Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bash * wildcard for extensions — Regex compliance for YAML #1055

Closed
TheChymera opened this issue Apr 8, 2022 · 5 comments
Closed

Bash * wildcard for extensions — Regex compliance for YAML #1055

TheChymera opened this issue Apr 8, 2022 · 5 comments
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.

Comments

@TheChymera
Copy link
Collaborator

TheChymera commented Apr 8, 2022

I recently ran into this https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/datatypes/meg.yaml#L63 which messes with regex (re.error: nothing to repeat at position 172). Well, the position part being irrelevant.

The issue arises from the fact that * means different things in Bash and Regex. In Regex it's a modifier for the preceding character meaning zero or more occurrences.

Interestingly this also potentially compounds the confusion arising from the leading period in extensions (discussed here). This is because in Bash, .* means period followed by anything, whereas in Regex it means any character occurring zero or more times (so basically anything).

Luckily I had already implemented a period-safe parsing for the extensions so instead of false negatives the validator gives an outright error. Honestly this was a pretty solid dodging of a bullet since this is the sort of thing that could have gone unnoticed for a very long time.

I ofc understand that bash syntax makes the YAML more legible for casual reference, but perhaps its main utility lies more in programmatic usability than manual lookup.

So I guess what I'm asking is whether you agree that it would make sense to make YAML enries regex-comliant. Given that we do not generally use special characters, that would basically just be this one entry, and also involve a small change to the tools/schemacode/schemacode/render.py file which I had on the roadmap anyway.

It would also be a great opportunity to address the leading periods, though I'm unsure whether it could be said that we agree on that.

@yarikoptic @tsalo

@tsalo
Copy link
Member

tsalo commented Apr 8, 2022

Just an initial thought but we now have formats.yaml, which has the regular expression for each format in a pattern key. I think a reasonable structure would be to default to using the object's key for objects' regular expressions, but rely on a pattern key if that's present.

@TheChymera
Copy link
Collaborator Author

TheChymera commented Apr 9, 2022

@tsalo thank you for the rapid response :)

formats.yaml is indeed very cool (and I'll be integrating it soon, as the design started before it happened and the schema is moving really fast ^^ ).
But it doesn't really cover this issue, as it pertains to variable fields and not predefined values.
I could of course shoehorn the issue into it (maybe that's what you suggested, using e.g. label as an extension entry?), but using a special name which then references a format sounds like it could cause no end of trouble. People could be using .index or .label verbatim as text file extensions.

Another suggestion I've heard floating around (can't remember where) is that perhaps all YAML schema entries should have dedicated regex representations which can then be referenced for regex matching.
The problem I see with that is that it duplicates information and “human-readable” and regex representations might diverge.
So I guess the question is, which is the first-class citizen for the YAML, “human-readable” Bash, or directly usable Regex?
I would strongly advocate for the latter, but it could go either way.
I'm putting together an emergency fix as we speak, but it seems pretty useless to parse-in Bash, since we don't really load the schema in Bash — right?

@sappelhoff sappelhoff added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Apr 9, 2022
@tsalo
Copy link
Member

tsalo commented Apr 13, 2022

But it doesn't really cover this issue, as it pertains to variable fields and not predefined values.

But we do have a variable here, so the name and the search pattern can be dissociated (preferably unlike the rest of the suffixes).

Another suggestion I've heard floating around (can't remember where) is that perhaps all YAML schema entries should have dedicated regex representations which can then be referenced for regex matching.

I would argue that literal matching should be assumed unless there is a pattern or format field in the object definition. In this case, .* could be extended with pattern: '\.[0-9a-zA-Z]+', like this:

.*:
  name: Any Extension
  description: |
    Any extension is allowed.
  pattern: '\.[0-9a-zA-Z]+'
.ave:
  name: AVE  # not sure what ave stands for
  description: |
    File containing data averaged by segments of interest.

    Used by KIT, Yokogawa, and Ricoh MEG systems.

I'd like to know what others think of this though.

@bendhouseart
Copy link
Collaborator

Use regex, but only good regex.

@TheChymera
Copy link
Collaborator Author

I think we are already doing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
None yet
Development

No branches or pull requests

4 participants