Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of the HED schema library for SCORE implementation #324

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

tpatpa
Copy link

@tpatpa tpatpa commented Jun 26, 2022

This PR includes examples of the HED schema library for the Standardized Computer-based Organized Reporting of EEG (SCORE) implementation in several EEG and invasive EEG datasets in the two ways to annotate HED in BIDS.

BIDS dataset was validated with [email protected] and HED annotations were separately validated with the example jupyter notebook and the hedtools that will be incorporated in the bids-validator soon.

@tpatpa tpatpa marked this pull request as draft June 26, 2022 19:53
@tpatpa tpatpa marked this pull request as ready for review June 26, 2022 19:53
@tpatpa tpatpa marked this pull request as draft June 26, 2022 19:54
@tpatpa tpatpa marked this pull request as ready for review June 26, 2022 19:56
@dorahermes
Copy link
Member

Before this example can be merged, this other pull request in the bids-specificartion that adds the use of an HED schema library bids-specification pull request needs to be merged first. Then the dataset_description.json can be updated to include the SCORE library schema.

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nice to see a dataset with the MEF data format. Could you please make sure that all binary data files are zeroed out, that is, that they have 0kb?

see also: https://github.com/bids-standard/bids-examples/blob/master/CONTRIBUTING.md#why-do-we-only-host-truncated-data-with-0kb-size

@sappelhoff
Copy link
Member

sappelhoff commented Jul 22, 2022

crosslinking, this PR needs a spec PR and a following validator PR to be merged first:

@sappelhoff
Copy link
Member

The PRs mentioned in #324 (comment) have meanwhile been merged, so I think you can address some of the reviewer comments above @tpatpa

@tpatpa
Copy link
Author

tpatpa commented Sep 27, 2022

I made minor edits to match the latest HED-SCORE library schema. Validation is still failing because the validator is validating against a previous version of the HED-SCORE library schema. We are in the final stages before releasing the schema's latest version. Once released this will be validated and could be merged as well.

@dorahermes
Copy link
Member

We just released HED-SCORE V1.0.0. If you follow the link make sure to click on "Show another version" to browse. The remaining question for this BIDS-HED-SCORE example is on how to list a set of channels affected by an event. It seems like the discussion in the electrophysiology derivatives was going for a Pythonic list of strings, eg ['C3','F3'] or ['C3-F3','P3-C3'] in a separate channels column. Any preference for these examples? @sappelhoff @robertoostenveld @VisLab @guiomar @CPernet

@sappelhoff
Copy link
Member

congrats Dora, Tal, et al!

It seems like the discussion in the electrophysiology derivatives was going for a Pythonic list of strings

yes, this should be a link to the specific discussion thread: gdoc discussion

My preference is indeed the "pythonic list of strings" (clarified more in the above linked thread) but apart from Robert and me, nobody seems to have offered an opinion yet.

@CPernet
Copy link
Contributor

CPernet commented Feb 9, 2023

+1 list of strings

@VisLab
Copy link
Member

VisLab commented Feb 9, 2023

It seems like the discussion in the electrophysiology derivatives was going for a Pythonic list of strings, eg ['C3','F3'] or ['C3-F3','P3-C3'] in a separate channels column.

Could you clarify what would be meant by: ['C3-F3','P3-C3']? If it means a group of channels as ordered by the channels.tsv, I don't think this should be allowed. Which channels are in the data and how they are ordered can change during computation, so this is particularly risky for derivatives. The same comment applies to channel numbers.

Also, given that quotes are usually not used in events files, would [C3, F3] be preferred?

One further option, which perhaps the HED working group should discuss separately, is how HED might be used to define named groups of channels which could then be referred to by name. This does not preclude the need for deciding on a format for a channels column in the events files and their derivatives.

@sappelhoff
Copy link
Member

Let me copy the discussion thread from the Gdoc here, perhaps that clarifies some things:


@sappelhoff :

I think a "pythonic" way to specify channels would be nice.

  • A list of strings where each string is an affected channel
  • An empty list means no channel is affected
  • "n/a" means the interpretation of "channel" is not applicable or the data is not available

We could additionally think about a shorthand to indicate "all channels are affected". Perhaps the string "all" would work for that ... if by chance a channel is named "all" this would still not be ambiguous, because specifying individual channels needs to be in a list like ["all"]

@robertoostenveld :

I don't care too mush, but think that we should not be restrictive in formatting.
And to note: channels in the annotations file do not have to correspond to channels in the data. E.g., how would you document an interictal spike that is observed in a bipolar refreferenced channel ("C3-F3"), whereas the data is stored and shared with a common reference ("C3-REF", or probably just "C3"). Annotations could also refer to electrodes rather than channels, such as "video inspection revealed that electrode C3 fell off at 10 minutes into the recording".

@sappelhoff:

Thanks for these valid examples. Regarding the first one, one could argue that if an interictal spike at "C3-F3" is observed, and the dataset curator wants to annotate this event, then they should include that channel in the data.

Regarding the "electrode" example: We could have an additional column called "electrode" that follows the same formatting as "channel" (the column we are discussing here), but strictly referring to electrodes.

Curators could then for each event specify any number of channels and/or electrodes that are related to the event.

@VisLab
Copy link
Member

VisLab commented Feb 9, 2023

@sappelhoff :

I think a "pythonic" way to specify channels would be nice.

  • A list of strings where each string is an affected channel
  • An empty list means no channel is affected
  • "n/a" means the interpretation of "channel" is not applicable or the data is not available

We could additionally think about a shorthand to indicate "all channels are affected". Perhaps the string "all" would work for that ... if by chance a channel is named "all" this would still not be ambiguous, because specifying individual channels needs to be in a list like ["all"]

I am in favor of this format.

I realize that this discussion pertains to derivatives, however, it might also be relevant to events in the main BIDS specification.

Right now the BIDS specification of events.json indicates two options for describing column values (discounting HED): Levels and Units.

If Levels are given, the column is assumed to contain categorical values.

If Units are given, the column is assumed to contain numeric values expressed in the specified units.

Would this new format of column value require a new List type -- list of numbers or list of strings?

@tpatpa
Copy link
Author

tpatpa commented Feb 15, 2023

Right now the BIDS specification of events.json indicates two options for describing column values (discounting HED): Levels and Units.

Maybe we can use IntendedFor in the corresponding JSON file to link to the relevant channels file similar to how fieldmap data is linked to a specific scan?

@sappelhoff
Copy link
Member

One further option, which perhaps the HED working group should discuss separately, is how HED might be used to define named groups of channels which could then be referred to by name. This does not preclude the need for deciding on a format for a channels column in the events files and their derivatives.

I agree that developing an "HED way" to deal with this problem would be very nice in addition to a "straight forward - non HED" BIDS solution.

Right now the BIDS specification of events.json indicates two options for describing column values (discounting HED): Levels and Units.

you are raising a very important point here, thanks -- I have not considered that my proposal would introduce a new "data type" for rows in TSV files apart from those defined by Levels and Units. It wouldn't make sense to consider the lists of strings (or the [], "all", and "n/a" values that'd be possible) as "Levels", because there'd simply be too many levels without providing a lot of insight.

We should:

  1. think about alternative solutions
  2. think whether introducing a List type makes sense in a more general sense (beyond our present problem), because I would be hesitant to introduce a concept that is only sensible / meaningful for one particular application
  3. beyond the potential List data type, the present proposal would also permit "all" as a shorthand for a full list of channels and the standard "n/a". While permitting "n/a" is easy and needed anyhow, ... what do you think of the "all" shorthand? Needed? Superfluous? ...

In any case, we need some more people to chime in :-)

Maybe we can use IntendedFor in the corresponding JSON file to link to the relevant channels file similar to how fieldmap data is linked to a specific scan?

Not all JSON files have an IntendedFor metadata field, but I like the general direction here ...

@VisLab
Copy link
Member

VisLab commented Feb 23, 2023

One other possibility would be just to use a string for the channel list rather than an actual list. Then BIDS doesn't have to be concerned about validating necessarily, although it could.

"[Cz7, O1, O2]"

Downstream tools could then do their own thing to interpret a channels column, parse, and make sure that these are real channels.

@dorahermes
Copy link
Member

Right now the BIDS specification of events.json indicates two options for describing column values (discounting HED): Levels and Units.

you are raising a very important point here, thanks -- I have not considered that my proposal would introduce a new "data type" for rows in TSV files apart from those defined by Levels and Units. It wouldn't make sense to consider the lists of strings (or the [], "all", and "n/a" values that'd be possible) as "Levels", because there'd simply be too many levels without providing a lot of insight.

We should:

  1. think about alternative solutions
  2. think whether introducing a List type makes sense in a more general sense (beyond our present problem), because I would be hesitant to introduce a concept that is only sensible / meaningful for one particular application
  3. beyond the potential List data type, the present proposal would also permit "all" as a shorthand for a full list of channels and the standard "n/a". While permitting "n/a" is easy and needed anyhow, ... what do you think of the "all" shorthand? Needed? Superfluous? ...

In any case, we need some more people to chime in :-)

I think a List data type would help. This may also be helpful beyond the present problem for other ephys cases where multiple channels or electrodes could be listed and I can imaging other cases where multiple items may to be indexed beyond channels or electrodes.

Maybe we can use IntendedFor in the corresponding JSON file to link to the relevant channels file similar to how fieldmap data is linked to a specific scan?

Not all JSON files have an IntendedFor metadata field, but I like the general direction here ...

This may be similar or comparable to a case in the BIDS-connectivity-BEP, for channel-to-channel connectivity metrics where we also want to interpret a column with respect to an _channels.tsv or _electrodes.tsv file (discussing there whether this should be a sourceAtlas, but it is not an actual atlas...).

@Andesha
Copy link
Contributor

Andesha commented Mar 3, 2023

For the format of this field a "pythonic list" has the downside of encouraging developers to do an eval which is not a safe practice. A comma separated list (or a single item) would work effectively as a representation. Brackets aren't strictly necessary as these files are a TSV.

CCing @christinerogers for EEGnet

@sappelhoff
Copy link
Member

For the format of this field a "pythonic list" has the downside of encouraging developers to do an eval which is not a safe practice. A comma separated list (or a single item) would work effectively as a representation. Brackets aren't strictly necessary as these files are a TSV.

I agree with you -- I guess I proposed it this way to distinguish n/a from a potential channel name n/a and to have a convenient shorthand all that is different from a channel potentially called all. But we can probably come up with better ways.

@christinerogers
Copy link

christinerogers commented Mar 9, 2023

Super, thanks @sappelhoff -- I'll raise this at the BIDS maintainers meeting, since @Andesha 's request for code-agnostic (consistent, comma-separated instead of pythonic) lists helps those making BIDS-supporting data platforms (like EEGNet which is webfacing) across modalities.

Has this been discussed elsewhere in the BIDS specs, i.e. is there an implementation cost concern? cc @CPernet

Does this make sense to you @arnodelorme @VisLab ?
thoughts, @dorahermes @tpatpa ?

@rwblair
Copy link
Member

rwblair commented Mar 9, 2023

Discussing this with a few maintainers we settled on allowing json arrays whose only values are strings as an acceptable solution. The json specification allows json documents to start with a square bracket so any languages implementation of the json parser should be sufficient.

We should also be able to represent this in the schema well enough to tip the validators off that elements of the column may need to be parsed.

@effigies did I misrepresent anything here?

@christinerogers
Copy link

christinerogers commented Mar 9, 2023

Hi @rwblair ,
My takeaway from this maintainers meeting was that it wasn't decided, and will be raised again next time after some community consultation (@sappelhoff).
There's definitely concern about the embedded JSON suggestion on our side -- e.g. within a TSV it's not only hard to read but also can't be validated easily by those preparing data files, so significantly more error-prone.

@effigies
Copy link
Contributor

effigies commented Mar 9, 2023

@rwblair nothing misrepresented, but here's a more extended version:

We discussed two possible approaches:

  1. Comma-separated values: a,b,c
  2. JSON arrays: ["a","b","c"]

Advantages to CSV:

  • Simple. Just use split(val, ","), which will come in every language and is easy to write your own if forced.
  • You can eyeball it.

Advantages to JSON arrays:

  • Typed. The schema could declare a value as having type array[str] (actually something uglier, but you get it) and the validator can check it.
  • Proximal. If a tool is working with BIDS data, it should have access to a JSON parser. No eval needed.
  • Escapable. JSON supports unicode escapes, in case someone needs a tab or newline character inside their strings, which TSV with a CSV insert would interpret as an end-of-column or end-of-row. (I hope people don't do this, but in the extreme...)

@rwblair
Copy link
Member

rwblair commented Mar 9, 2023

Hi @rwblair , My takeaway from this maintainers meeting was that it wasn't decided, and will be raised again next time after some community consultation (@sappelhoff). There's definitely concern about the embedded JSON suggestion on our side -- e.g. within a TSV it's not only hard to read but also can't be validated easily by those preparing data files, so significantly more error-prone.

My apologies for the ambiguity and finality in tone of my post. This was from a chat after the maintainers meeting you presented at. I completely agree with you that it needs more discussion here and in the next maintainers meeting.

@christinerogers
Copy link

thanks @rwblair - let's continue this detailed point in a more general/accessible place than this channel. cc @sappelhoff happy to follow your lead here.

@VisLab
Copy link
Member

VisLab commented Aug 7, 2023

Looks good....

Copy link
Member

@VisLab VisLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good --- but you will need to release SCORE 1.1.1 before this will validate.

@dorahermes
Copy link
Member

Looks good to me, ok after it is validated!

New example of the HED SCORE schema library used for event annotations in several EEG and invasive EEG datasets.

Dataset validated with [email protected] and PENDING validation of HED annotations with HED python tools.
[ERR] The validation on this HED string returned an error. (code: 104 - HED_ERROR) -- HED validated separately
 - dataset_description.json fixed and waiting for [bids-specification pull request](bids-standard/bids-specification#1106)
 - units added to _channels.tsv for TUH subjects
 - Files truncated via [terminal](https://github.com/bids-standard/bids-examples/blob/master/CONTRIBUTING.md#why-do-we-only-host-truncated-data-with-0kb-size)
Correct annotations to match the latest prerelease library schema.
Example validated following latest schema release (HED score_1.0.0) and tools update using HED-examples jupyter notebook (https://github.com/hed-standard/hed-examples/blob/main/hedcode/jupyter_notebooks/bids/bids_validate_dataset_with_libraries.ipynb).
Array data in .tsv cells discussion is converging towards a comma-separated list (possibly wrapped delimiters list), so the example was edited accordingly - combined rows representing the same event with a list of channels in the channel column.

See full discussion here: bids-standard/bids-specification#1446
Channel column in _events.tsv should match the channel name so updated _channels.tsv to accommodate 2 different references.
Updated example to use the HED-SCORE schema in its partnered version (score_1.1.0).
 - changed specified HEDVersion in dataset_description.json
 - removed HED-SCORE prefix (sc:)
 - changed column name from 'HED' to 'HED_annotations' since the use of 'HED' is only allowed as a second-level child in these files.

The example was validated with bids-validator and HED jupyter notebook - BIDS data set that uses partnered schema has no HED validation errors.
Removing Other-organized-rhythm/non-interesting events (bckg) annotations since the background is considered baseline. If we end up with empty files they are removed.
Corrected mix-up with the TUH dataset stop_time and duration.
Updated example to use the HED-SCORE schema with score_1.1.1
 - changed specified HEDVersion in dataset_description.json
 - changed event annotations to match labels used for annotating the Temple University Hospital EEG Corpus (TUEG) and their corresponding SCORE HED library schema annotations.
@sappelhoff
Copy link
Member

@tpatpa I rebased your branch on bids-standard/bids-examples@master. You will have to run the following command locally before you continue to work on this: git fetch --all and then git reset --hard origin/master

NOTE if you have uncommitted local changes on your branch, do not run the above ☝️ ... just comment here in that case :-)

And in the future, it would be good if you could make a new git branch before submitting a pull request. Submitting pull requests directly from your master branch makes it difficult for you to stay up to date with the upstream repository (bids-standard/bids-examples@master).

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI fails with the following issue:

1: [ERR] The validation on this HED string returned an error. (code: 104 - HED_ERROR)
./dataset_description.json
Evidence: ERROR: [SCHEMA_LOAD_FAILED] Could not load HED schema "{"nickname":"","version":"1.1.1","library":"score","localPath":""}" from remote repository - "XMLHttpRequest is not defined". (For more information on this HED error, see https://hed-specification.readthedocs.io/en/latest/Appendix_B.html#schema-load-failed.)

Please visit https://neurostars.org/search?q=HED_ERROR for existing conversations about this issue.

cc @tpatpa @VisLab @dorahermes


Are you also sharing the full data of this example somewhere? For example on OSF or GIN?

Are the channel names such as FP1-F7 also how they occur in the .edf files? (in the full data, of course)

Furthermore: Why is this example called xeeg with an x, instead of just eeg?

This PR will also need to add the new example to our example index. I can do that for you, however, once we have resolved the above issue.

After that, I'd be happy to merge this.

@VisLab
Copy link
Member

VisLab commented Oct 27, 2023

We're having a chicken and egg problem here... The HED version must be a released version, but SCORE 1.1.1 has not been released yet due to unresolved issues. @dorahermes can you review 1.1.1 and we can discuss the issues offline and move towards release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants