Implement check for missing `participant_id` and `session` values #20

alyssadai · 2023-03-31T06:57:54Z

Both the imaging and phenotypic input schema require participant_id and session columns to be present, but currently the dashboard does not check whether there are any missing values in these columns.

We do not want to keep this behaviour for several reasons, including:

Every measurement should be associated with a specific participant and session/visit, otherwise the data is not very usable
Neurobagel tools impose the same restrictions on tabular data
On a technical level, allowing and preserving missing data (NaN) in these columns complicates the process of reshaping data into wide format, as they serve as 'index' columns which are expected to be non-missing values
The session column is cast to str in the app to avoid numerical session labels being treated as continuous in plots, etc. - having NaN in this column makes the calculation of unique records (participant-sessions) pretty confusing

Decisions:

One column with session info should be required in each bagel type:
- session for imaging bagel
- visit for phenotypic bagel
Add a check that when participant_id and {session, visit} columns are present, there are no missing values for these columns (otherwise error)

The text was updated successfully, but these errors were encountered:

github-actions · 2023-08-19T01:54:35Z

We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 30 days.
We have applied the stale-issue label to indicate that this issue should be reviewed again and then either prioritized or closed.

github-actions · 2023-09-19T02:01:17Z

We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 30 days.
We have applied the stale-issue label to indicate that this issue should be reviewed again and then either prioritized or closed.

alyssadai · 2024-03-18T19:04:12Z

@nikhil153, @michellewang: I'd like to address this soon, let me know if the issue description makes sense to you or if you have any thoughts on any of the Decisions.

michellewang · 2024-03-18T21:46:33Z

Should the session column itself should always be required (e.g., what about for cross-sectional data)?

In Nipoppy we force all imaging data to have a session, even for cross-sectional datasets, so that's fine on our side. But if people aren't using Nipoppy then it could be annoying to invent a session just for the dashboard. Would it make sense (for the dashboard) to add a dummy/constant session column to the input bagel if there isn't one, then do the checks/wrangling as normal?

Add a check that when participant_id and {session, visit} columns are present, there are no missing values for these columns (otherwise error)

Sounds good to me!

nikhil153 · 2024-03-18T23:16:21Z

I think if it's not too much work, I would prefer to have a check for either session or visit column, since bothare used quite commonly by people.
I would also make it mandatory to have at least one of these columns in the bagel and refrain from populating them automatically when they are absent because that would 1) require checking for duplicate participant_id entries which would be assigned the same baseline session label and 2) create confusion in cases where visit or session column was accidentally dropped.

nikhil153 · 2024-03-18T23:17:20Z

Add a check that when participant_id and {session, visit} columns are present, there are no missing values for these columns (otherwise error)

This makes sense!

michellewang · 2024-03-18T23:40:58Z

I would also make it mandatory to have at least one of these columns in the bagel and refrain from populating them automatically when they are absent because that would 1) require checking for duplicate participant_id entries which would be assigned the same baseline session label

I don't mind forcing the user to have a session/visit column, but I think in any case the dashboard should check that there are no duplicate entries for the index, because that would be unexpected (and very likely wrong).

@alyssadai for the imaging bagel I imagine that the "index" columns would be participant_id, session and/or visit, pipeline_name and pipeline_version? And for the phenotypic bagel it's participant_id, session and/or visit, and assessment_name?

But maybe that should be a separate issue/discussion.

alyssadai · 2024-03-19T22:05:03Z

Thanks for your feedback!!

I think if it's not too much work, I would prefer to have a check for either session or visit column, since both are used quite commonly by people

I understand the logic. As a first step though, if we want to introduce the visit column name, I think we still should pick a 'primary' column for session info (one of session or visit; note that we can pick a different one for phenotypic vs. imaging bagels). This is because the dashboard needs to know which column takes precedence when both session and visit columns are present in an input file for picking one column to use to stratify plots by, etc.

If you both agree that visit generally makes more sense in the context of phenotypic data, my proposal for now would be to update the phenotypic bagel schema to replace "session" with "visit" below:

digest/schemas/bagel_schema_pheno.json

Lines 15 to 19 in 08381a8

    
           "session": { 
        
               "Description": "Participant session ID.",  
        
               "dtype": "str", 
        
               "IsRequired": true, 
        
               "IsPrefixedColumn": false

We then have the option to also add to both schemas a secondary session info column (presumably called visit for imaging, session for pheno) which will have "IsRequired": false, similar to how bids_id is treated. From there, I could add conditionals that if the primary (required) session info column is missing, the secondary one is present. These changes have greater implications for dashboard interactivity however, so I would address them in a separate issue.

For the current issue, it sounds like we're all in agreement that the participant_id and primary session info columns should not have missing values or duplicate combinations. For the latter, the dashboard actually already has a check for duplicates, across all the 'index' columns that unique identify a subject measurement in the context of the schema --

@alyssadai for the imaging bagel I imagine that the "index" columns would be participant_id, session and/or visit, pipeline_name and pipeline_version? And for the phenotypic bagel it's participant_id, session and/or visit, and assessment_name?

-- exactly this (when assessment_version is present, it uses that too). :)

Let me know if that makes sense @nikhil153 @michellewang !

nikhil153 · 2024-03-19T22:48:30Z

If you both agree that visit generally makes more sense in the context of phenotypic data, my proposal for now would be to update the phenotypic bagel schema to replace "session" with "visit" below:

Yes, let's use visit as a primary column for phenotypic bagel since many of these visits will not have a bids session.

github-actions · 2024-06-06T01:46:55Z

We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 75 days.
We have applied the _flag:stale label to indicate that this issue should be reviewed again.
When you review, please reread the spec and then apply one of these three options:

prioritize: apply the flag:schedule label to suggest moving this issue into the backlog now
close: if the issue is no longer relevant, explain why (give others a chance to reply) and then close.
archive: sometimes an issue has important information or ideas but we won't work on it soon. In this case
apply the someday label to show that this won't be prioritized. The stalebot will ignore issues with this
label in the future. Use sparingly!

alyssadai added bug:functional Functional defects resulting from feature changes. type:bug Defects in shipped code and fixes for those defects labels Mar 31, 2023

alyssadai mentioned this issue Jun 8, 2023

Add further validation of bagel.csv columns + column values against schema #33

Open

4 tasks

github-actions bot added the _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again label Aug 19, 2023

github-actions bot removed the @ flag:stale label Sep 20, 2023

alyssadai changed the title ~~Implement check for missing participant_ids~~ Implement check for missing participant_id and session values Mar 5, 2024

github-actions bot removed the _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again label Mar 20, 2024

github-actions bot added the _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement check for missing `participant_id` and `session` values #20

Implement check for missing `participant_id` and `session` values #20

alyssadai commented Mar 31, 2023 •

edited

Loading

github-actions bot commented Aug 19, 2023

github-actions bot commented Sep 19, 2023

alyssadai commented Mar 18, 2024

michellewang commented Mar 18, 2024

nikhil153 commented Mar 18, 2024

nikhil153 commented Mar 18, 2024

michellewang commented Mar 18, 2024

alyssadai commented Mar 19, 2024

nikhil153 commented Mar 19, 2024 •

edited

Loading

github-actions bot commented Jun 6, 2024

Implement check for missing participant_id and session values #20

Implement check for missing participant_id and session values #20

Comments

alyssadai commented Mar 31, 2023 • edited Loading

Decisions:

github-actions bot commented Aug 19, 2023

github-actions bot commented Sep 19, 2023

alyssadai commented Mar 18, 2024

michellewang commented Mar 18, 2024

nikhil153 commented Mar 18, 2024

nikhil153 commented Mar 18, 2024

michellewang commented Mar 18, 2024

alyssadai commented Mar 19, 2024

nikhil153 commented Mar 19, 2024 • edited Loading

github-actions bot commented Jun 6, 2024

Implement check for missing `participant_id` and `session` values #20

Implement check for missing `participant_id` and `session` values #20

alyssadai commented Mar 31, 2023 •

edited

Loading

nikhil153 commented Mar 19, 2024 •

edited

Loading