Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement check for missing participant_id and session values #20

Open
alyssadai opened this issue Mar 31, 2023 · 10 comments
Open

Implement check for missing participant_id and session values #20

alyssadai opened this issue Mar 31, 2023 · 10 comments
Labels
bug:functional Functional defects resulting from feature changes. _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again type:bug Defects in shipped code and fixes for those defects

Comments

@alyssadai
Copy link
Collaborator

alyssadai commented Mar 31, 2023

Both the imaging and phenotypic input schema require participant_id and session columns to be present, but currently the dashboard does not check whether there are any missing values in these columns.

We do not want to keep this behaviour for several reasons, including:

  • Every measurement should be associated with a specific participant and session/visit, otherwise the data is not very usable
  • Neurobagel tools impose the same restrictions on tabular data
  • On a technical level, allowing and preserving missing data (NaN) in these columns complicates the process of reshaping data into wide format, as they serve as 'index' columns which are expected to be non-missing values
  • The session column is cast to str in the app to avoid numerical session labels being treated as continuous in plots, etc. - having NaN in this column makes the calculation of unique records (participant-sessions) pretty confusing

Decisions:

  • One column with session info should be required in each bagel type:
    • session for imaging bagel
    • visit for phenotypic bagel
  • Add a check that when participant_id and {session, visit} columns are present, there are no missing values for these columns (otherwise error)
@alyssadai alyssadai added bug:functional Functional defects resulting from feature changes. type:bug Defects in shipped code and fixes for those defects labels Mar 31, 2023
@github-actions
Copy link

We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 30 days.
We have applied the stale-issue label to indicate that this issue should be reviewed again and then either prioritized or closed.

@github-actions github-actions bot added the _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again label Aug 19, 2023
@github-actions
Copy link

We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 30 days.
We have applied the stale-issue label to indicate that this issue should be reviewed again and then either prioritized or closed.

@alyssadai alyssadai changed the title Implement check for missing participant_ids Implement check for missing participant_id and session values Mar 5, 2024
@alyssadai
Copy link
Collaborator Author

@nikhil153, @michellewang: I'd like to address this soon, let me know if the issue description makes sense to you or if you have any thoughts on any of the Decisions.

@michellewang
Copy link
Collaborator

Should the session column itself should always be required (e.g., what about for cross-sectional data)?

In Nipoppy we force all imaging data to have a session, even for cross-sectional datasets, so that's fine on our side. But if people aren't using Nipoppy then it could be annoying to invent a session just for the dashboard. Would it make sense (for the dashboard) to add a dummy/constant session column to the input bagel if there isn't one, then do the checks/wrangling as normal?

Add a check that when participant_id and {session, visit} columns are present, there are no missing values for these columns (otherwise error)

Sounds good to me!

@nikhil153
Copy link
Contributor

I think if it's not too much work, I would prefer to have a check for either session or visit column, since bothare used quite commonly by people.
I would also make it mandatory to have at least one of these columns in the bagel and refrain from populating them automatically when they are absent because that would 1) require checking for duplicate participant_id entries which would be assigned the same baseline session label and 2) create confusion in cases where visit or session column was accidentally dropped.

@nikhil153
Copy link
Contributor

Add a check that when participant_id and {session, visit} columns are present, there are no missing values for these columns (otherwise error)

This makes sense!

@michellewang
Copy link
Collaborator

I would also make it mandatory to have at least one of these columns in the bagel and refrain from populating them automatically when they are absent because that would 1) require checking for duplicate participant_id entries which would be assigned the same baseline session label

I don't mind forcing the user to have a session/visit column, but I think in any case the dashboard should check that there are no duplicate entries for the index, because that would be unexpected (and very likely wrong).

@alyssadai for the imaging bagel I imagine that the "index" columns would be participant_id, session and/or visit, pipeline_name and pipeline_version? And for the phenotypic bagel it's participant_id, session and/or visit, and assessment_name?

But maybe that should be a separate issue/discussion.

@alyssadai
Copy link
Collaborator Author

Thanks for your feedback!!

I think if it's not too much work, I would prefer to have a check for either session or visit column, since both are used quite commonly by people

I understand the logic. As a first step though, if we want to introduce the visit column name, I think we still should pick a 'primary' column for session info (one of session or visit; note that we can pick a different one for phenotypic vs. imaging bagels). This is because the dashboard needs to know which column takes precedence when both session and visit columns are present in an input file for picking one column to use to stratify plots by, etc.

If you both agree that visit generally makes more sense in the context of phenotypic data, my proposal for now would be to update the phenotypic bagel schema to replace "session" with "visit" below:

"session": {
"Description": "Participant session ID.",
"dtype": "str",
"IsRequired": true,
"IsPrefixedColumn": false

We then have the option to also add to both schemas a secondary session info column (presumably called visit for imaging, session for pheno) which will have "IsRequired": false, similar to how bids_id is treated. From there, I could add conditionals that if the primary (required) session info column is missing, the secondary one is present. These changes have greater implications for dashboard interactivity however, so I would address them in a separate issue.

For the current issue, it sounds like we're all in agreement that the participant_id and primary session info columns should not have missing values or duplicate combinations. For the latter, the dashboard actually already has a check for duplicates, across all the 'index' columns that unique identify a subject measurement in the context of the schema --

@alyssadai for the imaging bagel I imagine that the "index" columns would be participant_id, session and/or visit, pipeline_name and pipeline_version? And for the phenotypic bagel it's participant_id, session and/or visit, and assessment_name?

-- exactly this (when assessment_version is present, it uses that too). :)

Let me know if that makes sense @nikhil153 @michellewang !

@nikhil153
Copy link
Contributor

nikhil153 commented Mar 19, 2024

If you both agree that visit generally makes more sense in the context of phenotypic data, my proposal for now would be to update the phenotypic bagel schema to replace "session" with "visit" below:

Yes, let's use visit as a primary column for phenotypic bagel since many of these visits will not have a bids session.

@github-actions github-actions bot removed the _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again label Mar 20, 2024
Copy link

github-actions bot commented Jun 6, 2024

We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 75 days.
We have applied the _flag:stale label to indicate that this issue should be reviewed again.
When you review, please reread the spec and then apply one of these three options:

  • prioritize: apply the flag:schedule label to suggest moving this issue into the backlog now
  • close: if the issue is no longer relevant, explain why (give others a chance to reply) and then close.
  • archive: sometimes an issue has important information or ideas but we won't work on it soon. In this case
    apply the someday label to show that this won't be prioritized. The stalebot will ignore issues with this
    label in the future. Use sparingly!

@github-actions github-actions bot added the _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again label Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:functional Functional defects resulting from feature changes. _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again type:bug Defects in shipped code and fixes for those defects
Projects
Status: No status
Development

No branches or pull requests

3 participants