Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race labels for MIMIC-CXR ? #39

Open
robintibor opened this issue Sep 20, 2021 · 5 comments
Open

race labels for MIMIC-CXR ? #39

robintibor opened this issue Sep 20, 2021 · 5 comments

Comments

@robintibor
Copy link

Hi,

I wondered how to obtain the race labels for MIMIC - CXR ?

I do have access to https://physionet.org/content/mimic-cxr/2.0.0/ and https://physionet.org/content/mimic-cxr-jpg/2.0.0/ but could not locate where you get the white/asian/black labels?

Like how to create the modified_viewposition_race_4-race-ethnicity_60-10-30_split_with_gender_age_ver_b.csv that you use in the training code?

Thanks for any help,
Best,
Robin

@blackboxradiology
Copy link
Contributor

blackboxradiology commented Sep 20, 2021

Hi Robin,

Race labels can be found here
Under the core directory, in the admissions dataset. From there you can join the subject_id with the CXR subject_id.

Let us know if we can help with anything else!

@robintibor
Copy link
Author

ah amazing thanks that clears it up! Other questions, am I understading correctly there is some code that preprocesses MIMIC-CXR and that is not in this repo? Like, one cannot just follow:

  1. Fork/Download the GitHub repository.
  2. Fetch the data from the data URLs for open-source datasets and drop them in the data folder.
  3. Run the corresponding training code and save the trained model in the models folder.

for MIMIC-CXR, because https://github.com/Emory-HITI/AI-Vengers/blob/cbdf593b0d852e3078abbc72cf92aad03496511d/training_code/CXR_training/MIMIC/MIMIC_resnet34_race_detection_2021_06_29.ipynb starts from some dataframe that you have created with some code that is not in this repo?

@blackboxradiology
Copy link
Contributor

That's correct. At the moment you would have to join the csv dataframes and make your own train-val-test splits, like what we did with modified_viewposition_race_4-race-ethnicity_60-10-30_split_with_gender_age_ver_b.csv

@robintibor
Copy link
Author

I see.
One more question that came up:
Did you try to handle subjects with multiple values for ethnicity in any way? For example, following code shows there are 168 subjects that had been entered both as BLACK/AFRICAN AMERICAN and WHITE and 2489 subjects with OTHER and WHITE:

admissions_df = pd.read_csv(os.path.join(mimic_folder, 'admissions.csv'))
ethnicity_df = admissions_df.loc[:,['subject_id', 'ethnicity']].drop_duplicates()

v = ethnicity_df.subject_id.value_counts()
subject_id_more_than_once = v.index[v.gt(1)]

ambiguous_ethnicity_df = ethnicity_df[ethnicity_df.subject_id.isin(subject_id_more_than_once)]

grouped = ambiguous_ethnicity_df.groupby('subject_id')
grouped.aggregate(lambda x: "_".join(sorted(x))).ethnicity.value_counts()

@blackboxradiology
Copy link
Contributor

blackboxradiology commented Sep 21, 2021

Wow! Great catch! As far I know we were unaware of this multiple ethnicity problem. I will look into this and test using these changes. I suspect it could improve performance by reducing noise from mislabeled patients.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants