Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biomarkers transform for ModelAD #148

Draft
wants to merge 24 commits into
base: dev
Choose a base branch
from

Conversation

beatrizsaldana
Copy link
Member

@beatrizsaldana beatrizsaldana commented Sep 25, 2024

WIP: PyTest Failed - DO NOT REVIEW

Creates a new transform for the biomarkers dataset. The transform will re-structure data as described in this jira ticket.

This is my first PR in this repo. Please review carefully and be as brutally honest as is necessary. Its better for me to learn things now than for us to have to go back and fix or add things later because nobody wanted to tell me I was doing something suboptimally.

Expected Changes

  1. Added modelad_test_config.yaml
  2. Added a biomarkers transform function
  3. Added test cases and test data

Unexpected Changes

  1. The transform_biomarkers() function outputs the transform as a list instead of dict or pd.DataFrame as is expected.
  2. I added a list_to_json() function in src/agoradatatools/etl/load.py to acomodate the new output type
  3. I added elif isinstance(df, list): and elif isinstance(df, DataFrame): in the process_dataset() function in src/agoradatatools/process.py.
  4. I added an else to catch errors if any of the functions output anything other than a list, dict, or pd.DataFrame.

@BWMac what do you think about the Unexpected Changes? Would it be better for the transform_biomarkers() function to output a dict or pd.DataFrame and prevent any of these extra changes? All feedback is welcome.

staging_path=staging_path,
filename=dataset_name + "." + dataset_obj[dataset_name]["final_format"],
)
elif isinstance(df, DataFrame):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BWMac here are the changes mentioned in the PR description. What do you think about them?

else:
raise ADTDataProcessingError(
f"Data processing failed for {dataset_name}. Dataframe type is not supported."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about this error handling? Is it necessary? Since we control all of the outputs, maybe we don't need it and can keep using the else: for the case where df is a pd.DataFrame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let Brad or others talk about the necessity of this, my comment is on the message itself.

If I saw this message "in a vacuum" I would ask myself "What WAS the Dataframe type when the exception was raised?"

could that information be added to the exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great point! I updated the error message, let me know if you think I should make any more changes here.

@beatrizsaldana beatrizsaldana self-assigned this Sep 25, 2024
@beatrizsaldana beatrizsaldana marked this pull request as draft September 26, 2024 00:01
@beatrizsaldana beatrizsaldana added the enhancement New feature or request label Sep 26, 2024
import pandas as pd


def transform_biomarkers(datasets: dict) -> list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, are there more specific typing we can add for this aside from a generic dict or list?

Also - The docstring:

    Returns:
        dict: a dictionary of biomarkers data modeled after intended final JSON structure

Does not match the return type of the function of list.

An example of how to add more specific typing (Might not be correct based on my comment above):

from typing import Dict, List

def transform_biomarkers(datasets: Dict[str, str]) -> List[str]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thank you for catching this! I updated the typing hints and the docstring for the function :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanFauble @beatrizsaldana If I recall from previous experience, this isn't possible to do at this time. Adding type hints like list[str] isn't supported by Python 3.8 so our CI fails on it. Unfortunately we have to leave it generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I have been using Python 3.9. I'll revert the typing hints back to generic dict, list, etc. Thank you! :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're fine to keep using 3.9, I think it's what gets installed by default when installing agora data tools, but our CI runs the test suite on 3(?) different Python versions, including 3.8, so sometimes we run across stuff that fails in 3.8 but not 3.9. I've run into this type hinting issue before haha.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! You probably saved me a lot of future failed debugging attempts :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, Python 3.8 EOL is coming up, so we may want to update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaclynbeck-sage @beatrizsaldana
It is supported in Python 3.8 but not the way you are suggesting: Adding type hints like list[str] isn't supported by Python 3.8 so our CI fails on it

The following does work in Python 3.8:

from typing import Dict, List

def transform_biomarkers(datasets: Dict[str, str]) -> List[str]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh interesting. Learned something new! Thanks Bryan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use the typing library since I do really like to be specific with type hinting.

Comment on lines 177 to 180
temp_json = open(os.path.join(staging_path, filename), "w+")
json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
temp_json.close()
return temp_json.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, a context managed open is preferred like:

    with open(os.path.join(staging_path, filename), "w+") as temp_json:
          json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
          return temp_json.name

This is so you don't need to be concerned about calling .close(), which is a valid way of accomplishing this, however, if this is the approach you want to take the .close() should be within a finally block so it's guaranteed to execute.

Copy link
Member Author

@beatrizsaldana beatrizsaldana Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I do like this approach better than what I was doing. I was trying to copy what the other functions are doing. Feedback please: Should I...

  1. Update just this one function with the preferred context managed open
  2. Update all of the X to json functions with the preferred context managed open
  3. Leave things as they are and create a Jira ticket for updating the functions to use the preferred context managed open

Thoughts? @BryanFauble

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would:

  1. Update any of the code you are already touching to following this approach
  2. Log a tech debt ticket to go back and look at the other areas of the code

Generally, the mantra I follow is: "Leave the code in a better place than when I started". That needs to be balanced with the scope of the change, the time you have to make the changes, and the time it's going to take to validate the change. Some minor things are probably not worth fixing if it means there is a significant effort required to test the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, update your own code and make a ticket for anything else you notice. I'm not sure who to assign the issue to so it doesn't get lost in the ether, maybe Jess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the feedback! I'll update the function I wrote, create a tech debt Jira ticket and assign it to Jess :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JessterB I need help figuring out where to create this Jira ticket 🙃


# Check that the dataset looks like what we expect
if not isinstance(biomarkers_dataset, pd.DataFrame):
raise ValueError("Biomarker dataset is not a pandas DataFrame")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add what the biomarkers_dataset is to the exception message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should! Thank you :)
I also changed the exception to TypeError because that is what it is.

pass_test_data = [
( # Pass with good real data
"biomarkers_good_input.csv",
"biomarkers_good_output.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see a need to test both real data and fake data if they're both good input. Usually for my tests I just subset to a small number of rows from the real data as my test input, and then tweak a few things from there if I need to check what happens with missing values or duplicates.

"Pass with duplicated data",
]
fail_test_data = [
# No failure cases for this transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your transform you have 2 error conditions: A TypeError and a ValueError. Can you make some test data that will force each of these to happen in a test? test_proteomics_distribution_data.py has an example of failure condition code that checks for different error types depending on input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I look at it, I'm not sure you can trigger the TypeError condition, since the input has to be a data frame by the time it gets to your transform. If it wasn't a DF, that means one of the earlier functions (load or standardize) would have failed first. In which case, it might be worth just removing that error check. test_genes_biodomains.py has a simpler, one-error-type failure case you can look at in that case.

biomarkers_dataset = datasets["biomarkers"]

# Check that the dataset looks like what we expect
if not isinstance(biomarkers_dataset, pd.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on your test function, this error check probably isn't necessary.

Copy link
Member Author

@beatrizsaldana beatrizsaldana Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about this earlier. The type hints should catch this :)

I'll remove it. Thank you for the validation!

].sort()
):
raise ValueError(
f"Biomarker dataset does not contain expected columns. Columns found: {list(biomarkers_dataset.columns)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth changing this check from == to checking that biomarkers contains those columns, so that the data set has to have those columns in it but can have extra columns we don't care about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So true! I was trying to be strict with the error handling, but if there is a possibility for extra columns that we could just ignore, then I'll use isin or something like that instead of the ==. Thank you!

datasets (dict[str, pd.DataFrame]): dictionary of dataset names mapped to their DataFrame

Returns:
list[dict[str, Any]]: a list of dictionaries containing biomarker data modeled after intended final JSON structure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have some code in place to basically do your transform in the form of a data frame instead of a list, where each row is a grouping and it contains a column with 'nested' data (it's the function nest_fields in utils.py). See the genes_biodomains.py transform for an example of how this is called. So in my head your grouping would be ['model', 'type', 'ageDeath', 'tissue', units'], and you would want it to nest ['genotype', 'measurement', 'sex'] into a new column called points. However, we've only used/tested nest_fields with one column as the grouping, never with multiple columns like is needed here, so I'm not sure this will work here. It might be simpler to leave the code as you've written it for now, take another look at nest_fields at some point to make it more generic to any number of columns, and edit later, but maybe @JessterB can weigh in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c: I forgot about nest_fields... ideally we would use nest_fields for this, but if that will add a significant amount of work then we shouldn't tackle it as part of this PR. We can add a ticket to the backlog for extending nest_fields and move forward with the existing implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this approach but felt it was extra computational work that could be simplified with keeping the data as a list. However, I do agree that in terms of codebase maintenance it would be best for us to use existing approaches, because I did have to make a few changes to accommodate the new type of transform output. I can definitely update the transform to output a pd.DataFrame instead of a list. Thoughts?

@BWMac
Copy link
Contributor

BWMac commented Sep 27, 2024

Just dropping a comment to say that I'm watching this PR and will do a review once it is marked as ready (unless told otherwise)!

Copy link

sonarcloud bot commented Sep 27, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants