-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Biomarkers transform for ModelAD #148
base: dev
Are you sure you want to change the base?
Conversation
…funcitonal at the moment
…puts a list and so a new list_to_json() function was added to the load module and logic to handle this was added to the process_dataset function
staging_path=staging_path, | ||
filename=dataset_name + "." + dataset_obj[dataset_name]["final_format"], | ||
) | ||
elif isinstance(df, DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BWMac here are the changes mentioned in the PR description. What do you think about them?
else: | ||
raise ADTDataProcessingError( | ||
f"Data processing failed for {dataset_name}. Dataframe type is not supported." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about this error handling? Is it necessary? Since we control all of the outputs, maybe we don't need it and can keep using the else:
for the case where df
is a pd.DataFrame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let Brad or others talk about the necessity of this, my comment is on the message itself.
If I saw this message "in a vacuum" I would ask myself "What WAS the Dataframe type when the exception was raised?"
could that information be added to the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great point! I updated the error message, let me know if you think I should make any more changes here.
import pandas as pd | ||
|
||
|
||
def transform_biomarkers(datasets: dict) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, are there more specific typing we can add for this aside from a generic dict
or list
?
Also - The docstring:
Returns:
dict: a dictionary of biomarkers data modeled after intended final JSON structure
Does not match the return type of the function of list
.
An example of how to add more specific typing (Might not be correct based on my comment above):
from typing import Dict, List
def transform_biomarkers(datasets: Dict[str, str]) -> List[str]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Thank you for catching this! I updated the typing hints and the docstring for the function :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanFauble @beatrizsaldana If I recall from previous experience, this isn't possible to do at this time. Adding type hints like list[str]
isn't supported by Python 3.8 so our CI fails on it. Unfortunately we have to leave it generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I have been using Python 3.9. I'll revert the typing hints back to generic dict, list, etc. Thank you! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're fine to keep using 3.9, I think it's what gets installed by default when installing agora data tools, but our CI runs the test suite on 3(?) different Python versions, including 3.8, so sometimes we run across stuff that fails in 3.8 but not 3.9. I've run into this type hinting issue before haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! You probably saved me a lot of future failed debugging attempts :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, Python 3.8 EOL is coming up, so we may want to update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaclynbeck-sage @beatrizsaldana
It is supported in Python 3.8 but not the way you are suggesting: Adding type hints like list[str] isn't supported by Python 3.8 so our CI fails on it
The following does work in Python 3.8:
from typing import Dict, List
def transform_biomarkers(datasets: Dict[str, str]) -> List[str]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh interesting. Learned something new! Thanks Bryan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use the typing library since I do really like to be specific with type hinting.
src/agoradatatools/etl/load.py
Outdated
temp_json = open(os.path.join(staging_path, filename), "w+") | ||
json.dump(df, temp_json, cls=NumpyEncoder, indent=2) | ||
temp_json.close() | ||
return temp_json.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, a context managed open is preferred like:
with open(os.path.join(staging_path, filename), "w+") as temp_json:
json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
return temp_json.name
This is so you don't need to be concerned about calling .close()
, which is a valid way of accomplishing this, however, if this is the approach you want to take the .close()
should be within a finally
block so it's guaranteed to execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I do like this approach better than what I was doing. I was trying to copy what the other functions are doing. Feedback please: Should I...
- Update just this one function with the preferred context managed open
- Update all of the X to json functions with the preferred context managed open
- Leave things as they are and create a Jira ticket for updating the functions to use the preferred context managed open
Thoughts? @BryanFauble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would:
- Update any of the code you are already touching to following this approach
- Log a tech debt ticket to go back and look at the other areas of the code
Generally, the mantra I follow is: "Leave the code in a better place than when I started". That needs to be balanced with the scope of the change, the time you have to make the changes, and the time it's going to take to validate the change. Some minor things are probably not worth fixing if it means there is a significant effort required to test the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, update your own code and make a ticket for anything else you notice. I'm not sure who to assign the issue to so it doesn't get lost in the ether, maybe Jess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for the feedback! I'll update the function I wrote, create a tech debt Jira ticket and assign it to Jess :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JessterB I need help figuring out where to create this Jira ticket 🙃
|
||
# Check that the dataset looks like what we expect | ||
if not isinstance(biomarkers_dataset, pd.DataFrame): | ||
raise ValueError("Biomarker dataset is not a pandas DataFrame") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add what the biomarkers_dataset
is to the exception message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should! Thank you :)
I also changed the exception to TypeError because that is what it is.
…m_biomarkers() function.
…problems with puthon 3.8
pass_test_data = [ | ||
( # Pass with good real data | ||
"biomarkers_good_input.csv", | ||
"biomarkers_good_output.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see a need to test both real data and fake data if they're both good input. Usually for my tests I just subset to a small number of rows from the real data as my test input, and then tweak a few things from there if I need to check what happens with missing values or duplicates.
"Pass with duplicated data", | ||
] | ||
fail_test_data = [ | ||
# No failure cases for this transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your transform you have 2 error conditions: A TypeError and a ValueError. Can you make some test data that will force each of these to happen in a test? test_proteomics_distribution_data.py
has an example of failure condition code that checks for different error types depending on input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now that I look at it, I'm not sure you can trigger the TypeError
condition, since the input has to be a data frame by the time it gets to your transform. If it wasn't a DF, that means one of the earlier functions (load or standardize) would have failed first. In which case, it might be worth just removing that error check. test_genes_biodomains.py
has a simpler, one-error-type failure case you can look at in that case.
biomarkers_dataset = datasets["biomarkers"] | ||
|
||
# Check that the dataset looks like what we expect | ||
if not isinstance(biomarkers_dataset, pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment on your test function, this error check probably isn't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking about this earlier. The type hints should catch this :)
I'll remove it. Thank you for the validation!
].sort() | ||
): | ||
raise ValueError( | ||
f"Biomarker dataset does not contain expected columns. Columns found: {list(biomarkers_dataset.columns)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth changing this check from ==
to checking that biomarkers contains those columns, so that the data set has to have those columns in it but can have extra columns we don't care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So true! I was trying to be strict with the error handling, but if there is a possibility for extra columns that we could just ignore, then I'll use isin
or something like that instead of the ==
. Thank you!
datasets (dict[str, pd.DataFrame]): dictionary of dataset names mapped to their DataFrame | ||
|
||
Returns: | ||
list[dict[str, Any]]: a list of dictionaries containing biomarker data modeled after intended final JSON structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have some code in place to basically do your transform in the form of a data frame instead of a list, where each row is a grouping and it contains a column with 'nested' data (it's the function nest_fields
in utils.py
). See the genes_biodomains.py
transform for an example of how this is called. So in my head your grouping would be ['model', 'type', 'ageDeath', 'tissue', units']
, and you would want it to nest ['genotype', 'measurement', 'sex']
into a new column called points
. However, we've only used/tested nest_fields
with one column as the grouping, never with multiple columns like is needed here, so I'm not sure this will work here. It might be simpler to leave the code as you've written it for now, take another look at nest_fields
at some point to make it more generic to any number of columns, and edit later, but maybe @JessterB can weigh in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2c: I forgot about nest_fields
... ideally we would use nest_fields
for this, but if that will add a significant amount of work then we shouldn't tackle it as part of this PR. We can add a ticket to the backlog for extending nest_fields
and move forward with the existing implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this approach but felt it was extra computational work that could be simplified with keeping the data as a list. However, I do agree that in terms of codebase maintenance it would be best for us to use existing approaches, because I did have to make a few changes to accommodate the new type of transform output. I can definitely update the transform to output a pd.DataFrame
instead of a list
. Thoughts?
…_biomarkers() function.
…st to a json in list_to_json()
Just dropping a comment to say that I'm watching this PR and will do a review once it is marked as ready (unless told otherwise)! |
Quality Gate passedIssues Measures |
WIP: PyTest Failed - DO NOT REVIEW
Creates a new transform for the biomarkers dataset. The transform will re-structure data as described in this jira ticket.
This is my first PR in this repo. Please review carefully and be as brutally honest as is necessary. Its better for me to learn things now than for us to have to go back and fix or add things later because nobody wanted to tell me I was doing something suboptimally.
Expected Changes
Unexpected Changes
transform_biomarkers()
function outputs the transform as alist
instead ofdict
orpd.DataFrame
as is expected.list_to_json()
function insrc/agoradatatools/etl/load.py
to acomodate the new output typeelif isinstance(df, list):
andelif isinstance(df, DataFrame):
in theprocess_dataset()
function insrc/agoradatatools/process.py
.else
to catch errors if any of the functions output anything other than alist
,dict
, orpd.DataFrame
.@BWMac what do you think about the Unexpected Changes? Would it be better for the
transform_biomarkers()
function to output adict
orpd.DataFrame
and prevent any of these extra changes? All feedback is welcome.