Biomarkers transform for ModelAD #148

beatrizsaldana · 2024-09-25T23:45:14Z

WIP: PyTest Failed - DO NOT REVIEW

Creates a new transform for the biomarkers dataset. The transform will re-structure data as described in this jira ticket.

This is my first PR in this repo. Please review carefully and be as brutally honest as is necessary. Its better for me to learn things now than for us to have to go back and fix or add things later because nobody wanted to tell me I was doing something suboptimally.

Expected Changes

Added modelad_test_config.yaml
Added a biomarkers transform function
Added test cases and test data

Unexpected Changes

The transform_biomarkers() function outputs the transform as a list instead of dict or pd.DataFrame as is expected.
I added a list_to_json() function in src/agoradatatools/etl/load.py to acomodate the new output type
I added elif isinstance(df, list): and elif isinstance(df, DataFrame): in the process_dataset() function in src/agoradatatools/process.py.
I added an else to catch errors if any of the functions output anything other than a list, dict, or pd.DataFrame.

@BWMac what do you think about the Unexpected Changes? Would it be better for the transform_biomarkers() function to output a dict or pd.DataFrame and prevent any of these extra changes? All feedback is welcome.

…funcitonal at the moment

…puts a list and so a new list_to_json() function was added to the load module and logic to handle this was added to the process_dataset function

beatrizsaldana · 2024-09-25T23:57:06Z

src/agoradatatools/process.py

+            staging_path=staging_path,
+            filename=dataset_name + "." + dataset_obj[dataset_name]["final_format"],
+        )
+    elif isinstance(df, DataFrame):


@BWMac here are the changes mentioned in the PR description. What do you think about them?

beatrizsaldana · 2024-09-25T23:58:35Z

src/agoradatatools/process.py

+    else:
+        raise ADTDataProcessingError(
+            f"Data processing failed for {dataset_name}. Dataframe type is not supported."
+        )


What do we think about this error handling? Is it necessary? Since we control all of the outputs, maybe we don't need it and can keep using the else: for the case where df is a pd.DataFrame.

I'll let Brad or others talk about the necessity of this, my comment is on the message itself.

If I saw this message "in a vacuum" I would ask myself "What WAS the Dataframe type when the exception was raised?"

could that information be added to the exception?

Really great point! I updated the error message, let me know if you think I should make any more changes here.

BryanFauble · 2024-09-26T16:56:34Z

src/agoradatatools/etl/transform/biomarkers.py

+import pandas as pd
+
+
+def transform_biomarkers(datasets: dict) -> list:


nit, are there more specific typing we can add for this aside from a generic dict or list?

Also - The docstring:

Returns: dict: a dictionary of biomarkers data modeled after intended final JSON structure

Does not match the return type of the function of list.

An example of how to add more specific typing (Might not be correct based on my comment above):

from typing import Dict, List def transform_biomarkers(datasets: Dict[str, str]) -> List[str]:

Oops! Thank you for catching this! I updated the typing hints and the docstring for the function :)

@BryanFauble @beatrizsaldana If I recall from previous experience, this isn't possible to do at this time. Adding type hints like list[str] isn't supported by Python 3.8 so our CI fails on it. Unfortunately we have to leave it generic.

Ohh, I have been using Python 3.9. I'll revert the typing hints back to generic dict, list, etc. Thank you! :)

You're fine to keep using 3.9, I think it's what gets installed by default when installing agora data tools, but our CI runs the test suite on 3(?) different Python versions, including 3.8, so sometimes we run across stuff that fails in 3.8 but not 3.9. I've run into this type hinting issue before haha.

Thank you! You probably saved me a lot of future failed debugging attempts :)

By the way, Python 3.8 EOL is coming up, so we may want to update

@jaclynbeck-sage @beatrizsaldana
It is supported in Python 3.8 but not the way you are suggesting: Adding type hints like list[str] isn't supported by Python 3.8 so our CI fails on it

The following does work in Python 3.8:

from typing import Dict, List def transform_biomarkers(datasets: Dict[str, str]) -> List[str]:

Ohh interesting. Learned something new! Thanks Bryan.

I'll use the typing library since I do really like to be specific with type hinting.

BryanFauble · 2024-09-26T16:59:09Z

src/agoradatatools/etl/load.py

+    temp_json = open(os.path.join(staging_path, filename), "w+")
+    json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
+    temp_json.close()
+    return temp_json.name


Generally, a context managed open is preferred like:

with open(os.path.join(staging_path, filename), "w+") as temp_json: json.dump(df, temp_json, cls=NumpyEncoder, indent=2) return temp_json.name

This is so you don't need to be concerned about calling .close(), which is a valid way of accomplishing this, however, if this is the approach you want to take the .close() should be within a finally block so it's guaranteed to execute.

Hmm, I do like this approach better than what I was doing. I was trying to copy what the other functions are doing. Feedback please: Should I...

Update just this one function with the preferred context managed open

Update all of the X to json functions with the preferred context managed open

Leave things as they are and create a Jira ticket for updating the functions to use the preferred context managed open

Thoughts? @BryanFauble

I would:

Update any of the code you are already touching to following this approach

Log a tech debt ticket to go back and look at the other areas of the code

Generally, the mantra I follow is: "Leave the code in a better place than when I started". That needs to be balanced with the scope of the change, the time you have to make the changes, and the time it's going to take to validate the change. Some minor things are probably not worth fixing if it means there is a significant effort required to test the change.

I agree, update your own code and make a ticket for anything else you notice. I'm not sure who to assign the issue to so it doesn't get lost in the ether, maybe Jess?

Thank you both for the feedback! I'll update the function I wrote, create a tech debt Jira ticket and assign it to Jess :)

@JessterB I need help figuring out where to create this Jira ticket 🙃

BryanFauble · 2024-09-26T16:59:58Z

src/agoradatatools/etl/transform/biomarkers.py

+
+    # Check that the dataset looks like what we expect
+    if not isinstance(biomarkers_dataset, pd.DataFrame):
+        raise ValueError("Biomarker dataset is not a pandas DataFrame")


Should we add what the biomarkers_dataset is to the exception message?

Yes we should! Thank you :)
I also changed the exception to TypeError because that is what it is.

…m_biomarkers() function.

…problems with puthon 3.8

jaclynbeck-sage · 2024-09-26T19:34:53Z

tests/transform/test_biomarkers.py

+    pass_test_data = [
+        (  # Pass with good real data
+            "biomarkers_good_input.csv",
+            "biomarkers_good_output.json",


I'm not sure I see a need to test both real data and fake data if they're both good input. Usually for my tests I just subset to a small number of rows from the real data as my test input, and then tweak a few things from there if I need to check what happens with missing values or duplicates.

jaclynbeck-sage · 2024-09-26T19:41:34Z

tests/transform/test_biomarkers.py

+        "Pass with duplicated data",
+    ]
+    fail_test_data = [
+        # No failure cases for this transform


In your transform you have 2 error conditions: A TypeError and a ValueError. Can you make some test data that will force each of these to happen in a test? test_proteomics_distribution_data.py has an example of failure condition code that checks for different error types depending on input.

Actually now that I look at it, I'm not sure you can trigger the TypeError condition, since the input has to be a data frame by the time it gets to your transform. If it wasn't a DF, that means one of the earlier functions (load or standardize) would have failed first. In which case, it might be worth just removing that error check. test_genes_biodomains.py has a simpler, one-error-type failure case you can look at in that case.

jaclynbeck-sage · 2024-09-26T19:50:07Z

src/agoradatatools/etl/transform/biomarkers.py

+    biomarkers_dataset = datasets["biomarkers"]
+
+    # Check that the dataset looks like what we expect
+    if not isinstance(biomarkers_dataset, pd.DataFrame):


See my comment on your test function, this error check probably isn't necessary.

Yes, I was thinking about this earlier. The type hints should catch this :)

I'll remove it. Thank you for the validation!

jaclynbeck-sage · 2024-09-26T19:55:02Z

src/agoradatatools/etl/transform/biomarkers.py

+        ].sort()
+    ):
+        raise ValueError(
+            f"Biomarker dataset does not contain expected columns. Columns found: {list(biomarkers_dataset.columns)}"


It might be worth changing this check from == to checking that biomarkers contains those columns, so that the data set has to have those columns in it but can have extra columns we don't care about.

So true! I was trying to be strict with the error handling, but if there is a possibility for extra columns that we could just ignore, then I'll use isin or something like that instead of the ==. Thank you!

jaclynbeck-sage · 2024-09-26T20:25:42Z

src/agoradatatools/etl/transform/biomarkers.py

+        datasets (dict[str, pd.DataFrame]): dictionary of dataset names mapped to their DataFrame
+
+    Returns:
+        list[dict[str, Any]]: a list of dictionaries containing biomarker data modeled after intended final JSON structure


We do have some code in place to basically do your transform in the form of a data frame instead of a list, where each row is a grouping and it contains a column with 'nested' data (it's the function nest_fields in utils.py). See the genes_biodomains.py transform for an example of how this is called. So in my head your grouping would be ['model', 'type', 'ageDeath', 'tissue', units'], and you would want it to nest ['genotype', 'measurement', 'sex'] into a new column called points. However, we've only used/tested nest_fields with one column as the grouping, never with multiple columns like is needed here, so I'm not sure this will work here. It might be simpler to leave the code as you've written it for now, take another look at nest_fields at some point to make it more generic to any number of columns, and edit later, but maybe @JessterB can weigh in.

My 2c: I forgot about nest_fields... ideally we would use nest_fields for this, but if that will add a significant amount of work then we shouldn't tackle it as part of this PR. We can add a ticket to the backlog for extending nest_fields and move forward with the existing implementation.

I did consider this approach but felt it was extra computational work that could be simplified with keeping the data as a list. However, I do agree that in terms of codebase maintenance it would be best for us to use existing approaches, because I did have to make a few changes to accommodate the new type of transform output. I can definitely update the transform to output a pd.DataFrame instead of a list. Thoughts?

…_biomarkers() function.

…st to a json in list_to_json()

BWMac · 2024-09-27T15:01:07Z

Just dropping a comment to say that I'm watching this PR and will do a review once it is marked as ready (unless told otherwise)!

sonarcloud · 2024-09-27T18:56:41Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Beatriz Saldana added 11 commits September 18, 2024 11:13

Added biomarker files and functions to necessary locations, none are …

f348d68

…funcitonal at the moment

Added biomarker transform for the Model-AD project. The transform out…

2d43820

…puts a list and so a new list_to_json() function was added to the load module and logic to handle this was added to the process_dataset function

Biomarkers input and output test files

b19a529

Added tests for biomarkers

84355f7

Ran black formatter

1297757

Biomarkers test passes when it should

537605c

Biomarkers transform working, need to remove custom_transform from yaml

597bfca

Correct use of the custom_transformations parameter in yaml config file

b796474

Added fake test data made by hand for testing biomarkers transform

d6a7d19

Added testing for duplicate data

46feee2

Formatting with black

4ac1f23

thomasyu888 requested a review from BWMac September 25, 2024 23:56

beatrizsaldana commented Sep 25, 2024

View reviewed changes

beatrizsaldana requested review from JessterB and jaclynbeck-sage September 25, 2024 23:59

beatrizsaldana self-assigned this Sep 25, 2024

beatrizsaldana marked this pull request as draft September 26, 2024 00:01

beatrizsaldana added the enhancement New feature or request label Sep 26, 2024

BryanFauble reviewed Sep 26, 2024

View reviewed changes

Beatriz Saldana added 7 commits September 26, 2024 11:18

Addressing PR comment about process_dataset() error message.

bb8cb5d

Reformatting process.py

1a09560

Addressing PR comment about TypeError for biomarkers dataset.

2e4c792

Addressing PR comment: Improved docstring and typing for the transfor…

200f068

…m_biomarkers() function.

PR comment: Reverting back to using standard typing hints to prevent …

dd8d422

…problems with puthon 3.8

Removed unused import that caused pre-commit to fail.

3fae0ae

Removed unnecessary formatting from ADTDataProcessingError message.

f75638b

jaclynbeck-sage reviewed Sep 26, 2024

View reviewed changes

Beatriz Saldana added 2 commits September 26, 2024 15:15

Using typing library to add more specific type hints to the transform…

8016df4

…_biomarkers() function.

PR comment - using preferred context managed open for converting a li…

5e3dfeb

…st to a json in list_to_json()

Beatriz Saldana added 4 commits September 27, 2024 11:38

Reverting change to see if it fixes CI: pre-commit fail

8ee844d

Maybe now the CI pre-commit will pass?

8ca5cc9

What about now? Will the CI pre-commit pass now?

44cbdb2

I think the problem was just formatting

653bede

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biomarkers transform for ModelAD #148

Biomarkers transform for ModelAD #148

beatrizsaldana commented Sep 25, 2024 •

edited

Loading

beatrizsaldana Sep 25, 2024

beatrizsaldana Sep 25, 2024

BryanFauble Sep 26, 2024

beatrizsaldana Sep 26, 2024

BryanFauble Sep 26, 2024

beatrizsaldana Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

beatrizsaldana Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

beatrizsaldana Sep 26, 2024

thomasyu888 Sep 26, 2024

BryanFauble Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

beatrizsaldana Sep 26, 2024

BryanFauble Sep 26, 2024

beatrizsaldana Sep 26, 2024 •

edited

Loading

BryanFauble Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

beatrizsaldana Sep 26, 2024

beatrizsaldana Sep 26, 2024

BryanFauble Sep 26, 2024

beatrizsaldana Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

beatrizsaldana Sep 26, 2024 •

edited

Loading

jaclynbeck-sage Sep 26, 2024

beatrizsaldana Sep 26, 2024

jaclynbeck-sage Sep 26, 2024

JessterB Sep 26, 2024

beatrizsaldana Sep 26, 2024

BWMac commented Sep 27, 2024

sonarcloud bot commented Sep 27, 2024

		import pandas as pd


		def transform_biomarkers(datasets: dict) -> list:

Biomarkers transform for ModelAD #148

Are you sure you want to change the base?

Biomarkers transform for ModelAD #148

Conversation

beatrizsaldana commented Sep 25, 2024 • edited Loading

WIP: PyTest Failed - DO NOT REVIEW

Expected Changes

Unexpected Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beatrizsaldana Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beatrizsaldana Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BWMac commented Sep 27, 2024

sonarcloud bot commented Sep 27, 2024

Quality Gate passed

beatrizsaldana commented Sep 25, 2024 •

edited

Loading

beatrizsaldana Sep 26, 2024 •

edited

Loading

beatrizsaldana Sep 26, 2024 •

edited

Loading