Add Dagster job to create rollup JSON/collection #643

sujaypatil96 · 2024-08-16T19:28:48Z

Description

Create a Dagster job that makes a Biosample-keyed rollup JSON, which can be imported into a collection in Mongo. The idea being, similar to the /data_objects/study/{study_id}, if we provide a study_id to (in this case) a Dagster job, we should be able to retrieve parseable JSON or create a collection in Mongo that instead of just the Data Objects, has all associated NMDC objects materialized in it.

Fixes #642

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

Definition of Done (DoD) Checklist:

A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.

My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (in docs/ and in https://github.com/microbiomedata/NMDC_documentation/?)
I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

sujaypatil96 · 2024-10-30T16:37:41Z

This is a first pass as to what the JSON should look like. We can make improvements as necessary.

aclum

This PR introduces several new functions without any corresponding test coverage. I'm also unclear if we need to still be explicitly specifying the relationship slots or if refscan is in a place where we can use functions from that code base to derive relationship slots.

pkalita-lbl

I think I'm a little confused about the broader purpose of this Dagster job. Who would run it? When? And for what purpose?

I could imagine a Dagster job that pre-computes a collection which maps Biosample IDs to other associated IDs and then having an API endpoint which utilizes that new collection. But this Dagster job that only does it for one Biosample has me scratching my head a little.

sujaypatil96 · 2024-11-01T08:08:40Z

All good questions above @pkalita-lbl!

So the overall vision/direction in which this is headed (working on this in the referential integrity and rollup squad) is that we're going to have a rollup collection with the Biosample rollup for all studies materialized in it, and this will be created by way of a Dagster job.

But this Dagster job that only does it for one Biosample has me scratching my head a little.

Hm, the Dagster job in this PR at the moment does it for a study. It produces a JSON with a list of documents which has each of the Biosamples that are part of the study associated with the IDs that are "related" to that Biosample.

pkalita-lbl · 2024-11-01T16:33:59Z

Sorry that's my mistake, I meant to say one Study, not one Biosample.

So then is the idea that this job will eventually be expanded to compute the associated IDs for all studies? Is this just a first step towards that?

aclum · 2024-11-01T23:37:16Z

The use case (I think) is we have studies with some but not all of the data, so the ETL script needs to figure out which samples need 1) biosamples or 2) data generation records. For example for NEON we had to deleted a subset of the malformed data object records and their corresponding nucleotide sequencing records and now we need to bring in the fixed records.

sujaypatil96 · 2024-11-04T18:30:54Z

So then is the idea that this job will eventually be expanded to compute the associated IDs for all studies? Is this just a first step towards that?

Correct, that's exactly right! Apologies, I should have mentioned that in the PR description for context. But yes, the idea is that there will be a collection in Mongo which will have associated IDs for all biosamples from all studies, and yup, this is a first step towards that.

Ideally there will be an endpoint which accepts an NMDC study id as input, and also interacts with that "biosample rollup collection" to retrieve a rollup JSON for that study alone.

The use case (I think) is we have studies with some but not all of the data, so the ETL script needs to figure out which samples need 1) biosamples or 2) data generation records. For example for NEON we had to deleted a subset of the malformed data object records and their corresponding nucleotide sequencing records and now we need to bring in the fixed records.

Yup, that's an immediate requirement for which this will be useful. But eventually once we have the materialized rollup collection in Mongo and the endpoint on top of it we should be able to plug in a study id, like say, the NEON study ids and get a JSON that has the biosamples and its associated ids in a JSON.

sujaypatil96 and others added 4 commits August 16, 2024 12:25

add Dagster job to create rollup JSON - partially complete

6e8233c

Merge branch 'main' into issue-642-dagster-job-for-rollup

42ad45b

Dagster job that creates biosample rollup JSON

5dab308

style: reformat

a06a2dd

sujaypatil96 marked this pull request as ready for review October 30, 2024 16:36

PeopleMakeCulture requested review from PeopleMakeCulture, dwinston, pkalita-lbl and eecavanna October 30, 2024 17:13

add back necessary rollup helper to util.py

15d4458

sujaypatil96 requested review from aclum and removed request for pkalita-lbl October 31, 2024 15:07

aclum reviewed Oct 31, 2024

View reviewed changes

pkalita-lbl reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dagster job to create rollup JSON/collection #643

Add Dagster job to create rollup JSON/collection #643

sujaypatil96 commented Aug 16, 2024 •

edited

Loading

sujaypatil96 commented Oct 30, 2024

aclum left a comment

pkalita-lbl left a comment •

edited

Loading

sujaypatil96 commented Nov 1, 2024

pkalita-lbl commented Nov 1, 2024

aclum commented Nov 1, 2024

sujaypatil96 commented Nov 4, 2024

Add Dagster job to create rollup JSON/collection #643

Are you sure you want to change the base?

Add Dagster job to create rollup JSON/collection #643

Conversation

sujaypatil96 commented Aug 16, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Definition of Done (DoD) Checklist:

sujaypatil96 commented Oct 30, 2024

aclum left a comment

Choose a reason for hiding this comment

pkalita-lbl left a comment • edited Loading

Choose a reason for hiding this comment

sujaypatil96 commented Nov 1, 2024

pkalita-lbl commented Nov 1, 2024

aclum commented Nov 1, 2024

sujaypatil96 commented Nov 4, 2024

sujaypatil96 commented Aug 16, 2024 •

edited

Loading

pkalita-lbl left a comment •

edited

Loading