Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat (metadata-manager): Update model #535

Merged
merged 14 commits into from
Sep 20, 2024
66 changes: 38 additions & 28 deletions lib/workload/stateless/stacks/metadata-manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,24 +32,32 @@ An example of how to use a curl command to access the production API:
curl -s -H "Authorization: Bearer $ORCABUS_TOKEN" "https://metadata.umccr.org/api/v1/library" | jq
```

Filtering of results is also supported by the API. For example, to filter by `internal_id`, append the query parameter
to the URL: `.../library?library_id=LIB001`
Filtering of results is also supported by the API. For example, to filter by `libraryId`, append the query parameter
to the URL: `.../library?libraryId=LIB001`

## Schema

This is the current (WIP) schema that reflects the current implementation.
This is the current (WIP) schema that reflects the current implementation. The schema is based on the
draft [draw.io in Google Drive](https://app.diagrams.net/#G10ryWSXORMo7Qj7ghvj37LHYqmMm4hXW-#%7B%22pageId%22%3A%22vfe626awnvWGlhOGvxTV%22%7D)
.

![schema](docs/schema.drawio.svg)

To modify the diagram, open the `docs/schema.drawio.svg` with [diagrams.net](https://app.diagrams.net/?src=about).

`orcabus_id` is the unique identifier for each record in the database. It is generated by the application where the
first 3 characters are the model prefix followed by [ULID](https://pypi.org/project/ulid-py/) separated by a dot (.).
The prefix is as follows:
The `orcabus_id` serves as the unique identifier for each record in the database. It is generated by the application
using the [ULID](https://pypi.org/project/ulid-py/) library. When a record is accessed via the API, the `orcabus_id`
is presented with a prefix consisting of three characters followed by a dot (.). The specific prefix varies depending
on the model of the record.

- Library model are `lib`
- Specimen model are `spc`
- Subject model are `sbj`
| Model | Prefix |
|------------|--------|
| Subject | `sbj.` |
| Sample | `smp.` |
| Library | `lib.` |
| Individual | `idv.` |
| Contact | `ctc.` |
| Project | `prj.` |

## How things work

Expand All @@ -59,36 +67,38 @@ In the near future, we might introduce different ways to load data into the appl
loading data
from the Google tracking sheet and mapping it to its respective model as follows.

| Sheet Header | Table | Field Name |
|--------------|------------|---------------|
| SubjectID | `Subject` | subject_id |
| SampleID | `Specimen` | sample_id |
| Source | `Specimen` | source |
| LibraryID | `Library` | library_id |
| Phenotype | `Library` | phenotype |
| Workflow | `Library` | workflow |
| Quality | `Library` | quality |
| Type | `Library` | type |
| Coverage (X) | `Library` | coverage |
| Assay | `Library` | assay |
| ProjectOwner | `Library` | project_owner |
| ProjectName | `Library` | project_name |
| Sheet Header | Table | Field Name |
|-------------------|--------------|--------------------|
| SubjectID | `Individual` | individual_id |
| ExternalSubjectID | `Subject` | subject_id |
| SampleID | `Sample` | sample_id |
| ExternalSampleID | `Sample` | external_sample_id |
| Source | `Sample` | source |
| LibraryID | `Library` | library_id |
| Phenotype | `Library` | phenotype |
| Workflow | `Library` | workflow |
| Quality | `Library` | quality |
| Type | `Library` | type |
| Coverage (X) | `Library` | coverage |
| Assay | `Library` | assay |
| ProjectName | `Project` | project_id |
| ProjectOwner | `Contact` | contact_id |

Some important notes of the sync:

1. The sync will only run from the current year.
2. The tracking sheet is the single source of truth for the current year. Any deletion or update to existing records
will be applied based on their internal IDs (`library_id`, `specimen_id`, and `subject_id`). For the library
will be applied based on their internal IDs (e.g. `library_id`, `subject_id`, etc. ). For the library
model, the deletion will only occur based on the current year's prefix. For example, syncing the 2024 tracking
sheet will only query libraries with `library_id` starting with `L24` to determine whether to delete it.
3. `LibraryId` is treated as a unique value in the tracking sheet, so for any duplicated value (including from other
tabs) it will only recognize the last appearance.
sheet will only query libraries with `library_id` tarting with `L24` to determine whether to delete it.
3. `LibraryId` is treated as a unique value in the tracking sheet, so for any duplicated value will only recognize
the last appearance.
4. In cases where multiple records share the same unique identifier (such as SampleId), only the data from the most
recent record is stored. For instance, if a SampleId appears twice with differing source values, only the values from
the latter record will be retained.
5. The sync happens every night periodically. See `./deploy/README.md` for more info.

Please refer to the [traking-sheet-service](proc/service/tracking_sheet_srv.py) implementation.
Please refer to the [tracking-sheet-service](proc/service/tracking_sheet_srv.py) implementation.

### Audit Data

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
import os
os.environ['SSM_NAME_GDRIVE_ACCOUNT'] = "/umccr/google/drive/lims_service_account_json"
os.environ["SSM_NAME_TRACKING_SHEET_ID"] = "/umccr/google/drive/tracking_sheet_id"
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from django.core.management import BaseCommand

from proc.service.tracking_sheet_srv import sanitize_lab_metadata_df, persist_lab_metadata
from proc.tests.test_tracking_sheet_srv import RECORD_1, RECORD_2, RECORD_3
from proc.tests.test_tracking_sheet_srv import RECORD_1, RECORD_2, RECORD_3, SHEET_YEAR


class Command(BaseCommand):
Expand All @@ -16,7 +16,7 @@ def handle(self, *args, **options):

metadata_pd = pd.json_normalize(mock_sheet_data)
metadata_pd = sanitize_lab_metadata_df(metadata_pd)
result = persist_lab_metadata(metadata_pd)
result = persist_lab_metadata(metadata_pd, SHEET_YEAR)

print(json.dumps(result, indent=4))
print("insert mock data completed")
Loading