Feat (metadata-manager): Update model (#535)

umccr · Sep 20, 2024 · 0af058d · 0af058d
1 parent c387d2e
commit 0af058d
Show file tree

Hide file tree

Showing 48 changed files with 1,205 additions and 725 deletions.
diff --git a/lib/workload/stateless/stacks/metadata-manager/README.md b/lib/workload/stateless/stacks/metadata-manager/README.md
@@ -32,24 +32,32 @@ An example of how to use a curl command to access the production API:
 curl -s -H "Authorization: Bearer $ORCABUS_TOKEN" "https://metadata.umccr.org/api/v1/library" | jq
 ```
 
-Filtering of results is also supported by the API. For example, to filter by `internal_id`, append the query parameter
-to the URL: `.../library?library_id=LIB001`
+Filtering of results is also supported by the API. For example, to filter by `libraryId`, append the query parameter
+to the URL: `.../library?libraryId=LIB001`
 
 ## Schema
 
-This is the current (WIP) schema that reflects the current implementation.
+This is the current (WIP) schema that reflects the current implementation. The schema is based on the
+draft [draw.io in Google Drive](https://app.diagrams.net/#G10ryWSXORMo7Qj7ghvj37LHYqmMm4hXW-#%7B%22pageId%22%3A%22vfe626awnvWGlhOGvxTV%22%7D)
+.
 
 ![schema](docs/schema.drawio.svg)
 
 To modify the diagram, open the `docs/schema.drawio.svg` with [diagrams.net](https://app.diagrams.net/?src=about).
 
-`orcabus_id` is the unique identifier for each record in the database. It is generated by the application where the
-first 3 characters are the model prefix followed by [ULID](https://pypi.org/project/ulid-py/) separated by a dot (.).
-The prefix is as follows:
+The `orcabus_id` serves as the unique identifier for each record in the database. It is generated by the application
+using the [ULID](https://pypi.org/project/ulid-py/) library. When a record is accessed via the API, the `orcabus_id`
+is presented with a prefix consisting of three characters followed by a dot (.). The specific prefix varies depending
+on the model of the record.
 
-- Library model are `lib`
-- Specimen model are `spc`
-- Subject model are `sbj`
+| Model      | Prefix |
+|------------|--------|
+| Subject    | `sbj.` |
+| Sample     | `smp.` |
+| Library    | `lib.` |
+| Individual | `idv.` |
+| Contact    | `ctc.` |
+| Project    | `prj.` |
 
 ## How things work
 
@@ -59,36 +67,38 @@ In the near future, we might introduce different ways to load data into the appl
 loading data
 from the Google tracking sheet and mapping it to its respective model as follows.
 
-| Sheet Header | Table      | Field Name    |
-|--------------|------------|---------------|
-| SubjectID    | `Subject`  | subject_id    |
-| SampleID     | `Specimen` | sample_id     |
-| Source       | `Specimen` | source        |
-| LibraryID    | `Library`  | library_id    |
-| Phenotype    | `Library`  | phenotype     |
-| Workflow     | `Library`  | workflow      |
-| Quality      | `Library`  | quality       |
-| Type         | `Library`  | type          |
-| Coverage (X) | `Library`  | coverage      |
-| Assay        | `Library`  | assay         |
-| ProjectOwner | `Library`  | project_owner |
-| ProjectName  | `Library`  | project_name  |
+| Sheet Header      | Table        | Field Name         |
+|-------------------|--------------|--------------------|
+| SubjectID         | `Individual` | individual_id      |
+| ExternalSubjectID | `Subject`    | subject_id         |
+| SampleID          | `Sample`     | sample_id          |
+| ExternalSampleID  | `Sample`     | external_sample_id |
+| Source            | `Sample`     | source             |
+| LibraryID         | `Library`    | library_id         |
+| Phenotype         | `Library`    | phenotype          |
+| Workflow          | `Library`    | workflow           |
+| Quality           | `Library`    | quality            |
+| Type              | `Library`    | type               |
+| Coverage (X)      | `Library`    | coverage           |
+| Assay             | `Library`    | assay              |
+| ProjectName       | `Project`    | project_id         |
+| ProjectOwner      | `Contact`    | contact_id         |
 
 Some important notes of the sync:
 
 1. The sync will only run from the current year.
 2. The tracking sheet is the single source of truth for the current year. Any deletion or update to existing records
-   will be applied based on their internal IDs (`library_id`, `specimen_id`, and `subject_id`). For the library
+   will be applied based on their internal IDs (e.g. `library_id`, `subject_id`, etc. ). For the library
    model, the deletion will only occur based on the current year's prefix. For example, syncing the 2024 tracking
-   sheet will only query libraries with `library_id` starting with `L24` to determine whether to delete it.
-3. `LibraryId` is treated as a unique value in the tracking sheet, so for any duplicated value (including from other
-   tabs) it will only recognize the last appearance.
+   sheet will only query libraries with `library_id` tarting with `L24` to determine whether to delete it.
+3. `LibraryId` is treated as a unique value in the tracking sheet, so for any duplicated value will only recognize 
+   the last appearance.
 4. In cases where multiple records share the same unique identifier (such as SampleId), only the data from the most
    recent record is stored. For instance, if a SampleId appears twice with differing source values, only the values from
    the latter record will be retained.
 5. The sync happens every night periodically. See `./deploy/README.md` for more info.
 
-Please refer to the [traking-sheet-service](proc/service/tracking_sheet_srv.py) implementation.
+Please refer to the [tracking-sheet-service](proc/service/tracking_sheet_srv.py) implementation.
 
 ### Audit Data
 

diff --git a/lib/workload/stateless/stacks/metadata-manager/app/management/commands/__init__.py b/lib/workload/stateless/stacks/metadata-manager/app/management/commands/__init__.py
@@ -0,0 +1,3 @@
+import os
+os.environ['SSM_NAME_GDRIVE_ACCOUNT'] = "/umccr/google/drive/lims_service_account_json"
+os.environ["SSM_NAME_TRACKING_SHEET_ID"] = "/umccr/google/drive/tracking_sheet_id"
diff --git a/lib/workload/stateless/stacks/metadata-manager/app/management/commands/insert_mock_data.py b/lib/workload/stateless/stacks/metadata-manager/app/management/commands/insert_mock_data.py
@@ -3,7 +3,7 @@
 from django.core.management import BaseCommand
 
 from proc.service.tracking_sheet_srv import sanitize_lab_metadata_df, persist_lab_metadata
-from proc.tests.test_tracking_sheet_srv import RECORD_1, RECORD_2, RECORD_3
+from proc.tests.test_tracking_sheet_srv import RECORD_1, RECORD_2, RECORD_3, SHEET_YEAR
 
 
 class Command(BaseCommand):
@@ -16,7 +16,7 @@ def handle(self, *args, **options):
 
         metadata_pd = pd.json_normalize(mock_sheet_data)
         metadata_pd = sanitize_lab_metadata_df(metadata_pd)
-        result = persist_lab_metadata(metadata_pd)
+        result = persist_lab_metadata(metadata_pd, SHEET_YEAR)
 
         print(json.dumps(result, indent=4))
         print("insert mock data completed")