Skip to content

Commit

Permalink
Merge pull request #24 from Cellular-Semantics/17-document-how-to-cur…
Browse files Browse the repository at this point in the history
…ate-datasets-for-loading-where-place-curation-files

Document how to curate datasets for loading where place curation files
  • Loading branch information
ubyndr authored Sep 11, 2024
2 parents 3adff10 + 2a2cc92 commit 51431e0
Show file tree
Hide file tree
Showing 3 changed files with 145 additions and 1 deletion.
2 changes: 1 addition & 1 deletion anndata2rdf/src/pull_anndata.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import logging
import os
from typing import Dict, List, Optional, Union
from typing import Dict, List, Optional

import requests
import yaml
Expand Down
143 changes: 143 additions & 0 deletions docs/dataset_curation_guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Dataset Curation Guidelines for Loading and File Placement

This guide outlines the process for curating datasets and specifies the location where
curation files should be placed for proper loading. By following these guidelines, you ensure
datasets are prepared correctly
for integration and that the files are stored in the appropriate directories for seamless access
and management.

## Curation Process: Steps for Preparing Datasets

This section describes the necessary steps to format, validate, and prepare datasets for loading
into the system, ensuring they meet the required standards.

### Dataset Curation Guidelines

This section will instruct curators on how to correctly fill in each of the columns required in
a curated CSV. Examples have been provided based on the preparation of the linked dataset [here]
(https://cellxgene.cziscience.com/collections/9b02383a-9358-4f0f-9795-a891ec523bcc) into a
curated CSV ready for upload.

---

### Columns and Guidelines

1. **Dataset (individual datasets within larger group):**
- **Description**: The specific name of the dataset being curated within a larger dataset group.
- **Example**: "Single cell transcriptional and chromatin accessibility profiling redefine
cellular heterogeneity in the adult human kidney - ATACseq"

2. **Full name dataset (top of page):**
- **Description**: The full descriptive name of the dataset that should be used for
documentation and display.
- **Example**: "Single cell transcriptional and chromatin accessibility profiling redefine
cellular heterogeneity in the adult human kidney"

3. **CxG Link:**
- **Description**: The CellxGene link to access the dataset.
- **Example**: "https://cellxgene.cziscience.com/e/13a027de-ea3e-432b-9a5e-6bc7048498fc.cxg/"

4. **h5ad link:**
- **Description**: The direct link to the `.h5ad` data file of the dataset.
- **Example**: "https://datasets.cellxgene.cziscience.com/dabd979f-cc50-4526-81f3-8bc6c673ca36.h5ad"

5. **Reference_DOI:**
- **Description**: The DOI reference for the associated publication(s) for the dataset.
- **Example**: "DOI: 10.1038/s41467-021-22368-w"

6. **Bionetworks reference:**
- **Description**: Indicate whether the dataset has a reference within the Bionetworks repository.
- **Example**: "T" (True)

7. **Standard category present? (T/F):**
- **Description**: Flag indicating whether standard categories are present in the dataset.
- **Example**: "T" (True)

8. **Standard category cell_type present? (T/F):**
- **Description**: Flag indicating whether the standard category for cell type is present in
the dataset.
- **Example**: "T" (True)

9. **Author Category Cell Type Field Name:**
- **Description**: This column shows the name of the field as it appears in the Dataset
Explorer UI. It indicates which specific field within the dataset corresponds to a certain
category, such as "cell type" or other annotations. Fields marked as `Cell types` in the
`Content` column play a key role in graph generation using the `pandasaurus_cxg` library, which
is employed in the data pipeline.
- **Example**: "author_cell_type"

10. **Content:**
- **Description**: This column indicates whether the field is used for cell type annotations
or for other dataset annotations (e.g., Cell type or Other).
- **Example**: "Cell types"

11. **Value type(s):**
- **Description**: This column specifies if the values in the dataset are represented in full
names or as abbreviations.
- **Example**: "abbreviations"

12. **Notes:**
- **Description**: Any additional notes or comments regarding the dataset.
- **Example**: "Only standard categories used"

13. **Study Short Name:**
- **Description**: The shortened name or acronym of the study associated with the dataset.
- **Example**: "Muto et al. (2021) Nat Commun"

14. **CxG Dataset Collection X:**
- **Description**: The CellxGene link to the collection where the dataset is stored.
- **Example**: "https://cellxgene.cziscience.com/collections/9b02383a-9358-4f0f-9795-a891ec523bcc"

15. **Is the dataset Normal or Normal/Diseased:**
- **Description**: Indicates whether the dataset includes normal samples, diseased samples or
both.
- **Example**: "Normal"

16. **Stage:**
- **Description**: The biological stage of the samples in the dataset, such as adult, fetal, etc.
- **Example**: "Adult"

---

## General Tips for Curators:
- Ensure that fields marked as `Cell types` in the `Content` column are correctly paired with
appropriate `Author Category Cell Type Field Name`, as these pairs are crucial for graph
generation in the data pipeline using the `pandasaurus_cxg` library.
- Ensure all links (CxG and h5ad) are correct and accessible.
- Use consistent naming for datasets across related entries.
- Double-check flags (T/F) to ensure they correctly reflect the presence of specific categories.
- Fill out fields such as `Study Short Name` and `Notes` with proper references to aid in
documentation and user clarity.

By following these guidelines, curators can ensure that datasets are correctly formatted and
ready for integration into the pipeline.


## File Placement: Where to Store Curation Files

This section provides guidance on the correct directory structure and file locations for placing
curated datasets to ensure they are properly recognized and accessible during the loading process.

In the pipeline, curated CSV files are stored in the `curated_data` folder. When the pipeline is
run, these CSV files are automatically converted into a YAML file named `cxg_author_cell_type.yml`
and placed in the `config` folder. The YAML file maps the CxG links to the corresponding
`author_cell_type_list` fields, which are essential for processing.

Example of the YAML format:

```yaml
- CxG_link: https://datasets.cellxgene.cziscience.com/03af5481-a0b6-426c-86b4-9127ada17b53.h5ad
author_cell_type_list:
- author_cell_type
- author_cluster_label
- CxG_link: https://datasets.cellxgene.cziscience.com/080f9be4-0f94-48cb-a82f-db53df1542ff.h5ad
author_cell_type_list:
- author_cluster_name
- author_cell_type
- author_cell_type

```

The CxG links in the YAML file are then used to download datasets into the `dataset` folder.
Finally, the `pandasaurus_cxg` library is used to generate RDF graphs, which are stored in the
`graph` folder for further use in the pipeline.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ nav:
- Home: index.md
- Access Guide: access_guide.md
- Query Guide: query_guide.md
- Dataset Curation Guidelines: dataset_curation_guidelines.md
- Schema: schema.md
theme: readthedocs
docs_dir: docs

0 comments on commit 51431e0

Please sign in to comment.