Skip to content

Commit

Permalink
Update schema.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dosumis authored Sep 25, 2024
1 parent ffeb686 commit 23333d6
Showing 1 changed file with 5 additions and 51 deletions.
56 changes: 5 additions & 51 deletions docs/schema.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,13 @@
**Status**: Draft
## CL KG Schema

# OWL/RDF to Neo4j Schema
Full details of the schema are now here:
[CL_KG user stories, schema and roadmap](https://docs.google.com/document/d/1CIvy_NV1poK1wK-lY9E_sksOIRDxMyyBc-ZZLzD8OrM/edit#heading=h.vq3lz7r6domf)

Defined in [documentation of owl2neo library](https://github.com/OBASKTools/neo4j2owl?tab=readme-ov-file#entities).
For ontology representation see:
[OWL-2-NEO mapping](https://github.com/OBASKTools/neo4j2owl/blob/master/README.md#owl-2-el---neo4j-mapping-direct-existentials)

## Nested cell sets:

Cell sets are individuals representing author category cell type annotations.

```cypher
(c1)-[:INSTANCEOF]-(:Cluster { label: 'cluster' } ) // 'cluster' (PCL:0010001) # This should be improved!
// Where one cell set subsumes another it is represented as
(c1)-[:subcluster_of]->(c2) subcluster_of [RO:0015003](https://www.ebi.ac.uk/ols4/ontologies/ro/properties/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FRO_0015003)
```
subcluster_of is transitive, so a transitive reduction step MUST be used in generating the graph.

All cell sets representing author cell type annotations MUST be present, however, if cell sets have identical membership, they are unified into a single node. Configuration specifies an order of preference for which annotation will become rdfs:label if nodes are unified. All other names are stored with their original keys.

TBD: Should we also represent overlaps between author annotations. These could use RO:overlaps and record percent_overlap on the edge (should think about how this fits with confusion matrix generation)

## Cell sets to Cell ontology terms

The cell_type fields in the CELLxGENE schema also define cell sets.

**All cell ontology terms MUST be represented.**

Where there is a 1:1 relationship between a cell set defined by a cell_type annotation and one represented by an author annotation, this is represented by:

```cypher
(c:Cluster)-[:composed_primarily_of]->(cl:Cell:Class)
```

'composed primarily of' ([RO:0002473](https://www.ebi.ac.uk/ols4/ontologies/ro/properties/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FRO_0002473))

Where a cell set defined by a cell_type annotation doesn't map to single cell set defined by author category annotation, but subsumes >1 of these, we generate a cluster (cell set) node for the cell_type & relate this as above. One advantage of this is that it allows for CxG metadata to be consistently attached to an author annotation node.


## Cell sets to standard [CxG metadata](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.0.0/schema.md) (apart from cell ontlogy terms)

```cypher
(c:Cluster)-[:CxG_metadata_key { percentage: <float> }]-(x)
```

Where percentage = percent of cells in cell_set defined by author annotation that are in cell_set defined by metadata annotation.

e.g.
```cypher
(c:Cluster)-[:tissue { percentage: 50.5 }]->(:Class { label: 'cornea', short_form: 'UBERON_'})
```

Above properties are reprented as OBASK builtin

## Markers/marker sets

TBA



0 comments on commit 23333d6

Please sign in to comment.