From ef415e2f5f2cf1db5598377880467d97d377c494 Mon Sep 17 00:00:00 2001 From: David Osumi-Sutherland Date: Wed, 25 Sep 2024 14:58:33 +0100 Subject: [PATCH] Update index.md --- docs/index.md | 28 ++++------------------------ 1 file changed, 4 insertions(+), 24 deletions(-) diff --git a/docs/index.md b/docs/index.md index 77eed14..bfbfbb7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,26 +1,6 @@ -# Pipeline code for building a cell type knowledge graph +## Cell Ontology Knowledge Graph -## Components -* Hierarchies of nested cell sets defined by author category cell type annotations, combined with CL annotation. -* The Cell Ontology and interlinked OBO ontologies - initial emphasis on GO & Pro. -* Gene/Protein Annotations to GO terms that: - * Are supported by strong experimental evidence - * Cover mouse and human genes (we may consider adding other mammals) - * Are directly linked to CL terms or are closely linked via some defined pattern (e.g. if a cell type has a cellular component, then we should also pull annotations to terms for assembly and maintenance of that cell type). -* Sources of assertions about cell type markers: LLMs, CL, GO, CAS. -* Curated information about cell types and the processes they are involved in derived from LLM-based piplines. -* In all cases, we will capture which publications support/sources support assertions. -* Standard model for linking Gene/Protein/Transcript IDs. TBD. Initially at least I suggest aggregating to single Neo4J nodes and using APs. -* For all markers found via any route, validate against annotated data using CxG Census query & storing sumamry statistics - mean, median, variance, entropy? This information can be stored in edge annotation. +CL-KG is a knowledge graph integrating the cell ontology and linked ontologies with hierarchical annotations of single cell transcriptomics data from CellXGene. +It's main current use is as a semantic layer for querying annotations on CELLxGENE. Future plans include loading cell type markers, along with evidence and confidence from muliple sources and integrating GO annotation for GO terms used to define cell types. -## Pipelines -* Pandasaurus extracts cell sets linked to CL terms following standard schemas -* Python script to QuickGO API to extract relevant GO annotations - -## Use cases - * Mining CxG for missing CL terms and CL annotations (Cypher queries to be defined) - * Cell Type marker query service - * Define Cypher queries - * Build API - * Build LLM query layer - * Input to ML algorithms assigning cell type. This is experimental. We need partners early enough in development do guide and avoid making poor choices. It is probably worth being aware of existing options for generating embeddings (e.g. node2vec) +For more details please see [CL-KG schema and roadmap](https://docs.google.com/document/d/1CIvy_NV1poK1wK-lY9E_sksOIRDxMyyBc-ZZLzD8OrM/edit#heading=h.vq3lz7r6domf)