-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multiple writers #140
base: main
Are you sure you want to change the base?
Conversation
OK, so here's my first stab. I split nodes by Here's a unit test to run to see what's going on: koza/tests/unit/test_tsvwriter_node_and_edge.py Lines 39 to 107 in 4e0ec24
So I added a flag parameter When this flag is
I deliberately didn't provide subject and object categories in some examples just to show how the splits would look like. This would (hopefully) encourage KG builders to abide by a standard (Biolink) to categorize their nodes. We could enforce usage of Biolink categories (maybe via pydantic?) but I'm not sure if we want to do that. Also not quite sure how to implement this in the JSONWriter but we'll worry about it once we finalize this. Thoughts? cc: @kevinschaper @justaddcoffee @caufieldjh @DnlRKorn @sierra-moxon @amc-corey-cox |
I was thinking more generic for this feature, something like:
and then it would produce something like Which would also allow for
|
This is assuming there is a score. How/where is this score calculated? Sorry in advance, I do not follow |
In this example with STRING the score is provided with the data, e.g.: protein1 protein2 combined_score
493.BWD07_00005 493.BWD07_05105 227
493.BWD07_00005 493.BWD07_03880 221
493.BWD07_00005 493.BWD07_08685 317
493.BWD07_00005 493.BWD07_05905 232
493.BWD07_00005 493.BWD07_06110 174
493.BWD07_00005 493.BWD07_02170 451
493.BWD07_00005 493.BWD07_07175 150
493.BWD07_00005 493.BWD07_01790 161
493.BWD07_00005 493.BWD07_05145 168 where each row is a single protein-protein interaction pair. |
Right, in the transform python for an ingest, you’d be able to |
Sorry .... still confused and pardon my ignorance ... I have a few questions. I may have misunderstood this whole concept. If this is the case:
The bigger question:
|
No worries, I think we're still figuring out many of the details of how this whole system can/should work.
Many of our ingests are essentially one to many: one data file, many different types of entities and relationships. Our KGs rarely need to include all of these potential components, and that's partially because we already do the work of modeling everything as nodes vs. edges, so that eliminates a whole bunch of other ways we could be modeling the data (e.g., I could try to model everything in a pure RDF approach and make everything a triple - not an invalid approach, but not what we're doing). For a KG like Monarch there's also the assumption that node data will have a single source of truth, but not every KG works that way; some may merge node properties from multiple sources. So if we have a way to separate ingested data based on its component parts, we have a way to produce reusable data modules.
We can't predict them all, but we can make it as easy as possible to modify existing transforms. For Koza's purposes, this just means supporting very generic splits, and then it's just a matter of having the "core" transform be the broadest possible interpretation of the data (I think Koza already assumes this, because if I fail to include one of the column names in an input file within the transform config then it raises an error)
Yes indeed
Current plans are to provide the component nodes and edges along with their transform module, so if the parts already work for a given use case, they can just be used as-is, no changes needed. The user would still be expected to do the final merge. |
Catching up here ... probably an ingest case that requires numeric data interpretation like STRING is not the best example to start with because will involve either 1) ingesting all the data but in the end not usable without further work and decisions or 2) an arbitrary slice of the data that will be difficult to agree on. But what I wanted to ask first, was that that the nascent strategy above seems to introduce an extra transform step. That is, say that a CHEBI transform exists and we just want the 'antiviral' subset of CHEBI -- one would grab the bulk CHEBI transform from the right repo and then would need extra steps to filter/subset that source into a specific KG project. Am I interpreting this correctly? I know CHEBI is also not a great example because reference ontology transforms already exist in KG-OBO. We have a similar case with subsetting the NCBITaxonomy. A better example to talk about would be BacDive -- this is a rich, complex, mostly standardized source. We have been working on ingesting various aspects of this dataset over the last year or so -- and are about 70% done. But due to the breadth and complexity of the data, there are also other analyses or interpretations or augmentations of BacDive that we've ingested. In the end, getting a 100% of this data ingested is a huge lift and even out of scope. So I wanted to throw this example into the mix, perhaps as a bit of an edge case. How could a partial ingest of a valuable data source live in this new modular universe? It seems the wrong direction to prevent ingestion of a source because 100% of it is not available... One solution to the out of scope ingest could be to somehow represent the data selection and modeling decisions in a machine readable way. I think this is going to be an important piece of the modularity ... the transparency side of it to help make ingest/selection decisions. |
I think a benefit that we still get from splitting apart into a single repo for each source or file from a source is that even if we subset for practicality, all of the machinery is in place to produce alternate subsets or expand to different parts of a file/source that are initially passed over. An example I have is the alliance disease association ingest which includes non-human gene to human disease associations that are inferred via orthology. I don’t want to bring those edges into monarch-kg enough to figure out how to model them in biolink, so I’m excluding them, but if somebody needs them and wants to sort out the modeling, it’s just a small PR an existing repo. I would love for koza to have an all declarative mode using linkml-map syntax, so that transforms that don’t actually need custom python logic can just be expressed in yaml, so that it would be naturally machine readable. Maybe kgx stats like metadata about each file would be a good way to document descriptively though. Our minimal start on that in our cookie cutter was a little report tsv table for nodes and edges to give counts by category, taxon etc. |
I like the koza_app.write(a, name="filtered") idea. probably want it to be a list since any edge (or node) could be part of multiple modules/subsets koza_app.write(a, name=["filtered", ...]) |
No description provided.