GPAD/GPI spec comments #3031

balhoff · 2020-06-02T13:55:50Z

I've been working on a parser and object model for the new GPAD/GPI specs, and just wanted to report a few areas in the spec that could be clarified:

I wonder if it would be possible to consolidate the grammar definitions a little, as in the SPARQL grammar or in the OWL Functional Syntax grammar. I really enjoy reading those. :-)

The text was updated successfully, but these errors were encountered:

kltm · 2020-06-03T22:08:06Z

@vanaukenk Notes from GAF and GPAD diving.

headers

Counter to my intuition, headers are pretty wild across the board. I'm honestly not sure we should bother with supporting the past here and feel that just setting decent rules now and enforcing them would be a cleaner way forward.

Things that would likely be scanned as a "property symbol":
https://gist.github.com/kltm/d28a487b3a009c38dda93293808159a5#file-things_that_are_property_symbols-txt

All headers: https://gist.github.com/kltm/d28a487b3a009c38dda93293808159a5#file-all-headers-txt

Given this, I propose a tight definition for non-comment comment lines in the header:

^!\w*[a-zA-Z0-9-_]+\w*:\w*<OUR_ASCII_SUBSET>+\w*

@balhoff @vanaukenk Any thoughts here?

taxon

A little easier: if used, only a single '|'; no instances of anything else.

vanaukenk · 2020-06-09T14:30:06Z

@kltm
Your header analysis is consistent with what I had thought when I examined some by eye.
Let's just confirm your proposal on the next GO-CAM specs call.
Where possible, I've tried to fix the other minor typos in the spec.

vanaukenk · 2020-06-09T14:56:01Z

@kltm @balhoff
For the Interacting_taxon_ID field, I believe we said that we would separate multiple values with a comma, since the annotation is likely to reflect multi-species interactions and by analogy to the With/From field, a comma would indicate AND (while a pipe would indicate OR).

dougli1sqrd · 2020-06-16T20:08:11Z

I have a spec question/concern:
For with/from, since we can do pipes and commas now: blah,blah|foo,bar the idea is that things between pipes can be considered separate annotations, yeah?

But we also have this pattern in the extensions. Does that mean we just have W x E number of internal annotations for each pipe separated element in both withfrom and extensions?

With, Ext
A|B, X|Y --> A X, A Y, B X, B Y

ukemi · 2020-06-16T20:33:07Z

Since the with field represents a single piece of evidence, shouldn't it be considered a single annotation? It's just that the evidence is supported by this or that or this or that rather than this and that and this and that. It's an interesting question on where to make the slices.
For the extensions themselves, I have found that commas and pipes have been used inconsistently and @dustine32 and I have worked on rules for the initial imports that reflect what curators were trying to get at. Presumably once the initial import is done, this will no longer be the case and things will be consistent.

dougli1sqrd · 2020-06-16T21:10:17Z

Can we explain more specifically what this means?

Do you mean you can only have pipes or commas in a single annotation?

If you mix them together, where are the "parentheses" exist? Like, is A,B|C,D|E:
(A,B)|(C,D)|E
A,(B|C),(D|E)
((((A,B)|C),D)|E)

vanaukenk · 2020-06-16T21:12:13Z

@dougli1sqrd
My interpretation is that the parentheses from the above example would be:
(A,B)|(C,D)|E

vanaukenk · 2020-06-16T21:18:59Z

So if we had this completely made up annotation:

ace-1 involved_in neurotransmission ace-2, ace-3|ace-3, ace-4 occurs_in(motor neuron), part_of(response to odorant)|occurs_in(body wall muscle cell), part_of(response to touch)

It could be split into four annotations as follows:

ace-1 involved_in neurotransmission ace-2, ace-3 occurs_in(motor neuron), part_of(response to odorant)

ace-1 involved_in neurotransmission ace-3, ace-4 occurs_in(motor neuron), part_of(response to odorant)

ace-1 involved_in neurotransmission ace-2, ace-3 occurs_in(body wall muscle cell), part_of(response to touch)

ace-1 involved_in neurotransmission ace-3, ace-4 occurs_in(body wall muscle cell), part_of(response to touch)

ukemi · 2020-06-17T12:12:24Z

I'm now wondering if we should rethink this. Allowing for both pipes and commas seems to open up confusion that might lead to what I saw in our original annotations. If there is a delimiter that represent separate annotations, wouldn't it be better to just create the separate annotations? I think when @dustine32 imports conventional annotations, he parses these complex structures into separate 'annotations' anyways. How will we create files as we move forward from GO-CAM exports? Will we need to devise an algorithm for creating pipe versus comma delimited values in one very complex annotation?

vanaukenk · 2020-06-17T13:05:41Z

IIRC, introducing pipes into the With/From field was aimed at time-saving/efficiency wrt curation. I can't otherwise think of a reason pipe-separated With/From entries couldn't be separate annotations.

For annotation extensions, though, my recollection was that there was concern that individual annotations should not just differ by extensions, partly because of concern about how analysis tools, most (all?) of which don't take extensions into account, would handle seemingly 'duplicate' annotations. Perhaps this is no longer a real concern and we should change our SOP on this.

ukemi · 2020-06-17T13:08:47Z

We should re-examine. If I understand correctly when we round trip, they will come out as individual annotations.

ukemi · 2020-06-17T14:25:21Z

@vanaukenk, I think we are thinking along the same lines, but we need to take a closer look at this. Does it give us an advantage to split out the with/from field? Does it give us an advantage to split out the annotation extensions? For the latter, several pipe-separated annotation extensions could in fact lead to different primary inferred annotations, so there is a distinct advantage.
For example 'organ development'-----results_in_development_of heart|results_in_development_of kidney would yield two different annotations by inference: heart development and kidney development.

ukemi · 2020-06-17T14:26:19Z

I can't think of an advantage to splitting out pipe-separated with/from fields.

balhoff · 2020-07-01T18:46:23Z

@vanaukenk @ukemi @kltm for GPI columns 2, 3, and 4:

The values are generally text. Columns 2 and 3 are specified as "xxxx", and column 4 is "Label" (undefined). Could these all be defined as Alphanumeric? Or are other characters like hyphens expected?

ukemi · 2020-07-01T18:55:36Z

balhoff · 2020-07-01T19:05:56Z

@ukemi okay I expected as much. :-)

kltm · 2020-07-01T19:47:10Z

@balhoff This is some of where the "ASCII, except for non-printing characters, tabs, new lines, control characters, quotation marks, or pipe" comes from (and we even said that quotes are probably fine too on a call). Doesn't this also bump into the utf-8 question? The future may indeed look different: geneontology/helpdesk#265
geneontology/go-site#1514
geneontology/go-site#1515

dougli1sqrd · 2020-07-28T20:28:12Z

In the GPAD 2 spec I noticed that the symbol Relation_ID may be from the RO ontology or from go_rel. In practice this makes parsing hard. Should we think about whittling this down to a defined set of relations? Or maybe we can generate a file that has all the allowed CURIEs and perhaps the label to make parsing easier. @balhoff @cmungall

vanaukenk · 2020-07-28T20:42:57Z

@dougli1sqrd

You are absolutely right about this, and we are working to move whatever relations we can from gorel to RO.

That said, there are some gorel relations, like 'regulates o occurs in', that will have to stick around for the foreseeable future until we can move all of the annotations that use it into GO-CAM models.

I'm open to suggestions/thoughts on how to make any of this easier.

kltm · 2020-07-28T20:58:01Z

Maybe a topic for the software meeting?

dougli1sqrd · 2020-12-16T21:26:38Z

Hi @ukemi, @kltm, @vanaukenk, etc!

Taking a look at the GPI 2.0 spec file, I have some more questions actually.

In the spec file, we have a grammar for each entity line here: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md#gp-entities
Specifiying:

Entity ::= Col_1 tab Col_2 tab ... Col_9 nl

But in the table below, under https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md#gpi-columns it appears that there are 11 possible columns. Should the above Entity spec say up to Col_9?

In the actual column specifications, for columns 3 and 5, the cardinality is [n] or greater, but the field spec doesn't show a separator or how they should be joined (like in column 4). Are those columns actually allowed to take on lists of values?

vanaukenk · 2020-12-16T21:43:30Z

@dougli1sqrd

I think the line below should actually say Col_11.

Entity ::= Col_1 tab Col_2 tab ... Col_9 nl

Some of the specs got copied over from a previous iteration and that update got missed.

For columns 3 and 5, I think we'd follow a similar use of '|' as a separator.

I can update the spec if what I've said here makes sense to you.

dougli1sqrd · 2020-12-16T21:49:06Z

@vanaukenk Okay great! Thanks for the clarification. That makes sense to me.

vanaukenk · 2020-12-16T22:10:00Z

Updated specification. Thanks @dougli1sqrd

dougli1sqrd · 2020-12-16T22:20:03Z

Thanks! Sorry, just noting that column 5 I think now reads like it can accept an empty field: DB_Object_Type ::= [OBO_ID] ('|' OBO_ID)* Do we want this to be DB_Object_Type ::= OBO_ID ('|' OBO_ID)*? (tagging @balhoff @vanaukenk )

vanaukenk · 2020-12-17T14:26:58Z

Thanks @dougli1sqrd You're right; I'll fix that.
I also double-checked the GPAD spec and the same issue existed with the reference field there, so I'll update that, too.

pgaudet · 2024-02-13T15:37:05Z

The time for commenting is over. If issues need to be discussed, please open a new ticket.

vanaukenk added the GPAD/GPI label Jun 3, 2020

vanaukenk assigned balhoff, vanaukenk and kltm Jun 3, 2020

balhoff mentioned this issue Jul 1, 2020

Add comma-separation for Interacting_taxon_ID; add definition for Taxon_ID; add definition for Relation_ID #3171

Merged

pgaudet closed this as completed Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPAD/GPI spec comments #3031

GPAD/GPI spec comments #3031

balhoff commented Jun 2, 2020 •

edited by vanaukenk

Loading

kltm commented Jun 3, 2020 •

edited

Loading

vanaukenk commented Jun 9, 2020

vanaukenk commented Jun 9, 2020

dougli1sqrd commented Jun 16, 2020

ukemi commented Jun 16, 2020

dougli1sqrd commented Jun 16, 2020

vanaukenk commented Jun 16, 2020

vanaukenk commented Jun 16, 2020

ukemi commented Jun 17, 2020

vanaukenk commented Jun 17, 2020

ukemi commented Jun 17, 2020

ukemi commented Jun 17, 2020

ukemi commented Jun 17, 2020

balhoff commented Jul 1, 2020

ukemi commented Jul 1, 2020

balhoff commented Jul 1, 2020

kltm commented Jul 1, 2020

dougli1sqrd commented Jul 28, 2020 •

edited

Loading

vanaukenk commented Jul 28, 2020

kltm commented Jul 28, 2020

dougli1sqrd commented Dec 16, 2020

vanaukenk commented Dec 16, 2020

dougli1sqrd commented Dec 16, 2020

vanaukenk commented Dec 16, 2020

dougli1sqrd commented Dec 16, 2020 •

edited

Loading

vanaukenk commented Dec 17, 2020

pgaudet commented Feb 13, 2024

GPAD/GPI spec comments #3031

GPAD/GPI spec comments #3031

Comments

balhoff commented Jun 2, 2020 • edited by vanaukenk Loading

kltm commented Jun 3, 2020 • edited Loading

headers

taxon

vanaukenk commented Jun 9, 2020

vanaukenk commented Jun 9, 2020

dougli1sqrd commented Jun 16, 2020

ukemi commented Jun 16, 2020

dougli1sqrd commented Jun 16, 2020

vanaukenk commented Jun 16, 2020

vanaukenk commented Jun 16, 2020

ukemi commented Jun 17, 2020

vanaukenk commented Jun 17, 2020

ukemi commented Jun 17, 2020

ukemi commented Jun 17, 2020

ukemi commented Jun 17, 2020

balhoff commented Jul 1, 2020

ukemi commented Jul 1, 2020

balhoff commented Jul 1, 2020

kltm commented Jul 1, 2020

dougli1sqrd commented Jul 28, 2020 • edited Loading

vanaukenk commented Jul 28, 2020

kltm commented Jul 28, 2020

dougli1sqrd commented Dec 16, 2020

vanaukenk commented Dec 16, 2020

dougli1sqrd commented Dec 16, 2020

vanaukenk commented Dec 16, 2020

dougli1sqrd commented Dec 16, 2020 • edited Loading

vanaukenk commented Dec 17, 2020

pgaudet commented Feb 13, 2024

balhoff commented Jun 2, 2020 •

edited by vanaukenk

Loading

kltm commented Jun 3, 2020 •

edited

Loading

dougli1sqrd commented Jul 28, 2020 •

edited

Loading

dougli1sqrd commented Dec 16, 2020 •

edited

Loading