Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPAD/GPI spec comments #3031

Closed
3 of 13 tasks
balhoff opened this issue Jun 2, 2020 · 27 comments
Closed
3 of 13 tasks

GPAD/GPI spec comments #3031

balhoff opened this issue Jun 2, 2020 · 27 comments
Assignees
Labels

Comments

@balhoff
Copy link
Member

balhoff commented Jun 2, 2020

I've been working on a parser and object model for the new GPAD/GPI specs, and just wanted to report a few areas in the spec that could be clarified:

  • GPI_Header ::= '!gpi-version: 1.1' nl should be version 2.0
  • There is GPAD_Doc ::= GPAD_Header Annotation* but no corresponding GPI_Doc production
  • I worry about some ambiguity here in GPAD_Header_Line: '!' Property_Symbol ':' Space* Value nl
    • given Property_Symbol ::= ID | Alphanumeric, since ID must contain a :, how does a colon in the ID interact with the : separating the property from the value?
    • Value doesn't seem to be defined. Should it follow the Property_Value defined for property-value pairs within annotation lines? If so it seems it can also contain a :.
    • Perhaps the solution is to make at least one space after the separating colon mandatory? Currently it's zero-or-more.
  • In GPI column 4, Label is not defined: DB_Object_Synonyms ::= [Label] ('|' Label)*
  • In GPI column 6, Taxon_ID is not defined: DB_Object_Taxon ::= NCBITaxon:[Taxon_ID]. Should it be one or more digits?
  • Property_Value could be more explicit. Current:
    • Property_Value ::= (AnyChar - ('=' | '|' | nl))
      ASCII, except for non-printing characters, tabs, new lines, control characters, quotation marks, or pipe
    • = and | and nl are in the forbidden list, but not the other characters mentioned
  • In Relational_Expression ::= Relation_ID '(' ID ')' should the term be an OBO_ID (or I guess staying with ID allows relations to gene product identifiers)?
  • In GPAD column 8, Interacting_taxon_ID is said to be cardinality 0 or greater, but this isn't in the grammar: Interacting_taxon_ID ::= NCBITaxon:[Taxon_ID]

I wonder if it would be possible to consolidate the grammar definitions a little, as in the SPARQL grammar or in the OWL Functional Syntax grammar. I really enjoy reading those. :-)

@kltm
Copy link
Member

kltm commented Jun 3, 2020

@vanaukenk Notes from GAF and GPAD diving.

headers

Counter to my intuition, headers are pretty wild across the board. I'm honestly not sure we should bother with supporting the past here and feel that just setting decent rules now and enforcing them would be a cleaner way forward.

Things that would likely be scanned as a "property symbol":
https://gist.github.com/kltm/d28a487b3a009c38dda93293808159a5#file-things_that_are_property_symbols-txt

All headers: https://gist.github.com/kltm/d28a487b3a009c38dda93293808159a5#file-all-headers-txt

Given this, I propose a tight definition for non-comment comment lines in the header:

^!\w*[a-zA-Z0-9-_]+\w*:\w*<OUR_ASCII_SUBSET>+\w*

@balhoff @vanaukenk Any thoughts here?

taxon

A little easier: if used, only a single '|'; no instances of anything else.

@vanaukenk
Copy link
Contributor

@kltm
Your header analysis is consistent with what I had thought when I examined some by eye.
Let's just confirm your proposal on the next GO-CAM specs call.
Where possible, I've tried to fix the other minor typos in the spec.

@vanaukenk
Copy link
Contributor

@kltm @balhoff
For the Interacting_taxon_ID field, I believe we said that we would separate multiple values with a comma, since the annotation is likely to reflect multi-species interactions and by analogy to the With/From field, a comma would indicate AND (while a pipe would indicate OR).

@dougli1sqrd
Copy link
Contributor

I have a spec question/concern:
For with/from, since we can do pipes and commas now: blah,blah|foo,bar the idea is that things between pipes can be considered separate annotations, yeah?

But we also have this pattern in the extensions. Does that mean we just have W x E number of internal annotations for each pipe separated element in both withfrom and extensions?

With, Ext
A|B, X|Y --> A X, A Y, B X, B Y

@ukemi
Copy link
Contributor

ukemi commented Jun 16, 2020

Since the with field represents a single piece of evidence, shouldn't it be considered a single annotation? It's just that the evidence is supported by this or that or this or that rather than this and that and this and that. It's an interesting question on where to make the slices.
For the extensions themselves, I have found that commas and pipes have been used inconsistently and @dustine32 and I have worked on rules for the initial imports that reflect what curators were trying to get at. Presumably once the initial import is done, this will no longer be the case and things will be consistent.

@dougli1sqrd
Copy link
Contributor

Can we explain more specifically what this means?

Do you mean you can only have pipes or commas in a single annotation?

If you mix them together, where are the "parentheses" exist? Like, is A,B|C,D|E:
(A,B)|(C,D)|E
A,(B|C),(D|E)
((((A,B)|C),D)|E)

@vanaukenk
Copy link
Contributor

@dougli1sqrd
My interpretation is that the parentheses from the above example would be:
(A,B)|(C,D)|E

@vanaukenk
Copy link
Contributor

So if we had this completely made up annotation:

ace-1 involved_in neurotransmission ace-2, ace-3|ace-3, ace-4 occurs_in(motor neuron), part_of(response to odorant)|occurs_in(body wall muscle cell), part_of(response to touch)

It could be split into four annotations as follows:

ace-1 involved_in neurotransmission ace-2, ace-3 occurs_in(motor neuron), part_of(response to odorant)

ace-1 involved_in neurotransmission ace-3, ace-4 occurs_in(motor neuron), part_of(response to odorant)

ace-1 involved_in neurotransmission ace-2, ace-3 occurs_in(body wall muscle cell), part_of(response to touch)

ace-1 involved_in neurotransmission ace-3, ace-4 occurs_in(body wall muscle cell), part_of(response to touch)

@ukemi
Copy link
Contributor

ukemi commented Jun 17, 2020

I'm now wondering if we should rethink this. Allowing for both pipes and commas seems to open up confusion that might lead to what I saw in our original annotations. If there is a delimiter that represent separate annotations, wouldn't it be better to just create the separate annotations? I think when @dustine32 imports conventional annotations, he parses these complex structures into separate 'annotations' anyways. How will we create files as we move forward from GO-CAM exports? Will we need to devise an algorithm for creating pipe versus comma delimited values in one very complex annotation?

@vanaukenk
Copy link
Contributor

IIRC, introducing pipes into the With/From field was aimed at time-saving/efficiency wrt curation. I can't otherwise think of a reason pipe-separated With/From entries couldn't be separate annotations.

For annotation extensions, though, my recollection was that there was concern that individual annotations should not just differ by extensions, partly because of concern about how analysis tools, most (all?) of which don't take extensions into account, would handle seemingly 'duplicate' annotations. Perhaps this is no longer a real concern and we should change our SOP on this.

@ukemi
Copy link
Contributor

ukemi commented Jun 17, 2020

We should re-examine. If I understand correctly when we round trip, they will come out as individual annotations.

@ukemi
Copy link
Contributor

ukemi commented Jun 17, 2020

@vanaukenk, I think we are thinking along the same lines, but we need to take a closer look at this. Does it give us an advantage to split out the with/from field? Does it give us an advantage to split out the annotation extensions? For the latter, several pipe-separated annotation extensions could in fact lead to different primary inferred annotations, so there is a distinct advantage.
For example 'organ development'-----results_in_development_of heart|results_in_development_of kidney would yield two different annotations by inference: heart development and kidney development.

@ukemi
Copy link
Contributor

ukemi commented Jun 17, 2020

I can't think of an advantage to splitting out pipe-separated with/from fields.

@balhoff
Copy link
Member Author

balhoff commented Jul 1, 2020

@vanaukenk @ukemi @kltm for GPI columns 2, 3, and 4:

The values are generally text. Columns 2 and 3 are specified as "xxxx", and column 4 is "Label" (undefined). Could these all be defined as Alphanumeric? Or are other characters like hyphens expected?

@ukemi
Copy link
Contributor

ukemi commented Jul 1, 2020

MGI:109346 Gene Symbol Nkx1-1 (Column 2), synonym Nkx-1.1 (column 4)
MGI:3629589 Gene Symbol Mir181a-1 (Column 2), synonyms mir 181a-1|miR-213|Mirn181a-1|Mirn213|mmu-mir-181a-1|mmu-mir-213 (column 4), gene name microRNA 181a-1 (column 3)

@balhoff
Copy link
Member Author

balhoff commented Jul 1, 2020

@ukemi okay I expected as much. :-)

@kltm
Copy link
Member

kltm commented Jul 1, 2020

@balhoff This is some of where the "ASCII, except for non-printing characters, tabs, new lines, control characters, quotation marks, or pipe" comes from (and we even said that quotes are probably fine too on a call). Doesn't this also bump into the utf-8 question? The future may indeed look different: geneontology/helpdesk#265
geneontology/go-site#1514
geneontology/go-site#1515

@dougli1sqrd
Copy link
Contributor

dougli1sqrd commented Jul 28, 2020

In the GPAD 2 spec I noticed that the symbol Relation_ID may be from the RO ontology or from go_rel. In practice this makes parsing hard. Should we think about whittling this down to a defined set of relations? Or maybe we can generate a file that has all the allowed CURIEs and perhaps the label to make parsing easier. @balhoff @cmungall

@vanaukenk
Copy link
Contributor

@dougli1sqrd

You are absolutely right about this, and we are working to move whatever relations we can from gorel to RO.

That said, there are some gorel relations, like 'regulates o occurs in', that will have to stick around for the foreseeable future until we can move all of the annotations that use it into GO-CAM models.

I'm open to suggestions/thoughts on how to make any of this easier.

@kltm
Copy link
Member

kltm commented Jul 28, 2020

Maybe a topic for the software meeting?

@dougli1sqrd
Copy link
Contributor

Hi @ukemi, @kltm, @vanaukenk, etc!

Taking a look at the GPI 2.0 spec file, I have some more questions actually.

In the spec file, we have a grammar for each entity line here: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md#gp-entities
Specifiying:

Entity ::= Col_1 tab Col_2 tab ... Col_9 nl

But in the table below, under https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md#gpi-columns it appears that there are 11 possible columns. Should the above Entity spec say up to Col_9?

In the actual column specifications, for columns 3 and 5, the cardinality is [n] or greater, but the field spec doesn't show a separator or how they should be joined (like in column 4). Are those columns actually allowed to take on lists of values?

@vanaukenk
Copy link
Contributor

@dougli1sqrd

I think the line below should actually say Col_11.

Entity ::= Col_1 tab Col_2 tab ... Col_9 nl

Some of the specs got copied over from a previous iteration and that update got missed.

For columns 3 and 5, I think we'd follow a similar use of '|' as a separator.

I can update the spec if what I've said here makes sense to you.

@dougli1sqrd
Copy link
Contributor

@vanaukenk Okay great! Thanks for the clarification. That makes sense to me.

@vanaukenk
Copy link
Contributor

Updated specification. Thanks @dougli1sqrd

@dougli1sqrd
Copy link
Contributor

dougli1sqrd commented Dec 16, 2020

Thanks! Sorry, just noting that column 5 I think now reads like it can accept an empty field: DB_Object_Type ::= [OBO_ID] ('|' OBO_ID)* Do we want this to be DB_Object_Type ::= OBO_ID ('|' OBO_ID)*? (tagging @balhoff @vanaukenk )

@vanaukenk
Copy link
Contributor

Thanks @dougli1sqrd You're right; I'll fix that.
I also double-checked the GPAD spec and the same issue existed with the reference field there, so I'll update that, too.

@pgaudet
Copy link
Contributor

pgaudet commented Feb 13, 2024

The time for commenting is over. If issues need to be discussed, please open a new ticket.

@pgaudet pgaudet closed this as completed Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants