Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend pattern writer to test for validity of pubs & features FB & add to KB #5

Open
dosumis opened this issue Oct 15, 2019 · 11 comments
Assignees

Comments

@dosumis
Copy link
Member

dosumis commented Oct 15, 2019

Notes on design and implementation.

The spec originally allowed features to be referred to by name only, but this makes for quite a bit of overhead to deal with special characters = e.g. Should greeks always be spelt out; should they use sgml, should they use UTF-8? I'd rather follow a simpler approach if possible: whenever a feature must be referred to, the curation table has two columns: one for the name and one for the ID. Loading scripts check that the name given is present in the set of synonyms/names for the feature in question in FlyBase. A warning is thrown if this is not the case.

@gm119 - Do you think this is this viable?

The spec for Pubs originally specified an FBrf or, if not available, a DOI. This could potentially be supplemented by a miniref for cross-checking.

CC @Clare72 @gm119

@dosumis dosumis changed the title Extend peevish to test for validity of pubs & features FB & add to KB Extend pattern writer to test for validity of pubs & features FB & add to KB Nov 12, 2019
@gm119
Copy link

gm119 commented Nov 18, 2019

whenever a feature must be referred to, the curation table has two columns: one for the name and one for the ID.

the trouble with this is that you are doubling the amount of curation that needs doing - a curator has to go copy and paste the FBid from somewhere, which is time consuming, so if this can be avoided, it would be good.

I would not recommend UTF-8 (which should hopefully make it a bit simpler !)

The format we currently use is sgml for special characters - this is what is in all the lookup files that curators would use to identify the correct feature, so if it were possible to use the sgml format, that would match the lookup files the curators will be using, which might be helpful.

Peeves now looks in the synonym_sgml column of the synonym table to get the current sgml symbol for a feature (it does have to convert superscripts and subscripts too).

I've got a question:

Loading scripts check that the name given is present in the set of synonyms/names for the feature in question in FlyBase. A warning is thrown if this is not the case.

what tables are you using to look this up - if you are using the synonym table, then you may already have the sql necessary to get the correct uniquename ?

I can dig out the Peeves queries if that would help ?

I think if you can find a way to use symbols without making them also add in the FBids, it'll be way faster for curators.

@gm119
Copy link

gm119 commented Nov 19, 2019

Hi @dosumis, I've been thinking about this - bottom line is we should avoid curators having to fill in two columns for the same thing, but perhaps because image curation is going to have simpler genotypes than am used to in phenotype curation:
a. maybe FBids will be OK for this (if its loads of works to get it working for symbols).
b. OR maybe there is a compromise where split combo symbol can be used for splits and FBid for others ?

I tried looking for the specs and examples to have a look to try and make a more concrete suggestions for columns etc. - but I can't find the specs/examples. I got to the https://github.com/VirtualFlyBrain/curation/wiki/Curation-wiki--Home wiki, but the links on there are not working at the moment.

I'm specifically interested in these (doc from wiki):

Curation record types for adding new images:

  • Expression pattern (YAML spec; Example): Specify a driver using a FlyBase feature name. Submission of this curation record will create the expression pattern node if it does not already exist.
    
  • Split (YAML spec; Example): Specify AD and DBD using FlyBase feature names. Submission of this curation record will create the appropriate split expression pattern node if it does not already exist.
    
  • Anatomy (YAML spec; Example): Specify Classification (IS_A) and reasons for classification; Optionally specify a driver.
    

Expression and Split 'YAML spec' and Example just go to the wiki page, 'Anatomy' YAMLspec goes to https://github.com/VirtualFlyBrain/curation/blob/master/records/anatomy_spec.yaml but I get a page not found error (am logged in !)

ta

@dosumis
Copy link
Member Author

dosumis commented Nov 19, 2019

Peeves now looks in the synonym_sgml column of the synonym table to get the current sgml symbol for a feature (it does have to convert superscripts and subscripts too).

Thanks for the pointer. I think I may be missing something. We use synonym.sgml to pull names and synonyms for VFB:

https://github.com/VirtualFlyBrain/VFB_neo4j/blob/master/src/uk/ac/ebi/vfb/neo4j/flybase2neo/feature_tools.py#L117

I always thought the name synonym.sgml was a bit odd as it contains sgml super and subscript markup, but as far a I can tell, doesn't always have sgml greeks. Instead it is the only column with unicode versions of names and synonyms:

Using this

SELECT DISTINCT f.uniquename as fbid, f.name as feature_name, s.name as ascii_name, 
stype.name AS stype, 
 fs.is_current, s.synonym_sgml as unicode_name 
FROM feature f 
LEFT OUTER JOIN feature_synonym fs on (f.feature_id=fs.feature_id) 
JOIN synonym s on (fs.synonym_id=s.synonym_id)
JOIN cvterm stype on (s.type_id=stype.cvterm_id)
where f.uniquename = 'FBal0097158';

=>

fbid ascii_name stype is_current unicode_name
FBal0097158 Scer\Gal4[alphaTub84B.PL] symbol f Scer\Gal4<up>alphaTub84B.PL</up>
FBal0097158 Scer\GAL4[Tub84B.PL] symbol f Scer\GAL4<up>Tub84B.PL</up>
FBal0097158 alpha-Tubulin at 84B promoter construct of Lee fullname t α-Tubulin at 84B promoter construct of Lee
FBal0097158 Scer\GAL4[alphaTub84B.PL] symbol t Scer\GAL4<up>αTub84B.PL</up>
FBal0097158 Scer\GAL4[alphaTub84B.PL] symbol f Scer\GAL4<up>alphaTub84B.PL</up>

The sgml.synonym column (here I've called it unicode_name as that's what I use it for) has unicode versions of the official symbols and fullname, but no sgml greek.

Note in this case feature.name = 'Scer\GAL4[alphaTub84B.PL]' - also no sgml greek. Matching on feature.name is more appealing as there's only one so I can use this for a fast lookup. Is the greek always spelt out in these cases?

@dosumis
Copy link
Member Author

dosumis commented Nov 19, 2019

Comment's just crossed. Reading yours now.

@gm119
Copy link

gm119 commented Nov 19, 2019

(ditto !)

@dosumis
Copy link
Member Author

dosumis commented Nov 19, 2019

Re wiki: specs are in a branch and I've done some re-arranging. Latest versions are here - but still draft: https://github.com/VirtualFlyBrain/curation/tree/configs_and_test_recs/records/new_images

Need to conclude this discussion before finalising. I'll wire them up to the wiki, and update that, when details settle down a bit more.

@gm119
Copy link

gm119 commented Nov 19, 2019

thanks for the link !

@gm119
Copy link

gm119 commented Nov 19, 2019

OK,

  1. re sgml vs Unicode/UTF-8 in synonym.synonym_sgml - you are right, it does now have Unicode (I think its supposed to be UTF-8) in this slot rather than sgml format (I think when we first moved to chado it was sgml but then the parser was changed to put in UTF-8).

This means that what Peeves is doing when presented with an sgml symbol to check for validity is first turning it into utf-8 using a subroutine that has a mapping of sgml (&agr;, &bgr;) etc. to utf-8, turning [ ] style super and sub-scripts into and and then using that utf-8 symbol to query the synonym.synonym_sgml table.

This sounds like too much conversion back and forth to me for the image curation code, so I reckon that if we think that we do need to use symbols, then getting curators to spell out any greeks would be the way to go, because the answer to your question:

Note in this case feature.name = 'Scer\GAL4[alphaTub84B.PL]' - also no sgml greek. Matching on feature.name is more appealing as there's only one so I can use this for a fast lookup. Is the greek always spelt out in these cases?

is yes, the greek is always spelt out in the feature.name

I've had a look at the yaml specs and I've got a couple of ideas re symbol vs FBid (will put in next comment)

@gm119
Copy link

gm119 commented Nov 19, 2019

I've got a suggestion for the split_spec.yaml file
That is to use the split combination symbol (e.g. MB002B, SS00043) where possible instead of (or possibly as well as) specifying the AD vs DBD. This symbol ought to be unique so you should be able to identify what the combinations are using the FlyLight to FB stuff mapping you already have in the VFB knowledgebase I think.

The code I made for Alex for expression curation uses the following information (1-3 submitted as columns, 4. submitted once when running script) to uniquely identify the line components involved.

  1. combination symbol
    e.g. MB077B

  2. DBD identifier

e.g. R25D01 or GMR25D01

  1. AD identifier
    e.g. R19F09 or GMR19F09
  2. DBD type e.g. GAL4 _this last one is to future-proof it in case there are split lexA etc. expression patterns - you don't need similar for the AD, since the same AD half will work with any compatible DBD so only one AD set has been made _

In terms of filling in columns 1-3, there are options in the code:

a. fill in combination symbol only

  • curator uses when combination symbol is only thing given
  • code looks up in FlyLight-to-FB mapping for combinations and figures out DBD and AD

b. fill in all info

  • curator uses when there is a nice table of combo symbol, DBD, AD so not loads of extra work to paste into tsv
  • code looks up in FlyLight-to-FB mapping for combinations and figures out DBD and AD
  • code ALSO checks that DBD and AD given by curator match those that the FlyLight-to-FB mapping thinks the combination corresponds to, and prints error if mismatch

c. DO NOT fill in combination symbol, fill in DBD identifier, AD identifier, DBD type

  • curator uses when the split combination symbol isn't given (or its a non-FlyLight combination so there isn't a combination symbol)
  • code uses FB mapping for individual hemidrivers and figures out DBD and AD. This relies on the fact that all the hemidrivers are members of either the GMR_Brain_exp_1 or VDRC-VT dataset and that they are marked as hemidrivers in the db using the experimental_tool info (so its not just limited to the FlyLight set of combinations, but all the split lines that we have).

For image curation, I think we'll probably want to use either a. or c. mostly. Using the symbols is probably more human readable, but for the GMR/VT lines, its still possible to unambiguously map using symbols

@gm119
Copy link

gm119 commented Nov 19, 2019

For ep_spec.yaml,

I think maybe this description needs tweaking:

Please use transposon rather than insertion.

For promoter fusion transgenes, please use transposon rather than insertion. For enhancer traps lines, please use insertion.

(i.e. it could be either FBtp or FBti ??)

For this, perhaps we should see whether curators find it OK just filling in the FBid, given that in most cases its going to be just a single driver per image (and sometimes an effector), it might not be too confusing to use ids for this rather than symbols ?

@dosumis
Copy link
Member Author

dosumis commented Nov 26, 2019

I'm now allowing name-only for features. Will only change if we hit problems with special characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants