Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexmatch revamp #397

Merged
merged 7 commits into from
Jan 29, 2024
Merged

Lexmatch revamp #397

merged 7 commits into from
Jan 29, 2024

Conversation

matentzn
Copy link
Member

@matentzn matentzn commented Jan 22, 2024

This PR

  • adds a few missing prefixes to merged.db (through prefixes.csv) that were previously missed
  • replaces the prefix_map with converter parameter in src/scripts/lexmatch-sssom-compare.py
  • skips meta in src/scripts/match-mondo-sources-all-lexical.py which is not urgently needed and was causing all the problems @joeflack4 is fixing
  • amends to preprocessing queries in the pipeline that added junk xrefs to the pipeline
  • Updates src/ontology/config/prefixes.csv to cover all relevant prefixes for the matching
  • updates src/ontology/metadata/* configs to avoid conflicting namespaces.

@joeflack4 I am not reading your answers etc - please review this PR with @twhetzel and see if you have any concern with these changes.

To test this PR if you like:

docker pull obolibrary/odkfull:dev
ODK_TAG=dev ./run.sh make lexmatch/README.md -B IMP=false PAT=false MIR=false

I ran it locally - now testing the whole pipeline (I circumvent the main problem you are trying to solve by simply ditching meta here, hoping its not needed).

@@ -85,7 +85,8 @@ def main(verbose: int, quiet: bool):
def run(input: str, config: str, rules: str, rejects: str, output: str):
# Implemented `meta` param in `lexical_index_to_sssom`

meta = get_metadata_and_prefix_map(config)
#meta = get_metadata_and_prefix_map(config)
meta = None
Copy link
Contributor

@joeflack4 joeflack4 Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lexical_index_to_sssom(meta=None)

Nico wrote:

skips meta in src/scripts/match-mondo-sources-all-lexical.py which is not urgently needed and was causing all the problems @joeflack4 is fixing

You are more familiar with all of the lexmatch outputs and you say that running it locally was fine, so it seems there are no significant negative side effects from this.

You say it is not urgently needed, so I take it that this means that there is still a desire to eventually do what I did in #394 (pass a Converter) or similar, perhaps after addressing INCATools/ontology-access-kit#698. So perhaps I should make like a medium priority ticket for this, or keep my current PR open and rebase from / conform to what you have here.

Not important but to clarify, this wasn't causing all the problems I am fixing. Rather it is a solution for the last, only, conditional, remaining unfixed problem in #394.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trish wrote:

@joeflack4 have you also run the entire pipeline locally? Did you find that Nico's solution to "ditch meta" worked as intended?

I haven't run the whole pipeline locally. I'll give it a go but I'm terrified of it; I've heard it takes over 24 hours and sometimes this can interfere w/ other projects I'm working on. My computer wasn't actually capable of doing this until last month (without some skipping of things like NCIT). When I've asked Nico if he wants me to do this before, he's said I could hold off. But I'll at least give it a go.

Regarding checking myself the results of meta=None, I'll give that a go. I was trusting Nico and it sound like he got the results he expected, but good to have second eyes especially if you think so. Another difference I just noticed as well is that I've also been passing no meta (same as passing None), but I've been passing prefix_map instead. I'm not sure it matters in this case but I'd better remove that as well and run in a more closely to what Nico is running. I'd check out his PR as well but I think better to run on my PR because Nico made some other changes to prefixes.csv, but I want to more closely compare the outputs I got before/after with simply the meta=None/prefix_map=None change.

@@ -85,7 +85,8 @@ def main(verbose: int, quiet: bool):
def run(input: str, config: str, rules: str, rejects: str, output: str):
# Implemented `meta` param in `lexical_index_to_sssom`

meta = get_metadata_and_prefix_map(config)
#meta = get_metadata_and_prefix_map(config)
meta = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#394 redundancy

The main difference is that #397 does passes meta=None to lexical_index_to_sssom(), whereas #394 passes a Converter. Supposedly #397 approach is fine.

Everything else in that PR is now cosmetic, so i can merge just those changes afterwards or ignore and close.

My instinct is to feel a little irked that I spent a lot of work and found a way to get the pipeline working without throwing the oio err (although #394 currently has oio removed and loses outputs), and that my work had I think the same overall effect but was supplanted by Nico and Charlie's PRs. But maybe I shouldn't feel irked; I laid the groundwork and this is just the nature of collaboration, and Nico/Charlie are more senior here and have a better idea of how things should exactly be.

Also, I see that this PR does add more than #394, such as missing prefixes.

@@ -227,12 +227,27 @@ ICD10CM,http://purl.bioontology.org/ontology/ICD10CM/
ICD10CM2,https://icd.codes/icd10cm/
ICD10WHO,https://icd.who.int/browse10/2019/en#/
ICD10WHO2010,http://apps.who.int/classifications/icd10/browse/2010/en#/
OMIMPS,https://www.omim.org/phenotypicSeries/PS
OMIMPS,https://omim.org/phenotypicSeries/PS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefixes.csv updates

That will be supplanted eventually by dynamic creation of prefixes.csv but these changes seem necessary for now.

Harshad and I should have thought to check the status of this sooner.

src/sparql/fix_xref_prefixes.ru Show resolved Hide resolved
src/sparql/rm_xref_by_prefix.ru Show resolved Hide resolved
@joeflack4 joeflack4 added the bug Something isn't working label Jan 23, 2024
@joeflack4 joeflack4 linked an issue Jan 23, 2024 that may be closed by this pull request
Copy link
Contributor

@joeflack4 joeflack4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust your review of the outputs and if you say it is working locally and that you're verifying by running the pipeline, I'm fine for you / us to merge this after that verification has been made!

@twhetzel
Copy link
Contributor

I ran it locally - now testing the whole pipeline (I circumvent the main problem you are trying to solve by simply ditching meta here, hoping its not needed).

@joeflack4 have you also run the entire pipeline locally? Did you find that Nico's solution to "ditch meta" worked as intended?

Copy link
Member Author

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.. Phew!

@twhetzel @joeflack4

I looked through all the results - They look correct to me. A lot of issues fixed, way beyond the the original problem with the prefixes.

I wont be looking at this anymore this week - please merge this, write down any question you have in a Google docs so we can document things more permanently, and I will, once I see the merge, immediately make another run.

@joeflack4 if you want to retain the contents of your comments, please move them to issues.

This was a bit painful, sorry about that. The source of all problems, was a major refactor of sssom-py and curies, which led to 3 completely independent major issues all around the pipeline. The debugging process has led to some good fixes in curies (thanks @joeflack4), and I hope some better understanding of what is happening. Lets keep one thing in mind for the future:

Never ever again do we merge a branch unless a full run of the pipeline was successful. I can be responsible for the runs until we have a runner on GCP.

@twhetzel I will post a comment on slack now to advice Nicole and Sabrina to push forward the syncing.

Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nico mentioned all looks good and it looks good to me at this point

# SNOMEDCT_US_2021_09_01: http://identifiers.org/snomedct/
# SNOMEDCT_US_2022_07_31: http://identifiers.org/snomedct/
# SNOMEDCT_US_2022_10_31: http://identifiers.org/snomedct/
# SNOMEDCT_US_2023_02_28: http://identifiers.org/snomedct/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn why are these commented out vs. removed or what are these comments a reminder of in the future?

@twhetzel twhetzel merged commit 28166e2 into main Jan 29, 2024
@twhetzel twhetzel deleted the lexmatch-revamp branch January 29, 2024 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in match-mondo-sources-all-lexical.py
3 participants