Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade externally managed content #628

Open
wants to merge 45 commits into
base: develop
Choose a base branch
from

Conversation

matentzn
Copy link
Member

@matentzn matentzn commented Aug 5, 2024

Overview

This PR:

  • Adds all missing pieces of externally managed content to Mondo Ingest (EFO, ClinGen, MedGen)
  • And make goals to update them

Pre-merge checklist

Documentation

Was the documentation added/updated under docs/?

  • Yes
  • No, updates to the docs were not necessary after careful consideration

QC

Was the full pipeline run before submitting this PR using sh run.sh make build-mondo-ingest on this branch (after
docker pull obolibrary/odkfull:dev), and no errors occurred?

  • Yes
  • No, there are no functional (code-related) changes to the pipeline in the PR, so no re-run is necessary

New Packages

Were any new Python packages added?

Were any other non-Python packages added?

PR Review and Conversations Resolved

Has the PR been sufficiently reviewed by at least 1 team member of the Mondo Technical team and all threads resolved?

  • Yes

@matentzn matentzn changed the base branch from main to develop August 5, 2024 16:32
Unfortunately, I cant find the orginal TSV file, but it is highly unlikely this file will ever change.
@joeflack4 joeflack4 added the enhancement New feature or request label Aug 5, 2024
src/ontology/mondo-ingest.Makefile Outdated Show resolved Hide resolved
src/ontology/mondo-ingest.Makefile Outdated Show resolved Hide resolved
src/ontology/external/efo-proxy-merges.robot.owl Outdated Show resolved Hide resolved
@matentzn matentzn marked this pull request as ready for review August 18, 2024 19:17
@matentzn
Copy link
Member Author

matentzn commented Aug 19, 2024

Next steps:

  • @twhetzel to update develop from main
  • @twhetzel to update externalclingenmedgenefo from develop
  • @twhetzel to update mondo-ingest/src/ontology/config/robot_report_external_content.txt with all important checks she can think of
  • @matentzn to run full build of pipeline and _commit changes to all files in src/ontology/external to this branch (only those files)
  • @twhetzel to approve PR and merge
  • @matentzn CHANGE emc pipeline to point to main branch
    @twhetzel agreed?

.gitignore Outdated Show resolved Hide resolved
src/ontology/mondo-ingest.Makefile Outdated Show resolved Hide resolved
src/ontology/mondo-ingest.Makefile Outdated Show resolved Hide resolved
src/scripts/post_process_externally_managed_content.py Outdated Show resolved Hide resolved
src/ontology/mondo-ingest.Makefile Outdated Show resolved Hide resolved
src/scripts/post_process_externally_managed_content.py Outdated Show resolved Hide resolved
src/scripts/post_process_externally_managed_content.py Outdated Show resolved Hide resolved
src/scripts/post_process_externally_managed_content.py Outdated Show resolved Hide resolved
src/ontology/medgen-xrefs.robot.template.tsv Outdated Show resolved Hide resolved
src/ontology/mondo-ingest.Makefile Outdated Show resolved Hide resolved
@twhetzel twhetzel self-requested a review August 24, 2024 03:35
Copy link
Member Author

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one change request (related to EFO proxy merges, see comments) which I would like your opinion on @twhetzel, else I think this is good to go. You can run the pipeline with:

sh run.sh make update-externally-managed-content -B

However, it does take a long time, so strap in. The biggest downside of this PR is the fact that I chose to properly simulate the EMC pipeline, which adds a big overhead on the runtime.

EDIT: Not that long in the end, at least if you discount the mondo repo refresh:

### DEBUG STATS ###
Elapsed time: 0:13.96
Peak memory: 1669976 kb

I just (24.08.2024, 2pm EEST) finished a complete run of the external pipeline, and all looks good to me. So remains to be seen what you think of my comments and your general review.

| MONDO:0007030 | | | Aarskog Syndrome | | | | | | nord | duplicate_scoped_synonym (oboInOwl:hasExactSynonym) |
| MONDO:0007306 | | | Klippel-Feil Syndrome | | | | | | nord | duplicate_scoped_synonym (oboInOwl:hasBroadSynonym) |
| MONDO:0007389 | | | Spondylothoracic Dysplasia | | | | | | nord | duplicate_scoped_synonym (oboInOwl:hasExactSynonym) |
| MONDO:0005676 | | | | | | http://purl.obolibrary.org/obo/mondo#nord_rare | | | nord | qc-animal-disease-rare (oboInOwl:inSubset) |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am glad this made it through after some fiddling!


$(TMPDIR)/mondo-with-simulated-emc-run.owl: $(TMPDIR)/mondo_repo_built
mkdir -p $(TMPDIR)/mondo/src/ontology/tmp/external/
$(foreach n,$(EXTERNAL_FILES), cp -f $(EXTERNAL_CONTENT_DIR)/$(n).robot.owl $(TMPDIR)/mondo/src/ontology/tmp/external/processed-$(n).robot.owl;) \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line here is the crux of the whole PR. Instead of creating our own EMC pipeline here, we are letting the real EMC pipeline in the Mondo repo run. This works basically by copying the unprocessed files over into the mondo repo, running the EMC pipeline there as it is, and then copying the mondo edit back here for running our new QC system.

# Unfortunately, @matentzn cant find the orginal TSV file, but it is highly unlikely this file will ever change.

$(EXTERNAL_CONTENT_DIR)/efo-proxy-merges.robot.owl:
echo "WARNING: $@ is currently manually curated!"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my self review, I am a bit inclined now to drop this file entirely from the pipeline and have this being fixed on the EFO side - like with all other content we do. Here is what this file does:

  1. There are 3 cases or 4 where two EFO ids map to the same Mondo ID
  2. This file makes that valid by adding "preferred external" to one of the EFO ids in each pair

I feel now on second thought that we should not easily validate externally provided proxy merged IDs, and work with the upstream sources to fix them.

Copy link
Contributor

@twhetzel twhetzel Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn I agree that we should not fix these on our side, but work with the provider of the source to fix these or a a minimum have the provider determine which should be tagged as preferred. There is also usually some curator work to determine which one should be the "preferred external" so automatically choosing which ID skips this needed work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have fixed it now. I made a test here: https://github.com/monarch-initiative/mondo/pull/8120/files#r1734450583

This commit describes the removal of that part from the pipeline: 52fdd09

Note that I have yet deleted the efo-merges owl file from this PR, but I have removed all its content, as there is a cyclic dependency here that was too cumbersome to remove. If you like you can remove these files after first removing references to this file from mondo.Makefile, e.g.:

https://github.com/monarch-initiative/mondo/blob/49291cff03da830ec8826c623f042b2ecfafd6d4/src/ontology/mondo.Makefile#L457

cp mondo-edit.owl mondo-edit.owl.bak && \
cp mondo-edit.obo mondo-edit.obo.bak && \
make update-external-content-incl-rare -B MIR=false IMP=false MIR=false DOWNLOAD_EXTERNAL=false &&\
make mondo-edit.owl MIR=false IMP=false MIR=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this goal in the Makefile?

Copy link
Contributor

@joeflack4 joeflack4 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it has already made its way into mondo makefile on main.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all that you see here is in the context of the mondo, not mondo-ingest pipeline!

Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will Approve for now, but I do think that proxy merges from EFO should not be fixed automatically related to this comment. I don't know if there is still time to change this.

@twhetzel
Copy link
Contributor

twhetzel commented Sep 6, 2024

The full build does not work for this branch. I am looking into this.

Error:

if [ false = true ]; then wget "https://raw.githubusercontent.com/monarch-initiative/mondo-ingest/externalclingenmedgenefo/src/ontology/external/processed-mondo-efo.robot.owl" -O tmp/external/processed-mondo-efo.robot.owl; fi
mkdir -p tmp/external
if [ false = true ]; then wget "https://raw.githubusercontent.com/monarch-initiative/mondo-ingest/externalclingenmedgenefo/src/ontology/external/processed-efo-proxy-merges.robot.owl" -O tmp/external/processed-efo-proxy-merges.robot.owl; fi
make[4]: Leaving directory '/work/src/ontology/tmp/mondo/src/ontology'
grep -vE '^(xref: EFO:|subset: otar)' mondo-edit.obo > tmp/mondo-edit.tmp || true
mv tmp/mondo-edit.tmp mondo-edit.obo
robot --catalog catalog-v001.xml merge -i mondo-edit.obo -i tmp/external/processed-mondo-otar-subset.robot.owl -i tmp/external/processed-mondo-efo.robot.owl -i tmp/external/processed-efo-proxy-merges.robot.owl --collapse-import-closure false \
	query --use-graphs false --update ../sparql/update/update-equivalent-obsolete.ru \
	convert -f obo --check false -o mondo-edit.obo.obo
tmp/external/processed-efo-proxy-merges.robot.owl (No such file or directory)
Use the -vvv option to show the stack trace.
Use the --help option to see usage information.
make[3]: *** [mondo.Makefile:460: update-efo-subset] Error 1
make[3]: Leaving directory '/work/src/ontology/tmp/mondo/src/ontology'
make[2]: *** [mondo.Makefile:319: update-external-content-incl-rare] Error 2
make[2]: Leaving directory '/work/src/ontology/tmp/mondo/src/ontology'
make[1]: *** [mondo-ingest.Makefile:617: tmp/mondo-with-simulated-emc-run.owl] Error 2
rm imports/ro_terms_combined.txt
make[1]: Leaving directory '/work/src/ontology'
make: *** [mondo-ingest.Makefile:346: build-mondo-ingest] Error 2
Command exited with non-zero status 2
### DEBUG STATS ###
Elapsed time: 3:48:09

@twhetzel twhetzel self-requested a review September 6, 2024 18:28
Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full build no longer works after further changes were added after this was already reviewed and previously approved.

The GH Action in the Mondo repo "Update Externally Provided Content" does still work so this PR will be left unmerged until the issue with the full build can be sorted out and fixed.

Copy link
Contributor

@joeflack4 joeflack4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build is failing.

@twhetzel twhetzel added the bug Something isn't working label Sep 19, 2024
ERROR duplicate_scoped_synonym
ERROR file:tmp/mondo/src/sparql/qc/general/qc-duplicate-exact-synonym-no-abbrev.sparql
ERROR file:tmp/mondo/src/sparql/qc/mondo/qc-proxy-merges.sparql
ERROR file:tmp/mondo/src/sparql/qc/general/qc-related-exact-synonym.sparql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing this check since it looks like it is covered by the default ROBOT check of duplicate_scoped_synonym already included in this file.

- This provides a necessary bug fix
* data build on main following mondo release

* remove logs file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants