Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ingest to accommodate oropouche segments #1

Merged
merged 3 commits into from
Aug 12, 2024
Merged

Conversation

miparedes
Copy link
Collaborator

Oropouche is a segmented virus (L, M, and S segments). In order to accommodate these different segments and to allow for downstream phylogenetic analysis, the ingest pipeline was customized to split up the metadata based on segment, as well as a metadata and sequences file with all the sequences under results/all

This was adopted from the work done by @j23414 in nextstrain/lassa#12

I really quickly compared the segment assignments done by nextclade with the already existing annotations found on NCBI, and it seems to be quite concordant with all the genomes that are annotated as L and M in NCBI being also assigned as L and M respectively by Nextclade.

There are two genomes that are annotated as S but were not assigned as such by nextclade and a quick look show that theyre both from culex mosquitos and pretty short so the sequencing quality might not be great to begin with. I can look into that a bit better in the future
Screenshot 2024-07-30 at 5 09 04 PM

there were about 13% about the genomes that didnt have a segment annotation and nextclade and nextclade was able to assign a segment to all except 7. Below is their information, they're just really short segments so makes sense that nextclade would struggle.
oropouche_no_nextclade_segment_assignment.csv

It all runs perfectly thanks to @j23414 's work on the lassa side.

…in order to accommidate the three different segments of oropouche virus
@miparedes miparedes requested a review from genehack July 31, 2024 00:29
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small comments to clean up the workflow a little bit, but nothing blocking!

Glad this was relatively straight-forward to set up 🙏

ingest/Snakefile Outdated
Comment on lines 51 to 59
rule create_final_metadata:
input:
metadata="data/subset_metadata.tsv"
output:
metadata="results/all/metadata.tsv"
shell:
"""
cp {input.metadata} {output.metadata}
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

This rule can be removed if the subset_metadata rule just outputs directly to results/all/metadata.tsv.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I updated this

input:
dataset=f"data/nextclade_data/{DATASET_NAME}.zip",
sequences="results/sequences.fasta",
metadata = "data/subset_metadata.tsv",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like the metadata input is needed for this rule?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right! I removed it from the most recent commit

@miparedes miparedes merged commit 8daff24 into main Aug 12, 2024
@miparedes miparedes deleted the update_ingest branch August 12, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants