-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ingest to accommodate oropouche segments #1
Conversation
…y changing the config.yaml file
…in order to accommidate the three different segments of oropouche virus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some small comments to clean up the workflow a little bit, but nothing blocking!
Glad this was relatively straight-forward to set up 🙏
ingest/Snakefile
Outdated
rule create_final_metadata: | ||
input: | ||
metadata="data/subset_metadata.tsv" | ||
output: | ||
metadata="results/all/metadata.tsv" | ||
shell: | ||
""" | ||
cp {input.metadata} {output.metadata} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion
This rule can be removed if the subset_metadata
rule just outputs directly to results/all/metadata.tsv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I updated this
ingest/rules/nextclade.smk
Outdated
input: | ||
dataset=f"data/nextclade_data/{DATASET_NAME}.zip", | ||
sequences="results/sequences.fasta", | ||
metadata = "data/subset_metadata.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like the metadata input is needed for this rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right! I removed it from the most recent commit
Oropouche is a segmented virus (L, M, and S segments). In order to accommodate these different segments and to allow for downstream phylogenetic analysis, the ingest pipeline was customized to split up the metadata based on segment, as well as a metadata and sequences file with all the sequences under
results/all
This was adopted from the work done by @j23414 in nextstrain/lassa#12
I really quickly compared the segment assignments done by nextclade with the already existing annotations found on NCBI, and it seems to be quite concordant with all the genomes that are annotated as L and M in NCBI being also assigned as L and M respectively by Nextclade.
There are two genomes that are annotated as S but were not assigned as such by nextclade and a quick look show that theyre both from culex mosquitos and pretty short so the sequencing quality might not be great to begin with. I can look into that a bit better in the future
there were about 13% about the genomes that didnt have a
segment
annotation and nextclade and nextclade was able to assign a segment to all except 7. Below is their information, they're just really short segments so makes sense that nextclade would struggle.oropouche_no_nextclade_segment_assignment.csv
It all runs perfectly thanks to @j23414 's work on the lassa side.