Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploaded new test dataset for github workflow #107

Merged
merged 5 commits into from
Aug 16, 2024
Merged

Uploaded new test dataset for github workflow #107

merged 5 commits into from
Aug 16, 2024

Conversation

samarth8392
Copy link
Collaborator

@samarth8392 samarth8392 commented Aug 15, 2024

Changes

  • Added subsampled human tumor normal WES reads from seqc2 project and BAM files for github workflow tests. Contains reads spanning all chromosomes.
  • Updated the main.yaml file in github workflow.

Issues

Resolves #27

PR Checklist

(Strikethrough any points that are not applicable.)

  • This comment contains a description of changes with justifications, with any relevant issues linked.
    - [ ] Update docs if there are any API changes.
  • Update CHANGELOG.md with a short description of any user-facing changes and reference the PR number. Guidelines: https://keepachangelog.com/en/1.1.0/

Copy link
Member

@kelly-sovacool kelly-sovacool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move all the test files from .tests/ to tests/data/? This location would be more standard.

@kelly-sovacool
Copy link
Member

I found some instances of .tests/ to change:

grep -rl "\.tests" *
docs/usage/run.md
pyproject.toml
src/xavier/__main__.py
tests/test_run.py
tests/test_cli.py

I went ahead and fixed them with:

grep -rl "\.tests" * | xargs sed -ri "s|\.tests|tests/data|g"

@kelly-sovacool
Copy link
Member

I have a full run in progress with the new test dataset, will let you know how it goes

@samarth8392
Copy link
Collaborator Author

I have a full run in progress with the new test dataset, will let you know how it goes

The full run will fail because these are 0.5% of all sampled reads and will fail at the CNV calling step. I only uploaded these here for dryrun tests as they were small enough. In my tests, the 25% subsampled reads (~800 mb/file) run to completion without errors. Anything below that would fail to call CNVs due to low coverage.

@kelly-sovacool
Copy link
Member

kelly-sovacool commented Aug 15, 2024

I have a full run in progress with the new test dataset, will let you know how it goes

The full run will fail because these are 0.5% of all sampled reads and will fail at the CNV calling step. I only uploaded these here for dryrun tests as they were small enough. In my tests, the 25% subsampled reads (~800 mb/file) run to completion without errors. Anything below that would fail to call CNVs due to low coverage.

Oh, the purpose of uploading real data (rather than blank files) is so we can run real non-dry runs to test the actual functionality and not just the workflow logic.

Is there a way we can have an option to relax restrictions for the CNV callers so they run to completion with this small subsample? Ideally we would not advertise such an option for users and it would only exist for testing.

Also, can you copy your 25% subsampled data to a subdir here, so we can use it for manual testing on biowulf? /data/CCBR_Pipeliner/testdata/XAVIER/

@samarth8392
Copy link
Collaborator Author

I have a full run in progress with the new test dataset, will let you know how it goes

The full run will fail because these are 0.5% of all sampled reads and will fail at the CNV calling step. I only uploaded these here for dryrun tests as they were small enough. In my tests, the 25% subsampled reads (~800 mb/file) run to completion without errors. Anything below that would fail to call CNVs due to low coverage.

Oh, the purpose of uploading real data (rather than blank files) is so we can run real non-dry runs to test the actual functionality and not just the workflow logic.

Is there a way we can have an option to relax restrictions for the CNV callers so they run to completion with this small subsample? Ideally we would not advertise such an option for users and it would only exist for testing.

Also, can you copy your 25% subsampled data to a subdir here, so we can use it for manual testing on biowulf? /data/CCBR_Pipeliner/testdata/XAVIER/

CNV calls are added as --cnv flag and we can remove it from the logic and maybe it would work. You can test it and see if that works.
I have added the 0.25 subsampled files to the folder.

@kelly-sovacool
Copy link
Member

CNV calls are added as --cnv flag and we can remove it from the logic and maybe it would work. You can test it and see if that works. I have added the 0.25 subsampled files to the folder.

Ok, so we can at least run it without CNV entirely to test the other parts of the pipeline. That's still worth doing, definitely better than only a dry run.

@samarth8392
Copy link
Collaborator Author

Ok, so we can at least run it without CNV entirely to test the other parts of the pipeline. That's still worth doing, definitely better than only a dry run.

Okay. Let me do a full run of XAVIER without CNV calling on the uploaded test dataset (0.5%) and see if it runs. Will post an update.

@kelly-sovacool
Copy link
Member

Ok, so we can at least run it without CNV entirely to test the other parts of the pipeline. That's still worth doing, definitely better than only a dry run.

Okay. Let me do a full run of XAVIER without CNV calling on the uploaded test dataset (0.5%) and see if it runs. Will post an update.

Mine is in progress, so there's no need for you to run it too unless you just want to!

@kelly-sovacool
Copy link
Member

somalier_analysis failed likely since the subsample is so small.

[somalier] subset from 17384 to 0 high call-rate sites (removed 100.00%)
p_shapeshifting.nim(103) broadcast2Impl
Error: unhandled exception: Broadcasting error: non-singleton dimensions must be the same in both tensors. Tensors' shapes were: [2504, 0] and [] [ValueError]

But everything else completed successfully (340/343 steps).

@kelly-sovacool
Copy link
Member

I have added the 0.25 subsampled files to the folder.
@samarth8392

I moved them to a subdir called human_subset. Can you make them readable by everyone? I don't have permission since you own the files.

chmod a+r /data/CCBR_Pipeliner/testdata/XAVIER/human_subset/*

@kelly-sovacool
Copy link
Member

kelly-sovacool commented Aug 16, 2024

p.s. I edited your first PR comment to say Resolves #27. Resolves is a keyword that will trigger GitHub to close that issue once this PR is merged (https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests)

@samarth8392 samarth8392 merged commit 315ef23 into main Aug 16, 2024
3 checks passed
@samarth8392 samarth8392 deleted the test-data branch August 16, 2024 13:00
@kelly-sovacool
Copy link
Member

Thanks for your work on this @samarth8392 🎉

@samarth8392
Copy link
Collaborator Author

go team 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subset small test dataset
2 participants