Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid tarball bloat by discourage tar from including multiple snapshots of large files written out over time #4

Open
tomkinsc opened this issue Jul 12, 2024 · 0 comments

Comments

@tomkinsc
Copy link
Member

On newer Illumina machines writing only a few large concatenated bcl files (*.cbcl) per cycle, capturing incremental snapshots of modified files in a run directory too often can result in a tarball that is artificially-inflated in size. This occurs because multiple snapshots of each *.cbcl file are included, with each snapshot capturing the progressive changes to a *.cbcl file as it was appended to by the sequencer over time. We ideally want to include each *.cbcl file only once it has been finalized.

This behavior can be improved by setting or overriding the value for DELAY_BETWEEN_INCREMENTS_SEC to a larger number to reduce the frequency at which snapshots are captured.

A value of DELAY_BETWEEN_INCREMENTS_SEC=600 should limit snapshots to a maximum of two per cycle (in the worst case an in-progress capture plus a capture once each cbcl finalized), as 600 seconds is at or above the 95th percentile of observed cycle durations.

An additional improvement should be possible by excluding from a snapshot any files that are currently open or that we anticipate may be changed by the sequencer in the (very) near future. Rather than inspect open file descriptors, we can exclude the paths of the most recent cycle by adding this pattern to a file:

find $PATH_TO_RUN_FOLDER/Data/Intensities/BaseCalls/L00*/ \
  -type d \
  -regextype posix-extended \
  -iregex '^.+\/C[0-9]+\.[0-9]$' |\
sort -r -k1,1 -V |\
head -n1 |\
sed --regexp-extended 's/(BaseCalls\/)L([0-9]+)/\1L\*/g' |\
tee recent_cycle_exclusions.txt

(obtain $PATH_TO_RUN_FOLDER as desired, ex. those directories within /usr/local/illumina/runs/ on a NextSeq 2000)

...and then exclude the patterns in that file as part of the tar call by passing --exclude-from=recent_cycle_exclusions.txt
This should only be added to the call if run_is_finished is not true, so upon run completion any previously-excluded files will be swept up into the final tarball.

We could also optionally add to the exclusion file the paths of files that have changed in the past few minutes. Ex.:

find . -mmin -3 -type f >> recent_cycle_exclusions.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant