Combinatorial or split-pool barcoding appends a unique set of barcodes to single cells or nuclei during multiple "rounds". These reactions are confined to each individual fixed and permeabilized cell, bypassing the need for physical barriers between cells such as microwells or nanodroplets.
This workflow is designed to pre-process data from the Parse Biosciences Evercode whole-transcriptome (“WT”) barcoding kits. This split-pool method may differ from others such as SHARE-seq due to using of both oligo-dT and random hexamer primers during the first round of reverse transcription-based barcoding. The goal in using both primers is to span the entire transcript for more full-length coverage and to avoid polyA-biased reads. After all 3 round of barcoding, aliquots of cells are lysed and libraries are prepared from barcoded cDNA. At the end of short-read library prep, a final Illumina barcode is added to identify these lysed aliquots, or "subpools", and libraries are sequenced and demultiplexed as subpools containing a mix of all the samples on the plate.
The main deviation from the official company pipeline is the mapping/quantification software and custom handling of sample merging to produce Python anndata objects grouped by tissue for downstream processing with scanpy. This workflow is based on the kb-python (kallisto/bustools) suite, the official snRNA-seq tools for IGVF, and also includes optional code in place for genetic demultiplexing of individuals from sample barcoding wells loaded with 2 individuals with distinct genotypes.
This workflow does the following:
- kb-count: Associate reads with their cells/nuclei of origin
- kb-count: Collapse duplicated reads according to unique molecular identifiers (UMIs)
- kb-count: Collapse reads from oligo-dT and random hexamer primers in the same well (
counts_unfiltered_modified
) - kb-count: Generate cell x gene adatas for each subpool
- klue: Quantify reads associated with distinct genotypes for each cell & merge with kallisto adata
- Merge custom metadata by sample barcode-to-well mapping
- Assign genotype with higher count number between 2 expected genotypes,
tie
otherwise - Perform basic filtering for easier merging (adjust in config.yml, default 200 UMI)
- scrublet: Generate doublet scores
- Merge samples across subpools by tissue of origin
Subpool FASTQ Files:
- Subpool_1_R1.fastq.gz (contains cDNA)
- Subpool_1_R2.fastq.gz (contains the cell barcode and UMI)
Fastq files split by multiple lanes are ok, e.g.
- Subpool_2_S1_L001_R1_001.fastq.gz
- Subpool_2_S1_L001_R2_001.fastq.gz
- Subpool_2_S1_L003_R1_001.fastq.gz
- Subpool_2_S1_L003_R2_001.fastq.gz etc.
Read name is not used directly in the pipeline, can be formatted however. Just need to specify the relevant information in the fastq config file.
Note: Written for and tested on UCI's HPC cluster with slurm job scheduler. Follow these steps if you are using this workflow for the first time.
Choose a location on HPC with plenty of space to clone this repo. The kallisto output in particular is very large for a full 1M cell (Mega kit) experiment (~350GB with all unfiltered bus files).
git clone https://github.com/fairliereese/parse_pipeline.git
- Remember your login node or choose your favorite out of i15, i16, i17. You can switch login nodes via ssh, e.g.
ssh login-i15
- Start tmux session via
tmux new -s mysession
, or whatever you want to name it. When you need to reconnect, typetmux a -t mysession
(tmux cheatsheet) from the same login node as when you started the session. This is so if your internet goes out or you have to close your laptop, ongoing processes won't be terminated. - Start interactive session:
srun -A SEYEDAM_LAB --cpus-per-task=1 --mem 32G --pty bash -i
. This is so you don't clog up the login node. We shouldn't need a lot since we are just installing packages and running snakemake, which will launch more computationally-intensive jobs for you.
Required packages: snakemake
, pandas
, numpy
, anndata
, scanpy
, scrublet
, kb-python
, and if you have genetically multiplexed samples, klue
.
conda install -n base -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake==7.32 python==3.9 pandas
conda activate snakemake
- Install required python packages with pip,
pip install kb-python scrublet
Klue installation instructions:
git clone https://github.com/Yenaled/klue
cd klue
mkdir build
cd build
cmake ..
make
make install
If step 7 does not work, follow these steps:
cd src
(within the build folder)pwd
and copy the path, for example mine is/share/crsp/lab/seyedam/erebboah/parse_pipeline/klue/build/src
- Then edit your
~/.bashrc
by adding this line to the bottom of the file:export PATH="<your path>:$PATH"
. For example I addedexport PATH="/share/crsp/lab/seyedam/erebboah/parse_pipeline/klue/build/src:$PATH"
source ~/.bashrc
Test installation by typing klue
in the terminal, should see version and usage information.
- Make a fastq config file with columns that match the many examples in the configs folder, e.g. igvf_015_config.tsv and save in configs folder. Required columns are fastq, fastq_r2, subpool, plate, lane, run, and platform.
- Update sample_metadata.csv with relevant metadata for your experiment. The minimum required metadata columns for the pipeline to run properly are Mouse_Tissue_ID, Experiment, bc1_well, well_type, Tissue, and Genotype. If you have genetically multiplexed wells, it's also very convienent for downstream analysis to also have Multiplexed_sample1 and Multiplexed_sample2.
- Make a copy of
snakemake/Snakefile.smk
with your experiment name to edit. Keep the copy in the snakemake folder. You can also just editSnakefile.smk
directly, but I like to keep a copy of the Snakefile used for each run for my own records. You only need to edit 5 lines maximum in the header region of the file:
- config_tsv: Path to the fastq config file which has the paths to your read1 and read2 input fastqs and the plate, lane, run, and sequencing platform.
- sample_csv: Path to the sample metadata. I typically update the Google spreadsheet, download the tab, and upload it to my configs folder. Each row represents a well in the sample barcoding plate with metadata information for that sample, some of which is repeated across all the samples on the plate, such as experiment name, kit, and tissue.
- kit: either WT (48 wells), WT_mini (12 wells), or WT_mega (96 wells)
- chemistry: all IGVF and ModelAD experiments are v2 so far, v3 coming soon
- first_min_counts: minimum # of UMIs per cell
Skip steps 1-4 if you were following the setup instructions and are already in an interactive tmux session.
- Pay attention to your login node, or ssh to your favorite, e.g.
ssh login-i15
- Change directories to your pipeline directory, e.g.
cd /share/crsp/lab/seyedam/erebboah/parse_pipeline
. You MUST be in theparse_pipeline
directory, not in a sub-directory likeparse_pipeline/configs
,parse_pipeline/snakemake
, or it will not run. - Start tmux session, e.g.
tmux new -s mysession
- Start interactive session:
srun -A SEYEDAM_LAB --cpus-per-task=1 --mem 32G --pty bash -i
- Activate your snakemake environment:
conda activate snakemake
(you need to activate your snakemake conda environment again even if you were following setup instructions since sourcing the bashrc probably reset it to your base environment) - Check that snakemake is going to run the appropriate jobs (use the -n flag first). Make sure to change
Snakefile.smk
to the one you are actually using! For example,Snakefile_igvf015.smk
snakemake \
-s snakemake/Snakefile.smk \
-j 100 \
--latency-wait 120 \
--use-conda \
--cluster "sbatch -A seyedam_lab --partition=highmem --mem={resources.mem_gb}GB -c {resources.threads} --time=72:00:00" -n
- Actually run pipeline
snakemake \
-s snakemake/Snakefile.smk \
-j 100 \
--latency-wait 120 \
--use-conda \
--cluster "sbatch -A seyedam_lab --partition=highmem --mem={resources.mem_gb}GB -c {resources.threads} --time=72:00:00"
For a 1M cell WT Mega experiment (96 wells used for sample barcoding) with genetic multiplexing, 2 tissues, and 15 subpools (standard for 8-cube founders),
Job stats:
job count
--------------------------------------- -------
all 1
curl_fa 8
dl_annot 1
dl_fa 1
fa_ref_fmt 8
kallisto 15
kallisto_ind 1
klue 60
klue_fa 4
klue_get_genotype_counts 15
klue_ind 4
klue_make_subpool_genotype_filter_adata 60
klue_merge_genotype 15
make_subpool_filter_adata 15
make_subpool_sample_adata 1440
make_tissue_adata 2
scrublet 1440
symlink_fastq_r1 120
symlink_fastq_r2 120
total 3330
For a 100k cell WT experiment (48 wells used for sample barcoding) with genetic multiplexing, 8 subpools, and 5 tissues (igvf_012):
Job stats:
job count
--------------------------------------- -------
all 1
curl_fa 8
dl_annot 1
dl_fa 1
fa_ref_fmt 8
kallisto 8
kallisto_ind 1
klue_fa 4
klue_get_genotype_counts 8
klue_ind 4
klue_merge_genotype 8
make_subpool_filter_adata 8
make_subpool_sample_adata 96
make_tissue_adata 5
scrublet 96
symlink_fastq_r1 16
symlink_fastq_r2 16
total 289
- Killed: You are probably on the login node, or you need to increase the requested memory for the interactive session.
- FileNotFoundError/No such file or directory: Check your current directory (
pwd
). Make sure the 3 required input files exist and in the correct locations: fastq config e.g.igvf_###_config.tsv
is inparse_pipeline/configs
,sample_metadata.csv
is inparse_pipeline/configs
, andSnakemake_###.smk
is inparse_pipeline/snakemake
. Make sure the fastq config file is spelled correctly in your Snakemake smk file. - AttributeError: Make sure the columns in
igvf_###_config.tsv
exactly match fastq, fastq_r2, subpool, plate, lane, run, and platform.
- klue reference generation runs twice for the F1 plates, in other words it makes both NODJ.idx file and a B6NODF1J.idx file, which for now are identical. And the same for all the other 6 non-B6J genotypes. The good news is that it only runs twice one time...reference generation isn't repeated after the first pipeline run.
- Integrate Ryan's report code to make beautiful knee plots and well heatmaps
- Integrate experimentSpecs and analysisSpecs to replace manual metadata curation
- Move all the slurm std err/out files to a folder after the run finishes
- Support custom humanized loci for Model AD