You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a large number of samples (~700 in the samplesheet) through the differentialabundance pipeline, the PLOT_EXPLORATORY module will generate an OOM error, even when scaling up RAM usage to 128gb via AWS Batch. The R App generates fine, indicating it isn't necessarily an issue with the inputs.
Jonathan Manning has suggested it may lie in the density function of the boxplot.R specifically.
I ran the following command from the terminal:/home/ec2-user/nextflow run /home/ec2-user/differentialabundance/main.nf \-profile docker \-bucket-dir path/to/s3/folder/differential_abundance/work/ \--outdir path/to/s3/folder/differential_abundance/results/ \--input path/to/s3/folder/differential_abundance/inputs/IPSC_21_all_cpd_samples_samplesheet_mod_for_diff_abundance.csv \--matrix path/to/s3/folder/differential_abundance/inputs/salmon.merged.gene_counts_scaled.tsv \--contrasts path/to/s3/folder/differential_abundance/inputs/IPSC_21_all_cpd_samples_contrasts.csv \-c /home/ec2-user/awsbatch_5_retries.config \-resume
The following error occurred:
ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:PLOT_EXPLORATORY (Treatment)'
Caused by:
Essential container in task exited - OutOfMemoryError: Container killed due to memory usage
Command error:
(more omitted..)
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Attaching package: 'shinyngs'
The following object is masked from 'package:MatrixGenerics':
colMedians
The following object is masked from 'package:matrixStats':
colMedians
[1] "Reading inputs..."
[1] "Creating output paths..."
[1] "Writing boxplots..."
[1] "... static"
null device
1
[1] "Writing density plots..."
[1] "... static"
.command.sh: line 8: 223 Killed
### Relevant files
I launched my pipeline from `home/ec2-user/differentalabundance`. Unfortunately, I cannot provide full files due to the proprietary nature of the data. However, the matrix file is a `salmon.merged.gene_counts_scaled.tsv file`, which is 200mb, 770 columns and 29,745 rows. of the format:
gene_id
gene_name
X1A_21_A1_GW_IPSC_CNS_4U_24h_2021
X1A_21_A10_GW_IPSC_CNS_4U_24h_2021
A1BG
A1BG
30.7871312864341
83.588759118784
A1BG-AS1
A1BG-AS1
0
2.57252569758649
A1CF
A1CF
0
0
A2M
A2M
1.56513798198313
3.11160444362184
A2M-AS1
A2M-AS1
0
0.554415060204014
A2ML1
A2ML1
0
0
A2MP1
A2MP1
0
0
A3GALT2
A3GALT2
0
0
A4GALT
A4GALT
0
0
A4GNT
A4GNT
0
0
AA06
AA06
0
0
AAAS
AAAS
9.82407546152402
33.6596479800078
AACS
AACS
3.6887964454676
7.56602734382506
My contrast file contains the following:
id
variable
reference
target
blocking
GWP42003_1_vs_GWP42003_2
Treatment
GWP42003_1
GWP42003_2
DMSO_vs_GWP42003_1
Treatment
DMSO
GWP42003_1
DMSO_vs_GWP42003_2
Treatment
DMSO
GWP42003_2
For the sake of running the pipeline, the sample sheet looks something like the below, except with a total of 769 rows (samples) 48 of which are GWP42003_2, 48 GWP42003_2, and 64 DMSO, leading to 160 total samples that the contrast file should pick up
sample
fastq_1
fastq_2
strandedness
Plate
Well
Treatment
Concentration
TimePoint
CellsWell
Study
Platform
Protocol
Paired
ConcentrationUnit
StudyType
Status
Excluded
Duration
1A_21_A1_GW_IPSC_CNS_4U_24h_2021
s3path/to/fastq/A1.fastq.gz
auto
1A_21
A1
BUFFER
0
24h
10k
GW_IPSC_CNS_4U_24h_2021
plateRNA-Seq
plateRNA-Seq
Y
uM
Invitro
Analyzed
N
21
1A_21_A10_GW_IPSC_CNS_4U_24h_2021
s3path/to/fastq/A10.fastq.gz
auto
1A_21
A10
GWP42003_1
0.1
24h
10k
GW_IPSC_CNS_4U_24h_2021
plateRNA-Seq
plateRNA-Seq
Y
uM
Invitro
Analyzed
N
21
1A_21_A11_GW_IPSC_CNS_4U_24h_2021
s3path/to/fastq/A11.fastq.gz
auto
1A_21
A11
GWP42003_2
0.1
24h
10k
GW_IPSC_CNS_4U_24h_2021
plateRNA-Seq
plateRNA-Seq
Y
uM
Invitro
Analyzed
N
21
### System information
Running via AWS EC2 instance using CentOS
Running via AWS Batch.
using the Docker Profile
NextFlow version 23.10.0, build 5889
differential abundance v1.4.0
The text was updated successfully, but these errors were encountered:
As part of #221 I reduced the granularity of density calculation and made a couple of other tweaks that might make a small difference.
Fundamentally though, this pipeline is not optimised for large sample numbers, and we need to think what that might look like. We probably need to make some plots/ report components conditional on smaller sample numbers, or use different summary statistics across sample groups.
This won't be fixed imminently, but we'll keep it on the TODO list.
Description of the bug
When running a large number of samples (~700 in the samplesheet) through the differentialabundance pipeline, the
PLOT_EXPLORATORY
module will generate an OOM error, even when scaling up RAM usage to 128gb via AWS Batch. The R App generates fine, indicating it isn't necessarily an issue with the inputs.Jonathan Manning has suggested it may lie in the density function of the boxplot.R specifically.
Command used and terminal output
The following error occurred:
ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:PLOT_EXPLORATORY (Treatment)'
Caused by:
Essential container in task exited - OutOfMemoryError: Container killed due to memory usage
Command executed:
exploratory_plots.R
--sample_metadata "IPSC_21_all_cpd_samples_samplesheet_mod_for_diff_abundance.sample_metadata.tsv"
--feature_metadata "matrix_as_anno.feature_metadata.tsv"
--assay_files "salmon.merged.gene_counts_length_scaled.assay.tsv,all.normalised_counts.tsv,all.vst.tsv"
--contrast_variable "Treatment"
--outdir "Treatment"
--sample_id_col "sample" --feature_id_col "gene_id" --assay_names "raw,normalised,variance_stabilised" --final_assay "variance_stabilised" --outlier_mad_threshold -5 --palette_name "Set1"
cat <<-END_VERSIONS > versions.yml$(echo $ (R --version 2>&1) | sed 's/^.R version //; s/ .$//')
"NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:PLOT_EXPLORATORY":
r-base:
r-shinyngs: $(Rscript -e "library(shinyngs); cat(as.character(packageVersion('shinyngs')))")
END_VERSIONS
Command exit status:
137
Command output:
[1] "Reading inputs..."
[1] "Creating output paths..."
[1] "Writing boxplots..."
[1] "... static"
null device
1
[1] "Writing density plots..."
[1] "... static"
Command error:
(more omitted..)
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Attaching package: 'shinyngs'
The following object is masked from 'package:MatrixGenerics':
colMedians
The following object is masked from 'package:matrixStats':
colMedians
[1] "Reading inputs..."
[1] "Creating output paths..."
[1] "Writing boxplots..."
[1] "... static"
null device
1
[1] "Writing density plots..."
[1] "... static"
.command.sh: line 8: 223 Killed
The text was updated successfully, but these errors were encountered: