Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak in mincWriteVolume #302

Open
gdevenyi opened this issue Jan 25, 2022 · 9 comments
Open

Possible memory leak in mincWriteVolume #302

gdevenyi opened this issue Jan 25, 2022 · 9 comments

Comments

@gdevenyi
Copy link
Contributor

We're running the following code on Niagara:

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
# Load packages (not all are necessary)
library(lme4)
library(lmerTest)
library(RMINC)
library(tidyverse)


# set working directory 
setwd("/scratch/m/mchakrav/paulbest/genfi_df5/dbm/dbm_4/linear_model")

#load data 

data <- read_csv("pls_nona_nia.csv")

model1<-mincLmer(jacobians ~  time_month + (time_month|Id),
data = data,
mask="secondlevel_otsumask.mnc",
summary_type = "ranef",
parallel=c("local", 20),
control=lmerControl(optimizer ="Nelder_Mead"))

save(model1, file = "output_lmer/model1.RData")
save.image(file = "complete_model.RData")

unique_id=unique(data['Id'])
for(i in 1:nrow(unique_id))
{         
  id=unique_id[i,1] 
  column=paste0("beta-time_month-Id", id)
  dir.create(file.path(paste0("output_lmer/",id,"/")), showWarnings = FALSE)
  output_minc_file<-paste0("output_lmer/",id,"/",id,"_time_month_Id_beta.mnc")
  mincWriteVolume(model1,output.filename=output_minc_file,like.filename="secondlevel_template0.mnc",column = column)
  }          

And seeing the following memory performance:
Screen Shot 2022-01-19 at 6 18 06 PM

During this time, we loop through and get ~29 files saved, before the system runs out of memory and kills R.

we're not creating any new memory-holding objects as far as I understand, but R memory consumption rises and eventually fills the node.

@bcdarwin
Copy link
Member

Is VOLUME_CACHE_THRESHOLD set ?

@gdevenyi
Copy link
Contributor Author

The minc-toolkit default environment sets it as:

VOLUME_CACHE_THRESHOLD=-1

@bcdarwin
Copy link
Member

Hmm, quite possible it is a memory leak in the C bindings or similar.

Is this memory measurement the whole machine or your R process? If the latter, do you know if the dip at 3.30pm corresponds to the beginning of writing files?

Do I have access to your modules/files on Niagara? If so I could try to debug by running under Valgrind but I'm pretty unfamiliar with RMINC internals so might be challenging.

@gdevenyi
Copy link
Contributor Author

Is this memory measurement the whole machine or your R process?

This is the Niagara readout from the slurm whole-machine statistics

If the latter, do you know if the dip at 3.30pm corresponds to the beginning of writing files?
That's a really good question, that dip is pretty big and all the allocations should be done at that point. I'm not sure.

Do I have access to your modules/files on Niagara?

Yes

export QUARANTINE_PATH=/project/m/mchakrav/quarantine
module use ${QUARANTINE_PATH}/modules
module load cobralab

For now, we're addressing this by randomizing the list of files to write out and repeating the job so we'll eventually get them all.

@bcdarwin
Copy link
Member

Just as a quick note -- as a better workaround than randomizing, you could probably run mincWriteVolume from short-lived subprocesses e.g. using batchMap from batchtools with local multiprocessing backend.

@bcdarwin
Copy link
Member

Do I have access to your modules/files on Niagara?

Yes

export QUARANTINE_PATH=/project/m/mchakrav/quarantine
module use ${QUARANTINE_PATH}/modules
module load cobralab

For now, we're addressing this by randomizing the list of files to write out and repeating the job so we'll eventually get them all.

Thanks. Is there any chance you could give me read access (via extended ACLs, say) to the data directory as well?

@gdevenyi
Copy link
Contributor Author

I have a special share for that,

/scratch/m/mchakrav/share

You have read-write there. Data is still copying. I suggest ~30 minutes wait.

@bcdarwin
Copy link
Member

Are the jacobians being copied as well ?

@gdevenyi
Copy link
Contributor Author

Are the jacobians being copied as well ?

Yes. Warning: the file paths are all absolute path. Student will be talked to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants