Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "sorry, Dude. Didn't generate enough snps." #5

Open
ruizeliot opened this issue Oct 23, 2024 · 9 comments
Open

Error "sorry, Dude. Didn't generate enough snps." #5

ruizeliot opened this issue Oct 23, 2024 · 9 comments

Comments

@ruizeliot
Copy link

Hi! Thank you for this new program! Could you explain me why I get the error "sorry, Dude. Didn't generate enough snps." during preprocessing?

The error occurs using the function sample_ts with the following code: geno_mat, locs = training_generator.sample_ts(trees[i], args.seed). The function runs for about 1h and then generates this error message. Here is the command line I am using below:
python mapnn.py --preprocess --gpu_index any --seed 123 --num_snps 100 --n 49 --map_width 649 --slim_width 649 --sample_grid 1 --tree_list ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/sim_mapNN/chr_t_top_100_sim_123_tree_list.txt --target_list ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/cookie_mapNN/chr_t_top_100_cookie_list.txt --habitat_map ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/data_mapNN/depth_map_wio_rgb_cropped.png --empirical ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/data_mapNN/Chr_t_unlinked_snp_subset_top_100_loci_dapc_fixed_mapnn --out ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/preprocess_mapNN

The trees were generated with 2 SliM iterations (for the test as I have a big region):
defineConstant("W", 649); defineConstant("L", 4); defineConstant("G", 1e8); defineConstant("FECUN", 1/L); defineConstant("maxgens", 2);

Thank you in advance, ER

@chriscrsmith
Copy link
Collaborator

Hi and thank you for posting the issue! Is this using one of the slim recipes from the mapnn repo? Did you run slim without mutations, recapitate with msprime, and add mutations to the tree sequence?

@ruizeliot
Copy link
Author

ruizeliot commented Oct 23, 2024

Thank you for your quick answer. I used the benchmark.slim that I just slightly modified to changed the image width and the number of generations. What is the procedure to add mutations if needed as I am not familiar with slim and msprime?

benchmark_custom.zip

@chriscrsmith
Copy link
Collaborator

chriscrsmith commented Oct 23, 2024

We recommend a workflow where you do relatively few generations in slim (expensive) and then "finish" the simulation backwards in time with a coalescent model—called "recapitation"—and then you can add mutations to the completed tree sequence. This workflow is described in general here: https://tskit.dev/pyslim/docs/latest/tutorial.html

The way I did this is using this function (mentioned briefly in the mapnn readme):

from process_input import recap
recap("tempout/sim_123.trees", "tempout/recap_123.trees", seed=123)

Did you use that recap function? I hope that addresses the problem, here.

@ruizeliot
Copy link
Author

Hi again! Thanks for your answer!

I tried to run the recap function yesterday but it has been running since then while I have a small SNP dataset (loci number = 100), even though the geographic area is large (large part of the Western Indian Ocean). I checked in details and the function get stuck when running this row: self.Delta = self._create_incidence_matrix().

I also relaunched the SliM program with 50 generations (without the recap) but the preprocessing function of mapNN also fails:
2024-10-24 14:05:32.251140: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-24 14:05:32.323373: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-24 14:05:33.434702: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
starting pre-processing pipeline
getting mean from training, on sim 0
getting sd from training, on sim 0

Traceback (most recent call last):
File "/home/eruiz/mapNN/mapnn.py", line 1250, in
preprocess()
File "/home/eruiz/mapNN/mapnn.py", line 938, in preprocess
geno_mat, locs = training_generator.sample_ts(trees[i], args.seed)
File "/home/eruiz/mapNN/data_generation.py", line 334, in sample_ts
new_genotypes = self.unpolarize(geno_mat0[shuffled_indices[s]])
IndexError: index 481 is out of bounds for axis 0 with size 481`

Thank for your help in advance!

@chriscrsmith
Copy link
Collaborator

On the one hand, it's normal for the slim step, and the recapitate to be slow, so the slow runtime by itself isn't necessarily a bug. But I'm confused about the detailed steps in your workflow. In particular you mentioned the SNP count in the file you are recapitating—that is different that the order I normally run things (slim -> recapitate -> add mutations).

You need to recapitate to complete the simulation before running the preprocess step with mapnn.

@ruizeliot
Copy link
Author

Hi again! Here is the order I followed: vcf2genos (to get the .genos file) -> create_maps to get a training map -> cookie cutter on the training map -> slim -> preprocessing (this step is possibly preceded by the recap but too slow for me). I try to recapitate the SliM file...

@chriscrsmith
Copy link
Collaborator

chriscrsmith commented Oct 28, 2024

Oops sorry I just noticed your comment about the line "self.Delta = self._create_incidence_matrix()". Will you please share (1) your recap() command, (2) the full error, and (3) and your tskit and msprime versions?

Or is there no error and you're saying it continues running? Can you tell what package that line is in (e.g. maybe msprime)? It might be the case that your simulation just takes a super long time to run.

@Eliot-RUIZ
Copy link

Hi! I wasn't very clear in my previous message, but the problem with create_incidence_matrix concerns the function SpatialGraph that I was trying to use without the recap step. It outputs the error message I pasted in my previous message, which originates from create_incidence_matrix.

Concerning the recap function, there is no error message but it just get stuck running (I stopped it after 4 days). Here is the code to launch the recap function (I have tskit v0.5.5 and msprime v1.2.0): recap("../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/sim_mapNN/chr_t_top_100_sim_123.trees", "../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/recap_mapNN/chr_t_top_100_recap_123.trees", seed=123).

Meanwhile, I could get the result I wanted with EEMS in about one day using one CPU (achieving satisfying convergence) so I think computation time is not the problem...

@andrewkern
Copy link
Member

seems very strange that a coalescent simulation would take > 4 days. how large is your chromosome? what is the associated recombination rate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants