-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault, and docs #57
Comments
Would it be possible to share the input VCFs so that I can debug this? Does this happen with every chromosome file? The documentation is under |
Thanks for your reply. I am not allowed to share the input VCF because it is from an NIH restricted dataset. It does happen with every chromosome file I've tested so far. Was the procedure I used to prepare the VCF files correct? What else should I look for to diagnose the issue? I do see the command list produced by Here is the output for one chromosome which might provide a hint? It does seem like there may be some issue with the input VCF:
|
This is the same file being attempted to be imputed with an earlier minimac4. It also fails, but provides a bit more information.
|
Your preparation seems correct. Are your genotypes phased? It would be good to figure out what is making v4.1.2 crash, but this will be difficult without a reproducible example. Which array was used for genotyping? Can you check whether it's alright to share the site information (i.e., the first 8 columns of the VCF or even the first 5 columns) for chromosome 14, excluding the sample genotypes? I can try to reproduce with random genotypes. You could also try running a debug build (configure with |
I get a segfault at a similar stage in the process for minimac v4.1.1 and v4.1.2 when running the pre-built release versions.
This happens on at least 2 systems I'm running. What they have in common is that they're not very up to date. Something I suspect, but have not confirmed, is that this might be an issue with static linking glibc. Here is what I know: If I build minimac v4.1.1 myself, it depends on the shared libraries as so:
If I check the distributed version:
This leads me to believe that 1) the distributed version is fully static and 2) it is statically linking glibc. That could be the cause of the segfault depending on what parts of glibc are being used (modern versions of gcc and g++ will try to output a warning while building statically with glibc, to the effect that at runtime it will still require the user to have that exact version of glibc to run). The general consensus is that statically linking glibc is actually bad for portability, because it will depend on having the exact version in those cases and the symbols will need to match. The solutions are likely to either
My apologies if my conjectures about the build process are totally off base. In my case I do not believe the input data is the issue, the same data runs fine with our older v1.0.2 version that was compiled locally and with Minimac3. I am having other (non-segfault) issues with the locally compiled v4.1.1, but that probably merits a separate issue.
In particular, these two lines indicate a problem:
This part works correctly on v1.0.2 and on the ones that segfault (distributed v4.1.1 and v4.1.2), but not on my self-compiled v4.1.1. The results might be correct given that incorrect read, imputation quality was much worse, but I guess we would expect that with it using only 196 instead of 1133. In my case I'm using files derived from 1000 Genomes data, so I can share them if it will help, but I'm leaning towards eliminating potential glibc issues before getting into the data weeds. |
Actually, the pre-built v1.0.2 also segfaults for me. It is likely that every prebuilt version distributed in the '.sh' files is segfaulting. Only the ones I compile run, and of the two of those I have compiled, v1.0.2 runs correctly while v4.1.1 seems to have some problem. |
@ttbek, that's a reasonable theory. If your theory is correct, I would have expected it to crash immediately (before entering Is your self-compiled binary a release or debug build? I ask because debug builds will often silence segfaults that are caused by uninitialized pointers. I just want to make sure that the significant difference between your self-compiled and pre-compiled versions isn't that one is a debug build. As for your non-segfault issue, I'm not sure I understand what the problem is. Do you believe that there shouldn't be 937 variants exclusive to the target panel? Do you experience this issue with a self-compiled v4.1.2 as well? If you have a shareable example, I'm eager to take a look. |
I started on:
and
But my most recent check of the pre-built v1.0.2 is on
I don't think this sort of glibc problem segfaults right away (it probably would for a missing shared library, but here the compile is actually static but depends on the symbols existing at runtime... if I understand it myself, which I may not), I think it happens when it gets to the part where the symbols don't match, but I don't have much experience with it myself either. The self-compiled v4.1.1 is in Release mode (has the non-segfault issue). The self-compiled v1.0.2, I can't say for sure as I compiled it a few years ago, but I'm pretty sure I did so by just following the directions as they were.
Correct, there shouldn't be any variants exclusive to the target panel. The target panel was created by splitting one file, 70% of samples for ref and 30% for the target and then masking 95% of SNPs in the target, which is why
makes sense to me even though that one is segfaulting, which also matches Minimac3 and v1.0.2. Unless something in the process removes variants for not having any variation now? The only file that could be different between v1.0.2 and now is the .msav of the reference, and as it isn't plaintext I can't inspect it directly.
I haven't attempted v4.1.2 compilation myself yet. I did just try v4.1.1 in debug mode right now, no change from release mode behavior. Here's a question though... would it be a problem if I ran the cget part on one machine and the cmake part on another... it could be if the cget part builds the dependencies, which I'm thinking it probably does (did that because I was having problems getting cget on the Redhat system). I'll probably try v4.1.2 compiled myself tomorrow (getting sort of late here), and also if I can do a build with all steps on one system, I had initially assumed cget only fetched so it would be fine, but I'm not really sure. Is there an easy way to rebuild the stuff cget downloaded without using cget? |
Thanks for the info and this example. I'll look into this later today. Yes, cget builds dependencies from source. The list of dependencies is below (non-http URIs are on github). There isn't much different between v4.1.1 and v4.1.2, so I wouldn't bother testing v4.1.2 if it's going to be a hassle.
|
Alrighty, that was probably the issue. Still didn't get it running on the Redhat machine, but on the other server I managed to get cget working and cmake new enough. I did end up needing to symlink /usr/bin/gmake to /usr/bin/make. I was still having trouble getting a clean build at that point (cmake was being really weird, didn't seem to navigate relative paths correctly, couldn't find -ptheads and thought the compiler wasn't working, none of which was actually the case) so I downloaded the source again from scratch. So in the end I have v4.1.1 compiled (I believe correctly now) on one server. Output now looks like this:
That looks roughly like what I expect, though I haven't plotted the comparison with ground truth again yet. .. but ideally I still want it to work on the other server. Actually, I would really like is for a portable build to be working, since it is on NFS and it would be nice for all the servers to be able to run the program without worrying about which one I'm on. It's good that we've ruled out problems with the source for local compiling though. @rlad Do you know which distributed version you were using? That is, did you build it from source, use the .sh install file, the .deb package, or the .rpm package? We might be having totally different problems. |
@rlad Is that output showing the correct number of GWAS sites when you run with v1.0.2? Aside from the technical problem where it doesn't run, that looks like it is too few sites for imputation to give meaningful results anyway (unless they were clustered in a much smaller region and only that region was of interest, in which case cutting out the unneeded region would improve the % and probably let it proceed). Likewise for v4.1.2
Is the 6 correct? I hope not, because imputation may be good, but it's not magic. |
@jonathonl or @ttbek Should it show 100% overlap or close to it in GWAS sites? I'm actually not sure which genome build the reference files are in, GRCh37 or GRCh38. It would be helpful if that was noted in the filenames, or somewhere. ftp://share.sph.umich.edu/minimac3/G1K_P3_M3VCF_FILES_WITH_ESTIMATES.tar.gz Can you let me know which build each of those files are if you know? I installed using the .deb package here https://github.com/statgen/Minimac4/releases |
@ttbek, I'm getting
I will work on fixing the pre-built binary packages. I'm not sure which approach I will take yet. Thanks for your suggestions. @rlad, you don't need 100%, but imputing from only 6 genotyped variants will not yield very accurate results. The reference panels you listed are 1000 genomes mapped to build 37. This may just be an artifact of improperly built binaries though. Both the crash and the low number of overlapping variants might be fixed by building Minimac4 from source instead of using the deb package. |
@rlad
This looks like your reference is 1000 Genomes, looks fine.
Just one sample, it can be more informative to do more samples together sometimes, I don't remember offhand which imputation algorithms benefit from that and which don't. Doing one is probably fine. 10444 genotyped variants is probably low. Even if they were all in the reference, it would be about 0.5% (for genotyped SNPs/ref SNPs). That might be ok-ish, but personally I would like to hit at least 1% and would be much more comfortable at 5%. Of course we may often receive data well after decisions have been made about which genotyping panel to use... etc... So we work with what we have.
But only 45 overlapping the reference means that only 45 will really be informative here. This is bad, like Jonathon said, we don't need 100% overlap, but we do need enough overlap to get the above ratio hopefully to 1% ish (impossible here, but as close as we can get, so hopefully at least 95% overlap, it isn't impossible to be less, but if it were less I would investigate to see exactly why it is less), how much overlap we expect will depend on the ref and target populations, but the larger the ref, the more we expect, but a larger target may also have less overlap due to more rare SNPs that don't appear in the ref (not likely if you target is from a real genotyping array, but maybe if you have exome or whole genome data were SNPs were ditched during quality control and you wish to impute them back). If you have the .vcf version of the ref, then you can use bcftools to check the intersection, or even inspecting manually should let you see if it should be more than 45.
Actually, here I'm confused, if there are 10444 genotyped SNPs, and only 45 overlap with the ref, then shouldn't all the remaining be exclusive to the Target/GWAS panel?
18+14+16+10 = 58... which is more than 45, but maybe possible because I believe the chunks overlap (see the buffer bounds), or is the number of sites given only between the main bounds? |
@jonathonl Awesome, I'll try that way on the Redhat system tomorrow (past 2AM for me). Another way to try building with musl is probably to static build in an Alpine Linux Docker container, because I think they use musl as the default libc in Alpine. For differences between musl and glibc, this is a good read: https://wiki.musl-libc.org/functional-differences-from-glibc.html For one program we worked on, we just also distribute the whole thing as an Alpine Linux Docker container, since Alpine is so small. https://hub.docker.com/r/kkunji/gigi2/ The whole thing is just 3.62 MB. The downsides are of course that 1) they need Docker (and often sudo) and 2) because it is through docker the arguments need to consider where things are mounted inside the container. |
I have built a new package minimac4-4.1.2-Linux-x86_64.sh.gz and have tested it on rhel7, ubuntu18, and ubuntu20. Hopefully this will resolve both of your issues. Please let me know. Note: I have not updated the downloads under the releases section yet, so please test the file attached to this comment. |
I'm having trouble with the cmake build, I'm attaching a .txt file that shows the build output. At first I had an error, but switching to a newer gcc toolchain seems to have resolved that error. Built from scratch again after that, but it is giving the same messed up output as before. So perhaps the dependencies building elsewhere wasn't the root cause of that. Only things that really stood out to me in the build were warnings about implicit fallthrough while building LZMA and
On the positive side, the new pre-built package is working for me on the Redhat system Operating System: Red Hat Enterprise Linux and on Ubuntu 20.04, didn't try the other Ubuntu yet, but it will probably be fine. What change did you end up making? |
The fallthrough is likely intentional swtich logic. The two lines that stand out to me in that log file are:
This is just a wild guess, but maybe try |
The clock issue may be due to your build directory being on NFS. Maybe try building on a local disk ( |
I changed the packaging image from Ubuntu to Alpine (81fab63). |
The clock issue shouldn't matter while doing a clean build, only if there are object files from a previous build that might not need rebuilding. But I gave it a shot anyway, no dice. First I tried the make clean and single thread rebuild, and then totally from scratch again on local storage. I'm attaching a file with all the build output on the totally from scratch attempt: The new static build solves my immediate problem. I understand if we don't want to pursue some weird build issue, but it does bug me. I'm happy to try running any next troubleshooting steps if you're interested in it. That is, it might not be that important if it builds on my specific server, it's been running for years and so it isn't exactly a clean Redhat install (where we would be more sure it should work). |
I was able to reproduce this on centos7 this morning. I'm debugging now. |
This appears to be an issue that only occurs on rhel7/centos7 (using both devtoolset-7 and devtoolset-8). I cannot reproduce this on centos8, rhel8, ubuntu16, or ubuntu18 using the default compilers. When debugging, it looks like memory corruption, but neither ASan nor Valgrind detect any errors. I'm ready to believe that this is a bug in the rhel7 toolchain. I tried editing some of the relevant code to not rely on modern c++ spec (e.g., implicitly defined copy constructors), but I was unable to do anything that would fix rhel7/centos7 builds. |
Hmm. If it is an the rhel7/8 toolchain bug, then what is the path
forward? Add a specific warning about these platforms to the build
instructions and recommend the static build to these users?
I might poke around a bit to satisfy my curiosity, but it sounds like you
were pretty thorough already.
…On Wed, Feb 15, 2023, 10:16 PM Jonathon LeFaive ***@***.***> wrote:
This appears to be an issue that only occurs on rhel7/centos7 (using both
devtoolset-7 and devtoolset-8). I cannot reproduce this on centos8, rhel8,
ubuntu16, or ubuntu18 using the default compilers.
When debugging, it looks like memory corruption, but neither ASan nor
Valgrind detect any errors. I'm ready to believe that this is a bug in the
rhel7 toolchain. I tried editing some of the relevant code to not rely on
modern c++ spec (e.g., implicitly defined copy constructors), but I was
unable to do anything that would fix rhel7/centos7 builds.
—
Reply to this email directly, view it on GitHub
<#57 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC2ENROKTQT4K6DYT3VMSP3WXUTPXANCNFSM6AAAAAAUYL3WXU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for the updated build. It is working so far and not crashing on our system. I had a few questions I'm hoping you can answer and/or point me in the direction to answer:
Thanks again for all of your help getting minimac4 to work for us! |
R2 is the estimated square of the Pearson correlation under the assumption
of Hardy-Weinberg equilibrium.
Not sure what the CS stands for off the top of my head.
R2 (also referred to as Rsq, though sometimes one or the other may be used
by convention when trying to distinguish Minimac Rsq from e.g. empirically
measured R2, when you know the ground truth) is the quality metric.
There is also a leave one out and empirical Rsq that is computed on
genotyped sites by leaving one out, imputing it back, and then calculating
the correlation. This serves as a general quality control check, whereas
the R2 above is often used to filter individual sites. It is unfortunately
common for people to use a threshold of 0.3 for this, perhaps because they
read that number as a quality cutoff in the literature for Impute/Impute 2,
but R2 isn't really on the same scale as the INFO score, so I don't think
0.3 is that appropriate and would lean towards at least 0.5, though what is
appropriate will depend on your downstream use of course. If your
genotyped data is still looking as sparse as it seemed to earlier, then you
might need to be more lenient to have many sites to work with at all, ...
but I wouldn't consider them trustworthy in that case. On a good
genotyping array it isn't unusual to see an average R2 over 0.9 for SNPs
with MAF over 0.05 or so. Minimac 3 and 4 can be very good, actually, this
might be a good read:
https://www.nature.com/articles/s41431-021-00917-7
These metrics are all described on the Minimac 3 page of the Michigan
Genome Analysis Wiki https://genome.sph.umich.edu/wiki/Minimac3_Info_File
I think the above metrics answer sufficiently regarding positive controls
and which sites to trust. You may also wish to look over the canonical
Minimac 3 and 4 publications, I think most of the formulas are in the
supplementary material for them.
About the 40%, maybe, are they genotyped but not in the reference? And
which input are they going missing from per se, the ref or the target?
See the option:
--allTypedSites
OFF by default. If ON, Minimac4 will also include variants that were
genotyped but NOT in the reference panel in the output files (and imputes
any missing data in such variants to the major allele frequency).
I think that option is still around, not in front of the computer at the
moment.
…On Fri, Feb 17, 2023, 2:00 AM rlad ***@***.***> wrote:
Thanks for the updated build. It is working so far and not crashing on our
system.
I had a few questions I'm hoping you can answer and/or point me in the
direction to answer:
- Does R2 mean R squared? If so, of what?
- How is AVG_CS computed? What does it signify?
- How are the quality metrics computed?
- How does the algorithm perform on positive control sites? How do we
decide which sites to trust?
- I noticed that the output imputed file dropped about 40% of the
input SNPs. Should it be doing that?
Thanks again for all of your help getting minimac4 to work for us!
—
Reply to this email directly, view it on GitHub
<#57 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC2ENRMFNZZUCJ5LAMNTUY3WX2WRTANCNFSM6AAAAAAUYL3WXU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Probably not, unless you have an array with a lot of rare custom content. I would try running https://github.com/zhanxw/checkVCF with the b37 fasta to ensure your ref alleles match the sequence reference. You could also check out other pre-imputation recommendations at https://imputationserver.readthedocs.io/en/latest/prepare-your-data/.
This is the average call score of a variant, where call score is |
Offhand question, but does Minimac4 share any code with the htslib synced_bcf_reader (maybe through Savvy)? Ah.. perhaps so, : https://github.com/statgen/savvy/blob/master/src/savvy/vcf_reader.cpp |
Actually, neither Minimac4 nor the current version of savvy use htslib. Savvy is now a header-only library and the I wonder whether using the default rhel compiler vs devtoolset-7/8 has any affect on the bctools/htslib segfaults. My other thought (kind of a shot in the dark) is to try adding |
Just gave it a shot, but it seems like it is using GCC and I had already tried -O0. Maybe the toolchain uses a modified compiler (can try going back to that server) but the one I was currently trying on uses loadable gcc modules that I believe are stock GCC, at least they appear to be when invoked with --version. The .hpp files seem to use a similar approach to the storage layout? Since there is a fair amount of bit manipulation, I checked just in case, both machines are little Endian. Ah, I was just playing around with it a bit more and got something potentially useful from address sanitizer, I'll need to type it out manually here since security is pretty tight on that server:
I didn't count the spaces or number of '='s signs and redacted a two path names, but other than that I guess this is the message. I suppose I should raise it over at htslib. Not sure how to unlink the numbers, tried putting '\' before the '#' Line 3200 seems to be the line with the opening '{' of the bcf_unpack function in this version of the vcf.c file. I'll try to make a more minimal example in case it is my code around it that isn't portable somehow, maybe when it points to the start of bcf_unpack it means my bcf1_t pointer argument to the function was bad (though it was fine on Ubuntu... I compile with -fsanitize=undefined but it doesn't seem to catch anything here, adding the address sanitize on Ubuntu just now changed nothing, still looks clean on Ubuntu). Hmm, on the Redhat server I just tried a different input file with -fsanitize=undefined only (without -fsanitize=address) and got an error from that this time:
Ah, nevermind, that is because those other files aren't indexed? Though again, that behavior differs between Redhat and Ubuntu. On Ubuntu the sanitizer does not complain and I just get:
Still with -fsanitize=undefined, on Redhat when I give a file that doesn't exist as the input:
Line 85 of groundySyncedReader.cpp has bcf_hdr_nsamples(hdr), and hdr was declared earlier as: When I do the same on Ubuntu I get the error message from htslib rather than the sanitizer:
It's not weird to get errors for the wrong inputs, I guess it is weird to me because in the one case the undefined behavior sanitizer is giving the output and in the other it isn't, so then is it wrong but defined on Ubuntu? Maybe because of different GCC versions? 5.2.0 on Redhat, 9.4.0 on Ubuntu. Or perhaps different htslib versions (1.13 on Ubuntu, I've tried a few different ones on the server but they don't have 1.13, I've tried 1.10.2, 1.11, and 1.16). Just tried the gcc 8.3.0 module on the server, no change from 5.2.0. |
Ok, this looks like an uninitialized pointer issue (either in your code or htslib). I don't think it's related to the Minimac4 issues on rhel7. |
Well, the errors for missing files are uninitialized pointer issues in my code (though I'm not quite sure who's code is responsible to check for the index file, but that's neither here nor there). What I wanted to highlight there was that there were differences in ubsan output, I'm not sure why that should differ between two operating systems (different versions of ubsan I guess... but it is the newer one that didn't complain and also where the code works). The part in bcf_unpack is what I thought might be related, long odds as they may have been. I'll probably file that with htslib of course. Returning to the Minimac4 code on the first Redhat server, the address sanitizer looks clean, but the undefined behavior sanitizer spits out:
The section of code is:
Maybe you saw this already, but just in case. |
Using memstomp also seems to pick on this. The output from that is over 2 GB of the same sort of thing, I'm attaching the first and last 500 lines of output in one file. |
I can push a fix to silence the sanitizer error message later, but it has no effect on the minimac4 issues on rhel7. In the meantime, you could do it yourself by replacing lines 2063-2064 of deps/include/savvy/typed_value.hpp to look like the following, and then running
|
Trying to run minimac4 v4.1.2 I am getting segmentation faults:
minimac v4.1.2
Imputing 1:1-20000000 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Loading reference haplotypes took 2 seconds
Typed sites to imputed sites ratio: 1.63579e-05 (6/366795)
2257 variants are exclusive to target file and will be excluded from output
Loading switch probabilities ...
Loading switch probabilities took 0 seconds
Running HMM with 1 threads ...
Segmentation fault (core dumped)`
The procedure to prepare the input VCF was:
If this is incorrect it may be the reason there are segfaults, and the correct steps would be great to know. And if the entire original genotype VCF file including all chromosomes can be imputed at once, that would also be good to know.
I am using the msav files downloaded from ftp://share.sph.umich.edu/minimac4/panels/g1k_p3_msav_files_with_estimates.tar.gz which I discovered by accident but assume are the current ones?
Also, is there any more complete documentation on this version and its options, since they seem to be entirely different from the ones documented at https://genome.sph.umich.edu/wiki/Minimac4_Documentation ?
Thanks very much for any/all help and suggestions.
The text was updated successfully, but these errors were encountered: