Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault, and docs #57

Open
rlad opened this issue Feb 11, 2023 · 34 comments
Open

Segmentation fault, and docs #57

rlad opened this issue Feb 11, 2023 · 34 comments

Comments

@rlad
Copy link

rlad commented Feb 11, 2023

Trying to run minimac4 v4.1.2 I am getting segmentation faults:

minimac v4.1.2
Imputing 1:1-20000000 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Loading reference haplotypes took 2 seconds
Typed sites to imputed sites ratio: 1.63579e-05 (6/366795)
2257 variants are exclusive to target file and will be excluded from output
Loading switch probabilities ...
Loading switch probabilities took 0 seconds
Running HMM with 1 threads ...
Segmentation fault (core dumped)`

The procedure to prepare the input VCF was:

  • separate the full genotype VCF into separate chromosome VCFs and add the same header to each of the chromosome VCFs
  • bgzip the chromosome VCFs
  • bcftools index the chromosome vcf.gz files
  • Attempt to impute each chromosome VCF separately

If this is incorrect it may be the reason there are segfaults, and the correct steps would be great to know. And if the entire original genotype VCF file including all chromosomes can be imputed at once, that would also be good to know.

I am using the msav files downloaded from ftp://share.sph.umich.edu/minimac4/panels/g1k_p3_msav_files_with_estimates.tar.gz which I discovered by accident but assume are the current ones?

Also, is there any more complete documentation on this version and its options, since they seem to be entirely different from the ones documented at https://genome.sph.umich.edu/wiki/Minimac4_Documentation ?

Thanks very much for any/all help and suggestions.

@jonathonl
Copy link
Contributor

jonathonl commented Feb 11, 2023

Would it be possible to share the input VCFs so that I can debug this? Does this happen with every chromosome file? The documentation is under./minimac4 --help.

@rlad
Copy link
Author

rlad commented Feb 11, 2023

Thanks for your reply. I am not allowed to share the input VCF because it is from an NIH restricted dataset. It does happen with every chromosome file I've tested so far.

Was the procedure I used to prepare the VCF files correct? What else should I look for to diagnose the issue?

I do see the command list produced by minimac4 --help, but it doesn't have any details, for example on whether the steps I outlined above are correct, when to use any of the options, etc. I was hoping for some more detailed documentation.

Here is the output for one chromosome which might provide a hint? It does seem like there may be some issue with the input VCF:

minimac4 /mnt/VCF-Transfer/Minimac4/refhaps/g1k_p3_msav_files_with_estimates/1000g_phase3_v5.chr14.with_parameter_estimates.msav ./14.995041_99504199.vcf.gz > ./14.995041_99504199-test.vcf.gz
minimac v4.1.2
Imputing 14:1-20000000 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Loading reference haplotypes took 0 seconds
Typed sites to imputed sites ratio: 0.000201735 (2/9914)
94 variants are exclusive to target file and will be excluded from output
Loading switch probabilities ...
Loading switch probabilities took 0 seconds
Running HMM with 1 threads ...
Segmentation fault (core dumped)

@rlad
Copy link
Author

rlad commented Feb 11, 2023

This is the same file being attempted to be imputed with an earlier minimac4. It also fails, but provides a bit more information.

/mnt/VCF-Transfer/Minimac4/release-build/minimac4  --refhaps /mnt/VCF-Transfer/Minimac4/refhaps/14.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf   --haps ./14.995041_99504199.vcf --ignoreDuplicates --prefix test-out --nobgzip
 --------------------------------------------------------------------------------
          Minimac4 - Fast Imputation Based on State Space Reduction HMM
 --------------------------------------------------------------------------------
           (c) 2014 - Sayantan Das, Christian Fuchsberger, David Hinds
                             Mary Kate Wing, Goncalo Abecasis
 Version: 1.0.2;
 Built: Thu Dec  9 21:16:34 UTC 2021 by ubuntu
 URL = http://genome.sph.umich.edu/wiki/Minimac4
 Command Line Options:
       Reference Haplotypes : --refHaps [/mnt/VCF-Transfer/Minimac4/refhaps/14.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf],
                              --passOnly, --rsid, --referenceEstimates [ON],
                              --mapFile [docs/geneticMapFile.b38.map.txt.gz]
          Target Haplotypes : --haps [./14.995041_99504199.vcf]
          Output Parameters : --prefix [test-out], --estimate, --nobgzip [ON],
                              --vcfBuffer [200], --format [GT,DS],
                              --allTypedSites, --meta, --memUsage
        Chunking Parameters : --ChunkLengthMb [20.00], --ChunkOverlapMb [3.00]
          Subset Parameters : --chr [], --start, --end, --window
   Approximation Parameters : --minimac3, --probThreshold [0.01],
                              --diffThreshold [0.01], --topThreshold [0.01]
           Other Parameters : --log, --help, --cpus [5], --params
                  PhoneHome : --noPhoneHome, --phoneHomeThinning [50]
Additional Options:
                              --ignoreDuplicates [1]
 Starting Main Imputation/Estimation Analysis ...
 Performing preliminary check on input parameters...
 ------------------------------------------------------------------------------
                             PRELIMINARY FILE CHECK
 ------------------------------------------------------------------------------
 Checking GWAS haplotype file : ./14.995041_99504199.vcf
 Gathering variant information ...
 Successful !!!
 Checking Reference haplotype file : /mnt/VCF-Transfer/Minimac4/refhaps/14.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf
 Gathering variant information ...
 Successful !!!
 Reference Panel   : Found 2504 samples (5008 haplotypes) and 1535592 variants ...
 Target/GWAS Panel : Found 1 samples (2 haplotypes) and 10444 variants ...
                     45 variants overlap with Reference panel
                     0 variants imported that exist only in Target/GWAS panel
 ------------------------------------------------------------------------------
                           CHUNKING INFORMATION
 ------------------------------------------------------------------------------
 Chunking region into 4 chunk(s) with atleast 383898 variants in each chunk ...
 Details of chunks is given below ...
 No   LeftBuffer      LeftEnd   RightPoint  RightBuffer       #Sites(GWAS/Ref/%)
 -------------------------------------------------------------------------------
  1     19000059     19000059     41751400     44654830       18/  436865/ 0.00%
  2     38789158     41751401     63967623     67198348       14/  489851/ 0.00%
  3     60840456     63967624     86945553     89950903       16/  489816/ 0.00%
  4     83925504     86945554    107289453    107289453       10/  436908/ 0.00%
 ERROR !!! ERROR !!! ERROR !!!
 Chunk 4 has less than 0.1% of variants from the GWAS panel overlapping with the reference panel ...
 Please increase the value of "--chunkSize" to analyze larger chunks
 or decrease the value of "--minRatio" = 0.1
 Program Exiting ...

@jonathonl
Copy link
Contributor

Your preparation seems correct. Are your genotypes phased? It would be good to figure out what is making v4.1.2 crash, but this will be difficult without a reproducible example.

Which array was used for genotyping? Can you check whether it's alright to share the site information (i.e., the first 8 columns of the VCF or even the first 5 columns) for chromosome 14, excluding the sample genotypes? I can try to reproduce with random genotypes.

You could also try running a debug build (configure with cmake -DCMAKE_BUILD_TYPE=Debug and then rebuild with make). If the debug version of Minimac4 crashes, we can get a hint as to where it is crashing by having you run gdb path/to/the/minimac4 path/to/core, where path/to/core is the path to the core file dumped by the debug version of minimac4. Then running bt inside the gdb shell in order to get the backtrace information.

@ttbek
Copy link

ttbek commented Feb 12, 2023

I get a segfault at a similar stage in the process for minimac v4.1.1 and v4.1.2 when running the pre-built release versions.

minimac v4.1.2

Imputing 22:1-20000000 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Notice: no variant records in reference query region (22:1-23000000)
Loading reference haplotypes took 0 seconds
Notice: skipping empty region in reference (22:1:20000000)

Imputing 22:20000001-31999992 ...
Loading target haplotypes ...
Loading target haplotypes took 1 seconds
Loading reference haplotypes ...
Loading reference haplotypes took 0 seconds
Typed sites to imputed sites ratio: 0.05 (1133/22660)
Loading switch probabilities ...
Loading switch probabilities took 0 seconds
Running HMM with 1 threads ...
Segmentation fault (core dumped)

This happens on at least 2 systems I'm running. What they have in common is that they're not very up to date. Something I suspect, but have not confirmed, is that this might be an issue with static linking glibc. Here is what I know:

If I build minimac v4.1.1 myself, it depends on the shared libraries as so:

ldd ./minimac4
      linux-vdso.so.1 => (0x00007ffe26de6000)
      libz.so.1 => /export/cse/kkunji/Minimac4/cget/lib/libz.so.1 (0x00007fc181403000)
      libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc1811e7000)
      libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fc180edf000)
      libm.so.6 => /lib64/libm.so.6 (0x00007fc180bdd000)
      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fc1809c7000)
      libc.so.6 => /lib64/libc.so.6 (0x00007fc1805f9000)
      /lib64/ld-linux-x86-64.so.2 (0x00007fc18161e000)

If I check the distributed version:

ldd /usr/bin/minimac4
not a dynamic executable

This leads me to believe that 1) the distributed version is fully static and 2) it is statically linking glibc. That could be the cause of the segfault depending on what parts of glibc are being used (modern versions of gcc and g++ will try to output a warning while building statically with glibc, to the effect that at runtime it will still require the user to have that exact version of glibc to run).

The general consensus is that statically linking glibc is actually bad for portability, because it will depend on having the exact version in those cases and the symbols will need to match. The solutions are likely to either

  1. Compile everything except glibc statically keeping the dynamic dependence on glibc (e.g. via -Wl, -Bstatic and -Wl, -Bdynamic if using gcc/g++) AND use the oldest version of glibc that works for your code because glibc supports forward compatibility as best they can but makes no effort to support backwards compatibility (that is, compiling with old glibc should run on newer glibc, but not vice versa).
  2. Ditch glibc in favor of an implementation that is more friendly to static compiling (just for the release for which we want widespread compatibility), e.g. musl, which provides gcc and clang wrappers just for this purpose. Minimac4 may be complex enough that it may need the full cross compiling toolchain. https://wiki.musl-libc.org/getting-started.html

My apologies if my conjectures about the build process are totally off base.

In my case I do not believe the input data is the issue, the same data runs fine with our older v1.0.2 version that was compiled locally and with Minimac3. I am having other (non-segfault) issues with the locally compiled v4.1.1, but that probably merits a separate issue.

minimac v4.1.1

Imputing 22:1-20000000 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Notice: no variant records in reference query region (22:1-23000000)
Loading reference haplotypes took 0 seconds
Notice: skipping empty region in reference (22:1:20000000)

Imputing 22:20000001-31999992 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Loading reference haplotypes took 0 seconds
Typed sites to imputed sites ratio: 0.00851582 (196/23016)
937 variants are exclusive to target file and will be excluded from output
Loading switch probabilities ...
Loading switch probabilities took 0 seconds
Running HMM with 1 threads ...
Completed 200 of 1753 samples
Completed 400 of 1753 samples
Completed 600 of 1753 samples
Completed 800 of 1753 samples
Completed 1000 of 1753 samples
Completed 1200 of 1753 samples
Completed 1400 of 1753 samples
Completed 1600 of 1753 samples
Completed 1753 of 1753 samples
Running HMM took 18 seconds
Writing temp files took 3 seconds
Merging temp files ...
Merging temp files took 91 seconds

Mean ER2:

Total time for parsing input: 0 seconds
Total time for HMM: 18 seconds
Total time for writing output: 94 seconds
Total wall time (h:mm:ss): 0:01:52

In particular, these two lines indicate a problem:

Typed sites to imputed sites ratio: 0.00851582 (196/23016)
937 variants are exclusive to target file and will be excluded from output

This part works correctly on v1.0.2 and on the ones that segfault (distributed v4.1.1 and v4.1.2), but not on my self-compiled v4.1.1. The results might be correct given that incorrect read, imputation quality was much worse, but I guess we would expect that with it using only 196 instead of 1133.

In my case I'm using files derived from 1000 Genomes data, so I can share them if it will help, but I'm leaning towards eliminating potential glibc issues before getting into the data weeds.

@ttbek
Copy link

ttbek commented Feb 12, 2023

Actually, the pre-built v1.0.2 also segfaults for me. It is likely that every prebuilt version distributed in the '.sh' files is segfaulting. Only the ones I compile run, and of the two of those I have compiled, v1.0.2 runs correctly while v4.1.1 seems to have some problem.

@jonathonl
Copy link
Contributor

@ttbek, that's a reasonable theory. If your theory is correct, I would have expected it to crash immediately (before entering main()), but maybe I'm wrong. Which OS version are you running on?

Is your self-compiled binary a release or debug build? I ask because debug builds will often silence segfaults that are caused by uninitialized pointers. I just want to make sure that the significant difference between your self-compiled and pre-compiled versions isn't that one is a debug build.

As for your non-segfault issue, I'm not sure I understand what the problem is. Do you believe that there shouldn't be 937 variants exclusive to the target panel? Do you experience this issue with a self-compiled v4.1.2 as well? If you have a shareable example, I'm eager to take a look.

@ttbek
Copy link

ttbek commented Feb 12, 2023

I started on:

Operating System: Red Hat Enterprise Linux
CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
Kernel: Linux 3.10.0-1160.80.1.el7.x86_64
Architecture: x86-64

and

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.7 LTS
Release: 16.04
Codename: xenial

But my most recent check of the pre-built v1.0.2 is on

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy

I don't think this sort of glibc problem segfaults right away (it probably would for a missing shared library, but here the compile is actually static but depends on the symbols existing at runtime... if I understand it myself, which I may not), I think it happens when it gets to the part where the symbols don't match, but I don't have much experience with it myself either.

The self-compiled v4.1.1 is in Release mode (has the non-segfault issue). The self-compiled v1.0.2, I can't say for sure as I compiled it a few years ago, but I'm pretty sure I did so by just following the directions as they were.

Built: Wed Oct 9 14:42:55 +03 2019 by

Do you believe that there shouldn't be 937 variants exclusive to the target panel?

Correct, there shouldn't be any variants exclusive to the target panel. The target panel was created by splitting one file, 70% of samples for ref and 30% for the target and then masking 95% of SNPs in the target, which is why

Typed sites to imputed sites ratio: 0.05 (1133/22660)

makes sense to me even though that one is segfaulting, which also matches Minimac3 and v1.0.2. Unless something in the process removes variants for not having any variation now? The only file that could be different between v1.0.2 and now is the .msav of the reference, and as it isn't plaintext I can't inspect it directly.
Example.zip
.

Do you experience this issue with a self-compiled v4.1.2 as well?

I haven't attempted v4.1.2 compilation myself yet. I did just try v4.1.1 in debug mode right now, no change from release mode behavior. Here's a question though... would it be a problem if I ran the cget part on one machine and the cmake part on another... it could be if the cget part builds the dependencies, which I'm thinking it probably does (did that because I was having problems getting cget on the Redhat system).

I'll probably try v4.1.2 compiled myself tomorrow (getting sort of late here), and also if I can do a build with all steps on one system, I had initially assumed cget only fetched so it would be fine, but I'm not really sure. Is there an easy way to rebuild the stuff cget downloaded without using cget?

@jonathonl
Copy link
Contributor

Thanks for the info and this example. I'll look into this later today.

Yes, cget builds dependencies from source. The list of dependencies is below (non-http URIs are on github). There isn't much different between v4.1.1 and v4.1.2, so I wouldn't bother testing v4.1.2 if it's going to be a hassle.

@ttbek
Copy link

ttbek commented Feb 13, 2023

Alrighty, that was probably the issue. Still didn't get it running on the Redhat machine, but on the other server I managed to get cget working and cmake new enough. I did end up needing to symlink /usr/bin/gmake to /usr/bin/make. I was still having trouble getting a clean build at that point (cmake was being really weird, didn't seem to navigate relative paths correctly, couldn't find -ptheads and thought the compiler wasn't working, none of which was actually the case) so I downloaded the source again from scratch.

So in the end I have v4.1.1 compiled (I believe correctly now) on one server. Output now looks like this:

minimac v4.1.1

Imputing 22:1-20000000 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Notice: no variant records in reference query region (22:1-23000000)
Loading reference haplotypes took 0 seconds
Notice: skipping empty region in reference (22:1:20000000)

Imputing 22:20000001-31999992 ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Loading reference haplotypes took 0 seconds
Typed sites to imputed sites ratio: 0.05 (1133/22660)
Loading switch probabilities ...
Loading switch probabilities took 0 seconds
Running HMM with 1 threads ...
Completed 200 of 1753 samples
Completed 400 of 1753 samples
Completed 600 of 1753 samples
Completed 800 of 1753 samples
Completed 1000 of 1753 samples
Completed 1200 of 1753 samples
Completed 1400 of 1753 samples
Completed 1600 of 1753 samples
Completed 1753 of 1753 samples
Running HMM took 15 seconds
Writing temp files took 2 seconds
Merging temp files ...
Merging temp files took 43 seconds

Mean ER2: 0.926101 0.900519 0.809843 0.550555

Total time for parsing input: 0 seconds
Total time for HMM: 15 seconds
Total time for writing output: 45 seconds
Total wall time (h:mm:ss): 0:01:00

That looks roughly like what I expect, though I haven't plotted the comparison with ground truth again yet.

.. but ideally I still want it to work on the other server. Actually, I would really like is for a portable build to be working, since it is on NFS and it would be nice for all the servers to be able to run the program without worrying about which one I'm on. It's good that we've ruled out problems with the source for local compiling though.

@rlad Do you know which distributed version you were using? That is, did you build it from source, use the .sh install file, the .deb package, or the .rpm package? We might be having totally different problems.

@ttbek
Copy link

ttbek commented Feb 13, 2023

@rlad Is that output showing the correct number of GWAS sites when you run with v1.0.2? Aside from the technical problem where it doesn't run, that looks like it is too few sites for imputation to give meaningful results anyway (unless they were clustered in a much smaller region and only that region was of interest, in which case cutting out the unneeded region would improve the % and probably let it proceed).

Likewise for v4.1.2

Typed sites to imputed sites ratio: 1.63579e-05 (6/366795)

Is the 6 correct? I hope not, because imputation may be good, but it's not magic.

@rlad
Copy link
Author

rlad commented Feb 13, 2023

@jonathonl or @ttbek Should it show 100% overlap or close to it in GWAS sites? I'm actually not sure which genome build the reference files are in, GRCh37 or GRCh38. It would be helpful if that was noted in the filenames, or somewhere.

ftp://share.sph.umich.edu/minimac3/G1K_P3_M3VCF_FILES_WITH_ESTIMATES.tar.gz
and
ftp://share.sph.umich.edu/minimac4/panels/g1k_p3_msav_files_with_estimates.tar.gz

Can you let me know which build each of those files are if you know?

I installed using the .deb package here https://github.com/statgen/Minimac4/releases

@jonathonl
Copy link
Contributor

@ttbek, I'm getting Typed sites to imputed sites ratio: 0.05 (1133/22660) when building from source, so you are probably correct that moving the dependencies between machines caused the incorrect ratio. I just pushed a cmake-only alternative to cget that should hopefully help you build on Red Hat. The instructions below are also documented in the README.

cmake -P dependencies.cmake deps/
mkdir build; cd build
cmake -DCMAKE_PREFIX_PATH=$(pwd)/../deps -DCMAKE_CXX_FLAGS="-I$(pwd)/../deps/include" ..
make
make install

I will work on fixing the pre-built binary packages. I'm not sure which approach I will take yet. Thanks for your suggestions.

@rlad, you don't need 100%, but imputing from only 6 genotyped variants will not yield very accurate results. The reference panels you listed are 1000 genomes mapped to build 37. This may just be an artifact of improperly built binaries though. Both the crash and the low number of overlapping variants might be fixed by building Minimac4 from source instead of using the deb package.

@ttbek
Copy link

ttbek commented Feb 13, 2023

@rlad
Well, let's assume the v1.0.2 output you have is correct for a moment and I'll see if I can interpret it correctly.

Reference Panel : Found 2504 samples (5008 haplotypes) and 1535592 variants ...

This looks like your reference is 1000 Genomes, looks fine.

Target/GWAS Panel : Found 1 samples (2 haplotypes) and 10444 variants ...

Just one sample, it can be more informative to do more samples together sometimes, I don't remember offhand which imputation algorithms benefit from that and which don't. Doing one is probably fine. 10444 genotyped variants is probably low. Even if they were all in the reference, it would be about 0.5% (for genotyped SNPs/ref SNPs). That might be ok-ish, but personally I would like to hit at least 1% and would be much more comfortable at 5%. Of course we may often receive data well after decisions have been made about which genotyping panel to use... etc... So we work with what we have.

                 45 variants overlap with Reference panel

But only 45 overlapping the reference means that only 45 will really be informative here. This is bad, like Jonathon said, we don't need 100% overlap, but we do need enough overlap to get the above ratio hopefully to 1% ish (impossible here, but as close as we can get, so hopefully at least 95% overlap, it isn't impossible to be less, but if it were less I would investigate to see exactly why it is less), how much overlap we expect will depend on the ref and target populations, but the larger the ref, the more we expect, but a larger target may also have less overlap due to more rare SNPs that don't appear in the ref (not likely if you target is from a real genotyping array, but maybe if you have exome or whole genome data were SNPs were ditched during quality control and you wish to impute them back).

If you have the .vcf version of the ref, then you can use bcftools to check the intersection, or even inspecting manually should let you see if it should be more than 45.

                 0 variants imported that exist only in Target/GWAS panel

Actually, here I'm confused, if there are 10444 genotyped SNPs, and only 45 overlap with the ref, then shouldn't all the remaining be exclusive to the Target/GWAS panel?
@jonathonl ? This would support the idea that these numbers are from the prebuilt binaries acting up or am I misinterpreting.

------------------------------------------------------------------------------
CHUNKING INFORMATION
------------------------------------------------------------------------------
Chunking region into 4 chunk(s) with atleast 383898 variants in each chunk ...
Details of chunks is given below ...
No LeftBuffer LeftEnd RightPoint RightBuffer #Sites(GWAS/Ref/%)
-------------------------------------------------------------------------------
1 19000059 19000059 41751400 44654830 18/ 436865/ 0.00%
2 38789158 41751401 63967623 67198348 14/ 489851/ 0.00%
3 60840456 63967624 86945553 89950903 16/ 489816/ 0.00%
4 83925504 86945554 107289453 107289453 10/ 436908/ 0.00%

18+14+16+10 = 58... which is more than 45, but maybe possible because I believe the chunks overlap (see the buffer bounds), or is the number of sites given only between the main bounds?

@ttbek
Copy link

ttbek commented Feb 13, 2023

@jonathonl Awesome, I'll try that way on the Redhat system tomorrow (past 2AM for me).

Another way to try building with musl is probably to static build in an Alpine Linux Docker container, because I think they use musl as the default libc in Alpine. For differences between musl and glibc, this is a good read: https://wiki.musl-libc.org/functional-differences-from-glibc.html

For one program we worked on, we just also distribute the whole thing as an Alpine Linux Docker container, since Alpine is so small. https://hub.docker.com/r/kkunji/gigi2/ The whole thing is just 3.62 MB. The downsides are of course that 1) they need Docker (and often sudo) and 2) because it is through docker the arguments need to consider where things are mounted inside the container.

@jonathonl
Copy link
Contributor

I have built a new package minimac4-4.1.2-Linux-x86_64.sh.gz and have tested it on rhel7, ubuntu18, and ubuntu20. Hopefully this will resolve both of your issues. Please let me know.

Note: I have not updated the downloads under the releases section yet, so please test the file attached to this comment.

@ttbek
Copy link

ttbek commented Feb 14, 2023

I'm having trouble with the cmake build, I'm attaching a .txt file that shows the build output. At first I had an error, but switching to a newer gcc toolchain seems to have resolved that error. Built from scratch again after that, but it is giving the same messed up output as before. So perhaps the dependencies building elsewhere wasn't the root cause of that. Only things that really stood out to me in the build were warnings about implicit fallthrough while building LZMA and

CMake Warning (dev) at CMakeGet.cmake:291 (list):
Policy CMP0007 is not set: list command no longer ignores empty elements.
Run "cmake --help-policy CMP0007" for policy details. Use the cmake_policy
command to set the policy and suppress this warning. List has value =
[https:;;github.com;facebook;zstd;archive;v1.3.2.tar.gz].
Call Stack (most recent call first):
CMakeGet.cmake:394 (cget_fetch)
CMakeGet.cmake:461 (cmake_get)
CMakeGet.cmake:396 (cmake_get_from)
CMakeGet.cmake:461 (cmake_get)
CMakeGet.cmake:396 (cmake_get_from)
CMakeGet.cmake:461 (cmake_get)
dependencies.cmake:17 (cmake_get_from)
This warning is for project developers. Use -Wno-dev to suppress it.

Redhat_Build_Issues.txt

On the positive side, the new pre-built package is working for me on the Redhat system

Operating System: Red Hat Enterprise Linux
CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
Kernel: Linux 3.10.0-1160.80.1.el7.x86_64
Architecture: x86-64

and on Ubuntu 20.04, didn't try the other Ubuntu yet, but it will probably be fine.

What change did you end up making?

@jonathonl
Copy link
Contributor

The fallthrough is likely intentional swtich logic. The two lines that stand out to me in that log file are:

make[2]: Warning: File 'CMakeFiles/minimac4.dir/depend.make' has modification time 4.1 s in the future
make[2]: warning:  Clock skew detected.  Your build may be incomplete.

This is just a wild guess, but maybe try make clean && make -j1.

@jonathonl
Copy link
Contributor

The clock issue may be due to your build directory being on NFS. Maybe try building on a local disk (/tmp maybe).

@jonathonl
Copy link
Contributor

What change did you end up making?

I changed the packaging image from Ubuntu to Alpine (81fab63).

@ttbek
Copy link

ttbek commented Feb 15, 2023

The clock issue shouldn't matter while doing a clean build, only if there are object files from a previous build that might not need rebuilding. But I gave it a shot anyway, no dice. First I tried the make clean and single thread rebuild, and then totally from scratch again on local storage. I'm attaching a file with all the build output on the totally from scratch attempt:
Redhat_Build_Issues_2.txt
No more warning about the clock skew but the build is identical. Any chance gcc 7.3.1 is still too old for something here?

The new static build solves my immediate problem. I understand if we don't want to pursue some weird build issue, but it does bug me. I'm happy to try running any next troubleshooting steps if you're interested in it. That is, it might not be that important if it builds on my specific server, it's been running for years and so it isn't exactly a clean Redhat install (where we would be more sure it should work).

@jonathonl
Copy link
Contributor

I was able to reproduce this on centos7 this morning. I'm debugging now.

@jonathonl
Copy link
Contributor

This appears to be an issue that only occurs on rhel7/centos7 (using both devtoolset-7 and devtoolset-8). I cannot reproduce this on centos8, rhel8, ubuntu16, or ubuntu18 using the default compilers.

When debugging, it looks like memory corruption, but neither ASan nor Valgrind detect any errors. I'm ready to believe that this is a bug in the rhel7 toolchain. I tried editing some of the relevant code to not rely on modern c++ spec (e.g., implicitly defined copy constructors), but I was unable to do anything that would fix rhel7/centos7 builds.

@ttbek
Copy link

ttbek commented Feb 15, 2023 via email

@rlad
Copy link
Author

rlad commented Feb 16, 2023

Thanks for the updated build. It is working so far and not crashing on our system.

I had a few questions I'm hoping you can answer and/or point me in the direction to answer:

  • Does R2 mean R squared? If so, of what?
  • How is AVG_CS computed? What does it signify?
  • How are the quality metrics computed?
  • How does the algorithm perform on positive control sites? How do we decide which sites to trust?
  • I noticed that the output imputed file dropped about 40% of the input SNPs. Should it be doing that?

Thanks again for all of your help getting minimac4 to work for us!

@ttbek
Copy link

ttbek commented Feb 17, 2023 via email

@jonathonl
Copy link
Contributor

jonathonl commented Feb 17, 2023

I noticed that the output imputed file dropped about 40% of the input SNPs. Should it be doing that?

Probably not, unless you have an array with a lot of rare custom content. I would try running https://github.com/zhanxw/checkVCF with the b37 fasta to ensure your ref alleles match the sequence reference. You could also check out other pre-imputation recommendations at https://imputationserver.readthedocs.io/en/latest/prepare-your-data/.

How is AVG_CS computed? What does it signify?

This is the average call score of a variant, where call score is 1 - HDS if HDS < 0.5 else HDS. HDS is the haplotype dosage. This is a measure of model confidence, not accuracy.

@ttbek
Copy link

ttbek commented Feb 20, 2023

Offhand question, but does Minimac4 share any code with the htslib synced_bcf_reader (maybe through Savvy)? Ah.. perhaps so, : https://github.com/statgen/savvy/blob/master/src/savvy/vcf_reader.cpp
I was wondering because I'm also having trouble with some of my own code that uses htslib. Locally it builds fine, and runs fine, Valgrind gives a clean bill of health. On Redhat, it builds fine but segfaults when run. I wonder if htslib is doing something that is undefined behavior.
Maybe not specifically the synced_bcf_reader, the above behavior is with htslib generally, but the behavior is even worse in that case (can't even seem to build a static binary with musl that is compatible with the Redhat system). On the other hand, the admins of that server seem to have been able to build bcftools alright on there... I'll need to dig more.. actually, maybe it's just my memory being faulty, but I think their bcftools might segfault when doing a 3-way intersection, which would use the synced reader, I should verify that and dig some more.

@jonathonl
Copy link
Contributor

Actually, neither Minimac4 nor the current version of savvy use htslib. Savvy is now a header-only library and the src/savvy/*.cpp files are dead code.

I wonder whether using the default rhel compiler vs devtoolset-7/8 has any affect on the bctools/htslib segfaults.

My other thought (kind of a shot in the dark) is to try adding -fno-tree-vectorize when compiling htslib and bcftools. The design of htslib is prone to memory misalignment. It's usually not as much of a concern since htslib uses -O2 and tree-vectorize was historically only enabled with -O3 in gcc , but maybe rhel compilers are treating that optimization level differently.

@ttbek
Copy link

ttbek commented Feb 20, 2023

Just gave it a shot, but it seems like it is using GCC and I had already tried -O0. Maybe the toolchain uses a modified compiler (can try going back to that server) but the one I was currently trying on uses loadable gcc modules that I believe are stock GCC, at least they appear to be when invoked with --version.

The .hpp files seem to use a similar approach to the storage layout?

Since there is a fair amount of bit manipulation, I checked just in case, both machines are little Endian.

Ah, I was just playing around with it a bit more and got something potentially useful from address sanitizer, I'll need to type it out manually here since security is pretty tight on that server:

ASAN:SIGSEGV
================================================
==20939==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000020 (pc 0x2ae469863711 bp 0x7ffeff85b1e30 sp 0x7ffef85b1910 T0)
#0 0x2ae469863710 in bcf_unpack /gpfs/software/genomics/htslib/1.16/htslib-1.16/vcf.c:3200
#1 0x4029d7 in main /home/..redacted../groundySyncedReader.cpp:159
#2 0x2ae46aa17554 in __libc_start_main (/lib64/libc.so.6+0x22554)
#3 0x401858 (/mnt/ovd/slaveserver/..redacted../groundySyncedReader+0x401858

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /gpfs/software/genomics/htslib/1.16/htslib-1.16/vcf.c:3200 bcf_unpack
==20939==ABORTING

I didn't count the spaces or number of '='s signs and redacted a two path names, but other than that I guess this is the message. I suppose I should raise it over at htslib. Not sure how to unlink the numbers, tried putting '\' before the '#'

Line 3200 seems to be the line with the opening '{' of the bcf_unpack function in this version of the vcf.c file.

I'll try to make a more minimal example in case it is my code around it that isn't portable somehow, maybe when it points to the start of bcf_unpack it means my bcf1_t pointer argument to the function was bad (though it was fine on Ubuntu... I compile with -fsanitize=undefined but it doesn't seem to catch anything here, adding the address sanitize on Ubuntu just now changed nothing, still looks clean on Ubuntu).

Hmm, on the Redhat server I just tried a different input file with -fsanitize=undefined only (without -fsanitize=address) and got an error from that this time:

groundySyncedReader.cpp:85:59: runtime error: member access within null pointer of type 'struct bcf_hdr_t'
Segmentation fault

Ah, nevermind, that is because those other files aren't indexed? Though again, that behavior differs between Redhat and Ubuntu. On Ubuntu the sanitizer does not complain and I just get:

[E::idx_find_and_load] Could not retrieve index file for 'OutCR0.05_chr1_ref0.3.Imputed.dose.vcf.gz'
Segmentation fault (core dumped)

Still with -fsanitize=undefined, on Redhat when I give a file that doesn't exist as the input:
groundySyncedReader.cpp:85:55: runtime error: member access within misaligned address

0x000000000002 for type 'struct bcf_hdr_t', which requires 8 byte alignment
0x000000000002: note: pointer points here

Segmentation fault

Line 85 of groundySyncedReader.cpp has bcf_hdr_nsamples(hdr), and hdr was declared earlier as:
bcf_hdr_t *hdr = bcf_hdr_read(inFile);
So that makes some sense when the file is missing, but...

When I do the same on Ubuntu I get the error message from htslib rather than the sanitizer:

[E::hts_open_format] Failed to open file "./OutCR0.05_chr1_ref0.3.Imputed." : No such file or directory
The input files could not be read.

It's not weird to get errors for the wrong inputs, I guess it is weird to me because in the one case the undefined behavior sanitizer is giving the output and in the other it isn't, so then is it wrong but defined on Ubuntu?

Maybe because of different GCC versions? 5.2.0 on Redhat, 9.4.0 on Ubuntu. Or perhaps different htslib versions (1.13 on Ubuntu, I've tried a few different ones on the server but they don't have 1.13, I've tried 1.10.2, 1.11, and 1.16). Just tried the gcc 8.3.0 module on the server, no change from 5.2.0.

@jonathonl
Copy link
Contributor

jonathonl commented Feb 20, 2023

Ok, this looks like an uninitialized pointer issue (either in your code or htslib). I don't think it's related to the Minimac4 issues on rhel7.

@ttbek
Copy link

ttbek commented Feb 21, 2023

Well, the errors for missing files are uninitialized pointer issues in my code (though I'm not quite sure who's code is responsible to check for the index file, but that's neither here nor there). What I wanted to highlight there was that there were differences in ubsan output, I'm not sure why that should differ between two operating systems (different versions of ubsan I guess... but it is the newer one that didn't complain and also where the code works). The part in bcf_unpack is what I thought might be related, long odds as they may have been. I'll probably file that with htslib of course.

Returning to the Minimac4 code on the first Redhat server, the address sanitizer looks clean, but the undefined behavior sanitizer spits out:

/export/cse/kkunji/Minimac4attempt3Redhat/retest/Minimac4/deps/include/savvy/typed_value.hpp:2063:18: runtime error: null pointer passed as argument 1, which is declared to never be null
/export/cse/kkunji/Minimac4attempt3Redhat/retest/Minimac4/deps/include/savvy/typed_value.hpp:2063:18: runtime error: null pointer passed as argument 2, which is declared to never be null

The section of code is:

2046 inline
2047 typed_value& typed_value::operator=(const typed_value& src)
2048 {
2049 if (&src != this)
2050 {
2051 val_type_ = src.val_type_;
2052 off_type_ = src.off_type_;
2053 size_ = src.size_;
2054 sparse_size_ = src.sparse_size_;
2055 pbwt_flag_ = src.pbwt_flag_;
2056
2057 std::size_t off_width = off_type_ ? 1u << bcf_type_shift[off_type_] : 0;
2058 std::size_t val_width = val_type_ ? 1u << bcf_type_shift[val_type_] : 0;
2059 std::size_t sz = off_type_ ? sparse_size_ : size_;
2060 off_data_.resize(off_width * sz);
2061 val_data_.resize(val_width * sz);
2062
2063 std::memcpy(off_data_.data(), src.off_data_.data(), off_width * sz);
2064 std::memcpy(val_data_.data(), src.val_data_.data(), val_width * sz);
2065 }
2066 return *this;
2067 }

Maybe you saw this already, but just in case.

@ttbek
Copy link

ttbek commented Feb 21, 2023

Using memstomp also seems to pick on this. The output from that is over 2 GB of the same sort of thing, I'm attaching the first and last 500 lines of output in one file.
memstomp_output_start_end.txt

@jonathonl
Copy link
Contributor

jonathonl commented Feb 21, 2023

I can push a fix to silence the sanitizer error message later, but it has no effect on the minimac4 issues on rhel7. In the meantime, you could do it yourself by replacing lines 2063-2064 of deps/include/savvy/typed_value.hpp to look like the following, and then running make clean && make from your build directory.

if (!off_data_.empty())
  std::memcpy(off_data_.data(), src.off_data_.data(), off_width * sz);
if (!val_data_.empty())
  std::memcpy(val_data_.data(), src.val_data_.data(), val_width * sz);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants