Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test refactoring of Makefiles #753

Open
wants to merge 111 commits into
base: master
Choose a base branch
from

Conversation

hageboeck
Copy link
Member

@hageboeck hageboeck commented Aug 17, 2023

Based on Jorgen's work to separate builds with different vectorisation levels and cuda into dedicated targets and build folders, here we completely change the strategy how makefiles are used.

TODOs:

  • regenerate all processes once CODEGEN work is finalised
  • Fix runTest.exe for cuda. Due to a namespace issue, "MadgraphTest" is not correctly instantiated
  • Discuss overall strategy and target names
  • Decide on:
    • Default make target yes/no?
    • If yes: What does it do?
  • Add a master Makefile to each process directory that just recursively makes or make clean in all subprocesses? (suggested by Stefan)
  • Work around clang-format inconsistencies.
    • Locally, clang-format passes
    • In the CI, it doesn't pass. This is most likely because of different clang-format versions, so the clang-format check has either be downgraded to a warning, or the relevant lines have to excluded from the clang-format step
  • Understand the intention/impact of Andrea's -mmacosx-version-min=11.3 patch. I tried to incorporate it as best as possible, but using this option now leads to the warning "ld: warning: dylib (../../lib/build.native_d_inl0_hrd0/libmg5amc_common.so) was built for newer macOS version (13.0) than being linked (11.3)", which is understandable if the link version is that old". I might have misunderstood what the patch was supposed to fix, but it's not removing the warning, it's adding one at the moment.
  • Work around patch problems for the top of make_opts. Locally, the patch applies cleanly, but apparently not in the CI. A solution might be to not at all touch any line at the top of make_opts.

General description (might need to be updated before merge):

  • Include rules from the C++ Makefile into the Fortran Makefile
    --> Now, only one make is invoked to build both C++ and Fortran executables (or fortran+cuda for the cuda build)
  • Move common setup stuff into Source/make_opts
    --> Setup such as architecture selection and compiler things don't have to be repeated in each Makefile
  • Allow for overriding flags from outside for every compiler. Internal
    flags are now called MG_CXXFLAGS, and CXXFLAGS are appended to the
    compiler invocation. Similar flags exist for FC and NVCC.
    • This means that you can easily do make CXXFLAGS="-O0 -g" cppavx2 to debug C++ with avx2 (or export that variable for an entire session) without breaking the build system.
    • With this, we follow the makefile convention that the flags that the user should set for compiler XX are called XXFLAGS, and these flags follow the internal flags that the Makefile specifies.
  • Every architecture-specific C++/cuda build now ends up in its own
    directory. This includes the library, object files and all executables.
  • In order to do an architecture-specific build, the Makefile invokes
    itself with the appropriate flags. This allows for parallel
    architecture-specific builds, and greatly simplifies the Makefiles.
    make cppavx2 will e.g. do something like $(make) AVX=avx2 AVXFLAGS=<...> cpp
  • The cuda build is now completely separated from SIMD-enabled builds
  • A new default target has been added, "cppnative", which builds
    -march=native if the user doesn't specify anything else.
  • The main targets are now:
    • make cuda
    • make cpp{sse4,avx2,avx512y,avx512z}
    • make cppall for all of the above
    • make ALL for all cpp and cuda
    • make cppnative
    • ... names to be discussed
  • Clean targets have been simplified.
    • "clean" cleans an entire subprocess,
    • "cleanavx2" cleans only the architecture-specific build in the subprocess,
    • "cleanall" cleans all subprocesses.
    • "distclean" cleans everything

@oliviermattelaer
Copy link
Member

oliviermattelaer commented Aug 30, 2023

will comment here (another?) issue with the makefile that we might want to handle at the same time.

If I try to compile the pure fortran code (on my mac). First error is:

[P1_uux_ttx]$ make madevent_fortran
/bin/bash: -c: line 0: syntax error near unexpected token `&'
/bin/bash: -c: line 0: `/Library/Developer/CommandLineTools/usr/bin/make  -f cudacpp.mk -pn |& awk '/Building/{print $3}' | sed s/BUILDDIR=//'
makefile:56: *** CUDACPP_BUILDDIR='' should not be empty!.  Stop.
[P1_uux_ttx]$ make madevent_cudacpp
/bin/bash: -c: line 0: syntax error near unexpected token `&'
/bin/bash: -c: line 0: `/Library/Developer/CommandLineTools/usr/bin/make  -f cudacpp.mk -pn |& awk '/Building/{print $3}' | sed s/BUILDDIR=//'
makefile:56: *** CUDACPP_BUILDDIR='' should not be empty!.  Stop.

obviously having CUDACPP_BUILDDIR empty is reasonable in this particular case.

second error (after removing such check by hand) is

ld: unknown option: --no-relax

which is related to avoid 'failed to convert GOTPCREL relocation' error #458

UPDATE

problem was a bash syntax not supported on all terminal/bash version

Now that C++ and cuda build folders are separate, the test cases have to
be instantiated on every compiler pass.
clang-format 16 was not accepting the macros at the end of runTest.cc
In order to generate a correct patch for madgraph4gpu, the generated
code cannot be patched. Therefore, the pre-patch step is removed here.

The desired change (-O3 -ffast-math -fbounds-check) is absorbed into
patch.common
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Dec 9, 2023
…2py patching from generateAndCompare.sh to patchMad.sh

Before this commit, codegen was now failing on itscrd90 as it was previously failing on itscrd80 (patchMd.sh fails due to f2py)

After this commit, codegen is succeeding again on itscrd90, and now also on itscrd80 (fixing the issue in PR madgraph5#753)
@valassi
Copy link
Member

valassi commented Dec 9, 2023

Hi @hageboeck thanks for all this, I started having a look.

The first thing that I think needs solving is that the CI fails. In the codegen, many of my new actions fail patchMad.sh. I actually realised that this was also happening across my nodes: on el9 the codegen was ok, but not on el8. So I added a fix in WIP PR #797. However in that PR my CI also fails, so I guess there is still more to be done. I will investigaete more

@valassi
Copy link
Member

valassi commented Dec 9, 2023

Hi again, I think I fixed the make_opts issue.

The CI however now fails because we need clang format 16 and this is not available in the CI. You have an easy way to do that (without external actions from outside official github, preferably)? Otherwise, do we really need clang format 16? You changed runTest.cc formatting for that, but I was using clang format 13 previously. (An alternative is to not require 16, and disable formatting in runTest.cc)

@valassi
Copy link
Member

valassi commented Dec 9, 2023

Ok now clang-format-16 is fixed

Next issues

  • fix 'make avxall' for something else in the CI scripts

(One more issue would be caching clang-16... but in the end I removed the need for clang-16, now 13 is ok, and I protected runTest.cc)

@valassi
Copy link
Member

valassi commented Dec 9, 2023

Hi @hageboeck I will stop here, sorry.

The main issue is that make -j fails

ccache gfortran -I. -fPIC -w -ffixed-line-length-132 -O3 -ffast-math -fbounds-check -c readgrid.f -o readgrid.o
dsample.f:1090:10:

 1090 |       use DiscreteSampler
      |          1
Fatal Error: Cannot open module file ‘discretesampler.mod’ for reading at (1): No such file or directory

I had a look, you removed those (ugly, I admit) .lib and replaced by explicit discretesampler.mod. Can you please have a look yourslef and try to fix make -j? Thanks Andrea

@hageboeck
Copy link
Member Author

The main issue is that make -j fails

ccache gfortran -I. -fPIC -w -ffixed-line-length-132 -O3 -ffast-math -fbounds-check -c readgrid.f -o readgrid.o
dsample.f:1090:10:

 1090 |       use DiscreteSampler
      |          1
Fatal Error: Cannot open module file ‘discretesampler.mod’ for reading at (1): No such file or directory

Hello @valassi,

I had a look. Long story short:
Not my fault, the bug (or shortcoming I guess) is in ccache.

In a bit more detail:
The missing file is the side effect of a Fortran compilation that ccache is unaware of. ccache cannot deal with fortran modules (or Fortran at all; they reverted Fortran support in the linked discussion. No idea if they re-reverted this at some point).

The module that's missing above gets built in the following way:
DiscreteSampler.f --> DiscreteSampler.o + discretesampler.mod --> dsample.o

See this demonstration:

$ make dsample.o
gfortran -I. -fPIC -w -ffixed-line-length-132 -O3 -ffast-math -fbounds-check -c DiscreteSampler.f -o DiscreteSampler.o
gfortran -I. -fPIC -w -ffixed-line-length-132 -O3 -ffast-math -fbounds-check -c dsample.f -o dsample.o

The makefile implements the dependency via object files, and hopes that the modules drop out correctly. I never touched that part:
https://github.com/hageboeck/madgraph4gpu/blob/96f9ba777db50a18ba2c4049b4398e8abd591213/epochX/cudacpp/gg_tt.mad/Source/makefile#L77-L78

It's just that ccache doesn't create modules. Since I don't use ccache for madgraph, I never saw any problems. I can clean and make -j in Source in a loop, and it always builds correctly.

I don't know what the way forward is. You could disable ccache for Fortran, or you could add special make rules for Fortran compilations that also generate modules.

@valassi
Copy link
Member

valassi commented Dec 12, 2023

Hi @hageboeck thanks. Various points.

  1. It seems I am unable myself to reproduce 'make -j' now, very strange. Maybe I had made a mistake? Or maybe I got into some strange configuration which in any case would have failed in my older code too? Anyway, unless I see this again, I guess there is nothing to fix for that.

  2. I was aware about ccache not supporting fortran. But I do not understand how this could influence 'make -j': if ccache works, it simply means that it produces a .o (or other things, .a, .mod etc) from cache rather than from a real build, and if it does not work, it creates it from a real build. But if the build product is there already, it is the makefile that does not trigger anything, and ccache is not involved at all? (As I said above I seem to be unable to reproduce 'make -j' issues, so I guess this is irrelevant, but I mention this anyway).

Anyway: I agree, for clarity it is probably better to remove ccache from gfortran. I have done this in codegen in my branch.

  1. More importantly now, from other small tests, I do not understand how 'make clean*' is supposed to do. Previously 'make cleanall' was cleaning both Source and local P1 (and mybe mre than one P1), now it does not clean src? And in any case if I do both 'make cleanall' and 'make cleansrc' I am stuck with a build,native empty directory?

Rephrasing:

Clean targets have been simplified.

* "clean" cleans an entire subprocess,

* "cleanavx2" cleans only the architecture-specific build in the subprocess,

* "cleanall" cleans all subprocesses.

* "distclean" cleans everything

What I miss here is (1) something that cleans also src and Source (distclean is too m,uch, it also cleans googletest) and (2) the fact that build.* are not removed for me on linux.

And by the way I think we do not need cleanavx2?

@valassi
Copy link
Member

valassi commented Dec 12, 2023

I am just comparing what these were doing before

  • For make clean
    Cleans the local P* subdirectory. Does not clean all other P*, does not clean src/lib/Source.

  • For make cleanall
    Cleans the local P* subdirectory. all other P*, and src/lib/Source.

  • For make distclean
    In addition cleans also googletest

I would keep something similar: a simple 'make cleanall' to clean also src/lib/Source.

Or we can discuss it. It is clearly "ugly" that "make" also makes src/Source/lib while "make clean" does not clean that? And we maybe need a way to make all P* subdirectories too...

@hageboeck
Copy link
Member Author

hageboeck commented Dec 13, 2023

I am just comparing what these were doing before

  • For make clean
    Cleans the local P* subdirectory. Does not clean all other P*, does not clean src/lib/Source.

It's doing the same now; just leaving the directories intact. Maybe you want me to remove the directories, too, but I didn't do that because you might have profiling results or other data in those. I just remove build artifacts, so it doesn't destroy something that you would like to keep. Can be discussed.

  • For make cleanall
    Cleans the local P* subdirectory. all other P*, and src/lib/Source.

It does the same except for cleaning src/Source. For those, there is cleansrc, which I found somewhere in Jørgen's work, so I left that intact.
This could be included in the cleanall target with

- cleanall:
+ cleanall: cleansrc

Say it and I add this.

  • For make distclean
    In addition cleans also googletest

I would keep something similar: a simple 'make cleanall' to clean also src/lib/Source.

Or we can discuss it. It is clearly "ugly" that "make" also makes src/Source/lib while "make clean" does not clean that? And we maybe need a way to make all P* subdirectories too...

For that, there would be the meta-Makefile in <processName>.mad. [Edit: It's not in the PR, but Stefan and me mentioned that. Let me add it in the next minutes.] I think it's more logical to keep the makefiles in subdirectories "local" in the sense that they deal only with the local stuff. (Of course we break that rule by building Source/src/lib, but it wouldn't compile otherwise).
But to build everything everything, I guess it's more logical to do:

cd gg_tt.mad
make -j cppavx2
make clean

and this would build and clean all Source/src/lib and all P*. What do you think?

Zenny's and Filips work showed that the colour matrix can run in single
precision without impact on the results. Make this the default in the
makefiles now.
- clean will now also remove build directories when they are empty.
- cleanall will also invoke cleansrc to clean the source directory.
@hageboeck
Copy link
Member Author

hageboeck commented Dec 13, 2023

Hello @valassi,

just pushed the following amendments:

  • clean wipes the build directories unless they contain some user-created files.
  • cleanall now also cleans Source/src/lib
  • And make mixed-precision mode the default since the colour matrix is sufficiently accurate as we discussed in person a few times.

@valassi
Copy link
Member

valassi commented Dec 13, 2023

Thanks Stefan I will have a look!

Listen at least some of the patches I have in #797 are also needed, how do you prefer that we proceed? Maybe I make a subset and make them as a PR against your branch? Or I tell you which ones to cherry-pick?

Thanks
Andrea

valassi added a commit to valassi/madgraph4gpu that referenced this pull request Dec 14, 2023
…2py patching from generateAndCompare.sh to patchMad.sh

(This is an old commit from 9 Dec but the description is still valid on 14 Dec)

Before this commit, codegen was now failing on itscrd90 as it was previously failing on itscrd80 (patchMd.sh fails due to f2py)

After this commit, codegen is succeeding again on itscrd90, and now also on itscrd80 (fixing the issue in PR madgraph5#753)
@valassi
Copy link
Member

valassi commented Dec 14, 2023

@hageboeck thanks I am having a look.

A couple of points

  • I am now recreating/retesting the patches I had made in WIP: refactor makefiles (additions to Stephan's and Joorgen's PRs) #797 to fix CODEGEN in the CI and similar issues. Can you please wait and then apply my patches on top of yours before doing other changes?
  • In particular, please do not do forced pushes now on what you have so far.
  • I checked 'make cleanall': it gives an error (had you tested that? maybe it succeeds on mac but must be tested on linux?). Anyway, I will also fix that.
ls: cannot access 'build.*': No such file or directory
rm: cannot remove 'build.*': No such file or directory
make[1]: *** [makefile:224: clean] Error 1
make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx'
make: *** [makefile:227: cleanall] Error 2

More later!
Thanks

@hageboeck
Copy link
Member Author

  • I am now recreating/retesting the patches I had made in WIP: refactor makefiles (additions to Stephan's and Joorgen's PRs) #797 to fix CODEGEN in the CI and similar issues. Can you please wait and then apply my patches on top of yours before doing other changes?
  • In particular, please do not do forced pushes now on what you have so far.
  • I checked 'make cleanall': it gives an error (had you tested that? maybe it succeeds on mac but must be tested on linux?). Anyway, I will also fix that.

Interesting. Of course I will wait, because I know what conflict resolution feels like in this project.

Also interesting that you ask whether I tested that. Of course I tested! Both on mac and linux. I do nothing without testing it, but of course I might overlook specific workflows. In this case, this seems to be cleaning without having a build directory. The fix is easy, but I see that you are already doing that, so will wait now as promised.

@valassi
Copy link
Member

valassi commented Dec 14, 2023

Also interesting that you ask whether I tested that. Of course I tested! Both on mac and linux. I do nothing without testing it, but of course I might overlook specific workflows. In this case, this seems to be cleaning without having a build directory. The fix is easy, but I see that you are already doing that, so will wait now as promised.

Thanks Stephan! Yes I imagine you tested... which is even more surprising. Maybe you tested only with a clean dir? The failure is only when build.* have already been deleted. This is because for d in build.*; do if there is NO directory called build.* then expands to literal build.*, which is not a directory. I just learnt this has to do with bash "nullglob" and "failglob" settings etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants