-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Image #590
Comments
Is there any reason we can't just have python and c++ in the same docker image? How much additional size is it really going to add? |
I'm doing a benchmark to compare FLAME GPU 2 with mesa-frames (cc: @adamamer20), and have prepared a Dockerfile for it. I will turn it into a PR when I have the time. This was generated by GPT-4o and then I fixed several missing dependencies.
One thing I am not sure with the |
This is what I get on a V100 NVIDIA. The result doesn't make sense to me. Why is it almost the same for 1 million and 16 million agents?
|
For peak performance/benchmarking 'seatbelts' (runtime error checking) should also be disabled
@ptheywood can probably advise, but that table of results does seem sus at a glance, especially if it's a brute force model. |
Sorry for not stating which benchmark I did. It was the Sugarscape IG to reproduce the paper result. |
For V100, I think that should actually be Grand scheme, so long as it's less than or equal to (and compiles/runs) it should be fine. Although newer is typically preferred. Similarly, you can also build for multiple architectures, and CUDA should pick the newest at runtime. Although that inflates the binary size and compile time as all device code is duplicated for each architecture requested. I've seen performance both get better and get worse when compiling for earlier CUDA architectures. All it does is enables/disables whether the compiler utilises certain architectural features. So if the code includes hyper modern functions, it might not build if compiled for earlier architectures. But typical CUDA code is just going to see some statements compiled to slightly different instructions, which may be faster or may be slower (it depends on alot of unknown variables). |
I see, thank you for the informative architecture configuration. Update: I ran
My
|
I did both rebuilding the Docker image so that the FLAME GPU 2 is compiled with this flag, and also the Sugarscape IG benchmark. |
Currently the main FLAME GPU repo isn't really set up for installation targets and then finding by cmake which would make a generic flamegpu2 dockerfile useful (although it could be used to package pyflamegpu, as an alternative to installing from our pip wheelhouse). I.e. the benchamrk repos will all fetch their own version of flamegpu during configuration and build it at runtime. This would need #260 to be worthwhile, although as you've probably found it is useful for installing dependencies. From the dockerfile you've included above, it looks sensible enough other than the last two segments.
There is no install target, and I think the pip install statment would need tweaking.
This entrypoint does nothing useful, it's just GPT-4o regurgitating things it doesn't understand.
By specifying a single value, the compialtion time and binary file size would be reduced, but restrict the GPUs which can run the code to ones newer than specified (and the first run will JIT embedded PTX for newer architectures). I.e. specifyng For the $ module load CUDA/11.8 # specific to the machine
$ cmake -B build-11.8 -DCMAKE_CUDA_ARCHITECTURES=70 -DFLAMEGPU_SEATBELTS=OFF
$ cmake --build build-11.8/ -j 8
$ cd build-11.8/
$ ./bin/Release/submodel-benchmark I let this run for a few simulations from the performance_Scaling bench
and also re-ran the existing binary file on our HPC machine with V100s, using the existing binary file which was compield using CUDA 11.0 and GCC 9.
and then with a clean build shich shows similar performance
The difference between our V100 and Titan Vs is much larger than I'd expected, although a chunk of that may be in the slightly different drivers and presence of an X server in our Titan machine. Some of it could be due to power state, but I'd have only expected that to be a penalty for the first simulation at most. I've tweaked the benchmark repo to only run a single simulation before completing, and enabled NVTX in flamegpu via For the V100 this produces a sensible, very short timeline, with most simualtion step nvtx ranges taking ~600us which lines up with the reported time from this run (nsys does add some overhead, and the first few steps are slower due to how the model behaves)
However my Titan V machine took ~120ms per step, with most the time spent not doing any GPU work, but being blocked by a The output time for the Titan V run was the 120ms, which lines up with the output time during profiling (and without profiling was ~99ms).
The above screenshot shows the timeline for the final step of the simulation, with vastly differing timelines shown between the V100 at the top (~640us) and Titan V below (~120ms). This is something we need to dig into more, to understand why this is happening on our Titan V machines, see if it is impacting any of our other non HPC machines. It also might not be the same thing that is impacting your machine. |
After noticing that the ensemble benchmark was still using FLAME GPU I did this by changing the appropraite line of changing
to
and configuring a fresh cmake build, using teh saem CUDA 11.0 and GCC 9 as before on our titan V machine has reduced the nvtx trace to showthe steps taking ~580us. rather than 125ms. And without profiling now reports much more sensible timings for out titan V machine.
Having looked at the changelog, I've narrowed down the cause to a bug in our telemetry #1079, which was submitting a telemetry package every time a submodel finished. This bug has been fixed in the main FLAMEGPU 2 repository, but the stanadalone benchmark repo which we haven't updated. For the results in the paper generated with rc0, I disabled the telemetery on our HPC system when running the benchmarks by configuring with @rht could you try again doing one of the following to see if it resolves the issue for you as well:
|
Thank you for finding the cause!
This one works! I see, I assume any of the 3 has the same effect, with the last one, 2.0.0-rc.1, which disables the statistics by default? It's almost the same as the V100 result in your machine.
For context, @adamamer20 is trying to do fast vectorized ABM using pandas/Polars: https://github.com/adamamer20/mesa-frames/pull/71. It's not yet using GPU, but GPU-based DF is in the work. |
The environment variable prevents executables with telemetry enabled from submitting telemetry at runtime. The Cmake configuration option prevents telementry from being embedded in the binary at all (so you don't need to remember to set the environment variable The update to 2.0.0-rc.1 fixes a bug which caused a telemetry packet to be emitted for every step of the submodel benchmark (when a simulation completes it submits a telemetry packet, but that included submodels which are ran many times by the parent model, i.e. 100 times more than intended in the submodel benchmark model which runs for 100 steps for the performance test). |
We probably want to provide a docker image(s) as one option for running flamegpu
Based on other projects, we probably want to provide:
docker/cuda/Dockerfile
(or something to identify it as the c++ verison?)docker/python3/Dockerfile
We will have to base these on Nvida dockerfiles to comply with the redisitribution licence of libcuda.so.
It might aslo be worth separating images for using FLAME GPU from the image, and being able to modify FLAME GPU in the iamges. i.e. provide
-dev
images which include all source and build artifacts, and other images which just contain the CUDA/C++ static lib and includes and / or a docker image with a python wheel already installed.There will likely be some limitations for visualisaiton via docker. I.e. the nvidia docker container runtime suggests that GLX is not available, and EGL must be used instead (source)
The text was updated successfully, but these errors were encountered: