Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Image #590

Open
ptheywood opened this issue Jul 13, 2021 · 12 comments
Open

Docker Image #590

ptheywood opened this issue Jul 13, 2021 · 12 comments

Comments

@ptheywood
Copy link
Member

ptheywood commented Jul 13, 2021

We probably want to provide a docker image(s) as one option for running flamegpu

Based on other projects, we probably want to provide:

  • docker/cuda/Dockerfile (or something to identify it as the c++ verison?)
  • docker/python3/Dockerfile

We will have to base these on Nvida dockerfiles to comply with the redisitribution licence of libcuda.so.

It might aslo be worth separating images for using FLAME GPU from the image, and being able to modify FLAME GPU in the iamges. i.e. provide -dev images which include all source and build artifacts, and other images which just contain the CUDA/C++ static lib and includes and / or a docker image with a python wheel already installed.

There will likely be some limitations for visualisaiton via docker. I.e. the nvidia docker container runtime suggests that GLX is not available, and EGL must be used instead (source)

@Robadob
Copy link
Member

Robadob commented Jul 13, 2021

Is there any reason we can't just have python and c++ in the same docker image? How much additional size is it really going to add?

@rht
Copy link

rht commented Aug 20, 2024

I'm doing a benchmark to compare FLAME GPU 2 with mesa-frames (cc: @adamamer20), and have prepared a Dockerfile for it. I will turn it into a PR when I have the time.

This was generated by GPT-4o and then I fixed several missing dependencies.

# Use an official CUDA base image
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive

# Install dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    python3 \
    python3-pip \
    libgl1-mesa-dev \
    libglew-dev \
    freeglut3-dev \
    xorg-dev \
    swig \
    patchelf \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Set up Python environment
RUN python3 -m pip install --upgrade pip setuptools wheel build

# Clone the FLAMEGPU2 repository
RUN git clone https://github.com/FLAMEGPU/FLAMEGPU2.git /flamegpu2

# Set the working directory
WORKDIR /flamegpu2

# Checkout the desired branch (e.g., master or a specific version)
RUN git checkout master

# Create and build the project using CMake
RUN mkdir -p build && cd build && \
    cmake .. -DCMAKE_BUILD_TYPE=Release -DFLAMEGPU_BUILD_PYTHON=ON && \
    cmake --build . --target flamegpu boids_bruteforce -j 8

# Optional: Install Python bindings (if needed)
# RUN cd build && make install && pip3 install ./lib/python

# Set up entry point (if required)
ENTRYPOINT ["tail", "-f", "/dev/null"]

One thing I am not sure with the cmake command is that, is -DCMAKE_CUDA_ARCHITECTURES=61 necessary to optimize the result further? Am I missing a lot if I compile without specific target architecture?

@rht
Copy link

rht commented Aug 20, 2024

This is what I get on a V100 NVIDIA. The result doesn't make sense to me. Why is it almost the same for 1 million and 16 million agents?

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.543342,6158.32
0,512,262144,0.17,0.537335,25022.8
0,768,589824,0.17,0.537974,57078.9
0,1024,1048576,0.17,0.53916,101629
0,1280,1638400,0.17,0.541711,159002
0,1536,2359296,0.17,0.546862,230722
0,1792,3211264,0.17,0.549065,314442
0,2048,4194304,0.17,0.552074,409362
0,2304,5308416,0.17,0.56062,518971
0,2560,6553600,0.17,0.567626,640464
0,2816,7929856,0.17,0.570724,774681
0,3072,9437184,0.17,0.580403,922129
0,3328,11075584,0.17,0.57991,1.08285e+06
0,3584,12845056,0.17,0.588218,1.25571e+06
0,3840,14745600,0.17,0.588479,1.44259e+06
0,4096,16777216,0.17,0.59788,1.64187e+06
1,256,65536,0.17,0.553675,6294.4
1,512,262144,0.17,0.554435,25029.3
1,768,589824,0.17,0.555797,57294.6
1,1024,1048576,0.17,0.555232,101536
1,1280,1638400,0.17,0.557549,160115
1,1536,2359296,0.17,0.56029,231049
1,1792,3211264,0.17,0.560761,314853
1,2048,4194304,0.17,0.563404,410717
1,2304,5308416,0.17,0.567223,520563
1,2560,6553600,0.17,0.569116,641398
1,2816,7929856,0.17,0.57899,776075
1,3072,9437184,0.17,0.577088,925085

@Robadob
Copy link
Member

Robadob commented Aug 20, 2024

# Create and build the project using CMake
RUN mkdir -p build && cd build && \
    cmake .. -DCMAKE_BUILD_TYPE=Release -DFLAMEGPU_BUILD_PYTHON=ON && \
    cmake --build . --target flamegpu boids_bruteforce -j 8

For peak performance/benchmarking 'seatbelts' (runtime error checking) should also be disabled -DFLAMEGPU_SEATBELTS=OFF

This is what I get on a V100 NVIDIA. The result doesn't make sense to me. Why is it almost the same for 1 million and 16 million agents?

@ptheywood can probably advise, but that table of results does seem sus at a glance, especially if it's a brute force model.

@rht
Copy link

rht commented Aug 20, 2024

Sorry for not stating which benchmark I did. It was the Sugarscape IG to reproduce the paper result.

@Robadob
Copy link
Member

Robadob commented Aug 20, 2024

is -DCMAKE_CUDA_ARCHITECTURES=61 necessary to optimize the result further?

For V100, I think that should actually be 70, 61 is pascal.

Grand scheme, so long as it's less than or equal to (and compiles/runs) it should be fine. Although newer is typically preferred. Similarly, you can also build for multiple architectures, and CUDA should pick the newest at runtime. Although that inflates the binary size and compile time as all device code is duplicated for each architecture requested.

I've seen performance both get better and get worse when compiling for earlier CUDA architectures. All it does is enables/disables whether the compiler utilises certain architectural features. So if the code includes hyper modern functions, it might not build if compiled for earlier architectures. But typical CUDA code is just going to see some statements compiled to slightly different instructions, which may be faster or may be slower (it depends on alot of unknown variables).

@rht
Copy link

rht commented Aug 20, 2024

I see, thank you for the informative architecture configuration.

Update: I ran cmake with -DCMAKE_CUDA_ARCHITECTURES=70 and -DFLAMEGPU_SEATBELTS=OFF, and I still got ~0.5s per step.

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.557127,6158.32
0,512,262144,0.17,0.550818,25022.8
0,768,589824,0.17,0.554487,57078.9
0,1024,1048576,0.17,0.555948,101629

My lshw -C display output

       description: 3D controller
       product: GV100GL [Tesla V100 PCIe 16GB]
       vendor: NVIDIA Corporation

@rht
Copy link

rht commented Aug 20, 2024

Update: I ran cmake with -DCMAKE_CUDA_ARCHITECTURES=70 and -DFLAMEGPU_SEATBELTS=OFF, and I still got ~0.5s per step.

I did both rebuilding the Docker image so that the FLAME GPU 2 is compiled with this flag, and also the Sugarscape IG benchmark.

@ptheywood
Copy link
Member Author

ptheywood commented Aug 21, 2024

Currently the main FLAME GPU repo isn't really set up for installation targets and then finding by cmake which would make a generic flamegpu2 dockerfile useful (although it could be used to package pyflamegpu, as an alternative to installing from our pip wheelhouse).

I.e. the benchamrk repos will all fetch their own version of flamegpu during configuration and build it at runtime.

This would need #260 to be worthwhile, although as you've probably found it is useful for installing dependencies.

From the dockerfile you've included above, it looks sensible enough other than the last two segments.

# Optional: Install Python bindings (if needed)
# RUN cd build && make install && pip3 install ./lib/python

There is no install target, and I think the pip install statment would need tweaking.

# Set up entry point (if required)
ENTRYPOINT ["tail", "-f", "/dev/null"]

This entrypoint does nothing useful, it's just GPT-4o regurgitating things it doesn't understand.


-DCMAKE_CUDA_ARCHITECTURES generates CUDA code which is compatible with specific GPU architectures. If unspecified FLAME GPU will build for all major architectures supported by the CUDA version. For CUDA 11.8 this would 35;50;60;70;80;90 (Kepler to Hopper, although full hopper support requires CUDA 12.0).

By specifying a single value, the compialtion time and binary file size would be reduced, but restrict the GPUs which can run the code to ones newer than specified (and the first run will JIT embedded PTX for newer architectures). I.e. specifyng 70 would allow volta and newer GPUS to run, but features only in Ampere and Hopper would not be used.


For the FLAMEGPU2-submodel-benchmark performance, I've done a fresh native build on our titan v machine

$ module load CUDA/11.8 # specific to the machine
$ cmake -B build-11.8 -DCMAKE_CUDA_ARCHITECTURES=70 -DFLAMEGPU_SEATBELTS=OFF
$ cmake --build build-11.8/ -j 8 
$ cd build-11.8/
$ ./bin/Release/submodel-benchmark

I let this run for a few simulations from the performance_Scaling bench

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.0933171,6158.32
0,512,262144,0.17,0.0923017,25022.8
0,768,589824,0.17,0.0933319,57078.9
0,1024,1048576,0.17,0.0950031,101629
0,1280,1638400,0.17,0.095181,159002
0,1536,2359296,0.17,0.0986446,230722
0,1792,3211264,0.17,0.099533,314442

and also re-ran the existing binary file on our HPC machine with V100s, using the existing binary file which was compield using CUDA 11.0 and GCC 9.

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.000485079,6158.32
0,512,262144,0.17,0.00095145,25022.8
0,768,589824,0.17,0.00153795,57078.9
0,1024,1048576,0.17,0.00252234,101629
0,1280,1638400,0.17,0.00358018,159002
0,1536,2359296,0.17,0.0053722,230722
0,1792,3211264,0.17,0.00701561,314442
0,2048,4194304,0.17,0.00886571,409362

and then with a clean build shich shows similar performance

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.000434862,6158.32
0,512,262144,0.17,0.000939674,25022.8
0,768,589824,0.17,0.00153737,57078.9
0,1024,1048576,0.17,0.00262005,101629
0,1280,1638400,0.17,0.00359381,159002
0,1536,2359296,0.17,0.0053471,230722

The difference between our V100 and Titan Vs is much larger than I'd expected, although a chunk of that may be in the slightly different drivers and presence of an X server in our Titan machine.

Some of it could be due to power state, but I'd have only expected that to be a penalty for the first simulation at most.

I've tweaked the benchmark repo to only run a single simulation before completing, and enabled NVTX in flamegpu via -DFLAMEGPU_ENABLE_NVTX=ON at configure time. This being the first sim it is a very small population so the amount of GPU time will be relaitvley low.

For the V100 this produces a sensible, very short timeline, with most simualtion step nvtx ranges taking ~600us which lines up with the reported time from this run (nsys does add some overhead, and the first few steps are slower due to how the model behaves)

0,256,65536,0.17,0.000696624,6158.32

However my Titan V machine took ~120ms per step, with most the time spent not doing any GPU work, but being blocked by a system entry. The actual portion of time during that 120ms where the GPU was doing somethiing was ~1.4ms.

The output time for the Titan V run was the 120ms, which lines up with the output time during profiling (and without profiling was ~99ms).

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.125717,6158.32

image

The above screenshot shows the timeline for the final step of the simulation, with vastly differing timelines shown between the V100 at the top (~640us) and Titan V below (~120ms).

This is something we need to dig into more, to understand why this is happening on our Titan V machines, see if it is impacting any of our other non HPC machines.

It also might not be the same thing that is impacting your machine.

@ptheywood
Copy link
Member Author

After noticing that the ensemble benchmark was still using FLAME GPU 2.0.0-rc, rather than the much more recent 2.0.0-rc.1 or current master, I thought I'd see if this was caused by a bug we'd previosuly fixed but forgotten about / appeared unrelated

I did this by changing the appropraite line of changing CMakeLists.txt from

set(FLAMEGPU_VERSION "v2.0.0-rc" CACHE STRING "FLAMEGPU/FLAMEGPU2 git branch or tag to use")

to

set(FLAMEGPU_VERSION "v2.0.0-rc.1" CACHE STRING "FLAMEGPU/FLAMEGPU2 git branch or tag to use")

and configuring a fresh cmake build, using teh saem CUDA 11.0 and GCC 9 as before on our titan V machine has reduced the nvtx trace to showthe steps taking ~580us. rather than 125ms.

image

And without profiling now reports much more sensible timings for out titan V machine.

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.0004592,6157.67
0,512,262144,0.17,0.00103467,25034.7
0,768,589824,0.17,0.0018241,57058.4
0,1024,1048576,0.17,0.00307118,101641
0,1280,1638400,0.17,0.0044578,158976

Having looked at the changelog, I've narrowed down the cause to a bug in our telemetry #1079, which was submitting a telemetry package every time a submodel finished.
I.e. each step was making a network request, which takes roughly the same duration each step, hence no apparent scaling when the step duration is negligable compared to a network request.

This bug has been fixed in the main FLAMEGPU 2 repository, but the stanadalone benchmark repo which we haven't updated.

For the results in the paper generated with rc0, I disabled the telemetery on our HPC system when running the benchmarks by configuring with -DFLAMEGPU_SHARE_USAGE_STATISTICS=OFF, sop they didn't exhibit this problem.

@rht could you try again doing one of the following to see if it resolves the issue for you as well:

  • Run with the environment variable FLAMEGPU_SHARE_USAGE_STATISTICS set to OFF
    • i.e. FLAMEGPU_SHARE_USAGE_STATISTICS=OFF ./bin/Release/submodel-benchmark
  • Configure the FLAMEGPU2-submodel-benchmark repository build directory with -DFLAMEGPU_SHARE_USAGE_STATISTICS=OFF, recompile and rerun
  • Change the version of FLAME GPU 2 fetched by FLAMEGPU2-submodel-benchmark/CMakeListst.txt to 2.0.0-rc.1 as above, reconfigure (maybe with an entirely fresh build directory) and re-run.
    • I'll bump the version in this repository anyway, it doesn't seem to be impacted by any breaking changes between these release candidates

@rht
Copy link

rht commented Aug 21, 2024

Thank you for finding the cause!

Run with the environment variable FLAMEGPU_SHARE_USAGE_STATISTICS set to OFF
i.e. FLAMEGPU_SHARE_USAGE_STATISTICS=OFF ./bin/Release/submodel-benchmark

This one works! I see, I assume any of the 3 has the same effect, with the last one, 2.0.0-rc.1, which disables the statistics by default?

It's almost the same as the V100 result in your machine.

repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.000483422,6158.32
0,512,262144,0.17,0.000958801,25022.8
0,768,589824,0.17,0.00177816,57078.9
0,1024,1048576,0.17,0.00265805,101629
0,1280,1638400,0.17,0.0037753,159002
0,1536,2359296,0.17,0.00545013,230722
0,1792,3211264,0.17,0.00723409,314442
0,2048,4194304,0.17,0.00905549,409362
0,2304,5308416,0.17,0.0116754,518971
0,2560,6553600,0.17,0.0145468,640464

For context, @adamamer20 is trying to do fast vectorized ABM using pandas/Polars: https://github.com/adamamer20/mesa-frames/pull/71. It's not yet using GPU, but GPU-based DF is in the work.

@ptheywood
Copy link
Member Author

This one works! I see, I assume any of the 3 has the same effect, with the last one, 2.0.0-rc.1, which disables the statistics by default?

The environment variable prevents executables with telemetry enabled from submitting telemetry at runtime.

The Cmake configuration option prevents telementry from being embedded in the binary at all (so you don't need to remember to set the environment variable

The update to 2.0.0-rc.1 fixes a bug which caused a telemetry packet to be emitted for every step of the submodel benchmark (when a simulation completes it submits a telemetry packet, but that included submodels which are ran many times by the parent model, i.e. 100 times more than intended in the submodel benchmark model which runs for 100 steps for the performance test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants