A simple test setup to run different memory alignments on CUDA devices. Currently it is a implementation of the CUDA offsetCopy kernel from the programming guide. For more infomation on CUDA memory read [3].
** REQUIRES CUDA 5.0 OR GREATER **
To run the program do:
This will run the RunTest.sh script and use nvprof to get global load/store transactions with no offset.
This tests run the copy with a constant offset that is user defined. The test can be run with the command
or
where the first value (1) is the test and the second value (5) is the offset to use. This will profile the run and return the following events from nvprof,
- L1_global_load_hit
- L1_global_load_miss
- gld_requested
- gst_requested
- global_store_requests
Run nvprof --query-events to get a list of available events on the system. Infomation on the CUDA profiler can be found here [1]. When running the default large setup (i.e. 1000 blocks pr. grid and 128 threads pr. block) and a zero offset the resulting ratio should be 2 when using doubles (which is also standart). This means that the access is coalesced. Running with a non-zero offset will result in a ratio greater then 2.
Test 2 is almost the same as test 1 but the profiler now returns kernel runtime and not the above profiler events. The offset is run for the offset sequence 0:2:32. The output is printed to stdout and saved into a textfile. This file will then be plottet by gnuplot. The result should have the same shape as the one from [3] (Figure 6).
Same setup as test 1 but not all threads will copy memory. By default threads 36 -> 45 will skip the copy. The result is the same as in test 1 if similar parmeters where used. This agrees with the theory.
The setup is a quite simple finite difference stencil that computes the difference between the value above and below the current node. All the nodes are independent therefore the kernel should be able to scale nicely. The problem with the kernel is that the memory access is now no longer coalecsed. Nodes will access memory at quite different addresses. There will also be a overlap where four threads need to access the same infomation. The test should therefore show a cache hit rate greatere then 0% and a ratio greater then 2 because the GPU now has to transfer much more data.
The same stencil as in test 4 but now the vaules are extracted from shared memory. This means that the uncoalesced memory access is now to shared memory and the overhead should be reduced a lot.
From using the nvprof tool it was observed that for small problem the load and store ratios would be greater then expected. This test shows how the ratio changes with the number of memory copies.
When using padding every block will load a redundant set of halo nodes. This transaction is dependent on the padding size of the halos. This test runs different grid sizes with and without padding. This should show the overhead of using padding and the drop in memory preformence by removing coaleasing in favor of the speed of shared memory. The shared memory size of forced to be 32 by 32 threads pr. block. A 30-by-30 tile of inner nodes can be used to compute the stencil (about ~90% if the threads). On the other hand 4 times 32 threads (or 12.5%) will be boundary nodes and only loads memory into shared. In the best case a boundary node will preform a L1 cache hit when loading memory.
To use the software you need to install getoptpp [2] and have the file getopt_pp.cpp in the root src directory. You will also need gnuplot if you want to produce .png files. If the system does not have a awk version installed I suggest installing gawk.