CUDA for Wave Simulation - Part 3: Efficient CUDA

Non-trivial CUDA
The current snippet of code calling CUDA is:
cuda_step<<< 1, 1 >>>
This uses only one CUDA thread, and is probably extremely inefficient. To verify this, let’s add some way to time the program. I could use NVIDIA’s nsys
, or just the Linux time
command (or some kind of time counters in the program). In any case, I’ll need a new “run configuration”: run the program for a fixed number of cycles to check its efficiency. I don’t think this justifies a new executable, so let’s make this a program flag. For this, I’ll need to integrate the Boost Program Options library (I could also write program options checking by hand, but Boost is easier, better tested and probably more extensible).
This pulls in an additional dependency to our code, but Boost is already pretty widely installed, and requiring it is rather trivial in CMake. (A future TODO: have a fall-back in case the user does not have Boost installed.)
The new run configuration has the following differences with the original:
There is no visualization, as it could be a bottleneck (I’m not confident in the current communication between UI and simulation) and I don’t want a window to pop up when I run tests.
The program doesn’t call
cudaDeviceSynchronize
after each step, as that is no longer necessary and can become a bottleneck. (In fact, it is;./src/waves --perf 1000
takes 3ms withoutcudaDeviceSynchronize
and 2080ms with it).
NSys analysis
First, let's run the program with nsys
:
$ nsys profile src/waves --perf 100
$ nsys stats report1.nsys-rep
** CUDA API Summary (cuda_api_sum):
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ -------- ----------- ------------ ----------------------
98.0 128,491,979 2 64,245,989.5 64,245,989.5 72,052 128,419,927 90,755,652.8 cudaMallocManaged
1.2 1,553,190 100 15,531.9 5,443.5 5,074 962,247 95,655.4 cudaLaunchKernel
[...more lines left out...]
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ------------------------------------------------------------------------------------------------
100.0 1,923,059 100 19,230.6 16,096.0 16,063 330,014 31,392.3 cuda_step(const double *, double *, double, double, unsigned long, unsigned long, unsigned long)
Okay, so the allocations are still too big in comparison with the actual work: we spend 128ms (128,491,979ns) allocating memory, 1.55 ms launching the kernel, and 1.92 ms doing the kernel work. Notes:
This is particularly bad because as of
265a7e2
, the--perf
configuration doesn't measure the initialization time, incl. memory allocation — so our results are going to be wildly different from the actual runtime if the allocation runtime is significant. (I should probably fix this.)I don't know if the 1.55ms and 1.92ms should be added or substracted from each other (i.e. whether kernel launch time counts as part of kernel execution time), but it doesn't really matter for this example.
Sanity check:
nsys
confirms that we did 2 allocations and launched 100 kernels, as expected. (We allocate one "front" and one "back" matrix buffers and swap them back and forth.)
Let's increase N. I won't copy the output of nsys
again, instead here's a summary. All times are in ms.
N | 10 (previous) | 1000 |
cudaMallocManaged total time | 128.49 | 126.56 |
cudaMallocManaged calls | 2 | 2 |
cudaLaunchKernel total time | 1.55 | 127.88 |
cudaLaunchKernel number of calls | 100 | 10000 |
Kernel run time | 1.92 | 161.19 |
Runtime analysis
Now we can do a few tests with the --perf
configuration. All times are in ms.
N | 10 | 100 | 1000 | 10000 |
Without CUDA | 0.07 | 0.50 | 5.04 | 50.08 |
With CUDA | 0.41 | 0.27 | 125.95 | 11314.5 |
Again, results for \(N \le 10000\) are not really significant in CUDA, because the runtime is dominated by malloc
, outside the region tracked by perf
. The no-CUDA version scales linearly with N, as expected.
The CUDA version decreases in runtime with higher N, then scales very badly with large N. A possibly related bug is that I forgot to sync the CUDA state after all steps were done. Could this be the explanation?
A nice post on CUDA kernel launching explains the process of launching a kernel, including this interesting detail:
The driver doesn’t immediately send your kernel to the GPU. Instead, it [...] allocates space in a command buffer — think of this as a todo list for the GPU
How large is this command buffer? Unfortunately, this is difficult to find out. So for all I know, the following explanation is reasonable: we don't sync the CUDA state and launch a lot of kernels without waiting, so the first 100 or so land in the command buffer. We don't sync, so we measure only the time for launching the kernel, which is tiny. When we add more kernels, at some point the command buffer becomes full and we have to wait for work to be processed. So for higher N's, we effectively add a synchronization that was not present for lower N's, which makes the runtime much higher.
In any case, the fix is trivial: Call cudaDeviceSynchronize
at the end of each experiment. (See commit cae55ce
.) With this, we have the following results:
N | 10 | 100 | 1000 | 10000 |
Without CUDA | 0.07 | 0.50 | 5.04 | 50.08 |
With CUDA | 15.20 | 140.90 | 1128.67 | 11285.2 |
This is more sensible, although the performance is terrible. So now that we've ironed out all the bugs, we can do what we wanted to do in the first place: add more CUDA threads to actually take advantage of the GPU.
Increasing threads and grid size
Using more threads can be done by simply calling the kernel as
cuda_step<<< 1, NUMBER_OF_THREADS >>>
Then the kernel will be called NUMBER_OF_THREADS
times, with a different value of threadId
to give each thread an identity. We need to decide what each thread does. Fortunately, in our simple problem, each row of the grid is independent, so we can just give one or more rows to each thread. This is done with the following code:
void cuda_step(
const double* in, PlainCGrid out,
double t, double c,
+ std::size_t block_size,
std::size_t grid_width, std::size_t grid_height
) {
+ int start = threadIdx.x * block_size;
+ int end = (threadIdx.x + 1) * block_size;
- for(int i = 0; i < grid_height; i++) {
+ for(int i = start; i < end; i++) {
where the caller has to specify the block size and to ensure that NUMBER_OF_THREADS * block_size == grid_height
. In the documentation, it seems allowed to use an arbitrary number of threads, e.g. specify NUMBER_OF_THREADS = grid_height
and block_size = 1
. In fact, this works on my laptop even for NUMBER_OF_THREADS = 10000
, even though the Programming Guide states that
On current GPUs, a thread block may contain up to 1024 threads.
This limit can also be confirmed on my GPU with deviceQuery
. I assume that the additional threads are run sequentially on different "physical" threads.
A few experiments with the grid size. I use different values of \(S\), the number of elements per side; so there are \(S^2\) elements on the full grid. The time step \(dt\) is adapted to fulfill the CFL condition, but the number of time steps stays constant, namely N=100000 timesteps. The number of time steps has to be chosen sufficiently large, otherwise cudaDeviceSynchronize
takes up most of the runtime (since it now has to sync \(O(S^2)\) memory, which can be very large).
S | 10 | 50 | 100 | 500 | 1000 |
Without CUDA | 7.67 | 112.70 | 495.12 | 154523 | |
With CUDA, 1 thread | 1697.91 | 29064.3 | 111808 | 3902360 | |
With CUDA, S threads | 372.36 | 1125.84 | 2002.39 | 41444 | 210764 |
In theory, the work scales with the number of elements, i.e. with \(S^2\). We observe the following:
The non-CUDA version scales worse than \(O(S^2)\). This could be due to cache effects (for S = 500, each grid takes
500*500*sizeof(double)
i.e. 2MB; on my machine, this fits into L3 cache but not into L2.)The CUDA 1-thread has worse performance than the non-CUDA version in the beginning, but scales like \(S^2\), i.e. slightly better. This might be due to different cache effects.
The CUDA S-thread version scales like \(S^2\) for high values of \(S^2\). I would have expected it to scale like \(S\): Each thread takes one column (\(S\) elements), and there should be enough hardware threads so that no columns have to be processed sequentially.
Merging kernels
As seen previously, launching a kernel is a somewhat complex process. Can we put the for-loop over N in the kernel, so that we only launch the kernel once? Each thread's work is independent, so there should be no synchronization issues. This is implemented in 6bd0ee7
. The runtime is as follows:
S | 10 | 50 | 100 | 500 | 1000 |
Without CUDA | 7.67 | 112.70 | 495.12 | 154523 | |
With CUDA, 1 thread | 1697.91 | 29064.3 | 111808 | 3902360 | |
With CUDA, S threads | 372.36 | 1125.84 | 2002.39 | 41444 | 210764 |
With CUDA, S threads, one kernel | 178.77 | 661.35 | 1848.84 | 41206 | 210291 |
So this is an improvement, but becomes less relevant for high \(S\), as the actual kernel execution and memory transfer will be more prevalent than the kernel launching overhead.
Future work
Bugs
Even though the
--perf
version runs fine, there's now a SEGFAULT when running without--perf
, i.e. with the GUI.CUDA with \(S = 10^4\) threads and \(N = 1000\) reports 0.64 ms runtime, where \(S=10^3\) took 2162ms. There is probably something somewhere that immediately fails and exits.
Features
I never actually checked whether the program transformations were correct, i.e. whether the result is the same for every commit so far.
Use thread groups, i.e. put a non-1 number in the syntax
cuda_step<<< 1, NUMBER_OF_THREADS >>>
Subscribe to my newsletter
Read articles from Pierre Ballif directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
