Non-trivial CUDA

The current snippet of code calling CUDA is:

cuda_step<<< 1, 1 >>>

This uses only one CUDA thread, and is probably extremely inefficient. To verify this, let’s add some way to time the program. I could use NVIDIA’s nsys, or just the Linux time command (or some kind of time counters in the program). In any case, I’ll need a new “run configuration”: run the program for a fixed number of cycles to check its efficiency. I don’t think this justifies a new executable, so let’s make this a program flag. For this, I’ll need to integrate the Boost Program Options library (I could also write program options checking by hand, but Boost is easier, better tested and probably more extensible).

This pulls in an additional dependency to our code, but Boost is already pretty widely installed, and requiring it is rather trivial in CMake. (A future TODO: have a fall-back in case the user does not have Boost installed.)

The new run configuration has the following differences with the original:

There is no visualization, as it could be a bottleneck (I’m not confident in the current communication between UI and simulation) and I don’t want a window to pop up when I run tests.
The program doesn’t call cudaDeviceSynchronize after each step, as that is no longer necessary and can become a bottleneck. (In fact, it is; ./src/waves --perf 1000 takes 3ms without cudaDeviceSynchronize and 2080ms with it).

NSys analysis

First, let's run the program with nsys:

$ nsys profile src/waves --perf 100
$ nsys stats report1.nsys-rep
** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ----------------------
     98.0      128,491,979          2  64,245,989.5  64,245,989.5    72,052  128,419,927  90,755,652.8  cudaMallocManaged     
      1.2        1,553,190        100      15,531.9       5,443.5     5,074      962,247      95,655.4  cudaLaunchKernel
[...more lines left out...]
 ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                Name                                              
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ------------------------------------------------------------------------------------------------
    100.0        1,923,059        100  19,230.6  16,096.0    16,063   330,014     31,392.3  cuda_step(const double *, double *, double, double, unsigned long, unsigned long, unsigned long)

Okay, so the allocations are still too big in comparison with the actual work: we spend 128ms (128,491,979ns) allocating memory, 1.55 ms launching the kernel, and 1.92 ms doing the kernel work. Notes:

This is particularly bad because as of 265a7e2, the --perf configuration doesn't measure the initialization time, incl. memory allocation — so our results are going to be wildly different from the actual runtime if the allocation runtime is significant. (I should probably fix this.)
I don't know if the 1.55ms and 1.92ms should be added or substracted from each other (i.e. whether kernel launch time counts as part of kernel execution time), but it doesn't really matter for this example.
Sanity check: nsys confirms that we did 2 allocations and launched 100 kernels, as expected. (We allocate one "front" and one "back" matrix buffers and swap them back and forth.)

Let's increase N. I won't copy the output of nsys again, instead here's a summary. All times are in ms.

N	10 (previous)	1000
`cudaMallocManaged` total time	128.49	126.56
`cudaMallocManaged` calls	2	2
`cudaLaunchKernel` total time	1.55	127.88
`cudaLaunchKernel` number of calls	100	10000
Kernel run time	1.92	161.19

Runtime analysis

Now we can do a few tests with the --perf configuration. All times are in ms.

N	10	100	1000	10000
Without CUDA	0.07	0.50	5.04	50.08
With CUDA	0.41	0.27	125.95	11314.5

Again, results for \(N \le 10000\) are not really significant in CUDA, because the runtime is dominated by malloc, outside the region tracked by perf. The no-CUDA version scales linearly with N, as expected.

The CUDA version decreases in runtime with higher N, then scales very badly with large N. A possibly related bug is that I forgot to sync the CUDA state after all steps were done. Could this be the explanation?

A nice post on CUDA kernel launching explains the process of launching a kernel, including this interesting detail:

The driver doesn’t immediately send your kernel to the GPU. Instead, it [...] allocates space in a command buffer — think of this as a todo list for the GPU

How large is this command buffer? Unfortunately, this is difficult to find out. So for all I know, the following explanation is reasonable: we don't sync the CUDA state and launch a lot of kernels without waiting, so the first 100 or so land in the command buffer. We don't sync, so we measure only the time for launching the kernel, which is tiny. When we add more kernels, at some point the command buffer becomes full and we have to wait for work to be processed. So for higher N's, we effectively add a synchronization that was not present for lower N's, which makes the runtime much higher.

In any case, the fix is trivial: Call cudaDeviceSynchronize at the end of each experiment. (See commit cae55ce.) With this, we have the following results:

N	10	100	1000	10000
Without CUDA	0.07	0.50	5.04	50.08
With CUDA	15.20	140.90	1128.67	11285.2

This is more sensible, although the performance is terrible. So now that we've ironed out all the bugs, we can do what we wanted to do in the first place: add more CUDA threads to actually take advantage of the GPU.

Increasing threads and grid size

Using more threads can be done by simply calling the kernel as

cuda_step<<< 1, NUMBER_OF_THREADS >>>

Then the kernel will be called NUMBER_OF_THREADS times, with a different value of threadId to give each thread an identity. We need to decide what each thread does. Fortunately, in our simple problem, each row of the grid is independent, so we can just give one or more rows to each thread. This is done with the following code:

 void cuda_step(
     const double* in, PlainCGrid out,
     double t, double c,
+    std::size_t block_size,
     std::size_t grid_width, std::size_t grid_height
 ) {
+    int start = threadIdx.x * block_size;
+    int end = (threadIdx.x + 1) * block_size;
-    for(int i = 0; i < grid_height; i++) {
+    for(int i = start; i < end; i++) {

where the caller has to specify the block size and to ensure that NUMBER_OF_THREADS * block_size == grid_height. In the documentation, it seems allowed to use an arbitrary number of threads, e.g. specify NUMBER_OF_THREADS = grid_height and block_size = 1. In fact, this works on my laptop even for NUMBER_OF_THREADS = 10000, even though the Programming Guide states that

On current GPUs, a thread block may contain up to 1024 threads.

This limit can also be confirmed on my GPU with deviceQuery. I assume that the additional threads are run sequentially on different "physical" threads.

A few experiments with the grid size. I use different values of \(S\), the number of elements per side; so there are \(S^2\) elements on the full grid. The time step \(dt\) is adapted to fulfill the CFL condition, but the number of time steps stays constant, namely N=100000 timesteps. The number of time steps has to be chosen sufficiently large, otherwise cudaDeviceSynchronize takes up most of the runtime (since it now has to sync \(O(S^2)\) memory, which can be very large).

S	10	50	100	500	1000
Without CUDA	7.67	112.70	495.12	154523
With CUDA, 1 thread	1697.91	29064.3	111808	3902360
With CUDA, S threads	372.36	1125.84	2002.39	41444	210764

In theory, the work scales with the number of elements, i.e. with \(S^2\). We observe the following:

The non-CUDA version scales worse than \(O(S^2)\). This could be due to cache effects (for S = 500, each grid takes 500*500*sizeof(double) i.e. 2MB; on my machine, this fits into L3 cache but not into L2.)
The CUDA 1-thread has worse performance than the non-CUDA version in the beginning, but scales like \(S^2\), i.e. slightly better. This might be due to different cache effects.
The CUDA S-thread version scales like \(S^2\) for high values of \(S^2\). I would have expected it to scale like \(S\): Each thread takes one column (\(S\) elements), and there should be enough hardware threads so that no columns have to be processed sequentially.

Merging kernels

As seen previously, launching a kernel is a somewhat complex process. Can we put the for-loop over N in the kernel, so that we only launch the kernel once? Each thread's work is independent, so there should be no synchronization issues. This is implemented in 6bd0ee7. The runtime is as follows:

S	10	50	100	500	1000
Without CUDA	7.67	112.70	495.12	154523
With CUDA, 1 thread	1697.91	29064.3	111808	3902360
With CUDA, S threads	372.36	1125.84	2002.39	41444	210764
With CUDA, S threads, one kernel	178.77	661.35	1848.84	41206	210291

So this is an improvement, but becomes less relevant for high \(S\), as the actual kernel execution and memory transfer will be more prevalent than the kernel launching overhead.

Future work

Bugs

Even though the --perf version runs fine, there's now a SEGFAULT when running without --perf, i.e. with the GUI.
CUDA with \(S = 10^4\) threads and \(N = 1000\) reports 0.64 ms runtime, where \(S=10^3\) took 2162ms. There is probably something somewhere that immediately fails and exits.

Features

I never actually checked whether the program transformations were correct, i.e. whether the result is the same for every commit so far.
Use thread groups, i.e. put a non-1 number in the syntax

cuda_step<<< 1, NUMBER_OF_THREADS >>>

CUDA for Wave Simulation - Part 3: Efficient CUDA