It hasn’t been long since I was new to this stuff, so I will keep this as comprehensible as possible. The thing is, while entering in the domain of GPU programming, you need to understand the GPU before programming. This means we will have to understand some technical jargon first before getting to code. The best introduction to this is still the CUDA docs and some books, this is just a toned down quickstart.

First thing to know is that there are two main hierarchies in CUDA : The Execution hierarchy and the Memory Hierarchy.

The Execution hierarchy (the purple one in the image), explains how CUDA divides stuff to achieve massive parallelism. If you’ve studied some level of Operating Systems, or even CPU parallelism, you would be aware of the term thread.

Thread : It is the most fundamental logically independent, execution unit. Each thread has its own Registers, and is supposed to execute only one instance of kernel function (we will get to this in a moment).

In layman terms, if everything else is a vocabulary, it is a single alphabet, like an atom, or a cell, etc. etc. We will also have a better analogy for this later.

These threads never work alone. They work in groups.

Group : A group of threads is called a thread block, or simply, a Block. You noticed how each thread had its own register for local memory? Similarly, each block has a Shared Memory, which is accessible by all threads within that block.

So the threads can be coordinated to access shared memory, to perform a task.

Grid : Just like with threads, all the blocks collectively are called grid. Grid is also what has the Global Memory, which is big but slow for fetching data (again, this might be intuitive if you have read about file systems before).

All the blocks have access to global memory (of course). With this, we have also kinda covered the Memory hierarchy in itself (the green one), except constant memory.

So constant memory is a separate thing. It resides in global memory space, but despite of that, it is super fast. That is because it is backed by a small on-chip constant cache. That is, it is because of a design choice of the chip.

If all threads in a warp (we will get to warps in a minute) read the same constant address, the value is broadcast in a single transaction, which makes the speed really fast, almost register-like.

You can define something to be in constant memory if you want to access it again and again (and if it is a constant, of course).

In all and all :

Global memory should be accessed in a coalesced manner to exploit maximum bandwidth.
Shared memory should be used to minimize global memory accesses, particularly in scenarios requiring high inter-thread communication within blocks.
Constant memory should be employed for read-only data that is accessed by all threads, benefiting from the low-latency offered by on-chip caching.

Warps

This clears up two of our main hierarchies, the execution and memory ones. If you would go back to the image we referenced, you will notice that we skipped one last thing : Warps

Both warps and blocks are group of threads, so it is important to understand the difference properly. There are two main things to note here :

Warp is a collection of 32 threads (fixed number).
Warp follows SIMT (Single Instruction Multiple Threads), i.e, all the threads will perform one single task. They can access shared memory, but only for performing one predefined task.

Warp is a hardware-enforced concept. NVIDIA GPUs are designed in a way that will group 32 threads together to execute one instruction through SIMT. On the other hand, Block and grids are a part of Programming model. We programmers define block size, to effectively manage division of labour within the grid, but those blocks are further divided by the system into multiple warps. That is why, it is preferred to have block size as a multiple of 32, so that it is distributed across warps smoothly.

Now if you would look again at the image, you will find it to be much more intuitive than before. If not, then just read everything again while looking at the diagram and I hope it will be clear.

Otherwise, here is an analogy to tone it down :

Consider Grid to be a factory, Blocks are the workshops inside the factory. Within a single workshop, you will have workers, in groups of 32, working on a task.

There is a separate hardware organization hierarchy which you can look up if you are interested, but I will omit it for now and mention it in the end as it may complicate things.

This is enough pretext for us to get started with a few CUDA programs. First, let’s write a simple vector addition to see the structure of a CUDA C++ program. Of course, to compile and run this, you will need to have a GPU and a properly set up CUDA toolkit. You can follow the setup from the docs for that.

Vector Addition

So there are mainly two parts of a CUDA program. One is the kernel, where you define the instructions you want to be carried out by the GPU threads. Other is the regular C++ program where you allocate the memory, call the kernel on it, and then free the memory.

Most of the CUDA part, as you can guess, happens while writing the kernel. For our vector addition task, we will keep it simple and just use global memory for now.

__global__ void vec_add(const float *A, const float *B, float *C, int N) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < N) C[i] = A[i] + B[i];
}

The important thing to note here, is that while writing CUDA kernels, its important to access with global index. The calculation goes like this : blockIdx tells us which block within the grid is being used, blockDim tells us how many threads there are within that block. If you have read about row based indexing before, multiplying them might feel intuitive. If not, then don’t worry, they are not exactly the same.

CUDA provides provides local per-block (threadIdx) and block indices (blockIdx) as you can see here. It does not think in terms of rows and columns. Obviously, you can’t use threadIdx alone, that would be like telling someone your house number is 33. It doesn’t mean anything in its own. So we have to provide an offset as well, i.e, the location of house number 1.

Here, in our case, if each block holds 256 threads (say), then block 0 will hold thread number 0-255, block 1 will hold thread number 256-511 and so on. If we multiply the block number (blockIdx) with the number of threads per block (blockDim), we can get the offset in terms of thread number. For thread number 513, we can do (2 × 256 + 1) i.e : blockIdx.x * blockDim.x + threadIdx.x

Also, in the code, the variable N is the size of the vector. We are implementing a simple bound check on the calculated index, on whether it lies in the range or not. Why is this necessary here, whether or not it is a good practice, are questions which will be addressed as we go.

One more thing before we proceed. Sometimes, it becomes cumbersome to point out the line which is causing the error issues, especially with the CUDAUnknownError (which can be resolved with a restart but still). Also, sometimes the code will execute but not print any particular error statement. So, until we are not using a proper debugger, it is better to define a CUDA_CHECK macro, for at least getting the idea of an error and the line where it breaks.

// CUDA error checking macro

#define CUDA_CHECK(err)                                                     \
  do {                                                                      \
    if (err != cudaSuccess) {                                               \
      std::cerr << "CUDA Error: " << cudaGetErrorString(err) << " at line " \
                << __LINE__ << std::endl;                                   \
      exit(EXIT_FAILURE);                                                   \
    }                                                                       \
  } while (0)

(It is supposed to be done in a single line, but we can use backslashes to maintain readability)

Now what remains is writing the C++ code, which is often simple once you know about a few methods from the CUDA API :

cudaMalloc : Just like regular malloc, used to allocate memory on the GPU.
cudaMemcpy : Copies the data from the CPU to GPU or vice versa.
cudaFree : As you can guess, used to free the allocated memory on the GPU.

Most of the main program while learning CUDA will revolve around allocating memory separately on both CPU and GPU, populating CPU data containers if not filled already, calling the kernel after defining proper values of block size and all, and then freeing everything. This is also why the above three methods are important.

Other than these, we have used cudaGetLastError for an error check (which was resolved later) and cudaDeviceSynchronize. You will often see this or __syncthreads while learning CUDA. Because of the massive parallelism, and so many things happening on their own, it is important to wait of everything to get completed before proceeding to the next step to avoid race condition and undesired multiple writes.

The code is present here and with the comments, it is mostly self explanatory.

#include <cuda_runtime.h>
#include <iostream>

//Define CUDA_CHECK Macro here with the help of the code given above

//Define kernel here with the help of the code given above

int main() {
  int N = 1 << 16;  
  size_t size = N * sizeof(float);

  // Allocate host memory
  float *h_A = (float *)malloc(size);
  float *h_B = (float *)malloc(size);
  float *h_C = (float *)malloc(size);

  if (!h_A || !h_B || !h_C) {
    std::cerr << "Host memory allocation failed" << std::endl;
    return 1;
  }

  // Initialize input vectors
  for (int i = 0; i < N; i++) {
    h_A[i] = i * 0.5f;
    h_B[i] = i * 2.0f;
  }

  // Allocate device memory
  float *d_A, *d_B, *d_C;
  CUDA_CHECK(cudaMalloc((void **)&d_A, size));
  CUDA_CHECK(cudaMalloc((void **)&d_B, size));
  CUDA_CHECK(cudaMalloc((void **)&d_C, size));

  // Copy data from host to device
  CUDA_CHECK(cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice));
  CUDA_CHECK(cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice));

  // Launch kernel
  int threadsPerBlock = 256;
  int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
  vec_add<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

  CUDA_CHECK(cudaGetLastError());       // Check for kernel launch errors
  CUDA_CHECK(cudaDeviceSynchronize());  // Wait for kernel to finish

  // Copy result back to host
  CUDA_CHECK(cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost));

  // Validate results
  bool valid = true;
  for (int i = 0; i < 10; i++) {
    float expected = h_A[i] + h_B[i];  // Expected: i * 0.5 + i * 2.0 = i * 2.5
    std::cout << "Element " << i << ": " << h_C[i] << " (Expected: " << expected
              << ")" << std::endl;
    if (fabs(h_C[i] - expected) > 1e-5) {
      valid = false;
    }
  }
  if (!valid) {
    std::cerr << "Validation failed: Incorrect results" << std::endl;
  } else {
    std::cout << "Validation passed for first 10 elements" << std::endl;
  }

  // Free device memory
  CUDA_CHECK(cudaFree(d_A));
  CUDA_CHECK(cudaFree(d_B));
  CUDA_CHECK(cudaFree(d_C));

  // Free host memory
  free(h_A);
  free(h_B);
  free(h_C);

  return 0;
}

You can also print stuff to validate. There are a few other things you can notice in this program, but let’s just try to compile and execute it first.

To compile this, make sure you have properly installed and setup NVIDIA CUDA Development Toolkit, a simple check for that would be to run nvcc —version to see the version of your compiler and nvidia-smi to have an overall report of your GPU and which processes are running. If even one of them is not working properly, you should recheck.

If it works fine, you can proceed with the following command:

nvcc vec_add.cu -o vec_add -arch=sm_XX

The arch flag may be unnecessary for the most part, but sometimes you may come across some errors during compilation, so it is better to just find out your device’s compute capability and enter it here. If your compute capability is, say, 7.5, then you should write sm_75. Similarly, sm_86 for 8.6, sm_89 for 8.9 and so on.

After this a compiled file will be generated, and you can run it by simply:

./vec_add

It should print something like :

Element 0: 0 (Expected: 0)
Element 1: 2.5 (Expected: 2.5)
Element 2: 5 (Expected: 5)
Element 3: 7.5 (Expected: 7.5)
Element 4: 10 (Expected: 10)
Element 5: 12.5 (Expected: 12.5)
Element 6: 15 (Expected: 15)
Element 7: 17.5 (Expected: 17.5)
Element 8: 20 (Expected: 20)
Element 9: 22.5 (Expected: 22.5)
Validation passed for first 10 elements

Getting something up and running sure brings a breath of relaxation, but there are still a few things one should keep in mind.

Choosing Block Dimension

Firstly, you would’ve noticed that while defining blocksPerGrid, we make sure to add threadsPerBlock (since it’s the divisor) to make sure it gets rounded up to its ceiling and not the floor. Otherwise, if N turned out to be not an exact multiple of threadsPerBlock, we’d end up with too few blocks, leaving some elements unprocessed.

Say N = 1000, threadsPerBlock = 256

N / threadsPerBlock = 1000 / 256 = 3 (floor).
But 3 * 256 = 768 threads total → only 768 elements processed.
That leaves 232 elements unprocessed (1000 - 768).

By adding (threadsPerBlock - 1), we "push up" the division to the next integer if there’s any remainder: blocksPerGrid = (1000 + 255) / 256 = 1255 / 256 = 4

Now we launch 4 * 256 = 1024 threads, which covers all 1000 elements (with 24 "extra" threads that won’t be used).

We can use a if statement to prevent accessing those extra threads like we already did. This is much easier fix than having unprocessed elements.

Warp Divergence

Talking about a conditional statement, there is this second thing we need to keep in mind : Warp Divergence.

The problem is, that conditionals like if statement may cause branching in between the threads. Look at this code for instance :

__global__ void kernel(int *data) {
    int tid = threadIdx.x;
    if (tid % 2 == 0) data[tid] *= 2;   
    else data[tid] += 5;   
}

Here, the threads are performing an all together different task based on whether they are even or odd. This is not acceptable by GPU. Since we know that warps is a system enforced concept, and a group of 32 threads (which is a fixed number) is enforced to work on the same task, what happens here is that even and odd threads will be sequentialized.

That is, the even threads will be executed first while the odd ones are masked and then odd ones while the even ones are masked. This not just degrades performance, since this was supposed to be done in parallel, it can also lead to some undefined behaviour. Look at this code :

__global__ void Kernel(int *data) {
    __shared__ int temp[32];
    int tid = threadIdx.x;

    if (tid < 16)  temp[tid] = data[tid];   
    __syncthreads();             
    data[tid] = temp[tid];       
}

Here, only the first 16 threads are written, __syncthreads() makes sure that all the threads reach the same execution points. It’s like asking 32 people to run, but 16 of them have to pick a coin from the ground, and then everyone has to stand in line at a checkpoint before you start again (the __syncthreads()). Now when you check for the coin, half of them obviously won’t have any, that is, when reading from that array in the last line, we will be reading undefined values.

In our case, it doesn’t matter much, but in general you would want to avoid conditional statements. There are ways to do that which one can learn as they read more about GPU programming. At this level, the easiest way is to select values which are multiples of 32, write algorithms which avoid conditionals, or separate concerns properly among multiple kernels.

The Hardware Hierarchy

We left this before, remember? This section will talk about how the hardware arrangement of NVIDIA GPU looks like and how it maps with our execution model. Again, there are three main terms in this which we should know :

Streaming Multiprocessors : It’s kind of a self-contained processing engine. Each GPU device has multiple SMs and each one of them has a collection of CUDA cores. Along with that, they have shared memory, constant cache, ALU, SFU (special function units), LDST units (Load/Store), warp schedulers etc. etc. Basically, its on the top of the hierarchy.

Mapping it with our model, each block runs entirely on a single SM. It never splits across SM. However, it’s not a one-to-one mapping. A SM can harbour multiple blocks.

Warps : We already know about them at this point.
CUDA Cores : While threads is the most fundamental execution unit, it is so in a software-defined way as we already discussed. Cores are at exactly that level, except in a hardware-defined way.

A CUDA Core is primarily designed for single-precision floating-point (FP32) and integer (INT32) operations. (Note: Modern architectures have separate INT32 and FP32 cores, but the conceptual role is similar).

This diagram can help clear the relation more :

You don’t actually need to know about how a CUDA workflow goes out from start to end when you are just starting, so we won’t discuss that here, but it is important to look that up as you progress.

This is the end. Thank you for reading till here. This is the first blog I’m not so sure about, so please point out the errors or suggestions if any.

One last thing, the example we discussed is more like a hello world thing and not exactly a good representative of how CUDA parallelism enhances performance and its true power. To see that, a better project would be matrix multiplication with 2D tiling which you can read up in this much more amazing blog.

Introduction to CUDA Programming