CUDA Thread Hierarchy and Kernel Launch

1. Thread Layout Configuration and Kernel Launch
The thread layout refers to the arrangement of threads, which is defined in terms of the grid and block structure.
The syntax is as follows:
Kernel<<<grid,block>>()
gridDim: the overall grid size (number of blocks)
blockDim: the number of threads within each block
2.Example
// thread_layout.cu
#include <stdio.h>
__global__ void printThreadInfo() {
// ==============================
// Thread ID inside the block (local coordinates)
// ==============================
int tx = threadIdx.x; // thread x-coordinate
int ty = threadIdx.y; // thread y-coordinate
int tz = threadIdx.z; // thread z-coordinate
// ==============================
// Block ID inside the grid (block coordinates)
// ==============================
int bx = blockIdx.x; // block x-coordinate
int by = blockIdx.y; // block y-coordinate
int bz = blockIdx.z; // block z-coordinate
// ==============================
// Block dimensions (blockDim)
// ==============================
int bdx = blockDim.x; // number of threads in x within a block
int bdy = blockDim.y; // number of threads in y within a block
int bdz = blockDim.z; // number of threads in z within a block
// ==============================
// Grid dimensions (gridDim)
// ==============================
int gdx = gridDim.x; // number of blocks in x within a grid
int gdy = gridDim.y; // number of blocks in y within a grid
// gridDim.z is not used, so omitted
// ==============================
// Global thread ID calculation (flatten 3D → 1D)
// ==============================
int globalThreadId =
tx + // local x position
ty * bdx + // add y offset
tz * bdx * bdy + // add z offset
bx * bdx * bdy * bdz + // block offset in x
by * (bdx * bdy * bdz * gdx) + // block offset in y
bz * (bdx * bdy * bdz * gdx * gdy); // block offset in z
// ==============================
// Print results
// ==============================
printf("Grid(%d,%d,%d) Block(%d,%d,%d) Thread(%d,%d,%d) GlobalId=%d\n",
bx, by, bz, tx, ty, tz, tx, ty, tz, globalThreadId);
}
int main() {
// ===========================
// Thread layout configuration
// ===========================
dim3 grid(3,2,2); // Grid size = (3,2,2) → 12 blocks
dim3 block(2,2,2); // Block size = (2,2,2) → 8 threads per block
// Total number of threads = 12 × 8 = 96
// ===========================
// Kernel launch
// ===========================
printThreadInfo<<<grid, block>>>();
cudaDeviceSynchronize();
return 0;
}
3.ceil()
Why do we need ceil()
? (people analogy)
Thread = a worker (person)
Block = a group of workers
Example:
Total tasks to do: 1000
Workers per block: 256
👉 1000 ÷ 256 = 3.9
If we only use 3 blocks: 3 × 256 = 768 workers → 232 tasks remain ❌
If we use 4 blocks: 4 × 256 = 1024 workers → all tasks are covered ✅
So, we must always round up to ensure all tasks are processed.
Code Example (C/C++)
#include <stdio.h>
#include <math.h>
int main() {
int N = 1000; // total number of tasks
int blockSize = 256; // number of threads per block
// calculate the number of blocks using ceil
int gridSize = (int)ceil((double)N / blockSize);
printf("Total work = %d\n", N);
printf("Threads per block = %d\n", blockSize);
printf("Blocks needed = %d\n", gridSize);
printf("Total threads allocated = %d\n", gridSize * blockSize);
return 0;
}
Output
Total work = 1000
Threads per block = 256
Blocks needed = 4
Total threads allocated = 1024
👉 In practice, 1024 threads are launched, but the extra 24 threads simply do nothing if we guard with if (idx < N)
inside the kernel.
Common CUDA Pattern
int gridSize = (int)ceil((double)N / blockSize);
// Or using integer arithmetic (faster and safer)
int gridSize = (N + blockSize - 1) / blockSize;
✅ Summary: ceil()
ensures that all tasks are covered when computing the grid size.
It’s essential in CUDA for grid/block configuration.
Got it 👍 Here’s the same Q&A style explanation translated into English:
Q. If I allocate more blocks, there will be extra threads. What happens to those threads?
For example:
Total data elements:
N = 1000
Block size:
blockSize = 256
Required number of blocks =
(1000 + 255) / 256 = 4
👉 Total threads = 4 × 256 = 1024
👉 But only 1000 are actually needed, so 24 threads remain unused.
A. So how does CUDA handle these extra threads?
The answer is simple:
👉 The extra threads simply do nothing.
Each thread computes its global ID (idx
) and then checks:
if (idx < N) {
// Process only valid data
}
Threads with IDs 0–999 will perform useful work,
while threads with IDs 1000–1023 will fail the condition and do nothing.
Q. But why create extra threads at all?
GPUs perform best when the number of threads per block is a multiple of 32, 64, 128, 256, 512, or 1024 (warp-friendly sizes).
If you try to match
N=1000
exactly with irregular block sizes, you often lose performance due to poor warp alignment.Therefore, the standard approach is to launch a few extra threads and simply ignore them with a boundary check.
Subscribe to my newsletter
Read articles from 박서경 directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
