CUDA Thread Hierarchy and Kernel Launch

박서경박서경
4 min read

1. Thread Layout Configuration and Kernel Launch

The thread layout refers to the arrangement of threads, which is defined in terms of the grid and block structure.

The syntax is as follows:

Kernel<<<grid,block>>()
  • gridDim: the overall grid size (number of blocks)

  • blockDim: the number of threads within each block

2.Example

// thread_layout.cu
#include <stdio.h>

__global__ void printThreadInfo() {
    // ==============================
    // Thread ID inside the block (local coordinates)
    // ==============================
    int tx = threadIdx.x;  // thread x-coordinate
    int ty = threadIdx.y;  // thread y-coordinate
    int tz = threadIdx.z;  // thread z-coordinate

    // ==============================
    // Block ID inside the grid (block coordinates)
    // ==============================
    int bx = blockIdx.x;   // block x-coordinate
    int by = blockIdx.y;   // block y-coordinate
    int bz = blockIdx.z;   // block z-coordinate

    // ==============================
    // Block dimensions (blockDim)
    // ==============================
    int bdx = blockDim.x;  // number of threads in x within a block
    int bdy = blockDim.y;  // number of threads in y within a block
    int bdz = blockDim.z;  // number of threads in z within a block

    // ==============================
    // Grid dimensions (gridDim)
    // ==============================
    int gdx = gridDim.x;   // number of blocks in x within a grid
    int gdy = gridDim.y;   // number of blocks in y within a grid
    // gridDim.z is not used, so omitted

    // ==============================
    // Global thread ID calculation (flatten 3D → 1D)
    // ==============================
    int globalThreadId =
        tx +                                   // local x position
        ty * bdx +                             // add y offset
        tz * bdx * bdy +                       // add z offset
        bx * bdx * bdy * bdz +                 // block offset in x
        by * (bdx * bdy * bdz * gdx) +         // block offset in y
        bz * (bdx * bdy * bdz * gdx * gdy);    // block offset in z

    // ==============================
    // Print results
    // ==============================
    printf("Grid(%d,%d,%d) Block(%d,%d,%d) Thread(%d,%d,%d) GlobalId=%d\n",
           bx, by, bz, tx, ty, tz, tx, ty, tz, globalThreadId);
}

int main() {
    // ===========================
    // Thread layout configuration
    // ===========================
    dim3 grid(3,2,2);   // Grid size = (3,2,2) → 12 blocks
    dim3 block(2,2,2);  // Block size = (2,2,2) → 8 threads per block
    // Total number of threads = 12 × 8 = 96

    // ===========================
    // Kernel launch
    // ===========================
    printThreadInfo<<<grid, block>>>();
    cudaDeviceSynchronize();

    return 0;
}

3.ceil()

Why do we need ceil()? (people analogy)

  • Thread = a worker (person)

  • Block = a group of workers

Example:

  • Total tasks to do: 1000

  • Workers per block: 256

👉 1000 ÷ 256 = 3.9

  • If we only use 3 blocks: 3 × 256 = 768 workers232 tasks remain

  • If we use 4 blocks: 4 × 256 = 1024 workers → all tasks are covered ✅

So, we must always round up to ensure all tasks are processed.


Code Example (C/C++)

#include <stdio.h>
#include <math.h>

int main() {
    int N = 1000;        // total number of tasks
    int blockSize = 256; // number of threads per block

    // calculate the number of blocks using ceil
    int gridSize = (int)ceil((double)N / blockSize);

    printf("Total work = %d\n", N);
    printf("Threads per block = %d\n", blockSize);
    printf("Blocks needed = %d\n", gridSize);
    printf("Total threads allocated = %d\n", gridSize * blockSize);

    return 0;
}

Output

Total work = 1000
Threads per block = 256
Blocks needed = 4
Total threads allocated = 1024

👉 In practice, 1024 threads are launched, but the extra 24 threads simply do nothing if we guard with if (idx < N) inside the kernel.


Common CUDA Pattern

int gridSize = (int)ceil((double)N / blockSize);
// Or using integer arithmetic (faster and safer)
int gridSize = (N + blockSize - 1) / blockSize;

✅ Summary: ceil() ensures that all tasks are covered when computing the grid size.
It’s essential in CUDA for grid/block configuration.

Got it 👍 Here’s the same Q&A style explanation translated into English:


Q. If I allocate more blocks, there will be extra threads. What happens to those threads?

For example:

  • Total data elements: N = 1000

  • Block size: blockSize = 256

  • Required number of blocks = (1000 + 255) / 256 = 4

👉 Total threads = 4 × 256 = 1024
👉 But only 1000 are actually needed, so 24 threads remain unused.


A. So how does CUDA handle these extra threads?

The answer is simple:
👉 The extra threads simply do nothing.

Each thread computes its global ID (idx) and then checks:

if (idx < N) {
    // Process only valid data
}

Threads with IDs 0–999 will perform useful work,
while threads with IDs 1000–1023 will fail the condition and do nothing.


Q. But why create extra threads at all?

  • GPUs perform best when the number of threads per block is a multiple of 32, 64, 128, 256, 512, or 1024 (warp-friendly sizes).

  • If you try to match N=1000 exactly with irregular block sizes, you often lose performance due to poor warp alignment.

  • Therefore, the standard approach is to launch a few extra threads and simply ignore them with a boundary check.


0
Subscribe to my newsletter

Read articles from 박서경 directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

박서경
박서경