This time, let’s write some code for a GrayScale conversion.

GrayScale conversion = turning a color image into a black-and-white (grayscale) one.

Normally, a color image is expressed with three channels: RGB (red, green, blue).
What we do here is squeeze those three values into a single brightness value (0–255).

Why bother?

Simplification: Using brightness instead of full color slashes the amount of computation → perfect for GPU parallel programming practice.
Pre-processing: In computer vision tasks like face detection or edge detection, the first step is often converting to grayscale.
Visualization: When color isn’t needed (X-rays, black-and-white prints), dumping the RGB info is just more efficient.

Conversion method (example)

The most common formula is a weighted sum:

Gray = 0.299 * R + 0.587 * G + 0.114 * B

The human eye is more sensitive to green (G) → that’s why it gets the largest weight.

There’s also the simple average method: just add up R, G, and B and divide by 3 → (R+G+B)/3(R+G+B)/3(R+G+B)/3.

GrayScale conversion is per-pixel independent, so there’s no need for thread communication or synchronization. From a learner’s perspective, this makes it one of the easiest GPU kernels

1. Base knowledge

🔹 1. What is the interleaved format?

Image data is usually stored per pixel, with R, G, and B values mixed together (interleaved).

Example:

픽셀 0: R0 G0 B0
픽셀 1: R1 G1 B1
픽셀 2: R2 G2 B2
...

So on the GPU, if threadIdx.x == i, then that thread should grab the R/G/B of the i-th pixel all at once.

The opposite concept is the planar format:

RRRR.... GGGG.... BBBB....

(stored per channel instead of per pixel) → this is common in deep learning frameworks.
For now, though, we only care about the interleaved format.

🔹 2. Data types in CUDA

CUDA provides vector types to handle RGB-like image data:

uchar3: R, G, B (3 bytes)
uchar4: R, G, B, A (4 bytes, includes Alpha)

So, why is uchar4 more efficient?

Because GPU global memory runs fastest when accesses are aligned to 32/64/128 bytes (coalesced).

With uchar3 (3 bytes per pixel), 32 threads reading in parallel end up with misaligned access → inefficient.
With uchar4 (4-byte aligned), it fits perfectly into memory transactions → much faster.

👉 That’s why in practice, even plain RGB images often get padded to RGBA and handled as uchar4.

(This part can be confusing, so I’ll go deeper into it later at the end.)

🔹 3. Output (Grayscale buffer)

The output is 1-channel grayscale:

Each pixel = 1 byte (uint8_t)
Formula:

gray[i]=0.299R+0.587G+0.114Bgray[i] = 0.299R + 0.587G + 0.114Bgray[i]=0.299R+0.587G+0.114B

🔹4. Example: Input RGBA (`uchar4`) → Output Gray (1 channel)

#include <cuda_runtime.h>
#include <stdint.h>

__device__ __forceinline__ uint8_t rgb_to_gray_u8(uint8_t r, uint8_t g, uint8_t b) {
    // (77*R + 150*G + 29*B) / 256
    return static_cast<uint8_t>((77 * r + 150 * g + 29 * b) >> 8);
}

__global__ void rgba_to_gray_kernel(const uchar4* __restrict__ in,
                                    uint8_t* __restrict__ out,
                                    int width, int height)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; // col
    int y = blockIdx.y * blockDim.y + threadIdx.y; // row
    if (x >= width || y >= height) return;

    int idx = y * width + x;        // row-major
    uchar4 p = in[idx];             // coalesced load (4B aligned)
    out[idx] = rgb_to_gray_u8(p.x, p.y, p.z); // x=R, y=G, z=B, w=A
}

2. STB Library

Want to throw an image onto the GPU and play with it?
The very first thing you need to grab is the STB library.

Heavy installs? Complicated builds?
Nope. None of that. Just drop stb_image.h and stb_image_write.h into your project folder and you’re done.
Loading and saving? STB takes care of it all.

Popular STB header files (the ones you’ll use the most):

stb_image.h → image loader (PNG/JPG/BMP/PSD/TGA/GIF/HDR/… → load as RGBA, etc.)
stb_image_write.h → image writer (PNG/JPG/BMP/TGA/HDR)
stb_truetype.h → font rendering
stb_rect_pack.h → rectangle packing (texture atlas)
stb_vorbis.c → audio decoder (this one is a .c file)
and a bunch of other handy utilities.

📥 How to get it

Download directly from GitHub

  wget https://raw.githubusercontent.com/nothings/stb/master/stb_image.h
  wget https://raw.githubusercontent.com/nothings/stb/master/stb_image_write.h

3. Hands-on

// grayscale.cu
// Build: nvcc -O2 grayscale.cu -o grayscale
// Usage: ./grayscale input.jpg out.png

#define STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image.h"
#include "stb_image_write.h"

#include <cuda_runtime.h>
#include <stdint.h>
#include <cstdio>
#include <vector>
#include <algorithm>
#include <cmath>

#define CUDA_CHECK(expr) do {                              \
  cudaError_t _e = (expr);                                 \
  if (_e != cudaSuccess) {                                 \
    fprintf(stderr, "CUDA error %s:%d: %s\n",              \
            __FILE__, __LINE__, cudaGetErrorString(_e));   \
    std::exit(1);                                          \
  }                                                        \
} while(0)

__device__ __forceinline__ uint8_t rgb_to_gray_u8(uint8_t r, uint8_t g, uint8_t b) {
    // BT.601 integer approximation: (77R + 150G + 29B) / 256
    return static_cast<uint8_t>((77 * r + 150 * g + 29 * b) >> 8);
}

__global__ void rgba_to_gray_kernel(const uchar4* __restrict__ in,
                                    uint8_t* __restrict__ out,
                                    int width, int height)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; // column
    int y = blockIdx.y * blockDim.y + threadIdx.y; // row
    if (x >= width || y >= height) return;

    int idx = y * width + x;        // row-major indexing
    uchar4 p = in[idx];             // coalesced load (4B aligned)
    out[idx] = rgb_to_gray_u8(p.x, p.y, p.z); // x=R, y=G, z=B, w=A
}

// CPU reference (same integer approximation formula)
static inline uint8_t rgb_to_gray_u8_cpu(uint8_t r, uint8_t g, uint8_t b) {
    return static_cast<uint8_t>((77 * r + 150 * g + 29 * b) >> 8);
}

int main(int argc, char** argv) {
    if (argc < 3) {
        printf("Usage: %s <input_image> <output_png>\n", argv[0]);
        return 0;
    }
    const char* in_path = argv[1];
    const char* out_path = argv[2];

    int w = 0, h = 0, ch = 0;
    // force RGBA (4 channels)
    unsigned char* img = stbi_load(in_path, &w, &h, &ch, 4);
    if (!img) {
        fprintf(stderr, "Failed to load image: %s\n", in_path);
        return 1;
    }
    size_t num_pixels = static_cast<size_t>(w) * h;
    size_t in_bytes   = num_pixels * 4;  // RGBA
    size_t out_bytes  = num_pixels;      // Gray (1 channel)

    // Device allocation
    uchar4* d_in = nullptr;
    uint8_t* d_out = nullptr;
    CUDA_CHECK(cudaMalloc(&d_in,  in_bytes));
    CUDA_CHECK(cudaMalloc(&d_out, out_bytes));

    // Host to Device copy
    CUDA_CHECK(cudaMemcpy(d_in, img, in_bytes, cudaMemcpyHostToDevice));

    // Grid/Block configuration
    dim3 block(32, 8);
    dim3 grid((w + block.x - 1) / block.x,
              (h + block.y - 1) / block.y);

    // Warm-up (to eliminate context/JIT overhead)
    CUDA_CHECK(cudaFree(0));

    // Timing setup
    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));
    CUDA_CHECK(cudaEventRecord(start));

    // Kernel launch
    rgba_to_gray_kernel<<<grid, block>>>(d_in, d_out, w, h);
    CUDA_CHECK(cudaGetLastError());

    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaEventSynchronize(stop));
    float ms = 0.f;
    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
    printf("Kernel time: %.3f ms (w=%d, h=%d)\n", ms, w, h);

    // Device to Host copy
    std::vector<unsigned char> host_out(out_bytes);
    CUDA_CHECK(cudaMemcpy(host_out.data(), d_out, out_bytes, cudaMemcpyDeviceToHost));

    // (Optional) accuracy check: compare against CPU result (expect max_diff=0)
    size_t mismatches = 0;
    uint8_t max_diff = 0;
    for (size_t i = 0; i < num_pixels; ++i) {
        uint8_t r = img[4*i + 0];
        uint8_t g = img[4*i + 1];
        uint8_t b = img[4*i + 2];
        uint8_t y_cpu = rgb_to_gray_u8_cpu(r, g, b);
        uint8_t y_gpu = host_out[i];
        uint8_t diff = (y_cpu > y_gpu) ? (y_cpu - y_gpu) : (y_gpu - y_cpu);
        if (diff != 0) { ++mismatches; max_diff = std::max(max_diff, diff); }
    }
    printf("Check: mismatches=%zu, max_diff=%u\n", mismatches, (unsigned)max_diff);

    // Save PNG (1-channel grayscale)
    if (!stbi_write_png(out_path, w, h, 1, host_out.data(), w)) {
        fprintf(stderr, "Failed to write image: %s\n", out_path);
    } else {
        printf("Saved: %s\n", out_path);
    }

    // Cleanup
    CUDA_CHECK(cudaFree(d_in));
    CUDA_CHECK(cudaFree(d_out));
    stbi_image_free(img);
    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));
    return 0;
}

That turned out way cooler than I expected.

4. Important: Why is `uchar4` more efficient?

A GPU runs fastest when multiple threads access contiguous memory at the same time.
This is called memory coalescing, and GPUs handle data most efficiently when it’s aligned to units like 32, 64, or 128 bytes.

uchar4 fits perfectly into 4-byte alignment.

uchar4 is made up of 4 bytes.
When 32 threads each read one uchar4, that’s 32 × 4 = 128 bytes.
And 128 bytes is exactly the “sweet spot” size a GPU can process in a single transaction.

It’s like moving a box of neatly packed items in one go — the GPU grabs the data as a clean block, so it’s fast.

uchar3 breaks the alignment.

uchar3 is 3 bytes.
When 32 threads each read one uchar3, that’s 32 × 3 = 96 bytes.
96 bytes doesn’t line up with the GPU’s efficient transaction size.
The GPU can’t fetch it all at once, so it splits the work into multiple loads — slower and inefficient.

It’s like trying to carry loose items that aren’t boxed — you have to make multiple trips.

Analogy: Grocery shopping 🛒

Think of it this way:

uchar4 is like instant noodles packed in bundles of 4. At the checkout (GPU), the cashier can scan the 4-pack in one go. If 32 customers all bring 4-packs, that’s 128 noodles scanned super fast in a single sweep.
uchar3 is like awkward 3-packs. The cashier is optimized for 4-packs, so those 3-packs need to be scanned in parts. If 32 customers bring 3-packs, the cashier has to break it up into multiple scans — much slower.

GrayScale conversion (pixel-independent processing)

Table of contents

Conversion method (example)

1. Base knowledge

🔹 1. What is the interleaved format?

🔹 2. Data types in CUDA

🔹 3. Output (Grayscale buffer)

🔹4. Example: Input RGBA (`uchar4`) → Output Gray (1 channel)

2. STB Library

3. Hands-on

4. Important: Why is `uchar4` more efficient?

Subscribe to my newsletter

박서경

박서경

GrayScale conversion (pixel-independent processing)

Table of contents

Conversion method (example)

1. Base knowledge

🔹 1. What is the interleaved format?

🔹 2. Data types in CUDA

🔹 3. Output (Grayscale buffer)

🔹4. Example: Input RGBA (uchar4) → Output Gray (1 channel)

2. STB Library

3. Hands-on

4. Important: Why is uchar4 more efficient?

Subscribe to my newsletter

박서경

박서경

🔹4. Example: Input RGBA (`uchar4`) → Output Gray (1 channel)

4. Important: Why is `uchar4` more efficient?