GrayScale conversion (pixel-independent processing)

This time, let’s write some code for a GrayScale conversion.
GrayScale conversion = turning a color image into a black-and-white (grayscale) one.
Normally, a color image is expressed with three channels: RGB (red, green, blue).
What we do here is squeeze those three values into a single brightness value (0–255).
Why bother?
Simplification: Using brightness instead of full color slashes the amount of computation → perfect for GPU parallel programming practice.
Pre-processing: In computer vision tasks like face detection or edge detection, the first step is often converting to grayscale.
Visualization: When color isn’t needed (X-rays, black-and-white prints), dumping the RGB info is just more efficient.
Conversion method (example)
The most common formula is a weighted sum:
Gray = 0.299 * R + 0.587 * G + 0.114 * B
The human eye is more sensitive to green (G) → that’s why it gets the largest weight.
There’s also the simple average method: just add up R, G, and B and divide by 3 → (R+G+B)/3(R+G+B)/3(R+G+B)/3.
GrayScale conversion is per-pixel independent, so there’s no need for thread communication or synchronization. From a learner’s perspective, this makes it one of the easiest GPU kernels
1. Base knowledge
🔹 1. What is the interleaved format?
Image data is usually stored per pixel, with R, G, and B values mixed together (interleaved).
Example:
픽셀 0: R0 G0 B0
픽셀 1: R1 G1 B1
픽셀 2: R2 G2 B2
...
So on the GPU, if threadIdx.x == i
, then that thread should grab the R/G/B of the i-th pixel all at once.
The opposite concept is the planar format:
RRRR.... GGGG.... BBBB....
(stored per channel instead of per pixel) → this is common in deep learning frameworks.
For now, though, we only care about the interleaved format.
🔹 2. Data types in CUDA
CUDA provides vector types to handle RGB-like image data:
uchar3: R, G, B (3 bytes)
uchar4: R, G, B, A (4 bytes, includes Alpha)
So, why is uchar4
more efficient?
Because GPU global memory runs fastest when accesses are aligned to 32/64/128 bytes (coalesced).
With uchar3 (3 bytes per pixel), 32 threads reading in parallel end up with misaligned access → inefficient.
With uchar4 (4-byte aligned), it fits perfectly into memory transactions → much faster.
👉 That’s why in practice, even plain RGB images often get padded to RGBA and handled as uchar4
.
(This part can be confusing, so I’ll go deeper into it later at the end.)
🔹 3. Output (Grayscale buffer)
The output is 1-channel grayscale:
Each pixel = 1 byte (
uint8_t
)Formula:
gray[i]=0.299R+0.587G+0.114Bgray[i] = 0.299R + 0.587G + 0.114Bgray[i]=0.299R+0.587G+0.114B
🔹4. Example: Input RGBA (uchar4
) → Output Gray (1 channel)
#include <cuda_runtime.h>
#include <stdint.h>
__device__ __forceinline__ uint8_t rgb_to_gray_u8(uint8_t r, uint8_t g, uint8_t b) {
// (77*R + 150*G + 29*B) / 256
return static_cast<uint8_t>((77 * r + 150 * g + 29 * b) >> 8);
}
__global__ void rgba_to_gray_kernel(const uchar4* __restrict__ in,
uint8_t* __restrict__ out,
int width, int height)
{
int x = blockIdx.x * blockDim.x + threadIdx.x; // col
int y = blockIdx.y * blockDim.y + threadIdx.y; // row
if (x >= width || y >= height) return;
int idx = y * width + x; // row-major
uchar4 p = in[idx]; // coalesced load (4B aligned)
out[idx] = rgb_to_gray_u8(p.x, p.y, p.z); // x=R, y=G, z=B, w=A
}
2. STB Library
Want to throw an image onto the GPU and play with it?
The very first thing you need to grab is the STB library.
Heavy installs? Complicated builds?
Nope. None of that. Just drop stb_image.h
and stb_image_write.h
into your project folder and you’re done.
Loading and saving? STB takes care of it all.
Popular STB header files (the ones you’ll use the most):
stb_image.h
→ image loader (PNG/JPG/BMP/PSD/TGA/GIF/HDR/… → load as RGBA, etc.)stb_image_write.h
→ image writer (PNG/JPG/BMP/TGA/HDR)stb_truetype.h
→ font renderingstb_rect_pack.h
→ rectangle packing (texture atlas)stb_vorbis.c
→ audio decoder (this one is a.c
file)and a bunch of other handy utilities.
📥 How to get it
Download directly from GitHub
wget https://raw.githubusercontent.com/nothings/stb/master/stb_image.h wget https://raw.githubusercontent.com/nothings/stb/master/stb_image_write.h
3. Hands-on
// grayscale.cu
// Build: nvcc -O2 grayscale.cu -o grayscale
// Usage: ./grayscale input.jpg out.png
#define STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image.h"
#include "stb_image_write.h"
#include <cuda_runtime.h>
#include <stdint.h>
#include <cstdio>
#include <vector>
#include <algorithm>
#include <cmath>
#define CUDA_CHECK(expr) do { \
cudaError_t _e = (expr); \
if (_e != cudaSuccess) { \
fprintf(stderr, "CUDA error %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(_e)); \
std::exit(1); \
} \
} while(0)
__device__ __forceinline__ uint8_t rgb_to_gray_u8(uint8_t r, uint8_t g, uint8_t b) {
// BT.601 integer approximation: (77R + 150G + 29B) / 256
return static_cast<uint8_t>((77 * r + 150 * g + 29 * b) >> 8);
}
__global__ void rgba_to_gray_kernel(const uchar4* __restrict__ in,
uint8_t* __restrict__ out,
int width, int height)
{
int x = blockIdx.x * blockDim.x + threadIdx.x; // column
int y = blockIdx.y * blockDim.y + threadIdx.y; // row
if (x >= width || y >= height) return;
int idx = y * width + x; // row-major indexing
uchar4 p = in[idx]; // coalesced load (4B aligned)
out[idx] = rgb_to_gray_u8(p.x, p.y, p.z); // x=R, y=G, z=B, w=A
}
// CPU reference (same integer approximation formula)
static inline uint8_t rgb_to_gray_u8_cpu(uint8_t r, uint8_t g, uint8_t b) {
return static_cast<uint8_t>((77 * r + 150 * g + 29 * b) >> 8);
}
int main(int argc, char** argv) {
if (argc < 3) {
printf("Usage: %s <input_image> <output_png>\n", argv[0]);
return 0;
}
const char* in_path = argv[1];
const char* out_path = argv[2];
int w = 0, h = 0, ch = 0;
// force RGBA (4 channels)
unsigned char* img = stbi_load(in_path, &w, &h, &ch, 4);
if (!img) {
fprintf(stderr, "Failed to load image: %s\n", in_path);
return 1;
}
size_t num_pixels = static_cast<size_t>(w) * h;
size_t in_bytes = num_pixels * 4; // RGBA
size_t out_bytes = num_pixels; // Gray (1 channel)
// Device allocation
uchar4* d_in = nullptr;
uint8_t* d_out = nullptr;
CUDA_CHECK(cudaMalloc(&d_in, in_bytes));
CUDA_CHECK(cudaMalloc(&d_out, out_bytes));
// Host to Device copy
CUDA_CHECK(cudaMemcpy(d_in, img, in_bytes, cudaMemcpyHostToDevice));
// Grid/Block configuration
dim3 block(32, 8);
dim3 grid((w + block.x - 1) / block.x,
(h + block.y - 1) / block.y);
// Warm-up (to eliminate context/JIT overhead)
CUDA_CHECK(cudaFree(0));
// Timing setup
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
// Kernel launch
rgba_to_gray_kernel<<<grid, block>>>(d_in, d_out, w, h);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms = 0.f;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
printf("Kernel time: %.3f ms (w=%d, h=%d)\n", ms, w, h);
// Device to Host copy
std::vector<unsigned char> host_out(out_bytes);
CUDA_CHECK(cudaMemcpy(host_out.data(), d_out, out_bytes, cudaMemcpyDeviceToHost));
// (Optional) accuracy check: compare against CPU result (expect max_diff=0)
size_t mismatches = 0;
uint8_t max_diff = 0;
for (size_t i = 0; i < num_pixels; ++i) {
uint8_t r = img[4*i + 0];
uint8_t g = img[4*i + 1];
uint8_t b = img[4*i + 2];
uint8_t y_cpu = rgb_to_gray_u8_cpu(r, g, b);
uint8_t y_gpu = host_out[i];
uint8_t diff = (y_cpu > y_gpu) ? (y_cpu - y_gpu) : (y_gpu - y_cpu);
if (diff != 0) { ++mismatches; max_diff = std::max(max_diff, diff); }
}
printf("Check: mismatches=%zu, max_diff=%u\n", mismatches, (unsigned)max_diff);
// Save PNG (1-channel grayscale)
if (!stbi_write_png(out_path, w, h, 1, host_out.data(), w)) {
fprintf(stderr, "Failed to write image: %s\n", out_path);
} else {
printf("Saved: %s\n", out_path);
}
// Cleanup
CUDA_CHECK(cudaFree(d_in));
CUDA_CHECK(cudaFree(d_out));
stbi_image_free(img);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
return 0;
}
That turned out way cooler than I expected.
4. Important: Why is uchar4
more efficient?
A GPU runs fastest when multiple threads access contiguous memory at the same time.
This is called memory coalescing, and GPUs handle data most efficiently when it’s aligned to units like 32, 64, or 128 bytes.
uchar4
fits perfectly into 4-byte alignment.
uchar4
is made up of 4 bytes.When 32 threads each read one
uchar4
, that’s32 × 4 = 128 bytes
.And 128 bytes is exactly the “sweet spot” size a GPU can process in a single transaction.
It’s like moving a box of neatly packed items in one go — the GPU grabs the data as a clean block, so it’s fast.
uchar3
breaks the alignment.
uchar3
is 3 bytes.When 32 threads each read one
uchar3
, that’s32 × 3 = 96 bytes
.96 bytes doesn’t line up with the GPU’s efficient transaction size.
The GPU can’t fetch it all at once, so it splits the work into multiple loads — slower and inefficient.
It’s like trying to carry loose items that aren’t boxed — you have to make multiple trips.
Analogy: Grocery shopping 🛒
Think of it this way:
uchar4
is like instant noodles packed in bundles of 4. At the checkout (GPU), the cashier can scan the 4-pack in one go. If 32 customers all bring 4-packs, that’s 128 noodles scanned super fast in a single sweep.uchar3
is like awkward 3-packs. The cashier is optimized for 4-packs, so those 3-packs need to be scanned in parts. If 32 customers bring 3-packs, the cashier has to break it up into multiple scans — much slower.
Subscribe to my newsletter
Read articles from 박서경 directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
