A response to the "Luajit is wicked fast?" video

Recently, I watched the video “Luajit is wicked fast?“ by Philip Bohun.
Fantastic video, I learned a lot. Please watch it.
This article is by no means a refutation, only a small addendum. I also want to write down some of Philip’s code, as he didn’t have a linked repository. My repository is here https://github.com/MarioAriasC/luajit-is-fast
Lua code running in LuaJIT
Philip started with some Lua code, reproduced below (yes, from a still frame):
local function image_ramp_green(n)
local img = {}
local f = 255 / (n - 1)
for i = 1, n do
img[i] = { red = 0, green = (i - 1) * f, blue = 0, alpha = 255 }
end
return img
end
local function image_to_gray(img, n)
for i = 1, n do
local y = 0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue
img[i].red = y
img[i].blue = y
img[i].green = y
end
end
local N = 400 * 400
local img = image_ramp_green(N)
for i = 1, 1000 do
image_to_gray(img, N)
end
Some pixel manipulation, ramping green, then fading to grey, repeated a few thousand times, just to keep the processor warm and cosy. Nothing fancy
On my machine, using luajit, it takes: 1.303 s ± 0.056 s.
❯ hyperfine -w 3 'luajit img.lua'
Benchmark 1: luajit img.lua
Time (mean ± σ): 1.303 s ± 0.056 s [User: 1.286 s, System: 0.016 s]
Range (min … max): 1.245 s … 1.367 s 10 runs
Then, Philip introduces a feature that I have never heard of, LuaJIT ffi. Instead of using Lua tables, we can use a C Struct.
local ffi = require("ffi")
ffi.cdef([[
typedef struct { uint8_t red, green, blue, alpha;} rgba_pixel;
]])
local function image_ramp_green(n)
local img = ffi.new("rgba_pixel[?]", n)
local f = 255 / (n - 1)
for i = 1, n do
img[i].green = i * f
img[i].alpha = 255
end
return img
end
local function image_to_gray(img, n)
for i = 1, n do
local y = 0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue
img[i].red = y
img[i].blue = y
img[i].green = y
end
end
local N = 400 * 400
local img = image_ramp_green(N)
for i = 1, 1000 do
image_to_gray(img, N)
end
And it is indeed wicked fast.
On my machine, 346.3 ms ± 1.5 ms. Amazing
❯ hyperfine -w 3 'luajit img_with_ffi.lua'
Benchmark 1: luajit img_with_ffi.lua
Time (mean ± σ): 346.3 ms ± 1.5 ms [User: 344.3 ms, System: 1.8 ms]
Range (min … max): 344.2 ms … 348.6 ms 10 runs
Philip claims that it is faster than C.
#include <stdint.h>
#include <stdlib.h>
typedef struct {
uint8_t red;
uint8_t green;
uint8_t blue;
uint8_t alpha;
} rgba_pixel;
rgba_pixel *image_ramp_green(int n) {
rgba_pixel *img = calloc(n, sizeof(rgba_pixel));
float f = 255.0 / (float)(n - 1);
for (int i = 0; i < n; i++) {
img[i].green = (int)((float)i * f);
img[i].alpha = 255;
}
return img;
}
void image_to_grey(rgba_pixel *img, int n) {
for (int i = 0; i < n; i++) {
float y = (0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
0.1 * (float)img[i].blue);
img[i].red = (int)y;
img[i].green = (int)y;
img[i].blue = (int)y;
}
}
int main(void) {
int n = 400 * 400;
rgba_pixel *img = image_ramp_green(n);
for (int i = 0; i < 1000; i++) {
image_to_grey(img, n);
}
return EXIT_SUCCESS;
}
After compiling with GCC and Clang, it turns out that, yes, luajit with FFI is faster than C.
On my machine, the CLang compiled runs at 375.1 ms ± 1.4 ms and the GCC compiled runs at 369.2 ms ± 1.1ms.
❯ hyperfine -w 3 ./clangimg-mc
Benchmark 1: ./clangimg-mc
Time (mean ± σ): 375.1 ms ± 1.4 ms [User: 373.9 ms, System: 1.2 ms]
Range (min … max): 373.4 ms … 378.0 ms 10 runs
❯ hyperfine -w 3 ./gccimg-mc
Benchmark 1: ./gccimg-mc
Time (mean ± σ): 369.2 ms ± 1.1 ms [User: 367.5 ms, System: 1.5 ms]
Range (min … max): 367.9 ms … 371.3 ms 10 runs
I’m not a C expert, but part of my job is performance engineering for JVM applications, and I can identify some tweaks here and there.
One tweak that we can implement is instead of doing several casts, we can do just one on the declaration.
// original version
void image_to_grey(rgba_pixel *img, int n) {
for (int i = 0; i < n; i++) {
float y = (0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
0.1 * (float)img[i].blue);
img[i].red = (int)y;
img[i].green = (int)y;
img[i].blue = (int)y;
}
}
// new version
void image_to_grey(rgba_pixel *img, int n) {
for (int i = 0; i < n; i++) {
int y = (int)(0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
0.1 * (float)img[i].blue);
img[i].red = y;
img[i].green = y;
img[i].blue = y;
}
}
This change puts C code into the error margin with LuaJIT.
CLang runs at 348.6 ms ± 1.1 ms and GCC at 348.9 ms ± 1.1 ms. I’m sure that some C wizards can optimise this code even further.
❯ hyperfine -w 3 ./clangimg
Benchmark 1: ./clangimg
Time (mean ± σ): 348.6 ms ± 1.1 ms [User: 346.4 ms, System: 2.0 ms]
Range (min … max): 347.0 ms … 349.9 ms 10 runs
❯ hyperfine -w 3 ./gccimg
Benchmark 1: ./gccimg
Time (mean ± σ): 348.9 ms ± 1.1 ms [User: 346.8 ms, System: 1.9 ms]
Range (min … max): 347.7 ms … 350.9 ms 10 runs
Time to test my favourite native compiled language, Crystal.
struct RGBPixel
property red, blue, green, alpha
def initialize(@red : UInt8 = 0, @blue : UInt8 = 0, @green : UInt8 = 0, @alpha : UInt8 = 255)
end
end
def image_ramp_green(n)
img = Array.new(n) { RGBPixel.new }
f = (255/(n - 1)).to_u8
(0...n).each { |i| img[i].green = (i * f).to_u8 }
img
end
def image_to_gray(img, n)
(0...n).each do |i|
y = (0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue).to_u8
img[i].red = y
img[i].green = y
img[i].blue = y
end
end
N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }
Crystal runs 307.6 ms ± 1.4ms. The fastest of all.
❯ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg
Time (mean ± σ): 307.6 ms ± 1.4 ms [User: 304.6 ms, System: 2.7 ms]
Range (min … max): 306.1 ms … 310.1 ms 10 runs
What makes Crystal so fast? The Crystal compiler is a fantastic piece of tech; it does a lot of optimisation trickery and uses, effectively, the latest LLVM.
Update 9 August 2025
I had the feeling that my Crystal code was too fast and I was missing something, but my bias towards Crystal got the best of me. It turns out that my code was doing basically nothing. Let me explain.
Structs in Crystal are read-only, so the updates on the properties were doing nothing apart from float overflow checks from the function to_u8.
The correct equivalent version will be this one:
record RGBPixel, red : UInt8 = 0, green : UInt8 = 0, blue : UInt8 = 0, alpha : UInt8 = 255
def image_ramp_green(n)
img = Slice.new(n) { RGBPixel.new }
f = (255/(n - 1))
(0...n).each do |i|
img.update(i, &.copy_with(green: (i * f).to_u8!))
end
img
end
def image_to_gray(img, n)
(0...n).each do |i|
y = (0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue).to_u8!
img.update(i, &.copy_with(red: y, green: y, blue: y))
end
end
N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }
The macro “record“ (Yes, Crystal has beautiful macros) creates a struct that includes a copy_with function (similar to a case class in Scala)
The function to_u8! doesn’t execute any overflow check, similar to what a cast in C will do
❯ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg
Time (mean ± σ): 348.5 ms ± 2.7 ms [User: 345.2 ms, System: 3.1 ms]
Range (min … max): 345.9 ms … 354.5 ms 10 runs
Crystal runs .1 ms faster than Clang… which is within the error margin. So Clang, GCC, LuaJIT with FFI and Crystal run at the same speed.
Conclusion
Well, LuaJIT (with FFI) is indeed fast. It seems obvious that you can use luajit to glue C structs and code, and have C-like performance with a more friendly language, without compilation.
Other Lua-inspired/adjacent languages required compilation, like Nelua or Terra (Both languages look very interesting).
On the other hand, I’ll stay with Crystal, but it’s very cool to see what other languages and communities are working on.
Subscribe to my newsletter
Read articles from Mario Arias directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
