Recently, I watched the video “Luajit is wicked fast?“ by Philip Bohun.

https://www.youtube.com/watch?v=gS8Wji_YnAE

Fantastic video, I learned a lot. Please watch it.

This article is by no means a refutation, only a small addendum. I also want to write down some of Philip’s code, as he didn’t have a linked repository. My repository is here https://github.com/MarioAriasC/luajit-is-fast

Lua code running in LuaJIT

Philip started with some Lua code, reproduced below (yes, from a still frame):

 local function image_ramp_green(n)
    local img = {}
    local f = 255 / (n - 1)
    for i = 1, n do
        img[i] = { red = 0, green = (i - 1) * f, blue = 0, alpha = 255 }
    end
    return img
end

local function image_to_gray(img, n)
    for i = 1, n do
        local y = 0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue
        img[i].red = y
        img[i].blue = y
        img[i].green = y
    end
end

local N = 400 * 400
local img = image_ramp_green(N)
for i = 1, 1000 do
    image_to_gray(img, N)
end

Some pixel manipulation, ramping green, then fading to grey, repeated a few thousand times, just to keep the processor warm and cosy. Nothing fancy

On my machine, using luajit, it takes: 1.303 s ± 0.056 s.

❯ hyperfine -w 3 'luajit img.lua'
Benchmark 1: luajit img.lua
  Time (mean ± σ):      1.303 s ±  0.056 s    [User: 1.286 s, System: 0.016 s]
  Range (min … max):    1.245 s …  1.367 s    10 runs

Then, Philip introduces a feature that I have never heard of, LuaJIT ffi. Instead of using Lua tables, we can use a C Struct.

local ffi = require("ffi")

ffi.cdef([[
typedef struct { uint8_t red, green, blue, alpha;} rgba_pixel;
]])

local function image_ramp_green(n)
    local img = ffi.new("rgba_pixel[?]", n)
    local f = 255 / (n - 1)
    for i = 1, n do
        img[i].green = i * f
        img[i].alpha = 255
    end
    return img
end

local function image_to_gray(img, n)
    for i = 1, n do
        local y = 0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue
        img[i].red = y
        img[i].blue = y
        img[i].green = y
    end
end

local N = 400 * 400
local img = image_ramp_green(N)
for i = 1, 1000 do
    image_to_gray(img, N)
end

And it is indeed wicked fast.

On my machine, 346.3 ms ± 1.5 ms. Amazing

❯ hyperfine -w 3 'luajit img_with_ffi.lua'
Benchmark 1: luajit img_with_ffi.lua
  Time (mean ± σ):     346.3 ms ±   1.5 ms    [User: 344.3 ms, System: 1.8 ms]
  Range (min … max):   344.2 ms … 348.6 ms    10 runs

Philip claims that it is faster than C.

#include <stdint.h>
#include <stdlib.h>

typedef struct {
  uint8_t red;
  uint8_t green;
  uint8_t blue;
  uint8_t alpha;
} rgba_pixel;

rgba_pixel *image_ramp_green(int n) {
  rgba_pixel *img = calloc(n, sizeof(rgba_pixel));
  float f = 255.0 / (float)(n - 1);
  for (int i = 0; i < n; i++) {
    img[i].green = (int)((float)i * f);
    img[i].alpha = 255;
  }
  return img;
}

void image_to_grey(rgba_pixel *img, int n) {
  for (int i = 0; i < n; i++) {
    float y = (0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
               0.1 * (float)img[i].blue);
    img[i].red = (int)y;
    img[i].green = (int)y;
    img[i].blue = (int)y;
  }
}

int main(void) {
  int n = 400 * 400;
  rgba_pixel *img = image_ramp_green(n);
  for (int i = 0; i < 1000; i++) {
    image_to_grey(img, n);
  }
  return EXIT_SUCCESS;
}

After compiling with GCC and Clang, it turns out that, yes, luajit with FFI is faster than C.

On my machine, the CLang compiled runs at 375.1 ms ± 1.4 ms and the GCC compiled runs at 369.2 ms ± 1.1ms.

❯ hyperfine -w 3 ./clangimg-mc
Benchmark 1: ./clangimg-mc
  Time (mean ± σ):     375.1 ms ±   1.4 ms    [User: 373.9 ms, System: 1.2 ms]
  Range (min … max):   373.4 ms … 378.0 ms    10 runs

❯ hyperfine -w 3 ./gccimg-mc
Benchmark 1: ./gccimg-mc
  Time (mean ± σ):     369.2 ms ±   1.1 ms    [User: 367.5 ms, System: 1.5 ms]
  Range (min … max):   367.9 ms … 371.3 ms    10 runs

I’m not a C expert, but part of my job is performance engineering for JVM applications, and I can identify some tweaks here and there.

One tweak that we can implement is instead of doing several casts, we can do just one on the declaration.

// original version
void image_to_grey(rgba_pixel *img, int n) {
  for (int i = 0; i < n; i++) {
    float y = (0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
               0.1 * (float)img[i].blue);
    img[i].red = (int)y;
    img[i].green = (int)y;
    img[i].blue = (int)y;
  }
}

// new version
void image_to_grey(rgba_pixel *img, int n) {
  for (int i = 0; i < n; i++) {
    int y = (int)(0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
                  0.1 * (float)img[i].blue);
    img[i].red = y;
    img[i].green = y;
    img[i].blue = y;
  }
}

This change puts C code into the error margin with LuaJIT.

CLang runs at 348.6 ms ± 1.1 ms and GCC at 348.9 ms ± 1.1 ms. I’m sure that some C wizards can optimise this code even further.

❯ hyperfine -w 3 ./clangimg
Benchmark 1: ./clangimg
  Time (mean ± σ):     348.6 ms ±   1.1 ms    [User: 346.4 ms, System: 2.0 ms]
  Range (min … max):   347.0 ms … 349.9 ms    10 runs

❯ hyperfine -w 3 ./gccimg
Benchmark 1: ./gccimg
  Time (mean ± σ):     348.9 ms ±   1.1 ms    [User: 346.8 ms, System: 1.9 ms]
  Range (min … max):   347.7 ms … 350.9 ms    10 runs

Time to test my favourite native compiled language, Crystal.

struct RGBPixel
  property red, blue, green, alpha

  def initialize(@red : UInt8 = 0, @blue : UInt8 = 0, @green : UInt8 = 0, @alpha : UInt8 = 255)
  end
end

def image_ramp_green(n)
  img = Array.new(n) { RGBPixel.new }
  f = (255/(n - 1)).to_u8
  (0...n).each { |i| img[i].green = (i * f).to_u8 }
  img
end

def image_to_gray(img, n)
  (0...n).each do |i|
    y = (0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue).to_u8
    img[i].red = y
    img[i].green = y
    img[i].blue = y
  end
end

N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }

~~Crystal runs 307.6 ms ± 1.4ms. The fastest of all.~~

❯ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg
  Time (mean ± σ):     307.6 ms ±   1.4 ms    [User: 304.6 ms, System: 2.7 ms]
  Range (min … max):   306.1 ms … 310.1 ms    10 runs

~~What makes Crystal so fast? The Crystal compiler is a fantastic piece of tech; it does a lot of optimisation trickery and uses, effectively, the latest LLVM.~~

Update 9 August 2025

I had the feeling that my Crystal code was too fast and I was missing something, but my bias towards Crystal got the best of me. It turns out that my code was doing basically nothing. Let me explain.

Structs in Crystal are read-only, so the updates on the properties were doing nothing apart from float overflow checks from the function to_u8.

The correct equivalent version will be this one:

record RGBPixel, red : UInt8 = 0, green : UInt8 = 0, blue : UInt8 = 0, alpha : UInt8 = 255

def image_ramp_green(n)
  img = Slice.new(n) { RGBPixel.new }
  f = (255/(n - 1))
  (0...n).each do |i|
    img.update(i, &.copy_with(green: (i * f).to_u8!))
  end
  img
end

def image_to_gray(img, n)
  (0...n).each do |i|
    y = (0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue).to_u8!
    img.update(i, &.copy_with(red: y, green: y, blue: y))
  end
end

N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }

The macro “record“ (Yes, Crystal has beautiful macros) creates a struct that includes a copy_with function (similar to a case class in Scala)

The function to_u8! doesn’t execute any overflow check, similar to what a cast in C will do

❯ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg
  Time (mean ± σ):     348.5 ms ±   2.7 ms    [User: 345.2 ms, System: 3.1 ms]
  Range (min … max):   345.9 ms … 354.5 ms    10 runs

Crystal runs .1 ms faster than Clang… which is within the error margin. So Clang, GCC, LuaJIT with FFI and Crystal run at the same speed.

Conclusion

Well, LuaJIT (with FFI) is indeed fast. It seems obvious that you can use luajit to glue C structs and code, and have C-like performance with a more friendly language, without compilation.

Other Lua-inspired/adjacent languages required compilation, like Nelua or Terra (Both languages look very interesting).

On the other hand, I’ll stay with Crystal, but it’s very cool to see what other languages and communities are working on.

A response to the "Luajit is wicked fast?" video

Lua code running in LuaJIT

Update 9 August 2025

Conclusion

Subscribe to my newsletter

Mario Arias

Mario Arias