A response to the "Luajit is wicked fast?" video

Mario AriasMario Arias
6 min read

Recently, I watched the video “Luajit is wicked fast?“ by Philip Bohun.

Fantastic video, I learned a lot. Please watch it.

This article is by no means a refutation, only a small addendum. I also want to write down some of Philip’s code, as he didn’t have a linked repository. My repository is here https://github.com/MarioAriasC/luajit-is-fast

Lua code running in luajit

Philip started with some Lua code, reproduced below (yes, from a still frame):

 local function image_ramp_green(n)
    local img = {}
    local f = 255 / (n - 1)
    for i = 1, n do
        img[i] = { red = 0, green = (i - 1) * f, blue = 0, alpha = 255 }
    end
    return img
end

local function image_to_gray(img, n)
    for i = 1, n do
        local y = 0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue
        img[i].red = y
        img[i].blue = y
        img[i].green = y
    end
end

local N = 400 * 400
local img = image_ramp_green(N)
for i = 1, 1000 do
    image_to_gray(img, N)
end

Some pixel manipulation, ramping green, then fading to grey, repeated a few thousand times, just to keep the processor warm and cosy. Nothing fancy

On my machine, using luajit, it takes: 1.303 s ± 0.056 s.

❯ hyperfine -w 3 'luajit img.lua'
Benchmark 1: luajit img.lua
  Time (mean ± σ):      1.303 s ±  0.056 s    [User: 1.286 s, System: 0.016 s]
  Range (min … max):    1.245 s …  1.367 s    10 runs

Then, Philip introduces a feature that I have never heard of, luajit ffi. Instead of using Lua tables, we can use a C Struct.

local ffi = require("ffi")

ffi.cdef([[
typedef struct { uint8_t red, green, blue, alpha;} rgba_pixel;
]])

local function image_ramp_green(n)
    local img = ffi.new("rgba_pixel[?]", n)
    local f = 255 / (n - 1)
    for i = 1, n do
        img[i].green = i * f
        img[i].alpha = 255
    end
    return img
end

local function image_to_gray(img, n)
    for i = 1, n do
        local y = 0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue
        img[i].red = y
        img[i].blue = y
        img[i].green = y
    end
end

local N = 400 * 400
local img = image_ramp_green(N)
for i = 1, 1000 do
    image_to_gray(img, N)
end

And it is indeed wicked fast.

On my machine, 346.3 ms ± 1.5 ms. Amazing

❯ hyperfine -w 3 'luajit img_with_ffi.lua'
Benchmark 1: luajit img_with_ffi.lua
  Time (mean ± σ):     346.3 ms ±   1.5 ms    [User: 344.3 ms, System: 1.8 ms]
  Range (min … max):   344.2 ms … 348.6 ms    10 runs

Philip claims that it is faster than C.

#include <stdint.h>
#include <stdlib.h>

typedef struct {
  uint8_t red;
  uint8_t green;
  uint8_t blue;
  uint8_t alpha;
} rgba_pixel;

rgba_pixel *image_ramp_green(int n) {
  rgba_pixel *img = calloc(n, sizeof(rgba_pixel));
  float f = 255.0 / (float)(n - 1);
  for (int i = 0; i < n; i++) {
    img[i].green = (int)((float)i * f);
    img[i].alpha = 255;
  }
  return img;
}

void image_to_grey(rgba_pixel *img, int n) {
  for (int i = 0; i < n; i++) {
    float y = (0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
               0.1 * (float)img[i].blue);
    img[i].red = (int)y;
    img[i].green = (int)y;
    img[i].blue = (int)y;
  }
}

int main(void) {
  int n = 400 * 400;
  rgba_pixel *img = image_ramp_green(n);
  for (int i = 0; i < 1000; i++) {
    image_to_grey(img, n);
  }
  return EXIT_SUCCESS;
}

After compiling with GCC and Clang, it turns out that, yes, luajit with FFI is faster than C.

On my machine, the CLang compiled runs at 375.1 ms ± 1.4 ms and the GCC compiled runs at 369.2 ms ± 1.1ms.

❯ hyperfine -w 3 ./clangimg-mc
Benchmark 1: ./clangimg-mc
  Time (mean ± σ):     375.1 ms ±   1.4 ms    [User: 373.9 ms, System: 1.2 ms]
  Range (min … max):   373.4 ms … 378.0 ms    10 runs

❯ hyperfine -w 3 ./gccimg-mc
Benchmark 1: ./gccimg-mc
  Time (mean ± σ):     369.2 ms ±   1.1 ms    [User: 367.5 ms, System: 1.5 ms]
  Range (min … max):   367.9 ms … 371.3 ms    10 runs

I’m not a C expert, but part of my job is performance engineering for JVM applications, and I can identify some tweaks here and there.

One tweak that we can implement is instead of doing several casts, we can do just one on the declaration.

// original version
void image_to_grey(rgba_pixel *img, int n) {
  for (int i = 0; i < n; i++) {
    float y = (0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
               0.1 * (float)img[i].blue);
    img[i].red = (int)y;
    img[i].green = (int)y;
    img[i].blue = (int)y;
  }
}

// new version
void image_to_grey(rgba_pixel *img, int n) {
  for (int i = 0; i < n; i++) {
    int y = (int)(0.3 * (float)img[i].red + 0.59 * (float)img[i].green +
                  0.1 * (float)img[i].blue);
    img[i].red = y;
    img[i].green = y;
    img[i].blue = y;
  }
}

This change puts C code into the error margin with luajit.

CLang runs at 348.6 ms ± 1.1 ms and GCC at 348.9 ms ± 1.1 ms. I’m sure that some C wizards can optimise this code even further.

❯ hyperfine -w 3 ./clangimg
Benchmark 1: ./clangimg
  Time (mean ± σ):     348.6 ms ±   1.1 ms    [User: 346.4 ms, System: 2.0 ms]
  Range (min … max):   347.0 ms … 349.9 ms    10 runs

❯ hyperfine -w 3 ./gccimg
Benchmark 1: ./gccimg
  Time (mean ± σ):     348.9 ms ±   1.1 ms    [User: 346.8 ms, System: 1.9 ms]
  Range (min … max):   347.7 ms … 350.9 ms    10 runs

Time to test my favourite native compiled language, Crystal.

struct RGBPixel
  property red, blue, green, alpha

  def initialize(@red : UInt8 = 0, @blue : UInt8 = 0, @green : UInt8 = 0, @alpha : UInt8 = 255)
  end
end

def image_ramp_green(n)
  img = Array.new(n) { RGBPixel.new }
  f = (255/(n - 1)).to_u8
  (0...n).each { |i| img[i].green = (i * f).to_u8 }
  img
end

def image_to_gray(img, n)
  (0...n).each do |i|
    y = (0.3 * img[i].red + 0.59 * img[i].green + 0.11 * img[i].blue).to_u8
    img[i].red = y
    img[i].green = y
    img[i].blue = y
  end
end

N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }

Crystal runs 307.6 ms ± 1.4ms. The fastest of all.

❯ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg
  Time (mean ± σ):     307.6 ms ±   1.4 ms    [User: 304.6 ms, System: 2.7 ms]
  Range (min … max):   306.1 ms … 310.1 ms    10 runs

What makes Crystal so fast? The Crystal compiler is a fantastic piece of tech; it does a lot of optimisation trickery and uses, effectively, the latest LLVM.

Conclusion

Well, Luajit (with FFI) is indeed fast. It seems obvious that you can use luajit to glue C structs and code, and have C-Like performance with a more friendly language, without compilation.

Other Lua-inspired/adjacent languages required compilation, like Nelua or Terra (Both languages look very interesting).

On the other hand, I’ll stay with Crystal, but it’s very cool to see what other languages and communities are working on.

0
Subscribe to my newsletter

Read articles from Mario Arias directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mario Arias
Mario Arias