I covered our migration to Scala 3 extensively in my last blog post, but we deployed the migrated code to production only a week ago. Such releases are always a little stressful because you never know what kind of unexpected bug can occur, and on top of that, we usually get an important load right away (from 0 to 100k concurrent users within 5 minutes of the launch). To our surprise, everything went well! Metrics looked good, and there was nothing alarming in the hours following the release.

The Problem

About 48 hours later, we received an alert stating that the error rate was higher than usual on one of our servers. I had a quick look at the live profiler and didn’t see anything out of the usual. We killed that pod, and the issue was resolved. But a couple of hours later, another pod started to show similar degraded behavior. I checked the profiler again, and this time I looked at the Thread Timeline view that Datadog has, and noticed the following:

This view shows the threads currently running. Threads under the ZScheduler-Worker group are the threads of the ZIO scheduler. What’s weird here is the number: our pods have 16 cores, so they should always have 16 threads running at any given time.

I took a look at the pod that we killed earlier, and it was even worse: only one thread was showing up! At least the degraded performance made sense. I checked the time when the performance became worse and was able to see that it had two threads doing work, and then suddenly only one.

What happened to those threads? Was there an issue with ZIO’s scheduler? Only with Scala 3? We had never seen that before.

The Investigation

We immediately took a thread dump of the degraded pod with two active threads to see what the other threads were doing. The result was very interesting: those 14 threads were all in the WAITING state, with a similar stack trace:

java.lang.Thread.State: WAITING (parking)
 at jdk.internal.misc.Unsafe.park(java.base@23.0.2/Native Method)
 - parking to wait for  <0x0000040c85642938> (a java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(java.base@23.0.2/LockSupport.java:221)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@23.0.2/AbstractQueuedSynchronizer.java:754)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@23.0.2/AbstractQueuedSynchronizer.java:1099)
 at java.util.concurrent.CountDownLatch.await(java.base@23.0.2/CountDownLatch.java:230)
 at <redacted>$lzyINIT2(<redacted>.scala:65)

All those threads were waiting on a CountDownLatch.await call that didn’t come from our code. The important part here is lzyINIT and the line number, which matched a part of our code where a lazy val was defined.

case class AchievementCounter[K1, K2](records: Map[(K1, K2), Long]) {
  lazy val values: Seq[(K1, K2, Long)] =
    records.toSeq.map { case ((key1, key2), count) => (key1, key2, count)}
}

It looked pretty simple and should not have caused any concurrency problems. Since our release was only about the Scala 3 migration and no other changes, I looked at potential differences and remembered that the lazy val encoding had been changed compared to Scala 2 (note: this documentation page is obsolete, the actual encoding is not the one documented). Maybe a bug was responsible for this, so I searched the list of Scala 3 issues and quickly found this one: Scala 3 lazy vals are not serialization-safe thanks to the very similar stack trace.

What is this issue about?

To understand it, let’s look at how Scala 2 encodes lazy vals. We first create a simple case class:

class Foo(bar: Int) {
  lazy val baz: Int = bar + 1
}

Here's a simplified version of the Java code equivalent that is generated by the Scala compiler.

public class Foo {
    private int baz;
    private final int bar;
    private volatile boolean bitmap$0;

    private int baz$lzycompute() {
        synchronized(this){
            if (!this.bitmap$0) {
                this.baz = this.bar + 1;
                this.bitmap$0 = true;
            }
        }
        return this.baz;
    }

    public int baz() {
        return !this.bitmap$0 ? this.baz$lzycompute() : this.baz;
    }

The synchronized block could potentially lead to deadlocks if two objects have lazy vals that reference each other’s object. For this reason, a new encoding was introduced in Scala 3. Here’s what the latest version of Scala generates:

public class Foo {
    public static final long OFFSET$0;
    private final int bar;
    private volatile Object baz$lzy1;

    static {
        OFFSET$0 = .MODULE$.getOffsetStatic(Foo.class.getDeclaredField("baz$lzy1"));
    }

    public int baz() {
        Object var1 = this.baz$lzy1;
        if (var1 instanceof Integer) {
            return BoxesRunTime.unboxToInt(var1);
        } else {
            return var1 == scala.runtime.LazyVals.NullValue..MODULE$ ? BoxesRunTime.unboxToInt((Object)null) : BoxesRunTime.unboxToInt(this.baz$lzyINIT1());
        }
    }

    private Object baz$lzyINIT1() {
        while(true) {
            Object var1 = this.baz$lzy1;
            if (var1 == null) {
                if (.MODULE$.objCAS(this, OFFSET$0, (Object)null, scala.runtime.LazyVals.Evaluating..MODULE$)) {
                    Object var2 = null;
                    Object var3 = null;

                    try {
                        var8 = BoxesRunTime.boxToInteger(this.bar + 1);
                        if (var8 == null) {
                            var2 = scala.runtime.LazyVals.NullValue..MODULE$;
                        } else {
                            var2 = var8;
                        }
                    } finally {
                        if (!.MODULE$.objCAS(this, OFFSET$0, scala.runtime.LazyVals.Evaluating..MODULE$, var2)) {
                            LazyVals.Waiting var5 = (LazyVals.Waiting)this.baz$lzy1;
                            .MODULE$.objCAS(this, OFFSET$0, var5, var2);
                            var5.countDown();
                        }

                    }

                    return var8;
                }
            } else {
                if (var1 instanceof LazyVals.LazyValControlState) {
                    if (var1 == scala.runtime.LazyVals.Evaluating..MODULE$) {
                        .MODULE$.objCAS(this, OFFSET$0, var1, new LazyVals.Waiting());
                        continue;
                    }

                    if (var1 instanceof LazyVals.Waiting) {
                        ((LazyVals.Waiting)var1).await();
                        continue;
                    }

                    return null;
                }

                return var1;
            }
        }
    }
}

It is a bit complex, but the important part is that baz$lzy1 contains an object that can be either the actual value of the field once it’s computed, or a value of type LazyValControlState, which is defined as follows:

  sealed trait LazyValControlState extends Serializable

  final class Waiting extends CountDownLatch(1) with LazyValControlState
  object Evaluating extends LazyValControlState
  object NullValue extends LazyValControlState

In other words, our lazy val is represented by a value that is initially null, then becomes Evaluating on the first call, and eventually becomes the actual value of the field. If another thread tries to access the lazy val during evaluation, the value becomes Waiting, and that thread will wait on a latch until evaluation ends. NullValue is used to distinguish the case where it should be null after computation.

This brings us to the problem of this encoding with serialization. If you serialize a class while its lazy val is still evaluating (whether it’s Evaluating or Waiting), that “state” value will be embedded in the serialized object. Then if you try to access the field after deserialization, the current thread will start waiting on the latch since that state indicates the field is being evaluated. Except this time it will wait forever because there is no thread actually evaluating the field. The issue happens “randomly” because it’s only a problem if the class is serialized during evaluation of the field: once evaluated, there is no problem since the actual value is serialized.

The issue was initially discovered using Java serialization, and a fix was applied to serialize Waiting and Evaluating as if null. This is great, but does not work for other serialization methods. In our case, we use a library called Kryo via the Scala library Chill. The solution was to implement a custom Kryo serializer for these two objects (Waiting and Evaluating) to encode them as null as well. I’m sharing the code below showing how to create a custom serializer and register it to Kryo.

class LazyValControlStateSerializer extends Serializer[LazyValControlState] {
  override def read(kryo: Kryo, input: Input, cls: Class[LazyValControlState]): LazyValControlState = {
    kryo.readClassAndObject(input).asInstanceOf[LazyValControlState]
  }

  override def write(kryo: Kryo, output: Output, obj: LazyValControlState): Unit =
    kryo.writeClassAndObject(output, null)
}

val k = ??? // create the Kryo instance
val ser = new LazyValControlStateSerializer
k.register(classOf[LazyVals.Waiting], ser)
k.register(classOf[LazyVals.Evaluating.type], ser)

Takeaways

I believe it’s important to know this behavior if you use any serialization method that relies on runtime reflection. Unfortunately, Chill is currently unmaintained, so the chance of getting it fixed is pretty low unless someone decides to fork it. There’s another Scala library for Kryo, scala-kryo-serialization, where that fix should probably be applied. The good thing is that you can easily apply this fix in your own code since registering a new serializer is part of the public API.

Datadog was once again a lifesaver in helping us detect the thread count issue. We were even able to create a metric from that view so that we could see all affected pods in real-time and prevent too much degradation by killing the right pods until a fix was live.

Finally, I would say it’s quite important to keep track of the evolutions of the Scala language to be able to connect the dots when such a problem occurs. Knowing how things work under the hood can help you build a better intuition for troubleshooting.

Debugging session #2: Scala 3 lazy vals & serialization

The Problem

The Investigation

Takeaways

Subscribe to my newsletter

Pierre Ricadat

Pierre Ricadat