I have a love-hate relationship with Scala. It's expressive and concise enough to have some fun writing code for abstract concepts. However, some parts of the language feel annoying and weird, and the tooling is horrible when you step out of the basics and want to customize the build process or when there's something wrong and you need to troubleshoot. It's a fun language to play with, but not so much when it comes to building large software like Apache Spark.

That being said, the issue I want to share in this post is technically not Scala's fault, although it stems from the complexity and its use of annotations as one of the core components of the language.

Let's look at this Scala code which I prepared to demonstrate the issue:

// Main.scala
import io.glutenproject.memory.arrowalloc.ArrowBufferAllocators
import io.glutenproject.utils.ArrowAbiUtil
import org.apache.arrow.c.ArrowArray
import org.apache.arrow.c.ArrowSchema
import org.apache.spark.sql.vectorized.ColumnarBatch

object Main {
  def main(args: Array[String]): Unit = {
    val allocator = ArrowBufferAllocators.contextInstance()
    val batch: ColumnarBatch = null
    val cSchema: ArrowSchema = null
    val cArray: ArrowArray = null
    ArrowAbiUtil.exportFromSparkColumnarBatch(allocator, batch, cSchema, cArray)
  }
}

This is calling an actual method from Gluten. You can try it yourself with SBT and this build.sbt:

libraryDependencies ++= Seq(
  "io.glutenproject" % "gluten-package" % "1.1.0",
  "org.apache.spark" %% "spark-sql" % "3.4.3",
)

Running sbt compile will result in this error:

[error] /home/chungmin/example/Main.scala:14:47: type mismatch;
[error]  found   : BufferAllocator (in io.glutenproject.shaded.org.apache.arrow.memory)
[error]  required: BufferAllocator (in org.apache.arrow.memory)
[error]     ArrowAbiUtil.exportFromSparkColumnarBatch(allocator, batch, cSchema, cArray)
[error]                                               ^

If you know what shading is, you can tell that this has something to do with it. Indeed, you can find that classes from the org.apache.arrow package are relocated to io.glutenproject.shaded.org.apache.arrow by looking at the POM file.

My understanding of Maven or Maven Shade Plugin is superficial, so I roughly assumed that it would work by renaming all references from the old package to the new one in the produced bytecode. But then why is the Scala compiler expecting that the type should have an unshaded package name?

When this kind of issue happens, I usually check these things:

Classpath
Actual version of the library used
Package manager cache

But even after finding the exact class file used and decompiling the bytecode, the method signature was correct—it had the shaded type name:

$ javap -v -c -p io/glutenproject/utils/ArrowAbiUtil$.class
...
  public void exportFromSparkColumnarBatch(io.glutenproject.shaded.org.apache.arrow.memory.BufferAllocator, org.apache.spark.sql.vectorized.ColumnarBatch, org.apache.arrow.c.ArrowSchema, org.apache.arrow.c.ArrowArray);
    descriptor: (Lio/glutenproject/shaded/org/apache/arrow/memory/BufferAllocator;Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lorg/apache/arrow/c/ArrowSchema;Lorg/apache/arrow/c/ArrowArray;)V
...

It was then that I remembered there's an annotation called ScalaSignature for Scala classes. The annotation was attached on io.glutenproject.utils.ArrowAbiUtil:

$ javap -v -c -p io/glutenproject/utils/ArrowAbiUtil.class
...
SourceFile: "ArrowAbiUtil.scala"
RuntimeVisibleAnnotations:
  0: #6(#7=s#8)
    scala.reflect.ScalaSignature(
      bytes="\u0006\u0001\u0005\rt!..."
    )
  ScalaSig: length = 0x3 (unknown attribute)
   05 00 00

To decode the bytes, I needed to use scalap:

$ scalap -private -verbose io.glutenproject.utils.ArrowAbiUtil
...
  def exportFromSparkColumnarBatch(allocator: org.apache.arrow.memory.BufferAllocator, columnarBatch: org.apache.spark.sql.vectorized.ColumnarBatch, cSchema: org.apache.arrow.c.ArrowSchema, cArray: org.apache.arrow.c.ArrowArray): scala.Unit = { /* compiled code */ }
...

Voila! The signature has an unshaded type name: org.apache.arrow.memory.BufferAllocator. This can be understood, as Maven Shade Plugin doesn't know anything about Scala or ScalaSignature. It just works on the bytecode level. The error is showing that the Scala compiler is using ScalaSignature to check type mismatch instead of the information in the bytecode.

To verify if it's the case, we can write an equivalent Java code:

// Main.java
import io.glutenproject.memory.arrowalloc.ArrowBufferAllocators;
import io.glutenproject.utils.ArrowAbiUtil;
import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.spark.sql.vectorized.ColumnarBatch;

public class Main {
    public static void main(String[] args) {
        var allocator = ArrowBufferAllocators.contextInstance();
        ColumnarBatch batch = null;
        ArrowSchema cSchema = null;
        ArrowArray cArray = null;
        ArrowAbiUtil.exportFromSparkColumnarBatch(allocator, batch, cSchema, cArray);
    }
}

It compiles well.

Knowing the root cause, we can work around the issue by writing a wrapper in Java that calls ArrowAbiUtil.exportFromSparkColumnarBatch and using the wrapper in Scala instead of the original method.

A more fundamental fix should be placed on the Gluten side, to avoid shading in public API, although it's possible that ArrowAbiUtil is not meant to be public.

Key takeaway: Beware that shading might create bytecode that is not fully compatible with Scala! Avoid shading types that are used in public methods, as those methods might not be usable from Scala.

Unraveling a Scala type mismatch mystery

Subscribe to my newsletter

Chungmin Lee

Chungmin Lee