In recent years, the landscape of artificial intelligence has been dramatically transformed by advancements in deep learning and the underlying computational frameworks. As AI models grow increasingly complex, the demand for efficient and powerful compilers has surged, marking what some experts call the "Golden Age of Compilers." This article delves into the evolution of AI compilers, drawing insights from the Modular blog series "Democratizing AI Compute." We will explore the challenges and opportunities faced by software developers, particularly AI backend engineers, in navigating this rapidly evolving field. From the dominance of Nvidia's CUDA to the emerging alternatives, we aim to provide a comprehensive overview of the current state and future directions of AI compiler technology.

The Golden Age of Compilers

The title is quoted from Chris Lattner’s keynote talk in 2021. He observes a trend among tech companies, highlighting their substantial investment in compiler design to meet the demands of Deep Learning computation. The distinct nature of DL models compared to traditional compiler optimization offers researchers ample opportunities to develop innovative solutions. However, designing compilers requires collaboration between software and hardware. One of the major challenges in AI compiler design is generating optimal low-level code for diverse backend hardware, including TPUs, NPUs, and GPUs. Among these, Nvidia CUDA plays a crucial role due to its dominant status in the software ecosystem and the accessibility of GPUs for developers, compared to proprietary hardware like TPUs used by tech giants. Thus, it would be really hard to build an ecosystem if Nvidia CUDA is not integrated perfectly. The perfect integration is the key for the new tool.

Breaking CUDA’s Moat

As time goes by, CUDA has become a vast, layered platform. It encompasses every aspect of the GPU programming ecosystem, from low-level parallel programming models and mid-level libraries like cuDNN to high-level serving tools like Triton Serving and NIM. The monopoly of Nvidia's software stack impacts both other high-performance computing vendors and AI engineers.

The main idea of the "Democratizing AI Compute" series is to explore ways to reduce CUDA’s dominance from a corporate viewpoint. Mojo blogs guide readers through the efforts of tech giants, including OpenCL, TVM, Triton, and others. What about us? As AI engineers, do we benefit from Nvidia’s monopoly? I believe the answer depends on our role in the ecosystem. For those who use CUDA as a library, the main issue is often version compatibility. To address this, Nvidia provides all-in-one solutions like NIM. However, engineers focused on performance and portability spend most of their time programming in the CUDA language. The biggest challenge becomes the language itself. The outdated design of CUDA gradually pushes developers to find workarounds. A well-known example is DeepSeek attempting to use the undefined PTX instruction. The direct PTX instruction sacrifice the portability even more and increase vendor-lock in. Developers have less and less control at the software level to utilize the latest GPU features like the Tensor Memory Accelerator (TMA). Another major limit is writing CUDA fundamentally requires using C++. It cause the programmers working on DL framework switch back and forth between Python and C++ and suffers from the mindset difference between two languages.

In short, there is an ongoing conflict between Nvidia and developers who want more portability, as well as companies trying to compete in the HPC market.

Nvidia’s Fight Back

Nvidia has also recognized the significant issue at the language level. Recently, Nvidia announced native support and full integration of Python and other languages like Rust and Julia in its CUDA toolkit. Developers will be able to use Python to directly execute algorithmic-style computing on GPUs. Although the software is still a work in progress, readers can get an early look with cuda-python example code. There are two ways to run the GPU kernel in Python. One is by decorating a JIT compiler on the Python function, as shown below.

@cuda.jit('void(int32[:], int32[:])')
def foo(aryA, aryB):
    ...

Another way is to write the traditional CUDA code directly in Python and compile it using the binding interface. The following snippet is an example from cuda-binding.

saxpy = """\
extern "C" __global__
void saxpy(float a, float *x, float *y, float *out, size_t n)
{
 size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
 if (tid < n) {
   out[tid] = a * x[tid] + y[tid];
 }
}
"""

# Initialize CUDA Driver API
checkCudaErrors(driver.cuInit(0))

# Retrieve handle for device 0
cuDevice = checkCudaErrors(driver.cuDeviceGet(0))

# Derive target architecture for device 0
major = checkCudaErrors(driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice))
minor = checkCudaErrors(driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice))
arch_arg = bytes(f'--gpu-architecture=compute_{major}{minor}', 'ascii')

# Create program
prog = checkCudaErrors(nvrtc.nvrtcCreateProgram(str.encode(saxpy), b"saxpy.cu", 0, [], []))

# Compile program
opts = [b"--fmad=false", arch_arg]
checkCudaErrors(nvrtc.nvrtcCompileProgram(prog, 2, opts))

# Get PTX from compilation
ptxSize = checkCudaErrors(nvrtc.nvrtcGetPTXSize(prog))
ptx = b" " * ptxSize
checkCudaErrors(nvrtc.nvrtcGetPTX(prog, ptx))

Additionally, Nvidia has identified CUDA's limitations in distributed computing and plans to improve support for the modern workloads by redesigning CUDA.

What can we do

Based on the criteria used in Mojo’s comparison below and Nvidia’s goals of redesigning CUDA, there are some common themes:

Achieve top performance on the industry leader’s hardware
Foster developer enthusiasm
Allow full programmability
Manage AI computing complexity effectively
Support large-scale applications

Study Python Even Harder

Without a doubt, a deep understanding of Python, the most user-friendly language in the AI community, would be essential. To be specific, current AI computing optimization can be classified into two categories.

Build a new ecosystem that extends Python and includes the existing Python ecosystem. Mojo is the only player in this area. Mojo embeds CPython into the executable, and if needed, part of the execution path is handled by the embedded interpreter, as show below.

 # mypython.py
 import numpy as np

 def gen_random_values(size, base):
     # generate a size x size array of random numbers between base and base+1
     random_array = np.random.rand(size, size)
     return random_array + base

 # main.mojo
 from python import Python

 def main():
     Python.add_to_path("path/to/module")
     mypython = Python.import_module("mypython")

     values = mypython.gen_random_values(2, 3)
     print(values)

Including the cuda-python example above, the libraries on this side build an importable compiler. The compiler handles the decorated python function with their corresponding optimization. Another typical example is Torch Dynamo. A regular Python function is decorated with torch.compile, as shown below.
```
 import torch

 @torch.compile
 def mse(x, y):
     z = (x - y) ** 2
     return z.sum()

 x = torch.randn(200)
 y = torch.randn(200)
 mse(x, y)
```
The decorated function, in this case mse, is processed by the optimized compiler to convert the source code into low-level execution code for different hardware backends. The key difference in implementation is how the compiler receives the source code. In Torch Dynamo, it uses the eval_frame defined in PEP 523 and takes the bytecode as input. However, implementations like Triton Lang and JAX use different strategies, such as tracing JIT and AST JIT, respectively. The more detail implementation please refer to Decorator JITs - Python as a DSL.

The key is not to choose one side but to learn the optimization techniques from both. If possible, join the community and contribute. There will be certain outcomes once the conflict is resolved.

The scope here includes only DL model computation. The same optimization technique can be applied to related domains like streaming systems and databases.
The process of increasing portability is a spectrum. Given your application only uses a subset of GPU features, the better you understand the underlying mechanisms, the sooner you can benefit from portability when the infrastructure reaches a certain level.
There are still tasks that are necessary but can't be handled by the GPU, such as the networking stack and file I/O. Optimizing these areas requires a deep understanding of the OS and CPU. Focusing on these topics makes you invaluable, no matter which side prevails.

Distributed Application Is the Future

The trend of using multiple GPUs across server clusters is unavoidable. Traditionally, distributed computing and GPU programming have been mostly separate. In the modern era, to support applications like LLMs, the computational load during training and inference is so heavy that the system requires hundreds or even thousands of GPUs. The same issues that distributed computing addresses from the host side still pose challenges for distributed GPU-accelerated applications, such as the CAP theorem and parallel algorithms. However, a noticeable trend is that GPU kernels are using specialized hardware and software like NVLink and RDMA to achieve high performance, instead of relying on the network stack. DeepEP in DeepSeek is a typical example. I believe the necessary infrastructure is still incomplete. Another research area involves applying well-known CPU scheduling techniques from textbooks to deep learning training at a higher level. Instead of focusing on instructions, researchers are trying to parallelize micro batches in the forward and backward pass. For example, DualPipe in DeepSeek is an attempt to improve pipelining and address bubbling at the same time.

Conclusion

The AI wave is certainly driving growth in compiler design. In the near future, the current software stack will definitely evolve, whether through Nvidia or the open-source community. As long as Nvidia doesn't resolve the issues with Python, it will face strong competition from both software and hardware. Although we developers, as outsiders in this race, can't significantly influence this uncertain competition, there are still many things we can learn to stay competitive.

Navigating the Uncertain Evolution of AI Compilers