Implementing `simplify` for `starts_with` in Apache DataFusion

Jagdish PariharJagdish Parihar
7 min read

Intro about Datafusion

Apache DataFusion is a Rust-native query engine with a powerful optimizer. One key component of its optimizer is expression simplification, often referred to as the simplify function. In this post, we'll walk through how we implemented a simplify rule for the starts_with string function in DataFusion, turning it into an equivalent LIKE pattern. We'll cover a high-level overview of DataFusion's simplify mechanism, why this change was needed, how we implemented it step by step (including basic regex handling and tests), and the benefits it brings to both the DataFusion project and Rust developers.

Note: Portions of this article were drafted with the help of an AI writing assistant, then heavily edited by me for clarity and accuracy.

What is DataFusion's simplify Function?

In DataFusion, simplify is an optimization mechanism that rewrites expressions into simpler or more efficient forms during query planning. This includes things like constant folding (e.g. replacing 1 + 2 with 3) and algebraic rewrites. Each built-in function or expression can provide its own simplify logic. Thanks to DataFusion's well-structured design, these simplification rules are modular – typically implemented as a method on the function's definition – making it easy to add new rules without touching the rest of the codebase. For example, the starts_with function in DataFusion has a fn simplify(...) method where we can plug in custom rewrite logic​. The optimizer automatically applies these rules, so if a function's arguments allow a simplification, DataFusion will use the simpler form in the query plan.

Why Simplify starts_with?

The starts_with(string, prefix) function returns true if string begins with the given prefix. Without simplification, starts_with would be treated as a black-box scalar function in the query plan. By introducing a simplification rule, we can rewrite starts_with(col, 'prefix') into a standard SQL pattern match col LIKE 'prefix%'.

This has big benefits:

SQL engines (and DataFusion) know how to optimize LIKE 'prefix%' patterns for string columns. In particular, converting to LIKE enables predicate pruning based on prefix filters​. In other words, DataFusion can use file metadata (like min/max column values) to skip reading data that can't match the prefix, significantly improving performance for queries that filter on string prefixes.

Thanks to the PR that implements predicate pruning: https://github.com/apache/datafusion/pull/12978

Another benefit is that the logical plan becomes more transparent. A LIKE 'prefix%' is a recognizable operation for developers and other optimizations, whereas a custom function call might be harder to leverage. Overall, simplifying starts_with makes the query plan more efficient and easier to reason about.

Implementing the Simplification for starts_with

Thanks to DataFusion's modular architecture, adding this optimization was straightforward. All the logic for starts_with is encapsulated in its own module, so we only needed to modify that and add tests. Here are the steps we took to implement the simplify rule for starts_with:

  1. Introducing the simplify logic: We added a new simplify method in the starts_with function implementation. This method inspects the function's arguments. If the second argument (the prefix) is a literal string, we can simplify. DataFusion provides an ExprSimplifyResult type to indicate whether an expression was simplified or left unchanged. In our case, when we detect a literal prefix, we'll return a Simplified result with a new expression.

  2. Converting to a LIKE expression (basic regex handling): If the prefix is literal, we construct a new pattern for a SQL LIKE. Essentially, we take the prefix and append a % wildcard to match any suffix. For example, starts_with(name, "Ja") becomes name LIKE "Ja%". We also had to handle any special characters in the prefix. In SQL LIKE patterns, the % and _ characters have special meaning (any sequence of characters and any single character, respectively). We made sure to escape these in the prefix before appending % so that, for instance, a literal prefix "ja%" is treated as "ja\%" in the pattern (meaning the string "ja%") and then becomes "ja\%%" after adding the wildcard​. This way, the semantics of starts_with (which treats the prefix as literal text) are preserved in the LIKE expression.

The core snippet of the implementation looked like this:

    if let Expr::Literal(ScalarValue::Utf8(Some(prefix))) = &args[1] {
        // Found constant prefix, convert to LIKE pattern
        let escaped = prefix.replace("%", "\\%").replace("_", "\\_");
        let pattern = format!("{}%", escaped);
        let like_expr = Expr::Literal(ScalarValue::Utf8(Some(pattern)));
        return Ok(ExprSimplifyResult::Simplified(Expr::Like(Like {
            negated: false,
            expr: Box::new(args[0].clone()),    // the string column
            pattern: Box::new(like_expr),       // the 'prefix%' pattern
            escape_char: None,
            case_insensitive: false,
        }))));
    }
    // If not a literal prefix, no change
    return Ok(ExprSimplifyResult::Original(args));

In practice, we wrote similar logic for all relevant string types (Utf8, LargeUtf8, and the newer Utf8View) since DataFusion supports multiple string array types. The idea is the same: detect literal prefix, build the 'prefix%' pattern, wrap it in a LIKE expression node, and mark the expression as simplified.

  1. Updating tests: With the implementation in place, we updated and added tests to ensure everything works as expected. We wrote unit tests for starts_with to confirm that the function still returns correct boolean results for various cases (e.g. "alphabet", "alph" -> true, "alphabet", "bet" -> false, etc.)​. We paid special attention to edge cases like an empty prefix (which should always yield true as long as the string is not NULL, since every string starts with ""). We also added an integration test using DataFusion's SQL engine to verify the optimization kicks in. For example, using an EXPLAIN query on a filter with starts_with, we should see the plan has been rewritten to use LIKE. After our change, an EXPLAIN SELECT * FROM my_table WHERE starts_with(col, 'f') showed a filter of col LIKE 'f%' in the logical plan, and the physical plan included a pruning predicate based on the prefix​. This confirmed that predicate pushdown was working: DataFusion was able to determine, for instance, that only values between "f" and "g" (non-inclusive of "g") could match f%, and use that to skip irrelevant data. All new tests passed, demonstrating that the simplify rule behaves correctly.

Why DataFusion's Design Made This Easy

Implementing this feature highlighted how well-structured DataFusion is. The codebase cleanly separates concerns: the logic for each function (including execution and optimization rules) lives in one place. We didn't have to tinker with the core optimizer loop or planner; we just implemented the simplify method for starts_with and the existing optimization framework took care of invoking it. DataFusion already had patterns for similar rewrites (for instance, simplifying certain regex patterns to equality checks in the past), so we were able to follow an established approach. This modular design meant less risk of breaking unrelated parts of the system and made the code review process smoother. It’s a testament to DataFusion’s extensibility that a contributor could add such an optimization in a relatively small, focused PR.

Benefits and Impact

This enhancement brings several benefits:

  • Performance Boost: Queries filtering on string prefixes can see significant speed-ups. By converting starts_with(col, "prefix") into col LIKE 'prefix%', DataFusion can apply predicate pruning at the data source level​. In practical terms, if you're querying a large dataset stored in Parquet files, DataFusion will only read the files (or row groups) that could possibly contain values with the given prefix. This reduces I/O and speeds up query execution.

  • Optimized Plans: The logical and physical plans become more optimized and easier to understand. Using a native LIKE operator in plans means other optimizations (like combining with other conditions) work seamlessly. It also aligns with SQL standards, which can be helpful for developers debugging query plans. The plan shows a clear intention ("column starts with X") via a LIKE 'X%' clause, which is intuitive.

  • Encouraging Extensibility: For Rust developers and contributors, this is a great example of how to extend DataFusion. If you have a custom scalar function or see an opportunity for a rewrite rule, DataFusion provides the hooks (like the simplify trait method) to implement it. You can add specialized knowledge (such as domain-specific optimizations) in the same way we did for starts_with. The fact that we accomplished this with a small change localized to one module underscores how approachable DataFusion's codebase is for newcomers and seasoned devs alike.

  • Maintaining Correctness: Importantly, all these optimizations maintain the exact semantics of the original expression. By escaping special characters and carefully constructing the new expression, we ensure that starts_with and the equivalent LIKE behave identically for all inputs. The robust test suite gives confidence in the correctness of this rewrite.

Conclusion

Implementing the simplify rule for starts_with was a satisfying journey that improved DataFusion's optimizer and demonstrated the strength of its design. In just a few lines of code, we unlocked an optimization that can benefit many real-world queries. This change makes DataFusion smarter about string predicates, allowing it to skip unnecessary work and respond faster. For DataFusion users, it means you can write queries with starts_with and get performance similar to writing a LIKE by hand – the engine does it for you. For Rust developers, it's an example of how contributing to an open-source project like DataFusion can be straightforward thanks to clear abstraction boundaries. We hope this encourages more contributions and explorations into DataFusion's optimization capabilities. After all, as we've seen, a simple change (pun intended) can go a long way in making your queries run simply faster!

0
Subscribe to my newsletter

Read articles from Jagdish Parihar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jagdish Parihar
Jagdish Parihar

I am software developer, primarily working on the nodejs, graphql, react and mongoDB.