Implementing `simplify` for `starts_with` in Apache DataFusion

Intro about Datafusion
Apache DataFusion is a Rust-native query engine with a powerful optimizer. One key component of its optimizer is expression simplification, often referred to as the simplify
function. In this post, we'll walk through how we implemented a simplify
rule for the starts_with
string function in DataFusion, turning it into an equivalent LIKE
pattern. We'll cover a high-level overview of DataFusion's simplify mechanism, why this change was needed, how we implemented it step by step (including basic regex handling and tests), and the benefits it brings to both the DataFusion project and Rust developers.
Note: Portions of this article were drafted with the help of an AI writing assistant, then heavily edited by me for clarity and accuracy.
What is DataFusion's simplify
Function?
In DataFusion, simplify
is an optimization mechanism that rewrites expressions into simpler or more efficient forms during query planning. This includes things like constant folding (e.g. replacing 1 + 2
with 3
) and algebraic rewrites. Each built-in function or expression can provide its own simplify
logic. Thanks to DataFusion's well-structured design, these simplification rules are modular – typically implemented as a method on the function's definition – making it easy to add new rules without touching the rest of the codebase. For example, the starts_with
function in DataFusion has a fn simplify(...)
method where we can plug in custom rewrite logic. The optimizer automatically applies these rules, so if a function's arguments allow a simplification, DataFusion will use the simpler form in the query plan.
Why Simplify starts_with
?
The starts_with(string, prefix)
function returns true if string
begins with the given prefix
. Without simplification, starts_with
would be treated as a black-box scalar function in the query plan. By introducing a simplification rule, we can rewrite starts_with(col, 'prefix')
into a standard SQL pattern match col LIKE 'prefix%'
.
This has big benefits:
SQL engines (and DataFusion) know how to optimize LIKE 'prefix%'
patterns for string columns. In particular, converting to LIKE
enables predicate pruning based on prefix filters. In other words, DataFusion can use file metadata (like min/max column values) to skip reading data that can't match the prefix, significantly improving performance for queries that filter on string prefixes.
Thanks to the PR that implements predicate pruning: https://github.com/apache/datafusion/pull/12978
Another benefit is that the logical plan becomes more transparent. A LIKE 'prefix%'
is a recognizable operation for developers and other optimizations, whereas a custom function call might be harder to leverage. Overall, simplifying starts_with
makes the query plan more efficient and easier to reason about.
Implementing the Simplification for starts_with
Thanks to DataFusion's modular architecture, adding this optimization was straightforward. All the logic for starts_with
is encapsulated in its own module, so we only needed to modify that and add tests. Here are the steps we took to implement the simplify
rule for starts_with
:
Introducing the
simplify
logic: We added a newsimplify
method in thestarts_with
function implementation. This method inspects the function's arguments. If the second argument (the prefix) is a literal string, we can simplify. DataFusion provides anExprSimplifyResult
type to indicate whether an expression was simplified or left unchanged. In our case, when we detect a literal prefix, we'll return aSimplified
result with a new expression.Converting to a
LIKE
expression (basic regex handling): If the prefix is literal, we construct a new pattern for a SQLLIKE
. Essentially, we take the prefix and append a%
wildcard to match any suffix. For example,starts_with(name, "Ja")
becomesname LIKE "Ja%"
. We also had to handle any special characters in the prefix. In SQLLIKE
patterns, the%
and_
characters have special meaning (any sequence of characters and any single character, respectively). We made sure to escape these in the prefix before appending%
so that, for instance, a literal prefix"ja%"
is treated as"ja\%"
in the pattern (meaning the string "ja%") and then becomes"ja\%%"
after adding the wildcard. This way, the semantics ofstarts_with
(which treats the prefix as literal text) are preserved in theLIKE
expression.
The core snippet of the implementation looked like this:
if let Expr::Literal(ScalarValue::Utf8(Some(prefix))) = &args[1] {
// Found constant prefix, convert to LIKE pattern
let escaped = prefix.replace("%", "\\%").replace("_", "\\_");
let pattern = format!("{}%", escaped);
let like_expr = Expr::Literal(ScalarValue::Utf8(Some(pattern)));
return Ok(ExprSimplifyResult::Simplified(Expr::Like(Like {
negated: false,
expr: Box::new(args[0].clone()), // the string column
pattern: Box::new(like_expr), // the 'prefix%' pattern
escape_char: None,
case_insensitive: false,
}))));
}
// If not a literal prefix, no change
return Ok(ExprSimplifyResult::Original(args));
In practice, we wrote similar logic for all relevant string types (Utf8
, LargeUtf8
, and the newer Utf8View
) since DataFusion supports multiple string array types. The idea is the same: detect literal prefix, build the 'prefix%'
pattern, wrap it in a LIKE
expression node, and mark the expression as simplified.
- Updating tests: With the implementation in place, we updated and added tests to ensure everything works as expected. We wrote unit tests for
starts_with
to confirm that the function still returns correct boolean results for various cases (e.g."alphabet", "alph" -> true
,"alphabet", "bet" -> false
, etc.). We paid special attention to edge cases like an empty prefix (which should always yield true as long as the string is not NULL, since every string starts with""
). We also added an integration test using DataFusion's SQL engine to verify the optimization kicks in. For example, using anEXPLAIN
query on a filter withstarts_with
, we should see the plan has been rewritten to useLIKE
. After our change, anEXPLAIN SELECT * FROM my_table WHERE starts_with(col, 'f')
showed a filter ofcol LIKE 'f%'
in the logical plan, and the physical plan included a pruning predicate based on the prefix. This confirmed that predicate pushdown was working: DataFusion was able to determine, for instance, that only values between"f"
and"g"
(non-inclusive of"g"
) could matchf%
, and use that to skip irrelevant data. All new tests passed, demonstrating that thesimplify
rule behaves correctly.
Why DataFusion's Design Made This Easy
Implementing this feature highlighted how well-structured DataFusion is. The codebase cleanly separates concerns: the logic for each function (including execution and optimization rules) lives in one place. We didn't have to tinker with the core optimizer loop or planner; we just implemented the simplify
method for starts_with
and the existing optimization framework took care of invoking it. DataFusion already had patterns for similar rewrites (for instance, simplifying certain regex patterns to equality checks in the past), so we were able to follow an established approach. This modular design meant less risk of breaking unrelated parts of the system and made the code review process smoother. It’s a testament to DataFusion’s extensibility that a contributor could add such an optimization in a relatively small, focused PR.
Benefits and Impact
This enhancement brings several benefits:
Performance Boost: Queries filtering on string prefixes can see significant speed-ups. By converting
starts_with(col, "prefix")
intocol LIKE 'prefix%'
, DataFusion can apply predicate pruning at the data source level. In practical terms, if you're querying a large dataset stored in Parquet files, DataFusion will only read the files (or row groups) that could possibly contain values with the given prefix. This reduces I/O and speeds up query execution.Optimized Plans: The logical and physical plans become more optimized and easier to understand. Using a native
LIKE
operator in plans means other optimizations (like combining with other conditions) work seamlessly. It also aligns with SQL standards, which can be helpful for developers debugging query plans. The plan shows a clear intention ("column starts with X") via aLIKE 'X%'
clause, which is intuitive.Encouraging Extensibility: For Rust developers and contributors, this is a great example of how to extend DataFusion. If you have a custom scalar function or see an opportunity for a rewrite rule, DataFusion provides the hooks (like the
simplify
trait method) to implement it. You can add specialized knowledge (such as domain-specific optimizations) in the same way we did forstarts_with
. The fact that we accomplished this with a small change localized to one module underscores how approachable DataFusion's codebase is for newcomers and seasoned devs alike.Maintaining Correctness: Importantly, all these optimizations maintain the exact semantics of the original expression. By escaping special characters and carefully constructing the new expression, we ensure that
starts_with
and the equivalentLIKE
behave identically for all inputs. The robust test suite gives confidence in the correctness of this rewrite.
Conclusion
Implementing the simplify
rule for starts_with
was a satisfying journey that improved DataFusion's optimizer and demonstrated the strength of its design. In just a few lines of code, we unlocked an optimization that can benefit many real-world queries. This change makes DataFusion smarter about string predicates, allowing it to skip unnecessary work and respond faster. For DataFusion users, it means you can write queries with starts_with
and get performance similar to writing a LIKE
by hand – the engine does it for you. For Rust developers, it's an example of how contributing to an open-source project like DataFusion can be straightforward thanks to clear abstraction boundaries. We hope this encourages more contributions and explorations into DataFusion's optimization capabilities. After all, as we've seen, a simple change (pun intended) can go a long way in making your queries run simply faster!
Subscribe to my newsletter
Read articles from Jagdish Parihar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Jagdish Parihar
Jagdish Parihar
I am software developer, primarily working on the nodejs, graphql, react and mongoDB.