Each of us frequently interacts with databases, whether we are developers, data analysts, or system administrators. However, have we ever truly considered the intricate processes that occur when our queries are executed by the database's execution engine? These queries can originate from various sources, such as programs, scripts, or other applications. But how exactly do they transform into the output data that we display on user interfaces or use for further processing and decision-making within our applications?

Why some queries are processed in just a few milliseconds, while others might take several minutes or even longer to complete. What are the factors that influence the speed and efficiency of query execution? How does the execution engine handle these queries, and what steps are involved in this process?

To provide a clear understanding, we will focus on relational databases, using Microsoft SQL Server as our primary example. I will guide you through the various stages of query execution, explaining each step in detail. We will examine how the database engine parses, optimizes, and executes queries, and how it retrieves and processes data to produce the final results.

Are you intrigued by the inner workings of database systems and eager to learn more about query execution? If so, let's embark on this journey together and uncover the mysteries behind the efficient functioning of relational databases.

What is the Relational Engine?

All of our SQL queries, no matter where they come from, are processed by a component known as the "Relational Engine." This engine is a crucial part of the database management system, responsible for interpreting and executing SQL commands. When a query is submitted, the Relational Engine takes charge of parsing the SQL syntax, ensuring that it is correct and can be understood by the system. It then proceeds to optimize the query, determining the most efficient way to access the required data. This involves analyzing various possible execution plans and selecting the one that will execute the query in the shortest time with the least resource usage. Once the optimal plan is chosen, the Relational Engine executes the query, retrieving data from the database and processing it as needed to generate the desired output.

As you can see, the process of query execution is divided into two main phases:

Compilation Phase
Execution Phase.

The Compilation Phase itself is further broken down into three distinct stages. The first stage is the Syntax Check, where the system verifies the correctness of the SQL syntax. If the syntax is correct, it results in a Parse Tree, which represents the structural breakdown of the query.

Next, we move to the Object Binding stage. Here, the system checks the query against the database schema to ensure that all referenced objects, such as tables and columns, exist and are accessible. This stage produces a Query Processor Tree, which is a more refined representation of the query, now linked to actual database objects.

The final stage of the Compilation Phase is Query Optimization. During this stage, the system evaluates multiple potential execution strategies to determine the most efficient way to retrieve the required data. It considers factors such as available indexes, data distribution, and resource costs. The outcome of this stage is an Execution Plan, which outlines the steps the system will take to execute the query.

Once the Compilation Phase is complete, we move to the Execution Phase. This is where the Execution Plan is put into action. The system follows the plan to access the necessary data from the database, process it according to the query's instructions, and produce the final output. This output can either be the requested data or an error message if any issues arise during execution.

We begin this entire process using Transact-SQL (T-SQL), which is the primary language for managing and processing data in Microsoft SQL Server. T-SQL allows us to write complex queries to interact with the database effectively.

Let's look at each stage to understand their main goals and tasks, showing how the query execution process works efficiently to provide accurate and timely data for our applications and database environments.

Syntax Check

The main goal of this stage is to ensure the query is written correctly. For example, consider the following SQL query:

SELETC -- Syntax Error, might be SELECT
       p.LastName + ', ' + p.FirstName,
       p.Title,
       pp.PhoneNumber
FROM Product.Person AS p
JOIN Person.PersonPhone AS pp
   ON pp.BusinessEntityID = p.BusinessEntityID
WHERE p.Id = 1000

Since there was a syntax error in the code due to the use of an unknown operator, "SELETC", instead of the correct "SELECT" operator, the execution of the query will halt immediately at the syntax check stage. This stage is crucial because it ensures that the query is free from any grammatical mistakes before proceeding further. When the system encounters such an error, it will not attempt to execute the query. Instead, it will generate an error message to inform the user about the issue. In this case: “Error: near "SELETC": syntax error.“

Once the syntax error is identified, the user can correct it by replacing "SELETC" with the correct "SELECT" keyword. After making this correction, the query is ready to be executed:

SELECT -- Correct spelling
       p.LastName + ', ' + p.FirstName,
       p.Title,
       pp.PhoneNumber
FROM Product.Person AS p
JOIN Person.PersonPhone AS pp
   ON pp.BusinessEntityID = p.BusinessEntityID
WHERE p.Id = 1000

As a result of executing the corrected query, the relational engine will generate a "Parse Tree". Parse Tree is a structured representation that outlines the logical sequence of operations, or operators, required to successfully execute the query. Essentially, it breaks down the query into its fundamental components, detailing each step that the database engine needs to perform. This includes operations such as selecting specific columns, joining tables, and applying conditions to filter the data. The Parse Tree serves as a blueprint for the database engine, guiding it through the process of retrieving and processing the data as specified in the query. An example of a Parse Tree is illustrated in the image below, showcasing how the query is systematically decomposed into these logical steps.

The Parse Tree is a lightweight structure, which will be further utilized in the next stage - Object Binding, as detailed below.

Object Binding

The next stage in the query processing pipeline is Binding stage. This crucial phase involves several key tasks:

Resolving object, table, and column names within the query.
Identifying the data types of each element.
Determining the use of aggregate functions such as SUM and MAX, a process known as aggregate binding**.**

Essentially, this stage ensures that all elements of the query are correctly aligned with the database's schema, verifying that the specified tables and columns exist and that their data types are compatible with the operations being performed.

During this stage, the database engine carefully checks each part of the query to make sure it follows the database's structure and rules. It confirms that the table and column names in the query match those in the database. It also checks that the column data types are suitable for the operations in the query, like math calculations or string changes. This stage also involves spotting any aggregate functions in the query and making sure they are used correctly on the data.

For instance, if we revisit the code example provided above, we can see that there is a discrepancy with the table name used in the query. The table referenced as Product.Person does not exist in the database schema, the correct table name should be Person.Person. This kind of error would be detected during the Binding stage, prompting the database engine to flag it for correction before the query can proceed to execution. This meticulous process ensures that the query is not only syntactically correct but also semantically valid, aligning perfectly with the database's structure and constraints.

SELECT
       p.LastName + ', ' + p.FirstName,
       p.Title,
       pp.PhoneNumber
FROM Product.Person AS p -- Product.Person is not correct, schema is wrong
JOIN Person.PersonPhone AS pp
   ON pp.BusinessEntityID = p.BusinessEntityID
WHERE p.Id = 1000

Since there was an error in the code due to the incorrect table name (the table Product.Person does not exist in the database schema), the execution of the query will be halted. When this happens, the database engine will generate an error message: “Error: Invalid object name 'Product.Person'.” This message indicates that the table referenced in the query cannot be found, which prevents the query from proceeding further.

To resolve this issue, we need to correct the schema name to match the actual structure of the database. By updating the table reference to the correct schema, as shown below, we ensure that the query aligns with the database schema. This correction allows the query to execute successfully, retrieving the desired data without any errors. The corrected query is as follows:

SELECT
       p.LastName + ', ' + p.FirstName,
       p.Title,
       pp.PhoneNumber
FROM Person.Person AS p -- Corrected schema name
JOIN Person.PersonPhone AS pp
   ON pp.BusinessEntityID = p.BusinessEntityID
WHERE p.Id = 1000

As a result of executing the corrected query, the database system generates a "Query Processor Tree". This tree is a more advanced structure compared to a "Parse Tree". While a Parse Tree primarily focuses on the syntax of the query, the Query Processor Tree includes additional details that are crucial for the execution process. It not only represents the operators used in the query but also incorporates the conditions required for joining multiple tables. Furthermore, it includes all the filters and values applied to refine the results obtained from the query. This comprehensive structure helps the database engine efficiently process and execute the query, ensuring that the desired data is retrieved accurately. An example of what the resulting "Query Processor Tree" might look like below:

It's important to highlight that, in addition to the "Query Processor Tree", the database system also generates a hash value of this tree. This hash acts as a unique identifier for the tree, which is crucial for determining if an execution plan has already been cached. If the system finds a matching hash, it can bypass the optimization stage, which is often time-consuming, and move directly to the execution phase. This process significantly enhances the efficiency of query execution by reducing the time needed to process repeated queries.

Now, let's look more closely at the optimization stage. During this part, the database engine checks different ways to run the query, thinking about things like available indexes, data size, and join complexity. The aim is to find the best way to run the query using the least resources and time. The optimizer looks at different methods, like choosing between a nested loop join or a hash join, and picks the one that works best for the current database situation. By understanding this stage, we can see how the database system makes sure queries run efficiently, which boosts the overall performance of the database application.

Optimization

The final stage of the compilation phase is known as Optimization, and its primary role is to construct the Execution Plan.

The Execution Plan is essentially a detailed sequence of operations that the database system needs to perform to retrieve the results of an SQL query within a relational DBMS.

Within the relational database mechanism, an "optimizer" is embedded to handle this task. The same query, when submitted to the database, can be executed in multiple ways. The optimizer has the ability to decide the order of operations, select the most efficient execution methods, and apply filters at various stages of the query execution process. This flexibility leads to the critical question: which sequence of operations should be chosen to ensure the query executes in the shortest time and with the least resource consumption? The optimization stage is designed to answer this question.

As previously mentioned, this stage can be bypassed if the database system identifies that a similar query, or a query with the same structure but different parameter values, has already been processed and its Execution Plan is stored in the cache. This caching mechanism allows the system to skip the time-consuming optimization process and proceed directly to execution, thus saving valuable time.

However, if no cached Execution Plan is available, the database system must perform optimization, which can take one of two paths:

Trivial Optimization - This path is used for optimizing simple, lightweight queries that primarily operate with a single table. The key advantage of trivial optimization is its speed, as it typically requires no more than 1 millisecond to generate the Execution Plan.

The reason the Execution Plan is constructed so rapidly is that the optimizer has a very limited set of options to consider when building the plan, which significantly accelerates the process.

Below is an example of a query that would undergo Trivial Optimization. This type of query is straightforward, involving minimal complexity, which allows the optimizer to quickly determine the most efficient execution path.

SELECT d.Name
FROM HumanResources.Department AS d -- SELECT from a single Table
WHERE d.DepartmentID = 42

Full Cost-Based Optimization - This optimization process is designed for handling complex and resource-intensive queries. These queries typically involve multiple tables, making use of aggregate operators, intricate filters, and presenting a vast array of potential execution plans to consider.

When constructing an Execution Plan for such queries, the process can vary significantly in duration. It may take just a few milliseconds for simpler cases, but for particularly large and complex queries, it can extend to several minutes. This is due to the comprehensive analysis required to determine the most efficient execution strategy.

The optimizer plays a crucial role in generating numerous Execution Plans and meticulously evaluates each one to select the most optimal plan. This selection is based on a detailed comparison of several key metrics associated with the Execution Plans:

CPU Cost: This metric assesses the amount of processing power needed to execute the plan. It considers the computational resources required to complete the query efficiently.
I/O Cost: This involves evaluating the disk and memory usage associated with the plan. It measures the input/output operations required, which can significantly impact the overall performance.
Row Cardinality: This metric examines the number of rows processed at each step of the query execution. Understanding row cardinality helps in predicting the volume of data that needs to be handled.
Memory Usage: This estimates the amount of memory required to store intermediate results during the execution process. Efficient memory usage is crucial for maintaining performance, especially with large datasets.

For example, consider a query that requires Full Cost-Based Optimization. This query might join several tables, use complex filters, and calculate totals. The optimizer needs to explore different ways to execute the query to find the most efficient one.

Here is an example of such a query.

SELECT
       p.LastName + ', ' + p.FirstName,
       p.Title,
       pp.PhoneNumber
FROM Person.Person AS p -- Query using 2 tables, aggregation is used
JOIN Person.PersonPhone AS pp
   ON pp.BusinessEntityID = p.BusinessEntityID
WHERE p.Id = 1000

Each of the optimization options described above requires several key parameters as input. These parameters are crucial for selecting the most efficient operators for the query. By leveraging these inputs, the execution of the query, as determined by the generated Execution Plan, is designed to be as fast as possible. Here are some of the important parameters involved:

Query Processor Tree: This is a structured representation of the query, breaking it down into various components and operations that need to be executed. It helps in organizing the sequence of operations for optimal performance.
Statistics: These are detailed data points that provide insights into the distribution and volume of data within the database. They help the optimizer make informed decisions about the best way to access and process data.
Constraints: These are rules applied to the data, such as primary keys, foreign keys, and unique constraints, which can influence the execution plan by reducing the search space and improving efficiency.
Etc…: includes other factors like indexes, available resources, and system configuration settings that can affect the optimization process.

The optimizer's working mechanism is illustrated below.

An important outcome of the optimizer's work is not only the creation of the Execution Plan but also its storage in the Plan Cache. This caching mechanism allows the system to quickly retrieve a previously prepared plan, thereby skipping the time-consuming optimization process for future queries that are similar or identical. By doing so, it enhances the overall efficiency and speed of query processing, as the system can reuse existing plans rather than recalculating them from scratch each time.

Once the Execution Plan for a query is constructed and stored, it moves to the next crucial phase, known as the Execution Stage. During this stage, the database engine takes the detailed instructions outlined in the Execution Plan and carries them out to retrieve or manipulate the data as requested by the query. This stage is where the plan is put into action, interacting with the database to produce the desired results. The execution stage is vital because it translates the plan into actual operations that access and modify the data, ensuring that the query's objectives are achieved efficiently and accurately.

Execution

The final stage in the query processing lifecycle is the Execution Phase. This phase is crucial as it involves carrying out the query according to the detailed instructions specified in the execution plan. The execution phase is where the theoretical plan is transformed into practical operations that interact directly with the database. During this stage, the database engine meticulously follows the steps outlined in the execution plan to access, retrieve, or modify the data as requested by the query. This ensures that the query's objectives are met with precision and efficiency.

For instance, if the execution plan includes a series of operations such as scanning a table, applying filters, joining tables, or sorting results, the execution phase will perform these tasks in the exact order and manner prescribed. This phase is not only about retrieving data but also about ensuring that the data is processed accurately to meet the query's requirements.

An example of how execution works in practice is illustrated in the figure below.

Although this stage might appear simple and straightforward at first glance, there are several critical details to consider to fully understand its complexity and importance:

Immutability of the Execution Plan: Once the execution phase begins, the execution engine must strictly adhere to the optimizer's plan. This means that the execution plan is set in stone and cannot be altered, adjusted, or modified in any way during execution. The immutability of the plan ensures consistency and reliability in how queries are executed, preventing any unexpected changes that could affect the outcome.
Recompilation of Execution Plans: There are scenarios where the execution engine might need to recompile a plan. This typically occurs if there are significant changes to the database tables during the execution process. For example, if the statistics of a table are updated, if there is a substantial increase in the number of rows, or if the structure of the table is altered, the execution engine may determine that the existing plan is no longer optimal. In such cases, a new execution plan is generated to reflect the current state of the database, and execution will resume only after the new plan is in place.
Caching of Frequently-Used Plans: Execution plans that are frequently referenced and costly to compile are often retained in the cache for extended periods. This caching mechanism is crucial because generating new plans from scratch is a resource-intensive process. By keeping the most commonly used plans readily available in the cache, the system can improve efficiency and performance, reducing the overhead associated with plan compilation. This strategy ensures that the database engine can quickly execute queries without unnecessary delays, especially for operations that are repeatedly executed.

Summary

In conclusion, gaining a thorough understanding of the complexities involved in query execution within relational databases, especially when working with Microsoft SQL Server, offers significant insights into the efficient processing of data. By delving into each stage of query execution - starting from parsing and binding, moving through optimization, and finally reaching execution, we develop a comprehensive appreciation for the sophisticated processes that ensure queries are carried out both swiftly and accurately.

This detailed knowledge not only enhances our capability to craft well-optimized queries but also equips us with the skills needed to effectively troubleshoot and resolve performance issues that may arise. As we continue to engage with databases, this foundational understanding becomes an invaluable tool, aiding us in the development of robust and efficient data-driven applications. It empowers us to make informed decisions that can lead to improved performance and reliability in our database interactions, ultimately contributing to the success of our software projects.

Docs

SQL Server Execution Plans, Third Edition by Grant Fritchey: A comprehensive guide to understanding and optimizing SQL Server execution plans. Available at Red Gate.
Microsoft SQL Server Documentation: Official documentation from Microsoft, covering various aspects of SQL Server, including query execution and optimization. Available at Microsoft Docs.

Demystifying the Process of Query Execution in Relational Databases

Table of contents