Data is a crucial component of any modern application or service, and how you process it can determine the success or failure of your app. While data is often stored in databases, relying on SQL queries for processing can be problematic. Let me explain why this approach might not be ideal and what alternatives you can consider.

💡

This article is an opinionated advice based on my experience. I am not presenting the idea as a universal truth that you have to follow no matter what. Based on your experience, you may feel a completely different take.

Generally, the use of Data in any application can be categorised as

Analytical
ETL/ELT

Using SQL for analytical purposes is great, and you should keep using it. SQL truly excels in this area. There's no need for setup or worrying about the environment—just run the query and get your results.

But for ETL pipelines with complex transformations (which is common in the real world), you should switch from SQL immediately. Because using SQL often leads to very long, nested, and endless CTEs. This makes it really hard to read & nightmare debugging experience.

Challenges of SQL-heavy ETL Pipelines (Why it can be a mistake):

Increased Complexity and Readability: ETL pipelines with many transformations often result in complex, deeply nested SQL queries with numerous CTEs, making the code hard to read, understand, and maintain. This complexity can be challenging for new team members or even for your future self to understand it.
Debugging Nightmares: Tracking down errors in a massive SQL script with multiple levels of nesting can be a nightmare. The error messages might not always be as informative as those you'd get in your primary programming language's debugger.
Limited Expressiveness for Complex Logic: While SQL is powerful for set-based operations, implementing intricate business logic or conditional transformations can become cumbersome and less intuitive compared to using the control flow structures available in languages like Python or Java.
Version Control and Collaboration Challenges: While SQL scripts can be version-controlled, the changes and their impact might be harder to visualize and collaborate on compared to code within a larger application codebase managed by standard development tools and practices.
Fragility and Maintainability: Complex SQL ETL can become fragile and hard to maintain over time, which might lead to constantly fixing problems as they arise.
Testing Difficulties: Unit testing individual components and transformations within a large SQL script can be challenging. Testing often involves running the entire pipeline or significant portions of it, making it slower and less granular than testing functions or methods in your application code.

Below is a sample of a typical Query structure for an ETL pipeline. As you can see, even for fewer lines of code, it became complex to understand.

WITH
    SourceData AS (
        -- Step 1: Extract data from the raw source
        SELECT
            column1,
            column2,
            column3,
            -- ... more columns ...
            CAST(timestamp_column AS TIMESTAMP) AS processed_timestamp
        FROM
            raw_data_table
        WHERE
            date_column >= 'some_start_date' AND date_column < 'some_end_date'
    ),

    CleanedData AS (
        -- Step 2: Clean and standardize the data
        SELECT
            UPPER(TRIM(column1)) AS standardized_column1,
            CASE
                WHEN column2 = 'value_a' THEN 'Category A'
                WHEN column2 = 'value_b' THEN 'Category B'
                ELSE 'Other'
            END AS categorized_column2,
            CAST(column3 AS INTEGER) AS numeric_column3,
            processed_timestamp
        FROM
            SourceData
        WHERE
            column1 IS NOT NULL AND column3 IS NOT NULL
    ),

    TransformedData1 AS (
        -- Step 3: Apply the first set of transformations
        SELECT
            standardized_column1,
            categorized_column2,
            numeric_column3 * 10 AS transformed_numeric_column3,
            EXTRACT(YEAR FROM processed_timestamp) AS processing_year,
            EXTRACT(MONTH FROM processed_timestamp) AS processing_month
        FROM
            CleanedData
    ),

    LookupTable1 AS (
        -- Step 4: Join with a lookup table
        SELECT
            td1.*,
            lt1.lookup_value
        FROM
            TransformedData1 td1
        LEFT JOIN
            lookup_table_one lt1 ON td1.standardized_column1 = lt1.lookup_key
    ),

    FilteredData AS (
        -- Step 5: Apply specific filtering criteria
        SELECT
            *
        FROM
            LookupTable1
        WHERE
            processing_year = 2024 AND categorized_column2 IN ('Category A', 'Category B')
    ),

    AggregatedData AS (
        -- Step 6: Perform aggregations
        SELECT
            processing_year,
            processing_month,
            categorized_column2,
            COUNT(*) AS record_count,
            SUM(transformed_numeric_column3) AS total_transformed_value
        FROM
            FilteredData
        GROUP BY
            processing_year, processing_month, categorized_column2
    ),

    FinalStage AS (
        -- Step 7: Final transformations and calculations
        SELECT
            ad.*,
            (total_transformed_value / record_count) AS average_transformed_value,
            'Processed' AS status
        FROM
            AggregatedData ad
        -- Potentially another join with another lookup table here
        LEFT JOIN
            final_lookup_table flt ON ad.categorized_column2 = flt.category
    )

-- Step 8: Load or select the final processed data
SELECT
    *
FROM
    FinalStage
ORDER BY
    processing_year, processing_month, categorized_column2;

-- You might even have subqueries within these CTEs for more complex logic:
--
-- TransformedData2 AS (
--     SELECT
--         fd.*,
--         (SELECT MAX(another_column) FROM another_table WHERE fk_column = fd.some_id) AS max_value_from_other_table
--     FROM
--         FilteredData fd
-- ),
--
-- And so on...

The Solution

You should follow the below generic steps to solve the above-stated problem & improve your development experience.

Step 1: Initial Starting Point
- Use an ORM to pull decently filtered initial source data.
- If you wish, you can even use Raw SQL with the programming language-specific database drivers.
- If you are using Python, then I have already written a blog that explains how we should effectively work with databases in Python. It explains the solution which is the best of both world. You read here - How to effectively work with Databases in Python.
Step 2: Design your codebase using a proper Library Structure
- I'm confident that in the real world, you'll have many ETL pipelines running in a project to meet different business needs. I'm also certain that these pipelines will share a lot of common business logic.
- All the code, business logic, utilities, connections, visualizations, and more should be developed and organized as a Library. This approach provides a wide range of flexible options and enables us to leverage the best practices of Software Engineering. You can apply principles like DRY, SRE, design patterns, OOP, and more. The possibilities are extensive.
- Below is a sample library folder structure. This allows as t write relevant code to appropriate location.
- ```
  some_library/
  ├── config
  ├── utils/
  │   ├── module1
  │   └── module2
  ├── ops/
  │   ├── preprocessing/
  │   │   ├── module1
  │   │   └── module2
  │   └── processing/
  │       ├── module1
  │       └── module2
  └── pipeline/
      ├── pipeline1
      ├── pipeline2
      └── pipeline3
  cicd.yaml
  pyproject.toml
  poetry.lock
  test/
  ├── test_utils
  ├── test_ops
  └── test_pipeline
```
Step 3: Use your library as a package in the pipeline
- Now that the entire codebase is organized as a library, the next step is to publish it as a package.
- Install and use this package in downstream pipelines just like any other open-source package.
- This approach allows you to reuse the same functions for common functionalities.
Step 4: Use the programming language ecosystem
- Now that we're using programming languages like Python, Java, or JavaScript, we can take advantage of open-source tools and packages to enhance the pipeline even more.
- This saves us from the pain of re-inventing the wheel.

Possible Counterarguments or Nuances to Consider:

Performance: For very large datasets and simple, set-based transformations, well-optimized SQL can sometimes outperform code-based ETL due to the database's ability to leverage indexing and other performance optimizations. However, for complex logic, the overhead of repeated SQL calls and data transfer might negate this benefit.
Database-Specific Features: Sometimes, specific database features might be most efficiently utilized directly through SQL. However, if the core logic becomes overly convoluted, the trade-off for readability and maintainability might still favor the above approach for the bulk of the pipeline.
Out of Memory Issues: Large datasets won’t fit into memory & restrict us to run the pipeline. This can be solved using batch processing. To further improve performance, all the batches can be executed in parallel using a distributed system.

Conclusion:

Feature	SQL-heavy ETL Pipelines	ETL Pipelines running in the programming language
Readability	Can become complex and hard to read with many CTEs and nesting	Generally more readable and maintainable, especially for complex logic
Maintainability	Challenging to maintain and understand over time	Easier to maintain and refactor
Debugging	Difficult to debug complex nested queries	Easier to debug using programming language tools
Expressiveness	Limited for intricate business logic	More flexible and expressive with full programming language features
SDLC Integration	Might fall outside standard development lifecycle	Better integration with application codebase and SDLC practices
Testing	Harder to unit test granular components	Easier to unit test individual transformation steps
Development Speed	Can be quick for simple transformations	Might be slower initially for complex mappings, but faster for complex logic
Performance	Can be highly optimized for set-based operations	Potential overhead of object mapping, but manageable with careful design
Team Skillset	Requires strong SQL expertise	Leverages existing programming language skills
Complexity Management	Complexity increases significantly with more transformations	Complexity can be better managed through code organization and modularity
Flexibility	Less flexible for complex, multi-step transformations	More flexible in handling diverse data manipulations and integrations

Why Using SQL for Data Application Development Might Be a Mistake

Challenges of SQL-heavy ETL Pipelines (Why it can be a mistake):

The Solution

This saves us from the pain of re-inventing the wheel.

Possible Counterarguments or Nuances to Consider:

Conclusion:

Subscribe to my newsletter

Akash Desarda

Akash Desarda