When Fixtures Aren't Enough: Testing with Large Datasets in the Database

Aleš BlaznikAleš Blaznik
3 min read

We all know the value of test data. It helps us identify bugs, ensure functionality, and build confidence in our systems. But what happens when your trusty test data libraries start to slow down under the weight of millions of records? This is where exploring data locality and in-database test data generation comes in.

Fixture Fatigue: The Slowdown of Traditional Approaches

Many developers rely on libraries like fixtures to create test data. These libraries work well for smaller datasets, but as your data volume grows, so does the time it takes to generate and manage that data. This can significantly slow down your test suite, hindering development velocity.

The Power of In-Database Test Data Generation

Here's where leveraging the power of your database itself becomes advantageous. By generating test data directly within the database, we can take advantage of data locality – the physical proximity of related data – to speed things up. This approach bypasses the overhead of external libraries and allows the database to optimize data access for testing purposes.

A Real-World Example: Testing with Millions of Audit Logs

Let's say you're building a system that generates audit logs. You want to understand how the system behaves when it needs to process millions of these entries. Manually (or using fixtures library) creating such a massive dataset is impractical. Here's a quick and effective approach:

  1. Run the Test Suite: Start by running your test suite with your existing test data.

  2. Duplicate the Data: Within the database, duplicate the existing audit log data (generated by the initial test run) to create a larger dataset representing millions of records.

  3. Run the Test Suite Again: Rerun your test suite with the newly duplicated data.

  4. Evaluate Performance Impact: Analyze the test execution times. Did the test suite slow down significantly with the larger dataset?

This simple process helps identify potential bottlenecks. It can reveal issues like missing indexes or inefficient data loading practices.

WITH RECURSIVE num_of_replications AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1 FROM numbers WHERE n < 1000000
)
INSERT INTO audit_logs
SELECT gen_random_uuid() as id, user_id, type, data, metadata, created_at FROM num_of_replications
LEFT JOIN (SELECT * FROM audit_logs) AS n ON true;

SQL code snippet utilizes a WITH RECURSIVE to generate a sequence of numbers from 1 to 1,000,000. It then uses these numbers to loop through each row in the existing audit_logs table. For each loop, a new record is inserted with data from the original table, effectively creating a million-record dataset that mimics the structure and content of the real data.

Benefits of In-Database Test Data Generation

  • Speed: By leveraging the database's processing power and data locality, you can significantly reduce test execution time.

  • Efficiency: No need to manage external libraries or large test data files.

  • Quick Insights: Provides a fast way to identify potential performance issues before they become critical.

Remember: While duplicating data isn't a perfect representation of real-world scenarios, it's a valuable tool for quick performance checks and identifying areas for improvement.

By exploring in-database test data generation, you can overcome the limitations of traditional approaches and ensure your system performs optimally when dealing with large datasets.

0
Subscribe to my newsletter

Read articles from Aleš Blaznik directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Aleš Blaznik
Aleš Blaznik