You will Love Data Duplication!

In software development, we're often taught to follow the DRY principle (Don't Repeat Yourself) religiously. The idea that data should exist in exactly one place is deeply ingrained in our programming culture. Yet, in real-world systems, particularly in microservice architectures, strict adherence to this principle can lead to complex dependencies, performance bottlenecks, and fragile systems.

This article explores why duplicating data across multiple domains, while seemingly counterintuitive, can actually be the right architectural choice in many scenarios.

Data Duplication is Everywhere

Before we dive into the benefits of intentional data duplication, it's worth noting that duplication already happens at virtually every level of computing—often without us even realizing it.

Hardware-Level Duplication

At the hardware level, modern CPUs contain multiple layers of cache (L1, L2, L3) that duplicate data from main memory to improve access speed. Your RAM itself is a duplication of data from persistent storage. Even within storage devices, technologies like RAID mirror data across multiple disks for redundancy.

When you're running a program, the same piece of data might simultaneously exist in:

  • The hard drive or SSD

  • Main memory (RAM)

  • CPU L3 cache

  • CPU L2 cache

  • CPU L1 cache

  • CPU registers

Each level trades consistency management for performance gains. This is duplication by design!

System-Level Duplication

Operating systems maintain file system caches, page caches, and buffer caches—all forms of data duplication designed to improve performance. Virtual memory systems duplicate portions of RAM to disk when memory pressure increases.

Consider what happens when multiple applications open the same file:

  • The OS loads the file into memory once

  • Each application gets its own memory space containing a copy

  • The kernel maintains yet another copy in its file cache

That's three copies of the same data, all serving different purposes.

Application-Level Duplication

Web browsers cache resources locally. Content Delivery Networks (CDNs) duplicate website assets across global edge locations. Database replicas maintain copies of the same data for read scaling and disaster recovery.

When you visit a website, the same image might exist:

  • On the origin server

  • In multiple CDN edge locations

  • In your browser's memory

  • In your browser's disk cache

Framework-Level Duplication

Modern frameworks implement various caching strategies that duplicate data: HTTP response caching, object caching, computed value memoization, and more. Each represents a deliberate choice to trade consistency for performance.

The reality is that data duplication isn't an anti-pattern—it's a fundamental strategy employed throughout computing to balance competing concerns like performance, availability, and resilience.

The Problem with Strict Data Normalization

In traditional monolithic applications, we strive for normalized database schemas where each piece of data exists in exactly one place. This approach works well when:

  1. All data access happens within a single application

  2. Transactions can span multiple tables

  3. Joins are efficient and low-cost

  4. The application scales vertically (bigger machines)

However, as systems grow and evolve toward distributed architectures, these assumptions break down:

  • Service Boundaries: Different services need access to the same data

  • Network Costs: Remote calls to fetch data introduce latency

  • Availability Concerns: Dependencies on other services create failure points

  • Scaling Challenges: Distributed transactions become complex and costly

Event-Driven Data Duplication in Microservices

A common pattern in microservice architectures is to use event-driven mechanisms to duplicate and synchronize data across service boundaries. Let's examine a typical example:

The Person-Employee Example

Consider a system with two separate services:

  1. An Identity and Access Management (IAM) service that manages user accounts and core identity information

  2. An HR/Office service that handles employment-specific information

Both services need access to person data like names and contact information. Rather than having the HR service constantly query the IAM service, each maintains its own copy of the data.

How Duplication Can Be Managed

A typical implementation might use an event-driven approach:

  1. The IAM service is the "source of truth" for core identity information

  2. When identity data changes in IAM, an event is published to a message bus

  3. The HR service subscribes to these events and updates its local copy

This pattern can be implemented with a helper like this (pseudocode):

# Example of a DuplicateFieldHelper in a Ruby-based system
def duplicate_fields(source_event_type, target_repository, mapping)
  # Subscribe to events from the source service
  subscribe_to_event(source_event_type) do |event|
    # Extract the ID of the changed record
    record_id = event.resource_id

    # Find the corresponding record in the local repository
    local_record = target_repository.find_by_source_id(record_id)

    # Only update fields that were changed in the event
    changed_fields = event.changed_fields
    fields_to_update = mapping.select { |source_field, _| 
      changed_fields.include?(source_field) 
    }

    # Update the local copy with the new values
    target_repository.update(
      local_record.id, 
      fields_to_update.transform_values { |target_field| 
        event.data[target_field] 
      }
    )
  end
end

# Usage example
duplicate_fields(
  "users.updated",
  EmployeeRepository,
  {
    first_name: :first_name,
    last_name: :last_name,
    email: :contact_email,
    profile_picture: :photo_url
  }
)

This pattern allows services to maintain local copies of data while ensuring they eventually reflect changes made in the authoritative source.

Advantages of Data Duplication in Microservices

1. Service Autonomy

By duplicating data, each service can operate independently without runtime dependencies on other services. If the IAM service is temporarily unavailable, the HR service can still function with its local copy of employee data.

2. Performance Optimization

Local data access is always faster than remote calls. By keeping a copy of frequently accessed data within each service, we eliminate network round-trips and reduce latency.

Consider a dashboard that displays employee information and their current projects. Without data duplication, rendering this dashboard might require:

Dashboard Service → IAM Service (get employee details)
                  → Project Service (get project assignments)
                  → Department Service (get department info)

With data duplication, it becomes:

Dashboard Service → Local Database (get all required data)

The performance difference can be dramatic, especially at scale.

3. Domain-Specific Data Models

Each service can model data according to its specific domain needs. The HR service can add employment-specific fields without affecting the IAM service's data model.

For example, the IAM service might store basic name information:

Person {
  id: bigint,
  first_name: String,
  last_name: String,
  email: String
}

While the HR service might enhance this with employment details:

Employee {
  id: bigint, // In this case, we use same ID than person ID, because "Employee" is a "facet" of a person
  first_name: String,  // Duplicated from IAM
  last_name: String,   // Duplicated from IAM
  employee_number: String,
  position: String,
  department: String,
  hire_date: Date,
  manager_id: UUID
}

4. Resilience to Failures

If one service fails, others can continue operating with their local data. This creates a more robust system overall, as temporary outages in one service don't cascade throughout the entire system.

5. Simplified Queries

Complex queries can be performed locally without joining data across service boundaries, which would require complex distributed transactions or client-side joins.

6. Scalability

Services can scale independently based on their specific load patterns, without being constrained by dependencies on other services.

The Master-Duplicate Pattern

A key principle in successful data duplication is establishing clear ownership. A common approach is the "master-duplicate" pattern:

  1. Single Source of Truth: Each piece of data has one authoritative source (the "master"). For person identity information, the IAM service is the master.

  2. Controlled Propagation: Changes to master data are published as events, which duplicates consume to update their local copies.

  3. Unidirectional Flow: Updates only flow from master to duplicates, never the reverse. If a duplicate needs to change master data, it must request the change through the master service's API.

  4. Eventual Consistency: The system acknowledges that duplicates may temporarily be out of sync with the master, but will eventually converge.

This pattern provides a structured approach to data duplication that maintains data integrity while gaining the benefits of duplication.

DRY vs. Pragmatism: A Balanced View

The DRY principle remains valuable, but like all principles, it should be applied with nuance. Consider these perspectives:

"DRY is about knowledge duplication, not code duplication." — Dave Thomas, co-author of The Pragmatic Programmer

"Duplication is far cheaper than the wrong abstraction." — Sandi Metz

In microservice architectures, some level of data duplication is not just acceptable but often necessary. The key is to duplicate deliberately, with clear ownership and synchronization mechanisms.

Remember that DRY was conceived in an era dominated by monolithic applications. In distributed systems, strict adherence to DRY across service boundaries often leads to tight coupling—precisely what microservices aim to avoid.

When to Duplicate Data

Data duplication makes sense when:

  1. Service Boundaries Align with Business Domains: Each service owns a specific business capability and needs local access to relevant data.

  2. Read-Heavy Workloads: The data is read much more frequently than it's updated.

  3. Loose Coupling is Priority: You want to minimize runtime dependencies between services.

  4. Performance is Critical: Network calls to fetch data from other services would introduce unacceptable latency.

  5. Resilience Requirements: Services need to function even when other services are unavailable.

Challenges and Mitigations

Data duplication isn't without challenges:

1. Consistency Management

Challenge: Keeping duplicated data in sync can be complex.

Mitigation: Use event-driven architectures with clear ownership models. Implement idempotent update handlers and reconciliation processes.

2. Increased Storage Requirements

Challenge: Duplicating data increases storage needs (but it is often a no-problem).

Mitigation: Be selective about what data you duplicate. Often, only a subset of fields needs duplication.

3. Complexity in System Understanding

Challenge: Developers need to understand which service owns which data.

Mitigation: Clear documentation and conventions. Use tooling to visualize data ownership and flow.

4. Eventual Consistency Implications

Challenge: Applications must handle the reality that data might be temporarily stale. Events can be dropped, code can fail to update, etc. It needs often to be handled through maintenance task sets or use of ACLs.

Mitigation: Design UIs and APIs to gracefully handle eventual consistency. Consider showing "last updated" timestamps where appropriate. Write maintenance scripts that flag and correct data discrepancies once in a while.

Real-World Example: E-Commerce Platform

Consider an e-commerce platform with these services:

  • Product Catalog Service: Manages product information

  • Inventory Service: Tracks stock levels

  • Order Service: Handles customer orders

  • Customer Service: Manages customer profiles

  • Shipping Service: Coordinates deliveries

The Order Service needs product information to display order details. Options include:

  1. No Duplication: Query the Product Catalog Service every time order details are viewed

  2. Full Duplication: Maintain a complete copy of all product data

  3. Selective Duplication: Store only the product data needed for order display (name, SKU, image URL)

Option 3 is often the best compromise. When a product is updated in the Product Catalog Service, an event is published, and the Order Service updates its local copy of the relevant fields.

At the end

Data duplication across multiple domains may feel like breaking the rules, but it's often a pragmatic solution to real-world challenges in distributed systems. By understanding when and how to duplicate data effectively, we can build systems that are more resilient, performant, and maintainable.

The next time someone invokes the DRY principle to argue against data duplication, remember that even your CPU is constantly duplicating data to work efficiently. Sometimes, the right architectural choice is to embrace controlled duplication rather than fight against it.

As with most engineering decisions, the answer isn't about following rules dogmatically—it's about understanding tradeoffs and choosing the approach that best serves your specific requirements.


This article is part of our series on API design and microservice architecture. Stay tuned for more insights into how we've built a scalable, maintainable system.

0
Subscribe to my newsletter

Read articles from Yacine Petitprez directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yacine Petitprez
Yacine Petitprez