Write -> Audit -> Publish (WAP) pattern with Iceberg

Akash JainAkash Jain
8 min read

Ever Heard of "Garbage In, Garbage Out?" Yeah, it's pretty much the golden rule of data. If your data's a mess, your insights will be too. And with the world increasingly relying on AI, this becomes more important than ever. That's why the Write, Audit, Publish (WAP) pattern is must in my opinion – it's a super simple 3-step dance to make sure your data is always sparkling clean.

But pulling this off with old-school tools like Hive? That's not a dance; it's a clumsy wrestling match – expensive, complicated, and a total headache.

In this article, we will discuss what is WAP, why WAP in Hive is such a pain, and discuss how Apache Iceberg's awesome branching feature turns the whole thing into a total breeze. Let's dive in!

So, what is WAP?

At its core, the WAP i.e. Write → Audit → Publish pattern is a simple three-step process designed to ensure you have good quality data. The principle is simple - check before you publish - much like how I checked this blog for errors before publishing. Let’s break it down:

  • Write: First, your ETL job processes new or updated data and writes it to a temporary, isolated staging area. Why isolated staging, you ask? So that no one can see this data while you audit this data. This is like a draft version of blog - you have written it but because it is not validated or audited, you have kept it in drafts.

  • Audit: Once the data is written to the staging area, you run a series of rigorous quality checks against this staged data. This can involve checking duplicates, ensuring referential integrity, or comparing new data against benchmarks laid down by the business. Only if the data passes all these checks does it proceed.

  • Publish: If the audit is successful, the newly validated data is swapped with the current production data. This swap is atomic. Why? So that the transition from old to new data is instantaneous and seamless, ensuring that consumers always see a complete and valid state. If the audit fails, the staged data is simply discarded, leaving the production data untouched and you, the data engineer, go back to fixing the issues.

The beauty of WAP lies in its ability to guarantee consistency, prevent partial data exposure, and provide a clear rollback mechanism.

Implementing WAP with Hive: Welcome to the ‘wrestle mania’

Before the rise of modern table formats, majority of the data lakes relied on Apache Hive, which managed data as files in HDFS directories. Implementing WAP here is possible, but not without challenges:

  • Manual Directory Gymnastics: ‘Publish’ phase requires physical movement of data from the staging table to production table. Think about doing it on a table daily with large volume. Moving data physically is resource intensive - you need more resources. It is also time consuming - time to insights is delayed. It is not fun!

  • Lack of True Atomicity: Alright, so in a perfect world, when you "Publish" your squeaky-clean data, it should be like a magic trick: poof! The old data is gone, and the new data is there, instantly, for everyone, all at once, without a single hiccup. That's what atomic means – all or nothing, no in-between states.

    But with Hive, when you make that switch in the publish phase, you move the data physically. And during that awkward moment of transition, there is always a window where queries might fail or users see inconsistent data. INSERT OVERWRITE TABLE doesn’t help either because it deletes everything and rewrites, making your table temporarily unavailable or incomplete.

    You know that feeling? When you're trying to pull off a complex manoeuver, you just want everyone to look away for a second and hope you succeed. That's exactly how it feels. Not exactly the seamless experience we're dreaming of.

  • Schema Evolution Nightmares: Have you ever tried to change a table's schema in Hive like adding a column or changing a data type? Bet you had to resort to full table rewrites - the sound of it is so scary. It is not without some serious hair-pulling. You want to see what these "simple" renaming steps would look like? Imagine you're just trying to add a new column to your sales table. It might go something like this:

    1. Read from sales table.

    2. Write all that data (plus your new column) to a temporary table, let's call it sales_new_i_cancelled_my_date_to_add_a_col (because that is the cost you will pay for something as simple as adding a column).

    3. Then, the rename dance:

      • sales (your old table) → sales_i_dont_know_when_to_delete (because you genuinely won't know when to delete it!)

      • sales_new_i_cancelled_my_date_to_add_a_colsales (finally, your updated table!)

Am I convoluting it? Nope, that's just Hive being Hive. It's not me, it's the system! And speaking of nightmares, let's just see how a rollback would go down next...

  • Rollback & Cleanup Headaches: Imagine this: You've just pulled off a painful schema evolution, literally rewriting the whole darn table. You're leaning back, happily sipping your coffee, feeling like a data superhero. Then, BAM! The call comes in from the business: that shiny new column you just added? Wrong data type. Your coffee suddenly tastes like dread. If not planned meticulously, this kind of nightmare scenario can lead to data loss and force you into hours (or days!) of manual recovery.

    So, how do you roll back that sales table to its old, healthy version in Hive? Well, lucky for you, your smartness led you to preserve the old sales table as sales_i_dont_know_when_to_delete. Good call, because if you'd dropped it, you'd be chugging 10 espresso cups for all-nighters!

    Now, for the actual rollback dance:

    1. First, you rename your current sales table (the one with the bad data) to something like sales_hail_iceberg_wont_miss_a_date (a little prayer for the future, perhaps?).

    2. Then, you rename your trusty sales_i_dont_know_when_to_delete (your backup) back to sales. Phew!

But here's the kicker: you won't rest easy until you've figured out how to delete that lingering sales_hail_iceberg_wont_miss_a_date residual data. In all likelihood, you'd secretly keep it, tucked away, until you change your job – just in case you ever need to perform another rollback!

Hope you are not starting to hate Hive. But you see how these challenges make Hive-based WAP implementations fragile, operationally intensive, and a constant source of anxiety for data teams.

Implementing WAP with Iceberg: The dance

Apache Iceberg, as an open table format, fundamentally changes the game by bringing database-like capabilities (ACID transactions, schema evolution, time travel) to your data lake. Its branching feature takes the WAP pattern from a headache to a delightful, controlled process.

Think of Iceberg branching like Git for your data. Here's how it makes WAP a breeze:

  • 1. Write (on a Branch): Instead of writing to an arbitrary staging directory, you now write to a dedicated branch of your Iceberg table (e.g., my_table@dev_feature, my_table@wap_cycle). This branch is an isolated version of your table's history. Production queries continue to read from my_table@main without any disruption or awareness of the changes happening elsewhere.

  • 2. Audit (on the Branch, with Time Travel): With the new data now committed to your dev_feature branch, the audit phase becomes incredibly powerful. You can run all your validation queries directly against my_table@dev_feature. Need to compare it to yesterday's production data? No problem! Iceberg's time travel allows you to easily query my_table@main (which points to the current production snapshot) or even my_table@main VERSION AS OF 'yesterday'. This enables precise, in-place comparative auditing.

  • 3. Publish (as an Atomic Merge): This is where Iceberg's branching truly shines. If your dev_feature branch successfully passes all audits, "publishing" the data simply involves merging that branch into your main(production) branch. This merge is an atomic metadata operation.

    • Atomicity: The main branch's pointer is instantaneously updated to reflect the state of the dev_featurebranch. There's no data copying or directory renames.

    • Zero Downtime: Readers on main seamlessly switch to the new, merged data once the merge is complete. Ongoing queries see the snapshot they started with, ensuring complete consistency.

    • Effortless Rollback: If a critical bug somehow slips through and is discovered after the merge, you can use Iceberg's time travel to instantly roll back the main branch to a snapshot before the merge. Your data is immediately restored to a known good state.

    • Parallel Development: Multiple teams can work on different branches, each conducting their own WAP cycles in isolation. When ready, their branches can be merged into main independently, accelerating development without sacrificing quality.

Conclusion

So there you have it! We've journeyed through the wild west of Hive's WAP patterns, full of manual gymnastics, atomic glitches, schema nightmares, and those dreaded rollback headaches. It is less about data engineering and more about data wrestling, right?

But then, we now have Apache Iceberg with its super-powered branching feature, waving its magic wand! By letting you Write on a Branch, Audit with Time Travel, and Publish with an Atomic Merge, Iceberg transforms the Write, Audit, Publish pattern from a complex, risky, and manually intensive chore into a streamlined, automated, and supremely reliable workflow.

It's like finally bringing the best practices of software development – think isolated environments, atomic commits, and that glorious "Ctrl+Z" for your entire dataset – directly into your data world. No more sweating bullets, no more data graveyards, just clean, high-quality data effortlessly powering your business decisions.

0
Subscribe to my newsletter

Read articles from Akash Jain directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Akash Jain
Akash Jain