🚀 How I Reduced an 8-Hour Legacy Extraction Script to 2 Hours (Without Breaking Things)

🧩 Background

I was asked to urgently review a legacy data extraction script that had started failing during high-load scenarios. The script was originally designed for simpler, one-off use cases but hadn’t evolved to handle today's multi-consumer, full-load demands.

I initially intended just to patch the issues. But once I saw the deeper patterns, I realized there was a bigger opportunity to modernize how large-scale extractions are handled in this environment — safely and efficiently.

🔍 Problems I Identified

Fragile Delimiters
- The script used | for columns and non-printable SOH for record separation.
- Unfortunately, some fields (like description) included these characters, which made parsing highly unreliable.
Heavy Memory Usage
- The full MQL query output was stored in a string variable, printed to the console, and then written to a file. For large datasets, this led to memory exhaustion and script failures.
File Handle Leaks
- Files were opened but not explicitly closed. Over time, this caused “too many open files” errors — a silent and painful issue in long-running processes.
Inefficient Per-ID Data Fetching
- After extracting primary data, the script iterated over each ID and called execute program to fetch dependent information — sequentially. This was painfully slow.
Hardcoded Credentials
- Passwords for launching MQL commands were hardcoded in the shell script — a security and maintainability risk.

🛠️ What I Did Differently

✅ Safe Delimiters

Replaced fragile delimiters with long, unique strings:
- COLUMNSEPARATORCHANGEACTION (column)
- RECORDSEPARATORCHANGEACTION (record)
During parsing, used string search instead of split to correctly isolate columns and rows.

✅ Field Data Cleanup

Some actual data fields contained non-printable ACK or SOH characters.
During parsing (while fetching dependent info), I replaced:
- ACK with 'nonprintableack'
- SOH with 'nonprintablesoh'
When writing the final output, I used actual ACK (0x06) as column separator and SOH (0x01) as record separator.

✅ Direct File Output

Used MQL’s output keyout to write query results directly to a file — eliminating the large in-memory variable.

✅ Modular Script Design

Split the legacy TCL into two:
1. For extraction and writing to file.
2. For parsing and fetching dependent details.

✅ Shell Orchestration + Parallelism

Shell script was modified to:
- Split extracted data into chunks (100,000 records each).
- Launch multiple dependent extraction scripts in parallel.
- Wait for completion and merge results into a final output.

✅ File Handle Cleanup

Added proper close statements after reading files to prevent descriptor exhaustion.

✅ Secure Credential Management

Integrated a pre-existing Java utility to fetch credentials from Azure Key Vault at runtime.
Removed all hardcoded passwords from shell and script logic.

⚡ Results

Runtime reduced from ~8 hours to ~2 hours.
Script now executes consistently without failure under full-load conditions.
Design is modular, secure, and far easier to maintain.

🧠 Reflection

Most of these improvements weren’t flashy — just careful fixes applied with fresh perspective:

Knowing when not to trust legacy delimiters.
Understanding how memory leaks creep in silently.
Realizing the value of breaking things into composable chunks.

Sometimes, fixing something isn't about rewriting it all — it’s about seeing where the old assumptions no longer hold, and calmly rebuilding around those.

📝 Disclaimer

Note: Final data validation is still pending from the downstream data consumer teams.
This post focuses purely on the engineering-level improvements around stability, performance, and design.

💡 Final Note: Format Reflection

After publishing this post, I received thoughtful feedback about using structured formats like JSON, JSONL, or Parquet — especially when working with data lakes and scalable pipelines.

In this particular case:

MQL and TCL didn’t natively support structured formats, making it challenging to emit JSON or Parquet directly.
The downstream ingestion process currently works with delimited files, but it’s possible that the data lake platform itself supports more robust formats like JSON or Parquet.

This experience made me reflect on where structured formats could eventually improve safety, scalability, and parsing clarity — especially if the upstream tooling evolves or we introduce a translation step before ingestion.

It was a small step technically, but a big step in understanding the broader ecosystem I’m part of.

🙋‍♂️ Have You Faced Something Similar?

If you’ve ever worked on long-running data pipelines or inherited legacy scripts under pressure, I’d love to hear how you tackled it.
Feel free to share your thoughts or lessons below.

What a Legacy Script Taught Me About Performance, Parsing, and Patience