Struggles, Wins, and Scribe-Data: My Outreachy Journey

muhamad asifmuhamad asif
2 min read

Embarking on my Outreachy project with Scribe-Data was a mix of excitement and apprehension.The goal? Build a CLI tool to parse massive Wikidata Lexeme (LID) dumps—a critical step for multilingual translation workflows. Little did I know how much I’d learn about bzip2 compression, parsing JSON at scale, and the intricacies of lemmas and senses.

The Learning Cliff: From QIDs to Lexemes

Wikidata’s structure felt overwhelming at first. Mentors Andrew, Will, and Henrik patiently guided me through its core components:

  • QIDs (Wikidata Items): The backbone of Wikidata (e.g., Q42 = Douglas Adams).

  • Lexemes (LIDs): My project’s focus—these represent words like “run” (L12345) with forms (“ran”, “running”) and senses (e.g., “run” as a verb vs. noun).

  • LexicalCategory: A property defining a word’s part of speech (verb, noun, etc.).

Understanding how these pieces fit together was like learning a new language. But once I grasped how lemmas (base words like “happy”) map to their grammatical variants (forms like “happier”), the data started to make sense.

Battling the Wikidata Dump

The real test came with the Wikidata dump—a 87GB+ JSON file compressed with bzip2. My first CLI attempts crashed my system. Therefore we need to use lexeme dump which is around 355MB. And it is more convenient for Scribe-data.

The problem? Wikidata’s dumps are monolithic. Loading them entirely into memory was a rookie mistake.

The fix: Iterative parsing. Using Python’s ijson and bzip2 modules, I processed the dump line by line:

import bz2, ijson  

with bz2.open("latest-all.json.bz2", "rb") as f:  
    for lexeme in ijson.items(f, "item"):  
        if lexeme.get("type") == "lexeme":  
            process_lexeme(lexeme)  # Extract lemmas, forms, senses

Small Wins, Big Confidence

  1. CLI Optimization
    Rewrote the tool to accept filters (e.g., --lexicalCategory=verb), leveraging Wikidata Properties (PIDs) like P5185 (grammatical category).

  2. CamelCase Formatting
    Struggled for days aligning output with camelCase standards.

    Why This Matters

    Lexemes bridge human language and structured data. By streamlining their extraction, Our CLI tool helps developers build better translation apps, dictionaries.


    Growth Beyond Code

    The Outreachy organizers’ mantra—“Learning can be hard! Struggles are part of growth”—kept me going. Errors that once felt like failures (a crashed parser, a mislabeled lemma) became milestones.

    To future interns: Embrace the grind. Every bzip2 decompression error or malformed JSON teaches resilience. And mentors? Ask them everything.


    What’s Next

    I’m now integrating Wikidata Properties (PIDs) to link lexemes to semantic relationships—another leap into the unknown. Grateful to Andrew, Henrik, Will, and Outreachy for this transformative experience. Onward! 🚀


    Key Takeaways

    • Wikidata’s lexical structure is complex but deeply rewarding.

    • CLI tools demand efficiency; streaming > loading.

    • Mentorship turns roadblocks into stepping stones.

1
Subscribe to my newsletter

Read articles from muhamad asif directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

muhamad asif
muhamad asif