Cleaning Up Legacy Data: From Wikipedia Links to Wikidata IDs in Open Library


Software developers often avoid maintenance work, partly because companies rarely incentivize it. This is one reason why some tech companies have a reputation for abandoning products after launch. But maintenance is important, which is why I'm doing some data cleanup work after a recent feature.
Context
I recently shipped a feature for Open Library that displays Wikidata (a Wikipedia sister project) information on author pages, including Wikipedia links. This enhancement automatically shows Wikipedia links in the correct language and updates when new Wikipedia pages are created about an author - as long as we have their Wikidata ID.
The Problem
A long time ago, Open Library had a field on authors specifically for Wikipedia links. While users can no longer add or edit these legacy Wikipedia links through the UI (they now use a generic links field instead), the old links are still visible. Simply moving these links to the generic field would work, but we'd lose the benefits of language-aware links and other Wikidata integration features we're planning. That and a bunch of the links were broken anyway.
The Solution
Replace the old Wikipedia field entries with their corresponding Wikidata IDs.
Implementation Steps:
Download the Open Library author data dump
Query the dump to find authors with Wikipedia links
Use the Open Library API to add Wikidata IDs and remove Wikipedia links
Here's the query I used to extract authors and their Wikipedia links: SELECT column1 as key, json_extract(column4, '$.wikipedia') as wikipedia FROM read_csv('~/Downloads/ol_dump_2024-07-31/ol_dump_authors_2024-07-31.txt.gz') where wikipedia is not null limit 1;
The dump revealed 1,786 authors with Wikipedia links - a tad too many to manage in spreadsheets. So, with the help of AI, I built a small tool to help review them quickly.
The Tool
I built the review tool using Vue.js, Tailwind, and PocketBase. I chose this stack primarily to experiment with PocketBase, which I'm considering for future projects. It's an excellent solution for simple database hosting. I also wanted to try PocketHost, though they've since removed their free tier (fortunately, I'm grandfathered in).
For importing data into PocketBase, I used pocketbase-import. The initial import was slow, so I [requested](https://github.com/michal-kapala/pocketbase- import/issues/1) batch import capability from the creator, which they added soon after.
While using the tool, I discovered many authors already had correct Wikidata IDs and just needed their redundant Wikipedia links removed. I wrote a script to extract Wikidata IDs from Wikipedia links, which helped identify cases where we could simply remove the Wikipedia link without manual review.
This automation reduced the manual review cases to 209 authors - much more manageable! During the review process, I also identified and merged duplicate author records using theOpen Library QuickStatements tool I've been developing.
After processing these records, I found a small oversight in my query logic. I had been comparing QID != QID_from_wikipedia
, but needed to manually review cases where both fields were null.
To ensure completeness, I wrote a verification script to check the live data for any missed Wikipedia links. This caught a few redirects from merged authors, which I fixed.
Now Open Library is a little easier to use and the data is a little cleaner!
Video walkthrough : https://youtu.be/QQRKMWFK5yE
Next Steps:
Run the query again after the next data dump to catch any new Wikipedia links
Submit a PR to remove the old Wikipedia logic from Open Library
Remove Wikipedia field from the Open Library schema here
Subscribe to my newsletter
Read articles from Ray Berger directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ray Berger
Ray Berger
MSc Candidate in Urban Studies, Software Engineer