When working with multilingual lexical data, ensuring completeness and consistency is crucial. Our workflow involves extracting lexical forms from structured data sources like wiki lexeme dump and sparql comparing them against predefined queries. This helps us verify that our queries cover all expected forms and catch any discrepancies early. In this post, we'll explore how we can automate this validation process to ensure data integrity across multiple languages.

In Scribe-Data, the language_data_extraction directory organizes supported languages into folders, with each language folder containing sub folders for supported data types (e.g., nouns, verbs, adverbs). Within these subfolders, SPARQL files are used to fetch lexical data for grammatical features like this -

The directory structure for the "language_data_extraction" project, containing language folders with subfolders for grammatical categories.

However, we face two key challenges:

Listing all possible grammatical features for a given data type in a specific language (e.g., all forms that nouns or verbs can take).
Verifying that our SPARQL queries account for all these grammatical features, which, if overlooked, could lead to incomplete or inconsistent data extraction.

To address these challenges, implementing a mechanism to track the forms for each data type directly from Wikidata is crucial and much needed. This would ensure accurate capture of all forms across languages, ultimately improving data quality and consistency.

One of the goals for improving efficiency is to create a GitHub workflow that automates the process of handling errors in queries. Ideally, this workflow would identify any missing forms or errors in the queries and, upon encountering one, automatically open a pull request (PR) with the corrected query, including the missing forms. This approach ensures that the tedious task of manually writing or fixing queries is eliminated. Instead, we can focus on simply reviewing the automatically generated queries, streamlining the process while maintaining accuracy and productivity. 😊

Automated Query Correction Workflow

The image shows an automated workflow for processing Sparql queries, extracting data from Scribe Data, generating missing features JSON, and maintaining quality checks.

Triggering the Workflow:
The GitHub workflow can be initiated either manually by developers or through a scheduled trigger. This flexibility ensures that updates are handled regularly or as needed.

Workflow Initialization:
Once triggered, the check_and_update_missing_query_forms.yaml configuration file starts the workflow. It ensures the process only runs within the scribe-org/Scribe-Data repository. If this condition isn’t met, the workflow halts immediately.

Check and Update Missing Query Forms runner

Parsing SPARQL Queries:
The workflow scans all *.sparql files to:
- Extract languages and data types for each query.
- Identify optional forms associated with unique QIDs.
Extracting Dump Forms:
Lexeme dumps are processed to:
- Convert ISO language codes and data types into QIDs.
- Gather unique forms as a Set() for efficient comparison.
Intersecting Results:
By comparing result_sparql and result_dump using Set(), the workflow identifies the missing forms, which are then stored in a missing_features.json file (which is ignored by .gitignore).
Generating Queries:
- Queries are converted into a readable format using the lexeme_Form_Metadata.json file for consistency and clarity.
- CamelCase naming conventions are applied for uniformity.
- Error checks are maintained to ensure smooth execution.

Successful GitHub Actions workflow run for "Check and Update Missing Query Forms," detailing the steps taken to create a pull request.

Creating a Pull Request:
Using peter-evans/create-pull-request@v5, the workflow automatically generates query files and a summary of changes for the PR body. This ensures any missing forms are accounted for and corrections are ready for review.

In this image the PR description is shown

By tackling these challenges, we can ensure that the data extraction process becomes more robust, comprehensive, and reliable. This approach will enhance the overall quality of lexical data, supporting accurate and consistent multilingual applications.

Automating Query Corrections with GitHub Workflows

Subscribe to my newsletter

muhamad asif

muhamad asif