Natural Language Stats Engine Innovation

Ever wanted to know "Who hit the most sixes for Royal Challengers Bengaluru while chasing in 2024?" or "What was MS Dhoni's highest score in a winning cause?". We'll answer these questions by the end of this blog. Traditionally, answering such specific cricket questions required sifting through scorecards or, for a programmer, writing complex SQL queries. But what if you could just ask in plain English?

That's exactly what I set out to build: an IPL Natural Language Stats Explorer. This project uses the power of Large Language Models (LLMs), a structured relational database, and integration to turn your everyday IPL questions into answers backed by data.

Let's dive into how it works!

The Big Picture: Architecture Overview

At its core, the system follows this flow:

User Input: You ask a question in natural language.
Question Refinement: An LLM rephrases your question into a more detailed, unambiguous form.
SQL Generation: Another LLM, equipped with knowledge of our database schema and specific IPL context, translates the refined question into a PostgreSQL query.
Database Execution: The generated SQL query is run against the IPL database.
Answer Formulation: The raw data from the database is passed back to an LLM, which generates a user-friendly answer.
Display: The answer, along with the intermediate steps (refined question, SQL query, raw result), is shown.

ER-diagram of the database

1. The Foundation: Data Layer (PostgreSQL & SQLAlchemy)

No stats engine can exist without data. I built this on PostgreSQL, a robust open-source relational database.

Schema (model.py):
The database schema is defined using SQLAlchemy, a Python ORM. Key tables include:

Teams: Stores IPL team names.
Players: Stores unique player names.
Venues: Stores IPL venue names.
Matches: Detailed information for each match (season, date, teams involved, winner, toss details, runs scored by each team).
PlayerTeams: Tracks which player played for which team in a given season (many-to-many relationship).
Deliveries: Ball-by-ball data for every match – the most granular level. Includes batter, bowler, runs, extras, wicket details.
PlayerMatches: Aggregated player statistics for each match (runs scored, balls faced, wickets taken, etc.). This is crucial for performance and simplifies many queries.

Data Ingestion (data_insert.py, loader.py, util.py):
Populating this database is a multi-step process, with data quality and standardization being key:

Loading Raw Data (loader.py): The primary source of our IPL data is cricsheet.org. Cricsheet is a great open-source initiative that provides ball-by-ball data for a vast number of cricket matches, including the IPL. They make this data available in structured formats, primarily JSON, which makes it easy for processing. For this project, individual match data is downloaded as separate JSON files. Here's an example of what a part of a Cricsheet JSON file looks like:

 {
     "info": { /* ... match metadata ... */ },
     "innings": [
        {
          "team": "Team Name",
          "overs": [
            {
              "over": 0,
              "deliveries": [
                {
                  "batter": "V Kohli",
                  "bowler": "JJ Bumrah",
                  "runs": { "batter": 6, "extras": 0, "total": 6 }
                  /* ... other delivery details ... */
                }
              ]
            }
          ]
        }
      ]
    }
 }

The load_match_files function in loader.py is responsible for iterating through these Cricsheet JSON files in a specified directory and loading their full content.

The Player Name Conundrum & Resolution (util.py, data_insert.py):
A significant challenge was standardizing player names. Cricsheet often uses short or abbreviated names (e.g., "V Kohli" instead of "Virat Kohli"). But users are likely to ask questions using full names or common well-known names. Furthermore, short names can create ambiguity between players with similar initials. This was crucial to ensure consistency when querying the database.
To address this, I did the following steps:
- Finding ESPN IDs: The people.csv file provided by Cricsheet contains mappings from player names to their unique ESPN Cricinfo IDs (key_cricinfo). The get_espn_id function in util.py leverages this to find a player's ESPN ID.
- Fetching Full Names via ESPN API: I then needed a reliable way to get the official full name. After some research, I found a working (unofficial) ESPN API endpoint: http://core.espnuk.org/v2/sports/cricket/athletes/{espn_id}.
  The get_player_full_name function in util.py queries this API using the ESPN ID to retrieve the displayName field from the JSON returned.
- Caching with PLAYER_MAP: To avoid repeated API calls and to have a persistent mapping, successfully resolved names are stored in player_map.json. This map is loaded at the start and updated during the insert_players process. If a Cricsheet name is already in PLAYER_MAP, its standardized version is used directly; otherwise, the ESPN API lookup is performed.
  This process ensures that whether Cricsheet says "MS Dhoni" or "M S Dhoni", and the user asks about "Mahendra Singh Dhoni", the system can correctly identify the player in the database.
Venue and Team Name Normalization (util.py, data_insert.py):
Similar to player names, cricket data often has inconsistencies in venue and team names.
- VENUE_MAP (in util.py) standardizes venue names (e.g., "Feroz Shah Kotla" becomes "Arun Jaitley Stadium, Delhi").
- TEAM_MAP (in data_insert.py) standardizes team names, especially for franchises that have rebranded (e.g., "Kings XI Punjab" becomes "Punjab Kings", "Delhi Daredevils" becomes "Delhi Capitals").
Populating Tables (data_insert.py):
- insert_teams and insert_venues: Populate Teams and Venues tables from predefined lists, applying the mappings discussed above.
- insert_players: Iterates through match data, identifies unique players, resolves their names using the PLAYER_MAP and ESPN API lookup, and populates the Players table. It also populates PlayerTeams to link players to teams for specific seasons.
- insert_matches: Parses match metadata (teams, venue, winner, toss, etc.), using the standardized names to look up corresponding IDs, and stores match records.
- insert_deliveries: This is the most detailed step, processing ball-by-ball data from each match's innings, linking batters, bowlers (using their standardized names/IDs), and outcomes.
- insert_player_matches: After deliveries are inserted, this function aggregates performance statistics for each player in each match (runs, wickets, sixes, etc.) by querying the Deliveries table and stores them in PlayerMatches. This significantly speeds up common player-centric queries.
Database Connection (db.py): This script sets up the SQLAlchemy engine, session, and includes the initial calls to create tables and run the insertion scripts.
Database Hosting (render.com): The PostgreSQL database created is hosted on a free plan at render.com.

2. The Brain: LLMs and Langchain (`main.py`)

This is where the natural language understanding and SQL generation magic happens.

LLM Choices: The system is flexible, allowing selection from models like Gemini Flash or Llama 3 via Groq, handled by get_llm.
Langchain: This powerful framework handles the interactions between the user, LLMs, and the database.
- SQLDatabase.from_uri(): Connects Langchain to our PostgreSQL database.
- Prompt Engineering (system_message): This is arguably the most crucial part of the LLM interaction. The system_message is a carefully crafted prompt that guides the LLM on:
  - The SQL dialect (PostgreSQL).
  - Query constraints (e.g., top_k results).
  - Database schema details (table and column names).
  - IPL-Specific Logic: It includes rules like how to identify players batting first/second, on winning/losing teams, and emphasizes using standardized names (e.g., "Royal Challengers Bengaluru" not "Bangalore"). It also explains that overs are 0-indexed. There are additional rules for calculating stats and returning the results.
  - Examples of good question-to-SQL conversions.
- LangGraph (StateGraph): Langchain's LangGraph is used to define a stateful, multi-step process for handling a query. The graph, built in build_graph function, has the following sequence:
  1. natural_language_expand: Takes the user's potentially vague question and uses an LLM to expand it into a more detailed and unambiguous set of bullet points. This helps the next step and ensures that the user is not burdened with providing the most precise prompt.
  2. write_query: The expanded question, along with the system_message and schema info, is fed to an LLM (configured for structured output like a QueryOutput Pydantic model) to generate the SQL query.
  3. execute_query: The generated SQL is run against the database using Langchain's QuerySQLDataBaseTool.
  4. generate_answer: The raw SQL result, the original question, and the generated query are given to an LLM to synthesize a final, human-readable answer in Markdown.

3. The Interface: Gradio (`main.py`)

A simple yet effective web interface is built using Gradio.

Users can select their preferred LLM model.
A textbox allows users to type their IPL questions.
Outputs are clearly displayed in separate Markdown sections: Final Answer, Expanded Question, SQL Query, and Raw Query Result.
It also displays an image of the database schema for user reference.

The process_question function in main.py ties everything together, taking the user's input and LLM choice, running it through the LangGraph, and returning the structured results to the Gradio interface.

Why Gradio?

Gradio is similar to Streamlit, but I found it even easier to create and share AI and ML demos in Gradio. It integrates well with Hugging Face Spaces, which helps with hosting and allows you to share your work with others.

Key Challenges & Learnings

Data Normalization: Ensuring consistent team, player, and venue names was very much necessary. The PLAYER_MAP, TEAM_MAP, and VENUE_MAP, along with the ESPN API integration, were essential but time-consuming to develop.
Prompt Engineering: Crafting the system_message was an iterative process. Getting the LLM to reliably produce correct SQL, respect naming conventions, and understand IPL-specific nuances required careful instruction and examples.
Schema Design for Queries: The PlayerMatches table, although redundant in a fully normalized database, was a crucial optimization. It aggregates player stats per match, making many common queries much simpler and faster than querying the Deliveries table each time. This also aids in calculating stats like "The most 50s in 2024." Without the PlayerMatches table, the SQL query would need to parse all 2024 deliveries, group them by player names, calculate each player's total runs, and then filter for scores over 50. This process can quickly become tedious and complex. Therefore, precomputing certain statistics was essential.
Handling Edge Cases: Cricket has too many! Ensuring toss decisions correctly determine batting order, correctly identifying players in winning vs. losing teams, classifying what is a debut match, handling stats that needed some base thresholds and many more, required careful thought while making the schema, entering the data, and prompting the LLM.

What's Next?

This IPL Natural Language Stats Explorer is already quite powerful, but there's always room for improvement:

More Complex Queries: Supporting comparative analyses ("Player A vs Player B") or trend-based questions.
Visualizations: Integrating charts or graphs for results.
Error Handling & Feedback: More robust feedback if a query can't be understood or if no data is found.
Performance Optimizations: For extremely large datasets or very complex queries, further indexing or query optimization might be needed.

Exciting Possibilities of a Detailed Cricket Database

My current database is quite detailed, but there's still so much more cricket-specific data to explore, like ball trajectory data and the wagon wheel. A comprehensive database benefits everyone, from casual viewers to professional coaches. It could lead to:

Data-driven storytelling, allowing fans and analysts to create narratives about specific cricket situations with precise data, thus enhancing cricket discussions.
Improved picks and analysis for fantasy league players, helping them optimize their strategies.
A well-structured dataset ideal for developing advanced statistical models to forecast match outcomes, predict player performance, or simulate strategic team decisions, paving the way for next-level cricket intelligence.

Conclusion

Building this IPL Stats Explorer has been a valuable learning experience at the intersection of sports data, relational databases, and the fast-growing abilities of Large Language Models. It shows how natural language can be a powerful tool for retrieving complex data, making in-depth IPL insights available to everyone, not just SQL experts.

Answering the questions asked at the start (Answers directly from the engine):

“Who hit the most sixes for Royal Challengers Bengaluru while chasing in 2024?”
- Will Jacks hit the most sixes (15) for Royal Challengers Bengaluru while chasing in 2024.
"What was MS Dhoni's highest score in a winning cause?”
- MS Dhoni's highest score in a winning cause was 75 runs. This happened on March 31, 2019, during the 2019 season, in a match where Chennai Super Kings played against Rajasthan Royals.

Feel free to try it out with your own queries or explore the code! I'd love to hear your feedback.

Project Link: https://huggingface.co/spaces/sanchitjain1107/ipl-natural-language-query

GitHub Link: https://github.com/Sanchit-Jain07/ipl-natural-query-executor

Beyond Scorecards: Building a Natural Language Stats Engine

Table of contents

The Big Picture: Architecture Overview

1. The Foundation: Data Layer (PostgreSQL & SQLAlchemy)

2. The Brain: LLMs and Langchain (`main.py`)

3. The Interface: Gradio (`main.py`)

Why Gradio?

Key Challenges & Learnings

What's Next?

Exciting Possibilities of a Detailed Cricket Database

Conclusion

Subscribe to my newsletter

Sanchit Jain

Sanchit Jain

Beyond Scorecards: Building a Natural Language Stats Engine

Table of contents

The Big Picture: Architecture Overview

1. The Foundation: Data Layer (PostgreSQL & SQLAlchemy)

2. The Brain: LLMs and Langchain (main.py)

3. The Interface: Gradio (main.py)

Why Gradio?

Key Challenges & Learnings

What's Next?

Exciting Possibilities of a Detailed Cricket Database

Conclusion

Subscribe to my newsletter

Sanchit Jain

Sanchit Jain

2. The Brain: LLMs and Langchain (`main.py`)

3. The Interface: Gradio (`main.py`)