Hearable: Bringing Board Books and Rhymes to Life for Toddlers with Google Gemini

Jeffrey makJeffrey mak
3 min read

Introduction

In an era where technology intertwines seamlessly with daily life, the potential to enhance early childhood education through AI is immense. Recognizing this, I embarked on creating Hearable—a GenAI-powered voice companion tailored for 2-year-olds. Hearable bridges the tactile world of board books with the dynamic capabilities of AI, transforming passive reading into an interactive auditory experience.

Project Significance

Toddlers often express themselves with charming mispronunciations and fragmented phrases. Hearable is designed to understand these unique expressions, interpret their intent, and respond with relevant audio cues. For instance, when a child says “fire twuck,” Hearable recognizes it as “fire truck,” searches for a corresponding siren sound, and plays it back, enriching the child’s learning experience.

Beyond sound effects, Hearable recites beloved nursery rhymes, aligning auditory stimuli with visual elements in books. This multisensory approach not only captivates toddlers but also reinforces language development and comprehension.

Technical Implementation

Leveraging Google Gemini SDK

The backbone of Hearable is the Google Gemini SDK, a powerful tool that simplifies the integration of large language models into applications. Here’s how it was utilized:

  1. Structured Output with JSON Mode: Ensures consistent and parsable responses, facilitating seamless integration with other components.

  2. Few-Shot Prompting: Provides the model with examples to better understand and interpret toddler speech patterns.

  3. Function Calling: Enables the model to execute specific functions based on user input, such as searching for audio clips.

  4. Grounding: Anchors the model’s responses in factual data, enhancing reliability.

  5. Document Understanding: Allows the model to interpret and recite nursery rhymes accurately.

Setting Up the Environment

To build an application like Hearable, follow these steps:

  1. Install the Google Cloud SDK: This provides the necessary tools to interact with Google Cloud services.

  2. Enable the Vertex AI API: This API allows access to Google’s AI models, including Gemini.

  3. Set Up Authentication: Obtain an API key from Google AI Studio and configure your environment to use it.

  4. Install Required Libraries: Ensure that libraries such as google-cloud-aiplatform and google-generativeai are installed.

Implementing Function Calling

Function calling is pivotal in transforming user intent into actionable tasks. For instance, when a child requests a sound, the model identifies the intent and executes a function to retrieve the appropriate audio clip.

Here’s a simplified example:

from google import genai
from google.genai.types import Tool, GenerateContentConfig, GoogleSearch

client = genai.Client()
model_id = "gemini-2.0-flash"

google_search_tool = Tool(
    google_search = GoogleSearch()
)

response = client.models.generate_content(
    model=model_id,
    contents="When is the next total solar eclipse in the United States?",
    config=GenerateContentConfig(
        tools=[google_search_tool],
        response_modalities=["TEXT"],
    )
)

for each in response.candidates[0].content.parts:
    print(each.text)
# Example response:
# The next total solar eclipse visible in the contiguous United States will be on ...

# To get grounding metadata as web content.
print(response.candidates[0].grounding_metadata.search_entry_point.rendered_content)

This approach ensures that the application responds appropriately to various toddler inputs.

Achievements

Through the integration of the Google Gemini SDK, Hearable has achieved:

  • Enhanced Understanding: Accurately interprets toddler speech, even with mispronunciations.

  • Dynamic Responses: Generates engaging and contextually appropriate replies.

  • Interactive Learning: Transforms traditional reading into an immersive auditory experience.

Future Enhancements

To further augment Hearable’s capabilities:

  • Smart Device Integration: Connect with devices like Google Nest Mini for hands-free interactions.

  • Expanded Audio Sources: Incorporate platforms like YouTube and Spotify for a broader range of audio content.

  • Feedback Mechanisms: Implement reflection loops to refine responses and reduce inaccuracies.

  • Educational Modules: Introduce features like phonetics games and basic math challenges to diversify learning.

Conclusion

Building an LLM-powered application like Hearable is both feasible and rewarding, especially with tools like the Google Gemini SDK at your disposal. By leveraging its capabilities, developers can create applications that not only entertain but also educate, laying the foundation for a new era of interactive learning.

0
Subscribe to my newsletter

Read articles from Jeffrey mak directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jeffrey mak
Jeffrey mak