Milestone 1 : Successfully Added Image Generation Capabilities into the LLM4S Library


🎨 TL;DR

Milestone 1(Implements the Milestone 1 from the GSoC Proposal) delivers a robust, extensible image-generation subsystem in llm4s, featuring:

  • A unified factory and client abstraction (ImageGenerationClient)

  • Two plug-and-play providers: Stable Diffusion WebUI & Hugging Face Inference

  • A rich options model (prompts, sizes, seeds, samplers, negative prompts)

  • Type-safe error-handling, asynchronous-ready API, and automated file-saving

  • Cross-compiled for Scala 2/3, well-documented, and comprehensively unit-tested

This foundation enables future providers (Midjourney, DALL·E), UI integration, Batching, and Async enhancements.

See PR #76 For the Implementation.


Introduction

LLM4S is an open-source library designed to unify and simplify interaction with Large Language Models (LLMs) through Scala as a means instead of Python. As part of the 2025 Google Summer of Code (GSoC) with Scala Center, the goal of my project is to enhance llm4s by incorporating comprehensive image generation capabilities, enabling developers and end-users to seamlessly leverage powerful generative AI models directly through a unified API.

This milestone (Pull Request #76) implements this and lays the foundational architecture necessary for future enhancements like integrating Midjourney, DALL·E, batching, and UI-level integrations.

Why Image Generation?

Image generation is rapidly becoming essential across industries, powering creative tools, content creation, automation, and enhancing user interfaces. By integrating robust generative models directly into llm4s, developers gain immediate access to state-of-the-art image generation without dealing with provider-specific intricacies, significantly reducing integration complexity.

How Generative Models work?

Picture yourself describing a scene to a skilled illustrator who can hear every nuance in your voice, flip instantly through millions of art books for inspiration, and then paint at super-human speed. That’s essentially what a modern text-to-image system does, but instead of brushes, it wields neural networks and probability theory. Here’s the journey, start to finish:


1. Reading Your Idea

When you type a prompt—say, “a retro-futuristic airship drifting across a neon sunset”—the model’s language encoder (often a CLIP‐style transformer) translates each word into high-dimensional vectors. These vectors capture not just dictionary meanings but vibes: retro, futuristic, sunset all sit in different regions of a semantic map the network learned from billions of captioned images.


2. Guiding the Imagination with Cross-Attention

The text vectors don’t create pixels directly; they act more like a creative director whispering instructions. Inside the image generator, a UNet listens to those instructions through cross-attention layers, constantly asking, “Does this patch of the picture match the idea of ‘neon’ or ‘airship’ yet?” If not, it adjusts.


3. From Snowstorm to Sketch—Diffusion in Latent Space

Instead of beginning with a blank canvas, diffusion models start with pure visual noise—digital snow. Over 30–100 iterative steps, the network predicts and subtracts noise, gradually revealing structure. Crucially, Stable Diffusion does this in a compressed latent space (a 64 × 64 × 4 tensor) learned by an auto-encoder. Working small makes the process 10-times faster and keeps the model file under 5 GB [2].

A slider called classifier-free guidance scale lets you choose between faithfulness to the prompt (higher values) and creative freedom (lower). Samplers (DDIM, Euler, DPM++, etc.) control the rhythm of noise removal, a bit like choosing charcoal versus watercolor for sketching.


4. Upscaling the Blueprint

Once the noise is gone, the latent “blueprint” is passed through the decoder part of the auto-encoder, blooming into a full-resolution RGB image. Optional post-processing—face restoration, super-resolution, NSFW filtering—adds polish and safety.


5. Where the Magic Runs

  • Stable Diffusion WebUI: Everything happens on your machine or server. You tweak seeds, sampler types, step counts, and watch images evolve in real time—ideal for power users or privacy-sensitive workflows.

  • Hugging Face Inference API: You send a prompt and a bearer token; the cloud runs the heavy computation and streams back finished images. It’s pay-as-you-go scalability without GPU headaches.


6. Why It Feels Like Magic—But Isn’t

Under the hood are three research breakthroughs made practical:

  1. Transformers for language–vision alignment (CLIP, 2021 [1])

  2. Diffusion probabilistic models for crisp, controllable synthesis (DDPM, 2020 [3])

  3. Latent diffusion that slashes compute cost without sacrificing detail (Stable Diffusion, 2022 [2])

Blend them with a friendly API—like ImageGenerationClient in llm4s—and you get a tool that turns everyday language into production-quality art at the press of a button.


Key Features Delivered

Milestone 1 delivers a solid, extensible foundation featuring:

  • Unified Factory and Client Abstraction (ImageGenerationClient): Abstracts provider-specific complexities behind a simple, type-safe API.

  • Plug-and-Play Providers: Initial support for Stable Diffusion WebUI and Hugging Face Inference API.

  • Rich Options Model: Includes prompt customization, image size, random seeds, sampler selection, and negative prompt handling.

  • Type-safe Error Handling: Structured, clear exceptions ensuring easy debugging and stability.

  • Asynchronous-Ready API: Supports async workflows for scalability.

  • Automated File Saving: Directly saves generated images to disk without manual intervention.

  • Cross-Compilation for Scala 2 and 3: Ensures broad compatibility.

  • Comprehensive Documentation and Unit Testing: Empowers users and developers with clear guidelines and reliability guarantees.

Files Breakdown

1. docs/ImageGeneration.md

  • Purpose: User-facing documentation including examples, configuration details, and troubleshooting guidance.

  • Impact: Enables users to quickly leverage the new image generation subsystem.

2. src/main/scala/org/llm4s/imagegeneration/ImageGeneration.scala

  • Purpose: Central factory and facade providing access to Stable Diffusion and Hugging Face providers.

  • Impact: Simplifies provider selection and hides complexity behind a unified API.

3. src/main/scala/org/llm4s/imagegeneration/README.md

  • Purpose: Developer-focused overview, architecture details, and guidelines for future provider integrations.

  • Impact: Facilitates maintenance and future enhancements, enabling straightforward addition of new generative models.

4. src/main/scala/org/llm4s/imagegeneration/examples/ImageGenerationExample.scala

  • Purpose: Minimal runnable example demonstrating basic image generation and saving functionality.

  • Impact: Provides immediate validation of the setup, acting as a quick-start guide and smoke test.

5. src/main/scala/org/llm4s/imagegeneration/provider/StableDiffusionClient.scala

  • Purpose: Implements communication with a self-hosted or remotely hosted Stable Diffusion WebUI instance via REST.

  • Impact: Offers a self-contained, dependency-free option suitable for local or private deployments.

6. src/main/scala/org/llm4s/imagegeneration/provider/HuggingFaceClient.scala

  • Purpose: Implements integration with Hugging Face’s Inference API, supporting bearer-token authentication, rate limiting, and resilience strategies.

  • Impact: Provides effortless cloud scalability, opening access to a vast range of Hugging Face-hosted models.

7. src/test/scala/org/llm4s/imagegeneration/ImageGenerationTest.scala

  • Purpose: Unit tests utilizing mock servers to rigorously validate success scenarios, error handling, and JSON payload correctness.

  • Impact: Ensures early detection of errors, robustness, and correctness across diverse scenarios.

Supporting Changes

  • build.sbt Enhancements: Integrated necessary JSON and HTTP libraries, Scala 3 tasty-reader support, and enforced stricter compiler flags to enhance reliability.

  • Tool API Schema Updates: Extended toolapi/SchemaDefinition.scala and ToolFunction.scala to officially register the new image-generation functionality.

  • Cross-compile Adjustments: Performed minor cleanups in VersionTest.scala for smoother cross-version compatibility.


Technical Deep Dive

Stable Diffusion WebUI

Stable Diffusion is a cutting-edge generative model capable of creating highly detailed images from textual prompts using latent diffusion techniques. It utilizes advanced neural network architectures trained to reconstruct visual content based on semantic descriptions.

The implementation in llm4s allows developers to connect either to a locally hosted or remotely accessible instance, making it ideal for privacy-conscious or offline-capable applications.

Hugging Face Inference API

Hugging Face provides a cloud-hosted inference API that grants access to a broad spectrum of state-of-the-art generative models. The integration requires users to supply their Hugging Face bearer tokens, easily managed through the provided API. This connection leverages Hugging Face’s scalability, supporting large-scale, high-throughput image generation scenarios effortlessly.

Getting Started

Users can quickly dive into image generation using the provided examples:

val client = ImageGeneration.stableDiffusionClient("http://localhost:7860")
val image = client.generateImage(prompt = "A sunset over mountains", width = 512, height = 512)

image.save("./generated_image.png")

For Hugging Face:

val client = ImageGeneration.huggingFaceClient(token = "your_hf_token")
val image = client.generateImage(prompt = "A futuristic cityscape")

image.save("./hf_generated_image.png")

Detailed usage guides and troubleshooting information are available in the documentation.


Conclusion

This milestone provides a robust, extensible foundation for image generation within llm4s, seamlessly integrating powerful providers like Stable Diffusion and Hugging Face. Future work will expand these capabilities, incorporating additional generative models, performance optimizations, batch processing, and enhanced UI integrations, further enriching llm4s’s multimodal capabilities.

Stay tuned for further updates and contributions are always welcome via GitHub!(Please follow the issue #80)


References

[1] Radford et al., CLIP: Learning Transferable Visual Models from Natural Language Supervision, 2021.

[2] Rombach et al., High-Resolution Image Synthesis with Latent Diffusion Models, 2022.

[3] Ho et al., Denoising Diffusion Probabilistic Models, 2020.

This Article may get updated in the future based on the status of the library.

0
Subscribe to my newsletter

Read articles from Anshuman Awasthi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anshuman Awasthi
Anshuman Awasthi