Integrating AI Models into Rails Apps

In this post we’ll put together a minimal backend flow for summarization with two API models and two local ones. Each has its own strengths and trade-offs, and since they all run through the same Rails flow, switching between them is as simple as changing one line of code.

The goal isn’t to build a production-ready system (we’ll skip validations, retries, and other extras) but to highlight the core logic in the simplest way possible: a minimal controller with clean service objects. By the end, you’ll have a straightfoward but complete and adaptable integration for consuming different LLM models in you application, and with a few tweaks or some prompt engineering, you can easily tailor it to your own projects.

Next, we’ll set up the app, plug in each provider, and refine the prompts to keep the outputs consistent. Let’s dive in.

Setting up the Rails application

You can start with a fresh Rails app or adapt it to your existing project. The setup (database, testing framework, etc.) is entirely up to you.

We’ll be working with two Rails models: one to store the original text (Article) and another to store the AI responses (Summary), which also holds some useful metrics like tokens and request time.

# db/schema.rb (simplified)
create_table "articles", force: :cascade do |t|
  t.string "title"
  t.text   "content"
end

create_table "summaries", force: :cascade do |t|
  t.string     "provider"
  t.text       "content"
  t.integer    "request_tokens"
  t.integer    "response_tokens"
  t.integer    "request_time_ms"
  t.references "article", foreign_key: true
end

This demo uses a single controller with a create action to trigger the summarization. In the view we’ll add a minimal form with a provider selector and a single button that POSTs. Nothing fancy, just enough to drive the flow.

# app/controllers/summaries_controller.rb
class SummariesController < ApplicationController
  def create
    article  = Article.find(params[:article_id])
    provider = params[:provider] || "mistral"
    response = SummaryService.new(provider: provider).call(article.content)

    if response
      article.summaries.create!(response)
      redirect_to article, notice: "Summary created."
    else
      redirect_to article, alert: "Summarization failed."
    end
  end
end

Since not all providers return their own latency, we’ll measure it ourselves with a tiny helper:

# app/services/request_timer.rb
class RequestTimer
  def self.ms
    t0 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    result = yield
    ms = ((Process.clock_gettime(Process::CLOCK_MONOTONIC) - t0) * 1000).round
    [ result, ms ]
  end
end

Finally, we’ll add the SummaryService, which builds a basic prompt and delegates the request to the right provider. This keeps the code clean and avoids giving extra responsibilities to the controller:

# app/services/summary_service.rb
class SummaryService
  def initialize(provider:, max_chars: 300)
    @provider  = provider.to_s.downcase
    @max_chars = max_chars
  end

  def call(text)
    prompt = "Summarize in #{@max_chars} characters or fewer: " \
             "#{text}"

    response =
      case @provider
      when "openai"   then OpenaiService.new.call(prompt)
      when "gemini"   then GeminiService.new.call(prompt)
      when "llama3:8b",
           "mistral"  then OllamaService.new(provider: @provider).call(prompt)
      else
        raise "Unknown provider: #{@provider}"
      end

    response&.merge(provider: @provider)
  end
end

Here’s the setup: two simple Rails models, a lightweight controller with a single entry point, and a service object that centralizes the routing logic to the different adapters. With this in place, we can start integrating each provider.

OpenaiService (gpt-3.5-turbo)

OpenAI is probably the most popular provider for summarization tasks, and integrating it into Rails is straightforward thanks to the openai-ruby gem.

# Gemfile
gem "ruby-openai"

To use the API, you need to generate an API key from platform.openai.com/api-keys and make sure you have credits available, since there’s currently no free tier.

Once you have your key, we’ll store it safely in credentials or environment variables before using it in our Rails app:

# config/credentials.yml.enc
openai:
  api_key: your_api_key_here

The model we are going to use works through the chat completions endpoint, sending the prompt as part of a conversation. If you decide to use GPT-5, keep in mind it uses the responses endpoint with a slightly different request format.

Here we’ll focus on extracting the summary text plus a couple of useful metrics (tokens from the API and request time measured on our side). The API returns more fields, but we’ll stick to the essentials for clarity.

# app/services/openai_service.rb
require "openai"

class OpenaiService
  def initialize
    @client = OpenAI::Client.new(
      access_token: Rails.application.credentials.dig(:openai, :api_key)
    )
  end

  def call(prompt)
    resp, elapsed_ms = RequestTimer.ms do
      @client.chat(
        parameters: {
          model: "gpt-3.5-turbo",
          messages: [{ role: "user", content: prompt }],
          temperature: 0.5,
          max_tokens: 300
        }
      )
    end

    {
      content: resp.dig("choices", 0, "message", "content").to_s.strip,
      request_time_ms: elapsed_ms,
      request_tokens:  resp.dig("usage", "prompt_tokens") || 0,
      response_tokens: resp.dig("usage", "completion_tokens") || 0
    }

  rescue OpenAI::Error => e
    Rails.logger.error("OpenAI failed: #{e.message}")
    nil
  end
end

For this demo we’ll use GPT-3.5-turbo because it’s faster, uses fewer tokens, and costs significantly less. In practice, it consistently produced summaries within the 300-character limit without losing coherence or key information.

Newer models like GPT-4 or GPT-5 may be worth considering if you need more context or deeper reasoning, but for this use case they’re not necessary.

GeminiService (gemini-1.5-flash)

Gemini is Google’s family of AI models, accessible through Google AI Studio. The Flash models are tuned for speed while still offering solid reasoning quality, and they can handle very large contexts, making them ideal for long inputs. On top of that, it offers a monthly free tier, so we can experiment without paying right away.

To use it in Rails, first generate an API key and then enable the Generative Language API in your Google Cloud project. Don't forget to store the key in your credentials.

Once set up, requests are made to POST /v1beta/models/{model}:generateContent, where we include the prompt in a structured payload. Gemini expects prompts to be wrapped in a contents/parts format, even if we’re only sending plain text.

For the requests, we’ll use http.rb, but you can swap it for HTTParty, Faraday, or whichever client you prefer.

#gemfile
gem "http"

# app/services/gemini_service.rb
require "http"

class GeminiService
  BASE_URL = "https://generativelanguage.googleapis.com/v1beta/models"

  def initialize(model: "gemini-1.5-flash")
    @model   = model
    @api_key = Rails.application.credentials.dig(:gemini, :api_key)
  end

  def call(prompt)
    payload = {
      contents: [ { parts: [ { text: prompt } ] } ]
    }

    resp, elapsed_ms = RequestTimer.ms do
      HTTP.timeout(connect: 5, read: 30).post(
        "#{BASE_URL}/#{@model}:generateContent?key=#{@api_key}",
        json: payload
      )
    end

    return unless resp.status.success?

    data    = resp.parse
    content = data.dig("candidates", 0, "content", "parts", 0, "text")
    usage   = data["usageMetadata"] || {}

    {
      content: content.to_s.strip,
      request_time_ms: elapsed_ms,
      request_tokens:  usage["promptTokenCount"],
      response_tokens: usage["candidatesTokenCount"]
    }

  rescue HTTP::Error => e
    Rails.logger.error("Gemini failed: #{e.message}")
    nil
  end
end

In this example we’ll go with gemini-1.5-flash, though you can switch to any other Gemini model with no code changes. Just keep in mind that each version trades off speed, reasoning depth, and cost.

Note: This snippet is simplified for demonstration. Usage metadata may vary depending on the model and endpoint, and the response is returned as plain text. In production, you may want to handle errors, retries, and formatting.

Local Inference with Ollama

Before jumping into the code, it's worth explaining what local inference is: basically, running the AI model directly on your own hardware (your PC, server, or device) instead of sending data to external services or APIs hosted by third parties.

The main advantage is that you eliminate token costs, keep full control over your data, and avoid depending on network latency. The trade-off is that you’ll need decent hardware (especially a good GPU for larger models), sufficient RAM, plus enough disk space to store the models, which can range from a few GB for small models up to hundreds of GB for larger ones. The final performance will depend on your machine’s capabilities. In my experience, having 16 GB of RAM or more makes a big difference for smoother results.

For instance, in my case I didn’t have enough RAM to run Mistral 7B smoothly on my main PC, so I set up Ollama on a secondary host within my local network: an Nvidia 4070 Super GPU with 12 GB of VRAM and 32 GB of RAM, optimized with CUDA support.

Setting up the models with Ollama

Download and install Ollama from ollama.com, then start the local server and pull the models. In this tutorial we’ll use Mistral and Llama 3, but you can install any other model from the official library:

ollama serve
ollama pull mistral
ollama pull llama3:8b

Once the models are pulled, we don’t need extra API keys or network access. Just run ollama serve and you’re ready to go. From this, we can set up a simple service in Rails to call the models.

OllamaService (Mistral / Llama 3)

One nice thing about Ollama is that all models share the same API, so we don’t need different code for Mistral or Llama 3. Everything works through a simple POST to http://localhost:11434/api/generate, unless you’ve changed the port. In our case, we’ll just use http.rb to send the requests.

Note: Pass the exact Ollama model name you have installed (e.g., mistral, llama3:8b). If it doesn’t match, the request will fail.

# app/services/ollama_service.rb
require "http"

class OllamaService
  def initialize(provider:)
    @provider = provider
  end

  def call(prompt)
    payload = {
      model: @provider,
      prompt: prompt,
      stream: false,
      options: { temperature: 0.2, num_predict: 200 }
    }

    resp, elapsed_ms = RequestTimer.ms do
      HTTP.post("http://localhost:11434/api/generate", json: payload)
    end

    return nil unless resp.status.success?

    body = resp.parse
    {
      content: body["response"].to_s.strip,
      request_time_ms: elapsed_ms,
      request_tokens:  body["prompt_eval_count"],
      response_tokens: body["eval_count"]
    }

  rescue HTTP::Error => e
    Rails.logger.error("Ollama failed: #{e.message}")
    nil
  end
end

This way, local inference runs side-by-side with API providers in our SummaryService, letting us swap between OpenAI, Gemini, and Ollama seamlessly.

Refining the output with prompts

With all our services integrated and running, we can now add a simple view to test what we’ve built: choose a model, send it some text, and get a summary with metrics in return.

If you've been testing along the way, you might have noticed that we're using the same basic prompt across all models:

"Summarize in #{@max_chars} characters or fewer: #{text}"

For OpenAI and Gemini this works perfectly fine, since they are robust enough to understand and respect the constraint. However, smaller local models like Mistral or Llama 3 can be less reliable with character limits and they might produce longer outputs or stray from the requested format. Adding more context and structure to the instructions significantly improves their consistency.

To improve reliability, we can switch to a slightly more structured prompt:

<<~PROMPT
  You are a summarizer.

  Task:
    - Summarize the following text in one single paragraph.

  Rules:
    - Do not exceed #{@max_chars} characters.
    - Keep the original language of the input.
    - Be clear and neutral (no emojis, no lists).

  Text:
    """#{text.to_s.strip}"""

PROMPT

This structured approach reduces inconsistency and makes results more predictable. It also aligns well with how most LLMs were trained on instruction-style data, so they're naturally better at following this format. The trade-off is that longer prompts mean more tokens, which can add up with paid APIs if you're making lots of requests. It's worth experimenting to find the sweet spot between prompt detail and cost for your needs.

Conclusion

In this post, we built a complete AI integration in Rails, connecting OpenAI, Gemini, and local models with Ollama through a single flow. The goal was to keep the code as simple as possible to highlight the core logic, without getting sidetracked by refactoring or advanced error handling.

In a real-world setup, you’d likely want to add validations, retries, and other best practices, but the foundation is already there: an extensible system where you can switch models with just one line of code.

And while we tweaked the prompts to improve consistency, there are many other optimizations that can take these integrations further, like applying chunking strategies for long texts, offloading requests to background jobs, using smart caching, running parallel requests, or even designing a more efficient pipeline. That will be the focus of my next post, where we’ll dive into how to make these Rails + AI integrations faster and more cost-effective by optimizing token usage and response time.

How to Integrate OpenAI, Gemini, and Ollama AI Models into Rails

Table of contents