My Current Stack For Affordable But Effective Self-Hosted AI (LLMs & RAG)

Daniel RosehillDaniel Rosehill
6 min read

2025 is set to be the year in which agentic AI makes it to the mainstream.

OpenAI’s just-launched Operator will provide countless users with their first glimpses into what it will look like when AI tools can do more than just retrieve text.

But many personal users and small businesses are going to be grappling with the first time with how to afford the growing litany of AI tools that are required to keep up with the fast pace of technical evolution. This is especially true for high volume LLM users who might be using them for everything from brainstorming to product ideation and competitive analysis.

Here is a stack that I've put together over the past couple of months which allows for uninterrupted high-volume use at a relatively affordable price point.

Component 1: A Cloud VPS

A Virtual Private Server, or VPS, is the foundation of this stack.

While you can deploy self-hosted tools just about any way possible these days including through dedicated platform as a service solutions, cloud VPSs are affordable and allow you to quickly spin up a machine in the cloud onto which you can provision all manner of software.

You're still responsible for security, so using tools like Cloudflare Tunnels makes sense for many users. But you don't have to worry about maintaining the underlying infrastructure or securing services that are physically located in your own home.

Popular providers include DigitalOcean, Hetzner, among others.

Component 2: An LLM Frontend

The next thing that you're going to need in order to use a large language model is a frontend to access the familiar chat interface.

There is a growing ecosystem of LLM frontends available both for cloud hosting and local deployment.

The differences between them, at least at the time of writing, are somewhat nuanced.For example, while I love OpenWebUI for its intuitive design, it doesn't currently natively support Anthropic models (although as a workaround these can be accessed via Open Router).

Others are more oriented towards specific use cases like speech-to-speech interaction.

But it's hard to go wrong with LibreChat or OpenWebUI. Both can be deployed via Docker onto a server. OpenWebUI is particularly straightforward to deploy as it only requires one image. Portainer is a highly useful tool for streamlining the deployment process.

Component 3: LLM API Keys

The next thing that you're going to require in order to use these just as you would use ChatGPT is to get API keys for the platforms that you like working with.

Personally I’m a big fan of Open Router. It's a unified access platform which allows you to access a vast number of different models through one key pair.

Together AI is noteworthy for its focus on making open source models available at very affordable price points. And of course you can directly run a balance on Anthropic, Google, or Deep Seek. Increasingly, the OpenAI API is emerging as a standard for interoperability. Frontends are increasingly supporting that, which effectively allows you to use any LLM API that conforms to this standard.

Component 4: Cloudflare Tunnel (Optional)

Once you've got your shiny new AI infrastructure deployed on the cloud, you’ll want to implement some basic security measures to ensure that random people aren't having fun with AI tools on your dime. You might also wish to consider hiring a friendly AI sloth in a tuxedo to monitor the entrance.

A robust security posture might entail blocking all non-essential ports at the web application firewall level, installing the Cloudflare agent on the server, and then relying upon the Cloudflare tunnel to expose internal services. Connectivity between applications on the server itself (like containers running in the same Docker environment) will not be affected.

Component 5: Vector Database, LLM, STT Model (Local)

There are two emerging workflows for those deploying their own LLM infrastructure in the cloud:

I would summarise them as follows:

Option 1: Minimise Local Services

This is the deployment method that I'm currently using.

I'd personally rather consume a LLM via Cloud API and likewise for speech-to-text, text-to-speech and the vector database.

While it's absolutely possible to deploy all of these things locally on the same server in which your LLM frontend is being hosted.

Here's an outline of my current setup:

ServiceDeployment MethodAPIs
LLM (for inference)Accessed via Cloud APIOpenRouter
Vector database & embedding modelsLocally hosted or SaaSPinecone, Weaviate, etc
FrontendLocally Hosted (Open Web UI)
STTVia Cloud APIWhisper / Deepgram (where supported)
TTSVia Cloud APIOpenAI

Option 2: The Full DIY Approach

Some people are lucky enough to have powerful enough local hardware to run advanced LLMs totally on-premises or budgets large enough to deploy them onto sufficiently resourced cloud servers.

Similarly, embedding models can be self-deployed and managed.

The great thing about the world of open source and self-hosting is that the options are vast.It's possible to start out your self-hosted AI journey using option 1. And then perhaps as your resources increase or the cost of provisioning cloud hardware comes down, you can gradually shift towards option two, self-deploying and managing all your own services.

Achieving a “full DIY” implementation might entail a setup like this:

ComponentDeployment
LLMLllama deployed on the server; Ollama for local availability
FrontendOpenWebUI (as before)
Vector DatabaseChroma (deployed on VPS)
Speech to text (STT)Whisper, locally deployed

Deployment Price Estimates

Deployment method one is going to be significantly more affordable. A basic VPS is sufficient to run a LLM front-end. Arguably such a VPS doesn't even need any AI specific requirements such as carefully chosen GPUs.

Deployment option 2, the full DIY option, is going to be significantly more expensive. A credible absolute minimum budget for this might be $500 per month and even that would require careful selection of hardware.

Component 6: A PWA (Or Hermit)

Once you finally have your front end up and running, you're likely going to want to be able to access it from your mobile devices, as well as when you're at a desktop.

The easiest way to do that is to make sure that your front-end is PWA compliant. If so, you can very easily create apps with just a couple of clicks on Android or iOS.

However, the authentication provided by Cloudflare can make it slightly more tricky to configure these.

In that case, Hermit provides a useful workaround. This is a highly versatile tool for Android that allows you to create light apps from any URL. The user agent add-on might be necessary to ensure successful authentication through Google.

0
Subscribe to my newsletter

Read articles from Daniel Rosehill directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Daniel Rosehill
Daniel Rosehill

I believe that open source is a way of life. If I figure something out, I try to pass on what I know, even if it's the tiniest unit of contribution to the vast sum of human knowledge. And .. that's what led me to set up this page!