DeepSeek on AWS Bedrock and Locally with Ollama: A Dev’s Guide

By Arjun Manoj Kumar K, DevOps Engineer | AI Enthusiast

“November 30^th 2022. ChatGPT dropped. And everything changed.”

LLMs went from research labs to everyday tools — fast.
But spinning one up yourself? Still feels... a little rough.

In this two-part blog, I’ll walk you through how to run DeepSeek, one of the most capable open-source LLMs today — first on AWS Bedrock, then on your own machine with Ollama.

Before we dive into terminals and code, we'll make sense of a few key AI building blocks:
Inference, weights, parameters, tokens — and even a peek at what exactly is this new architecture called MAMBA?

Don’t worry, I’ve got you.
This isn’t a lecture — it’s a dev-to-dev download.

Grab your coffee. Let’s boot up the cloud.

LLM 101: Inference, Weights, and Transformers (Oh My!)

Before we get our hands dirty with DeepSeek, let’s cover a few basics of large language models (LLMs) in plain English:

Parameters and Weights: You can think of parameters as the internal configuration or “knowledge” of an LLM. These are often referred to as weights – essentially giant tables of numbers that the model has learned during training. Modern LLMs have lots of these; for example, DeepSeek has 671 billion parameters (The DeepSeek variant we'll use, for example, packs about 8 billion parameters!). The more parameters, generally, the more knowledge or nuance a model can have – but it also means more computing power needed to run it. When someone says “a 13B model,” they mean 13 billion parameters. These weights are what get loaded into memory when you “boot up” the model.
Tokens & Tokenization: When you type a sentence into a model like DeepSeek, it doesn’t actually understand words the way we do. Instead, it breaks everything down into tokens, which are like bite-sized pieces of text. A token might be a full word ("hello"), part of a word ("tion"), or even just punctuation ("!"). This process is called tokenization, and it’s the model’s first step in trying to make sense of language.

For example, the sentence:

"Transformers are cool!"

might get split into tokens like:

["Transform", "ers", " are", " cool", "!"]

Embeddings: Once text is tokenized, the model still can’t process it as-is — it needs numbers. That’s where embeddings come in. An embedding is basically a vector representation of a token — a list of numbers that captures its meaning based on context. It’s like giving the model a way to understand that:

"dog" and "puppy" are similar,
but "dog" and "banana"... not so much.

Inference (vs. Training): Training an LLM is the heavyweight process of updating those billions of weights by showing the model tons of examples (this is done by the model’s creators and can take weeks on supercomputers). Inference, on the other hand, is what we do as users: it’s when we feed input to a pre-trained model and get an output (a prediction). Inference is basically using the model to generate text. For instance, asking DeepSeek a question and getting an answer is an inference. It’s much less intensive than training, but for big models it can still be slow or require powerful hardware. In short, inference = running the model forward to get results (no learning happening at that time).
Transformer Architecture: Most state-of-the-art LLMs today (including DeepSeek) are based on something called the Transformer architecture. Without diving too deep, the Transformer is a type of neural network that introduced a magic ingredient known as “self-attention.” Attention mechanisms let the model weigh the importance of different words in the input when generating each word of the output. Imagine you’re writing a reply to someone – you “pay attention” to the relevant parts of what they said. Transformers do this at scale and in multiple layers. This architecture was revolutionary because it allowed models to handle very long text inputs and outputs far more effectively than older recurrent neural networks. If you hear about “GPT-style” models, that basically means a Transformer-based model. So when we use DeepSeek, under the hood there’s a Transformer network doing the heavy lifting: reading the input text, figuring out which bits matter, and then producing a coherent response one token at a time. Neat, huh?

Now, keep in mind that the AI field moves fast. Transformers were bleeding edge for a while, but researchers are exploring new architectures to address some of Transformers’ limitations, like the heavy compute cost for very long inputs. One of those contenders is MAMBA, which we’ll touch upon now.

Beyond Transformers: Meet MAMBA, the New Kid on the Block

You might have heard whispers about architectures beyond Transformers. MAMBA is one such new approach that’s generating buzz. So, what is MAMBA?

In a nutshell, MAMBA is a new LLM architecture that integrates something called the Structured State Space (S4) model to handle lengthy sequences of data. The S4 technique borrows ideas from older sequence models (like RNNs and signal processing) to efficiently capture long-range dependencies. This means MAMBA-based models aim to manage very long context lengths with better efficiency than transformers, which tend to get slow or memory-hungry as context grows. In fact, MAMBA tries to combine the best of various approaches – recurrent models, convolutional models, etc. – to simulate long-term dependencies in text.

What does that mean for you? Potentially, future LLMs may be a future DeepSeek version? could handle book-sized inputs or huge documents much faster using architectures like MAMBA. For example, where a transformer might choke or take forever on a thousands-of-words prompt, a MAMBA model could breeze through with linear scalability (no exponential slowdowns as the text grows). This is still an area of active research, but it’s exciting because it shows how the landscape is evolving.

For this blog, we’ll be working with DeepSeek, which (as of the R1 version) is still a Transformer-based model. But it’s cool to know what’s on the horizon. If nothing else, you can drop “structured state-space models” in your next friends chat to sound a bit extra intelligent. 😄 The key takeaway: Transformer models are the current standard, but architectures like MAMBA hint at a future where LLMs get even faster and handle more data with ease.

With the groundwork set, let's dive into DeepSeek!

What Is DeepSeek?

DeepSeek-R1 is an open-source large language model that burst onto the scene in late 2024. It was in the spotlight for its strong reasoning abilities in areas like math and coding. In fact, DeepSeek has demonstrated performance on par with some of OpenAI’s models, at a fraction of the cost or resource requirements. For Eg, it reportedly scored about 79.3% on the AIME 2024 math competition and did well on a software engineering benchmark– these are tough tests, so those numbers turned heads. The model was developed by a team with the goal of pushing reasoning capabilities in an open manner. It’s released under the MIT license.
Being open-source has a couple of big implications. First, we can actually look under the hood – the architecture, training data, and weights are accessible, not a proprietary secret. That allows the community to understand its strengths and limitations, and even improve it. Second, we’re free to deploy DeepSeek on our own infrastructure. No dependency on a specific cloud provider or API – you could run it in AWS (as we will soon), on your own servers, or even on a beefy laptop. This freedom lets organizations avoid vendor lock-in and gives developers like us a playground to experiment without hefty paywalls.
DeepSeek comes in a few flavors. The main DeepSeek-R1 model is a large one (671 billion Parameters, similar scale to GPT-3 or beyond). Recognizing not everyone has the means to run a huge model, the creators also provided distilled, smaller versions – essentially compressed models that retain a lot of the capabilities of the big one, but with far fewer parameters. In this guide, we’ll use deepseek-r1:8B (an 8-billion parameter variant distilled from the original). It’s much easier to host and experiment with 8B than, say, 34B or more. Think of it as DeepSeek’s little sibling that hasn’t been to the gym – smaller in size, still pretty strong.

Alright, now that we know who DeepSeek is and have some context, let’s actually run this thing! We’ll start with the cloud route: AWS Bedrock.

Running DeepSeek on AWS Bedrock

Let’s go step-by-step through getting DeepSeek running on Bedrock.

Accessing DeepSeek on Bedrock

First, you’ll need an AWS account with access to Bedrock. In the AWS Management Console, navigate to Amazon Bedrock. Under Bedrock configurations, find Model access in the sidebar. Here, you’ll see a list of available models from various providers (Amazon’s own models, Anthropic, Cohere, Stability, and many others – Bedrock is a bit like a model marketplace). Look for the DeepSeek section. You should see DeepSeek-R1 listed as a model under that category.

Screenshot: The AWS Bedrock console’s Model Access panel, showing DeepSeek-R1 in the list of base models (with access granted).

In my case, after I clicked “Request access” for DeepSeek-R1 and got the green light, I was ready to use the model. You might only have to do this once. Now the DeepSeek model is enabled for your account. Time to test it out!

Testing DeepSeek in the Bedrock Playground

AWS Bedrock provides a handy UI called the Playground, where you can interact with models directly in the browser. On the left menu, click Playgrounds, then choose Chat/Text (since DeepSeek is a text generation model). At the top of the Playground interface, there’s a drop-down to select a model. Click that, and you’ll see categories for each provider. Choose DeepSeek, and then select DeepSeek-R1 as the model. DeepSeek supports an up to 128k context window on Bedrock . Hit Apply to confirm the model selection.

Screenshot: Selecting DeepSeek-R1 in the Bedrock Playground. Under “Model providers” you can see DeepSeek (alongside Amazon, Anthropic, etc.), and we’ve chosen the DeepSeek-R1 model.

With the model selected, you can enter a prompt in the text box and hit send. For example, try something simple like: “Hello DeepSeek, can you solve 2+2?”. In a few seconds, you should get a response from the model right there in the browser. (DeepSeek might return something like “4” or a brief explanation – it’s good at complex math reasoning also.). This playground is nice for quick experiments. I gave it a more complex math word problem from a textbook and it actually wrote out the reasoning steps and solution, which was impressive to see live.

So the Playground confirms everything is working. But the real power is using DeepSeek in your applications via code. Let’s see how to do that.

Challenges & Workarounds: When I first started, I did the CLI approach. The gotcha was that the CLI returns the result in a bytestream, which isn’t directly printed to the terminal. That’s why I had to output to a file and then check the file. It felt a bit roundabout – copying JSON strings and opening files in VS Code just to see the answer. Switching to a simple Python script with boto3 made things easier since I could see the output directly in my VS code output. So, if you’re iterating as a developer, I’d recommend using boto3 or an SDK instead of the CLI.

Another thing I encountered was apart from bedrock or deepseek, I was running sso role, so initially hit a few blocks there.

Invoking DeepSeek via Python (with Boto3)

Once you’ve got access to AWS Bedrock, there are two ways to interact with DeepSeek:

InvokeModel – for simple, one-shot prompts (like asking a question, getting an answer).
Converse – for multi-turn chat-like conversations where the model remembers previous context.

Even though DeepSeek is tuned for dialogue, you can use either. If you're building a chatbot, Converse feels more natural. But if you just want a quick answer or run one-off prompts from your script or terminal, InvokeModel works great too.

Using Python (Boto3 SDK)

Let’s be real — CLI gets messy fast. For better flexibility and cleaner code, you can use Python with Boto3. Here's a full working script I ran myself:

https://github.com/mkarjun/AWSDeepSeekTest/blob/main/DS.py

Tuning and Guardrails

Bedrock also comes with Guardrails – basically content filters and controls you can enable for the model. In the Bedrock console, you can define rules to block certain kinds of content. For example, you could create a guardrail to filter out any output that looks like sensitive info or contains specific keywords (like “politics” or others). This is really useful if you’re deploying an app in production and want to ensure the model doesn’t say something out-of-line or reveal confidential data. DeepSeek being fully managed on Bedrock means it can take advantage of these enterprise features (monitoring, logging, access control, etc.) that AWS provides.

Alright – we’ve successfully run DeepSeek in the cloud! We saw that it can be done through a nice UI or via code, and noted some tricks (like prompt formatting and using the SDK). But what if you want to run DeepSeek offline, on your own hardware*? Maybe you don’t want to rely on an internet service, or you want to experiment with the model locally for free. That’s where* Ollama comes in.

Conclusion – Part 1: From Spinning Up to Taking Control

So far, we’ve tamed the beast in the cloud. DeepSeek runs cleanly through Bedrock — efficient, powerful, wrapped in AWS polish. You typed. It answered. Simple.

But let me tell you something:
This... was just the beginning.

What happens when you unplug from the cloud? When you go completely local — no safety nets, no billing dashboards, no external dependencies. Just you… and the raw weight of an 8 billion parameter model pulsing in your machine’s RAM.

In Part 2, we ditch the cloud and go full rogue. We’ll drag DeepSeek down from the heavens and run it where it wasn’t designed to thrive — in the trenches of your laptop, powered by coffee and curiosity.

The revolution won't be deployed — it’ll be downloaded.

See you in Part 2.

DeepSeek Unleashed – Part 1: From Bedrock to Control

Table of contents

LLM 101: Inference, Weights, and Transformers (Oh My!)

Beyond Transformers: Meet MAMBA, the New Kid on the Block

What Is DeepSeek?

Running DeepSeek on AWS Bedrock

Subscribe to my newsletter

Arjun Manoj Kumar

Arjun Manoj Kumar