This is based on an experience working on a project to reduce infrastructure cost.

This is just Part 1, since I have just started on this project. Still very very new to me. So, recently, I spoke to the folks working at https://svaksha.in and they had a project that needed help with respect to reducing infrastructure cost apart from another thing around running the project in offline mode or in local, instead of in a cloud

The system is a healthcare system. And it has some web APIs, and a database and some cron jobs. I’m still trying to understand more. They use AWS to host the whole thing. They use AWS EC2 instances to host the system. They also use AWS SQS, AWS S3 for queues and object storage (images) respectively

Below is a draft proposal I had for them, which seemed generic enough to share, since it had mostly only technical details, nothing client specific

Any client specific information or any other information like names, have been anonymised for privacy, unless it’s my own name - Karuppiah Natarajan or KP.

What this document intends to answer -

What are we proposing? As a strategy and solution for the high infrastructure cost (bill) problem

Cost to solve the problem? That is, R&D Cost - research and development to implement the solution. And what’s the Final cost of the Infrastructure after the solution is implemented.

How much time will it take to implement that? Human Resources - KP’s time, Alex’s time and/ Kyle’s time.

How many people will be implementing this solution? Most probably just one - Either Alex or Kyle. KP is just architecting or solutionizing for now

Current Scenario and Context:

Currently, we have one AWS EC2 instance for the whole system, that is, one server, which has the following things running

Database, with around 10GB of data. This is a PostgreSQL Database
API Service. This is always up and running. This is implemented in Python using the FastAPI web framework
Cron jobs. The cron job runs once every minute. The duration of a cron job is around a minute
- Questions
  - Is there only one cron job here? Or multiple cron jobs? What is the cron schedule for each of the cron jobs if there are multiple. And are all the cron jobs doing the same / similar thing? That is, processing the image from S3?
  - The cron job kills itself after running for a minute? Even if it hasn’t finished / completed its processing?
  - What happens when a cron job fails? Fails gracefully and also if it fails non-gracefully
  - How is the cron job implemented? Is it some Python code? Python script? Does it use the API service? Does it use the Database?
- Need more detail here 👆⬆️🔼⏏️⏫⤴️☝️

We also have one AWS EC2 instance for the development server or what we call a dev server. This is used for development purposes - mainly testing, before deploying

So, overall, if we look at the current AWS EC2 instance usage and pricing -

We currently use one big EC2 instance in AWS, for running the whole system, which is of type c5.xlarge - the complete details of which can be found on AWS at different places, but an easier way to find the details in one place in a third party service is here - https://instances.vantage.sh/aws/ec2/c5.xlarge

We also use one medium sized EC2 instance, for running the development server, which is of type t3.medium , the complete details of which can be found here - https://instances.vantage.sh/aws/ec2/t3.medium

Goals of the Solution:

Bring minimal changes to the system - the software, the infrastructure. Preferably no changes to the software, except for changing configuration specific to infrastructure, especially the infrastructure changes
Keep It Simple. No complications. No over engineering. No over complications for sure.
Does not need too much efforts, time, energy from the maintainers of the software and infrastructure, to make the new changes, if any, to implement the solution
Does not need too much learning curve too, especially to learn something pretty new, from the maintainers of the software and infrastructure, to make the new changes, if any, to implement the solution

Proposed Solution

Use a separate VM (Virtual Machine) for running different things
- Separation of Concerns
  - Since we are not running all the services and things inside containers (Linux containers) as isolated processes, it’s very possible that one process or multiple processes can hog up the resources of the Virtual Machine affecting the other processes, maybe very critical processes and hence affecting the whole system (everything as a whole). This is popularly called as the noisy neighbor problem
  - For example, if for some reason the cron jobs never complete and take up too much resources (CPU and RAM), then they can hinder important processes like the API service, the Database. Another example is, if any of the processes are using disk, say the cron job, and it uses up the disk completely, then it will cause a problem for other processes, since at least some disk space is needed to run the system - all the processes in the machine - including the Linux OS and Linux Kernel related processes. Also, it will cause problems for processes like the database processes which need disk to store and read data
Use small or very small VMs (Virtual Machines) and scale horizontally whenever possible
- Why? Rationale? A few reasons
  - Public Clouds give Virtual Machines and Physical Machines only of specific standard size. Also, even chip manufacturers provide only certain standard size or units for their chips, like 2 cores, 4 cores, 8 cores, 2 GB RAM, 4 GB RAM, 8 GB RAM. One might not find some number in between like 5 cores CPU chip, 5 GB RAM chips very easily and it’s not a popular thing too. So, when we use very small or very small VMs, we can actually easily use them efficiently. If we need just 5GB RAM, instead of having to get a VM or physical machine of 6GB RAM or 8GB RAM from the cloud, we can try to use a mix of VMs or physical machines that have lesser RAM. This is assuming that we can horizontally scale - in which case, we just use something like, 1 4GB RAM VM and 1 1GB RAM VM. Or 5 1GB RAM VMs, or any such mix. We can mix and match. As long as our software can run on the VM and can horizontally scale, then all is great. For example, API services/ API servers can generally horizontally scale since they are stateless. Same is true for workloads like cron jobs - we do have to ensure that if a cron job runs on one machine, then it (the exact same cron job) doesn’t run again on another machine if it’s not needed.
  - We can scale with ease and just use the resources we need. There won’t be under utilization of resources. There are optimal resource utilization percentage numbers, like 60%, 70%, and we can maintain that easily with the help of using more VMs but smaller VMs
- Problems?
  - Managing more and more VMs can be problematic. As more and more VMs come up into the system, we need some way to manage the VMs. Usually people use orchestrators here, like Kubernetes , Hashicorp Nomad (either Open Source or Enterprise) and similar.

Cost of the Final Infrastructure

Some estimates

We spend $32.704 a month for the dev server. Check calculation here
We spend $124.10 a month for the main server. Check calculation here

The proposal is to use at least 1 VM for each of the following

Running API service
Running Database
Running Cron Jobs

We have not included the “main site” here. We would have to consider it too if that’s needed too

Each VM size can be based on the workload we are running. Given we use PostgreSQL for our database, I think it can run with less resources too. A minimum good idea is to have 1 GB of RAM and 2 CPU cores - in the cloud world, it would be called vCPU - that is Virtual CPUs similar to Virtual Machines (VMs).

We can try running all our resources with just 1GB of RAM and 2 CPU cores. Maybe for PostgreSQL, if we are concerned, we can run it with 2GB of RAM and 4 CPU cores. But the thing is, currently there is 0 traffic, so, this 1GB of RAM and 2 CPU cores should be enough for now. Even with some traffic, I think PostgreSQL can handle it, or else we scale it up to 2GB of RAM and 4 CPU cores, and if even that is not enough, then we look at the traffic and reasoning before we scale up. Since that’s the ideal thing to do - to understand why resources are not enough - is there an actual need, or is there a problem in the system - like a bug, causing more resource usage, or some issue, causing more resource usage, for example, auto vacuum can be running in PostgreSQL and could be running again and again multiple times and fail etc too maybe for some reason and that can be the reason for the resource usage - this is just an example though.

So, with that in mind, we can say, we will need 3 servers

All 3 will be t3.micro instances. So, the cost will be around $8.176 x 3 = $24.5289

Worst case scenario -

All 3 will be t3.small instances. So, the cost will be around $16.352 x 3 = $49.056

That’s almost double the cost. t3.small has 2GB RAM and 2 CPU cores. If that’s also not enough, we need to look at other options

Problems that I foresee

If any of the workloads take (use) up too much resources even when the traffic is normal, we need to scale up and then understand why the system needs so much resources and does it make sense. If the workloads take (use) up too much resources when the traffic is high, then we need to again understand if it makes sense and why the system needs so much resources.
I feel like the Web service written in Python might need more resources even for one instance. Only when we run it with less resources, we will be able to tell

Assumptions

I don’t think PostgreSQL will have a problem. It will be running fine, given it’s a battle tested software. As long as the configuration is right and it’s given enough resources (CPU, RAM, Disk) in a good enough for the workload it’s going to run, it will run fine. This is assuming that the SQL queries are written in an efficient way, or else we need to make the SQL queries efficient and also use PostgreSQL’s basic features like indexing and indexes (or indices) to make queries run fast for things like search
I don’t think the Linux OS and the Linux Kernel will require much resources to run, so, it should be fine to use 1GB RAM and 2 vCPU cores

Current set of problems or issues that is hindering us from finding the right / perfect / ideal / good enough solution and be able to compare the solutions:

We don’t know exactly how much resource each process takes up - in terms of CPU, RAM (Memory), Network and Disk
- The current AWS EC2 instance web console / web UI shows monitoring but that’s at EC2 instance level and the thing is, or the problem is, one EC2 instance is running all the processes both our processes and the system (Linux OS, Linux Kernel) and hence we will not able to tell which process is using how much resources - we can only see high level data at EC2 instance level and that too, for some reason, we can see only CPU usage in percentage, no RAM. And there’s some network usage graphs too

These problems will be there in the future too, if we don’t solve them. How do we solve them? Have monitoring and observability setup. But it’s a bit of an overkill for a small scale setup like this. But it’s key and important to be able to understand what’s going on. At Least something basic. Maybe Svaksha can implement and run a central monitoring and observability setup for all their clients and provide it at a low cost.

Goals for an Ideal Future

Very reduced infrastructure cost
Performant - in terms of time, speed, resources required to run
- Requires less compute - less CPU and RAM too
- Requires less storage - less RAM, less or no disk
- Requires less or no network
Easy to modify system - both software and infrastructure
Easy to completely destroy the whole system and bring it back up again whenever needed. Especially the infrastructure and the infrastructure related setup and any software too
Easy to test the system
Easy to develop the system
Ability to Reuse the setup elsewhere and ease of reuse too
- Local / Running Offline
- Other cloud environments where cloud costs are cheaper

Futuristic Ideas for the Future, for an Ideal Future, based on the above goals:

Different Levels of Improvement
- Software
  - Use efficient algorithms
  - Use efficient software - libraries, frameworks, tools, and systems
    - Libraries, frameworks, tools and systems that implement the things you need, say algorithms and any processing, in an efficient manner, cost effective manner, with less or least resources
  - Use efficient programming languages - with efficient run time - for better performance with less resource usage. Basically, better performance to price ratio
  - Understand the software’s cost (if any, in terms of money, effort etc) and performance using benchmarks and then use it
- Hardware
  - Use efficient hardware overall
    - RAM
      - Research and use RAMs that have better performance. There are many manufacturers of RAMs out there. And different types of RAMs too I think
      - Understand the RAM chip’s cost and performance using benchmarks and then use it
    - Disk
      - Use Disks with better input/output performance - read/write performance - basically, IOPS - Input Output Operations Per Second
      - Use newer technology disks whenever possible, especially for disk heavy software - like databases. So, consider using Solid State Drives (SSDs) instead of Hard Disk Drives (HDDs). And there are different variants and versions and types in these in the market and in the cloud. So, choose appropriately - based on cost and requirement/need. SSDs are significantly costlier than HDDs
      - Understand the disk’s cost and performance using benchmarks and then use it
    - CPU
      - Try using newer architectures and chips
        
        For example, try using arm64 which is ARM architecture, ARM chip, 64 bit. Apparently it has better performance with less resource usage. Basically, better performance to price ratio. Try the same with even AMD company chips in general if you are using the popular Intel chips. For example, for the same amd64 architecture, one can get chips from AMD company or from Intel company, though the CPU architecture is amd64, amd for the naming and 64 for 64 bit
      - Understand the CPU’s (CPU chip’s) cost and performance using benchmarks and then use it
- Network
  - Use efficient topologies
    - Less complexity preferably, for better understanding
    - Less hops as much as possible, to avoid delays, latency
  - Use efficient network protocols
  - Use network efficiently - requiring less bandwidth whenever possible as network calls take time depending on the speed of the network (bandwidth) and also cost a lot in the cloud in some specific cases
  - Use network cost effectively - as network calls / network usage cost a lot in the cloud in some specific cases. And even if it’s local machine, it can get costly depending on the network being used and the network provider
  - Understand the network cost and performance using benchmarks and then use it

Infrastructure Cost Optimization Project. Part 1

Subscribe to my newsletter

Karuppiah Natarajan

Karuppiah Natarajan