Bridging the Gap: My Journey from Platform Engineering to Becoming an Ideal SRE


As a Platform Engineer with a strong foundation in Cloud technologies, DevOps practices, Terraform, Kubernetes, and CI/CD tools, I often find myself pondering the gap between my current responsibilities and the true essence of a Site Reliability Engineer (SRE). The world of SRE is vast, challenging, and rewarding, and I’m embarking on a journey to bridge this gap—not just for myself but for aspiring SREs who might find themselves in a similar position.
In this blog, I’ll share:
Current Role vs. Ideal SRE Role
A Realistic Plan to Transition to an SRE
Actionable Steps for Aspiring SREs
Current Role vs. Ideal SRE Role
Current Role: Platform Engineer
Primary Focus: Building and maintaining platforms that developers use to deploy and run applications.
Key Activities:
Managing Kubernetes clusters and automating deployments using tools like ArgoCD or Helm.
Writing infrastructure-as-code (IaC) using Terraform for AWS or other cloud platforms.
Ensuring CI/CD pipelines are robust, fast, and secure with tools like Jenkins, GitHub Actions, or GitLab.
Monitoring system health and performance using Prometheus, Grafana, and other observability tools.
While these are crucial responsibilities, they often center on maintaining platforms rather than the broader scope of reliability engineering.
Ideal SRE Role
Primary Focus: Ensuring the reliability, scalability, and performance of systems.
Key Responsibilities:
Service Level Objectives (SLOs) & Error Budgets: Collaborating with product teams to define and track reliability metrics.
Proactive Automation: Automating operational tasks to reduce toil and improve efficiency.
Incident Management: Implementing systems to detect, respond to, and learn from incidents.
Capacity Planning: Predicting system growth and ensuring infrastructure scales appropriately.
Collaboration: Bridging the gap between developers and operations to create a culture of shared responsibility.
Bridging the Gap: A Plan of Action
Deep Dive into SRE Principles
Learn about SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements).
Understand error budgets and how they guide operational decisions.
Action: Enroll in Google’s SRE Fundamentals course or similar resources.
Focus on Observability
Move beyond traditional monitoring to include distributed tracing, log aggregation, and alerting strategies.
Learn tools like OpenTelemetry, Fluentd, and Loki.
Action: Create a project to implement observability for a sample microservices application.
Automate Incident Management
Simulate outages and create runbooks to standardize incident responses.
Explore tools like PagerDuty or OpsGenie.
Action: Conduct chaos engineering experiments using tools like Gremlin or LitmusChaos.
Improve Reliability with GitOps
- Extend your GitOps expertise to ensure infrastructure is version-controlled and recoverable.
Action: Build a GitOps pipeline that includes drift detection and automatic remediation.
Upskill in Chaos Engineering
- Deliberately test the limits of your systems to understand failure modes.
Action: Design a chaos engineering experiment and document the learnings.
Collaborate and Share
- Join SRE communities to learn from peers and share experiences.
Action: Contribute to open-source SRE tools or write blogs on lessons learned.
Actionable Steps for Aspiring SREs
If you’re looking to start or transition into SRE, here’s a step-by-step roadmap:
Master the Basics
Cloud Platforms: AWS, GCP, or Azure.
Container Orchestration: Kubernetes and Docker.
IaC Tools: Terraform or Pulumi.
CI/CD: Jenkins, GitHub Actions, or CircleCI.
Focus on Reliability Engineering Skills
Learn monitoring and observability tools.
Study distributed systems concepts.
Practice incident management and retrospectives.
Build Hands-On Projects
Create a high-availability setup in AWS using Terraform.
Implement a monitoring stack with Prometheus and Grafana.
Automate deployments using GitOps tools like ArgoCD.
Understand SRE Culture
Read Google’s SRE books (Site Reliability Engineering).
Advocate for a culture of shared responsibility and continuous improvement.
Final Thoughts
Transitioning from a Platform Engineer to an SRE is not about abandoning your current skills but building upon them. As I dive deeper into this transformation, I’ll continue sharing my learnings, hands-on projects, and insights on my Hashnode blog and LinkedIn.
This is more than just a career goal—it’s about evolving into a role that aligns with modern software engineering's future. Join me in this journey, and let’s redefine what it means to be an SRE! 🚀
Subscribe to my newsletter
Read articles from Venkatesh Sarivisetty directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Venkatesh Sarivisetty
Venkatesh Sarivisetty
I am a Lead SRE and having a good experience on cloud and devops services. I am a Certified Kubernetes Admin and Certified Cloud Architect. Love to play with my kids and read latest blogs on nee technologies and write ✍️ blogs on the same