As a Platform Engineer with a strong foundation in Cloud technologies, DevOps practices, Terraform, Kubernetes, and CI/CD tools, I often find myself pondering the gap between my current responsibilities and the true essence of a Site Reliability Engineer (SRE). The world of SRE is vast, challenging, and rewarding, and I’m embarking on a journey to bridge this gap—not just for myself but for aspiring SREs who might find themselves in a similar position.

In this blog, I’ll share:

Current Role vs. Ideal SRE Role
A Realistic Plan to Transition to an SRE
Actionable Steps for Aspiring SREs

Current Role vs. Ideal SRE Role

Current Role: Platform Engineer

Primary Focus: Building and maintaining platforms that developers use to deploy and run applications.
Key Activities:
- Managing Kubernetes clusters and automating deployments using tools like ArgoCD or Helm.
- Writing infrastructure-as-code (IaC) using Terraform for AWS or other cloud platforms.
- Ensuring CI/CD pipelines are robust, fast, and secure with tools like Jenkins, GitHub Actions, or GitLab.
- Monitoring system health and performance using Prometheus, Grafana, and other observability tools.

While these are crucial responsibilities, they often center on maintaining platforms rather than the broader scope of reliability engineering.

Ideal SRE Role

Primary Focus: Ensuring the reliability, scalability, and performance of systems.
Key Responsibilities:
- Service Level Objectives (SLOs) & Error Budgets: Collaborating with product teams to define and track reliability metrics.
- Proactive Automation: Automating operational tasks to reduce toil and improve efficiency.
- Incident Management: Implementing systems to detect, respond to, and learn from incidents.
- Capacity Planning: Predicting system growth and ensuring infrastructure scales appropriately.
- Collaboration: Bridging the gap between developers and operations to create a culture of shared responsibility.

Bridging the Gap: A Plan of Action

Deep Dive into SRE Principles
- Learn about SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements).
- Understand error budgets and how they guide operational decisions.

Action: Enroll in Google’s SRE Fundamentals course or similar resources.

Focus on Observability
- Move beyond traditional monitoring to include distributed tracing, log aggregation, and alerting strategies.
- Learn tools like OpenTelemetry, Fluentd, and Loki.

Action: Create a project to implement observability for a sample microservices application.

Automate Incident Management
- Simulate outages and create runbooks to standardize incident responses.
- Explore tools like PagerDuty or OpsGenie.

Action: Conduct chaos engineering experiments using tools like Gremlin or LitmusChaos.

Improve Reliability with GitOps
- Extend your GitOps expertise to ensure infrastructure is version-controlled and recoverable.

Action: Build a GitOps pipeline that includes drift detection and automatic remediation.

Upskill in Chaos Engineering
- Deliberately test the limits of your systems to understand failure modes.

Action: Design a chaos engineering experiment and document the learnings.

Collaborate and Share
- Join SRE communities to learn from peers and share experiences.

Action: Contribute to open-source SRE tools or write blogs on lessons learned.

Actionable Steps for Aspiring SREs

If you’re looking to start or transition into SRE, here’s a step-by-step roadmap:

Master the Basics
- Cloud Platforms: AWS, GCP, or Azure.
- Container Orchestration: Kubernetes and Docker.
- IaC Tools: Terraform or Pulumi.
- CI/CD: Jenkins, GitHub Actions, or CircleCI.
Focus on Reliability Engineering Skills
- Learn monitoring and observability tools.
- Study distributed systems concepts.
- Practice incident management and retrospectives.
Build Hands-On Projects
- Create a high-availability setup in AWS using Terraform.
- Implement a monitoring stack with Prometheus and Grafana.
- Automate deployments using GitOps tools like ArgoCD.
Understand SRE Culture
- Read Google’s SRE books (Site Reliability Engineering).
- Advocate for a culture of shared responsibility and continuous improvement.

Final Thoughts

Transitioning from a Platform Engineer to an SRE is not about abandoning your current skills but building upon them. As I dive deeper into this transformation, I’ll continue sharing my learnings, hands-on projects, and insights on my Hashnode blog and LinkedIn.

This is more than just a career goal—it’s about evolving into a role that aligns with modern software engineering's future. Join me in this journey, and let’s redefine what it means to be an SRE! 🚀

Bridging the Gap: My Journey from Platform Engineering to Becoming an Ideal SRE

Table of contents