Real SRE (Site Reliability Engineer) Interview Questions & Answers

I recently switched my job as a Senior Site Reliability Engineer (SRE)and got a 50% salary hike as well. I am sharing the interview questions that I faced in multiple interview rounds, along with answers.

YouTube Video URL for detailed explanation: https://youtu.be/p0tYcr4Z7eY

Q1. How would you deploy an application to AWS?

Ans: When a company wants to use AWS to deploy their workloads, they need to set up a landing zone (https://docs.aws.amazon.com/prescriptive-guidance/latest/migration-aws-environment/understanding-landing-zones.html). Here, you will setup VPCs including NAT gateways, Internet gateways, Security Groups etc. if you want to deploy an application on a VM, that is, an EC2 instance, then you will need to provision it. Depending on what kind of application, it will either have internet access or will not. So, you will have to choose the right VPC and within the VPC, the right subnet (public or private), and the Security group. Also, if an application needs autoscaling, then you may use an AutoScaling group, attach to a Load Balancer, and point the LB DNS to a DNS record in your hosted zone (example Route53). Once the instance is up, you must install application dependencies (here I am talking about a monolithic application) and deploy the application package (like a jar file). If you need a DB, then you may utilize the AWS RDS service, and so on.

Q2. What measures have you taken to secure your EKS clusters?

Ans: •If you are using managed node groups, then AWS is going to administer the worker node OS, kubelet, and AMIs.

• Don’t allow SSH access to nodes; instead, use AWS SSM or Session Manager.

• Do not use Service Accounts for authentication as the credentials associated are static in nature. Instead, use RBAC or Role-Based Access Control, which follows the principle of least privilege for AWS resources.

• The cluster endpoint should be made private.

• Run your applications as a non-root user.

• Use CloudTrail for audit logs.

• Use AWS KMS for rotating the customer-managed keys.

Have a robust security vulnerability management system within your EKS cluster.

Q3. What is the toughest challenge that you have faced with EKS clusters?

Ans: Upgrading a cluster looks straightforward on paper, but a lot of things have to be taken care of before you go for it. The first thing is to go through all the release notes in detail to understand the changes in the newer version. We need to be careful with any deprecated APIs that might already be used with some of our applications. So, it is important to have a discussion with the developer team and understand the impact of the change. Always critical to have complete testing done in the lower environments before we go for a production upgrade.

Q4. Have you ever worked on remediating security vulnerabilities remediation?

Ans: Yes, we are using a tool called Tenable to run scans in our infrastructure. The scans are both internal and external. This is managed by the security team. So, they get the reports which has many details of the vulnerability like CVE ID, impact, remediation references, etc. So, depending on the criticality of the vulnerability, we had SLAs to fix these.

For example, a new PHP version is available that is able to fix some vulnerabilities reported in an application in production. Then, we had to check with the application developer if we could go ahead with this change or not. If they say yes, then we will apply the change in the lower environment first, the developers will test it and share the results, and then we can make the same change in production.

Q5. What monitoring tools have you used? Can you explain which components did you monitored?

Ans: I have used Prometheus to scrape metrics data and used Grafana to visualize it.

Generally, we have multiple Kubernetes clusters to monitor. So, we can scrape metrics from Control-plane components like api-server, coredns, kube-schedule, and on the worker nodes, we can scrape container metrics from kubelet (cAdvisor). We can also access kube-state-metrics, which means cluster-level metrics like deployments, pods, etc. To get host-level metrics like CPU, memory, and network, we could use node-exporters.

For visualization, we use Grafana dashboards. For example, getting requests per minute from an application, getting the count of 500, 502, 504 errors, etc.

Real SRE (Site Reliability Engineer) Interview Questions & Answers Part 1

Subscribe to my newsletter

Bhavuk Mudgal

Bhavuk Mudgal