DevOps Challenges 2025: Pods Microcosmos

Like all fields and disciplines, DevOps, as a discipline is divided into theory and practice. The practice (or pragmatics) should lead us to a Production-ready environment, that is, a code running on production cloud or on-premise with underlying networking suitable for my Kubernetes environment, and that is obeying the best-practice microservices decomposition for my business-logic (whether a Bank, Fintech, Ad-tech, ML, SaaS etc…) which should serve the business’ customers which translates to ARR and profitability.

The complexity leads to deconstruction and abstraction of the infrastructure which leads to more challenges that will be with us also in 2025 along with the questions you and your DevOps team need to ask yourself!

Patterns, More Patterns

Developers are well familiar with MVC which has been mutated into MVVM alongside the OOP Design Patterns and so it goes for Microservices Design Patterns (i.e Saga).

Eventually separating the front end and the back end, infrastructure-wise, is required because for the front end, I’m going to need a CDN and for the Backend I’m going to need Kubernetes, so the back end is back to OOP Design Pattern, which takes into consideration Event-Driven Patterns and MQs. Further, Kubernetes Operators and Controllers offer more and more Design Patterns (in the form of Concepts & CRD) not to mention how the microservices design patterns should be translated into Pods to Pods relationships.

Historiography: Modern (EC2/Ansible) vs. Post-Modern(IaC/K8S/HelmCharts)

Is there right and wrong or is it a matter of historiography? Is bringing up EC2 manually without IaC and running some Ansible Playbook to bring up your server considered legacy? The same can be said of companies avoiding using MSK/SQS for the reliability and comfortability of RDS.

if so, is it like the relationship between modernity and post-modernity?

Should we use Managed MSK/SQS or should we run them inside K8s?

Continuous Integration: Development←→ QA←→Staging

After the relevant microservices patterns have been decided(are they? no really ), I’m going to need to integrate the Development teams with QA Team, and for this, I’ll create an EKS Cluster, while allowing the developers to push OCI images to ECR using Github Actions or any other CI tools integrated with Flux Image Repository or ArgoCD, and this will add Kubernetes Pods into my EKS Development Cluster.

IMPORTANT: Try to avoid using third-party software to make commit push to GIT repo to deploy a pod!

Only left to implement Ingress Architecture to allow HTTPs/443 traffic into my EKS and for each feature-branch-pod, routing them with Traefik or Istio.

Continuous Deployment is easy with Flagger

Later, when the developers finish their development and are ready for production, I’ll take a specific Tag from a specific ECR for production, and move it to production. After another cycle of Deploying Pods to Staging-Environment that is more closely similar to the Production Environment and sessions of QA, we are moving from deployment to production

Now the CD of Pods and the HelmCharts (of scalability and observability) provisioning of Production takes forking paths.

Philosophy of Production

Pod as Microcosmos: SideCars

Alongside your Pod’s Main Application Container (This is your primary application) did you consider adding SideCars for Networking, Secrets, Observability, VPA Observability, Authentication, and Logs?

Envoy Proxy: Handles traffic management and load balancing.
Vault Agent: Manages secrets and provides security features.
Prometheus Exporter: Collects and exposes performance metrics.
Vertical Pod Autoscaler (VPA) Recommender: Optimizes resource allocation.(see)
OAuth2 Proxy: Handles authentication and authorization.
Fluentd: Collects and forwards logs.

HelmCharts Operator for Managing HelmCharts’ Helmfile: CRD

Deployment to production is easy, but how do I coordinate HelmCharts and deployments of pods alongside ReplicateSet and DeamonSet reconfiguration? How does GITOPS work with HelmCharts?

Should we use a GITOPS Repository for provisioning Production-Env or should I use Helmfile Operator? How do I manage all the HelmChart consistently?

https://github.com/mumoshu/helmfile-operator

Dealing with values.yaml

Does your DevOps team have successfully managed to integrate GITOPS with HelmChart’s values.yaml? Do you DevOps team understands all of Ciliums values.yaml (3600~ lines), istio’s (500~ lines), Keda values.yaml (850~ lines)? and takes full advantage of the functionality?

Q: Do you use Lens?

CNI, Networking & Ingresses Design Patterns (Envoy/Traefik/Istio)

Finishing the design of the Pods with their SideCars, outlined by the microservices design pattern leads us to the ServiceMesh and K8s networking, how do the Pods interact with each other mTLS while having all the Certificates ready-at-hand?

In this book, the complexity is dealt with low level and it’ll still be a challenge for 2025 how to deal with networking practices. But what really helps is to be able to debug all this networking with Kiali.

DevSecOps Aesthetics

Comprehensively securing of the Code (developers), Container(Trivy, gVisor), Cluster[RBAC,NetworkPolicy, Pod Security Admission] and the Cloud[IAM,OIDC,TLS,SSL,SecurityGroup,MFA,STS]) became pretty basic in today’s devSecOps world while following the Methodology of CIS Kubernetes Benchmark and eBPF could be a good start. Try to add Pods insider your K8s Cluster to take care of DevSecOps!

Q: Are you running Vault inside your k8s cluster?

eBPF-based Security Observability: Tetragon & Cilium

Between zero-trust and Air Gap, it becomes best practice to embed Kernel-Observability with eBPF tools like Tetragon & Cilium so make sure you add them to your observability helm-charts toolset of observability alongside your openTelemetry/SIEM Observabilities while reading how to build a WAF to Prevent Command Injection, Backdoors, and Reverse Shells.

DevSecOps Tools

https://medium.com/@noah_h/kubernetes-security-tools-seccomp-apparmor-586fdc61e6d9

https://github.com/aquasecurity/kube-bench

https://www.consul.io/

https://www.cisecurity.org/benchmark/kubernetes

https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html

https://falco.org/docs/

https://spacelift.io/blog/aws-sts

https://spiffe.io/docs/latest/spiffe-about/overview/

https://medium.com/@vanchi811/aws-iam-roles-anywhere-63656682c7aa

Production Skepticism: Chaos Engineering

SRE or Continuous Observability? Chaos Engineering Experiments or Disaster Recovery?

[ TIP: Keep Challenging your production environment Consider Choas Mesh or Gremlin and the full ChaosHub ]

If you think your production environment is safe because you took care of auto-scaling, and you can sleep well at night, think again: more than Metrics Observability (and Alerting systems) you want to prepare for disasters in production by shifting left with Chaos-Engineering, by running experiments, like

Case: Scalability Chaos Engineering

If you are using auto-scaling (KEDA/Karpanter) solutions, try to challenge them with Pod Autoscaler Chaos Engineering Experiment.

About the Author:

Amit Sides is a Development, DevOps & SRE Expert, DevSecOps, and MLOps.