Introduction

Staying ahead in the ever-changing DevOps landscape requires a constant exploration of new tools and techniques that streamline operations. One of the most promising developments in recent months has been the use of few-shot prompting, a technique borrowed from AI that is changing the way DevOps engineers approach their daily workflows.

What is few-shot prompting?

This is a model that leverages AI by providing a small number of examples to guide its behavior. Unlike traditional programming, where you must explicitly code every rule and exception, few-shot prompting allows you to teach an AI system through examples, much like you would train a junior colleague.

Why DevOps Engineers Should Care

For DevOps professionals, few-shot prompting is an effective method to optimize processes, automate repetitive operations, and improve communication with AI systems. Whether it's developing infrastructure-as-code (IaC) templates, debugging scripts, or optimizing CI/CD pipelines, this method allows engineers to provide precise, context-aware results with minimal manual effort.

As AI becomes more integrated into DevOps operations, mastering few-shot prompting can be a game changer, allowing teams to reduce errors, accelerate deployments, and focus on innovation rather than tedious configuration. This article will examine how this approach works and how you may apply it to improve your DevOps procedures.

Automating Repetitive DevOps Tasks with Few-Shot Prompting

DevOps engineers often spend valuable time on repetitive tasks like writing deployment scripts, parsing logs, or debugging configuration files. Few-shot prompting can automate many of these processes by providing AI models with just a few examples to generate accurate, context-aware outputs.

Common Repetitive DevOps Tasks Ideal for Automation

Log Analysis & Error Detection: Filtering logs, identifying anomalies, or categorizing errors.
Script Generation: Writing Bash, Python, or PowerShell scripts for deployments or backups.
Infrastructure-as-Code (IaC) Templates: Generating Terraform, Ansible, or Kubernetes YAML files.
CI/CD Pipeline Debugging: Suggesting fixes for failed builds or optimizing workflows.
Alert Triage: Summarizing and prioritizing alerts from monitoring tools.
Configuration Management: Generating or updating configuration files across different environments.
Ticket Triage: Categorizing and routing incoming tickets based on content.

Structuring Effective Prompts for Automation

The key to successful automation with few-shot prompting is how you structure the inputs you feed the large language model (LLM). The steps to create effective prompts are as follows:
1. Identify the pattern: Define the input and output formats. clearly
2. Provide various examples: Consider scenarios, including edge cases.
3. Include reasons: Explain the reasoning process behind each case.
4. Apply uniform formatting: Maintain consistent structure throughout examples.

Prompt Examples

Example 1: Generating a Bash Script for Log Cleanup

Prompt:

Input: Write a Bash script to delete log files older than 7 days in /var/log.  
Output:  
#!/bin/bash  
find /var/log -type f -name "*.log" -mtime +7 -exec rm {} \;  

Input: Write a Bash script to compress logs older than 30 days in /opt/app/logs.  
Output:  
#!/bin/bash  
find /opt/app/logs -type f -name "*.log" -mtime +30 -exec gzip {} \;  

Input: Write a Bash script to delete temporary files older than 1 day in /tmp.  
Output:

ChatGPT’s response:

#!/bin/bash
find /tmp -type f -mtime +1 -exec rm -f {} \;

DeepSeek’s response:

#!/bin/bash  
find /tmp -type f -name "*.tmp" -mtime +1 -exec rm {} \;

From these responses, it makes sure to output the shebang line alongside the code for log cleanup.

Example 2: Analyze this error log entry and provide a concise summary with severity level, affected component, potential root cause, and recommended action.

Context: These logs come from our Kubernetes cluster running microservices. Severity levels should be critical, high, medium, or low.

Prompt:

Input: [2024-05-01T14:22:18.345Z] ERROR [auth-service] - Connection refused to database at 10.0.0.12:5432. Retry attempt 3 failed. 
Output: Severity: High Component: auth-service 
Root Cause: Database connectivity issue 
Recommendation: Check database pod health and network policies

Input: [2024-05-01T15:10:45.128Z] WARN [payment-processor] - Rate limit reached for external payment API (120/100 requests). Backing off for 30s. 
Output: Severity: Medium Component: payment-processor 
Root Cause: API rate limiting 
Recommendation: Review rate limit settings and implement better request throttling

New Input: [2024-05-01T16:05:22.891Z] ERROR [user-service] - Out of memory error occurred while processing batch operation. Process terminated.

AI’s response:

**Severity:** Critical
**Component:** user-service
**Root Cause:** Out of memory condition during batch processing
**Recommendation:** Investigate memory limits in the pod configuration and optimize memory usage or scale resources accordingly

Intelligent Troubleshooting with Few-Shot Prompting

As DevOps teams grow, debugging infrastructure issues becomes more complicated. Few-shot prompting can function as an AI-powered assistant, diagnosing problems by learning from previous events. Engineers can train models to recommend fixes by providing organized examples of frequent issues and solutions, thereby decreasing downtime and human debugging work.

The Traditional Troubleshooting Workflow

In most organizations, the troubleshooting procedure looks like this:
1. Alerts for system issues
2. The engineer investigates by reviewing logs, metrics, and setups.
3. The engineer seeks to compare the current symptoms with previous experiences.
4. Several diagnostic approaches are attempted, typically through trial and error.
5. Finally, the root cause is discovered and remedied.

This approach works, but it is inefficient, depends largely on individual heroics, and does not scale well. Prompting with a few shots can transform this strategy.

Creating a Troubleshooting Knowledge Base

The initial step in using few-shot prompting for troubleshooting is to compile a knowledge base of frequent issues and solutions. This doesn't have to be a huge project; start by documenting 5-10 of your most common or significant instances.

Document the following for each incident:

Symptoms (alerts, logs, error messages)
Diagnosis process.
The root cause.
Resolution steps
Any preventive actions implemented.

This documentation serves as the foundation for your few-shot examples.

Structured Troubleshooting Prompts

Let's examine how to structure a troubleshooting prompt for Kubernetes cluster issues:

Task: Diagnose this Kubernetes cluster issue and provide a prioritized list of investigation steps, 
potential causes, and recommended solutions.

Context: This is a production cluster running on AWS EKS with approximately 50 microservices. 
The cluster uses Istio for service mesh, Prometheus for monitoring, and Fluentd for logging. 

Example 1: 
Input:
Symptoms: 
- Multiple pods showing CrashLoopBackOff status 
- CPU usage across nodes spiked to 90%+ 
- Connection timeouts reported by frontend services 
- No recent deployments or configuration changes 
- Occurred suddenly at approximately 14:30 UTC  

Output: 
Diagnosis: 
This pattern suggests a resource contention issue, most likely caused by: 
1. Memory leak in one or more services 
2. External traffic surge overwhelming resources 
3. Possible DDoS attack 
4. Background job or cron job consuming excessive resources  

Investigation Steps: 
1. Identify which nodes are experiencing highest CPU usage 
2. Check which pods on those nodes started failing first 
3. Examine resource metrics for the 30 minutes prior to the incident 
4. Look for unusual network traffic patterns in Istio metrics 
5. Review any scheduled jobs that might have triggered at 14:30 UTC  

Recommended Actions: 
1. Temporarily increase resource limits on critical services 
2. Identify and restart the problematic service(s) 
3. Consider scaling out the affected node group 
4. Implement resource quotas on namespaces if not already present  

Example 2: 
Input: 
Symptoms: 
- Intermittent 503 errors from API gateway 
- Service discovery lookups failing sporadically 
- DNS resolution timeouts in pod logs 
- Cluster has been running stable for weeks 
- Recently upgraded CoreDNS version  

Output: 
Diagnosis: 
This pattern indicates DNS resolution issues within the cluster, most likely caused by: 
1. CoreDNS configuration problems after upgrade 
2. CoreDNS pods under-resourced for current cluster size 
3. Network policy blocking DNS traffic 
4. kube-proxy issues affecting service discovery  

Investigation Steps: 
1. Check CoreDNS pod logs for error patterns 
2. Verify CoreDNS deployment has sufficient replicas and resources 
3. Test DNS resolution from various namespaces 
4. Examine network policies that might affect kube-dns namespace 
5. Verify kube-proxy daemonset status across all nodes  

Recommended Actions: 
1. Scale up CoreDNS deployment if under-provisioned 
2. Rollback CoreDNS version if issues began after upgrade 
3. Ensure appropriate network policies for DNS traffic 
4. Consider implementing DNS caching at pod level for critical services  

New Input: 
Symptoms: 
- Persistent volume claims stuck in "Pending" state 
- New pods won't schedule, showing "waiting for volume" events 
- AWS console shows EBS volumes as "creating" for over 30 minutes 
- Cluster autoscaler logs show node scaling but not completing 
- Only affects production namespace, staging working fine

With this systematic method, the AI can apply the reasoning patterns from your examples to new scenarios, essentially "thinking like" your most experienced engineers.

Case Study: Troubleshooting a Kubernetes Cluster Issue

Problem: A production cluster suddenly reports NodeNotReady errors, and pods are evicted.

Step 1: Feed the AI Context with Few-Shot Examples

Prompt:

Input: kubectl get nodes shows 'NotReady'. kubectl describe node reports 'DiskPressure'.  

Output:  
Likely Cause: Node disk is full.  

Solution:  
1. Identify large files (kubectl exec -it <pod> -- df -h).  
2. Clean up unused logs or PVCs.  
3. Expand the node volume if needed.  

Input: kubectl get nodes shows 'NotReady'. kubectl describe node reports 'MemoryPressure'. 

Output:  
Likely Cause: Insufficient memory.  

Solution:  
1. Check pod memory limits (kubectl top pods).  
2. Evict or reschedule memory-heavy pods.  
3. Scale up the node group.  

Input: kubectl get nodes shows 'NotReady'. kubectl describe node reports 'NetworkUnavailable'.  
Output:

AI Response

Likely Cause: CNI plugin crash or network misconfiguration.  
Solution:  
1. Restart kubelet (systemctl restart kubelet).  
2. Verify CNI pods are running (kubectl get pods -n kube-system).  
3. Check node network interfaces (ip addr show).

Building a Learning System

The most powerful aspect of using few-shot prompting for troubleshooting is that it creates a learning system:

Document resolved incidents as new examples.
Periodically review and refine your example set.
Track which examples lead to successful resolutions.
Identify gaps in your knowledge base when new types of issues arise. Over time, this creates a compounding return on investment. Each incident makes your system smarter and more capable.

Ethical and Cultural Considerations

While implementing these systems, consider:

Maintaining human oversight for critical systems.
Ensuring the system recommends safe diagnostic steps first.
Creating a culture where the tool augments rather than replaces expertise.
Encouraging engineers to contribute to the knowledge base.

Conclusion

The goal is not to replace human judgment but to encode organizational knowledge such that it is available to everyone, independent of expertise level. By systematizing troubleshooting via few-shot prompting, you not only solve today's problems quickly, but you also establish institutional knowledge that accumulates over time, resulting in a more resilient business.

Leveraging Few-Shot Prompting to Transform Your Workflows

Table of contents