SRE Interview Questions - Comprehensive Answer Guide
Part 1: SRE Fundamentals & Practices
1. What is the difference between SRE and traditional operations, and how do you balance reliability with feature velocity?
Strong Answer: SRE differs from traditional ops in several key ways:
- Proactive vs Reactive: SRE focuses on preventing issues through engineering rather than just responding to them
- Error Budgets: We quantify acceptable unreliability, allowing teams to move fast while maintaining reliability targets
- Automation: SRE emphasizes eliminating toil through automation and self-healing systems
- Shared Ownership: Development and operations work together using the same tools and metrics
Balancing reliability with velocity:
- Set clear SLIs/SLOs with stakeholders (e.g., 99.9% uptime = 43 minutes downtime/month)
- Use error budgets as a shared currency - if we're within budget, dev teams can deploy faster
- When error budget is exhausted, focus shifts to reliability work
- Implement gradual rollouts and feature flags to reduce blast radius
Follow-up - Implementing error budgets with resistant teams:
- Start with education - show how error budgets enable faster delivery
- Use concrete examples of downtime costs vs delayed features
- Begin with lenient budgets and tighten over time
- Make error budget status visible in dashboards and planning meetings
2. Explain the four golden signals of monitoring. How would you implement alerting around these for a Python microservice?
Strong Answer: The four golden signals are:
- Latency: Time to process requests
- Traffic: Demand on your system (requests/second)
- Errors: Rate of failed requests
- Saturation: How "full" your service is (CPU, memory, I/O)
Implementation for Python microservice:
# Using Prometheus with Flask
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
import psutil
app = Flask(__name__)
# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
ERROR_RATE = Counter('http_errors_total', 'Total HTTP errors', ['status'])
CPU_USAGE = Gauge('cpu_usage_percent', 'CPU usage percentage')
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
latency = time.time() - request.start_time
REQUEST_LATENCY.observe(latency)
REQUEST_COUNT.labels(request.method, request.endpoint, response.status_code).inc()
if response.status_code >= 400:
ERROR_RATE.labels(response.status_code).inc()
CPU_USAGE.set(psutil.cpu_percent())
return response
Alerting Rules:
- Latency: Alert if p99 > 500ms for 5 minutes
- Traffic: Alert on 50% increase/decrease from baseline
- Errors: Alert if error rate > 1% for 2 minutes
- Saturation: Alert if CPU > 80% or Memory > 85% for 10 minutes
3. Walk me through how you would conduct a post-mortem for a production incident.
Strong Answer: Timeline:
- Immediate: Focus on resolution, collect logs/metrics during incident
- Within 24-48 hours: Conduct post-mortem meeting
- Within 1 week: Publish written post-mortem and track action items
Post-mortem Process:
- Timeline Construction: Build detailed timeline with all events, decisions, and communications
- Root Cause Analysis: Use techniques like "5 Whys" or Fishbone diagrams
- Impact Assessment: Quantify user impact, revenue loss, SLO burn
- Action Items: Focus on systemic fixes, not individual blame
- Follow-up: Track action items to completion
Good Post-mortem Characteristics:
- Blameless culture - focus on systems, not individuals
- Detailed timeline with timestamps
- Clear root cause analysis
- Actionable remediation items with owners and deadlines
- Written in accessible language for all stakeholders
- Includes what went well (not just failures)
Psychological Safety:
- Use "the system allowed..." instead of "person X did..."
- Ask "how can we make this impossible to happen again?"
- Celebrate people who surface problems early
- Make post-mortems learning opportunities, not punishment
4. You notice your application's 99th percentile latency has increased by 50ms over the past week, but the average latency remains the same. How would you investigate this?
Strong Answer: This suggests a long tail problem - most requests are fine, but some are much slower.
Investigation Steps:
- Check Request Distribution: Look at latency histograms - are we seeing bimodal distribution?
- Analyze Traffic Patterns: Has the mix of request types changed? Are we getting more complex queries?
- Database Performance: Check for slow queries, table locks, or index problems
- Resource Saturation: Look for memory pressure, GC pauses, or I/O bottlenecks during peak times
- Dependency Analysis: Check latency of downstream services - could be cascading slow responses
- Code Changes: Review recent deployments for inefficient algorithms or new features
Specific Checks:
- Database slow query logs
- Application profiling data
- Memory usage patterns and GC metrics
- Thread pool utilization
- External API response times
- Distributed tracing for slow requests
Tools: Use APM tools like New Relic, DataDog, or distributed tracing with Jaeger/Zipkin to identify bottlenecks.
5. Design a monitoring strategy for a Go-based API that processes financial transactions.
Strong Answer: Business Metrics:
- Transaction volume and value per minute
- Success rate by transaction type
- Time to settlement
- Regulatory compliance metrics (PCI DSS)
Technical Metrics:
// Key metrics to track
var (
transactionCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{Name: "transactions_total"},
[]string{"type", "status", "payment_method"})
transactionLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{Name: "transaction_duration_seconds"},
[]string{"type"})
queueDepth = prometheus.NewGauge(
prometheus.GaugeOpts{Name: "transaction_queue_depth"})
dbConnectionPool = prometheus.NewGauge(
prometheus.GaugeOpts{Name: "db_connections_active"})
)
Logging Strategy:
- Structured logging with correlation IDs
- Log all transaction state changes
- Security events (failed auth, suspicious patterns)
- Audit trail for compliance
Alerting:
- Transaction failure rate > 0.1%
- Processing latency > 2 seconds
- Queue depth > 1000 items
- Database connection pool > 80% utilization
- Any security-related events
Compliance Considerations:
- PII data must be masked in logs
- Audit logs with tamper-proof storage
- Real-time fraud detection alerts
Part 2: Software Engineering & Development
6. Code Review Scenario: Memory leak optimization
Strong Answer: Problems with the original code:
- Loads entire dataset into memory before processing
- No streaming or chunked processing
- Memory usage grows linearly with file size
Optimized version:
def process_large_dataset(file_path, chunk_size=1000):
"""Process large dataset in chunks to manage memory usage."""
results = []
with open(file_path, 'r') as f:
chunk = []
for line in f:
chunk.append(line.strip())
if len(chunk) >= chunk_size:
# Process chunk and yield results
processed_chunk = [expensive_processing(item) for item in chunk]
partial_result = analyze_data(processed_chunk)
results.append(partial_result)
# Clear chunk to free memory
chunk.clear()
# Process remaining items
if chunk:
processed_chunk = [expensive_processing(item) for item in chunk]
partial_result = analyze_data(processed_chunk)
results.append(partial_result)
return combine_results(results)
# Even better - use generator for streaming
def process_large_dataset_streaming(file_path):
"""Stream processing for minimal memory footprint."""
with open(file_path, 'r') as f:
for line in f:
yield expensive_processing(line.strip())
# Usage
def analyze_streaming_data(file_path):
processed_items = process_large_dataset_streaming(file_path)
return analyze_data_streaming(processed_items)
Additional Optimizations:
- Use
mmap
for very large files - Implement backpressure if processing can't keep up
- Add memory monitoring and circuit breakers
- Consider using
asyncio
for I/O-bound operations
7. In Go, explain the difference between buffered and unbuffered channels.
Strong Answer: Unbuffered Channels:
- Synchronous communication - sender blocks until receiver reads
- Zero capacity - no internal storage
- Guarantees handoff between goroutines
ch := make(chan int) // unbuffered
go func() {
ch <- 42 // blocks until someone reads
}()
value := <-ch // blocks until someone sends
Buffered Channels:
- Asynchronous communication up to buffer size
- Sender only blocks when buffer is full
- Receiver only blocks when buffer is empty
ch := make(chan int, 3) // buffered with capacity 3
ch <- 1 // doesn't block
ch <- 2 // doesn't block
ch <- 3 // doesn't block
ch <- 4 // blocks - buffer full
When to use in high-throughput systems:
Unbuffered for:
- Strict synchronization requirements
- Request-response patterns
- When you need guaranteed delivery confirmation
- Worker pools where you want backpressure
Buffered for:
- Producer-consumer with different rates
- Batching operations
- Reducing contention in high-throughput scenarios
- Event streaming where some loss is acceptable
8. React Performance: Optimize dashboard with real-time metrics
Strong Answer: Problems with frequent re-renders:
- All components re-render when any metric updates
- Expensive calculations on every render
- DOM thrashing from rapid updates
Optimization Strategy:
import React, { memo, useMemo, useCallback, useRef } from "react";
import { useVirtualizer } from "@tanstack/react-virtual";
// 1. Memoize metric components
const MetricCard = memo(({ metric, value, threshold }) => {
// Only re-render when props actually change
const status = useMemo(
() => (value > threshold ? "critical" : "normal"),
[value, threshold]
);
return (
<div className={`metric-card \${status}`}>
<h3>{metric}</h3>
<span>{value}</span>
</div>
);
});
// 2. Virtualize large lists
const MetricsList = ({ metrics }) => {
const parentRef = useRef();
const virtualizer = useVirtualizer({
count: metrics.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 100,
});
return (
<div ref={parentRef} style={{ height: "400px", overflow: "auto" }}>
{virtualizer.getVirtualItems().map((virtualRow) => (
<MetricCard key={virtualRow.key} {...metrics[virtualRow.index]} />
))}
</div>
);
};
// 3. Debounce updates and batch state changes
const Dashboard = () => {
const [metrics, setMetrics] = useState({});
const updateQueue = useRef(new Map());
const flushTimeout = useRef();
const queueUpdate = useCallback((serviceName, newMetrics) => {
updateQueue.current.set(serviceName, newMetrics);
// Debounce updates - batch multiple rapid changes
clearTimeout(flushTimeout.current);
flushTimeout.current = setTimeout(() => {
setMetrics((prev) => {
const updates = Object.fromEntries(updateQueue.current);
updateQueue.current.clear();
return { ...prev, ...updates };
});
}, 100); // 100ms debounce
}, []);
return <MetricsList metrics={Object.values(metrics)} />;
};
9. Design a CI/CD pipeline for a multi-service application
Strong Answer: Pipeline Architecture:
# .github/workflows/main.yml
name: Multi-Service CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
python-api: \${{ steps.changes.outputs.python-api }}
go-workers: \${{ steps.changes.outputs.go-workers }}
react-frontend: \${{ steps.changes.outputs.react-frontend }}
steps:
- uses: actions/checkout@v3
- uses: dorny/paths-filter@v2
id: changes
with:
filters: |
python-api:
- 'services/api/**'
go-workers:
- 'services/workers/**'
react-frontend:
- 'frontend/**'
test-python:
needs: detect-changes
if: needs.detect-changes.outputs.python-api == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Python tests
run: |
cd services/api
pip install -r requirements.txt
pytest --cov=. --cov-report=xml
flake8 .
mypy .
deploy:
needs: [test-python, test-go, test-react]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Deploy with blue-green
run: |
# Database migration strategy
kubectl apply -f k8s/migration-job.yaml
kubectl wait --for=condition=complete job/db-migration
# Deploy new version to green environment
helm upgrade app-green ./helm-chart \
--set image.tag=\${{ github.sha }} \
--set environment=green
# Health check green environment
./scripts/health-check.sh green
# Switch traffic to green
kubectl patch service app-service -p \
'{"spec":{"selector":{"version":"green"}}}'
Part 3: System Design Deep Dive
10. Requirements Gathering Questions
Strong Answer: Functional Requirements:
- What specific metrics need to be displayed? (orders/minute, revenue, concurrent users)
- How real-time? (sub-second, few seconds, minute-level updates)
- What user roles need access? (executives, ops teams, developers)
- What actions can users take? (view-only, alerts, drill-down)
- Geographic distribution of users?
Non-Functional Requirements:
- Scale: How many concurrent dashboard users? (100s, 1000s)
- Data volume: Orders per day? Peak traffic? Data retention period?
- Availability: 99.9% or higher? Maintenance windows?
- Latency: How fast should dashboard updates be?
- Consistency: Can we show slightly stale data? (eventual consistency)
- Security: Authentication, authorization, audit logging?
11. High-Level Architecture Diagram
Strong Answer:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ React SPA │ │ Load Balancer │ │ API Gateway │
│ │◄──►│ (ALB/NGINX) │◄──►│ (Kong/Envoy) │
│ - Dashboard │ │ │ │ - Auth │
│ - WebSocket │ │ │ │ - Rate Limiting │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────────────────────────┼─────────────────┐
│ │ │
┌─────────▼────────┐ ┌───────────▼────────┐ ┌──────▼──────┐
│ Metrics API │ │ WebSocket API │ │ Config API │
│ (Python/Flask) │ │ (Go/Gorilla) │ │(Python/Fast)│
└─────────┬────────┘ └───────────┬────────┘ └──────┬──────┘
│ │ │
┌─────────▼────────┐ ┌───────────▼────────┐ ┌──────▼──────┐
│ Redis Cache │ │ Message Queue │ │ PostgreSQL │
│ (Metrics) │ │ (Kafka/Redis) │ │ (Config) │
└─────────┬────────┘ └───────────┬────────┘ └─────────────┘
│ │
┌─────────▼─────────────────────────▼────────┐
│ Time Series Database │
│ (InfluxDB/TimescaleDB) │
└──────────────────────────────────────────────┘
12. Database Design: SQL vs NoSQL
Strong Answer: Hybrid Approach - Use Both:
SQL (PostgreSQL) for:
- Transactional Data: Orders, users, inventory
- ACID Requirements: Financial transactions
- Complex Queries: Joins, aggregations
- Data Consistency: Strong consistency needs
-- OLTP Database Schema
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
order_total DECIMAL(10,2) NOT NULL,
order_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
status VARCHAR(20) NOT NULL DEFAULT 'pending'
);
CREATE INDEX idx_orders_timestamp ON orders(order_timestamp);
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Partitioning for time-series data
CREATE TABLE orders_2025_06 PARTITION OF orders
FOR VALUES FROM ('2025-06-01') TO ('2025-07-01');
NoSQL (InfluxDB) for:
- Time-series Metrics: Performance data, system metrics
- High Write Volume: Thousands of metrics per second
- Retention Policies: Automatic data aging
# InfluxDB for metrics storage
from influxdb_client import InfluxDBClient, Point
from influxdb_client.client.write_api import SYNCHRONOUS
client = InfluxDBClient(url="http://influxdb:8086", token="my-token")
write_api = client.write_api(write_options=SYNCHRONOUS)
def write_metric(measurement, tags, fields):
point = Point(measurement) \
.tag("service", tags.get("service")) \
.tag("region", tags.get("region")) \
.field("value", fields["value"]) \
.time(datetime.utcnow(), WritePrecision.S)
write_api.write(bucket="metrics", record=point)
Part 4: Advanced SRE & Operations
13. Go Service CPU Investigation
Strong Answer: Systematic CPU Investigation Process:
// 1. Enable pprof in Go service for CPU profiling
package main
import (
"log"
"net/http"
_ "net/http/pprof" // Import pprof
"runtime"
)
func main() {
// Start pprof server
go func() {
log.Println("Starting pprof server on :6060")
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// Set GOMAXPROCS to container CPU limit
runtime.GOMAXPROCS(2) // Adjust based on container resources
// Your application code
startApplication()
}
Investigation Tools:
#!/bin/bash
# cpu-investigation.sh
echo "🔍 Investigating Go service CPU usage..."
# 1. Get current CPU profile (30 seconds)
echo "📊 Collecting CPU profile..."
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
# 2. Check for goroutine leaks
echo "🧵 Checking goroutine count..."
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | head -20
# 3. Memory allocation profile
echo "💾 Checking memory allocations..."
go tool pprof http://localhost:6060/debug/pprof/allocs
# 4. Check GC performance
echo "🗑️ Checking garbage collection stats..."
curl -s http://localhost:6060/debug/vars | jq '.memstats'
14. Staying Current with SRE Practices
Strong Answer: My Learning Strategy:
Daily (30 minutes):
- SRE Weekly Newsletter - concise industry updates
- Hacker News - scan for infrastructure/reliability topics
- Internal Slack channels - #sre-learning, #incidents-learned
Weekly (2-3 hours):
- Google SRE Book Club - team works through chapters together
- Kubernetes documentation - staying current with new features
- Conference talk videos - KubeCon, SREcon, Velocity recordings
Monthly Deep Dives:
- Academic papers - especially from USENIX, SOSP, OSDI conferences
- Vendor whitepapers - but with healthy skepticism
- Open source project exploration - contribute small patches
Hands-on Learning Lab:
# Home lab setup for experimentation
homelab_projects:
current_experiments:
- name: "eBPF monitoring tools"
status: "Building custom metrics collector"
learning: "Kernel-level observability"
- name: "Chaos engineering with Litmus"
status: "Testing failure scenarios"
learning: "Resilience patterns"
infrastructure:
platform: "Kubernetes cluster on Raspberry Pi"
monitoring: "Prometheus + Grafana + Jaeger"
ci_cd: "GitLab CI with ArgoCD"
Community Engagement:
- SRE Discord/Slack communities - daily participation
- Local meetups - monthly CNCF and DevOps meetups
- Conference speaking - submitted 3 talks this year
- Mentoring - guide 2 junior engineers
- Open source contributions - maintain a small monitoring tool
Key Success Factors:
- Consistency over intensity - 30 minutes daily beats 8 hours monthly
- Applied learning - immediately try new concepts in lab/work
- Teaching others - best way to solidify knowledge
- Balance breadth and depth - stay broad but go deep on core areas
Summary