SRE Interview Questions - Comprehensive Answer Guide

Part 1: SRE Fundamentals & Practices

1. What is the difference between SRE and traditional operations, and how do you balance reliability with feature velocity?

Strong Answer: SRE differs from traditional ops in several key ways:

Proactive vs Reactive: SRE focuses on preventing issues through engineering rather than just responding to them
Error Budgets: We quantify acceptable unreliability, allowing teams to move fast while maintaining reliability targets
Automation: SRE emphasizes eliminating toil through automation and self-healing systems
Shared Ownership: Development and operations work together using the same tools and metrics

Balancing reliability with velocity:

Set clear SLIs/SLOs with stakeholders (e.g., 99.9% uptime = 43 minutes downtime/month)
Use error budgets as a shared currency - if we're within budget, dev teams can deploy faster
When error budget is exhausted, focus shifts to reliability work
Implement gradual rollouts and feature flags to reduce blast radius

Follow-up - Implementing error budgets with resistant teams:

Start with education - show how error budgets enable faster delivery
Use concrete examples of downtime costs vs delayed features
Begin with lenient budgets and tighten over time
Make error budget status visible in dashboards and planning meetings

2. Explain the four golden signals of monitoring. How would you implement alerting around these for a Python microservice?

Strong Answer: The four golden signals are:

Latency: Time to process requests
Traffic: Demand on your system (requests/second)
Errors: Rate of failed requests
Saturation: How "full" your service is (CPU, memory, I/O)

Implementation for Python microservice:

# Using Prometheus with Flask
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
import psutil

app = Flask(__name__)

# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
ERROR_RATE = Counter('http_errors_total', 'Total HTTP errors', ['status'])
CPU_USAGE = Gauge('cpu_usage_percent', 'CPU usage percentage')

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    latency = time.time() - request.start_time
    REQUEST_LATENCY.observe(latency)
    REQUEST_COUNT.labels(request.method, request.endpoint, response.status_code).inc()

    if response.status_code >= 400:
        ERROR_RATE.labels(response.status_code).inc()

    CPU_USAGE.set(psutil.cpu_percent())
    return response

Alerting Rules:

Latency: Alert if p99 > 500ms for 5 minutes
Traffic: Alert on 50% increase/decrease from baseline
Errors: Alert if error rate > 1% for 2 minutes
Saturation: Alert if CPU > 80% or Memory > 85% for 10 minutes

3. Walk me through how you would conduct a post-mortem for a production incident.

Strong Answer: Timeline:

Immediate: Focus on resolution, collect logs/metrics during incident
Within 24-48 hours: Conduct post-mortem meeting
Within 1 week: Publish written post-mortem and track action items

Post-mortem Process:

Timeline Construction: Build detailed timeline with all events, decisions, and communications
Root Cause Analysis: Use techniques like "5 Whys" or Fishbone diagrams
Impact Assessment: Quantify user impact, revenue loss, SLO burn
Action Items: Focus on systemic fixes, not individual blame
Follow-up: Track action items to completion

Good Post-mortem Characteristics:

Blameless culture - focus on systems, not individuals
Detailed timeline with timestamps
Clear root cause analysis
Actionable remediation items with owners and deadlines
Written in accessible language for all stakeholders
Includes what went well (not just failures)

Psychological Safety:

Use "the system allowed..." instead of "person X did..."
Ask "how can we make this impossible to happen again?"
Celebrate people who surface problems early
Make post-mortems learning opportunities, not punishment

4. You notice your application's 99th percentile latency has increased by 50ms over the past week, but the average latency remains the same. How would you investigate this?

Strong Answer: This suggests a long tail problem - most requests are fine, but some are much slower.

Investigation Steps:

Check Request Distribution: Look at latency histograms - are we seeing bimodal distribution?
Analyze Traffic Patterns: Has the mix of request types changed? Are we getting more complex queries?
Database Performance: Check for slow queries, table locks, or index problems
Resource Saturation: Look for memory pressure, GC pauses, or I/O bottlenecks during peak times
Dependency Analysis: Check latency of downstream services - could be cascading slow responses
Code Changes: Review recent deployments for inefficient algorithms or new features

Specific Checks:

Database slow query logs
Application profiling data
Memory usage patterns and GC metrics
Thread pool utilization
External API response times
Distributed tracing for slow requests

Tools: Use APM tools like New Relic, DataDog, or distributed tracing with Jaeger/Zipkin to identify bottlenecks.

5. Design a monitoring strategy for a Go-based API that processes financial transactions.

Strong Answer: Business Metrics:

Transaction volume and value per minute
Success rate by transaction type
Time to settlement
Regulatory compliance metrics (PCI DSS)

Technical Metrics:

// Key metrics to track
var (
    transactionCounter = prometheus.NewCounterVec(
        prometheus.CounterOpts{Name: "transactions_total"},
        []string{"type", "status", "payment_method"})

    transactionLatency = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{Name: "transaction_duration_seconds"},
        []string{"type"})

    queueDepth = prometheus.NewGauge(
        prometheus.GaugeOpts{Name: "transaction_queue_depth"})

    dbConnectionPool = prometheus.NewGauge(
        prometheus.GaugeOpts{Name: "db_connections_active"})
)

Logging Strategy:

Structured logging with correlation IDs
Log all transaction state changes
Security events (failed auth, suspicious patterns)
Audit trail for compliance

Alerting:

Transaction failure rate > 0.1%
Processing latency > 2 seconds
Queue depth > 1000 items
Database connection pool > 80% utilization
Any security-related events

Compliance Considerations:

PII data must be masked in logs
Audit logs with tamper-proof storage
Real-time fraud detection alerts

Part 2: Software Engineering & Development

6. Code Review Scenario: Memory leak optimization

Strong Answer: Problems with the original code:

Loads entire dataset into memory before processing
No streaming or chunked processing
Memory usage grows linearly with file size

Optimized version:

def process_large_dataset(file_path, chunk_size=1000):
    """Process large dataset in chunks to manage memory usage."""
    results = []

    with open(file_path, 'r') as f:
        chunk = []
        for line in f:
            chunk.append(line.strip())

            if len(chunk) >= chunk_size:
                # Process chunk and yield results
                processed_chunk = [expensive_processing(item) for item in chunk]
                partial_result = analyze_data(processed_chunk)
                results.append(partial_result)

                # Clear chunk to free memory
                chunk.clear()

        # Process remaining items
        if chunk:
            processed_chunk = [expensive_processing(item) for item in chunk]
            partial_result = analyze_data(processed_chunk)
            results.append(partial_result)

    return combine_results(results)

# Even better - use generator for streaming
def process_large_dataset_streaming(file_path):
    """Stream processing for minimal memory footprint."""
    with open(file_path, 'r') as f:
        for line in f:
            yield expensive_processing(line.strip())

# Usage
def analyze_streaming_data(file_path):
    processed_items = process_large_dataset_streaming(file_path)
    return analyze_data_streaming(processed_items)

Additional Optimizations:

Use mmap for very large files
Implement backpressure if processing can't keep up
Add memory monitoring and circuit breakers
Consider using asyncio for I/O-bound operations

7. In Go, explain the difference between buffered and unbuffered channels.

Strong Answer: Unbuffered Channels:

Synchronous communication - sender blocks until receiver reads
Zero capacity - no internal storage
Guarantees handoff between goroutines

ch := make(chan int) // unbuffered
go func() {
    ch <- 42 // blocks until someone reads
}()
value := <-ch // blocks until someone sends

Buffered Channels:

Asynchronous communication up to buffer size
Sender only blocks when buffer is full
Receiver only blocks when buffer is empty

ch := make(chan int, 3) // buffered with capacity 3
ch <- 1 // doesn't block
ch <- 2 // doesn't block
ch <- 3 // doesn't block
ch <- 4 // blocks - buffer full

When to use in high-throughput systems:

Unbuffered for:

Strict synchronization requirements
Request-response patterns
When you need guaranteed delivery confirmation
Worker pools where you want backpressure

Buffered for:

Producer-consumer with different rates
Batching operations
Reducing contention in high-throughput scenarios
Event streaming where some loss is acceptable

8. React Performance: Optimize dashboard with real-time metrics

Strong Answer: Problems with frequent re-renders:

All components re-render when any metric updates
Expensive calculations on every render
DOM thrashing from rapid updates

Optimization Strategy:

import React, { memo, useMemo, useCallback, useRef } from "react";
import { useVirtualizer } from "@tanstack/react-virtual";

// 1. Memoize metric components
const MetricCard = memo(({ metric, value, threshold }) => {
  // Only re-render when props actually change
  const status = useMemo(
    () => (value > threshold ? "critical" : "normal"),
    [value, threshold]
  );

  return (
    <div className={`metric-card \${status}`}>
      <h3>{metric}</h3>
      <span>{value}</span>
    </div>
  );
});

// 2. Virtualize large lists
const MetricsList = ({ metrics }) => {
  const parentRef = useRef();
  const virtualizer = useVirtualizer({
    count: metrics.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 100,
  });

  return (
    <div ref={parentRef} style={{ height: "400px", overflow: "auto" }}>
      {virtualizer.getVirtualItems().map((virtualRow) => (
        <MetricCard key={virtualRow.key} {...metrics[virtualRow.index]} />
      ))}
    </div>
  );
};

// 3. Debounce updates and batch state changes
const Dashboard = () => {
  const [metrics, setMetrics] = useState({});
  const updateQueue = useRef(new Map());
  const flushTimeout = useRef();

  const queueUpdate = useCallback((serviceName, newMetrics) => {
    updateQueue.current.set(serviceName, newMetrics);

    // Debounce updates - batch multiple rapid changes
    clearTimeout(flushTimeout.current);
    flushTimeout.current = setTimeout(() => {
      setMetrics((prev) => {
        const updates = Object.fromEntries(updateQueue.current);
        updateQueue.current.clear();
        return { ...prev, ...updates };
      });
    }, 100); // 100ms debounce
  }, []);

  return <MetricsList metrics={Object.values(metrics)} />;
};

9. Design a CI/CD pipeline for a multi-service application

Strong Answer: Pipeline Architecture:

# .github/workflows/main.yml
name: Multi-Service CI/CD
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      python-api: \${{ steps.changes.outputs.python-api }}
      go-workers: \${{ steps.changes.outputs.go-workers }}
      react-frontend: \${{ steps.changes.outputs.react-frontend }}
    steps:
      - uses: actions/checkout@v3
      - uses: dorny/paths-filter@v2
        id: changes
        with:
          filters: |
            python-api:
              - 'services/api/**'
            go-workers:
              - 'services/workers/**'
            react-frontend:
              - 'frontend/**'

  test-python:
    needs: detect-changes
    if: needs.detect-changes.outputs.python-api == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Python tests
        run: |
          cd services/api
          pip install -r requirements.txt
          pytest --cov=. --cov-report=xml
          flake8 .
          mypy .

  deploy:
    needs: [test-python, test-go, test-react]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy with blue-green
        run: |
          # Database migration strategy
          kubectl apply -f k8s/migration-job.yaml
          kubectl wait --for=condition=complete job/db-migration

          # Deploy new version to green environment
          helm upgrade app-green ./helm-chart \
            --set image.tag=\${{ github.sha }} \
            --set environment=green

          # Health check green environment
          ./scripts/health-check.sh green

          # Switch traffic to green
          kubectl patch service app-service -p \
            '{"spec":{"selector":{"version":"green"}}}'

Part 3: System Design Deep Dive

10. Requirements Gathering Questions

Strong Answer: Functional Requirements:

What specific metrics need to be displayed? (orders/minute, revenue, concurrent users)
How real-time? (sub-second, few seconds, minute-level updates)
What user roles need access? (executives, ops teams, developers)
What actions can users take? (view-only, alerts, drill-down)
Geographic distribution of users?

Non-Functional Requirements:

Scale: How many concurrent dashboard users? (100s, 1000s)
Data volume: Orders per day? Peak traffic? Data retention period?
Availability: 99.9% or higher? Maintenance windows?
Latency: How fast should dashboard updates be?
Consistency: Can we show slightly stale data? (eventual consistency)
Security: Authentication, authorization, audit logging?

11. High-Level Architecture Diagram

Strong Answer:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   React SPA     │    │   Load Balancer  │    │   API Gateway   │
│                 │◄──►│   (ALB/NGINX)    │◄──►│   (Kong/Envoy)  │
│ - Dashboard     │    │                  │    │ - Auth          │
│ - WebSocket     │    │                  │    │ - Rate Limiting │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                         │
                              ┌──────────────────────────┼─────────────────┐
                              │                          │                 │
                    ┌─────────▼────────┐    ┌───────────▼────────┐ ┌──────▼──────┐
                    │   Metrics API    │    │   WebSocket API    │ │ Config API  │
                    │   (Python/Flask) │    │   (Go/Gorilla)     │ │(Python/Fast)│
                    └─────────┬────────┘    └───────────┬────────┘ └──────┬──────┘
                              │                         │                 │
                    ┌─────────▼────────┐    ┌───────────▼────────┐ ┌──────▼──────┐
                    │   Redis Cache    │    │   Message Queue    │ │ PostgreSQL  │
                    │   (Metrics)      │    │   (Kafka/Redis)    │ │ (Config)    │
                    └─────────┬────────┘    └───────────┬────────┘ └─────────────┘
                              │                         │
                    ┌─────────▼─────────────────────────▼────────┐
                    │           Time Series Database             │
                    │           (InfluxDB/TimescaleDB)           │
                    └──────────────────────────────────────────────┘

12. Database Design: SQL vs NoSQL

Strong Answer: Hybrid Approach - Use Both:

SQL (PostgreSQL) for:

Transactional Data: Orders, users, inventory
ACID Requirements: Financial transactions
Complex Queries: Joins, aggregations
Data Consistency: Strong consistency needs

-- OLTP Database Schema
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(id),
    order_total DECIMAL(10,2) NOT NULL,
    order_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    status VARCHAR(20) NOT NULL DEFAULT 'pending'
);

CREATE INDEX idx_orders_timestamp ON orders(order_timestamp);
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- Partitioning for time-series data
CREATE TABLE orders_2025_06 PARTITION OF orders
FOR VALUES FROM ('2025-06-01') TO ('2025-07-01');

NoSQL (InfluxDB) for:

Time-series Metrics: Performance data, system metrics
High Write Volume: Thousands of metrics per second
Retention Policies: Automatic data aging

# InfluxDB for metrics storage
from influxdb_client import InfluxDBClient, Point
from influxdb_client.client.write_api import SYNCHRONOUS

client = InfluxDBClient(url="http://influxdb:8086", token="my-token")
write_api = client.write_api(write_options=SYNCHRONOUS)

def write_metric(measurement, tags, fields):
    point = Point(measurement) \
        .tag("service", tags.get("service")) \
        .tag("region", tags.get("region")) \
        .field("value", fields["value"]) \
        .time(datetime.utcnow(), WritePrecision.S)

    write_api.write(bucket="metrics", record=point)

Part 4: Advanced SRE & Operations

13. Go Service CPU Investigation

Strong Answer: Systematic CPU Investigation Process:

// 1. Enable pprof in Go service for CPU profiling
package main

import (
    "log"
    "net/http"
    _ "net/http/pprof"  // Import pprof
    "runtime"
)

func main() {
    // Start pprof server
    go func() {
        log.Println("Starting pprof server on :6060")
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // Set GOMAXPROCS to container CPU limit
    runtime.GOMAXPROCS(2)  // Adjust based on container resources

    // Your application code
    startApplication()
}

Investigation Tools:

#!/bin/bash
# cpu-investigation.sh

echo "🔍 Investigating Go service CPU usage..."

# 1. Get current CPU profile (30 seconds)
echo "📊 Collecting CPU profile..."
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

# 2. Check for goroutine leaks
echo "🧵 Checking goroutine count..."
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | head -20

# 3. Memory allocation profile
echo "💾 Checking memory allocations..."
go tool pprof http://localhost:6060/debug/pprof/allocs

# 4. Check GC performance
echo "🗑️ Checking garbage collection stats..."
curl -s http://localhost:6060/debug/vars | jq '.memstats'

14. Staying Current with SRE Practices

Strong Answer: My Learning Strategy:

Daily (30 minutes):

SRE Weekly Newsletter - concise industry updates
Hacker News - scan for infrastructure/reliability topics
Internal Slack channels - #sre-learning, #incidents-learned

Weekly (2-3 hours):

Google SRE Book Club - team works through chapters together
Kubernetes documentation - staying current with new features
Conference talk videos - KubeCon, SREcon, Velocity recordings

Monthly Deep Dives:

Academic papers - especially from USENIX, SOSP, OSDI conferences
Vendor whitepapers - but with healthy skepticism
Open source project exploration - contribute small patches

Hands-on Learning Lab:

# Home lab setup for experimentation
homelab_projects:
  current_experiments:
    - name: "eBPF monitoring tools"
      status: "Building custom metrics collector"
      learning: "Kernel-level observability"

    - name: "Chaos engineering with Litmus"
      status: "Testing failure scenarios"
      learning: "Resilience patterns"

  infrastructure:
    platform: "Kubernetes cluster on Raspberry Pi"
    monitoring: "Prometheus + Grafana + Jaeger"
    ci_cd: "GitLab CI with ArgoCD"

Community Engagement:

SRE Discord/Slack communities - daily participation
Local meetups - monthly CNCF and DevOps meetups
Conference speaking - submitted 3 talks this year
Mentoring - guide 2 junior engineers
Open source contributions - maintain a small monitoring tool

Key Success Factors:

Consistency over intensity - 30 minutes daily beats 8 hours monthly
Applied learning - immediately try new concepts in lab/work
Teaching others - best way to solidify knowledge
Balance breadth and depth - stay broad but go deep on core areas

SRE Interview Questions - Comprehensive Answer Guide

Part 1: SRE Fundamentals & Practices​

1. What is the difference between SRE and traditional operations, and how do you balance reliability with feature velocity?​

2. Explain the four golden signals of monitoring. How would you implement alerting around these for a Python microservice?​

3. Walk me through how you would conduct a post-mortem for a production incident.​

4. You notice your application's 99th percentile latency has increased by 50ms over the past week, but the average latency remains the same. How would you investigate this?​

5. Design a monitoring strategy for a Go-based API that processes financial transactions.​

Part 2: Software Engineering & Development​

6. Code Review Scenario: Memory leak optimization​

7. In Go, explain the difference between buffered and unbuffered channels.​

8. React Performance: Optimize dashboard with real-time metrics​

9. Design a CI/CD pipeline for a multi-service application​

Part 3: System Design Deep Dive​

10. Requirements Gathering Questions​

11. High-Level Architecture Diagram​

12. Database Design: SQL vs NoSQL​

Part 4: Advanced SRE & Operations​

13. Go Service CPU Investigation​

14. Staying Current with SRE Practices​

Summary​