SRE Interview Questions - Comprehensive Answer Guide
Part 1: SRE Fundamentals & Practices
1. What is the difference between SRE and traditional operations, and how do you balance reliability with feature velocity?
Strong Answer: SRE differs from traditional ops in several key ways:
- Proactive vs Reactive: SRE focuses on preventing issues through engineering rather than just responding to them
- Error Budgets: We quantify acceptable unreliability, allowing teams to move fast while maintaining reliability targets
- Automation: SRE emphasizes eliminating toil through automation and self-healing systems
- Shared Ownership: Development and operations work together using the same tools and metrics
Balancing reliability with velocity:
- Set clear SLIs/SLOs with stakeholders (e.g., 99.9% uptime = 43 minutes downtime/month)
- Use error budgets as a shared currency - if we're within budget, dev teams can deploy faster
- When error budget is exhausted, focus shifts to reliability work
- Implement gradual rollouts and feature flags to reduce blast radius
Follow-up - Implementing error budgets with resistant teams:
- Start with education - show how error budgets enable faster delivery
- Use concrete examples of downtime costs vs delayed features
- Begin with lenient budgets and tighten over time
- Make error budget status visible in dashboards and planning meetings
2. Explain the four golden signals of monitoring. How would you implement alerting around these for a Python microservice?
Strong Answer: The four golden signals are:
- Latency: Time to process requests
- Traffic: Demand on your system (requests/second)
- Errors: Rate of failed requests
- Saturation: How "full" your service is (CPU, memory, I/O)
Implementation for Python microservice:
# Using Prometheus with Flask
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
import psutil
app = Flask(__name__)
# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
ERROR_RATE = Counter('http_errors_total', 'Total HTTP errors', ['status'])
CPU_USAGE = Gauge('cpu_usage_percent', 'CPU usage percentage')
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
latency = time.time() - request.start_time
REQUEST_LATENCY.observe(latency)
REQUEST_COUNT.labels(request.method, request.endpoint, response.status_code).inc()
if response.status_code >= 400:
ERROR_RATE.labels(response.status_code).inc()
CPU_USAGE.set(psutil.cpu_percent())
return response
Alerting Rules:
- Latency: Alert if p99 > 500ms for 5 minutes
- Traffic: Alert on 50% increase/decrease from baseline
- Errors: Alert if error rate > 1% for 2 minutes
- Saturation: Alert if CPU > 80% or Memory > 85% for 10 minutes
3. Walk me through how you would conduct a post-mortem for a production incident.
Strong Answer: Timeline:
- Immediate: Focus on resolution, collect logs/metrics during incident
- Within 24-48 hours: Conduct post-mortem meeting
- Within 1 week: Publish written post-mortem and track action items
Post-mortem Process:
- Timeline Construction: Build detailed timeline with all events, decisions, and communications
- Root Cause Analysis: Use techniques like "5 Whys" or Fishbone diagrams
- Impact Assessment: Quantify user impact, revenue loss, SLO burn
- Action Items: Focus on systemic fixes, not individual blame
- Follow-up: Track action items to completion
Good Post-mortem Characteristics:
- Blameless culture - focus on systems, not individuals
- Detailed timeline with timestamps
- Clear root cause analysis
- Actionable remediation items with owners and deadlines
- Written in accessible language for all stakeholders
- Includes what went well (not just failures)
Psychological Safety:
- Use "the system allowed..." instead of "person X did..."
- Ask "how can we make this impossible to happen again?"
- Celebrate people who surface problems early
- Make post-mortems learning opportunities, not punishment
4. You notice your application's 99th percentile latency has increased by 50ms over the past week, but the average latency remains the same. How would you investigate this?
Strong Answer: This suggests a long tail problem - most requests are fine, but some are much slower.
Investigation Steps:
- Check Request Distribution: Look at latency histograms - are we seeing bimodal distribution?
- Analyze Traffic Patterns: Has the mix of request types changed? Are we getting more complex queries?
- Database Performance: Check for slow queries, table locks, or index problems
- Resource Saturation: Look for memory pressure, GC pauses, or I/O bottlenecks during peak times
- Dependency Analysis: Check latency of downstream services - could be cascading slow responses
- Code Changes: Review recent deployments for inefficient algorithms or new features
Specific Checks:
- Database slow query logs
- Application profiling data
- Memory usage patterns and GC metrics
- Thread pool utilization
- External API response times
- Distributed tracing for slow requests
Tools: Use APM tools like New Relic, DataDog, or distributed tracing with Jaeger/Zipkin to identify bottlenecks.
5. Design a monitoring strategy for a Go-based API that processes financial transactions.
Strong Answer: Business Metrics:
- Transaction volume and value per minute
- Success rate by transaction type
- Time to settlement
- Regulatory compliance metrics (PCI DSS)
Technical Metrics:
// Key metrics to track
var (
transactionCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{Name: "transactions_total"},
[]string{"type", "status", "payment_method"})
transactionLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{Name: "transaction_duration_seconds"},
[]string{"type"})
queueDepth = prometheus.NewGauge(
prometheus.GaugeOpts{Name: "transaction_queue_depth"})
dbConnectionPool = prometheus.NewGauge(
prometheus.GaugeOpts{Name: "db_connections_active"})
)
Logging Strategy:
- Structured logging with correlation IDs
- Log all transaction state changes
- Security events (failed auth, suspicious patterns)
- Audit trail for compliance
Alerting:
- Transaction failure rate > 0.1%
- Processing latency > 2 seconds
- Queue depth > 1000 items
- Database connection pool > 80% utilization
- Any security-related events
Compliance Considerations:
- PII data must be masked in logs
- Audit logs with tamper-proof storage
- Real-time fraud detection alerts
6. How would you implement distributed tracing across a system with Python backend services and a React frontend?
Strong Answer: Architecture:
- Use OpenTelemetry standard for vendor-neutral tracing
- Jaeger or Zipkin as tracing backend
- Trace context propagation across service boundaries
Backend Implementation (Python):
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Auto-instrument Flask and requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
@app.route('/api/user/<user_id>')
def get_user(user_id):
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user.id", user_id)
# Service logic here
return response
Frontend Implementation (React):
import { WebTracerProvider } from "@opentelemetry/web"
import { getWebAutoInstrumentations } from "@opentelemetry/auto-instrumentations-web"
const provider = new WebTracerProvider()
provider.addSpanProcessor(new BatchSpanProcessor(new JaegerExporter()))
// Auto-instrument fetch, XMLHttpRequest
registerInstrumentations({
instrumentations: [getWebAutoInstrumentations()],
})
Key Implementation Points:
- Correlation IDs: Pass trace context in HTTP headers
- Sampling: Use probabilistic sampling (1-10%) to reduce overhead
- Service Map: Visualize service dependencies
- Performance Analysis: Identify bottlenecks across service boundaries
Part 2: Software Engineering & Development
7. Code Review Scenario: Memory leak optimization
Strong Answer: Problems with the original code:
- Loads entire dataset into memory before processing
- No streaming or chunked processing
- Memory usage grows linearly with file size
Optimized version:
def process_large_dataset(file_path, chunk_size=1000):
"""Process large dataset in chunks to manage memory usage."""
results = []
with open(file_path, 'r') as f:
chunk = []
for line in f:
chunk.append(line.strip())
if len(chunk) >= chunk_size:
# Process chunk and yield results
processed_chunk = [expensive_processing(item) for item in chunk]
partial_result = analyze_data(processed_chunk)
results.append(partial_result)
# Clear chunk to free memory
chunk.clear()
# Process remaining items
if chunk:
processed_chunk = [expensive_processing(item) for item in chunk]
partial_result = analyze_data(processed_chunk)
results.append(partial_result)
return combine_results(results)
# Even better - use generator for streaming
def process_large_dataset_streaming(file_path):
"""Stream processing for minimal memory footprint."""
with open(file_path, 'r') as f:
for line in f:
yield expensive_processing(line.strip())
# Usage
def analyze_streaming_data(file_path):
processed_items = process_large_dataset_streaming(file_path)
return analyze_data_streaming(processed_items)
Additional Optimizations:
- Use
mmap
for very large files - Implement backpressure if processing can't keep up
- Add memory monitoring and circuit breakers
- Consider using
asyncio
for I/O-bound operations
8. In Go, explain the difference between buffered and unbuffered channels.
Strong Answer: Unbuffered Channels:
- Synchronous communication - sender blocks until receiver reads
- Zero capacity - no internal storage
- Guarantees handoff between goroutines
ch := make(chan int) // unbuffered
go func() {
ch <- 42 // blocks until someone reads
}()
value := <-ch // blocks until someone sends
Buffered Channels:
- Asynchronous communication up to buffer size
- Sender only blocks when buffer is full
- Receiver only blocks when buffer is empty
ch := make(chan int, 3) // buffered with capacity 3
ch <- 1 // doesn't block
ch <- 2 // doesn't block
ch <- 3 // doesn't block
ch <- 4 // blocks - buffer full
When to use in high-throughput systems:
Unbuffered for:
- Strict synchronization requirements
- Request-response patterns
- When you need guaranteed delivery confirmation
- Worker pools where you want backpressure
Buffered for:
- Producer-consumer with different rates
- Batching operations
- Reducing contention in high-throughput scenarios
- Event streaming where some loss is acceptable
Example - High-throughput log processor:
// Buffered for log ingestion
logChan := make(chan LogEntry, 10000)
// Unbuffered for critical operations
errorChan := make(chan error)
func logProcessor() {
for {
select {
case log := <-logChan:
processLog(log)
case err := <-errorChan:
handleCriticalError(err) // immediate attention
}
}
}
9. React Performance: Optimize dashboard with real-time metrics
Strong Answer: Problems with frequent re-renders:
- All components re-render when any metric updates
- Expensive calculations on every render
- DOM thrashing from rapid updates
Optimization Strategy:
import React, { memo, useMemo, useCallback, useRef } from "react"
import { useVirtualizer } from "@tanstack/react-virtual"
// 1. Memoize metric components
const MetricCard = memo(({ metric, value, threshold }) => {
// Only re-render when props actually change
const status = useMemo(
() => (value > threshold ? "critical" : "normal"),
[value, threshold]
)
return (
<div className={`metric-card ${status}`}>
<h3>{metric}</h3>
<span>{value}</span>
</div>
)
})
// 2. Virtualize large lists
const MetricsList = ({ metrics }) => {
const parentRef = useRef()
const virtualizer = useVirtualizer({
count: metrics.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 100,
})
return (
<div ref={parentRef} style={{ height: "400px", overflow: "auto" }}>
{virtualizer.getVirtualItems().map((virtualRow) => (
<MetricCard key={virtualRow.key} {...metrics[virtualRow.index]} />
))}
</div>
)
}
// 3. Debounce updates and batch state changes
const Dashboard = () => {
const [metrics, setMetrics] = useState({})
const updateQueue = useRef(new Map())
const flushTimeout = useRef()
const queueUpdate = useCallback((serviceName, newMetrics) => {
updateQueue.current.set(serviceName, newMetrics)
// Debounce updates - batch multiple rapid changes
clearTimeout(flushTimeout.current)
flushTimeout.current = setTimeout(() => {
setMetrics((prev) => {
const updates = Object.fromEntries(updateQueue.current)
updateQueue.current.clear()
return { ...prev, ...updates }
})
}, 100) // 100ms debounce
}, [])
// 4. Use React.startTransition for non-critical updates
const handleMetricUpdate = useCallback(
(data) => {
if (data.priority === "high") {
queueUpdate(data.service, data.metrics)
} else {
startTransition(() => {
queueUpdate(data.service, data.metrics)
})
}
},
[queueUpdate]
)
return <MetricsList metrics={Object.values(metrics)} />
}
Additional Optimizations:
- WebSocket connection pooling - single connection for all metrics
- Data normalization - structure data to minimize re-renders
- Service Worker - offload heavy calculations
- Canvas/WebGL - for complex visualizations instead of DOM updates
10. Design a CI/CD pipeline for a multi-service application
Strong Answer: Pipeline Architecture:
# .github/workflows/main.yml
name: Multi-Service CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
python-api: ${{ steps.changes.outputs.python-api }}
go-workers: ${{ steps.changes.outputs.go-workers }}
react-frontend: ${{ steps.changes.outputs.react-frontend }}
steps:
- uses: actions/checkout@v3
- uses: dorny/paths-filter@v2
id: changes
with:
filters: |
python-api:
- 'services/api/**'
go-workers:
- 'services/workers/**'
react-frontend:
- 'frontend/**'
test-python:
needs: detect-changes
if: needs.detect-changes.outputs.python-api == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Python tests
run: |
cd services/api
pip install -r requirements.txt
pytest --cov=. --cov-report=xml
flake8 .
mypy .
test-go:
needs: detect-changes
if: needs.detect-changes.outputs.go-workers == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-go@v3
with:
go-version: "1.21"
- name: Run Go tests
run: |
cd services/workers
go test -race -coverprofile=coverage.out ./...
go vet ./...
staticcheck ./...
deploy:
needs: [test-python, test-go, test-react]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Deploy with blue-green
run: |
# Database migration strategy
kubectl apply -f k8s/migration-job.yaml
kubectl wait --for=condition=complete job/db-migration
# Deploy new version to green environment
helm upgrade app-green ./helm-chart \
--set image.tag=${{ github.sha }} \
--set environment=green
# Health check green environment
./scripts/health-check.sh green
# Switch traffic to green
kubectl patch service app-service -p \
'{"spec":{"selector":{"version":"green"}}}'
# Verify traffic switch
./scripts/verify-deployment.sh
# Clean up blue environment
helm uninstall app-blue
Database Migration Strategy:
#!/bin/bash
# scripts/db-migration.sh
# Create backup
pg_dump $DATABASE_URL > backup-$(date +%Y%m%d-%H%M%S).sql
# Run migrations in transaction
psql $DATABASE_URL << EOF
BEGIN;
-- Run all pending migrations
\i migrations/001_add_user_table.sql
\i migrations/002_add_index.sql
-- Verify migration success
SELECT count(*) FROM schema_migrations;
COMMIT;
EOF
# Test rollback capability
if [ "$1" == "test-rollback" ]; then
psql $DATABASE_URL < backup-latest.sql
fi
Rollback Strategy:
- Keep previous 3 versions in registry
- Database rollback scripts for each migration
- Feature flags to disable new features quickly
- Automated rollback triggers on error rate increase
11. How would you implement blue-green deployments for a stateful service with a database?
Strong Answer: Challenges with Stateful Services:
- Database schema changes
- Data consistency during switch
- Connection management
- State synchronization
Implementation Strategy:
# Blue-Green with Database
apiVersion: v1
kind: Service
metadata:
name: app-active
spec:
selector:
app: myapp
version: blue # Will switch to green
ports:
- port: 80
---
# Database proxy for connection management
apiVersion: apps/v1
kind: Deployment
metadata:
name: db-proxy
spec:
template:
spec:
containers:
- name: pgbouncer
image: pgbouncer/pgbouncer
env:
- name: DATABASES_HOST
value: "postgres-primary"
- name: POOL_MODE
value: "transaction" # Allow connection switching
Deployment Process:
-
Pre-deployment:
- Run backward-compatible schema migrations
- Ensure both versions can operate with new schema
-
Green Deployment:
- Deploy green version with same database
- Warm up green instances (cache, connections)
- Run health checks
-
Traffic Switch:
- Update service selector to point to green
- Monitor metrics for 10-15 minutes
- Keep blue running for quick rollback
-
Post-deployment:
- Run cleanup migrations (remove old columns)
- Terminate blue environment
Database Migration Strategy:
-- Phase 1: Additive changes (safe for both versions)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Phase 2: After green is stable, remove old columns
-- ALTER TABLE users DROP COLUMN old_email_field;
Rollback Plan:
- Revert service selector to blue
- Emergency database rollback scripts
- Circuit breaker to stop problematic requests
Part 3: System Design Deep Dive
12. Requirements Gathering Questions
Strong Answer: Functional Requirements:
- What specific metrics need to be displayed? (orders/minute, revenue, concurrent users)
- How real-time? (sub-second, few seconds, minute-level updates)
- What user roles need access? (executives, ops teams, developers)
- What actions can users take? (view-only, alerts, drill-down)
- Geographic distribution of users?
Non-Functional Requirements:
- Scale: How many concurrent dashboard users? (100s, 1000s)
- Data volume: Orders per day? Peak traffic? Data retention period?
- Availability: 99.9% or higher? Maintenance windows?
- Latency: How fast should dashboard updates be?
- Consistency: Can we show slightly stale data? (eventual consistency)
- Security: Authentication, authorization, audit logging?
Technical Constraints:
- Existing infrastructure? (AWS, on-prem, hybrid)
- Integration requirements? (existing systems, APIs)
- Compliance requirements? (SOX, PCI DSS)
- Budget constraints?
13. API Design and Framework Selection
Strong Answer: API Endpoints:
// REST API Design
GET /api/v1/metrics/realtime
{
"active_users": 1250,
"orders_per_minute": 45,
"revenue_per_minute": 12500,
"inventory_alerts": 3,
"system_health": "healthy",
"timestamp": "2025-06-30T10:30:00Z"
}
GET /api/v1/metrics/historical?metric=revenue&period=24h&granularity=1h
{
"metric": "revenue",
"data": [
{"timestamp": "2025-06-30T09:00:00Z", "value": 75000},
{"timestamp": "2025-06-30T10:00:00Z", "value": 82000}
]
}
POST /api/v1/alerts/subscribe
{
"metric": "inventory_level",
"threshold": 100,
"product_id": "12345",
"notification_method": "webhook"
}
Framework Comparison:
REST ✅ Recommended
- Simple, cacheable
- Wide tooling support
- Good for CRUD operations
- HTTP status codes for errors
WebSockets ✅ For Real-time Updates
// WebSocket for live updates
const ws = new WebSocket("wss://api.example.com/metrics/stream")
ws.onmessage = (event) => {
const metrics = JSON.parse(event.data)
updateDashboard(metrics)
}
GraphQL ❌ Not Recommended
- Adds complexity for simple metrics
- Caching more difficult
- Overkill for this use case
Final Architecture:
- REST API for historical data and configuration
- WebSocket for real-time metric streaming
- Server-Sent Events as WebSocket fallback
14. High-Level Architecture Diagram
Strong Answer:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ React SPA │ │ Load Balancer │ │ API Gateway │
│ │◄──►│ (ALB/NGINX) │◄──►│ (Kong/Envoy) │
│ - Dashboard │ │ │ │ - Auth │
│ - WebSocket │ │ │ │ - Rate Limiting │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────────────────────────┼─────────────────┐
│ │ │
┌─────────▼────────┐ ┌───────────▼────────┐ ┌──────▼──────┐
│ Metrics API │ │ WebSocket API │ │ Config API │
│ (Python/Flask) │ │ (Go/Gorilla) │ │(Python/Fast)│
└─────────┬────────┘ └───────────┬────────┘ └──────┬──────┘
│ │ │
┌─────────▼────────┐ ┌───────────▼────────┐ ┌──────▼──────┐
│ Redis Cache │ │ Message Queue │ │ PostgreSQL │
│ (Metrics) │ │ (Kafka/Redis) │ │ (Config) │
└─────────┬────────┘ └───────────┬────────┘ └─────────────┘
│ │
┌─────────▼─────────────────────────▼────────┐
│ Time Series Database │
│ (InfluxDB/TimescaleDB) │
└─────────┬──────────────────────────────────┘
│
┌─────────▼────────┐
│ ETL Pipeline │
│ (Apache Beam) │
└─────────┬────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
┌───────▼──────┐ ┌─────────▼─────────┐ ┌─────▼─────┐
│ E-commerce │ │ Inventory │ │ System │
│ Database │ │ Management │ │ Metrics │
│ (Orders) │ │ (Stock Levels) │ │ (Health) │
└──────────────┘ └───────────────────┘ └───────────┘
Data Flow:
- Source Systems → ETL Pipeline (real-time streaming)
- ETL Pipeline → Time Series DB (processed metrics)
- Time Series DB → Redis Cache (frequently accessed data)
- API Services → Frontend (REST + WebSocket)
- Message Queue → WebSocket API (real-time updates)
15. Real-time Implementation Approaches
Strong Answer: Comparison of Real-time Approaches:
Approach | Pros | Cons | Use Case |
---|---|---|---|
Polling | Simple, stateless | High latency, wasteful | Non-critical updates |
WebSockets | True real-time, bidirectional | Complex, stateful | Live dashboards |
Server-Sent Events | Simpler than WebSocket, auto-reconnect | One-way only | Event streams |
Message Queues | Reliable, scalable | Added complexity | High-volume events |
Recommended Architecture:
// WebSocket implementation in Go
type Hub struct {
clients map[*Client]bool
broadcast chan []byte
register chan *Client
unregister chan *Client
}
type Client struct {
hub *Hub
conn *websocket.Conn
send chan []byte
subscriptions map[string]bool
}
func (h *Hub) run() {
for {
select {
case client := <-h.register:
h.clients[client] = true
case client := <-h.unregister:
delete(h.clients, client)
close(client.send)
case message := <-h.broadcast:
for client := range h.clients {
select {
case client.send <- message:
default:
close(client.send)
delete(h.clients, client)
}
}
}
}
}
// Kafka consumer for real-time metrics
func consumeMetrics() {
consumer, _ := kafka.NewConsumer(&kafka.ConfigMap{
"bootstrap.servers": "localhost:9092",
"group.id": "dashboard-consumers",
})
consumer.Subscribe("metrics-topic", nil)
for {
msg, _ := consumer.ReadMessage(-1)
// Process metric and broadcast to WebSocket clients
metric := parseMetric(msg.Value)
hub.broadcast <- metric
// Also update Redis cache
updateCache(metric)
}
}
Client-side Connection Management:
class MetricsWebSocket {
constructor(url) {
this.url = url
this.reconnectInterval = 1000
this.maxReconnectInterval = 30000
this.reconnectDecay = 1.5
this.connect()
}
connect() {
this.ws = new WebSocket(this.url)
this.ws.onopen = () => {
console.log("Connected to metrics stream")
this.reconnectInterval = 1000 // Reset backoff
}
this.ws.onmessage = (event) => {
const metrics = JSON.parse(event.data)
this.updateDashboard(metrics)
}
this.ws.onclose = () => {
console.log("Connection lost, reconnecting...")
setTimeout(() => {
this.reconnectInterval = Math.min(
this.reconnectInterval * this.reconnectDecay,
this.maxReconnectInterval
)
this.connect()
}, this.reconnectInterval)
}
}
subscribe(metricType) {
this.ws.send(
JSON.stringify({
action: "subscribe",
metric: metricType,
})
)
}
}
16. Single Points of Failure Analysis
Strong Answer: Identified SPOFs and Solutions:
-
Load Balancer SPOF
- Problem: Single ALB failure takes down entire system
- Solution: Multi-AZ deployment with Route 53 health checks
# Terraform for multi-region setup
resource "aws_lb" "primary" {
name = "app-lb-primary"
availability_zones = ["us-east-1a", "us-east-1b"]
}
resource "aws_route53_record" "failover_primary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
} -
Database SPOF
- Problem: Single database failure
- Solution: Primary-replica setup with automatic failover
# PostgreSQL HA setup
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-cluster
spec:
instances: 3
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
max_connections: "200"
shared_buffers: "256MB"
backup:
retentionPolicy: "30d"
barmanObjectStore:
destinationPath: "s3://backups/postgres" -
Redis Cache SPOF
- Problem: Cache failure impacts performance
- Solution: Redis Sentinel for HA + graceful degradation
import redis.sentinel
sentinel = redis.sentinel.Sentinel([
('sentinel1', 26379),
('sentinel2', 26379),
('sentinel3', 26379)
])
def get_metric(key):
try:
master = sentinel.master_for('mymaster', socket_timeout=0.1)
return master.get(key)
except redis.RedisError:
# Graceful degradation - fetch from database
return get_metric_from_db(key) -
Message Queue SPOF
- Problem: Kafka broker failure stops real-time updates
- Solution: Multi-broker Kafka cluster with replication
# Kafka with replication
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
replicas: 3
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
default.replication.factor: 3
min.insync.replicas: 2
17. Caching Implementation Strategy
Strong Answer: Multi-Level Caching Strategy:
# 1. Application-level caching
from functools import lru_cache
import redis
import json
class CacheService:
def __init__(self):
self.redis_client = redis.Redis(host='redis-cluster')
self.local_cache = {}
@lru_cache(maxsize=1000)
def get_metric_definition(self, metric_name):
"""Cache metric metadata (rarely changes)"""
return self.fetch_metric_definition(metric_name)
def get_real_time_metric(self, metric_key):
"""Multi-level cache for real-time data"""
# L1: Memory cache (100ms TTL)
if metric_key in self.local_cache:
data, timestamp = self.local_cache[metric_key]
if time.time() - timestamp < 0.1: # 100ms
return data
# L2: Redis cache (5s TTL)
cached = self.redis_client.get(f"metric:{metric_key}")
if cached:
data = json.loads(cached)
self.local_cache[metric_key] = (data, time.time())
return data
# L3: Database fallback
data = self.fetch_from_database(metric_key)
# Cache with appropriate TTL
self.redis_client.setex(
f"metric:{metric_key}",
5, # 5 second TTL
json.dumps(data)
)
self.local_cache[metric_key] = (data, time.time())
return data
2. CDN Caching for Static Assets:
// CloudFront configuration
{
"Origins": [{
"Id": "dashboard-origin",
"DomainName": "dashboard-api.example.com",
"CustomOriginConfig": {
"HTTPPort": 443,
"OriginProtocolPolicy": "https-only"
}
}],
"DefaultCacheBehavior": {
"TargetOriginId": "dashboard-origin",
"ViewerProtocolPolicy": "redirect-to-https",
"CachePolicyId": "custom-cache-policy",
"TTL": {
"DefaultTTL": 300, // 5 minutes for API responses
"MaxTTL": 3600 // 1 hour max
}
},
"CacheBehaviors": [{
"PathPattern": "/static/*",
"TTL": {
"DefaultTTL": 86400, // 24 hours for static assets
"MaxTTL": 31536000 // 1 year max
}
}]
}
3. Database Query Result Caching:
-- Materialized views for expensive aggregations
CREATE MATERIALIZED VIEW hourly_revenue AS
SELECT
date_trunc('hour', order_timestamp) as hour,
SUM(order_total) as revenue,
COUNT(*) as order_count
FROM orders
WHERE order_timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY date_trunc('hour', order_timestamp);
-- Refresh every 5 minutes
CREATE OR REPLACE FUNCTION refresh_hourly_revenue()
RETURNS void AS $
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY hourly_revenue;
END;
$ LANGUAGE plpgsql;
SELECT cron.schedule('refresh-revenue', '*/5 * * * *', 'SELECT refresh_hourly_revenue();');
Cache Invalidation Strategy:
class CacheInvalidator:
def __init__(self):
self.redis_client = redis.Redis()
def invalidate_on_order(self, order_data):
"""Invalidate relevant caches when new order arrives"""
patterns_to_invalidate = [
"metric:orders_per_minute",
"metric:revenue_per_minute",
f"metric:inventory:{order_data['product_id']}"
]
for pattern in patterns_to_invalidate:
# Use Redis pipeline for efficiency
pipe = self.redis_client.pipeline()
pipe.delete(pattern)
pipe.publish('cache_invalidation', pattern)
pipe.execute()
def smart_invalidation(self, event_type, entity_id):
"""Invalidate based on event type"""
invalidation_map = {
'order_created': ['revenue', 'orders', 'inventory'],
'user_login': ['active_users'],
'inventory_update': ['inventory', 'stock_alerts']
}
metrics_to_invalidate = invalidation_map.get(event_type, [])
for metric in metrics_to_invalidate:
self.redis_client.delete(f"metric:{metric}:{entity_id}")
18. Database Design: SQL vs NoSQL
Strong Answer: Hybrid Approach - Use Both:
SQL (PostgreSQL) for:
- Transactional Data: Orders, users, inventory
- ACID Requirements: Financial transactions
- Complex Queries: Joins, aggregations
- Data Consistency: Strong consistency needs
-- OLTP Database Schema
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
order_total DECIMAL(10,2) NOT NULL,
order_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
status VARCHAR(20) NOT NULL DEFAULT 'pending'
);
CREATE INDEX idx_orders_timestamp ON orders(order_timestamp);
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Partitioning for time-series data
CREATE TABLE orders_2025_06 PARTITION OF orders
FOR VALUES FROM ('2025-06-01') TO ('2025-07-01');
NoSQL (InfluxDB) for:
- Time-series Metrics: Performance data, system metrics
- High Write Volume: Thousands of metrics per second
- Retention Policies: Automatic data aging
# InfluxDB for metrics storage
from influxdb_client import InfluxDBClient, Point
from influxdb_client.client.write_api import SYNCHRONOUS
client = InfluxDBClient(url="http://influxdb:8086", token="my-token")
write_api = client.write_api(write_options=SYNCHRONOUS)
def write_metric(measurement, tags, fields):
point = Point(measurement) \
.tag("service", tags.get("service")) \
.tag("region", tags.get("region")) \
.field("value", fields["value"]) \
.time(datetime.utcnow(), WritePrecision.S)
write_api.write(bucket="metrics", record=point)
# Example usage
write_metric(
measurement="orders_per_minute",
tags={"service": "ecommerce", "region": "us-east-1"},
fields={"value": 45}
)
Read/Write Load Balancing:
class DatabaseRouter:
def __init__(self):
self.primary_db = PostgreSQLConnection("primary-db")
self.read_replicas = [
PostgreSQLConnection("replica-1"),
PostgreSQLConnection("replica-2"),
PostgreSQLConnection("replica-3")
]
self.current_replica = 0
def get_read_connection(self):
"""Round-robin read replica selection"""
replica = self.read_replicas[self.current_replica]
self.current_replica = (self.current_replica + 1) % len(self.read_replicas)
return replica
def get_write_connection(self):
"""Always use primary for writes"""
return self.primary_db
def execute_read_query(self, query, params=None):
try:
return self.get_read_connection().execute(query, params)
except Exception:
# Fallback to primary if replica fails
return self.primary_db.execute(query, params)
# Usage
@app.route('/api/metrics/historical')
def get_historical_metrics():
# Use read replica for analytics queries
db = router.get_read_connection()
return db.execute("""
SELECT date_trunc('hour', created_at) as hour,
SUM(total) as revenue
FROM orders
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour
""")
19. Scaling Strategy for 10x Traffic
Strong Answer: Scaling Priority Order:
1. Application Tier (Scale First)
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: dashboard-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: dashboard-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
2. Database Scaling
# Read replica scaling + connection pooling
class DatabaseScaler:
def __init__(self):
self.connection_pools = {
'primary': create_pool('primary-db', pool_size=20),
'replicas': [
create_pool('replica-1', pool_size=50),
create_pool('replica-2', pool_size=50),
create_pool('replica-3', pool_size=50),
]
}
def scale_read_capacity(self, target_qps):
"""Add more read replicas based on QPS"""
current_capacity = len(self.connection_pools['replicas']) * 1000 # QPS per replica
if target_qps > current_capacity * 0.8: # 80% utilization threshold
# Add new read replica
new_replica = create_read_replica()
self.connection_pools['replicas'].append(
create_pool(new_replica, pool_size=50)
)
def implement_sharding(self):
"""Implement horizontal sharding for orders table"""
shards = {
'shard_1': 'orders_americas',
'shard_2': 'orders_europe',
'shard_3': 'orders_asia'
}
def get_shard(user_id):
return shards[f'shard_{hash(user_id) % 3 + 1}']
3. Cache Scaling
# Redis Cluster for horizontal scaling
import rediscluster
redis_cluster = rediscluster.RedisCluster(
startup_nodes=[
{"host": "redis-1", "port": "7000"},
{"host": "redis-2", "port": "7000"},
{"host": "redis-3", "port": "7000"},
{"host": "redis-4", "port": "7000"},
{"host": "redis-5", "port": "7000"},
{"host": "redis-6", "port": "7000"},
],
decode_responses=True,
skip_full_coverage_check=True
)
# Cache partitioning strategy
def cache_key_partition(metric_type, entity_id):
"""Partition cache keys for better distribution"""
partition = hash(f"{metric_type}:{entity_id}") % 1000
return f"{metric_type}:{partition}:{entity_id}"
4. CDN and Edge Caching
// CloudFlare Workers for edge computing
addEventListener("fetch", (event) => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
const url = new URL(request.url)
// Cache API responses at edge
if (url.pathname.startsWith("/api/metrics/")) {
const cacheKey = new Request(url.toString(), request)
const cache = caches.default
// Check edge cache first
let response = await cache.match(cacheKey)
if (!response) {
// Fetch from origin
response = await fetch(request)
// Cache for 30 seconds
const headers = new Headers(response.headers)
headers.set("Cache-Control", "public, max-age=30")
response = new Response(response.body, {
status: response.status,
statusText: response.statusText,
headers: headers,
})
event.waitUntil(cache.put(cacheKey, response.clone()))
}
return response
}
return fetch(request)
}
5. Message Queue Scaling
# Kafka cluster scaling
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: metrics-kafka
spec:
kafka:
replicas: 9 # Scale from 3 to 9 brokers
config:
num.partitions: 50 # More partitions for parallelism
default.replication.factor: 3
min.insync.replicas: 2
resources:
requests:
memory: 8Gi
cpu: 2000m
limits:
memory: 16Gi
cpu: 4000m
20. Failover Strategy for Database Outage
Strong Answer: Multi-Tier Failover Strategy:
# Automatic failover implementation
import time
import threading
from enum import Enum
class DatabaseState(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
FAILED = "failed"
class DatabaseFailoverManager:
def __init__(self):
self.primary_db = "primary-db.example.com"
self.replica_dbs = [
"replica-1.example.com",
"replica-2.example.com",
"replica-3.example.com"
]
self.current_state = DatabaseState.HEALTHY
self.current_primary = self.primary_db
self.health_check_interval = 5 # seconds
def health_check(self, db_host):
"""Check database health"""
try:
conn = psycopg2.connect(
host=db_host,
database="analytics",
user="readonly",
password="password",
connect_timeout=3
)
cursor = conn.cursor()
cursor.execute("SELECT 1")
result = cursor.fetchone()
conn.close()
return result[0] == 1
except Exception as e:
logger.error(f"Health check failed for {db_host}: {e}")
return False
def promote_replica_to_primary(self, replica_host):
"""Promote replica to primary (manual intervention required)"""
# This would typically involve:
# 1. Stopping replication on chosen replica
# 2. Updating DNS/load balancer to point to new primary
# 3. Updating application config
logger.critical(f"Promoting {replica_host} to primary")
# Update Route 53 record to point to new primary
route53 = boto3.client('route53')
route53.change_resource_record_sets(
HostedZoneId='Z123456789',
ChangeBatch={
'Changes': [{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'primary-db.example.com',
'Type': 'CNAME',
'TTL': 60,
'ResourceRecords': [{'Value': replica_host}]
}
}]
}
)
self.current_primary = replica_host
def circuit_breaker_fallback(self):
"""Fallback to cached data when database is unavailable"""
logger.warning("Database unavailable, switching to read-only mode")
# Serve from cache only
app.config['READ_ONLY_MODE'] = True
# Show banner to users
return {
"status": "degraded",
"message": "Real-time data temporarily unavailable",
"data_freshness": "cached_5_minutes_ago"
}
def monitor_and_failover(self):
"""Background thread for monitoring and automatic failover"""
consecutive_failures = 0
while True:
if self.health_check(self.current_primary):
consecutive_failures = 0
self.current_state = DatabaseState.HEALTHY
app.config['READ_ONLY_MODE'] = False
else:
consecutive_failures += 1
logger.warning(f"Primary DB health check failed {consecutive_failures} times")
if consecutive_failures >= 3: # 15 seconds of failures
self.current_state = DatabaseState.FAILED
# Try to find healthy replica
healthy_replica = None
for replica in self.replica_dbs:
if self.health_check(replica):
healthy_replica = replica
break
if healthy_replica:
self.promote_replica_to_primary(healthy_replica)
else:
# All databases failed - enable circuit breaker
self.circuit_breaker_fallback()
time.sleep(self.health_check_interval)
# Start monitoring in background
failover_manager = DatabaseFailoverManager()
monitor_thread = threading.Thread(target=failover_manager.monitor_and_failover)
monitor_thread.daemon = True
monitor_thread.start()
Emergency Procedures:
#!/bin/bash
# emergency-failover.sh
echo "=== EMERGENCY DATABASE FAILOVER ==="
echo "1. Checking primary database status..."
if ! pg_isready -h primary-db.example.com -p 5432; then
echo "❌ Primary database is DOWN"
echo "2. Finding healthy replica..."
for replica in replica-1 replica-2 replica-3; do
if pg_isready -h ${replica}.example.com -p 5432; then
echo "