SRE & System Design Interview Questions
This guide covers essential interview questions forExpected Answer:
CAP Theorem Analysis for Video Streaming Platform:
The CAP theorem states that in a distributed system, you can only guarantee two of the following three properties:
- Consistency (C): All nodes see the same data simultaneously
- Availability (A): System remains operational and responsive
- Partition Tolerance (P): System continues to operate despite network failures
For our video streaming platform:
-
Partition Tolerance is Required:
- Must handle network failures between data centers
- Geographic distribution across regions is essential
- Cannot sacrifice P in a global system
-
Availability vs Consistency Trade-off:
-
Prioritize Availability (AP System):
- Users can always stream videos (critical for user experience)
- Video uploads may be eventually consistent across regions
- View counts and analytics can have slight delays
-
When Consistency Matters:
- User authentication and authorization (strong consistency required)
- Payment processing (ACID transactions needed)
- Content moderation (immediate consistency for safety)
-
Implementation Strategy:
- Video Content: Eventually consistent (AP) - Users can watch videos even during network partitions
- User Data: Strong consistency (CP) - Authentication must be accurate
- Analytics: Eventually consistent (AP) - View counts can be slightly delayed
Example Scenarios:
- Network partition occurs between US and EU data centers
- Users in both regions can still stream videos (Availability maintained)
- New video uploads may take time to replicate (Consistency temporarily relaxed)
- User login still works with local authentication cache (Partition tolerance)eliability Engineering (SRE), system design, and full-stack engineering roles. Each question includes detailed expected answers with practical examples.
π§ General Understandingβ
β Question 1: How do you differentiate between functional and non-functional requirements?β
Expected Answer:
- Functional requirements specify what the system should do:
- Examples: Allow user registration, serve video content, process payments, send notifications
- Focus on business logic and features
- Non-functional requirements focus on how the system performs its functions:
- Examples: Performance (response time < 200ms), security (encryption), scalability (handle 1M users), reliability (99.9% uptime)
- Often called "quality attributes" or "system properties"
Key Differences:
- Functional = What the system does
- Non-functional = How well the system does it
βοΈ Site Reliability Engineering (SRE) Practicesβ
π§ Question 2: What is SLI, SLO, and SLA? How are they related?β
Expected Answer:
-
SLI (Service Level Indicator):
- A quantitative metric used to measure performance
- Examples: Response time, error rate, throughput, availability percentage
- Must be measurable and meaningful to users
-
SLO (Service Level Objective):
- The target value or range for an SLI
- Examples: 99.9% availability, 95% of requests < 200ms response time
- Internal goals that drive engineering decisions
-
SLA (Service Level Agreement):
- A formal contract with consequences if SLOs are not met
- Includes penalties, refunds, or compensation
- External commitments to customers
Relationship: SLIs measure β SLOs set targets β SLAs define consequences
Example:
- SLI: API response time
- SLO: 95% of API calls respond within 200ms
- SLA: If uptime falls below 99.5%, customers get 10% service credit
π₯ Question 3: Describe how you'd design for fault tolerance in a distributed system.β
Expected Answer:
Key Strategies:
-
Redundancy & Replication:
- Duplicate critical components across multiple availability zones/regions
- Use active-active or active-passive configurations
- Implement database replication (master-slave, master-master)
-
Load Balancing & Health Checks:
- Distribute traffic across healthy instances
- Implement health checks to remove unhealthy nodes
- Use circuit breakers to prevent cascade failures
-
Graceful Degradation:
- Design systems to function with reduced capability when components fail
- Implement fallback mechanisms and default responses
- Prioritize core functionality over nice-to-have features
-
Monitoring & Alerting:
- Comprehensive observability (logs, metrics, traces)
- Automated failover mechanisms
- Real-time alerting for quick incident response
-
Retry Logic & Timeouts:
- Implement exponential backoff for retries
- Set appropriate timeouts to prevent resource exhaustion
- Use bulkhead patterns to isolate failures
π Question 4: What is observability and how would you build it into an application?β
Expected Answer:
Definition: Observability is the ability to understand the internal state of a system by examining its external outputs.
Three Pillars of Observability:
-
Logs:
- Structured logging (JSON format)
- Centralized log aggregation
- Searchable and queryable
- Include correlation IDs for tracing requests
-
Metrics:
- Time-series data (counters, gauges, histograms)
- System metrics: CPU, memory, disk, network
- Application metrics: request rate, response time, error rate
- Business metrics: user signups, revenue
-
Traces:
- End-to-end request flow across services
- Distributed tracing to identify bottlenecks
- Span relationships and timing information
Implementation Strategy:
- Tools: Prometheus + Grafana, ELK Stack, Jaeger, OpenTelemetry
- Standards: Use OpenTelemetry for vendor-neutral instrumentation
- Alerting: Set up alerts based on SLIs and error budgets
- Dashboards: Create role-specific dashboards for different stakeholders
π§βπ» Software Engineering & Fullstack Focusβ
π Question 5: When would you use GraphQL over REST?β
Expected Answer:
Use GraphQL when:
- Clients need fine-grained control over data fetching
- Mobile applications with bandwidth constraints
- Frontend teams want to avoid over-fetching/under-fetching
- Multiple clients need different data shapes from the same backend
- Real-time subscriptions are required
Use REST when:
- Simple CRUD operations that map well to HTTP verbs
- Caching is critical (HTTP caching works well with REST)
- Team familiarity and existing infrastructure
- Third-party integrations expect REST APIs
Practical Example:
// GraphQL - Fetch only needed fields
query {
user(id: "123") {
name
email
posts(limit: 5) {
title
createdAt
}
}
}
// REST - Multiple requests or over-fetching
GET /users/123 // Gets all user fields
GET /users/123/posts // Gets all post fields
Trade-offs:
- GraphQL: More complex caching, learning curve, potential for expensive queries
- REST: Simpler caching, well-understood, but can lead to multiple round trips
π Question 6: What are some Python and Go best practices for backend development?β
Expected Answer:
Python Best Practices:
-
Code Quality:
- Follow PEP 8 style guide
- Use type hints (
def func(name: str) -> str:) - Implement comprehensive testing (pytest)
- Use linting tools (flake8, black, mypy)
-
Framework Choices:
- FastAPI: Modern, async, automatic API documentation
- Django: Full-featured, great for complex applications
- Flask: Lightweight, flexible for microservices
-
Performance:
- Use async/await for I/O-bound operations
- Implement proper connection pooling
- Use caching (Redis) for frequently accessed data
Go Best Practices:
-
Language Features:
- Embrace simplicity and readability
- Use goroutines and channels for concurrency
- Handle errors explicitly (no exceptions)
- Leverage interfaces for loose coupling
-
Performance & Patterns:
- Use the standard library when possible
- Implement graceful shutdown patterns
- Use context for request cancellation and timeouts
- Follow the "accept interfaces, return structs" principle
Common Patterns:
# Python - Async FastAPI example
@app.get("/users/{user_id}")
async def get_user(user_id: int) -> User:
return await user_service.get_user(user_id)
// Go - HTTP handler with context
func getUserHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
user, err := userService.GetUser(ctx, userID)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
json.NewEncoder(w).Encode(user)
}
ποΈ System Design Deep Dive (End-to-End Scenario)β
π¬ Question 7: Design a high-level architecture for a video streaming platform.β
Step 1: Understand the Scope
Functional Requirements:
- Upload and store videos
- Stream videos with different quality options
- User authentication and profiles
- Video metadata management (title, description, tags)
- Search and discovery
- Analytics and view tracking
- Comment system
Non-functional Requirements:
- Support 10M+ concurrent users
- 99.9% availability
- Low latency streaming (< 2s startup time)
- Global content delivery
- Secure content protection
- Scalable storage (petabytes of video data)
Scale Estimates:
- 1 billion hours watched per day
- 500 hours of video uploaded per minute
- Peak concurrent users: 10M+
- Storage: ~1 petabyte of new content daily
Step 2: High-Level Architecture Components
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Client β β CDN β β API β
β (React) βββββΊβ (CloudFront)βββββΊβ Gateway β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββ
β β β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Upload β β Streaming β β Metadata β
β Service β β Service β β Service β
β (FastAPI) β β (Go) β β (Node.js) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Object β β Video β β Database β
β Storage β β Processing β β (MongoDB) β
β (S3) β β (MediaConv) β β +Redis β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
Component Details:
- API Gateway: Rate limiting, authentication, routing
- CDN: Global content delivery, edge caching
- Upload Service: Handle video uploads, trigger processing
- Streaming Service: Serve video content with adaptive bitrate
- Metadata Service: User data, video information, search
- Message Queue: Asynchronous processing (RabbitMQ/SQS)
Step 3: API Design & Data Flow
| Functionality | Endpoint | Method | Framework | Response Time SLO |
|---|---|---|---|---|
| Video Upload | /api/v1/upload | POST | FastAPI | < 500ms (initiate) |
| Video Stream | /api/v1/stream/:id | GET | Go + NGINX | < 100ms |
| User Profile | /api/v1/users/:id | GET | Node.js | < 200ms |
| Search Videos | /api/v1/search | GET | Elasticsearch | < 300ms |
| Video Metadata | /api/v1/videos/:id | GET | Node.js | < 150ms |
Upload Flow:
- Client uploads video to signed S3 URL
- Upload service validates and stores metadata
- Video processing pipeline triggered (encoding, thumbnail generation)
- CDN cache populated with multiple quality versions
Streaming Flow:
- Client requests video stream
- CDN checks cache, serves if available
- If not cached, origin server provides stream
- Adaptive bitrate streaming based on client bandwidth
Step 4: Design Options & Tradeoffs
Architecture Patterns:
-
Monolith vs Microservices:
- Monolith Pros: Simpler deployment, easier debugging, faster development initially
- Monolith Cons: Hard to scale independently, technology lock-in, single point of failure
- Microservices Pros: Independent scaling, technology diversity, fault isolation
- Microservices Cons: Distributed system complexity, network latency, data consistency challenges
Recommendation: Start with modular monolith, evolve to microservices as team and requirements grow
-
Database Choices:
- SQL (PostgreSQL): ACID compliance, complex queries, strong consistency
- NoSQL (MongoDB): Horizontal scaling, flexible schema, eventual consistency
- Hybrid Approach: SQL for user data/transactions, NoSQL for video metadata and analytics
-
Caching Strategies:
- CDN Caching: Video content at edge locations
- Application Caching: Popular video metadata in Redis
- Database Caching: Query result caching for search operations
Preferred Architecture: Domain-driven microservices with event-driven communication
Step 5: Scalability & Performance Optimizations
Potential Bottlenecks & Solutions:
-
API Gateway Bottleneck:
- Problem: Single point of failure, traffic concentration
- Solution: Multiple API gateway instances behind load balancer, circuit breakers
-
Database Performance:
- Problem: Read/write bottlenecks, slow queries
- Solutions:
- Read replicas for scaling reads
- Database sharding by user_id or video_id
- Caching layer (Redis) for frequently accessed data
-
Video Processing:
- Problem: CPU-intensive encoding tasks
- Solution: Distributed processing with message queues, auto-scaling workers
-
Storage Scalability:
- Problem: Petabyte-scale storage requirements
- Solutions:
- Object storage (S3) with lifecycle policies
- Multi-region replication for disaster recovery
- Cold storage for older content
Performance Optimizations:
-
CDN Strategy:
- Edge locations in major markets
- Cache popular content proactively
- Use HTTP/2 for better performance
-
Database Optimization:
- Indexing on frequently queried fields
- Partitioning large tables by date
- Read replicas in different regions
-
Monitoring & Alerting:
- Real-time metrics: response time, error rate, throughput
- Infrastructure monitoring: CPU, memory, disk, network
- Business metrics: video upload success rate, streaming quality
π§ͺ Question 8: Apply the CAP Theorem to this system.β
Expected Answer:
- Consistency: A userβs uploaded video is visible to them instantly
- Availability: APIs stay responsive even under load
- Partition Tolerance: Network failures between regions shouldnβt crash system
Since it's distributed, we must sacrifice strong consistency temporarily to maintain availability and partition tolerance β often a CP system with eventual consistency.
π§© Question 9: How would you shard the video metadata DB?β
Expected Answer:
Database Sharding Strategy for Video Metadata:
1. Sharding Keys Options:
Option A: Shard by video_id
-- Hash-based sharding
shard_id = hash(video_id) % num_shards
- Pros: Even distribution of videos across shards
- Cons: User's videos scattered across multiple shards
Option B: Shard by user_id (Recommended)
-- Hash-based sharding
shard_id = hash(user_id) % num_shards
- Pros: User's data co-located, efficient user-centric queries
- Cons: Popular users might create hot spots
Option C: Hybrid Approach - Directory-based Sharding
- Maintain a lookup service that maps ranges to shards
- More flexible but adds complexity
2. Implementation Details:
// Sharding logic example
function getShardForUser(userId) {
const shardId = hash(userId) % NUMBER_OF_SHARDS;
return `shard_${shardId}`;
}
function getVideoMetadata(videoId) {
const userId = getUserIdFromVideo(videoId);
const shard = getShardForUser(userId);
return queryDatabase(shard, videoId);
}
3. MongoDB Sharding Configuration:
- Use compound shard key:
{user_id: 1, created_at: 1} - Enable zone sharding for geographic distribution
- Configure chunk size appropriately (64MB default)
4. Handling Cross-Shard Operations:
- Video Search: Use search service (Elasticsearch) with replicated data
- Popular Videos: Maintain separate collection for trending content
- Analytics: Use separate OLAP system for complex queries
5. Rebalancing Strategy:
- Monitor shard utilization and hot spots
- Use MongoDB balancer for automatic chunk migration
- Plan for adding new shards as data grows
π Question 10: How would you implement rate limiting for the API?β
Expected Answer:
Rate Limiting Algorithms:
- Token Bucket Algorithm (Recommended):
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.last_refill = time.time()
def consume(self, tokens=1):
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def _refill(self):
now = time.time()
tokens_to_add = (now - self.last_refill) * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
-
Implementation Strategies:
- Application Level: Middleware in API gateway
- Infrastructure Level: Use Redis for distributed rate limiting
- CDN Level: CloudFlare, AWS CloudFront built-in rate limiting
-
Rate Limiting Tiers:
- Anonymous users: 100 requests/hour
- Authenticated users: 1000 requests/hour
- Premium users: 5000 requests/hour
- Different limits per endpoint: Upload (strict), read (lenient)
Headers and Response:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1609459200
Retry-After: 3600
π Question 11: How would you secure the video streaming platform?β
Expected Answer:
Security Layers:
-
Authentication & Authorization:
- JWT tokens with short expiration (15 minutes)
- Refresh token rotation
- OAuth 2.0 for third-party integration
- Role-based access control (RBAC)
-
API Security:
- HTTPS everywhere (TLS 1.3)
- API rate limiting and DDoS protection
- Input validation and sanitization
- CORS configuration for web clients
-
Content Protection:
- Signed URLs for video access (S3 presigned URLs)
- CDN token authentication
- DRM for premium content
- Watermarking for copyright protection
-
Infrastructure Security:
- VPC with private subnets
- Security groups and NACLs
- WAF (Web Application Firewall)
- Regular security scanning and penetration testing
Example: Signed URL Generation
def generate_signed_video_url(video_id, user_id, expiration=3600):
# Verify user has access to video
if not user_has_access(user_id, video_id):
raise PermissionError("Access denied")
# Generate signed URL with expiration
return s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': 'videos', 'Key': f'{video_id}.mp4'},
ExpiresIn=expiration
)
π¨ Question 12: How would you handle incident response and post-mortems?β
Expected Answer:
Incident Response Process:
-
Incident Detection (< 5 minutes):
- Automated monitoring alerts
- User reports through support channels
- Health check failures
-
Initial Response (< 15 minutes):
- Acknowledge alert and assess severity
- Create incident channel (Slack/Teams)
- Assign incident commander
- Communicate to stakeholders
-
Incident Severity Levels:
- SEV1: Complete service outage, data loss
- SEV2: Major feature down, significant user impact
- SEV3: Minor issues, degraded performance
- SEV4: Cosmetic issues, no user impact
-
Resolution Process:
- Implement immediate fix or rollback
- Document all actions taken
- Communicate status updates
- Monitor for resolution confirmation
Post-Mortem Process:
-
Post-Mortem Template:
- Summary: What happened and impact
- Timeline: Detailed sequence of events
- Root Cause: Why it happened
- Action Items: How to prevent recurrence
- Lessons Learned: What went well/poorly
-
Blameless Culture:
- Focus on systems and processes, not individuals
- Encourage honest reporting
- Share learnings across teams
Example Action Items:
- Add monitoring for X metric
- Implement automated failover for Y component
- Update runbook for Z scenario
- Schedule disaster recovery testing
π Question 13: Explain different deployment strategies and when to use them.β
Expected Answer:
Deployment Strategies:
-
Blue-Green Deployment:
- Maintain two identical production environments
- Switch traffic between environments
- Pros: Instant rollback, zero downtime
- Cons: Requires double infrastructure, complex data migration
-
Canary Deployment:
- Gradually roll out to small percentage of users
- Monitor metrics and increase traffic if healthy
- Pros: Early issue detection, reduced blast radius
- Cons: Longer deployment time, complex routing
-
Rolling Deployment:
- Update instances one by one or in small batches
- Pros: Resource efficient, gradual rollout
- Cons: Mixed versions during deployment
-
Feature Flags:
- Deploy code with features disabled
- Enable features for specific users/groups
- Pros: Decouple deployment from release, easy rollback
- Cons: Code complexity, technical debt
Implementation Example:
# Kubernetes Canary Deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: video-api
spec:
strategy:
canary:
steps:
- setWeight: 10 # 10% traffic to new version
- pause: { duration: 1h }
- setWeight: 50 # 50% traffic
- pause: { duration: 30m }
- setWeight: 100 # Full rollout
Decision Matrix:
- Video Streaming Service: Canary (user experience critical)
- Internal APIs: Rolling (cost-effective)
- Critical Payment Service: Blue-Green (zero downtime required)
- Experimental Features: Feature flags (safe experimentation)
π Wrap-Up Question: How would you improve this architecture over time?β
Expected Answer:
Evolution Strategy:
-
Short-term Improvements (0-6 months):
- Implement comprehensive monitoring and alerting
- Add automated scaling based on metrics (CPU, memory, request rate)
- Introduce caching layers for popular content
- Set up proper CI/CD pipelines with automated testing
-
Medium-term Enhancements (6-18 months):
-
Machine Learning Integration:
- Recommendation engine for personalized content
- Predictive caching based on viewing patterns
- Content moderation using ML models
- Automated video quality optimization
-
Performance Optimizations:
- Edge computing for video processing
- WebRTC for low-latency streaming
- Advanced CDN strategies with intelligent routing
-
-
Long-term Innovations (18+ months):
-
Advanced Technologies:
- gRPC for internal service communication (better performance than REST)
- Service mesh (Istio) for advanced traffic management
- Event sourcing for audit trails and replay capabilities
- CQRS (Command Query Responsibility Segregation) for read/write optimization
-
Reliability Engineering:
- Chaos engineering with tools like Chaos Monkey
- Automated disaster recovery testing
- Multi-cloud deployment for vendor independence
- Advanced security with zero-trust architecture
-
Implementation Priority:
- Reliability first: Monitoring, alerting, SLOs
- Performance: Caching, CDN optimization
- Innovation: ML features, advanced architectures
- Resilience: Chaos engineering, disaster recovery
Metrics to Track Improvement:
- Technical: Response time, error rate, availability, MTTR
- Business: User engagement, content upload success rate, streaming quality
- Operational: Deployment frequency, change failure rate, recovery time
π― Quick Reference Guideβ
SRE Key Conceptsβ
- Error Budget: Amount of downtime acceptable (100% - SLO)
- MTTR: Mean Time To Recovery - how quickly you recover from incidents
- MTBF: Mean Time Between Failures - reliability measure
- Toil: Manual, repetitive work that should be automated
System Design Patternsβ
- Circuit Breaker: Prevent cascade failures
- Bulkhead: Isolate resources to prevent total failure
- Retry with Backoff: Handle transient failures gracefully
- CQRS: Separate read and write operations for performance
Performance Metricsβ
- Latency: Response time (p50, p95, p99)
- Throughput: Requests per second
- Error Rate: Percentage of failed requests
- Availability: Uptime percentage (99.9% = 43.8 minutes downtime/month)
βοΈ Summary Tableβ
| Area | Concepts Covered |
|---|---|
| SRE Fundamentals | SLI/SLO/SLA, Error Budgets, Monitoring, Incident Response |
| System Design | Scalability, CAP Theorem, Database Sharding, Caching Strategies |
| Architecture Patterns | Microservices vs Monolith, Load Balancing, CDN, Message Queues |
| Security | Authentication, Authorization, API Security, Content Protection |
| DevOps | Deployment Strategies, CI/CD, Infrastructure as Code |
| Programming | Python/Go Best Practices, API Design, Database Optimization |
| Reliability | Fault Tolerance, Disaster Recovery, Chaos Engineering |
π Additional Resourcesβ
Booksβ
- "Site Reliability Engineering" by Google
- "Designing Data-Intensive Applications" by Martin Kleppmann
- "Building Microservices" by Sam Newman
Tools & Technologiesβ
- Monitoring: Prometheus, Grafana, Datadog, New Relic
- Logging: ELK Stack, Fluentd, Splunk
- Tracing: Jaeger, Zipkin, OpenTelemetry
- Infrastructure: Kubernetes, Docker, Terraform, Helm
Practice Platformsβ
- System Design: Educative.io, InterviewBit
- Coding: LeetCode, HackerRank
- Architecture: AWS Well-Architected Framework
This guide covers the most common SRE and system design interview questions. Focus on understanding the principles rather than memorizing answers, and always be prepared to dive deeper into any topic based on your experience.