๐น Video Reference: This blog post is based on the 7-step system design interview framework demonstrated in System Design Interview: A Step-By-Step Guide by ByteByteGo.
The 7-Step Framework:
- Requirements & Assumptions - Define scope and constraints
- Capacity Planning - Estimate storage, bandwidth, and throughput
- High-Level Architecture - Design system components and data flow
- Data Models - Structure databases and relationships
- API Design - Define interfaces and contracts
- Critical Flow - Detail the most important user journey
- Scalability - Plan for growth from MVP to global scale
Table of Contents
Core Framework (Steps 1-7):
- Step 1: Requirements & Assumptions
- Step 2: Capacity Planning
- Step 3: High-Level Architecture
- Step 4: Data Models
- Step 5: API Design
- Step 6: Song Playback Flow (Critical User Journey)
- Step 7: Scalability Strategy
Advanced Topics: 8. Advanced Features 9. Monitoring & Observability 10. Security Considerations
1. Requirements & Assumptions
๐ Step 1 of 7: Define Scope & Constraints
Functional Requirements
Define what the system must do:
Core Features:
- Artist Upload: Artists can upload songs with metadata (title, album, genre, cover art)
- Search & Discovery: Users can search for songs, artists, albums, and playlists
- Playback: Stream audio with adaptive bitrate based on network conditions
- Playlist Management: Create, update, delete, and share playlists
- User Profiles: Manage user accounts, subscriptions, and preferences
- Social Features: Follow artists, share songs, collaborative playlists
- Recommendations: Personalized song suggestions based on listening history
Scale Assumptions:
- Active Users: 500,000 daily active users (DAU) initially
- Song Library: 30 million songs
- Concurrent Streams: Peak of 50,000 concurrent streams
- Upload Rate: 10,000 new songs uploaded daily
Non-Functional Requirements
Performance:
- Latency: < 200ms for metadata queries
- Time to First Byte (TTFB): < 500ms for audio streaming
- Availability: 99.9% uptime (8.76 hours downtime/year)
Audio Quality:
- Support multiple bitrates: 64kbps (low), 128kbps (normal), 320kbps (high)
- Adaptive bitrate streaming (ABR) based on network conditions
- Formats: Ogg Vorbis, AAC, FLAC (lossless for premium)
Storage:
- Average song file: ~3MB at 128kbps (3.5 minutes)
- Multiple quality versions per song
2. Capacity Planning
๐ Step 2 of 7: Estimate Storage, Bandwidth & Throughput
Accurate capacity estimation is crucial for infrastructure provisioning and cost optimization.
Storage Requirements
Audio Storage:
Base calculation: - 30M songs ร 3MB/song (128kbps) = 90TB Multi-bitrate storage: - 64kbps: 1.5MB/song ร 30M = 45TB - 128kbps: 3MB/song ร 30M = 90TB - 320kbps: 7.5MB/song ร 30M = 225TB Total: 360TB With 3x replication: 360TB ร 3 = 1.08PB
Metadata Storage:
Song metadata: - 30M songs ร 200 bytes = 6GB User data: - 500k users ร 2KB (profile + preferences) = 1GB Playlist data: - Avg 10 playlists/user, 50 songs/playlist - 5M playlists ร 1KB = 5GB Total metadata: ~15GB (easily fits in SQL)
Bandwidth Requirements
Daily Streaming Bandwidth:
Assumptions: - 500k DAU - Average 10 songs/user/day - Average song: 4MB (128kbps) Daily: 500k ร 10 ร 4MB = 20TB/day Monthly: 20TB ร 30 = 600TB/month With CDN: ~80% cache hit ratio Origin egress: 600TB ร 0.2 = 120TB/month
Upload Bandwidth:
10,000 songs/day ร 20MB (uncompressed) = 200GB/day
Database Throughput
Read Operations (QPS - Queries Per Second):
- Song metadata queries: 50k concurrent users ร 0.1 QPS = 5,000 QPS - Search queries: 500k DAU ร 5 searches/day รท 86,400s = ~30 QPS - User profile: 1,000 QPS Total read QPS: ~6,000 QPS
Write Operations:
- Song uploads: 10,000/day รท 86,400s = ~0.12 QPS - Playlist updates: ~50 QPS - Play count updates: 5,000 QPS (batch these!)
3. High-Level Architecture
๐๏ธ Step 3 of 7: Design System Components & Data Flow
The system follows a microservices architecture with clear separation of concerns.
Component Responsibilities
API Gateway:
- Authentication & authorization (JWT validation)
- Rate limiting (prevent abuse)
- Request routing
- SSL termination
User Service:
- User registration/login
- Profile management
- Subscription handling
- User preferences
Song Service:
- Song metadata CRUD
- Artist management
- Album management
- Play count tracking (batched writes)
Playlist Service:
- Playlist CRUD operations
- Collaborative playlists
- Playlist sharing
Search Service:
- Full-text search across songs, artists, albums
- Auto-complete suggestions
- Trending searches
Stream Service:
- Generate signed URLs for audio files
- Handle playback sessions
- Adaptive bitrate logic
Upload Service:
- Handle artist uploads
- Queue songs for encoding
- Validate file formats
4. Data Models (Relational Database)
๐พ Step 4 of 7: Structure Databases & Relationships
We use PostgreSQL for structured metadata with strong consistency requirements.
Database Indexing Strategy
Critical Indexes:
-- Users CREATE INDEX idx_users_email ON users(email); CREATE INDEX idx_users_subscription ON users(subscription_type); -- Songs CREATE INDEX idx_songs_artist ON songs(artist_id); CREATE INDEX idx_songs_album ON songs(album_id); CREATE INDEX idx_songs_genre ON songs(genre); CREATE INDEX idx_songs_play_count ON songs(play_count DESC); -- Playlists CREATE INDEX idx_playlists_user ON playlists(user_id); CREATE INDEX idx_playlists_public ON playlists(is_public) WHERE is_public = true; -- Listening History (partitioned by month) CREATE INDEX idx_history_user_time ON listening_history(user_id, played_at DESC); CREATE INDEX idx_history_song_time ON listening_history(song_id, played_at DESC);
5. API Design
๐ Step 5 of 7: Define Interfaces & Contracts
RESTful API endpoints with proper versioning and pagination.
Authentication Endpoints
POST /api/v1/auth/register POST /api/v1/auth/login POST /api/v1/auth/refresh POST /api/v1/auth/logout
Example Request/Response:
POST /api/v1/auth/login { "email": "[email protected]", "password": "securePassword123" } Response 200: { "access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", "refresh_token": "...", "expires_in": 3600, "user": { "user_id": 12345, "display_name": "John Doe", "subscription_type": "premium" } }
Search & Discovery
GET /api/v1/search?q={query}&type={song,artist,album,playlist}&limit={20}&offset={0} GET /api/v1/songs/trending?genre={genre}®ion={us}&limit={50} GET /api/v1/recommendations?user_id={id}&limit={20}
Example:
GET /api/v1/search?q=bohemian&type=song&limit=5 Response 200: { "results": [ { "type": "song", "song_id": 98765, "title": "Bohemian Rhapsody", "artist": { "artist_id": 111, "name": "Queen" }, "album": { "album_id": 222, "title": "A Night at the Opera", "cover_art_url": "https://cdn.example.com/covers/222.jpg" }, "duration": 354, "play_count": 5000000000 } ], "total": 1, "limit": 5, "offset": 0 }
Song Metadata & Streaming
GET /api/v1/songs/{song_id} GET /api/v1/songs/{song_id}/stream?quality={low,normal,high} POST /api/v1/songs/{song_id}/play
Stream Endpoint Response:
GET /api/v1/songs/98765/stream?quality=high Response 200: { "song_id": 98765, "title": "Bohemian Rhapsody", "artist": "Queen", "duration": 354, "stream_url": "https://cdn.example.com/audio/...[signed-url]...", "expires_at": "2026-01-05T12:00:00Z", "bitrate": 320, "format": "aac" }
Playlist Management
GET /api/v1/playlists/{playlist_id} POST /api/v1/playlists PUT /api/v1/playlists/{playlist_id} DELETE /api/v1/playlists/{playlist_id} POST /api/v1/playlists/{playlist_id}/songs DELETE /api/v1/playlists/{playlist_id}/songs/{song_id}
Artist Upload
POST /api/v1/upload/song GET /api/v1/upload/status/{upload_id}
Upload Flow:
POST /api/v1/upload/song Content-Type: multipart/form-data { "audio_file": [binary], "title": "New Song", "album_id": 222, "genre": "rock", "duration": 240 } Response 202 Accepted: { "upload_id": "upload_abc123", "status": "processing", "estimated_time_seconds": 120 }
6. Song Playback Flow
๐ต Step 6 of 7: Detail the Most Critical User Journey
This is the most critical user journey. Let's break it down step-by-step.
Key Implementation Details
1. JWT Authentication
async def validate_token(token: str) -> User: try: payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"]) user_id = payload.get("user_id") # Check subscription level if not await check_subscription(user_id): raise HTTPException(status_code=403, detail="Subscription expired") return User(id=user_id, subscription=payload.get("subscription")) except JWTError: raise HTTPException(status_code=401, detail="Invalid token")
2. Signed URL Generation (S3 Presigned URL)
import boto3 from datetime import timedelta async def generate_signed_url(s3_key: str, expiry_seconds: int = 3600) -> str: s3_client = boto3.client('s3') signed_url = s3_client.generate_presigned_url( 'get_object', Params={ 'Bucket': 'music-streaming-audio', 'Key': s3_key }, ExpiresIn=expiry_seconds ) return signed_url
3. HTTP Range Requests (Streaming)
The mobile app uses HTTP Range requests to stream audio in chunks:
GET /audio/song_98765_320kbps.aac Range: bytes=0-524287 Response 206 Partial Content: Content-Range: bytes 0-524287/7864320 Content-Length: 524288 Content-Type: audio/aac [binary audio data]
4. Adaptive Bitrate Streaming
The client monitors network conditions and switches quality:
// Client-side logic function selectBitrate(networkSpeed) { if (networkSpeed < 500) return 64 // kbps if (networkSpeed < 1500) return 128 return 320 } // Monitor and adapt setInterval(() => { const speed = measureNetworkSpeed() const newBitrate = selectBitrate(speed) if (newBitrate !== currentBitrate) { switchStreamQuality(newBitrate) } }, 10000) // Check every 10 seconds
5. Play Count Analytics (Batched Writes)
Don't update the database on every play immediately. Batch writes to reduce load:
from collections import defaultdict import asyncio play_count_buffer = defaultdict(int) BATCH_SIZE = 1000 BATCH_INTERVAL = 60 # seconds async def record_play(song_id: int): play_count_buffer[song_id] += 1 if sum(play_count_buffer.values()) >= BATCH_SIZE: await flush_play_counts() async def flush_play_counts(): if not play_count_buffer: return # Bulk update async with db_pool.acquire() as conn: values = [(count, song_id) for song_id, count in play_count_buffer.items()] await conn.executemany( "UPDATE songs SET play_count = play_count + $1 WHERE song_id = $2", values ) play_count_buffer.clear() # Background task asyncio.create_task(periodic_flush())
7. Scalability (Scaling to 50M Users)
๐ Step 7 of 7: Plan for Growth from MVP to Global Scale
Scaling from 500K to 50M users requires architectural evolution.
Database Scaling Strategies
1. Read Replicas (Leader-Follower Replication)
Configuration:
- 1 Leader (handles all writes)
- 5-10 Read Replicas (distribute read traffic)
- Async replication (acceptable replication lag: < 100ms)
2. Database Sharding (Horizontal Partitioning)
When metadata grows beyond a single instance (50GB+ users, 20GB+ songs), shard by key:
User Data Sharding:
def get_user_shard(user_id: int, num_shards: int = 10) -> int: return user_id % num_shards # Route queries to the correct shard shard_id = get_user_shard(user_id, num_shards=10) db_conn = shard_connections[shard_id]
Song Data Sharding:
# Shard by artist_id for co-location of artist's songs def get_song_shard(artist_id: int, num_shards: int = 20) -> int: return artist_id % num_shards
3. Caching Strategy
Cache Keys:
song:{song_id}:metadata TTL: 1 hour user:{user_id}:profile TTL: 30 min playlist:{playlist_id} TTL: 15 min trending:songs:{genre}:{region} TTL: 5 min search:autocomplete:{prefix} TTL: 24 hours
Cache Invalidation:
async def update_song_metadata(song_id: int, data: dict): # Update database await db.execute("UPDATE songs SET ... WHERE song_id = $1", song_id) # Invalidate cache await redis.delete(f"song:{song_id}:metadata")
CDN Strategy
Geographic Distribution:
Region-based CDN PoPs: - North America: 15 edge locations - Europe: 12 edge locations - Asia-Pacific: 10 edge locations - South America: 5 edge locations - Africa/Middle East: 3 edge locations Total: 45+ edge locations globally
Cache Configuration:
# CDN cache rules location /audio/ { proxy_cache audio_cache; proxy_cache_valid 200 7d; # Cache for 7 days proxy_cache_lock on; # Prevent thundering herd proxy_cache_use_stale error timeout updating; add_header X-Cache-Status $upstream_cache_status; }
Benefits:
- 80-90% cache hit ratio
- Reduced origin egress costs (from 600TB to ~60TB/month)
- Lower latency (< 50ms to CDN edge vs. 200ms+ to origin)
Auto-Scaling
API Server Auto-Scaling:
# Kubernetes HPA (Horizontal Pod Autoscaler) apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-server-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-server minReplicas: 10 maxReplicas: 200 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80
Message Queue for Async Processing
Use Kafka or RabbitMQ for:
- Audio encoding (upload โ encode โ store)
- Play count updates (batch writes)
- Recommendation engine updates
- Analytics processing
# Producer (API Server) async def handle_song_upload(file: UploadFile): upload_id = str(uuid.uuid4()) # Store raw file temporarily await s3.upload_file(file, f"uploads/{upload_id}.raw") # Queue encoding job await kafka_producer.send('song-encoding', { 'upload_id': upload_id, 'artist_id': artist_id, 's3_key': f"uploads/{upload_id}.raw", 'target_bitrates': [64, 128, 320] }) return {"upload_id": upload_id, "status": "queued"} # Consumer (Encoder Worker) async def encode_song(message): upload_id = message['upload_id'] # Download raw file raw_audio = await s3.download(message['s3_key']) # Encode to multiple bitrates for bitrate in message['target_bitrates']: encoded = await encode_audio(raw_audio, bitrate) s3_key = f"audio/{artist_id}/{upload_id}_{bitrate}kbps.aac" await s3.upload(s3_key, encoded) # Update database await db.insert_song_file(upload_id, bitrate, s3_key) # Clean up raw file await s3.delete(message['s3_key'])
8. Advanced Features
๐ก Beyond the Core Framework
The following sections explore advanced features that would enhance the platform beyond the core MVP design.
Recommendation Engine
Collaborative Filtering:
from scipy.sparse import csr_matrix from sklearn.neighbors import NearestNeighbors async def get_recommendations(user_id: int, limit: int = 20): # Build user-song interaction matrix # Rows: users, Columns: songs, Values: play counts matrix = build_interaction_matrix() # Find similar users model = NearestNeighbors(metric='cosine', algorithm='brute') model.fit(matrix) distances, indices = model.kneighbors( matrix[user_id], n_neighbors=50 ) # Aggregate songs from similar users recommended_songs = aggregate_songs_from_users(indices) # Filter out already listened songs user_history = await get_user_history(user_id) recommendations = [ song for song in recommended_songs if song not in user_history ][:limit] return recommendations
Content-Based Filtering:
from sentence_transformers import SentenceTransformer # Embed song metadata (title, artist, genre, lyrics) model = SentenceTransformer('all-MiniLM-L6-v2') async def find_similar_songs(song_id: int, limit: int = 10): # Get song metadata song = await db.get_song(song_id) # Create text representation text = f"{song.title} {song.artist} {song.genre} {song.lyrics}" # Embed query_embedding = model.encode(text) # Search in vector database (Pinecone, Milvus, etc.) results = await vector_db.search(query_embedding, limit=limit) return results
Real-Time Lyrics Sync
GET /api/v1/songs/98765/lyrics Response 200: { "song_id": 98765, "lyrics": [ { "start_time": 0.5, "end_time": 3.2, "text": "Is this the real life?" }, { "start_time": 3.5, "end_time": 6.8, "text": "Is this just fantasy?" } ] }
Social Features
Activity Feed:
# Redis Sorted Set for activity timeline async def add_activity(user_id: int, activity: dict): timestamp = time.time() activity_json = json.dumps(activity) await redis.zadd( f"user:{user_id}:feed", {activity_json: timestamp} ) # Keep only last 1000 activities await redis.zremrangebyrank(f"user:{user_id}:feed", 0, -1001) async def get_feed(user_id: int, limit: int = 50): # Get following list following = await db.get_following(user_id) # Aggregate activities from followed users activities = [] for followed_id in following: user_activities = await redis.zrange( f"user:{followed_id}:feed", 0, -1, desc=True ) activities.extend(user_activities) # Sort by timestamp and limit activities.sort(key=lambda x: x['timestamp'], reverse=True) return activities[:limit]
Offline Mode
Download Management:
# Client-side async def download_playlist(playlist_id: int): songs = await api.get_playlist_songs(playlist_id) for song in songs: # Download highest quality user has access to stream_url = await api.get_stream_url(song.id, quality='high') # Download to local storage await download_file(stream_url, f"downloads/{song.id}.aac") # Store metadata await local_db.save_song_metadata(song) await local_db.mark_playlist_downloaded(playlist_id)
9. Monitoring & Observability
Key Metrics to Track
Application Metrics:
from prometheus_client import Counter, Histogram, Gauge # Request metrics request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status']) request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration') # Business metrics songs_streamed = Counter('songs_streamed_total', 'Total songs streamed') song_upload_errors = Counter('song_upload_errors_total', 'Failed song uploads') active_streams = Gauge('active_streams_current', 'Current active streams') # Database metrics db_query_duration = Histogram('db_query_duration_seconds', 'Database query duration', ['query_type']) db_connection_pool_size = Gauge('db_connection_pool_size', 'Database connection pool size') # Cache metrics cache_hit_rate = Gauge('cache_hit_rate', 'Cache hit rate', ['cache_type'])
Health Checks:
from fastapi import FastAPI, status @app.get("/health", status_code=status.HTTP_200_OK) async def health_check(): # Check database connectivity db_healthy = await check_db_health() # Check Redis cache_healthy = await check_redis_health() # Check S3 storage_healthy = await check_s3_health() if not all([db_healthy, cache_healthy, storage_healthy]): return { "status": "unhealthy", "database": db_healthy, "cache": cache_healthy, "storage": storage_healthy }, 503 return {"status": "healthy"}
Distributed Tracing:
from opentelemetry import trace from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor # Initialize tracing FastAPIInstrumentor.instrument_app(app) @app.get("/songs/{song_id}/stream") async def stream_song(song_id: int): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("validate_user"): user = await validate_token(request.headers['Authorization']) with tracer.start_as_current_span("fetch_song_metadata"): song = await get_song_metadata(song_id) with tracer.start_as_current_span("generate_signed_url"): stream_url = await generate_signed_url(song.s3_key) return {"stream_url": stream_url}
Alerting:
# Prometheus alerting rules groups: - name: api_alerts rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: 'High error rate detected' - alert: HighLatency expr: histogram_quantile(0.95, http_request_duration_seconds) > 1.0 for: 10m labels: severity: warning annotations: summary: '95th percentile latency > 1s' - alert: DatabaseConnectionPoolExhausted expr: db_connection_pool_size / db_connection_pool_max > 0.9 for: 5m labels: severity: warning
10. Security Considerations
Authentication & Authorization
JWT Token Structure:
{ "sub": "12345", "user_id": 12345, "email": "[email protected]", "subscription": "premium", "roles": ["user"], "iat": 1704470400, "exp": 1704474000 }
Rate Limiting:
from slowapi import Limiter from slowapi.util import get_remote_address limiter = Limiter(key_func=get_remote_address) @app.get("/search") @limiter.limit("100/minute") async def search(request: Request, q: str): return await perform_search(q) # Premium users get higher limits @limiter.limit("1000/minute", key_func=get_user_tier) async def premium_search(request: Request, q: str): return await perform_search(q)
Data Protection
Encryption at Rest:
- Database: AWS RDS encryption (AES-256)
- S3: Server-side encryption (SSE-S3 or SSE-KMS)
- Backups: Encrypted with KMS keys
Encryption in Transit:
- TLS 1.3 for all API communication
- HTTPS only (HSTS headers)
DRM (Digital Rights Management):
# For premium content from cryptography.fernet import Fernet async def encrypt_audio_file(file_path: str, key: bytes): fernet = Fernet(key) with open(file_path, 'rb') as f: audio_data = f.read() encrypted_data = fernet.encrypt(audio_data) with open(f"{file_path}.encrypted", 'wb') as f: f.write(encrypted_data) return f"{file_path}.encrypted" # Client decrypts with user-specific key
Input Validation
from pydantic import BaseModel, validator, constr class SongUploadRequest(BaseModel): title: constr(min_length=1, max_length=200) artist_id: int duration: int # seconds genre: str @validator('duration') def validate_duration(cls, v): if v < 1 or v > 3600: # Max 1 hour raise ValueError('Duration must be between 1 and 3600 seconds') return v @validator('genre') def validate_genre(cls, v): allowed_genres = ['rock', 'pop', 'jazz', 'classical', 'hip-hop'] if v.lower() not in allowed_genres: raise ValueError(f'Genre must be one of {allowed_genres}') return v.lower()
DDOS Protection
- AWS Shield / CloudFlare for network-layer protection
- Rate limiting at API Gateway
- Geo-blocking for suspicious regions
- Web Application Firewall (WAF) rules
Conclusion
Building a music streaming platform like Spotify requires careful consideration of:
- Storage: Separating metadata (SQL) from binary files (S3/Blob)
- Delivery: Using CDNs to reduce latency and costs
- Scalability: Database sharding, caching, and horizontal scaling
- Performance: Async operations, connection pooling, batch writes
- Security: JWT authentication, signed URLs, encryption, rate limiting
- Observability: Comprehensive monitoring, tracing, and alerting
Key Takeaways
โ Decouple audio delivery from metadata queries using signed URLs and CDNs
โ Batch write operations (play counts, analytics) to reduce database load
โ Use multi-tier caching (in-memory โ Redis โ database) for hot data
โ Implement adaptive bitrate streaming for optimal user experience
โ Design for failure with health checks, circuit breakers, and graceful degradation
โ Monitor everything - latency, error rates, cache hit ratios, resource utilization
Further Reading
Video Resources:
- System Design Interview: A Step-By-Step Guide - ByteByteGo - Comprehensive system design interview framework
Technical Blogs:
- Spotify Engineering Blog - Real-world insights from Spotify's engineering team
- Netflix Tech Blog - Video Streaming - Lessons on content delivery at scale
- Discord Engineering - Voice & Video Infrastructure - Audio streaming architecture patterns
Guides & Documentation:
- AWS Well-Architected Framework - Cloud architecture best practices
- System Design Primer - Comprehensive system design resource
- Martin Kleppmann - Designing Data-Intensive Applications - Deep dive into distributed systems
Acknowledgments
This blog post is inspired by and expands upon the system design principles demonstrated in ByteByteGo's System Design Interview Guide. The 7-step framework (Requirements โ Capacity โ Architecture โ Data Model โ API โ Critical Flow โ Scalability) provides an excellent structure for approaching system design interviews and real-world architecture decisions.
This comprehensive guide covers the essential components and advanced considerations for building a production-ready music streaming platform. The architecture can be adapted based on specific business requirements, scale, and available resources.
