Learning Objectives: By the end of this lesson, you'll build a complete distributed system that handles thousands of events per second, understand why companies like Netflix and Uber architect their systems this way, and gain hands-on experience with patterns used in real production environments.
What We're Building Today
Complete distributed log processing system with producer, consumer, and API gateway services
Production-grade infrastructure using Docker Compose with Kafka, Redis, PostgreSQL, and monitoring
Observability-first architecture with metrics, health checks, and distributed tracing
Scalable foundation that handles real-world failure scenarios and performance requirements
[IMAGE PLACEHOLDER: System Overview Diagram - showing the three main services and their connections]
Why This Matters: Beyond Hello World to Production Scale
Most programming tutorials show you how to build a simple app that works on your laptop. But what happens when thousands of people try to use your app at the same time? What if one part breaks? What if you need to process a million events per hour?
That's where distributed systems come in. Think of it like this: instead of having one super-powerful computer doing everything (which would be expensive and could fail), you have many smaller computers working together. Each one has a specific job, and they communicate with each other to get the work done.
The difference between a school project and a production system isn't just scale - it's the architectural decisions that anticipate failure, enable observability, and provide clear scaling paths. We're implementing patterns that work at 10 requests per second and scale to 100,000 requests per second with minimal changes.
This log processing system represents one of the most fundamental distributed patterns: event-driven architecture with persistent messaging. Every major tech company has multiple variants of this pattern powering their core business logic.
[IMAGE PLACEHOLDER: Before/After comparison showing monolithic vs distributed architecture]
System Design Deep Dive: Five Critical Distributed Patterns
Pattern 1: Event-Driven Decoupling with Apache Kafka
The Challenge: Traditional request-response systems create tight coupling and cascade failures. When service A calls service B directly, A's availability depends on B's availability. It's like a chain - if one link breaks, the whole thing fails.
The Solution: Event-driven architecture with persistent messaging. Services publish events without knowing who consumes them, and consumers process events at their own pace. Think of it like a postal system - you drop off mail, and it gets delivered even if the recipient isn't home right away.
Trade-offs:
Gains: High availability, natural backpressure handling, replay capability
Costs: Eventual consistency, increased complexity, message ordering challenges
Implementation: Kafka provides ordered, partitioned, replicated event logs. Our producer publishes log events to topics, consumers process them independently. If a consumer fails, messages wait in the partition until recovery.
Anti-pattern: Using Kafka as a database. Kafka excels at streaming data, not random access queries.
[IMAGE PLACEHOLDER: Event-driven flow diagram showing producer → Kafka → consumer]
Pattern 2: Circuit Breaker for Cascade Failure Prevention
The Challenge: In distributed systems, one slow service can cascade failures across the entire system. Database connection pool exhaustion, memory leaks, or network partitions in one service shouldn't destroy the whole system.
The Solution: Circuit breakers monitor failure rates and automatically "open" to prevent calls to failing services, giving them time to recover. Think of it like a electrical circuit breaker in your house - when there's too much current, it shuts off to prevent a fire.
Trade-offs:
Gains: System stability, graceful degradation, automatic recovery
Costs: Added complexity, potential for false positives, cache invalidation challenges
Implementation: Resilience4j circuit breakers wrap external calls with configurable failure thresholds, timeout periods, and fallback responses.
Scale Reality: Netflix's Hystrix (now Resilience4j) prevented massive outages by isolating failures. Without circuit breakers, a single database issue could take down their entire streaming platform.
[IMAGE PLACEHOLDER: Circuit breaker state diagram showing closed/open/half-open states]
Pattern 3: Distributed Caching with Redis
The Challenge: Database queries become bottlenecks as traffic scales. Even optimized queries can't handle thousands of concurrent requests for the same data. Imagine if every time someone wanted to know the weather, they had to call the weather station directly.
The Solution: Multi-layer caching with cache-aside pattern and TTL-based invalidation. Store frequently accessed data in fast memory so you don't have to go to the slow database every time.
Trade-offs:
Gains: Sub-millisecond response times, reduced database load, natural load balancing
Costs: Cache invalidation complexity, memory overhead, eventual consistency
Implementation: Redis provides distributed caching with pub/sub capabilities. Application code implements cache-aside pattern with fallback to database.
Bottleneck Alert: Cache hit ratio below 85% indicates either wrong data being cached or TTL too aggressive. Monitor cache performance metrics closely.
Pattern 4: Health Checks and Observability
The Challenge: In distributed systems, partial failures are normal. Services might be running but unable to process requests due to database connections, memory pressure, or dependency failures. It's like a car that starts but can't drive because it's out of gas.
The Solution: Multi-level health checks with shallow and deep verification, combined with metrics-driven alerting.
Trade-offs:
Gains: Early failure detection, automatic recovery, performance visibility
Costs: Monitoring overhead, false positive alerts, increased operational complexity
Implementation: Spring Boot Actuator provides health endpoints, Micrometer exports metrics to Prometheus, Grafana visualizes system state.
[IMAGE PLACEHOLDER: Monitoring dashboard screenshot showing key metrics]
Pattern 5: API Gateway Pattern for Service Orchestration
The Challenge: Client applications shouldn't need to know about internal service topology. Cross-cutting concerns like authentication, rate limiting, and request routing need centralized implementation. Imagine if you had to know the phone number of every department in a company instead of just calling the main number.
The Solution: API Gateway acts as single entry point, handling routing, authentication, rate limiting, and request aggregation.
Trade-offs:
Gains: Simplified client logic, centralized cross-cutting concerns, easier service evolution
Costs: Single point of failure, potential bottleneck, increased latency
Implementation: Spring Cloud Gateway with reactive routing, Redis-backed rate limiting, and JWT authentication.
Implementation Walkthrough: Building the Foundation
Step 1: Multi-Service Architecture Design
Our system implements three core services:
Log Producer: REST API that accepts log events and publishes to Kafka
Log Consumer: Kafka consumer that processes events and stores to PostgreSQL
API Gateway: Routes requests, handles authentication, provides unified API
Architectural Decision: Separate producer and consumer services enable independent scaling. During traffic spikes, we can scale producers horizontally without affecting consumer processing rates.
[IMAGE PLACEHOLDER: Service architecture diagram with three boxes and connections]
Step 2: Kafka Topic Design and Partitioning Strategy
Why 6 partitions? Enables parallel processing with up to 6 consumer instances. Partition count should match expected peak consumer count for optimal throughput.
Message Key Strategy: Using organizationId as partition key ensures logs from the same organization always go to the same partition, maintaining order while enabling parallelism.
Step 3: Error Handling and Retry Logic
Exponential Backoff: Prevents overwhelming failed services during recovery. Initial retry after 1 second, then 2 seconds, then 4 seconds.
Dead Letter Queue: After 3 retries, messages go to error topic for manual investigation. This prevents one bad message from blocking the entire pipeline.
Step 4: Distributed Tracing Implementation
Every request gets a trace ID that follows the request across all services. This enables debugging distributed transactions and identifying bottlenecks.
Performance Impact: Tracing adds ~2-3ms latency per request but provides invaluable debugging capability. In production, sample 10% of requests to balance observability with performance.
[IMAGE PLACEHOLDER: Trace visualization showing request flow across services]
Production Considerations: Real-World Operational Challenges
Performance Under Load
Throughput Targets: System handles 1,000 events/second with 99th percentile latency under 100ms. Kafka can scale to 100,000+ events/second by adding partitions and consumer instances.
Memory Management: JVM heap sized at 2GB with G1GC for low-latency garbage collection. Monitor for memory leaks in long-running consumer processes.
Connection Pooling: Database connection pool sized to 20 connections per service instance. Too many connections overwhelm PostgreSQL; too few create bottlenecks.
Monitoring and Alerting Strategy
Key Metrics:
Consumer lag: Alert if >1000 messages behind
Error rate: Alert if >5% of requests fail
Response latency: Alert if 95th percentile >200ms
Database connections: Alert if >80% pool utilization
Failure Scenarios: Test network partitions, database failures, and OOM conditions. Our circuit breakers should prevent cascade failures.
Scale Connection: How FAANG Companies Use These Patterns
Netflix: Uses identical patterns for their recommendation engine. User interactions go through Kafka, multiple microservices process events, and results cache in distributed Redis clusters.
Uber: Their dispatch system follows the same architecture. Ride requests publish to Kafka, driver matching services consume events, and Redis caches driver locations for sub-second matching.
Amazon: Warehouse management systems use event-driven architecture with circuit breakers to handle peak holiday traffic without degrading performance.
The patterns we're implementing today are battle-tested at the highest scales in the industry.
Next Steps: Tomorrow's Advanced Concepts
Tomorrow we'll implement the log generator that produces realistic log events at configurable rates, adding backpressure handling and batch processing optimizations that enable 10x throughput improvements.
Hands-On Implementation: Building Your System
Now let's put theory into practice. We'll build this entire system step by step, so you can see how each piece works together.
Prerequisites Setup
Before we start coding, make sure you have these tools installed:
Required Software:
Java 17 or higher
Maven 3.6 or higher
Docker Desktop
Git
Your favorite IDE (IntelliJ IDEA, VS Code, or Eclipse)
Quick Check:
Open your terminal and run these commands to verify everything is installed:
[IMAGE PLACEHOLDER: Terminal screenshot showing version commands]
Creating the Project Structure
We'll use Maven to create a multi-module project. This means one parent project with three child projects inside it.
Step 1: Create the Main Project
Step 2: Set Up Maven Structure
Create the parent pom.xml file that defines our three services:
Infrastructure Setup with Docker
Before building our services, let's set up the infrastructure they need to run.
Step 3: Create Docker Compose File
Create docker-compose.yml in your main project directory:
Step 4: Start the Infrastructure
This starts all our infrastructure services in the background. You can check they're running with:
[IMAGE PLACEHOLDER: Docker Desktop showing running containers]
Building the Log Producer Service
The log producer receives HTTP requests and publishes them to Kafka.
Step 5: Create Producer Structure
Step 6: Producer Application Class
Create log-producer/src/main/java/com/example/logprocessor/producer/LogProducerApplication.java:
Step 7: Log Event Model
Create LogEvent.java to define what a log event looks like:
Building the Consumer Service
The consumer reads events from Kafka and saves them to PostgreSQL.
Step 8: Create Consumer Structure
Step 9: Consumer Service Logic
The most important part is the Kafka listener that processes incoming events:
Testing Your System
Let's make sure everything works together.
Step 10: Start Your Services
Open three terminal windows and run:
Step 11: Send Test Events
Use curl to send a test log event:
You should see:
The producer receives the request and publishes to Kafka
The consumer picks up the message and saves to PostgreSQL
All services log what they're doing
[IMAGE PLACEHOLDER: Terminal windows showing service logs]
Load Testing Your System
Let's see how your system performs under load.
Step 12: Simple Load Test
Create a script that sends many requests quickly:
Run this and watch your system handle concurrent requests. You should see messages being processed in order within each organization ID.
Monitoring Your System
Step 13: Check Health Endpoints
Spring Boot provides built-in health checks:
Step 14: View Metrics
[IMAGE PLACEHOLDER: Grafana dashboard showing system metrics]
Experiment with Failures
Let's see how resilient your system is.
Step 15: Test Database Failure
Stop PostgreSQL and send requests:
Step 16: Test Kafka Failure
This demonstrates how distributed systems handle partial failures gracefully.
Deployment and Production Readiness
Step 17: Complete System Deployment
Package everything for production:
Step 18: Production Monitoring Setup
Add Prometheus and Grafana to your docker-compose.yml:
Access Grafana at http://localhost:3000 (admin/admin) to see beautiful dashboards of your system metrics.
Architecture Insight: The key to building scalable distributed systems isn't avoiding failures—it's designing systems that fail gracefully and recover automatically. Every component we built today assumes other components will fail and provides mechanisms to handle those failures without human intervention.
This mindset shift from "preventing failures" to "embracing failures" is what separates systems that work in development from systems that work at global scale. You've just built a system that thinks like Netflix, Uber, and Amazon - congratulations!