Building Production-Ready Distributed Log Processing Infrastructure

Lesson 1 2-3 hours

Learning Objectives: By the end of this lesson, you'll build a complete distributed system that handles thousands of events per second, understand why companies like Netflix and Uber architect their systems this way, and gain hands-on experience with patterns used in real production environments.

What We're Building Today

Component Architecture

Distributed Log Processing System Architecture Client Applications 1000+ req/s API Gateway :8080 Rate Limiting Circuit Breaker Log Producer :8081 REST API Kafka Producer Apache Kafka :9092 Topic: log-events 6 Partitions Ordered Delivery Persistence Log Consumer :8082 Kafka Consumer Retry Logic Redis :6379 Caching Rate Limiting PostgreSQL :5432 Persistent Store ACID Compliance Prometheus :9090 Metrics Collection Time Series DB Grafana :3000 Dashboards Alerting HTTP/REST Route Publish 1K events/s Consume Store Rate Limit Cache Query CB DB Failure Queue Full Scale → Scale → Performance Characteristics • Throughput: 1,000+ events/second • Latency: 99th percentile < 100ms • Availability: 99.9% with circuit breakers • Scalability: Horizontal via partitions • Recovery: Automatic with retry logic Legend Data Flow Metrics Collection Failure Scenario CB Circuit Breaker Horizontal Scaling Message Queue Bottleneck DB Connection Pool High Volume Batched
  • Complete distributed log processing system with producer, consumer, and API gateway services

  • Production-grade infrastructure using Docker Compose with Kafka, Redis, PostgreSQL, and monitoring

  • Observability-first architecture with metrics, health checks, and distributed tracing

  • Scalable foundation that handles real-world failure scenarios and performance requirements

[IMAGE PLACEHOLDER: System Overview Diagram - showing the three main services and their connections]

Why This Matters: Beyond Hello World to Production Scale

Most programming tutorials show you how to build a simple app that works on your laptop. But what happens when thousands of people try to use your app at the same time? What if one part breaks? What if you need to process a million events per hour?

That's where distributed systems come in. Think of it like this: instead of having one super-powerful computer doing everything (which would be expensive and could fail), you have many smaller computers working together. Each one has a specific job, and they communicate with each other to get the work done.

The difference between a school project and a production system isn't just scale - it's the architectural decisions that anticipate failure, enable observability, and provide clear scaling paths. We're implementing patterns that work at 10 requests per second and scale to 100,000 requests per second with minimal changes.

This log processing system represents one of the most fundamental distributed patterns: event-driven architecture with persistent messaging. Every major tech company has multiple variants of this pattern powering their core business logic.

[IMAGE PLACEHOLDER: Before/After comparison showing monolithic vs distributed architecture]

System Design Deep Dive: Five Critical Distributed Patterns

Pattern 1: Event-Driven Decoupling with Apache Kafka

The Challenge: Traditional request-response systems create tight coupling and cascade failures. When service A calls service B directly, A's availability depends on B's availability. It's like a chain - if one link breaks, the whole thing fails.

The Solution: Event-driven architecture with persistent messaging. Services publish events without knowing who consumes them, and consumers process events at their own pace. Think of it like a postal system - you drop off mail, and it gets delivered even if the recipient isn't home right away.

Trade-offs:

  • Gains: High availability, natural backpressure handling, replay capability

  • Costs: Eventual consistency, increased complexity, message ordering challenges

Implementation: Kafka provides ordered, partitioned, replicated event logs. Our producer publishes log events to topics, consumers process them independently. If a consumer fails, messages wait in the partition until recovery.

Anti-pattern: Using Kafka as a database. Kafka excels at streaming data, not random access queries.

[IMAGE PLACEHOLDER: Event-driven flow diagram showing producer → Kafka → consumer]

Pattern 2: Circuit Breaker for Cascade Failure Prevention

The Challenge: In distributed systems, one slow service can cascade failures across the entire system. Database connection pool exhaustion, memory leaks, or network partitions in one service shouldn't destroy the whole system.

The Solution: Circuit breakers monitor failure rates and automatically "open" to prevent calls to failing services, giving them time to recover. Think of it like a electrical circuit breaker in your house - when there's too much current, it shuts off to prevent a fire.

Trade-offs:

  • Gains: System stability, graceful degradation, automatic recovery

  • Costs: Added complexity, potential for false positives, cache invalidation challenges

Implementation: Resilience4j circuit breakers wrap external calls with configurable failure thresholds, timeout periods, and fallback responses.

Scale Reality: Netflix's Hystrix (now Resilience4j) prevented massive outages by isolating failures. Without circuit breakers, a single database issue could take down their entire streaming platform.

[IMAGE PLACEHOLDER: Circuit breaker state diagram showing closed/open/half-open states]

Pattern 3: Distributed Caching with Redis

The Challenge: Database queries become bottlenecks as traffic scales. Even optimized queries can't handle thousands of concurrent requests for the same data. Imagine if every time someone wanted to know the weather, they had to call the weather station directly.

The Solution: Multi-layer caching with cache-aside pattern and TTL-based invalidation. Store frequently accessed data in fast memory so you don't have to go to the slow database every time.

Trade-offs:

  • Gains: Sub-millisecond response times, reduced database load, natural load balancing

  • Costs: Cache invalidation complexity, memory overhead, eventual consistency

Implementation: Redis provides distributed caching with pub/sub capabilities. Application code implements cache-aside pattern with fallback to database.

Bottleneck Alert: Cache hit ratio below 85% indicates either wrong data being cached or TTL too aggressive. Monitor cache performance metrics closely.

Pattern 4: Health Checks and Observability

The Challenge: In distributed systems, partial failures are normal. Services might be running but unable to process requests due to database connections, memory pressure, or dependency failures. It's like a car that starts but can't drive because it's out of gas.

The Solution: Multi-level health checks with shallow and deep verification, combined with metrics-driven alerting.

Trade-offs:

  • Gains: Early failure detection, automatic recovery, performance visibility

  • Costs: Monitoring overhead, false positive alerts, increased operational complexity

Implementation: Spring Boot Actuator provides health endpoints, Micrometer exports metrics to Prometheus, Grafana visualizes system state.

[IMAGE PLACEHOLDER: Monitoring dashboard screenshot showing key metrics]

Pattern 5: API Gateway Pattern for Service Orchestration

The Challenge: Client applications shouldn't need to know about internal service topology. Cross-cutting concerns like authentication, rate limiting, and request routing need centralized implementation. Imagine if you had to know the phone number of every department in a company instead of just calling the main number.

The Solution: API Gateway acts as single entry point, handling routing, authentication, rate limiting, and request aggregation.

Trade-offs:

  • Gains: Simplified client logic, centralized cross-cutting concerns, easier service evolution

  • Costs: Single point of failure, potential bottleneck, increased latency

Implementation: Spring Cloud Gateway with reactive routing, Redis-backed rate limiting, and JWT authentication.

Implementation Walkthrough: Building the Foundation

Step 1: Multi-Service Architecture Design

Our system implements three core services:

  • Log Producer: REST API that accepts log events and publishes to Kafka

  • Log Consumer: Kafka consumer that processes events and stores to PostgreSQL

  • API Gateway: Routes requests, handles authentication, provides unified API

Architectural Decision: Separate producer and consumer services enable independent scaling. During traffic spikes, we can scale producers horizontally without affecting consumer processing rates.

[IMAGE PLACEHOLDER: Service architecture diagram with three boxes and connections]

Step 2: Kafka Topic Design and Partitioning Strategy

yaml
# Kafka Topic Configuration
log-events:
  partitions: 6
  replication-factor: 1
  cleanup.policy: compact

Why 6 partitions? Enables parallel processing with up to 6 consumer instances. Partition count should match expected peak consumer count for optimal throughput.

Message Key Strategy: Using organizationId as partition key ensures logs from the same organization always go to the same partition, maintaining order while enabling parallelism.

Step 3: Error Handling and Retry Logic

java
@Retryable(value = {DataAccessException.class}, maxAttempts = 3)
public void processLogEvent(LogEvent event) {
    // Processing logic with automatic retry
}

Exponential Backoff: Prevents overwhelming failed services during recovery. Initial retry after 1 second, then 2 seconds, then 4 seconds.

Dead Letter Queue: After 3 retries, messages go to error topic for manual investigation. This prevents one bad message from blocking the entire pipeline.

Step 4: Distributed Tracing Implementation

Every request gets a trace ID that follows the request across all services. This enables debugging distributed transactions and identifying bottlenecks.

Performance Impact: Tracing adds ~2-3ms latency per request but provides invaluable debugging capability. In production, sample 10% of requests to balance observability with performance.

[IMAGE PLACEHOLDER: Trace visualization showing request flow across services]

Production Considerations: Real-World Operational Challenges

Performance Under Load

Throughput Targets: System handles 1,000 events/second with 99th percentile latency under 100ms. Kafka can scale to 100,000+ events/second by adding partitions and consumer instances.

Memory Management: JVM heap sized at 2GB with G1GC for low-latency garbage collection. Monitor for memory leaks in long-running consumer processes.

Connection Pooling: Database connection pool sized to 20 connections per service instance. Too many connections overwhelm PostgreSQL; too few create bottlenecks.

Monitoring and Alerting Strategy

Key Metrics:

  • Consumer lag: Alert if >1000 messages behind

  • Error rate: Alert if >5% of requests fail

  • Response latency: Alert if 95th percentile >200ms

  • Database connections: Alert if >80% pool utilization

Failure Scenarios: Test network partitions, database failures, and OOM conditions. Our circuit breakers should prevent cascade failures.

Scale Connection: How FAANG Companies Use These Patterns

Netflix: Uses identical patterns for their recommendation engine. User interactions go through Kafka, multiple microservices process events, and results cache in distributed Redis clusters.

Uber: Their dispatch system follows the same architecture. Ride requests publish to Kafka, driver matching services consume events, and Redis caches driver locations for sub-second matching.

Amazon: Warehouse management systems use event-driven architecture with circuit breakers to handle peak holiday traffic without degrading performance.

The patterns we're implementing today are battle-tested at the highest scales in the industry.

Next Steps: Tomorrow's Advanced Concepts

Tomorrow we'll implement the log generator that produces realistic log events at configurable rates, adding backpressure handling and batch processing optimizations that enable 10x throughput improvements.


Hands-On Implementation: Building Your System

Now let's put theory into practice. We'll build this entire system step by step, so you can see how each piece works together.

Prerequisites Setup

Before we start coding, make sure you have these tools installed:

Required Software:

  • Java 17 or higher

  • Maven 3.6 or higher

  • Docker Desktop

  • Git

  • Your favorite IDE (IntelliJ IDEA, VS Code, or Eclipse)

Quick Check:
Open your terminal and run these commands to verify everything is installed:

bash
java -version     # Should show Java 17+
mvn -version      # Should show Maven 3.6+
docker --version  # Should show Docker version
git --version     # Should show Git version

[IMAGE PLACEHOLDER: Terminal screenshot showing version commands]

Creating the Project Structure

We'll use Maven to create a multi-module project. This means one parent project with three child projects inside it.

Step 1: Create the Main Project

bash
mkdir distributed-log-processor
cd distributed-log-processor

Step 2: Set Up Maven Structure
Create the parent pom.xml file that defines our three services:

xml


    4.0.0
    
    com.example
    distributed-log-processor
    1.0.0
    pom
    
    
        log-producer
        log-consumer
        api-gateway
    
    
    
    
        17
        17
        3.2.0
    

Infrastructure Setup with Docker

Before building our services, let's set up the infrastructure they need to run.

Step 3: Create Docker Compose File
Create docker-compose.yml in your main project directory:

yaml
version: '3.8'
services:
  # Apache Kafka for event streaming
  kafka:
    image: confluentinc/cp-kafka: 7.4.0
    ports:
      - "9092: 9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper: 2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost: 9092
      
  # Redis for caching and rate limiting  
  redis:
    image: redis: 7-alpine
    ports:
      - "6379: 6379"
      
  # PostgreSQL for data persistence
  postgres:
    image: postgres: 15-alpine
    environment:
      POSTGRES_DB: logprocessor
      POSTGRES_USER: loguser
      POSTGRES_PASSWORD: logpassword
    ports:
      - "5432: 5432"

Step 4: Start the Infrastructure

bash
docker compose up -d

This starts all our infrastructure services in the background. You can check they're running with:

bash
docker compose ps

[IMAGE PLACEHOLDER: Docker Desktop showing running containers]

Building the Log Producer Service

The log producer receives HTTP requests and publishes them to Kafka.

Step 5: Create Producer Structure

bash
mkdir -p log-producer/src/main/java/com/example/logprocessor/producer
mkdir -p log-producer/src/main/resources

Step 6: Producer Application Class
Create log-producer/src/main/java/com/example/logprocessor/producer/LogProducerApplication.java:

java
package com.example.logprocessor.producer;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class LogProducerApplication {
    public static void main(String[] args) {
        SpringApplication.run(LogProducerApplication.class, args);
    }
}

Step 7: Log Event Model
Create LogEvent.java to define what a log event looks like:

java
package com.example.logprocessor.producer.model;

import java.time.LocalDateTime;

public class LogEvent {
    private String id;
    private String organizationId;
    private String level;        // INFO, WARN, ERROR
    private String message;
    private String source;       // Which application sent this
    private LocalDateTime timestamp;
    
    // Constructor that auto-generates ID and timestamp
    public LogEvent() {
        this.timestamp = LocalDateTime.now();
        this.id = java.util.UUID.randomUUID().toString();
    }
    
    // Getters and setters for all fields...
}

Building the Consumer Service

The consumer reads events from Kafka and saves them to PostgreSQL.

Step 8: Create Consumer Structure

bash
mkdir -p log-consumer/src/main/java/com/example/logprocessor/consumer
mkdir -p log-consumer/src/main/resources

Step 9: Consumer Service Logic
The most important part is the Kafka listener that processes incoming events:

java
@Service
public class LogEventConsumer {
    
    @Autowired
    private LogEventRepository repository;
    
    @KafkaListener(topics = "log-events", groupId = "log-consumer-group")
    public void consumeLogEvent(String message) {
        try {
            // Parse the JSON message into a LogEvent object
            LogEvent event = objectMapper.readValue(message, LogEvent.class);
            
            // Save to database
            event.setProcessedAt(LocalDateTime.now());
            repository.save(event);
            
            logger.info("Processed log event: {}", event.getId());
            
        } catch (Exception e) {
            logger.error("Failed to process message: {}", message, e);
            // In production, this would go to a dead letter queue
        }
    }
}

Testing Your System

Let's make sure everything works together.

Step 10: Start Your Services
Open three terminal windows and run:

bash
# Terminal 1: Start producer
cd log-producer
mvn spring-boot:run

# Terminal 2: Start consumer  
cd log-consumer
mvn spring-boot:run

# Terminal 3: Start gateway
cd api-gateway
mvn spring-boot:run

Step 11: Send Test Events
Use curl to send a test log event:

bash
curl -X POST http://localhost:8080/api/logs 
  -H "Content-Type: application/json" 
  -d '{
    "organizationId": "my-company",
    "level": "INFO", 
    "message": "User logged in successfully",
    "source": "auth-service"
  }'

You should see:

  1. The producer receives the request and publishes to Kafka

  2. The consumer picks up the message and saves to PostgreSQL

  3. All services log what they're doing

[IMAGE PLACEHOLDER: Terminal windows showing service logs]

Load Testing Your System

Let's see how your system performs under load.

Step 12: Simple Load Test
Create a script that sends many requests quickly:

bash
#!/bin/bash
echo "Sending 100 log events..."

for i in {1..100}; do
  curl -X POST http://localhost:8080/api/logs 
    -H "Content-Type: application/json" 
    -d "{
      "organizationId": "org-$((i % 5))",
      "level": "INFO",
      "message": "Load test message $i",
      "source": "load-test"
    }" &
    
  # Send 10 at a time to avoid overwhelming the system
  if (( i % 10 == 0 )); then
    wait
    echo "Sent $i requests..."
  fi
done

wait
echo "Load test complete!"

Run this and watch your system handle concurrent requests. You should see messages being processed in order within each organization ID.

Monitoring Your System

Step 13: Check Health Endpoints
Spring Boot provides built-in health checks:

bash
# Check if services are healthy
curl http://localhost:8080/actuator/health  # Gateway
curl http://localhost:8081/actuator/health  # Producer  
curl http://localhost:8082/actuator/health  # Consumer

Step 14: View Metrics

bash
# See detailed metrics
curl http://localhost:8081/actuator/metrics
curl http://localhost:8081/actuator/prometheus

[IMAGE PLACEHOLDER: Grafana dashboard showing system metrics]

Experiment with Failures

Let's see how resilient your system is.

Step 15: Test Database Failure
Stop PostgreSQL and send requests:

bash
docker compose stop postgres
# Send some requests - they should be retried
docker compose start postgres
# Previous requests should now be processed

Step 16: Test Kafka Failure

bash
docker compose stop kafka
# Requests should queue up and circuit breaker should activate
docker compose start kafka
# Messages should be processed once Kafka is back

This demonstrates how distributed systems handle partial failures gracefully.

Deployment and Production Readiness

Step 17: Complete System Deployment
Package everything for production:

bash
# Build all services
mvn clean package

# Create Docker images for each service
cd log-producer && docker build -t log-producer .
cd log-consumer && docker build -t log-consumer .
cd api-gateway && docker build -t api-gateway .

Step 18: Production Monitoring Setup
Add Prometheus and Grafana to your docker-compose.yml:

yaml
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090: 9090"
      
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000: 3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

Access Grafana at http://localhost:3000 (admin/admin) to see beautiful dashboards of your system metrics.


Architecture Insight: The key to building scalable distributed systems isn't avoiding failures—it's designing systems that fail gracefully and recover automatically. Every component we built today assumes other components will fail and provides mechanisms to handle those failures without human intervention.

This mindset shift from "preventing failures" to "embracing failures" is what separates systems that work in development from systems that work at global scale. You've just built a system that thinks like Netflix, Uber, and Amazon - congratulations!