Event-Driven Architecture Fundamentals

Lesson 1 2-3 hours

What We're Building Today

State Machine

Event State Machine Start Event Created Event Published Event Stored Handler Processing Side Effects Applied UI Updated Event Complete Error State create publish store route process notify complete error retry Event Categories user_action events content_interaction events system_event events error conditions Performance Targets • Event Creation: < 1ms • Publish Latency: < 5ms • Handler Processing: < 100ms • UI Update: < 200ms • Throughput: 1000+ events/sec • Error Rate: < 0.1% Transition Types Normal flow Error path Retry path

Flowchart

Event-Driven Flow Process User Action Create Event + Timestamp + Event ID Publish to Event Bus Store Event Immutable Log Async Processing Route to Subscribers Feed Handler Update Timeline Notification Send Alerts Analytics Track Metrics WebSocket Broadcast Real-time UI Update parallel Event Processing Stats Latency: <5ms | Throughput: 1000+/sec Key Benefits ✓ Non-blocking | ✓ Independent handlers | ✓ Real-time updates

Component Architecture

StreamSocial Component Architecture Client Layer Web Mobile API API Gateway FastAPI WebSocket Event Bus Publish/Subscribe Event Store Immutable Log Replay Capable Feed Handler Timeline Updates Content Ranking Notification Handler Push Alerts Analytics Handler Metrics Collection User Feeds Timeline Data HTTP Publish Store Subscribe Write

Today we transform StreamSocial from a single-point-of-failure system into a resilient, distributed event platform. We'll deploy a 3-broker Kafka cluster that can handle millions of social media events without breaking a sweat.

Today's Agenda:

  • Deploy multi-broker Kafka cluster with Docker Compose

  • Implement broker discovery and failover mechanisms

  • Create monitoring dashboard for cluster health

  • Test fault tolerance with controlled broker failures

Core Concepts: Distributed Consensus in Action

Broker Architecture Deep Dive

Think of Kafka brokers like Netflix's content delivery network - instead of one massive server, you have multiple smaller servers working together. Each broker is an independent Kafka server that stores and serves data.

Why 3 Brokers Matter:

  • Fault Tolerance: Lose one broker? System keeps running

  • Load Distribution: 50M requests spread across multiple machines

  • Replication: Your precious user posts aren't lost forever

Zookeeper vs KRaft: The Leadership Challenge

Imagine coordinating a group project without a clear leader - chaos, right? That's why Kafka needs distributed consensus.

Zookeeper (Traditional):

  • External coordination service

  • Handles leader election and metadata

  • Like having a dedicated project manager

KRaft (Modern):

  • Self-managing Kafka cluster

  • No external dependencies

  • Like the team electing their own leader

StreamSocial uses KRaft because it's simpler and reduces operational overhead.

Context in Ultra-Scalable System Design

Real-World Production Patterns

Netflix Example: Their recommendation engine processes billions of viewing events across hundreds of Kafka brokers. When a broker fails during peak hours, traffic seamlessly routes to healthy brokers.

StreamSocial Architecture Position:

Code
User Actions → Load Balancer → Kafka Cluster → Event Processors → Database

Our 3-broker setup handles:

  • User posts, likes, comments (high write volume)

  • Real-time notifications (low latency requirement)

  • Analytics events (high throughput tolerance)

Scaling Considerations

Current Target: 1M daily active users
Growth Path: 100M users (requiring 50+ brokers)

Starting with 3 brokers teaches core concepts without complexity overload.

Control Plane:

  • KRaft Controller: Manages cluster metadata and leader elections

  • Broker Registration: Dynamic discovery and health monitoring

  • Topic Management: Automated partition assignment

Data Plane:

  • Producer APIs: Accept events from StreamSocial frontend

  • Consumer APIs: Serve events to recommendation engine

  • Replication: Ensure data durability across brokers

  1. Write Path: User posts → Producer → Leader Broker → Follower Replicas

  2. Read Path: Consumer → Any Broker (leader or follower)

  3. Failure Recovery: Failed broker → Leader election → Traffic redirection

Broker States:

  • STARTING: Loading local data and registering with cluster

  • RUNNING: Actively serving produce/consume requests

  • RECOVERING: Catching up after network partition

  • SHUTDOWN: Graceful termination with data consistency

Production System Integration

StreamSocial Event Flow

python
# Simplified event routing logic
class EventRouter:
def route_event(self, event_type, user_id):
if event_type == "user_action":
return f"broker-{user_id % 3}" # Round-robin distribution
elif event_type == "content_interaction":
return "broker-1" # Dedicated broker for hot data

Monitoring & Observability

Real production clusters need visibility. Our setup includes:

  • JMX metrics exposure

  • Docker health checks

  • Custom Python monitoring dashboard

  • Automated failover testing

Key Insights for System Design

Distributed Systems Reality

CAP Theorem in Practice: Kafka chooses Consistency + Partition tolerance over Availability. When network splits occur, affected partitions become unavailable rather than serving stale data.

Replication Factor = 3: Not arbitrary - allows tolerance of 1 broker failure while maintaining data safety. Formula: max_failures = (replication_factor - 1) / 2

Operational Complexity Trade-offs

Single Broker: Simple to deploy, impossible to scale
3-Broker Cluster: Sweet spot for learning and small production
100+ Broker Cluster: Requires sophisticated tooling and dedicated ops team

Real-World Application

When You'll Use This

Startup Phase: 3-broker cluster handles 10M events/day
Growth Phase: Add brokers horizontally as traffic increases
Enterprise Phase: Multi-region clusters with hundreds of brokers

Common Production Patterns

  • Hot-standby regions: Disaster recovery setup

  • Cross-datacenter replication: Geographic distribution

  • Blue-green deployments: Zero-downtime upgrades

Success Criteria

By lesson end, you'll have:

  • ✅ Running 3-broker Kafka cluster

  • ✅ Fault tolerance demonstration (kill broker, system survives)

  • ✅ Monitoring dashboard showing cluster health

  • ✅ Event production/consumption verification

Assignment: Stress Test Your Cluster

Challenge: Simulate Black Friday traffic spike

  1. Configure producers to send 10,000 events/second

  2. Kill one broker during peak load

  3. Measure recovery time and data loss

  4. Document performance characteristics

Success Metrics:

  • Zero data loss during broker failure

  • Recovery time < 30 seconds

  • Throughput degradation < 50%

Solution Hints

Monitoring Strategy: Track under-replicated-partitions metric
Recovery Optimization: Tune unclean.leader.election.enable=false
Load Testing: Use built-in kafka-producer-perf-test.sh tool

Next Steps Preview

Tomorrow we'll optimize this cluster for StreamSocial's specific traffic patterns by designing topic partitioning strategies. You'll learn why user-actions needs 1000 partitions while notifications only need 10.


Remember: Every Netflix stream, every Instagram like, every Slack message flows through systems exactly like what you're building today. Master these fundamentals, and you're ready for any scale.