Scaling for Millions: How ProSiebenSat.1 Revolutionized Its Streaming Infrastructure

In the hyper-competitive world of digital media, a streaming platform’s success is measured by milliseconds. For Joyn, the streaming service operated by German broadcaster ProSiebenSat.1, the mandate was clear: deliver high-availability, scalable, and cost-effective entertainment to millions of users across the DACH (Germany, Austria, Switzerland) region. However, as Daniele Frasca, a lead engineer on the project, recently revealed, the path to achieving this wasn’t paved with "easy-peasy" management solutions, but through a radical, 18-month architectural overhaul that challenged the status quo of traditional media engineering.

The Breaking Point: When Architecture Fails Business

Eighteen months ago, the Joyn platform was struggling under its own weight. Frasca’s team, a lean group of just two developers with limited prior AWS experience, inherited a system that was effectively "on fire."

The original architecture relied on a rigid, monolithic-style worker pattern: a Kafka-subscribing worker performing transformations and writing to a single-node database. There was no caching, no redundancy, and a complete lack of standard protocols across six core services. During peak traffic—often tied to major events like the Bundesliga—the database would buckle, leading to widespread outages and inconsistent data across the platform. Users would navigate from a video page to a details page, only to find the content suddenly unavailable.

For the engineering team, the realization was stark: technical debt was no longer just a matter of messy code; it was a systemic failure of infrastructure to scale with the business.

Chronology of a Transformation

The team’s journey from a brittle, monolithic setup to a resilient, multi-region serverless environment followed a deliberate, iterative path:

Phase 1: Standardization and Boundaries (Months 1-6): The team moved to clearly define service boundaries. By implementing a "Hub and Spoke" pattern, they replaced direct, messy inter-service communication with a centralized event-driven architecture using Amazon EventBridge.
Phase 2: The Serverless Pivot (Months 6-12): Recognizing that the team’s bandwidth was better spent on business logic rather than managing servers, they migrated to AWS managed services. This phase focused on replacing custom API layers with scalable abstractions.
Phase 3: Multi-Region Resilience (Months 12-18): Once the core services were stable, the team focused on disaster recovery and geo-distribution, implementing cell-based architectures and automated traffic shifting to survive regional outages.

Data Consistency: The Hub and Spoke Revolution

One of the most persistent issues Joyn faced was data inconsistency. Because various services handled Kafka messages differently—applying inconsistent validations and transformations—the "source of truth" had effectively dissolved.

To fix this, the team deployed a "Bus Mesh" using EventBridge. By positioning EventBridge as the primary interface, the team created a standardized "fan-out" mechanism. Internal microservices no longer communicated directly, which had previously exposed internal states in an anti-pattern. Instead, every service interfaced with a local EventBridge instance.

To handle the massive data payloads common in media streaming—which can exceed the 256KB limits of EventBridge—the team implemented the Claim Check Pattern. They used EventBridge Pipes to intercept, validate, and store large payloads in Amazon S3, passing only the S3 key (the "claim check") through the event bus. This allowed the system to remain lightweight while providing consumers with access to full, consistent data without overloading the event infrastructure.

Scalability and the "Cell-Based" Strategy

Resiliency at scale is not just about choosing the right service; it is about reducing the "blast radius" of any potential failure. The team adopted a cell-based architecture, where traffic is segmented by country, user type (paid vs. free), and platform.

By splitting the workload across these segments, the team effectively multiplied their capacity. Instead of a single Lambda function handling all requests, the platform now utilizes dozens of granular instances. This allows for localized deployments—testing a new feature for "iOS free users in Austria," for instance—before a wider rollout.

The Caching Layer

To protect the database from the "thundering herd" of millions of concurrent requests, the team implemented a three-layer caching strategy:

Edge Caching (CloudFront): Handles the most repetitive, global requests.
In-Memory Caching: Stores hot keys closer to the compute layer.
Dedicated Cache Service (Momento): Acts as the final buffer before hitting the database.

According to Frasca, this multi-layered approach ensures that during prime time, less than 10% of requests ever hit the primary database, allowing for a much leaner, more cost-effective database cluster.

Official Perspectives: The Cost of Availability

Management often demands that services be "highly available, scalable, and cheap"—a trifecta that is fundamentally impossible to achieve simultaneously. Frasca’s approach was to bring the harsh reality of these trade-offs to stakeholders.

"If the infrastructure costs less than the lost revenue of a prime-time outage, the investment is justified," Frasca argued. By quantifying the risk, the team moved the conversation from "how can we make this cheap" to "how much are we willing to pay to avoid a reputation-damaging outage?"

The Multi-Region Economics

The team’s shift to multi-region architecture was driven by the necessity of survival. While multi-region deployments are inherently more expensive, the team utilized "affordability tactics" to mitigate the overhead:

Load Balancer Optimization: Switching from API Gateway to Application Load Balancers for high-volume routes resulted in a 90% cost saving.
Dynamic Compute Shifting: During off-peak hours, the system scales Fargate tasks to zero and shifts traffic entirely to Lambda, minimizing idle resource costs. During spikes, the system uses Lambda as an "overflow valve" to handle surges while Fargate tasks slowly scale up.

Implications for Modern Media Engineering

The success of the Joyn overhaul provides a blueprint for other media companies struggling with legacy debt. The primary implication is that delegation to the cloud provider is a feature, not a compromise.

By treating AWS managed services as a black box that handles availability and resiliency, the team was able to focus on the "code that matters." The transition from managing subnets, VPCs, and database clusters to using serverless APIs (like DynamoDB and Lambda) allowed two developers to do the work that previously required a much larger, and often less effective, team.

Lessons for the Future

The team’s reliance on "Chaos Engineering"—using AWS Fault Injection Simulator to manually break their own systems—ensures they are never caught off guard. By forcing failures during development, the team created a system that "recovers gracefully" rather than crashing spectacularly.

As the industry continues to move toward more globalized streaming models, the Joyn case study proves that the biggest challenge is not the technology itself, but the mentality. Moving away from the "cluster-first" mindset and embracing a fully event-driven, serverless approach is not just a technical upgrade; it is an organizational necessity. The end result is a platform that scales automatically, survives regional failures, and, perhaps most importantly, allows its engineers to sleep through the night during the biggest broadcast events of the year.