AWS Infrastructure That Scales With the Game

Protagona partnered with a live tabletop gaming events platform to diagnose EKS scaling failures, harden its security posture, and deliver a fully observable, rightsized AWS infrastructure capable of absorbing sharp player concurrency spikes without timeouts.

Industry

Startups & Software

Teams & Services

Cloud Architecture, DevOps, Observability, Infrastructure Optimization

Tech & Tools

Amazon EKS, Amazon EC2 (r7a.2xlarge), AWS Fargate, Amazon Aurora Serverless, Amazon ElastiCache (Redis), Amazon VPC, AWS Cluster Autoscaler, Celery, Django, New Relic, Vercel

Key Data Points

EKS cluster scaled reliably from 10 nodes at idle to 80 nodes under peak load, with autoscaling thresholds tuned to match real event-driven traffic patterns.

AWS infrastructure spend reduced by more than 50% through data-driven rightsizing, replacing intuition-based provisioning with measured CPU and memory baselines.

Aurora Serverless and ElastiCache held below 25% CPU and memory utilization throughout load testing, confirming the data tier was not a bottleneck.

The Vision

Built around the future of live tabletop gaming events, the platform brings players together in real time for high-energy, synchronous experiences. That product model creates a distinctive infrastructure challenge: traffic is not steady-state. It spikes sharply when events go live and returns to near-idle between them. As the team prepared for a major player-facing event, they recognized the platform needed to handle that surge reliably, without timeouts or degraded performance. They engaged Protagona to conduct a deep architectural review of their Amazon EKS environment, identify the root causes of scaling failures observed during internal load tests, and implement the fixes needed to give their engineers confidence before the event.

The Goal

The engagement had two concrete objectives: diagnose why the EKS cluster was hitting scaling limits and causing application timeouts during load tests, then implement targeted fixes to ensure the platform could absorb large, sudden spikes in concurrent players. A secondary goal was to reduce unnecessary AWS spend by rightsizing infrastructure to match actual workload requirements rather than worst-case assumptions.

The Challenge

The platform experiences extreme traffic variance by design. During live events, player concurrency spikes sharply; outside of them, the environment runs at a fraction of that load. The existing EKS configuration was not tuned for this pattern. During load tests, the cluster consumed all available nodes almost instantly — even after the node group was expanded from 50 to 80 — indicating that autoscaling thresholds were triggering too late and pod resource requests were not accurately reflecting actual consumption. Without reliable profiling data, it was unclear whether the issue was node capacity, pod-level resource definitions, or application-level inefficiency.

‍

Compounding the challenge was a security posture that needed attention before any production event: nodes were running in public subnets, there were no VPC flow logs, and development and production workloads shared a single cluster. Observability was also incomplete, with monitoring tools partially configured and no alerting in place for critical cluster events such as pod evictions or crash loops. Protagona needed to triage all of these issues, distinguish high-severity items from lower-priority improvements, and sequence the work to deliver the most important fixes before the event.

The Solution

Protagona began with a structured architectural review, mapping the full EKS environment against the symptoms observed during load testing. The team confirmed that Aurora Serverless and ElastiCache were not contributing to failures — both services ran well under capacity — narrowing the focus to the compute layer, specifically autoscaler configuration and pod resource management. A rightsizing methodology was then designed: autoscaling thresholds were set above expected peaks, load tests were run under controlled conditions, and actual CPU and memory consumption was recorded across pods and node groups. This gave the engineering team a data-driven baseline for setting accurate resource requests and limits, replacing the previous approach of provisioning by intuition.

‍

Protagona also recommended and implemented a transition of nodes from public to private subnets with NAT-based egress, eliminating a significant security exposure ahead of the public event. VPC flow logging was enabled for traffic visibility, and a roadmap for separating development and production into distinct clusters was established. On the observability side, Protagona accelerated onboarding of AWS infrastructure and application metrics into New Relic, configured dashboards to surface unhealthy cluster events in real time, and connected alerting to Slack so the engineering team would have immediate visibility into pod evictions, crash loops, and resource saturation during the event. The combined effect of accurate autoscaling, rightsized compute, tightened security, and full observability produced a platform capable of absorbing live event traffic without intervention.

Reliable Scale for Live Event Traffic

Autoscaling thresholds tuned to measured baselines allowed the EKS cluster to scale from 10 to 80 nodes on demand, eliminating timeouts and node exhaustion that had appeared during pre-event load tests.

Over 50% Reduction in AWS Spend

Replacing intuition-based provisioning with data-driven resource requests and limits removed significant over-provisioned capacity, cutting AWS infrastructure spend by more than 50% without sacrificing headroom for peak traffic.

Real-Time Cluster Health Visibility

New Relic APM and infrastructure monitoring were fully onboarded, with dashboards surfacing pod evictions, crash loops, and resource saturation in real time, and Slack-connected alerting giving the engineering team immediate awareness during live events.