The edge gateway, the thundering herd, and the day we ran out of ports

// What is the edge gateway? #

Whether you interact with DigitalOcean using its CLI, public API or web UI, your requests flow through the edge gateway. The edge gateway looks at every inbound HTTP request and routes it to one of hundreds of backend microservices. It also performs authentication, rate limiting, and observability injection. Load balancing is constant-time O(1) weighted random, which lets traffic shift easily between regions during maintenance or unexpected issues (it uses Vose's aliasing algorithm, but that's a topic for another post).

It wasn't always this way. The web UI and API were once served entirely by a couple of monolithic Ruby applications. This approach had real advantages, including fewer moving parts and a shared way to enforce policies like authentication and rate limiting. But as the company grew, it became difficult to reliably scale and deploy changes across dozens of engineering teams. The company needed to split the monolithic apps into smaller, more manageable pieces with clear ownership. The edge gateway was built to handle the platform-wide concerns while freeing teams to own their services.

The edge gateway

Every HTTP request, from the CLI, web UI, public API, or AI agents, flows through the edge gateway before reaching a backend. The gateway handles platform-wide concerns (authentication, rate limiting, observability) and routes each request to one of hundreds of microservices. It runs on Kubernetes clusters in multiple regions worldwide.

// The scale we're working with #

The edge gateway is made up of several components deployed on bare-metal Kubernetes clusters across continents. The real architectural challenge is designing for the chaotic days: sudden traffic spikes, datacenter link failures, or runaway customer scripts hammering the API thousands of times a second.

DigitalOcean serves 650,000+ customers and billions of HTTP requests daily. One of the edge gateway's biggest consumers is DigitalOcean itself. Tens of thousands of servers, spread across 14 datacenters, rely on it to communicate with internal services. That means we control some of the clients and can enforce best practices. The other clients, the ones hitting our public API from anywhere on the internet, we don't control at all, which calls for a different set of defenses.

// Defending against thundering herds #

One afternoon, 35,000 of our servers were disconnected from the network. No data was lost, and the servers stayed healthy. They just couldn't phone home for a bit. Then connectivity came back, and all of them tried to reconnect and re-authenticate at the same instant to upload telemetry collected while they were offline. We DDoSed ourselves.

That's a network partition, and the recovery can be worse than the initial outage. In distributed systems, partitions happen for all kinds of reasons: a backhoe hits a buried fiber cable; a datacenter router overheats; a configuration update temporarily breaks connectivity. Across thousands of servers and dozens of network paths, this isn't a rare event. It's a Tuesday.

One fix is to make sure devices don't all reconnect at the exact same moment. For clients under our control, we spread recovery out over a longer period of time using exponential backoff and jitter. Backoff is what it sounds like: if a client's reconnection attempt is unsuccessful, it backs off for a second before retrying. If it fails again, it waits for two seconds. Each failure doubles the wait, usually capped at a maximum of 60 seconds. Jitter adds a small, random amount of time to the backoff, so clients are less likely to retry at the exact same moment.

Server load during reconnection

When a network partition heals, every disconnected client tries to reconnect. Without coordination they all retry at the same instant, which can push the server past its capacity and trigger a fresh outage. Exponential backoff with jitter spreads the same reconnections across minutes, keeping load comfortably under the line.

All at once (thundering herd)

Exponential backoff + jitter

Another important mitigation is server-side caching, because it reduces the impact of sudden bursts of traffic from both clients we control and those we don't. For example, without caching, if a client sends 1,000 requests to the edge gateway in rapid succession, that would result in 1,000 database lookups to map the client's API key to its corresponding user. With caching, the first request stores the lookup result in memory, and the other 999 read it directly. These cached results have a very short time-to-live (TTL), so a revoked API key can't be used from a stale cache entry.

But what happens if those 1,000 requests hit the gateway at the exact same millisecond, before the first request has finished fetching the data from the database? They would all experience a cache miss and slam the database simultaneously (a phenomenon known as cache stampede). Request coalescing solves this problem by pausing all but the first request. Once complete, the response is cached and returned to all the waiting requests simultaneously.

Caching helps when requests arrive one after another, but here six requests (standing in for a thousand) hit a cold cache at the exact same instant. With coalescing, only the first goes upstream; the rest wait, then all share its single response. Turn it off and every request stampedes the origin at once.

Coalescing: on

// When clients go rogue #

Imagine a script designed to use the DigitalOcean API to restart a Droplet once every hour, but it has a bug and actually calls the API once every millisecond. It's an honest mistake, but the resources (CPU, memory, network bandwidth) used to serve those requests could instead be used to serve legitimate requests from other customers. At a large enough scale, especially if done with malicious intent, the excessive load will eventually cause everything to grind to a halt (denial of service attack).

To keep one buggy script from degrading service for everyone else, the edge gateway gives engineering teams a highly configurable set of rate limiting rules. Service owners can configure limits across multiple dimensions, like timing (short bursts or sustained periods), identity (API key, user ID), network origin, and behavioral heuristics like bot detection.

The multi-dimensional part is important. A single bad user is a different problem from a dozen legitimate users sharing one IP address because they're behind the same office network. A rate limit that looks at IP alone would punish the office; a rate limit that only looks at user ID would miss a coordinated attack from many accounts.

Once a client hits its rate limit, additional requests are blocked until a cooldown period has elapsed. The gateway returns an HTTP 429 Too Many Requests status, with a Retry-After header, so well-behaved clients know when to try again.

// Talking to dependencies #

Rate limiting helps protect the gateway from clients. But the gateway also makes requests to other internal services, and those can get overwhelmed too.

If you're already stumbling with an overflowing armful of groceries, handing you another bag isn't going to help. Similarly, when an application is struggling under heavy load or recovering from a crash, piling on more requests is the worst thing you can do. This is why the edge gateway uses the circuit breaker pattern when it sends requests to its dependencies.

Named after the electrical safety mechanism, a software circuit breaker works on the same principle. Under normal conditions, the circuit is closed and traffic flows freely. But if the gateway notices that requests to a dependency (like our authentication service) are consistently failing, it assumes the downstream server is overloaded. To protect it, the breaker trips open.

Instead of routing traffic to a service that's already struggling, which only wastes resources and forces the client to wait for a long timeout, the open breaker "fails fast," instantly returning an error to the client. This gives the overloaded service room to recover instead of getting buried. Once a cooldown period passes, the breaker enters a half-open state where it cautiously lets a small number of requests through to test whether the service has recovered. If those test requests still fail, the breaker returns to the open state and resets the timer. If they succeed, the breaker closes, and normal operation resumes.

Backend: unhealthy

// The day we ran out of ports #

Every request the edge gateway sends to a backend service travels over a TCP connection. Establishing a fresh connection isn't free: there's a TCP handshake, usually a TLS handshake, and sometimes DNS resolution. At the volumes we're dealing with, doing that every time adds real latency and burns unnecessary CPU. So, like most HTTP client libraries do by default, we keep connections open and reuse them.

The catch is that idle open connections aren't free either. Each open connection holds kernel memory and a file descriptor on both ends. When you multiply by hundreds of backend services and many gateway instances, the numbers add up quickly. Connections also go stale when you're not looking (load balancers in the middle silently drop them, NAT mappings expire, and you don't find out until you try to use one). So the goal is to keep connections warm enough to absorb a burst, but not so warm that you're hoarding dead weight.

One way to manage this tradeoff is with an idle connection timeout. Leave a connection unused long enough and it gets closed. And here is where I learned something the hard way.

We'd set that timeout aggressively on the theory that closing unused connections quickly would keep our pool tidy. It seemed mostly harmless. For a long time, it was. Then one afternoon, production started throwing a steady stream of connection failures to a particular Redis datastore. The errors were EADDRNOTAVAIL (cannot assign requested address). The service was healthy, Redis was healthy, the network between them was healthy. Our dashboards were green, and the thing was still broken (which, if you've ever been on call, you know is the worst possible kind of broken). We weren't out of memory or CPU. We were out of ports.

When a host closes a TCP connection, the operating system holds onto the local port for about 60 seconds in a state called TIME_WAIT. This is a feature of TCP that prevents stray packets from a closed connection from being misinterpreted as part of a new one. For that minute, the port cannot be used to open a new connection. Linux gives you about 28,000 ports per destination IP and port combo. This sounds like a fortune right up until you have bursty waves of requests hitting the same destination and an aggressive idle timeout diligently shredding connections in the lulls between them. We were opening and closing connections fast enough to shove tens of thousands of ports into TIME_WAIT at once. New connection attempts had nowhere to go.

The primary fix was to raise the idle connection timeout and move the limit closer to the max-open limit, so the pool held onto connections between bursts instead of churning them. Less churn, fewer ports cycling through TIME_WAIT, and the pool stopped running dry. We also added a quick liveness check (Redis PING) when borrowing a connection from the pool, so the occasional stale connection got caught and swapped out cleanly instead of blowing up a real customer's request. Lastly, we added better visibility into pool churn, so the next version of this surprises us a little less.

None of these fixes are clever on their own. The lesson was that "close idle connections promptly" sounds like obviously good hygiene, but it's actually a knob with a sharp edge if you turn it the wrong way under the wrong traffic shape.

// Outrunning latency #

Even under perfect conditions, cross-region network requests are bound by the laws of physics. A round trip between distant global datacenters can easily take 150 milliseconds. At our latency targets, that's more than the entire budget.

To beat this latency barrier, we replicate critical data across multiple geographic regions, routing incoming requests to the deployments that are physically closest to the client. Usually, this works perfectly, but networks are unpredictable. Sometimes the network path to the closest deployment will be congested or the service might be undergoing maintenance with degraded performance.

When an internal request fails due to a temporary error like a timeout, the gateway retries it a few times before giving up. But waiting for a request to completely time out before trying again destroys tail latency (p99). So for latency-sensitive paths like authentication and authorization, we use a more aggressive strategy called request hedging.

Instead of waiting around to see if the primary request fails, hedging preemptively fires duplicate requests to alternative regions, before an error even occurs. The gateway keeps whichever response comes back first and immediately cancels the others.

Hedging is essentially a load multiplier: a slow request spawns two or three more requests. That's fine in normal operation, but during an incident, when a service is already struggling, hedging actively makes things worse by piling on more load. This is why hedging has to be paired with the techniques from earlier sections. Circuit breakers detect that a destination is unhealthy and stop sending hedged requests to it. Exponential backoff prevents the retries from overwhelming other services. When failures are consistent, the fastest path to recovery is reducing load on the overloaded service.

A request goes to the primary region first. If no response arrives by the hedge threshold, the same request is fired to backup regions in parallel without waiting for the first to fail. The first response wins; the rest are cancelled. A low threshold trims tail latency but adds duplicate load; a high one saves load but worsens the slow tail. Drag the threshold and send a few requests.

Winning response

Cancelled (wasted)

Hedge threshold

Hedge threshold: 300 ms

// Seeing what's happening #

Because the edge gateway handles all inbound HTTP traffic, it's a high-leverage place to capture the telemetry needed for end-to-end debugging.

Before forwarding an incoming request, the gateway injects a unique request ID and initiates or propagates OpenTelemetry spans for distributed tracing. This means the request's entire journey through downstream services can be reconstructed during a post-mortem. As the response travels back to the client, the gateway records real-time metrics using Prometheus. Combined, this data provides instant visibility into the Four Golden Signals (latency, traffic, errors, and saturation), and it allows our automated alerting systems to detect SLO burn rate anomalies before users notice a degradation.

But, real-time metrics only tell you that something is wrong; request and response logs tell you why. Logs (with sensitive fields scrubbed) are published asynchronously to a Kafka stream and ingested into our long-term data warehouse for retrospective analysis. The async part is critical: at high volumes of traffic, logging cannot block the request path or introduce latency. Kafka cleanly absorbs the write throughput, and the slower analytical pipeline runs at its own pace.

// Closing thoughts #

It's important to notice how all of these techniques interact. Hedging cuts latency but multiplies load, so it leans on circuit breakers to know when to stop. Caching shields the database until a stampede turns every cache miss into a synchronized blast of requests, so it leans on coalescing. Closing idle connections promptly sounds like a good plan until your ephemeral ports disappear.

Sometimes the setting you tuned to prevent one failure was secretly engineering another.