How we rebuilt our data plane to serve 40M requests per second without losing sleep.

For most of its life, our data plane was a well-loved antique. A set of regional gateways forwarded writes to a primary, replicated asynchronously, and waved its hands at the three-nines we had promised. It worked — for a while. Then traffic grew ten-fold and we started to notice the cracks in ways that are only visible at three in the morning.

This is the story of the year we spent replacing it. Not the heroic version — the slow, careful version, where the interesting moments are the ones where we chose to delete code instead of write it.

The constraints

Before we drew a single box on a whiteboard, we wrote down what we were not willing to negotiate. Three things made the list:

P99 write latency must stay under 8 ms, measured at the SDK, globally.
The plane must degrade gracefully under partial region failure — no hard dependency on any single region being reachable.
The total code we are willing to own on the hot path must fit in a single engineer’s head. If it doesn’t, the design is wrong.

That third one is the one that kept us honest. It’s easy to add a caching layer, a queue, a supervisor, a coordinator — each one individually defensible. The sum is a system nobody can reason about at 3 AM, and 3 AM is the only time the system really matters.

The job of an infrastructure team is not to add layers. It is to remove them, one at a time, until the remaining ones are load-bearing.
— internal design doc, January 2025

Design choices

We ended up with three decisions that drove everything else:

1. One ring to coordinate them all

Instead of per-region primaries replicating to each other, we put every region into a single raft group of seven nodes — one per region plus a tie-breaker. Reads stay local. Writes go to the leader, wherever it is. This sounds slow, and on paper it is. In practice, with the right batching and pipelining, it’s faster than what we had.

// hot-path write. seven lines. the whole protocol.
fn write(req: WriteReq) -> Result<Ack> {
    let entry = Entry::from(req).with_term(state.term);
    raft.append(entry)?;
    raft.wait_quorum(timeout: 8.ms)?;
    state.apply(entry)
}

There is a lot of machinery behind each of those lines, of course. But the hot path is short enough to fit on a screenshot, which was the point.

2. The client is part of the system

Our old gateways existed partly so clients could stay dumb. The new plane pushes routing back to the SDK. The SDK keeps a small snapshot of the ring’s topology, picks its region-local follower for reads, and streams writes to whatever it believes is the current leader. When it’s wrong, the follower redirects with a cheap error.

Fig. 2 — P99 write latency, global. v3 is the lower, solid line. The gap widens under load; the old plane’s bad days are where the dashed line spikes.

3. One binary, one config file

The old plane had four services. The new plane has one, and most of what used to be separate services is now a thread inside that binary. This is unfashionable — the industry has spent a decade telling us to decompose. We did the opposite, for this one thing, and the operational story got dramatically better.

40M

Sustained requests per second on v3 during peak load.

4.2ms

Global P99 write latency, down from 14.6 ms on v2.

−61%

Reduction in lines of hot-path code between v2 and v3.

The rollout

We rolled v3 out the way you should roll anything out that matters: slowly, in shadow, and with a visible rollback button. The first two months we ran v3 in parallel and threw away its answers. The next two months we started trusting a percent of the reads. Only in month seven did we let it serve writes, and only for a single internal tenant with whom we had a lot of coffee.

The only real surprise came from clock skew. A subtle bug in our quorum accounting surfaced during a VM-host migration that bumped one node’s wall clock backwards by ~400 ms. The fix was four lines; finding it took a week. It is a good reminder that time, as always, is the enemy.

What we learned

The things we’d do again:

Start by writing down what you’ll refuse to do. It prunes the tree of possible designs by 90%.
Make the hot path readable. If it doesn’t fit on one page, you’ll regret it.
Ship in shadow for twice as long as you think you need to. You will find the real bugs in month five, not month two.

The things we’d do differently:

We underestimated how much client SDK work the redesign would create. If we did it again, we’d staff the SDK team before the core team.
Our feature-flag story for the gradual rollout was ad-hoc. Teams that came after us built a proper framework and it was obviously the right call.

If you are staring at your own well-loved antique and wondering whether it’s worth the year, the honest answer is: probably. But only if you are willing to delete more than you add, and to resist every invitation to make the new thing interesting. The goal is the boringest possible thing that clears the bar. Ours took us a year. Yours might take less. We hope this helps.

Thanks to the whole Platform team for reviews, and to everyone who stayed up for the cutovers. Questions or corrections: engineering@bbyb.dev.

How we rebuilt our data plane to serve 40M requests per second without losing sleep.

The constraints

Design choices

1. One ring to coordinate them all

2. The client is part of the system

3. One binary, one config file

The rollout

What we learned

Read Next

A year of running Postgres on NVMe: what we learned the hard way.

Bazel at B3: how we cut CI wall-clock time by 68%.

The incident that taught us to distrust our own dashboards.

The constraints

Design choices

1. One ring to coordinate them all

2. The client is part of the system

3. One binary, one config file

The rollout

What we learned

Read Next

A year of running Postgres on NVMe: what we learned the hard way.

Bazel at B3: how we cut CI wall-clock time by 68%.

The incident that taught us to distrust our own dashboards.

New articles, delivered quietly.