For most of its life, our data plane was a well-loved antique. A set of regional gateways forwarded writes to a primary, replicated asynchronously, and waved its hands at the three-nines we had promised. It worked — for a while. Then traffic grew ten-fold and we started to notice the cracks in ways that are only visible at three in the morning.
This is the story of the year we spent replacing it. Not the heroic version — the slow, careful version, where the interesting moments are the ones where we chose to delete code instead of write it.
The constraints
Before we drew a single box on a whiteboard, we wrote down what we were not willing to negotiate. Three things made the list:
- P99 write latency must stay under 8 ms, measured at the SDK, globally.
- The plane must degrade gracefully under partial region failure — no hard dependency on any single region being reachable.
- The total code we are willing to own on the hot path must fit in a single engineer’s head. If it doesn’t, the design is wrong.
That third one is the one that kept us honest. It’s easy to add a caching layer, a queue, a supervisor, a coordinator — each one individually defensible. The sum is a system nobody can reason about at 3 AM, and 3 AM is the only time the system really matters.
The job of an infrastructure team is not to add layers. It is to remove them, one at a time, until the remaining ones are load-bearing.
— internal design doc, January 2025
Design choices
We ended up with three decisions that drove everything else:
1. One ring to coordinate them all
Instead of per-region primaries replicating to each other, we put every region into a single raft group of seven nodes — one per region plus a tie-breaker. Reads stay local. Writes go to the leader, wherever it is. This sounds slow, and on paper it is. In practice, with the right batching and pipelining, it’s faster than what we had.
// hot-path write. seven lines. the whole protocol.
fn write(req: WriteReq) -> Result<Ack> {
let entry = Entry::from(req).with_term(state.term);
raft.append(entry)?;
raft.wait_quorum(timeout: 8.ms)?;
state.apply(entry)
}
There is a lot of machinery behind each of those lines, of course. But the hot path is short enough to fit on a screenshot, which was the point.
2. The client is part of the system
Our old gateways existed partly so clients could stay dumb. The new plane pushes routing back to the SDK. The SDK keeps a small snapshot of the ring’s topology, picks its region-local follower for reads, and streams writes to whatever it believes is the current leader. When it’s wrong, the follower redirects with a cheap error.
3. One binary, one config file
The old plane had four services. The new plane has one, and most of what used to be separate services is now a thread inside that binary. This is unfashionable — the industry has spent a decade telling us to decompose. We did the opposite, for this one thing, and the operational story got dramatically better.
The rollout
We rolled v3 out the way you should roll anything out that matters: slowly, in shadow, and with a visible rollback button. The first two months we ran v3 in parallel and threw away its answers. The next two months we started trusting a percent of the reads. Only in month seven did we let it serve writes, and only for a single internal tenant with whom we had a lot of coffee.
The only real surprise came from clock skew. A subtle bug in our quorum accounting surfaced during a VM-host migration that bumped one node’s wall clock backwards by ~400 ms. The fix was four lines; finding it took a week. It is a good reminder that time, as always, is the enemy.
What we learned
The things we’d do again:
- Start by writing down what you’ll refuse to do. It prunes the tree of possible designs by 90%.
- Make the hot path readable. If it doesn’t fit on one page, you’ll regret it.
- Ship in shadow for twice as long as you think you need to. You will find the real bugs in month five, not month two.
The things we’d do differently:
- We underestimated how much client SDK work the redesign would create. If we did it again, we’d staff the SDK team before the core team.
- Our feature-flag story for the gradual rollout was ad-hoc. Teams that came after us built a proper framework and it was obviously the right call.
If you are staring at your own well-loved antique and wondering whether it’s worth the year, the honest answer is: probably. But only if you are willing to delete more than you add, and to resist every invitation to make the new thing interesting. The goal is the boringest possible thing that clears the bar. Ours took us a year. Yours might take less. We hope this helps.
Thanks to the whole Platform team for reviews, and to everyone who stayed up for the cutovers. Questions or corrections: engineering@bbyb.dev.