In the good ol' days, we used to implement redundant systems such that there was tie breaker process that used only a fraction of the resources of the true processes. Some Raft implementations allow for nodes that are voters but cannot be quorum leaders (example: the branch office where all traffic is funneled through an asymmetric VPN tunnel should never be elected leader, but knows which candidates it can and cannot see)
So the table stakes for running a cluster was not 3x as much hardware but closer to 2.2x, which is a huge deal for solutions in the small and developer sandboxes. It also matters when 3 shards don't quite cover your load but 5 is too many. Or 6 vs 7.
The problem is that with geographic replication, this doesn't fix either of the problems articulated as part of the thesis of this article:
1. Cloud economics – by design, Kafka’s replication strategy will rack up massive inter AZ bandwidth costs.
2. Operational overhead – running your own Kafka cluster literally requires a dedicated team and sophisticated custom tooling.
Still, we need this functionality back for cloud, particularly as the pendulum swings back to self hosting, which it always has in the past.
So the table stakes for running a cluster was not 3x as much hardware but closer to 2.2x, which is a huge deal for solutions in the small and developer sandboxes. It also matters when 3 shards don't quite cover your load but 5 is too many. Or 6 vs 7.
The problem is that with geographic replication, this doesn't fix either of the problems articulated as part of the thesis of this article:
1. Cloud economics – by design, Kafka’s replication strategy will rack up massive inter AZ bandwidth costs.
2. Operational overhead – running your own Kafka cluster literally requires a dedicated team and sophisticated custom tooling.
Still, we need this functionality back for cloud, particularly as the pendulum swings back to self hosting, which it always has in the past.