We’ve been running a 3 node cluster for several years, and a significant minority of the times I’ve been paged is because ZK got into a bad state that was fixed by a restart (what bad state exactly? Don’t know, don’t care, don’t have two spare weeks to spend figuring it out). Note that we have proper liveness checks on individual instances, so the issue is more complicated than that.
Migrated to 3.3 with KRaft about half a year ago, and we haven’t had a single issue since. It just runs and we resize the disks from time to time.
We’ve been running a 3 node cluster for several years, and a significant minority of the times I’ve been paged is because ZK got into a bad state that was fixed by a restart (what bad state exactly? Don’t know, don’t care, don’t have two spare weeks to spend figuring it out). Note that we have proper liveness checks on individual instances, so the issue is more complicated than that.
Migrated to 3.3 with KRaft about half a year ago, and we haven’t had a single issue since. It just runs and we resize the disks from time to time.