ZooKeeper is rock solid. Moving off it is a mistake, IMO. My tinfoil hat theory ...

physicles · on Aug 8, 2023

Not my experience at all.

We’ve been running a 3 node cluster for several years, and a significant minority of the times I’ve been paged is because ZK got into a bad state that was fixed by a restart (what bad state exactly? Don’t know, don’t care, don’t have two spare weeks to spend figuring it out). Note that we have proper liveness checks on individual instances, so the issue is more complicated than that.

Migrated to 3.3 with KRaft about half a year ago, and we haven’t had a single issue since. It just runs and we resize the disks from time to time.

mdaniel · on Aug 8, 2023

> Migrated to 3.3 with KRaft about half a year ago

Did you follow their migration guide, or did you just rebuild the cluster and then using KRaft? I didn't know how "migration" was used in that context

physicles · on Aug 13, 2023

We built a new cluster and very carefully rolled over our producers and consumers. It wasn’t simple, but it’s possible to do without downtime.

linsomniac · on Aug 8, 2023

>ZooKeeper is rock solid

That has not been my experience. I've been running several small cluster (3 and 5 node) Confluent packaged for the last 3 years, and zookeeper ~20 times has gotten into this state where a node isn't in the cluster, and the way to "fix" it is to restart the current leader node. Usually I have to play "whack-a-mole" until I've restarted enough leaders that it comes up. Sometimes I've not been able to get the node back into the cluster without shutting down the whole cluster and restarting it.

Once it's running it's fine, until updates are done. But this getting into a weird state sure doesn't sit well with me.

jitl · on Aug 8, 2023

> ZooKeeper is rock solid

lol. lmao. never falls over or causes incidents, in the same way c is rock solid and never SIGSEGV or causes security problems

reissbaker · on Aug 8, 2023

This thread is an excellent example of the author's point: Kafka is polarizing.

Personally, in my experience with Kafka and Zookeeper at Airbnb back in the day (we also used ZK for general-purpose service discovery), they both were... temperamental. They'd chug along just fine for a bit, seemingly handling outages that e.g. RDS would have thrown a fit over, and then suddenly they'd be cataclysmically down in extremely complicated ways and be very difficult to bring back up. Even just using them required teaching a more complex mental model than most cloud-hosted offerings of similar things, and you ended up in this path dependency trap of "we already invested so much in Kafka, so if you want to send a message, use Kafka" when for like 95+% of use cases something easy like SQS would've been fine and simpler. TBQH I don't think either Kafka or ZK ever quite paid back their operational overhead cost, and personally I wouldn't recommend using either unless you absolutely need to.

Warpstream looks really cool in that light!

mdaniel · on Aug 8, 2023

Some people say "well, it could be worse" but my go-to version is "at least it's not fucking etcd"

EdwardDiego · on Aug 8, 2023

Fair, that's just been in my experience. It's still better than KRaft though.

jandrewrogers · on Aug 8, 2023

> ZooKeeper is rock solid. Moving off it is a mistake, IMO.

I’m agnostic about Kafka but ZooKeeper is problematic for many use cases based on personal experience and I wouldn’t recommend it. It can be “rock solid” and still not very good. I’ve seen ZK replaced with alternatives at a few different organizations now because it didn’t work well in practice, and what it was replaced with worked much better in every case.

ZooKeeper works, sort of, but I wouldn’t call it “good” in some objective sense.

dikei · on Aug 8, 2023

To be fair, a lot of people use ZK wrong, then complaint about it.

For example, if you use it like a general purpose KV store like Redis, you'll have a bad time.

Another often encountered mistake is people, thinking it doesn't need to store much data, deploy ZK to a server with slow disk/network. Big mistake, as every write to ZK need to be broadcasted and synced to disk, a bottle-neck in disk and network IOPS will kill your ensembles.

morelisp · on Aug 8, 2023

This has also been my experience when I saw unreliable ZKs; they're sharing the OS, ZK, and maybe even some other services on the same disk, and sometimes they're even running software RAID or something on top of that.

I don't think teams who can't run ZK will have much luck running other distributed systems. (Maybe KRaft, if they're Kafka experts.) Most of the alternatives proposed here have been "let someone else run the hard part." (Which isn't a bad choice, but it's not technically a solution.)