Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Man, Ceph really doesn't get enough love. For all the distributed systems hype out there - be it Kubernetes or blockchains or serverless - the ol' rock solid distributed storage systems sat in the background iterating like crazy.

We had a huge Rook/Ceph installation in the early days of our startup before we killed off the product that used it (sadly). It did explode under some rare unusual cases, but I sometimes miss it! For folks who aren't aware, a rough TLDR is that Ceph is to ZFS/LVM what Kubernetes is to containers.

This seems like a very cool board for a Ceph lab - although - extremely expensive - and I say that as someone who sells very expensive Raspberry Pi based computers!



Ceph is fantastic. I use it as the storage layer in my homelab. I've done some things that I can only concisely describe as super fucked up to this Ceph cluster, and every single time I've come out the other side with zero data loss, not having to restore a backup.


Haha "super fucked up" is a much better way of describing the "usual, rare" situations I was putting it into as well :P


Care to provide examples of what these things were that you were doing to a storage pool? I guess I'm just not imaginative enough to think about ways of using a storage pool other than storing data in it.


In our case we were a free-to-use-without-any-signup way of testing Kubernetes. You could just go to the site and spin up pods. Looking back, it was a bit insane.

Anyways, you can imagine we had all sorts of attacks and miners or other abusive software running. This on top of using ephemeral nodes for our free service meant hosts were always coming and going and ceph was always busy migrating data around. The wrong combo of nodes dying and bursting traffic and beta versions of Rook meant we ran into a huge number of edge cases. We did some optimization and re-design, but it turned out there just weren't enough folks interested in paying for multi-tenant Kubernetes. We did learn an absolute ton about multi-tenant K8s, so, if anyone is running into those challenges, feel free to hire us :P


not OP, but I would start with filling disk space up to 100%, or creating zillions of empty files. In case of distributed filesystems - maybe removing one node (under heavy load preferably), or "cloning" nodes so they had same UUIDs (preferably nodes storing some data on them - to see if the data will be de-duplicated somehow).

Or just a disk with unreliable USB connection?


Administrating the storage pool.

The worst that comes to mind for me was a node failure in the middle of a major version upgrade. Not likely a big deal for proper deployments, but I don't have enough nodes to tolerate complete node failure for most of my data.

Grabbed a new root/boot SSD, reinstalled the OS, reinstalled the OSDs on each disk, told Ceph what OSD ID each one had previously (not actually sure if that was required), and....voila, they just rejoined the cluster and started serving their data like nothing ever happened.


I think many people (myself included) had been burned by major disasters on earlier clustered storage solutions (like early Gluster installations). Ceph seems to have been under the radar for a bit of time when it got to a more stable/usable point, and came more in the limelight once people started deploying Kubernetes (and Rook, and more integrated/wholistic clustered storage solutions).

So I think a big part of Ceph's success (at least IMO) was its timing, and it's adoption into a more cloud-first ecosystem. That narrowed the use cases down from what the earliest networked storage software were trying to solve.


We're more and more feeling we made the wrong call with gluster... The underlying bricks being a POSIX fs felt a lot safer at the time but in hindsight ceph or one of the newer ones would probably have been a better choice. So much inexplicable behavior. For your sake I hope the grass really is greener.


Red Hat (owner of Gluster) has announced EOL in 2024: https://access.redhat.com/support/policy/updates/rhs/

Ceph is where the action is now.


Whoa, totally missed that announcement, whenever it was...

But yeah, have felt the wind blowing for a while now so with this I guess it's about time to get moving.


Can someone with experience with Ceph and MinIO or SeaweedFS comment on how they compare?

I currently run a single-node SnapRAID setup, but would like to expand to a distributed one, and would ideally prefer something simple (which is why I chose SnapRAID over ZFS). Ceph feels to enterprisey and complex for my needs, but at the same time, I wouldn't want to entrust my data to a simpler project that can have major issues I only discover years down the road.

SeaweedFS has an interesting comparison[1], but I'm not sure how biased it is.

[1]: https://github.com/seaweedfs/seaweedfs#compared-to-ceph


Seaweedfs has problems with large "pools" it's based on an old facebook paper (haystack) and supposed for block storage to distribute large image caches. I found it mediocre at best as it's documentation was lacking, performance was lacking (in my tests) and the multitude of components were hard to get working. The idea behind it is that every daemon uses one large file as data store to skip slow metadata access. There are different ways to access the storage over gateways.

MinIO is changing so much in the last years thatI can't give a competent answer but compared to seaweedfs it uses many small local databases. Right now it's deprecating many features like the gateway and it is split into 2 main components (cli and server) compared the seaweedfs deployment is dead simple, but I don't know which direction the project is going. Went from a normal open source project to a more business like deal (in what I saw) like I said I didn't quite follow the process.

Ceph is based on blocmlk storage. Offers an object gateway (s3/swift), fs (cephfs) and block storage (rbd). You can access everything through librados directly as well. For a minimal setup you need a "larger" vluster but it is the most flexible solution (imho). Uses the most resources as well, but you can do nearly everything you want without limit with it.


SeaweedFS author here. Thanks for your candid answer. You do not need to use multiple SeaweedFS components. Just download the binary and run "weed server -s3".

There are many other components, but you do not really need to use them. This default mode should be good enough for most cases. I saw many times people try to optimize too early, but often unnecessary, and sometimes in the wrong way.

I would like to know what kind of setup you are running. It should beat most other options if the use case needs lots of small files, e.g. millions or billions of files. If just small use case, e.g. a few personal files, it would be an overkill.

Another aspect is how to increase capacity for existing clusters. It should be most simple for SeaweedFS, just start one more volume server. And it will linearly increase the throughput.


Yeah sorry my answer was more than insufficient to be honest. I wrote it _in bed_ and was embarrassed the next day because it was of really low quality. Thought of expanding it later. So yeah I screwed the pooch here and I'm sorry, I will try to do better now by expanding on my answer.

First of all this is all from memory and I didn't try seaweedfs again for this.

So first things first. I evaluated seaweedfs for HPC Cluster usage in 2020 (oh my this is some time ago), but my test setup were VMs. I tried it with many small and larger files and it didn't scale at all (at least when I tested it) for parallel loads. The response time was acceptable, but the throughput was very low. When I tried it "weed server" spun up everything more or less fine, but had problems binding correctly that a distributed setup worked. Based on the wiki documentation I configured a master server, a filer and a few volume server (iirc). My main gripes at that time were as follows:

  * the syntax of the different clients was incosistent
  * the throughput was rather low
  * the components didn't work well together in a certain configuration and I had to specify many things manually
  * the wiki was lacking
I tried filer (fuse), s3 and hadoop. s3 wasn't compatible enough to work with everything I tried with it so I spun up a minIO instance as a gateway to test the whole thing. When working over a longer period I had some hangs as well.

That's sadly everything I remember on it but I made a presentation if you are interested I can look for it and give you the benchmarks I tried and the limitations I found (although they will be all HORRIBLY out of date). When I tested it there were 2 versions with different size limitations iirc. I just now looked over your gitlab releases and can't find these.

Sorry again if I misrepresented seaweedfs here with my outdated tests. I looked at the github wiki and it looks much better then when I last played with it. I will give it a spin again soon and if I find my old experience of it to be not represantive, maybe write something about it and post it here.

---

minIO was when I tried it mainly an s3 server and gateway. It had a simple web framework that allowed you to upload and share files. One of our use cases that we thought we could use minIO for was as a bucket browser/web interface. It was easy to setup as a server as well. Like I said I didn't track it after testing it for about a month. Today it boasts with it's performance and AI/ML use cases. Here is there pricing model https://min.io/pricing and you can see how they add value to their product.

---

Ceph is like I said the most complex product of the three with the most components that need to be setup (even though it's quite easy now). Performance is optimized in their crimson project https://next.redhat.com/2021/01/18/crimson-evolving-ceph-for... (this is a WIP and not enabled by default). It's not the most straight forward to tune since many small things can lead to big performance gains and losses (for instance the erasure code k and m you choose), but I found that the defaults got more sane with time.


Thanks for the detailed clarification! I am too deep into the SeaweedFS low level details and am all ears on how to make it simpler to use. SeaweedFS has weekly releases and is constantly evolving.

Depending on your case, you may need to add more filers. UCSD has a setup that uses about 10 filers to achieve 1.5 billion iops. https://twitter.com/SeaweedFS/status/1549890262633107456 There are many AI/ML users switching from MinIO or CEPH to SeaweedFS, especially with lots of images/text/audio files to process.

I found MinIO benchmark results is really, well, "marketing". MinIO is basically just an S3 API layer on top of the local disks. Any object is mapped to at least 2 files on disk, one for metadata and one is the object itself.


Thanks for your perspective. Ceph does sound the most appealing for my use case. I'm hoping that the learning curve is mild, and that it has a mostly set-and-forget UX.


I love it, but when it fails at scale, it can be hard to reason about. Or at least that was the case when I was using it a few years back. Still keen to try it again and see what's changed. I haven't run it since bluestore was released.


Yeah, I've been running a small Ceph cluster at home, and my only real issue with it is the relative scarcity of good conceptual documentation.

I personally learned about Ceph from a coworker and fellow distributed systems geek who's a big fan of the design. So I kind of absorbed a lot of the concepts before I ever actually started using it. There have been quite a few times where I look at a command or config parameter, and think, "oh, I know what that's probably doing under the hood"... but when I try to actually check that assumption, the documentation is missing, or sparse, or outdated, or I have to "read between the lines" of a bunch of different pages to understand what's really happening.


Ceph seems to be always related to big block storage outages. This is why I am very wary of using it. Has this changed? Edit: rephrased a bit


Ceph is incredibly stable and resilient.

I've run Ceph at two Fortune 50 companies since 2013 to now, and I've not lost a single production object. We've had outages, yes, but not because of Ceph, it was always something else causing cascading issues.

Today I have a few dozen clusters with over 250 PB total storage, some on hardware with spinning rust that's over 5 years old, and I sleep very well at night. I've been doing storage for a long time, and no other system, open source or enterprise, has given me such a feeling of security in knowing my data is safe.

Any time I read about a big Ceph outage, it's always a bunch of things that should have never been allowed in production, compounded by non-existent monitoring, and poor understanding of how Ceph works.


Can you talk about the method that Ceph has for determining whether there was bit rot in a system?

My understanding is that you have to run a separate task/process that has Ceph go through its file structures and check it against some checksums. Is it a separate step for you, do you run it at night, etc.?


Just to add to the other comment: Ceph checksums data and metadata on every read/write operation. So even if you completely disable scrubbing, if data on a disk becomes corrupted, the OSD will detect it and the client will transparently fail over to another replica, rather than seeing bad data or an I/O error.

Scrubbing is only necessary to proactively detect bad sectors or silent corruption on infrequently-accessed data, so that you can replace the drive early without losing redundancy.


That’s called ceph scrub & deep-scrub.

By default it “scrubs” basic metadata daily and does a deep scrub where it fully reads the object and confirms the checksum is correct from all 3 replicas weekly for all of the data in the cluster.

It’s automatic and enabled by default.


So what amount of disk bandwidth/usage is involved?

For instance, say that I have 30TB of disk space used and it is across 3 replicas , thus 3 systems.

When I kick off the deep scrub operation, what amiunt of reads will happen on each system? Just the smaller amount of metadata or the actual full size of the files themselves?


In Ceph, objects are organized into placement groups (PGs), and a scrub is performed on one PG at a time, operating on all replicas of that PG.

For a normal scrub, only the metadata (essentially, the list of stored objects) is compared, so the amount of data read is very small. For a deep scrub, each replica reads and verifies the contents of all its data, and compares the hashes with its peers. So a deep scrub of all PGs ends up reading the entire contents of every disk. (Depending on what you mean by "disk space used", that could be 30TB, or 30TBx3.)

The deep scrub frequency is configurable, so e.g. if each disk is fast enough to sequentially read its entire contents in 24 hours, and you choose to deep-scrub every 30 days, you're devoting 1/30th of your total IOPS to scrubbing.

Note that "3 replicas" is not necessarily the same as "3 systems". The normal way to use Ceph is that if you set a replication factor of 3, each PG has 3 replicas that are chosen from your pool of disks/servers; a cluster with N replicas and N servers is just a special case of this (with more limited fault-tolerance). In a typical cluster, any given scrub operation only touches a small fraction of the disks at a time.


Not to be too glib, but "a bunch of things that should have never been allowed in production, compounded by non-existent monitoring, and poor understanding of how <thing> works" is the cause of most outages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: