Ceph seems to be always related to big block storage outages. This is why I am v...

antongribok · on Aug 17, 2022

Ceph is incredibly stable and resilient.

I've run Ceph at two Fortune 50 companies since 2013 to now, and I've not lost a single production object. We've had outages, yes, but not because of Ceph, it was always something else causing cascading issues.

Today I have a few dozen clusters with over 250 PB total storage, some on hardware with spinning rust that's over 5 years old, and I sleep very well at night. I've been doing storage for a long time, and no other system, open source or enterprise, has given me such a feeling of security in knowing my data is safe.

Any time I read about a big Ceph outage, it's always a bunch of things that should have never been allowed in production, compounded by non-existent monitoring, and poor understanding of how Ceph works.

shrubble · on Aug 17, 2022

Can you talk about the method that Ceph has for determining whether there was bit rot in a system?

My understanding is that you have to run a separate task/process that has Ceph go through its file structures and check it against some checksums. Is it a separate step for you, do you run it at night, etc.?

teraflop · on Aug 17, 2022

Just to add to the other comment: Ceph checksums data and metadata on every read/write operation. So even if you completely disable scrubbing, if data on a disk becomes corrupted, the OSD will detect it and the client will transparently fail over to another replica, rather than seeing bad data or an I/O error.

Scrubbing is only necessary to proactively detect bad sectors or silent corruption on infrequently-accessed data, so that you can replace the drive early without losing redundancy.

lathiat · on Aug 17, 2022

That’s called ceph scrub & deep-scrub.

By default it “scrubs” basic metadata daily and does a deep scrub where it fully reads the object and confirms the checksum is correct from all 3 replicas weekly for all of the data in the cluster.

It’s automatic and enabled by default.

shrubble · on Aug 17, 2022

So what amount of disk bandwidth/usage is involved?

For instance, say that I have 30TB of disk space used and it is across 3 replicas , thus 3 systems.

When I kick off the deep scrub operation, what amiunt of reads will happen on each system? Just the smaller amount of metadata or the actual full size of the files themselves?

teraflop · on Aug 17, 2022

In Ceph, objects are organized into placement groups (PGs), and a scrub is performed on one PG at a time, operating on all replicas of that PG.

For a normal scrub, only the metadata (essentially, the list of stored objects) is compared, so the amount of data read is very small. For a deep scrub, each replica reads and verifies the contents of all its data, and compares the hashes with its peers. So a deep scrub of all PGs ends up reading the entire contents of every disk. (Depending on what you mean by "disk space used", that could be 30TB, or 30TBx3.)

The deep scrub frequency is configurable, so e.g. if each disk is fast enough to sequentially read its entire contents in 24 hours, and you choose to deep-scrub every 30 days, you're devoting 1/30th of your total IOPS to scrubbing.

Note that "3 replicas" is not necessarily the same as "3 systems". The normal way to use Ceph is that if you set a replication factor of 3, each PG has 3 replicas that are chosen from your pool of disks/servers; a cluster with N replicas and N servers is just a special case of this (with more limited fault-tolerance). In a typical cluster, any given scrub operation only touches a small fraction of the disks at a time.

imsofuture · on Aug 18, 2022

Not to be too glib, but "a bunch of things that should have never been allowed in production, compounded by non-existent monitoring, and poor understanding of how <thing> works" is the cause of most outages.