When I worked at Google in storage, we had our own figures of merit that showed ...

breckognize · on March 10, 2024

9's are overblown. When cloud providers report that, they're really saying "Assuming random hard drive failure at the rates we've historically measured and how we quickly we detect and fix those failures, what's the mean time to data loss".

But that's burying the lede. By far the greatest risks to a file's durability are: 1. Bugs (which aren't captured by a durability model). This is mitigated by deploying slowly and having good isolation between regions. 2. An act of God that wipes out a facility.

The point of my comment was that it's not just about checksums. That's table stakes. The main driver of data loss for storage organizations with competent software is safety culture and physical infrastructure.

My experience was that S3's safety culture is outstanding. In terms of physical separation and how "solid" the AZs are, AWS is overbuilt compared to the other players.

pclmulqdq · on March 10, 2024

That was not how we treated the 9's at Google. Those had been tested through natural experiments (disasters).

I was not at Google for the Clichy fire, but it wasn't the first datacenter fire Google experienced. I think your information about Google's data placement may be incorrect, or you may be mapping AWS concepts onto Google internal infrastructure in the wrong way.

breckognize · on March 10, 2024

Do you mean Google included "acts of God" when computing 9's? That's definitely not right.

11 9's of durability means mean time to data loss of 100 billion years. Nothing on earth is 11 9's durable in the face of natural (or man-made) disasters. The earth is only 4.5 billion years old.

pclmulqdq · on March 10, 2024

Normally, companies store more than 1 byte of data, and the 9's (not just for data loss, for everything) are ensemble averages.

By the way, I don't doubt that AWS has plenty of 9's by that metric - perhaps more than GCP.

fsociety · on March 10, 2024

I would not lose sleep over storing data on GCS, but have heard from several Google Cloud folks that their concept of zones is a mirage at best.

pclmulqdq · on March 10, 2024

Yeah, that's definitely true. Google sort of mapped an AWS concept onto its own cluster splits. However, there are enough regional-scale outages at all the major clouds that I don't personally place much stock in the idea of zones to begin with. The only way to get close to true 24/7 five-9's uptime with clouds is to be multi-region (and preferably multi-cloud).

callalex · on March 11, 2024

I have experienced many outages that were contained to a specific availability zone in AWS, from power failures to flooding to cable cuts. You are correct that 5 9’s still requires multi-region though.

kevincox · on March 11, 2024

I think also Google as a whole has pretty good diversity. But Cloud customers demanded regions in big population centers and smaller countries where Google traditionally avoided due to cost reasons. This lead to less redundant sites that were often owned and/or operated by third parties. So in the US and Europe you can probably trust GCP zones quite literally. But other regions (I have heard lots of rumours in the APAC) they may not be quite as diverse as they appear.

throwaway2037 · on March 11, 2024

> big population centers and smaller countries

Can we stop this dance on HN? Can you just name them, please?

pclmulqdq · on March 11, 2024

I think most Googlers actually don't know the specifics (I certainly don't know), and if they could, they probably couldn't tell you. It's sort of common knowledge that some of them are like this, but not exactly which ones.

flaminHotSpeedo · on March 11, 2024

See: that fire in France that took down a whole region

But on the other hand, GCP supports multi-region so that's not nearly as big of a deal as it would be if AWS zones were not sufficiently isolated

jftuga · on March 10, 2024

If I were to upload a 50kb object to S3 (standard tier), about how many unique physical copies would exist?

cyberax · on March 10, 2024

At least 3.

bigiain · on March 11, 2024

At least 3, in at least 3 seperate datacenters.

According to https://nuclearsecrecy.com/nukemap/ - it'd take at least a 1 megaton warhead to take out two of the ap-southeast-2 datacenters, and over 10MT to take out 3.

I suspect you'd need a lot less than that though, the 1MT warhead would probably take out enough outside-the-datacenter infrastructure to take the entire AZ offline. I don't care too much though, if someone's dropped a warhead that close to home I have other things to worry about than whether all the cat pictures and audit logs survive.

tomcam · on March 11, 2024

> if someone's dropped a warhead that close to home I have other things to worry about than whether all the cat pictures and audit logs survive.

Speak for yourself. Many of us love our audit logs and show them to strangers whatever we can.

dmvdoug · on March 11, 2024

I’m picturing you having a slideshow of audit logs that you make guests to your home sit down and watch with you, like the vacation pictures slideshow of old.

tomcam · on March 11, 2024

Yes but my audit logs are special and everyone just loves them, although they playfully act bored

konfusinomicon · on March 11, 2024

scrolling through them like it's the matrix can't be all that boring!

bigiain · on March 11, 2024

Awwwww! Check out this cute little IAM audit log! Look at its funny little fizzy privilege escalation! I just want to scratch it's belly until it p0wns the whole prod deployment.

tomcam · on March 11, 2024

Oh, like you don’t?

We’re among friends here.

flybarrel · on March 11, 2024

1MT will take out the infrastructure to make the data not Available. However the data is still in the 3rd datacenter, making it still accessible therefore no compromising on Durable but yes we don't need the cat pictures when that is close to home :)

bigiain · on March 13, 2024

I'm guessing that even though the 3rd datacenter is about 35km away from the other 2, and so the building isn't in the expected destruction zone of a 1MT warhead, the damage to the city's electricity/water/network infrastructure would take the 3rd datacenter offline as well - so while your cat pictures are probably still in existence on the no-longer-spinning-rust there, they'd be unaccessible for quite some time.

tom910 · on March 11, 2024

I think lees. Why AWS need to store 3 times if they can use Reed-Solomon algorithms (or similar) and decrease this number to 2 or 1.5 and save a lot of storage space

Kluggy · on March 11, 2024

Reed-Solomon would allow you to lose any one of the three and recover. Losing two would be catastrophic.

AWS's guarantee is that you can lose two of the three copies and still have all the data. You can't do that without three complete copies.

laluser · on March 11, 2024

Probably for just a short period of time before it’s erasure coded

wubrr · on March 11, 2024

9's are useful when they're backed by an actual SLA - like GCP Cloud Storage and AWS S3 availability SLAs. Neither one commits to any durability SLAs whatsoever so I wouldn't put any stock into the 'eleventy nine nines' durability claims.

kragen · on March 11, 2024

0. user error (deletion or overwriting a file they regret later, possibly much later)

-1. government, the historical cause of most data loss

1½. google canceling the product and deleting all the data, as they did with google+

wubrr · on March 11, 2024

> As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.

How can you tell that if it's not measurable?

As far as I can tell the '11/14 9s' durability numbers are more or less completely made up. That's why AWS doesn't offer any actual durability SLA for S3, only a 99.9% availability SLA[0].

[0] https://aws.amazon.com/s3/sla/

flybarrel · on March 11, 2024

buddy you are mixing availability and durability

urda · on March 11, 2024

Sorry I am not buying the personal anecdote as the public numbers from both orgs tell a different story. When reliability and long term support come up in conversation, Google is not a name to reach for.

pclmulqdq · on March 11, 2024

Note that I said "durability," not any other reliability metric. GCP is pretty well-known for its outages and abysmal support. It's a reputation they want to change, but they did earn it.

However, Google is very good at not losing data.

urda · on March 12, 2024

Yup still not buying it. People who want their production stable, secure, and durable do not choose Google.