When I worked at Google in storage, we had our own figures of merit that showed that we were the best and Amazon's durability was trash in comparison to us.
As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.
9's are overblown. When cloud providers report that, they're really saying "Assuming random hard drive failure at the rates we've historically measured and how we quickly we detect and fix those failures, what's the mean time to data loss".
But that's burying the lede. By far the greatest risks to a file's durability are:
1. Bugs (which aren't captured by a durability model). This is mitigated by deploying slowly and having good isolation between regions.
2. An act of God that wipes out a facility.
The point of my comment was that it's not just about checksums. That's table stakes. The main driver of data loss for storage organizations with competent software is safety culture and physical infrastructure.
My experience was that S3's safety culture is outstanding. In terms of physical separation and how "solid" the AZs are, AWS is overbuilt compared to the other players.
That was not how we treated the 9's at Google. Those had been tested through natural experiments (disasters).
I was not at Google for the Clichy fire, but it wasn't the first datacenter fire Google experienced. I think your information about Google's data placement may be incorrect, or you may be mapping AWS concepts onto Google internal infrastructure in the wrong way.
Do you mean Google included "acts of God" when computing 9's? That's definitely not right.
11 9's of durability means mean time to data loss of 100 billion years. Nothing on earth is 11 9's durable in the face of natural (or man-made) disasters. The earth is only 4.5 billion years old.
Yeah, that's definitely true. Google sort of mapped an AWS concept onto its own cluster splits. However, there are enough regional-scale outages at all the major clouds that I don't personally place much stock in the idea of zones to begin with. The only way to get close to true 24/7 five-9's uptime with clouds is to be multi-region (and preferably multi-cloud).
I have experienced many outages that were contained to a specific availability zone in AWS, from power failures to flooding to cable cuts. You are correct that 5 9’s still requires multi-region though.
I think also Google as a whole has pretty good diversity. But Cloud customers demanded regions in big population centers and smaller countries where Google traditionally avoided due to cost reasons. This lead to less redundant sites that were often owned and/or operated by third parties. So in the US and Europe you can probably trust GCP zones quite literally. But other regions (I have heard lots of rumours in the APAC) they may not be quite as diverse as they appear.
I think most Googlers actually don't know the specifics (I certainly don't know), and if they could, they probably couldn't tell you. It's sort of common knowledge that some of them are like this, but not exactly which ones.
According to https://nuclearsecrecy.com/nukemap/ - it'd take at least a 1 megaton warhead to take out two of the ap-southeast-2 datacenters, and over 10MT to take out 3.
I suspect you'd need a lot less than that though, the 1MT warhead would probably take out enough outside-the-datacenter infrastructure to take the entire AZ offline. I don't care too much though, if someone's dropped a warhead that close to home I have other things to worry about than whether all the cat pictures and audit logs survive.
I’m picturing you having a slideshow of audit logs that you make guests to your home sit down and watch with you, like the vacation pictures slideshow of old.
Awwwww! Check out this cute little IAM audit log! Look at its funny little fizzy privilege escalation! I just want to scratch it's belly until it p0wns the whole prod deployment.
1MT will take out the infrastructure to make the data not Available.
However the data is still in the 3rd datacenter, making it still accessible therefore no compromising on Durable
but yes we don't need the cat pictures when that is close to home :)
I'm guessing that even though the 3rd datacenter is about 35km away from the other 2, and so the building isn't in the expected destruction zone of a 1MT warhead, the damage to the city's electricity/water/network infrastructure would take the 3rd datacenter offline as well - so while your cat pictures are probably still in existence on the no-longer-spinning-rust there, they'd be unaccessible for quite some time.
I think lees. Why AWS need to store 3 times if they can use Reed-Solomon algorithms (or similar) and decrease this number to 2 or 1.5 and save a lot of storage space
9's are useful when they're backed by an actual SLA - like GCP Cloud Storage and AWS S3 availability SLAs. Neither one commits to any durability SLAs whatsoever so I wouldn't put any stock into the 'eleventy nine nines' durability claims.
> As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.
How can you tell that if it's not measurable?
As far as I can tell the '11/14 9s' durability numbers are more or less completely made up. That's why AWS doesn't offer any actual durability SLA for S3, only a 99.9% availability SLA[0].
Sorry I am not buying the personal anecdote as the public numbers from both orgs tell a different story. When reliability and long term support come up in conversation, Google is not a name to reach for.
Note that I said "durability," not any other reliability metric. GCP is pretty well-known for its outages and abysmal support. It's a reputation they want to change, but they did earn it.
As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.