> My statement was definitely a reach, but I think there is a bunch of space in the life of an early startup for lack of highly available services.
I think there's room for a lot fewer 9s for most businesses generally, provided you have enough isolation that one thing being down doesn't also take down everything else. The exceptions are mainly ones that are trying to be someone else's infrastructure.
For lots of businesses, especially early on, maintenance windows are entirely fine and perfect bigco rolling deploys to giant HA clusters and such are way overkill that introduce all kinds of costs and complexity (=risk) that could be avoided by just accepting that every now and then some part of your site/service will be down. Just have backups, know how to restore from them quickly, and make sure you have scripted from-scratch deployments that actually work (having developers build them daily or weekly for their own local work does a pretty decent job of sussing out brokenness) and you're probably fine.
[EDIT] but of course that's the opposite of Résumé Driven Development and is very Not Cool and likely to be unpopular with clueless managers.
manager: "Why's our site down?"
you: "a deployment broke, it's fine, we can restore from scratch in ten minutes flat if we have to."
versus, also you: "as you know we follow Google's best practices and use Kubernetes and blah blah blah and you see [translated from Bullshit Speak] we don't actually understand it very well and it's super complex and it shit the bed for some reason but we're fixing it, and as you know this is all best practices and Google like and such as"
Unfortunately the latter is often "safer" than the former... for one's career, not for the product or service you're providing.
I think there's room for a lot fewer 9s for most businesses generally, provided you have enough isolation that one thing being down doesn't also take down everything else. The exceptions are mainly ones that are trying to be someone else's infrastructure.
For lots of businesses, especially early on, maintenance windows are entirely fine and perfect bigco rolling deploys to giant HA clusters and such are way overkill that introduce all kinds of costs and complexity (=risk) that could be avoided by just accepting that every now and then some part of your site/service will be down. Just have backups, know how to restore from them quickly, and make sure you have scripted from-scratch deployments that actually work (having developers build them daily or weekly for their own local work does a pretty decent job of sussing out brokenness) and you're probably fine.
[EDIT] but of course that's the opposite of Résumé Driven Development and is very Not Cool and likely to be unpopular with clueless managers.
manager: "Why's our site down?"
you: "a deployment broke, it's fine, we can restore from scratch in ten minutes flat if we have to."
versus, also you: "as you know we follow Google's best practices and use Kubernetes and blah blah blah and you see [translated from Bullshit Speak] we don't actually understand it very well and it's super complex and it shit the bed for some reason but we're fixing it, and as you know this is all best practices and Google like and such as"
Unfortunately the latter is often "safer" than the former... for one's career, not for the product or service you're providing.