> How people ever got away with building massively redundant fault-tolerant appl...

brazzy · on Aug 9, 2024

> Each "fault-tolerance" criteria added adds some cost. At some point the cost of being resistant to errors exceeds the cost of downtime.

Not to forget: those costs are not just in money and time, but also in complexity. And added complexity comes with its own downtime risks. It's not that uncommon for systems to go down due to problems with mechanisms or components that would not exist in a simpler, "not fault tolerant" system.

ClumsyPilot · on Aug 9, 2024

> For most businesses, being able to serve 20k concurrent requests is way more than they need anyway: an internal app used by 500k

This is a very simple distinction and I am not sure why is it not understood

For some reason people design public apps the same as internal apps

The largest companies employ circa 1 million people - that’s Walmart, Amazon, etc. most giants, like Shell, etc. companies have ~ 100k tops. That can be handled by 1 beefy server.

Successful consumer facing apps are hundred millions to billion. It’s 3 orders of magnitude difference

I have seen a company with 5k employees invest into mega-scalable microservice event driven architecture and I was was thinking - I hope they realise what they are doing and it’s just CV-driven development

lmm · on Aug 13, 2024

> Each "fault-tolerance" criteria added adds some cost. At some point the cost of being resistant to errors exceeds the cost of downtime.

Agreed, but the cost/benefit should be the same at every level in the stack. I get it when people build a single-server system. I'm just baffled that so many people insist they need a load balancer and multiple instances of their application so that there's no single point of failure (which is not free by any means), but then run them all off of a single SQL server.

zelphirkalt · on Aug 9, 2024

The business can try to decide what those 3min are worth, but ultimately the customers vote by either staying or leaving that service.

lelanthran · on Aug 9, 2024

> The business can try to decide what those 3min are worth, but ultimately the customers vote by either staying or leaving that service.

That's still a business decision.

Customers don't vote with their feet based on what tech stack the business chose, they vote based on a range of other factors, few, if none, of which are related to 3m of downtime.

There are little to no services I know off that would lose customers over 3m of downtime per week.

IOW, 3m of downtime is mostly an imaginary problem.

zelphirkalt · on Aug 9, 2024

That's really a too broad generalization.

Services that people might leave, because of downtime are for example a git hoster, or a password manager. When people cannot push their commits and this happens multiple times, they may leave for another git hoster. I have seen this very example when gitlab was less stable and often unreachable for a few minutes. When people need some credentials, but cannot reach their online password manager, they cannot work. They cannot trust that service to be available in critical moments. Not being able to access your credentials leaves a very bad impression. Some will look for more reliable ways of storing their credentials.

skydhash · on Aug 9, 2024

Why does a password manager needs to be online. I understand the need for synchronization, but being exclusively online is a very bad decision. And git synchronization is basically ssh, and if you mess that up on a regular basis, you have no job being in business in the first place. These are examples, but there's a few things that do not need to be online unless your computer is a thin client or you don't trust it at all.

Bjartr · on Aug 9, 2024

The user experience of "often unreachable" means way more than 3m per week in practice.

consteval · on Aug 9, 2024

True, but what people should understand about databases is they're incredibly mature software. They don't fail, they just don't. It's not like the software we're used to using where "whoopsie! Something broke!" is common.

I've never, in my life, seen an error in SQL Server related to SQL Server. It's always been me, the app code developer.

Now, to be fair, the server itself or the hardware CAN fail. But having active/passive database configurations is simple, tried and tested.

skydhash · on Aug 9, 2024

And the server itself can be very resilient if you run something like debian or freebsd. Even on arch, I've seen things fails rarely unless it's fringe/proprietary code (bluetooth, nvidia, the browser and 3d accelerated graphics,...). That presumes you will use boring tech which are heavily tested by people around the world, not something "new" and "hyped" which is still on 0.x

consteval · on Aug 9, 2024

I agree 100%. Unfortunately my company is pretty tied to windows and windows server, which is a pain. Upgrading and sysadmin-type work is still very manual and there's a lot of room for human error.

I wish we would use something like Debian and take advantage of tech like systemd. But alas, we're still using COM and Windows Services and we still need to remote desktop in and click around on random GUIs to get stuff to work.

Luckily, SQL Server itself is very stable and reliable. But even SQL Server runs on Linux.

scott_w · on Aug 9, 2024

>> What is the benefit to removing 3 minutes of downtime?

> The business can try to decide what those 3min are worth, but ultimately the customers vote by either staying or leaving that service.

What do you think the business is doing when it evaluates what 3 minutes are worth?

zelphirkalt · on Aug 9, 2024

There is no "the business". Businesses do all kinds of f'ed up things and lie to themselves all the time as well.

I don't understand, what people are arguing about here. Are we really arguing about customers making their own choice? Since that is all I stated. The business can jump up and down all it wants, if the customers decide to leave. Is that not very clear?

lelanthran · on Aug 9, 2024

> The business can jump up and down all it wants, if the customers decide to leave.

I think the point is that, for a few minutes of downtime, businesses lose so little customers that it's not worth avoiding that downtime.

Just now, we had a 5m period where disney+ stopped responding. We aren't going to cut off our toddler from peppa big and bluey for 5m of downtime per day, nevermind per week.

You appeared to be under the impression that 3m downtime/week is enough to make people leave. This is simply not true, especially for internet services where the users are conditioned to simply wait.

scott_w · on Aug 10, 2024

I think people are arguing that what you say makes no sense as a counter argument. Of course customers make their choice. THAT IS WHAT BUSINESSES ARE CALCULATING. How many customers do we lose as a result of downtime, how much is that worth in $ and how many $ do we need to spend to not suffer that downtime? If the projected loss is less than the cost of fixing the downtime, let the customers go.

You saying “the customer can decide to leave” is not a counter to this at all. It’s just a weird way of saying what everyone else is but framing it as a counter argument to what is being said. Which it simply isn’t.