The details (the particular companies / systems etc) of this global incident don't really matter.
When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.
It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
The love affair with oligopoly, cornered markets and power concentration (which creates abnormal returns for a select few) is priming the rest of us for major disasters.
As a rule of thumb there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...
Some truths will hit you in the face again and again until you acknowledge the nature of reality.
Can you imagine having just one road connecting two big cities to cut costs? No alternative roads, nor big nor small.
That will be really cheap to maintain, and you can charge as much as you want in tolls as there are no alternatives. And you can add ads all over the road as people has to watch them to move from one city to the other.
And if the road breaks, the goverment needs to pay for the cost as they cannot allow the cities to go unconnected.
I'm writing this in the wake of the aftermath of the disclosure of the log4j zero-day vulnerability. But this is only a recent example of just one kind of networked risk.
With managed services we effectively add one more level to the Inception world of our software organisation. We outsource nice big chunks of supply chain risk management, but we in-source a different risk of depending critically on entities that we do not control and cannot fix if they fail.
Not to mention the fact that change ripples through the parallel yet deeply enmeshed dimensions of cyberspace and meatspace. Code running on hardware is inexorably tied to concepts running in wetware. Of course, at this level of abstraction, the notion applies to any field of human endeavour. Yet, it is so much more true of software. Because software is essentially the thoughts of people being played on repeat.
The oligopoly is not a "love affair", that's how IT works: prime mover advantage, "move fast and break things" (the first of them being interoperability), moats, the brittleness of programming...
The whole startup/unicorns ecosystem exists only because there is the possibility of becoming the dominant player in a field within a few years (or being bought out by one of the big players). This "love affair with oligopoly" is the reason why Ycombinator/HN exists.
It's correct that these are political/economical decisions. But most people in society neither have the knowledge for an informed opinion on such matters, nor a vote.
Centralisation vs decentralisation.
Cost-savings vs localisation of disaster.
It's a swinging pendulum of decisions. And developers know that software/hardware provision is a house of cards. The more levels of dependency, the more fragile the system is.
Absolutely horrible when lives will be lost, but determining the way our global systems are engineered and paid for will always be a moving target based on policy and incentive.
My heart goes out to life and death results of this. There are no perfect tech solutions.
Be aware that enterprise firms actively choose and "asses" who their AV suppliers are on-premis and in the cloud not imposed by msft. Googling it does seem that CrowdStrike, does have a history of Kernel Panics. Perhaps such interesting things as Kernel panic should be part of compliance checklist.
Googling it seems crowdstrike has a history of causing kernel panics.
Everytime there was a mysterious performance problem affecting a random subset of machines, it was Tanium. I know how difficult it is for anyone to just get rid of this type of software, but frankly it has been proven over and over that antivirus are just more surface attack, not less.
I think the enterprise software ecosystem currently is not really "all eggs in one basket", but rather you have a whole bunch of baskets, some of them you are not even aware of, some are full of eggs, some have grenades in them instead, some are buckets instead. All baskets are being constantly bombarded with a barrage of eggs from unknown sources, sometimes the eggs explode for inexplicable reasons. Oh yeah and sometimes the baskets themselves disintegrate all at once for no apparent reason.
The problem is allowing a single vendor, with a reputation of fucking up over and over again, to push code into your production systems at will with no testing on your part.
Right. I thought the "big guys" know better and they have some processes to vet Crowdstrike updates. Maybe even if they don't get its source code, they at least have a separate server that manages the updates, like Microsoft's WSUS.
But no, they are okay with a black box that calls home and they give it kernel access to their machines. What?
Monocultures are known to be points of failure, but people keep going down that path because they optimize for efficiency (heck, most modern economics is premised on the market being efficient).
This problem is pervasive and effects everything from food supply (planting genetically identical seeds rather than diversified "heirloom" crops) to businesses across the board buying and gutting their competitors thus reducing consumer choice.
It's a tough problem akin to a multi-armed bandit: exploit a known strategy or "waste" some effort exploring alternatives in the hopes of better returns. The more efficient you are (exploitation), the higher the likelihood of catastrophic failure in weird edge cases.
this isn't even the first time something like this has happened. it's literally a running joke in programmer circles that AWS East going down will take down half the internet, and yet there's absolutely zero initiative being taken by anyone who makes these sorts of decisions to maybe not have every major service on the internet be put into the same handful of points of failure. nothing will change, no one will learn anything, and this will happen again.
That’s very different though. That’s avoidable. We all can easily have our services running in different data centers around the world. Heck, the non-amateurs out there all have their services running in different Amazon data centers around the world. So you can get that even from a single provider. Hardware redundancy is just that cheap nowadays.
This CS thing, there’s no way around. You use it and they screw up, you get hit. Period. You don’t failover to another data center in Europe or Asia. You just go down.
Hardware, even cloud hardware, is rarely the issue. Probably especially cloud hardware is not an issue because failover is so inexpensive relative to software.
Software is a different issue entirely. How many of us will develop, shadow run, and maintain a parallel service written on a separate OS? My guess is “not many”. That’s the redundancy we’re talking about to avoid something like this. You’d have to be using a different OS and not using CS anywhere in that new software stack. (Though not using CS wouldn’t be much of a problem if the OS is different but I think you see what I mean.)
Amazon, implementing failover for your hardware is a few clicks. But if you want to implement an identical service with different software, you better have a spare dev team somewhere.
AWS East going down will (and has) cause(d) disruption in other regions.
Last time it happened (maybe like 18 months ago), you ran into billing and quota issues, if my memory serves.
AWS is, as any company, centralized in a way or another.
Want to be sure you won't be impacted by AWS East going down, even if you run in another region? Well, better be prepared to run (or have a DRP) on another cloud provider then...
The cost of running your workload on two different CSP is quite high, especially if your teams have been convinced to use AWS-specific technologies. You need to first get your software stack provider agnostic and then manage the two platform in sync from a technical and contract perspective, which is not always easy...
You just made the single point of failure your software stack hardware abstraction layer. There’s a bug in it, you’re down. Everywhere. Not only that, but if there is CS in either your HAL, or your application you’re down. So to get the redundancy the original commenter was talking about, you need to develop 2 different HALs with 2 different applications all using a minimum of 2 different OS and language stacks.
Why multiply your problems? Use your cloud service provider only to access hardware and leave the rest of that alone. That way any cloud provider will due. Any region on any cloud provider will due. You could even just fallback to your own racks if you want. Point is, you only want the hardware.
Now to get that level of redundancy, you would still have to create 2 different implementations of your application on 2 different software and OS stacks. But the hardware layer is now able to run anywhere. Again, you can even have a self hosted rack in your dispatch stack.
So hardware redundancy is easy to do at the level the original commenter recommends . Software redundancy is incredibly difficult and expensive to do at the level the original commenter was talking about. Your idea to make a hardware/cloud abstraction layer only multiplies the number of software layers you would need multiple implementations of, shadow run and maintain to achieve the hypothetical level of redundancy.
> It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
The fact it's widespread is because so many individual organisations individually chose to use CrowdStrike, not because they all got together and decided to crown CrowdStrike as king, surely?
I agree with you in principle, but the only solution I can think of would be to split up a company with reach like CrowdStrike's. The consequences of having to do that are up for debate.
It's never that simple. There is a strong herd mentality in the business space.
Just yesterday I've been in a presentation from the risk department and they described the motives around choosing a specific security product as `safe choice, because a lot of other companies use it in our space, so regulator can't complain`...the whole decision structure boiled down to: `I don't want to do extra work to check the other options, we go with whatever the herd chooses`. Its terrifying to hear this...
The whole point of software like this is a regulatory box-ticking exercise, no-one wants it to actually do anything except satisfy the regulator. Crowdstrike had less overhead and (until now) outages than its competitors, and the regulators were willing to tick the box, so of course people picked them. There are bad cases of people following the herd where there are other solutions with actually better functionality, but this isn't that.
OTOH... I remember an O365 outage in London a few years ago.
You're down? Great, so are your competitors, your customers, and your suppliers. Head to the pub. Actually, you'll probably get more real value there, as your competitors, customers and suppliers are at that same pub. Insurance multinationals have been founded from less.
That didn't affect any OT though, so it was more just proof that 90% of work carried out via O365 adds no real value. Knowing where the planes are probably is important.
> You're down? Great, so are your competitors, your customers, and your suppliers. Head to the pub. Actually, you'll probably get more real value there, as your competitors, customers and suppliers are at that same pub. Insurance multinationals have been founded from less.
I mean yeah, that's the other thing - the Keynesian sound banker aspect. But that's more for software that you're intentionally using for your business processes. I don't think anyone was thinking about Cloudstrike being down in the first place, unless they were worried about an outage in the webpage that lists all the security certifications they have.
You say that as it's some bad thing, but it's just other words for "use boring tech".
Yes, there could be reasons to choose a lesser-known product, but they better be really good reasons.
Because there are multiple general reasons in the other direction, and incidents like this are actually one of those reasons: they could happen with any product, but now you have a bigger community sharing heads-ups and workarounds, and vendor's incident response might also be better when the whole world is on fire, not only a couple of companies.
It's not just Crowdstrike, it's all up and down the software and hardware supply chain.
It's that so many people are on Azure - which is a defacto monopoly for people using Microsoft stack - which is a defacto monopoly for people using .Net
And if they're doing that, the clients are on Windows as well, and probably also running Crowdstrike. The AD servers that you need to get around Bitlocker to automatically restore a machine are on Azure, running Windows, running Crowdstrike. The VM image storage? Same. This is basically a "rebuild the world from scratch" exercise to some greater or lesser degree. I hope some of the admins have non-windows machines.
How come AWS sometimes has even better tooling for .NET than Azure, while JetBrains offers better IDE on Linux, macOS and, depending on your taste, Windows than Microsoft? Or, for some reason, the most popular deployment target is just a container that is vendor-agnostic? Surely I must be missing something you don't.
All of that is absolutely true and in no way affects the behavior at hand. Big companies go with whoever sells them the best, not any kind of actual technical evaluation.
Perhaps the organisations have a similar security posture. And that creates a market that will eventually result in a few large providers who have the resources to service larger corporations. You see something similar in VPN software where Fortinet and Palo become the linchpin of security. The deeper question is to wonder at the soundness of the security posture itself.
There's a strong drive for everyone to do things the same way in IT. Some of the same pressure that drives us towards open standards can also drive us towards using a standard vendor.
> I agree with you in principle, but the only solution I can think of would be to split up a company with reach like CrowdStrike's.
Changing corporate structures doesn't necessarily help. It's possible that if CrowdStrike were split up into to smaller companies, all the customers would go to the one with the "better" product and we'd be in a similar position.
Well, if they'd used a different vendor (or nothing) on the DR servers we could have done a failover and gotten on with our day. But alas nobody saw, an app that can download data from the internet, whenever it wants to update itself arbitrarily without user intervention, as a problem.
They choose because other have. "Look how many others choose us" is a common marketing cry. Perhaps instead too popular is a reason not to choose? Perhaps not parroting your competitors and industry is a reason not to choose?
When it comes to security products, the size of the customer base matters. More customers means more telemetry. More telemetry means better awareness of IOCs, better training sets to determine what's good and what's bad.
I wonder how many of those orgs were "independently" audited by security firms which made passing audit without Crowdstrike specifically a hell.
Most of crap security I met in big organisations was driven by checklist audits and compliance audits by a few "security" firms. Either you did it the dumb way or good luck fighting your org and their org to pass the audit.
Setting aside the utter fecklessness if not outright perniciousness of cybersecurity products such as this, I hope this incident (re-)triggers a discussion of our increasing dependence on computing technology in our lives, its utter inescapability, and our ever-growing inability to function without it in modern society.
Not everything needs to be done through a computer, and we are seeing the effects now of organizing our systems such that the only way to interface with them is through a digital device or a smartphone, with no alternative. Such are the consequences of moving everything "into the cloud" and onto digital devices as a result of easy monetary policy and the concomitant digital gold rush where everyone and their dog scrambled to turn everything into a smartphone app.
This past week I purchased a thermostat. There were "high-end" touch only models, models that were app-assisted also with analog controls, and then finally old school analog only. I went with the middle / combo so that I have analog as a call back if the pure tech mode fails.
Being prepared can cost more and/or be less flashy (read: I didn't get touch-only) but it's only peace of mind, at least for critical components. I want a thermostat that works, I don't get no satisfaction from any bragging rights. Nod to the Rolling Stones.
I literally dealt with this just a few hours ago. I need a new HVAC system. I wanted the high-end model, but it will only work with their fancy cloud-connected thermostat. You cannot replace it with an off-the-shelf thermostat.
Have home automation? Sorry, you'll have to use the Internet.
I vote with my dollars, so it cost them the higher-margin sale. I also went with the mid-tier system, and grabbed a Z-Wave compatible thermostat along with it. I wonder if I'll miss the nifty variable-speed system?
I really wish everyone would stop trying to trap us into their walled gardens. Apple at least lets people write software for theirs, but the hardware/appliance manufacturers (not to mention the automotive folks) are awful about this.
> The details (the particular companies / systems etc) of this global incident don't really matter.
It definitively matters. The main issue he is that Crowdstrike was able to push and update on all server around the world where all their agent is installed ... it looks like an enormous botnet ...
We need a detailed post-mortem on what happened here.
The other aspect of risk management is an acceptance that something going wrong isn't necessarily a reason to change what you are doing. If the plan was tacitly to run something at a 99% uptime, then incidents causing 1% downtime can be ignored.
We are going to get hit by some terrible outage eventually (I hope someone is tracking things like what happens if a big war breaks out and the GPS constellations all go down together). But having 10x providers won't help against the big IT-related threats which are things like grid outages and suchlike having cascading effects into food supplies.
> there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...
And does anyone actually know how to actually implement this, at the scale required (dealing with billions of transactions daily) in a way that would resolve the problems we are seeing?
It very much seems like a data access problem; places can't access/modify data. The physical disks themselves are most likely fine, but the 'interfaces' are having troubles (assuming that the data isn't stored on the devices having the issue).
But in any case how do you design a system that, if the 'main' interface is troubled, you can switch over, instantly, seemlessly, duplicating access controls, permissions, data validation, logic etc.
There is a reason everything is centralised because it makes no financial sense to duplicate for an extremely unlikely and rare chance. The world is random and these things will happen, but a global outage on this type of scale is not a daily occurance.
We'll look back in a few years and think "those were a crazy few hours" and move on...
> The details (the particular companies / systems etc) of this global incident don't really matter.
But they do matter. This is elementary. It's like saying "playing with matches doesn't matter". This is a problem that has happened before, albeit on smaller scale, and the solution/cure is well known and imho it should be established 2 decades ago to every org on the planet.
This is basic COBIT (or BYOFramework) stuff from 10-15-20 years ago.
How can you push a patch/update without testing it fist? I get it if you are a tiny company with 1 IT person an 20 local PCs. Stuff like that cripples you for a couple of days. But when you are an org, with 10k+ laptops, 500+ servers (half of them MS Win), how can you NOT test each and every update?
If you don't want to have the test/staging environments, then at least wait 1-3-5 days to see what the updates will do to others/the news.
Sorry not sorry guys and gals. I've been auditing systems and procedures for so many years, that this is a basic failure. "One cannot just push an update without testing it first" any update, no matter how small/innocent.
> So I am not convinced that there need to be "at least ten alternatives" to be fail safe as society.
The required number "N for safety" is a good discussion to have. Risk-Return, Cost-Benefit etc are essential considerations. We live in the real world with finite resources and stark choices. But I would argue (without trying to be facetious) that they are risk management 102 type considerations.
Why? Because they must rely on the pretense of knowledge [1]. As digitization keeps expanding to encompass basically everything we do, the system becomes exceedingly complex, nobody has a good picture of all internal or external vulnerabilities and how much they might cascade inside an interconnected system.
Assessing cost versus benefit implies one can reasonably quantify all sides of the equation. In the absence of a demonstrably valid model of the "system" the prudent thing is to favor detail-agnostic rules of thumb. If these rules suggest that reducing unsafe levels of concentration is not economically viable there must be something wrong with the conceptual business model of digitization as it is now pursued.
Or perhaps it's just because companies release features, planes, devices, etc. without any form of QA, aiming just to increase their profits?
In this case, has CS done any QA on this release? Have they tested it for months on all the variations of the devices that they claim to support? It seems not.
Considering CS Falcon causes your performance to drop by about half and does the same to your battery life, I doubt they have any sort of QA that cares about anything but hitting stakeholder goals.
Yet, catastrophic failures like this happen, and people move on. Sure, there is that one guy who spent 10 years building a 10-fold redundancy plan, and his service didn't go down when the whole planet went down, but do people really care?
Unless his systems are up but critically dependent on other external systems (payment services, bucket storage, auth etc...) that are down. It's becoming increasingly difficult to not have those dependencies.
While this is a great theory, how would you actually accomplish this with antivirus software?
Multiple machines, each one using different vendor software? What other software needs to be partitioned this way? What about combinations of this software?
I’m just barely awake but don’t know if I’m affected yet. One of my devs is, our client support staff is, and I have no idea how our servers are doing just yet.
> It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
I mean, plenty of businesses only have penguin eggs in their basket, and some sort of penguin problem would cause major problems for them. I believe that last time this happened was with the leap second thing around 2005 or thereabouts.
"Don't put all your eggs in one basket" sounds nice, but it would mean a completely different independent service all through your stack. That's not really realistic, IMHO.
The bigger issue here is that: 1) some driver update "just" gets pushed (or how does this work?), and 2) the inability to easily do "this is broken, restore to last version". That is even something that could be automatic.
This isn't some global conspiracy, it's just incentives and economies of scale. When it's cheaper to pay a hyperexpert to handle your security, why wouldn't you?
The fact that physical distance is no longer a limit to who you do business with means that you can select the cheapest vendor globally, but then that vendor has an incentive to hyperspecialize (because everyone goes to them for this one thing), which means that even more people go to them.
Avoiding once-in-a-century events just isn't something we're willing to pay the extra cost for, except now we have around twenty places where these once-in-a-century events can happen, which kind of makes them more frequent.
How much stuff do you host on Hetzner instead of AWS?
Now they know the state of each of the affected companies systems. How adept their sysops guys are, a birds eye view of their security practices. Nice move and plausible deniable too :D.
I mean how did this happen at all? Are there no checks in place @ crowdstrike? Like deploying the new update to a selected machines and check whether everything is ok, and then releasing it to the wild incrementally?
> When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.
Once again, it's Microsoft, directly, or indirectly, choosing a strategy of eventually getting all worldwide Windows desktops online, and connected via their systems.
Which is why I installed Fedora after Windows 7 and never looked back. 100% local, 100% offline if needed.
My company is looking to a non-Microsoft desktop. We're not affected by this, but it will certainly encourage us to move sooner rather than later.
Society was able to move to mass WFH on a global scale in a single month during Covid, thanks to the highly centralized and efficient cloud infrastructure. That could have easily saved tens of millions of lives (Imagine the Spanish flu with mass air travel, no vaccines, no strain-weakening)
These small 'downages' basically never cause serious issue. Your solutions are just alarmist and extremely costly (though they will provide developer employment...).
> These small 'downages' basically never cause serious issue.
Hospitals, airlines, 911, grocery stores, electric companies, gas companies, all offline. There will be more than a few people dead as an indirect result of this outage, depending on how long it lasts.
> These small 'downages' basically never cause serious issue.
Emergency Departments and 911 were knocked offline. People will indirectly die because of this, just like the last time 911 went down, and just like the last time EDs went down.
If CrowdStrike can cause this with a faulty update (allegedly), what do you think could happen to Western infrastructure from a full blown cyberwar? It's a valid risk.
> Society was able to move to mass WFH on a global scale in a single month during Covid
I don't know how much WFH saved lives, seeing as ordered isolation and social distancing was a thing during the Spanish Flu too (you just take the economic hit). But yes it allowed companies to keep maintaining profits. Those that couldn't WFH got paid in most countries anyway (furlough in England, etc).
true, but incentives should be in place to encourage a more diverse array of products, at the moment with many solutions (especially security) it is a choice between that one popular known product (Okta, CrowdStrike, et al, $$$) and bespoke ($$$$$$$$$$).
If only because we can then move away from one-size-fits-all, while mitigating the short-term impact of events like the above.
When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.
It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
The love affair with oligopoly, cornered markets and power concentration (which creates abnormal returns for a select few) is priming the rest of us for major disasters.
As a rule of thumb there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...
Some truths will hit you in the face again and again until you acknowledge the nature of reality.