Two things are clear though Nobody ran this update The update was pushed globall...

hn_throwaway_99 · on Sept 13, 2024

While I agree with this, from a software engineering perspective I think it's more useful to look at the lessons learned. I think it's too easy to just throw "Crowdstrike is a bunch of idiots" against the wall, and I don't think that's true.

It's clear to me that CrowdStrike saw this as a data update vs. a code update, and that they had much more stringent QA procedures for code updates that they did data updates. It's very easy for organizations to lull themselves into this false sense of security when they make these kinds of delineations (sometimes even subconsciously at first), and then over time they lose site of the fact that a bad data update can be just as catastrophic as a bad code update. I've seen shades of this issue elsewhere many times.

So all that said, I think your point is valid. I know Crowdstrike had the posture that they wanted to get vulnerability files deployed globally as fast as possible upon a new threat detection in order to protect their clients, but it wouldn't have been that hard to build in some simple checks in their build process (first deploy to a test bed, then deploy globally) even if they felt a slower staged rollout would have left too many of their clients unprotected for too long.

Hindsight is always 20/20, but I think the most important lesson is that this code vs data dichotomy can be dangerous if the implications are not fully understood.

GuB-42 · on Sept 14, 2024

It could have been ok to expedite data updates, should the code treat configuration data as untrusted input, as if it could be written by an attacker. It means fuzz testing and all that.

Obviously the system wasn't very robust, as a simple, within specs change could break it. A company like CrowdStrike, which routinely deals with memory exploits and claims to do "zero trust" should know better.

As often, there is a good chance it is an organization problem. The team in charge of the parsing expected that the team in charge of the data did their tests and made sure the files weren't broken, while on the other side, they expected the parser to be robust and at worst, a quick rollback could fix the problem. This may indeed be the sign of a broken company culture, which would give some credit to the ex-employees.

Izkata · on Sept 14, 2024

> Obviously the system wasn't very robust, as a simple, within specs change could break it.

From my limited understanding, the file was corrupted in some way. Lots of NULL bytes, something like that.

acdha · on Sept 14, 2024

That rumor floated around Twitter but the company quickly disavowed it. The problem was that they added an extra parameter to a common function but never tested it with a non-wildcard value, revealing a gap in their code coverage review:

https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann...

GuB-42 · on Sept 14, 2024

From the report, it seems the problem is that they added a feature that could use 21 arguments, but there was only enough space for 20. Until now, no configuration used all 21 (the last one was a wildcard regex, which apparently didn't count), but when they finally did, it caused a buffer overflow and crashed.

abraae · on Sept 14, 2024

> It's clear to me that CrowdStrike saw this as a data update vs. a code update, and that they had much more stringent QA procedures for code updates that they did data updates.

It cannot have been a surprise to Crowdstrike that pushing bad data had the potential to bork the target computer. So if they had such an attitude that would indicate striking incompetence. So perhaps you are right.

RaftPeople · on Sept 14, 2024

> It's clear to me that CrowdStrike saw this as a data update vs. a code update

> Hindsight is always 20/20, but I think the most important lesson is that this code vs data dichotomy can be dangerous if the implications are not fully understood.

But it's not some new condition that the industry hasn't already been dealing with for many many decades (i.e. code vs config vs data vs any other type of change to system, etc.).

There are known strategies to reduce the risk.

mavhc · on Sept 14, 2024

If they weren't idiots they wouldn't be parsing data in the kernel level module

Comma2976 · on Sept 14, 2024

Crowdstrike is a bunch of idiots

llm_trw · on Sept 14, 2024

I'm sorry but there comes a point where you have to call a spade a spade.

When you have the trifecta of regex, *argv packing and uninitialized memory you're reaching levels of incompetence which require being actively malicious and not just stupid.

busterarm · on Sept 13, 2024

Also it's the _second_ time that they had done this in a few short months.

They had previous bricked linux hosts earlier with a similar type of update.

So we also know that they don't learn from their mistakes.

rblatz · on Sept 13, 2024

The blame for the Linux situation isn’t as clear cut as you make it out to be. Red hat rolled out a breaking change to BPF which was likely a regression. That wasn’t caused directly by a crowdstrike update.

IcyWindows · on Sept 13, 2024

At least one of the incidents involved Debian machines, so I don't understand how Red Hat's change would be related.

rblatz · on Sept 13, 2024

Sorry, that’s correct it was Debian, but Debian did apply a RHEL specific patch to their kernel. That’s the relationship to red hat.

busterarm · on Sept 14, 2024

It's not about the blame, it's about how you respond to incidents and what mitigation steps you take. Even if they aren't directly responsible, they clearly didn't take proper mitigation steps when they encountered the problem.

roblabla · on Sept 14, 2024

How do you mitigate the OS breaking an API below you in an update? Test the updates before they come out? Even if you could, you'd still need to deploy a fix before the OS update hits the customers, and anyone that didn't update would still be affected.

The linux case is just _very_ different from the windows case. The mitigation steps that could have been taken to avoid the linux problem would not have helped for the windows outage anyways, the problems are just too different. The linux update was about an OS update breaking their program, while the windows issue was about a configuration change they made triggering crashes in their driver.

busterarm · on Sept 14, 2024

You're missing the forest for the trees.

It's: a) an update, b) pushed out globally without proper testing, c) that bricked the OS.

It's an obvious failure mode that if you have a proper incident response process would be revealed from that specific incident and flagged for needing mitigation.

I do this specific thing for a living. You don't just address the exact failure that happened but try to identify classes of risk in your platform.

> Even if you could, you'd still need to deploy a fix before the OS update hits the customers, and anyone that didn't update would still be affected.

And yet the problem would still only affect Crowdstrike's paying customers. No matter how much you blame upstream your paying customers are only ever going to blame their vendor because the vendor had discretion to test and not release the update. As their customers should.

roblabla · on Sept 16, 2024

Sure, customers are free to blame their vendor. But please, we’re on HN, we aren’t customers, we don’t have beef in this game. So we can do better here, and properly allocate blame, instead of piling on the cs hate for internet clout.

And again, you cannot prevent your vendor breaking you. Sure, you can magic some convoluted process to catch it asap. But that won’t help the poor sods who got caught in-between.

ScottBurson · on Sept 13, 2024

> there should have been some kind of error handling

This is the point I would emphasize. A kernel module that parses configuration files must defend itself against a failed parse.