this post was submitted on 19 Jul 2024

8 points (100.0% liked)

Technology

59414 readers

3123 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes (twiiit.com)

submitted 4 months ago by [email protected] to c/[email protected]

44 comments fedilink hide all child comments

…according to a Twitter post by the Chief Informational Security Officer of Grand Canyon Education.

So, does anyone else find it odd that the file that caused everything CrowdStrike to freak out, C-00000291-
00000000-00000032.sys was 42KB of blank/null values, while the replacement file C-00000291-00000000-
00000.033.sys was 35KB and looked like a normal, if not obfuscated sys/.conf file?

Also, apparently CrowdStrike had at least 5 hours to work on the problem between the time it was discovered and the time it was fixed.

all 45 comments

sorted by: hot top controversial new old

[–] [email protected] 3 points 4 months ago* (last edited 4 months ago) (2 children)

Every affected company should be extremely thankful that this was an accidental bug, because if crowdstrike gets hacked, it means the bad actors could basically ransom I don't know how many millions of computers overnight

Not to mention that crowdstrike will now be a massive target from hackers trying to do exactly this

[–] [email protected] 3 points 4 months ago (1 children)

Don't Google solar winds

[–] [email protected] 1 points 4 months ago (1 children)

Holy hell

[–] [email protected] 1 points 4 months ago

New vulnerability just dropped

[–] [email protected] 0 points 4 months ago (1 children)

On Monday I will once again be raising the point of not automatically updating software. Just because it's being updated does not mean it's better and does not mean we should be running it on production servers.

Of course they won't listen to me but at least it's been brought up.

[–] [email protected] 0 points 4 months ago* (last edited 4 months ago) (1 children)

I thought it was a security definition download; as in, there's nothing short of not connecting to the Internet that you can do about it.

[–] [email protected] 0 points 4 months ago (1 children)

Well I haven't looked into it for this piece of software but essentially you can prevent automatic updates from applying to the network. Usually because the network is behind a firewall that you can use to block the update until you decide that you like it.

Also a lot of companies recognize that businesses like to check updates and so have more streamlined ways of doing it. For instance Apple have a whole dedicated update system for iOS devices that only businesses have access to where you can decide you don't want the latest iOS and it's easy you just don't enable it and it doesn't happen.

Regardless of the method, what should happen is you should download the update to a few testing computers (preferably also physically isolated from the main network) and run some basic checks to see if it works. In this case the testing computers would have blue screened instantly, and you would have known that this is not an update that you want on your system. Although usually requires a little bit more investigation to determine problems.

[–] [email protected] 0 points 4 months ago (1 children)

It makes me so fuckdamn angry that people make this assumption.

This Crowdstrike update was NOT pausable. You cannot disable updates without disabling the service as they get fingerprint files nearly every day.

[–] [email protected] 1 points 4 months ago

I hear you, but there's no reason to be angry.

When I first learned of the issue, my first thought was, "Hey our update policy doesn't pull the latest sensor to production servers." After a little more research I came to the same conclusion you did, aside from disconnecting from the internet there's nothing we really could have done.

There will always be armchair quarterbacks, use this as an opportunity to teach, life's too short to be upset about such things.

[–] [email protected] 3 points 4 months ago (2 children)

If I had to bet my money, a bad machine with corrupted memory pushed the file at a very final stage of the release.

The astonishing fact is that for a security software I would expect all files being verified against a signature (that would have prevented this issue and some kinds of attacks

[–] [email protected] 1 points 4 months ago (2 children)

So here's my uneducated question: Don't huge software companies like this usually do updates in "rollouts" to a small portion of users (companies) at a time?

[–] [email protected] 1 points 4 months ago

I mean yes, but one of the issuess with "state of the art av" is they are trying to roll out updates faster than bad actors can push out code to exploit discovered vulnerabilities.

The code/config/software push may have worked on some test systems but MS is always changing things too.

[–] [email protected] 0 points 4 months ago (1 children)

Companies don't like to be beta testers. Apparently the solution is to just not test anything and call it production ready.

[–] [email protected] 1 points 4 months ago

Every company has a full-scale test environment. Some companies are just lucky enough to have a separate prod environment.

[–] [email protected] 1 points 4 months ago (1 children)

From my experience it was more likely to be an accidental overwrite from human error with recent policy changes that removed vetting steps.

[–] [email protected] 2 points 4 months ago (2 children)

I'm not a dev, but don't they have like a/b updates or at least test their updates in a sandbox before releasing them?

[–] [email protected] 3 points 4 months ago

It could have been the release process itself that was bugged. The actual update that was supposed to go out was tested and worked, then the upload was corrupted/failed. They need to add tests on the actual released version instead of a local copy.

[–] [email protected] 1 points 4 months ago

one would think. apparently the world is their sandbox.

[–] [email protected] 2 points 4 months ago (1 children)

Ah, a classic off by 43,008 zeroes error.

[–] [email protected] 1 points 4 months ago

If you listen closely, you can hear this file.

[–] [email protected] 1 points 4 months ago (4 children)

The fact that a single bad file can cause a kernel panic like this tells you everything you need to know about using this kind of integrated security product. Crowdstrike is apparently a rootkit, and windows apparently has zero execution integrity.

[–] [email protected] 3 points 4 months ago (1 children)

This is a pretty hot take. A single bad file can topple pretty much any operating system depending on what the file is. That's part of why it's important to be able to detect file corruption in a mission critical system.

[–] [email protected] 1 points 4 months ago* (last edited 3 months ago)

This was a binary configuration file of some sort though?

Something along the lines of:

IF (config.parameter.read == garbage) {
     Dont_panic;
}

Would have helped greatly here.

Edit: oh it's more like an unsigned binary blob that gets downloaded and directly executed. What could possibly go wrong with that approach?

[–] [email protected] 2 points 4 months ago

I’m not sure why you think this statement is so profound.

CrowdStrike is expected to have kernel level access to operate correctly. Kernel level exceptions cause these types of errors.

Windows handles exceptions just fine when code is run in user space.

This is how nearly all computers operate.

[–] [email protected] 1 points 4 months ago (2 children)

Yeah pretty much all security products need kernel level access unfortunately. The Linux ones including crowdstrike and also the Open Source tools SELinux and AppArmor all need some kind of kernel module in order to work.

[–] [email protected] 1 points 4 months ago (1 children)

crowdstrike has caused issues like this with linux systems in the past, but sounds like they have now moved to eBPF user mode by default (I don't know enough about low level linux to understand that though haha), and it now can't crash the whole computer. source

[–] [email protected] 1 points 4 months ago

As explained in that source eBPF code is still running in kernel space. The difference is it's not turing complete and has protections in place to make sure it can't do anything too nasty. That being said I am sure you could still break something like networking or critical services on the system by applying the wrong eBPF code. It's on the authors of the software to make sure they thoroughly test and review their software prior to release if it's designed to work with the kernel especially in enterprise environments. I am glad this is something they are doing though.

[–] [email protected] 0 points 4 months ago (1 children)

At least SELinux doesn't crash on bad config file

[–] [email protected] 2 points 4 months ago* (last edited 4 months ago)

I am not praising crowdstrike here. They fucked up big time. I am saying that the concept of security software needing kernel access isn't that unheard of, and is unfortunately necessary for a reason. There is only so much a security thing can do without that kernel level access.

[–] [email protected] 1 points 4 months ago

Security products of this nature need to be tight with the kernel in order to actually be effective (and prevent actual rootkits).

That said, the old mantra of "with great power" comes to mind...

[–] [email protected] 0 points 4 months ago (1 children)

How can all of those zeroes cause a major OS crash?

[–] [email protected] 0 points 4 months ago (1 children)

If I send you on stage at the Olympic Games opening ceremony with a sealed envelope

And I say "This contains your script, just open it and read it"

And then when you open it, the script is blank

You're gonna freak out

[–] [email protected] 0 points 4 months ago (1 children)

Maybe. But I'd like to think I'd just say something clever like, "says here that this year the pummel horse will be replaced by yours truly!"

[–] [email protected] 0 points 4 months ago (1 children)

Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn't written code for, then there will be a crash.

[–] [email protected] 0 points 4 months ago (1 children)

Poorly written code can't.

In this case:

Load config data
If data is valid:
1. Use config data
If data is invalid:
1. Crash entire OS

Is just poor code.

[–] [email protected] 0 points 4 months ago (1 children)

If AV suddenly stops working, it could mean the AV is compromised. A BSOD is a desirable outcome in that case. Booting a compromised system anyway is bad code.

[–] [email protected] 0 points 4 months ago (1 children)

You know there's a whole other scenario where the system can simply boot the last known good config.

[–] [email protected] 0 points 3 months ago (1 children)

And what guarantees that that "last known good config" is available, not compromised and there's no malicious actor trying to force the system to use a config that has a vulnerability?

[–] [email protected] 0 points 3 months ago* (last edited 3 months ago) (1 children)

The following:

An internal backup of previous configs
Encrypted copies
Massive warnings in the system that current loaded config has failed integrity check

There's a load of other checks that could be employed. This is literally no different than securing the OS itself.

This is essentially a solved problem, but even then it's impossible to make any system 100% secure. As the person you replied to said: "this is poor code"

Edit: just to add, failure for the system to boot should NEVER be the desired outcome. Especially when the party implementing that is a 3rd party service. The people who setup these servers are expecting them to operate for things to work. Nothing is gained from a non-booting critical system and literally EVERYTHING to lose. If it's critical then it must be operational.

[–] [email protected] 0 points 3 months ago (1 children)

The 3rd party service is AV. You do not want to boot a potentially compromised or insecure system that is unable to start its AV properly, and have it potentially access other critical systems. That's a recipe for a perhaps more local but also more painful disaster. It makes sense that a critical enterprise system does not boot if something is off. No AV means the system is a security risk and should not boot and connect to other critical/sensitive systems, period.

These sorts of errors should be alleviated through backup systems and prevented by not auto-updating these sorts of systems.

Sure, for a personal PC I would not necessarily want a BSOD, I'd prefer if it just booted and alerted the user. But for enterprise servers? Best not.

[–] [email protected] 1 points 3 months ago (1 children)

Sure, for a personal PC I would not necessarily want a BSOD, I’d prefer if it just booted and alerted the user. But for enterprise servers? Best not.

You have that backwards. I work as a dev and system admin for a medium sized company. You absolutely do not want any server to ever not boot. You absolutely want to know immediately that there's an issue that needs to be addressed ASAP, but a loss of service generally means loss of revenue and, even worse, a loss of reputation. If you server is briefly at a lower protection level that's not an issue unless you're actively being targeted and attacked. But if that's the case then getting notified of an issue can get some people to deal with it immediately.

[–] [email protected] 2 points 3 months ago

A single server not booting should not usually lead to a loss of service as you should always run some sort of redundancy.

I'm a dev for a medium-sized PSP that due to our customers does occasionally get targetted by malicious actors, including state actors. We build our services to be highly available, e.g. a server not booting would automatically do a failover to another one, and if that fails several alerts will go off so that the sysadmins can investigate.

Temporary loss of service does lead to reputational damage, but if contained most of our customers tend to be understanding. However, if a malicious actor could gain entry to our systems the damage could be incredibly severe (depending on what they manage to access of course), so much so that we prefer the service to stop rather than continue in a potentially compromised state. What's worse: service disrupted for an hour or tons of personal data leaked?

Of course, your threat model might be different and a compromised server might not lead to severe damage. But Crowdstrike/Microsoft/whatever may not know that, and thus opt for the most "secure" option, which is to stop the boot process.

[–] [email protected] -1 points 4 months ago

school districts were also affected.. at least mine was.