I've been running my most recent Server built for quite some time now. I think Uptime was somewhere around 5 Months. Absolutely flawless. A few Days ago i started to have issues. Hard-Locks, Freezing...but absolutely zero log entries. Nothing. The Server was built with "off the shelf" Hardware and no ECC (even though the Ryzen CPU technically supports it, at the time ECC 3200 MHz Memory was still a lot more expensive than it is now) and is running a ZFS. Risky business, but it's "just" a home server. Would never built a server running mission critical stuff like that (and I've been doing that for over 10 years now as my main job). Over the last few weeks, i've been trying some stuff and had a pretty high memory load.
In any case, i also like Astrophysics and have some newsletters about Auroras and so on. They are extremely rare, here in southern Germany to occur. Yesterday we had one of the biggest and brightest I've ever seen.
But it got me thinking about my hard locks and crashes and i remembered, i had an account for ESA's SSCC (SSA Space Weather Coordination Centre). They have something called "Post-Event Analysis", where you can correlate certain timestamps to real time data, for example from DSCOVR ("THE" Space Weather Satellite).
For Auroras to occur, the so called "Bz-Value" is important. Basically, it tells the direction of the interplanetary magnetic field. If it's direction is towards the sun and towards the charged particles the sun throws at us, they get deflected. If it's with the direction of the solar wind, the particles "come in" and produce auroras...because the charged particles charge other parts - they generally charge oxygen, which results in green auroras - they also can do all sorts of stuff (and that's why spaceships, sats and other stuff floating around in space need shielding). The Value is measured in nanoTesla(nT).
There's also the Kp-Index...which was 7-8, out of 9.
So yeah - i'm pretty sure, i experienced a Single-Event Upset/Bit-Flip. Amazing stuff!
Edit: Picture of the Aurora https://i.imgur.com/TIxketJ.jpg
Shouldn't ZFS have detected the bad data and repaired itself from redundancy though?
In memory?
Oh! I thought OP was referencing OS files from the drive.
It also wouldn't cause Hard-Locks and Freezes without any errors
It certainly could. A bit-flip in a core part of the kernel could easily cause it to lock up, if an address is corrupted and it starts writing garbage over its code, or execution jumps to somewhere unexpected, or an instruction is changed from something reasonable to a halt.
Yes, most of those should trigger a blue screen or kernel panic, but that's not guaranteed when you're making completely random changes.
Sure - i should have mentioned, that the system itself runs not on the ZFS but from it's own SSD. So a "ZFS Cache in Memory Bit-Flip" should (theoretically...) not cause a hard-lock/freeze. It would probably trigger a complete garbage collection though.
And yes - that's what was so confusing to me, no kernel panic, no log entry...nothing, just a sudden, random freeze.
Right, a bit flip in ZFS cache shouldn't cause that. But a bit flip in active memory could.
Absolutely! And I think that's actually what happened :)
It probably did - but that's not why the server crashed :)