this post was submitted on 07 Feb 2024
65 points (97.1% liked)

Linux

48144 readers
909 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Hi everyone,

I have been experiencing some weird problems lately, starting with the OS becoming unresponsive "randomly". After reinstalling multiple times (different filesystems, tried XFS and BTRFS, different nvme slots with different nvme drives, same results) I have narrowed it down to heavy IO operations on the nvme drive. Most of the time, I can't even pull up dmesg, and force shutdown, as ZSH gives an Input/Output error no matter the command. A couple of times I was lucky enough for the system to stay somewhat responsive, so that I could pull up dmesg.

It gives a controller is down, resetting message, which I've seen on archwiki for some older Kingston and Samsung nvmes, and gives Kernel parameters to try (didn't help much, they pretty much disable aspm on pcie).

What did help a bit was reverting a recent bios upgrade on my MSI Z490 Tomahawk, causing the system to not crash immediately with heavy I/O, but rather mount as ro, but the issue still persists. I have additionally run memtest86 for 8 passes, no issues there.

I have tried running the lts Kernel, but this didn't help. The strange thing is, this error does not happen on Windows 11.

Has anyone experienced this before, and can give some pointers on what to try next? I'm at my wits end here. EDIT: When this issue first appeared, I assumed the Kioxia drive was defective, which the manufacturer replaced after. This issue still happens with the new replacement drive too, as well as the Samsung drive. I thus assume, that neither drives are defective (smartctl also seems to think so)

Here are hardware and software details:

  • Arch with latest Zen Kernel, 6.7.4, happened with other, older kernels too though, tried regular, lts and zen
  • BTRFS on LUKS
  • i9-10850k
  • MSI z490 Tomahawk
  • GSkill 3200 MHz RAM, 32GB, DDR4
  • Samsung 970 Evo 1TB & Kioxia Exceria G2 1TB (tested both drives, in both slots each, over multiple installs)
  • Vega 56 GPU
  • Be quiet Straight Power 11 750W PSU
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 5 points 9 months ago (1 children)

ssds getting not enough power? i'd test it with different PSU, i had a problem with my ssd failing and changing PSU worked, apparently 3.3VDC rail is routed on the motherboards without any conversion straight to m.2/pcie devices

[–] [email protected] 1 points 9 months ago (3 children)

Unfortunately I don't have a spare PSU, but I might try to measure the 3.3 volt rail with a multimeter (don't own an oscilloscope unfortunately) while under load and see what happens

[–] [email protected] 1 points 9 months ago

Do you have a spare set up where you can boot up from that same SSD? Literally any laptop would work plug and play and that would rule out the possibility of it being the motherboard on the OP.

[–] [email protected] 1 points 9 months ago

Yeah, an oscilloscope would be handy in hunting spikes, it's a bit harder with a standard multimeter, you sure you don't know anyone with a spare PSU to borrow?

[–] [email protected] 0 points 9 months ago

you could also try one of those USB to m.2 and see if that works