Looking for thoughts/opinions
I have a 5 disc raidz1 array. The volumes are accumulating CKSUM errors - fairly evenly distributed over the discs. I've been lazy and let this progress to the point where there are permanent errors in files.
# zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 748K in 06:17:19 with 1 errors on Sun Jul 14 06:41:22 2024
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST8000VN004-2M2101_WSD13YBW ONLINE 0 0 6
ata-ST8000VN004-2M2101_WSD13YE4 ONLINE 0 0 7
ata-ST8000VN004-2M2101_WSD1454G ONLINE 0 0 8
ata-ST8000VN004-2M2101_WSD1454W ONLINE 0 0 6
ata-ST8000VN004-2M2101_WSD14563 ONLINE 0 0 7
errors: Permanent errors have been detected in the following files:
/you/do/not/need/this/level of detail.txt
I've done some research and believe (hope) that the cause of these errors is the "domestic" onboard SATA controllers I'm using and I have ordered a LSI SAS3008 9300-8i HBA as an upgrade.
I know I can fix the permanent error by deleting and restoring it and then running a scrub. But, I'm torn - should I scrub now and risk stressing it more on the crappy SATA controllers, or wait until I get the new HBA (in a few weeks - free cheap, slow, shipping)?
I’d shut it down before it corrupts even more, replace HBA when it arrives and run a scrub to see what’s the damage
I know that's the correct response. But, it's been running like this for many months, maybe even years - as I said in the post, I've been lazy. There's nothing on it that can't easily be restored, or replaced, and shutting it down would be a PITA.
There’s always a chance your backups might get corrupted too if you let it continue like that