A true life story of a way to do something.
Disclaimer
I have left out some details for clarity. Available options were dictated by the even older mature technology, the business critical nature of the environment, and the time and hardware available to create the backup solution and keep it running.
Background
Over 10 years ago I worked with customer sites hosting multiple systems, 16 TB in each chassis spread across 4TB Hardware RAID 5 arrays. I made suggestions for improving the robustness the way we stored data, but management was not interested.
All of the stored data was created, never modified, and only deleted after 5-10 years based on variable retention rules. We had customers older than that.
Our system required careful use. If the customer modified the wrong retention rule, the system would purge all of the affected data immediately. There was no warning and no 'undo'. This change might go unnoticed for a period of time, few days to a week, depending on how often audits were done.
Get on with IT!
This was our Saga Customer, good at breaking things in weird, unconventional, and time consuming ways. There was the Saga that caused the push for a backup solution. We provided technical recovery and was moderately successful.
The Saga around the creation of a backup solution (not completely their fault). Normally we stayed away from backup related issues, leaving our customers to create their own solutions. This customer was more than a beta site.
Everything was backed up weekly(full) and daily(incremental) 5-10GB of changes - both add & routine retention based delete events. Available system I/O is was an issue when it came to speed. Each data file was 20-30 seconds of Base-64 encoded, 8 bit stereo audio, and was about 350 kb. I'll let you do the math on how many fit in 4TB.
Small recoveries we're just restored from the backup system directly onto the target Host system and commands run to recreate the database entries pointing at the restored data. Each file header had information that went into the DB.
Larger events affecting multiple TB (Saga) had a different recoverery procedure. That we even needed this alternate process says a lot.
When a major deletion event occurred, the remaining data on the Host was shifted to consolidate and free up one of the storage volumes. It is always faster to shift files between volumes on the same system, than over a network. Especially when production load was still active on a system.
Because backup retention rules and the available backup storage, all the backups were compressed to save space. Duplication removal wasn't a feature we had access to. Only one vendor at that time even offered it.
While Host consolidation was happening, a Recovery system (spare host because downtime was not an option), was used to restore all of the missing data from compressed archive onto a new storage volume array. The recovery took a few days because you needed the last full before the change, and all of the incrementals to be applied up to the time the error was made. It also took longer due to the compression, but we didn't think the process would become regular when we created it.
Sneakernet Happens!
Once the restore completed and was validated, after hours work was scheduled. Both the Recovery and Target hosts were shut down. The empty array was pulled from the Host chassis, and replaced with the array from the Recovery chassis. On Target Host boot, the system detected the volume change prompted for where to mount it in the file structure. Once mounted, the restored data was scanned by our tools and various back end databases had entries added restoring access to our software.
Afterwards the empty array was put into the recovery system and that controller was used the break the array so the drives could go to the top of the site's replacement pool for reuse if there were no issues with the drives. Also the invalid incrementals any possible full backups created after the error needed to be removed from the backup system to recover the space before 30 days.