It's A Digital Disease!

11 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 1 year ago
MODERATORS
1
 
 
The original post: /r/datahoarder by /u/akaxd123 on 2024-12-26 22:25:05.

Can't right click :c

2
 
 
The original post: /r/datahoarder by /u/Matti_Meikalainen on 2024-12-26 22:24:25.

https://github.com/ledimestari/Backblaze-progress

Took me a while to take some time to clean this up and post it to github but now I've done it.

This is the documentation for that dashboard I posted earlier.

As I was educated in the comments of my previous post, this no longer is a stupidly complex screenshot + OCR bundle, but now the script parses the xml files generated by the Backblaze client.

Also an added feature are graphs for individual drives as well.

Screenshots

https://i.imgur.com/cp5yz8Z.png

https://i.imgur.com/YlNfLEl.png

This won't help you to setup influx or grafana but I guess if you're reading this you're probably familiar with those already.

Hope you like it. :)

3
 
 
The original post: /r/datahoarder by /u/mike_six on 2024-12-26 20:38:19.

A few months ago I began archiving Sports Illustrated issues from their Vault website. I created a macro for my mouse that would right-click and save the image, then advance the slideshow to the next page. In a couple minutes I would have all images saved then could make a file I could view on a comic book reader app.

Now SI is under new management and all there is available is the cover image plus the next page. After that the slideshow seems to be broken. I thought about copying the link and doing a URL sequence batch download on uGet, but the folder names for the images seem to change frequently.

Their website has been broken for nearly a year, so the new owners don't seem interested in fixing this. Is there a different way to navigate around the website and grab these images somehow?

4
 
 
The original post: /r/datahoarder by /u/manzurfahim on 2024-12-26 20:18:22.

https://preview.redd.it/gcp2y3q9099e1.jpg?width=1500&format=pjpg&auto=webp&s=1ca465758ec29e8d83768c93e192d902973c1b2b

https://preview.redd.it/bit4ooaa099e1.jpg?width=989&format=pjpg&auto=webp&s=0a9eada30caae770ff7cecf1cc3b6628df507938

https://preview.redd.it/dplmaob8199e1.jpg?width=800&format=pjpg&auto=webp&s=d158fff382ddc5c9ff7ed9b3e4dd20c28881e115

So, I decided to gift myself some storage this Christmas.

  1. 2 x 20TB Seagate Exos X22 20TB HDD from Serverpartdeals (HDD year 2024)
  2. 1 x 4TB Kingston Fury Renegade NVMe (4000 TBW)
  3. 1 x 4TB Crucial T700 NVMe (2400 TBW)

Going to retire my old, trusty MegaRAID 9271-8i card, currently running two arrays (5 x 16TB RAID5 and 3 x 18TB RAID5).

I purchased a 9361-8i card and the cables, I already purchased 3 x 20TB drives, once these 2 drives arrive, I'll make a 5 x 20TB RAID 5 array, and re-purpose the 5 x 16TB drives for offline storage. Once I save some more money, I'll get three more 20TBs and make an 8 x 20TB RAID 6 array.

NVMe SSDs are going inside the system, one of them will retire an old 500GB NVMe drive.

This is my plan so far, would love to hear what you think, and if you have any ideas, I'd love to hear it.

5
 
 
The original post: /r/datahoarder by /u/SecretlyCarl on 2024-12-26 19:59:02.

Not sure this is the right sub for this but I figured this community would appreciate the purpose of the script.

I recently downloaded ~80k epubs (zip folder, no way to pre-select what I wanted). I didn't want to keep ALL of them, but I also didn't want to go through them one by one. I spent the last few days chatting with chatgpt to get a working script, and now I want to make it more efficient. Right now it takes about 3hr to process 1000 books, so 80k would take a few days.

In the readme I outline the flow of the script. It uses a LLM to clean up filenames, and passes them to GoodReads to parse genres and save them in a txt file. Then the txt files are used in a separate GUI script to filter, delete, and move the epubs by genre.

From what I can tell, the main slowdown is being caused by the way selenium webdriver and beautifulsoup are being implemented.

Here is the github repo - https://github.com/secretlycarl/epub_filter_tool

And the file I'm looking for advice about - https://github.com/secretlycarl/epub_filter_tool/blob/main/grsearch/grsearch.py

6
 
 
The original post: /r/datahoarder by /u/LowerDoor on 2024-12-26 19:02:08.

I do not have the 1.2TB limit on my service, it leads me to wonder at what point would they say something about my data usage? I pay for the most expensive plan they have.

I had to re-download part of my steam library and it used 1.8TB and i still don't have all my games.

7
 
 
The original post: /r/datahoarder by /u/NinjaskPvP on 2024-12-26 18:46:28.

I have a facebook group with 40 thousand members, I wanted to preserve everything in it because it has very important content, is there a way for me to do this? I have ADM privileges

8
 
 
The original post: /r/datahoarder by /u/JudgeStock3888 on 2024-12-26 18:43:47.

I’ve already tried wayback machine but although the link was archived the video itself wasn’t :(

9
 
 
The original post: /r/datahoarder by /u/S3ND_ME_PT_INVIT3S on 2024-12-26 18:32:23.

Not really in a position to actually build a server i'd be able to keep expanding; been thinking about burning BR-R's with coverarts etc to store in binders might be better than continue to purchasing external hdd's. Bluray discs last like 100y right? I got like 150 bd-r discs laying around. That'll just cover like one hdd, out of +10 I have. Still, i'm starting to think burning and storing away might be better than keep adding more and more externals. How are ya'll storing those linux distos? Whats the best plan for long term, with a budget? Most are pretty much cold storage right, so BD-R would make more sense i'd think?

10
 
 
The original post: /r/datahoarder by /u/Googles_Janitor on 2024-12-26 18:20:35.

Has anyone hoarded the entirety of Wikipedia? Any sense of scale and how this could be done?

11
 
 
The original post: /r/datahoarder by /u/Ok-Development7092 on 2024-12-26 17:27:34.

I thought this might fit this subreddit better than r/techsupport so here I am.

I will preface this with some info first:

I'm a college student who just experienced data loss for the first time a year ago because of the damn windows EFS(Encrypting File System) and am very afraid of my data being lost again.

A few months ago, I got for free an old acer pc that I have been using for testing out linux distros, and stores the only copy of old data from my devices (so I could reformat). Thing is, all 3 drives in there are used/2nd-hand and one of them even has read errors on some sectors whenever I tried a full disk test

( I kept that part unformatted).

CPU : AMD A6-6400K APU

Motherboard: MSI A68HM-E33 V2

And now I want to use it as a simple backup pc so I can at least have some peace of mind if ever one of my devices fail, or if one of the drives fail( at least I could rebuild the array, better than nothing). The motherboard has 4 SATA ports, so I'll use one port for the OS(M$/Linux distros), then the 3 for the array.

Also, I can only buy used HDDs. Used 512GB HDDs are $8 max where I am, so I can buy three after 4 weeks or so. I know that raid is not a good backup solution but I don't have much in choices right now. I DO plan on thoroughly testing the drives first before I try to store my data on it.

Also, I do not care about read/write speeds. This will only be used for backup and testing linux distros(on a different HDD), and I am considering a RAID 5 with three 512GB drives. Currently, 1TB can fit all of my important data so I'm fine with that. The PC will also be kept disconnected(after updates) so no worries on that part.

Here's the questions:

  1. Before everything else, is it even worth it to RAID the drives with these specs?
  2. If so, which OS should I use? win10 and debloat it heavily, or linux with mdadm?
  3. Is it fine to have different brands/sizes(2.5 or 3.5) of HDDs?
  4. Is there anything else I could consider? or anything else I'm missing?
12
 
 
The original post: /r/datahoarder by /u/erparucca on 2024-12-26 17:26:09.

first of all: shame and blame on me; I have a huge amount of disk space and an LTO5 library: I am the only one to blame but that's not the topic. Won't get through why at the moment I didn't have a 2nd (and 3rd) copy of that data, which I should have had.

Starting scenario: laptop with 1TB system NVMe and 2TB data NVMe. Data NVMe had a 60GB unused (ex linux) partition and the rest as NTFS. I swapped the 1TB driev with a twin 2TB NVMe and, you see it coming, I repartitioned what I was persuaded to be the system NVME and installed windows on it. I booted and realized I had just zapped the data NVMe.

Tried PhotoRec/TestDisk with mixed results. I'm looking for more advanced techniques to recover the data considering what I car the most are Sony RAW files (about 600GB) of all the photos I shot (amateur photographer). So we know the file format and the fix size of each file (including its structure). We also know the empty beginning partition was between 60 and 64GB (no software found the original data partition being an NTFS partition of 2TB-60/64GB at the beginning.

I've been in IT for more than 28 years: every pointer to whatever can be tried will be welcome. of course I created a binary copy of the NVMe to prevent further damage. Thanks to everyone who'll take the time to help!

13
 
 
The original post: /r/datahoarder by /u/-IGadget- on 2024-12-26 17:15:21.

A true life story of a way to do something.

Disclaimer I have left out some details for clarity. Available options were dictated by the even older mature technology, the business critical nature of the environment, and the time and hardware available to create the backup solution and keep it running.

Background Over 10 years ago I worked with customer sites hosting multiple systems, 16 TB in each chassis spread across 4TB Hardware RAID 5 arrays. I made suggestions for improving the robustness the way we stored data, but management was not interested.

All of the stored data was created, never modified, and only deleted after 5-10 years based on variable retention rules. We had customers older than that.

Our system required careful use. If the customer modified the wrong retention rule, the system would purge all of the affected data immediately. There was no warning and no 'undo'. This change might go unnoticed for a period of time, few days to a week, depending on how often audits were done.

Get on with IT! This was our Saga Customer, good at breaking things in weird, unconventional, and time consuming ways. There was the Saga that caused the push for a backup solution. We provided technical recovery and was moderately successful.

The Saga around the creation of a backup solution (not completely their fault). Normally we stayed away from backup related issues, leaving our customers to create their own solutions. This customer was more than a beta site.

Everything was backed up weekly(full) and daily(incremental) 5-10GB of changes - both add & routine retention based delete events. Available system I/O is was an issue when it came to speed. Each data file was 20-30 seconds of Base-64 encoded, 8 bit stereo audio, and was about 350 kb. I'll let you do the math on how many fit in 4TB.

Small recoveries we're just restored from the backup system directly onto the target Host system and commands run to recreate the database entries pointing at the restored data. Each file header had information that went into the DB.

Larger events affecting multiple TB (Saga) had a different recoverery procedure. That we even needed this alternate process says a lot.

When a major deletion event occurred, the remaining data on the Host was shifted to consolidate and free up one of the storage volumes. It is always faster to shift files between volumes on the same system, than over a network. Especially when production load was still active on a system.

Because backup retention rules and the available backup storage, all the backups were compressed to save space. Duplication removal wasn't a feature we had access to. Only one vendor at that time even offered it.

While Host consolidation was happening, a Recovery system (spare host because downtime was not an option), was used to restore all of the missing data from compressed archive onto a new storage volume array. The recovery took a few days because you needed the last full before the change, and all of the incrementals to be applied up to the time the error was made. It also took longer due to the compression, but we didn't think the process would become regular when we created it.

Sneakernet Happens! Once the restore completed and was validated, after hours work was scheduled. Both the Recovery and Target hosts were shut down. The empty array was pulled from the Host chassis, and replaced with the array from the Recovery chassis. On Target Host boot, the system detected the volume change prompted for where to mount it in the file structure. Once mounted, the restored data was scanned by our tools and various back end databases had entries added restoring access to our software.

Afterwards the empty array was put into the recovery system and that controller was used the break the array so the drives could go to the top of the site's replacement pool for reuse if there were no issues with the drives. Also the invalid incrementals any possible full backups created after the error needed to be removed from the backup system to recover the space before 30 days.

14
 
 
The original post: /r/datahoarder by /u/nunyajaks on 2024-12-26 16:55:45.

I recently got a new 12TB HGST drive (huh721212ale600) to add to my collection, and like many others I had to look into the thump-every-5-seconds thing and found out that it's probably PWL. The thing that seems strange to me about PWL is that it's so frequent over such a long drive lifespan with only as much lubricant as was put in the drive before sealing it up. Does it simply never run out or degrade in the closed environment inside the HDD and can just be re-distributed over and over until the eventual death of the drive? I feel like I'm missing some obvious information, so apologies if this is a stupid question.

15
 
 
The original post: /r/datahoarder by /u/russiancarl on 2024-12-26 16:27:34.

I am looking to expand my HDD storage but my favorite hard drives, WD Red Plus 14 TBs, seem to be discontinued. They've been out of stock for a long time and no longer show up on the website.

The only other HDD in my case is a shucked 14 TB WD white label and it drives me mad. It isn't loud, but it does vibrate and clicked for the longest time. My red plus's don't have this issue at all.

I am thinking about taking the leap and going to the refurb enterprise drives like the 18TB+ exos, or just getting a Red Pro, but I really don't want to drop that kind of cash if they will vibrate or make that loud head seeking/clicking noise.

I do prefer a quiet hard drive, but often times its the frequency or quality of the sound that bothers me as opposed to the absolute volume. I have tinnitus and I'm always playing white noise anyway, so that helps a lot, but some sounds I just can't deal with.

Unfortunately, moving my case isn't possible at this point.

Any suggestions or experiences? Thank you. I could go down to 12TB Red Plus's but that seems less than ideal.

16
 
 
The original post: /r/datahoarder by /u/Mr_MoeO on 2024-12-26 15:33:25.

Hey everybody,

I am researching ways to store photos and videos that are currently stored on external harddrives. The combined size of everything is around 40tb+. I've always used a Mac, so most of the photos are IPhoto libraries or Photo libraries.

Which software would be the best solution with regards to file format compatibility and organization? Is it even possible or do I have to find a different solution?

17
 
 
The original post: /r/datahoarder by /u/cjr71244 on 2024-12-26 15:05:09.

For those of you who take a lot of pictures, what tools/apps do you use for tagging, organizing and adding Geolocations? I'm a data hoarder and therefore I have TONS of pictures from over 20+ years. I've just started using a Windows program to organize and tag them. It's called Digikam.

My primary goal is to be able to quickly access my favorite random photos that I have taken of scenery etc and also quickly be able to view my vacation pictures in slideshows.

But I'd like to hear what other people do. I'd prefer not to use a paid service.

I have Android phone which is my primary camera, GoPro hero 11 I use occasionally, all photos and videos are backed up to my Windows server using SyncThing. Then I add tags, organize them in subfolders by year and then event. But often there are hundreds of pictures from that year that I don't really have any logical way to organize because they're kind of random.

18
 
 
The original post: /r/datahoarder by /u/SkippaChip on 2024-12-26 13:34:31.

Hi, I've got a 14TB HDD here that i just brought used from a trusted UK Seller of Second hand HDD's, I feel the drive isn't working on my PC due to a 3.3v pin and i just keep getting this error on my PC, I've tried covering pin 3 but unsure if the pinout for this drive is different and i need to be covering a different pin, I have highlighted in red the pin i've already tried covering, Any suggestions?

https://preview.redd.it/r5v3wpg7479e1.png?width=376&format=png&auto=webp&s=af55cf1417f21291dcc437b3349ce89fc338093d

https://preview.redd.it/mkpy36y7479e1.jpg?width=4032&format=pjpg&auto=webp&s=5ef3ee87333c634eff3f0117a1218acc9662dd7e

19
 
 
The original post: /r/datahoarder by /u/dumnezilla on 2024-12-26 10:37:28.

No category in mind. Just whatever.

20
1
Best local Setup? (zerobytes.monster)
submitted 15 hours ago by [email protected] to c/[email protected]
 
 
The original post: /r/datahoarder by /u/Autumnlight_02 on 2024-12-26 10:30:54.

Hey, so I wanted to setup a larger NAS at home. I don't wanna buy a new system since I got a mainboard etc lying around here. I planned to buy a PC case with 10 Drive slots and use that.

I am now curious on how I can encrypt the data Securely, and which OS is the most suited one? Prefferably I would like to only store the encrypted data and decrypt it on my local PC.

Generally, which NAS OS should I use? (This will be my first NAS.) I need to have my Data be safe in the case the NAS gets stolen.

21
1
128+ GB BluRay? (zerobytes.monster)
submitted 21 hours ago by [email protected] to c/[email protected]
 
 
The original post: /r/datahoarder by /u/h3lnwein on 2024-12-26 08:39:06.

I am looking at remux of Lord of the Rings: Return of the King. It’s actually 132GB in size… how? Aren’t BR disks 128GB max?

22
 
 
The original post: /r/datahoarder by /u/hmmqzaz on 2024-12-26 05:30:39.

So: I’m a librarian with all the major archival and digital archival certifications in the US.

About seven years into that, I studied taxonomy theory they often use in bioinformatics and biocuration, and learned better stuff about digital organizing in two graduate courses than I learned in the degrees, certs, and work.

I mean, the single best thing I learned was when to use upper and lower case, and plurals and singular, in naming file folders, but there was also other stuff :-P

So now that I have my own satisfyingly insane system for organizing my stuff, based on the Gene Ontology and pre-GPT semantic programming, what’s yours?

23
 
 
The original post: /r/datahoarder by /u/LeviAEthan512 on 2024-12-26 04:25:28.

Hi, does anyone have recommendations for an expansion card that's good for data hoarding?

Right now, I've got a USB HDD enclosure. I hear it's best to go for eSATA, so I think I'll switch to that. Unless the latest USB advances have made that irrelevant)

I also have a USB expansion card, but it's kinda sketchy, so that's another reason I want to upgrade.

Now I have two problems. I'm using a couple of those ports on my USB card, and I only have one available PCIe slot (the rest are blocked by my graphics card). So whatever card I get needs to have both USB and eSATA. I'm not sure this exists. I can't find it in any case. I've found cables that go eSATA to USB-C, but they're all crazy expensive compared to USB cables, and I don't recognise any of the brand names either.

If I can't get both, then I'm stuck on USB for the foreseeable future. In that case, I would like a proper card from a reputable brand. Does anyone know any? The Chinese one I'm using works find 90% of the time, but causes a wide range of problems infrequently. I know it's the card because I've used it in 3 different systems and they've all encountered these problems, only when the card is in use.

24
 
 
The original post: /r/datahoarder by /u/Imaginary_War9923 on 2024-12-26 00:53:01.

Hi strangers,

I have been working on a media scraper for a few months for my own archiving and it is finally ready to be released to the public~

This bot is designed as a comprehensive tool for managing and automating downloads using a self-reliant database. It offers both manual and automatic entry through a command line tool and a batch file. As well as a guided command line interface to add, manage, and delete entries as well as modify a global blacklist system to block tags on BOORU sites. It also currently supports Pixiv downloads.

Note that this program doesn't currently support anything but gelbooru and Pixiv but that will be changing within the next week or so as I refine the bot :)

Anyway, here is a link to the project, keep an eye on it because I am working on improving it, Happy hoarding!

https://github.com/Waffles-54/scraping-bot-manager

25
 
 
The original post: /r/datahoarder by /u/olo99 on 2024-12-26 00:09:07.

I got some questions that I hope to get some clarification on:

I have 8 x WD RED SSD (4TB) that I want to set up with Drivepool + SnapRAID.

  1. Is 2 disk parity for SnapRAID still recommended even when using SSDs and not regular HDDs, or is 1 disk good enough?

  2. Is there any point in getting a PLP SSD to use as cache drive in addition to the other ones? (at the moment I don't have a UPS, would PLP give a little protection for sudden shut downs?)

  3. I have seen some using 2 SSDs as cache drives (1 with duplication). And then "dumping" the data over to the storage drives and then have SnapRAID just run off the storage drives. In my case all drives will be SSDs, would this still benefit in any way? Or would it be better just having all drives (without parity) in just one big storage pool? I know SnapRAID doesn't like to be run when anything is active.

For example would this be ideal:

disk 1 cache drive

disk 2 duplicate of 1

disk 3-7 storage

disk 8 parity (SnapRAID).

This would then give 24TB total storage. (down to 20 if double parity).

Or what would you recommend? :)

view more: next ›