this post was submitted on 26 Dec 2024
1 points (100.0% liked)

It's A Digital Disease!

11 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 1 year ago
MODERATORS
 
The original post: /r/datahoarder by /u/SecretlyCarl on 2024-12-26 19:59:02.

Not sure this is the right sub for this but I figured this community would appreciate the purpose of the script.

I recently downloaded ~80k epubs (zip folder, no way to pre-select what I wanted). I didn't want to keep ALL of them, but I also didn't want to go through them one by one. I spent the last few days chatting with chatgpt to get a working script, and now I want to make it more efficient. Right now it takes about 3hr to process 1000 books, so 80k would take a few days.

In the readme I outline the flow of the script. It uses a LLM to clean up filenames, and passes them to GoodReads to parse genres and save them in a txt file. Then the txt files are used in a separate GUI script to filter, delete, and move the epubs by genre.

From what I can tell, the main slowdown is being caused by the way selenium webdriver and beautifulsoup are being implemented.

Here is the github repo - https://github.com/secretlycarl/epub_filter_tool

And the file I'm looking for advice about - https://github.com/secretlycarl/epub_filter_tool/blob/main/grsearch/grsearch.py

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here