this post was submitted on 13 Aug 2023

48 points (83.3% liked)

No Stupid Questions

35830 readers

969 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)

Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.

Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That's it.

Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.

Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

Rule 10- Majority of bots aren't allowed to participate here.

Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 1 year ago

MODERATORS

[email protected]

Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use? (social.graves.cl)

submitted 1 year ago by [email protected] to c/[email protected]

13 comments fedilink hide all child comments

Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use?

Cc @[email protected]

all 14 comments

sorted by: hot top controversial new old

[–] [email protected] 32 points 1 year ago (1 children)

I had this exact question myself a little while ago, so I’ll share what I learned. I don’t know your level of knowledge with these things so forgive me if I’m explaining things you already know. And spoiler alert, the answer is “technically, but not how you’d like”

An EPUB “file” is really a folder containing a bunch of individual HTML files which hold the text for the book as well as things like the table of contents, and photos (if your ebook has pictures), with CSS for styling. This is the exact medium you’d work in if you were designing a web page, but with en ebook there are different best practices and considerations.

Now assuming that your PDF has a good OCR (optical character recognition) layer, then it will be possible for calibre and other programs to grab the text of the PDF, and even to create an epub with it. But as you’ve noticed, they don’t do a good job of this. The fundamental problem is that creating an epub is something of an art, with best practices and personal choices as far as layout and file structure. When you “convert”, you’re not changing the file type from PDF to EPUB; you’re grabbing the text from the PDF and then sticking it into multiple different files, with HTML and CSS instructions throughout to tell the EReader how to lay things out, which footnotes link to which annotations, where to display pictures, etc.

As far as I’m aware, this basically can’t be done (well) with dumb, automatic programs like what Calibre offers because there’s too much “thinking” involved. Perhaps an AI tool could be created that would handle this better, but I’m not aware of one, and it’s a pretty specialised application so it’s possible you’ll need to wait a while before someone gets around to that.

So I realised that if I wanted an EPUB version, I’d need to make it myself. I used Sigil, a free EPUB creation tool, to do it, which gave me some nice features to help speed up the process, but it’s a big time commitment (unless you’re working with a very short PDF), especially for your first EPUB where you’re still learning what to do while making it. You’ll also need to learn HTML and CSS if you haven’t already.

I did it as a sort of fun side project in my free time to learn a new skill, but unfortunately other than that, I don’t think there’s such thing as an “EPUBinator” that’s gonna take your PDF and create a well-made ebook.

[–] [email protected] 6 points 1 year ago (2 children)

You’ve identified the main issue: PDF extraction. A PDF can lay out pages in an infinite number of ways.

My personal workflow is to take a PDF, tun it through ClearType OCR, save it as a web-friendly, accessibility standard compliant PDF, which will extract all the text and re-lay it out so a screen reader can read the text in the correct order.

After that, it’s a matter of exporting the PDF to HTML, chunking it, zipping the results with a CSS file and a manifest, and you’ve got an ePub.

And of course, there are Python libraries to do a lot of the conversion as well.

[–] [email protected] 1 points 1 year ago (1 children)

Oh yeah my actual workflow to create a book was horribly inefficient and time-consuming. How automated is that HTML export and chunking process? Are you still going through and manually adding in every last and href?

I’m curious about your use case, because I was doing this with a book that was hundreds of pages long, full of photos and footnotes, which added lots of tedium.

[–] [email protected] 3 points 1 year ago

I just use a text editor and regex to add all the paras and hrefs. Done it for a few horribly mangled books I was converting. Whole process was “manually automated”.

[–] [email protected] 1 points 1 year ago

question: why is using OCR software more worth it than taking its contents with something like LibreOffice Draw?

[–] [email protected] 11 points 1 year ago

The ideal would have to be some sort of AI translation. The problem is that PDF is a page layout format and EPUB is a reading format and you can’t just extract the text without understanding what parts are affected by page layout, think of reading by columns for example. And you would need to train the AI on what’s unnecessary for reading comprehension

[–] [email protected] 10 points 1 year ago (2 children)

By "far from ideal", I think you mean "not perfect".

[–] [email protected] 1 points 1 year ago

No. They mean really bad. OP is being overly polite.

[–] [email protected] -2 points 1 year ago (1 children)

And ugly!

[–] [email protected] 13 points 1 year ago (1 children)

An ugly powerhouse Linux application? What will they think of next?!

[–] [email protected] 3 points 1 year ago

Yeah! Like Audacity!

[–] [email protected] 7 points 1 year ago (1 children)

Well data extraction from PDF is always tricky and there isn’t a defined way of how to translate PDF to EPUB 1:1 so I don’t think it’s calibre is the problem. It’s difficult to program how to reverse engineer automatically.

[–] [email protected] 3 points 1 year ago

This is the answer.

PDF was designed to be hard to convert. Mobi to EPUB easy though because it's just text.

/thread