
: Batch conversion of arbitrary formats to djvu, mobi, epub - not with calibre I have a few thousand files, mostly PDF, many postscript, that I need to convert to several other formats. I can't
I have a few thousand files, mostly PDF, many postscript, that I need to convert to several other formats. I can't use calibre (don't get me started) - I have considered writing scripts to dump pages to image files and using dubious cli programs to compile the images to these formats, but thought I would ask here before reinventing the wheel. Hopefully someone has an elegant solution. Epub is just zipped html with a specific file layout, the others ( djvu, mobi ) seem a bit different.
Free books android app tbrJar TBR JAR Read Free books online gutenberg
More posts by @Ravi

: Publishing HTML to Kindle - Encoding problem I am publishing an HTML page (a book) to Kindle using kdp.amazon.com and I encountered an encoding problem so I added: lang="en" to: <!DOCTYPE

: How to preserve the hyperlinks in a PDF file I'm formatting my mathematical book into an ebook. To keep all mathematical symbols intact, I saved it as a .PDF. Is there any software that can
2 Comments
Sorted by latest first Latest Oldest Best
I've been converting PDF to other formats for over 3 years. I've converted at least 20 books.
A PDF file is not easily convertible, except for text paragraphs. I have a lot of experience extracting text from PDF. PDF is an endpoint, not meant for further processing or extracting of text. You will find you have a LOT of problems extracting images and tables. Images generally don't extract at all. Tables will be really messed up and converting to other formats will require a LOT of manual cleanup.
That said, if you want to extract the text I found this site to be better than others: online-convert.com. It supports many input and output formats. And I tried at least 5 other sites and 5 other OCR programs for the PC.
You will not get the accuracy you want from any program as far as I know. There might be better programs, like Abbyy Finereader but they are not free and there is no such thing as perfect conversion from PDF to anything else, especially for images and tabular data.
Books of fiction are easy to extract the text from because they are just wrapped paragraphs with few or no images and no tabular data. So to test your conversion, find a more challenging book with tabular data and images and see how it goes.
Some PDF books have the text inside them and extract fairly well. Some PDF files are just a bunch of images, one scanned page per image. For these you need OCR.
Also, there is no OCR that is 100% accurate. You might get 90-95% accuracy. But we don't know which letters or words are inaccurate, so that means we have to check every word in the output. This takes a lot of time.
Free books android app tbrJar TBR JAR Read Free books online gutenberg
For highly structured pages with many pictures, IMHO, is the only usable format DjVu. So, use the DjvuLibre application library.
To move it you can use scripts in python-djvulibre.
I think there are instructions in English. But surely there are Slavic instructions (Instruction in Czech 1, Instruction in Czech 2). If the community wanted to, I could translate the tutorial (PDF/TIFF/JPG -> DjVu) into English or automate the process with a script. (Some parts would need machine learning algorithms. For a full automated process.)
For pure text, use the method for creating EPUB format, which have already been referenced.
If you are not commercial, you can not use mobi format when you already have epub. Epub is IMHO a better format than mobi format.
For electronic book readers, update your firmware and then you can read epub format.
Free books android app tbrJar TBR JAR Read Free books online gutenberg