bell notificationshomepageloginedit profileclubsdmBox

10% popularity   0 Reactions

To get a good reflowable and readable text you would indeed, as you suggest, have to go to some OCR software. Tesseract is a free OCR software with good (IMHO) quality that you could use.

The problem is in the mathematical formula, I have not seen OCR software that does a good job in that area. That leaves two options:

Cut out the formula's as images and put them into the text at the places where the OCR faltered.
Rewrite the formula in some system that can generate HTML.

The second option is more work and more error prone. I have used that (in combination with the python sympy module and generating LaTeX), but of course any typo leads to incorrect formulae, something that is more difficult to achieve when just cutting out the formula as images.

One other, maybe less obvious, road is to ask the professor for the source material from which the PDF was generated. You might have an easier way starting from there. And your professor might be willing to supply you with the material with the lure of getting an ebook compatible version of the text in exchange. Even if the original material are individual scans, you are better of starting with those images for OCR, than with PDF files (which is, apart from its multipage capabilities, fundamentally unsuitable for scanned material)


Free books android app tbrJar TBR JAR Read Free books online gutenberg


Load Full (0)

Login to follow story

More posts by @Lorraine

0 Comments

Sorted by latest first Latest Oldest Best

 

Back to top