bell notificationshomepageloginedit profileclubsdmBox

10% popularity   0 Reactions

I've done work for pgdp.net, which is Project Gutenberg Distributed Proofreading. Here are the steps if you're not going through PGDP.

I'm assuming you scanned each page into a separate image. You might want to put all these images into a PDF file. You will probably have to cut the pages out of the book so they lay perfectly flat on the scanner. A scan of a curved page is a huge source of OCR errors.
Use OCR to extract the text from the images and save book to a text file. Depending on the font, the smudges in the images, and other factors like straightness of lines of text, you might get 70-90% accuracy.
Proofread by hand every word.
Construct ebook in the format you want. I markup my text files with MultiMarkdown, and then I have a program which converts that to an EPUB. I just like MMD better than editing HTML.

Abbyy Finereader is one of the best OCR programs out there but pricey. Here is a site where you can OCR a few pages using their software, but there are limits.

If you have a PDF full of images (which is what much of the free books in Google Books is) try doing OCR at this site, again there may be limits to PDF file size you upload.

All in all, proofreading is a very time-consuming task.

EDIT: PGDP does not allow requests to digitize books. They have their own queue which they work on. However, Captcha is a way to crowd source digitizing books. They supposedly did many Google books, but I can't find that many there were actually edited. If you use Captcha to crowd source proofreading, it might cost you money. That depends if you use their service or if you are able to install the software yourself.

I found the Captcha concept to be fascinating, and the author/inventor even has a TED talk.


Free books android app tbrJar TBR JAR Read Free books online gutenberg


Load Full (0)

Login to follow story

More posts by @Melissa

0 Comments

Sorted by latest first Latest Oldest Best

 

Back to top