
: Re: How to skip those hyphens while searching in acrobat I have two pdf files: original image pdf where the pdf file is just an image of the text ocr pdf produced by OCR, using ABBYY FineReader
I cannot answer you with certainty because I do not have the tools yo
use. Hence I am not sure what kind of file they create (PDF can come
in different flavors). Furthermore, besides the basic character
recognition, OCR may or may not involve various steps, often based on
linguistic analysis, intended to improve the result. Hyphen removal is
one such step, since there is usually no point in keeping hyphenation
in the resulting OCRed text.
See for example this question: How to remove hard hyphens? or this
Master thesis on Dehyphenation.
Comparing images is sometimes possible, but unlikely to give you
easily usable results for textual purposes.
Thus one must assume that you used one tool (ABBYY FineReader 11) to
get a first OCRed document, and that your comparison did a second OCR on
the image to compare it with the first result.
This may make sense, and help identify some possibly erroneous
locations in the OCRed text (though I do not know that it is used, as
there may be other ways). It is also likely that both OCR process will
sometimes agree on making the same mistake, which then remains
undetected.
Now it is possible that one OCR system does hyphenation removal while
the other does not. Then the comparison of resulting files would show
differences wherever there was an hyphen in the original image. BTW, does acrobat find (mostly) extra hyphens or missing hyphens?
To do away with the problem would require that both OCR steps do the
same regarding hyphenation, either keep it or remove it. I would
expect that some OCR algorithms can be configured to do either which
might solve your problem.
Note however that hyphen removal may sometimes be ambiguous, requiring
somewhat arbitrary choice. Thus if both OCR systems used for double
checking do hyphen removal, they may sometimes disagree on which
hyphens should be removed, and there is not much one can do to avoid
this.
Another way to avoid hyphens is to write a piece of software that will
filter out reported hyphen differences. How to do that depends on the
context you work in.
Free books android app tbrJar TBR JAR Read Free books online gutenberg
More posts by @Kevin

: Direct copying of epub files on the Kobo internal file system I am running Linux (Mageia 5), and I own a Kobo Glo HD. There is no Kobo Desktop for me (that I know of). I wish to load some

: Kindle for PDF only Should I buy kindle for reading PDFs only? I usually read lots of books on programming, software development and investment only. So, Should I buy kindle for the same. Because