
: How to skip those hyphens while searching in acrobat I have two pdf files: original image pdf where the pdf file is just an image of the text ocr pdf produced by OCR, using ABBYY FineReader
I have two pdf files:
original image pdf where the pdf file is just an image of the text
ocr pdf produced by OCR, using ABBYY FineReader 11
these are the process i did
i have a image pdf then i use abby to perform OCR function
then i convert my pdf to scanble pdf in abby transformer 3.0
then i compare both the source file and ocr file using acrobat but all those hyphens are in error how to skip those hyphens while searching ?
Free books android app tbrJar TBR JAR Read Free books online gutenberg
More posts by @Xin

: Controversial book by two data scientists, anybody knows the name? My question is more about a book rather than ebook, so it probably would fall outside the domain of ebooks but as there is

: Upload & Read eBooks in Chrome (i.e. Maintain a Cloud Library) For some reason I haven't been able to find anything - extension, app, website - that let's me take my ebook library to the
2 Comments
Sorted by latest first Latest Oldest Best
I cannot answer you with certainty because I do not have the tools yo
use. Hence I am not sure what kind of file they create (PDF can come
in different flavors). Furthermore, besides the basic character
recognition, OCR may or may not involve various steps, often based on
linguistic analysis, intended to improve the result. Hyphen removal is
one such step, since there is usually no point in keeping hyphenation
in the resulting OCRed text.
See for example this question: How to remove hard hyphens? or this
Master thesis on Dehyphenation.
Comparing images is sometimes possible, but unlikely to give you
easily usable results for textual purposes.
Thus one must assume that you used one tool (ABBYY FineReader 11) to
get a first OCRed document, and that your comparison did a second OCR on
the image to compare it with the first result.
This may make sense, and help identify some possibly erroneous
locations in the OCRed text (though I do not know that it is used, as
there may be other ways). It is also likely that both OCR process will
sometimes agree on making the same mistake, which then remains
undetected.
Now it is possible that one OCR system does hyphenation removal while
the other does not. Then the comparison of resulting files would show
differences wherever there was an hyphen in the original image. BTW, does acrobat find (mostly) extra hyphens or missing hyphens?
To do away with the problem would require that both OCR steps do the
same regarding hyphenation, either keep it or remove it. I would
expect that some OCR algorithms can be configured to do either which
might solve your problem.
Note however that hyphen removal may sometimes be ambiguous, requiring
somewhat arbitrary choice. Thus if both OCR systems used for double
checking do hyphen removal, they may sometimes disagree on which
hyphens should be removed, and there is not much one can do to avoid
this.
Another way to avoid hyphens is to write a piece of software that will
filter out reported hyphen differences. How to do that depends on the
context you work in.
Free books android app tbrJar TBR JAR Read Free books online gutenberg
If I understand your question correctly, you are asking for a way to automatically double-check an OCR conversion from image to text. I'm afraid that is not possible: checking would require conversion from image to text via OCR, which I'm sure you can see is circular. If you want to proof an OCR conversion, you'll have to do it the old-fashioned way. At least it's faster than typing the whole thing out (and then proofing it again)!
Free books android app tbrJar TBR JAR Read Free books online gutenberg