bell notificationshomepageloginedit profileclubsdmBox

10.01% popularity   0 Reactions

How to remove spaces and tabs from the text layer in DjVu document for better text search using the DjVuLibre library?

By removing unnecessary characters (and their xml tags), also the file size is reduced.


Free books android app tbrJar TBR JAR Read Free books online gutenberg


Load Full (1)

Login to follow story

More posts by @Nuzhat

1 Comments

Sorted by latest first Latest Oldest Best

 

@Carla

10% popularity   0 Reactions

First, use the djvutoxml tool to extract a text layer in XML format from the DjVu document.

At the command prompt, type:

pathdjvutoxml.exe pathbook.djvu pathbook.xml

Instead path parameter substitute your location on the disk.

Press Enter...

Then, using the regular expressions, remove the selected characters(placed between sharp brackets > <). You can use any text editor (that can do regex).

String to remove spaces:

<WORD><CHARACTER coords="d*,d*,d*,d*"> </CHARACTER></WORD>

String to remove tabs:

<WORD><CHARACTER coords="d*,d*,d*,d*">&#9;</CHARACTER></WORD>

Regular expression can also be written in this form. This deletes everything at once:

<WORD><CHARACTER coords="([0-9,]*?)">(&#9;| )</CHARACTER></WORD>

Original fragment:

<WORD coords="318,262,706,190">Hallo</WORD>
<WORD><CHARACTER coords="707,262,760,190"> </CHARACTER></WORD>
<WORD coords="761,262,813,190">World!</WORD>
<WORD><CHARACTER coords="814,262,860,190"> </CHARACTER></WORD>

Fixed fragment:

<WORD coords="318,262,706,190">Hallo</WORD>
[here was the code for the space]
<WORD coords="761,262,813,190">World!</WORD>
[here was the code for the space]

(You can see how large text strings have been removed. Here 62 characters to describe one space!)

Finally, use the djvuxmlparser tool to merge modified XML with DjVu document.

At the command prompt, type:

pathdjvuxmlparser.exe -o pathfinal.djvu pathbook.xml

Instead path parameter substitute your location on the disk.
Parameter -o defines the target file.

Press Enter...


Free books android app tbrJar TBR JAR Read Free books online gutenberg


Load Full (0)

 

Back to top