bell notificationshomepageloginedit profileclubsdmBox

Login to follow story

More posts by @Lorraine

1 Comments

Sorted by latest first Latest Oldest Best

10% popularity   0 Reactions

Here is one way to find the number of words in an epub file sorted by their frequency, with the words used the most at the top of the list.

This is done on a Mac laptop and will also work on Unix hosts.

The overview of the process:

Install Calibre
Use the ./ebook-convert command in Calibre to convert the epub file to text
Transform the entire text file to lowercase (so "Word" and "word" match)
Convert punctuation to whitespace (so "period." and "period " match)
Convert all whitespace to a new line. This puts each word on its own line.
Exclude any blank lines from the list
Sort the list of words alphabetically
Pipe (send) that list of words through uniq -c You now have a count of how often each word appears.
Sort the result in numerical order. If you use the sort command with the -r argument, the most frequent words are at the top.

Here's an example of steps (2) through (9). The head command lists the top ten words in the final output.

$ ./ebook-convert ./book.epub ./book.txt
$ cat ./book.txt | tr '[:upper:]' '[:lower:]' | tr "“" " " | tr "”" " " | tr "," " " | tr "." " " | tr " " "n" | grep -v ^$ | sort | uniq -c | sort -gr | head
5303 the
1960 and
1934 of
1910 to
1874 a
1168 i
1067 you
844 in
812 that
703 it
$

The result is pretty boring. The word 'the" appears 5303 times, while the word 'it' appears 703 times.

I suspect in most books the most common words are the tiny conjunctions, articles, prepositions and pronouns. Perhaps on something that is not a novel this might be more interesting.

Good luck!


Free books android app tbrJar TBR JAR Read Free books online gutenberg


Load Full (0)

 

Back to top