Tag Archives: machine

Natural Language Processing: Reading Project Gutenberg

This post is an process exploration of my most recent project, Reading Project Gutenberg, a proof of concept for a content-analysis recommendation engine. I’m going to go deep on process for this one, more so than the other projects. Still trying to get the hang of explaining all this, so reach out with questions.

For our third project at Metis, we dipped into Natural Language Processing, a way of machine learning that deals with reading and interpreting text. Google Translate, for example, uses machine learning to automatically translate text, which is an insanely hard thing to do.

Another caveat of the project: we needed to do unsupervised machine learning. (A primer on those topics here.) Basically, we had to write a program that would read text and do something with it, and we couldn’t provide the program with examples of what we wanted.

Is that vague enough for you?

My idea: Project Gutenberg is a great site dedicated to creating and distributing free e-books. Compared to Amazon, Gutenberg has little to no data about what users like what books.

So how can I help them with data science? By building a basic recommendation engine, which will offer books similar to a chosen selection. This can help guide readers to books they might like.

Here’s the process.

Structure

Logically, it makes sense that books that discuss similar topics and are written similarly would be enjoyed by similar readers. Knowing this, we can read and categorize each book by the words in it. Using NLP, we can separate out topics, ideas and themes, and find books that are alike.

This is called “clustering.” If we can prove that books can be distributed into similar clusters, we can begin sharpening those clusters and defining them. Perhaps we’ll see a “sci-fi” or a “war” cluster. For the recommendation engine, this will help us identify similar books.

So time to grab the data.

Process

NLP is a computationally-expensive endeavor, so it takes a long time if we have a lot of books. I used this ISO creator to get the full text of 93 random books from fantasy, science fiction, drama and other fiction genres. However, the books aren’t labeled by genre, so I don’t know what category they’re in exactly.

After removing stopwords (common words like, they, there, that) and choosing the most important words, I was ready to run analyses on the text. A snapshot of my vectorizer, for those interested:

Using K-Means and Cosine Similarity, we can start clustering the books and see the actual structure. The machine doesn’t know what clusters exist – it just reads the books, finds similarities, and compares them to one another.

After that, I created a visualization of all the books. By looking at them mapped out, we can see if there are patterns of similar books.

Each dot represents a book, and each color represents a cluster. The distance of each dot from all the other dots shows how similar they are. Closer dots are more similar in topic, and dots that are the same color are in the same cluster.

(See the code here.)

Screen Shot 2016-03-08 at 4.54.57 PM

Clearly, clusters have appeared. By looking at some of the top words in each cluster, we can determine their topics. For example:

Green Cluster: ships, boats, sailed, deck, captain, board, vessels
Orange Cluster: London, England, honor, chapter, poet
Purple Cluster: princess, princes, king, colonel, queen, palaces, majesty, royal
Pink Cluster: Captain, aunt, hotel, garden, doctor
Turquoise Cluster: Jack, spiked, wagon, dollars, allies

So you can clearly see some clusters appearing, while others are less clear. The Green Cluster is obvious books about the high seas, and the three closest titles back up this claim:

-The Cruise of the Cachalot Round the World After Sperm Whales –  Frank T. Bullen

-Ned Myers, Or a Life Before the Mast – James Fenimore Cooper

-The Rover’s Secret: A Tale of the Pirate Cays and Lagoons of Cuba – Harry Collingwood

So that’s it. We proved it could be done, so now if any of you want to help me build this thing for Gutenberg, let me know. Also, if this post was unclear in any way, please comment so I can clear things up or answer any questions.

As always, fork me. 

Cover picture is Stranger in a Strange Land, a painting by James Warhola for a cover of the book written by Robert A. Heinlein. A great classic sci-fi read.

 

 

Elizabethan Stop Words for NLP

For Natural Language Processing, usually you will want to clean the text of common “stop words” that don’t usually contribute to a topical analysis. No version of this list is standard, as requirements change from project to project.

If you find yourself processing either Elizabethan or older English texts, most modern stopword lists will fail to pick up things like “thee,” “thy,” or “thine.”

I couldn’t find an Elizabethan English stopword list for an NLP project I did with Project Gutenberg text, so I made one. See it below, or fork me on Github.

The older words are arranged in alphabetical order at the end of the standard stop word list on Github, and below is an easy copy-paste so you can add it to your own stopword file easily.

If I’ve forgotten any, let me know and I’ll add them.

 

All the world’s a stage,
And all the men and women merely players.
They have their exits and their entrances,
And one man in his time plays many parts,
His acts being seven ages.