This post is an process exploration of my most recent project, Reading Project Gutenberg, a proof of concept for a content-analysis recommendation engine. I’m going to go deep on process for this one, more so than the other projects. Still trying to get the hang of explaining all this, so reach out with questions.
For our third project at Metis, we dipped into Natural Language Processing, a way of machine learning that deals with reading and interpreting text. Google Translate, for example, uses machine learning to automatically translate text, which is an insanely hard thing to do.
Another caveat of the project: we needed to do unsupervised machine learning. (A primer on those topics here.) Basically, we had to write a program that would read text and do something with it, and we couldn’t provide the program with examples of what we wanted.
Is that vague enough for you?
My idea: Project Gutenberg is a great site dedicated to creating and distributing free e-books. Compared to Amazon, Gutenberg has little to no data about what users like what books.
So how can I help them with data science? By building a basic recommendation engine, which will offer books similar to a chosen selection. This can help guide readers to books they might like.
Here’s the process.
Logically, it makes sense that books that discuss similar topics and are written similarly would be enjoyed by similar readers. Knowing this, we can read and categorize each book by the words in it. Using NLP, we can separate out topics, ideas and themes, and find books that are alike.
This is called “clustering.” If we can prove that books can be distributed into similar clusters, we can begin sharpening those clusters and defining them. Perhaps we’ll see a “sci-fi” or a “war” cluster. For the recommendation engine, this will help us identify similar books.
So time to grab the data.
NLP is a computationally-expensive endeavor, so it takes a long time if we have a lot of books. I used this ISO creator to get the full text of 93 random books from fantasy, science fiction, drama and other fiction genres. However, the books aren’t labeled by genre, so I don’t know what category they’re in exactly.
After removing stopwords (common words like, they, there, that) and choosing the most important words, I was ready to run analyses on the text. A snapshot of my vectorizer, for those interested:
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, max_features=200000,
min_df=0.20, analyzer='word', stop_words=stopwords,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
Using K-Means and Cosine Similarity, we can start clustering the books and see the actual structure. The machine doesn’t know what clusters exist – it just reads the books, finds similarities, and compares them to one another.
After that, I created a visualization of all the books. By looking at them mapped out, we can see if there are patterns of similar books.
Each dot represents a book, and each color represents a cluster. The distance of each dot from all the other dots shows how similar they are. Closer dots are more similar in topic, and dots that are the same color are in the same cluster.
Clearly, clusters have appeared. By looking at some of the top words in each cluster, we can determine their topics. For example:
Green Cluster: ships, boats, sailed, deck, captain, board, vessels
Orange Cluster: London, England, honor, chapter, poet
Purple Cluster: princess, princes, king, colonel, queen, palaces, majesty, royal
Pink Cluster: Captain, aunt, hotel, garden, doctor
Turquoise Cluster: Jack, spiked, wagon, dollars, allies
So you can clearly see some clusters appearing, while others are less clear. The Green Cluster is obvious books about the high seas, and the three closest titles back up this claim:
-The Cruise of the Cachalot Round the World After Sperm Whales – Frank T. Bullen
-Ned Myers, Or a Life Before the Mast – James Fenimore Cooper
-The Rover’s Secret: A Tale of the Pirate Cays and Lagoons of Cuba – Harry Collingwood
So that’s it. We proved it could be done, so now if any of you want to help me build this thing for Gutenberg, let me know. Also, if this post was unclear in any way, please comment so I can clear things up or answer any questions.
Cover picture is Stranger in a Strange Land, a painting by James Warhola for a cover of the book written by Robert A. Heinlein. A great classic sci-fi read.