Tag Archives: d3

Natural Language Processing: Reading Project Gutenberg

This post is an process exploration of my most recent project, Reading Project Gutenberg, a proof of concept for a content-analysis recommendation engine. I’m going to go deep on process for this one, more so than the other projects. Still trying to get the hang of explaining all this, so reach out with questions.

For our third project at Metis, we dipped into Natural Language Processing, a way of machine learning that deals with reading and interpreting text. Google Translate, for example, uses machine learning to automatically translate text, which is an insanely hard thing to do.

Another caveat of the project: we needed to do unsupervised machine learning. (A primer on those topics here.) Basically, we had to write a program that would read text and do something with it, and we couldn’t provide the program with examples of what we wanted.

Is that vague enough for you?

My idea: Project Gutenberg is a great site dedicated to creating and distributing free e-books. Compared to Amazon, Gutenberg has little to no data about what users like what books.

So how can I help them with data science? By building a basic recommendation engine, which will offer books similar to a chosen selection. This can help guide readers to books they might like.

Here’s the process.


Logically, it makes sense that books that discuss similar topics and are written similarly would be enjoyed by similar readers. Knowing this, we can read and categorize each book by the words in it. Using NLP, we can separate out topics, ideas and themes, and find books that are alike.

This is called “clustering.” If we can prove that books can be distributed into similar clusters, we can begin sharpening those clusters and defining them. Perhaps we’ll see a “sci-fi” or a “war” cluster. For the recommendation engine, this will help us identify similar books.

So time to grab the data.


NLP is a computationally-expensive endeavor, so it takes a long time if we have a lot of books. I used this ISO creator to get the full text of 93 random books from fantasy, science fiction, drama and other fiction genres. However, the books aren’t labeled by genre, so I don’t know what category they’re in exactly.

After removing stopwords (common words like, they, there, that) and choosing the most important words, I was ready to run analyses on the text. A snapshot of my vectorizer, for those interested:

Using K-Means and Cosine Similarity, we can start clustering the books and see the actual structure. The machine doesn’t know what clusters exist – it just reads the books, finds similarities, and compares them to one another.

After that, I created a visualization of all the books. By looking at them mapped out, we can see if there are patterns of similar books.

Each dot represents a book, and each color represents a cluster. The distance of each dot from all the other dots shows how similar they are. Closer dots are more similar in topic, and dots that are the same color are in the same cluster.

(See the code here.)

Screen Shot 2016-03-08 at 4.54.57 PM

Clearly, clusters have appeared. By looking at some of the top words in each cluster, we can determine their topics. For example:

Green Cluster: ships, boats, sailed, deck, captain, board, vessels
Orange Cluster: London, England, honor, chapter, poet
Purple Cluster: princess, princes, king, colonel, queen, palaces, majesty, royal
Pink Cluster: Captain, aunt, hotel, garden, doctor
Turquoise Cluster: Jack, spiked, wagon, dollars, allies

So you can clearly see some clusters appearing, while others are less clear. The Green Cluster is obvious books about the high seas, and the three closest titles back up this claim:

-The Cruise of the Cachalot Round the World After Sperm Whales –  Frank T. Bullen

-Ned Myers, Or a Life Before the Mast – James Fenimore Cooper

-The Rover’s Secret: A Tale of the Pirate Cays and Lagoons of Cuba – Harry Collingwood

So that’s it. We proved it could be done, so now if any of you want to help me build this thing for Gutenberg, let me know. Also, if this post was unclear in any way, please comment so I can clear things up or answer any questions.

As always, fork me. 

Cover picture is Stranger in a Strange Land, a painting by James Warhola for a cover of the book written by Robert A. Heinlein. A great classic sci-fi read.



Mizzou and bold new frontiers

Interactive data visualizations are all the rage these days, with major news organizations like the WSJ and the New York Times setting up interactive desks that churn out engrossing, compelling visualizations.

Mike Jenner, the Houston Harte Chair in Journalism at Mizzou and data visualization extraordinaire, set up a workshop back in October where dataviz journalists Chris Canipe (WSJ), Andrew Garcia-Phillips (Chartball) and Leah Becerra (Omaha World-Herald) came and taught us all D3, a Javascript library of data visualization, in one speedy weekend.

Check out that last link for some awesome data visualizations that capture the power of D3.

In one hot-and-heavy 16 hour sprint, we got the basics in Data Viz. Nobody left as data experts, but the class exposed lots of students to the future of digital journalism.

This was huge for one reason: while Chris, Andrew and Leah were all self-taught in data viz, they brought what they learned to an academic environment.  

After the class, Madi Alexander and myself organized the Mizzou Data Viz Club, where we met and tried to hang on to the skills we learned. (Madi recently got an internship at the NYT as a digital reporting intern. Hooray!)

In a conversation with Mike, I discovered he wanted to make a longer-term class, he just needed pledges that students would take it.

Our Dataviz club had the students and he had the resources. It was like the planets aligned.

Mike moved swiftly, organizing an 8-week course in D3 with Chris for this spring semester. I helped him design the posters to advertise it and recruited a bunch of students to join the class. Chris drives in occasionally from his home in Saint Louis, where he works remotely for the WSJ.

Our skills levels are all over the board, from accomplished programmers to brand new students. The class is open and modular with each student working at their own pace. It’s pitched together and sometimes challenging, but I want to outline a list why this Data Visualization class is wildly important to the future of Mizzou journalism academics.

1. Data visualization skills are in high demand. The success of Mizzou’s CAR and Data Reporting classes are testament to this. We teach the students how to find the data and how to pull stories from it, but now we’re on the cutting edge of visualizing it.

2. Most people who know this stuff were self taught, and our class is the foundation for rigorous academic improvement of the subject. By turning this into an academic affair, we make it easier for students to learn the basics quickly. Once people are learning it, they can move beyond and improve it, developing new techniques and taking those to industry publications.

3. It’s confusing, challenging and uneven – but it’s happening and we’re moving forward, setting standards for future dataviz classes. After this is over, we’ll know what kind of classes should be required for prerequisites. We’ll understand gaps in the digital knowledge of the journalists we’re training. We’ll know what kind of classes we need to establish a powerful data journalism sequence courses. This is us surging into a new frontier for science, know what I mean?

As we move forward, I’ll inevitably have more to say about this venture, so stay tuned.

Why you should learn Dataviz now

The other weekend I sat in on a Data Visualization introductory class taught over three days by three professionals in the business: Chris Canipe of The Wall Street Journal, Andrew Garcia Phillips of ChartBall.com, and Leah Becerra of the Omaha World-Herald.

In a quick and dirty 16-hour sprint, we were introduced to programming a variety of tools, including HighCharts, D3, and various text editing software.

Using these tools, we built a basic interactive graph using raw sports data. Numbers go in, beautiful pictures come out. This stuff is cutting edge – peep some gorgeous examples here. One of Mizzou’s own used these kinds of data visualizations to win a Pulitzer, and these graphics are common at the New York Times and The WSJ.

The weekend was crazy. Basically, a whole bunch of journalism nerds got together and did nerdy journalism stuff. And it was exceedingly awesome, and you should feel bad that you missed it.

But fret not – you can learn these highly demanded skills on your own with a little determination. Here’s why (and how) you should.

1. Because it’s part of the future of journalism. Take a look at journalism’s history and you’ll notice the people on the cutting edge are always the most successful, whether it’s Ben Franklin and his printing presses or ABC and color television. Take a lesson from the greats and secure your spot in journalism’s shining future, or something like that.

2. Because it’s a wild storytelling tool that helps audiences process the internet’s infinite stores of data. Journalists are no longer “gatekeepers” – if people want to know something, they can find any information they want on the internet. The flipside? There’s so much data, so many websites, that people get turned off by the gushing stream. Data visualizations help people process and explore vast amounts of data. All you do is hold their hand through it.

3. BECAUSE YOU CAN LEARN IT ON YOUR OWN FOR FREE. Like, seriously. Programming is becoming an easy skill to learn on your own, and all the journalists who taught this course taught themselves first. Explore sites like CodeAcademy, TreeHouse, Github, and W3 schools and you could know as much as anyone with a computer science degree. For D3 specifically, start here.

4. Because if you’re a Mizzou student, we just started a data visualization club, and there might potentially be a class in the spring. Jump on it.