Tag Archives: project

Natural Language Processing: Reading Project Gutenberg

This post is an process exploration of my most recent project, Reading Project Gutenberg, a proof of concept for a content-analysis recommendation engine. I’m going to go deep on process for this one, more so than the other projects. Still trying to get the hang of explaining all this, so reach out with questions.

For our third project at Metis, we dipped into Natural Language Processing, a way of machine learning that deals with reading and interpreting text. Google Translate, for example, uses machine learning to automatically translate text, which is an insanely hard thing to do.

Another caveat of the project: we needed to do unsupervised machine learning. (A primer on those topics here.) Basically, we had to write a program that would read text and do something with it, and we couldn’t provide the program with examples of what we wanted.

Is that vague enough for you?

My idea: Project Gutenberg is a great site dedicated to creating and distributing free e-books. Compared to Amazon, Gutenberg has little to no data about what users like what books.

So how can I help them with data science? By building a basic recommendation engine, which will offer books similar to a chosen selection. This can help guide readers to books they might like.

Here’s the process.


Logically, it makes sense that books that discuss similar topics and are written similarly would be enjoyed by similar readers. Knowing this, we can read and categorize each book by the words in it. Using NLP, we can separate out topics, ideas and themes, and find books that are alike.

This is called “clustering.” If we can prove that books can be distributed into similar clusters, we can begin sharpening those clusters and defining them. Perhaps we’ll see a “sci-fi” or a “war” cluster. For the recommendation engine, this will help us identify similar books.

So time to grab the data.


NLP is a computationally-expensive endeavor, so it takes a long time if we have a lot of books. I used this ISO creator to get the full text of 93 random books from fantasy, science fiction, drama and other fiction genres. However, the books aren’t labeled by genre, so I don’t know what category they’re in exactly.

After removing stopwords (common words like, they, there, that) and choosing the most important words, I was ready to run analyses on the text. A snapshot of my vectorizer, for those interested:

Using K-Means and Cosine Similarity, we can start clustering the books and see the actual structure. The machine doesn’t know what clusters exist – it just reads the books, finds similarities, and compares them to one another.

After that, I created a visualization of all the books. By looking at them mapped out, we can see if there are patterns of similar books.

Each dot represents a book, and each color represents a cluster. The distance of each dot from all the other dots shows how similar they are. Closer dots are more similar in topic, and dots that are the same color are in the same cluster.

(See the code here.)

Screen Shot 2016-03-08 at 4.54.57 PM

Clearly, clusters have appeared. By looking at some of the top words in each cluster, we can determine their topics. For example:

Green Cluster: ships, boats, sailed, deck, captain, board, vessels
Orange Cluster: London, England, honor, chapter, poet
Purple Cluster: princess, princes, king, colonel, queen, palaces, majesty, royal
Pink Cluster: Captain, aunt, hotel, garden, doctor
Turquoise Cluster: Jack, spiked, wagon, dollars, allies

So you can clearly see some clusters appearing, while others are less clear. The Green Cluster is obvious books about the high seas, and the three closest titles back up this claim:

-The Cruise of the Cachalot Round the World After Sperm Whales –  Frank T. Bullen

-Ned Myers, Or a Life Before the Mast – James Fenimore Cooper

-The Rover’s Secret: A Tale of the Pirate Cays and Lagoons of Cuba – Harry Collingwood

So that’s it. We proved it could be done, so now if any of you want to help me build this thing for Gutenberg, let me know. Also, if this post was unclear in any way, please comment so I can clear things up or answer any questions.

As always, fork me. 

Cover picture is Stranger in a Strange Land, a painting by James Warhola for a cover of the book written by Robert A. Heinlein. A great classic sci-fi read.



Mapping NYC subway traffic: an interactive

Ever wondered if you could count how many people go through the subway every day?

Okay, probably not. But bear with me here. No code this time.

For our first project in the Metis Data Science Bootcamp, we were given a hypothetical data science project by a company. Our team was asked to use data to help a nonprofit. In an email from an organization created to advocate women in tech, we got our assignment. A quote:

Where we’d like to solicit your engagement is to use MTA subway data, which as I’m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.

Basically, calculate where people are. But it wasn’t so simple. Our team, made of Ingrid, Ben, Ken and myself, thought through it like this:

1. The busiest turnstiles aren’t necessarily the best. We’re looking for a demographic here – young, progressive and interested in tech. Thousands of pissed off people at Penn Station won’t be any good. We crunched census, income and community data to identify the best neighborhoods.

2. Sometimes the data you’re given isn’t enough. We had to look for lots of extra resources beyond simple MTA turnstile data. Some of this helped us make the map below.

3. When you’re doing data science, make something useful. It’s easy to get lost in “We could do this…” and “But what if…” What if what your client actually cares about is something they can use, not all the stuff you discovered? Never forget your end goal.

And so, what we presented to the company is the below map. We selected five places for their street teams to hang out. The heat flashes show the busiest subway stops over the dates you can see in the bottom corner. Notice how they change throughout the week?

Next step is to plot hourly movements over a day.

Why I’m Becoming a Data Scientist

First  of all, this post exists because I’m currently in New York, studying the art of Data Science with Metis. It’s a 12-week boot camp focused on training us to use all the tools needed to pull insights from massive mountains of data.

I’m here because I want a job as a Data Scientist.

Usually when I say this, people respond in three ways:

  1. But I thought you wanted to be a journalist?
  2. Aren’t you getting your masters degree in journalism?
  3. What the hell is data science, and why would you want to do it?

My reasons are both philosophical and practical, so here’s a short explainer.

What is Data Science?

Our entire world is recorded in bits of information. Thanks to technology and the internet, human beings create and store more information now than at any point in the history of our species. From birth to death, our entire lives are recorded with paper documents, Google searches, emails, pictures and Facebook statuses. Every day trillions of data are created by billions of humans.

For example, this is how much information people create every minute of every day:


That’s a monstrous amount of information, and we’re only looking at a slice! Many firms across hundreds of industries are also recording their own information, as more of our world goes digital. Thanks to constantly improving server memory, it’s cheaper than ever to save all this information, so most of this stuff is just sitting around, unused.

But what if we could use all this information?

What if the data revealed patterns? What if we could look deeply at the information and discover the who, what, where, why and how of our world, to empower people, companies and governments to make better decisions?

People are starting to do just that, and they’re already making waves. Mandatory reading, for examples: Big Data, A Revolution.

That’s data science. And it sounds hella awesome.

Why do Data Science?

While I was doing my masters degree in Data Journalism at Mizzou, I realized data scientists and data journalists are basically the same thing. We learned lots of amazing tools like D3.js, CartoDB and Highcharts to tell stories with data. One major difference is that professional data scientists have stronger backgrounds in mathematics, programming and statistics, which would seriously help data journalists. As I created data-driven projects and infographics, I thought a lot about how I could use data to tell stories. Soon enough, it wasn’t satisfying to make a chart or a graph here or there. I wanted to do something bigger with data.

I also realized my journalistic skills – analysis, research, inherent curiosity and storytelling – were a perfect fit for a job as a data scientist. This inspired me to think outside the box and join this bootcamp, where I could get the programming and statistics needed to complete my education. This Venn Diagram explains the rare and challenging mix of skills needed to be a great Data Scientist.

data science

There’s also the practical motivation: Data Science as a profession is exploding, and every industry, from entertainment to healthcare, is hiring. Demand is high, and so is the pay: the median salary of a data scientist is around $107,000. Companies are hiring people right and left. Compare that to journalism as a profession, where median salaries usually sit around $31,000 a year for newspaper reporters, and layoffs loom around every corner. The storytelling opportunities could be deeper if I was involved with data science research.

Data science is still emerging, and with it, the potential for good or evil. I want to put the skills and ethics I learned as a journalist to use in this industry, so I can help establish responsible, ethical and useful uses of data to improve our world.

That’s it, pretty much. Right now, I’m looking to join a data science team that echoes those values, so I can learn better the skills of the trade. Thanks for reading this far and letting me explain this. Every week or so I’ll be blogging here about this camp, if you’re interested in learning more.

For more on Data Science, read the groundbreaking Booz Allen Hamilton Field Guide to Data Science, online for free.

Also, yes – I’m still working on my thesis. I haven’t forgotten about you, Mizzou.

Syrian Refugees In the United States, An Interactive

This bit is inspired by a map made in the New York Times. This piece appeared originally on the Columbia Missourian.

People think they’re stopping Syrian refugees from entering the United States, but guess what: they’re already here.

The graphic below shows states with governors who have pledged to keep out refugees, while the blue dots show where most of the refugees from the last ten years have resettled.

I made this by smashing together some data from various government sources.

Header photo is from Flickr.