Tag Archives: data

Elizabethan Stop Words for NLP

For Natural Language Processing, usually you will want to clean the text of common “stop words” that don’t usually contribute to a topical analysis. No version of this list is standard, as requirements change from project to project.

If you find yourself processing either Elizabethan or older English texts, most modern stopword lists will fail to pick up things like “thee,” “thy,” or “thine.”

I couldn’t find an Elizabethan English stopword list for an NLP project I did with Project Gutenberg text, so I made one. See it below, or fork me on Github.

The older words are arranged in alphabetical order at the end of the standard stop word list on Github, and below is an easy copy-paste so you can add it to your own stopword file easily.

If I’ve forgotten any, let me know and I’ll add them.


All the world’s a stage,
And all the men and women merely players.
They have their exits and their entrances,
And one man in his time plays many parts,
His acts being seven ages.

Mapping NYC subway traffic: an interactive

Ever wondered if you could count how many people go through the subway every day?

Okay, probably not. But bear with me here. No code this time.

For our first project in the Metis Data Science Bootcamp, we were given a hypothetical data science project by a company. Our team was asked to use data to help a nonprofit. In an email from an organization created to advocate women in tech, we got our assignment. A quote:

Where we’d like to solicit your engagement is to use MTA subway data, which as I’m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.

Basically, calculate where people are. But it wasn’t so simple. Our team, made of Ingrid, Ben, Ken and myself, thought through it like this:

1. The busiest turnstiles aren’t necessarily the best. We’re looking for a demographic here – young, progressive and interested in tech. Thousands of pissed off people at Penn Station won’t be any good. We crunched census, income and community data to identify the best neighborhoods.

2. Sometimes the data you’re given isn’t enough. We had to look for lots of extra resources beyond simple MTA turnstile data. Some of this helped us make the map below.

3. When you’re doing data science, make something useful. It’s easy to get lost in “We could do this…” and “But what if…” What if what your client actually cares about is something they can use, not all the stuff you discovered? Never forget your end goal.

And so, what we presented to the company is the below map. We selected five places for their street teams to hang out. The heat flashes show the busiest subway stops over the dates you can see in the bottom corner. Notice how they change throughout the week?

Next step is to plot hourly movements over a day.

Why I’m Becoming a Data Scientist

First  of all, this post exists because I’m currently in New York, studying the art of Data Science with Metis. It’s a 12-week boot camp focused on training us to use all the tools needed to pull insights from massive mountains of data.

I’m here because I want a job as a Data Scientist.

Usually when I say this, people respond in three ways:

  1. But I thought you wanted to be a journalist?
  2. Aren’t you getting your masters degree in journalism?
  3. What the hell is data science, and why would you want to do it?

My reasons are both philosophical and practical, so here’s a short explainer.

What is Data Science?

Our entire world is recorded in bits of information. Thanks to technology and the internet, human beings create and store more information now than at any point in the history of our species. From birth to death, our entire lives are recorded with paper documents, Google searches, emails, pictures and Facebook statuses. Every day trillions of data are created by billions of humans.

For example, this is how much information people create every minute of every day:


That’s a monstrous amount of information, and we’re only looking at a slice! Many firms across hundreds of industries are also recording their own information, as more of our world goes digital. Thanks to constantly improving server memory, it’s cheaper than ever to save all this information, so most of this stuff is just sitting around, unused.

But what if we could use all this information?

What if the data revealed patterns? What if we could look deeply at the information and discover the who, what, where, why and how of our world, to empower people, companies and governments to make better decisions?

People are starting to do just that, and they’re already making waves. Mandatory reading, for examples: Big Data, A Revolution.

That’s data science. And it sounds hella awesome.

Why do Data Science?

While I was doing my masters degree in Data Journalism at Mizzou, I realized data scientists and data journalists are basically the same thing. We learned lots of amazing tools like D3.js, CartoDB and Highcharts to tell stories with data. One major difference is that professional data scientists have stronger backgrounds in mathematics, programming and statistics, which would seriously help data journalists. As I created data-driven projects and infographics, I thought a lot about how I could use data to tell stories. Soon enough, it wasn’t satisfying to make a chart or a graph here or there. I wanted to do something bigger with data.

I also realized my journalistic skills – analysis, research, inherent curiosity and storytelling – were a perfect fit for a job as a data scientist. This inspired me to think outside the box and join this bootcamp, where I could get the programming and statistics needed to complete my education. This Venn Diagram explains the rare and challenging mix of skills needed to be a great Data Scientist.

data science

There’s also the practical motivation: Data Science as a profession is exploding, and every industry, from entertainment to healthcare, is hiring. Demand is high, and so is the pay: the median salary of a data scientist is around $107,000. Companies are hiring people right and left. Compare that to journalism as a profession, where median salaries usually sit around $31,000 a year for newspaper reporters, and layoffs loom around every corner. The storytelling opportunities could be deeper if I was involved with data science research.

Data science is still emerging, and with it, the potential for good or evil. I want to put the skills and ethics I learned as a journalist to use in this industry, so I can help establish responsible, ethical and useful uses of data to improve our world.

That’s it, pretty much. Right now, I’m looking to join a data science team that echoes those values, so I can learn better the skills of the trade. Thanks for reading this far and letting me explain this. Every week or so I’ll be blogging here about this camp, if you’re interested in learning more.

For more on Data Science, read the groundbreaking Booz Allen Hamilton Field Guide to Data Science, online for free.

Also, yes – I’m still working on my thesis. I haven’t forgotten about you, Mizzou.

Syrian Refugees In the United States, An Interactive

This bit is inspired by a map made in the New York Times. This piece appeared originally on the Columbia Missourian.

People think they’re stopping Syrian refugees from entering the United States, but guess what: they’re already here.

The graphic below shows states with governors who have pledged to keep out refugees, while the blue dots show where most of the refugees from the last ten years have resettled.

I made this by smashing together some data from various government sources.

Header photo is from Flickr. 

Mizzou and bold new frontiers

Interactive data visualizations are all the rage these days, with major news organizations like the WSJ and the New York Times setting up interactive desks that churn out engrossing, compelling visualizations.

Mike Jenner, the Houston Harte Chair in Journalism at Mizzou and data visualization extraordinaire, set up a workshop back in October where dataviz journalists Chris Canipe (WSJ), Andrew Garcia-Phillips (Chartball) and Leah Becerra (Omaha World-Herald) came and taught us all D3, a Javascript library of data visualization, in one speedy weekend.

Check out that last link for some awesome data visualizations that capture the power of D3.

In one hot-and-heavy 16 hour sprint, we got the basics in Data Viz. Nobody left as data experts, but the class exposed lots of students to the future of digital journalism.

This was huge for one reason: while Chris, Andrew and Leah were all self-taught in data viz, they brought what they learned to an academic environment.  

After the class, Madi Alexander and myself organized the Mizzou Data Viz Club, where we met and tried to hang on to the skills we learned. (Madi recently got an internship at the NYT as a digital reporting intern. Hooray!)

In a conversation with Mike, I discovered he wanted to make a longer-term class, he just needed pledges that students would take it.

Our Dataviz club had the students and he had the resources. It was like the planets aligned.

Mike moved swiftly, organizing an 8-week course in D3 with Chris for this spring semester. I helped him design the posters to advertise it and recruited a bunch of students to join the class. Chris drives in occasionally from his home in Saint Louis, where he works remotely for the WSJ.

Our skills levels are all over the board, from accomplished programmers to brand new students. The class is open and modular with each student working at their own pace. It’s pitched together and sometimes challenging, but I want to outline a list why this Data Visualization class is wildly important to the future of Mizzou journalism academics.

1. Data visualization skills are in high demand. The success of Mizzou’s CAR and Data Reporting classes are testament to this. We teach the students how to find the data and how to pull stories from it, but now we’re on the cutting edge of visualizing it.

2. Most people who know this stuff were self taught, and our class is the foundation for rigorous academic improvement of the subject. By turning this into an academic affair, we make it easier for students to learn the basics quickly. Once people are learning it, they can move beyond and improve it, developing new techniques and taking those to industry publications.

3. It’s confusing, challenging and uneven – but it’s happening and we’re moving forward, setting standards for future dataviz classes. After this is over, we’ll know what kind of classes should be required for prerequisites. We’ll understand gaps in the digital knowledge of the journalists we’re training. We’ll know what kind of classes we need to establish a powerful data journalism sequence courses. This is us surging into a new frontier for science, know what I mean?

As we move forward, I’ll inevitably have more to say about this venture, so stay tuned.

Why you should learn Dataviz now

The other weekend I sat in on a Data Visualization introductory class taught over three days by three professionals in the business: Chris Canipe of The Wall Street Journal, Andrew Garcia Phillips of ChartBall.com, and Leah Becerra of the Omaha World-Herald.

In a quick and dirty 16-hour sprint, we were introduced to programming a variety of tools, including HighCharts, D3, and various text editing software.

Using these tools, we built a basic interactive graph using raw sports data. Numbers go in, beautiful pictures come out. This stuff is cutting edge – peep some gorgeous examples here. One of Mizzou’s own used these kinds of data visualizations to win a Pulitzer, and these graphics are common at the New York Times and The WSJ.

The weekend was crazy. Basically, a whole bunch of journalism nerds got together and did nerdy journalism stuff. And it was exceedingly awesome, and you should feel bad that you missed it.

But fret not – you can learn these highly demanded skills on your own with a little determination. Here’s why (and how) you should.

1. Because it’s part of the future of journalism. Take a look at journalism’s history and you’ll notice the people on the cutting edge are always the most successful, whether it’s Ben Franklin and his printing presses or ABC and color television. Take a lesson from the greats and secure your spot in journalism’s shining future, or something like that.

2. Because it’s a wild storytelling tool that helps audiences process the internet’s infinite stores of data. Journalists are no longer “gatekeepers” – if people want to know something, they can find any information they want on the internet. The flipside? There’s so much data, so many websites, that people get turned off by the gushing stream. Data visualizations help people process and explore vast amounts of data. All you do is hold their hand through it.

3. BECAUSE YOU CAN LEARN IT ON YOUR OWN FOR FREE. Like, seriously. Programming is becoming an easy skill to learn on your own, and all the journalists who taught this course taught themselves first. Explore sites like CodeAcademy, TreeHouse, Github, and W3 schools and you could know as much as anyone with a computer science degree. For D3 specifically, start here.

4. Because if you’re a Mizzou student, we just started a data visualization club, and there might potentially be a class in the spring. Jump on it.