Tag Archives: science

Elizabethan Stop Words for NLP

For Natural Language Processing, usually you will want to clean the text of common “stop words” that don’t usually contribute to a topical analysis. No version of this list is standard, as requirements change from project to project.

If you find yourself processing either Elizabethan or older English texts, most modern stopword lists will fail to pick up things like “thee,” “thy,” or “thine.”

I couldn’t find an Elizabethan English stopword list for an NLP project I did with Project Gutenberg text, so I made one. See it below, or fork me on Github.

The older words are arranged in alphabetical order at the end of the standard stop word list on Github, and below is an easy copy-paste so you can add it to your own stopword file easily.

If I’ve forgotten any, let me know and I’ll add them.

 

All the world’s a stage,
And all the men and women merely players.
They have their exits and their entrances,
And one man in his time plays many parts,
His acts being seven ages.

Mapping NYC subway traffic: an interactive

Ever wondered if you could count how many people go through the subway every day?

Okay, probably not. But bear with me here. No code this time.

For our first project in the Metis Data Science Bootcamp, we were given a hypothetical data science project by a company. Our team was asked to use data to help a nonprofit. In an email from an organization created to advocate women in tech, we got our assignment. A quote:

Where we’d like to solicit your engagement is to use MTA subway data, which as I’m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.

Basically, calculate where people are. But it wasn’t so simple. Our team, made of Ingrid, Ben, Ken and myself, thought through it like this:

1. The busiest turnstiles aren’t necessarily the best. We’re looking for a demographic here – young, progressive and interested in tech. Thousands of pissed off people at Penn Station won’t be any good. We crunched census, income and community data to identify the best neighborhoods.

2. Sometimes the data you’re given isn’t enough. We had to look for lots of extra resources beyond simple MTA turnstile data. Some of this helped us make the map below.

3. When you’re doing data science, make something useful. It’s easy to get lost in “We could do this…” and “But what if…” What if what your client actually cares about is something they can use, not all the stuff you discovered? Never forget your end goal.

And so, what we presented to the company is the below map. We selected five places for their street teams to hang out. The heat flashes show the busiest subway stops over the dates you can see in the bottom corner. Notice how they change throughout the week?

Next step is to plot hourly movements over a day.

Syrian Refugees In the United States, An Interactive

This bit is inspired by a map made in the New York Times. This piece appeared originally on the Columbia Missourian.

People think they’re stopping Syrian refugees from entering the United States, but guess what: they’re already here.

The graphic below shows states with governors who have pledged to keep out refugees, while the blue dots show where most of the refugees from the last ten years have resettled.

I made this by smashing together some data from various government sources.

Header photo is from Flickr.