All posts by bryanbumgardner@gmail.com

Natural Language Processing: Reading Project Gutenberg

This post is an process exploration of my most recent project, Reading Project Gutenberg, a proof of concept for a content-analysis recommendation engine. I’m going to go deep on process for this one, more so than the other projects. Still trying to get the hang of explaining all this, so reach out with questions.

For our third project at Metis, we dipped into Natural Language Processing, a way of machine learning that deals with reading and interpreting text. Google Translate, for example, uses machine learning to automatically translate text, which is an insanely hard thing to do.

Another caveat of the project: we needed to do unsupervised machine learning. (A primer on those topics here.) Basically, we had to write a program that would read text and do something with it, and we couldn’t provide the program with examples of what we wanted.

Is that vague enough for you?

My idea: Project Gutenberg is a great site dedicated to creating and distributing free e-books. Compared to Amazon, Gutenberg has little to no data about what users like what books.

So how can I help them with data science? By building a basic recommendation engine, which will offer books similar to a chosen selection. This can help guide readers to books they might like.

Here’s the process.

Structure

Logically, it makes sense that books that discuss similar topics and are written similarly would be enjoyed by similar readers. Knowing this, we can read and categorize each book by the words in it. Using NLP, we can separate out topics, ideas and themes, and find books that are alike.

This is called “clustering.” If we can prove that books can be distributed into similar clusters, we can begin sharpening those clusters and defining them. Perhaps we’ll see a “sci-fi” or a “war” cluster. For the recommendation engine, this will help us identify similar books.

So time to grab the data.

Process

NLP is a computationally-expensive endeavor, so it takes a long time if we have a lot of books. I used this ISO creator to get the full text of 93 random books from fantasy, science fiction, drama and other fiction genres. However, the books aren’t labeled by genre, so I don’t know what category they’re in exactly.

After removing stopwords (common words like, they, there, that) and choosing the most important words, I was ready to run analyses on the text. A snapshot of my vectorizer, for those interested:

Using K-Means and Cosine Similarity, we can start clustering the books and see the actual structure. The machine doesn’t know what clusters exist – it just reads the books, finds similarities, and compares them to one another.

After that, I created a visualization of all the books. By looking at them mapped out, we can see if there are patterns of similar books.

Each dot represents a book, and each color represents a cluster. The distance of each dot from all the other dots shows how similar they are. Closer dots are more similar in topic, and dots that are the same color are in the same cluster.

(See the code here.)

Screen Shot 2016-03-08 at 4.54.57 PM

Clearly, clusters have appeared. By looking at some of the top words in each cluster, we can determine their topics. For example:

Green Cluster: ships, boats, sailed, deck, captain, board, vessels
Orange Cluster: London, England, honor, chapter, poet
Purple Cluster: princess, princes, king, colonel, queen, palaces, majesty, royal
Pink Cluster: Captain, aunt, hotel, garden, doctor
Turquoise Cluster: Jack, spiked, wagon, dollars, allies

So you can clearly see some clusters appearing, while others are less clear. The Green Cluster is obvious books about the high seas, and the three closest titles back up this claim:

-The Cruise of the Cachalot Round the World After Sperm Whales –  Frank T. Bullen

-Ned Myers, Or a Life Before the Mast – James Fenimore Cooper

-The Rover’s Secret: A Tale of the Pirate Cays and Lagoons of Cuba – Harry Collingwood

So that’s it. We proved it could be done, so now if any of you want to help me build this thing for Gutenberg, let me know. Also, if this post was unclear in any way, please comment so I can clear things up or answer any questions.

As always, fork me. 

Cover picture is Stranger in a Strange Land, a painting by James Warhola for a cover of the book written by Robert A. Heinlein. A great classic sci-fi read.

 

 

Elizabethan Stop Words for NLP

For Natural Language Processing, usually you will want to clean the text of common “stop words” that don’t usually contribute to a topical analysis. No version of this list is standard, as requirements change from project to project.

If you find yourself processing either Elizabethan or older English texts, most modern stopword lists will fail to pick up things like “thee,” “thy,” or “thine.”

I couldn’t find an Elizabethan English stopword list for an NLP project I did with Project Gutenberg text, so I made one. See it below, or fork me on Github.

The older words are arranged in alphabetical order at the end of the standard stop word list on Github, and below is an easy copy-paste so you can add it to your own stopword file easily.

If I’ve forgotten any, let me know and I’ll add them.

 

All the world’s a stage,
And all the men and women merely players.
They have their exits and their entrances,
And one man in his time plays many parts,
His acts being seven ages.

Mapping NYC subway traffic: an interactive

Ever wondered if you could count how many people go through the subway every day?

Okay, probably not. But bear with me here. No code this time.

For our first project in the Metis Data Science Bootcamp, we were given a hypothetical data science project by a company. Our team was asked to use data to help a nonprofit. In an email from an organization created to advocate women in tech, we got our assignment. A quote:

Where we’d like to solicit your engagement is to use MTA subway data, which as I’m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.

Basically, calculate where people are. But it wasn’t so simple. Our team, made of Ingrid, Ben, Ken and myself, thought through it like this:

1. The busiest turnstiles aren’t necessarily the best. We’re looking for a demographic here – young, progressive and interested in tech. Thousands of pissed off people at Penn Station won’t be any good. We crunched census, income and community data to identify the best neighborhoods.

2. Sometimes the data you’re given isn’t enough. We had to look for lots of extra resources beyond simple MTA turnstile data. Some of this helped us make the map below.

3. When you’re doing data science, make something useful. It’s easy to get lost in “We could do this…” and “But what if…” What if what your client actually cares about is something they can use, not all the stuff you discovered? Never forget your end goal.

And so, what we presented to the company is the below map. We selected five places for their street teams to hang out. The heat flashes show the busiest subway stops over the dates you can see in the bottom corner. Notice how they change throughout the week?

Next step is to plot hourly movements over a day.

Why I’m Becoming a Data Scientist

First  of all, this post exists because I’m currently in New York, studying the art of Data Science with Metis. It’s a 12-week boot camp focused on training us to use all the tools needed to pull insights from massive mountains of data.

I’m here because I want a job as a Data Scientist.

Usually when I say this, people respond in three ways:

  1. But I thought you wanted to be a journalist?
  2. Aren’t you getting your masters degree in journalism?
  3. What the hell is data science, and why would you want to do it?

My reasons are both philosophical and practical, so here’s a short explainer.

What is Data Science?

Our entire world is recorded in bits of information. Thanks to technology and the internet, human beings create and store more information now than at any point in the history of our species. From birth to death, our entire lives are recorded with paper documents, Google searches, emails, pictures and Facebook statuses. Every day trillions of data are created by billions of humans.

For example, this is how much information people create every minute of every day:

DataNeverSleeps_2.0_v2

That’s a monstrous amount of information, and we’re only looking at a slice! Many firms across hundreds of industries are also recording their own information, as more of our world goes digital. Thanks to constantly improving server memory, it’s cheaper than ever to save all this information, so most of this stuff is just sitting around, unused.

But what if we could use all this information?

What if the data revealed patterns? What if we could look deeply at the information and discover the who, what, where, why and how of our world, to empower people, companies and governments to make better decisions?

People are starting to do just that, and they’re already making waves. Mandatory reading, for examples: Big Data, A Revolution.

That’s data science. And it sounds hella awesome.

Why do Data Science?

While I was doing my masters degree in Data Journalism at Mizzou, I realized data scientists and data journalists are basically the same thing. We learned lots of amazing tools like D3.js, CartoDB and Highcharts to tell stories with data. One major difference is that professional data scientists have stronger backgrounds in mathematics, programming and statistics, which would seriously help data journalists. As I created data-driven projects and infographics, I thought a lot about how I could use data to tell stories. Soon enough, it wasn’t satisfying to make a chart or a graph here or there. I wanted to do something bigger with data.

I also realized my journalistic skills – analysis, research, inherent curiosity and storytelling – were a perfect fit for a job as a data scientist. This inspired me to think outside the box and join this bootcamp, where I could get the programming and statistics needed to complete my education. This Venn Diagram explains the rare and challenging mix of skills needed to be a great Data Scientist.

data science

There’s also the practical motivation: Data Science as a profession is exploding, and every industry, from entertainment to healthcare, is hiring. Demand is high, and so is the pay: the median salary of a data scientist is around $107,000. Companies are hiring people right and left. Compare that to journalism as a profession, where median salaries usually sit around $31,000 a year for newspaper reporters, and layoffs loom around every corner. The storytelling opportunities could be deeper if I was involved with data science research.

Data science is still emerging, and with it, the potential for good or evil. I want to put the skills and ethics I learned as a journalist to use in this industry, so I can help establish responsible, ethical and useful uses of data to improve our world.

That’s it, pretty much. Right now, I’m looking to join a data science team that echoes those values, so I can learn better the skills of the trade. Thanks for reading this far and letting me explain this. Every week or so I’ll be blogging here about this camp, if you’re interested in learning more.

For more on Data Science, read the groundbreaking Booz Allen Hamilton Field Guide to Data Science, online for free.

Also, yes – I’m still working on my thesis. I haven’t forgotten about you, Mizzou.

Queer Sensibilities: News Xchange Berlin, 2015

Below is a video of our talk at the 2015 News Xchange conference in Berlin in October. Sara Trimble and I discussed LGBT coverage and style guides in modern journalism. News Xchange is a conference of news executives from all over the world. I’m pretty honored they invited us. It was a blast, and a major shout out to Amy Selwyn for putting the whole thing together. Here’s to hoping I can visit the conference again soon.

QUEER SENSIBILITIES from News Xchange on Vimeo.

Syrian Refugees In the United States, An Interactive

This bit is inspired by a map made in the New York Times. This piece appeared originally on the Columbia Missourian.

People think they’re stopping Syrian refugees from entering the United States, but guess what: they’re already here.

The graphic below shows states with governors who have pledged to keep out refugees, while the blue dots show where most of the refugees from the last ten years have resettled.

I made this by smashing together some data from various government sources.

Header photo is from Flickr. 

Martin Luther King Jr’s “I Have a Dream”

Because I feel like all of us have to read this again. MLK Jr. delivered this speech half a century ago and it still applies. 

I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation.

Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity.

But one hundred years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later, the Negro is still languishing in the corners of American society and finds himself an exile in his own land. So we have come here today to dramatize a shameful condition.

In a sense we have come to our nation’s capital to cash a check. When the architects of our republic wrote the magnificent words of the Constitution and the Declaration of Independence, they were signing a promissory note to which every American was to fall heir. This note was a promise that all men, yes, black men as well as white men, would be guaranteed the unalienable rights of life, liberty, and the pursuit of happiness.

It is obvious today that America has defaulted on this promissory note insofar as her citizens of color are concerned. Instead of honoring this sacred obligation, America has given the Negro people a bad check, a check which has come back marked “insufficient funds.” But we refuse to believe that the bank of justice is bankrupt. We refuse to believe that there are insufficient funds in the great vaults of opportunity of this nation. So we have come to cash this check — a check that will give us upon demand the riches of freedom and the security of justice. We have also come to this hallowed spot to remind America of the fierce urgency of now. This is no time to engage in the luxury of cooling off or to take the tranquilizing drug of gradualism. Now is the time to make real the promises of democracy. Now is the time to rise from the dark and desolate valley of segregation to the sunlit path of racial justice. Now is the time to lift our nation from the quick sands of racial injustice to the solid rock of brotherhood. Now is the time to make justice a reality for all of God’s children.

It would be fatal for the nation to overlook the urgency of the moment. This sweltering summer of the Negro’s legitimate discontent will not pass until there is an invigorating autumn of freedom and equality. Nineteen sixty-three is not an end, but a beginning. Those who hope that the Negro needed to blow off steam and will now be content will have a rude awakening if the nation returns to business as usual. There will be neither rest nor tranquility in America until the Negro is granted his citizenship rights. The whirlwinds of revolt will continue to shake the foundations of our nation until the bright day of justice emerges.

But there is something that I must say to my people who stand on the warm threshold which leads into the palace of justice. In the process of gaining our rightful place we must not be guilty of wrongful deeds. Let us not seek to satisfy our thirst for freedom by drinking from the cup of bitterness and hatred.

We must forever conduct our struggle on the high plane of dignity and discipline. We must not allow our creative protest to degenerate into physical violence. Again and again we must rise to the majestic heights of meeting physical force with soul force. The marvelous new militancy which has engulfed the Negro community must not lead us to a distrust of all white people, for many of our white brothers, as evidenced by their presence here today, have come to realize that their destiny is tied up with our destiny. They have come to realize that their freedom is inextricably bound to our freedom. We cannot walk alone.

As we walk, we must make the pledge that we shall always march ahead. We cannot turn back. There are those who are asking the devotees of civil rights, “When will you be satisfied?” We can never be satisfied as long as the Negro is the victim of the unspeakable horrors of police brutality. We can never be satisfied, as long as our bodies, heavy with the fatigue of travel, cannot gain lodging in the motels of the highways and the hotels of the cities. We cannot be satisfied as long as the Negro’s basic mobility is from a smaller ghetto to a larger one. We can never be satisfied as long as our children are stripped of their selfhood and robbed of their dignity by signs stating “For Whites Only”. We cannot be satisfied as long as a Negro in Mississippi cannot vote and a Negro in New York believes he has nothing for which to vote. No, no, we are not satisfied, and we will not be satisfied until justice rolls down like waters and righteousness like a mighty stream.

I am not unmindful that some of you have come here out of great trials and tribulations. Some of you have come fresh from narrow jail cells. Some of you have come from areas where your quest for freedom left you battered by the storms of persecution and staggered by the winds of police brutality. You have been the veterans of creative suffering. Continue to work with the faith that unearned suffering is redemptive.

Go back to Mississippi, go back to Alabama, go back to South Carolina, go back to Georgia, go back to Louisiana, go back to the slums and ghettos of our northern cities, knowing that somehow this situation can and will be changed. Let us not wallow in the valley of despair.

I say to you today, my friends, so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.

I have a dream that one day this nation will rise up and live out the true meaning of its creed: “We hold these truths to be self-evident: that all men are created equal.”

I have a dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.

I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.

I have a dream today.

I have a dream that one day, down in Alabama, with its vicious racists, with its governor having his lips dripping with the words of interposition and nullification; one day right there in Alabama, little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers.

I have a dream today.

I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

This is our hope. This is the faith that I go back to the South with. With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day.

This will be the day when all of God’s children will be able to sing with a new meaning, “My country, ’tis of thee, sweet land of liberty, of thee I sing. Land where my fathers died, land of the pilgrim’s pride, from every mountainside, let freedom ring.”

And if America is to be a great nation this must become true. So let freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty mountains of New York. Let freedom ring from the heightening Alleghenies of Pennsylvania!

Let freedom ring from the snowcapped Rockies of Colorado!

Let freedom ring from the curvaceous slopes of California!

But not only that; let freedom ring from Stone Mountain of Georgia!

Let freedom ring from Lookout Mountain of Tennessee!

Let freedom ring from every hill and molehill of Mississippi. From every mountainside, let freedom ring.

And when this happens, when we allow freedom to ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God’s children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual, “Free at last! free at last! thank God Almighty, we are free at last!”

5 Stupid things journalism students say

Journalism students, I have good news and bad news.

Good news: you’re all talented storytellers with amazing, in-demand publishing skills, especially in digital mediums.

Bad news: your educations have brainwashed you into saying and believing stupid things that have absolutely no truth in the real world.

It’s time you learn the truth before you take that first job. Below are some stupid things you’ve probably said. I’ve said them before. If you say them in the real world, people will laugh at you.

Or you can take your dose of reality now and feel your full potential.

Will you find this list offensive? Probably. Should we talk about that? Definitely.

1. “I don’t care about my salary.”
How admirable of you. Unfortunately, our society runs almost exclusively on money. “Personal conviction” isn’t an acceptable currency you can use to pay rent. Everyone needs money – it pervades everything we do, even journalism. Money buys food, health, time, freedom, education. You will need money, whether you admit it or not. Instead, try saying “I will take a job where I can make enough money to support my chosen lifestyle without violating my personal moral imperatives.” Or just keep couchsurfing and blogging, you douche.

2. “Working for PR is selling your soul.”
Job titles alone do not prescribe moral integrity to individuals. Just as a prison executioner might actually be a nice guy, journalists can be terrible, shitty, lying, totally mentally insane people.
Public relations is often more like brand advocacy: helping journalists and bloggers know about your company, spreading the truth about your organization, or helping customers use your products. The best public relations professionals are completely transparent and genuinely good, much like their companies.

Tylenol, for example. After the tampering of their drugs in the 80s, Tylenol went on an aggressive recall and safety campaign, inventing the modern tamper-proof seal. As you will soon discover, PR jobs outnumber journalism jobs 3-1 and pay to the same ratio. The explosion of branded, native and sponsored content are making a journalist’s skills more demanded in PR than ever. So maybe it’s time you considered that a person’s (or organization’s) ethics are much more complex than titles describe.
Or you can stick to your standard narrative:
Once upon a time, at a dirt crossroads in rural Georgia, Edward Bernays shook hands with a dark, ominous figure…

3. “There aren’t any jobs out there.”
This is complete bull. There are jobs everywhere for people with your skills, you just won’t take them. Let me rephrase your complaint: “Because all I want to do is work in one specific city, and because I don’t look very hard on the internet, and because I spend most of my time watching Netflix and eating in bed, I can’t find a job.” Want a job? Imagine murderers have taken one of your loved ones hostage. Their only demand: you have to get a job in 60 days. What would you do?

4. “Advertising is the dark side.”
Advertising is the lifeblood of good journalism. If you believe otherwise, you have a fundamental misunderstanding of the journalism industry. Good content leads to more readers. More readers leads to more advertisers. More advertisers means more money for your company. More money for you company means better content. Get it? Ignoring your advertising department is like putting an expensive paint job on a car without an engine: you’ll look like an asshole AND you won’t have a ride.

5. “I just want to be a full time freelancer.”
Oh yeah? Go for it, kid. Just be prepared for the worst hours, for zero benefits, for terrible bosses, and no consistent income. So much for that luxurious travel writing lifestyle you expected. Welcome to the real world – time to get a full-time job.

ICYMI: Vox Magazine at the RJI Tech Innovation Showcase

I clearly need to work on my public speaking, but for a last second presentation, this wasn’t half bad. For a quick rundown of Vox’s digital innovation this semester, watch this video:

[9:49] Our Blue Highways — Ride along with Vox reporters and digital editors as they discuss this award-winning multimedia project from Fall 2014.
Members: Atiya Abbas, Bryan Bumgardner, Jenna Fear and Carson Kohler

[16:44] Vox on social media — There’s a right way and a wrong way for publications to use social media. Vox social gurus share some of our success stories, including Renz prison, the Antlers and CoMo cups.
Members: Christine Jackson and Dan Roe

[1:22] Vox’s new website — Publication websites are in a state of constant development. Vox students took a leadership role in the relaunch of Vox’s spiffy new site last summer.
Members: Laura Heck and Justin Paprocki

[26:38] Q&A

More information about this event: http://rjionline.org/events/tech15

The Spring 2015 Mizzou Mag Club Trip, in Quotes

Select quotes from the annual Mizzou Magazine Club trip to New York, where we toured magazines, quizzed editors and mingled with alumni. These quotes tell the story of what we learned.

“Nobody gets hired on GPA, where you went to school, how you structure your resume… It’s who you are.”

Ryan D’Agostino, Editor-in-Chief of Popular Mechanics and former manager of the band Dispatch

“If you end up working at a smaller publication and doing a lot, that can sometimes be better.”

Sara Gaynes Levy, Features Editor of SELF Magazine, talking about summer internship opportunities

“Look for where there is a need and fill it. Every magazine has a blind spot.”

Jesse Kissinger, Assistant Editor of Esquire giving internship advice

Touring SELF Magazine with Tova Diamond.
Touring SELF Magazine with Tova Diamond.

“If you’re not up for a wild adventure for the next ten years, find a different career.”

Richard Dorment, Senior Editor of Esquire Magazine

“Your ideas should always outsize your resources.”

Andrew Del-Colle, Senior Editor of Popular Mechanics and WVU and MU grad

“It’s sexist, it’s disgusting… It makes more money than anything we publish.”

Mark Godich, Senior Editor of Sports Illustrated and MU graduate, talking about the annual Swimsuit Edition of SI

 

Real talk with Mara Reinstein of Us Weekly
Real talk with Mara Reinstein of Us Weekly

“Your magazine must always be evolving – as an editor, that’s your role.”

Lindsay Schallon, Features Editor of Seventeen Magazine and MU grad

“I think there’s a lot more value in personal experiences than people realize.”

Tova Diamond, Senior Designer of SELF Magazine and MU grad, talking about independent passion projects

“Don’t be afraid to tell people what you’re gonna do – don’t just have them tell you what to do.”

Joe Bargmann, Special Projects Director of Popular Mechanics

“I suggest you live life beyond your wildest dreams.”

– Allyson Torrisi, Director of Photography at Popular Mechanics