Scikit-learn

Scikit-learn contains simple and efficient tools for data mining and data analysis. It implements a wide variety of machine learning algorithms and processes to conduct advanced analytics.

Library documentation: http://scikit-learn.org/stable/

Classification with a Naive Bayes

2 classes of headlines “upworthy” (1) and “not upworthy” (0)
Load the data
Featurize the data
Train and Test the classifier the right way
Use the model to predict new headlines

Let’s start with a helper function for loading the raw data

def import_titles(filename):
    #Imports text file
    titles = open(filename, 'rb').read()
    # Handles unicode encode/decode
    titles = titles.decode('utf-8')
    titles = titles.encode('ascii', 'ignore')

    return titles.splitlines()

upworthy_titles = import_titles('./data/upworthy_titles.txt')

print(len(upworthy_titles))
upworthy_titles[:5] # first five

times_titles = import_titles('./data/not_upworthy_titles.txt')

print(len(times_titles))
times_titles[:5] # first five

What we want the data to look like next

upworthy	titles
1	He Was About To Take His Own Life — Until A Man Stopped Him. Here He Meets Him Face To Face Again
0	CVS Pharmacy to Stop Selling Tobacco Goods by October
0	Twitter’s Share Price Falls After It Reports a Loss
1	A 16-Year-Old Explains Why Everything You Thought You Knew About Beauty May Be Wrong. With Math.

#Set up placehold lists
upworthy = []
titles   = []

#Go through all the upworthy articles
for title in upworthy_titles:
    #add title to title list
    titles.append(title)
    #add 1 to Y list
    upworthy.append(1) # Upworthy = 1

for title in times_titles:
    titles.append(title)
    upworthy.append(0) # Upworthy = 0  

What the machine wants the data to look like next

upworthy	stop	man	obama	explain	you	nato	debate	industry	believe
1	0	1	0	0	2	0	0	0	1
0	1	0	0	0	0	1	1	0	0
0	0	0	1	1	0	0	0	1	0
1	0	0	0	0	1	0	1	0	0

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(lowercase     = True ,
                             min_df        = 2 ,
                             max_df        = .5 ,
                             ngram_range   = (1, 1),
                             stop_words    = 'english', 
                             strip_accents = 'unicode')
vectorizer.fit(titles)
X = vectorizer.transform(titles)

Fit a Multinomial Naive Bayes Classifier

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X,upworthy)

# Predict on the same data used for training (don't do this at home)
y_hat = clf.predict(X) 

# Print out some predictions
y_hat[:20]

What % were correctly predicted?

clf.score(X, upworthy)

Use better fit statistics (a confusion matrix)

# A helper function to plot the matrix
import itertools
import numpy as np
import matplotlib.pyplot as plt

def plot_confusion_matrix(cm):
    plt.matshow(cm,cmap=plt.cm.Blues)
    classes = ['not upworthy','upworthy']
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    # Plot Values
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center", color="black")

    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

# Create and plot confusion matrix
from sklearn import metrics
cm = metrics.confusion_matrix(upworthy, y_hat)

plot_confusion_matrix(cm)

Data scientists care about over fitting…so should we

from sklearn.model_selection import train_test_split

# Split Test / Train
X_train, X_test, y_train, y_test = train_test_split(X, upworthy, test_size=0.3)

clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)

print("classification accuracy:", metrics.accuracy_score(y_test, y_test_pred))
cm = metrics.confusion_matrix(y_test, y_test_pred)
plot_confusion_matrix(cm)

This is unstable for some reason (skip for now):

from sklearn import cross_validation

cross_validation.cross_val_score(clf, X, np.array(upworthy),  cv=10)

np.mean(cross_validation.cross_val_score(clf, X, np.array(upworthy),  cv=10))

What about for predicting new headlines?

soc_title = 'Educational Assortative Mating and Earnings Inequality in the United States'
soc_title_vector = vectorizer.transform([soc_title])
clf.predict_proba(soc_title_vector)

gawker_title = 'Shocking Footage Aired of Police Shooting Face-Eating Nude Man'
gawker_title_vector = vectorizer.transform([gawker_title])
clf.predict_proba(gawker_title_vector)