NOTE: Need data with Max/Impressions; Need separator between phrases in Primary_kw and Tags.

BuzzModel - Build a prediction model for articles

Step 1: Load Data and Clean it up

A. Features: Clean Null

B. Target: Normalize - use (freq, Impressions) and max_impressions

Use Viral, Non-Viral (Pick -1 Std. Dev. as an arbitrary marker) Try Multiple Classes: 1 Buzz (Bottom quartile), 2 (Middle 50%) Buzz and 3(Top Quartile) Buzz

Step 2: Describe data and understand data

A. ...

Step 3: Select Features to try-out

A. Combination of All Texts

B. Identify ones with signals

C. Remove infrequent terms

D. Remove too-frequent terms

E. Remove instances with time bias (News)

F. Use tags and primary_kw phrases as tokens instead of breaking them apart (Convert String to List)

G.

Step 4: Modeling

Model Selection: Multinomial NB, Logistic Regression, SVM (Maybe if we have time.

Hyperparameter tuning

Step 5: Pipeline

Cross Validation

Step 6: Feature Engineering and Reduction

Try another model: Length of title, Number of Tags, List or other description of the title

Feature reduction (PCA, SVD)

Feature union

Weighted feature analysis - what is more important: Title, Descr, Keywords, Tags

Step 7: Random things to try:

A. Based upon articles that go viral in a country, impact of : source country, cats, keywords, etc. (pandas can do this)


In [1]:
import os
import json
import time
import pickle
import requests
import math


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

In [112]:
df = pd.DataFrame()
df = pd.read_csv('may_june_july.csv', delimiter="|")
#df = df[df.pull_cc == 'us']
#df = df.reset_index(drop=True)
df.head()


Out[112]:
id pull_cc cc freq impressions descr cat title metav primary_kw tags
0 4251480 au en-us 2 29316 Giant man with tiny dog alert! Celebrity The Mountain From "Game Of Thrones" Has A Ridi... buzz the mountain dogs game of thrones instagram pomeranians pup...
1 4312033 ca en-us 2 17180 FYI: Ice cream sandwiches > all other sandw... Food 16 Grown-Up Ice Cream Sandwiches That'll Up Yo... life ice cream sandwiches dessert DIY ice cream food52 homemade ice crea...
2 4236366 au en-us 2 3474 "My mama always said you can tell a lot about ... Style 17 Shoe Charts Every Guy Needs To Bookmark life menslifestyle charts shoes
3 4306947 in en-us 2 9027 Let's see if you're a true cheese whiz. Food Can You Find All The Cheese? life cheese cheese quiz cheesy Food food quiz jumblequiz t...
4 4253360 au en-us 2 7247 The EPA just released first-time guidelines on... Science Where To Worry About Fluorinated Chemicals I... news science news epa fluorinated chemicals water

In [113]:
# Combine all text
df['AllText'] = ""
df['primary_kw'].fillna(" ", inplace=True)
df['tags'].fillna(" ", inplace=True)
for i, row in df.iterrows():
    #cv = df.iloc[i,5]+" "+df.iloc[i,6]+" "+df.iloc[i,7]+" "+df.iloc[i,8]+" "+df.iloc[i,9]+" "+df.iloc[i,10]
    #Remove metav and cat
    cv = df.iloc[i,2]+" "+df.iloc[i,5]+" "+df.iloc[i,6]+" "+df.iloc[i,7]+" "+df.iloc[i,9]+" "+df.iloc[i,10]
    df.set_value(i,'AllText',cv)

print df.tail()


# Log to convert to Normal Distribution
df['Log'] = df['freq']*(df['impressions']+1)/1000

for i, row in df.iterrows():
    cv = math.log(df.iloc[i,12],2)
    df.set_value(i,'Log',cv)
    
# analyse data a bit
data_mean = df["Log"].mean()
print data_mean
data_std = df["Log"].std()
print data_std
%matplotlib inline
plt.hist(df["Log"])
plt.show()

# Assign buzzes
df['viral'] = ""
for i, row in df.iterrows():
    if df.iloc[i,12]<=(data_mean-1.5*data_std):
        df.set_value(i,'viral','1buzz')
    elif (df.iloc[i,12]>(data_mean+1.5*data_std)):
        df.set_value(i,'viral','3buzz')
    else:
        df.set_value(i,'viral','2buzz')


#df['viral'] = np.where(df['Log']<data_mean-1*data_std, 'notviral', 'viral')
df['viral_num'] = 0
df['viral_num'] = df.viral.map({'1buzz':1, '2buzz':2, '3buzz':3})


            id pull_cc     cc  freq  impressions  \
14292  4267490      us  en-us   240     16919616   
14293  4267490      uk  en-us   242     17616881   
14294  4267490      ca  en-us   244     17497742   
14295  4209250      au  en-au   274       257463   
14296  4206100      in  en-uk   336      1329315   

                                                   descr        cat  \
14292  A former Stanford swimmer who sexually assault...     USNews   
14293  A former Stanford swimmer who sexually assault...     USNews   
14294  A former Stanford swimmer who sexually assault...     USNews   
14295        A definitive ranking of our dirtiest words.  Australia   
14296  Don't worry, you won't need to know Chandler B...         UK   

                                                   title metav  \
14292  Here's The Powerful Letter The Stanford Victim...  news   
14293  Here's The Powerful Letter The Stanford Victim...  news   
14294  Here's The Powerful Letter The Stanford Victim...  news   
14295      The 100 Rudest Fucking Things Australians Say  None   
14296  Only A True "Friends" Fan Can Get More Than 15...  None   

                  primary_kw  \
14292  campus sexual assault   
14293  campus sexual assault   
14294  campus sexual assault   
14295              Australia   
14296                friends   

                                                    tags  \
14292                                stanford university   
14293                                stanford university   
14294                                stanford university   
14295  arse over tit arsewipe bloody hell bloody wank...   
14296  courtney cox david schwimmer Friends quiz frie...   

                                                 AllText  
14292  en-us A former Stanford swimmer who sexually a...  
14293  en-us A former Stanford swimmer who sexually a...  
14294  en-us A former Stanford swimmer who sexually a...  
14295  en-au A definitive ranking of our dirtiest wor...  
14296  en-uk Don't worry, you won't need to know Chan...  
9.97906734145
3.21618646466

In [114]:
X = df.AllText
y = df.viral_num
# instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_df=0.1)
df.head()


Out[114]:
id pull_cc cc freq impressions descr cat title metav primary_kw tags AllText Log viral viral_num
0 4251480 au en-us 2 29316 Giant man with tiny dog alert! Celebrity The Mountain From "Game Of Thrones" Has A Ridi... buzz the mountain dogs game of thrones instagram pomeranians pup... en-us Giant man with tiny dog alert! Celebrity... 5.873666 2buzz 2
1 4312033 ca en-us 2 17180 FYI: Ice cream sandwiches &gt; all other sandw... Food 16 Grown-Up Ice Cream Sandwiches That'll Up Yo... life ice cream sandwiches dessert DIY ice cream food52 homemade ice crea... en-us FYI: Ice cream sandwiches &gt; all other... 5.102742 1buzz 1
2 4236366 au en-us 2 3474 "My mama always said you can tell a lot about ... Style 17 Shoe Charts Every Guy Needs To Bookmark life menslifestyle charts shoes en-us "My mama always said you can tell a lot ... 2.797013 1buzz 1
3 4306947 in en-us 2 9027 Let's see if you're a true cheese whiz. Food Can You Find All The Cheese? life cheese cheese quiz cheesy Food food quiz jumblequiz t... en-us Let's see if you're a true cheese whiz. ... 4.174406 1buzz 1
4 4253360 au en-us 2 7247 The EPA just released first-time guidelines on... Science Where To Worry About Fluorinated Chemicals I... news science news epa fluorinated chemicals water en-us The EPA just released first-time guideli... 3.857583 1buzz 1

In [115]:
df.tail()


Out[115]:
id pull_cc cc freq impressions descr cat title metav primary_kw tags AllText Log viral viral_num
14292 4267490 us en-us 240 16919616 A former Stanford swimmer who sexually assault... USNews Here's The Powerful Letter The Stanford Victim... news campus sexual assault stanford university en-us A former Stanford swimmer who sexually a... 21.953300 3buzz 3
14293 4267490 uk en-us 242 17616881 A former Stanford swimmer who sexually assault... USNews Here's The Powerful Letter The Stanford Victim... news campus sexual assault stanford university en-us A former Stanford swimmer who sexually a... 22.023534 3buzz 3
14294 4267490 ca en-us 244 17497742 A former Stanford swimmer who sexually assault... USNews Here's The Powerful Letter The Stanford Victim... news campus sexual assault stanford university en-us A former Stanford swimmer who sexually a... 22.025619 3buzz 3
14295 4209250 au en-au 274 257463 A definitive ranking of our dirtiest words. Australia The 100 Rudest Fucking Things Australians Say None Australia arse over tit arsewipe bloody hell bloody wank... en-au A definitive ranking of our dirtiest wor... 16.106259 3buzz 3
14296 4206100 in en-uk 336 1329315 Don't worry, you won't need to know Chandler B... UK Only A True "Friends" Fan Can Get More Than 15... None friends courtney cox david schwimmer Friends quiz frie... en-uk Don't worry, you won't need to know Chan... 18.768786 3buzz 3

In [116]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

from sklearn.pipeline import make_pipeline
pipe=make_pipeline(vect, nb)
pipe.steps


Out[116]:
[('countvectorizer',
  CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
          dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
          lowercase=True, max_df=0.1, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None, stop_words=None,
          strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

In [117]:
# calculate accuracy of class predictions
from sklearn.cross_validation import cross_val_score
cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()


Out[117]:
0.82981844975972141

In [118]:
# import and instantiate a Logistic Regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.pipeline import make_pipeline
pipe=make_pipeline(vect, logreg)
pipe.steps


Out[118]:
[('countvectorizer',
  CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
          dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
          lowercase=True, max_df=0.1, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None, stop_words=None,
          strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('logisticregression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False))]

In [119]:
# calculate accuracy of class predictions
cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()


Out[119]:
0.87968707842769478

In [120]:
data_mean-1.5*data_std


Out[120]:
5.1547876444580458

In [121]:
data_mean+1.5*data_std


Out[121]:
14.803347038439068

In [122]:
print data_mean
print data_std


9.97906734145
3.21618646466

In [123]:
df.shape


Out[123]:
(14297, 15)

In [124]:
df.viral.value_counts()


Out[124]:
2buzz    12382
1buzz     1003
3buzz      912
Name: viral, dtype: int64

In [ ]: