W207 Summer 2017 Final Project iPython notebook

Omar Al Taher, Ted Pham, Chris SanChez


In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# General libraries.
import json
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV #update module model_selection
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict, cross_val_score

from sklearn import preprocessing
from sklearn.mixture import GMM

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

I. Background:

This project aims to predict whether a reddit post asking for pizzas would get funded. Since it's a binary classification problem, we will explore several algorithms with a focus on logistic regression. In particular, we will look into details how to extract features from text.

II. Data Pre-Processing:

The data in its raw form consists of 4040 observations of 31 features. The original columns consist of 19 integer values, 4 floats, and 8 objects (there is one boolean column which is the outcome variable). In order to extract predictive value from the dataset a good deal of pre-processing and feature engineering was required. A walkthrough of the various steps taken follows below in a narrative format:


In [2]:
#load json training data into pandas dataframe
df = pd.read_json('train.json')
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4040 entries, 0 to 4039
Data columns (total 32 columns):
giver_username_if_known                                 4040 non-null object
number_of_downvotes_of_request_at_retrieval             4040 non-null int64
number_of_upvotes_of_request_at_retrieval               4040 non-null int64
post_was_edited                                         4040 non-null int64
request_id                                              4040 non-null object
request_number_of_comments_at_retrieval                 4040 non-null int64
request_text                                            4040 non-null object
request_text_edit_aware                                 4040 non-null object
request_title                                           4040 non-null object
requester_account_age_in_days_at_request                4040 non-null float64
requester_account_age_in_days_at_retrieval              4040 non-null float64
requester_days_since_first_post_on_raop_at_request      4040 non-null float64
requester_days_since_first_post_on_raop_at_retrieval    4040 non-null float64
requester_number_of_comments_at_request                 4040 non-null int64
requester_number_of_comments_at_retrieval               4040 non-null int64
requester_number_of_comments_in_raop_at_request         4040 non-null int64
requester_number_of_comments_in_raop_at_retrieval       4040 non-null int64
requester_number_of_posts_at_request                    4040 non-null int64
requester_number_of_posts_at_retrieval                  4040 non-null int64
requester_number_of_posts_on_raop_at_request            4040 non-null int64
requester_number_of_posts_on_raop_at_retrieval          4040 non-null int64
requester_number_of_subreddits_at_request               4040 non-null int64
requester_received_pizza                                4040 non-null bool
requester_subreddits_at_request                         4040 non-null object
requester_upvotes_minus_downvotes_at_request            4040 non-null int64
requester_upvotes_minus_downvotes_at_retrieval          4040 non-null int64
requester_upvotes_plus_downvotes_at_request             4040 non-null int64
requester_upvotes_plus_downvotes_at_retrieval           4040 non-null int64
requester_user_flair                                    994 non-null object
requester_username                                      4040 non-null object
unix_timestamp_of_request                               4040 non-null int64
unix_timestamp_of_request_utc                           4040 non-null int64
dtypes: bool(1), float64(4), int64(19), object(8)
memory usage: 1013.9+ KB

We'll start by transforming the outcome variable into a binary variable.


In [3]:
df['requester_received_pizza'] = np.where(df['requester_received_pizza'] == True, 1, 0)
df['requester_received_pizza'].value_counts()


Out[3]:
0    3046
1     994
Name: requester_received_pizza, dtype: int64

Next, we'll remove all the "_at_retrieval" columns from the dataset as they are not found in the test data set and therefore represent data that is not avaialable at the time of the request for pizza.


In [4]:
good_indexes = []
for i, name in enumerate(df.columns):
    if re.findall('retrieval', name):
        pass
    else:
        good_indexes.append(i)

# Remove at_retrieval fields from dataframce df
columns = df.columns[good_indexes]
df =  df.loc[:,columns]

We also found that there were several other columns that were not needed for predictive power, and we therefore removed them as well.


In [5]:
#Drop six more columns from dataset
df.drop(['giver_username_if_known', 'post_was_edited', 'request_id', 
        'requester_user_flair', 'requester_username'], axis=1, inplace=True)
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4040 entries, 0 to 4039
Data columns (total 16 columns):
request_text                                          4040 non-null object
request_text_edit_aware                               4040 non-null object
request_title                                         4040 non-null object
requester_account_age_in_days_at_request              4040 non-null float64
requester_days_since_first_post_on_raop_at_request    4040 non-null float64
requester_number_of_comments_at_request               4040 non-null int64
requester_number_of_comments_in_raop_at_request       4040 non-null int64
requester_number_of_posts_at_request                  4040 non-null int64
requester_number_of_posts_on_raop_at_request          4040 non-null int64
requester_number_of_subreddits_at_request             4040 non-null int64
requester_received_pizza                              4040 non-null int64
requester_subreddits_at_request                       4040 non-null object
requester_upvotes_minus_downvotes_at_request          4040 non-null int64
requester_upvotes_plus_downvotes_at_request           4040 non-null int64
unix_timestamp_of_request                             4040 non-null int64
unix_timestamp_of_request_utc                         4040 non-null int64
dtypes: float64(2), int64(10), object(4)
memory usage: 536.6+ KB

Removing the "at_retrieval" columns and the other six columns, reduces our dataset by 17 total features (leaving us with 14 features, not including the outcome variable ).

During the EDA phase of this project we found that several (104 to be exact), observations had a "request_text" length of zero, some of these observations actually ended up being given a pizza. After looking through the data, we discovered that some people had left their request in the "request_title" field of the RAOP Reddit page and had left their request_text field blank. In order to clean this discrepancy up, we decided to combine these two fields together, as it is unclear if the benefactors (those who ended up giving pizzas away), were responding to the '"request_title" field, the "request_text" field, or both when they made their altruistic decision.


In [6]:
#Show that 104 observations have a blank "request_text" field
len(df[df['request_text'].str.len() == 0])


Out[6]:
104

In [7]:
#1. Combine request_text and request_title fields
#2. Lowercase all words 

df['request_text_n_title'] = (df['request_title'] + ' ' + df['request_text_edit_aware'])
df['request_text_n_title'] = [ text.split(" ",1)[1].lower() for text in df['request_text_n_title']]
print df['request_text_n_title'].head()
 
#3. Add a total length feature to the dataset 
df['total_length'] =df['request_text_n_title'].apply(lambda x: len(x.split(' ')))

#4. Ensure there are no zero length requests in the new feature/column
print '\nAfter combining request_title and request_text, number of requests with length of \
"zero": {}'.format(len(df[df['request_text_n_title'] == 0]))


0    colorado springs help us please hi i am in nee...
1    california, no cash and i could use some dinne...
2    hungry couple in dundee, scotland would love s...
3    in canada (ontario), just got home from school...
4    old friend coming to visit. would love to feed...
Name: request_text_n_title, dtype: object

After combining request_title and request_text, number of requests with length of "zero": 0

In [8]:
#Showcase of new added features to dataset
df.loc[:5,['request_text_n_title', 'total_length']]


Out[8]:
request_text_n_title total_length
0 colorado springs help us please hi i am in nee... 72
1 california, no cash and i could use some dinne... 25
2 hungry couple in dundee, scotland would love s... 68
3 in canada (ontario), just got home from school... 39
4 old friend coming to visit. would love to feed... 115
5 i'll give a two week xbox live code for a slic... 47

III. Baseline Modeling:

With pre-processing out of the way, we decided to run a baseline text analysis model using the "request_text_n_title" feature alone. This model serves as our basis against which we can compare the success or failure of all futue feature engineering efforts.

IIIa. Train and Dev set creation

We begin by forming our training and dev sets. Given that the success rate for receiving a pizza in the original dataset is ~25%, we'll check our randomized training and dev sets to ensure a similar success rate is preserved.


In [9]:
np.random.seed(0)

#separate into features and labels
text_features = df['request_text_n_title'].values # text features
success_rate = sum(df['requester_received_pizza'])/4040.
print "Original success rate: {}".format(round(success_rate, 4))

# Create target field for received pizza are only 1's and 0's
target = df['requester_received_pizza'].values

#shuffle our data to ensure randomization
shuffle = np.random.permutation(len(text_features))
text_features, target = text_features[shuffle], target[shuffle]

#separate into training and dev groups
train_data, train_labels = text_features[:3200], target[:3200]
dev_data, dev_labels = text_features[3200:], target[3200:]

#check to ensure success rate is roughly preserved across sets 
train_success_rate = sum(train_labels)/3200.
dev_success_rate = sum(dev_labels)/840.
print "Training success rate: {}".format(round(train_success_rate, 4))
print "Dev success rate: {}".format(round(dev_success_rate, 4))

#check to ensure we've got the right datasets
print '\n\nTraining Data shape: \t{}'.format(train_data.shape)
print 'Training Labels shape: \t{}'.format(train_labels.shape)
print 'Dev data shape: \t{}'.format(dev_data.shape)
print 'Dev Labels shape: \t{}'.format(dev_labels.shape)


Original success rate: 0.246
Training success rate: 0.2462
Dev success rate: 0.2452


Training Data shape: 	(3200,)
Training Labels shape: 	(3200,)
Dev data shape: 	(840,)
Dev Labels shape: 	(840,)

IIIb. Logistic Regression

Our initial model will make use of the sklearn TfidfVectorizer function to encapsulate predictive power from the training data. After some initial trial runs, we realized that we needed to limit the number of features to make a useful model. In this case, a predictive "sweet spot" emerged in the 350+ - 400 range, so we limited our feature number to this amount.


In [10]:
#fit logistic classifier to training data
print "BASELINE LOGISTIC REGRESSION"
print "----------------------------"
vec = TfidfVectorizer()
train_matrix = vec.fit_transform(train_data)
dev_matrix = vec.transform(dev_data)
lr= LogisticRegression(n_jobs=-1, class_weight='balanced').fit(train_matrix, train_labels)
predictions = lr.predict(dev_matrix)
score = round(roc_auc_score(dev_labels, predictions, average='weighted'), 4)
print "Baseline ROC AUC score: {}".format(score)

print "\n\nRESTRICTED FEATURES LOGISTIC REGRESSION"
print "---------------------------------------"

model_scores = []
range_features = np.arange(385,400)

for features in range_features:    
    vec = TfidfVectorizer(stop_words='english',sublinear_tf=1, ngram_range=(1,1),max_features=features)
    train_matrix = vec.fit_transform(train_data)
    dev_matrix = vec.transform(dev_data)
    lr= LogisticRegression(n_jobs=-1, class_weight='balanced').fit(train_matrix, train_labels)
    predictions = lr.predict(dev_matrix)
    model_scores.append(round(metrics.roc_auc_score(dev_labels, predictions, average = 'weighted'), 4))
    best_score = round(max(model_scores), 4)
print "Max number of features: {}".format(range_features[np.argmax(model_scores)])
print "Best ROC AUC Score: {}".format(best_score)

#best_max_feature
best_max_feature = range_features[np.argmax(model_scores)]

#fit logistic classifier to training data
vec = TfidfVectorizer(stop_words='english',sublinear_tf=1, ngram_range=(1,1), max_features=best_max_feature)
train_matrix = vec.fit_transform(train_data)
dev_matrix = vec.transform(dev_data)

print "\n"
print "LOGISTIC REGRESSION with tuning c value and restricted # of features"
print "--------------------------------------------------------------------"
c_values = np.logspace(.0001, 2, 200)
c_scores = []
for value in c_values:
    lr = LogisticRegression(C=value, n_jobs=-1, class_weight='balanced', penalty='l2')
    lr.fit(train_matrix, train_labels)
    predictions = lr.predict(dev_matrix)
    c_scores.append(round(metrics.roc_auc_score(dev_labels, predictions, average = 'weighted'), 4))
    
best_c_value = c_values[np.argmax(c_scores)]
print "Best C-value: {}".format(best_c_value)
print "Best ROC AUC Score: {}".format(c_scores[np.argmax(c_scores)])
lr = LogisticRegression(C=best_c_value, n_jobs=-1).fit(train_matrix, train_labels)
predictions = lr.predict(dev_matrix)
print '\n'


BASELINE LOGISTIC REGRESSION
----------------------------
Baseline ROC AUC score: 0.5841


RESTRICTED FEATURES LOGISTIC REGRESSION
---------------------------------------
Max number of features: 396
Best ROC AUC Score: 0.6056


LOGISTIC REGRESSION with tuning c value and restricted # of features
--------------------------------------------------------------------
Best C-value: 2.30086403033
Best ROC AUC Score: 0.6088


Our initial baseline model with Logistic Regression does suprisingly well using only the raw text without any feature engineering or hyperparameter tuning. The initial score represents a rough 9% improvement over random guessing alone. After restricting the number of features in the model we see a 2% improvement in predictive accuracy, suggesting that the great majority of unique words in the feature space do not contribute to predictive accuracy. We further tested baseline models using Multinomial Naive Bayes and SVM.

IIIc. Bernoulli Naive Bayes


In [11]:
#fit Multinomial Naive Bayes classifier to training data
print "BERNOULLI NAIVE BAYES"
print "-----------------------"
model_scores = []
range_features = np.arange(415, 431)
for features in range_features:
    vec = TfidfVectorizer(max_features=features)
    train_matrix = vec.fit_transform(train_data)
    dev_matrix = vec.transform(dev_data)
    alphas = np.linspace(.0001, 1, 20)
    bnb = BernoulliNB(alpha=.001).fit(train_matrix, train_labels)
    predictions = bnb.predict(dev_matrix)
    model_scores.append(round(roc_auc_score(dev_labels, predictions, average='weighted'), 4))
print "Best ROC AUC Score: {}".format(max(model_scores))
print "Max Features: {}".format(range_features[np.argmax(model_scores)])


BERNOULLI NAIVE BAYES
-----------------------
Best ROC AUC Score: 0.5948
Max Features: 419

Using Bernoulli NB we get similar results to LR, again making use of restricting the number of features in our training matrix.

IIId. Support Vector Machine


In [12]:
print "\nSupport Vector Machine"
print "------------------------"
vec = TfidfVectorizer(max_features=best_max_feature, stop_words='english')
train_matrix = vec.fit_transform(train_data)
dev_matrix = vec.transform(dev_data)
svc = LinearSVC().fit(train_matrix, train_labels)
predictions = svc.predict(dev_matrix)
print round(roc_auc_score(dev_labels, predictions, average='weighted'),4)


Support Vector Machine
------------------------
0.5196

SVM performs no better than previous models.

IV. Feature Engineering

Setting aside text processing for the time being, we decided to focus on specific features within the dataset that could possibly lead to increased predictive power. For example, we noted that users who included pictues in their request (presumably as a means of validating their story), were disporportionately more likely to receive a pizza. Similarly, users who wrote longer requests, or who had spent more time in the RAOP community were also more likely to recieve a pizza. Therefore, what follows is an explanation of our attempts at engineering features to build the best predictive model possible. With a few exceptions all features were binarized.

a. Including an image

The inclusion of an image in the request indicates that the requester is providing some evidence of their need for a pizza. A photo might be a screenshot of a bank account balance, a large bill, an injury, or other misfortune. Adding such evidence increases the odds of receiving a pizza.


In [13]:
#Create feature where in image is included in the request
df['image_incl'] = np.where(df['request_text_n_title'].str.contains("imgur"), int(1), int(0))

b. Community standing/status


In [14]:
#Create feature that is an aggregate of all indicators of community status including seniority

df['karma'] = df['requester_account_age_in_days_at_request'] + df['requester_days_since_first_post_on_raop_at_request']\
+ df['requester_number_of_comments_at_request'] + df['requester_number_of_comments_in_raop_at_request'] + \
df['requester_number_of_posts_at_request'] + df['requester_number_of_posts_on_raop_at_request'] + \
df['requester_number_of_subreddits_at_request'] + df['requester_upvotes_minus_downvotes_at_request']

karma_winners = df['karma'][df['requester_received_pizza'] == 1].describe()
karma_losers = df['karma'][df['requester_received_pizza'] == 0].describe()
karma_comparison = pd.concat([karma_winners, karma_losers], axis=1)
karma_comparison.columns = ['winners', 'losers']
karma_comparison


Out[14]:
winners losers
count 994.000000 3046.000000
mean 1848.437240 1501.113257
std 5598.137867 3231.267421
min 0.000000 -10.895197
25% 132.845203 5.020179
50% 666.343900 480.832176
75% 2127.290081 1762.444358
max 156272.652604 89490.103021

A quick analysis of the newly created "karma" features highlights the fact that people who received pizza have a median karma value of over 660, compared to those who did not with a median value of only 480. This large difference will likely have strong predictve power for our model. As indicated in the code snippet below, roughly one-third of all observations who did not receive a pizza have a karma value below 100.


In [15]:
low_losers= len(df[df['requester_received_pizza'] == 0][df['karma'] < 15])
low_winners=len(df[df['requester_received_pizza'] == 1][df['karma'] < 15])
print low_losers, low_losers/3046.
print low_winners, low_winners/994.

high_losers = len(df[df['requester_received_pizza'] == 0][df['karma'] > 5000])
high_winners = len(df[df['requester_received_pizza'] == 1][df['karma'] > 5000])
print high_losers, high_losers/3046.
print high_winners, high_winners/994.

df['karma_low'] = np.where(df['karma'] < 15, 1, 0)


820 0.26920551543
163 0.163983903421
237 0.0778069599475
81 0.0814889336016

c. Request length


In [16]:
length_winners = df['total_length'][df['requester_received_pizza'] == 1].describe()
length_losers = df['total_length'][df['requester_received_pizza'] == 0].describe()
length_comparison = pd.concat([length_winners, length_losers], axis=1)
length_comparison.columns = ['winners', 'losers']
length_comparison


Out[16]:
winners losers
count 994.000000 3046.000000
mean 101.668008 83.379186
std 75.466061 66.959005
min 8.000000 4.000000
25% 54.000000 43.000000
50% 82.000000 66.000000
75% 125.000000 102.000000
max 828.000000 862.000000

The analsyis here indicates that while not as dramatic a difference as the previous feature, there is a moderate difference in text length between winners and losers. Apparently, a longer text (request) indicates more of an effort to explain the users particular situation, and is therefore more likely to be seen as sincere or plausible by the RAOP community.

d. Extracting time-based features:

The UTC time stamp is parsed and converted to human readable format. Also, the first half of the month is identified with a binary variable since people might be more money to use for donating a pizza earlier in the month.


In [17]:
df['day']=df['unix_timestamp_of_request_utc'].apply(lambda x:int(datetime.datetime.fromtimestamp(int(x)).strftime('%d')))
df['time']=df['unix_timestamp_of_request'].apply(lambda x:int(datetime.datetime.fromtimestamp(int(x)).strftime('%H')))
df['first_half'] = np.where(df['day'] < 16, 1, 0)

e. Extracting gratitude and "pay it forward" sentiment


In [18]:
#Create binary variables that show whether requester is grateful or willing to "pay it forward"
df['requester_grateful'] = np.where(df['request_text_n_title'].str.contains('thanks' or 'advance' or 'guy'\
                    or 'reading' or 'anyone' or 'anything' or'story'or 'tonight'or 'favor'or'craving'), int(1), int(0))

df['requester_payback'] = np.where(df['request_text_n_title'].str.contains('return' or 'pay it back'\
                                                                     or 'pay it forward' or 'favor'), int(1), int(0))

f. Extracting request narratives

We extract the narrative out of each request. Based on our EDA, we picked 5 narratives: Money, Job, Student, Family and Craving. 5 new columns will be added for the narratives which are coded in binary 0 or 1. The terms that specify each narrative are given in the narratives' dictionary. We counted the total occurent of the words in each post, normalize this count by the word count in the post, and use median-threshold to determin whether the post fall into the respective narrative.


In [19]:
# Define narrative categories
narratives = {
            'money':['money','now','broke','week','until','time',
                      'last','day','when','today','tonight','paid',
                      'next','first','night','night','after','tomorrow',
                      'while','account','before','long','friday','rent',
                      'buy','bank','still','bills','ago','cash','due',
                      'soon','past','never','paycheck','check','spent',
                      'year','years','poor','till','yesterday','morning',
                      'dollars','financial','hour','bill','evening','credit',
                      'budget','loan','bucks','deposit','dollar','current','payed'],
             'job':['work','job','paycheck','unemployment','interviewed',
                   'fired','employment','hired','hire'],
             'student':['college','student','school','roommate','studying',
                       'study','university','finals','semester','class','project',
                       'dorm','tuition'],
             'family': ['family','mom','wife','parents','mother','husband','dad','son',
                     'daughter','father','parent','mum','children','starving','hungry'],
             'craving': ['friend','girlfriend','birthday','boyfriend','celebrate',
                      'party','game','games','movie','movies','date','drunk',
                      'beer','celebrating','invited','drinks','crave','wasted','invited']
            }

# function to extract word count for each narrative from one text post, normalize by the word count of that post
def single_extract(text):  
    count = {'money':0.,
            'job':0.,
            'student':0.,
            'family':0.,
            'craving':0.}
    words = text.split(' ')
    length = 1./len(words)
    for word in text.split(' '):
        for i,k in narratives.items():
            if word in k:
                count[i] += length
    return count.values()

# Extract request_text_n_title field
texts = df['request_text_n_title'].copy()

#initialize count 
count =[]
# return normalized count for each narrative from all requests
for text in texts:
    count.append(single_extract(text))

# narrative dataframe
narrative = pd.DataFrame(count)
# set up median for using with the test set
median_values = []

#extract narrative field
for i,k in enumerate(narratives.keys()):
    median_values.append(np.median(narrative[i]))
    narrative['narrative_'+k] = (narrative[i] > np.median(narrative[i])).astype(int)
    narrative.drop([i],axis=1,inplace=True)

# concatenate 
df = pd.concat([df,narrative],axis=1)

In the end, we added a total of 16 additional features, 3 continuous and 13 binary.


In [20]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4040 entries, 0 to 4039
Data columns (total 31 columns):
request_text                                          4040 non-null object
request_text_edit_aware                               4040 non-null object
request_title                                         4040 non-null object
requester_account_age_in_days_at_request              4040 non-null float64
requester_days_since_first_post_on_raop_at_request    4040 non-null float64
requester_number_of_comments_at_request               4040 non-null int64
requester_number_of_comments_in_raop_at_request       4040 non-null int64
requester_number_of_posts_at_request                  4040 non-null int64
requester_number_of_posts_on_raop_at_request          4040 non-null int64
requester_number_of_subreddits_at_request             4040 non-null int64
requester_received_pizza                              4040 non-null int64
requester_subreddits_at_request                       4040 non-null object
requester_upvotes_minus_downvotes_at_request          4040 non-null int64
requester_upvotes_plus_downvotes_at_request           4040 non-null int64
unix_timestamp_of_request                             4040 non-null int64
unix_timestamp_of_request_utc                         4040 non-null int64
request_text_n_title                                  4040 non-null object
total_length                                          4040 non-null int64
image_incl                                            4040 non-null int64
karma                                                 4040 non-null float64
karma_low                                             4040 non-null int64
day                                                   4040 non-null int64
time                                                  4040 non-null int64
first_half                                            4040 non-null int64
requester_grateful                                    4040 non-null int64
requester_payback                                     4040 non-null int64
narrative_money                                       4040 non-null int64
narrative_job                                         4040 non-null int64
narrative_family                                      4040 non-null int64
narrative_student                                     4040 non-null int64
narrative_craving                                     4040 non-null int64
dtypes: float64(3), int64(23), object(5)
memory usage: 1010.0+ KB

To fit our model, we combined all engineered features into a single dataframe. And after several iterations, decided to truncate some of the added features to improve predictive accuracy ('time' and 'karma' specifically).


In [21]:
continuous_list = ['total_length']#, 'time', 'karma']
binary_list = [
                                               u'image_incl',
                                                u'karma_low',
                                            #        u'month',
                                             #         u'day',
                                                     u'time',
                                               u'first_half',
                                       u'requester_grateful',
                                        u'requester_payback',
                                          u'narrative_money',
                                            u'narrative_job',
                                         u'narrative_family',
                                        u'narrative_student',
                                        u'narrative_craving'
             ]

#create new DataFrame using previously defined "numeric_features" object to determine all columns in the DF

numeric_features = df.copy().loc[:,continuous_list]
numeric_features_norm = pd.DataFrame(data=preprocessing.normalize(numeric_features, axis=0),\
                                     columns=numeric_features.columns.values)  

# combine to contious and binary
numeric_features_norm = pd.concat([df[binary_list],numeric_features_norm],axis=1)
numeric_features_norm.head()


Out[21]:
image_incl karma_low time first_half requester_grateful requester_payback narrative_money narrative_job narrative_family narrative_student narrative_craving total_length
0 0 1 15 1 0 0 0 0 1 0 0 0.010106
1 0 0 22 0 0 0 1 0 0 0 0 0.003509
2 0 1 10 0 0 0 1 0 1 0 1 0.009545
3 0 0 11 1 0 0 0 0 1 0 0 0.005474
4 0 0 12 1 0 0 0 0 0 0 1 0.016142

In [22]:
features = numeric_features_norm.copy()
target = df['requester_received_pizza']
features.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4040 entries, 0 to 4039
Data columns (total 12 columns):
image_incl            4040 non-null int64
karma_low             4040 non-null int64
time                  4040 non-null int64
first_half            4040 non-null int64
requester_grateful    4040 non-null int64
requester_payback     4040 non-null int64
narrative_money       4040 non-null int64
narrative_job         4040 non-null int64
narrative_family      4040 non-null int64
narrative_student     4040 non-null int64
narrative_craving     4040 non-null int64
total_length          4040 non-null float64
dtypes: float64(1), int64(11)
memory usage: 410.3 KB

V. Final Modeling

We decided to create three final models.

  • One model using just our engineered features (without unigram text)
  • One combined model using engineered features and unigram text
  • One model using predicted probabilities as model features

In [23]:
X, y = features.values, target.copy()
#shuffle = np.random.permutation(len(X))
X, y = X[shuffle], y[shuffle]
Xtrain, Xdev, ytrain, ydev = X[:3200], X[3200:], y[:3200], y[3200:]
Xtrain.shape, Xdev.shape, ytrain.shape, ydev.shape


Out[23]:
((3200, 12), (840, 12), (3200,), (840,))

a. Model One: Engineered Features only


In [24]:
scores = []
c_values = np.linspace(.001, 100, 20)
for c in c_values:
    lr_8 = LogisticRegression(C=c, class_weight='balanced', n_jobs=-1).fit(Xtrain, ytrain)
    predictions = lr_8.predict(Xdev)
    scores.append(round(roc_auc_score(ydev, predictions, average='weighted'), 4))
print "Best C-value: {}".format(c_values[np.argmax(scores)])
print 'Best AUC score based on metrics.roc_auc_score = {}'.format(max(scores))


Best C-value: 68.4213684211
Best AUC score based on metrics.roc_auc_score = 0.5876

b. Model Two: Engineered Features combined with Unigram text


In [25]:
#Initialize our vectorizer using hyperparameters from previous models
vec = TfidfVectorizer(stop_words='english',sublinear_tf=1, max_features=best_max_feature)
train_matrix = vec.fit_transform(train_data)

#Transform our text features
text_features = df['request_text_n_title'].values.copy()
text_transformed = vec.transform(text_features)

# Concatenate egineered numeric_features with vectorized text features 
combined_features = np.append(features.values, text_transformed.toarray(), axis = 1)
print combined_features.shape

#prep data for modeling
X, y = combined_features, target.copy()
X, y = X[shuffle], y[shuffle]
Xtrain, Xdev, ytrain, ydev = X[:3200], X[3200:], y[:3200], y[3200:]
Xtrain.shape, Xdev.shape, ytrain.shape, ydev.shape

#Fit logistic model to data
c_values = np.logspace(.0001, 2, 200)
c_scores = []
for value in c_values:
    lr = LogisticRegression(C=value, n_jobs=-1, class_weight='balanced', penalty='l2')
    lr.fit(Xtrain, ytrain)
    predictions = lr.predict(Xdev)
    c_scores.append(round(metrics.roc_auc_score(ydev, predictions, average = 'weighted'), 4))
    
best_C_value = c_values[np.argmax(c_scores)]
print "Best C-value: {}".format(best_C_value)
print "Best ROC AUC Score: {}".format(c_scores[np.argmax(c_scores)])


(4040, 408)
Best C-value: 1.783789974
Best ROC AUC Score: 0.6121

Model Three: Unigrams predict_probability combined with engineered features


In [26]:
#fit logistic classifier to training data
vec = TfidfVectorizer(stop_words='english',sublinear_tf=1, max_features=best_max_feature)
train_matrix = vec.fit_transform(train_data)


lr1 = LogisticRegression(C=best_c_value, n_jobs=-1).fit(train_matrix, train_labels)

text_features = df['request_text_n_title'].values.copy()

text_transformed = vec.transform(text_features)
# create pizza_predict field for all records based on features array
pizza_predict = lr1.predict_proba(text_transformed)[:,1][:, np.newaxis] 

# Concatenate numeric_features with pizza_predict to create pizza_predict + numeric features in ens_features
combined_features = np.append(features.values, pizza_predict, axis = 1)

X, y = combined_features, target.copy()
#shuffle = np.random.permutation(len(X))
X, y = X[shuffle], y[shuffle]
Xtrain, Xdev, ytrain, ydev = X[:3200], X[3200:], y[:3200], y[3200:]
Xtrain.shape, Xdev.shape, ytrain.shape, ydev.shape

#####
c_values = np.logspace(.0001, 2, 200)
c_scores = []
for value in c_values:
    lr = LogisticRegression(C=value, n_jobs=-1, class_weight='balanced', penalty='l2')
    lr.fit(Xtrain, ytrain)
    predictions = lr.predict(Xdev)
    c_scores.append(round(metrics.roc_auc_score(ydev, predictions, average = 'weighted'), 4))
    
best_C_value = c_values[np.argmax(c_scores)]
print "Best C-value: {}".format(best_C_value)
print "Best ROC AUC Score: {}".format(c_scores[np.argmax(c_scores)])


Best C-value: 100.0
Best ROC AUC Score: 0.5906

VI. Results and Analysis

Our best model ended up being model Two which was a combination of engineered features and unigram text. This result is not surprising, especially considering the amount of work required to engineer the given features. We submitted different variants of model two to the Kaggle website with varying degrees of success.

FOR SUBMISSION WITH FINAL MODELS


In [27]:
"""Our first attempt is a brute force method with logistics regression that 
use the best parameters from developing the model"""


df_test = pd.read_json('test.json')

df_test['request_text_n_title'] = (df_test['request_title'] + ' ' + df_test['request_text_edit_aware'])
df_test['request_text_n_title'] = [ text.split(" ",1)[1].lower() for text in df_test['request_text_n_title']]
# print df_test['request_text_n_title'].head()

#create test set
test_features = df_test['request_text_n_title'].values # text features

#create new training set using all 4040 training samples
train_features = df['request_text_n_title'].values
train_labels = df['requester_received_pizza'].values

#shuffle our data to ensure randomization
shuffle = np.random.permutation(len(train_features))
train_features, train_labels = train_features[shuffle], train_labels[shuffle]


#transform raw data into Tfdif vector
vec = TfidfVectorizer(stop_words='english',sublinear_tf=1, max_features=387)
train_matrix = vec.fit_transform(train_features)
test_matrix = vec.transform(test_features)

#fit Logistic Regression model
lr= LogisticRegression(n_jobs=-1, class_weight='balanced').fit(train_matrix, train_labels)
test_predictions = lr.predict(test_matrix)[:, np.newaxis]

# print type(predictions_test)
sub_1 = np.append(df_test['request_id'].values[:, np.newaxis], test_predictions, axis = 1)
sub_1_df = pd.DataFrame(data=sub_1, columns=['request_id', 'requester_received_pizza'])  # 1st row as the column names
sub_1_df.to_csv("submission_1.csv", sep=',', header=True,  mode='w', index=0)


"""With this submission, we get  a AUC score of 0.59386"""

print "Kaggle submission 1, result AUC  = 0.59386"


Kaggle submission 1, result AUC  = 0.59386

In [28]:
"""The 2nd submission, we define a function process
to process the train and test json files

This function use the file name and the max number of feature tuned previously.
The output of process function depends on if it's train or test set
"""


def process(filename,max_feature_length,train=True):

    df = pd.read_json(filename)
    
    df['request_text_n_title'] = (df['request_title'] + ' ' + df['request_text_edit_aware'])
    
    
    df['request_text_n_title'] = [ text.split(" ",1)[1].lower() for text in df['request_text_n_title']]
    #total length
    df['total_length'] =df['request_text_n_title'].apply(lambda x: len(x.split(' ')))
    
    #get engineer features
    
    # image included?
    df['image_incl'] = np.where(df['request_text_n_title'].str.contains("imgur"), int(1), int(0))
    #Karma
    df['karma'] = df['requester_account_age_in_days_at_request']\
                 + df['requester_days_since_first_post_on_raop_at_request']\
                 + df['requester_number_of_comments_at_request'] + df['requester_number_of_comments_in_raop_at_request'] + \
                df['requester_number_of_posts_at_request'] + df['requester_number_of_posts_on_raop_at_request'] + \
                df['requester_number_of_subreddits_at_request'] + df['requester_upvotes_minus_downvotes_at_request']
    
    
    df['karma_low'] = np.where(df['karma'] < 15, 1, 0)
    
    
    # time
    df['day']=df['unix_timestamp_of_request_utc'].apply(lambda x:int(datetime.datetime.fromtimestamp(int(x)).strftime('%d')))
    df['time']=df['unix_timestamp_of_request'].apply(lambda x:int(datetime.datetime.fromtimestamp(int(x)).strftime('%H')))
    df['first_half'] = np.where(df['day'] < 16, 1, 0)
    
    # requester's attitude
    df['requester_grateful'] = np.where(df['request_text_n_title'].
                               str.contains('thanks' or 'advance' or 'guy'\
                                            or 'reading' or 'anyone' or 'anything'\
                                           'story'or 'tonight'or 'favor'or'craving'), 
                                            int(1), int(0))

    df['requester_payback'] = np.where(df['request_text_n_title'].
                               str.contains('return' or 'pay it back' or 'pay it forward' or 'favor'), int(1), int(0))

    #narrative
    # Define narrative categories
    narratives = {
                'money':['money','now','broke','week','until','time',
                          'last','day','when','today','tonight','paid',
                          'next','first','night','night','after','tomorrow',
                          'while','account','before','long','friday','rent',
                          'buy','bank','still','bills','ago','cash','due',
                          'soon','past','never','paycheck','check','spent',
                          'year','years','poor','till','yesterday','morning',
                          'dollars','financial','hour','bill','evening','credit',
                          'budget','loan','bucks','deposit','dollar','current','payed'],
                 'job':['work','job','paycheck','unemployment','interviewed',
                       'fired','employment','hired','hire'],
                 'student':['college','student','school','roommate','studying',
                           'study','university','finals','semester','class','project',
                           'dorm','tuition'],
                 'family': ['family','mom','wife','parents','mother','husband','dad','son',
                         'daughter','father','parent','mum','children','starving','hungry'],
                 'craving': ['friend','girlfriend','birthday','boyfriend','celebrate',
                          'party','game','games','movie','movies','date','drunk',
                          'beer','celebrating','invited','drinks','crave','wasted','invited']
                }

    # function to extract word count for each narrative from one text post
    # normalize by the word count of that post
    def single_extract(text):  
        count = {'money':0.,
                'job':0.,
                'student':0.,
                'family':0.,
                'craving':0.}
        words = text.split(' ')
        length = 1./len(words)
        for word in text.split(' '):
            for i,k in narratives.items():
                if word in k:
                    count[i] += length
        return count.values()

    # Extract request_text_n_title field
    texts = df['request_text_n_title'].copy()

    #initialize count 
    count =[]
    # return normalized count for each narrative from all requests
    for text in texts:
        count.append(single_extract(text))

    # narrative dataframe
    narrative = pd.DataFrame(count)
    # set up median for using with the test set

    #extract narrative field
    for i,k in enumerate(narratives.keys()):
        narrative['narrative_'+k] = (narrative[i] > np.median(narrative[i])).astype(int)
        narrative.drop([i],axis=1,inplace=True)

    # concatenate 
    df = pd.concat([df,narrative],axis=1)
    
    
    continuous_list = ['total_length','time']
    # include relevant binary variables and avoid overfitting
    binary_list = ['requester_grateful','narrative_job',
              'narrative_family', 'narrative_craving',
              'image_incl']
    
    #create new DataFrame using previously defined "numeric_features" object to determine all columns in the DF

    continuous_features = df.copy().loc[:,continuous_list]
    
    engineered_features = pd.concat([df[binary_list],continuous_features],axis=1)

    
    if train==True:
        # binarize sucess
        # combine title and text
        df['requester_received_pizza'] = np.where(df['requester_received_pizza'] == True, 1, 0)
            # Get vectorized fields
        vec = TfidfVectorizer(stop_words='english',sublinear_tf=1, max_features=max_feature_length)
        text_features = df['request_text_n_title'].values.copy()
        vectorized_matrix = vec.fit_transform(text_features)
        target = df['requester_received_pizza']
        return vec, vectorized_matrix, engineered_features, target
    else:

        text_features = df['request_text_n_title'].values.copy()
        request_id = df['request_id']
        return request_id,text_features, engineered_features

In [29]:
# process train.json get vectorizer, train_tfid matrix, engineered features, and target
vec, train_tfid_matrix, train_engr_features,train_target = process('train.json',best_max_feature)

# Model 2 on combined tfid_matrix and engr features
combined_features = np.append(train_engr_features.values, train_tfid_matrix.toarray(), axis = 1)

lr2 = LogisticRegression(n_jobs=-1,class_weight='balanced',penalty = 'l2').fit(combined_features, train_target)

# get test
request_id,test_text_features, test_engr_features = process('test.json',best_max_feature,train=False)

test_tfid_matrix = vec.transform(test_text_features)

combined_test_features = np.append(test_engr_features.values,test_tfid_matrix.toarray(),axis=1)

#  predict

predict2 = lr2.predict(combined_test_features)[:,np.newaxis]

# print type(predictions_test)
sub_2 = np.append(request_id.values[:, np.newaxis], predict2, axis = 1)
sub_2_df = pd.DataFrame(data=sub_2, columns=['request_id', 'requester_received_pizza'])  # 1st row as the column names
sub_2_df.to_csv("submission_2.csv", sep=',', header=True,  mode='w', index=0)


"""We achieved AUC of 0.60278 with this submission"""

print "Kaggle submission 2, result AUC  = 0.60278"


Kaggle submission 2, result AUC  = 0.60278

Conclusions

We chose logistic regression as our final classifier. The simple implementation of the classifier with our tuned max number of TDIDFvectorizer features resulted in AUC of 59%. We engineered some additional features from the request text and improved the AUC score by 1%.