In [15]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime

pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15,5)



In [16]:

    
NYT_train_raw = pd.read_csv("NYTimesBlogTrain.csv")
NYT_test_raw = pd.read_csv("NYTimesBlogTest.csv")

Join the data for preprocessing



In [17]:

    
print('Max train ID: %d. Max test ID: %d' % (np.max(NYT_train_raw['UniqueID']), np.max(NYT_test_raw['UniqueID'])))
joined = NYT_train_raw.merge(NYT_test_raw, how = 'outer')









    



Max train ID: 6532. Max test ID: 8402

Create additional features:

"QorE": question or exclamation mark in the headline
"Q&A": "Q. and A." phrase in the headline (I don't think it was valuable, but stayed here from my previous attemps)



In [18]:

    
joined['QorE'] = joined['Headline'].str.contains(r'\!|\?').astype(int)
joined['Q&A'] = joined['Headline'].str.contains(r'Q\. and A\.').astype(int)

Convert "PubDate" into two columns: Weekday and Hour:



In [19]:

    
joined['PubDate'] = pd.to_datetime(joined['PubDate'])
joined['Weekday'] = joined['PubDate'].dt.weekday
joined['Hour'] = joined['PubDate'].dt.hour



In [20]:

    
print("At the moment, we have %d entries with NewsDesk=Nan." % len(joined.loc[joined['NewsDesk'].isnull()]))









    



At the moment, we have 2408 entries with NewsDesk=Nan.

More features and gap filling

Below are the results of one day of searching for meaningful patterns in the data. There are a few easily identifiable features, most of which lead to zero popularity. They are:

"History": article headings always started with a year. None of them were popular in the training set
"Daily rubric": I added this new NewsDesk category for types of articles that appeared regularly (not necessarily daily): "Daily Clip Report", "Today in Politics", "What we're reading", "First Draft", "Pictures of the day", "Week in pictures". They also were not popular.

Now, as ask788 pointed out in this thread, the problem with data is often their structure, not the models we use on them. I agree that ideally this feature engineering should have been done automatically, but I am a novice, and had to tediously plod through the rows of data manually.

You can browse individual features that I selected by printing the head() of a subset, like so:



In [21]:

    
joined.loc[(joined['NewsDesk'] == 'Foreign') & (joined['SectionName'].isnull())].head()









    Out[21]:






  
    
      
      NewsDesk
      SectionName
      SubsectionName
      Headline
      Snippet
      Abstract
      WordCount
      PubDate
      Popular
      UniqueID
      QorE
      Q&A
      Weekday
      Hour
    
  
  
    
      11
      Foreign
      NaN
      NaN
      1939: German Troops Invade Poland
      Highlights from the International Herald Tribu...
      Highlights from the International Herald Tribu...
      97
      2014-09-01 14:39:43
      0
      12
      0
      0
      0
      14
    
    
      20
      Foreign
      NaN
      NaN
      1914: Russian Army Scores Victory
      Highlights from the International Herald Tribu...
      Highlights from the International Herald Tribu...
      108
      2014-09-01 09:30:14
      0
      21
      0
      0
      0
      9
    
    
      67
      Foreign
      NaN
      NaN
      1914: City Prepares for War Wounded
      Highlights from the International Herald Tribu...
      Highlights from the International Herald Tribu...
      101
      2014-09-02 13:34:59
      0
      68
      0
      0
      1
      13
    
    
      81
      Foreign
      NaN
      NaN
      1889: British Traders in East Africa
      Highlights from the International Herald Tribu...
      Highlights from the International Herald Tribu...
      122
      2014-09-02 10:48:08
      0
      82
      0
      0
      1
      10
    
    
      184
      Foreign
      NaN
      NaN
      1939: War on Germany Declared
      Highlights from the International Herald Tribu...
      Highlights from the International Herald Tribu...
      79
      2014-09-03 07:41:26
      0
      185
      0
      0
      2
      7



In [22]:

    
joined.loc[(joined['NewsDesk'] == 'Styles') & (joined['SectionName'].isnull()), 'NewsDesk'] = 'TStyle'
joined.loc[(joined['NewsDesk'] == 'Foreign') & (joined['SectionName'].isnull()), 'NewsDesk'] = 'History'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'].str.contains(r'^1[0-9]{3}')), 'NewsDesk'] = 'History'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'] == 'Daily Clip Report'), 'NewsDesk'] = 'Daily Rubric'
joined.loc[joined['NewsDesk'] == 'Daily Rubric', 'SectionName'] = 'Clip Report'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'] == 'Today in Politics'), 'SectionName'] = 'Today in Politics'
joined.loc[joined['SectionName'] == 'Today in Politics', 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'].str.contains(r'what we\'re reading', case=False)), 'SectionName'] = 'What we\'re reading'
joined.loc[joined['SectionName'] == 'What we\'re reading', 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'].str.contains(r'first draft', case=False)), 'SectionName'] = 'First draft'
joined.loc[joined['SectionName'] == 'First draft', 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['SubsectionName'] == 'Education'), 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['Headline'].str.contains('pictures of the day|week in pictures', case=False)), 'NewsDesk'] = 'Daily Rubric'

Filling the gaps in NewsDesk, SectionName and SubsectionName.



In [23]:

    
section_to_newsdesk = {'Business Day': 'Business', 'Crosswords/Games': 'Business', 'Technology': 'Business',
     'Arts': 'Culture',
     'World': 'Foreign',
     'Magazine': 'Magazine',
     'N.Y. / Region': 'Metro',
     'Opinion': 'OpEd',
     'Travel': 'Travel',
     'Multimedia': 'Multimedia',
     'Open': 'Open'}

section_to_subsection = {'Crosswords/Games': 'Crosswords/Games',
                        'Technology': 'Technology'}

newsdesk_to_section = {'TStyle': 'TStyle',
                      'Culture': 'Arts',
                      'OpEd': 'Opinion',
                      'History': 'History'}

newsdesk_to_subsection = {'TStyle': 'TStyle',
                         'Culture': 'Arts',
                         'Daily Rubric': 'Rubric',
                         'Magazine': 'Magazine',
                         'Metro': 'Metro',
                         'Multimedia': 'Multimedia',
                         'OpEd': 'OpEd',
                         'Science': 'Science',
                         'Sports': 'Sports',
                         'Styles': 'Styles',
                         'Travel': 'Travel',
                         'History': 'History'}
     
for sec in set(joined['SectionName']):
    try: section_to_newsdesk[sec]
    except KeyError:
        pass
    else:
        joined['NewsDesk'].fillna(joined.loc[(joined['SectionName'] == sec)]['NewsDesk'].fillna(section_to_newsdesk[sec]), inplace=True)

    try: section_to_subsection[sec]
    except KeyError:
        pass
    else:
        joined['SubsectionName'].fillna(joined.loc[(joined['SectionName'] == sec)]['SubsectionName'].fillna(section_to_subsection[sec]), inplace=True)        


for nd in set(joined['NewsDesk']):
    try: newsdesk_to_section[nd]
    except KeyError:
        pass
    else:
        joined['SectionName'].fillna(joined.loc[(joined['NewsDesk'] == nd)]['SectionName'].fillna(newsdesk_to_section[nd]), inplace=True)
        
    try: newsdesk_to_subsection[nd]
    except KeyError:
        pass
    else:
        joined['SubsectionName'].fillna(joined.loc[(joined['NewsDesk'] == nd)]['SubsectionName'].fillna(newsdesk_to_subsection[nd]), inplace=True)

Filling even more gaps with some clustering. I created a TFI-DF matrix and did Ward clustering on words. Four of six clusters I thought were meaningful and fitted well into existing NewsDesk/S(ubs)ectionName.



In [24]:

    
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer

nans = joined.loc[joined['NewsDesk'].isnull()]
words = list(nans.apply(lambda x:'%s' % (x['Abstract']),axis=1))
tfv = TfidfVectorizer(min_df=0.005,  max_features=None, 
        strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
        ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1,
        stop_words = 'english')
X_tr = tfv.fit_transform(words)

ward = AgglomerativeClustering(n_clusters=6,
        linkage='ward').fit(X_tr.toarray())

joined.loc[joined['NewsDesk'].isnull(), 'cluster'] = ward.labels_
cluster_to = {}
cluster_to['NewsDesk'] = {4: 'Metro', 3: 'National', 2: 'Foreign', 1: 'National'}
cluster_to['SectionName'] = {4: 'N.Y. / Region', 3: 'U.S.', 2: 'Not_Asia', 1: 'U.S.'}
cluster_to['SubsectionName'] = {4: 'NYT', 3: 'Politics', 2: 'Not_Asia', 1: 'Politics'}

for key in cluster_to:
    for key2 in cluster_to[key]:
        joined.loc[(joined['cluster'] == key2) & (nans['NewsDesk'].isnull()), key] = cluster_to[key][key2]

You can see what these clusters look like by typing:



In [26]:

    
joined.loc[joined['cluster'] == 3].head()









    Out[26]:






  
    
      
      NewsDesk
      SectionName
      SubsectionName
      Headline
      Snippet
      Abstract
      WordCount
      PubDate
      Popular
      UniqueID
      QorE
      Q&A
      Weekday
      Hour
      cluster
    
  
  
    
      1510
      National
      U.S.
      Politics
      Congress to Weigh In on White House Security
      The House Committee on Oversight and Governmen...
      The House Committee on Oversight and Governmen...
      98
      2014-09-22 16:50:03
      0
      1511
      0
      0
      0
      16
      3
    
    
      1540
      National
      U.S.
      Politics
      White House Will Use Locks More Often
      White House Not Changing the Locks Just Yet
      White House Not Changing the Locks Just Yet
      304
      2014-09-22 12:49:19
      0
      1541
      0
      0
      0
      12
      3
    
    
      1569
      National
      U.S.
      Politics
      The White House Front Door, When Entered Properly
      A brief history of the White Houses North Port...
      A brief history of the White Houses North Port...
      220
      2014-09-22 10:18:46
      0
      1570
      0
      0
      0
      10
      3
    
    
      1672
      National
      U.S.
      Politics
      Lunchtime Laughs: Jumper at 1600
      The Daily Shows take on the disturbing details...
      The Daily Shows take on the disturbing details...
      114
      2014-09-23 12:40:54
      0
      1673
      0
      0
      1
      12
      3
    
    
      1776
      National
      U.S.
      Politics
      White House Reporters Working on Pool End-Around
      The White House Correspondents Association is ...
      The White House Correspondents Association is ...
      210
      2014-09-24 14:46:28
      0
      1777
      0
      0
      2
      14
      3

Finally, use a few (6) obvious keywords to categorise the data even more. After this, we are left with 950 entries where NewsDesk, SectionName and SubsectionName are NaN, but I didn't have an idea how to deal with them.



In [ ]:

    
joined.drop('cluster', axis=1, inplace=True)



In [29]:

    
keywords = {}
keywords['clinton|white house|obama'] = {'NewsDesk': 'National', 'SectionName': 'U.S.', 'SubsectionName': 'Politics'}
keywords['isis|iraq'] = {'NewsDesk': 'Foreign', 'SectionName': 'Not_Asia', 'SubsectionName': 'Not_Asia'}
keywords['york'] = {'NewsDesk': 'Metro', 'SectionName': 'N.Y. / Region', 'SubsectionName': 'N.Y. / Region'}

for key in keywords:
    indices = (joined['NewsDesk'].isnull()) & (joined['Abstract'].str.contains(key, case=False))
    for sec in keywords[key]:
        joined.loc[indices, sec] = keywords[key][sec]



In [30]:

    
print("Now we have %d entries with NewsDesk=Nan." % len(joined.loc[joined['NewsDesk'].isnull()]))









    



Now we have 948 entries with NewsDesk=Nan.

Categorical (factor) colums

First, turn the categorial data into 0/1 binary columns. Yes, it's more painful in Python than in R.



In [31]:

    
from sklearn.feature_extraction import DictVectorizer

def categorizeDF(df):
    old_columns = df.columns
    cat_cols = ['NewsDesk', 'SectionName', 'SubsectionName']
    temp_dict = df[cat_cols].to_dict(orient="records")
    vec = DictVectorizer()
    vec_arr = vec.fit_transform(temp_dict).toarray()
    
    new_df = pd.DataFrame(vec_arr).convert_objects(convert_numeric=True)
    new_df.index = df.index
    new_df.columns = vec.get_feature_names()
    columns_to_add = [col for col in old_columns if col not in cat_cols]
    new_df[columns_to_add] = df[columns_to_add]
    new_df.drop(cat_cols, inplace=True, axis=1)
    return new_df

joined_cat = categorizeDF(joined)

Recover train and test sets



In [32]:

    
train = joined_cat[joined_cat['UniqueID'] <= 6532]
test = joined_cat[joined_cat['UniqueID'] > 6532]

Random Forest

Parametres for RF had been optimised with a GridSearchCV function from sklearn.



In [34]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation

Xcols = train.columns
Xcols = [x for x in Xcols if not x in ('Headline', 'Snippet', 'Abstract', 'PubDate', 'UniqueID', 'Popular', 'Q&A')]

y = train['Popular']

forest = RandomForestClassifier(n_estimators=7000, max_features=0.1, min_samples_split=24, random_state=33, n_jobs=3)
forest.fit(train[Xcols], y)

probsRF = forest.predict_proba(test[Xcols])[:,1]

print("10 Fold CV Score: ", np.mean(cross_validation.cross_val_score(forest, train[Xcols], y, cv=10, scoring='roc_auc')))









    



10 Fold CV Score:  0.946326151519

Gradient Boosting Method



In [35]:

    
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import cross_validation

Xcols = train.columns
Xcols = [x for x in Xcols if not x in ('Headline', 'Snippet', 'Abstract', 'PubDate', 'UniqueID', 'Popular', 'Q&A')]

y = train['Popular']

est = GradientBoostingClassifier(n_estimators=3000,
                                 learning_rate=0.005,
                                 max_depth=4,
                                 max_features=0.3,
                                 min_samples_leaf=9,
                                 random_state=33)
est.fit(train[Xcols], y)

probsGBC = est.predict_proba(test[Xcols])[:,1]

print("10 Fold CV Score: ", np.mean(cross_validation.cross_val_score(est, train[Xcols], y, cv=10, scoring='roc_auc')))









    



10 Fold CV Score:  0.945578691213

Define a function for cross-validating and plotting ensemble of my two models:



In [50]:

    
from sklearn import cross_validation
from sklearn.metrics import roc_auc_score

def calculate_ensemble_score(model1, model2, Xcols, ycol, dataset, cv=10):
    '''Calculates the score for various weights of two models in an ensemble'''
    
    num_points = 21
    score_arr = np.zeros((cv, num_points))
    
    kf = cross_validation.KFold(len(dataset), cv, shuffle=True)

    i = 0
    for xtrain, xtest in kf:
        train, test = dataset.ix[xtrain], dataset.ix[xtest]
        
        model1.fit(train[Xcols], train[ycol])
        probs1 = model1.predict_proba(test[Xcols])[:,1]
       
        model2.fit(train[Xcols], train[ycol])
        probs2 = model2.predict_proba(test[Xcols])[:,1]
        
        for wg in range(num_points):
            probs = wg/(num_points-1)*probs1 + (1-wg/(num_points-1))*probs2
            score_arr[i][wg] = roc_auc_score(test[ycol], probs)
        
        i+=1
    
    return np.mean(score_arr, axis=0)
            
def plot_ensemble_score(scores):
    import seaborn as sbs

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.axhline(y=scores[0], linestyle='--', color='red')
    ax.axhline(y=scores[-1], linestyle='--', color='green')
    ax.text(0.03, scores[0]+0.00001, "Pure GBM", verticalalignment='bottom', horizontalalignment='left', color='red', size='larger')
    ax.text(0.98, scores[-1]+0.00001, "Pure RF", verticalalignment='bottom', horizontalalignment='right', color='green', size='larger')

    ax.plot(np.linspace(0,1,len(scores)), scores)
    ax.set_xlabel("RF model weight")
    ax.set_ylabel("AUC")
    ax.set_title("Choosing the weights for two models in an ensemble (10-fold cross-validation)")
    
    return fig



In [37]:

    
means = calculate_ensemble_score(forest, est, Xcols, 'Popular', train)



In [51]:

    
myplot = plot_ensemble_score(means)
myplot.savefig("AUC.png", dpi=300)



In [54]:

    
test['Popular'] = (0.6*probsGBC+0.4*probsRF)
test['UniqueID'] = test['UniqueID'].astype(int)
test.to_csv('preds.csv', columns=['UniqueID', 'Popular'], header=['UniqueID', 'Probability1'], index=False)









    



/home/adam/anaconda3/lib/python3.4/site-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/home/adam/anaconda3/lib/python3.4/site-packages/IPython/kernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app

Kaggle: 0.93613.

	NewsDesk	SectionName	SubsectionName	Headline	Snippet	Abstract	WordCount	PubDate	UniqueID	Weekday	Hour
11	Foreign	NaN	NaN	1939: German Troops Invade Poland	Highlights from the International Herald Tribu...	Highlights from the International Herald Tribu...	97	2014-09-01 14:39:43	12	0	14
20	Foreign	NaN	NaN	1914: Russian Army Scores Victory	Highlights from the International Herald Tribu...	Highlights from the International Herald Tribu...	108	2014-09-01 09:30:14	21	0	9
67	Foreign	NaN	NaN	1914: City Prepares for War Wounded	Highlights from the International Herald Tribu...	Highlights from the International Herald Tribu...	101	2014-09-02 13:34:59	68	1	13
81	Foreign	NaN	NaN	1889: British Traders in East Africa	Highlights from the International Herald Tribu...	Highlights from the International Herald Tribu...	122	2014-09-02 10:48:08	82	1	10
184	Foreign	NaN	NaN	1939: War on Germany Declared	Highlights from the International Herald Tribu...	Highlights from the International Herald Tribu...	79	2014-09-03 07:41:26	185	2	7

	NewsDesk	SectionName	SubsectionName	Headline	Snippet	Abstract	WordCount	PubDate	UniqueID	Weekday	Hour	cluster
1510	National	U.S.	Politics	Congress to Weigh In on White House Security	The House Committee on Oversight and Governmen...	The House Committee on Oversight and Governmen...	98	2014-09-22 16:50:03	1511	0	16	3
1540	National	U.S.	Politics	White House Will Use Locks More Often	White House Not Changing the Locks Just Yet	White House Not Changing the Locks Just Yet	304	2014-09-22 12:49:19	1541	0	12	3
1569	National	U.S.	Politics	The White House Front Door, When Entered Properly	A brief history of the White Houses North Port...	A brief history of the White Houses North Port...	220	2014-09-22 10:18:46	1570	0	10	3
1672	National	U.S.	Politics	Lunchtime Laughs: Jumper at 1600	The Daily Shows take on the disturbing details...	The Daily Shows take on the disturbing details...	114	2014-09-23 12:40:54	1673	1	12	3
1776	National	U.S.	Politics	White House Reporters Working on Pool End-Around	The White House Correspondents Association is ...	The White House Correspondents Association is ...	210	2014-09-24 14:46:28	1777	2	14	3