Exercise 9

Mashable news stories analysis

Predicting if a news story is going to be popular



In [27]:

    
import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/mashable.csv'
df = pd.read_csv(url, index_col=0)
df.head()









    Out[27]:







  
    
      
      url
      timedelta
      n_tokens_title
      n_tokens_content
      n_unique_tokens
      n_non_stop_words
      n_non_stop_unique_tokens
      num_hrefs
      num_self_hrefs
      num_imgs
      ...
      min_positive_polarity
      max_positive_polarity
      avg_negative_polarity
      min_negative_polarity
      max_negative_polarity
      title_subjectivity
      title_sentiment_polarity
      abs_title_subjectivity
      abs_title_sentiment_polarity
      Popular
    
  
  
    
      0
      http://mashable.com/2014/12/10/cia-torture-rep...
      28.0
      9.0
      188.0
      0.732620
      1.0
      0.844262
      5.0
      1.0
      1.0
      ...
      0.200000
      0.80
      -0.487500
      -0.60
      -0.250000
      0.9
      0.8
      0.4
      0.8
      1
    
    
      1
      http://mashable.com/2013/10/18/bitlock-kicksta...
      447.0
      7.0
      297.0
      0.653199
      1.0
      0.815789
      9.0
      4.0
      1.0
      ...
      0.160000
      0.50
      -0.135340
      -0.40
      -0.050000
      0.1
      -0.1
      0.4
      0.1
      0
    
    
      2
      http://mashable.com/2013/07/24/google-glass-po...
      533.0
      11.0
      181.0
      0.660377
      1.0
      0.775701
      4.0
      3.0
      1.0
      ...
      0.136364
      1.00
      0.000000
      0.00
      0.000000
      0.3
      1.0
      0.2
      1.0
      0
    
    
      3
      http://mashable.com/2013/11/21/these-are-the-m...
      413.0
      12.0
      781.0
      0.497409
      1.0
      0.677350
      10.0
      3.0
      1.0
      ...
      0.100000
      1.00
      -0.195701
      -0.40
      -0.071429
      0.0
      0.0
      0.5
      0.0
      0
    
    
      4
      http://mashable.com/2014/02/11/parking-ticket-...
      331.0
      8.0
      177.0
      0.685714
      1.0
      0.830357
      3.0
      2.0
      1.0
      ...
      0.100000
      0.55
      -0.175000
      -0.25
      -0.100000
      0.0
      0.0
      0.5
      0.0
      0
    
  

5 rows × 61 columns



In [28]:

    
train_df.shape









    Out[28]:





(6000, 61)



In [29]:

    
X = train_df.drop(['url', 'Popular'], axis=1)
y = train_df['Popular']



In [30]:

    
y.mean()









    Out[30]:





0.5



In [32]:

    
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)



In [ ]:

Exercise 9.1

Estimate a Decision Tree Classifier and a Logistic Regression

Evaluate using the following metrics:

Accuracy
F1-Score



In [ ]:

Exercise 9.2

Estimate 300 bagged samples

Estimate the following set of classifiers:

100 Decision Trees where max_depth=None
100 Decision Trees where max_depth=2
100 Logistic Regressions



In [ ]:

Exercise 9.3

Ensemble using majority voting

Evaluate using the following metrics:

Accuracy
F1-Score



In [ ]:

Exercise 9.4

Estimate te probability as %models that predict positive

Modify the probability threshold and select the one that maximizes the F1-Score



In [ ]:

Exercise 9.5

Ensemble using weighted voting using the oob_error

Evaluate using the following metrics:

Accuracy
F1-Score



In [ ]:

Exercise 9.6

Estimate te probability of the weighted voting

Modify the probability threshold and select the one that maximizes the F1-Score



In [ ]:

Exercise 9.7

Estimate a logistic regression using as input the estimated classifiers

Modify the probability threshold such that maximizes the F1-Score



In [ ]:

	url	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	...	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity	Popular
0	http://mashable.com/2014/12/10/cia-torture-rep...	28.0	9.0	188.0	0.732620	1.0	0.844262	5.0	1.0	1.0	...	0.200000	0.80	-0.487500	-0.60	-0.250000	0.9	0.8	0.4	0.8	1
1	http://mashable.com/2013/10/18/bitlock-kicksta...	447.0	7.0	297.0	0.653199	1.0	0.815789	9.0	4.0	1.0	...	0.160000	0.50	-0.135340	-0.40	-0.050000	0.1	-0.1	0.4	0.1	0
2	http://mashable.com/2013/07/24/google-glass-po...	533.0	11.0	181.0	0.660377	1.0	0.775701	4.0	3.0	1.0	...	0.136364	1.00	0.000000	0.00	0.000000	0.3	1.0	0.2	1.0	0
3	http://mashable.com/2013/11/21/these-are-the-m...	413.0	12.0	781.0	0.497409	1.0	0.677350	10.0	3.0	1.0	...	0.100000	1.00	-0.195701	-0.40	-0.071429	0.0	0.0	0.5	0.0	0
4	http://mashable.com/2014/02/11/parking-ticket-...	331.0	8.0	177.0	0.685714	1.0	0.830357	3.0	2.0	1.0	...	0.100000	0.55	-0.175000	-0.25	-0.100000	0.0	0.0	0.5	0.0	0