Exercise 9

Mashable news stories analysis

Predicting if a news story is going to be popular


In [27]:
import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/mashable.csv'
df = pd.read_csv(url, index_col=0)
df.head()


Out[27]:
url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs ... min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity Popular
0 http://mashable.com/2014/12/10/cia-torture-rep... 28.0 9.0 188.0 0.732620 1.0 0.844262 5.0 1.0 1.0 ... 0.200000 0.80 -0.487500 -0.60 -0.250000 0.9 0.8 0.4 0.8 1
1 http://mashable.com/2013/10/18/bitlock-kicksta... 447.0 7.0 297.0 0.653199 1.0 0.815789 9.0 4.0 1.0 ... 0.160000 0.50 -0.135340 -0.40 -0.050000 0.1 -0.1 0.4 0.1 0
2 http://mashable.com/2013/07/24/google-glass-po... 533.0 11.0 181.0 0.660377 1.0 0.775701 4.0 3.0 1.0 ... 0.136364 1.00 0.000000 0.00 0.000000 0.3 1.0 0.2 1.0 0
3 http://mashable.com/2013/11/21/these-are-the-m... 413.0 12.0 781.0 0.497409 1.0 0.677350 10.0 3.0 1.0 ... 0.100000 1.00 -0.195701 -0.40 -0.071429 0.0 0.0 0.5 0.0 0
4 http://mashable.com/2014/02/11/parking-ticket-... 331.0 8.0 177.0 0.685714 1.0 0.830357 3.0 2.0 1.0 ... 0.100000 0.55 -0.175000 -0.25 -0.100000 0.0 0.0 0.5 0.0 0

5 rows × 61 columns


In [28]:
train_df.shape


Out[28]:
(6000, 61)

In [29]:
X = train_df.drop(['url', 'Popular'], axis=1)
y = train_df['Popular']

In [30]:
y.mean()


Out[30]:
0.5

In [32]:
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [ ]:

Exercise 9.1

Estimate a Decision Tree Classifier and a Logistic Regression

Evaluate using the following metrics:

  • Accuracy
  • F1-Score

In [ ]:

Exercise 9.2

Estimate 300 bagged samples

Estimate the following set of classifiers:

  • 100 Decision Trees where max_depth=None
  • 100 Decision Trees where max_depth=2
  • 100 Logistic Regressions

In [ ]:

Exercise 9.3

Ensemble using majority voting

Evaluate using the following metrics:

  • Accuracy
  • F1-Score

In [ ]:

Exercise 9.4

Estimate te probability as %models that predict positive

Modify the probability threshold and select the one that maximizes the F1-Score


In [ ]:

Exercise 9.5

Ensemble using weighted voting using the oob_error

Evaluate using the following metrics:

  • Accuracy
  • F1-Score

In [ ]:

Exercise 9.6

Estimate te probability of the weighted voting

Modify the probability threshold and select the one that maximizes the F1-Score


In [ ]:

Exercise 9.7

Estimate a logistic regression using as input the estimated classifiers

Modify the probability threshold such that maximizes the F1-Score


In [ ]: