Atividades

  1. Utilizamos a medida de Entropia como fator de decisão (medida de impureza de um nó). Teste o mesmo conjunto randômico de dados para a medida Gini e compare os resultados. Ref1.: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier Ref2.: https://en.wikipedia.org/wiki/Decision_tree_learning

  2. Faça o balanceamento dos dados contidos em "train.csv", aplique o algoritmo de Decision Tree e faça a submissão no kaggle. Tente melhorar o resultado obtido em sala de aula (posição 3100 no leaderboard). Dataset: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

  3. (Opcional) Execute uma Random Forest na competição do Kaggle e veja se a acurácia melhora. Utilize 10, 100 ou 1000 árvores (dependendo de quanto o seu computador aguentar =]): http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

#In order to install imblearn package
#!conda install -c glemaitre imbalanced-learn
from imblearn.over_sampling import SMOTE

from collections import Counter

In [10]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [11]:
x = train_data.iloc[:,2:]
y = train_data.iloc[:,1]
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)
class_count = y_train.value_counts()
class_count


Out[11]:
0    401631
1     15017
Name: target, dtype: int64

In [12]:
#Balanceando os dados, removendo rows da classe 0 
#train = x_train.copy()
#train = train.join(y_train.copy())
#train_df = pd.DataFrame(data=train)
#train_class0 = train[train.target == 0]
#train_class1 = train[train.target == 1]
#np.random.seed(10)
#remove_n = class_count[0] - (3*class_count[1])
#drop_indices = np.random.choice(train_class0.index, remove_n, replace=False)
#train_class0 = train_class0.drop(drop_indices)

# Since removing data did not work, I am gonna try to add fake data in order to balance the dataset
print('Original dataset shape {}'.format(Counter(y_train)))
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_sample(x_train, y_train)
print('Resampled dataset shape {}'.format(Counter(y_res)))


Original dataset shape Counter({0: 401631, 1: 15017})
Resampled dataset shape Counter({0: 401631, 1: 401631})

In [13]:
#Trainning dataset with decision tree model
d_tree = DecisionTreeClassifier(criterion='gini')
d_tree.fit(X_res, y_res)
y_hat = d_tree.predict(x_test)

print("Score:",d_tree.score(x_test, y_test),"\n")


Score: 0.917519768822 


In [14]:
#Just to check whether the model is actually trainned
CM = confusion_matrix(y_test, y_hat)

TN = CM[0][0]
FN = CM[1][0]
TP = CM[1][1]
FP = CM[0][1]

print("False Positives: ", FP, "\nFalse Negatives: ", FN)


False Positives:  8501 
False Negatives:  6227

Random Forest


In [20]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=5, random_state=0, criterion='entropy')
rfc.fit(X_res, y_res)
y_hatRfc = rfc.predict(x_test)

print("Score:",rfc.score(x_test, y_test),"\n")


Score: 0.961033578997 


In [21]:
CMRfc = confusion_matrix(y_test, y_hatRfc)

TN = CMRfc[0][0]
FN = CMRfc[1][0]
TP = CMRfc[1][1]
FP = CMRfc[0][1]

print("False Positives: ", FP, "\nFalse Negatives: ", FN)


False Positives:  311 
False Negatives:  6647

In [19]:
y_test.value_counts()


Out[19]:
0    171887
1      6677
Name: target, dtype: int64

In [ ]:
#I could not make these models learn to properly classify the class 1 :(