Homework 05 - Naive Bayes

Isac Lira, 371890

Questão 1

Implemente um classifacor Naive Bayes para o problema de predizer a qualidade de um carro. Para este fim, utilizaremos um conjunto de dados referente a qualidade de carros, disponível no UCI. Este dataset de carros possui as seguintes features e classe:

Attributos

  1. buying: vhigh, high, med, low
  2. maint: vhigh, high, med, low
  3. doors: 2, 3, 4, 5, more
  4. persons: 2, 4, more
  5. lug_boot: small, med, big
  6. safety: low, med, high

Classes

  1. unacc, acc, good, vgood

In [394]:
# Importa as bibliotecas 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [22]:
# Carrega os dados

cols = ['buying','maint','doors','persons','lug_book','safety','class']
carset = pd.read_csv('carData.csv',names=cols)
carset.head()


Out[22]:
buying maint doors persons lug_book safety class
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc

In [23]:
carset.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_book    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

Sem valores faltantes. Amém!


In [388]:
# Implementa o classificador naive bayes

class naive_classifier(object):
    def _init_(self):
        self.priors = {}
        self.lh_probs = {}
        self.train = None
          
    def calc_prior_probs(self):
        train = self.train
        classes = train['class'].unique()
        priors = {}
        for class_ in classes:            
            priors[class_] = len(train[train['class'] == class_])/len(train)            
        self.priors = priors
        
    def calc_likelihood_probs(self):
        train = self.train
        columns = carset.drop(['class'],axis=1).columns
        classes = carset['class'].unique() 
        lh_probs = {}
        for column in columns:
            for class_ in classes:
                for cat in carset[column].unique():                    
                    cat_prior_prob = sum(train[column]==cat)/len(train)
                    conditional_prob = sum((train['class']==class_) & (train[column]==cat))/sum(train['class']==class_)                                        
                    
                    if not conditional_prob: conditional_prob = 0.001   
                    
                    lh_probs[column,cat,class_] = conditional_prob/cat_prior_prob      
        self.lh_probs = lh_probs
    
    def fit(self,train):
        self.train = train
        self.calc_prior_probs()
        self.calc_likelihood_probs()
    
    def predict(self,xtest):
        columns = self.train.drop(['class'],axis=1).columns
        classes = self.train['class'].unique()  
        predictions = []
        
        for i in xtest.index:
            posterior_prob = {}
            x = xtest.loc[i]
            for class_ in classes:
                posterior_prob[class_] = 1
                for column in columns: 
                    #cat_prior = sum(self.train[column]==x[column])
                    posterior_prob[class_] *= self.lh_probs[column,x[column],class_]                        
                posterior_prob[class_] *= self.priors[class_]             
            predictions.append(max(posterior_prob,key=posterior_prob.get))    
        return predictions

In [389]:
# Instancia um classificador e realiza predições para os dados de teste

msk = np.random.rand(len(carset))<=0.8
xtrain = carset[msk]
xtest = carset[~msk]

ytrain = xtrain['class']
ytest = xtest['class']

naive = naive_classifier()
naive.fit(xtrain)
pred = naive.predict(xtest)
acc_my_nb = np.mean(pred==ytest)*100

In [390]:
from sklearn.metrics import classification_report
my_nv_report = classification_report(ytest,pred)

Questão 2

Crie uma versão de sua implementação usando as funções disponíveis na biblioteca SciKitLearn para o Naive Bayes (veja aqui)


In [391]:
# Transform as features categóricas em numéricas

from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(xtrain['class'])
ytrain = le.transform(xtrain['class'])
ytest = le.transform(xtest['class'])

xtrain =pd.get_dummies(xtrain.drop(['class'],axis=1))
xtest = pd.get_dummies(xtest.drop(['class'],axis=1))

xtrain.head()


Out[391]:
buying_high buying_low buying_med buying_vhigh maint_high maint_low maint_med maint_vhigh doors_2 doors_3 ... doors_5more persons_2 persons_4 persons_more lug_book_big lug_book_med lug_book_small safety_high safety_low safety_med
1 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 0 1 0 0 1
2 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 0 1 1 0 0
3 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 1 0 0 1 0
4 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 1 0 0 0 1
5 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 1 0 1 0 0

5 rows × 21 columns


In [392]:
# Aplica o classifcador Naive Bayes (sklearn)

nb = GaussianNB()
nb.fit(xtrain,ytrain)
pred = nb.predict(xtest)
acc_sk_nb = np.mean(pred == ytest)*100

Questão 3

Analise a acurácia dos dois algoritmos e discuta a sua solução.


In [393]:
print('Acurária do classificador(from scratch):',acc_my_nb)
print('Acurária do classificador(sklearn version):',acc_sk_nb)


Acurária do classificador(from scratch): 84.2239185751
Acurária do classificador(sklearn version): 80.1526717557

A acurácia obtida pelos classifcadores foram bem próximas. Como na versão do sklearn as features precisam ser numéricas, a tranformação categórica-numérica pode afetar o desempenho.


In [ ]: