Regressão Linear Multivariada - Trabalho

Estudo de caso: Qualidade de Vinhos

Nesta trabalho, treinaremos um modelo de regressão linear usando descendência de gradiente estocástico no conjunto de dados da Qualidade do Vinho. O exemplo pressupõe que uma cópia CSV do conjunto de dados está no diretório de trabalho atual com o nome do arquivo winequality-white.csv.

O conjunto de dados de qualidade do vinho envolve a previsão da qualidade dos vinhos brancos em uma escala, com medidas químicas de cada vinho. É um problema de classificação multiclasse, mas também pode ser enquadrado como um problema de regressão. O número de observações para cada classe não é equilibrado. Existem 4.898 observações com 11 variáveis de entrada e 1 variável de saída. Os nomes das variáveis são os seguintes:

  1. Fixed acidity.
  2. Volatile acidity.
  3. Citric acid.
  4. Residual sugar.
  5. Chlorides.
  6. Free sulfur dioxide.
  7. Total sulfur dioxide.
  8. Density.
  9. pH.
  10. Sulphates.
  11. Alcohol.
  12. Quality (score between 0 and 10).

O desempenho de referencia de predição do valor médio é um RMSE de aproximadamente 0.148 pontos de qualidade.

Utilize o exemplo apresentado no tutorial e altere-o de forma a carregar os dados e analisar a acurácia de sua solução.


In [154]:
import csv
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import matplotlib.pyplot as plt

def coefficients_sgd(train, l_rate, n_epoch):
    size = train.shape[0]
    coef = np.random.normal(size=train.shape[1]-1)
    print ('Coeficiente Inicial=',(coef))
    errors = []
    
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train.values:
            x = row[0:-1]
            yhat = np.dot(x, coef)
            error = yhat - row[-1]
            sum_error += error**2
            coef = coef - l_rate * error * x
        rmse = np.sqrt(sum_error/size) 
        errors.append(rmse)
        print(('epoch=%d, lrate=%.3f, RMSE=%.3f' % (epoch, l_rate, rmse)))
    return coef, errors

In [155]:
dataset = pd.read_csv('winequality-white.csv', delimiter=";")

cols = list(dataset.columns)
cols.remove('quality')

datasetNorm = pd.DataFrame(preprocessing.scale(dataset[cols]))
datasetNorm['y'] = dataset['quality']

idx = 0
new_col = np.ones(datasetNorm.shape[0])  # can be a list, a Series, an array or a scalar   
datasetNorm.insert(loc=idx, column='B0', value=new_col)

print(datasetNorm.shape)

train, test = train_test_split(datasetNorm, test_size=0.3)


(4898, 13)

In [168]:
# Calculate coefficients
l_rate = 1e-5
n_epoch = 200
coefs, errors = coefficients_sgd(train, l_rate, n_epoch)
print('Coeficiente Final=',coefs)


Coeficiente Inicial= [-0.45111468  2.23254304 -1.48777801  0.262857   -0.35499591 -0.03776972
  1.26994076 -1.44760054 -0.02891993 -0.30925472 -0.28220391  1.83270609]
epoch=0, lrate=0.000, RMSE=7.146
epoch=1, lrate=0.000, RMSE=6.884
epoch=2, lrate=0.000, RMSE=6.635
epoch=3, lrate=0.000, RMSE=6.398
epoch=4, lrate=0.000, RMSE=6.171
epoch=5, lrate=0.000, RMSE=5.954
epoch=6, lrate=0.000, RMSE=5.747
epoch=7, lrate=0.000, RMSE=5.549
epoch=8, lrate=0.000, RMSE=5.359
epoch=9, lrate=0.000, RMSE=5.177
epoch=10, lrate=0.000, RMSE=5.002
epoch=11, lrate=0.000, RMSE=4.835
epoch=12, lrate=0.000, RMSE=4.674
epoch=13, lrate=0.000, RMSE=4.520
epoch=14, lrate=0.000, RMSE=4.371
epoch=15, lrate=0.000, RMSE=4.229
epoch=16, lrate=0.000, RMSE=4.092
epoch=17, lrate=0.000, RMSE=3.961
epoch=18, lrate=0.000, RMSE=3.835
epoch=19, lrate=0.000, RMSE=3.713
epoch=20, lrate=0.000, RMSE=3.597
epoch=21, lrate=0.000, RMSE=3.484
epoch=22, lrate=0.000, RMSE=3.376
epoch=23, lrate=0.000, RMSE=3.273
epoch=24, lrate=0.000, RMSE=3.173
epoch=25, lrate=0.000, RMSE=3.077
epoch=26, lrate=0.000, RMSE=2.985
epoch=27, lrate=0.000, RMSE=2.896
epoch=28, lrate=0.000, RMSE=2.811
epoch=29, lrate=0.000, RMSE=2.729
epoch=30, lrate=0.000, RMSE=2.651
epoch=31, lrate=0.000, RMSE=2.575
epoch=32, lrate=0.000, RMSE=2.502
epoch=33, lrate=0.000, RMSE=2.432
epoch=34, lrate=0.000, RMSE=2.365
epoch=35, lrate=0.000, RMSE=2.301
epoch=36, lrate=0.000, RMSE=2.239
epoch=37, lrate=0.000, RMSE=2.179
epoch=38, lrate=0.000, RMSE=2.122
epoch=39, lrate=0.000, RMSE=2.067
epoch=40, lrate=0.000, RMSE=2.014
epoch=41, lrate=0.000, RMSE=1.964
epoch=42, lrate=0.000, RMSE=1.915
epoch=43, lrate=0.000, RMSE=1.868
epoch=44, lrate=0.000, RMSE=1.824
epoch=45, lrate=0.000, RMSE=1.781
epoch=46, lrate=0.000, RMSE=1.739
epoch=47, lrate=0.000, RMSE=1.700
epoch=48, lrate=0.000, RMSE=1.662
epoch=49, lrate=0.000, RMSE=1.626
epoch=50, lrate=0.000, RMSE=1.591
epoch=51, lrate=0.000, RMSE=1.558
epoch=52, lrate=0.000, RMSE=1.526
epoch=53, lrate=0.000, RMSE=1.495
epoch=54, lrate=0.000, RMSE=1.466
epoch=55, lrate=0.000, RMSE=1.438
epoch=56, lrate=0.000, RMSE=1.411
epoch=57, lrate=0.000, RMSE=1.385
epoch=58, lrate=0.000, RMSE=1.360
epoch=59, lrate=0.000, RMSE=1.337
epoch=60, lrate=0.000, RMSE=1.314
epoch=61, lrate=0.000, RMSE=1.293
epoch=62, lrate=0.000, RMSE=1.272
epoch=63, lrate=0.000, RMSE=1.252
epoch=64, lrate=0.000, RMSE=1.233
epoch=65, lrate=0.000, RMSE=1.215
epoch=66, lrate=0.000, RMSE=1.198
epoch=67, lrate=0.000, RMSE=1.181
epoch=68, lrate=0.000, RMSE=1.165
epoch=69, lrate=0.000, RMSE=1.150
epoch=70, lrate=0.000, RMSE=1.136
epoch=71, lrate=0.000, RMSE=1.122
epoch=72, lrate=0.000, RMSE=1.109
epoch=73, lrate=0.000, RMSE=1.096
epoch=74, lrate=0.000, RMSE=1.084
epoch=75, lrate=0.000, RMSE=1.072
epoch=76, lrate=0.000, RMSE=1.061
epoch=77, lrate=0.000, RMSE=1.051
epoch=78, lrate=0.000, RMSE=1.040
epoch=79, lrate=0.000, RMSE=1.031
epoch=80, lrate=0.000, RMSE=1.021
epoch=81, lrate=0.000, RMSE=1.013
epoch=82, lrate=0.000, RMSE=1.004
epoch=83, lrate=0.000, RMSE=0.996
epoch=84, lrate=0.000, RMSE=0.988
epoch=85, lrate=0.000, RMSE=0.981
epoch=86, lrate=0.000, RMSE=0.974
epoch=87, lrate=0.000, RMSE=0.967
epoch=88, lrate=0.000, RMSE=0.960
epoch=89, lrate=0.000, RMSE=0.954
epoch=90, lrate=0.000, RMSE=0.948
epoch=91, lrate=0.000, RMSE=0.942
epoch=92, lrate=0.000, RMSE=0.936
epoch=93, lrate=0.000, RMSE=0.931
epoch=94, lrate=0.000, RMSE=0.926
epoch=95, lrate=0.000, RMSE=0.921
epoch=96, lrate=0.000, RMSE=0.916
epoch=97, lrate=0.000, RMSE=0.912
epoch=98, lrate=0.000, RMSE=0.907
epoch=99, lrate=0.000, RMSE=0.903
epoch=100, lrate=0.000, RMSE=0.899
epoch=101, lrate=0.000, RMSE=0.895
epoch=102, lrate=0.000, RMSE=0.892
epoch=103, lrate=0.000, RMSE=0.888
epoch=104, lrate=0.000, RMSE=0.885
epoch=105, lrate=0.000, RMSE=0.881
epoch=106, lrate=0.000, RMSE=0.878
epoch=107, lrate=0.000, RMSE=0.875
epoch=108, lrate=0.000, RMSE=0.872
epoch=109, lrate=0.000, RMSE=0.869
epoch=110, lrate=0.000, RMSE=0.866
epoch=111, lrate=0.000, RMSE=0.864
epoch=112, lrate=0.000, RMSE=0.861
epoch=113, lrate=0.000, RMSE=0.859
epoch=114, lrate=0.000, RMSE=0.856
epoch=115, lrate=0.000, RMSE=0.854
epoch=116, lrate=0.000, RMSE=0.852
epoch=117, lrate=0.000, RMSE=0.849
epoch=118, lrate=0.000, RMSE=0.847
epoch=119, lrate=0.000, RMSE=0.845
epoch=120, lrate=0.000, RMSE=0.843
epoch=121, lrate=0.000, RMSE=0.841
epoch=122, lrate=0.000, RMSE=0.839
epoch=123, lrate=0.000, RMSE=0.838
epoch=124, lrate=0.000, RMSE=0.836
epoch=125, lrate=0.000, RMSE=0.834
epoch=126, lrate=0.000, RMSE=0.833
epoch=127, lrate=0.000, RMSE=0.831
epoch=128, lrate=0.000, RMSE=0.829
epoch=129, lrate=0.000, RMSE=0.828
epoch=130, lrate=0.000, RMSE=0.826
epoch=131, lrate=0.000, RMSE=0.825
epoch=132, lrate=0.000, RMSE=0.824
epoch=133, lrate=0.000, RMSE=0.822
epoch=134, lrate=0.000, RMSE=0.821
epoch=135, lrate=0.000, RMSE=0.820
epoch=136, lrate=0.000, RMSE=0.819
epoch=137, lrate=0.000, RMSE=0.817
epoch=138, lrate=0.000, RMSE=0.816
epoch=139, lrate=0.000, RMSE=0.815
epoch=140, lrate=0.000, RMSE=0.814
epoch=141, lrate=0.000, RMSE=0.813
epoch=142, lrate=0.000, RMSE=0.812
epoch=143, lrate=0.000, RMSE=0.811
epoch=144, lrate=0.000, RMSE=0.810
epoch=145, lrate=0.000, RMSE=0.809
epoch=146, lrate=0.000, RMSE=0.808
epoch=147, lrate=0.000, RMSE=0.807
epoch=148, lrate=0.000, RMSE=0.806
epoch=149, lrate=0.000, RMSE=0.805
epoch=150, lrate=0.000, RMSE=0.805
epoch=151, lrate=0.000, RMSE=0.804
epoch=152, lrate=0.000, RMSE=0.803
epoch=153, lrate=0.000, RMSE=0.802
epoch=154, lrate=0.000, RMSE=0.801
epoch=155, lrate=0.000, RMSE=0.801
epoch=156, lrate=0.000, RMSE=0.800
epoch=157, lrate=0.000, RMSE=0.799
epoch=158, lrate=0.000, RMSE=0.798
epoch=159, lrate=0.000, RMSE=0.798
epoch=160, lrate=0.000, RMSE=0.797
epoch=161, lrate=0.000, RMSE=0.797
epoch=162, lrate=0.000, RMSE=0.796
epoch=163, lrate=0.000, RMSE=0.795
epoch=164, lrate=0.000, RMSE=0.795
epoch=165, lrate=0.000, RMSE=0.794
epoch=166, lrate=0.000, RMSE=0.794
epoch=167, lrate=0.000, RMSE=0.793
epoch=168, lrate=0.000, RMSE=0.792
epoch=169, lrate=0.000, RMSE=0.792
epoch=170, lrate=0.000, RMSE=0.791
epoch=171, lrate=0.000, RMSE=0.791
epoch=172, lrate=0.000, RMSE=0.790
epoch=173, lrate=0.000, RMSE=0.790
epoch=174, lrate=0.000, RMSE=0.789
epoch=175, lrate=0.000, RMSE=0.789
epoch=176, lrate=0.000, RMSE=0.789
epoch=177, lrate=0.000, RMSE=0.788
epoch=178, lrate=0.000, RMSE=0.788
epoch=179, lrate=0.000, RMSE=0.787
epoch=180, lrate=0.000, RMSE=0.787
epoch=181, lrate=0.000, RMSE=0.786
epoch=182, lrate=0.000, RMSE=0.786
epoch=183, lrate=0.000, RMSE=0.786
epoch=184, lrate=0.000, RMSE=0.785
epoch=185, lrate=0.000, RMSE=0.785
epoch=186, lrate=0.000, RMSE=0.785
epoch=187, lrate=0.000, RMSE=0.784
epoch=188, lrate=0.000, RMSE=0.784
epoch=189, lrate=0.000, RMSE=0.784
epoch=190, lrate=0.000, RMSE=0.783
epoch=191, lrate=0.000, RMSE=0.783
epoch=192, lrate=0.000, RMSE=0.783
epoch=193, lrate=0.000, RMSE=0.782
epoch=194, lrate=0.000, RMSE=0.782
epoch=195, lrate=0.000, RMSE=0.782
epoch=196, lrate=0.000, RMSE=0.781
epoch=197, lrate=0.000, RMSE=0.781
epoch=198, lrate=0.000, RMSE=0.781
epoch=199, lrate=0.000, RMSE=0.781
Coeficiente Final= [  5.86812920e+00  -4.19244636e-02  -1.86885075e-01  -3.58146598e-02
  -1.49440419e-02   2.63705850e-03   2.10170502e-01  -1.78754580e-01
   2.65607665e-01   1.08557358e-02   4.20615454e-02   5.98836336e-01]

In [167]:
#Gráfico de Custo por Época
plt.plot(errors)
plt.show()



In [ ]: