Regressão Linear Multivariada - Trabalho

Estudo de caso: Qualidade de Vinhos

Nesta trabalho, treinaremos um modelo de regressão linear usando descendência de gradiente estocástico no conjunto de dados da Qualidade do Vinho. O exemplo pressupõe que uma cópia CSV do conjunto de dados está no diretório de trabalho atual com o nome do arquivo winequality-white.csv.

O conjunto de dados de qualidade do vinho envolve a previsão da qualidade dos vinhos brancos em uma escala, com medidas químicas de cada vinho. É um problema de classificação multiclasse, mas também pode ser enquadrado como um problema de regressão. O número de observações para cada classe não é equilibrado. Existem 4.898 observações com 11 variáveis de entrada e 1 variável de saída. Os nomes das variáveis são os seguintes:

  1. Fixed acidity.
  2. Volatile acidity.
  3. Citric acid.
  4. Residual sugar.
  5. Chlorides.
  6. Free sulfur dioxide.
  7. Total sulfur dioxide.
  8. Density.
  9. pH.
  10. Sulphates.
  11. Alcohol.
  12. Quality (score between 0 and 10).

O desempenho de referencia de predição do valor médio é um RMSE de aproximadamente 0.148 pontos de qualidade.

Utilize o exemplo apresentado no tutorial e altere-o de forma a carregar os dados e analisar a acurácia de sua solução.

Definição das Bibliotecas e Funções Principais


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

In [2]:
def RMSE(errors):
    return np.sqrt(1/errors.shape[1] * np.sum(errors**2))

def predict(X, coef, addOnes=False):
    if(addOnes): X = np.append(np.ones([X.shape[0], 1]), X, axis=1)
    return np.dot(X, coef).reshape(1, X.shape[0])

def stochasticGD(X, y, alfa=0.00001, maxEpoch=50):
    X = np.append(np.ones([X.shape[0], 1]), X, axis=1)
    coef = np.random.randn(X.shape[1], 1)
    errorHist = []
    
    for epoch in range(maxEpoch):
        error =  predict(X, coef) - y
        errorHist.append(RMSE(error))
        
        for i in range(X.shape[0]):
            coef[0] -= alfa * error[0,i]
            for j in range(len(coef)-1):
                coef[j+1] -= alfa * error[0,i] * X[i,j]
                
        print("Epoch: {} | RMSE: {}".format(epoch, errorHist[-1]))
        print("Coefficients: \n", coef.T)
        print("\n###")
    
    return coef, errorHist

Carregando o conjunto de dados e utilizando o Gradiente Descendente Estocástico


In [3]:
data = pd.read_csv("winequality-white.csv", delimiter=";")

X = MinMaxScaler().fit_transform(data.values[:,:-1])
y = data.values[:,-1]

In [4]:
[coef, errorHist] = stochasticGD(X, y)


Epoch: 0 | RMSE: 5.603006430346713
Coefficients: 
 [[ 1.41482737 -0.14446488 -1.36397147 -1.72856557  0.25729327  0.29945667
  -1.72824681 -0.14588103  1.0290101   0.72245113 -0.25032003 -0.22558473]]

###
Epoch: 1 | RMSE: 5.116385508659133
Coefficients: 
 [[ 1.66132821  0.10203597 -1.29172991 -1.68097011  0.30732737  0.32096455
  -1.70195755 -0.11703484  1.10247275  0.75444081 -0.14551486 -0.14777653]]

###
Epoch: 2 | RMSE: 4.674617386194008
Coefficients: 
 [[ 1.88584284  0.3265506  -1.22602007 -1.63768232  0.35290373  0.34050594
  -1.67806477 -0.09074454  1.16930883  0.78348593 -0.05005471 -0.07696228]]

###
Epoch: 3 | RMSE: 4.273807335226205
Coefficients: 
 [[ 2.09033743  0.53104519 -1.16625786 -1.59831694  0.394421    0.35825667
  -1.65635417 -0.06678155  1.23011095  0.8098498   0.03689623 -0.01251654]]

###
Epoch: 4 | RMSE: 3.9104159314646747
Coefficients: 
 [[ 2.27660233  0.71731009 -1.11191145 -1.56252317  0.43224216  0.37437687
  -1.63663059 -0.04493775  1.2854187   0.83377217  0.11609899  0.0461302 ]]

###
Epoch: 5 | RMSE: 3.5812267374506135
Coefficients: 
 [[ 2.44626777  0.88697552 -1.06249654 -1.52998157  0.46669774  0.38901235
  -1.61871634 -0.02502363  1.33572341  0.85547139  0.18824656  0.09949652]]

###
Epoch: 6 | RMSE: 3.283316528662896
Coefficients: 
 [[ 2.60081819  1.04152595 -1.01757216 -1.50040128  0.49808872  0.40229585
  -1.60244961 -0.00686662  1.38147248  0.87514623  0.25396995  0.14805461]]

###
Epoch: 7 | RMSE: 3.014027756319595
Coefficients: 
 [[ 2.7416053   1.18231306 -0.97673677 -1.47351744  0.52668916  0.41434827
  -1.58768308  0.00969042  1.4230733   0.89297773  0.31384375  0.19223447]]

###
Epoch: 8 | RMSE: 2.770942967636425
Coefficients: 
 [[ 2.86985991  1.31056767 -0.93962473 -1.44908889  0.5527486   0.4252797
  -1.57428257  0.02479056  1.46089684  0.90913071  0.36839114  0.2324276 ]]

###
Epoch: 9 | RMSE: 2.5518609377083132
Coefficients: 
 [[ 2.98670277  1.42741053 -0.90590312 -1.42689603  0.57649428  0.43519036
  -1.56212593  0.03856409  1.49528087  0.92375527  0.41808855  0.26899051]]

###
Epoch: 10 | RMSE: 2.3547743108144545
Coefficients: 
 [[ 3.0931544   1.53386216 -0.87526877 -1.40673889  0.5981331   0.44417153
  -1.55110192  0.05112965  1.52653297  0.93698807  0.46336978  0.30224785]]

###
Epoch: 11 | RMSE: 2.177848606124745
Coefficients: 
 [[ 3.19014406  1.63085181 -0.84744563 -1.38843539  0.61785348  0.45230632
  -1.54110925  0.06259526  1.55493321  0.94895356  0.50462984  0.3324952 ]]

###
Epoch: 12 | RMSE: 2.019402515391679
Coefficients: 
 [[ 3.2785179   1.71922566 -0.82218235 -1.37181973  0.63582697  0.4596704
  -1.53205569  0.07305929  1.58073661  0.95976506  0.54222842  0.36000172]]

###
Epoch: 13 | RMSE: 1.8778895070448227
Coefficients: 
 [[ 3.35904641  1.79975417 -0.79925003 -1.35674094  0.65220979  0.46633268
  -1.52385725  0.0826113   1.6041754   0.96952575  0.57649303  0.38501252]]

###
Epoch: 14 | RMSE: 1.7518808459940758
Coefficients: 
 [[ 3.43243119  1.87313894 -0.77844026 -1.34306156  0.66714418  0.47235592
  -1.51643747  0.09133286  1.62546102  0.9783296   0.60772192  0.40775075]]

###
Epoch: 15 | RMSE: 1.640050228816099
Coefficients: 
 [[ 3.49931106  1.94001882 -0.75956326 -1.3306564   0.68075968  0.47779723
  -1.50972671  0.09929823  1.64478602  0.98626215  0.63618663  0.42841964]]

###
Epoch: 16 | RMSE: 1.541160300968715
Coefficients: 
 [[ 3.56026774  2.0009755  -0.74244623 -1.31941149  0.69317422  0.48270866
  -1.50366155  0.10657503  1.66232572  0.9934013   0.66213445  0.44720423]]

###
Epoch: 17 | RMSE: 1.4540513439012783
Coefficients: 
 [[ 3.61583093  2.05653868 -0.72693182 -1.30922303  0.7044952   0.48713758
  -1.49818428  0.11322485  1.67823976  0.999818    0.68579054  0.46427302]]

###
Epoch: 18 | RMSE: 1.3776323756560345
Coefficients: 
 [[ 3.66648294  2.10719069 -0.71287674 -1.29999652  0.71482041  0.49112711
  -1.49324231  0.11930374  1.69267351  1.00557683  0.70735993  0.47977943]]

###
Epoch: 19 | RMSE: 1.3108747899419195
Coefficients: 
 [[ 3.71266296  2.15337072 -0.70015052 -1.2916459   0.7242389   0.49471654
  -1.48878776  0.12486276  1.70575934  1.01073662  0.7270293   0.49386316]]

###
Epoch: 20 | RMSE: 1.2528084752887587
Coefficients: 
 [[ 3.75477091  2.19547867 -0.68863438 -1.28409281  0.73283174  0.49794164
  -1.48477706  0.12994841  1.71761778  1.01535092  0.74496865  0.50665143]]

###
Epoch: 21 | RMSE: 1.2025201383098463
Coefficients: 
 [[ 3.79317094  2.2338787  -0.67822012 -1.27726591  0.74067277  0.50083498
  -1.48117051  0.13460302  1.72835858  1.01946849  0.76133278  0.51826007]]

###
Epoch: 22 | RMSE: 1.1591533492785853
Coefficients: 
 [[ 3.82819463  2.26890239 -0.66880925 -1.27110021  0.74782922  0.50342621
  -1.47793196  0.13886512  1.73808168  1.02313375  0.77626264  0.52879457]]

###
Epoch: 23 | RMSE: 1.1219096823484431
Coefficients: 
 [[ 3.86014391  2.30085167 -0.66031208 -1.26553655  0.7543623   0.50574233
  -1.4750285   0.14276984  1.74687807  1.02638712  0.78988657  0.53835096]]

###
Epoch: 24 | RMSE: 1.0900502712824562
Coefficients: 
 [[ 3.88929372  2.33000148 -0.65264694 -1.26052108  0.76032776  0.50780793
  -1.47243017  0.14634911  1.75483062  1.02926543  0.80232145  0.54701669]]

###
Epoch: 25 | RMSE: 1.0628971537160334
Coefficients: 
 [[ 3.91589439  2.35660215 -0.64573945 -1.25600472  0.76577636  0.5096454
  -1.47010968  0.14963205  1.76201478  1.03180221  0.81367368  0.55487142]]

###
Epoch: 26 | RMSE: 1.039833914857127
Coefficients: 
 [[ 3.94017389  2.38088165 -0.63952189 -1.25194283  0.77075431  0.51127514
  -1.46804218  0.15264515  1.76849926  1.03402797  0.82404017  0.56198763]]

###
Epoch: 27 | RMSE: 1.020305328278051
Coefficients: 
 [[ 3.96233979  2.40304755 -0.63393262 -1.24829472  0.77530369  0.5127157
  -1.46620505  0.15541255  1.77434661  1.03597052  0.83350916  0.56843137]]

###
Epoch: 28 | RMSE: 1.0038158848535903
Coefficients: 
 [[ 3.98258109  2.42328885 -0.62891547 -1.24502335  0.77946284  0.51398398
  -1.46457768  0.15795621  1.7796138   1.03765518  0.84216099  0.57426276]]

###
Epoch: 29 | RMSE: 0.989927266047197
Coefficients: 
 [[ 4.0010699   2.44177766 -0.62441932 -1.24209502  0.78326663  0.51509539
  -1.46314132  0.16029615  1.78435269  1.03910498  0.85006881  0.57953654]]

###
Epoch: 30 | RMSE: 0.9782549343084114
Coefficients: 
 [[ 4.01796293  2.45867069 -0.62039763 -1.239479    0.78674686  0.51606393
  -1.46187888  0.16245058  1.78861053  1.04034092  0.85729925  0.58430256]]

###
Epoch: 31 | RMSE: 0.9684640763744312
Coefficients: 
 [[ 4.03340289  2.47411065 -0.61680799 -1.23714734  0.78993244  0.51690238
  -1.4607748   0.16443609  1.79243032  1.0413821   0.86391296  0.58860621]]

###
Epoch: 32 | RMSE: 0.9602651514179553
Coefficients: 
 [[ 4.04751971  2.48822747 -0.6136118  -1.23507458  0.79284973  0.51762234
  -1.45981494  0.16626779  1.79585123  1.04224594  0.86996518  0.59248883]]

###
Epoch: 33 | RMSE: 0.9534092779954162
Coefficients: 
 [[ 4.06043171  2.50113947 -0.61077393 -1.23323753  0.79552271  0.5182344
  -1.45898638  0.16795942  1.79890891  1.04294826  0.8755062   0.59598806]]

###
Epoch: 34 | RMSE: 0.9476836555510322
Coefficients: 
 [[ 4.07224663  2.51295439 -0.60826235 -1.23161509  0.79797323  0.51874821
  -1.45827738  0.16952352  1.80163583  1.04350351  0.88058183  0.59913816]]

###
Epoch: 35 | RMSE: 0.9429071697568481
Coefficients: 
 [[ 4.08306258  2.52377034 -0.60604794 -1.23018802  0.80022117  0.51917252
  -1.45767726  0.17097149  1.80406156  1.04392482  0.88523375  0.60197034]]

###
Epoch: 36 | RMSE: 0.938926284782838
Coefficients: 
 [[ 4.09296887  2.53367663 -0.60410416 -1.22893883  0.80228465  0.51951533
  -1.45717627  0.17231371  1.80621303  1.04422415  0.88949996  0.60451302]]

###
Epoch: 37 | RMSE: 0.9356112848279582
Coefficients: 
 [[ 4.10204686  2.54275462 -0.60240683 -1.22785159  0.80418017  0.51978393
  -1.45676555  0.17355965  1.80811475  1.0444124   0.89341501  0.60679207]]

###
Epoch: 38 | RMSE: 0.93285289424897
Coefficients: 
 [[ 4.11037061  2.55107837 -0.60093396 -1.22691176  0.80592273  0.51998492
  -1.45643703  0.1747179   1.80978904  1.04449948  0.8970104   0.60883105]]

###
Epoch: 39 | RMSE: 0.9305592808160877
Coefficients: 
 [[ 4.11800756  2.55871532 -0.59966552 -1.22610615  0.80752601  0.52012434
  -1.45618335  0.17579631  1.81125624  1.04449443  0.9003148   0.6106514 ]]

###
Epoch: 40 | RMSE: 0.9286534292870214
Coefficients: 
 [[ 4.12501913  2.56572689 -0.59858325 -1.22542271  0.80900247  0.52020768
  -1.45599782  0.17680202  1.81253484  1.04440547  0.90335431  0.61227268]]

###
Epoch: 41 | RMSE: 0.9270708613758998
Coefficients: 
 [[ 4.13146123  2.57216899 -0.59767056 -1.22485049  0.81036344  0.52023995
  -1.45587435  0.17774153  1.8136417   1.04424009  0.9061527   0.61371266]]

###
Epoch: 42 | RMSE: 0.9257576719188847
Coefficients: 
 [[ 4.13738479  2.57809255 -0.5969123  -1.22437952  0.81161924  0.5202257
  -1.45580738  0.17862075  1.81459217  1.04400511  0.90873162  0.61498754]]

###
Epoch: 43 | RMSE: 0.9246688483158175
Coefficients: 
 [[ 4.14283617  2.58354392 -0.59629471 -1.2240007   0.81277927  0.52016908
  -1.45579186  0.17944509  1.81540023  1.04370673  0.91111077  0.61611207]]

###
Epoch: 44 | RMSE: 0.9237668400669623
Coefficients: 
 [[ 4.14785758  2.58856534 -0.59580525 -1.22370576  0.8138521   0.52007386
  -1.45582319  0.18021944  1.81607859  1.04335062  0.9133081   0.61709968]]

###
Epoch: 45 | RMSE: 0.9230203465641086
Coefficients: 
 [[ 4.15248749  2.59319524 -0.5954325  -1.22348718  0.81484552  0.51994348
  -1.45589717  0.18094828  1.81663885  1.04294192  0.91533996  0.6179626 ]]

###
Epoch: 46 | RMSE: 0.9224032935884039
Coefficients: 
 [[ 4.15676088  2.59746863 -0.59516605 -1.22333808  0.81576663  0.51978106
  -1.45601     0.18163567  1.81709156  1.04248532  0.91722123  0.61871195]]

###
Epoch: 47 | RMSE: 0.9218939717499177
Coefficients: 
 [[ 4.16070963  2.60141739 -0.59499644 -1.22325223  0.81662188  0.51958948
  -1.4561582   0.18228532  1.81744631  1.04198509  0.91896545  0.61935789]]

###
Epoch: 48 | RMSE: 0.9214743130584504
Coefficients: 
 [[ 4.16436278  2.60507054 -0.59491505 -1.22322394  0.81741717  0.5193713
  -1.4563386   0.18290061  1.81771186  1.04144511  0.92058496  0.61990964]]

###
Epoch: 49 | RMSE: 0.9211292847340766
Coefficients: 
 [[ 4.16774675  2.6084545  -0.59491402 -1.22324803  0.81815785  0.51912891
  -1.45654833  0.1834846   1.81789618  1.04086892  0.922091    0.62037561]]

###

In [5]:
print("Gradiente Descendente Estocástico\nRMSE: {}".format(RMSE(y - predict(X, coef, True))))
print("Coeficientes:\n", coef.T)


Gradiente Descendente Estocástico
RMSE: 0.9208463821240235
Coeficientes:
 [[ 4.16774675  2.6084545  -0.59491402 -1.22324803  0.81815785  0.51912891
  -1.45654833  0.1834846   1.81789618  1.04086892  0.922091    0.62037561]]

Plotagem do Custo por Época


In [6]:
plt.plot(errorHist)
plt.show()


Estimativa dos Coeficientes pelo Método dos Mínimos Quadrados (OLS)


In [7]:
X = np.append(np.ones([X.shape[0], 1]), X, axis=1)
beta = np.dot(np.dot(np.linalg.pinv(np.dot(X.T,X)), X.T), y)

print("Métodos dos Mínimos Quadrados\nRMSE: {}".format(RMSE(y - predict(X,beta))))
print("Coeficientes:\n", beta)


Métodos dos Mínimos Quadrados
RMSE: 0.7504359153109991
Coeficientes:
 [ 5.55089003  0.6814076  -1.90044063  0.03666973  5.31267873 -0.08333219
  1.07130361 -0.12315714 -7.79524045  0.75497812  0.54306977  1.19954932]

Comentários

Obs.: Os dados desse dataset foram disponibilizados pela Universidade do Minho (Portugal) :P

Primeiramente, é interessante notar que os dados de saída ($Y$), o atributo "Quality", possue apenas valores discretos e bastante baixos. Em contra-partida, os atributos de entrada ($X$) se apresentam em várias escalas, e podem ser bem maiores que os valores de saída. Por esse motivo, para manter a estabilidade do Stochastic Gradient Descent, é necessário realizar algum tipo de Feature Scaling para normalizar os dados de entrada, gerando assim um treinamento mais estável.

No meu código, utilizei a classe MinMaxScaler do próprio Scikit-Learn para realizar essa normalização de forma rápida. O Min-Max Scaling consiste em, para cada atributo, subtrair todos os valores pelo menor valor e dividir isso pela diferença entre o maior e menor valor. Isso garante, então, que todos os dados serão dispostos no intervalo fechado [0, 1].

No meu código, também, utilizei a notação matricial das operações entre os coeficientes ($\beta$) e os dados de entrada ($X$). Isso permite uma computação mais rápida, com menos linhas de códigos, e ainda mantém todas as características originais do problema. Uma outra estivativa de coeficientes, utilizando o Método dos Mínimos Quadrados, também foi apresentada e mostrou resultados similares aos do Gradiente Descendente Estocástico.