Exercise 6

SVM & Regularization

For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.



In [1]:

    
import pandas as pd
import numpy as np



In [4]:

    
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')



In [5]:

    
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)









    Out[5]:







  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
      type
    
  
  
    
      3594
      6.8
      0.19
      0.32
      7.6
      0.049
      37.0
      107.0
      0.99332
      3.12
      0.44
      10.7
      7
      white
    
    
      2412
      6.3
      0.38
      0.17
      8.8
      0.080
      50.0
      212.0
      0.99803
      3.47
      0.66
      9.4
      4
      white
    
    
      5378
      10.6
      0.28
      0.39
      15.5
      0.069
      6.0
      23.0
      1.00260
      3.12
      0.66
      9.2
      5
      red
    
    
      2905
      7.6
      0.31
      0.26
      1.7
      0.073
      40.0
      157.0
      0.99380
      3.10
      0.46
      9.8
      5
      white
    
    
      6432
      6.6
      0.56
      0.14
      2.4
      0.064
      13.0
      29.0
      0.99397
      3.42
      0.62
      11.7
      7
      red

Exercise 6.1

Show the frecuency table of the quality by type of wine



In [ ]:

SVM

Exercise 6.2

Standarized the features (not the quality)
Create a binary target for each type of wine
Create two Linear SVM's for the white and red wines, repectively.



In [ ]:

Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)



In [ ]:

Exercise 6.4

Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]



In [ ]:

Exercise 6.5

Compare the results with other methods



In [ ]:

Regularization

Exercise 6.6

Train a linear regression to predict wine quality (Continous)
Analyze the coefficients
Evaluate the RMSE



In [ ]:

Exercise 6.7

Estimate a ridge regression with alpha equals 0.1 and 1.
Compare the coefficients with the linear regression
Evaluate the RMSE



In [ ]:

Exercise 6.8

Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
Compare the coefficients with the linear regression
Evaluate the RMSE



In [ ]:

Exercise 6.9

Create a binary target
Train a logistic regression to predict wine quality (binary)
Analyze the coefficients
Evaluate the f1score



In [ ]:

Exercise 6.10

Estimate a regularized logistic regression using:
C = 0.01, 0.1 & 1.0
penalty = ['l1, 'l2']
Compare the coefficients and the f1score



In [ ]:



In [ ]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	type
3594	6.8	0.19	0.32	7.6	0.049	37.0	107.0	0.99332	3.12	0.44	10.7	7	white
2412	6.3	0.38	0.17	8.8	0.080	50.0	212.0	0.99803	3.47	0.66	9.4	4	white
5378	10.6	0.28	0.39	15.5	0.069	6.0	23.0	1.00260	3.12	0.66	9.2	5	red
2905	7.6	0.31	0.26	1.7	0.073	40.0	157.0	0.99380	3.10	0.46	9.8	5	white
6432	6.6	0.56	0.14	2.4	0.064	13.0	29.0	0.99397	3.42	0.62	11.7	7	red