Exercise - Linear Regression

Over 370,000 used cars were scraped from Ebay-Kleinanzeigen. The content of the data is in German, so one has to translate it first to English. The data is available here The fields included in the file data/autos.csv are:

seller : private or dealer
offerType
vehicleType
yearOfRegistration : at which year the car was first registered
gearbox
powerPS : power of the car in PS
model
kilometer : how many kilometers the car has driven
monthOfRegistration : at which month the car was first registered
fuelType
brand
notRepairedDamage : if the car has a damage which is not repaired yet
price : the price on the ad to sell the car.

Goal
Given the characteristics/features of the car, the sale price of the car is to be predicted.



In [1]:

    
#import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)



In [2]:

    
#Load the data
cars = pd.read_csv("data/autos.csv")



In [3]:

    
#Do basic sanity check.
#1. Look into the first few records
cars.head()









    Out[3]:






  
    
      
      seller
      offerType
      vehicleType
      yearOfRegistration
      gearbox
      powerPS
      model
      kilometer
      monthOfRegistration
      fuelType
      brand
      notRepairedDamage
      price
    
  
  
    
      0
      privat
      Angebot
      kleinwagen
      2001
      manuell
      75
      golf
      150000
      6
      benzin
      volkswagen
      nein
      1500
    
    
      1
      privat
      Angebot
      kleinwagen
      2008
      manuell
      69
      fabia
      90000
      7
      diesel
      skoda
      nein
      3600
    
    
      2
      privat
      Angebot
      limousine
      1995
      manuell
      102
      3er
      150000
      10
      benzin
      bmw
      ja
      650
    
    
      3
      privat
      Angebot
      cabrio
      2004
      manuell
      109
      2_reihe
      150000
      8
      benzin
      peugeot
      nein
      2200
    
    
      4
      privat
      Angebot
      limousine
      1980
      manuell
      50
      andere
      40000
      7
      benzin
      volkswagen
      nein
      0



In [4]:

    
#2. What are the column names?
cars.columns









    Out[4]:





Index(['seller', 'offerType', 'vehicleType', 'yearOfRegistration', 'gearbox',
       'powerPS', 'model', 'kilometer', 'monthOfRegistration', 'fuelType',
       'brand', 'notRepairedDamage', 'price'],
      dtype='object')



In [5]:

    
#3. What are the column types?
cars.dtypes









    Out[5]:





seller                 object
offerType              object
vehicleType            object
yearOfRegistration      int64
gearbox                object
powerPS                 int64
model                  object
kilometer               int64
monthOfRegistration     int64
fuelType               object
brand                  object
notRepairedDamage      object
price                   int64
dtype: object



In [6]:

    
#4. Do label encoding
from sklearn import preprocessing



In [7]:

    
cars.columns









    Out[7]:





Index(['seller', 'offerType', 'vehicleType', 'yearOfRegistration', 'gearbox',
       'powerPS', 'model', 'kilometer', 'monthOfRegistration', 'fuelType',
       'brand', 'notRepairedDamage', 'price'],
      dtype='object')



In [10]:

    
col_object = [x for x in cars.columns if cars[x].dtype == "object"]
col_object









    Out[10]:





['seller',
 'offerType',
 'vehicleType',
 'gearbox',
 'model',
 'fuelType',
 'brand',
 'notRepairedDamage']



In [11]:

    
col_others = [x for x in cars.columns if cars[x].dtype != "object"]
col_others









    Out[11]:





['yearOfRegistration', 'powerPS', 'kilometer', 'monthOfRegistration', 'price']



In [12]:

    
le = preprocessing.LabelEncoder()



In [13]:

    
cars_encoded_object = cars[col_object].apply(le.fit_transform)



In [14]:

    
cars_encoded_object.dtypes









    Out[14]:





seller               int64
offerType            int64
vehicleType          int64
gearbox              int64
model                int64
fuelType             int64
brand                int64
notRepairedDamage    int64
dtype: object



In [15]:

    
cars_encoded = pd.concat([cars_encoded_object, cars[col_others]],axis = 1)



In [16]:

    
cars_encoded.head()









    Out[16]:






  
    
      
      seller
      offerType
      vehicleType
      gearbox
      model
      fuelType
      brand
      notRepairedDamage
      yearOfRegistration
      powerPS
      kilometer
      monthOfRegistration
      price
    
  
  
    
      0
      1
      0
      4
      1
      118
      1
      37
      1
      2001
      75
      150000
      6
      1500
    
    
      1
      1
      0
      4
      1
      103
      3
      31
      1
      2008
      69
      90000
      7
      3600
    
    
      2
      1
      0
      6
      1
      11
      1
      2
      0
      1995
      102
      150000
      10
      650
    
    
      3
      1
      0
      2
      1
      8
      1
      25
      1
      2004
      109
      150000
      8
      2200
    
    
      4
      1
      0
      6
      1
      40
      1
      37
      1
      1980
      50
      40000
      7
      0



In [17]:

    
cars_encoded.columns









    Out[17]:





Index(['seller', 'offerType', 'vehicleType', 'gearbox', 'model', 'fuelType',
       'brand', 'notRepairedDamage', 'yearOfRegistration', 'powerPS',
       'kilometer', 'monthOfRegistration', 'price'],
      dtype='object')



In [18]:

    
#5. Ideally, we would do some exploratory analysis. 
#For practice, plot: year of registration vs price



In [19]:

    
plt.scatter(cars_encoded.yearOfRegistration, 
           cars_encoded.price,
           s=20, alpha = 0.05)
plt.xlabel('year')
plt.ylabel('price')









    Out[19]:





<matplotlib.text.Text at 0x10f0c7a90>



In [20]:

    
cars_encoded['log10price'] = np.log10(cars_encoded.price + 1)



In [21]:

    
plt.scatter(cars_encoded.yearOfRegistration, cars_encoded.log10price,
           s=20, alpha = 0.05)
plt.xlabel('year')
plt.ylabel('log10price')









    Out[21]:





<matplotlib.text.Text at 0x10d8dc2e8>



In [22]:

    
#Plot of month of registration with price



In [23]:

    
plt.scatter(cars_encoded.monthOfRegistration, cars_encoded.log10price,
           s=20, alpha = 0.05)
plt.xlabel('monthOfRegistration')
plt.ylabel('log10price')









    Out[23]:





<matplotlib.text.Text at 0x10d939198>



In [24]:

    
cars_encoded.head()









    Out[24]:






  
    
      
      seller
      offerType
      vehicleType
      gearbox
      model
      fuelType
      brand
      notRepairedDamage
      yearOfRegistration
      powerPS
      kilometer
      monthOfRegistration
      price
      log10price
    
  
  
    
      0
      1
      0
      4
      1
      118
      1
      37
      1
      2001
      75
      150000
      6
      1500
      3.176381
    
    
      1
      1
      0
      4
      1
      103
      3
      31
      1
      2008
      69
      90000
      7
      3600
      3.556423
    
    
      2
      1
      0
      6
      1
      11
      1
      2
      0
      1995
      102
      150000
      10
      650
      2.813581
    
    
      3
      1
      0
      2
      1
      8
      1
      25
      1
      2004
      109
      150000
      8
      2200
      3.342620
    
    
      4
      1
      0
      6
      1
      40
      1
      37
      1
      1980
      50
      40000
      7
      0
      0.000000



In [25]:

    
y = cars_encoded.iloc[:,-1]



In [26]:

    
X = cars_encoded.iloc[:,0:12]



In [27]:

    
# From sklearn import the linear models
from sklearn import linear_model



In [28]:

    
model_ols = linear_model.LinearRegression()



In [29]:

    
model_ols.fit(X, y)









    Out[29]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [30]:

    
model_ols.coef_









    Out[30]:





array([ -3.44100506e-01,  -2.48931559e+00,  -1.45016778e-02,
        -2.15433457e-01,  -1.72855546e-04,   7.75851320e-02,
        -2.21562066e-03,   4.63360547e-01,   3.18689809e-02,
         5.22540972e-04,  -3.43232660e-06,   5.24214376e-03])



In [31]:

    
model_ols.intercept_









    Out[31]:





-59.912081033556944



In [32]:

    
y_ols = model_ols.predict(X)



In [33]:

    
from sklearn import metrics



In [34]:

    
#7. Report the diagnostics. And discuss the results
metrics.mean_squared_error(y_ols,y)









    Out[34]:





0.26061808366881362



In [35]:

    
#8. Build L2 Regression using sklearn - linear_model.Ridge



In [36]:

    
from sklearn.linear_model import Ridge



In [120]:

    
model_ridge = Ridge(alpha=10e10)



In [121]:

    
model_ridge.fit(X, y)









    Out[121]:





Ridge(alpha=100000000000.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)



In [122]:

    
model_ridge.coef_









    Out[122]:





array([  3.12292970e-12,  -1.02760799e-10,   6.50003153e-09,
        -1.54680630e-07,  -2.87347351e-06,   4.69374603e-07,
        -2.25296042e-06,   1.33893300e-07,   3.76199402e-06,
         4.33738513e-05,  -5.73678960e-06,   3.34487916e-07])



In [123]:

    
sum(model_ridge.coef_)









    Out[123]:





3.7062097352428309e-05



In [108]:

    
y_ridge = model_ridge.predict(X)



In [42]:

    
metrics.mean_squared_error(y_ridge,y)









    Out[42]:





0.26071136846619347



In [43]:

    
#9. Try with different values of alpha (0.001, 0.01, 0.05, 0.1, 0,5)

10. The following code is from sklearn official documentation

# Author: Fabian Pedregosa -- <fabian.pedregosa@inria.fr>
# License: BSD 3 clause

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# X is the 10x10 Hilbert matrix
X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])
y = np.ones(10)

#compute paths
n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)
clf = linear_model.Ridge(fit_intercept=False)

coefs = []
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)

#Display results
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

Can you modify this code to plot for 10 values of alpha and see how weights get changed ?



In [57]:

    
X_std = preprocessing.StandardScaler().fit_transform(X)



In [136]:

    
# Author: Fabian Pedregosa -- <fabian.pedregosa@inria.fr>
# License: BSD 3 clause

#compute paths
n_alphas = 20
alphas = np.logspace(-10, 10, n_alphas)
clf = linear_model.Ridge(fit_intercept=False)

coefs = []
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)

#Display results
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()



In [137]:

    
#11. Build a L1 Linear Model
from sklearn.linear_model import Lasso



In [138]:

    
model_lasso = Lasso(alpha=0.1, fit_intercept = False)



In [139]:

    
model_lasso.fit(X, y)









    Out[139]:





Lasso(alpha=0.1, copy_X=True, fit_intercept=False, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)



In [140]:

    
model_lasso.coef_









    Out[140]:





array([ -0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,   9.06125056e-05,   6.15534174e-02,
        -3.52407975e-03,   0.00000000e+00,   2.04062618e-03,
         8.09221369e-04,  -5.82902308e-06,   8.85392698e-04])



In [141]:

    
np.sum(model_lasso.coef_)









    Out[141]:





0.061849361425333364



In [142]:

    
#12. Feature selection from L1 Linear Model



In [143]:

    
from sklearn.feature_selection import SelectFromModel



In [144]:

    
Lasso(alpha=0.1, fit_intercept = False, max_iter=10000000)









    Out[144]:





Lasso(alpha=0.1, copy_X=True, fit_intercept=False, max_iter=10000000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)



In [145]:

    
sfm = SelectFromModel(model_lasso, threshold=0.1)



In [146]:

    
sfm









    Out[146]:





SelectFromModel(estimator=Lasso(alpha=0.1, copy_X=True, fit_intercept=False, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
        prefit=False, threshold=0.1)



In [147]:

    
sfm.fit(X, y)









    Out[147]:





SelectFromModel(estimator=Lasso(alpha=0.1, copy_X=True, fit_intercept=False, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
        prefit=False, threshold=0.1)



In [148]:

    
featureSel = sfm.transform(X)









    



/Users/amitkaps/miniconda3/lib/python3.5/site-packages/sklearn/feature_selection/base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
  UserWarning)



In [149]:

    
featureSel.shape









    Out[149]:





(249176, 0)



In [150]:

    
featureSel[:,:3]









    Out[150]:





array([], shape=(249176, 0), dtype=float64)



In [151]:

    
#13. Find the generalization error.
#Split dataset into two: Train and Test : 80% and 20%
from sklearn.model_selection import train_test_split



In [152]:

    
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size=0.2)



In [86]:

    
x_train.shape, y_train.shape









    Out[86]:





((199340, 12), (199340,))



In [88]:

    
#14. Build L2 Regularization model on train and predict on test



In [89]:

    
model_ridge = Ridge(alpha=1.0, fit_intercept=True, normalize=True)



In [90]:

    
model_ridge.fit(x_train, y_train)









    Out[90]:





Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)



In [91]:

    
ridge_output = model_ridge.predict(x_test)



In [92]:

    
#15. Report the RMSE
np.sum((y_test - ridge_output)**2)









    Out[92]:





14217.47449793072



In [ ]:



In [ ]:

	seller	offerType	vehicleType	yearOfRegistration	gearbox	powerPS	model	kilometer	monthOfRegistration	fuelType	brand	notRepairedDamage	price
0	privat	Angebot	kleinwagen	2001	manuell	75	golf	150000	6	benzin	volkswagen	nein	1500
1	privat	Angebot	kleinwagen	2008	manuell	69	fabia	90000	7	diesel	skoda	nein	3600
2	privat	Angebot	limousine	1995	manuell	102	3er	150000	10	benzin	bmw	ja	650
3	privat	Angebot	cabrio	2004	manuell	109	2_reihe	150000	8	benzin	peugeot	nein	2200
4	privat	Angebot	limousine	1980	manuell	50	andere	40000	7	benzin	volkswagen	nein	0