Exercise - Linear Regression

Over 370,000 used cars were scraped from Ebay-Kleinanzeigen. The content of the data is in German, so one has to translate it first to English. The data is available here The fields included in the file data/autos.csv are:

seller : private or dealer
offerType
vehicleType
yearOfRegistration : at which year the car was first registered
gearbox
powerPS : power of the car in PS
model
kilometer : how many kilometers the car has driven
monthOfRegistration : at which month the car was first registered
fuelType
brand
notRepairedDamage : if the car has a damage which is not repaired yet
price : the price on the ad to sell the car.

Goal
Given the characteristics/features of the car, the sale price of the car is to be predicted.



In [1]:

    
#import the required libraries



In [2]:

    
#Load the data



In [3]:

    
#Do basic sanity check.
#1. Look into the first few records



In [4]:

    
#2. What are the column names?



In [5]:

    
#3. What are the column types?



In [6]:

    
#4. Do label encoding



In [7]:

    
#5. Ideally, we would do some exploratory analysis. 
#For practice, plot: year of registration vs price



In [51]:

    
#6.Build OLS Model - sklearn



In [9]:

    
#7. Report the diagnostics. And discuss the results



In [10]:

    
#8. Build L2 Regression using sklearn - linear_model.Ridge



In [11]:

    
#9. Try with different values of alpha (0.001, 0.01, 0.05, 0.1, 0,5)

10. The following code is from sklearn official documentation

# Author: Fabian Pedregosa -- <fabian.pedregosa@inria.fr>
# License: BSD 3 clause

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# X is the 10x10 Hilbert matrix
X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])
y = np.ones(10)

#compute paths
n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)
clf = linear_model.Ridge(fit_intercept=False)

coefs = []
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)

#Display results
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

Can you modify this code to plot for 10 values of alpha and see how weights get changed ?



In [ ]:



In [12]:

    
#11. Build a L1 Linear Model



In [13]:

    
#12. Feature selection from L1 Linear Model



In [14]:

    
#13. Find the generalization error.
#Split dataset into two: Train and Test : 80% and 20%



In [15]:

    
#14. Build L2 Regularization model on train and predict on test



In [16]:

    
#15. Report the RMSE