Exercise - Linear Regression

Over 370,000 used cars were scraped from Ebay-Kleinanzeigen. The content of the data is in German, so one has to translate it first to English. The data is available here The fields included in the file data/autos.csv are:

  • seller : private or dealer
  • offerType
  • vehicleType
  • yearOfRegistration : at which year the car was first registered
  • gearbox
  • powerPS : power of the car in PS
  • model
  • kilometer : how many kilometers the car has driven
  • monthOfRegistration : at which month the car was first registered
  • fuelType
  • brand
  • notRepairedDamage : if the car has a damage which is not repaired yet
  • price : the price on the ad to sell the car.

Goal
Given the characteristics/features of the car, the sale price of the car is to be predicted.


In [1]:
#import the required libraries

In [2]:
#Load the data

In [3]:
#Do basic sanity check.
#1. Look into the first few records

In [4]:
#2. What are the column names?

In [5]:
#3. What are the column types?

In [6]:
#4. Do label encoding

In [7]:
#5. Ideally, we would do some exploratory analysis. 
#For practice, plot: year of registration vs price

In [51]:
#6.Build OLS Model - sklearn

In [9]:
#7. Report the diagnostics. And discuss the results

In [10]:
#8. Build L2 Regression using sklearn - linear_model.Ridge

In [11]:
#9. Try with different values of alpha (0.001, 0.01, 0.05, 0.1, 0,5)

10. The following code is from sklearn official documentation

# Author: Fabian Pedregosa -- <fabian.pedregosa@inria.fr>
# License: BSD 3 clause

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# X is the 10x10 Hilbert matrix
X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])
y = np.ones(10)

#compute paths
n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)
clf = linear_model.Ridge(fit_intercept=False)

coefs = []
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)

#Display results
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

Can you modify this code to plot for 10 values of alpha and see how weights get changed ?


In [ ]:


In [12]:
#11. Build a L1 Linear Model

In [13]:
#12. Feature selection from L1 Linear Model

In [14]:
#13. Find the generalization error.
#Split dataset into two: Train and Test : 80% and 20%

In [15]:
#14. Build L2 Regularization model on train and predict on test

In [16]:
#15. Report the RMSE