Multivariate Linear Regression

predict prices of houses (Ng ML Example 1, part 2)

Suppose you are selling your house, and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.

The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# import data
data = pd.read_csv('mlclass-ex1/ex1data2.txt', header=None)

Notation

Features: $x_{1}, x_{2}, x_{3}, ..., x_{n}$
Prediction: $y$

$m =$ number of training examples
$n =$ number of features
$x^{(i)} =$ input (features) of the $i$-th training example (the superscript is the index into the training set)
$x_{j}^{(i)} =$ value of feature $j$ in the $i$-th training example

For this data set:
$x_{1}$: size (feet$^{2}$)
$x_{2}$: # of bedrooms
$y$: price ($)

Feature Scaling Mean Normalization

$x_{j}^{(i)} = \frac{x_{j}^{(i)} - \bar{x}_{j}}{s_{j}}$

Hypothesis

$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$

For convienience of notation, define $x_{0} = 1$, so that $x_{0}^{(i)} = 1$

Now, the feature vector can be zero-indexed:

$x^{(i)} = \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix} \in \mathbb{R}^{n+1}$

And, since the parameter vector is zero-indexed:

$\theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \theta_{2} \\ ... \\ \theta_{n} \end{bmatrix} \in \mathbb{R}^{n+1}$

The new hypothesis form is: $h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$

Hypothesis Calculation

Construct the following ($m\times[n+1]$) design matrix:

$X = \begin{bmatrix} x_{0}^{(0)} && x_{1}^{(0)} && ... && x_{n}^{(0)} \\ x_{0}^{(1)} && x_{1}^{(1)} && ... && x_{n}^{(1)} \\ x_{0}^{(2)} && x_{1}^{(2)} && ... && x_{n}^{(2)} \\ ... && ... && ... && ... \\ x_{0}^{(m-1)} && x_{1}^{(m-1)} && ... && x_{n}^{(m-1)} \end{bmatrix}$

To create a ($m\times1$) hypothesis matrix, perform matrix-vector multiplication:

$X \times \theta = \begin{bmatrix} x_{0}^{(0)} \times \theta_{0} + x_{1}^{(0)} \times \theta_{1} \\ x_{0}^{(1)} \times \theta_{0} + x_{1}^{(1)} \times \theta_{1} \\ x_{0}^{(2)} \times \theta_{0} + x_{1}^{(2)} \times \theta_{1} \\ ... \\ x_{0}^{(m-1)} \times \theta_{0} + x_{1}^{(m-1)} \times \theta_{1} \end{bmatrix}$

Alternatively, to compute each hypothesis return value per training sample:

Perform the inner/dot/scalar product of vectors of $\theta$-transpose and $X^{(i)}$:

$\theta^{T} = \begin{bmatrix}\theta_{0}&&\theta_{1}&&\theta_{2}&&...&&\theta_{n}\end{bmatrix}$

$h_{\theta}(x) = \theta^{T}X^{(i)} = \begin{bmatrix}\theta_{0}&&\theta_{1}&&\theta_{2}&&...&&\theta_{n}\end{bmatrix} \times \begin{bmatrix} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ ... \\ x_{n}^{(i)} \end{bmatrix}$


In [3]:
# initial data vectors
x_1 = data[0].values
x_2 = data[1].values
y = data[2].values

# training sample size
m = data[0].count()

# number of features
n = 2

# proper vector shapes
x_1.shape = (m,1)
x_2.shape = (m, 1)
y.shape = (m,1)

# Feature Scaling (unnecessary if using the normal equation method)
# option1: normalization
#x_1 = np.true_divide(x_1, np.amax(x_1))
#x_2 = np.true_divide(x_2, np.amax(x_2))
# option2: mean normalization
xbar_1 = np.mean(x_1)
xbar_2 = np.mean(x_2)

s_1 = np.std(x_1)
s_2 = np.std(x_2)

x_1 = np.divide((np.subtract(x_1,xbar_1)),s_1)
x_2 = np.divide((np.subtract(x_2,xbar_2)),s_2)

# design matrix
X = np.hstack((np.ones((m,1)),x_1,x_2))

# theta parameters 
theta = np.zeros(((n+1),1))

#gradient descent, number of iterations
iterations = 50

# learning rates 
alpha = [-0.3, -0.1, -0.03, -0.01, -0.003, -0.001, 0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3]

Cost Function (iterative and vectorized forms)

$J(\theta_{0},\theta_{1},...,\theta_{n}) = J(\theta) = \frac{1}{2m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}\sum\limits_{i=0}^{m-1}(\theta^{T}X^{(i)} - y^{(i)})^2 = \frac{1}{2m}(X\theta - y)^{T}(X\theta - y)$

Vectorized Implementation of Gradient Descent

Minimize the objective/cost function $J$

$\delta = \begin{bmatrix} \delta_{0} \\ \delta_{1} \\ \delta_{2} \\ ... \\ \delta_{n} \end{bmatrix}$

$\delta_{j} = \frac{1}{m}\sum\limits_{i=0}^{m-1}(\theta^{T}X^{(i)} - y^{(i)}) \cdot X_{j}^{(i)}$


In [4]:
# cost function: iterative implementation
    #def J():
    #    summation = 0
    #    for i in xrange(0,m):
    #        summation += ((theta.T.dot(X[i])) - y[i])**2
    #        
    #    return (1.0/(2*m)) * summation
    # 
    
# cost function: vectorized implementation
def J():
    return (1.0/(2*m)) * (((X.dot(theta)) - y).T).dot(X.dot(theta) - y)

# delta-vector function for derivatives
def deltas():
    delta = np.zeros(((n+1),1))
    for j in xrange(0,(n+1)):
        summation = 0
        #for i in xrange(0,m):
        #summation += ((theta.T.dot(X[i]) - y[i]) * X[i][j])
        
        summation = np.sum((X.dot(theta) - y) * X[:,j])
        
        delta[j] = (1.0/m) * summation
    print np.sum(delta)
    return delta

$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial\theta_{j}} J(\theta) = \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} = \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=0}^{m-1}(\theta^{T}x^{(i)} - y^{(i)}) \cdot x_{j}^{(i)} = \theta_{j} - \alpha\delta_{j}$

repeat until convergence {
$\theta := \theta - \alpha \delta$
}


In [5]:
# gradient descent, test multiple alphas
for a in xrange(0,len(alpha)):
    
    # reset theta
    theta = np.zeros(((n+1),1))
    
    # reset vector J_values, store cost function values for plotting
    J_values = np.zeros((iterations,1))
    
    for iteration in xrange(0,iterations):
        theta = theta - (alpha[a] * deltas())
        J_values[iteration] = J()
    
    # visualize the cost function (2-D)
    cost_x = np.arange(iterations)
    cost_x.shape = (iterations,1)
    plt.plot(cost_x,J_values)
    plt.title("Learning Rate: " + str(alpha[a]))
    plt.xlabel('iterations')
    plt.ylabel(r"$J(\theta)$")
    plt.show()


-15999395.0
-241590864.5
-3648022053.95
-55085133014.6
-831785508521.0
-1.25599611787e+13
-1.89655413798e+14
-2.86379674835e+15
-4.32433309001e+16
-6.52974296591e+17
-9.85991187852e+18
-1.48884669366e+20
-2.24815850742e+21
-3.39471934621e+22
-5.12602621277e+23
-7.74029958129e+24
-1.16878523677e+26
-1.76486570753e+27
-2.66494721837e+28
-4.02407029974e+29
-6.0763461526e+30
-9.17528269043e+31
-1.38546768626e+33
-2.09205620625e+34
-3.15900487143e+35
-4.77009735586e+36
-7.20284700735e+37
-1.08762989811e+39
-1.64232114615e+40
-2.47990493068e+41
-3.74465644533e+42
-5.65443123244e+43
-8.53819116099e+44
-1.28926686531e+46
-1.94679296662e+47
-2.93965737959e+48
-4.43888264318e+49
-6.70271279121e+50
-1.01210963147e+52
-1.52828554352e+53
-2.30771117072e+54
-3.48464386779e+55
-5.26181224036e+56
-7.94533648294e+57
-1.19974580892e+59
-1.81161617148e+60
-2.73554041893e+61
-4.13066603258e+62
-6.2373057092e+63
-9.41833162089e+64
-15999395.0
-91196551.5
-519820343.55
-2962975958.24
-16888962961.9
-96267088883.1
-548722406633.0
-3.12771771781e+12
-1.78279909915e+13
-1.01619548652e+14
-5.79231427314e+14
-3.30161913569e+15
-1.88192290734e+16
-1.07269605719e+17
-6.11436752596e+17
-3.4851894898e+18
-1.98655800919e+19
-1.13233806524e+20
-6.45432697184e+20
-3.67896637395e+21
-2.09701083315e+22
-1.1952961749e+23
-6.81318819691e+23
-3.88351727224e+24
-2.21360484518e+25
-1.26175476175e+26
-7.19200214198e+26
-4.09944122093e+27
-2.33668149593e+28
-1.33190845268e+29
-7.59187818027e+29
-4.32737056275e+30
-2.46660122077e+31
-1.40596269584e+32
-8.01398736628e+32
-4.56797279878e+33
-2.60374449531e+34
-1.48413436232e+35
-8.45956586525e+35
-4.82195254319e+36
-2.74851294962e+37
-1.56665238128e+38
-8.92991857331e+38
-5.09005358679e+39
-2.90133054447e+40
-1.65375841035e+41
-9.42642293898e+41
-5.37306107522e+42
-3.06264481287e+43
-1.74570754334e+44
-15999395.0
-38558541.95
-92926086.0995
-223951867.5
-539724000.675
-1300734841.63
-3134770968.32
-7554798033.65
-18207063261.1
-43879022459.2
-105748444127.0
-254853750345.0
-614197538332.0
-1.48021606738e+12
-3.56732072239e+12
-8.59724294096e+12
-2.07193554877e+13
-4.99336467254e+13
-1.20340088608e+14
-2.90019613546e+14
-6.98947268645e+14
-1.68446291743e+15
-4.05955563102e+15
-9.78352907075e+15
-2.35783050605e+16
-5.68237151958e+16
-1.36945153622e+17
-3.30037820229e+17
-7.95391146751e+17
-1.91689266367e+18
-4.61971131945e+18
-1.11335042799e+19
-2.68317453145e+19
-6.46645062079e+19
-1.55841459961e+20
-3.75577918506e+20
-9.051427836e+20
-2.18139410848e+21
-5.25715980143e+21
-1.26697551214e+22
-3.05341098427e+22
-7.35872047208e+22
-1.77345163377e+23
-4.27401843739e+23
-1.03003844341e+24
-2.48239264862e+24
-5.98256628317e+24
-1.44179847425e+25
-3.47473432293e+25
-8.37410971826e+25
-15999395.0
-23519110.65
-34573092.6555
-50822446.2036
-74708995.9193
-109822224.001
-161438669.282
-237314843.844
-348852820.451
-512813646.064
-753836059.713
-1108139007.78
-1628964341.43
-2394577581.91
-3520029045.41
-5174442696.75
-7606430764.22
-11181453223.4
-16436736238.4
-24162002270.4
-35518143337.6
-52211670706.2
-76751155938.1
-112824199229.0
-165851572867.0
-243801812114.0
-358388663808.0
-526831335797.0
-774442063622.0
-1.13842983352e+12
-1.67349185528e+12
-2.46003302726e+12
-3.61624855008e+12
-5.31588536861e+12
-7.81435149186e+12
-1.1487096693e+13
-1.68860321388e+13
-2.4822467244e+13
-3.64890268486e+13
-5.36388694675e+13
-7.88491381172e+13
-1.15908233032e+14
-1.70385102558e+14
-2.5046610076e+14
-3.68185168117e+14
-5.41232197131e+14
-7.95611329783e+14
-1.16954865478e+15
-1.71923652253e+15
-2.52727768812e+15
-15999395.0
-18255309.695
-20829308.362
-23766240.841
-27117280.7996
-30940817.3924
-35303472.6447
-40281262.2876
-45960920.2701
-52441410.0282
-59835648.8422
-68272475.329
-77898894.3504
-88882638.4538
-101415090.476
-115714618.233
-132030379.404
-150646662.9
-171887842.368
-196124028.142
-223777516.11
-255330145.882
-291331696.451
-332409465.651
-379279200.308
-432757567.551
-493776384.576
-563398854.801
-642838093.328
-733478264.487
-836898699.78
-954901416.449
-1089542516.17
-1243168010.95
-1418454700.49
-1618456813.26
-1846659223.93
-2107038174.5
-2404130557.11
-2743112965.66
-3129891893.82
-3571206650.85
-4074746788.62
-4649286085.82
-5304835423.91
-6052817218.69
-6906264446.52
-7880047733.48
-8991134463.9
-10258884423.3
-15999395.0
-16751366.565
-17538680.7936
-18362998.7909
-19226059.734
-20129684.5415
-21075779.715
-22066341.3616
-23103459.4056
-24189321.9976
-25326220.1315
-26516552.4777
-27762830.4442
-29067683.475
-30433864.5984
-31864256.2345
-33361876.2775
-34929884.4625
-36571589.0323
-38290453.7168
-40090105.0415
-41974339.9784
-43947133.9574
-46012649.2534
-48175243.7683
-50439480.2254
-52810135.796
-55292212.1785
-57890946.1508
-60611820.6199
-63460576.1891
-66443223.27
-69566054.7636
-72835659.3375
-76258935.3264
-79843105.2867
-83595731.2352
-87524730.6033
-91638392.9416
-95945397.4099
-100454831.088
-105176208.149
-110119489.932
-115295105.959
-120713975.939
-126387532.808
-132327746.85
-138547150.952
-145058867.047
-151876633.798
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15999395.0
-15247423.435
-14530794.5336
-13847847.1905
-13196998.3725
-12576739.449
-11985632.6949
-11422307.9583
-10885459.4842
-10373842.8885
-9886272.2727
-9421617.47588
-8978801.45452
-8556797.78615
-8154628.2902
-7771360.76056
-7406106.80482
-7058019.78499
-6726292.8551
-6410157.09091
-6108879.70763
-5821762.36138
-5548139.53039
-5287376.97246
-5038870.25476
-4802043.35278
-4576347.3152
-4361258.99139
-4156279.81879
-3960934.66731
-3774770.73795
-3597356.51326
-3428280.75714
-3267151.56155
-3113595.43816
-2967256.45257
-2827795.3993
-2694889.01553
-2568229.2318
-2447522.45791
-2332488.90238
-2222861.92397
-2118387.41354
-2018823.20511
-1923938.51447
-1833513.40429
-1747338.27429
-1665213.3754
-1586948.34675
-1512361.77445
-15999395.0
-13743480.305
-11805649.582
-10141052.9909
-8711164.51921
-7482890.322
-6427802.7866
-5521482.59369
-4742953.54798
-4074197.09771
-3499735.30694
-3006272.62866
-2582388.18802
-2218271.45351
-1905495.17856
-1636820.35839
-1406028.68785
-1207778.64287
-1037481.85422
-891196.912776
-765538.148075
-657597.269196
-564876.05424
-485228.530592
-416811.307778
-358040.913382
-307557.144595
-264191.587207
-226940.573411
-194941.95256
-167455.137249
-143843.962897
-123561.964128
-106139.727186
-91174.025653
-78318.4880359
-67275.5812229
-57789.7242704
-49641.3731483
-42641.9395344
-36629.4260601
-31464.6769856
-27028.1575306
-23217.1873188
-19943.5639068
-17131.521396
-14715.9768791
-12641.0241392
-10858.6397356
-9327.57153285
-15999395.0
-8479679.35
-4494230.0555
-2381941.92941
-1262429.22259
-669087.487973
-354616.368626
-187946.675372
-99611.7379469
-52794.2211119
-27980.9371893
-14829.8967103
-7859.84525647
-4165.71798593
-2207.83053254
-1170.15018225
-620.179596592
-328.695186195
-174.208448684
-92.330477804
-48.9351532353
-25.9356312138
-13.7458845423
-7.28531880617
-3.86121896596
-2.04644605272
-1.08461640744
-0.574846695736
-0.304668749666
-0.161474436239
-0.0855814512507
-0.0453581701296
-0.0240398307554
-0.0127411115719
-0.00675278994567
-0.00357898009911
-0.00189685982285
-0.00100533567255
-0.000532829102706
-0.000282400359983
-0.000149671999679
-7.93275127975e-05
-4.20443627282e-05
-2.22842453637e-05
-1.18117510438e-05
-6.26061104258e-06
-3.31736048386e-06
-1.7574992021e-06
-9.31517477395e-07
-4.93871258631e-07
-15999395.0
6559751.95
-2689498.2995
1102694.30279
-452104.664146
185362.9123
-75998.7940429
31159.5055576
-12775.3972786
5237.91288423
-2147.54428254
880.49315584
-361.002193894
148.010899497
-60.6844687944
24.8806322066
-10.2010592052
4.18243427267
-1.71479805094
0.703067201433
-0.288257551722
0.118185596561
-0.0484560952055
0.0198669977874
-0.00814546978267
0.00333964235076
-0.00136925462612
0.000561393221823
-0.000230171650981
9.43715611047e-05
-3.86930643915e-05
1.58632237605e-05
-6.50423967175e-06
2.6658226295e-06
-1.09276864331e-06
4.47182831889e-07
-1.81581429949e-07
7.51087816056e-08
-3.14038071009e-08
1.24945221072e-08
-4.27067750431e-09
1.63848373167e-09
-1.13566861825e-09
1.63848373167e-09
-1.13566861825e-09
1.63848373167e-09
-1.13566861825e-09
1.63848373167e-09
-1.13566861825e-09
1.63848373167e-09
-15999395.0
59197761.5
-219031717.55
810417354.935
-2998544213.26
11094613589.1
-41050070279.5
151885260034.0
-561975462127.0
2.07930920987e+12
-7.69344407651e+12
2.84657430831e+13
-1.05323249407e+14
3.89696022808e+14
-1.44187528439e+15
5.33493855224e+15
-1.97392726433e+16
7.30353087801e+16
-2.70230642486e+17
9.998533772e+17
-3.69945749564e+18
1.36879927339e+19
-5.06455731153e+19
1.87388620527e+20
-6.93337895949e+20
2.56535021501e+21
-9.49179579554e+21
3.51196444435e+22
-1.29942684441e+23
4.80787932431e+23
-1.77891535e+24
6.58198679498e+24
-2.43533511414e+25
9.01073992233e+25
-3.33397377126e+26
1.23357029537e+27
-4.56421009286e+27
1.68875773436e+28
-6.24840361712e+28
2.31190933834e+29
-8.55406455184e+29
3.16500388418e+30
-1.17105143715e+31
4.33289031745e+31
-1.60316941745e+32
5.93172684458e+32
-2.1947389325e+33
8.12053405023e+33
-3.00459759859e+34
1.11170111148e+35
-15999395.0
209592074.5
-2745656175.95
35968095904.9
-471182056355.0
6.17248493825e+12
-8.0859552691e+13
1.05926014025e+15
-1.38763078373e+16
1.81779632669e+17
-2.38131318796e+18
3.11952027623e+19
-4.08657156186e+20
5.35340874604e+21
-7.01296545731e+22
9.18698474907e+23
-1.20349500213e+25
1.57657845279e+26
-2.06531777315e+27
2.70556628283e+28
-3.54429183051e+29
4.64302229796e+30
-6.08235921033e+31
7.96789056554e+32
-1.04379366409e+34
1.36736969995e+35
-1.79125430694e+36
2.34654314209e+37
-3.07397151613e+38
4.02690268614e+39
-5.27524251884e+40
6.91056769968e+41
-9.05284368658e+42
1.18592252294e+44
-1.55355850505e+45
2.03516164162e+46
-2.66606175052e+47
3.49254089318e+48
-4.57522857007e+49
5.99354942679e+50
-7.8515497491e+51
1.02855301713e+53
-1.34740445244e+54
1.7650998327e+55
-2.31228078084e+56
3.0290878229e+57
-3.968105048e+58
5.19821761287e+59
-6.80966507286e+60
8.92066124545e+61

In [6]:
# predict price of house with 1650 feet^2 and 3 bedrooms
x = np.array([[1.0],
              [1650],
              [3]])

# mean normalize (unnecessary if using the normal equation method)
x[1] = np.true_divide((np.subtract(x[1],xbar_1)),s_1)
x[2] = np.true_divide((np.subtract(x[2],xbar_2)),s_2)

# compute and display prediction
y_hat = theta.T.dot(x)
print "\nPrediction for 1650 feet^2 and 3 bedrooms: " + str(y_hat) + "\n"


Prediction for 1650 feet^2 and 3 bedrooms: [[ -2.48639707e+61]]

Normal Equation (an alternative to gradient descent)

Return the value of $\theta$ that minimizes the cost function:
$\theta = (X^{T}X)^{-1}X^{T}y$


In [13]:
normal_eq_theta = (np.linalg.pinv(X.T.dot(X)).dot(X.T)).dot(y)

normal_eq_y_hat = normal_eq_theta.T.dot(x)
print "\n\nPrediction for 1650 feet^2 and 3 bedrooms: " + str(normal_eq_y_hat) + "\n"



Prediction for 1650 feet^2 and 3 bedrooms: [[ 293081.4643349]]

last equality NOT FROM Ng (from Wikipedia)

$\theta = (X^{T}X)^{-1}X^{T}y = X^{+}y$


In [19]:
normal_eq_theta = (np.linalg.pinv(X).dot(y))

normal_eq_y_hat = normal_eq_theta.T.dot(x)
print "\n\nPrediction for 1650 feet^2 and 3 bedrooms: " + str(normal_eq_y_hat) + "\n"



Prediction for 1650 feet^2 and 3 bedrooms: [[ 293081.4643349]]