Introduction

This is going to be a very basic examples of Linear Regression. Basically, we have generated data of total amount of meals and tips. We would like to use this historical data to predict the tip for any given amount of bill.

The data is going to be perfect because I just want to show how easy it is to do Linear Regression.

Best example so far I have found to calculate Linear Regression.

http://onlinestatbook.com/2/regression/intro.html

http://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/

https://github.com/mattnedrich/GradientDescentExample

Importing


In [1]:
%matplotlib inline 
import pandas as pd
import numpy as np
from scipy import stats
import collections
import time
from sklearn.linear_model import SGDRegressor

Generate Data

We are going to generate a 1000 samples of random number between \$0 - \$100. And let's say that each meal the customer tips 10% of the amount


In [3]:
total_bills = np.random.randint(100, size=1000)
tips = total_bills * 0.10

It's easier if we select the correct X and Y axis. Usually, The Y axis would be the value we want to predict and X would be the feed data.


In [4]:
x = pd.Series(tips, name='tips')
y = pd.Series(total_bills, name='total_bills')

In [6]:
df = pd.concat([x, y], axis=1)

In [7]:
df.plot(kind='scatter', x='total_bills', y='tips');


As we can see from the graph that there's a strong correlation between amount of tip and meal. Now we want to calculate the regression line. We need the slope and intercept to feed in the formula. Y = MX + C


In [8]:
slope, intercept, r_value, p_value, std_err = stats.linregress(x=total_bills, y=tips)
print("slope is %f and intercept is %s" % (slope,intercept))


slope is 0.100000 and intercept is 1.7763568394e-15

Let's say if the customer spent $70 how much the customer will tip


In [9]:
predicted_tips = (slope * 70) + intercept

In [10]:
print('The customer will leave the tip of $%f' % predicted_tips)


The customer will leave the tip of $7.000000

Large dataset

Now let's have a look at large dataset. Let's see how our Linear Regression performs. I'm going to create 100 million datasets.


In [20]:
large_total_bills = np.random.randint(10000, size=100000000)
large_tips = total_bills * 0.10

In [21]:
now = time.time()

slope, intercept, r_value, p_value, std_err = stats.linregress(x=large_total_bills, y=large_tips)
predicted_tips = (slope * 700) + intercept

later = time.time()
difference = int(later - now) 
print('The customer will leave the tip of $%f' % predicted_tips)
print('The time spent is %f seconds' % difference)


The customer will leave the tip of $499.961690
The time spent is 11.000000 seconds

Gradient Decent

Now, I'm going to use Gradient Decent to find the fitted line. It's been known that Gradient Decent is better for large dataset. Let's see how well it performs. I'm going to use the code example from https://github.com/mattnedrich/GradientDescentExample


In [17]:
def compute_error_for_line_given_points (b, m, points):
    totalError = 0
    for i in range(0, len(points)):
        totalError += (points[i].y - (m * points[i].x + b)) ** 2
    return totalError / float(len(points))

In [18]:
def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        b_gradient += -(2/N) * (points[i].y - ((m_current*points[i].x) + b_current))
        m_gradient += -(2/N) * points[i].x * (points[i].y - ((m_current * points[i].x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

In [19]:
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
    b = starting_b
    m = starting_m
    for i in range(num_iterations):
        b, m = step_gradient(b, m, points, learning_rate)
    return [b, m]

In [22]:
class point:
    def __init__(self,x,y):
        self.x=x
        self.y=y
        
x = np.random.randint(100, size=1000)
y = x * 0.10

np.column_stack((x,y))

points = []
collections.namedtuple('Point', ['x', 'y'])
for i in range(len(x)):
        points.append(point(x[i],y[i]))

learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
print("Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
print("Running...")
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
print("After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)))


Starting gradient descent at b = 0, m = 0, error = 33.086309999999955
Running...
After 1000 iterations b = 0.001432728484911941, m = 0.09997841844411902, error = 5.119034280329996e-07

Let's see after 1000 interations how close are we. Pretty close I think


In [25]:
gradient_predicted_tips = (m * 70) + b
gradient_predicted_tips


Out[25]:
6.9999220195732432

But you really don't need to write that on your own as Scikit provides that for you already.


In [3]:
x = np.random.randint(100, size=100000000)
y = x * 0.10

x = x[:,None]

now = time.time()

clf = SGDRegressor()
clf.fit(x, y)

later = time.time()
difference = int(later - now) 
print("Time spent for SGDRegressor is %d seconds" % difference)    
print("slope is %f and intercept is %s" % (clf.coef_, clf.intercept_[0]))


Time spent for SGDRegressor is 313 seconds
slope is 0.100000 and intercept is 5.96178570559e-07

In [4]:
clf.predict(70) # How much tip


Out[4]:
array([ 6.99999978])

In [ ]: