This is going to be a very basic examples of Linear Regression. Basically, we have generated data of total amount of meals and tips. We would like to use this historical data to predict the tip for any given amount of bill.
The data is going to be perfect because I just want to show how easy it is to do Linear Regression.
Best example so far I have found to calculate Linear Regression.
http://onlinestatbook.com/2/regression/intro.html
http://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import collections
import time
from sklearn.linear_model import SGDRegressor
In [3]:
total_bills = np.random.randint(100, size=1000)
tips = total_bills * 0.10
It's easier if we select the correct X and Y axis. Usually, The Y axis would be the value we want to predict and X would be the feed data.
In [4]:
x = pd.Series(tips, name='tips')
y = pd.Series(total_bills, name='total_bills')
In [6]:
df = pd.concat([x, y], axis=1)
In [7]:
df.plot(kind='scatter', x='total_bills', y='tips');
As we can see from the graph that there's a strong correlation between amount of tip and meal. Now we want to calculate the regression line. We need the slope and intercept to feed in the formula. Y = MX + C
In [8]:
slope, intercept, r_value, p_value, std_err = stats.linregress(x=total_bills, y=tips)
print("slope is %f and intercept is %s" % (slope,intercept))
Let's say if the customer spent $70 how much the customer will tip
In [9]:
predicted_tips = (slope * 70) + intercept
In [10]:
print('The customer will leave the tip of $%f' % predicted_tips)
In [20]:
large_total_bills = np.random.randint(10000, size=100000000)
large_tips = total_bills * 0.10
In [21]:
now = time.time()
slope, intercept, r_value, p_value, std_err = stats.linregress(x=large_total_bills, y=large_tips)
predicted_tips = (slope * 700) + intercept
later = time.time()
difference = int(later - now)
print('The customer will leave the tip of $%f' % predicted_tips)
print('The time spent is %f seconds' % difference)
Now, I'm going to use Gradient Decent to find the fitted line. It's been known that Gradient Decent is better for large dataset. Let's see how well it performs. I'm going to use the code example from https://github.com/mattnedrich/GradientDescentExample
In [17]:
def compute_error_for_line_given_points (b, m, points):
totalError = 0
for i in range(0, len(points)):
totalError += (points[i].y - (m * points[i].x + b)) ** 2
return totalError / float(len(points))
In [18]:
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
b_gradient += -(2/N) * (points[i].y - ((m_current*points[i].x) + b_current))
m_gradient += -(2/N) * points[i].x * (points[i].y - ((m_current * points[i].x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
In [19]:
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, points, learning_rate)
return [b, m]
In [22]:
class point:
def __init__(self,x,y):
self.x=x
self.y=y
x = np.random.randint(100, size=1000)
y = x * 0.10
np.column_stack((x,y))
points = []
collections.namedtuple('Point', ['x', 'y'])
for i in range(len(x)):
points.append(point(x[i],y[i]))
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
print("Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
print("Running...")
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
print("After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)))
Let's see after 1000 interations how close are we. Pretty close I think
In [25]:
gradient_predicted_tips = (m * 70) + b
gradient_predicted_tips
Out[25]:
But you really don't need to write that on your own as Scikit provides that for you already.
In [3]:
x = np.random.randint(100, size=100000000)
y = x * 0.10
x = x[:,None]
now = time.time()
clf = SGDRegressor()
clf.fit(x, y)
later = time.time()
difference = int(later - now)
print("Time spent for SGDRegressor is %d seconds" % difference)
print("slope is %f and intercept is %s" % (clf.coef_, clf.intercept_[0]))
In [4]:
clf.predict(70) # How much tip
Out[4]:
In [ ]: