First a basic linear regression example

Consider the rate a medicine is repurchased based on its effectiveness.


In [35]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(10)

dossageEffectiveness = abs(np.random.normal(5.0, 1.5, 1000))
repurchaseRate = (dossageEffectiveness + np.random.normal(0, 0.1, 1000)) * 3
repurchaseRate/=np.max(repurchaseRate)
plt.scatter(dossageEffectiveness, repurchaseRate)
plt.show()



In [36]:
from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(dossageEffectiveness, repurchaseRate)

r_value ** 2


Out[36]:
0.99509525372340124

In [37]:
def predict(x):
    return slope * x + intercept

fitLine = predict(dossageEffectiveness)

plt.scatter(dossageEffectiveness, repurchaseRate)
plt.plot(dossageEffectiveness, fitLine, c='r')
plt.show()


Modify this to a multivariate regression example


In [41]:
repurchaseRate = np.random.normal(1, 0.1, 1000)*dossageEffectiveness**2

plt.scatter(dossageEffectiveness, repurchaseRate)
plt.show()



In [42]:
poly = np.poly1d(np.polyfit(dossageEffectiveness, repurchaseRate, 4))

xPoly = np.linspace(0, 7, 100)
plt.scatter(dossageEffectiveness, repurchaseRate)
plt.plot(xPoly, poly(xPoly), c='r')
plt.show()



In [45]:
from sklearn.metrics import r2_score

r2 = r2_score(repurchaseRate, poly(dossageEffectiveness))
print r2


0.957161926314

Make distribution more complicated to see if scikit-learn can fit it


In [58]:
dossageEffectiveness = np.sort(dossageEffectiveness)
repurchaseRate = (dossageEffectiveness + np.random.normal(0, 1, 1000)) * 3
repurchaseRate/=np.max(repurchaseRate)


angles = np.sort(np.random.uniform(0,np.pi,1000))
cs = np.sin(angles)


repurchaseRateComplicated = repurchaseRate+(cs*100)
repurchaseRateComplicated/=np.max(repurchaseRateComplicated)
plt.scatter(dossageEffectiveness, repurchaseRateComplicated)
plt.show()



In [63]:
poly = np.poly1d(np.polyfit(dossageEffectiveness, repurchaseRateComplicated, 9))

xPoly = np.linspace(0, 7, 100)
plt.scatter(dossageEffectiveness, repurchaseRateComplicated)
plt.plot(xPoly, poly(xPoly), c='r')
plt.show()


With a high-N polynomial, it is unlikely to hold up to future testing and only fits the test data well.


In [ ]: