R-squared

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of accuracy, it's a measure of the power of one's model.

Below are some example plots and their corresponding R-squared value. Notice how the value changes as I add more noise into the data. These examples fail to capture the important "real world" explainability property, but they should serve as a reference point for what the values of R-squared look like.



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf









    



/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))



In [6]:

    
x = 1.0 * np.arange(100) / 100
y = 0.5 * np.arange(100) / 100
yy = y + (np.random.rand(100)-.5) * .01
f = sm.OLS(yy, x).fit()
plt.scatter(x, yy)
plt.plot(x, f.predict(x))
plt.title('R-squared = ' + str(f.rsquared))
plt.xlim(0, 1)
plt.ylim(-1, 1)
plt.show()



In [7]:

    
yy = y + (np.random.rand(100)-.5) * .1
f = sm.OLS(yy, x).fit()
plt.plot(x, f.predict(x))
plt.scatter(x, yy)
plt.title('R-squared = ' + str(f.rsquared))
plt.xlim(0, 1)
plt.ylim(-1, 1)
plt.show()



In [10]:

    
yy = y + (np.random.rand(100)-.5)
f = sm.OLS(yy, x).fit()
plt.plot(x, f.predict(x))
plt.scatter(x, yy)
plt.title('R-squared = ' + str(f.rsquared))
plt.xlim(0, 1)
plt.ylim(-1, 1)
plt.show()

Notice that as more noise is added, the model still find the best fit, but that fit does not explain the data as well, which is caputred in the lower R-squared value. Our last plot is the R-squared value of a model fit to completely random data, which we would expect to be zero.



In [11]:

    
yy = np.random.rand(100)*2 - 1
f = sm.OLS(yy, x).fit()
plt.plot(x, f.predict(x))
plt.scatter(x, yy)
plt.title('R-squared = ' + str(f.rsquared))
plt.xlim(0, 1)
plt.ylim(-1, 1)
plt.show()