notebook.community

Edit and run

Topic:    Challenge Set 5
Subject:  Linear Regression and Train/Test Split
Date:     02/07/2017
Name:     Prashant Tatineni



In [97]:

    
import pandas as pd
import patsy
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cross_validation import train_test_split

%matplotlib inline



In [3]:

    
df = pd.read_csv('2013_movies.csv')



In [4]:

    
df.head()









    Out[4]:






  
    
      
      Title
      Budget
      DomesticTotalGross
      Director
      Rating
      Runtime
      ReleaseDate
    
  
  
    
      0
      The Hunger Games: Catching Fire
      130000000.0
      424668047
      Francis Lawrence
      PG-13
      146
      2013-11-22 00:00:00
    
    
      1
      Iron Man 3
      200000000.0
      409013994
      Shane Black
      PG-13
      129
      2013-05-03 00:00:00
    
    
      2
      Frozen
      150000000.0
      400738009
      Chris BuckJennifer Lee
      PG
      108
      2013-11-22 00:00:00
    
    
      3
      Despicable Me 2
      76000000.0
      368061265
      Pierre CoffinChris Renaud
      PG
      98
      2013-07-03 00:00:00
    
    
      4
      Man of Steel
      225000000.0
      291045518
      Zack Snyder
      PG-13
      143
      2013-06-14 00:00:00



In [5]:

    
y, X = patsy.dmatrices('DomesticTotalGross ~ Budget + Runtime', data=df, return_type="dataframe")



In [10]:

    
X.head()









    Out[10]:






  
    
      
      Intercept
      Budget
      Runtime
    
  
  
    
      0
      1.0
      130000000.0
      146.0
    
    
      1
      1.0
      200000000.0
      129.0
    
    
      2
      1.0
      150000000.0
      108.0
    
    
      3
      1.0
      76000000.0
      98.0
    
    
      4
      1.0
      225000000.0
      143.0

Challenge 1



In [13]:

    
model = sm.OLS(y, X['Intercept'])
fit = model.fit()
fit.summary()









    Out[13]:





OLS Regression Results

  Dep. Variable:     DomesticTotalGross    R-squared:             0.000


  Model:                     OLS           Adj. R-squared:        0.000


  Method:               Least Squares      F-statistic:             nan


  Date:               Tue, 07 Feb 2017     Prob (F-statistic):      nan 


  Time:                   14:37:47         Log-Likelihood:      -1753.0


  No. Observations:            89          AIC:                   3508.


  Df Residuals:                88          BIC:                   3511.


  Df Model:                     0                                      


  Covariance Type:        nonrobust                                    




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept    1.03e+08   9.24e+06     11.148   0.000   8.47e+07  1.21e+08




  Omnibus:        49.405    Durbin-Watson:         0.016


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    127.458


  Skew:            2.051    Prob(JB):           2.10e-28


  Kurtosis:        7.189    Cond. No.               1.00

This model is representing the null hypothesis.



In [24]:

    
records = range(89)



In [32]:

    
plt.scatter(records, y, color='g')
plt.scatter(records, fit.predict(X['Intercept']))









    Out[32]:





<matplotlib.collections.PathCollection at 0x11ad62110>



In [51]:

    
plt.hist((y['DomesticTotalGross'] - fit.predict(X['Intercept'])));

Challenge 2



In [58]:

    
model = sm.OLS(y, X[['Intercept','Budget']])
fit = model.fit()
fit.summary()









    Out[58]:





OLS Regression Results

  Dep. Variable:     DomesticTotalGross    R-squared:             0.286


  Model:                     OLS           Adj. R-squared:        0.278


  Method:               Least Squares      F-statistic:           34.82


  Date:               Tue, 07 Feb 2017     Prob (F-statistic):  6.80e-08


  Time:                   15:09:42         Log-Likelihood:      -1738.1


  No. Observations:            89          AIC:                   3480.


  Df Residuals:                87          BIC:                   3485.


  Df Model:                     1                                      


  Covariance Type:        nonrobust                                    




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept    4.44e+07   1.27e+07      3.504   0.001   1.92e+07  6.96e+07


  Budget         0.7846      0.133      5.901   0.000      0.520     1.049




  Omnibus:        39.749    Durbin-Watson:         0.674


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     99.441


  Skew:            1.587    Prob(JB):           2.55e-22


  Kurtosis:        7.091    Cond. No.           1.54e+08



In [59]:

    
plt.scatter(X['Budget'], y, color='g')
plt.scatter(X['Budget'], fit.predict(X[['Intercept','Budget']]))









    Out[59]:





<matplotlib.collections.PathCollection at 0x11e0b4750>



In [61]:

    
plt.scatter(X['Budget'], fit.predict(X[['Intercept','Budget']]) - y['DomesticTotalGross'])









    Out[61]:





<matplotlib.collections.PathCollection at 0x11e5d9dd0>

For higher budget, higher grossing movies there is some spread in the data and the model's residuals are higher

Challenge 3



In [66]:

    
y3, X3 = patsy.dmatrices('DomesticTotalGross ~ Rating', data=df, return_type="dataframe")



In [72]:

    
X3.head()









    Out[72]:






  
    
      
      Intercept
      Rating[T.PG]
      Rating[T.PG-13]
      Rating[T.R]
    
  
  
    
      0
      1.0
      0.0
      1.0
      0.0
    
    
      1
      1.0
      0.0
      1.0
      0.0
    
    
      2
      1.0
      1.0
      0.0
      0.0
    
    
      3
      1.0
      1.0
      0.0
      0.0
    
    
      4
      1.0
      0.0
      1.0
      0.0



In [83]:

    
model = sm.OLS(y3, X3)
fit = model.fit()
fit.summary()









    Out[83]:





OLS Regression Results

  Dep. Variable:     DomesticTotalGross    R-squared:             0.109


  Model:                     OLS           Adj. R-squared:        0.081


  Method:               Least Squares      F-statistic:           3.924


  Date:               Tue, 07 Feb 2017     Prob (F-statistic):   0.0109 


  Time:                   15:22:55         Log-Likelihood:      -1964.2


  No. Observations:           100          AIC:                   3936.


  Df Residuals:                96          BIC:                   3947.


  Df Model:                     3                                      


  Covariance Type:        nonrobust                                    




                     coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept         2.685e+08   8.38e+07      3.205   0.002   1.02e+08  4.35e+08


  Rating[T.PG]     -1.374e+08   8.65e+07     -1.588   0.116  -3.09e+08  3.44e+07


  Rating[T.PG-13]   -1.57e+08   8.47e+07     -1.855   0.067  -3.25e+08   1.1e+07


  Rating[T.R]      -1.986e+08   8.49e+07     -2.340   0.021  -3.67e+08 -3.01e+07




  Omnibus:        46.897    Durbin-Watson:         0.240


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    109.261


  Skew:            1.826    Prob(JB):           1.88e-24


  Kurtosis:        6.589    Cond. No.               23.7



In [89]:

    
records3 = range(100)



In [90]:

    
plt.scatter(records3, y3, color='g')
plt.scatter(records3, fit.predict(X3))









    Out[90]:





<matplotlib.collections.PathCollection at 0x11f26d8d0>



In [91]:

    
plt.hist((y3['DomesticTotalGross'] - fit.predict(X3)));

Here, the model is using the 'rating' to predict Domestic gross. Since there's 4 ratings, it's predicting one of 4 domestic gross values.

Challenge 4



In [92]:

    
y4, X4 = patsy.dmatrices('DomesticTotalGross ~ Budget + Runtime + Rating', data=df, return_type="dataframe")



In [93]:

    
X4.head()









    Out[93]:






  
    
      
      Intercept
      Rating[T.PG]
      Rating[T.PG-13]
      Rating[T.R]
      Budget
      Runtime
    
  
  
    
      0
      1.0
      0.0
      1.0
      0.0
      130000000.0
      146.0
    
    
      1
      1.0
      0.0
      1.0
      0.0
      200000000.0
      129.0
    
    
      2
      1.0
      1.0
      0.0
      0.0
      150000000.0
      108.0
    
    
      3
      1.0
      1.0
      0.0
      0.0
      76000000.0
      98.0
    
    
      4
      1.0
      0.0
      1.0
      0.0
      225000000.0
      143.0



In [94]:

    
model = sm.OLS(y4, X4)
fit = model.fit()
fit.summary()









    Out[94]:





OLS Regression Results

  Dep. Variable:     DomesticTotalGross    R-squared:             0.302


  Model:                     OLS           Adj. R-squared:        0.269


  Method:               Least Squares      F-statistic:           9.089


  Date:               Tue, 07 Feb 2017     Prob (F-statistic):  3.77e-06


  Time:                   15:31:42         Log-Likelihood:      -1737.0


  No. Observations:            89          AIC:                   3484.


  Df Residuals:                84          BIC:                   3497.


  Df Model:                     4                                      


  Covariance Type:        nonrobust                                    




                     coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept        -1.012e+07   4.05e+07     -0.250   0.803  -9.08e+07  7.05e+07


  Rating[T.PG]      1.438e+07   1.74e+07      0.826   0.411  -2.02e+07   4.9e+07


  Rating[T.PG-13]  -1.342e+07   1.98e+07     -0.677   0.500  -5.28e+07   2.6e+07


  Rating[T.R]      -1.109e+07   2.03e+07     -0.547   0.586  -5.14e+07  2.92e+07


  Budget               0.7036      0.165      4.255   0.000      0.375     1.032


  Runtime           6.083e+05   5.21e+05      1.168   0.246  -4.27e+05  1.64e+06




  Omnibus:        36.883    Durbin-Watson:         0.740


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     82.319


  Skew:            1.525    Prob(JB):           1.33e-18


  Kurtosis:        6.591    Cond. No.           2.02e+17



In [96]:

    
plt.scatter(records, y4, color='g')
plt.scatter(records, fit.predict(X4))









    Out[96]:





<matplotlib.collections.PathCollection at 0x11dd2b290>

Challenge 5



In [98]:

    
X_train, X_test, y_train, y_test = train_test_split(X4, y4, test_size=0.25)



In [99]:

    
y_test.shape









    Out[99]:





(23, 1)



In [100]:

    
model = sm.OLS(y_train, X_train)
fit = model.fit()
fit.summary()









    Out[100]:





OLS Regression Results

  Dep. Variable:     DomesticTotalGross    R-squared:             0.257


  Model:                     OLS           Adj. R-squared:        0.208


  Method:               Least Squares      F-statistic:           5.274


  Date:               Tue, 07 Feb 2017     Prob (F-statistic):   0.00103


  Time:                   15:38:13         Log-Likelihood:      -1294.7


  No. Observations:            66          AIC:                   2599.


  Df Residuals:                61          BIC:                   2610.


  Df Model:                     4                                      


  Covariance Type:        nonrobust                                    




                     coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept        -9.789e+06   5.14e+07     -0.190   0.850  -1.13e+08   9.3e+07


  Rating[T.PG]      2.227e+07   2.28e+07      0.979   0.332  -2.32e+07  6.78e+07


  Rating[T.PG-13]  -1.653e+07   2.58e+07     -0.640   0.525  -6.82e+07  3.51e+07


  Rating[T.R]      -1.553e+07   2.54e+07     -0.611   0.544  -6.64e+07  3.53e+07


  Budget               0.6559      0.220      2.983   0.004      0.216     1.096


  Runtime           6.422e+05   6.62e+05      0.971   0.336  -6.81e+05  1.96e+06




  Omnibus:        28.421    Durbin-Watson:         2.118


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     49.197


  Skew:            1.561    Prob(JB):           2.07e-11


  Kurtosis:        5.854    Cond. No.           5.44e+17



In [101]:

    
records5 = range(23)



In [102]:

    
plt.scatter(records5, y_test, color='g')
plt.scatter(records5, fit.predict(X_test))









    Out[102]:





<matplotlib.collections.PathCollection at 0x11c764490>

	Title	Budget	DomesticTotalGross	Director	Rating	Runtime	ReleaseDate
0	The Hunger Games: Catching Fire	130000000.0	424668047	Francis Lawrence	PG-13	146	2013-11-22 00:00:00
1	Iron Man 3	200000000.0	409013994	Shane Black	PG-13	129	2013-05-03 00:00:00
2	Frozen	150000000.0	400738009	Chris BuckJennifer Lee	PG	108	2013-11-22 00:00:00
3	Despicable Me 2	76000000.0	368061265	Pierre CoffinChris Renaud	PG	98	2013-07-03 00:00:00
4	Man of Steel	225000000.0	291045518	Zack Snyder	PG-13	143	2013-06-14 00:00:00

	Intercept	Budget	Runtime
0	1.0	130000000.0	146.0
1	1.0	200000000.0	129.0
2	1.0	150000000.0	108.0
3	1.0	76000000.0	98.0
4	1.0	225000000.0	143.0

Dep. Variable:	DomesticTotalGross	R-squared:	0.000
Model:	OLS	Adj. R-squared:	0.000
Method:	Least Squares	F-statistic:	nan
Date:	Tue, 07 Feb 2017	Prob (F-statistic):	nan
Time:	14:37:47	Log-Likelihood:	-1753.0
No. Observations:	89	AIC:	3508.
Df Residuals:	88	BIC:	3511.
Df Model:	0
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1.03e+08	9.24e+06	11.148	0.000	8.47e+07 1.21e+08

Omnibus:	49.405	Durbin-Watson:	0.016
Prob(Omnibus):	0.000	Jarque-Bera (JB):	127.458
Skew:	2.051	Prob(JB):	2.10e-28
Kurtosis:	7.189	Cond. No.	1.00

Omnibus:	39.749	Durbin-Watson:	0.674
Prob(Omnibus):	0.000	Jarque-Bera (JB):	99.441
Skew:	1.587	Prob(JB):	2.55e-22
Kurtosis:	7.091	Cond. No.	1.54e+08

	Intercept	Rating[T.PG]	Rating[T.PG-13]
0	1.0	0.0	1.0
1	1.0	0.0	1.0
2	1.0	1.0	0.0
3	1.0	1.0	0.0
4	1.0	0.0	1.0

Omnibus:	46.897	Durbin-Watson:	0.240
Prob(Omnibus):	0.000	Jarque-Bera (JB):	109.261
Skew:	1.826	Prob(JB):	1.88e-24
Kurtosis:	6.589	Cond. No.	23.7

Omnibus:	36.883	Durbin-Watson:	0.740
Prob(Omnibus):	0.000	Jarque-Bera (JB):	82.319
Skew:	1.525	Prob(JB):	1.33e-18
Kurtosis:	6.591	Cond. No.	2.02e+17

Omnibus:	28.421	Durbin-Watson:	2.118
Prob(Omnibus):	0.000	Jarque-Bera (JB):	49.197
Skew:	1.561	Prob(JB):	2.07e-11
Kurtosis:	5.854	Cond. No.	5.44e+17

	Intercept	Rating[T.PG]	Rating[T.PG-13]
0	1.0	0.0	1.0
1	1.0	0.0	1.0
2	1.0	1.0	0.0
3	1.0	1.0	0.0
4	1.0	0.0	1.0

	Intercept	Rating[T.PG]	Rating[T.PG-13]
0	1.0	0.0	1.0
1	1.0	0.0	1.0
2	1.0	1.0	0.0
3	1.0	1.0	0.0
4	1.0	0.0	1.0