Lab

Multiple Linear Regression

Alessandro D. Gagliardi


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf

this dataset shows a set of six numeric survey responses $X_i$ (survey responses)
and a dependent variable $Y$ (perceived supervisor quality)
we want to predict $Y$ from the $X$'s


In [2]:
x = pd.read_table('http://www.ats.ucla.edu/stat/examples/chp/p054.txt')
x.head()


Out[2]:
Y X1 X2 X3 X4 X5 X6
0 43 51 30 39 61 92 45
1 63 64 51 54 63 73 47
2 71 70 68 69 76 86 48
3 61 63 45 47 54 84 35
4 81 78 56 66 71 83 47

5 rows × 7 columns

the column names have trailing whitespace, so I fix that by mapping str.strip onto the columns' names.


In [3]:
x.columns = x.columns.map(str.strip)

this set of scatterplots gives us an idea of the pairwise relationships present in the dataset


In [4]:
_ = pd.scatter_matrix(x, figsize=(7, 7))


this linear fit represents the "full model"; eg, the fit with all of the independent variables included


In [5]:
lm = smf.ols('Y ~ X1 + X2 + X3 + X4 + X5 + X6', data=x)
fit = lm.fit()
print fit.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.733
Model:                            OLS   Adj. R-squared:                  0.663
Method:                 Least Squares   F-statistic:                     10.50
Date:                Wed, 30 Apr 2014   Prob (F-statistic):           1.24e-05
Time:                        20:05:26   Log-Likelihood:                -97.250
No. Observations:                  30   AIC:                             208.5
Df Residuals:                      23   BIC:                             218.3
Df Model:                           6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     10.7871     11.589      0.931      0.362       -13.187    34.761
X1             0.6132      0.161      3.809      0.001         0.280     0.946
X2            -0.0731      0.136     -0.538      0.596        -0.354     0.208
X3             0.3203      0.169      1.901      0.070        -0.028     0.669
X4             0.0817      0.221      0.369      0.715        -0.376     0.540
X5             0.0384      0.147      0.261      0.796        -0.266     0.342
X6            -0.2171      0.178     -1.218      0.236        -0.586     0.152
==============================================================================
Omnibus:                        2.386   Durbin-Watson:                   1.795
Prob(Omnibus):                  0.303   Jarque-Bera (JB):                1.255
Skew:                          -0.081   Prob(JB):                        0.534
Kurtosis:                       2.011   Cond. No.                     1.34e+03
==============================================================================

Warnings:
[1] The condition number is large, 1.34e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

remove feature w/ lowest (abs) t score


In [6]:
fit2 = smf.ols('Y ~ X1 + X2 + X3 + X4 + X6', data=x).fit()
print fit2.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.732
Model:                            OLS   Adj. R-squared:                  0.676
Method:                 Least Squares   F-statistic:                     13.10
Date:                Wed, 30 Apr 2014   Prob (F-statistic):           3.28e-06
Time:                        20:06:32   Log-Likelihood:                -97.294
No. Observations:                  30   AIC:                             206.6
Df Residuals:                      24   BIC:                             215.0
Df Model:                           5                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     12.7979      8.491      1.507      0.145        -4.726    30.322
X1             0.6131      0.158      3.885      0.001         0.287     0.939
X2            -0.0722      0.133     -0.543      0.592        -0.347     0.202
X3             0.3117      0.162      1.924      0.066        -0.023     0.646
X4             0.0980      0.208      0.470      0.643        -0.332     0.528
X6            -0.2111      0.173     -1.218      0.235        -0.569     0.147
==============================================================================
Omnibus:                        2.254   Durbin-Watson:                   1.775
Prob(Omnibus):                  0.324   Jarque-Bera (JB):                1.239
Skew:                          -0.110   Prob(JB):                        0.538
Kurtosis:                       2.029   Cond. No.                         871.
==============================================================================

note R-sq decreases slightly, but adj R-sq increases slightly

--> increasing bias, decreasing variance

ditto


In [7]:
fit3 = smf.ols('Y ~ X1 + X2 + X3 + X6', data=x).fit()
print fit3.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.729
Model:                            OLS   Adj. R-squared:                  0.686
Method:                 Least Squares   F-statistic:                     16.84
Date:                Wed, 30 Apr 2014   Prob (F-statistic):           8.13e-07
Time:                        20:07:08   Log-Likelihood:                -97.432
No. Observations:                  30   AIC:                             204.9
Df Residuals:                      25   BIC:                             211.9
Df Model:                           4                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.3035      7.740      1.848      0.076        -1.636    30.243
X1             0.6534      0.131      5.006      0.000         0.385     0.922
X2            -0.0768      0.131     -0.588      0.562        -0.346     0.192
X3             0.3239      0.157      2.058      0.050        -0.000     0.648
X6            -0.1715      0.149     -1.151      0.261        -0.478     0.135
==============================================================================
Omnibus:                        2.565   Durbin-Watson:                   1.820
Prob(Omnibus):                  0.277   Jarque-Bera (JB):                1.315
Skew:                          -0.107   Prob(JB):                        0.518
Kurtosis:                       1.997   Cond. No.                         698.
==============================================================================

In [8]:
fit4 = smf.ols('Y ~ X1 + X3 + X6', data=x).fit()
print fit4.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.726
Model:                            OLS   Adj. R-squared:                  0.694
Method:                 Least Squares   F-statistic:                     22.92
Date:                Wed, 30 Apr 2014   Prob (F-statistic):           1.81e-07
Time:                        20:07:29   Log-Likelihood:                -97.638
No. Observations:                  30   AIC:                             203.3
Df Residuals:                      26   BIC:                             208.9
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     13.5777      7.544      1.800      0.084        -1.929    29.084
X1             0.6227      0.118      5.271      0.000         0.380     0.866
X3             0.3124      0.154      2.026      0.053        -0.005     0.629
X6            -0.1870      0.145     -1.291      0.208        -0.485     0.111
==============================================================================
Omnibus:                        2.856   Durbin-Watson:                   1.938
Prob(Omnibus):                  0.240   Jarque-Bera (JB):                1.394
Skew:                          -0.121   Prob(JB):                        0.498
Kurtosis:                       1.972   Cond. No.                         605.
==============================================================================

stopping criteria met: all featuers have |t| > 1

$\rightarrow$ optimal bias-variance point reached

$\rightarrow$ Residual standard error (RSE) minimized


In [9]:
fit5 = smf.ols('Y ~ X1 + X3', data=x).fit()
print fit5.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.708
Model:                            OLS   Adj. R-squared:                  0.686
Method:                 Least Squares   F-statistic:                     32.74
Date:                Wed, 30 Apr 2014   Prob (F-statistic):           6.06e-08
Time:                        20:16:43   Log-Likelihood:                -98.569
No. Observations:                  30   AIC:                             203.1
Df Residuals:                      27   BIC:                             207.3
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      9.8709      7.061      1.398      0.174        -4.618    24.359
X1             0.6435      0.118      5.432      0.000         0.400     0.887
X3             0.2112      0.134      1.571      0.128        -0.065     0.487
==============================================================================
Omnibus:                        6.448   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.040   Jarque-Bera (JB):                1.959
Skew:                          -0.041   Prob(JB):                        0.375
Kurtosis:                       1.751   Cond. No.                         503.
==============================================================================

note this model is weaker (lower $Adj. R^2$)


In [10]:
fit6 = smf.ols('Y ~ X1', data=x).fit()
print fit6.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.670
Method:                 Least Squares   F-statistic:                     59.86
Date:                Wed, 30 Apr 2014   Prob (F-statistic):           1.99e-08
Time:                        20:17:03   Log-Likelihood:                -99.882
No. Observations:                  30   AIC:                             203.8
Df Residuals:                      28   BIC:                             206.6
Df Model:                           1                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.3763      6.620      2.172      0.039         0.816    27.937
X1             0.7546      0.098      7.737      0.000         0.555     0.954
==============================================================================
Omnibus:                        7.462   Durbin-Watson:                   2.245
Prob(Omnibus):                  0.024   Jarque-Bera (JB):                2.537
Skew:                          -0.331   Prob(JB):                        0.281
Kurtosis:                       1.739   Cond. No.                         352.
==============================================================================

ditto

want to see absence of structure in resid scatterplot ("gaussian white noise")


In [11]:
fit4.resid.plot(style='o', figsize=(12,8))


Out[11]:
<matplotlib.axes.AxesSubplot at 0x10f19a5d0>

$\rightarrow$ this plot looks pretty good; also note that resid quartiles look good


In [12]:
plt.figure(figsize=(12,8))
_ = stats.probplot(fit4.resid, dist="norm", plot=plt)



1-2 Pairs

(Based on RABE 3.15)

A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:

  • Age: Median age of a person living in a state.

  • HS: Percentage of people over 25 years of age in a state who had completed high school.

  • Income: Per capita personal income for a state (income in dollars).

  • Black: Percentage of blacks living in a state.

  • Female: Percentage of females living in a state.

  • Price: Weighted average price (in cents) of a pack ofcigarettes in a state.

  • Sales: Number of packs of cigarettes sold in a state on a per capita basis.

The data can be found at http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt.

Below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.

  1. Test the hypothesis that the variable Female is not needed in the regression equation relating Sales to the six predictor variables.

  2. Test the hypothesis that the variables Female and HS are not needed in the above regression equation.

  3. Compute a 95% confidence interval for the true regression coefficient of the variable Income.

  4. What percentage of the variation in Sales can be accounted for when Income is removed from the above regression equation? Which model did you use?


In [14]:
data = pd.read_table('http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt')

In [29]:
X2 = data['Age']
X3 = data['HS']
X4 = data['Income']
X5 = data['Black']
X6 = data['Female']
X7 = data['Price']
Y = data['Sales']

In [ ]:


In [30]:
a = smf.ols('Y ~ X2 + X3 + X4 + X5 + X6 + X7', data=data)

In [31]:
fit = a.fit()
print fit.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.321
Model:                            OLS   Adj. R-squared:                  0.228
Method:                 Least Squares   F-statistic:                     3.464
Date:                Wed, 30 Apr 2014   Prob (F-statistic):            0.00686
Time:                        20:36:15   Log-Likelihood:                -238.86
No. Observations:                  51   AIC:                             491.7
Df Residuals:                      44   BIC:                             505.2
Df Model:                           6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    103.3448    245.607      0.421      0.676      -391.644   598.334
X2             4.5205      3.220      1.404      0.167        -1.969    11.009
X3            -0.0616      0.815     -0.076      0.940        -1.703     1.580
X4             0.0189      0.010      1.855      0.070        -0.002     0.040
X5             0.3575      0.487      0.734      0.467        -0.624     1.339
X6            -1.0529      5.561     -0.189      0.851       -12.260    10.155
X7            -3.2549      1.031     -3.156      0.003        -5.334    -1.176
==============================================================================
Omnibus:                       56.254   Durbin-Watson:                   1.663
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              358.088
Skew:                           2.842   Prob(JB):                     1.75e-78
Kurtosis:                      14.670   Cond. No.                     2.37e+05
==============================================================================

Warnings:
[1] The condition number is large, 2.37e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

H0: Female(=X6)=0 Ha=Female is not 0.


In [33]:
b = smf.ols('Y ~ X2 + X3 + X4 + X5 + X7', data=data)
fit = b.fit()
print fit.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.320
Model:                            OLS   Adj. R-squared:                  0.245
Method:                 Least Squares   F-statistic:                     4.241
Date:                Wed, 30 Apr 2014   Prob (F-statistic):            0.00304
Time:                        20:48:52   Log-Likelihood:                -238.88
No. Observations:                  51   AIC:                             489.8
Df Residuals:                      45   BIC:                             501.4
Df Model:                           5                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     59.4633     80.388      0.740      0.463      -102.446   221.372
X2             4.1178      2.391      1.722      0.092        -0.698     8.934
X3            -0.0668      0.805     -0.083      0.934        -1.689     1.555
X4             0.0195      0.010      1.997      0.052        -0.000     0.039
X5             0.3115      0.418      0.746      0.460        -0.530     1.153
X7            -3.2520      1.020     -3.188      0.003        -5.307    -1.197
==============================================================================
Omnibus:                       56.194   Durbin-Watson:                   1.659
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              353.488
Skew:                           2.846   Prob(JB):                     1.74e-77
Kurtosis:                      14.573   Cond. No.                     7.85e+04
==============================================================================

Warnings:
[1] The condition number is large, 7.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Based on the results of first model (a: X2...X7), there is not enough evidence to reject the null hypothesis H0: Female(=X6)=0; Ha=Female is not 0.

Based on the results of the second model (b: X2, X3, X4, X5, X7), the change in adjusted R squared in b indicates a stronger fit without the inclusion of X6. There is not enough evidence to reject the null hypothesis.

H0: Female(=X6) = HS(=X3) = 0 Ha= Both HS and Female are not 0.


In [34]:
c = smf.ols('Y ~ X2 + X4 + X5 + X7', data=data)
fit = c.fit()
print fit.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.320
Model:                            OLS   Adj. R-squared:                  0.261
Method:                 Least Squares   F-statistic:                     5.416
Date:                Wed, 30 Apr 2014   Prob (F-statistic):            0.00117
Time:                        20:56:40   Log-Likelihood:                -238.88
No. Observations:                  51   AIC:                             487.8
Df Residuals:                      46   BIC:                             497.4
Df Model:                           4                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     55.3296     62.395      0.887      0.380       -70.266   180.925
X2             4.1915      2.196      1.909      0.062        -0.228     8.611
X4             0.0189      0.007      2.745      0.009         0.005     0.033
X5             0.3342      0.312      1.071      0.290        -0.294     0.962
X7            -3.2399      0.999     -3.244      0.002        -5.250    -1.230
==============================================================================
Omnibus:                       56.030   Durbin-Watson:                   1.661
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              350.319
Skew:                           2.838   Prob(JB):                     8.49e-77
Kurtosis:                      14.517   Cond. No.                     6.16e+04
==============================================================================

Warnings:
[1] The condition number is large, 6.16e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Based on the results of first model (a: X2...X7), there is not enough evidence to reject the null hypothesis H0: Female(=X6) = HS(=X3) =0; Both HS and Female are not 0.

Based on the results of the second model (c: X2, X4, X5, X7), the change in adjusted R squared in c indicates a stronger fit without the inclusion of X6 and X3. There is not enough evidence to reject the null hypothesis.


In [37]:
d = smf.ols('Y ~ X4', data=data)
fit = d.fit()
print fit.summary()
print fit.conf_int(alpha=0.05)


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.106
Model:                            OLS   Adj. R-squared:                  0.088
Method:                 Least Squares   F-statistic:                     5.829
Date:                Wed, 30 Apr 2014   Prob (F-statistic):             0.0195
Time:                        21:08:05   Log-Likelihood:                -245.86
No. Observations:                  51   AIC:                             495.7
Df Residuals:                      49   BIC:                             499.6
Df Model:                           1                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     55.3625     27.743      1.996      0.052        -0.389   111.114
X4             0.0176      0.007      2.414      0.020         0.003     0.032
==============================================================================
Omnibus:                       47.570   Durbin-Watson:                   1.787
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              216.600
Skew:                           2.434   Prob(JB):                     9.25e-48
Kurtosis:                      11.845   Cond. No.                     2.46e+04
==============================================================================

Warnings:
[1] The condition number is large, 2.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
                  0           1
Intercept -0.389356  111.114265
X4         0.002948    0.032218

[2 rows x 2 columns]

Based on the results of the third model (d: X4) and given alpha as 0.05, the coefficient of X4 is between 0.002948 < 0.0176 < 0.032218.


In [39]:
e = smf.ols('Y ~ X2 + X3 + X5 + X6 + X7', data=data)
fit = e.fit()
print fit.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.268
Model:                            OLS   Adj. R-squared:                  0.186
Method:                 Least Squares   F-statistic:                     3.291
Date:                Wed, 30 Apr 2014   Prob (F-statistic):             0.0129
Time:                        21:12:27   Log-Likelihood:                -240.78
No. Observations:                  51   AIC:                             493.6
Df Residuals:                      45   BIC:                             505.1
Df Model:                           5                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    162.3245    250.054      0.649      0.520      -341.309   665.959
X2             7.3073      2.924      2.499      0.016         1.419    13.196
X3             0.9717      0.610      1.592      0.118        -0.258     2.201
X5             0.8447      0.421      2.005      0.051        -0.004     1.693
X6            -3.7815      5.506     -0.687      0.496       -14.872     7.309
X7            -2.8603      1.036     -2.760      0.008        -4.947    -0.773
==============================================================================
Omnibus:                       50.377   Durbin-Watson:                   1.594
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              252.496
Skew:                           2.572   Prob(JB):                     1.48e-55
Kurtosis:                      12.611   Cond. No.                     5.43e+03
==============================================================================

Warnings:
[1] The condition number is large, 5.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The first model accounts for 5.3% more of the variation in sales using the above model (a).



In [ ]:

Homework

Level 1

Use enron.db from last week.

  1. Select a DataFrame containing the recipient count (as before) and the department of the sender for each message.

  2. Create a histogram of the recipient count.

  3. Create a new column based on the log of the recipient count.

  4. Create a histogram of that log.

  5. Create a boxplot of the log, splitting the data based on the department of the sender.

  6. Compute the sample mean and standard deviation of the log in the three groups.

  7. Compute a 95% confidence interval for the difference in recipient count between the three groups.

  8. At level $\alpha=5\%$, test the null hypothesis that the average recipient count does not differ between the three groups. What assumptions are you making? What can you conclude?

Level 2

A criminologist studying the relationship between income level and assults in U.S. cities (among other things) collected the following data for 2215 communities. The dataset can be found in the UCI machine learning site.

We are interested in the per capita assult rate and its relation to median income.


In [ ]:
crime = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt", header = None, na_values  = '?',
                    names = ['communityname', 'state', 'countyCode', 'communityCode', 'fold', 'population', 'householdsize', 'racepctblack', 'racePctWhite', 'racePctAsian', 'racePctHisp', 'agePct12t21', 'agePct12t29', 'agePct16t24', 'agePct65up', 'numbUrban', 'pctUrban', 'medIncome', 'pctWWage', 'pctWFarmSelf', 'pctWInvInc', 'pctWSocSec', 'pctWPubAsst', 'pctWRetire', 'medFamInc', 'perCapInc', 'whitePerCap', 'blackPerCap', 'indianPerCap', 'AsianPerCap', 'OtherPerCap', 'HispPerCap', 'NumUnderPov', 'PctPopUnderPov', 'PctLess9thGrade', 'PctNotHSGrad', 'PctBSorMore', 'PctUnemployed', 'PctEmploy', 'PctEmplManu', 'PctEmplProfServ', 'PctOccupManu', 'PctOccupMgmtProf', 'MalePctDivorce', 'MalePctNevMarr', 'FemalePctDiv', 'TotalPctDiv', 'PersPerFam', 'PctFam2Par', 'PctKids2Par', 'PctYoungKids2Par', 'PctTeen2Par', 'PctWorkMomYoungKids', 'PctWorkMom', 'NumKidsBornNeverMar', 'PctKidsBornNeverMar', 'NumImmig', 'PctImmigRecent', 'PctImmigRec5', 'PctImmigRec8', 'PctImmigRec10', 'PctRecentImmig', 'PctRecImmig5', 'PctRecImmig8', 'PctRecImmig10', 'PctSpeakEnglOnly', 'PctNotSpeakEnglWell', 'PctLargHouseFam', 'PctLargHouseOccup', 'PersPerOccupHous', 'PersPerOwnOccHous', 'PersPerRentOccHous', 'PctPersOwnOccup', 'PctPersDenseHous', 'PctHousLess3BR', 'MedNumBR', 'HousVacant', 'PctHousOccup', 'PctHousOwnOcc', 'PctVacantBoarded', 'PctVacMore6Mos', 'MedYrHousBuilt', 'PctHousNoPhone', 'PctWOFullPlumb', 'OwnOccLowQuart', 'OwnOccMedVal', 'OwnOccHiQuart', 'OwnOccQrange', 'RentLowQ', 'RentMedian', 'RentHighQ', 'RentQrange', 'MedRent', 'MedRentPctHousInc', 'MedOwnCostPctInc', 'MedOwnCostPctIncNoMtg', 'NumInShelters', 'NumStreet', 'PctForeignBorn', 'PctBornSameState', 'PctSameHouse85', 'PctSameCity85', 'PctSameState85', 'LemasSwornFT', 'LemasSwFTPerPop', 'LemasSwFTFieldOps', 'LemasSwFTFieldPerPop', 'LemasTotalReq', 'LemasTotReqPerPop', 'PolicReqPerOffic', 'PolicPerPop', 'RacialMatchCommPol', 'PctPolicWhite', 'PctPolicBlack', 'PctPolicHisp', 'PctPolicAsian', 'PctPolicMinor', 'OfficAssgnDrugUnits', 'NumKindsDrugsSeiz', 'PolicAveOTWorked', 'LandArea', 'PopDens', 'PctUsePubTrans', 'PolicCars', 'PolicOperBudg', 'LemasPctPolicOnPatr', 'LemasGangUnitDeploy', 'LemasPctOfficDrugUn', 'PolicBudgPerPop', 'murders', 'murdPerPop', 'rapes', 'rapesPerPop', 'robberies', 'robbbPerPop', 'assaults', 'assaultPerPop', 'burglaries', 'burglPerPop', 'larcenies', 'larcPerPop', 'autoTheft', 'autoTheftPerPop', 'arsons', 'arsonsPerPop', 'ViolentCrimesPerPop', 'nonViolPerPop'])
  1. Fit a simple linear regression model to the data with np.log(crime.assaults) as the dependent variable and np.log(crime.medIncome) as the independent variable. Plot the estimated regression line.

  2. Test whether there is a linear relationship between assaults and medIncome at level $\alpha=0.05$. State the null hypothesis, the alternative, the conclusion and the $p$-value.

  3. Give a 95% confidence interval for the slope of the regression line. Interpret your interval.

  4. Report the $R^2$ and the adjusted $R^2$ of the model, as well as an estimate of the variance of the errors in the model.

  5. Go to archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized and pick out a few other factors that might help you predict assults.