In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf
this dataset shows a set of six numeric survey responses $X_i$ (survey responses)
and a dependent variable $Y$ (perceived supervisor quality)
we want to predict $Y$ from the $X$'s
In [2]:
x = pd.read_table('http://www.ats.ucla.edu/stat/examples/chp/p054.txt')
x.head()
Out[2]:
the column names have trailing whitespace, so I fix that by mapping str.strip onto the columns' names.
In [3]:
x.columns = x.columns.map(str.strip)
this set of scatterplots gives us an idea of the pairwise relationships present in the dataset
In [4]:
_ = pd.scatter_matrix(x, figsize=(7, 7))
this linear fit represents the "full model"; eg, the fit with all of the independent variables included
In [5]:
lm = smf.ols('Y ~ X1 + X2 + X3 + X4 + X5 + X6', data=x)
fit = lm.fit()
print fit.summary()
remove feature w/ lowest (abs) t score
In [6]:
fit2 = smf.ols('Y ~ X1 + X2 + X3 + X4 + X6', data=x).fit()
print fit2.summary()
note R-sq decreases slightly, but adj R-sq increases slightly
--> increasing bias, decreasing variance
ditto
In [7]:
fit3 = smf.ols('Y ~ X1 + X2 + X3 + X6', data=x).fit()
print fit3.summary()
In [8]:
fit4 = smf.ols('Y ~ X1 + X3 + X6', data=x).fit()
print fit4.summary()
stopping criteria met: all featuers have |t| > 1
$\rightarrow$ optimal bias-variance point reached
$\rightarrow$ Residual standard error (RSE) minimized
In [9]:
fit5 = smf.ols('Y ~ X1 + X3', data=x).fit()
print fit5.summary()
note this model is weaker (lower $Adj. R^2$)
In [10]:
fit6 = smf.ols('Y ~ X1', data=x).fit()
print fit6.summary()
ditto
want to see absence of structure in resid scatterplot ("gaussian white noise")
In [11]:
fit4.resid.plot(style='o', figsize=(12,8))
Out[11]:
$\rightarrow$ this plot looks pretty good; also note that resid quartiles look good
In [12]:
plt.figure(figsize=(12,8))
_ = stats.probplot(fit4.resid, dist="norm", plot=plt)
A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:
Age: Median age of a person living in a state.
HS: Percentage of people over 25 years of age in a state who had completed high school.
Income: Per capita personal income for a state (income in dollars).
Black: Percentage of blacks living in a state.
Female: Percentage of females living in a state.
Price: Weighted average price (in cents) of a pack ofcigarettes in a state.
Sales: Number of packs of cigarettes sold in a state on a per capita basis.
The data can be found at http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt.
Below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.
Test the hypothesis that the variable Female is not needed in the regression equation relating Sales to the six predictor variables.
Test the hypothesis that the variables Female and HS are not needed in the above regression equation.
Compute a 95% confidence interval for the true regression coefficient of the variable Income.
What percentage of the variation in Sales can be accounted for when Income is removed from the above regression equation? Which model did you use?
In [14]:
data = pd.read_table('http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt')
In [29]:
X2 = data['Age']
X3 = data['HS']
X4 = data['Income']
X5 = data['Black']
X6 = data['Female']
X7 = data['Price']
Y = data['Sales']
In [ ]:
In [30]:
a = smf.ols('Y ~ X2 + X3 + X4 + X5 + X6 + X7', data=data)
In [31]:
fit = a.fit()
print fit.summary()
H0: Female(=X6)=0 Ha=Female is not 0.
In [33]:
b = smf.ols('Y ~ X2 + X3 + X4 + X5 + X7', data=data)
fit = b.fit()
print fit.summary()
Based on the results of first model (a: X2...X7), there is not enough evidence to reject the null hypothesis H0: Female(=X6)=0; Ha=Female is not 0.
Based on the results of the second model (b: X2, X3, X4, X5, X7), the change in adjusted R squared in b indicates a stronger fit without the inclusion of X6. There is not enough evidence to reject the null hypothesis.
H0: Female(=X6) = HS(=X3) = 0 Ha= Both HS and Female are not 0.
In [34]:
c = smf.ols('Y ~ X2 + X4 + X5 + X7', data=data)
fit = c.fit()
print fit.summary()
Based on the results of first model (a: X2...X7), there is not enough evidence to reject the null hypothesis H0: Female(=X6) = HS(=X3) =0; Both HS and Female are not 0.
Based on the results of the second model (c: X2, X4, X5, X7), the change in adjusted R squared in c indicates a stronger fit without the inclusion of X6 and X3. There is not enough evidence to reject the null hypothesis.
In [37]:
d = smf.ols('Y ~ X4', data=data)
fit = d.fit()
print fit.summary()
print fit.conf_int(alpha=0.05)
Based on the results of the third model (d: X4) and given alpha as 0.05, the coefficient of X4 is between 0.002948 < 0.0176 < 0.032218.
In [39]:
e = smf.ols('Y ~ X2 + X3 + X5 + X6 + X7', data=data)
fit = e.fit()
print fit.summary()
The first model accounts for 5.3% more of the variation in sales using the above model (a).
In [ ]:
Use enron.db from last week.
Select a DataFrame containing the recipient count (as before) and the department of the sender for each message.
Create a histogram of the recipient count.
Create a new column based on the log of the recipient count.
Create a histogram of that log.
Create a boxplot of the log, splitting the data based on the department of the sender.
Compute the sample mean and standard deviation of the log in the three groups.
Compute a 95% confidence interval for the difference in recipient count between the three groups.
At level $\alpha=5\%$, test the null hypothesis that the average recipient count does not differ between the three groups. What assumptions are you making? What can you conclude?
A criminologist studying the relationship between income level and assults in U.S. cities (among other things) collected the following data for 2215 communities. The dataset can be found in the UCI machine learning site.
We are interested in the per capita assult rate and its relation to median income.
In [ ]:
crime = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt", header = None, na_values = '?',
names = ['communityname', 'state', 'countyCode', 'communityCode', 'fold', 'population', 'householdsize', 'racepctblack', 'racePctWhite', 'racePctAsian', 'racePctHisp', 'agePct12t21', 'agePct12t29', 'agePct16t24', 'agePct65up', 'numbUrban', 'pctUrban', 'medIncome', 'pctWWage', 'pctWFarmSelf', 'pctWInvInc', 'pctWSocSec', 'pctWPubAsst', 'pctWRetire', 'medFamInc', 'perCapInc', 'whitePerCap', 'blackPerCap', 'indianPerCap', 'AsianPerCap', 'OtherPerCap', 'HispPerCap', 'NumUnderPov', 'PctPopUnderPov', 'PctLess9thGrade', 'PctNotHSGrad', 'PctBSorMore', 'PctUnemployed', 'PctEmploy', 'PctEmplManu', 'PctEmplProfServ', 'PctOccupManu', 'PctOccupMgmtProf', 'MalePctDivorce', 'MalePctNevMarr', 'FemalePctDiv', 'TotalPctDiv', 'PersPerFam', 'PctFam2Par', 'PctKids2Par', 'PctYoungKids2Par', 'PctTeen2Par', 'PctWorkMomYoungKids', 'PctWorkMom', 'NumKidsBornNeverMar', 'PctKidsBornNeverMar', 'NumImmig', 'PctImmigRecent', 'PctImmigRec5', 'PctImmigRec8', 'PctImmigRec10', 'PctRecentImmig', 'PctRecImmig5', 'PctRecImmig8', 'PctRecImmig10', 'PctSpeakEnglOnly', 'PctNotSpeakEnglWell', 'PctLargHouseFam', 'PctLargHouseOccup', 'PersPerOccupHous', 'PersPerOwnOccHous', 'PersPerRentOccHous', 'PctPersOwnOccup', 'PctPersDenseHous', 'PctHousLess3BR', 'MedNumBR', 'HousVacant', 'PctHousOccup', 'PctHousOwnOcc', 'PctVacantBoarded', 'PctVacMore6Mos', 'MedYrHousBuilt', 'PctHousNoPhone', 'PctWOFullPlumb', 'OwnOccLowQuart', 'OwnOccMedVal', 'OwnOccHiQuart', 'OwnOccQrange', 'RentLowQ', 'RentMedian', 'RentHighQ', 'RentQrange', 'MedRent', 'MedRentPctHousInc', 'MedOwnCostPctInc', 'MedOwnCostPctIncNoMtg', 'NumInShelters', 'NumStreet', 'PctForeignBorn', 'PctBornSameState', 'PctSameHouse85', 'PctSameCity85', 'PctSameState85', 'LemasSwornFT', 'LemasSwFTPerPop', 'LemasSwFTFieldOps', 'LemasSwFTFieldPerPop', 'LemasTotalReq', 'LemasTotReqPerPop', 'PolicReqPerOffic', 'PolicPerPop', 'RacialMatchCommPol', 'PctPolicWhite', 'PctPolicBlack', 'PctPolicHisp', 'PctPolicAsian', 'PctPolicMinor', 'OfficAssgnDrugUnits', 'NumKindsDrugsSeiz', 'PolicAveOTWorked', 'LandArea', 'PopDens', 'PctUsePubTrans', 'PolicCars', 'PolicOperBudg', 'LemasPctPolicOnPatr', 'LemasGangUnitDeploy', 'LemasPctOfficDrugUn', 'PolicBudgPerPop', 'murders', 'murdPerPop', 'rapes', 'rapesPerPop', 'robberies', 'robbbPerPop', 'assaults', 'assaultPerPop', 'burglaries', 'burglPerPop', 'larcenies', 'larcPerPop', 'autoTheft', 'autoTheftPerPop', 'arsons', 'arsonsPerPop', 'ViolentCrimesPerPop', 'nonViolPerPop'])
Fit a simple linear regression model to the data with np.log(crime.assaults) as the dependent variable and np.log(crime.medIncome) as the independent variable. Plot the estimated regression line.
Test whether there is a linear relationship between assaults and medIncome at level $\alpha=0.05$. State the null hypothesis, the alternative, the conclusion and the $p$-value.
Give a 95% confidence interval for the slope of the regression line. Interpret your interval.
Report the $R^2$ and the adjusted $R^2$ of the model, as well as an estimate of the variance of the errors in the model.
Go to archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized and pick out a few other factors that might help you predict assults.