You are currently looking at version 1.1 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.



In [1]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

Assignment 4 - Hypothesis Testing

This assignment requires more individual learning than previous assignments - you are encouraged to check out the pandas documentation to find functions or methods you might not have used yet, or ask questions on Stack Overflow and tag them as pandas and python related. And of course, the discussion forums are open for interaction with your peers and the course staff.

Definitions:

  • A quarter is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December.
  • A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.
  • A recession bottom is the quarter within a recession which had the lowest GDP.
  • A university town is a city which has a high percentage of university students compared to the total population of the city.

Hypothesis: University towns have their mean housing prices less effected by recessions. Run a t-test to compare the ratio of the mean price of houses in university towns the quarter before the recession starts compared to the recession bottom. (price_ratio=quarter_before_recession/recession_bottom)

The following data files are available for this assignment:

  • From the Zillow research data site there is housing data for the United States. In particular the datafile for all homes at a city level, City_Zhvi_AllHomes.csv, has median home sale prices at a fine grained level.
  • From the Wikipedia page on college towns is a list of university towns in the United States which has been copy and pasted into the file university_towns.txt.
  • From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (use the chained value in 2009 dollars), in quarterly intervals, in the file gdplev.xls. For this assignment, only look at GDP data from the first quarter of 2000 onward.

Each function in this assignment below is worth 10%, with the exception of run_ttest(), which is worth 50%.


In [2]:
# Use this dictionary to map state names to two letter acronyms
states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

In [3]:
df_all_homes = pd.read_csv("City_Zhvi_AllHomes.csv")

In [4]:
df_all_homes.head()


Out[4]:
RegionID RegionName State Metro CountyName SizeRank 1996-04 1996-05 1996-06 1996-07 ... 2015-11 2015-12 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08
0 6181 New York NY New York Queens 1 NaN NaN NaN NaN ... 573600 576200 578400 582200 588000 592200 592500 590200 588000 586400
1 12447 Los Angeles CA Los Angeles-Long Beach-Anaheim Los Angeles 2 155000.0 154600.0 154400.0 154200.0 ... 558200 560800 562800 565600 569700 574000 577800 580600 583000 585100
2 17426 Chicago IL Chicago Cook 3 109700.0 109400.0 109300.0 109300.0 ... 207800 206900 206200 205800 206200 207300 208200 209100 211000 213000
3 13271 Philadelphia PA Philadelphia Philadelphia 4 50000.0 49900.0 49600.0 49400.0 ... 122300 121600 121800 123300 125200 126400 127000 127400 128300 129100
4 40326 Phoenix AZ Phoenix Maricopa 5 87200.0 87700.0 88200.0 88400.0 ... 183800 185300 186600 188000 189100 190200 191300 192800 194500 195900

5 rows × 251 columns


In [5]:
def get_list_of_university_towns():
    '''Returns a DataFrame of towns and the states they are in from the 
    university_towns.txt list. The format of the DataFrame should be:
    DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
    columns=["State", "RegionName"]  )
    
    The following cleaning needs to be done:

    1. For "State", removing characters from "[" to the end.
    2. For "RegionName", when applicable, removing every character from " (" to the end.
    3. Depending on how you read the data, you may need to remove newline character '\n'. '''
    
    towns = open("university_towns.txt", "r")
    lines = towns.read().split("\n")
    
    # Elements having string "[edit]" are states.
    state_indicator = "[edit]"
    state = ""
    states_regions = []
    
    for element in lines:
        if state_indicator in element:
            # State
            state = element.split("[")[0].rstrip()
        else:
            # Region
            states_regions.append({"State" : state, "RegionName" : element.split("(")[0].rstrip()})
            
    df = pd.DataFrame.from_dict(states_regions)
    # Drop last row since it has only state without region
    df = df.drop(df.index[len(df)-1])
    
    # Moving State as a first column. This is done only for automatic grader that expects
    # this order of columns
    cols = df.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    df = df[cols]
    
    return df

get_list_of_university_towns()


Out[5]:
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
12 Arkansas Arkadelphia
13 Arkansas Conway
14 Arkansas Fayetteville
15 Arkansas Jonesboro
16 Arkansas Magnolia
17 Arkansas Monticello
18 Arkansas Russellville
19 Arkansas Searcy
20 California Angwin
21 California Arcata
22 California Berkeley
23 California Chico
24 California Claremont
25 California Cotati
26 California Davis
27 California Irvine
28 California Isla Vista
29 California University Park, Los Angeles
... ... ...
487 Virginia Wise
488 Virginia Chesapeake
489 Washington Bellingham
490 Washington Cheney
491 Washington Ellensburg
492 Washington Pullman
493 Washington University District, Seattle
494 West Virginia Athens
495 West Virginia Buckhannon
496 West Virginia Fairmont
497 West Virginia Glenville
498 West Virginia Huntington
499 West Virginia Montgomery
500 West Virginia Morgantown
501 West Virginia Shepherdstown
502 West Virginia West Liberty
503 Wisconsin Appleton
504 Wisconsin Eau Claire
505 Wisconsin Green Bay
506 Wisconsin La Crosse
507 Wisconsin Madison
508 Wisconsin Menomonie
509 Wisconsin Milwaukee
510 Wisconsin Oshkosh
511 Wisconsin Platteville
512 Wisconsin River Falls
513 Wisconsin Stevens Point
514 Wisconsin Waukesha
515 Wisconsin Whitewater
516 Wyoming Laramie

517 rows × 2 columns


In [6]:
# Get GDP info: Drop unnessary header rows
df_gdp = pd.read_excel("gdplev.xls", header=6)
# Take only quarterly info
df_gdp = df_gdp.ix[:, 4:7]
df_gdp.columns = ["Quarter", "GDP current", "GDP chained"]
df_gdp.head()


Out[6]:
Quarter GDP current GDP chained
0 1947q1 243.1 1934.5
1 1947q2 246.3 1932.3
2 1947q3 250.1 1930.3
3 1947q4 260.3 1960.7
4 1948q1 266.2 1989.5

In [7]:
# Find GPD changes between quarters
df_gdp["Diff"] = df_gdp["GDP chained"].diff()
df_gdp.head()


Out[7]:
Quarter GDP current GDP chained Diff
0 1947q1 243.1 1934.5 NaN
1 1947q2 246.3 1932.3 -2.2
2 1947q3 250.1 1930.3 -2.0
3 1947q4 260.3 1960.7 30.4
4 1948q1 266.2 1989.5 28.8

In [8]:
# Mark increase (1) or decrease (0) in GDP
df_gdp["Change"] = np.where(df_gdp["Diff"] < 0, "0", "1")
# Now drop years before 2000 since "For this assignment, only look at GDP data from the first quarter of 2000 onward"
df_gdp = df_gdp[df_gdp.Quarter >= "2000q1"]
df_gdp.head()


Out[8]:
Quarter GDP current GDP chained Diff Change
212 2000q1 10031.0 12359.1 35.8 1
213 2000q2 10278.3 12592.5 233.4 1
214 2000q3 10357.4 12607.7 15.2 1
215 2000q4 10472.3 12679.3 71.6 1
216 2001q1 10508.1 12643.3 -36.0 0

In [9]:
# Find 2 consecutive declines in GDP. Converting increase/decline info as string 
# so its easy to find consecutive declines/increases
change_str = ""
change_str = change_str.join(df_gdp["Change"].values)
change_str


Out[9]:
'111101011111111111111111111111110100001111110111111111110111111111'

In [10]:
# Find recession
recession_index = change_str.index("0011")

In [11]:
def get_recession_start():
    '''Returns the year and quarter of the recession start time as a 
    string value in a format such as 2005q3'''
        
    # Start of recession, find 1 before start patter '0011'
    start_recession_index = change_str.rindex("1", 0, recession_index) + 1
    return df_gdp.iloc[start_recession_index]["Quarter"]

get_recession_start()


Out[11]:
'2008q3'

In [12]:
def get_recession_end():
    '''Returns the year and quarter of the recession end time as a 
    string value in a format such as 2005q3'''
    
    # End recession is index of pattern + len of pattern
    end_recession_index = recession_index + len("0011") - 1
    return df_gdp.iloc[end_recession_index]["Quarter"]

get_recession_end()


Out[12]:
'2009q4'

In [13]:
def get_recession_bottom():
    '''Returns the year and quarter of the recession bottom time as a 
    string value in a format such as 2005q3'''
    
    recession_start = get_recession_start()
    recession_end = get_recession_end()
    
    df_recession = df_gdp[ (df_gdp.Quarter >= recession_start) & (df_gdp.Quarter <= recession_end) ]
    return df_recession.loc[df_recession["GDP chained"].argmin()]["Quarter"]

get_recession_bottom()


Out[13]:
'2009q2'

In [14]:
def convert_housing_data_to_quarters():
    '''Converts the housing data to quarters and returns it as mean 
    values in a dataframe. This dataframe should be a dataframe with
    columns for 2000q1 through 2016q3, and should have a multi-index
    in the shape of ["State","RegionName"].
    
    Note: Quarters are defined in the assignment description, they are
    not arbitrary three month periods.
    
    The resulting dataframe should have 67 columns, and 10,730 rows.
    '''
    # A quarter is a specific three month period, 
    # - Q1 is January through March (1-3)
    # - Q2 is April through June (4-6)
    # - Q3 is July through September (7-9)
    # - Q4 is October through December. (10-12)
    
    # Drop columns for years not under inspection (year < 2000)
    cols = [c for c in df_all_homes.columns if c.lower()[:3] != '199']
    df_homes = df_all_homes[cols]    
    
    # Calculate mean for each quarter between 2000-2016
    quarters = [ ["01", "02", "03"], ["04", "05", "06"], ["07", "08", "09"], ["10", "11", "12"] ]
    
    for year in range(2000,2016):
        for q in range(0, 4):
            quarter_columns = ["{}-{}".format(year,quarters[q][0]), "{}-{}".format(year,quarters[q][1]), "{}-{}".format(year,quarters[q][2])]
            df_homes["{}q{}".format(year, q+1)] = df_homes[quarter_columns].mean(axis=1)
            
    # 2016 is not a full year so taking it separately
    df_homes["2016q1"] = df_homes[["2016-01", "2016-02", "2016-03"]].mean(axis=1)
    df_homes["2016q2"] = df_homes[["2016-04", "2016-05", "2016-06"]].mean(axis=1)
    df_homes["2016q3"] = df_homes[["2016-07", "2016-08"]].mean(axis=1)
    
    # Drop rest of the year-month columns
    cols = [c for c in df_homes.columns if c.lower()[4] != '-']
    df_homes = df_homes[cols]
    
    # Drop unnessary columns
    df_homes = df_homes.drop( ["RegionID", "Metro", "CountyName", "SizeRank"], axis=1 )
    
    # Convert acronym state names to full names
    df_homes["State"] = df_homes.apply(lambda row: states[row["State"]], axis=1)
    
    # Create indexes
    df_homes_indexed = df_homes.set_index(["State", "RegionName"])
    
    return df_homes_indexed

df_houses_quarters = convert_housing_data_to_quarters()


/opt/conda/lib/python3.5/site-packages/ipykernel/__main__.py:28: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [15]:
# Calculate recession impact on house prices
recession_start = get_recession_start()
recession_end = get_recession_end()
    
print("Recession start {} and end {}".format(recession_start, recession_end))
    
# Below line causes notebook to run out of memory. Using lambda instead.
#df_houses_quarters["Ratio"] =  df_houses_quarters[recession_start] / df_houses_quarters[recession_bottom]

df_houses_quarters["Ratio"] = df_houses_quarters.apply(lambda row: row[recession_start] / row [recession_end], axis=1)
df_houses_quarters.head()


Recession start 2008q3 and end 2009q4
Out[15]:
2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 2002q1 2002q2 ... 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3 Ratio
State RegionName
New York New York NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 522800.000000 528066.666667 532266.666667 540800.000000 557200.000000 572833.333333 582866.666667 591633.333333 587200.0 1.090321
California Los Angeles 207066.666667 214466.666667 220966.666667 226166.666667 233000.000000 239100.000000 245066.666667 253033.333333 261966.666667 272700.000000 ... 509066.666667 518866.666667 528800.000000 538166.666667 547266.666667 557733.333333 566033.333333 577466.666667 584050.0 1.161171
Illinois Chicago 138400.000000 143633.333333 147866.666667 152133.333333 156933.333333 161800.000000 166400.000000 170433.333333 175500.000000 177566.666667 ... 195766.666667 201266.666667 201066.666667 206033.333333 208300.000000 207900.000000 206066.666667 208200.000000 212000.0 1.096063
Pennsylvania Philadelphia 53000.000000 53633.333333 54133.333333 54700.000000 55333.333333 55533.333333 56266.666667 57533.333333 59133.333333 60733.333333 ... 115300.000000 115666.666667 116200.000000 117966.666667 121233.333333 122200.000000 123433.333333 126933.333333 128700.0 0.986224
Arizona Phoenix 111833.333333 114366.666667 116000.000000 117400.000000 119600.000000 121566.666667 122700.000000 124300.000000 126533.333333 128366.666667 ... 165366.666667 168500.000000 171533.333333 174166.666667 179066.666667 183833.333333 187900.000000 191433.333333 195200.0 1.350604

5 rows × 68 columns


In [16]:
university_towns = get_list_of_university_towns()
university_towns = university_towns.set_index(["State", "RegionName"])
university_towns.head()


Out[16]:
State RegionName
Alabama Auburn
Florence
Jacksonville
Livingston
Montevallo

In [17]:
# Get all university towns
df_univ_towns = df_houses_quarters.loc[list(university_towns.index)]
df_univ_towns = df_univ_towns.dropna()
df_univ_towns.head()


Out[17]:
2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 2002q1 2002q2 ... 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3 Ratio
Alabama Montevallo 97000.0 96800.000000 96533.333333 98333.333333 99466.666667 101333.333333 103200.000000 101866.666667 100900.000000 102700.000000 ... 120266.666667 120000.000000 119900.000000 119466.666667 119400.000000 119833.333333 121700.000000 122400.000000 121500.0 1.025242
Tuscaloosa 104500.0 106033.333333 108500.000000 108766.666667 109900.000000 111366.666667 110000.000000 110700.000000 111900.000000 111233.333333 ... 139566.666667 142733.333333 145000.000000 144700.000000 144266.666667 145000.000000 140300.000000 138033.333333 138650.0 1.022461
Alaska Fairbanks 163200.0 165033.333333 169300.000000 172800.000000 164433.333333 157800.000000 158200.000000 154666.666667 152766.666667 154533.333333 ... 207100.000000 207000.000000 207766.666667 206466.666667 208433.333333 209466.666667 209066.666667 212933.333333 215850.0 1.115757
Arizona Flagstaff 161300.0 164100.000000 165466.666667 168266.666667 172000.000000 175633.333333 178966.666667 180800.000000 182066.666667 184400.000000 ... 295066.666667 295966.666667 297633.333333 302800.000000 307333.333333 310033.333333 312266.666667 318433.333333 322650.0 1.136033
Tempe 128100.0 129933.333333 131100.000000 132800.000000 135633.333333 137900.000000 139700.000000 141833.333333 144166.666667 146566.666667 ... 213066.666667 216866.666667 220466.666667 225900.000000 231200.000000 237800.000000 241900.000000 245000.000000 246950.0 1.206204

5 rows × 68 columns


In [18]:
# Get all non university towns
non_univ_index = set(df_houses_quarters.index) - set(df_univ_towns.index)
df_non_univ_towns = df_houses_quarters.loc[list(non_univ_index)]
df_non_univ_towns = df_non_univ_towns.dropna()
df_non_univ_towns.head()


Out[18]:
2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 2002q1 2002q2 ... 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3 Ratio
State RegionName
Kentucky Briarwood 101666.666667 105766.666667 108300.000000 109733.333333 109166.666667 109266.666667 108800.000000 108700.000000 111533.333333 112066.666667 ... 129666.666667 129600.000000 129366.666667 132200.000000 131766.666667 131400.000000 133033.333333 134466.666667 134900.0 1.001606
Texas Baytown 74133.333333 73000.000000 73600.000000 74566.666667 74400.000000 73400.000000 73766.666667 74400.000000 75566.666667 77366.666667 ... 102433.333333 104766.666667 107566.666667 110500.000000 113533.333333 116666.666667 118933.333333 122266.666667 124200.0 0.990712
Pennsylvania Pulaski 65266.666667 66033.333333 65666.666667 65000.000000 61300.000000 60800.000000 62466.666667 64966.666667 65933.333333 67666.666667 ... 67533.333333 64266.666667 63300.000000 64700.000000 67466.666667 71000.000000 73900.000000 78866.666667 78200.0 0.962736
Illinois Schiller Park 153400.000000 157100.000000 158933.333333 162966.666667 165933.333333 172733.333333 178266.666667 186366.666667 190200.000000 189400.000000 ... 181700.000000 182966.666667 180700.000000 179766.666667 176433.333333 177433.333333 178133.333333 182066.666667 186950.0 1.177983
California Mentone 126933.333333 129900.000000 132866.666667 135766.666667 139133.333333 142066.666667 146800.000000 153700.000000 159100.000000 164866.666667 ... 234000.000000 238866.666667 244866.666667 252266.666667 261900.000000 272000.000000 276766.666667 278766.666667 280200.0 1.191615

5 rows × 68 columns


In [19]:
def run_ttest():
    '''First creates new data showing the decline or growth of housing prices
    between the recession start and the recession bottom. Then runs a ttest
    comparing the university town values to the non-university towns values, 
    return whether the alternative hypothesis (that the two groups are the same)
    is true or not as well as the p-value of the confidence. 
    
    Return the tuple (different, p, better) where different=True if the t-test is
    True at a p<0.01 (we reject the null hypothesis), or different=False if 
    otherwise (we cannot reject the null hypothesis). The variable p should
    be equal to the exact p value returned from scipy.stats.ttest_ind(). The
    value for better should be either "university town" or "non-university town"
    depending on which has a lower mean price ratio (which is equivilent to a
    reduced market loss).'''
    statistic, pvalue = ttest_ind(df_univ_towns["Ratio"], df_non_univ_towns["Ratio"])
    
    different = pvalue < 0.01
    
    if df_univ_towns["Ratio"].mean() < df_non_univ_towns["Ratio"].mean():
        better = "university town"
    else:
        better = "non-university town"
        
    return different, pvalue, better

run_ttest()


Out[19]:
(True, 0.0041517008594592966, 'university town')