Analysis of STD prevalence in U.S. counties. Here I start with Chlamydia data from the CDC and add data from the Census Bureau to train a probabilistic model that can predict the Chlamydia rate based on the population characteristics. This model will be used to predict Chlamyida rates in neighborhoods / ZIP code areas for which Census data is available.
In [1]:
# load some libraries we will need
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import re
%matplotlib inline
In [2]:
# Pandas and Seaborn options
# Always display all the columns
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 200)
# Plain Seaborn figures with matplotlib color codes mapped to the default seaborn palette
sns.set(style="white", color_codes=True)
In [3]:
# read in the csv retrieved from http://gis.cdc.gov/grasp/nchhstpatlas/main.html?value=atlas
df = pd.read_csv("../data/cdc/chlamydia.csv")
Let's explore the dataset by looking at its shape, columns and variable types
In [4]:
df.shape
Out[4]:
In [5]:
df.columns
Out[5]:
In [6]:
df.dtypes
Out[6]:
The object columns actually contain floats and integers. Let's see if we can convert them.
In [7]:
df_test = df.convert_objects(convert_numeric=True)
df_test.dtypes
Out[7]:
In [8]:
df_test.head()
Out[8]:
Some numbers have comma separators, which we should remove.
In [9]:
df["Rate"].sort_values()
Out[9]:
In [10]:
df['Rate'] = df['Rate'].str.replace('Data not available','275.300000')
In [11]:
df['Population'] = df['Population'].str.replace(',','')
df['Cases'] = df['Cases'].str.replace(',','')
In [12]:
df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[12]:
In [13]:
df.head()
Out[13]:
Let's explore a bit more and see what these counties look like
In [14]:
df['Rate'].describe()
Out[14]:
That's a wide range of values! Let's make sure this is not just some outlier.
In [15]:
df['Population'].idxmax()
Out[15]:
In [16]:
df.loc[207]
Out[16]:
It makes sense that LA County is that large. What's the distribution in general?
In [17]:
fig = plt.figure(figsize=(10, 6))
data = np.log10(df['Population'])
ax = data.plot.hist(50)
ax.set_xlabel("log(Population)")
ax.set_title("Chlamydia")
plt.savefig('../graphics/county_population_chlamydia.png', bbox_inches='tight', dpi=150)
A log-normal distribution peaking at $10^4$. Looks good so far, but I doubt that there are any cases of Chlamydia in the smallest of counties.
In [18]:
fig = plt.figure(figsize=(10, 6))
data = np.log10(df['Cases']+1)
ax = data.plot.hist(50)
ax.set_xlabel("log(Cases)")
ax.set_title("Chlamydia")
plt.savefig('../graphics/county_cases_chlamydia.png', bbox_inches='tight', dpi=150)
Some of these counties we will have to remove since they will only introduce noise. The question is, should we remove counties with low number of cases or with low rates?
In [19]:
fig = plt.figure(figsize=(10, 6))
ax = df['Rate'].plot.hist(100)
ax.set_xlabel("Rate (per 100,000 inhabitants)")
ax.set_title("Chlamydia")
plt.savefig('../graphics/county_rate_chlamydia.png', bbox_inches='tight', dpi=150)
In [20]:
outliers = df[df['Rate']<50]
In [21]:
len(outliers)
Out[21]:
Cutting out the low rate one's seems not too bad in term of reduction of data points. But there may be counties with a significant number of cases but many inhabitants, hence low rates. It seems better to cut out the ones with low numbers of reported cases.
Let's look for missing values, NaN's, zeros, and duplicates.
In [22]:
df["Rate"].sort_values()
Out[22]:
We definitely want to remove those NaN's later!
How about unique values/duplicates?
In [23]:
len(df['Area'].unique())
Out[23]:
Why are there only 1962 unique counties?
In [24]:
df.shape
Out[24]:
In [25]:
df.sort_values(by='Area')
Out[25]:
In [26]:
df['Area'].value_counts()
Out[26]:
That makes sense now. Good that we have the FIPS number, which gives all of these counties unique identifiers.
Let's check for nulls.
In [27]:
df.isnull().values.any()
Out[27]:
We can remove those entries and create a new, completely clean dataframe
In [28]:
null_list = df["Population"].isnull()
not_null_list = [not i for i in null_list]
df_clean = df[not_null_list].copy()
In [29]:
null_list = df_clean["Rate"].isnull()
not_null_list = [not i for i in null_list]
df_completely_clean = df_clean[not_null_list].copy()
In [30]:
df_completely_clean.isnull().values.any()
Out[30]:
We will need a mapping function for ZIP codes and for FIPS codes. ZIP codes are used by USPS and are generally known by the users, whereas governmental studies and CDC data usually use FIPS codes for regions.
The CDC provides csv files with conversions for both directions. We just have to create two dictionaries to go from ZIP code to FIPS and vice versa.
In [31]:
df_fipszip= pd.read_csv("../data/COUNTY_ZIP_122014.csv", usecols={0,1})
In [32]:
df_fipszip.shape
Out[32]:
In [33]:
df_fipszip.head()
Out[33]:
In [34]:
df_zipfips= pd.read_csv("../data/ZIP_COUNTY_122014.csv", usecols={0,1})
In [35]:
df_zipfips.shape
Out[35]:
In [36]:
df_zipfips.head()
Out[36]:
In [37]:
df_zipfips.dtypes
Out[37]:
There are 51280 ZIP codes in these two tables, compared to the 3143 counties. Going from ZIP code to FIPS code will be easy:
In [38]:
zip2fips = dict(zip(df_zipfips["ZIP"], df_zipfips["COUNTY"]))
In [39]:
zip2fips[2139]
Out[39]:
Going from FIPS to ZIP code will be a bit more complicated since each FIPS contains many ZIPs.
In [40]:
fips2zip = {}
In [41]:
for fips in np.arange(len(df_fipszip.COUNTY)):
if df_fipszip.COUNTY[fips] in fips2zip:
fips2zip[df_fipszip.COUNTY[fips]].append(df_fipszip.ZIP[fips])
else:
fips2zip[df_fipszip.COUNTY[fips]] = []
fips2zip[df_fipszip.COUNTY[fips]].append(df_fipszip.ZIP[fips])
In [42]:
len(fips2zip[25017])
Out[42]:
In [43]:
df_fipszip.COUNTY[1]
Out[43]:
In [44]:
fips2zip[zip2fips[754]]
Out[44]:
In [45]:
df_fipszip['COUNTY'].value_counts()
Out[45]:
That's quite remarkable. Some counties contain several hundred ZIP codes! LA County has the most of course.
The Census Bureau offers tons of data on county level:
http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
I pre-selected a few tables that seem most relevant in terms of population characteristics. Other tables can be added later. Most importantly, the tables should also be available on ZIP code level for prediction later on. That's not the case for many of the tables.
In [46]:
df_census = pd.read_csv("../data/census/DEC_10_general.csv", header=0, skiprows={1})
df_census_labels = pd.read_csv("../data/census/DEC_10_general_metadata.csv", header=0, nrows=1)
In [47]:
df_census.shape, df_census_labels.shape
Out[47]:
In [48]:
df_census.head()
Out[48]:
We will be facing the same problems as with the CDC data. The columns have to be converted to a proper format.
In [49]:
df_census.convert_objects(convert_numeric=True)
df_census.dtypes
Out[49]:
For SQL purposes, I will remove all special characters and capital letters from the column names. This will make life a lot easier later on.
In [50]:
columnnames = list(df_census.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
print(columnname_wo_specialcharacters.lower())
df_census.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
df_census_labels.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [51]:
df_census_clean = df_census.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_clean.dtypes
Out[51]:
In [52]:
df_census_clean.describe()
Out[52]:
Very nice! Let's look at the data and compare it to the CDC values. Keep in mind though that the Census data is mostly 2010 while the CDC data is from 2014. So there will be small differences in population numbers.
In [53]:
fig = plt.figure(figsize=(10, 6))
data = np.log10(df_completely_clean['Population'])
binwidth = 0.1
bins=np.arange(min(data), max(data) + binwidth, binwidth)
ax = data.plot.hist(bins=bins, alpha=0.5, label='CDC', color='red')
ax.set_xlabel("log(Population)")
ax.set_title("Chlamydia")
data2 = np.log10(df_census_clean["hd01s001"])
ax2 = data2.plot.hist(bins=bins, alpha=0.5, label='Census')
plt.legend()
plt.savefig('../graphics/county_population_comparison.png', bbox_inches='tight', dpi=150)
In [54]:
data = np.log10(df_completely_clean['Population'])
data2 = np.log10(df_census_clean["hd01s001"])
len(data), len(data2)
Out[54]:
The CDC and Census data have the same number of entries. But are the FIPS and geoid2 exactly the same?
In [55]:
true = df_completely_clean["FIPS"].isin(df_census_clean["geoid2"])
true.head()#.sort_values()
Out[55]:
In [56]:
not_in_census = set(df_completely_clean["FIPS"])-set(df_census_clean["geoid2"])
len(not_in_census)
Out[56]:
No, 83 are not in common. We will have to make sure to exclude those later on.
We don't need all of the columns in the Census data. Let's select a subset of the data for modeling:
In [57]:
df_census_subset = df_census_clean[["geoid2",
"hd01s001", #log10(population)
"hd02s002", #under 5 yrs
"hd02s003", #5-9 yrs
"hd02s004", #10-14
"hd02s005", #15-19
"hd02s006", #20-24
"hd02s007", #25-29
"hd02s008", #30-34
"hd02s009", #35-39
"hd02s010", #40-44
"hd02s011", #45-49
"hd02s012", #50-54
"hd02s013", #55-59
"hd02s014", #60-64
"hd02s015", #65-69
"hd02s016", #70-74
"hd02s017", #75-79
"hd02s018", #80-84
"hd02s019", #85 and over
"hd01s020", #median age
"hd02s026", #male percent
"hd02s051", #female percent
"hd02s078", #white
"hd02s079", #black
"hd02s080", #native
"hd02s081", #asian
"hd02s089", #pacific
"hd02s095", #two or more
"hd02s107", #hispanic
"hd02s131", #in households
"hd02s132", #householder
"hd02s133", #spouse
"hd02s134", #child
"hd02s135", #child w own child under 18
"hd02s136", #other relatives
"hd02s143", #in group quarters
"hd02s151", #family households
"hd02s152", #family households w own child under 18
"hd02s153", #husband-wife family
"hd02s154", #husband-wife family w own child under 18
"hd02s159", #nonfamily households
"hd01s167", #average household size
"hd01s168", #average family size
"hd02s181", #owner occupied housing units
"hd02s184" #renter occupied housing units
]].copy()
df_census_subset.head()
Out[57]:
The population variable is the feature with the larges range. We can engineer a new feature to be the log of the population number. Maybe there is just a weak dependence on population and the log will pick this up better than the actual number.
In [58]:
df_census_subset['hd01s001'] = df_census_subset['hd01s001'].apply(np.log10)
In [59]:
df_census_subset['hd02s002'] = df_census_subset['hd02s002']+df_census_subset['hd02s003']+df_census_subset['hd02s004']
df_census_subset['hd02s011'] = df_census_subset['hd02s011']+df_census_subset['hd02s012']
df_census_subset['hd02s013'] = df_census_subset['hd02s013']+df_census_subset['hd02s014']
df_census_subset['hd02s015'] = df_census_subset['hd02s015']+df_census_subset['hd02s016']+df_census_subset['hd02s017']+df_census_subset['hd02s018']+df_census_subset['hd02s019']
In [60]:
df_census_combined_subset = df_census_subset[["geoid2",
"hd01s001", #log10(population)
"hd02s002", #0-14
"hd02s005", #15-19
"hd02s006", #20-24
"hd02s007", #25-29
"hd02s008", #30-34
"hd02s009", #35-39
"hd02s010", #40-44
"hd02s011", #45-54
"hd02s013", #55-64
"hd02s015", #65+
"hd01s020", #median age
"hd02s026", #male percent
"hd02s051", #female percent
"hd02s078", #white
"hd02s079", #black
"hd02s080", #native
"hd02s081", #asian
"hd02s089", #pacific
"hd02s095", #two or more
"hd02s107", #hispanic
"hd02s131", #in households
"hd02s132", #householder
"hd02s133", #spouse
"hd02s134", #child
"hd02s135", #child w own child under 18
"hd02s136", #other relatives
"hd02s143", #in group quarters
"hd02s151", #family households
"hd02s152", #family households w own child under 18
"hd02s153", #husband-wife family
"hd02s154", #husband-wife family w own child under 18
"hd02s159", #nonfamily households
"hd01s167", #average household size
"hd01s168", #average family size
"hd02s181", #owner occupied housing units
"hd02s184" #renter occupied housing units
]].copy()
df_census_combined_subset.head()
Out[60]:
In [61]:
df_cdc = df_completely_clean[true].copy()
df_cdc_subset = df_cdc[["FIPS","Population","Cases"]]
df_cdc_subset.shape
Out[61]:
In [62]:
df_cdc_subset.head()
Out[62]:
In [63]:
true_new = df_cdc_subset["FIPS"].isin(df_census_clean["geoid2"])
true_new.describe()
Out[63]:
In [64]:
df_census_acs = pd.read_csv("../data/census/ACS_14_5YR_income.csv", header=0, skiprows={1})
df_census_acs_labels = pd.read_csv("../data/census/ACS_14_5YR_income_metadata.csv", header=0, nrows=1)
In [65]:
columnnames = list(df_census_acs.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
print(columnname_wo_specialcharacters.lower())
df_census_acs.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
df_census_acs_labels.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [66]:
df_census_acs_clean = df_census_acs.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_acs_clean.dtypes
Out[66]:
In [67]:
df_census_acs_clean.head()
Out[67]:
In [68]:
df_census_acs_clean["hd01vd01"].hist()
Out[68]:
In [69]:
df_census_acs_labels
Out[69]:
In [70]:
true = df_completely_clean["FIPS"].isin(df_census_acs_clean["geoid2"])
In [71]:
not_in_census = set(df_completely_clean["FIPS"])-set(df_census_acs_clean["geoid2"])
len(not_in_census)
Out[71]:
In [72]:
df_census_acs_clean.head()
Out[72]:
In [73]:
df_census_acs_subset = df_census_acs_clean[["geoid2",
"hd01vd01" #median income
]].copy()
df_census_acs_subset.head()
Out[73]:
In [74]:
df_census_lgbt = pd.read_csv("../data/census/DEC_10_lgbt.csv", header=0, skiprows={1})
df_census_lgbt_labels = pd.read_csv("../data/census/DEC_10_lgbt_metadata.csv", header=0, nrows=1)
In [75]:
columnnames = list(df_census_lgbt.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
print(columnname_wo_specialcharacters.lower())
df_census_lgbt.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
df_census_lgbt_labels.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [76]:
df_census_lgbt_clean = df_census_lgbt.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_lgbt_clean.dtypes
Out[76]:
In [77]:
df_census_lgbt_clean.head()
Out[77]:
In [78]:
true = df_completely_clean["FIPS"].isin(df_census_lgbt_clean["geoid2"])
In [79]:
not_in_census = set(df_completely_clean["FIPS"])-set(df_census_lgbt_clean["geoid2"])
len(not_in_census)
Out[79]:
In [80]:
df_census_lgbt_subset = df_census_lgbt_clean[["geoid2",
"d001", #Total households
"d002", #Husband-wife households
"d014", #Unmarried-partner households: - Male householder and male partner
"d019", #Unmarried-partner households: - Male householder and female partner
"d024", #Unmarried-partner households: - Female householder and female partner
"d029" #Unmarried-partner households: - Female householder and male partner
]].copy()
df_census_lgbt_subset.head()
Out[80]:
In [81]:
df_census_lgbt_subset.describe()
Out[81]:
In [82]:
df_census_lgbt_subset["d002"] = df_census_lgbt_subset["d002"]/df_census_lgbt_subset["d001"]
df_census_lgbt_subset["d014"] = df_census_lgbt_subset["d014"]/df_census_lgbt_subset["d001"]
df_census_lgbt_subset["d019"] = df_census_lgbt_subset["d019"]/df_census_lgbt_subset["d001"]
df_census_lgbt_subset["d024"] = df_census_lgbt_subset["d024"]/df_census_lgbt_subset["d001"]
df_census_lgbt_subset["d029"] = df_census_lgbt_subset["d029"]/df_census_lgbt_subset["d001"]
In [83]:
df_census_lgbt_subset.describe()
Out[83]:
In [84]:
df_census_lgbt_subset.drop('d001', axis=1, inplace=True)
In [85]:
df_census_area = pd.read_csv("../data/census/Area_counties.csv", header=0, skiprows={1}, usecols={"STCOU", "LND110210D"})
In [86]:
df_census_area.head()
Out[86]:
In [87]:
columnnames = list(df_census_area.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
print(columnname_wo_specialcharacters.lower())
df_census_area.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [88]:
df_census_area_clean = df_census_area.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_area_clean.dtypes
Out[88]:
In [89]:
true = df_completely_clean["FIPS"].isin(df_census_area_clean["stcou"])
In [90]:
not_in_census = set(df_completely_clean["FIPS"])-set(df_census_area_clean["stcou"])
len(not_in_census)
Out[90]:
In [91]:
df_census_area_clean.describe()
Out[91]:
In [92]:
df_merged = pd.merge(df_cdc_subset, df_census_combined_subset, left_on='FIPS', right_on='geoid2', how='inner', sort=False)
In [93]:
df_merged2 = pd.merge(df_merged, df_census_acs_subset, left_on='FIPS', right_on='geoid2', how='inner', sort=False)
In [94]:
df_merged = pd.merge(df_merged2, df_census_lgbt_subset, left_on='FIPS', right_on='geoid2', how='inner', sort=False)
In [95]:
df_merged2 = pd.merge(df_merged, df_census_area_clean, left_on='FIPS', right_on='stcou', how='inner', sort=False)
In [96]:
df_merged = df_merged2
In [97]:
df_merged.head()
Out[97]:
In [98]:
df_merged.drop('stcou', axis=1, inplace=True)
df_merged.drop('geoid2_x', axis=1, inplace=True)
df_merged.drop('geoid2', axis=1, inplace=True)
df_merged.drop('geoid2_y', axis=1, inplace=True)
In [99]:
df_merged["lnd110210d"] = df_merged["Population"]/(df_merged["lnd110210d"]+1.0)
In [100]:
df_merged.head()
Out[100]:
In [101]:
df_merged.shape
Out[101]:
In [102]:
unique_values = [len(np.unique(df_merged.values[:,i])) for i in range(df_merged.values.shape[1])]
plt.bar(range(df_merged.values.shape[1]), unique_values)
Out[102]:
In [103]:
plt.plot(np.log(np.std(df_merged.values, axis=0)))
Out[103]:
In [104]:
df_merged.to_csv("../data/chlamydia_cdc_census.csv", index=False)
In [105]:
df_merged.shape
Out[105]:
In [106]:
df_merged = pd.read_csv("../data/chlamydia_cdc_census.csv")
In [107]:
plt.scatter(np.log10(df_merged["Population"]), df_merged["Cases"]/df_merged["Population"])
Out[107]:
In [108]:
plt.scatter(df_merged["FIPS"], df_merged["Cases"]/df_merged["Population"])
Out[108]:
In [109]:
df_all = df_merged.copy()
df_all["Rate"] = df_all["Cases"]/df_all["Population"]
corr = df_all.corr()
In [110]:
pearsonr = corr["Rate"]
pearsonr
Out[110]:
In [111]:
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
In [112]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
square=True, xticklabels=2, yticklabels=2,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
plt.savefig('../graphics/cross-correlation.png', bbox_inches='tight', dpi=150)
In [113]:
#sns.pairplot(df_all, size=2.5)
#plt.savefig('../graphics/pairplot_chlamydia.png', bbox_inches='tight', dpi=150)
In [114]:
from sklearn.decomposition import PCA
In [115]:
df_merged.describe()
Out[115]:
In [116]:
df_new = df_merged.drop(["FIPS","Cases"], axis=1)
X = df_new.values
columns = X.shape[1]
for column in np.arange(columns):
mean_temp = X[:,column].mean()
std_temp = X[:,column].std()
X[:,column] = (X[:,column]-mean_temp)/std_temp
plt.plot((np.std(X, axis=0)))
Out[116]:
In [117]:
X.shape
Out[117]:
In [118]:
pca = PCA(n_components=44)
pca.fit(X)
pca.explained_variance_ratio_
Out[118]:
In [119]:
plt.scatter(pca.components_[:,0], pca.components_[:,1])
Out[119]:
In [120]:
X_trans = pca.transform(X)
Y = df_merged["Cases"]/df_merged["Population"]
plt.scatter(X_trans[:,0],Y)
Out[120]:
In [121]:
from sklearn import linear_model
In [122]:
df_merged = pd.read_csv("../data/chlamydia_cdc_census.csv")
In [123]:
df_new = df_merged.drop(["FIPS","Cases", "d002", "hd02s051", "hd02s143", "hd02s159", "hd02s184", "hd01s001"], axis=1)
In [124]:
df_new.head()
Out[124]:
In [125]:
df_merged.describe()
Out[125]:
In [126]:
df_new["hd02s002"].hist()
Out[126]:
Split data set into training/test and validation data
In [127]:
cutoff = 1
X = df_new[df_merged["Cases"]>cutoff].values
Y = df_merged[df_merged["Cases"]>cutoff].Cases/(df_merged[df_merged["Cases"]>cutoff].Population+1.0)
X_full = df_new.values
Y_full = df_merged.Cases/(df_merged.Population+1.0)
#normalize all columns to the same normalization
columns = X.shape[1]
means = np.zeros(columns)
stds = np.zeros(columns)
for column in np.arange(columns):
mean_temp = X[:,column].mean()
std_temp = X[:,column].std()
means[column] = mean_temp
stds[column] = std_temp
X[:,column] = (X[:,column]-mean_temp)/std_temp
X_full[:,column] = (X_full[:,column]-mean_temp)/std_temp
Ymean = Y.mean()
Ystd = Y.std()
Y = (Y-Ymean)/Ystd
Y_full = (Y_full-Ymean)/Ystd
ones = np.ones(round(0.75*len(X)), dtype=bool)
zeros = np.zeros(len(X)-round(0.75*len(X)), dtype=bool)
training_list = np.hstack((ones, zeros))
np.random.shuffle(training_list)
test_list = np.zeros(len(X),dtype=bool)
test_list = np.array([not i for i in training_list])
X_train = X[training_list]
X_test = X[test_list]
Y_train = Y[training_list]
Y_test = Y[test_list]
X.shape, X_train.shape, X_test.shape, Y.shape, Y_train.shape, Y_test.shape
Out[127]:
In [128]:
X.shape, Y.shape
Out[128]:
In [129]:
Y_test.describe()
Out[129]:
In [130]:
#X_weights = df_merged.values
#X_train_weights = X_weights[training_list]
weights = 1 #X_train_weights[:,2]
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train, sample_weight=weights)
Out[130]:
In [131]:
print(regr.coef_)
In [132]:
1.0-np.sum((regr.predict(X_test)-Y_test)**2)/np.sum((Y_test-np.mean(Y_test))**2)
Out[132]:
In [133]:
plt.scatter(Y,regr.predict(X))
plt.plot(np.linspace(Y.min(),Y.max(),num=10),np.linspace(Y.min(),Y.max(),num=10),color='red')
Out[133]:
In [134]:
print('Variance score: %.5f\t(%.5f)' % (regr.score(X_test, Y_test), regr.score(X_full, Y_full)))
In [135]:
from sklearn import cross_validation
cv = cross_validation.ShuffleSplit(len(Y_train), n_iter=10, test_size=0.2, random_state=1)
scores_regression = cross_validation.cross_val_score(regr, X_train, Y_train, cv=cv)
scores_regression
#cross_val_score(regr, X_train, Y_train, cv=6, n_jobs=1)
#scores
Out[135]:
In [136]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_regression.mean(), scores_regression.std() * 2))
In [137]:
from sklearn.metrics import explained_variance_score
In [138]:
explained_variance_score(Y_train, regr.predict(X_train))
Out[138]:
In [139]:
from sklearn.metrics import mean_absolute_error
In [140]:
mean_absolute_error(Y_train, regr.predict(X_train))
Out[140]:
In [141]:
from sklearn.metrics import mean_squared_error
In [142]:
from sklearn.metrics import median_absolute_error
In [143]:
from sklearn.metrics import r2_score
In [144]:
median_absolute_error(Y_train, regr.predict(X_train))
Out[144]:
In [145]:
mse_regression = mean_squared_error(Y_test, regr.predict(X_test))
r2_regression = r2_score(Y_test, regr.predict(X_test))
r2_regression, mse_regression
Out[145]:
Try recursive feature elimination to increase R^2
In [146]:
from sklearn.feature_selection import RFE
In [147]:
selector = RFE(regr, 20, step=1)
selector = selector.fit(X_train, Y_train)
In [148]:
selector.ranking_
Out[148]:
In [149]:
selector.support_
Out[149]:
In [150]:
for column in df_new.columns[selector.support_].values:
print(labels_dict[column])
In [151]:
from sklearn.preprocessing import PolynomialFeatures
In [152]:
poly = PolynomialFeatures(degree=2)
In [153]:
X_train_poly = poly.fit_transform(X_train)
In [154]:
poly_regr = linear_model.LinearRegression(fit_intercept=False)
poly_regr.fit(X_train_poly, Y_train)
Out[154]:
In [155]:
plt.scatter(Y,poly_regr.predict(poly.fit_transform(X)))
plt.ylim([-3,4])
plt.plot(np.linspace(Y_train.min(),Y_train.max(),num=10),np.linspace(Y_train.min(),Y_train.max(),num=10),color='red')
Out[155]:
In [156]:
print('Variance score: %.5f\t(%.5f)' % (poly_regr.score(poly.fit_transform(X_test), Y_test), poly_regr.score(poly.fit_transform(X), Y)))
In [157]:
from sklearn import linear_model
In [158]:
rregr = linear_model.Ridge(alpha=0.5)
rregr.fit(X_train, Y_train)
Out[158]:
In [159]:
print('Variance score: %.5f\t(%.5f)' % (rregr.score(X_test, Y_test), rregr.score(X_full, Y_full)))
In [160]:
scores_rregression = cross_validation.cross_val_score(rregr, X_train, Y_train, cv=cv, n_jobs=1)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_rregression.mean(), scores_rregression.std() * 2))
In [161]:
mse_rregression = mean_squared_error(Y_test, rregr.predict(X_test))
r2_rregression = r2_score(Y_test, rregr.predict(X_test))
r2_rregression, mse_rregression
Out[161]:
In [162]:
from sklearn.ensemble import ExtraTreesRegressor
In [163]:
clf_et = ExtraTreesRegressor(n_estimators=250, bootstrap=True, oob_score=True, max_features='sqrt')
clf_et.fit(X_train, Y_train)
Out[163]:
In [164]:
print('Variance score: %.5f\t(%.5f)\nOut of bag error score: %.5f' % (clf_et.score(X_test, Y_test), clf_et.score(X_full, Y_full),clf_et.oob_score_))
In [165]:
scores_etregression = cross_validation.cross_val_score(clf_et, X_train, Y_train, cv=cv, n_jobs=1)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_etregression.mean(), scores_etregression.std() * 2))
In [166]:
mse_etregression = mean_squared_error(Y_test, clf_et.predict(X_test))
r2_etregression = r2_score(Y_test, clf_et.predict(X_test))
r2_etregression, mse_etregression
Out[166]:
In [167]:
from sklearn.ensemble import AdaBoostRegressor
In [168]:
clf_ada = AdaBoostRegressor(n_estimators=400, learning_rate=0.1, loss='linear')
clf_ada.fit(X_train, Y_train)
Out[168]:
In [169]:
print('Variance score: %.5f\t(%.5f)' % (clf_ada.score(X_test, Y_test), clf_ada.score(X_full, Y_full)))
In [170]:
scores_adaregression = cross_validation.cross_val_score(clf_ada, X_train, Y_train, cv=cv, n_jobs=1)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_adaregression.mean(), scores_adaregression.std() * 2))
In [171]:
mse_adaregression = mean_squared_error(Y_test, clf_ada.predict(X_test))
r2_adaregression = r2_score(Y_test, clf_ada.predict(X_test))
r2_adaregression, mse_adaregression
Out[171]:
In [172]:
from sklearn.ensemble import BaggingRegressor
In [173]:
clf_bg = BaggingRegressor(n_estimators=250, bootstrap=True, oob_score=True, max_features=20)
clf_bg.fit(X_train, Y_train)
Out[173]:
In [174]:
print('Variance score: %.5f\t(%.5f)' % (clf_bg.score(X_test, Y_test), clf_bg.score(X_full, Y_full)))
In [175]:
scores_bagregression = cross_validation.cross_val_score(clf_bg, X_train, Y_train, cv=cv, n_jobs=1)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_bagregression.mean(), scores_bagregression.std() * 2))
In [176]:
mse_bagregression = mean_squared_error(Y_test, clf_bg.predict(X_test))
r2_bagregression = r2_score(Y_test, clf_bg.predict(X_test))
r2_bagregression, mse_bagregression
Out[176]:
In [177]:
from sklearn.ensemble import GradientBoostingRegressor
In [178]:
clf_gb = GradientBoostingRegressor(n_estimators=100, max_features=30,max_depth=4)
clf_gb.fit(X_train, Y_train)
Out[178]:
In [179]:
print('Variance score: %.5f\t(%.5f)' % (clf_gb.score(X_test, Y_test), clf_gb.score(X_full, Y_full)))
In [180]:
plt.scatter(Y_full,clf_gb.predict(X_full))
plt.plot(np.linspace(Y_full.min(),Y_full.max(),num=10),np.linspace(Y_full.min(),Y_full.max(),num=10),color='red')
Out[180]:
In [181]:
scores_gradboostregression = cross_validation.cross_val_score(clf_gb, X_train, Y_train, cv=cv, n_jobs=1)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_gradboostregression.mean(), scores_gradboostregression.std() * 2))
In [182]:
mse_gradboostregression = mean_squared_error(Y_test, clf_gb.predict(X_test))
r2_gradboostregression = r2_score(Y_test, clf_gb.predict(X_test))
r2_gradboostregression, mse_gradboostregression
Out[182]:
In [183]:
clf_gb.fit(X_train, Y_train)
Out[183]:
In [184]:
import pickle
In [185]:
with open("../data/gradientboosting_params.pickle", "wb") as myfile:
pickle.dump(clf_gb, myfile)
In [186]:
with open("../data/Ymean.pickle", "wb") as myfile:
pickle.dump(Ymean, myfile)
In [187]:
with open("../data/Ystd.pickle", "wb") as myfile:
pickle.dump(Ystd, myfile)
In [188]:
with open("../data/Xmeans.pickle", "wb") as myfile:
pickle.dump(means, myfile)
In [189]:
with open("../data/Xstds.pickle", "wb") as myfile:
pickle.dump(stds, myfile)
In [190]:
from sklearn.ensemble import RandomForestRegressor
In [212]:
df_merged = pd.read_csv("../data/chlamydia_cdc_census.csv")
df_new = df_merged.drop(["FIPS","Cases", "d002", "hd02s051", "hd02s143", "hd02s159", "hd02s184", "hd01s001"], axis=1)
#df_new = df_merged.drop(["FIPS","Cases"], axis=1)
In [213]:
df_new.head()
Out[213]:
In [214]:
cutoff = 1
X = df_new[df_merged["Cases"]>cutoff].values
Y = df_merged[df_merged["Cases"]>cutoff].Cases/(df_merged[df_merged["Cases"]>cutoff].Population+1.0)
X_full = df_new.values
Y_full = df_merged.Cases/(df_merged.Population+1.0)
#normalize all columns and the full dataset to the same normalization
columns = X.shape[1]
means = np.zeros(columns)
stds = np.zeros(columns)
for column in np.arange(columns):
mean_temp = X[:,column].mean()
std_temp = X[:,column].std()
means[column] = mean_temp
stds[column] = std_temp
X[:,column] = (X[:,column]-mean_temp)/std_temp
X_full[:,column] = (X_full[:,column]-mean_temp)/std_temp
Ymean = Y.mean()
Ystd = Y.std()
Y = (Y-Ymean)/Ystd
Y_full = (Y_full-Ymean)/Ystd
#split into training/validation and test data
ones = np.ones(round(0.75*len(X)), dtype=bool)
zeros = np.zeros(len(X)-round(0.75*len(X)), dtype=bool)
training_list = np.hstack((ones, zeros))
np.random.shuffle(training_list)
test_list = np.zeros(len(X),dtype=bool)
test_list = np.array([not i for i in training_list])
X_train = X[training_list]
X_test = X[test_list]
Y_train = Y[training_list]
Y_test = Y[test_list]
In [215]:
clf = RandomForestRegressor(n_estimators=250, oob_score=True, max_features='sqrt',min_samples_split=2, n_jobs=4)
clf.fit(X_train, Y_train)
Out[215]:
In [216]:
print('Variance score: %.5f\t(%.5f)\nOut of bag error score: %.5f' % (clf.score(X_test, Y_test), clf.score(X_full, Y_full),clf.oob_score_))
In [217]:
scores_randomforest = cross_validation.cross_val_score(clf, X_train, Y_train, cv=cv, n_jobs=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_randomforest.mean(), scores_randomforest.std()/np.sqrt(5) * 2))
In [218]:
mse_randomforest = mean_squared_error(Y_test, clf.predict(X_test))
r2_randomforest = r2_score(Y_test, clf.predict(X_test))
r2_randomforest, mse_randomforest
Out[218]:
In [219]:
clf = RandomForestRegressor(n_estimators=250, oob_score=True, max_features='sqrt',min_samples_split=2, n_jobs=4)
clf.fit(X_train, Y_train)
Out[219]:
In [220]:
mean_squared_error(Y_train, clf.predict(X_train))
Out[220]:
In [221]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
#ax.text(-1.5, 3, r'$R^2 = $%.2f'%(clf.oob_score_), fontsize=30)
ax.set_xlabel("CDC data", fontsize=20)
ax.set_ylabel("Model prediction", fontsize=20)
ax = plt.scatter(Y,clf.predict(X))
ax2 = plt.plot(np.linspace(Y.min(),Y.max(),10),np.linspace(Y.min(),Y.max(),10),color='red')
plt.savefig('../graphics/data_vs_model.png', bbox_inches='tight', dpi=150)
In [1041]:
from collections import OrderedDict
RANDOM_STATE = 123
ensemble_clfs = [
("RandomForestRegressor, max_features='sqrt'",
RandomForestRegressor(warm_start=False, oob_score=True,
max_features="sqrt",
random_state=RANDOM_STATE)),
("RandomForestRegressor, max_features='log2'",
RandomForestRegressor(warm_start=False, max_features='log2',
oob_score=True,
random_state=RANDOM_STATE)),
("RandomForestRegressor, max_features=None",
RandomForestRegressor(warm_start=False, max_features=None,
oob_score=True,
random_state=RANDOM_STATE))
]
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
# Range of `n_estimators` values to explore.
min_estimators = 10
max_estimators = 250
for label, clf in ensemble_clfs:
for i in range(min_estimators, max_estimators,10):
clf.set_params(n_estimators=i)
clf.fit(X_train, Y_train)
# Record the OOB error for each `n_estimators=i` setting.
oob_error = clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB score")
plt.legend(loc="lower right")
plt.savefig('../graphics/features_estimators.png', bbox_inches='tight', dpi=150)
In [302]:
feature_importance = (np.vstack((np.arange(len(clf.feature_importances_)), clf.feature_importances_)).T)
ranking = feature_importance[feature_importance[:,1].argsort()[::-1]]
In [303]:
for rank, importance in ranking:
print(df_new.columns[rank], importance)
In [304]:
labels_dict = {
"Population": "Population",
"hd01s001": "log10(Population)",
"hd02s002": "0-14",
"hd02s005": "15-19",
"hd02s006": "20-24",
"hd02s007": "25-29",
"hd02s008": "30-34",
"hd02s009": "35-39",
"hd02s010": "40-44",
"hd02s011": "45-54",
"hd02s013": "55-64",
"hd02s015": "65+",
"hd01s020": "Median age",
"hd02s026": "Male percent",
"hd02s051": "Female percent",
"hd02s078": "White",
"hd02s079": "Black",
"hd02s080": "Native",
"hd02s081": "Asian",
"hd02s089": "Pacific/Hawaiian",
"hd02s095": "Two or more races",
"hd02s107": "Hispanic",
"hd02s131": "In households",
"hd02s132": "Householder",
"hd02s133": "Spouse",
"hd02s134": "Child",
"hd02s135": "Child w own child under 18",
"hd02s136": "Other relatives",
"hd02s143": "In group quarters",
"hd02s151": "Family households",
"hd02s152": "Family households w own child under 18",
"hd02s153": "Husband-wife family",
"hd02s154": "Husband-wife family w own child under 18",
"hd02s159": "Nonfamily households",
"hd01s167": "Average household size",
"hd01s168": "Average family size",
"hd02s181": "Owner occupied housing units",
"hd02s184": "Renter occupied housing units",
"hd01vd01": "Median income",
"d001": "Total households",
"d002": "Husband-wife households",
"d014": "Unmarried-partner households: - Male householder and male partner",
"d019": "Unmarried-partner households: - Male householder and female partner",
"d024": "Unmarried-partner households: - Female householder and female partner",
"d029": "Unmarried-partner households: - Female householder and male partner",
"lnd110210d": "Population density"}
In [305]:
ranked_list = []
ranked_labels = []
for rank, importance in ranking:
print(labels_dict[df_new.columns[rank]], importance)
ranked_list.append(importance)
ranked_labels.append(labels_dict[df_new.columns[rank]])
len(ranked_labels)
Out[305]:
In [306]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
ax.set_xlabel("Feature importance", fontsize=20)
ax.set_ylabel("Feature", fontsize=20)
ax = sns.barplot(x=ranked_list, y=ranked_labels)
plt.savefig('../graphics/feature_importance_ranking.png', bbox_inches='tight', dpi=150)
In [307]:
len(ranked_list)
Out[307]:
Save model parameters for use in web app:
In [222]:
import pickle
In [223]:
with open("../data/randomforest_params.pickle", "wb") as myfile:
pickle.dump(clf, myfile)
In [224]:
with open("../data/Ymean.pickle", "wb") as myfile:
pickle.dump(Ymean, myfile)
In [225]:
with open("../data/Ystd.pickle", "wb") as myfile:
pickle.dump(Ystd, myfile)
In [226]:
with open("../data/Xmeans.pickle", "wb") as myfile:
pickle.dump(means, myfile)
In [227]:
with open("../data/Xstds.pickle", "wb") as myfile:
pickle.dump(stds, myfile)
In [228]:
deployed_model = pickle.load(open('../data/randomforest_params.pickle', "rb" ))
In [229]:
print('Variance score: %.5f\t(%.5f)' % (deployed_model.score(X_test, Y_test), deployed_model.score(X_full, Y_full)))
In [316]:
from sklearn.svm import SVR
In [317]:
svr_rbf = SVR(kernel='rbf', C=15, gamma=0.0001, epsilon=0.0005, tol=0.00001)
#svr_lin = SVR(kernel='linear', C=1, epsilon=0.001)
#svr_poly = SVR(kernel='poly', C=1, degree=2, epsilon=0.001)
svr_rbf.fit(X_train, Y_train)
#svr_lin.fit(X_train, Y_train)
#svr_poly.fit(X_train, Y_train)
#print('Variance score:\n\t%.5f\t(rbf)\n\t%.5f\t(lin)\n\t%.5f\t(poly)' % (svr_rbf.score(X_train, Y_train), svr_lin.score(X_train, Y_train),svr_poly.score(X_train, Y_train)))
print('Variance score:\n\t%.5f\t(rbf)' % (svr_rbf.score(X_test, Y_test)))
In [318]:
scores_svm = cross_validation.cross_val_score(svr_rbf, X_train, Y_train, cv=cv, n_jobs=1)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_svm.mean(), scores_svm.std() * 2))
In [319]:
mse_svm = mean_squared_error(Y_test, svr_rbf.predict(X_test))
r2_svm = r2_score(Y_test, svr_rbf.predict(X_test))
r2_svm, mse_svm
Out[319]:
In [324]:
model_scores = []
model_errors = []
model_names = []
model_scores.append(np.mean(scores_regression))
model_errors.append(np.std(scores_regression)*2)
model_names.append("Linear")
model_scores.append(np.mean(scores_rregression))
model_errors.append(np.std(scores_rregression)*2)
model_names.append("Ridge")
model_scores.append(np.mean(scores_etregression))
model_errors.append(np.std(scores_etregression)*2)
model_names.append("Extra trees")
model_scores.append(np.mean(scores_adaregression))
model_errors.append(np.std(scores_adaregression)*2)
model_names.append("ADA boost")
model_scores.append(np.mean(scores_bagregression))
model_errors.append(np.std(scores_bagregression)*2)
model_names.append("Bagging")
model_scores.append(np.mean(scores_gradboostregression))
model_errors.append(np.std(scores_gradboostregression)*2)
model_names.append("Gradient boost")
model_scores.append(np.mean(scores_svm))
model_errors.append(np.std(scores_svm)*2)
model_names.append("Suport vector")
model_scores.append(np.mean(scores_randomforest))
model_errors.append(np.std(scores_randomforest)*2)
model_names.append("Random forest")
In [325]:
model_names, model_scores, model_errors
Out[325]:
In [326]:
# Set up the matplotlib figure
sns.set_context("poster", font_scale=1)
f, ax = plt.subplots(figsize=(11, 9))
ax.set_xlabel("Model", fontsize=20)
ax.set_ylabel("Cross validation score", fontsize=20)
ax = sns.barplot(x=model_names, y=model_scores, yerr=model_errors/np.sqrt(5))
plt.xticks(rotation=90)
plt.savefig('../graphics/model_performance.png', bbox_inches='tight', dpi=150)
In [327]:
from mlxtend.regressor import StackingRegressor
In [328]:
svr_rbf_meta = SVR(kernel='rbf', C=1, gamma=0.0001, epsilon=0.1, tol=0.001)
stregr = StackingRegressor(regressors=[regr, rregr, clf_et, clf_ada, clf_bg, clf_gb, clf, svr_rbf],
meta_regressor=svr_rbf_meta)
# Training the stacking classifier
stregr.fit(X_train, Y_train)
stregr.predict(X_train)
# Evaluate and visualize the fit
print("Mean Squared Error: %.4f"
% np.mean((stregr.predict(X_test) - Y_test) ** 2))
print('Variance Score: %.4f' % stregr.score(X_test, Y_test))
#plt.scatter(X, Y, c='lightgray')
plt.scatter(Y_test, stregr.predict(X_test), c='darkgreen')
Out[328]:
In [329]:
scores_meta = cross_validation.cross_val_score(stregr, X_train, Y_train, cv=cv)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_meta.mean(), scores_meta.std()/np.sqrt(5) * 2))
In [330]:
scores_meta
Out[330]:
In [331]:
df_census_zip = pd.read_csv("../data/census/DEC_10_general_zipcode.csv", header=0, skiprows={1})
In [332]:
df_census_zip.head()
Out[332]:
In [333]:
columnnames = list(df_census_zip.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
#print(columnname_wo_specialcharacters.lower())
df_census_zip.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [334]:
df_census_zip_clean = df_census_zip.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_zip_clean.dtypes
Out[334]:
In [335]:
df_census_zip_clean.describe()
Out[335]:
In [336]:
df_census_zip_subset = df_census_zip_clean[["geoid2",
"hd01s001", #population
"hd02s002", #under 5 yrs
"hd02s003", #5-9 yrs
"hd02s004", #10-14
"hd02s005", #15-19
"hd02s006", #20-24
"hd02s007", #25-29
"hd02s008", #30-34
"hd02s009", #35-39
"hd02s010", #40-44
"hd02s011", #45-49
"hd02s012", #50-54
"hd02s013", #55-59
"hd02s014", #60-64
"hd02s015", #65-69
"hd02s016", #70-74
"hd02s017", #75-79
"hd02s018", #80-84
"hd02s019", #85 and over
"hd01s020", #median age
"hd02s026", #male percent
"hd02s051", #female percent
"hd02s078", #white
"hd02s079", #black
"hd02s080", #native
"hd02s081", #asian
"hd02s089", #pacific
"hd02s095", #two or more
"hd02s107", #hispanic
"hd02s131", #in households
"hd02s132", #householder
"hd02s133", #spouse
"hd02s134", #child
"hd02s135", #child w own child under 18
"hd02s136", #other relatives
"hd02s143", #in group quarters
"hd02s151", #family households
"hd02s152", #family households w own child under 18
"hd02s153", #husband-wife family
"hd02s154", #husband-wife family w own child under 18
"hd02s159", #nonfamily households
"hd01s167", #average household size
"hd01s168", #average family size
"hd02s181", #owner occupied housing units
"hd02s184" #renter occupied housing units
]].copy()
df_census_zip_subset.head()
Out[336]:
In [337]:
df_census_zip_subset['hd02s002'] = df_census_zip_subset['hd02s002']+df_census_zip_subset['hd02s003']+df_census_zip_subset['hd02s004']
df_census_zip_subset['hd02s011'] = df_census_zip_subset['hd02s011']+df_census_zip_subset['hd02s012']
df_census_zip_subset['hd02s013'] = df_census_zip_subset['hd02s013']+df_census_zip_subset['hd02s014']
df_census_zip_subset['hd02s015'] = df_census_zip_subset['hd02s015']+df_census_zip_subset['hd02s016']+df_census_zip_subset['hd02s017']+df_census_zip_subset['hd02s018']+df_census_zip_subset['hd02s019']
In [338]:
df_census_zip_combined_subset = df_census_zip_subset[["geoid2",
"hd01s001", #log10(population)
"hd02s002", #0-14
"hd02s005", #15-19
"hd02s006", #20-24
"hd02s007", #25-29
"hd02s008", #30-34
"hd02s009", #35-39
"hd02s010", #40-44
"hd02s011", #45-54
"hd02s013", #55-64
"hd02s015", #65+
"hd01s020", #median age
"hd02s026", #male percent
"hd02s051", #female percent
"hd02s078", #white
"hd02s079", #black
"hd02s080", #native
"hd02s081", #asian
"hd02s089", #pacific
"hd02s095", #two or more
"hd02s107", #hispanic
"hd02s131", #in households
"hd02s132", #householder
"hd02s133", #spouse
"hd02s134", #child
"hd02s135", #child w own child under 18
"hd02s136", #other relatives
"hd02s143", #in group quarters
"hd02s151", #family households
"hd02s152", #family households w own child under 18
"hd02s153", #husband-wife family
"hd02s154", #husband-wife family w own child under 18
"hd02s159", #nonfamily households
"hd01s167", #average household size
"hd01s168", #average family size
"hd02s181", #owner occupied housing units
"hd02s184" #renter occupied housing units
]].copy()
df_census_zip_combined_subset.head()
Out[338]:
In [339]:
df_census_zip_subset.head()
Out[339]:
In [340]:
df_census_acs_zipcodes = pd.read_csv("../data/census/ACS_14_5YR_income_zipcodes.csv", header=0, skiprows={1})
In [341]:
columnnames = list(df_census_acs_zipcodes.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
print(columnname_wo_specialcharacters.lower())
df_census_acs_zipcodes.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [342]:
df_census_acs_zipcodes_clean = df_census_acs_zipcodes.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_acs_zipcodes_clean.dtypes
Out[342]:
In [343]:
df_census_acs_zipcodes_subset = df_census_acs_zipcodes_clean[["geoid2",
"hd01vd01" #median income
]].copy()
df_census_acs_zipcodes_subset.head()
Out[343]:
In [344]:
df_census_lgbt_zipcodes = pd.read_csv("../data/census/DEC_10_lgbt_zipcodes.csv", header=0, skiprows={1})
In [345]:
columnnames = list(df_census_lgbt_zipcodes.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
print(columnname_wo_specialcharacters.lower())
df_census_lgbt_zipcodes.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [346]:
df_census_lgbt_zipcodes_clean = df_census_lgbt_zipcodes.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_lgbt_zipcodes_clean.dtypes
Out[346]:
In [347]:
df_census_lgbt_zipcodes_subset = df_census_lgbt_zipcodes_clean[["geoid2",
"d001", #Total households
"d002", #Husband-wife households
"d014", #Unmarried-partner households: - Male householder and male partner
"d019", #Unmarried-partner households: - Male householder and female partner
"d024", #Unmarried-partner households: - Female householder and female partner
"d029" #Unmarried-partner households: - Female householder and male partner
]].copy()
df_census_lgbt_zipcodes_subset.head()
Out[347]:
In [348]:
df_census_lgbt_zipcodes_subset["d002"] = df_census_lgbt_zipcodes_subset["d002"]/df_census_lgbt_zipcodes_subset["d001"]
df_census_lgbt_zipcodes_subset["d014"] = df_census_lgbt_zipcodes_subset["d014"]/df_census_lgbt_zipcodes_subset["d001"]
df_census_lgbt_zipcodes_subset["d019"] = df_census_lgbt_zipcodes_subset["d019"]/df_census_lgbt_zipcodes_subset["d001"]
df_census_lgbt_zipcodes_subset["d024"] = df_census_lgbt_zipcodes_subset["d024"]/df_census_lgbt_zipcodes_subset["d001"]
df_census_lgbt_zipcodes_subset["d029"] = df_census_lgbt_zipcodes_subset["d029"]/df_census_lgbt_zipcodes_subset["d001"]
In [349]:
df_census_lgbt_zipcodes_subset.drop('d001', axis=1, inplace=True)
In [350]:
df_census_lgbt_zipcodes_subset.dtypes
Out[350]:
In [351]:
df_census_area_zipcodes = pd.read_csv("../data/census/Area_zipcodes.csv", header=0, skiprows={1}, usecols={"ZCTA5", "LANDSQMI"})
In [352]:
df_census_area_zipcodes.head()
Out[352]:
In [353]:
df_census_area_zipcodes.describe()
Out[353]:
In [354]:
columnnames = list(df_census_area_zipcodes.columns.values)
for columnname in columnnames:
columnname_wo_specialcharacters = re.sub('[ \-\_\+\=\`\~\{\}\;\:\,\.\<\>\?\/\!\@\#\$\%\^\&\*\(\)\[\]]', '', columnname)
print(columnname_wo_specialcharacters.lower())
df_census_area_zipcodes.rename(columns={columnname: columnname_wo_specialcharacters.lower()}, inplace=True)
In [355]:
df_census_area_clean_zipcodes = df_census_area_zipcodes.replace(to_replace='\(r.+\)', value="", regex=True).convert_objects(convert_numeric=True)
df_census_area_clean_zipcodes.dtypes
Out[355]:
In [374]:
df_merged_zipcodes = pd.merge(df_census_zip_combined_subset, df_census_acs_zipcodes_subset, left_on='geoid2', right_on='geoid2', how='inner', sort=False).convert_objects(convert_numeric=True)
In [375]:
df_merged_zipcodes2 = pd.merge(df_merged_zipcodes, df_census_lgbt_zipcodes_subset, left_on='geoid2', right_on='geoid2', how='inner', sort=False).convert_objects(convert_numeric=True)
In [376]:
df_merged_zipcodes = pd.merge(df_merged_zipcodes2, df_census_area_zipcodes, left_on='geoid2', right_on='zcta5', how='inner', sort=False).convert_objects(convert_numeric=True)
In [377]:
df_merged_zipcodes.head()
Out[377]:
In [378]:
df_zipcodes = df_merged_zipcodes['geoid2'].copy()
df_merged_zipcodes.drop('geoid2', axis=1, inplace=True)
df_merged_zipcodes.drop('zcta5', axis=1, inplace=True)
In [379]:
df_merged_zipcodes.insert(0, "Population", df_merged_zipcodes["hd01s001"])
In [380]:
df_merged_zipcodes["hd01s001"] = df_merged_zipcodes["hd01s001"].apply(np.log10)
In [381]:
df_merged_zipcodes["landsqmi"] = df_merged_zipcodes["Population"]/(df_merged_zipcodes["landsqmi"]+1.0)
In [382]:
df_merged_zipcodes = df_merged_zipcodes.drop(["d002", "hd02s051", "hd02s143", "hd02s159", "hd02s184", "hd01s001"], axis=1)
In [383]:
df_merged_zipcodes_unnormalized = df_merged_zipcodes.copy()
In [384]:
df_merged_zipcodes.shape, means.shape
Out[384]:
In [385]:
for column in np.arange(df_merged_zipcodes.shape[1]):
columnname = df_merged_zipcodes.columns[column]
df_merged_zipcodes[columnname] = (df_merged_zipcodes[columnname]-means[column])/stds[column]
In [386]:
df_merged_zipcodes.insert(0, "geoid2", df_zipcodes)
In [387]:
df_merged_zipcodes_unnormalized.insert(0, "geoid2", df_zipcodes)
In [388]:
df_merged_zipcodes_unnormalized.head()
Out[388]:
In [389]:
df_merged_zipcodes.to_csv("../data/census_zipcode.csv", index=False)
In [390]:
df_merged_zipcodes_unnormalized.to_csv("../data/census_zipcode_unnormalized.csv", index=False)
In [ ]: