In [2]:
#First import pandas and KMeans from scikit-learn
import pandas as pd
from sklearn.cluster import KMeans
#Configure the plotting library
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 5)
In [3]:
#Load county data - this also contains state-level data
#The first file contains the actual statistics, but the columns have codes
county_facts = pd.read_csv('../kaggle-data/county_facts.csv', index_col=['fips', 'area_name'])
#The second file translates from codes to the meaning of each column
county_facts_columns = pd.read_csv('../kaggle-data/county_facts_dictionary.csv')
In [4]:
#Take a look at the columns
county_facts_columns
Out[4]:
In [5]:
#Let's take a look at some of the rows we have using head() pandas method
county_facts.head()
Out[5]:
county_facts contains data at the county and state level. Since we only want state-level data, we need to filter the rows. As you can see from the previous cell, county-level rows contain the state_abbreviation, the rest of the rows contain null. To get only state-level data we'll filter by rows that have a null value in state_abbreviation column but leave the first row out since it contains data for the entire country.
In [6]:
#Subselect rows, let's just keep state level data
df = county_facts[county_facts.state_abbreviation.isnull()][1:]
df.head()
Out[6]:
While we can cluster using all columns, let's subset them for now. Feel free to modify this piece of code and experiment with different column combinations!
In [7]:
#Select columns that have to do with etnicity proportion
df = df.filter(regex='RHI*')
The column names in our data are coded, since we are interested in the meaning of each column we are using, we need to replace the code for the actual meaning. The following cell achieves that.
In [8]:
#Rename columns to use their meaning instead their codename
col_names = dict((k, v) for k,v in county_facts_columns.itertuples(index=False, name=None))
df.rename(columns=col_names, inplace=True)
df.head()
Out[8]:
Now let's run a KMeans algorithm with 4 clusters, feel free to experiment with different clustering algorithms and different parameters (parameter change depending on the algorithm).
In [9]:
#Run a clustering algorithm, group in 4 clusters
model = KMeans(n_clusters=4)
results = model.fit_predict(df.values)
In [10]:
#Assign cluster number to our dataframe, this will help us identify which cluster was assigned to
#every state
df['cluster'] = results
In [11]:
#Count the number of states assigned to each cluster
df.cluster.value_counts().plot.bar()
Out[11]:
In [12]:
#Let's take a look at how the algorithm clustered states
df.groupby(df.cluster).mean().transpose().plot.bar()
Out[12]:
We can se that cluster 0 has a much higher proportion of hispanics, cluster 3 of asians and clister 1 african americanas.
In [13]:
#Let's see which states are in each cluster
df[df.cluster==0]
Out[13]:
In [14]:
df[df.cluster==3]
Out[14]:
In [15]:
df[df.cluster==2]
Out[15]: