Finding similar states - clustering using scikit-learn and pandas

On this notebook we'll use pandas to load and process data from the US states, after that we'll use scikit-learn to cluster states based on census statistics.


In [2]:
#First import pandas and KMeans from scikit-learn
import pandas as pd
from sklearn.cluster import KMeans

#Configure the plotting library
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 5)

In [3]:
#Load county data - this also contains state-level data

#The first file contains the actual statistics, but the columns have codes
county_facts = pd.read_csv('../kaggle-data/county_facts.csv', index_col=['fips', 'area_name'])
#The second file translates from codes to the meaning of each column
county_facts_columns = pd.read_csv('../kaggle-data/county_facts_dictionary.csv')

In [4]:
#Take a look at the columns
county_facts_columns


Out[4]:
column_name description
0 PST045214 Population, 2014 estimate
1 PST040210 Population, 2010 (April 1) estimates base
2 PST120214 Population, percent change - April 1, 2010 to ...
3 POP010210 Population, 2010
4 AGE135214 Persons under 5 years, percent, 2014
5 AGE295214 Persons under 18 years, percent, 2014
6 AGE775214 Persons 65 years and over, percent, 2014
7 SEX255214 Female persons, percent, 2014
8 RHI125214 White alone, percent, 2014
9 RHI225214 Black or African American alone, percent, 2014
10 RHI325214 American Indian and Alaska Native alone, perce...
11 RHI425214 Asian alone, percent, 2014
12 RHI525214 Native Hawaiian and Other Pacific Islander alo...
13 RHI625214 Two or More Races, percent, 2014
14 RHI725214 Hispanic or Latino, percent, 2014
15 RHI825214 White alone, not Hispanic or Latino, percent, ...
16 POP715213 Living in same house 1 year & over, percent, 2...
17 POP645213 Foreign born persons, percent, 2009-2013
18 POP815213 Language other than English spoken at home, pc...
19 EDU635213 High school graduate or higher, percent of per...
20 EDU685213 Bachelor's degree or higher, percent of person...
21 VET605213 Veterans, 2009-2013
22 LFE305213 Mean travel time to work (minutes), workers ag...
23 HSG010214 Housing units, 2014
24 HSG445213 Homeownership rate, 2009-2013
25 HSG096213 Housing units in multi-unit structures, percen...
26 HSG495213 Median value of owner-occupied housing units, ...
27 HSD410213 Households, 2009-2013
28 HSD310213 Persons per household, 2009-2013
29 INC910213 Per capita money income in past 12 months (201...
30 INC110213 Median household income, 2009-2013
31 PVY020213 Persons below poverty level, percent, 2009-2013
32 BZA010213 Private nonfarm establishments, 2013
33 BZA110213 Private nonfarm employment, 2013
34 BZA115213 Private nonfarm employment, percent change, 20...
35 NES010213 Nonemployer establishments, 2013
36 SBO001207 Total number of firms, 2007
37 SBO315207 Black-owned firms, percent, 2007
38 SBO115207 American Indian- and Alaska Native-owned firms...
39 SBO215207 Asian-owned firms, percent, 2007
40 SBO515207 Native Hawaiian- and Other Pacific Islander-ow...
41 SBO415207 Hispanic-owned firms, percent, 2007
42 SBO015207 Women-owned firms, percent, 2007
43 MAN450207 Manufacturers shipments, 2007 ($1,000)
44 WTN220207 Merchant wholesaler sales, 2007 ($1,000)
45 RTN130207 Retail sales, 2007 ($1,000)
46 RTN131207 Retail sales per capita, 2007
47 AFN120207 Accommodation and food services sales, 2007 ($...
48 BPS030214 Building permits, 2014
49 LND110210 Land area in square miles, 2010
50 POP060210 Population per square mile, 2010

In [5]:
#Let's take a look at some of the rows we have using head() pandas method
county_facts.head()


Out[5]:
state_abbreviation PST045214 PST040210 PST120214 POP010210 AGE135214 AGE295214 AGE775214 SEX255214 RHI125214 ... SBO415207 SBO015207 MAN450207 WTN220207 RTN130207 RTN131207 AFN120207 BPS030214 LND110210 POP060210
fips area_name
0 United States NaN 318857056 308758105 3.3 308745538 6.2 23.1 14.5 50.8 77.4 ... 8.3 28.8 5319456312 4174286516 3917663456 12990 613795732 1046363 3531905.43 87.4
1000 Alabama NaN 4849377 4780127 1.4 4779736 6.1 22.8 15.3 51.5 69.7 ... 1.2 28.1 112858843 52252752 57344851 12364 6426342 13369 50645.33 94.4
1001 Autauga County AL 55395 54571 1.5 54571 6.0 25.2 13.8 51.4 77.9 ... 0.7 31.7 0 0 598175 12003 88157 131 594.44 91.8
1003 Baldwin County AL 200111 182265 9.8 182265 5.6 22.2 18.7 51.2 87.1 ... 1.3 27.3 1410273 0 2966489 17166 436955 1384 1589.78 114.6
1005 Barbour County AL 26887 27457 -2.1 27457 5.7 21.2 16.5 46.6 50.2 ... 0.0 27.0 0 0 188337 6334 0 8 884.88 31.0

5 rows × 52 columns

Data processing

county_facts contains data at the county and state level. Since we only want state-level data, we need to filter the rows. As you can see from the previous cell, county-level rows contain the state_abbreviation, the rest of the rows contain null. To get only state-level data we'll filter by rows that have a null value in state_abbreviation column but leave the first row out since it contains data for the entire country.


In [6]:
#Subselect rows, let's just keep state level data
df = county_facts[county_facts.state_abbreviation.isnull()][1:]
df.head()


Out[6]:
state_abbreviation PST045214 PST040210 PST120214 POP010210 AGE135214 AGE295214 AGE775214 SEX255214 RHI125214 ... SBO415207 SBO015207 MAN450207 WTN220207 RTN130207 RTN131207 AFN120207 BPS030214 LND110210 POP060210
fips area_name
1000 Alabama NaN 4849377 4780127 1.4 4779736 6.1 22.8 15.3 51.5 69.7 ... 1.2 28.1 112858843 52252752 57344851 12364 6426342 13369 50645.33 94.4
2000 Alaska NaN 736732 710249 3.7 710231 7.4 25.3 9.4 47.4 66.9 ... 0.0 25.9 8204030 4563605 9303387 13635 1851293 1518 570640.95 1.2
4000 Arizona NaN 6731484 6392310 5.3 6392017 6.4 24.1 15.9 50.3 83.7 ... 10.7 28.1 57977827 57573459 86758801 13637 13268514 26997 113594.08 56.3
5000 Arkansas NaN 2966369 2915958 1.7 2915918 6.5 23.8 15.7 50.9 79.7 ... 2.3 24.5 60735582 29659789 32974282 11602 3559795 7666 52035.48 56.0
6000 California NaN 38802500 37254503 4.2 37253956 6.5 23.6 12.9 50.3 73.2 ... 16.5 30.3 491372092 598456486 455032270 12561 80852787 83645 155779.22 239.1

5 rows × 52 columns

While we can cluster using all columns, let's subset them for now. Feel free to modify this piece of code and experiment with different column combinations!


In [7]:
#Select columns that have to do with etnicity proportion
df = df.filter(regex='RHI*')

The column names in our data are coded, since we are interested in the meaning of each column we are using, we need to replace the code for the actual meaning. The following cell achieves that.


In [8]:
#Rename columns to use their meaning instead their codename
col_names = dict((k, v) for k,v in county_facts_columns.itertuples(index=False, name=None))
df.rename(columns=col_names, inplace=True)
df.head()


Out[8]:
White alone, percent, 2014 Black or African American alone, percent, 2014 American Indian and Alaska Native alone, percent, 2014 Asian alone, percent, 2014 Native Hawaiian and Other Pacific Islander alone, percent, 2014 Two or More Races, percent, 2014 Hispanic or Latino, percent, 2014 White alone, not Hispanic or Latino, percent, 2014
fips area_name
1000 Alabama 69.7 26.7 0.7 1.3 0.1 1.5 4.1 66.2
2000 Alaska 66.9 3.9 14.8 6.1 1.3 7.1 6.8 61.9
4000 Arizona 83.7 4.7 5.3 3.3 0.3 2.7 30.5 56.2
5000 Arkansas 79.7 15.6 1.0 1.5 0.3 1.9 7.0 73.4
6000 California 73.2 6.5 1.7 14.4 0.5 3.7 38.6 38.5

Now let's run a KMeans algorithm with 4 clusters, feel free to experiment with different clustering algorithms and different parameters (parameter change depending on the algorithm).

Running a clustering algorithm with scikit-learn


In [9]:
#Run a clustering algorithm, group in 4 clusters
model = KMeans(n_clusters=4)
results = model.fit_predict(df.values)

In [10]:
#Assign cluster number to our dataframe, this will help us identify which cluster was assigned to
#every state
df['cluster'] = results

Data visualization


In [11]:
#Count the number of states assigned to each cluster
df.cluster.value_counts().plot.bar()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f65c490>

In [12]:
#Let's take a look at how the algorithm clustered states
df.groupby(df.cluster).mean().transpose().plot.bar()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f756e10>

We can se that cluster 0 has a much higher proportion of hispanics, cluster 3 of asians and clister 1 african americanas.


In [13]:
#Let's see which states are in each cluster
df[df.cluster==0]


Out[13]:
White alone, percent, 2014 Black or African American alone, percent, 2014 American Indian and Alaska Native alone, percent, 2014 Asian alone, percent, 2014 Native Hawaiian and Other Pacific Islander alone, percent, 2014 Two or More Races, percent, 2014 Hispanic or Latino, percent, 2014 White alone, not Hispanic or Latino, percent, 2014 cluster
fips area_name
1000 Alabama 69.7 26.7 0.7 1.3 0.1 1.5 4.1 66.2 0
2000 Alaska 66.9 3.9 14.8 6.1 1.3 7.1 6.8 61.9 0
10000 Delaware 70.8 22.2 0.7 3.8 0.1 2.5 8.9 63.7 0
11000 District Of Columbia 43.6 49.0 0.6 4.0 0.2 2.6 10.4 35.8 0
13000 Georgia 62.1 31.5 0.5 3.8 0.1 2.0 9.3 54.3 0
17000 Illinois 77.5 14.7 0.6 5.3 0.1 1.8 16.7 62.3 0
22000 Louisiana 63.4 32.5 0.8 1.8 0.1 1.5 4.8 59.3 0
24000 Maryland 60.1 30.3 0.6 6.4 0.1 2.6 9.3 52.6 0
28000 Mississippi 59.7 37.5 0.6 1.0 0.1 1.2 3.0 57.3 0
34000 New Jersey 73.0 14.8 0.6 9.4 0.1 2.1 19.3 56.8 0
36000 New York 70.4 17.6 1.0 8.5 0.1 2.4 18.6 56.5 0
37000 North Carolina 71.5 22.1 1.6 2.7 0.1 2.1 9.0 64.1 0
45000 South Carolina 68.3 27.8 0.5 1.5 0.1 1.7 5.4 63.9 0
51000 Virginia 70.5 19.7 0.5 6.3 0.1 2.8 8.9 63.1 0

In [14]:
df[df.cluster==3]


Out[14]:
White alone, percent, 2014 Black or African American alone, percent, 2014 American Indian and Alaska Native alone, percent, 2014 Asian alone, percent, 2014 Native Hawaiian and Other Pacific Islander alone, percent, 2014 Two or More Races, percent, 2014 Hispanic or Latino, percent, 2014 White alone, not Hispanic or Latino, percent, 2014 cluster
fips area_name
4000 Arizona 83.7 4.7 5.3 3.3 0.3 2.7 30.5 56.2 3
6000 California 73.2 6.5 1.7 14.4 0.5 3.7 38.6 38.5 3
12000 Florida 77.8 16.8 0.5 2.8 0.1 2.0 24.1 55.8 3
32000 Nevada 76.2 9.1 1.6 8.3 0.7 4.0 27.8 51.5 3
35000 New Mexico 82.8 2.5 10.4 1.7 0.2 2.5 47.7 38.9 3
48000 Texas 80.0 12.5 1.0 4.5 0.1 1.8 38.6 43.5 3

In [15]:
df[df.cluster==2]


Out[15]:
White alone, percent, 2014 Black or African American alone, percent, 2014 American Indian and Alaska Native alone, percent, 2014 Asian alone, percent, 2014 Native Hawaiian and Other Pacific Islander alone, percent, 2014 Two or More Races, percent, 2014 Hispanic or Latino, percent, 2014 White alone, not Hispanic or Latino, percent, 2014 cluster
fips area_name
5000 Arkansas 79.7 15.6 1.0 1.5 0.3 1.9 7.0 73.4 2
8000 Colorado 87.7 4.5 1.6 3.1 0.2 2.9 21.2 69.0 2
9000 Connecticut 81.2 11.5 0.5 4.5 0.1 2.2 15.0 68.8 2
16000 Idaho 93.5 0.8 1.7 1.4 0.2 2.3 12.0 82.8 2
18000 Indiana 86.1 9.6 0.4 2.0 0.1 1.9 6.6 80.3 2
19000 Iowa 92.1 3.4 0.5 2.2 0.1 1.7 5.6 87.1 2
20000 Kansas 86.8 6.3 1.2 2.8 0.1 2.8 11.4 76.8 2
21000 Kentucky 88.3 8.2 0.3 1.4 0.1 1.8 3.4 85.4 2
23000 Maine 95.0 1.4 0.7 1.2 0.0 1.6 1.5 93.8 2
25000 Massachusetts 82.6 8.3 0.5 6.3 0.1 2.2 10.8 74.3 2
26000 Michigan 79.9 14.2 0.7 2.9 0.0 2.3 4.8 75.8 2
27000 Minnesota 85.7 5.9 1.3 4.7 0.1 2.3 5.1 81.4 2
29000 Missouri 83.5 11.8 0.5 1.9 0.1 2.1 4.0 80.1 2
30000 Montana 89.4 0.6 6.6 0.8 0.1 2.6 3.5 86.7 2
31000 Nebraska 89.4 4.9 1.4 2.2 0.1 2.0 10.2 80.5 2
33000 New Hampshire 94.0 1.5 0.3 2.5 0.0 1.6 3.3 91.3 2
38000 North Dakota 89.1 2.1 5.4 1.3 0.1 2.0 3.2 86.6 2
39000 Ohio 83.0 12.6 0.3 2.0 0.1 2.1 3.5 80.1 2
40000 Oklahoma 75.1 7.7 9.0 2.1 0.2 5.9 9.8 67.0 2
41000 Oregon 87.9 2.0 1.8 4.3 0.4 3.6 12.5 77.0 2
42000 Pennsylvania 82.9 11.6 0.3 3.3 0.1 1.8 6.6 77.9 2
44000 Rhode Island 85.1 7.7 0.9 3.5 0.2 2.6 14.0 74.5 2
46000 South Dakota 85.7 1.9 8.9 1.3 0.1 2.2 3.6 83.0 2
47000 Tennessee 78.9 17.1 0.4 1.7 0.1 1.7 5.0 74.6 2
49000 Utah 91.4 1.3 1.5 2.4 1.0 2.4 13.5 79.3 2
50000 Vermont 95.0 1.2 0.4 1.6 0.0 1.8 1.8 93.5 2
53000 Washington 80.7 4.1 1.9 8.2 0.7 4.5 12.2 70.4 2
54000 West Virginia 93.7 3.6 0.2 0.8 0.0 1.6 1.5 92.5 2
55000 Wisconsin 87.8 6.6 1.1 2.6 0.0 1.8 6.5 82.2 2
56000 Wyoming 92.7 1.6 2.7 1.0 0.1 2.0 9.8 84.1 2