k-Means Clustering on Election Primary

Amar Seoparson

For my project, I used k-Means Clustering on a dataset containing statistics regarding the 2016 Primaries. I first processed the dataset in order to cut it down to only the counties that voted for Trump. I then normalized this data and ran a k-Means algorithm on it to see what groups of people voted for him.


In [85]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

Importing the .csv files and processing them into a single dataframe.


In [86]:
# processing .csv containing county statistics
counties = pd.read_csv('county_facts.csv')
drop_columns = ["state_abbreviation", "fips"]
counties.drop(drop_columns,inplace=True,axis=1)
# combine it with .csv containing primary statistics
primary = pd.read_csv('primary_results.csv')
primary = pd.concat([primary,counties], axis=1)
trump = primary[primary['candidate'] == 'Donald Trump'].sort_index()
# drop the features we don't need
drop_columns = ["state_abbreviation", "party", "candidate","area_name"]
trump.drop(drop_columns,inplace=True,axis=1)
# get rid of counties with no statistical data
trump = trump.fillna(0.0)
trump = trump[trump['POP010210'] > 0]
trump.head()


Out[86]:
state county fips votes fraction_votes PST045214 PST040210 PST120214 POP010210 AGE135214 ... SBO415207 SBO015207 MAN450207 WTN220207 RTN130207 RTN131207 AFN120207 BPS030214 LND110210 POP060210
135 Alabama Autauga 1001.0 5387 0.445 7755.0 8116.0 -4.4 8116.0 5.5 ... 0.0 12.9 184521.0 10852.0 79676.0 9727.0 4648.0 1.0 667.39 12.2
140 Alabama Baldwin 1003.0 23618 0.469 12125.0 12245.0 -1.0 12245.0 4.8 ... 0.0 21.9 0.0 8179.0 42434.0 3598.0 5299.0 1.0 618.19 19.8
145 Alabama Barbour 1005.0 1710 0.501 33368.0 32923.0 1.4 32923.0 5.3 ... 0.0 23.8 858460.0 77002.0 207424.0 6511.0 16843.0 0.0 615.20 53.5
150 Alabama Bibb 1007.0 1959 0.494 72297.0 77435.0 -6.6 77435.0 6.2 ... 0.0 25.6 0.0 0.0 867380.0 10922.0 87074.0 55.0 870.75 88.9
155 Alabama Blount 1009.0 7390 0.487 13970.0 14134.0 -1.2 14134.0 4.4 ... 0.0 32.8 0.0 0.0 57549.0 4202.0 0.0 0.0 561.52 25.2

5 rows × 56 columns

This creates a dataframe containing all of the counties where Trump won. Now the data has to be normalized.


In [87]:
state = trump["state"]
county = trump["county"]
# any of the features in the trump dataframe can be used, these were chosen because they seemed interesting
# percent who voted for donald trump
fraction_votes = trump["fraction_votes"]
fraction_votes_norm = np.array((fraction_votes - fraction_votes.min()) / (fraction_votes.max() - fraction_votes.min())).reshape(-1,1)
# median household income of the country
median_income = trump["INC110213"]
median_income_norm = np.array((median_income - median_income.min()) / (median_income.max() - median_income.min())).reshape(-1,1)
# percent of people in the county who were born outside of america
foreign_born = trump["POP645213"]
foreign_born_norm = np.array((foreign_born - foreign_born.min()) / (foreign_born.max() - foreign_born.min())).reshape(-1,1)
# percent of people in the county who graduated high school
high_school = trump["EDU635213"]
high_school_norm = np.array((high_school - high_school.min()) / (high_school.max() - high_school.min())).reshape(-1,1)
# percent of people in the county with a bachelors degree
bachelors = trump["EDU685213"]
bachelors_norm = np.array((bachelors - bachelors.min()) / (bachelors.max() - bachelors.min())).reshape(-1,1)
# the features to be used in k-Means are added to 2-D arrays
trump_norm = np.hstack((high_school_norm, median_income_norm))

Graphs showing the relationships between some of these features and the election results are displayed below.


In [88]:
# graphs of the normalized data
f, axarr = plt.subplots(2, 2)
axarr[0,0].set_title('Income and Trump Votes')
axarr[0,0].scatter(median_income_norm, fraction_votes_norm, c='red')
axarr[0,1].set_title('Foreigners and Trump Votes')
axarr[0,1].scatter(foreign_born_norm, fraction_votes_norm, c='green')
axarr[1,0].set_title('College and Trump Votes')
axarr[1,0].scatter(bachelors_norm, fraction_votes_norm, c='blue')
axarr[1,1].set_title('High School and Trump Votes')
axarr[1,1].scatter(high_school_norm, fraction_votes_norm, c='yellow')
plt.setp([a.get_xticklabels() for a in axarr[0, :]], visible=False)
plt.setp([a.get_yticklabels() for a in axarr[:, 1]], visible=False)
plt.show()


With the data normalized the k-Means algorithm can be run on it. In order to find the optimal number of clusters the silhouette score was calculated for different numbers.


In [89]:
best_nc = 0
best_ss = 0
for n_clusters in range(2,10):
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(trump_norm)
    silhouette_avg = silhouette_score(trump_norm, cluster_labels)
    print("For", n_clusters,"clusters, the average silhouette score is", silhouette_avg)
    if silhouette_avg > best_ss:
        best_nc = n_clusters
        best_ss = silhouette_avg
print("The best number of clusters is",best_nc)


For 2 clusters, the average silhouette score is 0.417204527717
For 3 clusters, the average silhouette score is 0.436471878714
For 4 clusters, the average silhouette score is 0.376470905095
For 5 clusters, the average silhouette score is 0.370916049204
For 6 clusters, the average silhouette score is 0.385980951897
For 7 clusters, the average silhouette score is 0.374056524232
For 8 clusters, the average silhouette score is 0.337198437131
For 9 clusters, the average silhouette score is 0.343756549117
The best number of clusters is 3

With the optimal number of clusters, the most accurate model can be created.


In [90]:
kmeans = KMeans(n_clusters=best_nc, random_state=10)
kmeans.fit(trump_norm)


Out[90]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=10, tol=0.0001, verbose=0)

The results of the k-Means algorithm are plotted below. The code for plotting is taken from the sklearn documentation.


In [91]:
h = .02 
x_min, x_max = trump_norm[:, 0].min() - 1, trump_norm[:, 0].max() + 0.5
y_min, y_max = trump_norm[:, 1].min() - 1, trump_norm[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')
plt.plot(trump_norm[:, 0], trump_norm[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=100, linewidths=3,
            color='w', zorder=10)
plt.title('K-means Clustering on Primary Results')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()


Findings

The two clusters on the top and bottom are spread out and include many outliers. The middle cluster is the most dense and likely represents the average Trump-supporting county. From this, it appears that counties with Trump supporters have average to below-average median household incomes and average to slightly-below-average high school graduation rates.