I recently moved to the Pacific Northwest from Texas and I've noticed a few differences between the regions. The big one is all the WATER. There is so much water here! It falls from the sky nearly everyday. It causes moss to grow on rooftops!!! And then there's the ocean, ahem, I mean the bay.
Another difference I've noticed is the industry. I live in Tacoma which has a fairly active port. There is more train and truck traffic than I experienced in Texas. All those goods need to be transported from the Port to points east and south! This got me thinking about the industry and business scene in Tacoma.
The city of Tacoma has a really nice website devoted to its public data. You can download the data directly or use some of the tools provided by Socrata. I decided to take a look at Tacoma's list of business licenses. It was last updated April 17, 2017 and has over 32,000 rows of data. That means that over 32,000 business licenses have been issued to businesses operating in Tacoma. These businesses include lessors that rent or lease out their property in Tacoma, as well as drivers who drive for a ride-share company like uber or lyft.
I'm curious how Tacoma's industry landscape has changed over time. Let's take a look at the data.
In [1]:
# Load packages.
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# Read data into pandas dataframe.
bl = pd.read_csv('TACOMA_BUSINESS_LICENSE.csv')
In [3]:
# How big is the dataset?
bl.shape
Out[3]:
In [4]:
# View the dataset
bl.head()
Out[4]:
Most of the columns are pretty self-explanatory. The ZIP CODE
and STATE
columns tell me where the owner of the business is located. Also the BUSINESS OPEN DATE
might be useful too. The columns labeled NAICS CODE
and NAICS CODE DESCRIPTION
give an idea of the industry of the business. NAICS (North American Industry Classification System) codes are like the Dewey Decimal System numbers that you can see marked on library books.
For NAICS codes, the longer the number the more precise the classification. For example, if a business is classified as part of business sector 61
, then it provides educational services. However, if it is classified as 611512
, then it is a flight training school.
Here's a listing of the description of the first two digits of the NAICS code.
There are about 20 generally defined sectors of business. It would interesting to see how these sectors have changed in Tacoma over time using the BUSINESS OPEN DATE
.
However, before I get ahead of myself, let me clean the data up a bit.
In [5]:
# Check the data types
bl.dtypes
Out[5]:
In [6]:
# Convert date to datetime format.
bl['BUSINESS OPEN DATE'] = pd.to_datetime(bl['BUSINESS OPEN DATE'])
In [7]:
# Sort businesses by open date.
sorted = bl.sort_values(by = 'BUSINESS OPEN DATE')
# Oldest businesses
sorted[['BUSINESS NAME','NAICS CODE DESCRIPTION', 'BUSINESS OPEN DATE']].head(10)
Out[7]:
Interesting! The oldest company in Tacoma is the Salvation Army which opened in 1888. The city of Tacoma was incorporated in 1875 and there must have been businesses then. I haven't been able to figure out when Tacoma started issuing business licenses. However, the description of the dataset describes it as a list of "active accounts." So perhaps these are the oldest companies in Tacoma that are also still around.
Now let's take a look at the most recent businesses.
In [8]:
# Newest businesses
sorted[['BUSINESS NAME','NAICS CODE DESCRIPTION', 'BUSINESS OPEN DATE']].tail(15)
Out[8]:
What's with BUSINESS OPEN DATE = 2216-09-21
?
That's odd. Some of these businesses have open dates that are far in the future. I don't think that's right. I'm going to remove businesses that have opening dates after today.
In [9]:
# Remove businesses with open date after "today".
bl = bl[bl['BUSINESS OPEN DATE'] <='2017-05-04']
# Check work
sorted = bl.sort_values(by = 'BUSINESS OPEN DATE')
sorted[['BUSINESS NAME','NAICS CODE DESCRIPTION', 'BUSINESS OPEN DATE']].tail(10)
Out[9]:
That looks better!
In [10]:
# Let's take a look at the different types of business in Tacoma that have a business license.
bl_groupedbynaics = bl.groupby('NAICS CODE DESCRIPTION')['BUSINESS NAME'].count()
sorted = bl_groupedbynaics.sort_values()
sorted.tail(10)
Out[10]:
Whoa! Taxi services and rental properties are Big Business in Tacoma! I was not expecting that.
Ok, but what if I divide this up by ZIP code? What's the top business for each ZIP code operating in Tacoma?
In [11]:
# Find top industry in each of Tacoma's major ZIP code areas.
# http://www.zipmap.net/Washington/Pierce_County/Tacoma.htm
tacoma_zip_codes = [98402, 98403, 98404, 98405, 98406, 98407,
98408, 98409, 98418, 98421, 98422, 98465]
print len(tacoma_zip_codes)
In [12]:
# Clean up ZIP code data.
# Remove extra numbers after first 5 digits and remove non-numeric ZIP codes.
bl['ZIP CODE 5 string'] = bl['ZIP CODE'].str[:5]
# If zip code contains any characters, replace it with 00000
bl.loc[bl['ZIP CODE 5 string'].str.contains('[a-zA-Z]'), 'ZIP CODE 5 string'] = '00000'
# Create another zip code column that has integer format.
bl['ZIP CODE 5 int'] = bl['ZIP CODE 5 string'].astype(int)
In [13]:
# Print top three NAICS codes in each ZIP code.
for zip in tacoma_zip_codes:
bl_zipsort = bl[bl['ZIP CODE 5 int']== zip]
bl_zipsort = bl_zipsort.groupby('NAICS CODE DESCRIPTION')['BUSINESS NAME'].count()
sorted = bl_zipsort.sort_values()
print zip, ' #1 ', sorted.keys()[-1]
print zip, ' #2 ', sorted.keys()[-2]
print zip, ' #3 ', sorted.keys()[-3]
print '----'
It looks like folks renting out their homes and ride-share drivers are still the top industries even if I divide Tacoma up by ZIP code. However, there are two ZIP code areas that stand out as different. They are 98421
and 98402
, the ZIP codes where the Port of Tacoma and downtown Tacoma are located, respectively. The top two industries in the Port are warehouse rentals and long-haul trucking. In downtown Tacoma, there are more lawyer offices than ride-share drivers.
Since leasing and taxi services are the top business licenses in Tacoma and the city's business license homepage specifically draws attention to these two industries I thought I'd look into them a bit more. For-hire transportation business licenses were only recently required on January 1, 2014 and licenses for lessors renting out their homes were required starting January 1, 2004. Therefore, I would expect to see a spike in the number of business licenses for January 1 in 2004 and 2014 due to these new business license requirements. Let's see if this is the case.
In [14]:
# Count up the TOTAL number of businesses that opened each year.
bl_resampled = bl.resample('AS', on='BUSINESS OPEN DATE')['BUSINESS NAME'].count()
bl_resampled = pd.DataFrame(bl_resampled)
bl_resampled['BUSINESS OPEN YEAR'] = bl_resampled.index.year
bl_resampled = bl_resampled.set_index('BUSINESS OPEN YEAR')
bl_resampled = bl_resampled.rename(columns = {'BUSINESS NAME':'Count'})
In [15]:
# Let's plot evolution in time for ALL businesses.
%matplotlib inline
ax = bl_resampled.plot.bar(figsize=(15,8), fontsize = 7, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 20)
ax.set_ylabel('Count', fontsize = 20);
Whoa! Look at that jump in 2004! That does look suspiciously close to the year that business licenses were required for lessors. Based on the underlying trend in the number of business licenses, I would guess that there should be only about 1000 new business licenses for 2004. Instead there are 3000! There's about 2000 extra business licenses for 2004. If I look only at lessor business licenses, do I come up with those extra 2000 business licenses?
In [16]:
# Lessor business licenses grouped yearly
bl_lessor = bl[bl['NAICS CODE DESCRIPTION']=='Lessors of Residential Buildings and Dwellings']
bl_lessor = bl_lessor.resample('AS', on='BUSINESS OPEN DATE')['BUSINESS NAME'].count()
bl_lessor = pd.DataFrame(bl_lessor)
bl_lessor['BUSINESS OPEN DATE'] = bl_lessor.index.year
bl_lessor = bl_lessor.set_index('BUSINESS OPEN DATE')
bl_lessor=bl_lessor.rename(columns = {'BUSINESS NAME':'Count_Lessor'})
# Plot it.
ax = bl_lessor.plot.bar(figsize=(20,4), fontsize = 10, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 20)
ax.set_ylabel('Count', fontsize = 20);
Yes! I see a jump up to 2000 business licenses for lessors in 2004. It looks like they are the cause of the overall jump in 2004. Interesting!
Let's also take a look at the taxi business. It does seem like there might be an unusual rise in the number of licenses in 2014. Let's take a deeper look at taxi business licenses.
In [17]:
# Taxi business licenses grouped yearly
bl_lessor = bl[bl['NAICS CODE DESCRIPTION']=='Taxi Service']
bl_lessor = bl_lessor.resample('AS', on='BUSINESS OPEN DATE')['BUSINESS NAME'].count()
bl_lessor = pd.DataFrame(bl_lessor)
bl_lessor['BUSINESS OPEN DATE'] = bl_lessor.index.year
bl_lessor = bl_lessor.set_index('BUSINESS OPEN DATE')
bl_lessor=bl_lessor.rename(columns = {'BUSINESS NAME':'Count_Lessor'})
# Plot it.
ax = bl_lessor.plot.bar(figsize=(20,4), fontsize = 10, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 20)
ax.set_ylabel('Count', fontsize = 20);
The city of Tacoma required business licenses for ride-share drivers on January 1, 2014, however, the number of licenses doesn't pick up until 2016. It looks like Tacoma was getting ready for a possible influx of ride-share drivers when they changed the business requirements. Nice planning!
My previous analysis of lessors and taxi drivers gave me an idea to explore the data more using clustering. When looking at the plot showing the number of new business licenses every year I was led to investigate an underlying feature of the dataset -- the jump in licenses in 2004 for lessors. I wonder if there are other features of the dataset that I can pull out. One way of finding features is using a machine learning technique called clustering which sorts through each data point (in our case, each business license) and groups it with other similar data points. For example, business licenses with opening dates around 2004 in the leasing industry might be grouped together.
I'll be using a version of k-means clustering that allows for categorical data called k-modes clustering. The basic steps of k-means clustering using numerical data are as follows.
Another way to say this is that the k-means clustering algorithm minimizes the sum of the squared differences between the cluster members and cluster center.
Since the business license dataset contains a few columns containing categorical, rather than numerical data, I'll use a modified version of k-means clustering called k-modes clustering. Rather than minimizing the sum of the squared differences, this technique minimizes the dissimilarity measure which is defined as
$ \Sigma_{\textrm{all }j}{\delta(x_j, y_j)}$
The dissimilarity measure is the sum over all of the categorical features $j$ in the dataset. Each business license is represented by $x$ and $y$. The function $\delta$ is the delta function, i.e., $\delta = 0$ when $x$ and $y$ are in the same category and $\delta = 1$ when they are in different categories.
Another difference between the k-modes method and the k-means method is the use of the mode rather than the mean to find the center of the cluster.
A python implementation of this technique has been written. It also includes k-prototypes
which allows for a mix of categorical and numerical data in a single data set. I'll be using that.
Before plugging my dataset into the algorithm, let me choose the "features" to be used. Based on my previous experiments with the lessors and the most common businesses in each ZIP code, I think some useful features will be the opening year of the business, the NAICS code, and the ZIP code. However, rather than using the entire NAICS code, I'll only use the first two digits. These give the broad category of each business which is what I am interested in here.
In [18]:
# Create new column which contains only first two digits of NAICS CODE.
bl['NAICS CODE FIRST TWO'] = bl['NAICS CODE'].fillna('00')
bl['NAICS CODE FIRST TWO'] = bl['NAICS CODE FIRST TWO'].astype(str)
bl['NAICS CODE FIRST TWO'] = bl['NAICS CODE FIRST TWO'].str[:2]
In [19]:
# Count up the number of licenses for each NAICS CODE 2 digit grouping.
bl_groupedbynaicsfirst2 = bl.groupby('NAICS CODE FIRST TWO')['BUSINESS NAME'].count()
sorted = bl_groupedbynaicsfirst2.sort_values()
sorted.tail(3)
Out[19]:
If I only use the first two digits of the NAICS code, then the top business types are "Real Estate and Rental and Leasing," "Construction," and "Transportation and Warehousing."
In [20]:
# How many unique NAICS 2 digits codes are there in this dataset?
naics_keys = sorted.keys()
len(naics_keys)
Out[20]:
In [21]:
# Create the dataset to be used in the k-modes clustering algorithm
# Create new column with open year.
bl['BUSINESS OPEN YEAR'] = bl['BUSINESS OPEN DATE'].dt.year
# Create a dataframe that contains the data to be used for clustering.
df_kmodes = bl[['NAICS CODE FIRST TWO', 'BUSINESS OPEN YEAR','ZIP CODE 5 string']].copy()
df_kmodes.head()
Out[21]:
In [22]:
df_kmodes.dtypes
Out[22]:
In [23]:
# Import kprototypes module from kmodes
# https://github.com/nicodv/kmodes
from kmodes import kprototypes
# Convert dataframe to numpy array.
X = pd.DataFrame.as_matrix(df_kmodes)
print X
A common question that is often asked when trying to cluster data is "How many clusters should I have?" Usually it's a good idea to plot the data to try to figure this out. This is a bit difficult to do when dealing with three features, but let's see what we find.
In [24]:
# ZIP code vs business open year
ax = bl.plot.scatter(x="BUSINESS OPEN YEAR", y = "ZIP CODE 5 int", figsize=(20,8), fontsize = 15, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 15)
ax.set_ylabel('ZIP CODE 5', fontsize = 15);
In [25]:
# First two digits of NAICS code vs business open year
bl['NAICS CODE FIRST TWO INT'] = bl['NAICS CODE FIRST TWO'].astype(int)
ax = bl.plot.scatter(x="BUSINESS OPEN YEAR", y = "NAICS CODE FIRST TWO INT", figsize=(20,8), fontsize = 15, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 15)
ax.set_ylabel('NAICS CODE First Two Digits', fontsize = 15);
I'm not seeing any obvious groupings of the data. So, that wasn't really helpful in trying to figure out the number of clusters. That's ok, there's another way. For the k-means algorithm, a common technique is to find the "elbow point." The first step is to run the k-means algorithm for a few different values of $k$ (the number of clusters). Then plot the average distance of each point to its cluster's center against the number of clusters. For only two clusters, the average distance will be large. But as the number of clusters increases, the average distance decreases until it bottoms out. The point where the curve goes from steep to flat is the "elbow point." Clusters to the right of the "elbow point" are likely to be subjected to overfitting. If the number of clusters equals the number of datapoints, then each point has its own cluster and that's not very useful.
Let's give the elbow method a try and see what happens. I'll try between 2 and 25 clusters. The maximum number of NAICS codes is 25. I think the clusters might start to be grouped by NAICS codes if I let there be that many. Let's see.
In [26]:
# Create dictionaries to store the data from the k-modes algorithm.
cluster_dict = {}
kproto_dict = {}
WARNING! The following loop takes awhile (> 1 hour) to run on my ~2010 Macbook Air. If you'd like to play with the data it creates, but would rather not run this loop yourself, then use the data that I have put in a pickle file for safe-keeping. I write to the pickle file after the loop runs. You should be able to find it here as well.
In [ ]:
for i in range(2,26):
kproto_dict[i] = kprototypes.KPrototypes(n_clusters = i, init = 'Cao', verbose = 0)
cluster_dict[i]= kproto_dict[i].fit_predict(X, categorical=[0,2])
In [ ]:
# Create dictionary with cluster data.
cluster_info = {}
for i in range(2,26):
cluster_info[i]=[kproto_dict[i].cost_,
kproto_dict[i].n_iter_,
kproto_dict[i].cluster_centroids_,
cluster_dict[i]]
In [ ]:
# Save data in a pickle file.
import pickle
pickle.dump(cluster_info,open('cluster_data.p', 'wb'))
In [27]:
# To use the data in the pickle file, read it in here.
import pickle
cluster_info = pickle.load(open('cluster_data.p', 'rb'))
Now that I have the data compiled for a range of k
values, I can compare how the error (or "Cost") changes with k
.
In [28]:
x = cluster_info.keys()
cost = {}
for k in cluster_info:
cost[k] = cluster_info[k][0]
y = cost.values()
plt.xlabel('k (number of clusters)', fontsize = 15)
plt.ylabel('Cost', fontsize = 15)
plt.scatter(x,y);
It looks like the 'elbow' occurs around 10 and since 10 is a nice round number, I'll use that for my number of clusters. The next step is to see what I can learn about how the data is organized into these 10 clusters. What are the primary features of each cluster? One way of investigating this is by looking at the cluster centroids.
In [29]:
# Print cluster centroid: year, NAICS ID, and ZIP code
year = cluster_info[10][2][0]
IDZIP = cluster_info[10][2][1]
print 'Year NAICS ZIP'
for i in range(len(year)):
print int(year[i][0]), IDZIP[i][0], ' ', IDZIP[i][1]
The years for the clusters seem to be varied, but the NAICS code is often 53
, "Real Estate and Rental and Leasing"
The fact that the ZIP code 98409 is most common for the clusters makes sense because it is the most common ZIP code in the data set.
In [30]:
# Print most common ZIP codes.
bl_groupedbyZIP = bl.groupby('ZIP CODE')['BUSINESS NAME'].count()
sorted = bl_groupedbyZIP.sort_values()
sorted.tail(5)
Out[30]:
In addition to understanding the clusters by looking at their centers, it is often useful to visualize the properties of the cluster members.
In [31]:
# First, assign each business in the dataframe to a cluster.
bl['cluster_num'] = cluster_info[10][3]
In [32]:
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)
plt.figure(figsize = (10,8)) #w,h
numplots = 10
# Clusters arranged in approximate chronological order.
plot_order = [1,2,6,3,5,0,7,4,8,9]
top_zip = [98409, 98409,98409,98409, 98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
clustername = 'Cluster ' + str(i+1)
current_cluster = bl[bl['cluster_num']==plot_order[i]]
plt.subplot(numplots, 3, (i+2*i)+1)
current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
kind='bar', fontsize = 5)
plt.ylabel('Count', fontsize = 8)
plt.xlabel('First 2 digits of NAICS Code', fontsize = 8)
plt.subplot(numplots, 3, (i+2*i)+2)
current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
kind='line', fontsize = 5)
plt.xlabel('Year', fontsize = 8)
plt.subplot(numplots, 3, (i+2*i)+3)
current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
kind='bar', fontsize = 5)
plt.xlabel('ZIP codes', fontsize = 8)
plt.text(-0.1,0.79-0.09*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 8)
plt.text(0.05,0.79-0.09*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 8)
plt.text(0.21,0.79-0.09*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 8)
This figure shows the overall structure of each of the 10 clusters. Each row of graphs in the figure shows the properties of a single cluster; the top three graphs show the NAICS codes, years, and ZIP codes for Cluster 1, from left to right.
Let's discuss a few of the clusters at a time, looking at their commonalities and differences. I'll talk about Clusters 1, 2, and 3 together because they look similar except for their yearly graph. Then I'll take a look at Cluster 4 which stands out as unique from the other clusters due to the peak in the middle of its NAICS graph. Then I'll discuss Clusters 5, 6, 7, 8, and 9 together. I'll finish with a discussion of Cluster 10.
In [33]:
# Visualize Clusters 1, 2, and 3.
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)
plt.figure(figsize = (14,8)) #w,h
numplots = 3
# Clusters arranged in approximate chronological order.
plot_order = [1,2,6,3,5,0,7,4,8,9]
top_zip = [98409, 98409,98409,98409, 98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
clustername = 'Cluster ' + str(i+1)# + '\n' + str(top_zip[i])
current_cluster = bl[bl['cluster_num']==plot_order[i]]
plt.subplot(numplots, 3, (i+2*i)+1)
current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
kind='bar', fontsize = 12)
plt.ylabel('Count', fontsize = 12)
plt.xlabel('First 2 digits of NAICS Code', fontsize = 12)
plt.text(-0.07,0.85-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
plt.subplot(numplots, 3, (i+2*i)+2)
current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
kind='line', fontsize = 12)
# plt.ylabel('Count', fontsize = 25)
plt.xlabel('Year', fontsize = 12)
plt.text(0.135,0.85-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
plt.subplot(numplots, 3, (i+2*i)+3)
current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
kind='bar', fontsize = 12)
# plt.ylabel('Count', fontsize = 25)
plt.xlabel('ZIP codes', fontsize = 12)
plt.text(0.410,0.85-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
The NAICS code and ZIP code distributions for Clusters 1, 2, and 3 look about the same. The main difference between the clusters occurs in their range of years; Cluster 1, 2 and 3 are arranged from older to newer businesses.
There's an interesting spike in Cluster 1 around 1950. I'm not sure what's happening here. Since there aren't as many business licenses before 1950, perhaps around this time the city's business license requirements were more strictly enforced.
Let's take a look at the next cluster - Cluster 4, the odd ball.
In [34]:
# Visualize Cluster 4
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)
plt.figure(figsize = (16,4)) #w,h
numplots = 1
# Clusters arranged in approximate chronological order.
plot_order = [3,5,0,7,4,8,9]
top_zip = [98409, 98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
clustername = 'Cluster ' + str(i+4)
current_cluster = bl[bl['cluster_num']==plot_order[i]]
plt.subplot(numplots, 3, (i+2*i)+1)
current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
kind='bar', fontsize = 12)
plt.ylabel('Count', fontsize = 12)
plt.xlabel('First 2 digits of NAICS Code', fontsize = 12)
plt.text(-0.07,0.36-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
plt.subplot(numplots, 3, (i+2*i)+2)
current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
kind='line', fontsize = 12)
plt.xlabel('Year', fontsize = 12)
plt.text(0.2,0.36-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
plt.subplot(numplots, 3, (i+2*i)+3)
current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
kind='bar', fontsize = 12)
plt.xlabel('ZIP codes', fontsize = 12)
plt.text(0.5,0.36-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
I'm not sure what's happening with Cluster 4. It's a bit unusual because it has a peak for NAICS codes of 00 which were originally NaN
values. Are business owners suddenly unable to classify their businesses? Or maybe a previously defined NAICS code was retired, and is now no longer valid, hence the NaN
? (It turns out that NAICS codes do change over time). Or maybe, whispering in my best conspiracy voice it's related to the Y2K bug because this spike happened around the year 2000. Maybe we'll never know the truth... back to normal voice.
Let's take a look at the next big group of clusters, Clusters 5-9.
In [35]:
# Visualize Cluster 5-9
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)
plt.figure(figsize = (16,13)) #w,h
numplots = 5
# Clusters arranged in approximate chronological order.
plot_order = [5,0,7,4,8,9]
top_zip = [ 98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
clustername = 'Cluster ' + str(i+5)
current_cluster = bl[bl['cluster_num']==plot_order[i]]
plt.subplot(numplots, 3, (i+2*i)+1)
current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
kind='bar', fontsize = 12)
plt.ylabel('Count', fontsize = 14)
plt.xlabel('First 2 digits of NAICS Code', fontsize = 16)
plt.subplot(numplots, 3, (i+2*i)+2)
current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
kind='line', fontsize = 12)
plt.xlabel('Year', fontsize = 16)
plt.subplot(numplots, 3, (i+2*i)+3)
current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
kind='bar', fontsize = 12)
plt.xlabel('ZIP codes', fontsize = 16)
plt.text(-0.07,1.38-0.3*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 13)
plt.text(0.175,1.38-0.3*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 13)
clusternamezip = clustername + '\n' + str(top_zip[i])
plt.text(0.43,1.38-0.3*i,clusternamezip, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 13)
Clusters 5 through 9 have a similar overall shape of their NAICS code distribution. Their most common NAICS code is 53
, "Real Estate and Rental and Leasing," followed by 23
, "Construction." Each cluster however tracks a different yearly range, starting around 2000 and going through today. It is interesting to note that the ratio between the rental business licenses and all the others licenses seems to be decreasing with time. This may be due to the residual effect of the large influx of lessor business licenses in 2001, or because of an actual comparative decrease in home rentals over time. Something to explore for another day...
The ZIP code plot shows another difference between the clusters; I've written the top ZIP code on each of the ZIP code plots. All of these top ZIP codes cover primarily residential areas. Certain areas seem to be experiencing growth over different times. ZIP code 98409 experienced a big jump in the number of business licenses in the mid-2000s as seen in Cluster 6. While in the following few years, ZIP code 98405 took over as the leader in business licenses.
Now, let's take a look at the last group, Cluster 10.
In [36]:
# Visualize Cluster 10
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)
plt.figure(figsize = (16,4)) #w,h
numplots = 1
plot_order = [9]
top_zip = [ 98409]
for i in range(numplots):
clustername = 'Cluster ' + str(i+10)
current_cluster = bl[bl['cluster_num']==plot_order[i]]
plt.subplot(numplots, 3, (i+2*i)+1)
current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
kind='bar', fontsize = 13)
plt.ylabel('Count', fontsize = 12)
plt.xlabel('First 2 digits of NAICS Code', fontsize = 12)
plt.text(-0.06,0.36-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
plt.subplot(numplots, 3, (i+2*i)+2)
current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
kind='line', fontsize = 13)
plt.xlabel('Year', fontsize = 12)
plt.text(0.2,0.36-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
plt.subplot(numplots, 3, (i+2*i)+3)
current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
kind='bar', fontsize = 13)
plt.xlabel('ZIP codes', fontsize = 12)
plt.text(0.48,0.36-0.34*i,clustername, horizontalalignment='center',
verticalalignment='center',transform=ax.transAxes, fontsize = 15)
Cluster 10 is dominated by NAICS code 48
; our old friends the ride-share drivers are back. Cluster 10 tracks the growth in uber and lyft-like businesses in the Tacoma area. It has happened only recently as seen in the graph of years. It looks like these drivers are found in most ZIP codes, but dominate the 98409 area.
I think my big takeaway is that k-means clustering can help identify trends in your data that you may already be aware of (like lessors and ride-share drivers) and it can also be used to find new trends (why are there so many unknown NAICS codes in the early 2000's?!?). Also, it is a bit tricky to find the correct number of clusters for your data. In the future, it might be interesting to allow for a few more clusters now that I understand the properties of my first ten clusters.
I was hoping to learn more about the business licenses issued in the downtown Tacoma (98402) and the Port (98421). However, since there aren't very many licenses in these regions to begin with, it's hard for the few of them to overcome the dominance of lessors and drivers in residential areas. I think allowing for more clusters might show something interesting about the Port and downtown but I would probably also create a lot more clusters that are already similar to what I currently have. The noise of all these extra clusters might make it hard to find new and interesting clusters.