I recently moved to the Pacific Northwest from Texas and I've noticed a few differences between the regions. The big one is all the WATER. There is so much water here! It falls from the sky nearly everyday. It causes moss to grow on rooftops!!! And then there's the ocean, ahem, I mean the bay.

Another difference I've noticed is the industry. I live in Tacoma which has a fairly active port. There is more train and truck traffic than I experienced in Texas. All those goods need to be transported from the Port to points east and south! This got me thinking about the industry and business scene in Tacoma.

The city of Tacoma has a really nice website devoted to its public data. You can download the data directly or use some of the tools provided by Socrata. I decided to take a look at Tacoma's list of business licenses. It was last updated April 17, 2017 and has over 32,000 rows of data. That means that over 32,000 business licenses have been issued to businesses operating in Tacoma. These businesses include lessors that rent or lease out their property in Tacoma, as well as drivers who drive for a ride-share company like uber or lyft.

I'm curious how Tacoma's industry landscape has changed over time. Let's take a look at the data.


In [1]:
# Load packages.
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt   
%matplotlib inline

In [2]:
# Read data into pandas dataframe.
bl = pd.read_csv('TACOMA_BUSINESS_LICENSE.csv')

In [3]:
# How big is the dataset?
bl.shape


Out[3]:
(32349, 13)

In [4]:
# View the dataset
bl.head()


Out[4]:
LICENSE NUMBER BUSINESS NAME OWNER NAME HOUSE NUMBER AND STREET NAME UNIT NUMBER P.O. BOX CITY STATE ZIP CODE NAICS CODE NAICS CODE DESCRIPTION BUSINESS OPEN DATE Location 1
0 500006886 CH2M HILL INC CH2M HILL INC 1100 112TH AVE NE 500 NaN BELLEVUE WA 98004 541330.0 Engineering Services (ES 0200 0225 08) 01/01/1994 1100 112TH AVE NE\nBELLEVUE, WA 98004\n(47.620...
1 500110805 MONARCH MUSIC & ENTERTAINMENT MONARCH MUSIC & ENTERTAINMENT 3910 N 24TH ST NaN NaN TACOMA WA 98406 611610.0 Fine Arts Schools (ES 0200 0225 08) 04/16/2016 3910 N 24TH ST\nTACOMA, WA 98406\n(47.269197, ...
2 500092547 MARY BAKER CARL E BAKER 4501 ALAMEDA AVE NaN NaN UNIVERSITY PLACE WA 98466-6314 531110.0 Lessors of Residential Buildings and Dwellings 01/01/2007 4501 ALAMEDA AVE\nUNIVERSITY PLACE, WA 98466-6...
3 500115952 GAMETIME PLAYCORE WISCONSIN INC 401 CHESTNUT ST STE 410 NaN CHATTANOOGA TN 37402 451110.0 Sporting Goods Stores 12/01/2005 401 CHESTNUT ST\nCHATTANOOGA, TN 37402\n(35.05...
4 500091065 FRANK RICHMOND SECURITY FRANK RICHMOND 3702 64TH AVENUE CT NW NaN NaN GIG HARBOR WA 98335-7236 561612.0 Security Guards and Patrol Services 02/06/2015 3702 64TH AVENUE CT NW\nGIG HARBOR, WA 98335-7...

Most of the columns are pretty self-explanatory. The ZIP CODE and STATE columns tell me where the owner of the business is located. Also the BUSINESS OPEN DATE might be useful too. The columns labeled NAICS CODE and NAICS CODE DESCRIPTION give an idea of the industry of the business. NAICS (North American Industry Classification System) codes are like the Dewey Decimal System numbers that you can see marked on library books.

For NAICS codes, the longer the number the more precise the classification. For example, if a business is classified as part of business sector 61, then it provides educational services. However, if it is classified as 611512, then it is a flight training school.

Here's a listing of the description of the first two digits of the NAICS code.

There are about 20 generally defined sectors of business. It would interesting to see how these sectors have changed in Tacoma over time using the BUSINESS OPEN DATE.

However, before I get ahead of myself, let me clean the data up a bit.


In [5]:
# Check the data types
bl.dtypes


Out[5]:
LICENSE NUMBER                    int64
BUSINESS NAME                    object
OWNER NAME                       object
HOUSE NUMBER AND STREET NAME     object
UNIT NUMBER                      object
P.O. BOX                        float64
CITY                             object
STATE                            object
ZIP CODE                         object
NAICS CODE                      float64
NAICS CODE DESCRIPTION           object
BUSINESS OPEN DATE               object
Location 1                       object
dtype: object

In [6]:
# Convert date to datetime format.
bl['BUSINESS OPEN DATE'] = pd.to_datetime(bl['BUSINESS OPEN DATE'])

History of businesses in Tacoma

Now that the data is cleaned up, let's take a look at the oldest and the newest businesses in Tacoma.


In [7]:
# Sort businesses by open date.
sorted = bl.sort_values(by = 'BUSINESS OPEN DATE')
# Oldest businesses
sorted[['BUSINESS NAME','NAICS CODE DESCRIPTION', 'BUSINESS OPEN DATE']].head(10)


Out[7]:
BUSINESS NAME NAICS CODE DESCRIPTION BUSINESS OPEN DATE
21176 THE SALVATION ARMY Used Merchandise Stores (ES 0200 0225 08) 1888-01-01 00:00:00
31196 OTIS ELEVATOR COMPANY Industrial Machinery/Equipment Merch Whlslrs (... 1920-01-01
30013 TRINITY PRESBYTERIAN CHURCH OF TACO Religious Organizations 1922-01-01
7139 TACOMA GOODWILL INDUSTRIES Used Merchandise Stores (ES 0200 0225 08) 1922-04-01
13503 LOOMIS ARMORED US LLC Armored Car Services (ES 0200 0225 08) 1936-01-01
25824 CASCADE POLE & LUMBER CO Other Commercial and Industrial Machinery and ... 1936-01-01
30559 TITUS-WILL FORD INC New Car Dealers (ES 0200 0225 08) 1938-01-01
24526 MCKESSON CORPORATION Drugs and Druggists' Sundries Merchant Wholesa... 1938-01-01
3137 DARRELL WITTE Lessors of Residential Buildings and Dwellings 1942-07-01
32249 KBR CREDIT UNION Credit Unions (ES 0200 0225 08) 1949-07-01

Interesting! The oldest company in Tacoma is the Salvation Army which opened in 1888. The city of Tacoma was incorporated in 1875 and there must have been businesses then. I haven't been able to figure out when Tacoma started issuing business licenses. However, the description of the dataset describes it as a list of "active accounts." So perhaps these are the oldest companies in Tacoma that are also still around.

Now let's take a look at the most recent businesses.


In [8]:
# Newest businesses
sorted[['BUSINESS NAME','NAICS CODE DESCRIPTION', 'BUSINESS OPEN DATE']].tail(15)


Out[8]:
BUSINESS NAME NAICS CODE DESCRIPTION BUSINESS OPEN DATE
32251 NICE GIRL JEWELRY AND REPAIR Other Direct Selling Establishments (ES 0200 ... 2017-08-05
21934 WORD OF MOUTH CONCRETE Commercial and Institutional Building Construc... 2017-08-16
22432 MAVERICK Painting and Wall Covering Contractors (ES 0... 2017-09-01
6024 REBEL JUNK NaN 2017-09-23
23531 TNC - UBER Taxi Service 2017-10-01
24916 GOSHEN ADULT FAMILY HOME LLC Services for the Elderly and Persons with Disa... 2017-12-01
17342 TNC Taxi Service 2018-02-16
21947 JESSE M REEVES Lessors of Residential Buildings and Dwellings 2018-07-01
26494 HALLS HAULING LLC General Freight Trucking Local (ES 0200 0225 ... 2200-01-01
10913 UBER - TNC Taxi Service 2201-01-09
24478 UBER / LYFT TNC Taxi Service 2201-03-30
22833 ANNALEE ROTHENBERG Lessors of Residential Buildings and Dwellings 2201-08-01
20585 LINDA D SMITH Lessors of Residential Buildings and Dwellings 2201-08-27
15985 PR BARBERSHOP Barber Shops (ES 0200 0225 08) 2215-02-12
3946 UBER - TNC Taxi Service 2216-09-21

What's with BUSINESS OPEN DATE = 2216-09-21?

That's odd. Some of these businesses have open dates that are far in the future. I don't think that's right. I'm going to remove businesses that have opening dates after today.


In [9]:
# Remove businesses with open date after "today".
bl = bl[bl['BUSINESS OPEN DATE'] <='2017-05-04']
# Check work
sorted = bl.sort_values(by = 'BUSINESS OPEN DATE')
sorted[['BUSINESS NAME','NAICS CODE DESCRIPTION', 'BUSINESS OPEN DATE']].tail(10)


Out[9]:
BUSINESS NAME NAICS CODE DESCRIPTION BUSINESS OPEN DATE
8089 GRIT CITY FARM Fruit and Vegetable Markets 2017-05-01
12670 LYMANS VALET All Other Business Support Services 2017-05-01
31128 NEA D MOYER Lessors of Residential Buildings and Dwellings 2017-05-01
7387 UBER - TNC Taxi Service 2017-05-01
12782 BAKER CONCRETE CONSTRUCTION INC Poured Concrete Foundation/Structure Contracto... 2017-05-01
30650 CMP CONSTRUCTION LLC All Other Specialty Trade Contractors (ES 0200... 2017-05-01
3643 T U P A C Dance Companies 2017-05-01
5312 CAROLINAS SALON Other Personal Care Services (ES 0200 0225 ... 2017-05-01
17923 FERMENTATION INITIATIVE LLC All Othr Miscll Store Retailers (ex Tobacco St... 2017-05-01
21497 TT HAIR SALON AND SUPPLIES Beauty Salons (ES 0200 0225 08) 2017-05-01

That looks better!

What are the top businesses in Tacoma?

Looking at the business descriptions of the most recent businesses show a lot of diversity. I wonder what are the most common industries in Tacoma.


In [10]:
# Let's take a look at the different types of business in Tacoma that have a business license.
bl_groupedbynaics = bl.groupby('NAICS CODE DESCRIPTION')['BUSINESS NAME'].count()
sorted = bl_groupedbynaics.sort_values()
sorted.tail(10)


Out[10]:
NAICS CODE DESCRIPTION
Janitorial Services  (ES 0200 0225  06)                          414
Beauty Salons  (ES 0200 0225    08)                              414
Plumbing/Heating/Air-Conditioning Contractors (ES 0200 0225      536
New Single-Family Housing Constn (ex Operative Bldrs) (ES 02     541
Residential Remodelers  (ES 0200 0225  08)                       560
Landscaping Services                                             574
Lessors of Nonresidential Bildngs (ex Miniwrhss) (ES 0200 02     682
Electrical Contractors   (ES 0200 0225    06)                    754
Taxi Service                                                    3564
Lessors of Residential Buildings and Dwellings                  6827
Name: BUSINESS NAME, dtype: int64

Whoa! Taxi services and rental properties are Big Business in Tacoma! I was not expecting that.

Ok, but what if I divide this up by ZIP code? What's the top business for each ZIP code operating in Tacoma?

ZIP Codes: Downtown Tacoma and the Port of Tacoma are different from the rest of Tacoma.


In [11]:
# Find top industry in each of Tacoma's major ZIP code areas. 
# http://www.zipmap.net/Washington/Pierce_County/Tacoma.htm
tacoma_zip_codes = [98402, 98403, 98404,  98405, 98406, 98407, 
                    98408, 98409, 98418, 98421, 98422, 98465]
print len(tacoma_zip_codes)


12

In [12]:
# Clean up ZIP code data.
# Remove extra numbers after first 5 digits and remove non-numeric ZIP codes.
bl['ZIP CODE 5 string'] = bl['ZIP CODE'].str[:5]
# If zip code contains any characters, replace it with 00000
bl.loc[bl['ZIP CODE 5 string'].str.contains('[a-zA-Z]'), 'ZIP CODE 5 string']  = '00000' 
# Create another zip code column that has integer format.
bl['ZIP CODE 5 int'] = bl['ZIP CODE 5 string'].astype(int)

In [13]:
# Print top three NAICS codes in each ZIP code.
for zip in tacoma_zip_codes:
    bl_zipsort = bl[bl['ZIP CODE 5 int']== zip]
    bl_zipsort = bl_zipsort.groupby('NAICS CODE DESCRIPTION')['BUSINESS NAME'].count()
    sorted = bl_zipsort.sort_values()
    print zip, ' #1 ', sorted.keys()[-1]
    print zip, ' #2 ', sorted.keys()[-2]
    print zip, ' #3 ', sorted.keys()[-3]
    print '----'


98402  #1  Offices of Lawyers
98402  #2  Beauty Salons  (ES 0200 0225    08)
98402  #3  Lessors of Residential Buildings and Dwellings
----
98403  #1  Lessors of Residential Buildings and Dwellings
98403  #2  Taxi Service
98403  #3  Beauty Salons  (ES 0200 0225    08)
----
98404  #1  Lessors of Residential Buildings and Dwellings
98404  #2  Taxi Service
98404  #3  Landscaping Services
----
98405  #1  Lessors of Residential Buildings and Dwellings
98405  #2  Taxi Service
98405  #3  Offices of Physicians (ex Mental Health Specialists) (ES 020
----
98406  #1  Lessors of Residential Buildings and Dwellings
98406  #2  Beauty Salons  (ES 0200 0225    08)
98406  #3  Taxi Service
----
98407  #1  Lessors of Residential Buildings and Dwellings
98407  #2  Offices of Real Estate Agents and Brokers   (ES 0200 0225
98407  #3  Taxi Service
----
98408  #1  Lessors of Residential Buildings and Dwellings
98408  #2  Taxi Service
98408  #3  Beauty Salons  (ES 0200 0225    08)
----
98409  #1  Lessors of Residential Buildings and Dwellings
98409  #2  Taxi Service
98409  #3  Security Guards and Patrol Services
----
98418  #1  Lessors of Residential Buildings and Dwellings
98418  #2  Taxi Service
98418  #3  Other Personal Care Services   (ES 0200 0225    08)
----
98421  #1  Lessors of Nonresidential Bildngs (ex Miniwrhss) (ES 0200 02
98421  #2  General Freight Trucking/Long-Distance/Truckload (ES 0200 02
98421  #3  Lessors of Residential Buildings and Dwellings
----
98422  #1  Lessors of Residential Buildings and Dwellings
98422  #2  Taxi Service
98422  #3  Administrative Management and General Management Consulting
----
98465  #1  Lessors of Residential Buildings and Dwellings
98465  #2  Taxi Service
98465  #3  Other Personal Care Services   (ES 0200 0225    08)
----

It looks like folks renting out their homes and ride-share drivers are still the top industries even if I divide Tacoma up by ZIP code. However, there are two ZIP code areas that stand out as different. They are 98421 and 98402, the ZIP codes where the Port of Tacoma and downtown Tacoma are located, respectively. The top two industries in the Port are warehouse rentals and long-haul trucking. In downtown Tacoma, there are more lawyer offices than ride-share drivers.

Taking a deeper look at the taxi and rental housing industry.

Since leasing and taxi services are the top business licenses in Tacoma and the city's business license homepage specifically draws attention to these two industries I thought I'd look into them a bit more. For-hire transportation business licenses were only recently required on January 1, 2014 and licenses for lessors renting out their homes were required starting January 1, 2004. Therefore, I would expect to see a spike in the number of business licenses for January 1 in 2004 and 2014 due to these new business license requirements. Let's see if this is the case.


In [14]:
# Count up the TOTAL number of businesses that opened each year.
bl_resampled = bl.resample('AS', on='BUSINESS OPEN DATE')['BUSINESS NAME'].count()
bl_resampled = pd.DataFrame(bl_resampled)
bl_resampled['BUSINESS OPEN YEAR'] = bl_resampled.index.year
bl_resampled = bl_resampled.set_index('BUSINESS OPEN YEAR')
bl_resampled = bl_resampled.rename(columns = {'BUSINESS NAME':'Count'})

In [15]:
# Let's plot evolution in time for ALL businesses.
%matplotlib inline  
ax = bl_resampled.plot.bar(figsize=(15,8), fontsize = 7, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 20)
ax.set_ylabel('Count', fontsize = 20);


Whoa! Look at that jump in 2004! That does look suspiciously close to the year that business licenses were required for lessors. Based on the underlying trend in the number of business licenses, I would guess that there should be only about 1000 new business licenses for 2004. Instead there are 3000! There's about 2000 extra business licenses for 2004. If I look only at lessor business licenses, do I come up with those extra 2000 business licenses?


In [16]:
# Lessor business licenses grouped yearly
bl_lessor = bl[bl['NAICS CODE DESCRIPTION']=='Lessors of Residential Buildings and Dwellings']
bl_lessor = bl_lessor.resample('AS', on='BUSINESS OPEN DATE')['BUSINESS NAME'].count()
bl_lessor = pd.DataFrame(bl_lessor)
bl_lessor['BUSINESS OPEN DATE'] = bl_lessor.index.year
bl_lessor = bl_lessor.set_index('BUSINESS OPEN DATE')
bl_lessor=bl_lessor.rename(columns = {'BUSINESS NAME':'Count_Lessor'})
# Plot it.
ax = bl_lessor.plot.bar(figsize=(20,4), fontsize = 10, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 20)
ax.set_ylabel('Count', fontsize = 20);


Yes! I see a jump up to 2000 business licenses for lessors in 2004. It looks like they are the cause of the overall jump in 2004. Interesting!

Let's also take a look at the taxi business. It does seem like there might be an unusual rise in the number of licenses in 2014. Let's take a deeper look at taxi business licenses.


In [17]:
# Taxi business licenses grouped yearly
bl_lessor = bl[bl['NAICS CODE DESCRIPTION']=='Taxi Service']
bl_lessor = bl_lessor.resample('AS', on='BUSINESS OPEN DATE')['BUSINESS NAME'].count()
bl_lessor = pd.DataFrame(bl_lessor)
bl_lessor['BUSINESS OPEN DATE'] = bl_lessor.index.year
bl_lessor = bl_lessor.set_index('BUSINESS OPEN DATE')
bl_lessor=bl_lessor.rename(columns = {'BUSINESS NAME':'Count_Lessor'})
# Plot it.
ax = bl_lessor.plot.bar(figsize=(20,4), fontsize = 10, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 20)
ax.set_ylabel('Count', fontsize = 20);


The city of Tacoma required business licenses for ride-share drivers on January 1, 2014, however, the number of licenses doesn't pick up until 2016. It looks like Tacoma was getting ready for a possible influx of ride-share drivers when they changed the business requirements. Nice planning!

Finding groups with similar features in the dataset using K-modes clustering

My previous analysis of lessors and taxi drivers gave me an idea to explore the data more using clustering. When looking at the plot showing the number of new business licenses every year I was led to investigate an underlying feature of the dataset -- the jump in licenses in 2004 for lessors. I wonder if there are other features of the dataset that I can pull out. One way of finding features is using a machine learning technique called clustering which sorts through each data point (in our case, each business license) and groups it with other similar data points. For example, business licenses with opening dates around 2004 in the leasing industry might be grouped together.

I'll be using a version of k-means clustering that allows for categorical data called k-modes clustering. The basic steps of k-means clustering using numerical data are as follows.

  • Randomly select the centers of the clusters.
  • Find the distance between the centers and each data point.
  • Assign each data point to the cluster center that is closest.
  • Calculate the new center of the cluster using its cluster members.
  • Repeat process until cluster center doesn't change a significant amount.

Another way to say this is that the k-means clustering algorithm minimizes the sum of the squared differences between the cluster members and cluster center.

Since the business license dataset contains a few columns containing categorical, rather than numerical data, I'll use a modified version of k-means clustering called k-modes clustering. Rather than minimizing the sum of the squared differences, this technique minimizes the dissimilarity measure which is defined as

$ \Sigma_{\textrm{all }j}{\delta(x_j, y_j)}$

The dissimilarity measure is the sum over all of the categorical features $j$ in the dataset. Each business license is represented by $x$ and $y$. The function $\delta$ is the delta function, i.e., $\delta = 0$ when $x$ and $y$ are in the same category and $\delta = 1$ when they are in different categories.

Another difference between the k-modes method and the k-means method is the use of the mode rather than the mean to find the center of the cluster.

A python implementation of this technique has been written. It also includes k-prototypes which allows for a mix of categorical and numerical data in a single data set. I'll be using that.

Feature selection

Before plugging my dataset into the algorithm, let me choose the "features" to be used. Based on my previous experiments with the lessors and the most common businesses in each ZIP code, I think some useful features will be the opening year of the business, the NAICS code, and the ZIP code. However, rather than using the entire NAICS code, I'll only use the first two digits. These give the broad category of each business which is what I am interested in here.


In [18]:
# Create new column which contains only first two digits of NAICS CODE.
bl['NAICS CODE FIRST TWO'] = bl['NAICS CODE'].fillna('00')
bl['NAICS CODE FIRST TWO'] = bl['NAICS CODE FIRST TWO'].astype(str)
bl['NAICS CODE FIRST TWO'] = bl['NAICS CODE FIRST TWO'].str[:2]

In [19]:
# Count up the number of licenses for each NAICS CODE 2 digit grouping.
bl_groupedbynaicsfirst2 = bl.groupby('NAICS CODE FIRST TWO')['BUSINESS NAME'].count()
sorted = bl_groupedbynaicsfirst2.sort_values()
sorted.tail(3)


Out[19]:
NAICS CODE FIRST TWO
48    4062
23    4495
53    8544
Name: BUSINESS NAME, dtype: int64

If I only use the first two digits of the NAICS code, then the top business types are "Real Estate and Rental and Leasing," "Construction," and "Transportation and Warehousing."


In [20]:
# How many unique NAICS 2 digits codes are there in this dataset?
naics_keys = sorted.keys()
len(naics_keys)


Out[20]:
25

In [21]:
# Create the dataset to be used in the k-modes clustering algorithm
# Create new column with open year.
bl['BUSINESS OPEN YEAR'] = bl['BUSINESS OPEN DATE'].dt.year
# Create a dataframe that contains the data to be used for clustering.
df_kmodes = bl[['NAICS CODE FIRST TWO', 'BUSINESS OPEN YEAR','ZIP CODE 5 string']].copy()
df_kmodes.head()


Out[21]:
NAICS CODE FIRST TWO BUSINESS OPEN YEAR ZIP CODE 5 string
0 54 1994 98004
1 61 2016 98406
2 53 2007 98466
3 45 2005 37402
4 56 2015 98335

In [22]:
df_kmodes.dtypes


Out[22]:
NAICS CODE FIRST TWO    object
BUSINESS OPEN YEAR       int64
ZIP CODE 5 string       object
dtype: object

In [23]:
# Import kprototypes module from kmodes
# https://github.com/nicodv/kmodes
from kmodes import kprototypes
# Convert dataframe to numpy array.
X = pd.DataFrame.as_matrix(df_kmodes)
print X


[['54' 1994 '98004']
 ['61' 2016 '98406']
 ['53' 2007 '98466']
 ..., 
 ['71' 2009 '98405']
 ['53' 2004 '98109']
 ['23' 2001 '98370']]

A common question that is often asked when trying to cluster data is "How many clusters should I have?" Usually it's a good idea to plot the data to try to figure this out. This is a bit difficult to do when dealing with three features, but let's see what we find.


In [24]:
# ZIP code vs business open year
ax = bl.plot.scatter(x="BUSINESS OPEN YEAR", y = "ZIP CODE 5 int", figsize=(20,8), fontsize = 15, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 15)
ax.set_ylabel('ZIP CODE 5', fontsize = 15);



In [25]:
# First two digits of NAICS code vs business open year
bl['NAICS CODE FIRST TWO INT'] = bl['NAICS CODE FIRST TWO'].astype(int)
ax = bl.plot.scatter(x="BUSINESS OPEN YEAR", y = "NAICS CODE FIRST TWO INT", figsize=(20,8), fontsize = 15, legend = False);
ax.set_xlabel('Business Open Year', fontsize = 15)
ax.set_ylabel('NAICS CODE First Two Digits', fontsize = 15);


I'm not seeing any obvious groupings of the data. So, that wasn't really helpful in trying to figure out the number of clusters. That's ok, there's another way. For the k-means algorithm, a common technique is to find the "elbow point." The first step is to run the k-means algorithm for a few different values of $k$ (the number of clusters). Then plot the average distance of each point to its cluster's center against the number of clusters. For only two clusters, the average distance will be large. But as the number of clusters increases, the average distance decreases until it bottoms out. The point where the curve goes from steep to flat is the "elbow point." Clusters to the right of the "elbow point" are likely to be subjected to overfitting. If the number of clusters equals the number of datapoints, then each point has its own cluster and that's not very useful.

Let's give the elbow method a try and see what happens. I'll try between 2 and 25 clusters. The maximum number of NAICS codes is 25. I think the clusters might start to be grouped by NAICS codes if I let there be that many. Let's see.


In [26]:
# Create dictionaries to store the data from the k-modes algorithm.
cluster_dict = {}
kproto_dict = {}

WARNING! The following loop takes awhile (> 1 hour) to run on my ~2010 Macbook Air. If you'd like to play with the data it creates, but would rather not run this loop yourself, then use the data that I have put in a pickle file for safe-keeping. I write to the pickle file after the loop runs. You should be able to find it here as well.


In [ ]:
for i in range(2,26):
    kproto_dict[i] = kprototypes.KPrototypes(n_clusters = i, init = 'Cao', verbose = 0) 
    cluster_dict[i]= kproto_dict[i].fit_predict(X, categorical=[0,2])

In [ ]:
# Create dictionary with cluster data. 
cluster_info = {}
for i in range(2,26):
    cluster_info[i]=[kproto_dict[i].cost_,
                     kproto_dict[i].n_iter_, 
                     kproto_dict[i].cluster_centroids_,
                     cluster_dict[i]]

In [ ]:
# Save data in a pickle file.
import pickle   
pickle.dump(cluster_info,open('cluster_data.p', 'wb'))

In [27]:
# To use the data in the pickle file, read it in here.
import pickle   
cluster_info = pickle.load(open('cluster_data.p', 'rb'))

Now that I have the data compiled for a range of k values, I can compare how the error (or "Cost") changes with k.


In [28]:
x = cluster_info.keys()
cost = {}
for k in cluster_info:
    cost[k] = cluster_info[k][0]
y = cost.values()
plt.xlabel('k (number of clusters)', fontsize = 15)
plt.ylabel('Cost', fontsize = 15)
plt.scatter(x,y);


It looks like the 'elbow' occurs around 10 and since 10 is a nice round number, I'll use that for my number of clusters. The next step is to see what I can learn about how the data is organized into these 10 clusters. What are the primary features of each cluster? One way of investigating this is by looking at the cluster centroids.


In [29]:
# Print cluster centroid: year, NAICS ID, and ZIP code
year = cluster_info[10][2][0]
IDZIP = cluster_info[10][2][1]
print 'Year NAICS  ZIP'
for i in range(len(year)):
    print int(year[i][0]), IDZIP[i][0], '   ', IDZIP[i][1]


Year NAICS  ZIP
2005 53     98409
1959 42     98409
1981 23     98409
1999 00     98409
2011 53     98409
2003 53     98406
1992 23     98409
2008 53     98405
2014 53     98408
2016 48     98409

The years for the clusters seem to be varied, but the NAICS code is often 53, "Real Estate and Rental and Leasing" The fact that the ZIP code 98409 is most common for the clusters makes sense because it is the most common ZIP code in the data set.


In [30]:
# Print most common ZIP codes.
bl_groupedbyZIP = bl.groupby('ZIP CODE')['BUSINESS NAME'].count()
sorted = bl_groupedbyZIP.sort_values()
sorted.tail(5)


Out[30]:
ZIP CODE
98407     376
98406     530
98405     604
98402     921
98409    1102
Name: BUSINESS NAME, dtype: int64

In addition to understanding the clusters by looking at their centers, it is often useful to visualize the properties of the cluster members.


In [31]:
# First, assign each business in the dataframe to a cluster.
bl['cluster_num'] = cluster_info[10][3]

In [32]:
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)

plt.figure(figsize = (10,8))  #w,h
numplots = 10
# Clusters arranged in approximate chronological order.
plot_order = [1,2,6,3,5,0,7,4,8,9]  
top_zip = [98409, 98409,98409,98409,  98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
    clustername = 'Cluster ' + str(i+1)
    current_cluster = bl[bl['cluster_num']==plot_order[i]]

    plt.subplot(numplots, 3, (i+2*i)+1)
    current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
        kind='bar', fontsize = 5)
    plt.ylabel('Count', fontsize = 8)
    plt.xlabel('First 2 digits of NAICS Code', fontsize = 8)

    plt.subplot(numplots, 3, (i+2*i)+2)
    current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
        kind='line', fontsize = 5)
    plt.xlabel('Year', fontsize = 8)

    plt.subplot(numplots, 3, (i+2*i)+3)
    current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
        kind='bar', fontsize = 5)
    plt.xlabel('ZIP codes', fontsize = 8)

    plt.text(-0.1,0.79-0.09*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 8)
    plt.text(0.05,0.79-0.09*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 8)
    plt.text(0.21,0.79-0.09*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 8)


This figure shows the overall structure of each of the 10 clusters. Each row of graphs in the figure shows the properties of a single cluster; the top three graphs show the NAICS codes, years, and ZIP codes for Cluster 1, from left to right.

Let's discuss a few of the clusters at a time, looking at their commonalities and differences. I'll talk about Clusters 1, 2, and 3 together because they look similar except for their yearly graph. Then I'll take a look at Cluster 4 which stands out as unique from the other clusters due to the peak in the middle of its NAICS graph. Then I'll discuss Clusters 5, 6, 7, 8, and 9 together. I'll finish with a discussion of Cluster 10.


In [33]:
# Visualize Clusters 1, 2, and 3.
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)

plt.figure(figsize = (14,8))  #w,h
numplots = 3
# Clusters arranged in approximate chronological order.
plot_order = [1,2,6,3,5,0,7,4,8,9]  
top_zip = [98409, 98409,98409,98409,  98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
    clustername = 'Cluster ' + str(i+1)# + '\n' + str(top_zip[i])
    current_cluster = bl[bl['cluster_num']==plot_order[i]]

    plt.subplot(numplots, 3, (i+2*i)+1)
    current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
        kind='bar', fontsize = 12)
    plt.ylabel('Count', fontsize = 12)
    plt.xlabel('First 2 digits of NAICS Code', fontsize = 12)
    plt.text(-0.07,0.85-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)
    
    plt.subplot(numplots, 3, (i+2*i)+2)
    current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
        kind='line', fontsize = 12)
#    plt.ylabel('Count', fontsize = 25)
    plt.xlabel('Year', fontsize = 12)
    plt.text(0.135,0.85-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)    
        
    plt.subplot(numplots, 3, (i+2*i)+3)
    current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
        kind='bar', fontsize = 12)
#    plt.ylabel('Count', fontsize = 25)
    plt.xlabel('ZIP codes', fontsize = 12)
    plt.text(0.410,0.85-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)


The NAICS code and ZIP code distributions for Clusters 1, 2, and 3 look about the same. The main difference between the clusters occurs in their range of years; Cluster 1, 2 and 3 are arranged from older to newer businesses.

There's an interesting spike in Cluster 1 around 1950. I'm not sure what's happening here. Since there aren't as many business licenses before 1950, perhaps around this time the city's business license requirements were more strictly enforced.

Let's take a look at the next cluster - Cluster 4, the odd ball.


In [34]:
# Visualize Cluster 4
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)

plt.figure(figsize = (16,4))  #w,h
numplots = 1
# Clusters arranged in approximate chronological order.
plot_order = [3,5,0,7,4,8,9]  
top_zip = [98409,  98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
    clustername = 'Cluster ' + str(i+4)
    current_cluster = bl[bl['cluster_num']==plot_order[i]]

    plt.subplot(numplots, 3, (i+2*i)+1)
    current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
        kind='bar', fontsize = 12)
    plt.ylabel('Count', fontsize = 12)
    plt.xlabel('First 2 digits of NAICS Code', fontsize = 12)
    plt.text(-0.07,0.36-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)
    
    plt.subplot(numplots, 3, (i+2*i)+2)
    current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
        kind='line', fontsize = 12)
    plt.xlabel('Year', fontsize = 12)
    plt.text(0.2,0.36-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)    
        
    plt.subplot(numplots, 3, (i+2*i)+3)
    current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
        kind='bar', fontsize = 12)
    plt.xlabel('ZIP codes', fontsize = 12)
    plt.text(0.5,0.36-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)


I'm not sure what's happening with Cluster 4. It's a bit unusual because it has a peak for NAICS codes of 00 which were originally NaN values. Are business owners suddenly unable to classify their businesses? Or maybe a previously defined NAICS code was retired, and is now no longer valid, hence the NaN? (It turns out that NAICS codes do change over time). Or maybe, whispering in my best conspiracy voice it's related to the Y2K bug because this spike happened around the year 2000. Maybe we'll never know the truth... back to normal voice.

Let's take a look at the next big group of clusters, Clusters 5-9.


In [35]:
# Visualize Cluster 5-9
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)

plt.figure(figsize = (16,13))  #w,h
numplots = 5
# Clusters arranged in approximate chronological order.
plot_order = [5,0,7,4,8,9]  
top_zip = [ 98406 , 98409, 98405, 98409, 98408, 98409]
for i in range(numplots):
    clustername = 'Cluster ' + str(i+5)
    current_cluster = bl[bl['cluster_num']==plot_order[i]]

    plt.subplot(numplots, 3, (i+2*i)+1)
    current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
        kind='bar', fontsize = 12)
    plt.ylabel('Count', fontsize = 14)
    plt.xlabel('First 2 digits of NAICS Code', fontsize = 16)

    plt.subplot(numplots, 3, (i+2*i)+2)
    current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
        kind='line', fontsize = 12)
    plt.xlabel('Year', fontsize = 16)

    plt.subplot(numplots, 3, (i+2*i)+3)
    current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
        kind='bar', fontsize = 12)
    plt.xlabel('ZIP codes', fontsize = 16)

    plt.text(-0.07,1.38-0.3*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 13)
    plt.text(0.175,1.38-0.3*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 13)
    clusternamezip = clustername + '\n' + str(top_zip[i])
    plt.text(0.43,1.38-0.3*i,clusternamezip, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 13)


Clusters 5 through 9 have a similar overall shape of their NAICS code distribution. Their most common NAICS code is 53, "Real Estate and Rental and Leasing," followed by 23, "Construction." Each cluster however tracks a different yearly range, starting around 2000 and going through today. It is interesting to note that the ratio between the rental business licenses and all the others licenses seems to be decreasing with time. This may be due to the residual effect of the large influx of lessor business licenses in 2001, or because of an actual comparative decrease in home rentals over time. Something to explore for another day...

The ZIP code plot shows another difference between the clusters; I've written the top ZIP code on each of the ZIP code plots. All of these top ZIP codes cover primarily residential areas. Certain areas seem to be experiencing growth over different times. ZIP code 98409 experienced a big jump in the number of business licenses in the mid-2000s as seen in Cluster 6. While in the following few years, ZIP code 98405 took over as the leader in business licenses.

Now, let's take a look at the last group, Cluster 10.


In [36]:
# Visualize Cluster 10
xaxisorderkeys = naics_keys
xaxisorderyears = range(1922,2017)

plt.figure(figsize = (16,4))  #w,h
numplots = 1
plot_order = [9]  
top_zip = [ 98409]
for i in range(numplots):
    clustername = 'Cluster ' + str(i+10)
    current_cluster = bl[bl['cluster_num']==plot_order[i]]

    plt.subplot(numplots, 3, (i+2*i)+1)
    current_cluster['NAICS CODE FIRST TWO'].value_counts().reindex(xaxisorderkeys).plot(
        kind='bar', fontsize = 13)
    plt.ylabel('Count', fontsize = 12)
    plt.xlabel('First 2 digits of NAICS Code', fontsize = 12)
    plt.text(-0.06,0.36-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)
    
    plt.subplot(numplots, 3, (i+2*i)+2)
    current_cluster['BUSINESS OPEN YEAR'].value_counts().reindex(xaxisorderyears).plot(
        kind='line', fontsize = 13)
    plt.xlabel('Year', fontsize = 12)
    plt.text(0.2,0.36-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)    
        
    plt.subplot(numplots, 3, (i+2*i)+3)
    current_cluster['ZIP CODE 5 int'].value_counts().reindex(tacoma_zip_codes).plot(
        kind='bar', fontsize = 13)
    plt.xlabel('ZIP codes', fontsize = 12)
    plt.text(0.48,0.36-0.34*i,clustername, horizontalalignment='center', 
             verticalalignment='center',transform=ax.transAxes, fontsize = 15)


Cluster 10 is dominated by NAICS code 48; our old friends the ride-share drivers are back. Cluster 10 tracks the growth in uber and lyft-like businesses in the Tacoma area. It has happened only recently as seen in the graph of years. It looks like these drivers are found in most ZIP codes, but dominate the 98409 area.

What have we learned about clustering?

I think my big takeaway is that k-means clustering can help identify trends in your data that you may already be aware of (like lessors and ride-share drivers) and it can also be used to find new trends (why are there so many unknown NAICS codes in the early 2000's?!?). Also, it is a bit tricky to find the correct number of clusters for your data. In the future, it might be interesting to allow for a few more clusters now that I understand the properties of my first ten clusters.

I was hoping to learn more about the business licenses issued in the downtown Tacoma (98402) and the Port (98421). However, since there aren't very many licenses in these regions to begin with, it's hard for the few of them to overcome the dominance of lessors and drivers in residential areas. I think allowing for more clusters might show something interesting about the Port and downtown but I would probably also create a lot more clusters that are already similar to what I currently have. The noise of all these extra clusters might make it hard to find new and interesting clusters.