This notebook provides a project report for the data science project done as part of ANLY-501. The data science project intends to show all stages of the data science pipeline. This notebook facilitates that by organizing different phases into different sections and having the code and visualizations all included in one place. Install Jupyter notebook software available from http://jupyter.org/ to run this notebook. All code for this project is available on Github as https://github.com/aarora79/sb_study and there is also a simple website associated with it https://aarora79.github.io/sb_study/. The entire code (including the generated output) can be downloaded as a zip file from Github via this URL https://github.com/aarora79/sb_study/archive/master.zip. To run the code, simply unzip the master.zip and say "python main.py". The code requires that Python 3.5.2|Anaconda 4.1.1 be installed on the machine. Also, since the code tries to get the datasets in realtime so an Internet connection is also required.
[Updated for Project3 11/30/2016, starting from section Predictions based on classification].
Starbucks Corporation is an American coffee company and coffeehouse chain with more than 24,000 stores across the world. This projects intends to explore the following data science problems:
This problem is important because it attempts to provide a model (using Starbucks as an example) which can be applied to any similar business (say in the food and hospitality industry) to predict where it could expand globally. It provides insight about where the business is located currently by bringing out information like say the total number of stores (48) found in the entire continent of Africa is way less than number of stores (184) found just on U.S. Airports.
The data used as part of this project is obtained from two sources.
Starbucks store location data is available from https://opendata.socrata.com via an API. The Socrate Open Data API (SODA) end point for this data is https://opendata.socrata.com/resource/xy4y-c4mk.json
The economic and urban development data for various countries is available from the World Bank(WB) website. WB APIs are described here https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information. The API end point for all the World Development Indicators (WDI) is http://api.worldbank.org/indicators?format=json. Some examples of indicators being collected include GDP, urban to rural population ratio, % of employed youth (15-24 years), international tourism receipts (as % of total exports), ease of doing business and so on.
The possible directions/hypothesis based on collected data (this is not the complete list, would be expanded as the project goes on):
EDA about Starbucks store locations (for example):
Data visualization of the Starbucks store data:
Machine learning model for predicting number of Starbucks store based on various economic and urban development indicators. This could then be used to predict which countries where Starbucks does not have a presence today would be best suited as new markets. For example, model the number of Starbucks location in a country based on a) ease of doing business, b) GDP, c) urban to rural population, d) employment rate between people in the 15-24year age group, e) type of government, f) access to 24hour electricity and so on and so forth.
The data used for this project is being obtained via APIs from the Socrata web site and the World Bank website and is therefore expected to be relatively error free (for example as compared to the same data being obtained by scraping these websites). Even so, the data is checked for quality and appropriate error handling or even alternate mechanisms are put in place to handle errors/issues with the data.
Issue | Handling Strategy |
---|---|
Some of the city names (for examples cities in China) include UTF-16 characters and would therefore not display correctly in documents and charts. | Replace city name with country name _1, _2 and so on, for example CN_1, CN_2 etc. |
Missing data in any of the fields in the Starbucks dataset. | Ignore the data for the location with any mandatory field missing (say country name is missing). Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data. |
Incorrect format of the value in various fields in the Starbucks dataset. For example Latitude/Longitude values being out of range, country codes being invalid etc. | Ignore the data for the location with any missing value. Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data. |
Misc. data type related errors such as date time field (first_seen field in Starbucks dataset) not being correct, fields considered as primary key not being unique (for example store id for Starbuck dataset, Country code for WB dataset) | Flag all invalid fields as errors. |
Missing data for any of the indicators in the WB dataset. | The most recent year for which the data is available is 2015, if for a particular indicator the 2015 data is not available then use data for the previous year i.e. 2014. If no data is available for that indicator even for the previous 5 years then flag it as such and have the user define some custom value for it. |
Incorrect format of the value in various fields in the WB dataset. For example alphanumeric or non-numeric data for fields such as GDP for which numeric values are expected. | Provide sufficient information to the user (in this case the programmer) about the incorrect data and have the user define correct values. |
The subsequent sections of this notebook provide a view of the data for the two datasets used in this project and also provide the data quality scores (DQS) for both the datasets.
The Python code for the SB Study (SBS) is run offline and the results (along with the code) are periodically pushed to Github. The code here simply downloads the files from Github to show the results and how the data looks like.
In [9]:
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset
SB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/SB_data_as_downloaded.csv'
df = pd.read_csv(SB_DATASET_URL)
print('Starbucks dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()
Out[9]:
In [10]:
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset
WB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/WDI_data_as_downloaded.csv'
df = pd.read_csv(WB_DATASET_URL)
print('Worldbank dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()
Out[10]:
To check for dataquality two types of checks were done on both the datasets.
Two metrices are derived to quantify missing data and invalid data. These are raw score and adjusted score.
Raw Score for Missing Data: this is the percentage of data that is not missing, simply speaking this is the count of non-empty cells Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets.
Adjusted Raw Score for Missing Data: this is the percentage of data that is not missing for Mandatory Features (i.e. features without which the record would have to be discarded), simply speaking this is the count of non-empty cells in mandatory columns Vs total number of cells expressed as a percentage. In case all features are mandatory then the adjusted raw score and the raw score are the same. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets.
Raw Score for Invalid Data: this is the percentage of data that is not invalid, simply speaking this is the count of cells containing valid data Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.
Adjusted Score for Invalid Data: this is the percentage of data that is not invalid for Mandatory Features (i.e. features without which the record would have to be discarded), simply speaking this is the count of cells containing valid data in mandatory columns Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.
Field | Validity Check |
---|---|
Store Id | Has to be unique for each row |
Latitude/Longitude | Longitude measurements range from 0° to (+/–)180°, Latitude from -90° to +90° |
Timezone offset | Has to be divisible by 15 |
Country code | Has to be present in a country code list downloaded offline |
Brand | Has to be within one of 6 brands listed on Starbucks website http://www.starbucks.com/careers/brands |
Store Number | Has to follow a format XXXX-YYYY |
Timzone name | Has to have either "Standard Time" or "UTC" as part of name |
Ownership Type | Should not contain special characters |
First seen | Has to follow a date format |
Field | Validity Check |
---|---|
Country Code | Should not contain special characters |
All 83 features (WDIs) | Should be numeric as all of them represent a numerical measure of the feature |
The following code fragment downloads a CSV file containing these scores from the Github repo of this project. These scrores have been calculated offline by running the program and uploading the results as part of the Github repo. The missing data raw score for the WorldBank data is very less ~34%, this is because for several parameters the 2015 data is not yet available, so in this case we will use the 2014 data as these are macro level indicators which do not see a drastic change year on year (this will be disussed more in the next phase of the project).
In [11]:
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs.csv'
df = pd.read_csv(DQS_URL)
df.head()
Out[11]:
Three (for now) new features are created using the datasets. These features are all added to the Starbucks dataset. These are described below:
Feature | Format | Explanation |
---|---|---|
Continent | String | Map the country code to the continent to which the country belongs. This feature enables answering basic questions like what % of stores exist in Asia as compared to North America for example. The country code to continent mapping is obtained with the help of two csv files downloaded offline (code to country mapping, country to continent mapping), first country code is mapped to country name and then country name is mapped to the continent name. |
On Airport | Boolean | Does this store exists on an Airport? This feature helps answer questions like how many of the Starbucks in Europe exists on Airport for example. For now a store is considered to be on an airport if the name field or the street combined field for a store happens to have either one the following terms "Airport, Arpt, Terminal" contained in it. Going forward this logic will be replaced with an IATA API (see comments in code, wb.py file). |
Ease of doing business Category | String | The ease of doing business index (obtained from the WorldBank data) for the country in which the store exists is mapped to a category viz. Very High (1 to 10), High (11 to 30), Medium (31, 90), Low (91 to 130), Very Low (131 and beyond) and finally Unknown for which countries where there is no data available. This feature provides insights like around 70% of Starbucks stores exist in countries which have an ease of doing business index between 1 to 10 (with 1 being the highest and 189 being the lowest). The ease of doing business index is obtained by looking up the country code in the WDI dataset to find the value for the ease of doing business index. |
The following table shows a view of the modified dataset with new features added.
In [2]:
import pandas as pd
SB_W_NEW_FEATURES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/SB_data_w_features.csv'
df = pd.read_csv(SB_W_NEW_FEATURES_URL)
#show only some of the columns so that new feature can be shown without a scrollbar
cols = ['store_id', 'country', 'name', 'city', 'brand', 'continent', 'on_airport', 'eodb_category']
df[cols].head()
Out[2]:
This section provides information about how to run the code, how is the code organized amongst different modules and what are the output files generated.
When you unzip the zip file from Blackboard Github, you will see a folder called sb_study created which contains all the code, helper files and even an output folder (the contents of the output folder get rewritten on every run of the SBS program). As a pre-requisite the machine on which this program is being run is required to have Python 3.5.2|Anaconda 4.1.1 installed. Run the program by simply saying
python main.py
The code will take about 2 to 3 minutes to run completely. As the program runs it would spit out traces on the console and these traces are also automatically being logged into a file called SBS.log created in the output directory (also created by the SBS program in the current working directory if not already present). The program would create a directory called output and several sub-directories inside it to store the various artifacts created (csv files, plots etc.) as part of running the program.
The program can also be run in analysis only mode such that it assumes that the data has already been downloaded by running the program in the full mode first. In the analysis only mode the program runs various forms of analysis on the data and stores the created artificats under subdirectories in the output folder. This mode is useful if only the analysis needs to be redone and the entire datsset does not need to be downloaded again. To run the program in analysis mode use the following command line.
python main.py -a
Here is a listing of everything that is included alongwith a brief description.
File name | Notes |
---|---|
main.py | Everything starts from here, as the name suggests this is the main file of the package. This is what we run. It is organized to run a series of functions like a data science pipeline, for example get_data, clean_data and so on. It internally invokes the wb (for world bank) module and the sb (for Starbucks) module to do the same tasks for the Worldbank and Starbucks datasets respectively. |
README.md | This is the readme markdown file for Github. |
LICENCE.md | This is the license markdown file for Github. The SBS project is available under MIT license. |
countries.csv | A CSV file which provides the country to continent mapping. |
data.csv | A CSV file which provides the country code to country name mapping. Downloaded offline. |
WDI_Series.csv | A CSV file containing the name and description of World Development Indicators (WDI) selected for use in this project. Created offline. |
wb | Directory for the world bank submodule |
wb\wb.py | Main file for the WorldBank submodule. It contains functions to run the data science pipeline for the WorldBank dataset. |
wb\wb_check_quality.py | This file performs the data quality checks i.e. missing data and invalid data checks for the WorldBank data. It internally invokes the utils.py module for some of the data validation checks. |
sb\sb.py | Main file for the Starbucks submodule. It contains functions to run the data science pipeline for the Starbucks dataset. It also contains code for feature generation. |
sb\sb_check_quality.py | This file performs the data quality checks i.e. missing data and invalid data checks for the Starbucks data. It internally invokes the utils.py module for some of the data validation checks. Many of the data validation checks done are specific to the Starbucks data set and are included in this file. |
common | Directory for the common submodule. |
common\utils.py | This file contains utility function that are common to both the modules and so this is the once place to implement these functions. Several data validation checks are also included in this file. |
common\logger.py | This file creates a logger object using the Python logging module. It sets the format specifier so that the trace messages have the desired format (timestamp, module name, etc.) and also takes care of attaching two handlers, one for the console and one for the log file so all traces go to both the console and the log file. |
common\globals.py | This file contains the common constants and even some variables that are common across modules. Think of this is a common header file. |
output | This is the folder which contains all the output (including log file) produced by the SBS program. The SBS program creates this directory if not present. |
output\SBS.log | This is the log file created upon running the program. This file also contained more detailed DQS metrics i.e. missing data and invalid data related errors for individual features. |
output\SB_data_as_downloaded.csv | This is the CSV version of the Starbucks data downloaded via the Socrata API. |
output\SB_data_w_features.csv | This is the CSV version of Starbucks data plus with the new features added to it (last three columns) as part of the feature creation activity. |
output\WB_data_as_downloaded.csv | This is the CSV version of the WorldBank data downloaded via the WB API. |
output\dqs.csv | This is the CSV file which contains the data quality score for both the datasets. |
The output folder contains several csv file which correspond to data as downloaded, cleaned data and combined data.
The WorldBank WDI indicators dataset and the Starbucks datasets are combined together for hypothesis testing and machine learning described later in this notebook. To combine the two datasets we first filter out all the countries from the WorldBank dataset which do not have a Starbucks store i.e. we keep only those countries in the combined dataset that have at least one Starbucks store. This combined dataset is then cleaned (described later in this document) and all analysis is run on it. There are 72 countries in the world where Starbucks is present so the combined dataset contains 72 rows and 48 columns (more on the column count later). The following fields computed from the Starbucks dataset are added to the combined dataset:
Feature | Description |
---|---|
Num.Starbucks.Stores | Total number of Starbucks stores in the country |
Num.Starbucks.Stores.Categorical | Total number of Starbucks stores in the country categorized as VL (VeryLow), L (Low), Medium (M), High(H), Very High(VH) |
Starbucks.Store.Density | Number of Starbucks stores per 100,000 people |
Starbucks.Store.Density.Categorical | Number of Starbucks stores per 100,000 people categorized as VL (VeryLow), L (Low), Medium (M), High(H), Very High(VH) |
Exists.inMultiple.Cities | True if Starbucks stores exists in multiple cities in a country, False otherwise |
Ownership.Type.Mixed | True if multiple ownership types exists for Starbucks stores in a country (Franchisee owned, Starbucks owned etc). |
Continent | Name of the continent in which the country exists in |
MultipleBrands | True if Starbucks exists as multiple brands in the country, False otherwise |
The shape of the combined dataset is (72,48). This includes 35 WDI (World Development Indicators) from the WorldBank datset.
For converting continous columns to categorical a quantile (percentile) based binning strategy is used.
Range | Category |
---|---|
<= 10th percentile | Very Low (VL) |
<=20th percentile | Low (L) |
<=60th percentile | Medium (M) |
<= 80th percentile | High (H) |
beyond 80th percentile | Very High (VH) |
This section provides basic statistical analysis for both the WorldBank dataset and the StarBucks datasets. The following boxplot provides a quick overview into the distribution of Starbucks stores across continents.
There is a huge variation in the distribution of Starbucks stores in the continent of North America. This is on expected lines since other than the U.S. and Canada all other countries are islands. Also the North American continent contains the country with the most number of Starbucks stores (the U.S. ofcourse, no surprises there).
There is minimum variation in the distribution of Starbucks stores in the continent of Oceania since it consists of only two Countries which have Starbucks stores i.e. Australia and New Zealend.
The European continent has the most even distributed around the median.
The WorldBank dataset consists of all numeric values except for the country code, country name and the derived feature for ease of doing business (categorical). The country code and name are excluded from the statistical analysis since there is onnly one row per country in the dataset. The statistical measures for each feature are listed below. Some of the features show NaN for the statistical measures, that is because there is no data at all for these features (they are subsequently removed from the dataset, described later).
In [19]:
import pandas as pd
WB_EDA_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WB_EDA.csv'
df = pd.read_csv(WB_EDA_URL)
df
Out[19]:
The Starbucks dataset consists of all categorical features, therefore only a mode calculation is valid. Several features such as street names are excluded from the mode calculation as well. There are some intresting observations to be made
In [20]:
import pandas as pd
SB_EDA_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/SB_EDA.csv'
df = pd.read_csv(SB_EDA_URL)
df
Out[20]:
Outlier detection is performed on the WorldBank dataset. We determine outliers using a simple rule, anything outside of the 3rd standard deviation is an outlier. The results of outlier detection are provided below. Note that while there are values which get categorized as outliers but they are still valid values (as understood by human inspection of the data) therefore no outlier handling as such (smoothning, removal etc) is done for these values. Also given that these are valid values they cannot be ignored as they may have a significant role in the machine learning algorithms.
No outlier detection is performed on the Starbucks dataset as it contains only categorical values (values outside of well defined valid values set are considered invalid and that analysis has already been performed as part of the data quality detection phase).
In [21]:
import pandas as pd
WB_OUTLIER_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WB_outliers.csv'
df = pd.read_csv(WB_OUTLIER_URL)
df
Out[21]:
As part of the analysis that follows several continous/discrete features were binned (as described below this is needed for association rule mining, feature selection etc.). Here we show one specific feature IC.BUS.EASE.XQ which quantifies the ease of doing business in a country (1 being highest, larger the value lower the ease of doing business) into a categorical value. Numeric values get mapped to categories as follows:
Range | Category |
---|---|
[1,11) | VeryHigh |
[11,31) | High |
[31,91) | Medium |
[91, 131) | Low |
[131, beyond | VeryLow |
The benefits of binning this feature is for exploratory data analysis to answer questions like what % of Starbucks store are in countries with high or very high ease of business. The method used provides a simple strategy to bin the data, although it would require a subject matter expert to say what the most appropriate bin edges are.
Here is an excerpt from the WorldBank data showing the new column created as a result of binning along with the original feature (first 5 rows shown).
In [22]:
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/WB_data_w_features.csv'
df = pd.read_csv(WB_URL)
df[['name', 'IC.BUS.EASE.XQ', 'Ease.Of.Doing.Business']].head()
Out[22]:
There are not a lot of missing values in the Starbucks dataset (the adjusted data quality score for missing values is a 100% which means all the featurs that we care about have no missing values, see section on DQS above). For the WorldBank dataset has a lot of missing though, DQS for missing values is ~ 33.65%. The missing values in the WorldBank dataset are handled as follows:
The WorldBank data is updated every year and many times there is a timelag in the data being available for a year. Also since these indicators are macro level indicators so the year on year change is not drastic. As a first step, all the values that are missing for the latest year are filled with the values from the previous year whenever the previous year's data is applicable.
Define a feature density threshold of 90% and delete all features which are less than 90% filled.
Finally the analysis is to be performed on the combined dataset, as described above the combined dataset is obtained by joining the WorldBank and Starbucks datasets only for those countries that exist in the Starbucks dataset. This intersection results in a dataset which is pretty full (99.8%).
The DQS metrics for the original datasets as well as the combined dataset after cleaning are shown below.
In [24]:
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs/dqs_after_cleaning.csv'
df = pd.read_csv(DQS_URL)
#This is the DQS after Step 1 described above
df
Out[24]:
In [25]:
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs/dqs_combined_dataset.csv'
df = pd.read_csv(DQS_URL)
#This is the DQS after Step 3 described above (Combined dataset, actually used for analysis)
df
Out[25]:
The combined dataset is cleaned for missing values as described in the previous section.
The following figure shows histogram of three different features from the WorldBank dataset.
The observations from these histograms are described below.
Feature | Description | Observations |
---|---|---|
IS.BUS.EASE.XQ | Ease of doing business index (1=most business-friendly regulations) | Almost uniformaly distributed except for the 100 to 150 range (indicating more concentration of countries in that range). |
SP.URB.GROW | Urban population growth (annual %) | Almost normally distributed. |
WP15163_4.1 | Mobile account (% age 15+) | Almost looks like an EL (Exponential logarithmic distribution). |
The following figure shows histogram of three different features from the Combined dataset. The value of the coorelation coefficent is also shown.
The observations from these histograms are described below.
Feature | Description | Observations |
---|---|---|
IT.NET.USER.P2 | Internet users (per 100 people) | Looks like a skewed left distribution. Most countries with Starbucks stores have a very high percentage of internet users, which makes sense if one visualizes a typical coffee shop. |
Num.Starbucks.Stores | Number of Starbucks stores. | Most countries (infact 69 out of 73 which have Starbucks) have kess thana 1000 stores, which makes sense but doesnt tell as much. A better representation would be to increase the number of bins in this historgram. |
SP.PO.TOTAL | Total population | The outliers in this chart towards the extreme right probably represent India and China. |
ST.IN.ARVL | International tourism, number of arrivals | Intrestingly, it appears that most Starbucks stores are not located in countries with very high tourist footfals. |
The following figure shows the scatter plot for three features in the WorldBank dataset.
From the scatter plots we can observe that there is a negative correlation between ease of doing business and number of Internet users in a country, but, since the ease of doing business is expressed in a way that lower the number greater the ease so therefore it follows that ease of doing business and number of Internet users are positively correlated.
Ease doing business is also positvely correlated to per capita GDP. There also seems to be a positive correlation between number of internet users and per capita GDP.
It is important to mention here, the oft repeated phrase correlation does not imply causation.
The following table shows the correlation coefficient between these features and the value of r corrobrates the observations from the scatter plot above.
In [26]:
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WDI_pearson_r.csv'
df = pd.read_csv(WB_URL)
df
Out[26]:
The following figure shows the scatter plot for three features in the WorldBank dataset.
From the scatter plots we can observe Number of Starbucks stores has somewhat of a positive non-linear relationship with the number of international tourist arrivals, total population and number of people having access to the Internet. This is further clarified when we do polynomial regression between number of Starbucks stores and number of international tourist arrivals.
The following table shows the correlation coefficient between these features and the value of r corrobrates the observations from the scatter plot above.
In [27]:
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/Combined_r.csv'
df = pd.read_csv(WB_URL)
df
Out[27]:
Cluster analysis was conducted to find patterns within data (from the combined dataset). Recall that the combined dataset is created by filtering out countries which do not have Starbucks presence, this brings a Starbucks specific dimension to this dataset. The presence of Starbucks stores (lets say in terms of number of stores in a country or any other feature derived from the Starbucks dataset) is the dependant variable in this analysis, therefore any patterns we find in the clusters that are created would be examined in that light i.e. do the cluster member show an affinity to a particular value of the dependant variable.
Before doing clustering a Principal Component Analysis (PCA) is first conducted to transform the data into 3 dimensions. PCA in 3 dimensions helps because the clusters can the be visualized in 3D space. Three different types of clustering mechanisms are used viz. KMeans, DBSCAN and Hierarchical (Agglomerative) clustering. This is described in detail in the next few sections, lets examine the PCA results first.
The following figure shows the result of PCA in 3 dimensions. Note that PCA only helps to spatially distinguish the data, but seeing the clusters without a spectral differentiation is difficult (which is what is provided by the clustering algorthms).
KMeans label=4 groups together U.S. and China in a cluster of their own,which should not be a surprise. Ofcourse we can see any number of similarities that could have caused this for example GDP, tourist arrival etc. Also from a Starbucks perspective, these countries represent very high number of Starbucks stores (top 2).
KMeans label=3 groups together Brazil, Russia and other countries of the erstwhile Russian federation. From a Starbucks perspective these countries represent very similar Starbucks store density (number of stores per 100,000 people).
KMeans label=2 groups together mostly Asian, African and some South American countries which have low to medium ease of business but medium to high population.
Other labels can also be explained in a similar way.
In [28]:
import pandas as pd
Combined_W_Labels_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/clustering/WDI_SB_w_labels.csv'
df = pd.read_csv(Combined_W_Labels_URL)
df[['name', 'KMeans_labels', 'SL.GDP.PCAP.EM.KD','Ease.Of.Doing.Business','ST.INT.ARVL.Categorical','SP.POP.TOTL.Categorical','Num.Starbucks.Stores','Starbucks.Store.Density']].head(10)
Out[28]:
Association rule mining is run on the combined dataset to find tuples of various feature values that occur together in the dataset and draw inferences from that. Most of the data in the combined dataset is continous or discrete, to run association rule mining the first step that is needed is to convert data into categorical values. The combined dataset contains 40 odd features, so instead of converting the entire dataset to categorical, some selected features are converted and association rule mining is run for those features.
Continous/discrete features are converted into categorical by binning them by percentile values as described in previous sections. The dataset for this particular problem does not lend itself very well for this type of analysis (because large number of features are numeric). The conversion of numeric data to categorical in the context of this dataset is an art rather than a science (meaning what bin sizes are good for categorizing number of stores as low, medium or high, this is subjective) therefore we can get vastly different results by changing the criteria for converting numeric data to categorical (have 3 bins instead of 5 for example). Regression and classification algorithms discussed later are better suited for this data science problem.
To select which features would have a relationship with dependant variables that we want to predict we draw scatter plots of the dependant feature Vs all numeric features in the dataset and manually examine where a correlation is seen. Another technique used for finding out the most important feature using a machine learning technique (ExtraForestClassifier) and then using the top two most important features for conversion to categorical. The scatter plots are not being included in this report allthough they are available as part of the overall package that has been submitted. Similarly the list of important features can be seen as part of the logs generated on running the program.
Two different types of rule mining is done, one with a rule containing a single variable predicting the dependant variable and another one with two variables predicting the dependant variable.
The following table identifies all th association rules with two variables:
Feature | Description | Categorical values |
---|---|---|
Num.Starbucks.Stores.Categorical | Number of Starbucks stores, categorical | VL, L, M, H, VH |
ST.INT.ARVL.Categorical | International tourist arrivals, categorical | VL, L, M, H, VH |
The expectation is that very high influx of tourists should occur together with very high number of Starbucks stores. This does occur as this is the second most frequent rule as shown in the table below, however the support value for this rule is a low 0.125 and the confidence is 0.64. The rule that occurs most frequently is medium tourist influx corresponds to medium Starbucks stores, this rule occurs 15 times and has a support of 0.20 and confidence of 0.53. Expected to have a greater support level for the more frequent rules but this not happening probably because of two reasons a) there are only 72 countries with Starbucks stores so there is not a lot data and b) the conversion of numeric to cateogrical data in this case is not an exact science and can be tweaked to give better results (reduce the number of categories for example).
This is all summarized in the following table. The rules themselves are written in the following format R:({A})->B means if A is present then B is also present (with the frequency, support and confidence mentioned alongside to give a sense of probability).
In [29]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df.sort_values(by='frequency', ascending=False).head()
Out[29]:
In [30]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.05_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1
Out[30]:
In [31]:
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.07_and_min_confidence_0.25.csv'
df2 = pd.read_csv(ASSOC_RULES_URL)
df2
Out[31]:
In [32]:
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.1_and_min_confidence_0.25.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df
Out[32]:
The following table identifies all th association rules with two variables:
Feature | Description | Categorical values |
---|---|---|
Num.Starbucks.Stores.Categorical | Number of Starbucks stores, categorical | VL, L, M, H, VH |
ST.INT.ARVL.Categorical | International tourist arrivals, categorical | VL, L, M, H, VH |
SP.POP.TOTL.Categorical | Total population, categorical | VL, L, M, H, VH |
The expectation is that high influx of tourists and a high local population should occur together with high number of Starbucks stores. This does occur as this is the second most frequent rule as shown in the table below, however the support value for this rule is a very low 0.08 but the confidence is 0.75. The rule that occurs most frequently is medium tourist influx in a country with medium population corresponds to medium Starbucks stores, this rule occurs 7 times and has a support of 0.09 and confidence of 0.53. Expected to have a greater support level for the more frequent rules but this not happening probably because of two reasons a) there are only 72 countries with Starbucks stores so there is not a lot data and b) the conversion of numeric to cateogrical data in this case is not an exact science and can be tweaked to give better results (reduce the number of categories for example).
This is all summarized in the following table. The rules themselves are written in the following format R:({A,B})->C means if A and B present together does imply C (with the frequency, support and confidence mentioned alongside to give a sense of probability).
In [33]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df.sort_values(by='frequency', ascending=False).head()
Out[33]:
In [34]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.05_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1
Out[34]:
In [35]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.07_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1
Out[35]:
In [36]:
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.1_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1
Out[36]:
This hypothesis is tested in two ways.
Two sample T-test: A two sample T-test is used to find out if the means of two independant samples are different. The null hypothesis is that means of both the samples are the same. The two sample t-test only relies on sample means and does not need to know the population mean.
Linear and Polynomial regression: ordinary least squares regression. Linear regression tries to predict the value of the dependant variable by using a linear combination of explanatory variables. Linear regression could be univariate (one explanatory variable) or multi variate. A polynomial regression tries to model the dependant variable using higher powers of the explanatory variable(s).
The combined dataset is used to filter out entries belonging to 'ST.INT.ARVL.Categorical' equal to 'VH' (for very high) and then another set of entries corresponding to 'ST.INT.ARVL.Categorical' value other than 'VH' so as to include the rest of the world. Python scipy module is used to calculate a T statistic and a corresponding p-value when comparing the two distributions. The p-value comes out to be ~0.17. This is a unusually large value and with a typical T-test alpha value of 0.05 it would not be possible to reject the null hypothesis but in this case since the distribution of the number of stores was assumed to be normal when it is known that it is not normal (see histogram in previous sections) so a higher alpha of 0.20 is choosen and this enables the null hypothesis that there is no difference to be rejected. This is in line with what is seen in the scatter plots for these two features.
In [37]:
import pandas as pd
T_TEST_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/regression/t_test_results.csv'
df1 = pd.read_csv(T_TEST_URL)
df1
Out[37]:
In this method instead of using the categorical value of the number of international tourist arrivals the actual discrete value is used and it provided to a linear regression model and a polynmial regression model to predict the value of the number of Starbucks stores. The linear model provides an explained variance of 0.27 whereas the polynomial model performs much better by providing an exaplined variance of 0.33. Ideally the the explained variance should be close to 1. It is a matter of further investigation as to what else can be done to make the prediction better. The following scatter plots show linear and polynomial regression models in action, as can be seen from the plots the polynomial model provides a better fit. The polynomial model is a degree 3 polynomial (other degrees were also tried and 3 provided the best results). The outlier value seen towards the stop right is the number of stores in te U.S., it is an extreme value and further investigation would need to be done to figure out if a single variable can predict such a huge jump in value.
|
|
This hypothesis states that the number of Starbucks stores in a country expressed as a categorical value (Very High, High, Medium, Low, Very Low) can be predicted using WDI (World Development Indictors) from the WorldBank data. This is verified by running various classification algorithms and then selecting the on that provides the highest accuracy value as measured via cross validation.
The combined dataset contains 35 WDI features. All of these are not used for classification, the most important 15 features are choosen. The Extra Trees classifier which uses an ensemble of decision trees is used to provide a ranking for all the features and then the top ranked 15 features are used as inputs for the classification algorithms. Tests were done with 5, 10, 15, 30 and all 35 features as well, 15 most important features provided the best results in terms of accuracy.
The 15 most important features for determining the number of stores value as a categorical are:
Feature | Description |
---|---|
TX.VAL.TECH.CD | High-technology exports (current US dollars) |
ST.INT.ARVL | International tourism, number of arrivals |
SP.POP.TOTL | Total Population |
BG.GSR.NFSV.GD.ZS | Trade in services (percentage of GDP) |
SL.GDP.PCAP.EM.KD | GDP per person employed (constant 2011 PPP dollars) |
IC.LGL.CRED.XQ | Strength of legal rights index (0=weak to 12=strong) |
BX.KLT.DINV.WD.GD.ZS | Foreign direct investment, net inflows (percentage of GDP) |
IC.EXP.COST.CD | Cost to export (US$ per container) |
NE.CON.PETC.ZS | Household final consumption expenditure, etc. (percentage of GDP) |
IC.REG.DURS | Time required to start a business (days) |
TX.VAL.OTHR.ZS.WT | Computer, communications and other services (percentage of commercial service exports) |
IC.REG.PROC | Start-up procedures to register a business (number) |
IC.BUS.EASE.XQ | Ease of doing business |
SH.STA.ACSN.UR | Improved sanitation facilities, urban (percentage of urban population with access) |
IT.NET.USER.P2 | Internet users (per 100 people) |
The above list is a very intresting list because all of these indicators seem to appear as very important factors that a business would consider. A lot of these are either economic or directly related to business activity so it intutively makes sense that the machine learning algorithm was choosing the right indicators.
Once the indicators were selected all of the following classification algorithms were applied on the data and the results in terms of accuracy and standard deviation of a K-fold cross validation was noted. Random forest turned out to be the best classifier in terms of accuracy.
The accuracy turned out to be in the 50 to 60 percent range but this is just the first pass at predicting. The indicators, how they are used, and finally the bins for the classification of the dependant variable, all are subject to be refined and this is not the final score. The hypothesis is considered valid although the accuracy value is lower than desirable. More investigation needs to be performed to bring the accuracy score up around the 80 percent range.
All of the following algorithms were used for classification. All of these algorithms are available via the Scikit learn python package. A test train split was done using training data fraction as 20%. A 10-fold cross validation was done to determine the accuracy of various algorithms.
Algorithm | Description | Important notes | Results (Accuracy (standard deviation) |
---|---|---|---|
kNN | K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). | Used scikit learn's defaults. | 0.476667 (0.164688) |
CART | Classification and Regression Trees. Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The top most node is called root and is also the best predictor. | Used scikit learn's defaults. | 0.383333 (0.188709) |
Naive Bayes | The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. | Used scikit learn's defaults. Does performn well alongiwth kNN and RandomForest for the combined dataset. | 0.493333 (0.221008) |
SVM | A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the hyperplane are the support vectors. | Used scikit learn's defaults (rbf kernel is the default, tried with other kernels all give the same result). Does not perform well for the dataset. | 0.363333 (0.210000) |
RandomForest | A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). | Used scikit learn defaults. Provides the best results. | 0.510000 (0.201136) |
This hypothesis stats that number of Starbucks stores in a country which is a discrete variable (as opposed to the categorical considered in hypothesis 2) can be predicted using some WDI indicators. The method used here is similar to hypothesis 2 in that first a set of three most important features are determined using the Extra Trees classifier and then these three features are fed into a Multivariate Polynomial Regression model to predict the dependant variable i.e. number of Starbucks store in a country. The three features used for this prediction are 'ST.INT.ARVL' (international tourist arrivals), 'TX.VAL.TECH.CD' (High-technology exports (current US dollars)), 'SP.POP.TOTL' (population of the country).
A multivariate polynomial model is choosen since there are more than one explanatory variables and the relationship with the dependant variable is definitely not linear as is known from the scatter plots (not included here but available as part of the package). A degree 3 polynomial works best as determined by experimenting with other values as well. An explained variance score of 0.99 is achieved, however the mean squared error is still very high at 27063. The MSE needs to be improved before the hypothesis can be considered valid, this needs further investigation in terms of feature selection and model used for prediction.
A couple of other hypothesis were also tested as part of this phase.
At this time in the data science lifecycle a clean combined dataset exists and has been used to validate multiple hypothesis as described above. The focus of this phase was getting everything in place to run predictive algorithms, the next phase will focus on improving the accuracy of the predictions. The results of the predictions are included in the overall package (see regression folder) but are not being disucssed here at this time.
What does appear is that more than a regression problem the problem can be modeled as more of a clustering and classification problem. In other words, use clustering algorithms (like KMeans as described above) to see if countries which have no Starbucks stores appear in which clusters, for example if they appear in clusters corresponding say countries with a high number of Starbucks store then that is a prediction. Similarly, the RandomForest or NaiveBayes classifier (which provided between 50 to 61% accuracy) could be used to predict the category (low, medium, high etc) for number of Starbucks stores.
Another aspect to review is that maybe the categories for the number of Starbucks stores need to be reduced from 5 (Very high, high, medium, low, very low) to just 3 (high, medium, low) to improve the accuracy of predictions. These are all things to be considered for the next phase of the project.
As discussed in the previous section, it appears that classification is a better option for prediction than regression for this data science problem. In other words, it is possible to predict the number of Starbucks stores in a country as a categorical variable (i.e. VeryLow, Low, Medium, High, Very High) with a much higher degree of accuracy than predicting the exact number as a continous variable. The RandomForests classifier discussed above provided an accuracy of between 50 to 61% (stddev 12 to 20%), it is used to predict the number of stores in countries which do not have a Starbucks presence yet but for which there is data available for all the features to enable prediction. The features used for prediction are listed in Hypothesis 2 and are not being repeated here. Since the original WorldBank data has a lot of missing data therefore predictions cannot be made for all countries where Starbucks does not have a presence yet. The following table lists the results of the prediction. (VL: Very Low, L: Low, M: Medium, H: High, VH; Very High). The categories translate into the following numerical values:
Range | Category |
---|---|
[1,3] | Very Low |
[4,6] | Low |
[7,41] | Medium |
[42, 128] | High |
[129, beyond | Very High |
In [1]:
import pandas as pd
RESULTS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/classification/Num.Starbucks.Stores.Categorical_predictions.csv'
df = pd.read_csv(RESULTS_URL)
df
Out[1]:
The prediction throws up some interesting results. While some interpretation is being provided here, it would upto a subject matter expert (and well, Starbucks) to say why these groupings make sense (or not). A point to note here (which has not been discussed earlier) is that the WolrdBank data also had data for groups of countries (such as OECD) combined as a single entity so these groups also appear in the following classification.
Category | Notes |
---|---|
Very High | High income countries, OECD countries (Switzerland, Turkey, U.S., U.K.) and the EU show up in this category. This makes intuitive sense: high income, developed nations have a very high number of Starbucks stores. An interesting result here is the OECD countries group (U.S., U.K., Turkey, Switzerland), since a Very High or High number of Starbucks already exists in these countries, so this speaks to the accuracy of the model. |
High | Several groups show up in this category. The world overall is classified as a High Starbucks region. |
Very Low | This group consists of some African, South American countries. It is a matter of study to say why these countries are clustered togehter. Israel and Iceland could be an outlier in this list. |
Medium | This is the single largest group consisting of 29 countries from all over the world. Italy shows up in this list, whereas one would expect that there would already be Starbucks in Italy. This list can be combined with other aspects (see visualizations) such as tourist inflow to prioritize which countries should be higher on the list for Starbucks to enter. A few countries worth mentioning from this list are Sri Lanka, Mauritius, Jamaica and European countries like Italy, Albania, Lithuania, Montenegro, Slovenia etc. |
Low | This is a small list, couple of African, one Asian and one South American country. |
A piece of information available in the Starbucks dataset is the date on which a store was first seen and the dataset contains information from December 2013 to the current month i.e. November 2016. The first seen field can be used to identify (say) a month on month increase in the number of stores i.e. stores for which the first seen date is say a date in January 2014 can all be considered as newly opened stores in January 2014 (this was also empirically verified by the fact that a new store that opened in Clarksburg, Maryland that the author is personally aware of shows up with a first seen date of November 2016).
This information about the first seen date can be used to do a time series analysis to examine and predict the number of Starbucks stores in the next 12 months. To do this the following steps were followed:
Extract a time series of interest i.e. total number of Starbucks stores across the world and plot it to visually examine how the number of stores was increasing over time.
Examine the time series for being stationary i.e. is the mean of the series increasing with time. The idea behind requiring the time series to be stationary is that if the time series is stationary then individual points can be considered I.I.D (independent identically distributed) and the usual concepts of stationary random variables (Central Limit Theorem, Law of Large Numbers) can be applied to it. If the series is not stationary then it needs to be made (read transformed into a) stationary series. There are several transformations to stationarize a series such as Logarithmic, Deflation by CPI, first difference etc.
The Auto Regressive Integrated Moving Average (ARIMA) model is used to forecast the timeseries. In general, ARIMA models are applied in some cases where data show evidence of non-stationarity (like in case of the Starbucks store data), where an initial differencing step (corresponding to the "integrated" part of the model) can be applied to reduce the non-stationarity. See https://www.otexts.org/fpp/8/1 for more details about time series and https://people.duke.edu/~rnau/411arim.htm for an explanation on ARIMA.
The parameters (p,d,q) of the non-seasonal ARIMA model are are non-negative integers, p is the order (number of time lags) of the autoregressive model, d is the degree of differencing (the number of times the data have had past values subtracted), and q is the order of the moving-average model. The (p,d,q) values are generally identified by studying the ACF/PACF values, in this case, they were identified by trial and error (the values that gave the least mean square error were used).
A Seasonal ARIMA (SARIMA) could have been tried but current stable release of the statsmodels package (0.6.1) does not support it, although it is available in a version currently under development (0.7.1).
The ARIMA model is provided the logged timeseries for reasons discussed above. The fitted values provided by the model are actually the differences of the logged series, these values are then cumulatively summed to derive the actual logged series. Once the predicted logged series is obtained then it is transformed back to its original value by taking an exponent (e^x) of the logged value. The forecasted values for future timestamps is obtained using the same procedure.
The following sections discuss time series analysis for two timeseries, one for the total number of stores across the world and another one for the number of stores in the state of New York. Several other TSA analysis (such as time series for number of stores in the U.K, India, the continent of Africa) are included in the project submission.
There are four plots provided below.
|
|
|
|
Same four plots as above are provided for the timer series analysis for number of stores in the state of New York as well. The RMSE is again less than 1%. Forecasted number of stores in the state of New York at the end of 2017 is ~692, which would be an addition of 53 stores from the current number of 639.
|
|
|
|
This analysis explores the distribution of the number of Starbucks stores in countries which have a Very High density of Starbucks stores (> 1 Starbucks store per 100,000 people) and at least 100 Starbucks stores across the country. The number of stores per city in the country is modeled as a random variable and its distribution is examined. The following table provides the list of countries that have been studied, note that Singapore which has a 125 stores across the country is not included in this list since there are only two Cities in Singapore one of which has a single Starbucks store.
In [3]:
import pandas as pd
RESULTS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/more/cities_withmost_stores.csv'
df = pd.read_csv(RESULTS_URL)
df
Out[3]:
From the above table it is clear that the maximum number of stores occurs in the captial city of the country except in the case of the U.S. where the maximum number of stores in a single city is in New York (which is not a surprise).
The following boxplots and histogram show the distribution of Starbucks stores across cities in the above mentioned countries. As is clear from these plots, the distribution is right skewed with very large difference between the third quartile and the maximum (because the one city with the maximum number of Stores is an outlier).
|
|
In order to obtain any insight from the above plots the outliers need to be removed. This is done by "Winsorizing" (see https://en.wikipedia.org/wiki/Winsorizing) the distribution at 0.05 level i.e. all greater than the 95th percentile are clamped at the 95th percentile value. The following are the plots after Winsorization. As seen from the plots it is evident that the basic nature of the plots has not changed after Winsorization but at least now it is clear that median number of stores per city for all of these countries is around 3 and the 3rd quartile is less than 7. The shape of the distribution as seen from the histograms is similar for all of these countries with the U.S. having the maximum number of stores in almost all percentiles. The shape of the distribution is discussed in the next section.
|
|
Looking at the histogram above, this distribution looks similar to an exponential distribution or at least something from that family of distributions. To understand this better, the distributions for each of these countries was examined individually and it was found by visual inspection that a Exponential Logarithmic distribution or an Erlang distribution fitted this best. The Seaborne package for making plots also provides the option to define a "fit" function and then it overlays the fitted plot on the actual data.
The following plots show the CDF (cumulative distribution function) plots as well as the histograms with the fitted curve from the Erlang distribution. The plots provided here are for the U.S. and the U.K. (rest of the plots are included in the submission package). The Erlang distribution is used to model incoming calls in a telephone system, waiting times in queuing , business economics for describing interpurchase times etc. In the context of this data science problem, we could think of applying this distribution to model number of Starbucks stores in cities of a country where Starbucks does not exist today. The distribution could define the number of stores in the capital city (which would be an outlier) and number of stores in other cities in the country which would be within the 1st quartile, 3rd quartile and so on. This could be used in conjunction with the previous analysis where the number of Starbucks stores is predicted as a categorical variable, so from there a range for the total number of stores in the country is known and this range could then be modeled using the Erlang distribution to distribute the total number of stores across multiple cities. This is a topic for further research.
|
|
|
|
The following visualization shows all Starbucks stores as of November 2016, overlayed over a world map. The brown dots on the map represent locations of Starbucks stores. A hover over individual point brings a popout text which provides the name of location where the store exists along with the name of the city. Since the map is displayed from a certain zoom level therefore the points representing Starbucks stores appear very close to each other in high density areas.
The visualization clearly shows that a lot of Starbucks stores are located on the eastern and western coast of the US, the UK in Europe and China and Japan in Asia as well as Mayasia and Indonesia in south east Asia. There are relatively fewer stores in the South America and extremely few stores in Africa.
This visualization has been created using GMapPlot and Bokeh.
In [15]:
from IPython.core.display import display, HTML
import requests as req
filename='locations.html'
display(HTML(filename))
With more than 13,000 Starbucks stores, the U.S. has the most Starbucks stores in the World by a huge margin, but how are Starbucks stores distributed within the US? This is what this next visualization tries to explore by creating a heatmap of U.S. counties, more the number of stores, more intense the color of the map. Doing a mouse over a county brings up the name of the county and the number of stores in that county. The stores are expectedly most on the east coast and the west coast and very few in the states at the geographical center of the country like in Nebraska, South Dakota, Kansas etc.
Maximum number of stores, 665, occur in Los Angeles county in California. New York county (234) in New York state, Maricopa county (334 stores) in Arizona, Cook county (325 stores) in Illinois, King county (336 stores) in Washington are some of the counties with a very high number of Starbucks stores. Just from visual examination of the graphic it appears that the Ney York county (which includes Manhattan) would probably have the highest density of Starbucks stores in terms of number of stores divided by the size of the county.
This plot has been created using a combination of an SVG (Support Vector Graphics) image of the US map with counties marked on it and the FCC (Federal Communications Commission) API that provides the name of the county given a locations latitude and longitude. All Starbucks locations are mapped to a county name using the FCC API in an offline process and a count is maintained for the number of stores in each county. Once the store to county mapping is complete the result is stored in a CSV file. This CSV file is then used to assign color and hover text to the counties as listed in the SVG file. The counties are colored using shades in a single color pallete with 0 (no Starbucks stores) in the county being the lightest to greater than 300 stores being the darkest.
Number of stores | Color Intensity (increases with number of stores) |
---|---|
0 | Lightest |
1 | |
[2, 15) | |
[16,150) | |
[151, 300) | |
(301 and greater) | Darkest |
In [13]:
import IPython
fname = 'us_counties_starbucks_stores.html'
iframe = '<iframe src=' + fname + ' width=1000 height=550 scrolling=no frameBorder=0></iframe>'
IPython.display.HTML(iframe)
Out[13]:
As discussed in previous sections, the number of international tourist arrivals per year and internet users (per 100 people) are two important features which have been seen to influence the number of Starbucks stores in a country. These features were determined to be two of the more important features by the ExtraTreesClassifier and even intuitively it does make sense that countries with a large number of tourist arrivals and high Internet accessibility should have a higher number of Starbucks stores.
This relationship between number of Starbucks stores, Internet users and International tourists is depicted using a bubble chart. The bubble chart represents 4 dimensional data as described below.
The bubble chart presented below is created using Plotly, mouse over the bubbles to see information about the country, number of Starbucks stores etc. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available. All the data required for the bubble chart is available as part of the combined World Band Starbucks dataset.
In [9]:
import IPython
url = 'https://plot.ly/~aa1603/2.embed?link=false'
iframe = '<iframe width=\"1000\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)
Out[9]:
Some important observations from the above visualization are as follows:
Almost all the bubbles which are medium to large in size all occur towards the right side of the 10 Million mark on the x-axis which indicates that unless a country has north of 10 Million international tourists arrival a year it would have very small number of Starbucks store probably around 20 or so with the only exceptions being Brazil, Argentina, Philippines and Indonesia.
The green colored bubbles which correspond to Asia are very visible (most of them are at least medium size with China being the largest). This seems indicates that countries in Asian generally tend to have a higher number of Starbucks stores than counties in other continents. This is also indicated in the boxplot drawn in a previous section, the mean number of Starbucks stores was the highest for Asia.
The largest bubble towards the top right side of the chart is ofcourse the U.S., high availability of Internet access and a very large number of tourists.
France which is the blue colored bubble on the top right corner is relatively much smaller than bubbles representing other countries with high availability of Internet access and a large influx of tourists. This looks like an aberration, France has only 124 Starbucks stores, at more than 80 Million tourists per year and more than 80 out of 100 people having access to the Internet, France should be having a lot more Starbucks stores than just 124.
The following bar chart shows the increase in Starbucks stores by country from December 2013 to November 2016. Mouseover individual bars to see the name of the country and the number of Starbucks stores added in that country between December 2013 to November 2016. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available.
In [12]:
import IPython
url = 'https://plot.ly/~aa1603/6.embed?link=false'
iframe = '<iframe width=\"1000\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)
Out[12]:
Some observations from the above visualization:
While the increase in absolute terms is useful, it is also interesting to find out which countries have the highest rate of increase, basically where is Starbucks expanding fastest. This list may not be the same as the list of countries which have the greatest growth in terms of absolute numbers. To determine this, the following procedure is follows:
The timeseries for the number of stores for 15 countries having the highest cumulative percentage increase is represented below as Sparkline charts (more on Sparklines https://en.wikipedia.org/wiki/Sparkline). Note that this list does not include the U.S. so even though the U.S. had the highest increase in terms of absolute numbers the fastest rate of increase is somewhere else. It is seen from this visualization (and confirmed from the actual numbers, not shown here) that the maximum cumulative percentage increase happens in India where the number of stores increased from 29 in December 2013 to 85 in November 2016.
In [11]:
import IPython
url = 'https://plot.ly/~aa1603/74.embed?link=false'
iframe = '<iframe width=\"800\" height=\"1000\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)
Out[11]:
The following limitations and scope for future study is identified:
In [10]:
import IPython
url = 'https://plot.ly/~aa1603/76.embed?link=false'
iframe = '<iframe width=\"800\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)
Out[10]:
In [ ]: