In [17]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random
Authors:
RJ Nowling rnowling@gmail.com
Hadoop has positioned itself as the go-to solution for building big data warehousing and analysis solutions. Hadoop is a large ecosystem that includes a tools such as MapReduce, Pig, Hive, and Mahout. Unfortunately, the Hadoop community has been moving faster than the creation of documentation and high-quality examples. As a result, it is difficult for new users to learn the in and outs of tools, and maybe more importantly, how the tools fit together.
BigPetStore (BPS) aims to be the canonical example for the Hadoop ecosystem built around a fictional chain of pet stores. BPS implements an analytics pipeline that showcases the capabilities and position of each tool in the Hadoop ecosystem. Transaction records are generated randomly. Pig is used to clean and reformat the data into a more consumable format. Hive is used to ... Finally, Mahout is used to perform analytics on the data to predict trends.
BPS has started to move beyond just serving as an example for aspiring Hadoop users. BPS has recently gained attention and has been accepted into Apache BigTop, an open-source Hadoop distribution, where it's being used as part of the BigTop testing framework. Hadoop developers have also started using BPS in their live examples.
The random generation of the fictional data lies at the core of BPS. By using random generation, users don't need to download huge data sets. Random generation provides control over how much data is generated, so that BPS can be used on systems as small as a single desktop up to large clusters.
Although the current generation approach is efficient, the generated data contains very few patterns and little complexity. To truly showcase the capabilities of the Hadoop tools and application of analytics, more complex data is needed. Since the current approach makes it hard to incorporate new variables and embed trends, I prototyped a new approach using a probabilistic mathematical model.
A BigPetStore transaction contains:
In [ ]:
My models are probabilistic. For each variable or property in the model, I use a probability density functions (PDFs) to describe the probability of each value that the variable can take on. (Technically, PDFs are for continuous-valued functions and probability mass functions (PMFs) are for discrete-valued functions. I use the term PDF throughout to refer to both.)
What is a PDF? Examples. Properties.
PDFs enable the data model to be modular, declarative, and composable. As a result, the approach can easily be expanded to account for additional variables.
Advantages of formal models using PDFs:
Let's start with a simple homogenous Poisson process. A Poisson process is a continuous-time stochastic process that models the occurence of events within a given time interval. A Poisson process is known as a pure-birth process. The term homogenous refers to the fact that the rate of event occurences is constant over the time domain. The events are assumed to occur with an arrival time modeled by an exponential distribution.
The Poisson process is described by:
$$ P[(N(t + \Delta t) - N(t)) = k] = \frac{(\lambda \Delta t)^k}{k!}e^{-\lambda \Delta t}$$The above equations describes the probability of $k$ events occuring in the interval $(t, t + \Delta t]$ given $\lambda$, the rate of event occurences given in units of events/$\Delta t$.
I'll mention two approaches for generating events from the Poisson process.
The first approach is similar to the Roulette Wheel sampling method discussed early on. We can build the cumulative density function (CDF) by computing the probabilities for $k = 0, 1, 2, \dots$ events occuring in the time interval $(t, t + \Delta t]$. We then generate a random number from a uniform distribution and find the number of events by finding the matching range of probabilities. Since $k$ could be infinite, we only generate probabilities until we have a match.
There are two potential problems with this approach:
As a result, this approach ends up being computationally intensive.
For the second approach, we can compute the waiting time $X_n$ of event $n$ by sampling from the exponential distribution:
$$ f(X_n) = (\lambda \Delta t)^k e^{-\lambda \Delta t} $$The arrival time of $n$ is the sum of all the previous waiting times:
$$ t_n = \sum_{i=1}^n X_i $$The second approach overcomes both disadvantages of the first approach. It gives exact arrival times and is computationally efficient.
(I am unsure of whether the second approach will hold for non-homogenous Poisson processes.)
In [626]:
class HomogenousPoissonProcess(object):
def __init__(self, rate=None):
self._rate = rate
self._last_arrival = 0.0
self._last_time_period = 0.0
def sample(self, time_period):
arrival_times = []
if self._last_arrival > self._last_time_period:
arrival_times.append(self._last_arrival)
self._last_time_period += time_period
while True:
waiting_time = random.expovariate(self._rate)
self._last_arrival += waiting_time
if self._last_arrival < self._last_time_period:
arrival_times.append(self._last_arrival)
else:
break
return arrival_times
In [627]:
rate = 10.0 # 10 events per hour
process = HomogenousPoissonProcess(rate=rate)
arrival_times = process.sample(24.0)
print "%s events" % len(arrival_times)
print "Expected: %s events/hour" % rate
print "Observed: %s events/hour" % (len(arrival_times) / 24.0)
plt.plot(xrange(1, len(arrival_times) + 1), arrival_times)
plt.xlabel("Events")
plt.ylabel("Arrival Times")
plt.title("Homogenous Poisson Process")
Out[627]:
A Markov model is defined by a set of states and transition probabilities $p_{ij}$ from each state $i$ to each state $j$. When used as a stochastic process, Markov models produce a sequence of states called a Markov chain. On every step, the next state in the chain is generated by sampling from the PDF defined by the outbound transition probabilities for the current state.
For example, assume we have a Markov Model with three states: A, B, and C. The transition probabilities are given as follows:
A -> B : 0.5
A -> C : 0.5
A -> A : 0.0
B -> A : 0.3
B -> C : 0.7
B -> B : 0.0
C -> A : 0.7
C -> B : 0.2
C -> C : 0.1
If the current state is A, the Markov chain will choose B or C as the next state with equal probability. If the chain is in state B, the next state will be A with a 30% probability and C with a 70% probability. State C introduced the possiblity of choosing itself as the next state.
It is important to note that the transition probabilities do not need to be symmetric. Specifically, $p_{ij}$ does not need to equal $p_{ji}$. Self-transitions are also allowed, meaning that $p_{ii}$ can be non-zero. The sum of the outbound transitions for each state $i$ must equal unity: $\sum_j p_{ij} = 1$.
Hidden Markov Models (HMMs) extend and generalize the concept of Markov models. In a HMM, the state itself is not the output. Rather, each state defines a PDF of possible outputs. After each state transition, the output is chosen by sampling from the current state's PDF. A Markov model is a specific case of a HMM where the output PDF only defines one possible output with probability 1.
As the store location PDF is one dimensional, a simple roulette wheel approach is efficienty for sampling. The roulette wheel builds a table by computin the cumulative density function (CDF) and assigning each discrete value in the domain an interval. Values are sampled by generating a random number from a uniform distribution and iterating through the table until the interval containing the randomly-generated number is found.
In [613]:
class RouletteWheelSampler(object):
def __init__(self, domain, pdf):
self._wheel = []
end = 0.0
for x in domain:
end += pdf.probability(x)
self._wheel.append((end, x))
print end
def sample(self):
r = random.random()
for end, x in self._wheel:
if r < end:
return x
# we should never get here since probabilities
# should sum to 1
raise Exception, "Could not pick a value!"
Monte Carlo methods are stochastic methods for sampling from complex probability density functions (PDFs). Since complex PDFs are often very difficult to sample directly,
In [602]:
class MonteCarloSampler(object):
def __init__(self, generator, pdf):
self.generator = generator
self.pdf = pdf
self.num_iter_accept = []
def sample(self):
num_iter = 0
while True:
num_iter += 1
obj = self.generator.generate()
prob = self.pdf.probability(obj)
rand = random.random()
if rand < prob:
self.num_iter_accept.append(num_iter)
return obj
def sample_n(self, n):
samples = []
for i in xrange(n):
samples.append(self.sample())
return samples
def acceptance_rate(self):
if len(self.num_iter_accept) == 0:
return 0.0
return 1.0 / np.average(self.num_iter_accept)
In [603]:
class GaussianPDF(object):
def __init__(self, mean, std_dev):
self.mean = mean
self.std_dev = std_dev
def probability(self, x):
exponent = -1.0 * np.power(x - self.mean, 2.0) / (2.0 * self.std_dev * self.std_dev)
return np.exp(exponent) / (self.std_dev * np.sqrt(2.0 * np.pi))
class UniformSampler(object):
def __init__(self, a, b):
self.a = a
self.b = b
def generate(self):
return random.uniform(self.a, self.b)
In [604]:
mean = 0.0
std = 1.0
narrow_a = -10.0
narrow_b = 10.0
wide_a = -100.0
wide_b = 100.0
In [605]:
gaussian_pdf = GaussianPDF(mean, std)
narrow_uniform_sampler = UniformSampler(narrow_a, narrow_b)
wide_uniform_sampler = UniformSampler(wide_a, wide_b)
narrow_mc = MonteCarloSampler(narrow_uniform_sampler, gaussian_pdf)
wide_mc = MonteCarloSampler(wide_uniform_sampler, gaussian_pdf)
In [606]:
n_samples = 10000
gauss_samples = [random.gauss(mean, std) for i in xrange(n_samples)]
narrow_uniform_samples = [random.uniform(narrow_a, narrow_b) for i in xrange(n_samples)]
wide_uniform_samples = [random.uniform(wide_a, wide_b) for i in xrange(n_samples)]
In [607]:
bins = np.linspace(wide_a, wide_b, num=1000)
narrow_hist, _ = np.histogram(narrow_uniform_samples, bins=bins, density=True)
wide_hist, _ = np.histogram(wide_uniform_samples, bins=bins, density=True)
gauss_hist, _ = np.histogram(gauss_samples, bins=bins, density=True)
In [608]:
plt.clf()
plt.hold(True)
plt.plot(bins[:-1], gauss_hist, 'r-', label="Gaussian PDF")
plt.plot(bins[:-1], narrow_hist, 'g-', label="Narrow MC")
plt.plot(bins[:-1], wide_hist, 'b-', label="Wide MC")
plt.legend(loc="upper right")
Out[608]:
The figure shows the differences between the three distributions. It should be apparent here that the narrow-range uniform distributions overlaps more with the Gaussian distribution than the wide-range uniform distribution. This will be important later.
In [609]:
n_samples = 10000
narrow_samples = narrow_mc.sample_n(n_samples)
wide_samples = wide_mc.sample_n(n_samples)
gauss_samples = [random.gauss(mean, std) for i in xrange(n_samples)]
In [610]:
bins = np.linspace(narrow_a, narrow_b, num=100)
narrow_hist, _ = np.histogram(narrow_samples, bins=bins, density=True)
wide_hist, _ = np.histogram(wide_samples, bins=bins, density=True)
gauss_hist, _ = np.histogram(gauss_samples, bins=bins, density=True)
In [611]:
plt.clf()
plt.hold(True)
plt.plot(bins[:-1], gauss_hist, 'r-', label="Gaussian PDF")
plt.plot(bins[:-1], narrow_hist, 'g-', label="Narrow MC")
plt.plot(bins[:-1], wide_hist, 'b-', label="Wide MC")
plt.legend(loc="upper right")
Out[611]:
I ran the Monte Carlo using the wide and narrow uniform distributions to generate samples. Even though neither of these distributions match the target distribution (Gaussian), the Monte Carlo process ensures that we sample the target distribution. This is the power of a Monte Carlo method: Samples can be generated from any distribution, as long as it generates samples in the correct domain (e.g., $[-\infty, \infty]$). (In our case, I'm assuming that samples outside of the range $[-10, 10]$ have a near-zero probability, hence my choice for the ranges of the uniform distribution.) Regardless of the distribution used for generating the samples, the MC method ensures we sample from the correct distribution.
When the target distribution is very complex, sampling it directly can be impossible. MC methods allow us to overcome this limitation.
In [612]:
print narrow_mc.acceptance_rate(), wide_mc.acceptance_rate()
This is not to say that the choice of sampling distribution doesn't have any impact -- in fact, it is quite important. The choice of sampling distribution will affect the efficiency of the Monte Carlo method, in particular the acceptance rate. The narrow uniform sampler was more likely to pick values in the high-probability region of the Gaussian distribution, while the wide uniform sampler was more likely to pick values outside of the high-probability region. As a result, the MC method accepted 1 out of every 20 samples generated by the narrow sampler but only 1 out of every 200 samples generated by the wide sampler.
As an input to the model for BigPetStore, I'm using three datasets: population counts by zipcode, median household incomes by zipcode, and latitude and longitude coordinates by zipcode. I give a brief overview of the datasets here.
I obtained median household income data by zipcode from the 2012 American Community Survey. I downloaded the data from the U.S. Census Bureau web site.
In [340]:
def read_income_data(flname):
zipcodes = []
household_incomes = []
fl = open(flname)
#skip headers
next(fl)
next(fl)
for ln in fl:
cols = ln.strip().split(",")
# zipcodes in column 3 in the format "ZCTA5 XXXXX"
zipcode = cols[2].split()[1]
try:
median_household_income = int(cols[5])
zipcodes.append(zipcode)
household_incomes.append(median_household_income)
except:
# some records do not contain incomes
pass
fl.close()
return pd.DataFrame({"zipcode" : zipcodes, "median_household_income" : household_incomes}, index=zipcodes)
In [341]:
income_table = read_income_data("../resources/ACS_12_5YR_S1903/ACS_12_5YR_S1903_with_ann.csv")
In [342]:
income_table.shape
Out[342]:
In [343]:
income_table.iloc[:10, :]
Out[343]:
In [344]:
plt.clf()
plt.hist(income_table.iloc[:, 0], bins=100)
plt.xlabel("Median Household Income for Zipcode")
plt.ylabel("Number of Zipcodes")
Out[344]:
In [345]:
def load_population_data(flname):
fl = open(flname)
pop_data = dict()
# skip headers
next(fl)
for ln in fl:
if ln.strip() == "":
continue
zipcode, pop = ln.split(",")
pop = int(pop)
# remove duplicates. keep largest pop values
if zipcode in pop_data:
pop_data[zipcode] = max(pop_data[zipcode], pop)
else:
pop_data[zipcode] = pop
fl.close()
pops = []
zipcodes = []
for z, p in pop_data.iteritems():
zipcodes.append(z)
pops.append(p)
return pd.DataFrame(data={"zipcode" : zipcodes, "population_count" : pops}, index=zipcodes)
In [346]:
pop_table = load_population_data("../resources/population_data.csv")
In [347]:
pop_table.shape
Out[347]:
In [348]:
plt.clf()
plt.hist(pop_count_table.iloc[:, 0], bins=100)
plt.ylabel("Number of Zipcodes")
plt.xlabel("Population Count of Zipcode")
Out[348]:
In [539]:
def load_zip_long_lat(flname):
fl = open(flname)
# skip header
next(fl)
zipcodes = []
lats = []
longs = []
for ln in fl:
cols = ln.split(", ")
zipcode = cols[0][1:-1] # remove double-quote marks
lat = float(cols[2][1:-1])
long = float(cols[3][1:-1])
zipcodes.append(zipcode)
lats.append(lat)
longs.append(long)
fl.close()
return pd.DataFrame(data={"zipcode" : zipcodes, "latitude" : lats, "longitude" : longs}, index=zipcodes)
In [540]:
zipcode_lats_longs = load_zip_long_lat("../resources/zips.csv")
In [541]:
zipcode_lats_longs.shape
Out[541]:
In [542]:
zipcode_lats_longs.iloc[:10, :]
Out[542]:
Since there are zipcodes missing from each of the three datasets, I needed to reduce the datasets down to their common zipcodes.
In [543]:
income_zipcodes = set(income_table["zipcode"])
pop_zipcodes = set(pop_table["zipcode"])
lat_long_zipcodes = set(zipcode_lats_longs["zipcode"])
common_zipcodes = income_zipcodes.intersection(pop_zipcodes).intersection(lat_long_zipcodes)
print len(common_zipcodes)
In [544]:
income_table = income_table.loc[common_zipcodes, :]
pop_table = pop_table.loc[common_zipcodes, :]
lat_long_table = zipcode_lats_longs.loc[common_zipcodes, :]
print income_table.shape, pop_table.shape, lat_long_table.shape
In [546]:
plt.scatter(lat_long_table["longitude"], lat_long_table["latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("All Common Zipcodes")
Out[546]:
The scatter plot of all of the zipcodes by latitude and longitude shows a few familiar shapes: the continental U.S., Alaska, and Hawaii. (The Puerto Rican zipcodes were removed.) I didn't know Alaska was so large. :)
In [550]:
populations = list(reversed(sorted(pop_table["population_count"])))
cutoff_idx = int(len(populations) * 0.01)
cutoff = populations[cutoff_idx]
cutoff_zipcodes = [z for z in common_zipcodes if pop_table.loc[z, "population_count"] >= cutoff]
plt.scatter(lat_long_table.loc[cutoff_zipcodes, "longitude"], lat_long_table.loc[cutoff_zipcodes, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("Top 1% Most Populated Zipcodes")
Out[550]:
A scatter plot of the zipcodes in the top 1% of the population counts shows large populations on the coasts and major metropolitan areas like Chicago.
In [551]:
incomes = list(reversed(sorted(income_table["median_household_income"])))
cutoff_idx = int(len(incomes) * 0.01)
cutoff = incomes[cutoff_idx]
cutoff_zipcodes = [z for z in common_zipcodes if income_table.loc[z, "median_household_income"] >= cutoff]
plt.scatter(lat_long_table.loc[cutoff_zipcodes, "longitude"], lat_long_table.loc[cutoff_zipcodes, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("Top 1% Median Household Income Zipcodes")
Out[551]:
A scatter plot of the zipcodes in the top 1% of the median household incomes show concentrations of wealth in the northeast, parts of California, South Florida, and a few major metropolitan areas.
In [574]:
def read_names(flname):
fl = open(flname)
names = []
first_names = []
last_names = []
weights = []
for ln in fl:
cols = ln.split(",")
name = cols[0]
title = cols[2] == "1"
if title:
continue
last_name = cols[3] == "1"
first_name = cols[4] == "1"
weight = float(cols[5])
names.append(name)
first_names.append(first_name)
last_names.append(last_name)
weights.append(weight)
fl.close()
return pd.DataFrame(data={"name" : names, "first_name" : first_names, "last_name" : last_names, "weight" : weights}, index=names)
In [575]:
customer_names = read_names("../resources/namedb/data/data.dat")
In [576]:
customer_names.shape
Out[576]:
In [577]:
customer_names.iloc[:10, :]
Out[577]:
Stores really only have two fields in our data model: a unique ID and a location (zipcode). The complexity in modeling the stores comes from how we will choose to locate stores. I assume that stores are more likely to be placed in areas that have larger populations and higher incomes. The store locations will be generated according to a probability density function giving the probability of a store occuring in a given zipcode.
Note: the current approach for choosing the store locations assumes the stores are independent. It may be interesting to incorporate patterns such that the location of new stores is influenced by previously generated stores. For example, maybe we want to ensure that stores are located a minimum distance from each other while also within a maximum distance of each other.
I directly used zipcode population counts to determine the probability of a store being placed in a given zipcode:
$$p_{\textrm{population_weighting}}(\textrm{zipcode}) = \frac{\textrm{population_counts}(\textrm{zipcode})}{\sum_i\textrm{population_counts}(i)} $$
In [392]:
class PopulationWeightedLocationPDF(object):
def __init__(self, table):
self.pop_density = table["population_count"] / table["population_count"].sum()
def probability(self, zipcode):
return self.pop_density[zipcode]
I want the most prosperous areas to be 100x more likely to have a store than the least prosperous areas. To define the income weights, I needed a function that takes a user-defined parameter $s$ for the scaling between the minimum and maximum incomes and interpolates for incomes inbetween. I decided to use an exponential function.
$$w_{\textrm{income}}(\textrm{income}) = \frac{1}{Z}\exp( k (\textrm{income} - \textrm{min_income}))$$To achieve the appropriate scaling factor $s$, $k$ is set as follows:
$$ k = \frac{\log(s)}{\textrm{max_income} - \textrm{min_income}}$$$$\textrm{min_income} = \min_i{\textrm{median_household_incomes}(i)}$$$$\textrm{max_income} = \max_i{\textrm{median_household_incomes}(i)}$$
In [537]:
class IncomeScalingWeights(object):
def __init__(self, income_table=None, scaling_factor=100.0):
self.max_income = income_table["median_household_income"].max()
self.min_income = income_table["median_household_income"].min()
self.scaling_factor = scaling_factor
self.k = np.log(self.scaling_factor) / (self.max_income - self.min_income)
def weight(self, income):
return np.exp(self.k * (income - self.min_income)) #/ self.normalization_factor
In [451]:
income_weights = IncomeScalingWeights(income_table)
In [460]:
print income_table["median_household_income"].min(), income_table["median_household_income"].max()
incomes = np.linspace(income_table["median_household_income"].min(), income_table["median_household_income"].max(), num=100.0)
weights = [income_weights.weight(income) for income in incomes]
In [461]:
plt.plot(incomes, weights)
plt.xlabel("Income")
plt.ylabel("Weight")
Out[461]:
Since the income PDF takes income as an input, not location, I need to use function composition to map zipcodes to median household incomes and median household income to the weights.
Mathematically, the function is written:
$$p_{\textrm{income_weighted}}(\textrm{zipcode}) = \frac{1}{Z} w_{\textrm{income}}(\textrm{median_household_incomes}(\textrm{zipcode}))$$where $Z$ is the normalization factor.
$Z$ is found by integrating $w_{\textrm{income}}$ over all of the zipcodes:
$$ Z = \int w_{\textrm{income}}(\textrm{median_household_incomes}(z)) \, dz$$
In [455]:
class IncomeWeightedLocationPDF(object):
def __init__(self, zipcodes=None, income_table=None, scaling_factor=100.0):
self.income_table = income_table
self.max_income = income_table["median_household_income"].max()
self.min_income = income_table["median_household_income"].min()
self.scaling_factor = scaling_factor
self.k = np.log(self.scaling_factor) / (self.max_income - self.min_income)
zipcode_weights = [self._income_weight(self._zipcode_income(z)) for z in zipcodes]
self.normalization_factor = sum(zipcode_weights)
def _income_weight(self, income):
return np.exp(self.k * (income - self.min_income))
def _zipcode_income(self, zipcode):
return self.income_table["median_household_income"].loc[zipcode]
def probability(self, zipcode):
return self._income_weight(self._zipcode_income(zipcode)) / self.normalization_factor
The composite store location PDF is written as follows:
$$p_{\textrm{store_location}}(\textrm{zipcode}) = \frac{1}{Z} p_{\textrm{population_weighted}}(\textrm{zipcode}) \, p_{\textrm{income_weighted}}(\textrm{zipcode})$$where $Z$ is the normalization factor.
Although $p_{\textrm{population_weighted}}$ and $p_{\textrm{income_weighted}}$ are proper PDFs, with areas equal to 1, individually, their multiplication is not. As a result, a new normalization factor $Z$ needs to be computed.
$Z$ is found by integrating over all of the zipcodes:
$$ Z = \int p_{\textrm{population_weighted}}(z) \, p_{\textrm{income_weighted}}(z) \, dz$$The implementation is as follows:
In [442]:
class StoreLocationPDF(object):
def __init__(self, zipcodes=None, income_table=None, pop_table=None, income_scaling=100.0):
self.population_pdf = PopulationWeightedLocationPDF(pop_table)
self.income_pdf = IncomeWeightedLocationPDF(zipcodes, income_table, scaling_factor=income_scaling)
weights = [self._weight(z) for z in zipcodes]
self.normalization_factor = sum(weights)
def _weight(self, zipcode):
return self.population_pdf.probability(zipcode) * self.income_pdf.probability(zipcode)
def probability(self, zipcode):
return self._weight(zipcode) / self.normalization_factor
Stores have two fields: the location and a unique identifier (for cases where there are multiple stores in the same zipcode). Using a builder allows us to add new fields in the future. The builder encapsulates the PDF and sampling method. Stores are represented as dictionaries.
In [438]:
class StoreBuilder(object):
def __init__(self, zipcodes=None, income_table=None, pop_table=None, income_scaling=100.0):
self._ident = 1
location_pdf = StoreLocationPDF(zipcodes, income_table, pop_table, income_scaling=income_scaling)
self._location_sampler = RouletteWheelSampler(zipcodes, location_pdf)
def build(self):
store_ident = self._ident
self._ident += 1
store_zipcode = self._location_sampler.sample()
return {"store_id" : store_ident, "zipcode" : store_zipcode}
In [527]:
store_builder = StoreBuilder(zipcodes=common_zipcodes, income_table=income_table, pop_table=pop_table, income_scaling=100.0)
In [529]:
stores_10 = [store_builder.build() for i in xrange(10)]
stores_100 = [store_builder.build() for i in xrange(100)]
stores_1000 = [store_builder.build() for i in xrange(1000)]
In [530]:
print stores_10[:10]
In [563]:
zipcodes_10 = [store["zipcode"] for store in stores_10]
zipcodes_100 = [store["zipcode"] for store in stores_100]
zipcodes_1000 = [store["zipcode"] for store in stores_1000]
In [569]:
plt.scatter(lat_long_table.loc[zipcodes_10, "longitude"], lat_long_table.loc[zipcodes_10, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("10 Stores")
plt.xlim([-200, -60])
plt.ylim([10, 80])
Out[569]:
In [570]:
plt.scatter(lat_long_table.loc[zipcodes_100, "longitude"], lat_long_table.loc[zipcodes_100, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("100 Stores")
plt.xlim([-200, -60])
plt.ylim([10, 80])
Out[570]:
In [571]:
plt.scatter(lat_long_table.loc[zipcodes_1000, "longitude"], lat_long_table.loc[zipcodes_1000, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("1000 Stores")
plt.xlim([-200, -60])
plt.ylim([10, 80])
Out[571]:
Customer data model:
The customer location will be modeled based on population density and the locations of the stores. More customers will be in more populated areas. Customers will also be much more likely to frequent stores within 5-10 miles of their homes.
(Note: to be realistic, we should limit the maximum number of customers per zipcode to the population of that zipcode. I'm ignoring this for now.)
First, we'll look at the distribution of the minimum distances to stores based on the number of stores generated.
In [523]:
import math
class DistanceNearestStore(object):
def __init__(self, stores=None, lat_long_table=None):
self._lat_long_table = lat_long_table
self._stores = list()
for store in stores:
zipcode = store["zipcode"]
latA = self._lat_long_table.loc[zipcode, "latitude"]
longA = self._lat_long_table.loc[zipcode, "longitude"]
self._stores.append((store, latA, longA))
def _dist(self, lat_A, long_A, lat_B, long_B):
"""
Computes distance between latitude-longitude
pairs in miles.
"""
dist = (math.sin(math.radians(lat_A)) *
math.sin(math.radians(lat_B)) +
math.cos(math.radians(lat_A)) *
math.cos(math.radians(lat_B)) *
math.cos(math.radians(long_A - long_B)))
dist = (math.degrees(math.acos(dist))) * 69.09
return dist
def min_dist(self, zipcode):
latA = self._lat_long_table.loc[zipcode, "latitude"]
longA = self._lat_long_table.loc[zipcode, "longitude"]
distances = []
for store, latB, longB in self._stores:
if store["zipcode"] == zipcode:
dist = 0.0
else:
dist = self._dist(latA, longA, latB, longB)
distances.append(dist, store))
return min(distances)
In [552]:
distance_nearest_store_10 = DistanceNearestStore(stores_10, lat_long_table)
distance_nearest_store_100 = DistanceNearestStore(stores_100, lat_long_table)
distance_nearest_store_1000 = DistanceNearestStore(stores_1000, lat_long_table)
In [553]:
zipcode_dist_10 = [distance_nearest_store_10.min_dist(z) for z in common_zipcodes]
zipcode_dist_100 = [distance_nearest_store_100.min_dist(z) for z in common_zipcodes]
zipcode_dist_1000 = [distance_nearest_store_1000.min_dist(z) for z in common_zipcodes]
In [555]:
distances_only_10 = [dist for dist, store in zipcode_dist_10]#if dist < 200.0]
plt.clf()
plt.hist(distances_only_10, bins=100)
plt.title("10 stores")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")
Out[555]:
In [556]:
distances_only_100 = [dist for dist, store in zipcode_dist_100]# if dist < 200.0]
plt.clf()
plt.hist(distances_only_100, bins=100)
plt.title("100 stores")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")
Out[556]:
In [558]:
distances_only_1000 = [dist for dist, store in zipcode_dist_1000] # if dist < 180.0]
plt.clf()
plt.hist(distances_only_1000, bins=100)
plt.title("1000 stores")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")
Out[558]:
The number of stores generated has an (obvious) impact on the minimum distances. For 10 stores, a large number of zipcodes can be up to 1,000 miles away. For 100 stores, most zipcodes are within 450 miles, and when we generate 1,000 stores, most zipcodes are within 100-150 miles of a store.
Regardless of the number of stores generated, the maximum distances are unrealistic. (A customer would not be likely to go to a store over 100 miles away on a regular basis.) It is therefore reasonable to limit the customer locations based on the distance to the nearest store. Based on personal experience, 5 - 10 miles is the maximum distance for regular store visits.
We will choose to use an exponential distribution to model the probability of a customer living in a certain zipcode based on the distance to the nearest store. An exponential distribution has a single parameter, the average.
$$f(x; \beta) = \frac{1}{\beta} \exp(-\frac{x}{\beta}) $$
In [559]:
class DistanceWeights(object):
def __init__(self, avg_distance=5.0):
self.lambd = 1.0 / avg_distance
def weight(self, distance):
return self.lambd * np.exp(-self.lambd * distance)
In [560]:
distance_weights = DistanceWeights()
In [561]:
distances = np.linspace(0.0, 100.0, num=1000.0)
weights = [distance_weights.weight(distance) for distance in distances]
In [562]:
plt.plot(distances, weights)
plt.xlabel("Distance (Miles)")
plt.ylabel("Weight")
Out[562]:
The exponential distribution actually matches the real distance distributions quite well given enough stores. By choosing an average of 5 miles, the exponential distribution ensures that most customers are within 10-15 miles of a store.
In [584]:
class LocationDistanceWeightedPDF(object):
def __init__(self, avg_distance=5.0, stores=None, lat_long_table=None, zipcodes=None):
self._dist_calculator = DistanceNearestStore(stores=stores, lat_long_table=lat_long_table)
self._dist_weights = DistanceWeights(avg_distance=avg_distance)
weight_sum = 0.0
for z in zipcodes:
dist, _ = self._dist_calculator.min_dist(z)
weight = self._dist_weights.weight(dist)
weight_sum += weight
self._normalization_factor = weight_sum
def probability(self, zipcode):
dist, _ = self._dist_calculator.min_dist(zipcode)
weight = self._dist_weights.weight(dist)
prob = weight / self._normalization_factor
return prob
In [585]:
distance_pdf = LocationDistanceWeightedPDF(stores=stores_100, lat_long_table=lat_long_table, zipcodes=common_zipcodes)
In [586]:
customer_location_sampler = RouletteWheelSampler(common_zipcodes, distance_pdf)
In [589]:
customer_zipcodes = [customer_location_sampler.sample() for i in xrange(1000)]
In [593]:
customer_zipcodes[:10]
Out[593]:
In [598]:
customer_distances = [distance_nearest_store_100.min_dist(z)[0] for z in customer_zipcodes]
In [599]:
plt.clf()
plt.hist(customer_distances, bins=100)
plt.title("100 stores, 1000 customers")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")
Out[599]:
In [614]:
plt.clf()
plt.hold(True)
plt.scatter(lat_long_table.loc[zipcodes_100, "longitude"], lat_long_table.loc[zipcodes_100, "latitude"], c="b")
plt.scatter(lat_long_table.loc[customer_zipcodes, "longitude"], lat_long_table.loc[customer_zipcodes, "latitude"], c="r")
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("100 Stores, 1000 customers")
plt.xlim([-200, -60])
plt.ylim([10, 80])
Out[614]:
A transaction consists of:
There has to be a saturation point for transactions. If we have too few transactions, we won't capture regular customers or other patterns. If we have too many transactions, we end up with too many duplications. (e.g., customers purchasing more food than they should.) It's possible that we could use a monte carlo method (since previous state will matter here) and graph the acceptance rate as function of the number of transactions generated.
Several decisions have to be made when building the model for the transactions. This results in three approaches:
Sample the PDF in a completely unordered way. With this approach, samples would not be generated in any guaranteed order (e.g., a transaction can be generated for days Tuesday, Monday, Thursday). We could specify the total number of transactions to generate and those transactions would be distributed according to the PDF. Introducing memory or controlling or preventing duplicate transactions would be difficult.
For each store, integrate along the time dimension using a Poisson process. For each store, a non-homogenous Poisson process could be used to generate the number of transactions between $t$ and $t + \Delta t$. The rate would be parameterized according to the particular date and time to include effects such as peak shopping times, weather, etc. The Poisson process would be then be integrated over time for each store to determine the number of transactions in each timestep. We could then generate the other variables for the transactions by sampling from a PDF. Since the approach would generate the values in order, we could incorporate memory effects more easily. There are two limitations to this approach. First, the number of transactions specified would be treated as a constraint on the customers. In reality, the number of transactions should be a function of the number of customers and their particular buying habits. Secondly, it would be hard to model the customers history correctly.
Model each customer as a stochastic process and integrate over time. Statistics such as the amount and distribution of transactions will arise naturally from the number of customers and their buying habits. This will make it easier to ensure that each customer's expected transaction is carried out and would avoid needing to choose parameters such as the number of transactions or their rates for each store.
Factors:
In [ ]:
We assume that stores in more populated areas are going to be visited more often. We can use a scaling approach like we did for the incomes for the store locations.
We could also include an enrichment factor for each store. Some stores will be more favorable than others. The enrichment factors could be determined by an exponential or beta distribution. Enrichment could represent or be related to local competition.
The customer PDF will be based on:
Additional factors:
Factors to consider: