In [17]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Mathematical Model & Methods for Data Generation in BigPetStore v1.0

Authors:

RJ Nowling rnowling@gmail.com

Introduction and Background

Hadoop has positioned itself as the go-to solution for building big data warehousing and analysis solutions. Hadoop is a large ecosystem that includes a tools such as MapReduce, Pig, Hive, and Mahout. Unfortunately, the Hadoop community has been moving faster than the creation of documentation and high-quality examples. As a result, it is difficult for new users to learn the in and outs of tools, and maybe more importantly, how the tools fit together.

BigPetStore (BPS) aims to be the canonical example for the Hadoop ecosystem built around a fictional chain of pet stores. BPS implements an analytics pipeline that showcases the capabilities and position of each tool in the Hadoop ecosystem. Transaction records are generated randomly. Pig is used to clean and reformat the data into a more consumable format. Hive is used to ... Finally, Mahout is used to perform analytics on the data to predict trends.

BPS has started to move beyond just serving as an example for aspiring Hadoop users. BPS has recently gained attention and has been accepted into Apache BigTop, an open-source Hadoop distribution, where it's being used as part of the BigTop testing framework. Hadoop developers have also started using BPS in their live examples.

The random generation of the fictional data lies at the core of BPS. By using random generation, users don't need to download huge data sets. Random generation provides control over how much data is generated, so that BPS can be used on systems as small as a single desktop up to large clusters.

Although the current generation approach is efficient, the generated data contains very few patterns and little complexity. To truly showcase the capabilities of the Hadoop tools and application of analytics, more complex data is needed. Since the current approach makes it hard to incorporate new variables and embed trends, I prototyped a new approach using a probabilistic mathematical model.

Relational Description of Data Model

A BigPetStore transaction contains:

  • First Name
  • Last Name
  • Date / Time
  • Product
  • SalesTotal
  • CustomerLocation (zipcode)

In [ ]:

Mathematical Modeling using Probability Density Functions (PDFs) as a Framework for Describing Data

My models are probabilistic. For each variable or property in the model, I use a probability density functions (PDFs) to describe the probability of each value that the variable can take on. (Technically, PDFs are for continuous-valued functions and probability mass functions (PMFs) are for discrete-valued functions. I use the term PDF throughout to refer to both.)

What is a PDF? Examples. Properties.

PDFs enable the data model to be modular, declarative, and composable. As a result, the approach can easily be expanded to account for additional variables.

Advantages of formal models using PDFs:

  • Embed as much or as little detail as we want
  • Hypothesis-driven
  • Transparent
  • Robust (can reason about that them, prove properties, exploit properties)
  • Well understood and widely researched
  • Simulation provides empirical approach
  • Declarative
  • Composable
  • Modular

Review of Probability Density Functions (PDFs) and Their Use in Defining Models

Poisson Process

Let's start with a simple homogenous Poisson process. A Poisson process is a continuous-time stochastic process that models the occurence of events within a given time interval. A Poisson process is known as a pure-birth process. The term homogenous refers to the fact that the rate of event occurences is constant over the time domain. The events are assumed to occur with an arrival time modeled by an exponential distribution.

The Poisson process is described by:

$$ P[(N(t + \Delta t) - N(t)) = k] = \frac{(\lambda \Delta t)^k}{k!}e^{-\lambda \Delta t}$$

The above equations describes the probability of $k$ events occuring in the interval $(t, t + \Delta t]$ given $\lambda$, the rate of event occurences given in units of events/$\Delta t$.

I'll mention two approaches for generating events from the Poisson process.

The first approach is similar to the Roulette Wheel sampling method discussed early on. We can build the cumulative density function (CDF) by computing the probabilities for $k = 0, 1, 2, \dots$ events occuring in the time interval $(t, t + \Delta t]$. We then generate a random number from a uniform distribution and find the number of events by finding the matching range of probabilities. Since $k$ could be infinite, we only generate probabilities until we have a match.

There are two potential problems with this approach:

  1. We don't know the exact arrival time of each event -- we only know how many events have occured in the interval. To achieve a fine granularity, we need to use a small value for $\Delta t$.
  2. Finding the number of events requires evaluating the probabilities for all smaller values of $k$.

As a result, this approach ends up being computationally intensive.

For the second approach, we can compute the waiting time $X_n$ of event $n$ by sampling from the exponential distribution:

$$ f(X_n) = (\lambda \Delta t)^k e^{-\lambda \Delta t} $$

The arrival time of $n$ is the sum of all the previous waiting times:

$$ t_n = \sum_{i=1}^n X_i $$

The second approach overcomes both disadvantages of the first approach. It gives exact arrival times and is computationally efficient.

(I am unsure of whether the second approach will hold for non-homogenous Poisson processes.)


In [626]:
class HomogenousPoissonProcess(object):
    def __init__(self, rate=None):
        self._rate = rate
        self._last_arrival = 0.0
        self._last_time_period = 0.0
        
    def sample(self, time_period):
        arrival_times = []
        if self._last_arrival > self._last_time_period:
            arrival_times.append(self._last_arrival)
        self._last_time_period += time_period
        while True:
            waiting_time = random.expovariate(self._rate)
            self._last_arrival += waiting_time
            if self._last_arrival < self._last_time_period:
                arrival_times.append(self._last_arrival)
            else:
                break  
        return arrival_times

In [627]:
rate = 10.0 # 10 events per hour
process = HomogenousPoissonProcess(rate=rate)
arrival_times = process.sample(24.0)
print "%s events" % len(arrival_times)
print "Expected: %s events/hour" % rate
print "Observed: %s events/hour" % (len(arrival_times) / 24.0)
plt.plot(xrange(1, len(arrival_times) + 1), arrival_times)
plt.xlabel("Events")
plt.ylabel("Arrival Times")
plt.title("Homogenous Poisson Process")


248 events
Expected: 10.0 events/hour
Observed: 10.3333333333 events/hour
Out[627]:
<matplotlib.text.Text at 0x118b4d690>

Markov Models

A Markov model is defined by a set of states and transition probabilities $p_{ij}$ from each state $i$ to each state $j$. When used as a stochastic process, Markov models produce a sequence of states called a Markov chain. On every step, the next state in the chain is generated by sampling from the PDF defined by the outbound transition probabilities for the current state.

For example, assume we have a Markov Model with three states: A, B, and C. The transition probabilities are given as follows:

A -> B : 0.5

A -> C : 0.5

A -> A : 0.0

B -> A : 0.3

B -> C : 0.7

B -> B : 0.0

C -> A : 0.7

C -> B : 0.2

C -> C : 0.1

If the current state is A, the Markov chain will choose B or C as the next state with equal probability. If the chain is in state B, the next state will be A with a 30% probability and C with a 70% probability. State C introduced the possiblity of choosing itself as the next state.

It is important to note that the transition probabilities do not need to be symmetric. Specifically, $p_{ij}$ does not need to equal $p_{ji}$. Self-transitions are also allowed, meaning that $p_{ii}$ can be non-zero. The sum of the outbound transitions for each state $i$ must equal unity: $\sum_j p_{ij} = 1$.

Hidden Markov Models (HMMs) extend and generalize the concept of Markov models. In a HMM, the state itself is not the output. Rather, each state defines a PDF of possible outputs. After each state transition, the output is chosen by sampling from the current state's PDF. A Markov model is a specific case of a HMM where the output PDF only defines one possible output with probability 1.

Sampling Approaches

Direct Sampling

Roulette Wheel Sampling

As the store location PDF is one dimensional, a simple roulette wheel approach is efficienty for sampling. The roulette wheel builds a table by computin the cumulative density function (CDF) and assigning each discrete value in the domain an interval. Values are sampled by generating a random number from a uniform distribution and iterating through the table until the interval containing the randomly-generated number is found.


In [613]:
class RouletteWheelSampler(object):
    def __init__(self, domain, pdf):
        self._wheel = []
        end = 0.0
        for x in domain:
            end += pdf.probability(x)
            self._wheel.append((end, x))
        print end

    def sample(self):
        r = random.random()
        for end, x in self._wheel:
            if r < end:
                return x
        # we should never get here since probabilities
        # should sum to 1
        raise Exception, "Could not pick a value!"

Monte Carlo Methods

Monte Carlo methods are stochastic methods for sampling from complex probability density functions (PDFs). Since complex PDFs are often very difficult to sample directly,


In [602]:
class MonteCarloSampler(object):
    def __init__(self, generator, pdf):
        self.generator = generator
        self.pdf = pdf
        self.num_iter_accept = []

    def sample(self):
        num_iter = 0
        while True:
            num_iter += 1
            obj = self.generator.generate()
            prob = self.pdf.probability(obj)
            rand = random.random()
            if rand < prob:
                self.num_iter_accept.append(num_iter)
                return obj

    def sample_n(self, n):
        samples = []
        for i in xrange(n):
            samples.append(self.sample())
        return samples

    def acceptance_rate(self):
        if len(self.num_iter_accept) == 0:
            return 0.0
        return 1.0 / np.average(self.num_iter_accept)

In [603]:
class GaussianPDF(object):
	def __init__(self, mean, std_dev):
		self.mean = mean
		self.std_dev = std_dev

	def probability(self, x):
		exponent = -1.0 * np.power(x - self.mean, 2.0) / (2.0 * self.std_dev * self.std_dev)
		return np.exp(exponent) / (self.std_dev * np.sqrt(2.0 * np.pi))

class UniformSampler(object):
	def __init__(self, a, b):
		self.a = a
		self.b = b

	def generate(self):
		return random.uniform(self.a, self.b)

In [604]:
mean = 0.0
std = 1.0
narrow_a = -10.0
narrow_b = 10.0
wide_a = -100.0
wide_b = 100.0

In [605]:
gaussian_pdf = GaussianPDF(mean, std)
narrow_uniform_sampler = UniformSampler(narrow_a, narrow_b)
wide_uniform_sampler = UniformSampler(wide_a, wide_b)
narrow_mc = MonteCarloSampler(narrow_uniform_sampler, gaussian_pdf)
wide_mc = MonteCarloSampler(wide_uniform_sampler, gaussian_pdf)

In [606]:
n_samples = 10000
gauss_samples = [random.gauss(mean, std) for i in xrange(n_samples)]
narrow_uniform_samples = [random.uniform(narrow_a, narrow_b) for i in xrange(n_samples)]
wide_uniform_samples = [random.uniform(wide_a, wide_b) for i in xrange(n_samples)]

In [607]:
bins = np.linspace(wide_a, wide_b, num=1000)
narrow_hist, _ = np.histogram(narrow_uniform_samples, bins=bins, density=True)
wide_hist, _ = np.histogram(wide_uniform_samples, bins=bins, density=True)
gauss_hist, _ = np.histogram(gauss_samples, bins=bins, density=True)

In [608]:
plt.clf()
plt.hold(True)
plt.plot(bins[:-1], gauss_hist, 'r-', label="Gaussian PDF")
plt.plot(bins[:-1], narrow_hist, 'g-', label="Narrow MC")
plt.plot(bins[:-1], wide_hist, 'b-', label="Wide MC")
plt.legend(loc="upper right")


Out[608]:
<matplotlib.legend.Legend at 0x1180e4b10>

The figure shows the differences between the three distributions. It should be apparent here that the narrow-range uniform distributions overlaps more with the Gaussian distribution than the wide-range uniform distribution. This will be important later.


In [609]:
n_samples = 10000
narrow_samples = narrow_mc.sample_n(n_samples)
wide_samples = wide_mc.sample_n(n_samples)
gauss_samples = [random.gauss(mean, std) for i in xrange(n_samples)]

In [610]:
bins = np.linspace(narrow_a, narrow_b, num=100)
narrow_hist, _ = np.histogram(narrow_samples, bins=bins, density=True)
wide_hist, _ = np.histogram(wide_samples, bins=bins, density=True)
gauss_hist, _ = np.histogram(gauss_samples, bins=bins, density=True)

In [611]:
plt.clf()
plt.hold(True)
plt.plot(bins[:-1], gauss_hist, 'r-', label="Gaussian PDF")
plt.plot(bins[:-1], narrow_hist, 'g-', label="Narrow MC")
plt.plot(bins[:-1], wide_hist, 'b-', label="Wide MC")
plt.legend(loc="upper right")


Out[611]:
<matplotlib.legend.Legend at 0x118063cd0>

I ran the Monte Carlo using the wide and narrow uniform distributions to generate samples. Even though neither of these distributions match the target distribution (Gaussian), the Monte Carlo process ensures that we sample the target distribution. This is the power of a Monte Carlo method: Samples can be generated from any distribution, as long as it generates samples in the correct domain (e.g., $[-\infty, \infty]$). (In our case, I'm assuming that samples outside of the range $[-10, 10]$ have a near-zero probability, hence my choice for the ranges of the uniform distribution.) Regardless of the distribution used for generating the samples, the MC method ensures we sample from the correct distribution.

When the target distribution is very complex, sampling it directly can be impossible. MC methods allow us to overcome this limitation.


In [612]:
print narrow_mc.acceptance_rate(), wide_mc.acceptance_rate()


0.0511647658956 0.00500015500481

This is not to say that the choice of sampling distribution doesn't have any impact -- in fact, it is quite important. The choice of sampling distribution will affect the efficiency of the Monte Carlo method, in particular the acceptance rate. The narrow uniform sampler was more likely to pick values in the high-probability region of the Gaussian distribution, while the wide uniform sampler was more likely to pick values outside of the high-probability region. As a result, the MC method accepted 1 out of every 20 samples generated by the narrow sampler but only 1 out of every 200 samples generated by the wide sampler.

Description of Mathematical Model

Overview of External Data

As an input to the model for BigPetStore, I'm using three datasets: population counts by zipcode, median household incomes by zipcode, and latitude and longitude coordinates by zipcode. I give a brief overview of the datasets here.

Median Household Income Data

I obtained median household income data by zipcode from the 2012 American Community Survey. I downloaded the data from the U.S. Census Bureau web site.


In [340]:
def read_income_data(flname):
    zipcodes = []
    household_incomes = []
    fl = open(flname)
    #skip headers
    next(fl)
    next(fl)
    for ln in fl:
        cols = ln.strip().split(",")
        # zipcodes in column 3 in the format "ZCTA5 XXXXX"
        zipcode = cols[2].split()[1]
        try:
            median_household_income = int(cols[5])
            zipcodes.append(zipcode)
            household_incomes.append(median_household_income)
        except:
            # some records do not contain incomes
            pass
    fl.close()
    return pd.DataFrame({"zipcode" : zipcodes, "median_household_income" : household_incomes}, index=zipcodes)

In [341]:
income_table = read_income_data("../resources/ACS_12_5YR_S1903/ACS_12_5YR_S1903_with_ann.csv")

In [342]:
income_table.shape


Out[342]:
(32215, 2)

In [343]:
income_table.iloc[:10, :]


Out[343]:
median_household_income zipcode
00601 13495 00601
00602 15106 00602
00603 15079 00603
00606 12098 00606
00610 16923 00610
00612 18111 00612
00616 16279 00616
00617 15114 00617
00622 12059 00622
00623 16447 00623

10 rows × 2 columns


In [344]:
plt.clf()
plt.hist(income_table.iloc[:, 0], bins=100)
plt.xlabel("Median Household Income for Zipcode")
plt.ylabel("Number of Zipcodes")


Out[344]:
<matplotlib.text.Text at 0x10ea54f50>

Population Count Data


In [345]:
def load_population_data(flname):
	fl = open(flname)
	pop_data = dict()
	# skip headers
	next(fl)
	for ln in fl:
		if ln.strip() == "":
			continue
		zipcode, pop = ln.split(",")
		pop = int(pop)
        # remove duplicates.  keep largest pop values
		if zipcode in pop_data:
			pop_data[zipcode] = max(pop_data[zipcode], pop)
		else:
			pop_data[zipcode] = pop
	fl.close()
	pops = []
	zipcodes = []
	for z, p in pop_data.iteritems():
		zipcodes.append(z)
		pops.append(p)
	return pd.DataFrame(data={"zipcode" : zipcodes, "population_count" : pops}, index=zipcodes)

In [346]:
pop_table = load_population_data("../resources/population_data.csv")

In [347]:
pop_table.shape


Out[347]:
(32989, 2)

In [348]:
plt.clf()
plt.hist(pop_count_table.iloc[:, 0], bins=100)
plt.ylabel("Number of Zipcodes")
plt.xlabel("Population Count of Zipcode")


Out[348]:
<matplotlib.text.Text at 0x111b25d50>

Zipcode Longitude and Latitude Data


In [539]:
def load_zip_long_lat(flname):
    fl = open(flname)
    # skip header
    next(fl)
    zipcodes = []
    lats = []
    longs = []
    for ln in fl:
        cols = ln.split(", ")
        zipcode = cols[0][1:-1] # remove double-quote marks
        lat = float(cols[2][1:-1])
        long = float(cols[3][1:-1])
        zipcodes.append(zipcode)
        lats.append(lat)
        longs.append(long)
    fl.close()
    return pd.DataFrame(data={"zipcode" : zipcodes, "latitude" : lats, "longitude" : longs}, index=zipcodes)

In [540]:
zipcode_lats_longs = load_zip_long_lat("../resources/zips.csv")

In [541]:
zipcode_lats_longs.shape


Out[541]:
(33178, 3)

In [542]:
zipcode_lats_longs.iloc[:10, :]


Out[542]:
latitude longitude zipcode
35004 33.606379 -86.50249 35004
35005 33.592585 -86.95969 35005
35006 33.451714 -87.23957 35006
35007 33.232422 -86.80871 35007
35010 32.903432 -85.92669 35010
35014 33.355960 -86.27720 35014
35016 34.323715 -86.49278 35016
35019 34.292540 -86.63505 35019
35020 33.405559 -86.95141 35020
35022 33.346817 -86.95252 35022

10 rows × 3 columns

Zipcodes

Since there are zipcodes missing from each of the three datasets, I needed to reduce the datasets down to their common zipcodes.


In [543]:
income_zipcodes = set(income_table["zipcode"])
pop_zipcodes = set(pop_table["zipcode"])
lat_long_zipcodes = set(zipcode_lats_longs["zipcode"])
common_zipcodes = income_zipcodes.intersection(pop_zipcodes).intersection(lat_long_zipcodes)
print len(common_zipcodes)


30891

In [544]:
income_table = income_table.loc[common_zipcodes, :]
pop_table = pop_table.loc[common_zipcodes, :]
lat_long_table = zipcode_lats_longs.loc[common_zipcodes, :]

print income_table.shape, pop_table.shape, lat_long_table.shape


(30891, 2) (30891, 2) (30891, 3)

Visualization of Zipcode Locations


In [546]:
plt.scatter(lat_long_table["longitude"], lat_long_table["latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("All Common Zipcodes")


Out[546]:
<matplotlib.text.Text at 0x112bfa2d0>

The scatter plot of all of the zipcodes by latitude and longitude shows a few familiar shapes: the continental U.S., Alaska, and Hawaii. (The Puerto Rican zipcodes were removed.) I didn't know Alaska was so large. :)


In [550]:
populations = list(reversed(sorted(pop_table["population_count"])))
cutoff_idx = int(len(populations) * 0.01)
cutoff = populations[cutoff_idx]
cutoff_zipcodes = [z for z in common_zipcodes if pop_table.loc[z, "population_count"] >= cutoff]
plt.scatter(lat_long_table.loc[cutoff_zipcodes, "longitude"], lat_long_table.loc[cutoff_zipcodes, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("Top 1% Most Populated Zipcodes")


Out[550]:
<matplotlib.text.Text at 0x110381a50>

A scatter plot of the zipcodes in the top 1% of the population counts shows large populations on the coasts and major metropolitan areas like Chicago.


In [551]:
incomes = list(reversed(sorted(income_table["median_household_income"])))
cutoff_idx = int(len(incomes) * 0.01)
cutoff = incomes[cutoff_idx]
cutoff_zipcodes = [z for z in common_zipcodes if income_table.loc[z, "median_household_income"] >= cutoff]
plt.scatter(lat_long_table.loc[cutoff_zipcodes, "longitude"], lat_long_table.loc[cutoff_zipcodes, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("Top 1% Median Household Income Zipcodes")


Out[551]:
<matplotlib.text.Text at 0x112c5a690>

A scatter plot of the zipcodes in the top 1% of the median household incomes show concentrations of wealth in the northeast, parts of California, South Florida, and a few major metropolitan areas.

Names


In [574]:
def read_names(flname):
    fl = open(flname)
    names = []
    first_names = []
    last_names = []
    weights = []
    for ln in fl:
        cols = ln.split(",")
        name = cols[0]
        title = cols[2] == "1"
        if title:
            continue
        last_name = cols[3] == "1"
        first_name = cols[4] == "1"
        weight = float(cols[5])
        names.append(name)
        first_names.append(first_name)
        last_names.append(last_name)
        weights.append(weight)
    fl.close()
    return pd.DataFrame(data={"name" : names, "first_name" : first_names, "last_name" : last_names, "weight" : weights}, index=names)

In [575]:
customer_names = read_names("../resources/namedb/data/data.dat")

In [576]:
customer_names.shape


Out[576]:
(128988, 4)

In [577]:
customer_names.iloc[:10, :]


Out[577]:
first_name last_name name weight
Aaban True False Aaban 864
Aabid True False Aabid 917
Aabriella True False Aabriella 915
Aadam True False Aadam 703
Aadan True False Aadan 754
Aadarsh True False Aadarsh 712
Aaden True False Aaden 504
Aadesh True False Aadesh 871
Aadhav True False Aadhav 840
Aadhya True False Aadhya 781

10 rows × 4 columns

Model of Stores

Stores really only have two fields in our data model: a unique ID and a location (zipcode). The complexity in modeling the stores comes from how we will choose to locate stores. I assume that stores are more likely to be placed in areas that have larger populations and higher incomes. The store locations will be generated according to a probability density function giving the probability of a store occuring in a given zipcode.

Note: the current approach for choosing the store locations assumes the stores are independent. It may be interesting to incorporate patterns such that the location of new stores is influenced by previously generated stores. For example, maybe we want to ensure that stores are located a minimum distance from each other while also within a maximum distance of each other.

Store Population Density-Weighted PDF

I directly used zipcode population counts to determine the probability of a store being placed in a given zipcode:

$$p_{\textrm{population_weighting}}(\textrm{zipcode}) = \frac{\textrm{population_counts}(\textrm{zipcode})}{\sum_i\textrm{population_counts}(i)} $$

In [392]:
class PopulationWeightedLocationPDF(object):
    def __init__(self, table):
        self.pop_density = table["population_count"] / table["population_count"].sum()
    
    def probability(self, zipcode):
        return self.pop_density[zipcode]

Store Income-Weighted PDF

I want the most prosperous areas to be 100x more likely to have a store than the least prosperous areas. To define the income weights, I needed a function that takes a user-defined parameter $s$ for the scaling between the minimum and maximum incomes and interpolates for incomes inbetween. I decided to use an exponential function.

$$w_{\textrm{income}}(\textrm{income}) = \frac{1}{Z}\exp( k (\textrm{income} - \textrm{min_income}))$$

To achieve the appropriate scaling factor $s$, $k$ is set as follows:

$$ k = \frac{\log(s)}{\textrm{max_income} - \textrm{min_income}}$$$$\textrm{min_income} = \min_i{\textrm{median_household_incomes}(i)}$$$$\textrm{max_income} = \max_i{\textrm{median_household_incomes}(i)}$$

In [537]:
class IncomeScalingWeights(object):
    def __init__(self, income_table=None, scaling_factor=100.0):
        self.max_income = income_table["median_household_income"].max()
        self.min_income = income_table["median_household_income"].min()
        self.scaling_factor = scaling_factor
        self.k = np.log(self.scaling_factor) / (self.max_income - self.min_income)
        
    def weight(self, income):
        return np.exp(self.k * (income - self.min_income)) #/ self.normalization_factor

In [451]:
income_weights = IncomeScalingWeights(income_table)

In [460]:
print income_table["median_household_income"].min(), income_table["median_household_income"].max()
incomes = np.linspace(income_table["median_household_income"].min(), income_table["median_household_income"].max(), num=100.0)
weights = [income_weights.weight(income) for income in incomes]


3372 242824

In [461]:
plt.plot(incomes, weights)
plt.xlabel("Income")
plt.ylabel("Weight")


Out[461]:
<matplotlib.text.Text at 0x10ff9fa50>

Since the income PDF takes income as an input, not location, I need to use function composition to map zipcodes to median household incomes and median household income to the weights.

Mathematically, the function is written:

$$p_{\textrm{income_weighted}}(\textrm{zipcode}) = \frac{1}{Z} w_{\textrm{income}}(\textrm{median_household_incomes}(\textrm{zipcode}))$$

where $Z$ is the normalization factor.

$Z$ is found by integrating $w_{\textrm{income}}$ over all of the zipcodes:

$$ Z = \int w_{\textrm{income}}(\textrm{median_household_incomes}(z)) \, dz$$

In [455]:
class IncomeWeightedLocationPDF(object):
    def __init__(self, zipcodes=None, income_table=None, scaling_factor=100.0):
        self.income_table = income_table
        
        self.max_income = income_table["median_household_income"].max()
        self.min_income = income_table["median_household_income"].min()
        self.scaling_factor = scaling_factor
        self.k = np.log(self.scaling_factor) / (self.max_income - self.min_income)
    
        zipcode_weights = [self._income_weight(self._zipcode_income(z)) for z in zipcodes]
        self.normalization_factor = sum(zipcode_weights) 
        
    def _income_weight(self, income):
        return np.exp(self.k * (income - self.min_income))
    
    def _zipcode_income(self, zipcode):
        return self.income_table["median_household_income"].loc[zipcode]
    
    def probability(self, zipcode):
        return self._income_weight(self._zipcode_income(zipcode)) / self.normalization_factor

Store Location PDF

The composite store location PDF is written as follows:

$$p_{\textrm{store_location}}(\textrm{zipcode}) = \frac{1}{Z} p_{\textrm{population_weighted}}(\textrm{zipcode}) \, p_{\textrm{income_weighted}}(\textrm{zipcode})$$

where $Z$ is the normalization factor.

Although $p_{\textrm{population_weighted}}$ and $p_{\textrm{income_weighted}}$ are proper PDFs, with areas equal to 1, individually, their multiplication is not. As a result, a new normalization factor $Z$ needs to be computed.

$Z$ is found by integrating over all of the zipcodes:

$$ Z = \int p_{\textrm{population_weighted}}(z) \, p_{\textrm{income_weighted}}(z) \, dz$$

The implementation is as follows:


In [442]:
class StoreLocationPDF(object):
    def __init__(self, zipcodes=None, income_table=None, pop_table=None, income_scaling=100.0):
        self.population_pdf = PopulationWeightedLocationPDF(pop_table)
        self.income_pdf = IncomeWeightedLocationPDF(zipcodes, income_table, scaling_factor=income_scaling)
        
        weights = [self._weight(z) for z in zipcodes]
        self.normalization_factor = sum(weights)
        
    def _weight(self, zipcode):
        return self.population_pdf.probability(zipcode) * self.income_pdf.probability(zipcode)
                
    
    def probability(self, zipcode):
        return self._weight(zipcode) / self.normalization_factor

Building Store Objects

Stores have two fields: the location and a unique identifier (for cases where there are multiple stores in the same zipcode). Using a builder allows us to add new fields in the future. The builder encapsulates the PDF and sampling method. Stores are represented as dictionaries.


In [438]:
class StoreBuilder(object):
    def __init__(self, zipcodes=None, income_table=None, pop_table=None, income_scaling=100.0):
        self._ident = 1
        
        location_pdf = StoreLocationPDF(zipcodes, income_table, pop_table, income_scaling=income_scaling)
        self._location_sampler = RouletteWheelSampler(zipcodes, location_pdf)
        
    def build(self):
        store_ident = self._ident
        self._ident += 1
        
        store_zipcode = self._location_sampler.sample()
        
        return {"store_id" : store_ident, "zipcode" : store_zipcode}

In [527]:
store_builder = StoreBuilder(zipcodes=common_zipcodes, income_table=income_table, pop_table=pop_table, income_scaling=100.0)


1.0

In [529]:
stores_10 = [store_builder.build() for i in  xrange(10)]
stores_100 = [store_builder.build() for i in  xrange(100)]
stores_1000 = [store_builder.build() for i in  xrange(1000)]

In [530]:
print stores_10[:10]


[{'store_id': 1, 'zipcode': '93021'}, {'store_id': 2, 'zipcode': '78613'}, {'store_id': 3, 'zipcode': '06062'}, {'store_id': 4, 'zipcode': '65721'}, {'store_id': 5, 'zipcode': '76063'}, {'store_id': 6, 'zipcode': '37771'}, {'store_id': 7, 'zipcode': '77033'}, {'store_id': 8, 'zipcode': '20111'}, {'store_id': 9, 'zipcode': '74073'}, {'store_id': 10, 'zipcode': '02346'}]

In [563]:
zipcodes_10 = [store["zipcode"] for store in stores_10]
zipcodes_100 = [store["zipcode"] for store in stores_100]
zipcodes_1000 = [store["zipcode"] for store in stores_1000]

In [569]:
plt.scatter(lat_long_table.loc[zipcodes_10, "longitude"], lat_long_table.loc[zipcodes_10, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("10 Stores")
plt.xlim([-200, -60])
plt.ylim([10, 80])


Out[569]:
(10, 80)

In [570]:
plt.scatter(lat_long_table.loc[zipcodes_100, "longitude"], lat_long_table.loc[zipcodes_100, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("100 Stores")
plt.xlim([-200, -60])
plt.ylim([10, 80])


Out[570]:
(10, 80)

In [571]:
plt.scatter(lat_long_table.loc[zipcodes_1000, "longitude"], lat_long_table.loc[zipcodes_1000, "latitude"])
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("1000 Stores")
plt.xlim([-200, -60])
plt.ylim([10, 80])


Out[571]:
(10, 80)

Model of Customers

Customer data model:

  • Identifier
  • Name
  • Location (zipcode)
  • Number of Pets (used when generating pets)
  • Customer Enrichment Factor (hidden, used to determine customers who are more or less likely to have transactions than normal)

Customer Location

The customer location will be modeled based on population density and the locations of the stores. More customers will be in more populated areas. Customers will also be much more likely to frequent stores within 5-10 miles of their homes.

(Note: to be realistic, we should limit the maximum number of customers per zipcode to the population of that zipcode. I'm ignoring this for now.)

First, we'll look at the distribution of the minimum distances to stores based on the number of stores generated.


In [523]:
import math

class DistanceNearestStore(object):
    def __init__(self, stores=None, lat_long_table=None):
        self._lat_long_table = lat_long_table
        self._stores = list()
        for store in stores:
            zipcode = store["zipcode"]
            latA = self._lat_long_table.loc[zipcode, "latitude"]
            longA = self._lat_long_table.loc[zipcode, "longitude"]
            self._stores.append((store, latA, longA))
        
    def _dist(self, lat_A, long_A, lat_B, long_B):
        """
        Computes distance between latitude-longitude
        pairs in miles.
        """
        dist = (math.sin(math.radians(lat_A)) *
                math.sin(math.radians(lat_B)) +
                math.cos(math.radians(lat_A)) *
                math.cos(math.radians(lat_B)) *
                math.cos(math.radians(long_A - long_B)))
        dist = (math.degrees(math.acos(dist))) * 69.09
        return dist
        
    def min_dist(self, zipcode):
        latA = self._lat_long_table.loc[zipcode, "latitude"]
        longA = self._lat_long_table.loc[zipcode, "longitude"]
        
        distances = []
        for store, latB, longB in self._stores:
            if store["zipcode"] == zipcode:
                dist = 0.0
            else:
                dist = self._dist(latA, longA, latB, longB) 
            distances.append(dist, store))
            
        return min(distances)

In [552]:
distance_nearest_store_10 = DistanceNearestStore(stores_10, lat_long_table)
distance_nearest_store_100 = DistanceNearestStore(stores_100, lat_long_table)
distance_nearest_store_1000 = DistanceNearestStore(stores_1000, lat_long_table)

In [553]:
zipcode_dist_10 = [distance_nearest_store_10.min_dist(z) for z in common_zipcodes]
zipcode_dist_100 = [distance_nearest_store_100.min_dist(z) for z in common_zipcodes]
zipcode_dist_1000 = [distance_nearest_store_1000.min_dist(z) for z in common_zipcodes]

In [555]:
distances_only_10 = [dist for dist, store in zipcode_dist_10]#if dist < 200.0]
plt.clf()
plt.hist(distances_only_10, bins=100)
plt.title("10 stores")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")


Out[555]:
<matplotlib.text.Text at 0x1159a8b90>

In [556]:
distances_only_100 = [dist for dist, store in zipcode_dist_100]#  if dist < 200.0]
plt.clf()
plt.hist(distances_only_100, bins=100)
plt.title("100 stores")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")


Out[556]:
<matplotlib.text.Text at 0x115946c50>

In [558]:
distances_only_1000 = [dist for dist, store in zipcode_dist_1000] # if dist < 180.0]
plt.clf()
plt.hist(distances_only_1000, bins=100)
plt.title("1000 stores")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")


Out[558]:
<matplotlib.text.Text at 0x1171eb210>

The number of stores generated has an (obvious) impact on the minimum distances. For 10 stores, a large number of zipcodes can be up to 1,000 miles away. For 100 stores, most zipcodes are within 450 miles, and when we generate 1,000 stores, most zipcodes are within 100-150 miles of a store.

Regardless of the number of stores generated, the maximum distances are unrealistic. (A customer would not be likely to go to a store over 100 miles away on a regular basis.) It is therefore reasonable to limit the customer locations based on the distance to the nearest store. Based on personal experience, 5 - 10 miles is the maximum distance for regular store visits.

We will choose to use an exponential distribution to model the probability of a customer living in a certain zipcode based on the distance to the nearest store. An exponential distribution has a single parameter, the average.

$$f(x; \beta) = \frac{1}{\beta} \exp(-\frac{x}{\beta}) $$

In [559]:
class DistanceWeights(object):
    def __init__(self, avg_distance=5.0):
        self.lambd = 1.0 / avg_distance
    
    def weight(self, distance):
        return self.lambd * np.exp(-self.lambd * distance)

In [560]:
distance_weights = DistanceWeights()

In [561]:
distances = np.linspace(0.0, 100.0, num=1000.0)
weights = [distance_weights.weight(distance) for distance in distances]

In [562]:
plt.plot(distances, weights)
plt.xlabel("Distance (Miles)")
plt.ylabel("Weight")


Out[562]:
<matplotlib.text.Text at 0x1173d57d0>

The exponential distribution actually matches the real distance distributions quite well given enough stores. By choosing an average of 5 miles, the exponential distribution ensures that most customers are within 10-15 miles of a store.


In [584]:
class LocationDistanceWeightedPDF(object):
    def __init__(self, avg_distance=5.0, stores=None, lat_long_table=None, zipcodes=None):
        self._dist_calculator = DistanceNearestStore(stores=stores, lat_long_table=lat_long_table)
        self._dist_weights = DistanceWeights(avg_distance=avg_distance)
        
        weight_sum = 0.0
        for z in zipcodes:
            dist, _ = self._dist_calculator.min_dist(z)
            weight = self._dist_weights.weight(dist)
            weight_sum += weight
        self._normalization_factor = weight_sum
    
    def probability(self, zipcode):
        dist, _ = self._dist_calculator.min_dist(zipcode)
        weight = self._dist_weights.weight(dist)
        prob = weight / self._normalization_factor
        return prob

In [585]:
distance_pdf = LocationDistanceWeightedPDF(stores=stores_100, lat_long_table=lat_long_table, zipcodes=common_zipcodes)

In [586]:
customer_location_sampler = RouletteWheelSampler(common_zipcodes, distance_pdf)


1.0

In [589]:
customer_zipcodes = [customer_location_sampler.sample() for i in xrange(1000)]

In [593]:
customer_zipcodes[:10]


Out[593]:
['17563',
 '15012',
 '90402',
 '60610',
 '07921',
 '72211',
 '02151',
 '94538',
 '11223',
 '28206']

In [598]:
customer_distances = [distance_nearest_store_100.min_dist(z)[0] for z in customer_zipcodes]

In [599]:
plt.clf()
plt.hist(customer_distances, bins=100)
plt.title("100 stores, 1000 customers")
plt.xlabel("Distance to Nearest Store (miles)")
plt.ylabel("Count")


Out[599]:
<matplotlib.text.Text at 0x1173f3d90>

In [614]:
plt.clf()
plt.hold(True)
plt.scatter(lat_long_table.loc[zipcodes_100, "longitude"], lat_long_table.loc[zipcodes_100, "latitude"], c="b")
plt.scatter(lat_long_table.loc[customer_zipcodes, "longitude"], lat_long_table.loc[customer_zipcodes, "latitude"], c="r")
plt.ylabel("Latitude (degrees)")
plt.xlabel("Longitude (degrees)")
plt.title("100 Stores, 1000 customers")
plt.xlim([-200, -60])
plt.ylim([10, 80])


Out[614]:
(10, 80)

Pet PDFs

Product PDFs

Transaction PDFs

A transaction consists of:

  • Identifier
  • Store
  • Customer
  • Date
  • Time
  • List of Products Purchased and Their Quantities

There has to be a saturation point for transactions. If we have too few transactions, we won't capture regular customers or other patterns. If we have too many transactions, we end up with too many duplications. (e.g., customers purchasing more food than they should.) It's possible that we could use a monte carlo method (since previous state will matter here) and graph the acceptance rate as function of the number of transactions generated.

Several decisions have to be made when building the model for the transactions. This results in three approaches:

  1. Sample the PDF in a completely unordered way. With this approach, samples would not be generated in any guaranteed order (e.g., a transaction can be generated for days Tuesday, Monday, Thursday). We could specify the total number of transactions to generate and those transactions would be distributed according to the PDF. Introducing memory or controlling or preventing duplicate transactions would be difficult.

  2. For each store, integrate along the time dimension using a Poisson process. For each store, a non-homogenous Poisson process could be used to generate the number of transactions between $t$ and $t + \Delta t$. The rate would be parameterized according to the particular date and time to include effects such as peak shopping times, weather, etc. The Poisson process would be then be integrated over time for each store to determine the number of transactions in each timestep. We could then generate the other variables for the transactions by sampling from a PDF. Since the approach would generate the values in order, we could incorporate memory effects more easily. There are two limitations to this approach. First, the number of transactions specified would be treated as a constraint on the customers. In reality, the number of transactions should be a function of the number of customers and their particular buying habits. Secondly, it would be hard to model the customers history correctly.

  3. Model each customer as a stochastic process and integrate over time. Statistics such as the amount and distribution of transactions will arise naturally from the number of customers and their buying habits. This will make it easier to ensure that each customer's expected transaction is carried out and would avoid needing to choose parameters such as the number of transactions or their rates for each store.

Transaction Date / Time PDF

Factors:

  • Time of day
    • assume peaks before work, during lunch, and after work M-F
    • Sunday, during church in religious areas -- reduced probability
    • Equal probability for all times Saturday and Sunday
  • Date
    • Holiday?
    • Weather -- lower probability during cold days or storms. May need to generate weather patterns for all date / times first

In [ ]:

Transaction Store PDF

We assume that stores in more populated areas are going to be visited more often. We can use a scaling approach like we did for the incomes for the store locations.

We could also include an enrichment factor for each store. Some stores will be more favorable than others. The enrichment factors could be determined by an exponential or beta distribution. Enrichment could represent or be related to local competition.

Transaction Customer PDF

The customer PDF will be based on:

  • Preferred stores based on distance -- Compute distance to all stores. Weight stores based on distance.
  • Restrict maximum distance to store -- restrict set of customers to those within a certain distance. Essentially, a heaveside function that switches between a higher probability and a much lower probability.
  • Customer enrichment factor -- some customers will be more likely to purchase goods than other customers

Additional factors:

  • Put limit on time between transactions? Try to create periodic transactions? Let period vary based on customer frequency factor. E.g., let probability be a function of elapsed time to most recent (past, future) transaction -- low probability for recent transactions, higher for no recent transactions.

Transaction Products PDF

Factors to consider:

  • Product type enrichment -- products will be purchased based on consumption rate. e.g, food, snacks, poop bags, bedding, etc. will be much more common than other types. flea & tick meds may be purchased with medium frequency
  • Customer pets -- how many pets? what species?
  • Customer income -- how much a customer may spend, what type of products they buy
  • Previous purchases -- maybe certain things like food should be modeled to be purchased on a certain frequency or try to model average rate of purchases. Certain products (such as food) should may be limited based on recent purchases. Weight of previous purchases could depend on a factor.
  • Preferred brands / products -- customers tend to buy the same type of food over and over. Weight of previous purchases could depend on a factor.
  • Location -- certain products may be more common in certain locations
  • Randomness factor -- some times, customers purchase things for the hell of it... how can we incorporate this?