Welcome to the Doppelganger example. If you have not already done so, please see the README document for installation instructions and information on what Doppelganger is doing under the hood. For a simplified walkthrough, take a look at doppelganger_example_simple
This workbook acts on a single PUMA and carries out the following tasks:
Before we get going, let's take care of some housekeeping tasks to make your Doppelganger Example experience as seamless as possible.
We have included a cross-walk between PUMAs and Census tracts that we'll use later. Please navigate to this file and unzip it in the same directory. The file is here:
In [ ]:
# ./examples/sample_data/2010_puma_tract_mapping.txt.zip
The example operates on a single PUMA in California (00106). In Step 01 below, you'll see that we have already extracted PUMS data for this PUMA, so we'd recommend that you do not change to your favorite PUMA just yet (here's a set of reference maps for California -- all states are available). But once you get comfortable with the tools, download the PUMS data for the PUMA of your choice, and go for it.
In [ ]:
STATE = '06'
PUMA = '00106'
Later in the example we grab some Census data using the Census API. Enter your Census key below (if you need one, you can get one for free here).
In [ ]:
MY_CENSUS_KEY = ''
This example will generate a variety of output. Please specify where you'd like this data written to disk.
In [ ]:
output_dir = '.'
In [ ]:
import csv
import os
import pandas as pd
from doppelganger import (
allocation,
inputs,
Configuration,
HouseholdAllocator,
PumsData,
SegmentedData,
BayesianNetworkModel,
Population,
Preprocessor,
Marginals
)
This file does the following three things:
person_fields. In the example, you'll see age, sex, and individual_income. These variables are mapped to the PUMS variables in inputs.py. For example, age in Doppelganger is mapped to the PUMS variable agep. To use other variables from the PUMS with Doppelganger, you'll need to map their relationships in inputs.py and specify them here. household_fields. In the example, you'll see household_income and num_vehicles. As with the person-specific variables, you'll need to modify inputs.py to use other variables in Doppelganger.preprocessing.network_config_files.
In [ ]:
configuration = Configuration.from_file('sample_data/config.json')
With the configuration in hand, we can create a preprocessor object that will create methods to apply to the household and person PUMS data. In the current configuration, the preprocessor bins the individual income variable.
In [ ]:
preprocessor = Preprocessor.from_config(configuration.preprocessing_config)
Bayesian Networks are built using the specification (network_config_files) in the configuration file and cleaned PUMS data. Up first: let's read in and clean the PUMS data.
The fields we need from the household data is the union of those defined in the allocation.DEFAULT_HOUSEHOLD_FIELDS and those defined household_categories section of the configuration file. The raw/dirty data collected for our example PUMA is available here.
In [ ]:
household_fields = tuple(set(
field.name for field in allocation.DEFAULT_HOUSEHOLD_FIELDS).union(
set(configuration.household_fields)
))
households_data = PumsData.from_csv('sample_data/households_00106_dirty.csv').clean(
household_fields, preprocessor, puma=PUMA
)
In [ ]:
persons_fields = tuple(set(
field.name for field in allocation.DEFAULT_PERSON_FIELDS).union(
set(configuration.person_fields)
))
persons_data = PumsData.from_csv('sample_data/persons_00106_dirty.csv').clean(
persons_fields, preprocessor, puma=PUMA
)
The example below shows the training of a person-level model. It requires the following inputs:
persons_data, which was created in the previous step;pwgtp, which is defined in inputs.py. Once the training data is prepared, the training needs the person_structure, which is defined as in the network_config_files section of the configuration file, and the person_fields, which are also defined in the configuration file. The example person_structure specifies a network with three nodes (age, sex, and income) and two edges (age --> income, and sex --> income). The example household_structure specifies a network with three nodes (num_people, household_income, and num_vehicles) and three edges (num_people --> household_income, num_people --> num_vehicles, and household_income --> num_vehicles).
In [ ]:
person_training_data = SegmentedData.from_data(
persons_data,
list(configuration.person_fields),
weight_field = inputs.PERSON_WEIGHT.name
)
person_model = BayesianNetworkModel.train(
person_training_data,
configuration.person_structure,
configuration.person_fields
)
The Bayesian Networks can also be fully segmented. In the below example, we rebuild the networks, fully segmented by age category via the segmenter functionality.
In [ ]:
person_segmentation = lambda x: x[inputs.AGE.name]
person_training_data = SegmentedData.from_data(
persons_data,
list(configuration.person_fields),
inputs.PERSON_WEIGHT.name,
person_segmentation
)
person_model = BayesianNetworkModel.train(
person_training_data,
configuration.person_structure,
configuration.person_fields
)
The Bayesian Network can be written to disk and read from disk as follows.
In [ ]:
person_model_filename = os.path.join(output_dir, 'person_model.json')
person_model.write(person_model_filename)
person_model_reloaded = BayesianNetworkModel.from_file(person_model_filename)
Following the same steps as above, you can also build a household network.
In [ ]:
household_segmenter = lambda x: x[inputs.NUM_PEOPLE.name]
household_training_data = SegmentedData.from_data(
households_data,
list(configuration.household_fields),
inputs.HOUSEHOLD_WEIGHT.name,
household_segmenter,
)
household_model = BayesianNetworkModel.train(
household_training_data,
configuration.household_structure,
configuration.household_fields
)
household_model_filename = os.path.join(output_dir, 'household_model.json')
household_model.write(household_model_filename)
The sample data comes with a set of marginals that can be used in the allocation step. This file is named sample_data/marginals_00106.csv. To demonstrate how marginals can be created, we create them here. Note that this step will take a few minutes. If you want to skip it, use the controls = Marginals.from_csv('sample_data/marginals_00106.csv') call in the subsequent code box.
In [ ]:
if MY_CENSUS_KEY:
new_marginal_filename = os.path.join(output_dir, 'new_marginals.csv')
with open('sample_data/2010_puma_tract_mapping.txt') as csv_file:
csv_reader = csv.DictReader(csv_file)
marginals = Marginals.from_census_data(
csv_reader, MY_CENSUS_KEY, state=STATE, pumas=[PUMA]
)
marginals.write(new_marginal_filename)
In [ ]:
# controls = Marginals.from_csv(new_marginal_filename) # use this one to use your new marginals
controls = Marginals.from_csv('sample_data/marginals_00106.csv') # use this one if you want to skip previous step
With the above marginal controls, the methods in allocation.py allocate discrete PUMS households to the subject PUMA.
In [ ]:
allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)
It may be convenient to replace the source population with a synthetic population -- to add heterogeniety to the synthetic population or to obscure the source data set. In the below example we generate a set of persons and households using the allocator (the PUMS persons allocated to tracts), the Bayesian Networks (person_model, household_model).
In [ ]:
population = Population.generate(
allocator, person_model, household_model
)
We can access the people and households as Pandas DataFrames and work with them directly. Households and people are unique by (tract, serial_number, repeat_index). serial_number is the PUMS serialno for the household.
In [ ]:
people = population.generated_people
households = population.generated_households
sort_cols = [inputs.HOUSEHOLD_ID.name]
print(people.sort_values(sort_cols).head())
print(households.sort_values(sort_cols).head())
To create one fat table of people and household attributes we can join on inputs.HOUSEHOLD_ID.name:
In [ ]:
combined = pd.merge(people, households, on=[inputs.HOUSEHOLD_ID.name])
print(combined.head())
Or write them to disk:
In [ ]:
generated_people_filename = os.path.join(output_dir, 'generated_people.csv')
generated_households_filename = os.path.join(output_dir, 'generated_households.csv')
population.write(generated_people_filename, generated_households_filename)
In [ ]: