Doppelganger!

Welcome to the Doppelganger example. If you have not already done so, please see the README document for installation instructions and information on what Doppelganger is doing under the hood. For a simplified walkthrough, take a look at doppelganger_example_simple

What's in this Example?

This workbook acts on a single PUMA and carries out the following tasks:

  1. Builds household- and person-specific Bayesian Networks for an individual PUMA;
  2. Allocates PUMS households to an individual PUMA consistent with subjectively-weighted marginal controls; and,
  3. Replaces the drawn PUMS population with synthetic people created by moving through and person-specific Bayesian Network.

Housekeeping

Before we get going, let's take care of some housekeeping tasks to make your Doppelganger Example experience as seamless as possible.

We have included a cross-walk between PUMAs and Census tracts that we'll use later. Please navigate to this file and unzip it in the same directory. The file is here:


In [ ]:
# ./examples/sample_data/2010_puma_tract_mapping.txt.zip

The example operates on a single PUMA in California (00106). In Step 01 below, you'll see that we have already extracted PUMS data for this PUMA, so we'd recommend that you do not change to your favorite PUMA just yet (here's a set of reference maps for California -- all states are available). But once you get comfortable with the tools, download the PUMS data for the PUMA of your choice, and go for it.


In [ ]:
STATE = '06'
PUMA = '00106'

Later in the example we grab some Census data using the Census API. Enter your Census key below (if you need one, you can get one for free here).


In [ ]:
MY_CENSUS_KEY = ''

This example will generate a variety of output. Please specify where you'd like this data written to disk.


In [ ]:
output_dir = '.'

Configuration

We'll need to configure your computing environment and load in the configuration file before getting started.

Import the relevant Doppelganger Python packages


In [ ]:
import csv
import os

import pandas as pd

from doppelganger import (
    allocation,
    inputs,
    Configuration,
    HouseholdAllocator,
    PumsData,
    SegmentedData,
    BayesianNetworkModel,
    Population,
    Preprocessor,
    Marginals
)

Load the Doppelganger example configuration file

This file does the following three things:

  1. Defines person-specific variables in person_fields. In the example, you'll see age, sex, and individual_income. These variables are mapped to the PUMS variables in inputs.py. For example, age in Doppelganger is mapped to the PUMS variable agep. To use other variables from the PUMS with Doppelganger, you'll need to map their relationships in inputs.py and specify them here.
  2. Defines household-specific variables in household_fields. In the example, you'll see household_income and num_vehicles. As with the person-specific variables, you'll need to modify inputs.py to use other variables in Doppelganger.
  3. Defines procedures to process input variables into bins in preprocessing.
  4. Defines the structure of the household and person Bayesian Networks in network_config_files.

In [ ]:
configuration = Configuration.from_file('sample_data/config.json')

With the configuration in hand, we can create a preprocessor object that will create methods to apply to the household and person PUMS data. In the current configuration, the preprocessor bins the individual income variable.


In [ ]:
preprocessor = Preprocessor.from_config(configuration.preprocessing_config)

Step 01: Let's Build some Bayesian Networks!

Bayesian Networks are built using the specification (network_config_files) in the configuration file and cleaned PUMS data. Up first: let's read in and clean the PUMS data.

Data reads

Household Data

The fields we need from the household data is the union of those defined in the allocation.DEFAULT_HOUSEHOLD_FIELDS and those defined household_categories section of the configuration file. The raw/dirty data collected for our example PUMA is available here.


In [ ]:
household_fields = tuple(set(
    field.name for field in allocation.DEFAULT_HOUSEHOLD_FIELDS).union(
        set(configuration.household_fields)
))

households_data = PumsData.from_csv('sample_data/households_00106_dirty.csv').clean(
    household_fields, preprocessor, puma=PUMA
)

Person Data

The allocation.DEFAULT_PERSON_FIELDS defines a set of fields that can be used to create the persons_data; we take the union of these defaults with those defined in the person_categories section of the configuration file. This data is then extracted from the raw/dirty PUMS data.


In [ ]:
persons_fields = tuple(set(
    field.name for field in allocation.DEFAULT_PERSON_FIELDS).union(
        set(configuration.person_fields)
))
persons_data = PumsData.from_csv('sample_data/persons_00106_dirty.csv').clean(
    persons_fields, preprocessor, puma=PUMA
)

Bayesian Networks

The example below shows the training of a person-level model. It requires the following inputs:

  1. The persons_data, which was created in the previous step;
  2. A list of variables you want to consider in the training -- taken here from our configuration file; and,
  3. An optional weight if you want to weight the records (in the example, we use the PUMS variable pwgtp, which is defined in inputs.py.

Once the training data is prepared, the training needs the person_structure, which is defined as in the network_config_files section of the configuration file, and the person_fields, which are also defined in the configuration file. The example person_structure specifies a network with three nodes (age, sex, and income) and two edges (age --> income, and sex --> income). The example household_structure specifies a network with three nodes (num_people, household_income, and num_vehicles) and three edges (num_people --> household_income, num_people --> num_vehicles, and household_income --> num_vehicles).


In [ ]:
person_training_data = SegmentedData.from_data(
    persons_data,
    list(configuration.person_fields),
    weight_field = inputs.PERSON_WEIGHT.name
)
person_model = BayesianNetworkModel.train(
    person_training_data,
    configuration.person_structure,
    configuration.person_fields
)

The Bayesian Networks can also be fully segmented. In the below example, we rebuild the networks, fully segmented by age category via the segmenter functionality.


In [ ]:
person_segmentation = lambda x: x[inputs.AGE.name]

person_training_data = SegmentedData.from_data(
    persons_data,
    list(configuration.person_fields),
    inputs.PERSON_WEIGHT.name,
    person_segmentation
)
person_model = BayesianNetworkModel.train(
    person_training_data,
    configuration.person_structure,
    configuration.person_fields
)

The Bayesian Network can be written to disk and read from disk as follows.


In [ ]:
person_model_filename = os.path.join(output_dir, 'person_model.json')
person_model.write(person_model_filename)
person_model_reloaded = BayesianNetworkModel.from_file(person_model_filename)

Following the same steps as above, you can also build a household network.


In [ ]:
household_segmenter = lambda x: x[inputs.NUM_PEOPLE.name]

household_training_data = SegmentedData.from_data(
    households_data,
    list(configuration.household_fields),
    inputs.HOUSEHOLD_WEIGHT.name,
    household_segmenter,
)
household_model = BayesianNetworkModel.train(
    household_training_data,
    configuration.household_structure,
    configuration.household_fields
)

household_model_filename = os.path.join(output_dir, 'household_model.json')
household_model.write(household_model_filename)

Step 02: Allocate PUMS households to the PUMA

The sample data comes with a set of marginals that can be used in the allocation step. This file is named sample_data/marginals_00106.csv. To demonstrate how marginals can be created, we create them here. Note that this step will take a few minutes. If you want to skip it, use the controls = Marginals.from_csv('sample_data/marginals_00106.csv') call in the subsequent code box.


In [ ]:
if MY_CENSUS_KEY:
    new_marginal_filename = os.path.join(output_dir, 'new_marginals.csv')

    with open('sample_data/2010_puma_tract_mapping.txt') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        marginals = Marginals.from_census_data(
            csv_reader, MY_CENSUS_KEY, state=STATE, pumas=[PUMA]
        )
        marginals.write(new_marginal_filename)

In [ ]:
# controls = Marginals.from_csv(new_marginal_filename) # use this one to use your new marginals
controls = Marginals.from_csv('sample_data/marginals_00106.csv') # use this one if you want to skip previous step

With the above marginal controls, the methods in allocation.py allocate discrete PUMS households to the subject PUMA.


In [ ]:
allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)

Step 03: Replace the PUMS Persons with Synthetic Persons created from the Bayesian Network

It may be convenient to replace the source population with a synthetic population -- to add heterogeniety to the synthetic population or to obscure the source data set. In the below example we generate a set of persons and households using the allocator (the PUMS persons allocated to tracts), the Bayesian Networks (person_model, household_model).


In [ ]:
population = Population.generate(
    allocator, person_model, household_model
)

We can access the people and households as Pandas DataFrames and work with them directly. Households and people are unique by (tract, serial_number, repeat_index). serial_number is the PUMS serialno for the household.


In [ ]:
people = population.generated_people
households = population.generated_households

sort_cols = [inputs.HOUSEHOLD_ID.name]

print(people.sort_values(sort_cols).head())
print(households.sort_values(sort_cols).head())

To create one fat table of people and household attributes we can join on inputs.HOUSEHOLD_ID.name:


In [ ]:
combined = pd.merge(people, households, on=[inputs.HOUSEHOLD_ID.name])

print(combined.head())

Or write them to disk:


In [ ]:
generated_people_filename = os.path.join(output_dir, 'generated_people.csv')
generated_households_filename = os.path.join(output_dir, 'generated_households.csv')

population.write(generated_people_filename, generated_households_filename)

In [ ]: