Doppelganger! (Simple)

Welcome to the simplified Doppelganger example. If you have not already done so, please see the README document for installation instructions and information on what Doppelganger is doing under the hood. For a more thorough walkthrough, take a look at doppelganger_example_full.

Getting Started

Doppelganger lets you configure which census fields you use, the relationships among these fields (network structure), and the data preprocessing. We'll begin by loading the necessary packages, and then load a simple configuration file.


In [ ]:
import pandas as pd

from doppelganger import (
    allocation,
    inputs,
    Configuration,
    HouseholdAllocator,
    PumsData,
    SegmentedData,
    BayesianNetworkModel,
    Population,
    Preprocessor,
    Marginals
)

configuration = Configuration.from_file('sample_data/config.json')

Loading and Cleaning Data

The following loads our data and cleans it according to the configuration. Reusing the same preprocessor ensures all data is cleaned consistently. We'll use California's PUMA 00106 for our demonstration.


In [ ]:
PUMA = '00106'

preprocessor = Preprocessor.from_config(configuration.preprocessing_config)

# Take pums fields from the config and from the default fields needed for
# the household allocation process.
household_fields = tuple(set(
    field.name for field in allocation.DEFAULT_HOUSEHOLD_FIELDS).union(
        set(configuration.household_fields)
))

households_data = PumsData.from_csv('sample_data/households_00106_dirty.csv').clean(
    household_fields, preprocessor, puma=PUMA
)

persons_fields = tuple(set(
    field.name for field in allocation.DEFAULT_PERSON_FIELDS).union(
        set(configuration.person_fields)
))
persons_data = PumsData.from_csv('sample_data/persons_00106_dirty.csv').clean(
    persons_fields, preprocessor, puma=PUMA
)

Household Allocation

Now we will allocate persons and households to tracts to align with census controls. First, load our controls based on ACS marginals.


In [ ]:
controls = Marginals.from_csv('sample_data/marginals_00106.csv')

Now use HouseholdAllocator to generate household allocations.


In [ ]:
allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)

Bayesian Network Generation

Let's create models to generate characteristics for the people and households we just allocated. We'll start by loading up our pums data. Our model learns different probability distributions for each category of person. The category can be whatever you want and is specified by passing a segmentation function when you load training data.


In [ ]:
segmentation_function = lambda x: x[inputs.AGE.name]
person_training_data = SegmentedData.from_data(
    persons_data,
    list(configuration.person_fields),
    inputs.PERSON_WEIGHT.name,
    segmenter=segmentation_function
)
person_model = BayesianNetworkModel.train(
    person_training_data,
    configuration.person_structure,
    configuration.person_fields
)

household_segmenter = lambda x: x[inputs.NUM_PEOPLE.name]

household_training_data = SegmentedData.from_data(
    households_data,
    list(configuration.household_fields),
    inputs.HOUSEHOLD_WEIGHT.name,
    household_segmenter,
)
household_model = BayesianNetworkModel.train(
    household_training_data,
    configuration.household_structure,
    configuration.household_fields
)

Population Synthesis

Now for the main event! We can synthesize a population by taking the household allocations we produced above and filling out missing categories with our Bayesian Networks.


In [ ]:
population = Population.generate(allocator, person_model, household_model)

We can access the people and households as Pandas DataFrames and work with them directly. Households and people are unique by household_id. We can also join them to create a fat table of individual and household attributes.


In [ ]:
people = population.generated_people
households = population.generated_households

merge_cols = [inputs.HOUSEHOLD_ID.name]
combined = pd.merge(people, households, on=merge_cols)

print(combined)

We can easily save this population to a csv.


In [ ]:
population.write('generated_people.csv', 'generated_households.csv')

We can additionally save any of our intermediary stages and load them up again whenever we want. For example, we could save our Bayesian Network and reuse them again later with the same or different household allocations.


In [ ]:
person_model.write('person_model.json')
person_model_reloaded = BayesianNetworkModel.from_file('person_model.json', segmenter=segmentation_function)

Customize by PUMA

To try this out on the PUMA of your choice and learn to make other customizations, take a look at doppelganger_example_full.