Welcome to the simplified Doppelganger example. If you have not already done so, please see the README document for installation instructions and information on what Doppelganger is doing under the hood. For a more thorough walkthrough, take a look at doppelganger_example_full.
In [ ]:
import pandas as pd
from doppelganger import (
allocation,
inputs,
Configuration,
HouseholdAllocator,
PumsData,
SegmentedData,
BayesianNetworkModel,
Population,
Preprocessor,
Marginals
)
configuration = Configuration.from_file('sample_data/config.json')
In [ ]:
PUMA = '00106'
preprocessor = Preprocessor.from_config(configuration.preprocessing_config)
# Take pums fields from the config and from the default fields needed for
# the household allocation process.
household_fields = tuple(set(
field.name for field in allocation.DEFAULT_HOUSEHOLD_FIELDS).union(
set(configuration.household_fields)
))
households_data = PumsData.from_csv('sample_data/households_00106_dirty.csv').clean(
household_fields, preprocessor, puma=PUMA
)
persons_fields = tuple(set(
field.name for field in allocation.DEFAULT_PERSON_FIELDS).union(
set(configuration.person_fields)
))
persons_data = PumsData.from_csv('sample_data/persons_00106_dirty.csv').clean(
persons_fields, preprocessor, puma=PUMA
)
In [ ]:
controls = Marginals.from_csv('sample_data/marginals_00106.csv')
Now use HouseholdAllocator to generate household allocations.
In [ ]:
allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)
Let's create models to generate characteristics for the people and households we just allocated. We'll start by loading up our pums data. Our model learns different probability distributions for each category of person. The category can be whatever you want and is specified by passing a segmentation function when you load training data.
In [ ]:
segmentation_function = lambda x: x[inputs.AGE.name]
person_training_data = SegmentedData.from_data(
persons_data,
list(configuration.person_fields),
inputs.PERSON_WEIGHT.name,
segmenter=segmentation_function
)
person_model = BayesianNetworkModel.train(
person_training_data,
configuration.person_structure,
configuration.person_fields
)
household_segmenter = lambda x: x[inputs.NUM_PEOPLE.name]
household_training_data = SegmentedData.from_data(
households_data,
list(configuration.household_fields),
inputs.HOUSEHOLD_WEIGHT.name,
household_segmenter,
)
household_model = BayesianNetworkModel.train(
household_training_data,
configuration.household_structure,
configuration.household_fields
)
In [ ]:
population = Population.generate(allocator, person_model, household_model)
We can access the people and households as Pandas DataFrames and work with them directly. Households and people are unique by household_id. We can also join them to create a fat table of individual and household attributes.
In [ ]:
people = population.generated_people
households = population.generated_households
merge_cols = [inputs.HOUSEHOLD_ID.name]
combined = pd.merge(people, households, on=merge_cols)
print(combined)
We can easily save this population to a csv.
In [ ]:
population.write('generated_people.csv', 'generated_households.csv')
We can additionally save any of our intermediary stages and load them up again whenever we want. For example, we could save our Bayesian Network and reuse them again later with the same or different household allocations.
In [ ]:
person_model.write('person_model.json')
person_model_reloaded = BayesianNetworkModel.from_file('person_model.json', segmenter=segmentation_function)
To try this out on the PUMA of your choice and learn to make other customizations, take a look at doppelganger_example_full.