Title: Convert A String Categorical Variable With Patsy
Slug: pandas_convert_string_categorical_to_numeric_with_patsy
Summary: Convert A String Categorical Variable With Patsy
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

import modules


In [1]:
import pandas as pd
import patsy

Create dataframe


In [2]:
raw_data = {'patient': [1, 1, 1, 0, 0], 
        'obs': [1, 2, 3, 1, 2], 
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong']} 
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df


Out[2]:
patient obs treatment score
0 1 1 0 strong
1 1 2 1 weak
2 1 3 0 normal
3 0 1 1 weak
4 0 2 0 strong

Convert df['score'] into a categorical variable ready for regression (i.e. set one category as the baseline)


In [3]:
# On the 'score' variable in the df dataframe, convert to a categorical variable, and spit out a dataframe
patsy.dmatrix('score', df, return_type='dataframe')


Out[3]:
Intercept score[T.strong] score[T.weak]
0 1.0 1.0 0.0
1 1.0 0.0 1.0
2 1.0 0.0 0.0
3 1.0 0.0 1.0
4 1.0 1.0 0.0

Convert df['score'] into a categorical variable without setting one category as baseline

This is likely what you will want to do


In [4]:
# On the 'score' variable in the df dataframe, convert to a categorical variable, and spit out a dataframe
patsy.dmatrix('score - 1', df, return_type='dataframe')


Out[4]:
score[normal] score[strong] score[weak]
0 0.0 1.0 0.0
1 0.0 0.0 1.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 0.0 1.0 0.0

Create a variable that is "1" if the variables of patient and treatment are both 1


In [5]:
patsy.dmatrix('patient + treatment + patient:treatment-1', df, return_type='dataframe')


Out[5]:
patient treatment patient:treatment
0 1.0 0.0 0.0
1 1.0 1.0 1.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 0.0