In this section, we will see how to use PySCD to track history changes in pandas DataFrame objects. For more detailed information about Slowly Changing Dimensions, you can check this page on Wikipedia.
In [1]:
import pandas as pd
import pyscd
Use one of the pandas IO tools to read your data (from CSV, HDF5, ...)
In [2]:
df = pd.read_csv('clients 2015-01.csv')
df
Out[2]:
Now, lets create a dimension from this dataframe:
In [3]:
dim = pyscd.SlowlyChangingDimension(
df, source_keys='ssn', as_of='2015-01-01')
dim.df
Out[3]:
I used the as_of parameter here to indicate that this data from January. When as_of is ommited, the dimension will always use the current date.
In [4]:
df = pd.read_csv('clients 2015-02.csv')
df
Out[4]:
Note that Fred has moved from FL to NY.
Before updating the dimension with this new data, lets indicate that this time the data is from February.
In [5]:
dim.as_of = '2015-02-01'
In [6]:
dim.update(df)
In [7]:
dim.df
Out[7]: