Many statistical Python packages can deal with numpy Arrays.
Numpy Arrays however are not always easy to use.
Pandas is a package that provides a dataframe interface, similar to what R uses as the main data structure. Since Pandas has become so popular, many packages accept both pd.DataFrames and numpy Arrays.
In [1]:
import os
from dotenv import load_dotenv, find_dotenv
# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()
# load up the entries as environment variables
load_dotenv(dotenv_path)
Out[1]:
In [2]:
PROJECT_DIR = os.path.dirname(dotenv_path)
RAW_DATA_DIR = PROJECT_DIR + os.environ.get("RAW_DATA_DIR")
INTERIM_DATA_DIR = PROJECT_DIR + os.environ.get("INTERIM_DATA_DIR")
files=os.environ.get("FILES").split()
print("Project directory is : {0}".format(PROJECT_DIR))
print("Raw data directory is : {0}".format(RAW_DATA_DIR))
print("Interim directory is : {0}".format(INTERIM_DATA_DIR))
In [3]:
# The following jupyter notebook magic makes the plots appear in the notebook.
# If you run in batch mode, you have to save your plots as images.
%matplotlib inline
# matplotlib.pyplot is traditionally imported as plt
import matplotlib.pyplot as plt
# numpy is imported as np
import numpy as np
# Pandas is traditionaly imported as pd.
import pandas as pd
from pylab import rcParams
# some display options to size the figures. feel free to experiment
pd.set_option('display.max_columns', 25)
rcParams['figure.figsize'] = (17, 7)
Reading a CSV file is really easy in Pandas. There are several formats that Pandas can deal with.
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google Big Query | read_gbq | to_gbq |
Because the dataset is a real world dataset with data about many different aspects of the sample population, the data is not as nice for a regression or classification like the Iris dataset or the MPG data set.
What does work very well for such a data set is the use of a Probabilistic Graphical Model.
A good article describing PGM and a commonly used library pgmpy
In [4]:
df=pd.read_csv(INTERIM_DATA_DIR+'/arthritis_study.csv', index_col=0)
We will prepare the data by filtering some of the columns. This is for practical reasons for the lunch and learn. In reality you would leave all data in and only drop columns after determining they are not relevant for the model.
We also will split the data in a Train and Test dataset. This allows us to check how good the model predicts. The cool part is, with PGM, you can predict any of the columns.
In [5]:
from sklearn.cross_validation import train_test_split
columns_of_interest=['ARTH1','SPD','AGE_C', 'CHRONIC_CT','FUNC_LIMIT','SOC_RESTR','POV_RATIO','ACTIVITY']
data, test, y, test_y = train_test_split(df[columns_of_interest], df['ARTH1'], test_size=0.1, random_state=42, stratify=[ 0 if x=="Yes" else 1 for x in df['ARTH1']])
print("Number of samples in data: {}".format(data.shape[0]))
print("Number of samples in test: {}".format(test.shape[0]))
We need to drop any rows that still have NaN values
In [6]:
# to avoid problems later on we remove any rows that have any
data.dropna(axis=0, how='any', inplace=True)
test.dropna(axis=0, how='any', inplace=True)
print("Number of samples in data: {}".format(data.shape[0]))
print("Number of samples in test: {}".format(test.shape[0]))
In [7]:
from IPython.display import Image
Image(filename='Student_Bayesian_Network_Example.PNG')
Out[7]:
In [8]:
import pgmpy
from pgmpy.estimators import ConstraintBasedEstimator
est = ConstraintBasedEstimator(data)
First we stry to create a skeleton that is one of the many possible networks between the nodes. It does that by doing various statistical tests, like the Student test, Chi squared test, etc. (depending on the library you use and the type of data you have or the type of distribution you told the variables to be), between the different values of sets of features. If there is a significant difference, you can argue that one features influences the values of the other.
In [9]:
# don't think we will go over this.
def is_independent(X, Y, Zs=[], significance_level=0.05):
return est.test_conditional_independence(X, Y, Zs)[1] >= significance_level
In [10]:
for i in data.columns:
for j in data.columns:
print("relation between {} and {} is independent?: {}".format(i,j, is_independent('FUNC_LIMIT','POV_RATIO')))
In [11]:
# this step takes a long time with a dataset as big as we have
# without a model, the estimator will pairwise check independece for all categorical values for all combinations of columns
skel, seperating_sets = est.estimate_skeleton(significance_level=0.01)
print("Undirected edges: ", skel.edges())
Now that we have a skeleton there are some bidirectional and some loops still in the mix.
We need to recalculate the probabilities of the different options to check which one has the overall best cohesion.
In [12]:
# With an estimate of the skeleton, we can now add probabilities to the edges
pdag = est.skeleton_to_pdag(skel, seperating_sets)
print("PDAG edges: ", pdag.edges())
Now that we have the Probabilities calculated, we can eliminate the bi-directional ones and turn the network into a proper DAG.
In [13]:
# and finally based on the probabilities of the directionality, make a decision
model = est.pdag_to_dag(pdag)
print("DAG edges: ", model.edges())
You can let networkx determine the layout of the data, but with this data, we want to be more in control. So we define the positions
In [14]:
pos={'ACTIVITY': [ 2,3],
'AGE_C': [ 1,5],
'ARTH1': [ 3,5],
'CHRONIC_CT': [4,7],
'FUNC_LIMIT': [ 4,3],
'POV_RATIO': [ 6,3],
'SOC_RESTR': [ 5,5],
'SPD': [ 7,5]}
In [15]:
import networkx as nx
import warnings
warnings.filterwarnings("ignore")
nx.draw(model,pos=pos,node_size=6500, width=2, node_color='b')
nx.draw_networkx_edges(model,pos=pos)
nx.draw_networkx_labels(model,pos=pos)
Out[15]:
In [16]:
# adding the states to the model
for col in data.columns:
print(col)
print(data[col].unique())
model.node[col] = { 'STATES': data[col].unique() }
#for var, properties in nodes.items():
# model.node[var] = properties
In [17]:
# we could add any additional info to the model at this point, that adjusts our prior.
# now update the CPD's given the data.
model.fit(data)
for cpd in model.get_cpds():
if cpd.variable in ('ARTH1','CHRONIC_CT'):
print(cpd.variable)
print(cpd)
In [18]:
# OK, so now that we fit the data against the training set, we can test against the test set.
# we first create a version that does not have the value we want to test the accuracy of the model for.
test_x = test.copy()
# get the actual values for SPD for the dataset
test_y = test_x['ARTH1']
# then remove it from the data
test_x.drop('ARTH1', axis=1, inplace=True)
# now let the model predict the values
y_pred=model.predict(test_x)
from sklearn.metrics import f1_score, accuracy_score
y_true=[ 'No' if x is None else x for x in test_y ]
print(f1_score(y_true, y_pred, average='micro'))
That was not that great. But we know from the paper that the data is not predicting Arthritis, but the likelihood of having SPD with or without Arthritis. So how good is the model at predicting SPD.
In [19]:
# Is the Model better at predicting SPD?
# have to remove rows where SPD is blank
test_x = test.copy()
# get the actual values for SPD for the dataset
test_y = test_x['SPD']
# then remove it from the data
test_x.drop('SPD', axis=1, inplace=True)
# have the model predict the value
y_pred=model.predict(test_x)
In [20]:
from sklearn.metrics import f1_score, accuracy_score
y_true=[ 'No' if x is None else x for x in test_y ]
print(f1_score(y_true, y_pred, average='micro'))
Now we can predict full sets of data and infer missing columns in bulk, and we saw that the F1 score was not that great for predicting ARTH1.
However, on a case by case basis, this model can be very valuable if you look at the predicted probabilities for incomplete situation, especially if you are slowly gathering evidence.
Here an example of the likelihood of Arthritis by itself, and with more evidence.
In [21]:
# Set up for inference of
from pgmpy.inference import VariableElimination
infer = VariableElimination(model)
print(infer.query(['ARTH1'])['ARTH1'] )
print(infer.query(['ARTH1'], evidence={'CHRONIC_CT': 2})['ARTH1'] )
print(infer.query(['ARTH1'], evidence={'CHRONIC_CT': 2, 'AGE_C':2, 'FUNC_LIMIT':2})['ARTH1'] )
Also interesting is the rise in probability of SPD given some evidence, compared to just by itself.
In [22]:
print(infer.query(['SPD'])['SPD'] )
print(infer.query(['SPD'], evidence={'CHRONIC_CT': 2, 'POV_RATIO':1, 'SOC_RESTR':1, 'FUNC_LIMIT':1})['SPD'] )
When you have some idea of what the causal relations are, you can provide that instead of learn it from the data.
So what if I would put my very uneducated guess on what influences arthritis and what influences SPD out of this set of 8 features. I would come up with something like this.
And, besides the scope for today, but if I knew the CPD between to variables, I could provide those as well. Either as prior, before fitting the data, or, just override the one that is created from fitting the data. You need to have confidence in the prior to do the former, and extreme confidence in the prior, or know your data sample is biased against your prior, to do the latter.
In [23]:
from pgmpy.models import BayesianModel
edges=[('AGE_C','ARTH1'), ('AGE_C','CHRONIC_CT'),('AGE_C','FUNC_LIMIT')
,('ACTIVITY','ARTH1')
,('ARTH1','FUNC_LIMIT'),('ARTH1','SOC_RESTR')
,('CHRONIC_CT','FUNC_LIMIT'),('CHRONIC_CT','SOC_RESTR'),('CHRONIC_CT','SPD')
,('SOC_RESTR','SPD')
,('FUNC_LIMIT','SPD')
,('POV_RATIO','SPD')
]
altmodel = BayesianModel()
altmodel.add_edges_from(edges)
In [24]:
import warnings
warnings.filterwarnings("ignore")
nx.draw(altmodel,pos=pos,node_size=6500, width=2, node_color='b')
nx.draw_networkx_labels(altmodel,pos=pos)
Out[24]:
In [25]:
altmodel.fit(data)
for cpd in altmodel.get_cpds():
if cpd.variable in ('ARTH1','CHRONIC_CT'):
print(cpd.variable)
print(cpd)
In [26]:
y_pred=altmodel.predict(test_x)
from sklearn.metrics import f1_score, accuracy_score
y_true=[ 'No' if x is None else x for x in test_y ]
print(f1_score(y_true, y_pred, average='weighted'))
So the accuracy went down slightly, but with SPD being unbalanced, and a more realistic network, that is not a bad thing. It will reduce overfitting.
In [27]:
infer = VariableElimination(altmodel)
print(infer.query(['SPD'])['SPD'] )
print(infer.query(['SPD'], evidence={'ARTH1':1,'CHRONIC_CT': 2, 'POV_RATIO':1, 'SOC_RESTR':1, 'FUNC_LIMIT':1})['SPD'] )
In [28]:
infer = VariableElimination(altmodel)
print(infer.query(['SPD'], evidence={'ARTH1':0})['SPD'])
print(infer.query(['SPD'], evidence={'ARTH1':1})['SPD'])
As the paper suggests, the probability increases from around 3% to over 6%.