I will start with adding general code that I use everywhere and then load the data and turn it into a data frame. Using this data frame I will explore the data and get an understanding
In [5]:
import sys
projects_home = '/home/aseem/projects/ud120-projects'
final_project_home = projects_home + '/final_project/'
sys.path.append(final_project_home)
sys.path.append(projects_home + '/tools/')
In [6]:
%matplotlib inline
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
with open(final_project_home + "final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
data_df = pd.DataFrame.from_dict(data_dict, orient = 'index')
#del data_dict
I will note down the versions that I am using so as to avoid any confusion later
In [7]:
print sns.__version__
print matplotlib.__version__
print np.__version__
print pd.__version__
In [8]:
data_df.head()
Out[8]:
I remember that there was an outlier that needed to be removed. Also email_address does not seem to be useful
In [9]:
imputated_df1 = data_df.drop(['TOTAL'], axis=0).drop(['email_address'], axis = 1)
imputated_df1.head()
Out[9]:
I can see a lot of NaN. I will replace them with 0 for now
In [10]:
imputated_df2 = imputated_df1.replace(to_replace='NaN', value = 0)
imputated_df2.head()
Out[10]:
Now with some of the obvious things out of the way I will do some univariate analysis so I can find get some understanding of the data. Let's start with seeing the stats
In [11]:
imputated_df2.columns
Out[11]:
In [12]:
imputated_df2.describe().transpose()
Out[12]:
In [13]:
imputated_df2.median()
Out[13]:
In [14]:
imputated_df2.mode()
Out[14]:
We see a lot of zeros. Maybe there were a lot of nulls? More than I was expecting. Let me see how many were there
In [15]:
imputated_df1.replace(to_replace='NaN', value=np.nan).isnull().sum()
Out[15]:
There are a lot of nulls. How many records were there again?
In [16]:
imputated_df2.shape
Out[16]:
There are some which have very small amount of values. Let's check them out
In [17]:
imputated_df2.loan_advances.value_counts()
Out[17]:
In [18]:
imputated_df2[imputated_df2.loan_advances != 0]
Out[18]:
Although we have a POI for which this value is present I don't think this gives us any discriminating power. Hence I will remove this
In [19]:
imputated_df3 = imputated_df2.drop(['loan_advances'], axis = 1)
imputated_df3.head()
Out[19]:
In [20]:
imputated_df3.shape
Out[20]:
In [21]:
imputated_df3.poi.value_counts()
Out[21]:
In [22]:
def get_poi_wise_breakup(df, feature):
def my_func(test):
return {
'isNull': test[feature].isnull().sum(),
'isNotNull': test[feature].notnull().sum()
}
return df.replace(to_replace=0, value=np.nan)\
.loc[:, ['poi', feature]]\
.groupby('poi')\
.apply(my_func)
In [23]:
get_poi_wise_breakup(imputated_df3, 'deferral_payments')
Out[23]:
In [24]:
get_poi_wise_breakup(imputated_df3, 'restricted_stock_deferred')
Out[24]:
In [25]:
get_poi_wise_breakup(imputated_df3, 'director_fees')
Out[25]:
Looking at these I can see that restricted_stock_deferred
as well as director_fees
are null for all of the POI. So we can't really discriminate between POI/Non-POI based on this. I looked at some discussions but don't really agree and I don't want to make assumptions. Thus I am going to remove those. deferral_payments
still has some data for both the cases so I am going to keep it for now.
In [26]:
imputated_df4 = imputated_df3.drop(['restricted_stock_deferred', 'director_fees'], axis = 1)
imputated_df4.head()
Out[26]:
email_address
loan_advances
restricted_stock_deferred
as well as director_fees
deferral_payments
was kept but doubtful whether it can be useful or not