Michaël Defferrard, PhD student, EPFL LTS2
In [1]:
# While packages are usually imported at the top, they can
# be imported wherever you prefer, in whatever scope.
import numpy as np
# Show matplotlib graphs inside the notebook.
%matplotlib inline
import sys
print('Python {}.{}'.format(sys.version_info.major, sys.version_info.minor))
In [2]:
# Search modules in parent folder.
import sys
sys.path.insert(1, '..')
import ntds
# Cross-platform (Windows / Mac / Linux) paths.
import os.path
folder = os.path.join('..', 'data', 'credit_card_defaults')
# Download the data.
ntds.get_data(folder)
In [3]:
!ls ../data/credit_card_defaults/
# Windows: !dir ..\data\credit_card_defaults\
SQLAlchemy to the rescue.
In [4]:
import sqlalchemy
filename = os.path.join(folder, 'payments.sqlite')
engine = sqlalchemy.create_engine('sqlite:///' + filename, echo=False)
# Infer from existing DB.
metadata = sqlalchemy.MetaData()
metadata.reflect(engine)
# An SQL SELECT statement.
table = metadata.tables.get('payments')
op = sqlalchemy.sql.select([table])
engine.echo = True
result = engine.execute(op)
engine.echo = False
In [5]:
# Show some lines, i.e. clients.
for row in result.fetchmany(size=10):
print('ID: {:2d}, payments: {}'.format(row[0], row[1:]))
result.close()
In [6]:
# Execute some raw SQL.
paid = 1000
op = sqlalchemy.sql.text('SELECT payments."ID", payments."PAY6" FROM payments WHERE payments."PAY6" = {}'.format(paid))
result = engine.execute(op).fetchall()
print('{} clients paid {} in April 2005'.format(len(result), paid))
Put some pandas in our Python !
In [7]:
def get_data(directory):
filename_csv = os.path.join(directory, 'demographics.csv')
filename_xls = os.path.join(directory, 'delays.xls')
filename_hdf = os.path.join(directory, 'bills.hdf5')
filename_json = os.path.join(directory, 'target.json')
demographics = pd.read_csv(filename_csv, index_col=0)
delays = pd.read_excel(filename_xls, index_col=0)
bills = pd.read_hdf(filename_hdf, 'bills')
payments = pd.read_sql('payments', engine, index_col='ID')
target = pd.read_json(filename_json)
return pd.concat([demographics, delays, bills, payments, target], axis=1)
import pandas as pd
data = get_data(folder)
attributes = data.columns.tolist()
# Tansform from numerical to categorical variable.
data['SEX'] = data['SEX'].astype('category')
data['SEX'].cat.categories = ['MALE', 'FEMALE']
data['MARRIAGE'] = data['MARRIAGE'].astype('category')
data['MARRIAGE'].cat.categories = ['UNK', 'MARRIED', 'SINGLE', 'OTHERS']
data['EDUCATION'] = data['EDUCATION'].astype('category')
data['EDUCATION'].cat.categories = ['UNK', 'GRAD SCHOOL', 'UNIVERSITY', 'HIGH SCHOOL', 'OTHERS', 'UNK1', 'UNK2']
In [8]:
data.loc[:6, ['LIMIT', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'DEFAULT']]
Out[8]:
In [9]:
data.iloc[:5, 4:10]
Out[9]:
In [10]:
data.iloc[:5, 11:23]
Out[10]:
Export as an HTML table for manual inspection.
In [11]:
data[:1000].to_html('subset.html')
While cleaning data is the most time-consuming, least enjoyable Data Science task, it should be perfomed nonetheless. Problems come in two flavours:
The actions to be taken in each case is highly data and problem specific.
Example: marital status
UNK
).
In [12]:
print(data['MARRIAGE'].value_counts())
data = data[data['MARRIAGE'] != 'UNK']
data['MARRIAGE'] = data['MARRIAGE'].cat.remove_unused_categories()
print('\nWe are left with {} clients\n'.format(data.shape))
print(data['MARRIAGE'].unique())
Example: education
UNK
, UNK1
and UNK2
).
In [13]:
print(data['EDUCATION'].value_counts())
data.loc[data['EDUCATION']=='UNK1', 'EDUCATION'] = 'UNK'
data.loc[data['EDUCATION']=='UNK2', 'EDUCATION'] = 'UNK'
data['EDUCATION'] = data['EDUCATION'].cat.remove_unused_categories()
print(data['EDUCATION'].value_counts())
In [14]:
attributes_numerical = ['LIMIT', 'AGE']
attributes_numerical.extend(attributes[11:23])
data.loc[:, attributes_numerical].describe().astype(np.int)
Out[14]:
Let's plot an histogram of the ages, so that we get a better impression of who our clients are. That may even be an end goal, e.g. if your marketing team asks which customer groups to target.
Then a boxplot of the bills, which may serve as a verification of the quality of the acquired data.
In [15]:
data.loc[:, 'AGE'].plot.hist(bins=20, figsize=(15,5))
ax = data.iloc[:, 11:17].plot.box(logy=True, figsize=(15,5))
Simple question: which proportion of our clients default ?
In [16]:
percentage = data['DEFAULT'].value_counts()[1] / data.shape[0] * 100
print('Percentage of defaults: {:.2f}%'.format(percentage))
Another question: who's more susceptible to default, males or females ?
In [17]:
observed = pd.crosstab(data['SEX'], data['DEFAULT'], margins=True)
observed
Out[17]:
Seems like females are better risk. Let's verify with a Chi-Squared test of independance, using scipy.stats.
In [18]:
import scipy.stats as stats
_, p, _, expected = stats.chi2_contingency(observed.iloc[:2,:2])
print('p-value = {:.2e}'.format(p))
print('expected values:\n{}'.format(expected))
Intuition: people who pay late present a higher risk of defaulting. Let's verify ! Verifying some intuitions will also help you to identify mistakes. E.g. it would be suspicious if that intuition is not verified in the data: did we select the right column, or did we miss-compute a result ?
In [19]:
group = data.groupby('DELAY1').mean()
corr = data['DEFAULT'].corr(data['DELAY1'], method='pearson')
group['DEFAULT'].plot(grid=True, title='Pearson correlation: {:.4f}'.format(corr), figsize=(15,5));
Bokeh is a Python interactive visualization library that targets modern web browsers for presentation, in the style of D3.js. Alternatively, matplotlib.widgets could be used. Those interactive visualizations are very helpful to explore the data at hand in the quest of anomalies or patterns. Try with the plots below !
In [20]:
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
output_notebook()
x, y1, y2 = 'LIMIT', 'PAY1', 'PAY2'
n = 1000 # Less intensive for the browser.
options = dict(
tools='pan,box_zoom,wheel_zoom,box_select,lasso_select,crosshair,reset,save',
x_axis_type='log', y_axis_type='log',
)
plot1 = figure(
x_range=[1e4,1e6],
x_axis_label=x, y_axis_label=y1,
**options
)
plot2 = figure(
x_range=plot1.x_range, y_range=plot1.y_range,
x_axis_label=x, y_axis_label=y2,
**options
)
html_color = lambda r,g,b: '#{:02x}{:02x}{:02x}'.format(r,g,b)
colors = [html_color(150,0,0) if default == 1 else html_color(0,150,0) for default in data['DEFAULT'][:n]]
# The above line is a list comprehension.
radii = data['AGE'][:n] / 5
# To link brushing (where a selection on one plot causes a selection to update on other plots).
source = ColumnDataSource(dict(x=data[x][:n], y1=data[y1][:n], y2=data[y2][:n], radii=radii, colors=colors))
plot1.scatter('x', 'y1', source=source, size='radii', color='colors', alpha=0.6)
plot2.scatter('x', 'y2', source=source, size='radii', color='colors', alpha=0.6)
plot = gridplot([[plot1, plot2]], toolbar_location='right', plot_width=400, plot_height=400, title='adsf')
show(plot)
Statsmodels is similar to scikit-learn, with much stronger emphasis on parameter estimation and (statistical) testing. It is similar in spirit to other statistical packages such as R, SPSS, SAS and Stata. That split reflects the two statistical modeling cultures: (1) Statistics, which want to know how well a given model fits the data, and what variables "explain" or affect the outcome, and (2) Machine Learning, where the main supported task is chosing the "best" model for prediction.
In [21]:
# Back to numeric values.
# Note: in a serious project, these should be treated as categories.
data['SEX'].cat.categories = [-1, 1]
data['SEX'] = data['SEX'].astype(np.int)
data['MARRIAGE'].cat.categories = [-1, 1, 0]
data['MARRIAGE'] = data['MARRIAGE'].astype(np.int)
data['EDUCATION'].cat.categories = [-2, 2, 1, 0, -1]
data['EDUCATION'] = data['EDUCATION'].astype(np.int)
data['DEFAULT'] = data['DEFAULT'] * 2 - 1 # [0,1] --> [-1,1]
In [22]:
# Observations and targets.
X = data.values[:,:23]
y = data.values[:,23]
n, d = X.shape
print('The data is a {} with {} samples of dimensionality {}.'.format(type(X), n, d))
In [23]:
import statsmodels.api as sm
# Fit the Ordinary Least Square regression model.
results = sm.OLS(y, X).fit()
# Inspect the results.
print(results.summary())
In [24]:
np.save(os.path.join(folder, 'X.npy'), X)
np.save(os.path.join(folder, 'y.npy'), y)