A Python Tour of Data Science: Data Acquisition & Exploration

Michaël Defferrard, PhD student, EPFL LTS2



In [1]:

    
# While packages are usually imported at the top, they can
# be imported wherever you prefer, in whatever scope.
import numpy as np

# Show matplotlib graphs inside the notebook.
%matplotlib inline

import sys
print('Python {}.{}'.format(sys.version_info.major, sys.version_info.minor))









    



Python 3.5

1 Data Importation

The world is messy, we got data in CSV, JSON, Excel, HDF5 files and an SQL database.
Could also have been matlab, HTML, XML files or from the web via scraping and APIs (e.g. Twitter Firehose) or noSQL data stores, etc.



In [2]:

    
# Search modules in parent folder.
import sys
sys.path.insert(1, '..')
import ntds

# Cross-platform (Windows / Mac / Linux) paths.
import os.path
folder = os.path.join('..', 'data', 'credit_card_defaults')

# Download the data.
ntds.get_data(folder)



In [3]:

    
!ls ../data/credit_card_defaults/
# Windows: !dir ..\data\credit_card_defaults\









    



X.npy            delays.xls       original.xls     target.json
bills.hdf5       demographics.csv payments.sqlite  y.npy

1.1 Importing from an SQL Database

SQLAlchemy to the rescue.

Abstraction between DBAPIs.
- Supported databases: SQLite, Postgresql, MySQL, Oracle, MS-SQL, Firebird, Sybase and others.
SQL Expression Language.
Object Relational Mapper (ORM).



In [4]:

    
import sqlalchemy
filename = os.path.join(folder, 'payments.sqlite')
engine = sqlalchemy.create_engine('sqlite:///' + filename, echo=False)

# Infer from existing DB.
metadata = sqlalchemy.MetaData()
metadata.reflect(engine)

# An SQL SELECT statement.
table = metadata.tables.get('payments')
op = sqlalchemy.sql.select([table])
engine.echo = True
result = engine.execute(op)
engine.echo = False









    



2016-11-14 14:42:48,241 INFO sqlalchemy.engine.base.Engine SELECT payments."ID", payments."PAY1", payments."PAY2", payments."PAY3", payments."PAY4", payments."PAY5", payments."PAY6" 
FROM payments
2016-11-14 14:42:48,242 INFO sqlalchemy.engine.base.Engine ()



In [5]:

    
# Show some lines, i.e. clients.
for row in result.fetchmany(size=10):
    print('ID: {:2d}, payments: {}'.format(row[0], row[1:]))
result.close()









    



ID:  1, payments: (0, 689, 0, 0, 0, 0)
ID:  2, payments: (0, 1000, 1000, 1000, 0, 2000)
ID:  3, payments: (1518, 1500, 1000, 1000, 1000, 5000)
ID:  4, payments: (2000, 2019, 1200, 1100, 1069, 1000)
ID:  5, payments: (2000, 36681, 10000, 9000, 689, 679)
ID:  6, payments: (2500, 1815, 657, 1000, 1000, 800)
ID:  7, payments: (55000, 40000, 38000, 20239, 13750, 13770)
ID:  8, payments: (380, 601, 0, 581, 1687, 1542)
ID:  9, payments: (3329, 0, 432, 1000, 1000, 1000)
ID: 10, payments: (0, 0, 0, 13007, 1122, 0)



In [6]:

    
# Execute some raw SQL.
paid = 1000
op = sqlalchemy.sql.text('SELECT payments."ID", payments."PAY6" FROM payments WHERE payments."PAY6" = {}'.format(paid))
result = engine.execute(op).fetchall()
print('{} clients paid {} in April 2005'.format(len(result), paid))









    



1299 clients paid 1000 in April 2005

1.2 Merging data Sources

Put some pandas in our Python !

Import / export data from / to various sources.
Data frames manipulations: slicing, dicing, grouping.
And many more !



In [7]:

    
def get_data(directory):
    filename_csv = os.path.join(directory, 'demographics.csv')
    filename_xls = os.path.join(directory, 'delays.xls')
    filename_hdf = os.path.join(directory, 'bills.hdf5')
    filename_json = os.path.join(directory, 'target.json')

    demographics = pd.read_csv(filename_csv, index_col=0)
    delays = pd.read_excel(filename_xls, index_col=0)
    bills = pd.read_hdf(filename_hdf, 'bills')
    payments = pd.read_sql('payments', engine, index_col='ID')
    target = pd.read_json(filename_json)

    return pd.concat([demographics, delays, bills, payments, target], axis=1)

import pandas as pd
data = get_data(folder)
attributes = data.columns.tolist()

# Tansform from numerical to categorical variable.
data['SEX'] = data['SEX'].astype('category')
data['SEX'].cat.categories = ['MALE', 'FEMALE']
data['MARRIAGE'] = data['MARRIAGE'].astype('category')
data['MARRIAGE'].cat.categories = ['UNK', 'MARRIED', 'SINGLE', 'OTHERS']
data['EDUCATION'] = data['EDUCATION'].astype('category')
data['EDUCATION'].cat.categories = ['UNK', 'GRAD SCHOOL', 'UNIVERSITY', 'HIGH SCHOOL', 'OTHERS', 'UNK1', 'UNK2']

1.3 Looking at the Data



In [8]:

    
data.loc[:6, ['LIMIT', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'DEFAULT']]









    Out[8]:






  
    
      
      LIMIT
      SEX
      EDUCATION
      MARRIAGE
      AGE
      DEFAULT
    
    
      ID
      
      
      
      
      
      
    
  
  
    
      1
      20000
      FEMALE
      UNIVERSITY
      MARRIED
      24
      1
    
    
      2
      120000
      FEMALE
      UNIVERSITY
      SINGLE
      26
      1
    
    
      3
      90000
      FEMALE
      UNIVERSITY
      SINGLE
      34
      0
    
    
      4
      50000
      FEMALE
      UNIVERSITY
      MARRIED
      37
      0
    
    
      5
      50000
      MALE
      UNIVERSITY
      MARRIED
      57
      0
    
    
      6
      50000
      MALE
      GRAD SCHOOL
      SINGLE
      37
      0



In [9]:

    
data.iloc[:5, 4:10]



In [10]:

    
data.iloc[:5, 11:23]

Export as an HTML table for manual inspection.



In [11]:

    
data[:1000].to_html('subset.html')

2 Data Cleaning

While cleaning data is the most time-consuming, least enjoyable Data Science task, it should be perfomed nonetheless. Problems come in two flavours:

Missing data, i.e. unknown values.
Errors in data, i.e. wrong values.

The actions to be taken in each case is highly data and problem specific.

Example: marital status

According to dataset description, it should either be 1 (married), 2 (single) or 3 (others).
But we find some 0 (previously transformed to UNK).
Let's assume that 0 represents errors when collecting the data and that we should remove those clients.



In [12]:

    
print(data['MARRIAGE'].value_counts())
data = data[data['MARRIAGE'] != 'UNK']
data['MARRIAGE'] = data['MARRIAGE'].cat.remove_unused_categories()
print('\nWe are left with {} clients\n'.format(data.shape))
print(data['MARRIAGE'].unique())









    



SINGLE     15964
MARRIED    13659
OTHERS       323
UNK           54
Name: MARRIAGE, dtype: int64

We are left with (29946, 24) clients

[MARRIED, SINGLE, OTHERS]
Categories (3, object): [MARRIED, SINGLE, OTHERS]

Example: education

It should either be 1 (graduate school), 2 (university), 3 (high school) or 4 (others).
But we find some 0, 5 and 6 (previously transformed to UNK, UNK1 and UNK2).
Let's assume these values are dubious, but do not invalidate the data and keep them as they may have some predictive power.



In [13]:

    
print(data['EDUCATION'].value_counts())
data.loc[data['EDUCATION']=='UNK1', 'EDUCATION'] = 'UNK'
data.loc[data['EDUCATION']=='UNK2', 'EDUCATION'] = 'UNK'
data['EDUCATION'] = data['EDUCATION'].cat.remove_unused_categories()
print(data['EDUCATION'].value_counts())









    



UNIVERSITY     14024
GRAD SCHOOL    10581
HIGH SCHOOL     4873
UNK1             280
OTHERS           123
UNK2              51
UNK               14
Name: EDUCATION, dtype: int64
UNIVERSITY     14024
GRAD SCHOOL    10581
HIGH SCHOOL     4873
UNK              345
OTHERS           123
Name: EDUCATION, dtype: int64

3 Data Exploration

Get descriptive statistics.
Plot informative figures.
Verify some intuitive correlations.

Let's get first some descriptive statistics of our numerical variables.



In [14]:

    
attributes_numerical = ['LIMIT', 'AGE']
attributes_numerical.extend(attributes[11:23])
data.loc[:, attributes_numerical].describe().astype(np.int)

Let's plot an histogram of the ages, so that we get a better impression of who our clients are. That may even be an end goal, e.g. if your marketing team asks which customer groups to target.

Then a boxplot of the bills, which may serve as a verification of the quality of the acquired data.



In [15]:

    
data.loc[:, 'AGE'].plot.hist(bins=20, figsize=(15,5))
ax = data.iloc[:, 11:17].plot.box(logy=True, figsize=(15,5))

Simple question: which proportion of our clients default ?



In [16]:

    
percentage = data['DEFAULT'].value_counts()[1] / data.shape[0] * 100
print('Percentage of defaults: {:.2f}%'.format(percentage))









    



Percentage of defaults: 22.14%

Another question: who's more susceptible to default, males or females ?



In [17]:

    
observed = pd.crosstab(data['SEX'], data['DEFAULT'], margins=True)
observed

Seems like females are better risk. Let's verify with a Chi-Squared test of independance, using scipy.stats.



In [18]:

    
import scipy.stats as stats
_, p, _, expected = stats.chi2_contingency(observed.iloc[:2,:2])
print('p-value = {:.2e}'.format(p))
print('expected values:\n{}'.format(expected))









    



p-value = 6.75e-12
expected values:
[[  9244.71749148   2629.28250852]
 [ 14070.28250852   4001.71749148]]

Intuition: people who pay late present a higher risk of defaulting. Let's verify ! Verifying some intuitions will also help you to identify mistakes. E.g. it would be suspicious if that intuition is not verified in the data: did we select the right column, or did we miss-compute a result ?



In [19]:

    
group = data.groupby('DELAY1').mean()
corr = data['DEFAULT'].corr(data['DELAY1'], method='pearson')
group['DEFAULT'].plot(grid=True, title='Pearson correlation: {:.4f}'.format(corr), figsize=(15,5));

4 Interactive Visualization

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation, in the style of D3.js. Alternatively, matplotlib.widgets could be used. Those interactive visualizations are very helpful to explore the data at hand in the quest of anomalies or patterns. Try with the plots below !



In [20]:

    
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource

output_notebook()

x, y1, y2 = 'LIMIT', 'PAY1', 'PAY2'
n = 1000  # Less intensive for the browser.

options = dict(
    tools='pan,box_zoom,wheel_zoom,box_select,lasso_select,crosshair,reset,save',
    x_axis_type='log', y_axis_type='log',
)
plot1 = figure(
    x_range=[1e4,1e6],
    x_axis_label=x, y_axis_label=y1,
    **options
)
plot2 = figure(
    x_range=plot1.x_range, y_range=plot1.y_range,
    x_axis_label=x, y_axis_label=y2,
    **options
)

html_color = lambda r,g,b: '#{:02x}{:02x}{:02x}'.format(r,g,b)
colors = [html_color(150,0,0) if default == 1 else html_color(0,150,0) for default in data['DEFAULT'][:n]]
# The above line is a list comprehension.

radii = data['AGE'][:n] / 5

# To link brushing (where a selection on one plot causes a selection to update on other plots).
source = ColumnDataSource(dict(x=data[x][:n], y1=data[y1][:n], y2=data[y2][:n], radii=radii, colors=colors))

plot1.scatter('x', 'y1', source=source, size='radii', color='colors', alpha=0.6)
plot2.scatter('x', 'y2', source=source, size='radii', color='colors', alpha=0.6)

plot = gridplot([[plot1, plot2]], toolbar_location='right', plot_width=400, plot_height=400, title='adsf')

show(plot)









    





    
        
        Loading BokehJS ...

5 Traditional Statistics

Statsmodels is similar to scikit-learn, with much stronger emphasis on parameter estimation and (statistical) testing. It is similar in spirit to other statistical packages such as R, SPSS, SAS and Stata. That split reflects the two statistical modeling cultures: (1) Statistics, which want to know how well a given model fits the data, and what variables "explain" or affect the outcome, and (2) Machine Learning, where the main supported task is chosing the "best" model for prediction.



In [21]:

    
# Back to numeric values.
# Note: in a serious project, these should be treated as categories.
data['SEX'].cat.categories = [-1, 1]
data['SEX'] = data['SEX'].astype(np.int)
data['MARRIAGE'].cat.categories = [-1, 1, 0]
data['MARRIAGE'] = data['MARRIAGE'].astype(np.int)
data['EDUCATION'].cat.categories = [-2, 2, 1, 0, -1]
data['EDUCATION'] = data['EDUCATION'].astype(np.int)

data['DEFAULT'] = data['DEFAULT'] * 2 - 1  # [0,1] --> [-1,1]



In [22]:

    
# Observations and targets.
X = data.values[:,:23]
y = data.values[:,23]
n, d = X.shape
print('The data is a {} with {} samples of dimensionality {}.'.format(type(X), n, d))









    



The data is a <class 'numpy.ndarray'> with 29946 samples of dimensionality 23.



In [23]:

    
import statsmodels.api as sm

# Fit the Ordinary Least Square regression model.
results = sm.OLS(y, X).fit()

# Inspect the results.
print(results.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.384
Model:                            OLS   Adj. R-squared:                  0.383
Method:                 Least Squares   F-statistic:                     809.4
Date:                Mon, 14 Nov 2016   Prob (F-statistic):               0.00
Time:                        14:42:58   Log-Likelihood:                -35248.
No. Observations:               29946   AIC:                         7.054e+04
Df Residuals:                   29923   BIC:                         7.073e+04
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1         -2.364e-07   4.35e-08     -5.428      0.000     -3.22e-07 -1.51e-07
x2            -0.0337      0.005     -7.280      0.000        -0.043    -0.025
x3            -0.0265      0.006     -4.656      0.000        -0.038    -0.015
x4            -0.0801      0.005    -16.708      0.000        -0.090    -0.071
x5            -0.0103      0.000    -42.750      0.000        -0.011    -0.010
x6             0.1853      0.006     33.159      0.000         0.174     0.196
x7             0.0352      0.007      5.221      0.000         0.022     0.048
x8             0.0196      0.007      2.695      0.007         0.005     0.034
x9             0.0060      0.008      0.747      0.455        -0.010     0.022
x10            0.0114      0.009      1.314      0.189        -0.006     0.029
x11            0.0068      0.007      0.959      0.338        -0.007     0.021
x12        -1.372e-06   2.31e-07     -5.946      0.000     -1.82e-06  -9.2e-07
x13         3.537e-07   3.24e-07      1.091      0.275     -2.82e-07  9.89e-07
x14         3.364e-08   3.05e-07      0.110      0.912     -5.65e-07  6.32e-07
x15        -1.465e-07   3.18e-07     -0.461      0.645      -7.7e-07  4.77e-07
x16         1.527e-08   3.73e-07      0.041      0.967     -7.16e-07  7.46e-07
x17         2.117e-07   2.95e-07      0.717      0.473     -3.67e-07   7.9e-07
x18        -1.583e-06   3.58e-07     -4.421      0.000     -2.29e-06 -8.81e-07
x19        -4.521e-07   2.94e-07     -1.536      0.125     -1.03e-06  1.25e-07
x20        -3.437e-08   3.41e-07     -0.101      0.920     -7.04e-07  6.35e-07
x21        -6.357e-07   3.72e-07     -1.710      0.087     -1.36e-06  9.28e-08
x22        -6.479e-07   3.86e-07     -1.680      0.093      -1.4e-06  1.08e-07
x23        -1.959e-07   2.76e-07     -0.710      0.478     -7.37e-07  3.45e-07
==============================================================================
Omnibus:                     4479.171   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6851.842
Skew:                           1.169   Prob(JB):                         0.00
Kurtosis:                       3.160   Cond. No.                     6.14e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.14e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

6 Exporting data

Save the collected data to disk for further analysis.



In [24]:

    
np.save(os.path.join(folder, 'X.npy'), X)
np.save(os.path.join(folder, 'y.npy'), y)

	BILL1	BILL2	BILL3	BILL4	BILL5	BILL6	PAY1	PAY2	PAY3	PAY4	PAY5	PAY6
ID
1	3913	3102	689	0	0	0	0	689	0	0	0	0
2	2682	1725	2682	3272	3455	3261	0	1000	1000	1000	0	2000
3	29239	14027	13559	14331	14948	15549	1518	1500	1000	1000	1000	5000
4	46990	48233	49291	28314	28959	29547	2000	2019	1200	1100	1069	1000
5	8617	5670	35835	20940	19146	19131	2000	36681	10000	9000	689	679

	LIMIT	AGE	BILL1	BILL2	BILL3	BILL4	BILL5	BILL6	PAY1	PAY2	PAY3	PAY4	PAY5	PAY6
count	29946	29946	29946	29946	29946	29946	29946	29946	29946	29946	29946	29946	29946	29946
mean	167546	35	51278	49224	47063	43306	40352	38911	5659	5926	5227	4829	4804	5220
std	129807	9	73682	71219	69393	64374	60836	59592	16552	23060	17618	15677	15290	17791
min	10000	21	-165580	-69777	-157264	-170000	-81334	-339603	0	0	0	0	0	0
25%	50000	28	3570	2988	2684	2335	1770	1261	1000	836	390	298	255	122
50%	140000	34	22400	21221	20108	19066	18121	17098	2100	2010	1800	1500	1500	1500
75%	240000	41	67263	64108	60240	54601	50244	49248	5007	5000	4511	4015	4040	4000
max	1000000	79	964511	983931	1664089	891586	927171	961664	873552	1684259	896040	621000	426529	528666

DEFAULT	0	1	All
SEX
MALE	9003	2871	11874
FEMALE	14312	3760	18072
All	23315	6631	29946

	LIMIT	SEX	EDUCATION	MARRIAGE	AGE	DEFAULT
ID
1	20000	FEMALE	UNIVERSITY	MARRIED	24	1
2	120000	FEMALE	UNIVERSITY	SINGLE	26	1
3	90000	FEMALE	UNIVERSITY	SINGLE	34	0
4	50000	FEMALE	UNIVERSITY	MARRIED	37	0
5	50000	MALE	UNIVERSITY	MARRIED	57	0
6	50000	MALE	GRAD SCHOOL	SINGLE	37	0