NTDS'17 demo 7: data exploration and visualisation

Michael Defferrard and Hermina Petric Maretic, based on exercises from the 2016 edition of NTDS.


In [ ]:
%matplotlib inline

import sys
import os
import warnings

import numpy as np
import pandas as pd

warnings.filterwarnings('ignore')

1 Organising and looking at data


In [ ]:
filename = os.path.join('..', 'data', 'credit_card_defaults.csv')
data = pd.read_csv(filename, index_col = 0)
attributes = data.columns.tolist()

# Tansform from numerical to categorical variable.
data['SEX'] = data['SEX'].astype('category')
data['SEX'].cat.categories = ['MALE', 'FEMALE']
data['MARRIAGE'] = data['MARRIAGE'].astype('category')
data['MARRIAGE'].cat.categories = ['UNK', 'MARRIED', 'SINGLE', 'OTHERS']
data['EDUCATION'] = data['EDUCATION'].astype('category')
data['EDUCATION'].cat.categories = ['UNK', 'GRAD SCHOOL', 'UNIVERSITY', 'HIGH SCHOOL', 'OTHERS', 'UNK1', 'UNK2']

In [ ]:
data.loc[:6, ['LIMIT', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'DEFAULT']]

In [ ]:
data.iloc[:5, 4:10]

In [ ]:
data.iloc[:5, 11:23]

Export as an HTML table for manual inspection.


In [ ]:
data[:1000].to_html('subset.html')

2 Data Cleaning

Problems come in two flavours:

  1. Missing data, i.e. unknown values.
  2. Errors in data, i.e. wrong values.

The actions to be taken in each case is highly data and problem specific.

Example: marital status

  1. According to dataset description, it should either be 1 (married), 2 (single) or 3 (others).
  2. But we find some 0 (previously transformed to UNK).
  3. Let's assume that 0 represents errors when collecting the data and that we should remove those clients.

In [ ]:
print(data['MARRIAGE'].value_counts())
data = data[data['MARRIAGE'] != 'UNK']
data['MARRIAGE'] = data['MARRIAGE'].cat.remove_unused_categories()
print('\nWe are left with {} clients\n'.format(data.shape))
print(data['MARRIAGE'].unique())

Example: education

  1. It should either be 1 (graduate school), 2 (university), 3 (high school) or 4 (others).
  2. But we find some 0, 5 and 6 (previously transformed to UNK, UNK1 and UNK2).
  3. Let's assume these values are dubious, but do not invalidate the data and keep them as they may have some predictive power.

In [ ]:
print(data['EDUCATION'].value_counts())
data.loc[data['EDUCATION'] == 'UNK1', 'EDUCATION'] = 'UNK'
data.loc[data['EDUCATION'] == 'UNK2', 'EDUCATION'] = 'UNK'
data['EDUCATION'] = data['EDUCATION'].cat.remove_unused_categories()
print(data['EDUCATION'].value_counts())

3 Data statistics

  • Get descriptive statistics.
  • Plot informative figures.
  • Verify some intuitive correlations.

3.1 Descriptive statistics

Let's get first some descriptive statistics of our numerical variables.


In [ ]:
attributes_numerical = ['LIMIT', 'AGE']
attributes_numerical.extend(attributes[11:23])
data.loc[:, attributes_numerical].describe().astype(np.int)

Let's plot an histogram of the ages, so that we get a better impression of who our clients are. That may even be an end goal, e.g. if your marketing team asks which customer groups to target.

Then a boxplot of the bills, which may serve as a verification of the quality of the acquired data.


In [ ]:
data.loc[:, 'AGE'].plot.hist(bins=20, figsize=(15,5))
ax = data.iloc[:, 11:17].plot.box(logy=True, figsize=(15,5))

3.2 Check a hypotesis

Simple question: which proportion of our clients default ?


In [ ]:
percentage = data['DEFAULT'].value_counts()[1] / data.shape[0] * 100
print('Percentage of defaults: {:.2f}%'.format(percentage))

Another question: who's more susceptible to default, males or females ?


In [ ]:
observed = pd.crosstab(data['SEX'], data['DEFAULT'], margins=True)
observed

Seems like females are better risk. Let's verify with a Chi-Squared test of independance, using scipy.stats.


In [ ]:
from scipy import stats
_, p, _, expected = stats.chi2_contingency(observed.iloc[:2,:2])
print('p-value = {:.2e}'.format(p))
print('expected values:\n{}'.format(expected))

Intuition: people who pay late present a higher risk of defaulting. Let's verify! Verifying some intuitions will also help you to identify mistakes. E.g. it would be suspicious if that intuition is not verified in the data: did we select the right column, or did we miss-compute a result?


In [ ]:
group = data.groupby('DELAY1').mean()
corr = data['DEFAULT'].corr(data['DELAY1'], method='pearson')
group['DEFAULT'].plot(grid=True, title='Pearson correlation: {:.4f}'.format(corr), figsize=(15,5));

In [ ]:
group = data.groupby('AGE').mean()
group['LIMIT'].plot(grid=True, title='Mean credit card limit per age', figsize=(15,5));

In [ ]:
print(data['AGE'].value_counts())

3.3 Statistical modelling

Statsmodels is similar to scikit-learn, with much stronger emphasis on parameter estimation and (statistical) testing. It is similar in spirit to other statistical packages such as R, SPSS, SAS and Stata. That split reflects the two statistical modeling cultures: (1) Statistics, which want to know how well a given model fits the data, and what variables "explain" or affect the outcome, and (2) Machine Learning, where the main supported task is chosing the "best" model for prediction.


In [ ]:
# Back to numeric values.
data['SEX'].cat.categories = [-1, 1]
data['SEX'] = data['SEX'].astype(np.int)
data['MARRIAGE'].cat.categories = [-1, 1, 0]
data['MARRIAGE'] = data['MARRIAGE'].astype(np.int)
data['EDUCATION'].cat.categories = [-2, 2, 1, 0, -1]
data['EDUCATION'] = data['EDUCATION'].astype(np.int)

data['DEFAULT'] = data['DEFAULT'] * 2 - 1  # [0,1] --> [-1,1]

In [ ]:
# Observations and targets.
X = data.values[:,:23]
y = data.values[:,23]
n, d = X.shape
print('The data is a {} with {} samples of dimensionality {}.'.format(type(X), n, d))

In [ ]:
import statsmodels.api as sm

# Fit the Ordinary Least Square regression model.
results = sm.OLS(y, X).fit()

# Inspect the results.
print(results.summary())

4 Data visualisation

Data visualization is a key aspect of exploratory data analysis.

4.1 Time series

To start slowly, let's make a static line plot from some time series. Reproduce the plots below using:

  1. The procedural API of matplotlib, the main data visualization library for Python. Its procedural API is similar to matlab and convenient for interactive work.
  2. Pandas, which wraps matplotlib around his DataFrame format and makes many standard plots easy to code. It offers many helpers for data visualization.

In [ ]:
import matplotlib.pyplot as plt

# Random time series.
n = 1000
rs = np.random.RandomState(42)
data = rs.randn(n, 4).cumsum(axis=0)

plt.figure(figsize=(15,5))
plt.plot(data[:, 0], label='A')
plt.plot(data[:, 1], '.-k', label='B')
plt.plot(data[:, 2], '--m', label='C')
plt.plot(data[:, 3], ':', label='D')
plt.legend(loc='upper left')
plt.xticks(range(0, 1000, 50))
plt.ylabel('Value')
plt.xlabel('Day')
plt.grid()

In [ ]:
idx = pd.date_range('1/1/2000', periods=n)
df = pd.DataFrame(data, index=idx, columns=list('ABCD'))
df.plot(figsize=(15,5));

4.2 Frequency

A frequency plot is a graph that shows the pattern in a set of data by plotting how often particular values of a measure occur. They often take the form of an histogram or a box plot.

Seaborn is a statistical visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Its advantage is that you can modify the produced plots with matplotlib, so you loose nothing.


In [ ]:
import seaborn as sns
df = sns.load_dataset('iris', data_home=os.path.join('..', 'data'))

In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

g = sns.distplot(df['petal_width'], kde=True, rug=False, ax=axes[0])
g.set(title='Distribution of petal width')

g = sns.boxplot('species', 'petal_width', data=df, ax=axes[1])
g.set(title='Distribution of petal width by species');

4.3 Correlation

Scatter plots are very much used to assess the correlation between 2 variables. Pair plots are then a useful way of displaying the pairwise relations between variables in a dataset.

Use the seaborn pairplot() function to analyze how separable is the iris dataset.


In [ ]:
sns.pairplot(df, hue="species");

4.4 Dimensionality reduction

Humans can only comprehend up to 3 dimensions (in space, then there is e.g. color or size), so dimensionality reduction is often needed to explore high dimensional datasets. Analyze how separable is the iris dataset by visualizing it in a 2D scatter plot after reduction from 4 to 2 dimensions with two popular methods:

  1. The classical principal componant analysis (PCA).
  2. t-distributed stochastic neighbor embedding (t-SNE).

Hints:

  • t-SNE is a stochastic method, so you may want to run it multiple times.
  • The easiest way to create the scatter plot is to add columns to the pandas DataFrame, then use the Seaborn swarmplot().

In [ ]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [ ]:
pca = PCA(n_components=2)
X = pca.fit_transform(df.values[:, :4])
df['pca1'] = X[:, 0]
df['pca2'] = X[:, 1]

tsne = TSNE(n_components=2)
X = tsne.fit_transform(df.values[:, :4])
df['tsne1'] = X[:, 0]
df['tsne2'] = X[:, 1]

In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.swarmplot(x='pca1', y='pca2', data=df, hue='species', ax=axes[0])
sns.swarmplot(x='tsne1', y='tsne2', data=df, hue='species', ax=axes[1]);

5 Graph visualisation

You've already worked with Networkx to make basic graph plots. Other useful resources are Gephi, which allows for a nicer and more artistic approach, and Plotly, useful for data visualisation as well as graph drawing.


In [ ]:
%matplotlib notebook

import time
import networkx as nx
import scipy

In [ ]:
n = 20
T = 50
er = nx.erdos_renyi_graph(20,0.3)
er.remove_nodes_from(nx.isolates(er))  # removing isolated nodes
n = len(er)
nx.draw(er)

Time series data can also live on the nodes of our graph (e.g. temperatures in cities). Create random time series data for our graph and plot them on a graph with node size representing the signal value.

You can also experiment with different layouts for drawing the graph. The default layout positions nodes using the Fruchterman-Reingold force-directed algorithm (spring layout). Draw the graph using the spectral layout.


In [ ]:
sigma_eps = 1;
epsilon = stats.norm.rvs(scale = sigma_eps, size=[n,T])  # noise

In [ ]:
def draw_graph_time_series(x):
    pos = nx.spectral_layout(er)  # we have to fix the position of nodes

    l = 0
    for i in range(0,T):
        ax.clear()
        nx.draw_networkx(er, ax=ax, pos=pos, node_size=x[:,i]*100)
        ax.text(0,0,i)
        fig.canvas.draw()
        time.sleep(0.03)

In [ ]:
fig, ax = plt.subplots()
plt.ion()
fig.show()
fig.canvas.draw()

In [ ]:
draw_graph_time_series(epsilon)

In the last experiment we plotted noise. Create two events on your graph at times 10 and 30. Each event should be centered in a randomly selected node and increase the signal value of that node by 50. Both events should then propagate through the graph with the heat kernel model. When an event occurs, color the source node of the event to a different color to help visualise it better. Draw the graph using the spring layout.


In [ ]:
def create_x(Lapl):
    x = epsilon.copy()
    peak = np.zeros((n,2))
    for i in range(1,T):
        l = 0
        while (t[l] < i):
            peak[source[l]][l] = 1
            event = np.dot(scipy.linalg.expm((t[l]-i)*Lapl), peak[:,l])
            x[:,i] = x[:,i] + 50*event
            l += 1
            if l >= 2:
                break
    return x

t = [10, 30]
source = stats.randint.rvs(low=0, high=n, size=2)

In [ ]:
Lapl = nx.normalized_laplacian_matrix(er)
Lapl = Lapl.todense()
x = create_x(Lapl)

In [ ]:
def draw_graph_diffusion(x):
    pos = nx.spring_layout(er)  # we have to fix the position of nodes
    node_col = ['r'] * T
    l = 0
    for i in range(0,T):
        ax.clear()
        if l<2:
            if i == t[l]:
                node_col[source[l]] = 'b'
                l = l+1
        nx.draw_networkx(er, ax=ax, node_color = node_col, pos=pos, node_size=x[:,i]*100)
        ax.text(0,0,i)
        fig.canvas.draw()
        time.sleep(0.03)

In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.ion()
fig.show()
fig.canvas.draw()

In [ ]:
draw_graph_diffusion(x)