Michaël Defferrard, PhD student, EPFL LTS2
Data visualization is a key aspect of exploratory data analysis. During this exercise we'll gradually build more and more complex vizualisations. We'll do this by replicating plots. Try to reproduce the lines but also the axis labels, legends or titles.
Data visualization is both an art and a science. It should combine both aesthetic form and functionality.
To start slowly, let's make a static line plot from some time series. Reproduce the plots below using:
Hint: to plot with pandas, you first need to create a DataFrame, pandas' tabular data format.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Random time series.
n = 1000
rs = np.random.RandomState(42)
data = rs.randn(n, 4).cumsum(axis=0)
In [3]:
plt.figure(figsize=(15,5))
plt.plot(data[:, :])
Out[3]:
In [3]:
# df = pd.DataFrame(...)
# df.plot(...)
Categorical data is best represented by bar or pie charts. Reproduce the plots below using the object-oriented API of matplotlib, which is recommended for programming.
Question: What are the pros / cons of each plot ?
Tip: the matplotlib gallery is a convenient starting point.
In [4]:
data = [10, 40, 25, 15, 10]
categories = list('ABCDE')
In [5]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Right plot.
# axes[1].
# axes[1].
# Left plot.
# axes[0].
# axes[0].
A frequency plot is a graph that shows the pattern in a set of data by plotting how often particular values of a measure occur. They often take the form of an histogram or a box plot.
Reproduce the plots with the following three libraries, which provide high-level declarative syntax for statistical visualization as well as a convenient interface to pandas:
Hints:
distplot()
and boxplot()
.
In [6]:
import seaborn as sns
import os
df = sns.load_dataset('iris', data_home=os.path.join('..', 'data'))
In [7]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Your code for Seaborn: distplot() and boxplot().
In [8]:
import ggplot
# Your code for ggplot.
Out[8]:
In [10]:
import altair
# altair.Chart(df).mark_bar(opacity=.75).encode(
# x=...,
# y=...,
# color=...
# )
Scatter plots are very much used to assess the correlation between 2 variables. Pair plots are then a useful way of displaying the pairwise relations between variables in a dataset.
Use the seaborn pairplot()
function to analyze how separable is the iris dataset.
In [11]:
# One line with Seaborn.
Humans can only comprehend up to 3 dimensions (in space, then there is e.g. color or size), so dimensionality reduction is often needed to explore high dimensional datasets. Analyze how separable is the iris dataset by visualizing it in a 2D scatter plot after reduction from 4 to 2 dimensions with two popular methods:
Hints:
swarmplot()
.
In [12]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
In [13]:
# df['pca1'] =
# df['pca2'] =
# df['tsne1'] =
# df['tsne2'] =
In [14]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.swarmplot(x='pca1', y='pca2', data=df, hue='species', ax=axes[0])
sns.swarmplot(x='tsne1', y='tsne2', data=df, hue='species', ax=axes[1]);
For interactive visualization, look at bokeh (we used it during the data exploration exercise) or VisPy.
If you want to visualize data on an interactive map, look at Folium.