The following notebook serves as an introduction to data visualization with Python for the course "Data Mining".
For any comments or suggestions you can contact charlotte[dot]laclau[at]univ-grenoble-alpes[dot]fr or parantapa[dot]goswami[at]viseo[dot]com
Data visualization (DataViz) is an essential tool for exploring and and find insight in the data. Before jumping to complex machine learning or multivariate models, one should always take a first look at the data through simple visualization techniques. Indeed, visualization provides a unique perspective on the dataset that might in some cases allow you to detect potential challenge or specifities in your data that should be taken into account for future and in depth analysis.
Data can be visualized in lots of different ways depending on the nature of the features of the data (continuous, discrete or categorical). Also, levels of visualization depends on the number of dimensions that is needed to be represented: univariate (1D), bivariate (2D) or multivariate (ND).
The goal of this session is to discover how to make 1D, 2D, 3D and eventually multidimensional data visualization with Python. We will see different methods, which can help you in real life to choose an appropriate visualization best suited for the data at hand.
We will explore three different librairies:
In [1]:
# Import all three librairies
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# For displaying the plots inside Notebook
%matplotlib inline
Note: Both pandas visulisation modules and seaborn are based on matlotlib, therefore a lot of command related to the customization of plot can be found in tutorials on matplotlib.
The Pima Indian Diabetes Dataset consists of 768 females, who are at least 21 years old, of Pima Indian heritage. They are described by following 8 features which take numerical values, and the class:
In [2]:
# We start by importing the data using pandas
# Hint: use "read_csv" method, Note that comma (",") is the field separator, and we have no "header"
df = pd.read_csv('pima.txt', sep=",", header=None)
# We name the columns based on above features
df.columns = ["Pregnancy","Glucose","BloodPress", "Fold", "Insulin","BodyMass",'Diabetes','Age','Class']
# We sneak peek into the data
# Hint: use dataframe "head" method with "n" parameter
df.head(n=5)
Out[2]:
1D plots are a good way to detect outliers, level of variability in the features etc.
Here is a non-exhaustive list of possible methods:
The Box Plot is an interesting tool for data visualisation as it contains multiple statistics informations about the feature: the minimum and maximum values, the first and third quartiles (bottom and top line of the box),the median value (middle line in the box) and the range (dispersion).
We will use "BodyMass" as the example here.
In [3]:
# Write code to draw a Box Plot of "BadyMass" feature
# Hint 1: use "DataFrame.boxplot" from pandas on the dataframe df
# Hint 2: choose the column properly
# Hint 3: you can control "grid" parameter as True or False
df.boxplot(column = 'BodyMass', grid=False)
Out[3]:
In [4]:
# Write code to draw a Histogram of "BadyMass" feature
# Hint 1: use "DataFrame.hist" from pandas on the dataframe df
# Hint 2: choose the column properly
# Hint 3: you can control "grid" parameter as True or False
# Hint 4: for this plot choose "bins" as 10, "alpha" as 0.5 and "ec" as "black"
df.hist(column = 'BodyMass', grid='off', bins=10, alpha=0.5, ec='black')
Out[4]:
Warning: The number of bins/intervals that you choose can strongly impact the representation. To check it, change the value in the option bins to 3 for instance.
In [5]:
# Write code to draw a histogram for "BodyMass" feature with 3 bins
df.hist(column = 'BodyMass', grid='off', bins=3, alpha=0.5, ec='black')
Out[5]:
In [6]:
# Write code to draw a Density Plot of "BadyMass" feature
# Hint: use "DataFrame.plot.density" from pandas on the dataframe df
df['BodyMass'].plot.density()
Out[6]:
In [7]:
# Write code to draw Box Plot for "BodyMass" and "Glucose" together
# Hint: you can pass a list of features for "column" parameter
df.boxplot(column = ['BodyMass','Glucose'], grid='off')
Out[7]:
In [8]:
# Write code to draw Density Plot for all 4 continuous features together
# Hint: you can filter dataframe by a list of columns and then use plot.density
df[['BodyMass','Glucose','Fold']].plot.density()
Out[8]:
For the box plot and histogram, to visualize each feature in its own scale, it is better to draw one plot per feature. All these plots can be arranged in a grid.
Task: See the usage of plt.subplot()
. Then draw:
Note: You can also play with basic customization for each figure (labeling, title, colors etc.)
In [9]:
# Write code to create subplots to vizualize four continous features
# Hint: use plt.pyplot() in a 2 by 2 grid. You can adjust these using "nrows" and "ncols"
fig,axes = plt.subplots(nrows = 2,ncols = 2) # TO DELETE
df.hist('BodyMass', bins=10, grid='off', alpha=0.5, ec='black', ax = axes[0, 0])
df.hist('Glucose', bins=10, grid='off', alpha=0.5, ec='black', ax = axes[0, 1])
df.hist('Fold', bins=10, grid='off', alpha=0.5, ec='black', ax = axes[1, 0])
df.hist('BloodPress', bins=10, grid='off', alpha=0.5, ec='black', ax = axes[1, 1])
# For a neat display of plots
plt.tight_layout()
In [10]:
# Write code to create a Bar Chart of the "Pregnancy" feature.
df.Pregnancy.plot.bar() # TO DELETE
# For a neat display of plots: plating with the X-tick labels
_ = plt.xticks(list(df.index)[::100], list(df.index)[::100])
Let us now get some other information from the "Pregnancy" feature. We will now visualize the distribution of number of females for different "Pregnancy" values. For that:
DataFrame.value_counts()
)
In [11]:
# Step 1: Write code to generate the count distrubution
df["Pregnancy"].value_counts()
Out[11]:
In [12]:
# Step 2: Write code to create Bar Chart for the count distrubution
df["Pregnancy"].value_counts().plot.bar()
Out[12]:
Most common way to visualize the distribution of categorical feature is Pie Chart. It is a circular statistical graphic which is divided into slices to illustrate numerical proportion of different posible values of the categorical feature. The arc length of each slice is proportional to the quantity it represents.
As we are visualizing the count distribution, the first step is to use DataFrame.value_counts()
In [13]:
# Write code to create a Pie Chart of the "Class" feature.
# Hint 1: plot the count distrubution, NOT the data itself
# Hint 2: use plot.pie() on the count distrubution
# Hint 3: use autopct="%1.1f%%" to display percentage values and following colors
colors = ['gold', 'lightcoral']
df['Class'].value_counts().plot.pie(autopct='%1.1f%%', colors=colors)
# For a neat display of the plot
_ = plt.axis('equal')
Remark: Pie charts are very effective to visualize distribution of classes on the training data. It helps to discover if there exist a stong imbalance of classes in the data.
Warning: Pie charts cannot show more than a few values as the slices become too small. This makes them unsuitable for use with categorical features with a larger number of possible values.
2D plots (or in multi-D plots in general) are important to detect potential dependencies in the data (colinearity, linearity etc.).
Again, the nature of the features will guide you to choose the good representation.
Scatter Plot is used to display values for typically two variables for a set of data on Cartesian coordinates. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. A scatter plot can suggest various kinds of correlations between features.
Let's use "BodyMass" and "Fold" features together.
In [14]:
# Write code to create Scatter Plot between "BodyMass" and "Fold" features
# Hint: use "DataFrame.plot.scatter()" on our dataframe df, and mention the "x" and "y" axis features
df.plot.scatter(x="BodyMass", y="Fold")
Out[14]:
Questions:
Remark: Pandas plot
module is very useful for basic visualization techniques (more details here).
Now we will use explore seaborn library to create more advanced visualizations.
First, we will see seaborn jointplot()
. It shows bivariate scatter plots and univariate histograms in the same figure.
In [15]:
# Write code to create a jointplot using "BodyMass" and "Fold" features
# Hint 1: mention the "x" and "y" axis features, and out dataframe df as "data"
# Hint 2: "size" parameter controls the size of the plot. Here try size=6
sns.jointplot(x="BodyMass", y="Fold", data=df, size=6)
Out[15]:
Note: The legend refers to a correlation test (Pearson $\rho$) which indicate a significant correlation between these features (p-value below .05).
Question: Does the Pearson $\rho$ calculated correspond to your interpretation of correlation from the previous scatter plot?
In [16]:
# Write code to create a seaborn pairplot using all 4 continuous features
# Hint 1: use our dataframe df
# Hint 2: give the list of features to "vars" parameters.
# Hint 3: use markers=["o", "s"] and palette="husl" for better display
sns.pairplot(df, vars=["BodyMass","Fold", "Glucose", "Diabetes"])
Out[16]:
Question: Can you explain the nature of the diagonal plots?
Note: It is possible to project the class onto the pair plot (one color for each class) using the option hue
in the pairplot function.
In [17]:
# Write code to create a seaborn pairplot using all 4 continuous features and the "Class"
# Hint: use "hue" option with the "Class" variable
sns.pairplot(df, hue='Class', vars=["BodyMass","Fold", "Glucose", "Diabetes"])
Out[17]:
In order to cross continuous and categorical features, you can again use box plot. It allows you to visualize distribution of a continuous variable for each possible value of the categorical variable. One common application is to visualize the output of a clustering algorithm.
Here, we will visualize box plot between continuous "BodyMass" and categorical "Class" features. We will use seaborn boxplot
module.
In [18]:
# Write code to create a Box Plot between "BodyMass" and "Class"
# Hint: mention the "x" and "y" axis features, and out dataframe df as "data"
sns.boxplot(x="Class", y="BodyMass", data=df)
Out[18]:
The Violin Plot is similar to box plots, except that they also show the probability density of the data at different values. Like box plots, violin plots also include a marker for the median of the data and a box indicating the interquartile range. Violin plots are used to represent comparison of a variable distribution (or sample distribution) across different categories.
In [19]:
# Write code to create a Violin Plot between "BodyMass" and "Class"
# Hint: mention the "x" and "y" axis features, and out dataframe df as "data"
sns.violinplot(x="Class", y="BodyMass", data=df)
Out[19]:
A Heat Map is a two-dimensional representation of data in which values are represented by colors. A simple heat map provides an immediate visual summary of information.
In this exercise, we will use "Pregnancy" and "Class" features:
groupby
to group the new two column dataframe based on both "Pregnancy" and "Class" featuressize()
function to get the count of every possible pair of values of "Pregnancy" and "Class"reset_index()
with name="count"
argument to set a name for the count columnpivot
table is generated using all three columns in the following order: "Pregnancy", "Class", "count"seaborn.heatmap
Note: This is an advanced piece of code. So you may require to consult different places before you get this right. Do not hesitate to ask for help.
In [20]:
# Write code to create a Heat Map using the above steps
df2 = df[["Pregnancy", "Class"]].groupby(["Pregnancy", "Class"]).size().reset_index(name="count") # TO DELETE
sns.heatmap(df2.pivot("Pregnancy", "Class", "count"))
Out[20]:
3D plots lets you visualize 3 features together. Like, 2D plots, 3D plots are used to analyze potential dependencies in the data (colinearity, linearity etc.).
Here also, based on the nature of the features you can choose the type of visualization. However, in this exercise we will explore to visualize 3 continuous features together.
In this part, we will use Axes3D module from matplotlib library. We will explore 3D scatter plot. For other kinds of 3D plots you can refer here.
In [21]:
from mpl_toolkits.mplot3d import Axes3D
# For interactive 3D plots
%matplotlib notebook
In [28]:
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d.art3d import Path3DCollection
%matplotlib notebook
# Write code to create a 3D Scatter Plot for "BodyMass", "Fold" and "Glucose" features
# Hint 1: follow the basic steps mentioned in the link above.
# Hint 2: pass three desired columns of our dataframe as "xs", "ys" and "zs" in the scatter plot function
fig_scatter = plt.figure() # TO DELETE
ax = fig_scatter.add_subplot(111, projection='3d')
ax.scatter(df["BodyMass"], df["Fold"], df["Glucose"])
# Write code to display the feature names as axes labels
# Hint: set_xlabel etc. methods to set the labels
ax.set_xlabel("BodyMass")
ax.set_ylabel("Fold")
ax.set_zlabel("Glucose")
Out[28]:
For visualizing data with more than 3 features, we have to rely on additional tools. One such tool is Principal Component Analysis (PCA).
PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called Principal Components (PC). The number of distinct principal components is equal to the smaller of the number of original variables or the number of observations minus one.
We will do PCA to transform our data having 7 numerical features into 2 principal components. We will use sklearn.decomposition.PCA
package.
In [27]:
# Write code to import libraries
from sklearn.decomposition import PCA # TO DELETE
# We will use following columns of our dataframe:
columns_pca = ["Pregnancy","Glucose","BloodPress", "Fold", "Insulin","BodyMass",'Diabetes','Age']
# Write code to fit a PCA with the dataframe using above columns.
# Hint 1: first create a PCA instance with "n_components=2"
# as we are atttempting to generate 2 principal components.
# Hint 2: fit_transform PCA with the required dataframe
pca2 = PCA(n_components = 2)
array2PC = pca2.fit_transform(df[columns_pca])
As you can discover from PCA documention (link above), PCA returns principal components as a numpy array. For the ease of plotting with seaborn, we will create a pandas DataFrame from the principal components.
pandas.DataFrame()
to convert numpy arrayDataFrame.join()
In [10]:
# Write code to convert array2PC to a DataFrame with columns "PC1" and "PC2"
df2PC = pd.DataFrame(array2PC, columns=["PC1", "PC2"])
# Write code to update df2PC by appending "Class" column from orginal dataframe df
# Hint: using "join" on df2PC
df2PC = df2PC.join(df["Class"])
Now, we will create a scatter plot to visualize our 7D data transformed into 2 principal components.
For creating scatter plots using seaborn we will use lmplot
module with fit_reg=False
.
In [11]:
# For displaying the plots inside Notebook
%matplotlib inline
# Write code to create scatter plot for 2 PCs.
# Hint 1; use seaborn.lmplot and set fit_reg=False
# Hint 2: use hue option to visualize the "Class" labels in the plot
sns.lmplot("PC1", "PC2", df2PC, hue="Class", fit_reg=False)
Out[11]:
Now, you have to do PCA on the data as before but for 3 principal components. Then plot 3 principal components in a 3D scatter plot.
Hint: The hue
equivalent for 3D scatter plot is c
, and you have to pass the entire "Class" column, not just the name.
In [12]:
# Write code to do PCA with 3 principal components on the dataframe with columns_pca
pca3 = PCA(n_components = 3)
array3PC = pca3.fit_transform(df[columns_pca])
df3PC = pd.DataFrame(array3PC, columns=["PC1", "PC2", "PC3"]) # TO DELETE
df3PC = df3PC.join(df["Class"])
fig_pca = plt.figure()
ax = fig_pca.add_subplot(111, projection='3d')
ax.scatter(df3PC["PC1"], df3PC["PC2"], df3PC["PC3"], c=df3PC["Class"])
Out[12]: