In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
In [6]:
%matplotlib notebook
As we saw during the lecture, there are a nearly infinite number of parameters that can be adjusted when developing visuals for scientific communication. From something as small as - the thickness of the axes, to as critical as the choice of color (or choice to avoid the use of color), each of these choices will eventually affect the final interpretation of the data.
As you constuct visualizations today, there are three points from the lecture that I especially want to highlight:
Alternatively, ask your yourself, "what would the newspaper headline be for this figure/presentation?"
We will use the make_classification function from scikit-learn to generate some data in a low dimensional data space.
Problem 1a
Create 125 sources that live in 4 dimensions, where each source belongs to one of two classes.
Hint –– execute the cell below.
In [22]:
np.random.seed(23)
X, y = make_classification(n_samples=225, n_classes=2,
n_features=4, n_redundant=0, n_informative=4,
flip_y=0.04, weights=[0.62,0.38])
Problem 1b
Using the defaults in matplotlib, make a scatter plot of the data showing feature 1 vs. feature 2. Use different colors for the two classes (again with the matplotlib defaults).
Hint –– recall that scikit-learn organizes feature data in a two-dimensional array, where every column corresponds to a single source and every row corresponds to a single feature.
In [ ]:
fig, ax = plt.subplots()
ax.scatter( # complete
Now that we are familiar with the "defaults", we will apply several of the lessons from the lecture to create more salient visualizations.
Note –– many of the following questions are a little open ended, be sure you are happy with your results, but I would suggest that you do not dwell on any single inquiry for a really long time ($\gtrsim$15 min).
Problem 2a
Replot the data using symbols that provide strong visual boundaries between the two classes.
Hint –– make a choice that highlights the most important feature in the data (this will be subjective).
In [ ]:
fig, ax = plt.subplots()
Problem 2b
Replot the data, again with strong visual boundaries, but this time do not use color (if you did not use color in 2a then use color for this problem.
In [ ]:
fig, ax = plt.subplots()
Problem 2c
Replot the data, again with strong visual boundaries, varying some new aspect of the plotting symbols to distinguish the two classes.
Hint –– recall that you have many options at your disposal (e.g., symbol, color, size, orientation, shape, motion, etc)
In [ ]:
fig, ax = plt.subplots()
Problem 2d
Use www.color-blindness.com to examine how each of your choices above would appear to someone that is color blind. How do they appear in black and white?
After this examination, do you want to alter any of the previous plots?
In [ ]:
fig, ax = plt.subplots()
Problem 2e
Use the principle of enclosure to further highlight salient features in the data set.
In [ ]:
fig, ax = plt.subplots()
Problem 3a
Make a pie chart showing the relative number of sources in each class.
Hint –– don't do this in real life.
In [ ]:
fig, ax = plt.subplots()
Problem 3b
Make a bar graph showing the relative number of sources in each class.
In [ ]:
fig, ax = plt.subplots()
Problem 3c
Plot the same bar graph with a background grid that makes it easy to rapidly judge the relative magnitude of each class (i.e. remove the y-axis labels).
Note –– beware of introducing judgement error.
In [ ]:
fig, ax = plt.subplots()
Problem 3d
Can you adjust the grid to improve the salience of the bar graph? What thickness are you using for the grid lines? How does this compare to the axes lines? What line style? What opacity?
In [ ]:
fig, ax = plt.subplots()
Problem 4a
Using a plotting element that is currently not shown (shape, size, color, symbol, etc), encode the 3rd feature on your 2d scatter plot.
Hint –– recall that hue does a terrible job of representing relative magnitude.
In [ ]:
fig, ax = plt.subplots()
Problem 4b
Using a different plotting element that is currently not shown (shape, size, color, symbol, etc), encode the 4th, in addition to the 3rd feature, on your 2d scatter plot.
How successful are each of the previous representations?
In [ ]:
fig, ax = plt.subplots()
Problem 4c
Create a parallel coordinate plot to represent the two different classes in 4 dimensions.
Is this representation more successful than the previous 2d scatter plots? Why or why not?
Hint –– think about how you normalize the parallel axes.
In [ ]:
fig, ax = plt.subplots()
Problem 4d
Create a corner plot to represent the two different classes.
Is this representation more successful than the parallel coordinate plot? Why or why not?
Hint –– corner and/or seaborn are your friiiiiiiiiiiiiiiiiiiends.
In [ ]:
fig, ax = plt.subplots()
Which of all the above representations is "best"? Why (think about the Gestalt and design principles that have been successfully used)?
Challenge Problem Use the Gestalt principles and theories of design discussed in the previous lecture to redesign your own figures or slides.
Be sure to share a screen grab of the original and the improved version so we can compare at the end.
In [ ]:
fig, ax = plt.subplots()