In [5]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification



In [6]:

    
%matplotlib notebook

Salience: Highlighting the Most Important Features in the Data

Version 0.1

By AA Miller 11 June 2019

As we saw during the lecture, there are a nearly infinite number of parameters that can be adjusted when developing visuals for scientific communication. From something as small as - the thickness of the axes, to as critical as the choice of color (or choice to avoid the use of color), each of these choices will eventually affect the final interpretation of the data.

As you constuct visualizations today, there are three points from the lecture that I especially want to highlight:

Salience –– make specific choices to highlight the most important features of the visualization

Storytelling –– figure out the story you want to tell with the data

Alternatively, ask your yourself, "what would the newspaper headline be for this figure/presentation?"

Problem 1) Simple Synthetic Data

We will use the make_classification function from scikit-learn to generate some data in a low dimensional data space.

Problem 1a

Create 125 sources that live in 4 dimensions, where each source belongs to one of two classes.

Hint –– execute the cell below.



In [22]:

    
np.random.seed(23)
X, y = make_classification(n_samples=225, n_classes=2, 
                           n_features=4, n_redundant=0, n_informative=4,
                           flip_y=0.04, weights=[0.62,0.38])

Problem 1b

Using the defaults in matplotlib, make a scatter plot of the data showing feature 1 vs. feature 2. Use different colors for the two classes (again with the matplotlib defaults).

Hint –– recall that scikit-learn organizes feature data in a two-dimensional array, where every column corresponds to a single source and every row corresponds to a single feature.



In [ ]:

    
fig, ax = plt.subplots()
ax.scatter( # complete

Now that we are familiar with the "defaults", we will apply several of the lessons from the lecture to create more salient visualizations.

Note –– many of the following questions are a little open ended, be sure you are happy with your results, but I would suggest that you do not dwell on any single inquiry for a really long time ($\gtrsim$15 min).

Problem 2) Salience –– Plotting Symbols

Problem 2a

Replot the data using symbols that provide strong visual boundaries between the two classes.

Hint –– make a choice that highlights the most important feature in the data (this will be subjective).



In [ ]:

    
fig, ax = plt.subplots()

Problem 2b

Replot the data, again with strong visual boundaries, but this time do not use color (if you did not use color in 2a then use color for this problem.



In [ ]:

    
fig, ax = plt.subplots()

Problem 2c

Replot the data, again with strong visual boundaries, varying some new aspect of the plotting symbols to distinguish the two classes.

Hint –– recall that you have many options at your disposal (e.g., symbol, color, size, orientation, shape, motion, etc)



In [ ]:

    
fig, ax = plt.subplots()

Problem 2d

Use www.color-blindness.com to examine how each of your choices above would appear to someone that is color blind. How do they appear in black and white?

After this examination, do you want to alter any of the previous plots?



In [ ]:

    
fig, ax = plt.subplots()

Problem 2e

Use the principle of enclosure to further highlight salient features in the data set.



In [ ]:

    
fig, ax = plt.subplots()

Problem 3) Salience - Relative magnitude

Problem 3a

Make a pie chart showing the relative number of sources in each class.

Hint –– don't do this in real life.



In [ ]:

    
fig, ax = plt.subplots()

Problem 3b

Make a bar graph showing the relative number of sources in each class.



In [ ]:

    
fig, ax = plt.subplots()

Problem 3c

Plot the same bar graph with a background grid that makes it easy to rapidly judge the relative magnitude of each class (i.e. remove the y-axis labels).

Note –– beware of introducing judgement error.



In [ ]:

    
fig, ax = plt.subplots()

Problem 3d

Can you adjust the grid to improve the salience of the bar graph? What thickness are you using for the grid lines? How does this compare to the axes lines? What line style? What opacity?



In [ ]:

    
fig, ax = plt.subplots()

Problem 4) Salience –– Multiple dimensions

Problem 4a

Using a plotting element that is currently not shown (shape, size, color, symbol, etc), encode the 3rd feature on your 2d scatter plot.

Hint –– recall that hue does a terrible job of representing relative magnitude.



In [ ]:

    
fig, ax = plt.subplots()

Problem 4b

Using a different plotting element that is currently not shown (shape, size, color, symbol, etc), encode the 4th, in addition to the 3rd feature, on your 2d scatter plot.

How successful are each of the previous representations?



In [ ]:

    
fig, ax = plt.subplots()

Problem 4c

Create a parallel coordinate plot to represent the two different classes in 4 dimensions.

Is this representation more successful than the previous 2d scatter plots? Why or why not?

Hint –– think about how you normalize the parallel axes.



In [ ]:

    
fig, ax = plt.subplots()

Problem 4d

Create a corner plot to represent the two different classes.

Is this representation more successful than the parallel coordinate plot? Why or why not?

Hint –– corner and/or seaborn are your friiiiiiiiiiiiiiiiiiiends.



In [ ]:

    
fig, ax = plt.subplots()

Which of all the above representations is "best"? Why (think about the Gestalt and design principles that have been successfully used)?

Challenge Problem –– Real World

Challenge Problem Use the Gestalt principles and theories of design discussed in the previous lecture to redesign your own figures or slides.

Be sure to share a screen grab of the original and the improved version so we can compare at the end.



In [ ]:

    
fig, ax = plt.subplots()