Assignment 3 - Building a Custom Visualization

In this assignment you must choose one of the options presented below and submit a visual as well as your source code for peer grading. The details of how you solve the assignment are up to you, although your assignment must use matplotlib so that your peers can evaluate your work. The options differ in challenge level, but there are no grades associated with the challenge level you chose. However, your peers will be asked to ensure you at least met a minimum quality for a given technique in order to pass. Implement the technique fully (or exceed it!) and you should be able to earn full grades for the assignment.

Ferreira, N., Fisher, D., & Konig, A. C. (2014, April). Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571-580). ACM. (video)

In this paper the authors describe the challenges users face when trying to make judgements about probabilistic data generated through samples. As an example, they look at a bar chart of four years of data (replicated below in Figure 1). Each year has a y-axis value, which is derived from a sample of a larger dataset. For instance, the first value might be the number votes in a given district or riding for 1992, with the average being around 33,000. On top of this is plotted the confidence interval -- the range of the number of votes which encapsulates 95% of the data (see the boxplot lectures for more information, and the yerr parameter of barcharts).

Figure 1 from (Ferreira et al, 2014).

A challenge that users face is that, for a given y-axis value (e.g. 42,000), it is difficult to know which x-axis values are most likely to be representative, because the confidence levels overlap and their distributions are different (the lengths of the confidence interval bars are unequal). One of the solutions the authors propose for this problem (Figure 2c) is to allow users to indicate the y-axis value of interest (e.g. 42,000) and then draw a horizontal line and color bars based on this value. So bars might be colored red if they are definitely above this value (given the confidence interval), blue if they are definitely below this value, or white if they contain this value.

Figure 2c from (Ferreira et al. 2014). Note that the colorbar legend at the bottom as well as the arrows are not required in the assignment descriptions below.

Easiest option: Implement the bar coloring as described above - a color scale with only three colors, (e.g. blue, white, and red). Assume the user provides the y axis value of interest as a parameter or variable.

Harder option: Implement the bar coloring as described in the paper, where the color of the bar is actually based on the amount of data covered (e.g. a gradient ranging from dark blue for the distribution being certainly below this y-axis, to white if the value is certainly contained, to dark red if the value is certainly not contained as the distribution is above the axis).

Even Harder option: Add interactivity to the above, which allows the user to click on the y axis to set the value of interest. The bar colors should change with respect to what value the user has selected.

Hardest option: Allow the user to interactively set a range of y values they are interested in, and recolor based on this (e.g. a y-axis band, see the paper for more details).



In [1]:

    
%matplotlib notebook



In [2]:

    
# Use the following data for this assignment:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sci

np.random.seed(12345)

df = pd.DataFrame([np.random.normal(33500,150000,3650), 
                   np.random.normal(41000,90000,3650), 
                   np.random.normal(41000,120000,3650), 
                   np.random.normal(48000,55000,3650)], 
                  index=[1992,1993,1994,1995])
df = df.T



In [3]:

    
df.head()









    Out[3]:






  
    
      
      1992
      1993
      1994
      1995
    
  
  
    
      0
      2793.851077
      -44406.485331
      134288.798913
      -44485.202120
    
    
      1
      105341.500709
      180815.466879
      169097.538334
      -156.410517
    
    
      2
      -44415.807259
      -108866.427539
      337957.368420
      -13425.878636
    
    
      3
      -49859.545652
      -114625.083717
      -76005.273164
      53540.999558
    
    
      4
      328367.085875
      196807.232582
      90130.207911
      130408.559874



In [4]:

    
data = {'mean':[df[1992].mean(), df[1993].mean(), df[1994].mean(), df[1995].mean()],
             'std':[df[1992].std(), df[1993].std(), df[1994].std(), df[1995].std()]}



In [5]:

    
df_mean = pd.DataFrame(data, [1992, 1993 , 1994, 1995])



In [6]:

    
df_mean









    Out[6]:






  
    
      
      mean
      std
    
  
  
    
      1992
      34484.080607
      150473.176164
    
    
      1993
      39975.673587
      88558.520583
    
    
      1994
      37565.689950
      120317.078777
    
    
      1995
      47798.504333
      54828.074297



In [7]:

    
std_error = np.sqrt(df.count().iloc[0])



In [8]:

    
df_mean['std error'] = df_mean['std'].apply(lambda x: (x/np.sqrt(df.count().iloc[0])))



In [9]:

    
df_mean['margin of error'] = df_mean['std error'].apply(lambda x: x*0.95)



In [10]:

    
df_mean









    Out[10]:






  
    
      
      mean
      std
      std error
      margin of error
    
  
  
    
      1992
      34484.080607
      150473.176164
      2490.649733
      2366.117247
    
    
      1993
      39975.673587
      88558.520583
      1465.831062
      1392.539509
    
    
      1994
      37565.689950
      120317.078777
      1991.502458
      1891.927335
    
    
      1995
      47798.504333
      54828.074297
      907.520743
      862.144706



In [11]:

    
labels = ['1992', '1993', '1994', '1995']
left_pos = np.arange(len(df_mean['mean']))



In [12]:

    
y_axis_interest = 42000



In [13]:

    
df_mean









    Out[13]:






  
    
      
      mean
      std
      std error
      margin of error
    
  
  
    
      1992
      34484.080607
      150473.176164
      2490.649733
      2366.117247
    
    
      1993
      39975.673587
      88558.520583
      1465.831062
      1392.539509
    
    
      1994
      37565.689950
      120317.078777
      1991.502458
      1891.927335
    
    
      1995
      47798.504333
      54828.074297
      907.520743
      862.144706



In [14]:

    
conf_inter = 1.96 * (df_mean['std'] / np.sqrt(df.count().iloc[0] - 1))



In [15]:

    
df_mean['confidence interval'] = conf_inter.values



In [16]:

    
df_mean









    Out[16]:






  
    
      
      mean
      std
      std error
      margin of error
      confidence interval
    
  
  
    
      1992
      34484.080607
      150473.176164
      2490.649733
      2366.117247
      4882.342337
    
    
      1993
      39975.673587
      88558.520583
      1465.831062
      1392.539509
      2873.422529
    
    
      1994
      37565.689950
      120317.078777
      1991.502458
      1891.927335
      3903.879632
    
    
      1995
      47798.504333
      54828.074297
      907.520743
      862.144706
      1778.984369



In [17]:

    
#value ==> y_interest
#So bars might be colored red if they are definitely above this value (given the confidence interval), 
#blue if they are definitely below this value, or white if they contain this value.
df_mean['color'] = df_mean['mean'].astype('int') > y_axis_interest



In [18]:

    
df_mean









    Out[18]:






  
    
      
      mean
      std
      std error
      margin of error
      confidence interval
      color
    
  
  
    
      1992
      34484.080607
      150473.176164
      2490.649733
      2366.117247
      4882.342337
      False
    
    
      1993
      39975.673587
      88558.520583
      1465.831062
      1392.539509
      2873.422529
      False
    
    
      1994
      37565.689950
      120317.078777
      1991.502458
      1891.927335
      3903.879632
      False
    
    
      1995
      47798.504333
      54828.074297
      907.520743
      862.144706
      1778.984369
      True



In [19]:

    
df_mean['bar´s confidence interval'] = df_mean['mean'] + df_mean['confidence interval']



In [20]:

    
df_mean









    Out[20]:






  
    
      
      mean
      std
      std error
      margin of error
      confidence interval
      color
      bar´s confidence interval
    
  
  
    
      1992
      34484.080607
      150473.176164
      2490.649733
      2366.117247
      4882.342337
      False
      39366.422944
    
    
      1993
      39975.673587
      88558.520583
      1465.831062
      1392.539509
      2873.422529
      False
      42849.096116
    
    
      1994
      37565.689950
      120317.078777
      1991.502458
      1891.927335
      3903.879632
      False
      41469.569582
    
    
      1995
      47798.504333
      54828.074297
      907.520743
      862.144706
      1778.984369
      True
      49577.488702



In [21]:

    
df_mean['bar´s confidence interval'].astype('int')
#blue
#red
#blue
#red
colors = ['blue', 'white', 'blue', 'red']



In [22]:

    
#df_mean['color'].map({True: 'red', False: 'blue'})



In [23]:

    
plt.bar(left=left_pos, height=df_mean['mean'], align='center', alpha=0.8, yerr=df_mean['confidence interval'],
        color=colors, 
        error_kw=dict(ecolor='black', lw=1, capsize=5, capthick=1))
plt.xticks(left_pos, labels)
plt.axhline(y=y_axis_interest, color='grey', alpha=0.8)
ax = plt.gca()
ax.set_axis_bgcolor('lightgrey')
plt.tick_params(bottom='off', left='off')



In [ ]:

	1992	1993	1994	1995
0	2793.851077	-44406.485331	134288.798913	-44485.202120
1	105341.500709	180815.466879	169097.538334	-156.410517
2	-44415.807259	-108866.427539	337957.368420	-13425.878636
3	-49859.545652	-114625.083717	-76005.273164	53540.999558
4	328367.085875	196807.232582	90130.207911	130408.559874

	mean	std
1992	34484.080607	150473.176164
1993	39975.673587	88558.520583
1994	37565.689950	120317.078777
1995	47798.504333	54828.074297