Module 7: 1D data

Let's first import basic packages and then load a dataset from vega_datasets package. If you don't have vega_datasets or altair installed yet, use pip or conda to install them.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from vega_datasets import data

In [2]:
cars = data.cars()
cars.head()


Out[2]:
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA

1D scatter plot

Let's consider the Acceleration column as our 1D data. If we ask pandas to plot this series, it'll produce a line graph where the index becomes the horizontal axis.


In [3]:
cars.Acceleration.plot()


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x1167bc350>

Because the index is not really meaningful, drawing line between subsequent values is misleading! This is definitely not the plot we want!

It's actually not trivial to use pandas to create an 1-D scatter plot. Instead, we can use matploblib's scatter function. We can first create an array with zeros that we can use as the vertical coordinates of the points that we will plot. np.zeros_like returns an array with zeros that matches the shape of the input array.


In [4]:
np.zeros_like([1,2,3])


Out[4]:
array([0, 0, 0])

Q: now can you create an 1D scatter plot wit matplotlib's scatter function? Make the figure wide (e.g. set figsize=(10,2)) and then remove the y tics.


In [5]:
# TODO: put your code here


Out[5]:
<matplotlib.collections.PathCollection at 0x116f00650>

As you can see, there are lots of occlusions. So this plot cannot show the distribution properly and we would like to fix it. How about adding some jitters? You can use numpy's random.rand() function to generate random numbers, instead of using an array with zeros.

Q: create a jittered 1D scatter plot.


In [6]:
# TODO: put your code here
# jittered_y = ...
# plt ...


We can further improve this by adding transparency to the symbols. The transparency option for scatter function is called alpha. Set it to be 0.2.

Q: create a jittered 1D scatter plot with transparency (alpha=0.2)


In [7]:
# TODO: put your code here


Another strategy is using empty symbols. The option is facecolors. You can also change the stroke color (edgecolors).

Q: create a jittered 1D scatter plot with empty symbols.


In [8]:
# TODO: put your code here


What happens if you have lots and lots of points?

Whatever strategy that you use, it's almost useless if you have too many data points. Let's play with different number of data points and see how it looks.

It not only becomes completely useless, it also take a while to draw the plot itself.


In [9]:
# TODO: play with N and see what happens. 

# TODO: 1D scatter plot code here


Histogram and boxplot

When you have lots of data points, you can't no longer use the scatter plots. Even when you don't have millions of data points, you often want to get a quick summary of the distribution rather than seeing the whole dataset. For 1-D datasets, two major approaches are histogram and boxplot. Histogram is about aggregating and counting the data while boxplot is about summarizing the data. Let's first draw some histograms.

Histogram

It's very easy to draw a histogram with pandas.


In [10]:
cars.Acceleration.hist()


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x117930f90>

You can adjust the bin size, which is the main parameter of the histogram.


In [11]:
cars.Acceleration.hist(bins=15)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x117a15350>

You can even specify the actual bins.


In [12]:
bins = [7.5, 8.5, 10, 15, 30]
cars.Acceleration.hist(bins=bins)


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x117b009d0>

Do you see anything funky going on with this histogram? What's wrong? Can you fix it?

Q: Explain what's wrong with this histogram and fix it.

(hints: do you remember what we discussed regarding histogram? Also pandas documentation does not show the option that you should use. You should take a look at the matplotlib's documentation.


In [13]:
# TODO: put your code here


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1167bcc50>

Boxplot

Boxplot can be created with pandas very easily. Check out the plot documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

Q: create a box plot of Acceleration


In [14]:
# TODO: put your code here.


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x116ad2e50>

1D scatter plot with Seaborn and Altair

As you may have noticed, it is not very easy to use matplotlib. The organization of plot functions and parameters are not very systematic. Whenever you draw something, you should search how to do it, what are the parameters you can tweak, etc. You need to manually tweak a lot of things when you work with matplotlib.

There are more systematic approaches towards data visualization, such as the "Grammar of Graphics". This idea of grammar led to the famous ggplot2 (http://ggplot2.tidyverse.org) package in R as well as the Vega & Vega-lite) for the web. The grammar-based approach lets you work with tidy data in a natural way, and also lets you approach the data visualization systematically. In other words, they are very cool. 😎

I'd like to introduce two nice Python libraries. One is called seaborn (https://seaborn.pydata.org), which is focused on creating complex statistical data visualizations, and the other is called altair (https://altair-viz.github.io/) and it is a Python library that lets you define a visualization and translates it into vega-lite json.

Seaborn would be useful when you are doing exploratory data analysis; altair may be useful if you are thinking about creating and putting an interactive visualization on the web.

If you don't have them yet, check the installation page of altair. In conda,

$ conda install -c conda-forge altair vega_datasets jupyterlab 

Let's play with it.


In [15]:
import seaborn as sns
import altair as alt

# Uncomment the following line if you are using Jupyter notebook
# alt.renderers.enable('notebook')

In [16]:
cars.head()


Out[16]:
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA

Beeswarm plots with seaborn

Seaborn has a built-in function to create 1D scatter plots with multiple categories, and it adds jittering by default.


In [31]:
sns.stripplot(x='Origin', y='Acceleration', data=cars, jitter=False)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ade8f50>

And you can easily add jitters or even create a beeswarm plot.


In [32]:
sns.stripplot(x='Origin', y='Acceleration', data=cars)


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ada8050>

Seems like European cars tend to have good acceleration. 😎 Let's look at the beeswarm plot, which is a pretty nice option for fairly small datasets.


In [19]:
sns.swarmplot(x='Origin', y='Acceleration', data=cars)


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a9ad0d0>

Seaborn also allows you to use colors for another categorical variable. The option is hue.

Q: can you create a beeswarm plot where the swarms are grouped by Cylinders, y-values are Acceleration, and colors represent the Origin?


In [20]:
# TODO: put your code here


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1aa7a490>

And of course you can create box plots too.

Q: Create boxplots to show the relationships between Cylinders and Acceleration.


In [21]:
# TODO: put your code here


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ab534d0>

Altair basics

With altair, you're thinking in terms of a whole dataframe, rather than vectors for x or vectors for y. Passing the dataset to Chart creates an empty plot. If you try to run alt.Chart(cars), it will complain. You need to say what's the visual encoding of the data.


In [22]:
alt.Chart(cars)


---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
/opt/anaconda3/envs/py37/lib/python3.7/site-packages/altair/vegalite/v3/api.py in to_dict(self, *args, **kwargs)
    382         if dct is None:
    383             kwargs['validate'] = 'deep'
--> 384             dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    385 
    386         # TODO: following entries are added after validation. Should they be validated?

/opt/anaconda3/envs/py37/lib/python3.7/site-packages/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    300                 self.validate(result)
    301             except jsonschema.ValidationError as err:
--> 302                 raise SchemaValidationError(self, err)
    303         return result
    304 

SchemaValidationError: Invalid specification

        altair.vegalite.v3.api.Chart, validating 'required'

        'mark' is a required property
        
Out[22]:
alt.Chart(...)

Note: If the altair plots don't show properly, use one of the following lines depending on your environment. Also check out the troubleshooting document here.


In [30]:
#alt.renderers.enable('notebook')
#alt.renderers.enable('jupyterlab')
#alt.renderers.enable('default')

In [23]:
alt.Chart(cars).mark_point()


Out[23]:

So you just see one point. But actually this is not a single point. This is every row of the dataset represented as a point at the same location. Because there is no specification about where to put the points, it simply draws everything on top of each other. Let's specify how to spread them across the horizontal axis.


In [24]:
alt.Chart(cars).mark_point().encode(
    x='Acceleration',
)


Out[24]:

This is called the "short form", and it is a simplified version of the "long form", while the long form allows more fine tuning. For this plot, they are equivalent:


In [25]:
alt.Chart(cars).mark_point().encode(
    x=alt.X('Acceleration')
)


Out[25]:

There is another nice mark called tick:


In [26]:
alt.Chart(cars).mark_tick().encode(
    x='Acceleration',
)


Out[26]:

In altair, histogram is not a special type of visualization, but simply a plot with bars where a variable is binned and a counting aggregation function is used.


In [27]:
alt.Chart(cars).mark_bar().encode(
    x=alt.X('Acceleration', bin=True),
    y='count()'
)


Out[27]:

Q: can you create a 2D scatterplot with Acceleration and Horsepower? Use Origin for the colors.


In [28]:
# TODO: put your code here


Out[28]:

Because altair/vega-lite/vega are essentially drawing the chart using javascript (and D3.js), it is very easy to export it on the web. Probably the simplest way is just exporting it into an HTML file: https://altair-viz.github.io/getting_started/starting.html#publishing-your-visualization

Save the chart to m07.html and upload it too.


In [29]:
# TODO: your code here.