Lab 3: Intro to Visualizations

Authors: Sam Lau, Deb Nolan

Due 11:59pm 02/03/2017 (Completion-based)

Today, we'll learn the basics of plotting using the Python libraries matplotlib and seaborn! You should walk out of lab today understanding:

  • The functionality that matplotlib provides
  • Why we use seaborn for plotting
  • How to make and customize basic plots, including bar charts, box plots, histograms, and scatterplots.

As usual, to submit this lab you must scroll down the bottom and set the i_definitely_finished variable to True before running the submit cell.

Please work in pairs to work on this lab assignment. You will discuss the results with your partner instead of having to write them up in the notebook.


In [263]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

# These lines load the tests.
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab03.ok')

matplotlib

matplotlib is the most widely used plotting library available for Python. It comes with a good amount of out-of-the-box functionality and is highly customizable. Most other plotting libraries in Python provide simpler ways to generate complicated matplotlib plots, including seaborn, so it's worth learning a bit about matplotlib now.

Notice how all of our notebooks have lines that look like:

%matplotlib inline
import matplotlib.pyplot as plt

The %matplotlib inline magic command tells matplotlib to render the plots directly onto the notebook (by default it will open a new window with the plot).

Then, the import line lets us call matplotlib functions using plt.<func>

Here's a graph of cos(x) from 0 to 2 * pi (you've made this in homework 1 already).


In [264]:
# Set up (x, y) pairs from 0 to 2*pi
xs = np.linspace(0, 2 * np.pi, 300)
ys = np.cos(xs)

# plt.plot takes in x-values and y-values and plots them as a line
plt.plot(xs, ys)

matplotlib also conveniently has the ability to plot multiple things on the same plot. Just call plt.plot multiple times in the same cell:


In [265]:
plt.plot(xs, ys)
plt.plot(xs, np.sin(xs))

Question 0: That plot looks pretty nice but isn't publication-ready. Luckily, matplotlib has a wide array of plot customizations.

Skim through the first part of the tutorial at https://www.labri.fr/perso/nrougier/teaching/matplotlib to create the plot below. There is a lot of extra information there which we suggest you read on your own time. For now, just look for what you need to make the plot.

Specifically, you'll have to change the x and y limits, add a title, and add a legend.


In [266]:
# Here's the starting code from last time. Edit / Add code to create the plot above.
plt.plot(xs, ys)
plt.plot(xs, np.sin(xs))

Dataset: Bikeshare trips

Today, we'll be performing some basic EDA (exploratory data analysis) on bikeshare data in Washington D.C.

The variables in this data frame are defined as:

  • instant: record index
  • dteday : date
  • season : season (1:spring, 2:summer, 3:fall, 4:winter)
  • yr : year (0: 2011, 1:2012)
  • mnth : month ( 1 to 12)
  • hr : hour (0 to 23)
  • holiday : whether day is holiday or not
  • weekday : day of the week
  • workingday : if day is neither weekend nor holiday
  • weathersit :
    • 1: Clear or partly cloudy
    • 2: Mist + clouds
    • 3: Light Snow or Rain
    • 4: Heavy Rain or Snow
  • temp : Normalized temperature in Celsius (divided by 41)
  • atemp: Normalized feeling temperature in Celsius (divided by 50)
  • hum: Normalized percent humidity (divided by 100)
  • windspeed: Normalized wind speed (divided by 67)
  • casual: count of casual users
  • registered: count of registered users
  • cnt: count of total rental bikes including casual and registered

In [268]:
bike_trips = pd.read_csv('bikeshare.csv')

# Here we'll do some pandas datetime parsing so that the dteday column
# contains datetime objects.
bike_trips['dteday'] += ':' + bike_trips['hr'].astype(str)
bike_trips['dteday'] = pd.to_datetime(bike_trips['dteday'], format="%Y-%m-%d:%H")
bike_trips = bike_trips.drop(['yr', 'mnth', 'hr'], axis=1)

bike_trips.head()

Question 1: Discuss the data with your partner. What is its granularity? What time range is represented here? Perform your exploration in the cell below.


In [ ]:

Using pandas to plot

pandas provides useful methods on dataframes. For simple plots, we prefer to just use those methods instead of the matplotlib methods since we're often working with dataframes anyway. The syntax is:

dataframe.plot.<plotfunc>

Where the plotfunc is one of the functions listed here: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html#other-plots


In [260]:
# This plot shows the temperature at each data point

bike_trips.plot.line(x='dteday', y='temp')

# Stop here! Discuss why this plot is shaped like this with your partner.

seaborn

Now, we'll learn how to use the seaborn Python library. seaborn is built on top of matplotlib and provides many helpful functions for statistical plotting that matplotlib and pandas don't have.

Generally speaking, we'll use seaborn for more complex statistical plots, pandas for simple plots (eg. line / scatter plots), and matplotlib for plot customization.

Nearly all seaborn functions are designed to operate on pandas dataframes. Most of these functions assume that the dataframe is in a specific format called long-form, where each column of the dataframe is a particular feature and each row of the dataframe a single datapoint.

For example, this dataframe is long-form:

   country year avgtemp
 1  Sweden 1994       6
 2 Denmark 1994       6
 3  Norway 1994       3
 4  Sweden 1995       5
 5 Denmark 1995       8
 6  Norway 1995      11
 7  Sweden 1996       7
 8 Denmark 1996       8
 9  Norway 1996       7

But this dataframe of the same data is not:

   country avgtemp.1994 avgtemp.1995 avgtemp.1996
 1  Sweden            6            5            7
 2 Denmark            6            8            8
 3  Norway            3           11            7

Note that the bike_trips dataframe is long-form.

For more about long-form data, see https://stanford.edu/~ejdemyr/r-tutorials/wide-and-long. For now, just remember that we typically prefer long-form data and it makes plotting using seaborn easy as well.

Question 2: Use seaborn's barplot function to make a bar chart showing the average number of registered riders on each day of the week over the entire bike_trips dataset.

Here's a link to the seaborn API: http://seaborn.pydata.org/api.html

See if you can figure it out by reading the docs and talking with your partner.

Once you have the plot, discuss it with your partner. What trends do you notice? What do you suppose causes these trends?

Notice that barplot draws error bars for each category. It uses bootstrapping to make those.


In [196]:
...

Question 3: Now for a fancier plot that seaborn makes really easy to produce.

Use the distplot function to plot a histogram of all the total rider counts in the bike_trips dataset.


In [196]:
...

Notice that seaborn will fit a curve to the histogram of the data. Fancy!

Question 4: Discuss this plot with your partner. What shape does the distribution have? What does that imply about the rider counts?

Question 5: Use seaborn to make side-by-side boxplots of the number of casual riders (just checked out a bike for that day) and registered riders (have a bikeshare membership).

The boxplot function will plot all the columns of the dataframe you pass in.

Once you make the plot, you'll notice that there are many outliers that make the plot hard to see. To mitigate this, change the y-scale to be logarithmic.

That's a plot customization so you'll use matplotlib. The boxplot function returns a matplotlib Axes object which represents a single plot and has a set_yscale function.

The result should look like:


In [205]:
...

Question 6: Discuss with your partner what the plot tells you about the distribution of casual vs. the distribution of registered riders.

Question 7: Let's take a closer look at the number of registered vs. casual riders.

Use the lmplot function to make a scatterplot. Put the number of casual riders on the x-axis and the number of registered riders on the y-axis. Each point should correspond to a single row in your bike_trips dataframe.


In [210]:
...

Question 8: What do you notice about that plot? Discuss with your partner. Notice that seaborn automatically fits a line of best fit to the plot. Does that line seem to be relevant?

You should note that lm_plot allows you to pass in fit_line=False to avoid plotting lines of best fit when you feel they are unnecessary or misleading.

Question 9: There seem to be two main groups in the scatterplot. Let's see if we can separate them out.

Use lmplot to make the scatterplot again. This time, use the hue parameter to color points for weekday trips differently from weekend trips. You should get something that looks like:


In [223]:
# In your plot, you'll notice that your points are larger than ours. That's
# fine. If you'd like them to be smaller, you can add scatter_kws={'s': 6}
# to your lmplot call. That tells the underlying matplotlib scatter function
# to change the size of the points.
...

# Note that the legend for workingday isn't super helpful. 0 in this case
# means "not a working day" and 1 means "working day". Try fixing the legend
# to be more descriptive.

Question 10: Discuss the plot with your partner. Was splitting the data by working day informative? One of the best-fit lines looks valid but the other doesn't. Why do you suppose that is?

Question 11 (bonus): Eventually, you'll want to be able to pose a question yourself and answer it using a visualization. Here's a question you can think about:

How do the number of casual and registered riders change throughout the day, on average?

See if you can make a plot to answer this.


In [221]:
...

Want to learn more?

We recommend checking out the seaborn tutorials on your own time. http://seaborn.pydata.org/tutorial.html

The matplotlib tutorial we linked in Question 1 is also a great refresher on common matplotlib functions: https://www.labri.fr/perso/nrougier/teaching/matplotlib/

Here's a great blog post about the differences between Python's visualization libraries: https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/

Submission

Change i_definitely_finished to True and run the cells below to submit the lab. You may resubmit as many times you want. We will be grading you on effort/completion.


In [ ]:
i_definitely_finished = False

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()