Authors: Sam Lau, Deb Nolan
Today, we'll learn the basics of plotting using the Python libraries
matplotlib
and seaborn
! You should walk out of lab today understanding:
matplotlib
providesseaborn
for plottingAs usual, to submit this lab you must scroll down the bottom and set the
i_definitely_finished
variable to True
before running the submit cell.
In [263]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
# These lines load the tests.
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab03.ok')
matplotlib
matplotlib
is the most widely used plotting library available for Python.
It comes with a good amount of out-of-the-box functionality and is highly
customizable. Most other plotting libraries in Python provide simpler ways to generate
complicated matplotlib
plots, including seaborn
, so it's worth learning a bit about
matplotlib
now.
Notice how all of our notebooks have lines that look like:
%matplotlib inline
import matplotlib.pyplot as plt
The %matplotlib inline
magic command tells matplotlib
to render the plots
directly onto the notebook (by default it will open a new window with the plot).
Then, the import
line lets us call matplotlib
functions using plt.<func>
Here's a graph of cos(x)
from 0 to 2 * pi (you've made this in homework 1
already).
In [264]:
# Set up (x, y) pairs from 0 to 2*pi
xs = np.linspace(0, 2 * np.pi, 300)
ys = np.cos(xs)
# plt.plot takes in x-values and y-values and plots them as a line
plt.plot(xs, ys)
matplotlib
also conveniently has the ability to plot multiple things on the
same plot. Just call plt.plot
multiple times in the same cell:
In [265]:
plt.plot(xs, ys)
plt.plot(xs, np.sin(xs))
Question 0:
That plot looks pretty nice but isn't publication-ready. Luckily, matplotlib
has a wide array of plot customizations.
Skim through the first part of the tutorial at https://www.labri.fr/perso/nrougier/teaching/matplotlib to create the plot below. There is a lot of extra information there which we suggest you read on your own time. For now, just look for what you need to make the plot.
Specifically, you'll have to change the x and y limits, add a title, and add a legend.
In [266]:
# Here's the starting code from last time. Edit / Add code to create the plot above.
plt.plot(xs, ys)
plt.plot(xs, np.sin(xs))
Today, we'll be performing some basic EDA (exploratory data analysis) on bikeshare data in Washington D.C.
The variables in this data frame are defined as:
In [268]:
bike_trips = pd.read_csv('bikeshare.csv')
# Here we'll do some pandas datetime parsing so that the dteday column
# contains datetime objects.
bike_trips['dteday'] += ':' + bike_trips['hr'].astype(str)
bike_trips['dteday'] = pd.to_datetime(bike_trips['dteday'], format="%Y-%m-%d:%H")
bike_trips = bike_trips.drop(['yr', 'mnth', 'hr'], axis=1)
bike_trips.head()
Question 1: Discuss the data with your partner. What is its granularity? What time range is represented here? Perform your exploration in the cell below.
In [ ]:
pandas
to plotpandas
provides useful methods on dataframes. For simple plots, we prefer to
just use those methods instead of the matplotlib
methods since we're often
working with dataframes anyway. The syntax is:
dataframe.plot.<plotfunc>
Where the plotfunc
is one of the functions listed here: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html#other-plots
In [260]:
# This plot shows the temperature at each data point
bike_trips.plot.line(x='dteday', y='temp')
# Stop here! Discuss why this plot is shaped like this with your partner.
seaborn
Now, we'll learn how to use the seaborn
Python library. seaborn
is built on top of matplotlib
and provides many helpful functions
for statistical plotting that matplotlib
and pandas
don't have.
Generally speaking, we'll use seaborn
for more complex statistical plots,
pandas
for simple plots (eg. line / scatter plots), and
matplotlib
for plot customization.
Nearly all seaborn
functions are designed to operate on pandas
dataframes. Most of these functions assume that the dataframe is in
a specific format called long-form, where each column of the dataframe
is a particular feature and each row of the dataframe a single datapoint.
For example, this dataframe is long-form:
country year avgtemp
1 Sweden 1994 6
2 Denmark 1994 6
3 Norway 1994 3
4 Sweden 1995 5
5 Denmark 1995 8
6 Norway 1995 11
7 Sweden 1996 7
8 Denmark 1996 8
9 Norway 1996 7
But this dataframe of the same data is not:
country avgtemp.1994 avgtemp.1995 avgtemp.1996
1 Sweden 6 5 7
2 Denmark 6 8 8
3 Norway 3 11 7
Note that the bike_trips
dataframe is long-form.
For more about long-form data, see https://stanford.edu/~ejdemyr/r-tutorials/wide-and-long.
For now, just remember that we typically prefer long-form data and it makes plotting using
seaborn
easy as well.
Question 2:
Use seaborn's barplot
function to make a bar chart showing the average
number of registered riders on each day of the week over the
entire bike_trips
dataset.
Here's a link to the seaborn API: http://seaborn.pydata.org/api.html
See if you can figure it out by reading the docs and talking with your partner.
Once you have the plot, discuss it with your partner. What trends do you notice? What do you suppose causes these trends?
Notice that barplot
draws error bars for each category. It uses bootstrapping
to make those.
In [196]:
...
Question 3: Now for a fancier plot that seaborn
makes really easy to produce.
Use the distplot
function to plot a histogram of all the total rider counts in the
bike_trips
dataset.
In [196]:
...
Notice that seaborn
will fit a curve to the histogram of the data. Fancy!
Question 4: Discuss this plot with your partner. What shape does the distribution have? What does that imply about the rider counts?
Question 5:
Use seaborn
to make side-by-side boxplots of the number of casual riders (just
checked out a bike for that day) and registered riders (have a bikeshare membership).
The boxplot
function will plot all the columns of the dataframe you pass in.
Once you make the plot, you'll notice that there are many outliers that make the plot hard to see. To mitigate this, change the y-scale to be logarithmic.
That's a plot customization so you'll use matplotlib
. The boxplot
function returns
a matplotlib
Axes object which represents a single plot and
has a set_yscale
function.
The result should look like:
In [205]:
...
Question 6: Discuss with your partner what the plot tells you about the distribution of casual vs. the distribution of registered riders.
Question 7: Let's take a closer look at the number of registered vs. casual riders.
Use the lmplot
function to make a scatterplot. Put the number of casual
riders on the x-axis and the number of registered riders on the y-axis.
Each point should correspond to a single row in your bike_trips
dataframe.
In [210]:
...
Question 8: What do you notice about that plot? Discuss with
your partner. Notice that seaborn
automatically fits a line of best
fit to the plot. Does that line seem to be relevant?
You should note that lm_plot
allows you to pass in fit_line=False
to
avoid plotting lines of best fit when you feel they are unnecessary
or misleading.
Question 9: There seem to be two main groups in the scatterplot. Let's see if we can separate them out.
Use lmplot
to make the scatterplot again. This time, use the hue
parameter
to color points for weekday trips differently from weekend trips. You should
get something that looks like:
In [223]:
# In your plot, you'll notice that your points are larger than ours. That's
# fine. If you'd like them to be smaller, you can add scatter_kws={'s': 6}
# to your lmplot call. That tells the underlying matplotlib scatter function
# to change the size of the points.
...
# Note that the legend for workingday isn't super helpful. 0 in this case
# means "not a working day" and 1 means "working day". Try fixing the legend
# to be more descriptive.
Question 10: Discuss the plot with your partner. Was splitting the data by working day informative? One of the best-fit lines looks valid but the other doesn't. Why do you suppose that is?
Question 11 (bonus): Eventually, you'll want to be able to pose a question yourself and answer it using a visualization. Here's a question you can think about:
How do the number of casual and registered riders change throughout the day, on average?
See if you can make a plot to answer this.
In [221]:
...
We recommend checking out the seaborn
tutorials on your own time. http://seaborn.pydata.org/tutorial.html
The matplotlib
tutorial we linked in Question 1 is also a great refresher on common matplotlib
functions: https://www.labri.fr/perso/nrougier/teaching/matplotlib/
Here's a great blog post about the differences between Python's visualization libraries: https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/
In [ ]:
i_definitely_finished = False
In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()
In [ ]:
_ = ok.submit()