[Data, the Humanist's New Best Friend](index.ipynb)
*Class 07*

In this class you are expected to learn:

  1. Split-Apply-Combine
  2. String manipulation
  3. Cross-tabulation and pivoting
  4. Visualizations (matplotlib)
  5. Histograms
*We won't need Excel anymore... hopefully*

Group By: Split-Apply-Combine

Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

In pandas, those operation are carried out by the .groupby() method of DataFrames and Series, but the meaning is the same. From the documentation, by "group by" we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

*Split by column `x`, apply average of `y` values per each category of x, and combine the results.*
Source: [Hadley Wickham's Data Science in R slides](http://courses.had.co.nz/12-oscon/)

Of these, the split step is the most straightforward. In fact, in many situations you may wish to split the data set into groups and do something with those groups yourself. In the apply step, we might wish to one of the following:

  • Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
    • Compute group sums or means
    • Compute group sizes / counts
  • Transformation: perform some group-specific computations and return a like-indexed. Some examples:
    • Standardizing data within group
    • Filling NAs within groups with a value derived from each group
  • Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
    • Discarding data that belongs to groups with only a few members
    • Filtering out data based on the group sum or mean
  • Some combination of the above: .groupby() will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories

Since the set of object instance method on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name .groupby() should be quite familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2

We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality then provide some non-trivial examples / use cases.

To illustrate these operatoinsoperations, we will use the City of Chicago salaries CSV


In [1]:
import numpy as np
import pandas as pd
!head data/city-of-chicago-salaries.csv


Name,Position Title,Department,Employee Annual Salary
"AARON,  ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
"AARON,  JEFFERY M",POLICE OFFICER,POLICE,$75372.00
"AARON,  KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$80916.00
"ABAD JR,  VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$99648.00
"ABBATACOLA,  ROBERT J",ELECTRICAL MECHANIC,AVIATION,$89440.00
"ABBATEMARCO,  JAMES J",FIREFIGHTER,FIRE,$78012.00
"ABBATE,  TERRY M",POLICE OFFICER,POLICE,$80724.00
"ABBOTT,  BETTY L",FOSTER GRANDPARENT,FAMILY & SUPPORT,$2756.00
"ABBOTT,  LYNISE M",CLERK III,POLICE,$41784.00

In [2]:
chicago = pd.read_csv('data/city-of-chicago-salaries.csv')
chicago.head()


Out[2]:
Name Position Title Department Employee Annual Salary
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT $85512.00
1 AARON, JEFFERY M POLICE OFFICER POLICE $75372.00
2 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES $80916.00
3 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT $99648.00
4 ABBATACOLA, ROBERT J ELECTRICAL MECHANIC AVIATION $89440.00

So far, we have seen how to manipulate data once is loaded, but pandas provides with some basic modifications, the converters, as well as a way to rename the columns. Since it seems that the salaries have a $ symbol preceding the figure, let's create a converter for that column that removes the symbole and convert the string into a float.


In [3]:
headers = ['name', 'title', 'department', 'salary']
chicago = pd.read_csv('data/city-of-chicago-salaries.csv',
                      header=False,
                      names=headers,
                      converters={'salary': lambda x: float(x.replace('$', ''))})  # yes, lambda functions!
print("salary", chicago["salary"].dtype)
chicago.head()


salary float64
Out[3]:
name title department salary
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT 85512
1 AARON, JEFFERY M POLICE OFFICER POLICE 75372
2 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES 80916
3 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT 99648
4 ABBATACOLA, ROBERT J ELECTRICAL MECHANIC AVIATION 89440

Pandas .groupby() returns a DataFrameGroupBy object which has a variety of methods, many of which are similar to standard SQL aggregate functions.


In [4]:
by_dept = chicago.groupby('department')
by_dept


Out[4]:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f4d5b465400>

Calling .count() returns the total number of not null values within each column. If we were interested in the total number of records in each group, we could use .size().


In [5]:
by_dept.count().head(5)  # Not null records within each column


Out[5]:
name title salary
department
ADMIN HEARNG 42 42 42
ANIMAL CONTRL 61 61 61
AVIATION 1218 1218 1218
BOARD OF ELECTION 110 110 110
BOARD OF ETHICS 9 9 9

In [6]:
by_dept.size().tail()  # total records for each department


Out[6]:
department
PUBLIC LIBRARY     926
STREETS & SAN     2070
TRANSPORTN        1168
TREASURER           25
WATER MGMNT       1857
dtype: int64

Summation can be done via .sum(), averaging by .mean(), etc. (if it's a SQL function, chances are it exists in pandas). Oh, and there's .median() too, something not available in most databases.


In [7]:
by_dept.sum()[20:25]  # total salaries of each department


Out[7]:
salary
department
HUMAN RESOURCES 4850928.0
INSPECTOR GEN 4035150.0
IPRA 7006128.0
LAW 31883920.2
LICENSE APPL COMM 65436.0

In [8]:
by_dept.mean().head(5)  # average salary of each department


Out[8]:
salary
department
ADMIN HEARNG 70336.857143
ANIMAL CONTRL 57121.455738
AVIATION 70638.249130
BOARD OF ELECTION 55728.872727
BOARD OF ETHICS 81650.666667

In [9]:
by_dept.median()[20:25]  # take that, RDBMS!


Out[9]:
salary
department
HUMAN RESOURCES 68496
INSPECTOR GEN 76116
IPRA 82524
LAW 66492
LICENSE APPL COMM 65436

Operations can also be done on an individual Series within a grouped object. Say we were curious about the five departments with the most distinct titles.


In [10]:
by_dept.title.nunique().order(ascending=False)[:5]


Out[10]:
department
WATER MGMNT    153
TRANSPORTN     150
POLICE         130
AVIATION       125
HEALTH         118
Name: title, dtype: int64

Split-Apply-Combine

The real power of .groupby() comes from its split-apply-combine ability.

What if we wanted to see the highest paid employee within each department? Using .groupby() we can define a function (which we'll call ranker()) that will label each record from 1 to N, where N is the number of employees within the department. We can then call apply to, well, apply that function to each group (in this case, each department).


In [11]:
def ranker(df):
    """Assigns a rank to each employee based on salary, with 1 being the highest paid.
    Assumes the data is DESC sorted."""
    df['dept_rank'] = np.arange(len(df)) + 1
    return df

In [12]:
chicago.sort('salary', ascending=False, inplace=True)
chicago = chicago.groupby('department').apply(ranker)
chicago.head()


Out[12]:
name title department salary dept_rank
18039 MC CARTHY, GARRY F SUPERINTENDENT OF POLICE POLICE 260004 1
8004 EMANUEL, RAHM MAYOR MAYOR'S OFFICE 216210 1
25588 SANTIAGO, JOSE A FIRE COMMISSIONER FIRE 202728 1
27594 STEWART III, CHARLES FIRST DEPUTY FIRE COMMISSIONER FIRE 188316 2
31587 WYSINGER, ALFONZA FIRST DEPUTY SUPERINTENDENT POLICE 188316 2

In [13]:
chicago[chicago.dept_rank == 1].head(7)


Out[13]:
name title department salary dept_rank
18039 MC CARTHY, GARRY F SUPERINTENDENT OF POLICE POLICE 260004 1
8004 EMANUEL, RAHM MAYOR MAYOR'S OFFICE 216210 1
25588 SANTIAGO, JOSE A FIRE COMMISSIONER FIRE 202728 1
763 ANDOLINO, ROSEMARIE S COMMISSIONER OF AVIATION AVIATION 186576 1
4697 CHOUCAIR, BECHARA N COMMISSIONER OF HEALTH HEALTH 177156 1
21971 PATTON, STEPHEN R CORPORATION COUNSEL LAW 173664 1
12635 HOLT, ALEXANDRA D BUDGET DIR BUDGET & MGMT 169992 1

In [14]:
chicago[chicago.department == "LAW"][:5]


Out[14]:
name title department salary dept_rank
21971 PATTON, STEPHEN R CORPORATION COUNSEL LAW 173664 1
6311 DARLING, LESLIE M FIRST ASST CORPORATION COUNSEL LAW 149160 2
17680 MARTINICO, JOSEPH P CHIEF LABOR NEGOTIATOR LAW 144036 3
22357 PETERS, LYNDA A CITY PROSECUTOR LAW 139932 4
31383 WONG JR, EDWARD J DEPUTY CORPORATION COUNSEL LAW 137076 5

We can now see where each employee ranks within their department based on salary.

String manipulations

*[I got 99 problems. So I used regular expressions. Now I have 100 problems.](http://xkcd.com/1171/)*

Regular expressions are a powerful language for matching text patterns. The Python re module provides regular expression support.


In [15]:
import re

In Python a regular expression search is typically written as:

match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word:


In [16]:
string = """
    My phone is 519 456-1234, call me, maybe. Oh., wait, this is my real number 226 456 1234
"""
match = re.findall(r'\d{3}\s\d{3}[\s-]?\d{4}', string)
# If-statement after search() tests if it succeeded
match


Out[16]:
['519 456-1234', '226 456 1234']

In [17]:
string = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', string)
# If-statement after search() tests if it succeeded
if match:                      
    print('found', match.group()) ## 'found word:cat'
else:
    print('did not find')


found word:cat

The code match = re.search(pattern, string) stores the search result in a variable named match. Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python raw string which passes through backslashes without change which is very handy for regular expressions. I recommend that you always write pattern strings with the 'r' just as a habit.

Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9 -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( )
  • . (a period) -- matches any single character except newline \n
  • \w (lowercase w) -- matches a word character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b -- boundary between word and non-word
  • \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r -- tab, newline, return
  • \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
  • ^ = start, $ = end -- match the start or end of the string
  • \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as @, you can put a slash in front of it, \@, to make sure it is treated just as a character.

Basic Examples

Joke: what do you call a pig with three eyes? piiig!

The basic rules of regular expression search for a pattern within a string are:

  • The search proceeds through the string from start to end, stopping at the first match found
  • All of the pattern must be matched, but not all of the string
  • If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text

In [18]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')
print(match)  #  found, match.group() == "iii"
match = re.search(r'igs', 'piiig')
print(match)  #  not found, match == None


<_sre.SRE_Match object; span=(1, 4), match='iii'>
None

In [19]:
## . = any char but \n
match = re.search(r'..g', 'piiig')
print(match)  #  found, match.group() == "iig"


<_sre.SRE_Match object; span=(2, 5), match='iig'>

In [20]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')
print(match)  #  found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!')
print(match)  #  found, match.group() == "abc"


<_sre.SRE_Match object; span=(1, 4), match='123'>
<_sre.SRE_Match object; span=(2, 5), match='abc'>

Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

  • + -- 1 or more occurrences of the pattern to its left, e.g. i+ = one or more i's
  • * -- 0 or more occurrences of the pattern to its left
  • ? -- match 0 or 1 occurrences of the pattern to its left

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").


In [21]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')
print(match)  #  found, match.group() == "piii"


<_sre.SRE_Match object; span=(0, 4), match='piii'>

In [22]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')
print(match)  #  found, match.group() == "ii"


<_sre.SRE_Match object; span=(1, 3), match='ii'>

In [23]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')
print(match)  #  found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')
print(match)  #  found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print(match)  #  found, match.group() == "123"


<_sre.SRE_Match object; span=(2, 9), match='1 2   3'>
<_sre.SRE_Match object; span=(2, 7), match='12  3'>
<_sre.SRE_Match object; span=(2, 5), match='123'>

In [24]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')
print(match)  # not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')
print(match)  #  found, match.group() == "bar"


None
<_sre.SRE_Match object; span=(3, 6), match='bar'>

The Eternal Emails Example

Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':


In [25]:
string = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', string)
if match:
    print(match.group())


b@google

The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches a or b or c. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add . and - to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:


In [26]:
match = re.search(r'[\w.-]+@[\w.-]+', string)
if match:
    print(match.group())  ## 'alice-b@google.com'


alice-b@google.com

You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

Group Extraction

The group feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.


In [27]:
string = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', string)
if match:
    print(match.group())  ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)


alice-b@google.com
alice-b
google.com

A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

findall

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings, with each string representing one match.


In [28]:
## Suppose we have a text with many email addresses
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', string) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print(email)


alice@google.com
bob@abc.com

findall and Groups

The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of tuples. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').


In [29]:
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', string)
print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for pair in tuples:
    print(pair[0])  ## username
    print(pair[1])  ## host


[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com

Once you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group. (Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)

RE Workflow and Debug

Regular expression patterns pack a lot of meaning into just a few characters , but they are so dense, you can spend a lot of time debugging your patterns. Set up your runtime so you can run a pattern and print what it matches easily, for example by running it on a small test text and printing the result of findall(). If the pattern matches nothing, try weakening the pattern, removing parts of it so you get too many matches. When it's matching nothing, you can't make any progress since there's nothing concrete to look at. Once it's matching too much, then you can work on tightening it up incrementally to hit just what you want.

Options

The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).

  • IGNORECASE -- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.
  • DOTALL -- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up -- you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
  • MULTILINE -- Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string. Greedy vs. Non-Greedy (optional)

This is optional section which shows a more advanced regular expression technique not needed for the exercises.

Suppose you have text with tags in it: <b>foo</b> and <i>so on</i>

Suppose you are trying to match each tag with the pattern '(<.*>)' -- what does it match first?

The result is a little surprising, but the greedy aspect of the .* causes it to match the whole '<b>foo</b> and <i>so on</i>' as one big match. The problem is that the .* goes as far as is it can, instead of stopping at the first > (aka it is "greedy").

There is an extension to regular expression where you add a ? at the end, such as .*? or .+?, changing them to be non-greedy. Now they stop as soon as they can. So the pattern '(<.*?>)' will get just '<b>' as the first match, and '</b>' as the second match, and so on getting each <..> pair in turn. The style is typically that you use a .*?, and then immediately its right look for some concrete marker (> in this case) that forces the end of the .*? run.

The *? extension originated in Perl, and regular expressions that include Perl's extensions are known as Perl Compatible Regular Expressions -- pcre. Python includes pcre support. Many command line utils etc. have a flag where they accept pcre patterns.

An older but widely used technique to code this idea of "all of these chars except stopping at X" uses the square-bracket style. For the above you could write the pattern, but instead of .* to get all the chars, use [^>]* which skips over all characters which are not > (the leading ^ "inverts" the square bracket set, so it matches any char not in the brackets).

Substitution

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.


In [30]:
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, string) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', string))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher


purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

Activity

Given the `arts` data frame, try to do the next:

  1. Clean the dates so you only see numbers by using string manipulations. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
  2. Get the average execution year per artist.
  3. Get the average execution year per category.
  4. Get the number of artworks per artist. Which artist is the most prolific?
  5. Get the number of artworks per category. Which category has the highest number?
  6. Get the average length of artworks titles per category and artist.
  7. Get the year with the highest production.
  8. Get the approximate period of production for each artist. If an artist painted from 1970 to 1990, the period is 20.


In [31]:
string = "©1977-1979, CARCC"
finds = re.findall(r"\d+", string)
first_find = finds[-1]
int(first_find)


Out[31]:
1979

In [36]:
arts = pd.read_csv("data/arts.csv")
arts.head()


Out[36]:
artist_name title execution_date category id
0 Wright, Don Turf-Arm 1984 Work on paper 85/6-0174
1 Hansen, Jim Nfld. album no. 99 ©1977, CARCC Work on paper 79/80-0510
2 Creates, Marlene A Stone Placed in Gathered Water, Newfoundland ©1982, CARCC Photograph 92/3-0221
3 Walker, Peter 1.25.79 ©1979, CARCC Work on paper 79/80-0529
4 Bretzloff, Carol Under Heaven, Wind 1983 Work on paper 13/4-0007

In [37]:
arts["execution_date"] = arts["execution_date"].str.findall(r"(\d+)").str[0]
arts["execution_date"].head()


Out[37]:
0    1984
1    1977
2    1982
3    1979
4    1983
Name: execution_date, dtype: object

In [38]:
# 1. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
arts["execution_date"] = arts["execution_date"].astype(float)
arts["execution_date"] = arts["execution_date"].apply(lambda value: 1900 + value if value < 100 else value)
arts.head()


Out[38]:
artist_name title execution_date category id
0 Wright, Don Turf-Arm 1984 Work on paper 85/6-0174
1 Hansen, Jim Nfld. album no. 99 1977 Work on paper 79/80-0510
2 Creates, Marlene A Stone Placed in Gathered Water, Newfoundland 1982 Photograph 92/3-0221
3 Walker, Peter 1.25.79 1979 Work on paper 79/80-0529
4 Bretzloff, Carol Under Heaven, Wind 1983 Work on paper 13/4-0007

In [39]:
# 2. Get the average execution year per artist.
arts.groupby("artist_name").mean().head()


Out[39]:
execution_date
artist_name
A & B Associés 1982.250000
Aalto, Harri 1976.000000
Aaluk, Paul 1990.000000
Abbott, Louise 1978.666667
Abdilla, Alex 2001.000000

In [40]:
# 3. Get the average execution year per category.
arts.groupby("category")[["execution_date"]].mean()


Out[40]:
execution_date
category
Fibre 1986.183007
Installation 1984.579710
Media 1987.400000
Other 1982.707424
Painting 1980.344828
Photograph 1984.550800
Sculpture 1984.260649
Work on paper 1979.037114

In [41]:
# 4. Get the number of artworks per artist. Which artist is the most prolific?
artworks_by_artist = arts.groupby("artist_name")[["title"]].aggregate(np.count_nonzero)
artworks_by_artist.sort("title", ascending=False).head()


Out[41]:
title
artist_name
Smith, Gordon 137
Onley, Toni 130
Kipling, Ann 122
Hurtubise, Jacques 113
Laing, William (Bill) 111

In [42]:
# 5. Get the number of artworks per category. Which category has the highest number?
artworks_by_category = arts.groupby("category")[["title"]].aggregate(np.count_nonzero)
artworks_by_category.sort("title", ascending=False).head()


Out[42]:
title
category
Work on paper 10637
Photograph 2874
Painting 2351
Sculpture 988
Other 459

In [43]:
arts['title_len'] = arts['title'].apply(len)
arts.head()


Out[43]:
artist_name title execution_date category id title_len
0 Wright, Don Turf-Arm 1984 Work on paper 85/6-0174 8
1 Hansen, Jim Nfld. album no. 99 1977 Work on paper 79/80-0510 18
2 Creates, Marlene A Stone Placed in Gathered Water, Newfoundland 1982 Photograph 92/3-0221 46
3 Walker, Peter 1.25.79 1979 Work on paper 79/80-0529 7
4 Bretzloff, Carol Under Heaven, Wind 1983 Work on paper 13/4-0007 18

In [44]:
# 6. Get the average length of artworks titles per category and artist.
arts['title_length'] = arts['title'].apply(len)
length_by_category = arts.groupby("category")[["title_len"]].aggregate(np.mean)
length_by_category.sort("title_len")


Out[44]:
title_len
category
Media 12.400000
Sculpture 16.527328
Painting 17.075287
Other 18.389978
Work on paper 18.432265
Fibre 20.681818
Installation 25.144928
Photograph 28.188935

In [45]:
# 7. Get the year with the highest production.
artworks_by_year = arts.groupby("execution_date")[["title"]].aggregate(np.count_nonzero)
artworks_by_year.sort("title", ascending=False).head(3)


Out[45]:
title
execution_date
1973 1625
1972 1087
1974 1041

In [46]:
# 8. Get the approximate period of production for each artist. If an artist painted from 1970 to 1990, the period is 20.
period_min = arts.groupby("artist_name")[["execution_date"]].aggregate(np.min)
period_max = arts.groupby("artist_name")[["execution_date"]].aggregate(np.max)
(period_max - period_min + 1).sort("execution_date", ascending=False).head(5)


Out[46]:
execution_date
artist_name
Tousignant, Serge 47
Pellan, Alfred 43
Ashevak, Kenojuak 40
Morris, Michael 37
Haeseker, Alexandra 37

In [47]:
arts.to_csv("data/arts_clean.csv")

Pivot tables

Once we know the concept behind .groupby(), the next natural step is to think about doing a lot of groupings so we can see the data aggregated in many different ways. And this is very the problem that pivot tables are designed to solve.

Let's consider a new dataset, books (adapted from here), from an editorial and the number of copies of books they have, and say that we want to know things like:

  • How many copies are already published?
  • How many copies are in the pipeline?
  • How many copies per book and in what stage?

Let's set up our environment first with a new dataset from an editorial and the number of copies of books they have.


In [2]:
import pandas as pd
import numpy as np

books = pd.read_excel("data/books-funnel.xlsx")
books.head()


Out[2]:
Identifier Book Author Editor Edition Copies Price Status
0 714466 Adventures of Trantow-Barrows Craig Booker Debra Henley Premiun 100 30 In-print
1 714466 Adventures of Trantow-Barrows Craig Booker Debra Henley Standard 100 10 In-print
2 714466 Adventures of Trantow-Barrows Craig Booker Debra Henley Pocket 200 5 pending
3 737550 The Russel Paradox Craig Booker Debra Henley Premiun 100 35 rejected
4 146832 Theory of Kiehn Daniel Hilton Debra Henley Premiun 200 65 published

In [8]:
# How many copies are already published?
books[books["Status"] == "published"]["Copies"].sum()


Out[8]:
1200

In [16]:
# How many copies are in the pipeline?
books[books["Status"].isin(["In-print", "pending"])]["Copies"].sum()


Out[16]:
1400

In [18]:
books.groupby(["Book", "Status"])[["Copies"]].aggregate(np.sum)


Out[18]:
Copies
Book Status
A brave warrior pending 200
Adventures of Trantow-Barrows In-print 200
pending 200
Barton Fink rejected 100
Kassulke from Tokyo published 300
Kulas In-print 100
pending 200
Lords of the Baltic In-print 200
rejected 200
Purdy & Kunde In-print 100
Skin published 500
Stokes the Vampire In-print 100
pending 100
The Russel Paradox rejected 100
The Whale and the Hunter published 200
Theory of Kiehn published 200

Pivot the data

As we build up the pivot table, it's easier to take it one step at a time. Add items and check each step to verify you are getting the results you expect. Don't be afraid to play with the order and the variables to see what presentation makes the most sense for your needs.

The simplest pivot table must have a dataframe and an index. In this case, let's use the book title Book as our index.


In [91]:
pd.pivot_table(books, index=["Book"])


Out[91]:
Copies Identifier Price
Book
A brave warrior 200.000000 412290 5.0
Adventures of Trantow-Barrows 133.333333 714466 15.0
Barton Fink 100.000000 740150 35.0
Kassulke from Tokyo 300.000000 307599 7.0
Kulas 150.000000 218895 25.0
Lords of the Baltic 200.000000 729833 35.0
Purdy & Kunde 100.000000 163416 30.0
Skin 500.000000 688981 100.0
Stokes the Vampire 100.000000 239344 7.5
The Russel Paradox 100.000000 737550 35.0
The Whale and the Hunter 200.000000 141962 65.0
Theory of Kiehn 200.000000 146832 65.0

You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.


In [19]:
pd.pivot_table(books, index=["Book", "Author", "Editor"])


Out[19]:
Copies Identifier Price
Book Author Editor
A brave warrior John Smith Debra Henley 200.000000 412290 5.0
Adventures of Trantow-Barrows Craig Booker Debra Henley 133.333333 714466 15.0
Barton Fink John Smith Debra Henley 100.000000 740150 35.0
Kassulke from Tokyo Wendy Yule Fred Anderson 300.000000 307599 7.0
Kulas Daniel Hilton Debra Henley 150.000000 218895 25.0
Lords of the Baltic Wendy Yule Fred Anderson 200.000000 729833 35.0
Purdy & Kunde Cedric Moss Fred Anderson 100.000000 163416 30.0
Skin Wendy Yule Fred Anderson 500.000000 688981 100.0
Stokes the Vampire Cedric Moss Fred Anderson 100.000000 239344 7.5
The Russel Paradox Craig Booker Debra Henley 100.000000 737550 35.0
The Whale and the Hunter Cedric Moss Fred Anderson 200.000000 141962 65.0
Theory of Kiehn Daniel Hilton Debra Henley 200.000000 146832 65.0

This is interesting but not particularly useful. What we probably want to do is look at this by Editor and Author. It’s easy enough to do by changing the index.


In [21]:
pd.pivot_table(books, index=["Editor", "Author"])


Out[21]:
Copies Identifier Price
Editor Author
Debra Henley Craig Booker 125.000000 720237.0 20.000000
Daniel Hilton 166.666667 194874.0 38.333333
John Smith 150.000000 576220.0 20.000000
Fred Anderson Cedric Moss 125.000000 196016.5 27.500000
Wendy Yule 300.000000 614061.5 44.250000

You can see that the pivot table is smart enough to start aggregating the data and summarizing it by grouping the authors with their editors. Now we start to get a glimpse of what a pivot table can do for us.

For this purpose, the Identifier and Price columns aren’t really useful. Let’s remove it by explicitly defining the columns we care about using the values field.


In [22]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"])


Out[22]:
Copies
Editor Author
Debra Henley Craig Booker 125.000000
Daniel Hilton 166.666667
John Smith 150.000000
Fred Anderson Cedric Moss 125.000000
Wendy Yule 300.000000

The price column automatically averages the data but we can do a count or a sum. Adding them is simple using aggfunc and np.sum.


In [23]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"], aggfunc=np.sum)


Out[23]:
Copies
Editor Author
Debra Henley Craig Booker 500
Daniel Hilton 500
John Smith 300
Fred Anderson Cedric Moss 500
Wendy Yule 1200

aggfunc() can take a list of functions. Let’s try a mean using the numpy mean function and len to get a count.


In [98]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"], aggfunc=[np.sum, len])


Out[98]:
sum len
Copies Copies
Editor Author
Debra Henley Craig Booker 500 4
Daniel Hilton 500 3
John Smith 300 2
Fred Anderson Cedric Moss 500 4
Wendy Yule 1200 4

If we want to see copies broken down by the edition, the columns variable allows us to define one or more columns.

Columns vs. Values

One of the confusing points with the pivot_table is the use of columns and values. Remember, columns are optional - they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you list.


In [99]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"],
               columns=["Edition"], aggfunc=[np.sum])


Out[99]:
sum
Copies
Edition Pocket Premiun Standard
Editor Author
Debra Henley Craig Booker 200 200 100
Daniel Hilton NaN 400 100
John Smith 200 100 NaN
Fred Anderson Cedric Moss 100 300 100
Wendy Yule 300 700 200

The NaN’s are a bit distracting. If we want to remove them, we could use fill_value to set them to 0.


In [24]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"],
               columns=["Edition"], aggfunc=[np.sum], fill_value=0)


Out[24]:
sum
Copies
Edition Pocket Premiun Standard
Editor Author
Debra Henley Craig Booker 200 200 100
Daniel Hilton 0 400 100
John Smith 200 100 0
Fred Anderson Cedric Moss 100 300 100
Wendy Yule 300 700 200

It would be useful to add the average price as well. Add Price to the values list.


In [26]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies", "Price"],
               columns=["Edition"], aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0)


Out[26]:
Copies
Edition Pocket Premiun Standard
Author
Cedric Moss 100 300 100
Craig Booker 200 200 100
Daniel Hilton 0 400 100
John Smith 200 100 0
Wendy Yule 300 700 200

What's interesting is that you can move items to the index to get a different visual representation. Remove Edition from the columns and add to the index.


In [27]:
pd.pivot_table(books, index=["Editor", "Author", "Edition"], values=["Copies", "Price"],
               aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0)


Out[27]:
Copies Price
Editor Author Edition
Debra Henley Craig Booker Pocket 200 5.0
Premiun 200 32.5
Standard 100 10.0
Daniel Hilton Premiun 400 52.5
Standard 100 10.0
John Smith Pocket 200 5.0
Premiun 100 35.0
Fred Anderson Cedric Moss Pocket 100 5.0
Premiun 300 47.5
Standard 100 10.0
Wendy Yule Pocket 300 7.0
Premiun 700 82.5
Standard 200 5.0

For this data set, this representation makes more sense. Now, what if I want to see some totals? margins=True does that for us. Also, a really handy feature is the ability to pass a dictionary to the aggfunc so you can perform different functions on each of the values you select. This has a side-effect of making the labels a little cleaner.


In [29]:
pd.pivot_table(books, rows=["Editor", "Author", "Edition"], values=["Copies", "Price"],
               aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0, margins=True)


Out[29]:
Copies Price
Editor Author Edition
Debra Henley Craig Booker Pocket 200 5.000000
Premiun 200 32.500000
Standard 100 10.000000
Daniel Hilton Premiun 400 52.500000
Standard 100 10.000000
John Smith Pocket 200 5.000000
Premiun 100 35.000000
Fred Anderson Cedric Moss Pocket 100 5.000000
Premiun 300 47.500000
Standard 100 10.000000
Wendy Yule Pocket 300 7.000000
Premiun 700 82.500000
Standard 200 5.000000
All 3000 30.705882
*Eeeeeh, well...*

Let’s move the analysis up a level and look at our pipeline at the editor level.


In [30]:
pd.pivot_table(books, index=["Editor", "Status"], values=["Copies"],
               aggfunc=[np.sum], fill_value=0, margins=True)


Out[30]:
sum
Copies
Editor Status
Debra Henley In-print 300
pending 600
published 200
rejected 200
Fred Anderson In-print 400
pending 100
published 1000
rejected 200
All 3000

You can provide a list of aggfunctions to apply to each value too:


In [31]:
table = pd.pivot_table(books, index=["Editor", "Status"], columns=["Edition"], values=["Copies", "Price"],
               aggfunc={"Copies": np.sum, "Price":[np.sum, np.mean]}, fill_value=0)
table


Out[31]:
Copies Price
sum mean sum
Edition Pocket Premiun Standard Pocket Premiun Standard Pocket Premiun Standard
Editor Status
Debra Henley In-print 0 100 200 0 30.0 10.0 0 30 20
pending 400 200 0 5 40.0 0.0 10 40 0
published 0 200 0 0 65.0 0.0 0 65 0
rejected 0 200 0 0 35.0 0.0 0 70 0
Fred Anderson In-print 0 100 300 0 30.0 7.5 0 30 15
pending 100 0 0 5 0.0 0.0 5 0 0
published 300 700 0 7 82.5 0.0 7 165 0
rejected 0 200 0 0 65.0 0.0 0 65 0

It can look daunting to try to pull this all together at one time but as soon as you start playing with the data and slowly add the items, you can get a feel for how it works. As a general rule of thumb is, once you use multiple grouby() you should evaluate whether a pivot table is a useful approach.

Advanced Pivot Table Filtering

Once you have generated your data, it is in a DataFrame so you can filter on it using your standard DataFrame functions.

If you want to look at just one editor:


In [32]:
table.query('Editor == ["Debra Henley"]')  # you might nedd to pip install numexpr


Out[32]:
Copies Price
sum mean sum
Edition Pocket Premiun Standard Pocket Premiun Standard Pocket Premiun Standard
Editor Status
Debra Henley In-print 0 100 200 0 30 10 0 30 20
pending 400 200 0 5 40 0 10 40 0
published 0 200 0 0 65 0 0 65 0
rejected 0 200 0 0 35 0 0 70 0

We can look at all of our pending and in-print books.


In [33]:
table.query('Status == ["pending", "in-print"]')


Out[33]:
Copies Price
sum mean sum
Edition Pocket Premiun Standard Pocket Premiun Standard Pocket Premiun Standard
Editor Status
Debra Henley pending 400 200 0 5 40 0 10 40 0
Fred Anderson pending 100 0 0 5 0 0 5 0 0

This a poweful feature of the pivot_table so do not forget that you have the full power of pandas once you get your data into the pivot_table format you need.

Cross-tabulation

Some times, the only operation you want to do in a pivot talbe is to count. In those cases, when frequency tables with multi indices are required, a crosstabulated version of the data might come in handy. In Pandas, that kind of tables have their own function, so you don-t have to call pivot table with all the parameters.


In [34]:
df = pd.DataFrame({'a' : np.random.randint(0, 2, size=20),
                   'b' : np.random.randint(0, 5, size=20),
                   'c' : np.random.randint(0, 3, size=20)})
 
df


Out[34]:
a b c
0 1 2 0
1 0 4 2
2 0 1 1
3 1 0 2
4 0 3 2
5 0 0 0
6 0 3 0
7 1 2 2
8 0 1 2
9 1 4 0
10 1 3 1
11 1 0 1
12 0 3 0
13 0 4 2
14 1 2 2
15 0 3 0
16 0 1 1
17 0 4 0
18 0 1 2
19 1 4 0

In [35]:
pd.crosstab(df['b'], [df['a'], df['c']])


Out[35]:
a 0 1
c 0 1 2 0 1 2
b
0 1 0 0 0 1 1
1 0 2 2 0 0 0
2 0 0 0 1 0 2
3 3 0 1 0 1 0
4 1 0 2 2 0 0

Activity

Given the `arts` data frame, create a pivot table by spliting the table into categories per year, and counting the number of artworks for each. Add marginal results and fill the `NaN` values.


In [57]:
pd.pivot_table(arts, index=["execution_date"], columns=["category"],
               values=["title"], aggfunc={"title": np.count_nonzero},
               fill_value=0, margins=True)


Out[57]:
title
category Fibre Installation Media Other Painting Photograph Sculpture Work on paper All
execution_date
1940.0 0 0 0 0 0 0 0 1 1
1943.0 0 0 0 0 1 0 0 0 1
1945.0 0 0 0 0 0 0 0 1 1
1948.0 0 0 0 0 0 0 0 2 2
1949.0 0 0 0 0 0 1 0 0 1
1950.0 0 0 0 0 0 0 0 1 1
1951.0 0 0 0 0 1 1 0 0 2
1952.0 0 0 0 0 0 1 0 0 1
1953.0 0 0 0 0 1 0 0 0 1
1954.0 0 0 0 0 2 1 0 3 6
1955.0 0 0 0 0 1 3 0 6 10
1956.0 0 0 0 0 2 3 0 2 7
1957.0 0 0 0 0 0 0 0 5 5
1958.0 0 0 0 0 4 4 0 2 10
1959.0 0 0 0 0 4 3 0 3 10
1960.0 0 0 0 0 13 4 0 5 22
1961.0 0 0 0 0 10 2 0 11 23
1962.0 0 0 0 0 11 2 0 8 21
1963.0 0 0 0 0 10 0 0 12 22
1964.0 0 0 0 1 23 0 0 35 59
1965.0 0 0 0 7 17 5 1 72 102
1966.0 0 0 0 1 35 1 5 99 141
1967.0 0 0 0 1 26 9 3 93 132
1968.0 1 0 0 7 29 8 15 109 169
1969.0 2 0 0 8 33 9 22 153 227
1970.0 3 1 0 1 40 11 16 196 268
1971.0 2 0 0 6 56 31 19 422 536
1972.0 2 1 0 16 195 59 47 767 1087
1973.0 10 1 0 20 164 85 45 1300 1625
1974.0 7 2 0 13 100 102 46 771 1041
... ... ... ... ... ... ... ... ... ...
1985.0 4 11 0 12 65 171 26 313 602
1986.0 4 0 2 21 73 89 23 364 576
1987.0 2 3 0 16 56 124 25 297 523
1988.0 11 0 0 19 60 111 27 239 467
1989.0 8 5 1 19 55 116 56 350 610
1990.0 4 5 0 22 57 120 33 298 539
1991.0 2 4 2 23 56 98 32 301 518
1992.0 14 2 2 38 90 92 54 309 601
1993.0 4 5 0 11 43 32 22 91 208
1994.0 8 1 0 6 15 27 18 59 134
1995.0 0 0 0 0 3 4 0 1 8
1996.0 0 0 0 1 1 7 0 1 10
1997.0 0 0 0 1 2 6 0 5 14
1998.0 1 0 0 0 18 6 2 7 34
1999.0 0 0 0 0 7 13 5 3 28
2000.0 2 0 0 2 14 13 8 14 53
2001.0 4 0 0 0 11 17 5 20 57
2002.0 5 0 0 3 17 10 18 31 84
2003.0 2 0 0 0 15 12 18 8 55
2004.0 2 0 0 1 15 21 7 12 58
2005.0 4 0 0 1 11 21 9 18 64
2006.0 1 0 0 2 9 23 8 13 56
2007.0 0 0 0 0 17 24 12 14 67
2008.0 0 0 0 0 13 13 11 14 51
2009.0 1 0 0 0 17 96 10 8 132
2010.0 2 0 0 0 8 10 18 2 40
2011.0 0 0 0 0 4 1 1 2 8
2012.0 0 0 0 1 0 2 0 0 3
2013.0 0 0 0 0 0 0 0 2 2
All 154 69 10 459 2351 2874 988 10637 17542

70 rows × 9 columns

Plotting and Visualization

There are a handful of third-party Python packages that are suitable for creating scientific plots and visualizations. These include packages like:

  • matplotlib
  • Chaco
  • PyX
  • Bokeh

Here, we will focus excelusively on matplotlib and the high-level plotting availabel within pandas. It is currently the most robust and feature-rich package available.

Visual representation of data

We require plots, charts and other statistical graphics for the written communication of quantitative ideas. They allow us to more easily convey relationships and reveal deviations from patterns.

Gelman and Unwin 2011.

A well-designed graph can display more information than a table of the same size, and more information than numbers embedded in text. Graphical displays allow and encourage direct visual comparisons.

Matplotlib

The easiest way to interact with matplotlib is via pylab in iPython. By starting iPython (or iPython notebook) in "pylab mode", both matplotlib and numpy are pre-loaded into the iPython session. But that is deprecated and the way now is just using the %matplotlib cell magic. For loading numpy we just use the regular imports.


In [59]:
%matplotlib inline

You can specify a custom graphical backend (e.g. qt, gtk, osx), but iPython generally does a good job of auto-selecting. Now matplotlib is ready to go, and you can access the matplotlib API via plt. If you do not start iPython in pylab mode, you can do this manually with the following convention:


In [60]:
import matplotlib.pyplot as plt
plt.plot(np.random.normal(size=100), np.random.normal(size=100), 'ro')


Out[60]:
[<matplotlib.lines.Line2D at 0x7f889c2f1978>]

Matplotlib: Scripted vs. Object Oriented

Matplotlib provides two flavors of interface. One is the scripted interface, designed to feel like Matlab. To enable this, matplotlib maintains a pointer to the current figure, current axis, etc. and directs top-level commands to those places:


In [61]:
# create a figure using the matlab-like interface
x = np.linspace(0, 10, 1000)
plt.subplot(2, 1, 1)  # 2x1 grid, first plot
plt.plot(x, np.sin(x))
plt.title('Trig is easy.')

plt.subplot(2, 1, 2)  # 2x1 grid, second plot
plt.plot(x, np.cos(x))
plt.xlabel('x')


Out[61]:
<matplotlib.text.Text at 0x7f889c2167f0>

The other interface is an object-oriented interface, where we expliticly pass around references to the plot elements we want to work with:


In [65]:
# create the same figure using the object-oriented interface
fig = plt.figure(figsize=(10, 8))

ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(x, np.sin(x))
ax1.set_title("Trig iss asdasd asd  easy")

ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(x, np.cos(x))
ax2.set_xlabel('x')
ax2.set_ylabel("Hello!")


Out[65]:
<matplotlib.text.Text at 0x7f8897fb24a8>

Why use one interface vs another?

These two interfaces are convenient for different circumstances. I find that for doing quick, simple plots, the scripted interface is often easiest. On the other hand, when I want more sophisticated plots, the object-oriented interface is simpler and more powerful. In fact, the scripted interface has several distinct limitations. It's good practice to use the object-oriented interface. That is, you should get in the habit of never using the plt. when you can reference a specific axes or figure object instead.

Multi-panel plots

There are four main ways to create multi-panel plots in matplotlib. From lowest to highest-level they are (roughly):

1. fig.add_axes()

Useful for creating inset plots. The add_axes method allows you to create an axes instance by specifying the size relative to the figure edges. The argument is [left, bottom, width, height] which specifies the axes extent in fractions of the figure size (i.e. between 0 and 1):


In [162]:
fig = plt.figure()
main_ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])



In [163]:
inset_ax = fig.add_axes([0.6, 0.6, 0.25, 0.25])
fig


Out[163]:

In [164]:
main_ax.plot(np.random.rand(100), color='gray')
inset_ax.plot(np.random.rand(20), color='black')
fig


Out[164]:

2. fig.add_subplot()

Useful for simple multi-panel plots. If you're trying to create multiple axes in a grid, you might use .add_axes() repeatedly, but calculating the extent for each axes is not trivial. The .add_subplot() method can streamline this. The arguments are of the form N_vertical, N_horizontal, Plot_number, and the indices start at 1:


In [165]:
fig = plt.figure()
for i in range(1, 7):
    ax = fig.add_subplot(2, 3, i)
    ax.text(0.45, 0.45, str(i), fontsize=24)


If you desire, you can adjust the spacing using fig.subplots_adjust(), with units relative to the figure size (i.e. between 0 and 1)


In [166]:
fig.subplots_adjust(left=0.1, right=0.9,
                    bottom=0.1, top=0.9,
                    hspace=0.4, wspace=0.4)
fig


Out[166]:

3. plt.subplots()

Convenience function to create multiple subplots. Because creating a full grid of subplots is such a common task, matplotlib recently added the plt.subplots() command which creates the figure and axes in one go. The arguments are N_vertical, N_horizontal, and the axes are returned within a Numpy array:


In [167]:
fig, ax = plt.subplots(2, 3)
for i in range(2):
    for j in range(3):
        ax[i, j].text(0.2, 0.45, str((i, j)), fontsize=20)



In [168]:
print(type(ax))
print(ax.shape)
print(ax.dtype)


<class 'numpy.ndarray'>
(2, 3)
object

An additional nice piece of this routine is the ability to specify that the subplots have a shared x or y axis: this ties together the axis limits and removes redundant tick labels:


In [169]:
fig, ax = plt.subplots(2, 3, sharex=True, sharey=True)
x = np.linspace(0, 10, 1000)
for i in range(2):
    for j in range(3):
        ax[i, j].plot(x, (j + 1) * np.sin((i + 1) * x))


4. plt.GridSpec()

Useful for more involved layouts. GridSpec is the highest-level routine for creating subplots. It's an abstract object that allows the creation of multi-row or multi-column subplots via an intuitive slicing interface:


In [170]:
gs = plt.GridSpec(3, 3)  # a 3x3 grid
fig = plt.figure(figsize=(6, 6))  # figure size in inches

fig.add_subplot(gs[:, :])
fig.add_subplot(gs[1, 1])


Out[170]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4b6cd320>

In [171]:
gs = plt.GridSpec(3, 3, wspace=0.4, hspace=0.4)
fig = plt.figure(figsize=(6, 6))

fig.add_subplot(gs[1, :2])
fig.add_subplot(gs[0, :2])
fig.add_subplot(gs[2, 0])
fig.add_subplot(gs[:2, 2])
fig.add_subplot(gs[2, 1:])


Out[171]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4b54ccf8>

Check out the documentation of plt.GridSpec for information on adjusting the subplot parameters. The keyword arguments are similar to those of plt.subplots_adjust().

Plotting in Pandas

On the other hand, Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.


In [73]:
normals = pd.Series(np.random.normal(size=10))
normals.plot()


Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f889c309240>

Notice that by default a line plot is drawn, and a light grid is included. All of this can be changed, however:


In [77]:
normals.cumsum().plot(grid=False)


Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f88c02a2630>

Similarly, for a DataFrame:


In [78]:
variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                          'gamma': np.random.gamma(1, size=100), 
                          'poisson': np.random.poisson(size=100)})
variables.cumsum(0).plot()


Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f889797c2b0>

As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for plot:


In [80]:
variables.cumsum(0).plot(subplots=True)


Out[80]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f8897747f28>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f88976d46a0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f889769d940>], dtype=object)

Or, we may want to have some series displayed on the secondary y-axis, which can allow for greater detail and less empty space:


In [81]:
variables.cumsum(0).plot(secondary_y='normal')


Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8897635940>

Of course, each plot returned is an object with some methods and attributes to modify the aspect of your chart.


In [82]:
ax = variables.cumsum(0).plot()
ax.set_xlabel('Radius')
ax.set_ylabel('Area')
ax.set_title('Area of a Circle')


Out[82]:
<matplotlib.text.Text at 0x7f8897519198>

If we would like a little more control, we can use matplotlib's subplots function directly, and manually assign plots to its axes:


In [83]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
for i, var in enumerate(['normal','gamma','poisson']):
    variables[var].cumsum(0).plot(ax=axes[i], title=var)
axes[0].set_ylabel('cumulative sum')


Out[83]:
<matplotlib.text.Text at 0x7f88974fb978>

Bar plots

*Thanks, Bear Grylls!*

Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the plot method with a kind='bar' argument.

For this series of examples, let's use our arts and books datasets:


In [84]:
arts.groupby('category')[["title"]].count().plot(kind='bar')


Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f88973dc438>

In [85]:
books.groupby(['Editor', 'Author'])[["Copies"]].aggregate(np.sum).plot(kind='barh')


Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f88973a1278>

In [86]:
counts = pd.crosstab(books["Edition"], books["Status"])
counts.plot(kind='bar', stacked=True, grid=False)


Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8897307c18>

Activity

Given the `arts` data frame, plot a chart with the number of artworks per year. Set the title and the names of the axes. Also hide the legend and the grid.


In [119]:
grouped = arts.groupby("execution_date")[["title"]].count()
ax = grouped.plot()
ax.set_title("Artworks per Year")
ax.set_xlabel("Year")
ax.set_ylabel("Number of Artworks")
ax.set_xlim([1950, 2000])


Out[119]:
(1950, 2000)

Histograms

Frequenfly it is useful to look at the distribution of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.

For example, we might want to see how the artworks are distributed cross time:


In [123]:
arts["execution_date"].hist(grid=False)


Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d52223c50>

The hist method puts the continuous years values into bins, trying to make a sensible decision about how many bins to use (or equivalently, how wide the bins are). We can override the default value (10):


In [120]:
arts["execution_date"].hist(grid=False, bins=30)


Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8896f9a5c0>

There are algorithms for determining an "optimal" number of bins, each of which varies somehow with the number of observations in the data series.


In [121]:
sturges = lambda n: int(np.log2(n) + 1)
square_root = lambda n: int(np.sqrt(n))
from scipy.stats import kurtosis
doanes = lambda data: int(1 + np.log(len(data)) + np.log(1 + kurtosis(data) * (len(data) / 6.) ** 0.5))

n = len(arts["execution_date"])
sturges(n), square_root(n), doanes(arts["execution_date"].dropna())


Out[121]:
(15, 132, 14)

In [122]:
arts["execution_date"].hist(bins=doanes(arts["execution_date"].dropna()))


Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8897032b70>

A density plot is similar to a histogram in that it describes the distribution of the underlying data, but rather than being a pure empirical representation, it is an estimate of the underlying "true" distribution. As a result, it is smoothed into a continuous line plot. We create them in Pandas using the plot method with kind='kde', where kde stands for kernel density estimate.


In [178]:
arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020))


Out[178]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4914a240>

We can even make it look funny by using the XKCD style!


In [179]:
with plt.xkcd():
    arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020))


Often, histograms and density plots are shown together:


In [136]:
arts["execution_date"].hist(bins=doanes(arts["execution_date"].dropna()), normed=True, color='lightseagreen')
arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020), style='r--')


Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4c938358>

Here, we had to normalize the histogram (normed=True), since the kernel density is normalized by definition (it is a probability distribution).

Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.


In [138]:
arts.boxplot(column="execution_date", by="category", grid=False)


Out[138]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4caaa518>

You can think of the box plot as viewing the distribution from above. The blue crosses are "outlier" points that occur outside the extreme quantiles.

One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.


In [143]:
bp = arts.boxplot(column="execution_date", by="category", grid=False)
for i, index in enumerate(arts["category"].unique()):
    y = arts["execution_date"][arts["category"]==index].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)


When data are dense, a couple of tricks used above help the visualization:

  • Reducing the alpha level to make the points partially transparent
  • Adding random "jitter" along the x-axis to avoid overstriking

Scatterplots

To look at how Pandas does scatterplots, let's load the baseball sample dataset.


In [144]:
baseball = pd.read_csv("data/baseball.csv")
baseball.head()


Out[144]:
id player year stint team lg g ab r h ... rbi sb cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 ... 2 1 1 4 4 0 0 3 0 0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 ... 0 0 0 0 1 0 0 0 0 0
2 88645 myersmi01 2006 1 NYA AL 62 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 88649 helliri01 2006 1 MIL NL 20 3 0 0 ... 0 0 0 0 2 0 0 0 0 0
4 88650 johnsra05 2006 1 NYA AL 33 6 0 1 ... 0 0 0 0 4 0 0 0 0 0

5 rows × 23 columns

Scatterplots are useful for data exploration, where we seek to uncover relationships among variables. There are no scatterplot methods for Series or DataFrame objects; we must instead use the matplotlib function scatter.


In [201]:
fig = plt.scatter(baseball.ab, baseball.h)
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)


Out[201]:
(0, 200)

We can add additional information to scatterplots by assigning variables to either the size of the symbols or their colors.


In [199]:
fig = plt.scatter(baseball.ab, baseball.h, s=baseball.hr*10, alpha=0.5)  # s for size
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)


Out[199]:
(0, 200)

In [202]:
fig = plt.scatter(baseball.ab, baseball.h, c=baseball.hr, s=40, cmap='hot')  # c for color
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)


Out[202]:
(0, 200)

To view scatterplots of a large numbers of variables simultaneously, we can use the scatter_matrix function that was recently added to Pandas. It generates a matrix of pair-wise scatterplots, optiorally with histograms or kernel density estimates on the diagonal.


In [176]:
pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(12,8), diagonal='kde')


Out[176]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b2fa278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b2cf278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b296dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b2529e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b218da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b1d4710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b19b9e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b0e5ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b0a5cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b06ceb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b027da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aff50b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4afaaac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4af7d160>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4af45320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aefeb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aec7dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ae839e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ae49cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ae13e80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4add1fd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ad1f1d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ad5b0b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aca1390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4acd9dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ac2b2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4abf4470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4abb46d8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ab82080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ab38940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aaffb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aabf6a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b364cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b5a85f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b44cd68>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b881b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b9fa320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b961908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b957128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b967a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c0e8828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c49c9e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c5ddd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c4adef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c44cb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c420d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c64cc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c8caeb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c63bbe0>]], dtype=object)

For the next class