[Data, the Humanist's New Best Friend](index.ipynb)
Class 07

In this class you are expected to learn:

Split-Apply-Combine
String manipulation
Cross-tabulation and pivoting
Visualizations (matplotlib)
Histograms

*We won't need Excel anymore... hopefully*

Group By: Split-Apply-Combine

Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

In pandas, those operation are carried out by the .groupby() method of DataFrames and Series, but the meaning is the same. From the documentation, by "group by" we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

*Split by column `x`, apply average of `y` values per each category of x, and combine the results.*
Source: [Hadley Wickham's Data Science in R slides](http://courses.had.co.nz/12-oscon/)

Of these, the split step is the most straightforward. In fact, in many situations you may wish to split the data set into groups and do something with those groups yourself. In the apply step, we might wish to one of the following:

Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
- Compute group sums or means
- Compute group sizes / counts
Transformation: perform some group-specific computations and return a like-indexed. Some examples:
- Standardizing data within group
- Filling NAs within groups with a value derived from each group
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discarding data that belongs to groups with only a few members
- Filtering out data based on the group sum or mean
Some combination of the above: .groupby() will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories

Since the set of object instance method on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name .groupby() should be quite familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2

We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality then provide some non-trivial examples / use cases.

To illustrate these operatoinsoperations, we will use the City of Chicago salaries CSV



In [1]:

    
import numpy as np
import pandas as pd
!head data/city-of-chicago-salaries.csv









    



Name,Position Title,Department,Employee Annual Salary
"AARON,  ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
"AARON,  JEFFERY M",POLICE OFFICER,POLICE,$75372.00
"AARON,  KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$80916.00
"ABAD JR,  VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$99648.00
"ABBATACOLA,  ROBERT J",ELECTRICAL MECHANIC,AVIATION,$89440.00
"ABBATEMARCO,  JAMES J",FIREFIGHTER,FIRE,$78012.00
"ABBATE,  TERRY M",POLICE OFFICER,POLICE,$80724.00
"ABBOTT,  BETTY L",FOSTER GRANDPARENT,FAMILY & SUPPORT,$2756.00
"ABBOTT,  LYNISE M",CLERK III,POLICE,$41784.00



In [2]:

    
chicago = pd.read_csv('data/city-of-chicago-salaries.csv')
chicago.head()









    Out[2]:






  
    
      
      Name
      Position Title
      Department
      Employee Annual Salary
    
  
  
    
      0
             AARON,  ELVIA J
               WATER RATE TAKER
            WATER MGMNT
       $85512.00
    
    
      1
           AARON,  JEFFERY M
                 POLICE OFFICER
                 POLICE
       $75372.00
    
    
      2
         AARON,  KIMBERLEI R
       CHIEF CONTRACT EXPEDITER
       GENERAL SERVICES
       $80916.00
    
    
      3
         ABAD JR,  VICENTE M
              CIVIL ENGINEER IV
            WATER MGMNT
       $99648.00
    
    
      4
       ABBATACOLA,  ROBERT J
            ELECTRICAL MECHANIC
               AVIATION
       $89440.00

So far, we have seen how to manipulate data once is loaded, but pandas provides with some basic modifications, the converters, as well as a way to rename the columns. Since it seems that the salaries have a $ symbol preceding the figure, let's create a converter for that column that removes the symbole and convert the string into a float.



In [3]:

    
headers = ['name', 'title', 'department', 'salary']
chicago = pd.read_csv('data/city-of-chicago-salaries.csv',
                      header=False,
                      names=headers,
                      converters={'salary': lambda x: float(x.replace('$', ''))})  # yes, lambda functions!
print("salary", chicago["salary"].dtype)
chicago.head()









    



salary float64






    Out[3]:






  
    
      
      name
      title
      department
      salary
    
  
  
    
      0
             AARON,  ELVIA J
               WATER RATE TAKER
            WATER MGMNT
       85512
    
    
      1
           AARON,  JEFFERY M
                 POLICE OFFICER
                 POLICE
       75372
    
    
      2
         AARON,  KIMBERLEI R
       CHIEF CONTRACT EXPEDITER
       GENERAL SERVICES
       80916
    
    
      3
         ABAD JR,  VICENTE M
              CIVIL ENGINEER IV
            WATER MGMNT
       99648
    
    
      4
       ABBATACOLA,  ROBERT J
            ELECTRICAL MECHANIC
               AVIATION
       89440

Pandas .groupby() returns a DataFrameGroupBy object which has a variety of methods, many of which are similar to standard SQL aggregate functions.



In [4]:

    
by_dept = chicago.groupby('department')
by_dept









    Out[4]:





<pandas.core.groupby.DataFrameGroupBy object at 0x7f4d5b465400>

Calling .count() returns the total number of not null values within each column. If we were interested in the total number of records in each group, we could use .size().



In [5]:

    
by_dept.count().head(5)  # Not null records within each column









    Out[5]:






  
    
      
      name
      title
      salary
    
    
      department
      
      
      
    
  
  
    
      ADMIN HEARNG
         42
         42
         42
    
    
      ANIMAL CONTRL
         61
         61
         61
    
    
      AVIATION
       1218
       1218
       1218
    
    
      BOARD OF ELECTION
        110
        110
        110
    
    
      BOARD OF ETHICS
          9
          9
          9



In [6]:

    
by_dept.size().tail()  # total records for each department









    Out[6]:





department
PUBLIC LIBRARY     926
STREETS & SAN     2070
TRANSPORTN        1168
TREASURER           25
WATER MGMNT       1857
dtype: int64

Summation can be done via .sum(), averaging by .mean(), etc. (if it's a SQL function, chances are it exists in pandas). Oh, and there's .median() too, something not available in most databases.



In [7]:

    
by_dept.sum()[20:25]  # total salaries of each department









    Out[7]:






  
    
      
      salary
    
    
      department
      
    
  
  
    
      HUMAN RESOURCES
        4850928.0
    
    
      INSPECTOR GEN
        4035150.0
    
    
      IPRA
        7006128.0
    
    
      LAW
       31883920.2
    
    
      LICENSE APPL COMM
          65436.0



In [8]:

    
by_dept.mean().head(5)  # average salary of each department









    Out[8]:






  
    
      
      salary
    
    
      department
      
    
  
  
    
      ADMIN HEARNG
       70336.857143
    
    
      ANIMAL CONTRL
       57121.455738
    
    
      AVIATION
       70638.249130
    
    
      BOARD OF ELECTION
       55728.872727
    
    
      BOARD OF ETHICS
       81650.666667



In [9]:

    
by_dept.median()[20:25]  # take that, RDBMS!









    Out[9]:






  
    
      
      salary
    
    
      department
      
    
  
  
    
      HUMAN RESOURCES
       68496
    
    
      INSPECTOR GEN
       76116
    
    
      IPRA
       82524
    
    
      LAW
       66492
    
    
      LICENSE APPL COMM
       65436

Operations can also be done on an individual Series within a grouped object. Say we were curious about the five departments with the most distinct titles.



In [10]:

    
by_dept.title.nunique().order(ascending=False)[:5]









    Out[10]:





department
WATER MGMNT    153
TRANSPORTN     150
POLICE         130
AVIATION       125
HEALTH         118
Name: title, dtype: int64

Split-Apply-Combine

The real power of .groupby() comes from its split-apply-combine ability.

What if we wanted to see the highest paid employee within each department? Using .groupby() we can define a function (which we'll call ranker()) that will label each record from 1 to N, where N is the number of employees within the department. We can then call apply to, well, apply that function to each group (in this case, each department).



In [11]:

    
def ranker(df):
    """Assigns a rank to each employee based on salary, with 1 being the highest paid.
    Assumes the data is DESC sorted."""
    df['dept_rank'] = np.arange(len(df)) + 1
    return df



In [12]:

    
chicago.sort('salary', ascending=False, inplace=True)
chicago = chicago.groupby('department').apply(ranker)
chicago.head()









    Out[12]:






  
    
      
      name
      title
      department
      salary
      dept_rank
    
  
  
    
      18039
         MC CARTHY,  GARRY F
             SUPERINTENDENT OF POLICE
               POLICE
       260004
       1
    
    
      8004 
              EMANUEL,  RAHM
                                MAYOR
       MAYOR'S OFFICE
       216210
       1
    
    
      25588
           SANTIAGO,  JOSE A
                    FIRE COMMISSIONER
                 FIRE
       202728
       1
    
    
      27594
       STEWART III,  CHARLES
       FIRST DEPUTY FIRE COMMISSIONER
                 FIRE
       188316
       2
    
    
      31587
          WYSINGER,  ALFONZA
          FIRST DEPUTY SUPERINTENDENT
               POLICE
       188316
       2



In [13]:

    
chicago[chicago.dept_rank == 1].head(7)









    Out[13]:






  
    
      
      name
      title
      department
      salary
      dept_rank
    
  
  
    
      18039
          MC CARTHY,  GARRY F
       SUPERINTENDENT OF POLICE
               POLICE
       260004
       1
    
    
      8004 
               EMANUEL,  RAHM
                          MAYOR
       MAYOR'S OFFICE
       216210
       1
    
    
      25588
            SANTIAGO,  JOSE A
              FIRE COMMISSIONER
                 FIRE
       202728
       1
    
    
      763  
       ANDOLINO,  ROSEMARIE S
       COMMISSIONER OF AVIATION
             AVIATION
       186576
       1
    
    
      4697 
         CHOUCAIR,  BECHARA N
         COMMISSIONER OF HEALTH
               HEALTH
       177156
       1
    
    
      21971
           PATTON,  STEPHEN R
            CORPORATION COUNSEL
                  LAW
       173664
       1
    
    
      12635
           HOLT,  ALEXANDRA D
                     BUDGET DIR
        BUDGET & MGMT
       169992
       1



In [14]:

    
chicago[chicago.department == "LAW"][:5]









    Out[14]:






  
    
      
      name
      title
      department
      salary
      dept_rank
    
  
  
    
      21971
         PATTON,  STEPHEN R
                  CORPORATION COUNSEL
       LAW
       173664
       1
    
    
      6311 
         DARLING,  LESLIE M
       FIRST ASST CORPORATION COUNSEL
       LAW
       149160
       2
    
    
      17680
       MARTINICO,  JOSEPH P
               CHIEF LABOR NEGOTIATOR
       LAW
       144036
       3
    
    
      22357
           PETERS,  LYNDA A
                      CITY PROSECUTOR
       LAW
       139932
       4
    
    
      31383
         WONG JR,  EDWARD J
           DEPUTY CORPORATION COUNSEL
       LAW
       137076
       5

We can now see where each employee ranks within their department based on salary.

String manipulations

*[I got 99 problems. So I used regular expressions. Now I have 100 problems.](http://xkcd.com/1171/)*

Regular expressions are a powerful language for matching text patterns. The Python re module provides regular expression support.



In [15]:

    
import re

In Python a regular expression search is typically written as:

match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word:



In [16]:

    
string = """
    My phone is 519 456-1234, call me, maybe. Oh., wait, this is my real number 226 456 1234
"""
match = re.findall(r'\d{3}\s\d{3}[\s-]?\d{4}', string)
# If-statement after search() tests if it succeeded
match









    Out[16]:





['519 456-1234', '226 456 1234']



In [17]:

    
string = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', string)
# If-statement after search() tests if it succeeded
if match:                      
    print('found', match.group()) ## 'found word:cat'
else:
    print('did not find')









    



found word:cat

The code match = re.search(pattern, string) stores the search result in a variable named match. Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python raw string which passes through backslashes without change which is very handy for regular expressions. I recommend that you always write pattern strings with the 'r' just as a habit.

Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

a, X, 9 -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( )
. (a period) -- matches any single character except newline \n
\w (lowercase w) -- matches a word character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
\b -- boundary between word and non-word
\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
^ = start, $ = end -- match the start or end of the string
\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as @, you can put a slash in front of it, \@, to make sure it is treated just as a character.

Basic Examples

Joke: what do you call a pig with three eyes? piiig!

The basic rules of regular expression search for a pattern within a string are:

The search proceeds through the string from start to end, stopping at the first match found
All of the pattern must be matched, but not all of the string
If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text



In [18]:

    
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')
print(match)  #  found, match.group() == "iii"
match = re.search(r'igs', 'piiig')
print(match)  #  not found, match == None









    



<_sre.SRE_Match object; span=(1, 4), match='iii'>
None



In [19]:

    
## . = any char but \n
match = re.search(r'..g', 'piiig')
print(match)  #  found, match.group() == "iig"









    



<_sre.SRE_Match object; span=(2, 5), match='iig'>



In [20]:

    
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')
print(match)  #  found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!')
print(match)  #  found, match.group() == "abc"









    



<_sre.SRE_Match object; span=(1, 4), match='123'>
<_sre.SRE_Match object; span=(2, 5), match='abc'>

Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

+ -- 1 or more occurrences of the pattern to its left, e.g. i+ = one or more i's
* -- 0 or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").



In [21]:

    
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')
print(match)  #  found, match.group() == "piii"









    



<_sre.SRE_Match object; span=(0, 4), match='piii'>



In [22]:

    
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')
print(match)  #  found, match.group() == "ii"









    



<_sre.SRE_Match object; span=(1, 3), match='ii'>



In [23]:

    
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')
print(match)  #  found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')
print(match)  #  found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print(match)  #  found, match.group() == "123"









    



<_sre.SRE_Match object; span=(2, 9), match='1 2   3'>
<_sre.SRE_Match object; span=(2, 7), match='12  3'>
<_sre.SRE_Match object; span=(2, 5), match='123'>



In [24]:

    
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')
print(match)  # not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')
print(match)  #  found, match.group() == "bar"









    



None
<_sre.SRE_Match object; span=(3, 6), match='bar'>

The Eternal Emails Example

Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':



In [25]:

    
string = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', string)
if match:
    print(match.group())









    



b@google

The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches a or b or c. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add . and - to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:



In [26]:

    
match = re.search(r'[\w.-]+@[\w.-]+', string)
if match:
    print(match.group())  ## 'alice-b@google.com'









    



alice-b@google.com

You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

Group Extraction

The group feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.



In [27]:

    
string = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', string)
if match:
    print(match.group())  ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)









    



alice-b@google.com
alice-b
google.com

A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

`findall`

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings, with each string representing one match.



In [28]:

    
## Suppose we have a text with many email addresses
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', string) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print(email)









    



alice@google.com
bob@abc.com

`findall` and Groups

The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of tuples. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').



In [29]:

    
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', string)
print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for pair in tuples:
    print(pair[0])  ## username
    print(pair[1])  ## host









    



[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com

Once you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group. (Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)

RE Workflow and Debug

Regular expression patterns pack a lot of meaning into just a few characters , but they are so dense, you can spend a lot of time debugging your patterns. Set up your runtime so you can run a pattern and print what it matches easily, for example by running it on a small test text and printing the result of findall(). If the pattern matches nothing, try weakening the pattern, removing parts of it so you get too many matches. When it's matching nothing, you can't make any progress since there's nothing concrete to look at. Once it's matching too much, then you can work on tightening it up incrementally to hit just what you want.

Options

The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).

IGNORECASE -- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.
DOTALL -- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up -- you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
MULTILINE -- Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string. Greedy vs. Non-Greedy (optional)

This is optional section which shows a more advanced regular expression technique not needed for the exercises.

Suppose you have text with tags in it: foo and so on

Suppose you are trying to match each tag with the pattern '(<.*>)' -- what does it match first?

The result is a little surprising, but the greedy aspect of the .* causes it to match the whole 'foo and so on' as one big match. The problem is that the .* goes as far as is it can, instead of stopping at the first > (aka it is "greedy").

There is an extension to regular expression where you add a ? at the end, such as .*? or .+?, changing them to be non-greedy. Now they stop as soon as they can. So the pattern '(<.*?>)' will get just '' as the first match, and '' as the second match, and so on getting each <..> pair in turn. The style is typically that you use a .*?, and then immediately its right look for some concrete marker (> in this case) that forces the end of the .*? run.

The *? extension originated in Perl, and regular expressions that include Perl's extensions are known as Perl Compatible Regular Expressions -- pcre. Python includes pcre support. Many command line utils etc. have a flag where they accept pcre patterns.

An older but widely used technique to code this idea of "all of these chars except stopping at X" uses the square-bracket style. For the above you could write the pattern, but instead of .* to get all the chars, use [^>]* which skips over all characters which are not > (the leading ^ "inverts" the square bracket set, so it matches any char not in the brackets).

Substitution

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.



In [30]:

    
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, string) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', string))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher









    



purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

Activity

Given the `arts` data frame, try to do the next:

Clean the dates so you only see numbers by using string manipulations. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
Get the average execution year per artist.
Get the average execution year per category.
Get the number of artworks per artist. Which artist is the most prolific?
Get the number of artworks per category. Which category has the highest number?
Get the average length of artworks titles per category and artist.
Get the year with the highest production.
Get the approximate period of production for each artist. If an artist painted from 1970 to 1990, the period is 20.



In [31]:

    
string = "©1977-1979, CARCC"
finds = re.findall(r"\d+", string)
first_find = finds[-1]
int(first_find)









    Out[31]:





1979



In [36]:

    
arts = pd.read_csv("data/arts.csv")
arts.head()









    Out[36]:






  
    
      
      artist_name
      title
      execution_date
      category
      id
    
  
  
    
      0
            Wright, Don
                                             Turf-Arm
               1984
       Work on paper
        85/6-0174
    
    
      1
            Hansen, Jim
                                   Nfld. album no. 99
       ©1977, CARCC
       Work on paper
       79/80-0510
    
    
      2
       Creates, Marlene
       A Stone Placed in Gathered Water, Newfoundland
       ©1982, CARCC
          Photograph
        92/3-0221
    
    
      3
          Walker, Peter
                                              1.25.79
       ©1979, CARCC
       Work on paper
       79/80-0529
    
    
      4
       Bretzloff, Carol
                                   Under Heaven, Wind
               1983
       Work on paper
        13/4-0007



In [37]:

    
arts["execution_date"] = arts["execution_date"].str.findall(r"(\d+)").str[0]
arts["execution_date"].head()









    Out[37]:





0    1984
1    1977
2    1982
3    1979
4    1983
Name: execution_date, dtype: object



In [38]:

    
# 1. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
arts["execution_date"] = arts["execution_date"].astype(float)
arts["execution_date"] = arts["execution_date"].apply(lambda value: 1900 + value if value < 100 else value)
arts.head()









    Out[38]:






  
    
      
      artist_name
      title
      execution_date
      category
      id
    
  
  
    
      0
            Wright, Don
                                             Turf-Arm
       1984
       Work on paper
        85/6-0174
    
    
      1
            Hansen, Jim
                                   Nfld. album no. 99
       1977
       Work on paper
       79/80-0510
    
    
      2
       Creates, Marlene
       A Stone Placed in Gathered Water, Newfoundland
       1982
          Photograph
        92/3-0221
    
    
      3
          Walker, Peter
                                              1.25.79
       1979
       Work on paper
       79/80-0529
    
    
      4
       Bretzloff, Carol
                                   Under Heaven, Wind
       1983
       Work on paper
        13/4-0007



In [39]:

    
# 2. Get the average execution year per artist.
arts.groupby("artist_name").mean().head()









    Out[39]:






  
    
      
      execution_date
    
    
      artist_name
      
    
  
  
    
      A & B Associés
       1982.250000
    
    
      Aalto, Harri
       1976.000000
    
    
      Aaluk, Paul
       1990.000000
    
    
      Abbott, Louise
       1978.666667
    
    
      Abdilla, Alex
       2001.000000



In [40]:

    
# 3. Get the average execution year per category.
arts.groupby("category")[["execution_date"]].mean()









    Out[40]:






  
    
      
      execution_date
    
    
      category
      
    
  
  
    
      Fibre
       1986.183007
    
    
      Installation
       1984.579710
    
    
      Media
       1987.400000
    
    
      Other
       1982.707424
    
    
      Painting
       1980.344828
    
    
      Photograph
       1984.550800
    
    
      Sculpture
       1984.260649
    
    
      Work on paper
       1979.037114



In [41]:

    
# 4. Get the number of artworks per artist. Which artist is the most prolific?
artworks_by_artist = arts.groupby("artist_name")[["title"]].aggregate(np.count_nonzero)
artworks_by_artist.sort("title", ascending=False).head()









    Out[41]:






  
    
      
      title
    
    
      artist_name
      
    
  
  
    
      Smith, Gordon
       137
    
    
      Onley, Toni
       130
    
    
      Kipling, Ann
       122
    
    
      Hurtubise, Jacques
       113
    
    
      Laing, William (Bill)
       111



In [42]:

    
# 5. Get the number of artworks per category. Which category has the highest number?
artworks_by_category = arts.groupby("category")[["title"]].aggregate(np.count_nonzero)
artworks_by_category.sort("title", ascending=False).head()









    Out[42]:






  
    
      
      title
    
    
      category
      
    
  
  
    
      Work on paper
       10637
    
    
      Photograph
        2874
    
    
      Painting
        2351
    
    
      Sculpture
         988
    
    
      Other
         459



In [43]:

    
arts['title_len'] = arts['title'].apply(len)
arts.head()









    Out[43]:






  
    
      
      artist_name
      title
      execution_date
      category
      id
      title_len
    
  
  
    
      0
            Wright, Don
                                             Turf-Arm
       1984
       Work on paper
        85/6-0174
        8
    
    
      1
            Hansen, Jim
                                   Nfld. album no. 99
       1977
       Work on paper
       79/80-0510
       18
    
    
      2
       Creates, Marlene
       A Stone Placed in Gathered Water, Newfoundland
       1982
          Photograph
        92/3-0221
       46
    
    
      3
          Walker, Peter
                                              1.25.79
       1979
       Work on paper
       79/80-0529
        7
    
    
      4
       Bretzloff, Carol
                                   Under Heaven, Wind
       1983
       Work on paper
        13/4-0007
       18



In [44]:

    
# 6. Get the average length of artworks titles per category and artist.
arts['title_length'] = arts['title'].apply(len)
length_by_category = arts.groupby("category")[["title_len"]].aggregate(np.mean)
length_by_category.sort("title_len")









    Out[44]:






  
    
      
      title_len
    
    
      category
      
    
  
  
    
      Media
       12.400000
    
    
      Sculpture
       16.527328
    
    
      Painting
       17.075287
    
    
      Other
       18.389978
    
    
      Work on paper
       18.432265
    
    
      Fibre
       20.681818
    
    
      Installation
       25.144928
    
    
      Photograph
       28.188935



In [45]:

    
# 7. Get the year with the highest production.
artworks_by_year = arts.groupby("execution_date")[["title"]].aggregate(np.count_nonzero)
artworks_by_year.sort("title", ascending=False).head(3)









    Out[45]:






  
    
      
      title
    
    
      execution_date
      
    
  
  
    
      1973
       1625
    
    
      1972
       1087
    
    
      1974
       1041



In [46]:

    
# 8. Get the approximate period of production for each artist. If an artist painted from 1970 to 1990, the period is 20.
period_min = arts.groupby("artist_name")[["execution_date"]].aggregate(np.min)
period_max = arts.groupby("artist_name")[["execution_date"]].aggregate(np.max)
(period_max - period_min + 1).sort("execution_date", ascending=False).head(5)









    Out[46]:






  
    
      
      execution_date
    
    
      artist_name
      
    
  
  
    
      Tousignant, Serge
       47
    
    
      Pellan, Alfred
       43
    
    
      Ashevak, Kenojuak
       40
    
    
      Morris, Michael
       37
    
    
      Haeseker, Alexandra
       37



In [47]:

    
arts.to_csv("data/arts_clean.csv")

Pivot tables

Once we know the concept behind .groupby(), the next natural step is to think about doing a lot of groupings so we can see the data aggregated in many different ways. And this is very the problem that pivot tables are designed to solve.

Let's consider a new dataset, books (adapted from here), from an editorial and the number of copies of books they have, and say that we want to know things like:

How many copies are already published?
How many copies are in the pipeline?
How many copies per book and in what stage?

Let's set up our environment first with a new dataset from an editorial and the number of copies of books they have.



In [2]:

    
import pandas as pd
import numpy as np

books = pd.read_excel("data/books-funnel.xlsx")
books.head()









    Out[2]:






  
    
      
      Identifier
      Book
      Author
      Editor
      Edition
      Copies
      Price
      Status
    
  
  
    
      0
       714466
       Adventures of Trantow-Barrows
        Craig Booker
       Debra Henley
        Premiun
       100
       30
        In-print
    
    
      1
       714466
       Adventures of Trantow-Barrows
        Craig Booker
       Debra Henley
       Standard
       100
       10
        In-print
    
    
      2
       714466
       Adventures of Trantow-Barrows
        Craig Booker
       Debra Henley
         Pocket
       200
        5
         pending
    
    
      3
       737550
                  The Russel Paradox
        Craig Booker
       Debra Henley
        Premiun
       100
       35
        rejected
    
    
      4
       146832
                     Theory of Kiehn
       Daniel Hilton
       Debra Henley
        Premiun
       200
       65
       published



In [8]:

    
# How many copies are already published?
books[books["Status"] == "published"]["Copies"].sum()









    Out[8]:





1200



In [16]:

    
# How many copies are in the pipeline?
books[books["Status"].isin(["In-print", "pending"])]["Copies"].sum()









    Out[16]:





1400



In [18]:

    
books.groupby(["Book", "Status"])[["Copies"]].aggregate(np.sum)









    Out[18]:






  
    
      
      
      Copies
    
    
      Book
      Status
      
    
  
  
    
      A brave warrior
      pending
       200
    
    
      Adventures of Trantow-Barrows
      In-print
       200
    
    
      pending
       200
    
    
      Barton Fink
      rejected
       100
    
    
      Kassulke from Tokyo
      published
       300
    
    
      Kulas
      In-print
       100
    
    
      pending
       200
    
    
      Lords of the Baltic
      In-print
       200
    
    
      rejected
       200
    
    
      Purdy & Kunde
      In-print
       100
    
    
      Skin
      published
       500
    
    
      Stokes the Vampire
      In-print
       100
    
    
      pending
       100
    
    
      The Russel Paradox
      rejected
       100
    
    
      The Whale and the Hunter
      published
       200
    
    
      Theory of Kiehn
      published
       200

Pivot the data

As we build up the pivot table, it's easier to take it one step at a time. Add items and check each step to verify you are getting the results you expect. Don't be afraid to play with the order and the variables to see what presentation makes the most sense for your needs.

The simplest pivot table must have a dataframe and an index. In this case, let's use the book title Book as our index.



In [91]:

    
pd.pivot_table(books, index=["Book"])









    Out[91]:






  
    
      
      Copies
      Identifier
      Price
    
    
      Book
      
      
      
    
  
  
    
      A brave warrior
       200.000000
       412290
         5.0
    
    
      Adventures of Trantow-Barrows
       133.333333
       714466
        15.0
    
    
      Barton Fink
       100.000000
       740150
        35.0
    
    
      Kassulke from Tokyo
       300.000000
       307599
         7.0
    
    
      Kulas
       150.000000
       218895
        25.0
    
    
      Lords of the Baltic
       200.000000
       729833
        35.0
    
    
      Purdy & Kunde
       100.000000
       163416
        30.0
    
    
      Skin
       500.000000
       688981
       100.0
    
    
      Stokes the Vampire
       100.000000
       239344
         7.5
    
    
      The Russel Paradox
       100.000000
       737550
        35.0
    
    
      The Whale and the Hunter
       200.000000
       141962
        65.0
    
    
      Theory of Kiehn
       200.000000
       146832
        65.0

You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.



In [19]:

    
pd.pivot_table(books, index=["Book", "Author", "Editor"])









    Out[19]:






  
    
      
      
      
      Copies
      Identifier
      Price
    
    
      Book
      Author
      Editor
      
      
      
    
  
  
    
      A brave warrior
      John Smith
      Debra Henley
       200.000000
       412290
         5.0
    
    
      Adventures of Trantow-Barrows
      Craig Booker
      Debra Henley
       133.333333
       714466
        15.0
    
    
      Barton Fink
      John Smith
      Debra Henley
       100.000000
       740150
        35.0
    
    
      Kassulke from Tokyo
      Wendy Yule
      Fred Anderson
       300.000000
       307599
         7.0
    
    
      Kulas
      Daniel Hilton
      Debra Henley
       150.000000
       218895
        25.0
    
    
      Lords of the Baltic
      Wendy Yule
      Fred Anderson
       200.000000
       729833
        35.0
    
    
      Purdy & Kunde
      Cedric Moss
      Fred Anderson
       100.000000
       163416
        30.0
    
    
      Skin
      Wendy Yule
      Fred Anderson
       500.000000
       688981
       100.0
    
    
      Stokes the Vampire
      Cedric Moss
      Fred Anderson
       100.000000
       239344
         7.5
    
    
      The Russel Paradox
      Craig Booker
      Debra Henley
       100.000000
       737550
        35.0
    
    
      The Whale and the Hunter
      Cedric Moss
      Fred Anderson
       200.000000
       141962
        65.0
    
    
      Theory of Kiehn
      Daniel Hilton
      Debra Henley
       200.000000
       146832
        65.0

This is interesting but not particularly useful. What we probably want to do is look at this by Editor and Author. It’s easy enough to do by changing the index.



In [21]:

    
pd.pivot_table(books, index=["Editor", "Author"])









    Out[21]:






  
    
      
      
      Copies
      Identifier
      Price
    
    
      Editor
      Author
      
      
      
    
  
  
    
      Debra Henley
      Craig Booker
       125.000000
       720237.0
       20.000000
    
    
      Daniel Hilton
       166.666667
       194874.0
       38.333333
    
    
      John Smith
       150.000000
       576220.0
       20.000000
    
    
      Fred Anderson
      Cedric Moss
       125.000000
       196016.5
       27.500000
    
    
      Wendy Yule
       300.000000
       614061.5
       44.250000

You can see that the pivot table is smart enough to start aggregating the data and summarizing it by grouping the authors with their editors. Now we start to get a glimpse of what a pivot table can do for us.

For this purpose, the Identifier and Price columns aren’t really useful. Let’s remove it by explicitly defining the columns we care about using the values field.



In [22]:

    
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"])









    Out[22]:






  
    
      
      
      Copies
    
    
      Editor
      Author
      
    
  
  
    
      Debra Henley
      Craig Booker
       125.000000
    
    
      Daniel Hilton
       166.666667
    
    
      John Smith
       150.000000
    
    
      Fred Anderson
      Cedric Moss
       125.000000
    
    
      Wendy Yule
       300.000000

The price column automatically averages the data but we can do a count or a sum. Adding them is simple using aggfunc and np.sum.



In [23]:

    
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"], aggfunc=np.sum)









    Out[23]:






  
    
      
      
      Copies
    
    
      Editor
      Author
      
    
  
  
    
      Debra Henley
      Craig Booker
        500
    
    
      Daniel Hilton
        500
    
    
      John Smith
        300
    
    
      Fred Anderson
      Cedric Moss
        500
    
    
      Wendy Yule
       1200

aggfunc() can take a list of functions. Let’s try a mean using the numpy mean function and len to get a count.



In [98]:

    
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"], aggfunc=[np.sum, len])









    Out[98]:






  
    
      
      
      sum
      len
    
    
      
      
      Copies
      Copies
    
    
      Editor
      Author
      
      
    
  
  
    
      Debra Henley
      Craig Booker
        500
       4
    
    
      Daniel Hilton
        500
       3
    
    
      John Smith
        300
       2
    
    
      Fred Anderson
      Cedric Moss
        500
       4
    
    
      Wendy Yule
       1200
       4

If we want to see copies broken down by the edition, the columns variable allows us to define one or more columns.

Columns vs. Values

One of the confusing points with the pivot_table is the use of columns and values. Remember, columns are optional - they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you list.



In [99]:

    
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"],
               columns=["Edition"], aggfunc=[np.sum])









    Out[99]:






  
    
      
      
      sum
    
    
      
      
      Copies
    
    
      
      Edition
      Pocket
      Premiun
      Standard
    
    
      Editor
      Author
      
      
      
    
  
  
    
      Debra Henley
      Craig Booker
       200
       200
       100
    
    
      Daniel Hilton
       NaN
       400
       100
    
    
      John Smith
       200
       100
       NaN
    
    
      Fred Anderson
      Cedric Moss
       100
       300
       100
    
    
      Wendy Yule
       300
       700
       200

The NaN’s are a bit distracting. If we want to remove them, we could use fill_value to set them to 0.



In [24]:

    
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"],
               columns=["Edition"], aggfunc=[np.sum], fill_value=0)









    Out[24]:






  
    
      
      
      sum
    
    
      
      
      Copies
    
    
      
      Edition
      Pocket
      Premiun
      Standard
    
    
      Editor
      Author
      
      
      
    
  
  
    
      Debra Henley
      Craig Booker
       200
       200
       100
    
    
      Daniel Hilton
         0
       400
       100
    
    
      John Smith
       200
       100
         0
    
    
      Fred Anderson
      Cedric Moss
       100
       300
       100
    
    
      Wendy Yule
       300
       700
       200

It would be useful to add the average price as well. Add Price to the values list.



In [26]:

    
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies", "Price"],
               columns=["Edition"], aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0)









    Out[26]:






  
    
      
      Copies
    
    
      Edition
      Pocket
      Premiun
      Standard
    
    
      Author
      
      
      
    
  
  
    
      Cedric Moss
       100
       300
       100
    
    
      Craig Booker
       200
       200
       100
    
    
      Daniel Hilton
         0
       400
       100
    
    
      John Smith
       200
       100
         0
    
    
      Wendy Yule
       300
       700
       200

What's interesting is that you can move items to the index to get a different visual representation. Remove Edition from the columns and add to the index.



In [27]:

    
pd.pivot_table(books, index=["Editor", "Author", "Edition"], values=["Copies", "Price"],
               aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0)









    Out[27]:






  
    
      
      
      
      Copies
      Price
    
    
      Editor
      Author
      Edition
      
      
    
  
  
    
      Debra Henley
      Craig Booker
      Pocket
       200
        5.0
    
    
      Premiun
       200
       32.5
    
    
      Standard
       100
       10.0
    
    
      Daniel Hilton
      Premiun
       400
       52.5
    
    
      Standard
       100
       10.0
    
    
      John Smith
      Pocket
       200
        5.0
    
    
      Premiun
       100
       35.0
    
    
      Fred Anderson
      Cedric Moss
      Pocket
       100
        5.0
    
    
      Premiun
       300
       47.5
    
    
      Standard
       100
       10.0
    
    
      Wendy Yule
      Pocket
       300
        7.0
    
    
      Premiun
       700
       82.5
    
    
      Standard
       200
        5.0

For this data set, this representation makes more sense. Now, what if I want to see some totals? margins=True does that for us. Also, a really handy feature is the ability to pass a dictionary to the aggfunc so you can perform different functions on each of the values you select. This has a side-effect of making the labels a little cleaner.



In [29]:

    
pd.pivot_table(books, rows=["Editor", "Author", "Edition"], values=["Copies", "Price"],
               aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0, margins=True)









    Out[29]:






  
    
      
      
      
      Copies
      Price
    
    
      Editor
      Author
      Edition
      
      
    
  
  
    
      Debra Henley
      Craig Booker
      Pocket
        200
        5.000000
    
    
      Premiun
        200
       32.500000
    
    
      Standard
        100
       10.000000
    
    
      Daniel Hilton
      Premiun
        400
       52.500000
    
    
      Standard
        100
       10.000000
    
    
      John Smith
      Pocket
        200
        5.000000
    
    
      Premiun
        100
       35.000000
    
    
      Fred Anderson
      Cedric Moss
      Pocket
        100
        5.000000
    
    
      Premiun
        300
       47.500000
    
    
      Standard
        100
       10.000000
    
    
      Wendy Yule
      Pocket
        300
        7.000000
    
    
      Premiun
        700
       82.500000
    
    
      Standard
        200
        5.000000
    
    
      All
      
      
       3000
       30.705882

*Eeeeeh, well...*

Let’s move the analysis up a level and look at our pipeline at the editor level.



In [30]:

    
pd.pivot_table(books, index=["Editor", "Status"], values=["Copies"],
               aggfunc=[np.sum], fill_value=0, margins=True)









    Out[30]:






  
    
      
      
      sum
    
    
      
      
      Copies
    
    
      Editor
      Status
      
    
  
  
    
      Debra Henley
      In-print
        300
    
    
      pending
        600
    
    
      published
        200
    
    
      rejected
        200
    
    
      Fred Anderson
      In-print
        400
    
    
      pending
        100
    
    
      published
       1000
    
    
      rejected
        200
    
    
      All
      
       3000

You can provide a list of aggfunctions to apply to each value too:



In [31]:

    
table = pd.pivot_table(books, index=["Editor", "Status"], columns=["Edition"], values=["Copies", "Price"],
               aggfunc={"Copies": np.sum, "Price":[np.sum, np.mean]}, fill_value=0)
table









    Out[31]:






  
    
      
      
      Copies
      Price
    
    
      
      
      sum
      mean
      sum
    
    
      
      Edition
      Pocket
      Premiun
      Standard
      Pocket
      Premiun
      Standard
      Pocket
      Premiun
      Standard
    
    
      Editor
      Status
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debra Henley
      In-print
         0
       100
       200
       0
       30.0
       10.0
        0
        30
       20
    
    
      pending
       400
       200
         0
       5
       40.0
        0.0
       10
        40
        0
    
    
      published
         0
       200
         0
       0
       65.0
        0.0
        0
        65
        0
    
    
      rejected
         0
       200
         0
       0
       35.0
        0.0
        0
        70
        0
    
    
      Fred Anderson
      In-print
         0
       100
       300
       0
       30.0
        7.5
        0
        30
       15
    
    
      pending
       100
         0
         0
       5
        0.0
        0.0
        5
         0
        0
    
    
      published
       300
       700
         0
       7
       82.5
        0.0
        7
       165
        0
    
    
      rejected
         0
       200
         0
       0
       65.0
        0.0
        0
        65
        0

It can look daunting to try to pull this all together at one time but as soon as you start playing with the data and slowly add the items, you can get a feel for how it works. As a general rule of thumb is, once you use multiple grouby() you should evaluate whether a pivot table is a useful approach.

Advanced Pivot Table Filtering

Once you have generated your data, it is in a DataFrame so you can filter on it using your standard DataFrame functions.

If you want to look at just one editor:



In [32]:

    
table.query('Editor == ["Debra Henley"]')  # you might nedd to pip install numexpr









    Out[32]:






  
    
      
      
      Copies
      Price
    
    
      
      
      sum
      mean
      sum
    
    
      
      Edition
      Pocket
      Premiun
      Standard
      Pocket
      Premiun
      Standard
      Pocket
      Premiun
      Standard
    
    
      Editor
      Status
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debra Henley
      In-print
         0
       100
       200
       0
       30
       10
        0
       30
       20
    
    
      pending
       400
       200
         0
       5
       40
        0
       10
       40
        0
    
    
      published
         0
       200
         0
       0
       65
        0
        0
       65
        0
    
    
      rejected
         0
       200
         0
       0
       35
        0
        0
       70
        0

We can look at all of our pending and in-print books.



In [33]:

    
table.query('Status == ["pending", "in-print"]')









    Out[33]:






  
    
      
      
      Copies
      Price
    
    
      
      
      sum
      mean
      sum
    
    
      
      Edition
      Pocket
      Premiun
      Standard
      Pocket
      Premiun
      Standard
      Pocket
      Premiun
      Standard
    
    
      Editor
      Status
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debra Henley
      pending
       400
       200
       0
       5
       40
       0
       10
       40
       0
    
    
      Fred Anderson
      pending
       100
         0
       0
       5
        0
       0
        5
        0
       0

This a poweful feature of the pivot_table so do not forget that you have the full power of pandas once you get your data into the pivot_table format you need.

Cross-tabulation

Some times, the only operation you want to do in a pivot talbe is to count. In those cases, when frequency tables with multi indices are required, a crosstabulated version of the data might come in handy. In Pandas, that kind of tables have their own function, so you don-t have to call pivot table with all the parameters.



In [34]:

    
df = pd.DataFrame({'a' : np.random.randint(0, 2, size=20),
                   'b' : np.random.randint(0, 5, size=20),
                   'c' : np.random.randint(0, 3, size=20)})
 
df



In [35]:

    
pd.crosstab(df['b'], [df['a'], df['c']])

Activity

Given the `arts` data frame, create a pivot table by spliting the table into categories per year, and counting the number of artworks for each. Add marginal results and fill the `NaN` values.



In [57]:

    
pd.pivot_table(arts, index=["execution_date"], columns=["category"],
               values=["title"], aggfunc={"title": np.count_nonzero},
               fill_value=0, margins=True)









    Out[57]:






  
    
      
      title
    
    
      category
      Fibre
      Installation
      Media
      Other
      Painting
      Photograph
      Sculpture
      Work on paper
      All
    
    
      execution_date
      
      
      
      
      
      
      
      
      
    
  
  
    
      1940.0
         0
        0
        0
         0
          0
          0
         0
           1
           1
    
    
      1943.0
         0
        0
        0
         0
          1
          0
         0
           0
           1
    
    
      1945.0
         0
        0
        0
         0
          0
          0
         0
           1
           1
    
    
      1948.0
         0
        0
        0
         0
          0
          0
         0
           2
           2
    
    
      1949.0
         0
        0
        0
         0
          0
          1
         0
           0
           1
    
    
      1950.0
         0
        0
        0
         0
          0
          0
         0
           1
           1
    
    
      1951.0
         0
        0
        0
         0
          1
          1
         0
           0
           2
    
    
      1952.0
         0
        0
        0
         0
          0
          1
         0
           0
           1
    
    
      1953.0
         0
        0
        0
         0
          1
          0
         0
           0
           1
    
    
      1954.0
         0
        0
        0
         0
          2
          1
         0
           3
           6
    
    
      1955.0
         0
        0
        0
         0
          1
          3
         0
           6
          10
    
    
      1956.0
         0
        0
        0
         0
          2
          3
         0
           2
           7
    
    
      1957.0
         0
        0
        0
         0
          0
          0
         0
           5
           5
    
    
      1958.0
         0
        0
        0
         0
          4
          4
         0
           2
          10
    
    
      1959.0
         0
        0
        0
         0
          4
          3
         0
           3
          10
    
    
      1960.0
         0
        0
        0
         0
         13
          4
         0
           5
          22
    
    
      1961.0
         0
        0
        0
         0
         10
          2
         0
          11
          23
    
    
      1962.0
         0
        0
        0
         0
         11
          2
         0
           8
          21
    
    
      1963.0
         0
        0
        0
         0
         10
          0
         0
          12
          22
    
    
      1964.0
         0
        0
        0
         1
         23
          0
         0
          35
          59
    
    
      1965.0
         0
        0
        0
         7
         17
          5
         1
          72
         102
    
    
      1966.0
         0
        0
        0
         1
         35
          1
         5
          99
         141
    
    
      1967.0
         0
        0
        0
         1
         26
          9
         3
          93
         132
    
    
      1968.0
         1
        0
        0
         7
         29
          8
        15
         109
         169
    
    
      1969.0
         2
        0
        0
         8
         33
          9
        22
         153
         227
    
    
      1970.0
         3
        1
        0
         1
         40
         11
        16
         196
         268
    
    
      1971.0
         2
        0
        0
         6
         56
         31
        19
         422
         536
    
    
      1972.0
         2
        1
        0
        16
        195
         59
        47
         767
        1087
    
    
      1973.0
        10
        1
        0
        20
        164
         85
        45
        1300
        1625
    
    
      1974.0
         7
        2
        0
        13
        100
        102
        46
         771
        1041
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1985.0
         4
       11
        0
        12
         65
        171
        26
         313
         602
    
    
      1986.0
         4
        0
        2
        21
         73
         89
        23
         364
         576
    
    
      1987.0
         2
        3
        0
        16
         56
        124
        25
         297
         523
    
    
      1988.0
        11
        0
        0
        19
         60
        111
        27
         239
         467
    
    
      1989.0
         8
        5
        1
        19
         55
        116
        56
         350
         610
    
    
      1990.0
         4
        5
        0
        22
         57
        120
        33
         298
         539
    
    
      1991.0
         2
        4
        2
        23
         56
         98
        32
         301
         518
    
    
      1992.0
        14
        2
        2
        38
         90
         92
        54
         309
         601
    
    
      1993.0
         4
        5
        0
        11
         43
         32
        22
          91
         208
    
    
      1994.0
         8
        1
        0
         6
         15
         27
        18
          59
         134
    
    
      1995.0
         0
        0
        0
         0
          3
          4
         0
           1
           8
    
    
      1996.0
         0
        0
        0
         1
          1
          7
         0
           1
          10
    
    
      1997.0
         0
        0
        0
         1
          2
          6
         0
           5
          14
    
    
      1998.0
         1
        0
        0
         0
         18
          6
         2
           7
          34
    
    
      1999.0
         0
        0
        0
         0
          7
         13
         5
           3
          28
    
    
      2000.0
         2
        0
        0
         2
         14
         13
         8
          14
          53
    
    
      2001.0
         4
        0
        0
         0
         11
         17
         5
          20
          57
    
    
      2002.0
         5
        0
        0
         3
         17
         10
        18
          31
          84
    
    
      2003.0
         2
        0
        0
         0
         15
         12
        18
           8
          55
    
    
      2004.0
         2
        0
        0
         1
         15
         21
         7
          12
          58
    
    
      2005.0
         4
        0
        0
         1
         11
         21
         9
          18
          64
    
    
      2006.0
         1
        0
        0
         2
          9
         23
         8
          13
          56
    
    
      2007.0
         0
        0
        0
         0
         17
         24
        12
          14
          67
    
    
      2008.0
         0
        0
        0
         0
         13
         13
        11
          14
          51
    
    
      2009.0
         1
        0
        0
         0
         17
         96
        10
           8
         132
    
    
      2010.0
         2
        0
        0
         0
          8
         10
        18
           2
          40
    
    
      2011.0
         0
        0
        0
         0
          4
          1
         1
           2
           8
    
    
      2012.0
         0
        0
        0
         1
          0
          2
         0
           0
           3
    
    
      2013.0
         0
        0
        0
         0
          0
          0
         0
           2
           2
    
    
      All
       154
       69
       10
       459
       2351
       2874
       988
       10637
       17542
    
  

70 rows × 9 columns

Plotting and Visualization

There are a handful of third-party Python packages that are suitable for creating scientific plots and visualizations. These include packages like:

matplotlib
Chaco
PyX
Bokeh

Here, we will focus excelusively on matplotlib and the high-level plotting availabel within pandas. It is currently the most robust and feature-rich package available.

Visual representation of data

We require plots, charts and other statistical graphics for the written communication of quantitative ideas. They allow us to more easily convey relationships and reveal deviations from patterns.

Gelman and Unwin 2011.

A well-designed graph can display more information than a table of the same size, and more information than numbers embedded in text. Graphical displays allow and encourage direct visual comparisons.

Matplotlib

The easiest way to interact with matplotlib is via pylab in iPython. By starting iPython (or iPython notebook) in "pylab mode", both matplotlib and numpy are pre-loaded into the iPython session. But that is deprecated and the way now is just using the %matplotlib cell magic. For loading numpy we just use the regular imports.



In [59]:

    
%matplotlib inline

You can specify a custom graphical backend (e.g. qt, gtk, osx), but iPython generally does a good job of auto-selecting. Now matplotlib is ready to go, and you can access the matplotlib API via plt. If you do not start iPython in pylab mode, you can do this manually with the following convention:



In [60]:

    
import matplotlib.pyplot as plt
plt.plot(np.random.normal(size=100), np.random.normal(size=100), 'ro')









    Out[60]:





[<matplotlib.lines.Line2D at 0x7f889c2f1978>]

Matplotlib: Scripted vs. Object Oriented

Matplotlib provides two flavors of interface. One is the scripted interface, designed to feel like Matlab. To enable this, matplotlib maintains a pointer to the current figure, current axis, etc. and directs top-level commands to those places:



In [61]:

    
# create a figure using the matlab-like interface
x = np.linspace(0, 10, 1000)
plt.subplot(2, 1, 1)  # 2x1 grid, first plot
plt.plot(x, np.sin(x))
plt.title('Trig is easy.')

plt.subplot(2, 1, 2)  # 2x1 grid, second plot
plt.plot(x, np.cos(x))
plt.xlabel('x')









    Out[61]:





<matplotlib.text.Text at 0x7f889c2167f0>

The other interface is an object-oriented interface, where we expliticly pass around references to the plot elements we want to work with:



In [65]:

    
# create the same figure using the object-oriented interface
fig = plt.figure(figsize=(10, 8))

ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(x, np.sin(x))
ax1.set_title("Trig iss asdasd asd  easy")

ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(x, np.cos(x))
ax2.set_xlabel('x')
ax2.set_ylabel("Hello!")









    Out[65]:





<matplotlib.text.Text at 0x7f8897fb24a8>

Why use one interface vs another?

These two interfaces are convenient for different circumstances. I find that for doing quick, simple plots, the scripted interface is often easiest. On the other hand, when I want more sophisticated plots, the object-oriented interface is simpler and more powerful. In fact, the scripted interface has several distinct limitations. It's good practice to use the object-oriented interface. That is, you should get in the habit of never using the plt. when you can reference a specific axes or figure object instead.

Multi-panel plots

There are four main ways to create multi-panel plots in matplotlib. From lowest to highest-level they are (roughly):

1. `fig.add_axes()`

Useful for creating inset plots. The add_axes method allows you to create an axes instance by specifying the size relative to the figure edges. The argument is [left, bottom, width, height] which specifies the axes extent in fractions of the figure size (i.e. between 0 and 1):



In [162]:

    
fig = plt.figure()
main_ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])



In [163]:

    
inset_ax = fig.add_axes([0.6, 0.6, 0.25, 0.25])
fig









    Out[163]:



In [164]:

    
main_ax.plot(np.random.rand(100), color='gray')
inset_ax.plot(np.random.rand(20), color='black')
fig









    Out[164]:

2. fig.add_subplot()

Useful for simple multi-panel plots. If you're trying to create multiple axes in a grid, you might use .add_axes() repeatedly, but calculating the extent for each axes is not trivial. The .add_subplot() method can streamline this. The arguments are of the form N_vertical, N_horizontal, Plot_number, and the indices start at 1:



In [165]:

    
fig = plt.figure()
for i in range(1, 7):
    ax = fig.add_subplot(2, 3, i)
    ax.text(0.45, 0.45, str(i), fontsize=24)

If you desire, you can adjust the spacing using fig.subplots_adjust(), with units relative to the figure size (i.e. between 0 and 1)



In [166]:

    
fig.subplots_adjust(left=0.1, right=0.9,
                    bottom=0.1, top=0.9,
                    hspace=0.4, wspace=0.4)
fig









    Out[166]:

3. `plt.subplots()`

Convenience function to create multiple subplots. Because creating a full grid of subplots is such a common task, matplotlib recently added the plt.subplots() command which creates the figure and axes in one go. The arguments are N_vertical, N_horizontal, and the axes are returned within a Numpy array:



In [167]:

    
fig, ax = plt.subplots(2, 3)
for i in range(2):
    for j in range(3):
        ax[i, j].text(0.2, 0.45, str((i, j)), fontsize=20)



In [168]:

    
print(type(ax))
print(ax.shape)
print(ax.dtype)









    



<class 'numpy.ndarray'>
(2, 3)
object

An additional nice piece of this routine is the ability to specify that the subplots have a shared x or y axis: this ties together the axis limits and removes redundant tick labels:



In [169]:

    
fig, ax = plt.subplots(2, 3, sharex=True, sharey=True)
x = np.linspace(0, 10, 1000)
for i in range(2):
    for j in range(3):
        ax[i, j].plot(x, (j + 1) * np.sin((i + 1) * x))

4. plt.GridSpec()

Useful for more involved layouts. GridSpec is the highest-level routine for creating subplots. It's an abstract object that allows the creation of multi-row or multi-column subplots via an intuitive slicing interface:



In [170]:

    
gs = plt.GridSpec(3, 3)  # a 3x3 grid
fig = plt.figure(figsize=(6, 6))  # figure size in inches

fig.add_subplot(gs[:, :])
fig.add_subplot(gs[1, 1])









    Out[170]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4b6cd320>



In [171]:

    
gs = plt.GridSpec(3, 3, wspace=0.4, hspace=0.4)
fig = plt.figure(figsize=(6, 6))

fig.add_subplot(gs[1, :2])
fig.add_subplot(gs[0, :2])
fig.add_subplot(gs[2, 0])
fig.add_subplot(gs[:2, 2])
fig.add_subplot(gs[2, 1:])









    Out[171]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4b54ccf8>

Check out the documentation of plt.GridSpec for information on adjusting the subplot parameters. The keyword arguments are similar to those of plt.subplots_adjust().

Plotting in Pandas

On the other hand, Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.



In [73]:

    
normals = pd.Series(np.random.normal(size=10))
normals.plot()









    Out[73]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f889c309240>

Notice that by default a line plot is drawn, and a light grid is included. All of this can be changed, however:



In [77]:

    
normals.cumsum().plot(grid=False)









    Out[77]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f88c02a2630>

Similarly, for a DataFrame:



In [78]:

    
variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                          'gamma': np.random.gamma(1, size=100), 
                          'poisson': np.random.poisson(size=100)})
variables.cumsum(0).plot()









    Out[78]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f889797c2b0>

As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for plot:



In [80]:

    
variables.cumsum(0).plot(subplots=True)









    Out[80]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f8897747f28>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f88976d46a0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f889769d940>], dtype=object)

Or, we may want to have some series displayed on the secondary y-axis, which can allow for greater detail and less empty space:



In [81]:

    
variables.cumsum(0).plot(secondary_y='normal')









    Out[81]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8897635940>

Of course, each plot returned is an object with some methods and attributes to modify the aspect of your chart.



In [82]:

    
ax = variables.cumsum(0).plot()
ax.set_xlabel('Radius')
ax.set_ylabel('Area')
ax.set_title('Area of a Circle')









    Out[82]:





<matplotlib.text.Text at 0x7f8897519198>

If we would like a little more control, we can use matplotlib's subplots function directly, and manually assign plots to its axes:



In [83]:

    
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
for i, var in enumerate(['normal','gamma','poisson']):
    variables[var].cumsum(0).plot(ax=axes[i], title=var)
axes[0].set_ylabel('cumulative sum')









    Out[83]:





<matplotlib.text.Text at 0x7f88974fb978>

Bar plots

*Thanks, Bear Grylls!*

Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the plot method with a kind='bar' argument.

For this series of examples, let's use our arts and books datasets:



In [84]:

    
arts.groupby('category')[["title"]].count().plot(kind='bar')









    Out[84]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f88973dc438>



In [85]:

    
books.groupby(['Editor', 'Author'])[["Copies"]].aggregate(np.sum).plot(kind='barh')









    Out[85]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f88973a1278>



In [86]:

    
counts = pd.crosstab(books["Edition"], books["Status"])
counts.plot(kind='bar', stacked=True, grid=False)









    Out[86]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8897307c18>

Activity

Given the `arts` data frame, plot a chart with the number of artworks per year. Set the title and the names of the axes. Also hide the legend and the grid.



In [119]:

    
grouped = arts.groupby("execution_date")[["title"]].count()
ax = grouped.plot()
ax.set_title("Artworks per Year")
ax.set_xlabel("Year")
ax.set_ylabel("Number of Artworks")
ax.set_xlim([1950, 2000])









    Out[119]:





(1950, 2000)

Histograms

Frequenfly it is useful to look at the distribution of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.

For example, we might want to see how the artworks are distributed cross time:



In [123]:

    
arts["execution_date"].hist(grid=False)









    Out[123]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4d52223c50>

The hist method puts the continuous years values into bins, trying to make a sensible decision about how many bins to use (or equivalently, how wide the bins are). We can override the default value (10):



In [120]:

    
arts["execution_date"].hist(grid=False, bins=30)









    Out[120]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8896f9a5c0>

There are algorithms for determining an "optimal" number of bins, each of which varies somehow with the number of observations in the data series.



In [121]:

    
sturges = lambda n: int(np.log2(n) + 1)
square_root = lambda n: int(np.sqrt(n))
from scipy.stats import kurtosis
doanes = lambda data: int(1 + np.log(len(data)) + np.log(1 + kurtosis(data) * (len(data) / 6.) ** 0.5))

n = len(arts["execution_date"])
sturges(n), square_root(n), doanes(arts["execution_date"].dropna())









    Out[121]:





(15, 132, 14)



In [122]:

    
arts["execution_date"].hist(bins=doanes(arts["execution_date"].dropna()))









    Out[122]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8897032b70>

A density plot is similar to a histogram in that it describes the distribution of the underlying data, but rather than being a pure empirical representation, it is an estimate of the underlying "true" distribution. As a result, it is smoothed into a continuous line plot. We create them in Pandas using the plot method with kind='kde', where kde stands for kernel density estimate.



In [178]:

    
arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020))









    Out[178]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4914a240>

We can even make it look funny by using the XKCD style!



In [179]:

    
with plt.xkcd():
    arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020))

Often, histograms and density plots are shown together:



In [136]:

    
arts["execution_date"].hist(bins=doanes(arts["execution_date"].dropna()), normed=True, color='lightseagreen')
arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020), style='r--')









    Out[136]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4c938358>

Here, we had to normalize the histogram (normed=True), since the kernel density is normalized by definition (it is a probability distribution).

Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.



In [138]:

    
arts.boxplot(column="execution_date", by="category", grid=False)









    Out[138]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4d4caaa518>

You can think of the box plot as viewing the distribution from above. The blue crosses are "outlier" points that occur outside the extreme quantiles.

One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.



In [143]:

    
bp = arts.boxplot(column="execution_date", by="category", grid=False)
for i, index in enumerate(arts["category"].unique()):
    y = arts["execution_date"][arts["category"]==index].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)

When data are dense, a couple of tricks used above help the visualization:

Reducing the alpha level to make the points partially transparent
Adding random "jitter" along the x-axis to avoid overstriking

Scatterplots

To look at how Pandas does scatterplots, let's load the baseball sample dataset.



In [144]:

    
baseball = pd.read_csv("data/baseball.csv")
baseball.head()









    Out[144]:






  
    
      
      id
      player
      year
      stint
      team
      lg
      g
      ab
      r
      h
      ...
      rbi
      sb
      cs
      bb
      so
      ibb
      hbp
      sh
      sf
      gidp
    
  
  
    
      0
       88641
       womacto01
       2006
       2
       CHN
       NL
       19
       50
       6
       14
      ...
       2
       1
       1
       4
       4
       0
       0
       3
       0
       0
    
    
      1
       88643
       schilcu01
       2006
       1
       BOS
       AL
       31
        2
       0
        1
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      2
       88645
       myersmi01
       2006
       1
       NYA
       AL
       62
        0
       0
        0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3
       88649
       helliri01
       2006
       1
       MIL
       NL
       20
        3
       0
        0
      ...
       0
       0
       0
       0
       2
       0
       0
       0
       0
       0
    
    
      4
       88650
       johnsra05
       2006
       1
       NYA
       AL
       33
        6
       0
        1
      ...
       0
       0
       0
       0
       4
       0
       0
       0
       0
       0
    
  

5 rows × 23 columns

Scatterplots are useful for data exploration, where we seek to uncover relationships among variables. There are no scatterplot methods for Series or DataFrame objects; we must instead use the matplotlib function scatter.



In [201]:

    
fig = plt.scatter(baseball.ab, baseball.h)
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)









    Out[201]:





(0, 200)

We can add additional information to scatterplots by assigning variables to either the size of the symbols or their colors.



In [199]:

    
fig = plt.scatter(baseball.ab, baseball.h, s=baseball.hr*10, alpha=0.5)  # s for size
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)









    Out[199]:





(0, 200)



In [202]:

    
fig = plt.scatter(baseball.ab, baseball.h, c=baseball.hr, s=40, cmap='hot')  # c for color
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)









    Out[202]:





(0, 200)

To view scatterplots of a large numbers of variables simultaneously, we can use the scatter_matrix function that was recently added to Pandas. It generates a matrix of pair-wise scatterplots, optiorally with histograms or kernel density estimates on the diagonal.



In [176]:

    
pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(12,8), diagonal='kde')









    Out[176]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b2fa278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b2cf278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b296dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b2529e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b218da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b1d4710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b19b9e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b0e5ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b0a5cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b06ceb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b027da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aff50b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4afaaac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4af7d160>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4af45320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aefeb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aec7dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ae839e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ae49cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ae13e80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4add1fd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ad1f1d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ad5b0b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aca1390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4acd9dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ac2b2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4abf4470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4abb46d8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ab82080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4ab38940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aaffb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4aabf6a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b364cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b5a85f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b44cd68>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b881b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b9fa320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b961908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b957128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4b967a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c0e8828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c49c9e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c5ddd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c4adef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c44cb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c420d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c64cc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c8caeb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4d4c63bbe0>]], dtype=object)

	Name	Position Title	Department	Employee Annual Salary
0	AARON, ELVIA J	WATER RATE TAKER	WATER MGMNT	$85512.00
1	AARON, JEFFERY M	POLICE OFFICER	POLICE	$75372.00
2	AARON, KIMBERLEI R	CHIEF CONTRACT EXPEDITER	GENERAL SERVICES	$80916.00
3	ABAD JR, VICENTE M	CIVIL ENGINEER IV	WATER MGMNT	$99648.00
4	ABBATACOLA, ROBERT J	ELECTRICAL MECHANIC	AVIATION	$89440.00

	name	title	salary
department
ADMIN HEARNG	42	42	42
ANIMAL CONTRL	61	61	61
AVIATION	1218	1218	1218
BOARD OF ELECTION	110	110	110
BOARD OF ETHICS	9	9	9

	salary
department
HUMAN RESOURCES	4850928.0
INSPECTOR GEN	4035150.0
IPRA	7006128.0
LAW	31883920.2
LICENSE APPL COMM	65436.0

	salary
department
ADMIN HEARNG	70336.857143
ANIMAL CONTRL	57121.455738
AVIATION	70638.249130
BOARD OF ELECTION	55728.872727
BOARD OF ETHICS	81650.666667

	salary
department
HUMAN RESOURCES	68496
INSPECTOR GEN	76116
IPRA	82524
LAW	66492
LICENSE APPL COMM	65436

	name	title	department	salary	dept_rank
18039	MC CARTHY, GARRY F	SUPERINTENDENT OF POLICE	POLICE	260004	1
8004	EMANUEL, RAHM	MAYOR	MAYOR'S OFFICE	216210	1
25588	SANTIAGO, JOSE A	FIRE COMMISSIONER	FIRE	202728	1
27594	STEWART III, CHARLES	FIRST DEPUTY FIRE COMMISSIONER	FIRE	188316	2
31587	WYSINGER, ALFONZA	FIRST DEPUTY SUPERINTENDENT	POLICE	188316	2

	name	title	department	salary	dept_rank
21971	PATTON, STEPHEN R	CORPORATION COUNSEL	LAW	173664	1
6311	DARLING, LESLIE M	FIRST ASST CORPORATION COUNSEL	LAW	149160	2
17680	MARTINICO, JOSEPH P	CHIEF LABOR NEGOTIATOR	LAW	144036	3
22357	PETERS, LYNDA A	CITY PROSECUTOR	LAW	139932	4
31383	WONG JR, EDWARD J	DEPUTY CORPORATION COUNSEL	LAW	137076	5

	artist_name	title	execution_date	category	id
0	Wright, Don	Turf-Arm	1984	Work on paper	85/6-0174
1	Hansen, Jim	Nfld. album no. 99	©1977, CARCC	Work on paper	79/80-0510
2	Creates, Marlene	A Stone Placed in Gathered Water, Newfoundland	©1982, CARCC	Photograph	92/3-0221
3	Walker, Peter	1.25.79	©1979, CARCC	Work on paper	79/80-0529
4	Bretzloff, Carol	Under Heaven, Wind	1983	Work on paper	13/4-0007

	execution_date
artist_name
A & B Associés	1982.250000
Aalto, Harri	1976.000000
Aaluk, Paul	1990.000000
Abbott, Louise	1978.666667
Abdilla, Alex	2001.000000

	execution_date
category
Fibre	1986.183007
Installation	1984.579710
Media	1987.400000
Other	1982.707424
Painting	1980.344828
Photograph	1984.550800
Sculpture	1984.260649
Work on paper	1979.037114

	title
artist_name
Smith, Gordon	137
Onley, Toni	130
Kipling, Ann	122
Hurtubise, Jacques	113
Laing, William (Bill)	111

	title
category
Work on paper	10637
Photograph	2874
Painting	2351
Sculpture	988
Other	459

	title_len
category
Media	12.400000
Sculpture	16.527328
Painting	17.075287
Other	18.389978
Work on paper	18.432265
Fibre	20.681818
Installation	25.144928
Photograph	28.188935

	execution_date
artist_name
Tousignant, Serge	47
Pellan, Alfred	43
Ashevak, Kenojuak	40
Morris, Michael	37
Haeseker, Alexandra	37

	Identifier	Book	Author	Editor	Edition	Copies	Price	Status
0	714466	Adventures of Trantow-Barrows	Craig Booker	Debra Henley	Premiun	100	30	In-print
1	714466	Adventures of Trantow-Barrows	Craig Booker	Debra Henley	Standard	100	10	In-print
2	714466	Adventures of Trantow-Barrows	Craig Booker	Debra Henley	Pocket	200	5	pending
3	737550	The Russel Paradox	Craig Booker	Debra Henley	Premiun	100	35	rejected
4	146832	Theory of Kiehn	Daniel Hilton	Debra Henley	Premiun	200	65	published

		Copies
Book	Status
A brave warrior	pending	200
Adventures of Trantow-Barrows	In-print	200
Adventures of Trantow-Barrows	pending	200
Barton Fink	rejected	100
Kassulke from Tokyo	published	300
Kulas	In-print	100
Kulas	pending	200
Lords of the Baltic	In-print	200
Lords of the Baltic	rejected	200
Purdy & Kunde	In-print	100
Skin	published	500
Stokes the Vampire	In-print	100
Stokes the Vampire	pending	100
The Russel Paradox	rejected	100
The Whale and the Hunter	published	200
Theory of Kiehn	published	200

	Copies	Identifier	Price
Book
A brave warrior	200.000000	412290	5.0
Adventures of Trantow-Barrows	133.333333	714466	15.0
Barton Fink	100.000000	740150	35.0
Kassulke from Tokyo	300.000000	307599	7.0
Kulas	150.000000	218895	25.0
Lords of the Baltic	200.000000	729833	35.0
Purdy & Kunde	100.000000	163416	30.0
Skin	500.000000	688981	100.0
Stokes the Vampire	100.000000	239344	7.5
The Russel Paradox	100.000000	737550	35.0
The Whale and the Hunter	200.000000	141962	65.0
Theory of Kiehn	200.000000	146832	65.0

			Copies	Identifier	Price
Book	Author	Editor
A brave warrior	John Smith	Debra Henley	200.000000	412290	5.0
Adventures of Trantow-Barrows	Craig Booker	Debra Henley	133.333333	714466	15.0
Barton Fink	John Smith	Debra Henley	100.000000	740150	35.0
Kassulke from Tokyo	Wendy Yule	Fred Anderson	300.000000	307599	7.0
Kulas	Daniel Hilton	Debra Henley	150.000000	218895	25.0
Lords of the Baltic	Wendy Yule	Fred Anderson	200.000000	729833	35.0
Purdy & Kunde	Cedric Moss	Fred Anderson	100.000000	163416	30.0
Skin	Wendy Yule	Fred Anderson	500.000000	688981	100.0
Stokes the Vampire	Cedric Moss	Fred Anderson	100.000000	239344	7.5
The Russel Paradox	Craig Booker	Debra Henley	100.000000	737550	35.0
The Whale and the Hunter	Cedric Moss	Fred Anderson	200.000000	141962	65.0
Theory of Kiehn	Daniel Hilton	Debra Henley	200.000000	146832	65.0

		sum	len
		Copies	Copies
Editor	Author
Debra Henley	Craig Booker	500	4
	Daniel Hilton	500	3
	John Smith	300	2
Fred Anderson	Cedric Moss	500	4
Fred Anderson	Wendy Yule	1200	4

		sum
		Copies
	Edition	Pocket	Premiun	Standard
Editor	Author
Debra Henley	Craig Booker	200	200	100
	Daniel Hilton	NaN	400	100
	John Smith	200	100	NaN
Fred Anderson	Cedric Moss	100	300	100
Fred Anderson	Wendy Yule	300	700	200

[Data, the Humanist's New Best Friend](index.ipynb)*Class 07*

Group By: Split-Apply-Combine

Split-Apply-Combine

String manipulations

Basic Patterns

Basic Examples

Repetition

The Eternal Emails Example

Square Brackets

Group Extraction

findall

findall and Groups

RE Workflow and Debug

Options

Substitution

Pivot tables

Pivot the data

Columns vs. Values

Advanced Pivot Table Filtering

Cross-tabulation

Plotting and Visualization

Visual representation of data

Matplotlib

Matplotlib: Scripted vs. Object Oriented

Why use one interface vs another?

Multi-panel plots

1. fig.add_axes()

2. fig.add_subplot()

3. plt.subplots()

4. plt.GridSpec()

Plotting in Pandas

Bar plots

Histograms

Boxplots

Scatterplots

For the next class

[Data, the Humanist's New Best Friend](index.ipynb)
Class 07

`findall`

`findall` and Groups

1. `fig.add_axes()`

3. `plt.subplots()`

		Copies			Price
		sum			mean			sum
	Edition	Pocket	Premiun	Standard	Pocket	Premiun	Standard	Pocket	Premiun	Standard
Editor	Status
Debra Henley	In-print	0	100	200	0	30.0	10.0	0	30	20
	pending	400	200	0	5	40.0	0.0	10	40	0
	published	0	200	0	0	65.0	0.0	0	65	0
	rejected	0	200	0	0	35.0	0.0	0	70	0
Fred Anderson	In-print	0	100	300	0	30.0	7.5	0	30	15
	pending	100	0	0	5	0.0	0.0	5	0	0
	published	300	700	0	7	82.5	0.0	7	165	0
	rejected	0	200	0	0	65.0	0.0	0	65	0

	a	b	c
0	1	2	0
1	0	4	2
2	0	1	1
3	1	0	2
4	0	3	2
5	0	0	0
6	0	3	0
7	1	2	2
8	0	1	2
9	1	4	0
10	1	3	1
11	1	0	1
12	0	3	0
13	0	4	2
14	1	2	2
15	0	3	0
16	0	1	1
17	0	4	0
18	0	1	2
19	1	4	0

	id	player	year	stint	team	lg	g	ab	r	h	...	rbi	sb	cs	bb	so	sh
0	88641	womacto01	2006	2	CHN	NL	19	50	6	14	...	2	1	1	4	4	3
1	88643	schilcu01	2006	1	BOS	AL	31	2	0	1	...	0	0	0	0	1	0
2	88645	myersmi01	2006	1	NYA	AL	62	0	0	0	...	0	0	0	0	0	0
3	88649	helliri01	2006	1	MIL	NL	20	3	0	0	...	0	0	0	0	2	0
4	88650	johnsra05	2006	1	NYA	AL	33	6	0	1	...	0	0	0	0	4	0