In this class you are expected to learn:
Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.
In pandas, those operation are carried out by the .groupby()
method of DataFrames
and Series
, but the meaning is the same. From the documentation, by "group by" we are referring to a process involving one or more of the following steps:
Of these, the split step is the most straightforward. In fact, in many situations you may wish to split the data set into groups and do something with those groups yourself. In the apply step, we might wish to one of the following:
True
or False
. Some examples:.groupby()
will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categoriesSince the set of object instance method on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name .groupby()
should be quite familiar to those who have used a SQL-based tool (or itertools
), in which you can write code like:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality then provide some non-trivial examples / use cases.
To illustrate these operatoinsoperations, we will use the City of Chicago salaries CSV
In [1]:
import numpy as np
import pandas as pd
!head data/city-of-chicago-salaries.csv
In [2]:
chicago = pd.read_csv('data/city-of-chicago-salaries.csv')
chicago.head()
Out[2]:
So far, we have seen how to manipulate data once is loaded, but pandas provides with some basic modifications, the converters
, as well as a way to rename the columns. Since it seems that the salaries have a $
symbol preceding the figure, let's create a converter for that column that removes the symbole and convert the string into a float.
In [3]:
headers = ['name', 'title', 'department', 'salary']
chicago = pd.read_csv('data/city-of-chicago-salaries.csv',
header=False,
names=headers,
converters={'salary': lambda x: float(x.replace('$', ''))}) # yes, lambda functions!
print("salary", chicago["salary"].dtype)
chicago.head()
Out[3]:
Pandas .groupby()
returns a DataFrameGroupBy
object which has a variety of methods, many of which are similar to standard SQL aggregate functions.
In [4]:
by_dept = chicago.groupby('department')
by_dept
Out[4]:
Calling .count()
returns the total number of not null values within each column. If we were interested in the total number of records in each group, we could use .size()
.
In [5]:
by_dept.count().head(5) # Not null records within each column
Out[5]:
In [6]:
by_dept.size().tail() # total records for each department
Out[6]:
Summation can be done via .sum()
, averaging by .mean()
, etc. (if it's a SQL function, chances are it exists in pandas). Oh, and there's .median()
too, something not available in most databases.
In [7]:
by_dept.sum()[20:25] # total salaries of each department
Out[7]:
In [8]:
by_dept.mean().head(5) # average salary of each department
Out[8]:
In [9]:
by_dept.median()[20:25] # take that, RDBMS!
Out[9]:
Operations can also be done on an individual Series
within a grouped object. Say we were curious about the five departments with the most distinct titles.
In [10]:
by_dept.title.nunique().order(ascending=False)[:5]
Out[10]:
The real power of .groupby()
comes from its split-apply-combine ability.
What if we wanted to see the highest paid employee within each department? Using .groupby()
we can define a function (which we'll call ranker()
) that will label each record from 1 to N, where N is the number of employees within the department. We can then call apply to, well, apply that function to each group (in this case, each department).
In [11]:
def ranker(df):
"""Assigns a rank to each employee based on salary, with 1 being the highest paid.
Assumes the data is DESC sorted."""
df['dept_rank'] = np.arange(len(df)) + 1
return df
In [12]:
chicago.sort('salary', ascending=False, inplace=True)
chicago = chicago.groupby('department').apply(ranker)
chicago.head()
Out[12]:
In [13]:
chicago[chicago.dept_rank == 1].head(7)
Out[13]:
In [14]:
chicago[chicago.department == "LAW"][:5]
Out[14]:
We can now see where each employee ranks within their department based on salary.
Regular expressions are a powerful language for matching text patterns. The Python re
module provides regular expression support.
In [15]:
import re
In Python a regular expression search is typically written as:
match = re.search(pat, str)
The re.search()
method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search()
returns a match object or None
otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:'
followed by a 3 letter word:
In [16]:
string = """
My phone is 519 456-1234, call me, maybe. Oh., wait, this is my real number 226 456 1234
"""
match = re.findall(r'\d{3}\s\d{3}[\s-]?\d{4}', string)
# If-statement after search() tests if it succeeded
match
Out[16]:
In [17]:
string = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', string)
# If-statement after search() tests if it succeeded
if match:
print('found', match.group()) ## 'found word:cat'
else:
print('did not find')
The code match = re.search(pattern, string)
stores the search result in a variable named match
. Then the if-statement tests the match -- if true the search succeeded and match.group()
is the matching text (e.g. 'word:cat'
). Otherwise if the match is false (None
to be more specific), then the search did not succeed, and there is no matching text.
The 'r'
at the start of the pattern string designates a python raw string which passes through backslashes without change which is very handy for regular expressions. I recommend that you always write pattern strings with the 'r'
just as a habit.
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:
a
, X
, 9
-- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: .
^
$
*
+
?
{
[
]
\
|
(
)
.
(a period) -- matches any single character except newline \n
\w
(lowercase w) -- matches a word character: a letter or digit or underbar [a-zA-Z0-9_]
. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.\b
-- boundary between word and non-word\s
-- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]
. \S
(upper case S) matches any non-whitespace character.\t
, \n
, \r
-- tab, newline, return\d
-- decimal digit [0-9]
(some older regex utilities do not support but \d
, but they all support \w
and \s
)^
= start, $
= end -- match the start or end of the string\
-- inhibit the "specialness" of a character. So, for example, use \.
to match a period or \\
to match a slash. If you are unsure if a character has special meaning, such as @
, you can put a slash in front of it, \@
, to make sure it is treated just as a character.Joke: what do you call a pig with three eyes? piiig!
The basic rules of regular expression search for a pattern within a string are:
In [18]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')
print(match) # found, match.group() == "iii"
match = re.search(r'igs', 'piiig')
print(match) # not found, match == None
In [19]:
## . = any char but \n
match = re.search(r'..g', 'piiig')
print(match) # found, match.group() == "iig"
In [20]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')
print(match) # found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!')
print(match) # found, match.group() == "abc"
Things get more interesting when you use +
and *
to specify repetition in the pattern
+
-- 1 or more occurrences of the pattern to its left, e.g. i+
= one or more i's*
-- 0 or more occurrences of the pattern to its left?
-- match 0 or 1 occurrences of the pattern to its leftFirst the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. +
and *
go as far as possible (the +
and *
are said to be "greedy").
In [21]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')
print(match) # found, match.group() == "piii"
In [22]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')
print(match) # found, match.group() == "ii"
In [23]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx')
print(match) # found, match.group() == "1 2 3"
match = re.search(r'\d\s*\d\s*\d', 'xx12 3xx')
print(match) # found, match.group() == "12 3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print(match) # found, match.group() == "123"
In [24]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')
print(match) # not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')
print(match) # found, match.group() == "bar"
Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'
. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+'
:
In [25]:
string = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', string)
if match:
print(match.group())
The search does not get the whole email address in this case because the \w
does not match the '-'
or '.'
in the address. We'll fix this using the regular expression features below.
Square brackets can be used to indicate a set of chars, so [abc]
matches a
or b
or c
. The codes \w
, \s
etc. work inside square brackets too with the one exception that dot (.
) just means a literal dot. For the emails problem, the square brackets are an easy way to add .
and -
to the set of chars which can appear around the @
with the pattern r'[\w.-]+@[\w.-]+'
to get the whole email address:
In [26]:
match = re.search(r'[\w.-]+@[\w.-]+', string)
if match:
print(match.group()) ## 'alice-b@google.com'
You can also use a dash to indicate a range, so [a-z]
matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]
. An up-hat (^
) at the start of a square-bracket set inverts it, so [^ab
] means any char except 'a'
or 'b'
.
The group feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis (
)
around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'
. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1)
is the match text corresponding to the 1st left parenthesis, and match.group(2)
is the text corresponding to the 2nd left parenthesis. The plain match.group()
is still the whole match text as usual.
In [27]:
string = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', string)
if match:
print(match.group()) ## 'alice-b@google.com' (the whole match)
print(match.group(1)) ## 'alice-b' (the username, group 1)
print(match.group(2)) ## 'google.com' (the host, group 2)
A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.
findall()
is probably the single most powerful function in the re module. Above we used re.search()
to find the first match for a pattern. findall()
finds all the matches and returns them as a list of strings, with each string representing one match.
In [28]:
## Suppose we have a text with many email addresses
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', string) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
print(email)
The parenthesis (
)
group mechanism can be combined with findall()
. If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall()
returns a list of tuples. Each tuple represents one match of the pattern, and inside the tuple is the group(1)
, group(2)
data. So if 2 parenthesis groups are added to the email pattern, then findall()
returns a list of tuples, each length 2 containing the username and host, e.g. ('alice'
, 'google.com'
).
In [29]:
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', string)
print(tuples) ## [('alice', 'google.com'), ('bob', 'abc.com')]
for pair in tuples:
print(pair[0]) ## username
print(pair[1]) ## host
Once you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall()
returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall()
returns a list of strings corresponding to that single group. (Obscure optional feature: Sometimes you have paren (
)
groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?
: at the start, e.g. (?: )
and that left paren will not count as a group result.)
Regular expression patterns pack a lot of meaning into just a few characters , but they are so dense, you can spend a lot of time debugging your patterns. Set up your runtime so you can run a pattern and print what it matches easily, for example by running it on a small test text and printing the result of findall()
. If the pattern matches nothing, try weakening the pattern, removing parts of it so you get too many matches. When it's matching nothing, you can't make any progress since there's nothing concrete to look at. Once it's matching too much, then you can work on tightening it up incrementally to hit just what you want.
The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search()
or findall()
etc., e.g. re.search(pat, str, re.IGNORECASE)
.
IGNORECASE
-- ignore upper/lowercase differences for matching, so 'a'
matches both 'a' and 'A'
.DOTALL
-- allow dot (.
) to match newline -- normally it matches anything but newline. This can trip you up -- you think .*
matches everything, but by default it does not go past the end of a line. Note that \s
(whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
MULTILINE
-- Within a string made of many lines, allow ^
and $
to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.
Greedy vs. Non-Greedy (optional)This is optional section which shows a more advanced regular expression technique not needed for the exercises.
Suppose you have text with tags in it: <b>foo</b>
and <i>so on</i>
Suppose you are trying to match each tag with the pattern '(<.*>)'
-- what does it match first?
The result is a little surprising, but the greedy aspect of the .*
causes it to match the whole '<b>foo</b>
and <i>so on</i>'
as one big match. The problem is that the .*
goes as far as is it can, instead of stopping at the first >
(aka it is "greedy").
There is an extension to regular expression where you add a ?
at the end, such as .*?
or .+?
, changing them to be non-greedy. Now they stop as soon as they can. So the pattern '(<.*?>)'
will get just '<b>'
as the first match, and '</b>'
as the second match, and so on getting each <..>
pair in turn. The style is typically that you use a .*?
, and then immediately its right look for some concrete marker (>
in this case) that forces the end of the .*?
run.
The *?
extension originated in Perl, and regular expressions that include Perl's extensions are known as Perl Compatible Regular Expressions -- pcre. Python includes pcre support. Many command line utils etc. have a flag where they accept pcre patterns.
An older but widely used technique to code this idea of "all of these chars except stopping at X" uses the square-bracket style. For the above you could write the pattern, but instead of .*
to get all the chars, use [^>]*
which skips over all characters which are not >
(the leading ^
"inverts" the square bracket set, so it matches any char not in the brackets).
The re.sub(pat, replacement, str)
function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1'
, '\2'
which refer to the text from group(1)
, group(2)
, and so on from the original matching text.
Here's an example which searches for all the email addresses, and changes them to keep the user (\1
) but have yo-yo-dyne.com
as the host.
In [30]:
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, string) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', string))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
Activity
Given the `arts` data frame, try to do the next:
In [31]:
string = "©1977-1979, CARCC"
finds = re.findall(r"\d+", string)
first_find = finds[-1]
int(first_find)
Out[31]:
In [36]:
arts = pd.read_csv("data/arts.csv")
arts.head()
Out[36]:
In [37]:
arts["execution_date"] = arts["execution_date"].str.findall(r"(\d+)").str[0]
arts["execution_date"].head()
Out[37]:
In [38]:
# 1. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
arts["execution_date"] = arts["execution_date"].astype(float)
arts["execution_date"] = arts["execution_date"].apply(lambda value: 1900 + value if value < 100 else value)
arts.head()
Out[38]:
In [39]:
# 2. Get the average execution year per artist.
arts.groupby("artist_name").mean().head()
Out[39]:
In [40]:
# 3. Get the average execution year per category.
arts.groupby("category")[["execution_date"]].mean()
Out[40]:
In [41]:
# 4. Get the number of artworks per artist. Which artist is the most prolific?
artworks_by_artist = arts.groupby("artist_name")[["title"]].aggregate(np.count_nonzero)
artworks_by_artist.sort("title", ascending=False).head()
Out[41]:
In [42]:
# 5. Get the number of artworks per category. Which category has the highest number?
artworks_by_category = arts.groupby("category")[["title"]].aggregate(np.count_nonzero)
artworks_by_category.sort("title", ascending=False).head()
Out[42]:
In [43]:
arts['title_len'] = arts['title'].apply(len)
arts.head()
Out[43]:
In [44]:
# 6. Get the average length of artworks titles per category and artist.
arts['title_length'] = arts['title'].apply(len)
length_by_category = arts.groupby("category")[["title_len"]].aggregate(np.mean)
length_by_category.sort("title_len")
Out[44]:
In [45]:
# 7. Get the year with the highest production.
artworks_by_year = arts.groupby("execution_date")[["title"]].aggregate(np.count_nonzero)
artworks_by_year.sort("title", ascending=False).head(3)
Out[45]:
In [46]:
# 8. Get the approximate period of production for each artist. If an artist painted from 1970 to 1990, the period is 20.
period_min = arts.groupby("artist_name")[["execution_date"]].aggregate(np.min)
period_max = arts.groupby("artist_name")[["execution_date"]].aggregate(np.max)
(period_max - period_min + 1).sort("execution_date", ascending=False).head(5)
Out[46]:
In [47]:
arts.to_csv("data/arts_clean.csv")
Once we know the concept behind .groupby()
, the next natural step is to think about doing a lot of groupings so we can see the data aggregated in many different ways. And this is very the problem that pivot tables are designed to solve.
Let's consider a new dataset, books
(adapted from here), from an editorial and the number of copies of books they have, and say that we want to know things like:
Let's set up our environment first with a new dataset from an editorial and the number of copies of books they have.
In [2]:
import pandas as pd
import numpy as np
books = pd.read_excel("data/books-funnel.xlsx")
books.head()
Out[2]:
In [8]:
# How many copies are already published?
books[books["Status"] == "published"]["Copies"].sum()
Out[8]:
In [16]:
# How many copies are in the pipeline?
books[books["Status"].isin(["In-print", "pending"])]["Copies"].sum()
Out[16]:
In [18]:
books.groupby(["Book", "Status"])[["Copies"]].aggregate(np.sum)
Out[18]:
As we build up the pivot table, it's easier to take it one step at a time. Add items and check each step to verify you are getting the results you expect. Don't be afraid to play with the order and the variables to see what presentation makes the most sense for your needs.
The simplest pivot table must have a dataframe and an index. In this case, let's use the book title Book
as our index.
In [91]:
pd.pivot_table(books, index=["Book"])
Out[91]:
You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.
In [19]:
pd.pivot_table(books, index=["Book", "Author", "Editor"])
Out[19]:
This is interesting but not particularly useful. What we probably want to do is look at this by Editor
and Author
. It’s easy enough to do by changing the index.
In [21]:
pd.pivot_table(books, index=["Editor", "Author"])
Out[21]:
You can see that the pivot table is smart enough to start aggregating the data and summarizing it by grouping the authors with their editors. Now we start to get a glimpse of what a pivot table can do for us.
For this purpose, the Identifier
and Price
columns aren’t really useful. Let’s remove it by explicitly defining the columns we care about using the values field.
In [22]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"])
Out[22]:
The price column automatically averages the data but we can do a count or a sum. Adding them is simple using aggfunc
and np.sum
.
In [23]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"], aggfunc=np.sum)
Out[23]:
aggfunc()
can take a list of functions. Let’s try a mean using the numpy mean function and len to get a count.
In [98]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"], aggfunc=[np.sum, len])
Out[98]:
If we want to see copies broken down by the edition, the columns variable allows us to define one or more columns.
One of the confusing points with the pivot_table is the use of columns and values. Remember, columns are optional - they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you list.
In [99]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"],
columns=["Edition"], aggfunc=[np.sum])
Out[99]:
The NaN
’s are a bit distracting. If we want to remove them, we could use fill_value
to set them to 0.
In [24]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies"],
columns=["Edition"], aggfunc=[np.sum], fill_value=0)
Out[24]:
It would be useful to add the average price as well. Add Price
to the values list.
In [26]:
pd.pivot_table(books, index=["Editor", "Author"], values=["Copies", "Price"],
columns=["Edition"], aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0)
Out[26]:
What's interesting is that you can move items to the index to get a different visual representation. Remove Edition
from the columns and add to the index.
In [27]:
pd.pivot_table(books, index=["Editor", "Author", "Edition"], values=["Copies", "Price"],
aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0)
Out[27]:
For this data set, this representation makes more sense. Now, what if I want to see some totals? margins=True
does that for us. Also, a really handy feature is the ability to pass a dictionary to the aggfunc so you can perform different functions on each of the values you select. This has a side-effect of making the labels a little cleaner.
In [29]:
pd.pivot_table(books, rows=["Editor", "Author", "Edition"], values=["Copies", "Price"],
aggfunc={"Copies": np.sum, "Price": np.mean}, fill_value=0, margins=True)
Out[29]:
Let’s move the analysis up a level and look at our pipeline at the editor level.
In [30]:
pd.pivot_table(books, index=["Editor", "Status"], values=["Copies"],
aggfunc=[np.sum], fill_value=0, margins=True)
Out[30]:
You can provide a list of aggfunctions to apply to each value too:
In [31]:
table = pd.pivot_table(books, index=["Editor", "Status"], columns=["Edition"], values=["Copies", "Price"],
aggfunc={"Copies": np.sum, "Price":[np.sum, np.mean]}, fill_value=0)
table
Out[31]:
It can look daunting to try to pull this all together at one time but as soon as you start playing with the data and slowly add the items, you can get a feel for how it works. As a general rule of thumb is, once you use multiple grouby()
you should evaluate whether a pivot table is a useful approach.
Once you have generated your data, it is in a DataFrame
so you can filter on it using your standard DataFrame
functions.
If you want to look at just one editor:
In [32]:
table.query('Editor == ["Debra Henley"]') # you might nedd to pip install numexpr
Out[32]:
We can look at all of our pending and in-print books.
In [33]:
table.query('Status == ["pending", "in-print"]')
Out[33]:
This a poweful feature of the pivot_table so do not forget that you have the full power of pandas once you get your data into the pivot_table
format you need.
Some times, the only operation you want to do in a pivot talbe is to count. In those cases, when frequency tables with multi indices are required, a crosstabulated version of the data might come in handy. In Pandas, that kind of tables have their own function, so you don-t have to call pivot table with all the parameters.
In [34]:
df = pd.DataFrame({'a' : np.random.randint(0, 2, size=20),
'b' : np.random.randint(0, 5, size=20),
'c' : np.random.randint(0, 3, size=20)})
df
Out[34]:
In [35]:
pd.crosstab(df['b'], [df['a'], df['c']])
Out[35]:
Activity
Given the `arts` data frame, create a pivot table by spliting the table into categories per year, and counting the number of artworks for each. Add marginal results and fill the `NaN` values.
In [57]:
pd.pivot_table(arts, index=["execution_date"], columns=["category"],
values=["title"], aggfunc={"title": np.count_nonzero},
fill_value=0, margins=True)
Out[57]:
There are a handful of third-party Python packages that are suitable for creating scientific plots and visualizations. These include packages like:
Here, we will focus excelusively on matplotlib and the high-level plotting availabel within pandas. It is currently the most robust and feature-rich package available.
We require plots, charts and other statistical graphics for the written communication of quantitative ideas. They allow us to more easily convey relationships and reveal deviations from patterns.
Gelman and Unwin 2011.
A well-designed graph can display more information than a table of the same size, and more information than numbers embedded in text. Graphical displays allow and encourage direct visual comparisons.
The easiest way to interact with matplotlib is via pylab in iPython. By starting iPython (or iPython notebook) in "pylab mode", both matplotlib and numpy are pre-loaded into the iPython session. But that is deprecated and the way now is just using the %matplotlib
cell magic. For loading numpy we just use the regular imports.
In [59]:
%matplotlib inline
You can specify a custom graphical backend (e.g. qt, gtk, osx), but iPython generally does a good job of auto-selecting. Now matplotlib is ready to go, and you can access the matplotlib API via plt. If you do not start iPython in pylab mode, you can do this manually with the following convention:
In [60]:
import matplotlib.pyplot as plt
plt.plot(np.random.normal(size=100), np.random.normal(size=100), 'ro')
Out[60]:
In [61]:
# create a figure using the matlab-like interface
x = np.linspace(0, 10, 1000)
plt.subplot(2, 1, 1) # 2x1 grid, first plot
plt.plot(x, np.sin(x))
plt.title('Trig is easy.')
plt.subplot(2, 1, 2) # 2x1 grid, second plot
plt.plot(x, np.cos(x))
plt.xlabel('x')
Out[61]:
The other interface is an object-oriented interface, where we expliticly pass around references to the plot elements we want to work with:
In [65]:
# create the same figure using the object-oriented interface
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(x, np.sin(x))
ax1.set_title("Trig iss asdasd asd easy")
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(x, np.cos(x))
ax2.set_xlabel('x')
ax2.set_ylabel("Hello!")
Out[65]:
These two interfaces are convenient for different circumstances. I find that for doing quick, simple plots, the scripted interface is often easiest. On the other hand, when I want more sophisticated plots, the object-oriented interface is simpler and more powerful. In fact, the scripted interface has several distinct limitations.
It's good practice to use the object-oriented interface. That is, you should get in the habit of never using the plt.
There are four main ways to create multi-panel plots in matplotlib. From lowest to highest-level they are (roughly):
fig.add_axes()
Useful for creating inset plots. The add_axes method allows you to create an axes instance by specifying the size relative to the figure edges.
The argument is [left
, bottom
, width
, height
] which specifies the axes extent in fractions of the figure size (i.e. between 0 and 1):
In [162]:
fig = plt.figure()
main_ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])
In [163]:
inset_ax = fig.add_axes([0.6, 0.6, 0.25, 0.25])
fig
Out[163]:
In [164]:
main_ax.plot(np.random.rand(100), color='gray')
inset_ax.plot(np.random.rand(20), color='black')
fig
Out[164]:
Useful for simple multi-panel plots. If you're trying to create multiple axes in a grid, you might use .add_axes()
repeatedly, but calculating the extent for each axes is not trivial. The .add_subplot()
method can streamline this.
The arguments are of the form N_vertical
, N_horizontal
, Plot_number
, and the indices start at 1:
In [165]:
fig = plt.figure()
for i in range(1, 7):
ax = fig.add_subplot(2, 3, i)
ax.text(0.45, 0.45, str(i), fontsize=24)
If you desire, you can adjust the spacing using fig.subplots_adjust()
, with units relative to the figure size (i.e. between 0 and 1)
In [166]:
fig.subplots_adjust(left=0.1, right=0.9,
bottom=0.1, top=0.9,
hspace=0.4, wspace=0.4)
fig
Out[166]:
plt.subplots()
Convenience function to create multiple subplots.
Because creating a full grid of subplots is such a common task, matplotlib recently added the plt.subplots()
command which creates the figure and axes in one go.
The arguments are N_vertical
, N_horizontal
, and the axes are returned within a Numpy array:
In [167]:
fig, ax = plt.subplots(2, 3)
for i in range(2):
for j in range(3):
ax[i, j].text(0.2, 0.45, str((i, j)), fontsize=20)
In [168]:
print(type(ax))
print(ax.shape)
print(ax.dtype)
An additional nice piece of this routine is the ability to specify that the subplots have a shared x or y axis: this ties together the axis limits and removes redundant tick labels:
In [169]:
fig, ax = plt.subplots(2, 3, sharex=True, sharey=True)
x = np.linspace(0, 10, 1000)
for i in range(2):
for j in range(3):
ax[i, j].plot(x, (j + 1) * np.sin((i + 1) * x))
In [170]:
gs = plt.GridSpec(3, 3) # a 3x3 grid
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.add_subplot(gs[:, :])
fig.add_subplot(gs[1, 1])
Out[170]:
In [171]:
gs = plt.GridSpec(3, 3, wspace=0.4, hspace=0.4)
fig = plt.figure(figsize=(6, 6))
fig.add_subplot(gs[1, :2])
fig.add_subplot(gs[0, :2])
fig.add_subplot(gs[2, 0])
fig.add_subplot(gs[:2, 2])
fig.add_subplot(gs[2, 1:])
Out[171]:
Check out the documentation of plt.GridSpec
for information on adjusting the subplot parameters. The keyword arguments are similar to those of plt.subplots_adjust()
.
In [73]:
normals = pd.Series(np.random.normal(size=10))
normals.plot()
Out[73]:
Notice that by default a line plot is drawn, and a light grid is included. All of this can be changed, however:
In [77]:
normals.cumsum().plot(grid=False)
Out[77]:
Similarly, for a DataFrame
:
In [78]:
variables = pd.DataFrame({'normal': np.random.normal(size=100),
'gamma': np.random.gamma(1, size=100),
'poisson': np.random.poisson(size=100)})
variables.cumsum(0).plot()
Out[78]:
As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for plot:
In [80]:
variables.cumsum(0).plot(subplots=True)
Out[80]:
Or, we may want to have some series displayed on the secondary y-axis, which can allow for greater detail and less empty space:
In [81]:
variables.cumsum(0).plot(secondary_y='normal')
Out[81]:
Of course, each plot returned is an object with some methods and attributes to modify the aspect of your chart.
In [82]:
ax = variables.cumsum(0).plot()
ax.set_xlabel('Radius')
ax.set_ylabel('Area')
ax.set_title('Area of a Circle')
Out[82]:
If we would like a little more control, we can use matplotlib
's subplots function directly, and manually assign plots to its axes:
In [83]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
for i, var in enumerate(['normal','gamma','poisson']):
variables[var].cumsum(0).plot(ax=axes[i], title=var)
axes[0].set_ylabel('cumulative sum')
Out[83]:
In [84]:
arts.groupby('category')[["title"]].count().plot(kind='bar')
Out[84]:
In [85]:
books.groupby(['Editor', 'Author'])[["Copies"]].aggregate(np.sum).plot(kind='barh')
Out[85]:
In [86]:
counts = pd.crosstab(books["Edition"], books["Status"])
counts.plot(kind='bar', stacked=True, grid=False)
Out[86]:
Activity
Given the `arts` data frame, plot a chart with the number of artworks per year. Set the title and the names of the axes. Also hide the legend and the grid.
In [119]:
grouped = arts.groupby("execution_date")[["title"]].count()
ax = grouped.plot()
ax.set_title("Artworks per Year")
ax.set_xlabel("Year")
ax.set_ylabel("Number of Artworks")
ax.set_xlim([1950, 2000])
Out[119]:
Frequenfly it is useful to look at the distribution of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.
For example, we might want to see how the artworks are distributed cross time:
In [123]:
arts["execution_date"].hist(grid=False)
Out[123]:
The hist method puts the continuous years values into bins, trying to make a sensible decision about how many bins to use (or equivalently, how wide the bins are). We can override the default value (10):
In [120]:
arts["execution_date"].hist(grid=False, bins=30)
Out[120]:
There are algorithms for determining an "optimal" number of bins, each of which varies somehow with the number of observations in the data series.
In [121]:
sturges = lambda n: int(np.log2(n) + 1)
square_root = lambda n: int(np.sqrt(n))
from scipy.stats import kurtosis
doanes = lambda data: int(1 + np.log(len(data)) + np.log(1 + kurtosis(data) * (len(data) / 6.) ** 0.5))
n = len(arts["execution_date"])
sturges(n), square_root(n), doanes(arts["execution_date"].dropna())
Out[121]:
In [122]:
arts["execution_date"].hist(bins=doanes(arts["execution_date"].dropna()))
Out[122]:
A density plot is similar to a histogram in that it describes the distribution of the underlying data, but rather than being a pure empirical representation, it is an estimate of the underlying "true" distribution. As a result, it is smoothed into a continuous line plot. We create them in Pandas using the plot method with kind='kde'
, where kde
stands for kernel density estimate.
In [178]:
arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020))
Out[178]:
We can even make it look funny by using the XKCD style!
In [179]:
with plt.xkcd():
arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020))
Often, histograms and density plots are shown together:
In [136]:
arts["execution_date"].hist(bins=doanes(arts["execution_date"].dropna()), normed=True, color='lightseagreen')
arts["execution_date"].dropna().plot(kind='kde', xlim=(1940, 2020), style='r--')
Out[136]:
Here, we had to normalize the histogram (normed=True
), since the kernel density is normalized by definition (it is a probability distribution).
In [138]:
arts.boxplot(column="execution_date", by="category", grid=False)
Out[138]:
You can think of the box plot as viewing the distribution from above. The blue crosses are "outlier" points that occur outside the extreme quantiles.
One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.
In [143]:
bp = arts.boxplot(column="execution_date", by="category", grid=False)
for i, index in enumerate(arts["category"].unique()):
y = arts["execution_date"][arts["category"]==index].dropna()
# Add some random "jitter" to the x-axis
x = np.random.normal(i, 0.04, size=len(y))
plt.plot(x, y, 'r.', alpha=0.2)
When data are dense, a couple of tricks used above help the visualization:
To look at how Pandas does scatterplots, let's load the baseball
sample dataset.
In [144]:
baseball = pd.read_csv("data/baseball.csv")
baseball.head()
Out[144]:
Scatterplots are useful for data exploration, where we seek to uncover relationships among variables. There are no scatterplot methods for Series or DataFrame objects; we must instead use the matplotlib function scatter.
In [201]:
fig = plt.scatter(baseball.ab, baseball.h)
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)
Out[201]:
We can add additional information to scatterplots by assigning variables to either the size of the symbols or their colors.
In [199]:
fig = plt.scatter(baseball.ab, baseball.h, s=baseball.hr*10, alpha=0.5) # s for size
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)
Out[199]:
In [202]:
fig = plt.scatter(baseball.ab, baseball.h, c=baseball.hr, s=40, cmap='hot') # c for color
fig.axes.set_xlim(0, 700)
fig.axes.set_ylim(0, 200)
Out[202]:
To view scatterplots of a large numbers of variables simultaneously, we can use the scatter_matrix function that was recently added to Pandas. It generates a matrix of pair-wise scatterplots, optiorally with histograms or kernel density estimates on the diagonal.
In [176]:
pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(12,8), diagonal='kde')
Out[176]: