In this class you are expected to learn:
For any of you that have heard about R, a statistics-oriented language, Pandas is the R for Python. It provides easy to use data structures and a ton of useful helper functions for data cleanup and transformations. It's fast because is backed by Numpy arrays, and integrates pretty well with other Python libraries such as scikit-learn
for statistical learning, or matplotlib
for plotting.
Pandas provides a key data structure: the pandas.DataFrame
; which is like a super rich 2D array that addresses three deficiencies of arrays:
numpy.dtype
, so can have, in the same row, numbers, strings, and dates.DataFrame
are labeled with column names and row indices, what makes easier slicing and indexing.pandas.na
or NaN
, that stands for "not a number".Data frames are extremely useful for data munging. They provide a large range of operations such as filter, join, and group-by aggregation. Furthermore, pandas.Series
is to an 1D array what pandas.DataFrame
is to a 2d array.
In [2]:
import numpy as np # we will need it
import pandas as pd
Pandas is able to read data from a bunch of different data formats: CSV, TSV, Excel, HDFS, JSON, etc. The whole list is in the online documentation. In order to read data from a CSV file, we use the read_csv()
function that returns a data frame. By default, it assumes that the fields are comma-separated, but it has parameters to tune the behaviour of the importer.
We're going to use some datasets from Julia Evans' Pandas Cookbok. The first one is about cyclist data from Montréal in 2012.
This bikes dataset is a list of how many people were on 7 different bike paths in Montréal, each day.
In [2]:
bikes_df = pd.read_csv('data/bikes.csv') # We add _df 'cause it's a data frame
Hey! We know that error! It's an encoding error probably caused by French accents. Fortunately, read_csv()
allows to pass the encoding as an argument.
In [3]:
bikes_df = pd.read_csv('data/bikes.csv', encoding="latin1")
bikes_df
Out[3]:
You'll notice that this is totally broken! Fortunately, read_csv()
has a bunch of more options that will let us fix that, though. Here we'll:
;
Date
columnDate
column
In [4]:
bikes_df = pd.read_csv(
'data/bikes.csv',
sep=';',
encoding='latin1',
parse_dates=['Date'],
dayfirst=True,
index_col='Date'
)
bikes_df
Out[4]:
And now let's just remove missing columns
In [5]:
# Get rid of missing columns, don't worry, you'll see more on .dropna() later
bikes_df = bikes_df.dropna(axis=1)
bikes_df.head()
Out[5]:
Way better!
In the last example we added head()
to show just the first rows. If you want the last ones, use the tail()
function. Both can receive the number of rows to show, but defaults to 5.
Pandas is also able to get data from the Internet without even batting an eye. The excellent book Introduction to Statistical Learning has one of its data sets available to download.
In [6]:
autos_df = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
autos_df.head() # print the first lines
Out[6]:
A DataFrame
is made up of rows and columns. You get columns out of a DataFrame
the same way you get elements out of a dictionary.
Here's an example:
In [7]:
bikes_df['Berri 1'].head()
Out[7]:
For selecting several columns we use the same fancy indexing that we saw for Numpy
In [8]:
fancy_index = ['Berri 1', 'Rachel1']
bikes_df[fancy_index].head()
Out[8]:
But it's very common to just put the fancy index list as the key:
In [9]:
bikes_df[['Berri 1', 'Rachel1']].head()
Out[9]:
For slicing rows the syntax is just as expected: the same that we use for lists.
In [10]:
bikes_df[['Berri 1', 'Rachel1']][:5]
Out[10]:
In [11]:
bikes_df['Berri 1'][-5]
Out[11]:
In a DataFrame
every column is a Series
, so you can use the index to select specific items from the Series
.
In [12]:
berri = bikes_df[['Berri 1']]
berri.head()
Out[12]:
Or you can use boolean indexing for selection.
In [13]:
bikes_df[bikes_df['Berri 1'] < 40]
Out[13]:
That last one might be a little weird, so let's make it more clear: bikes_df['Berri 1'] < 40
returns a Series of True
/False
values, which we then pass to our DataFrame bikes_df
, returning the corresponding True
items.
In [14]:
less_than_40 = bikes_df['Berri 1'] < 40
print(less_than_40)
bikes_df[less_than_40]
Out[14]:
Series
and DataFrame
share the same indexing, slicing and selecting operations. What if you aren't sure whether an item is in the Series
? You can check using idiomatic Python.
Let's first create a new Series
from a dictionary.
In [15]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities
Out[15]:
In [16]:
'Seattle' in cities
Out[16]:
In [17]:
'San Francisco' in cities
Out[17]:
Finally, it's also possible to get a specific row by using the index value and the ix
property.
In [18]:
bikes_df.ix['2012-01-01']
Out[18]:
Mathematical operations can be done using scalars and functions.
In [19]:
# divide city values by 3
cities / 3
Out[19]:
In [20]:
# square city values
np.square(cities)
Out[20]:
But what can we do with that NaN
value? The easiest way to get rid of it is by removing it.
In [21]:
cities.dropna()
Out[21]:
Which is equivalent to do:
In [22]:
cities[cities.notnull()]
Out[22]:
We can also fill the null data or even interpolate its value from the rest of the Series
.
In [23]:
cities.fillna(0)
Out[23]:
In [24]:
cities.interpolate()
Out[24]:
But null data might come in different ways. The next example illustrates the creation of a DataFrame
from a dictionary. There are two blogs with a number of entries per tag. In this case, null data is represented by either the character "?"
or the empty string ""
.
In [8]:
blogs = pd.DataFrame({0: {'blog': 1, 'tag': 'GIS', 'entries': '?'},
1: {'blog': 1, 'tag': 'NLP', 'entries': 1638},
2: {'blog': 1, 'tag': 'SNA', 'entries': 569},
3: {'blog': 1, 'tag': 'DH', 'entries': 115},
4: {'blog': 2, 'tag': 'GIS', 'entries': ''},
5: {'blog': 2, 'tag': 'NLP', 'entries': 1130},
6: {'blog': 2, 'tag': 'SNA', 'entries': 754},
7: {'blog': 2, 'tag': 'DH', 'entries': 555}})
In [9]:
blogs
Out[9]:
In [10]:
blogs = blogs.T # We transpose it to swap rows by columns
blogs
Out[10]:
If we wanted to remove those rows with null values, we can use the na_values
parameter when reading the data from a CSV. For DataFrames
that we already have, the process is more manual. For now, let's just ignore the warning :-)
In [11]:
blogs[blogs.entries != "?"][blogs.entries != ""]
Out[11]:
In some cases, we might want to keep the rows, but fill them with real null data.
In [12]:
blogs['entries'] = blogs['entries'].replace("?", np.nan)
blogs['entries'].replace("", np.nan, inplace=True)
blogs
Out[12]:
Activity
What do you think is the difference between these two lines?
```
blogs['entries'] = blogs['entries'].replace("?", np.NaN)
```
```
blogs['entries'].replace("?", np.NaN, inplace=True)
```
When the Series
is of str
type, we can even use almost the whole set of string functions in Python by accessing the attribute .str
, and then any other string method we need.
In [13]:
blogs.tag.str.lower()
Out[13]:
But if the cleaning needs a more complicated operation, we can always .apply()
a specific function to the whole DataFrame
or Series
, in order to perform that funcion over the individual values or cells.
In [14]:
def clean_func(value):
if value == "GIS":
return "Geographical Information Systems"
elif value == "NLP":
return "Natural Languaje Processing"
elif value == "DH":
return "Digital Humanities"
elif value == "SNA":
return "Social Network Analysis"
else:
return value
return value
blogs['tag'].apply(clean_func)
Out[14]:
In [15]:
blogs
Out[15]:
Pandas provides nifty methods to understand your data. Let's look at the describe, correlation, covariance, and correlation methods that you can use to quickly make sense of the data.
The describe()
method provides quick stats on all suitable columns.
In [33]:
bikes_df.describe()
Out[33]:
In [34]:
blogs.describe()
Out[34]:
In [35]:
autos_df.describe()
Out[35]:
Activity
Other summary statistics include: `count()`, `mean()`, `median()`, `quantile()`, `std()`, `var()`, `min()`, `max()`, etc. Play around a bit to see what they do in the `DataFrames` we already have.
Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.
Like SQL's JOIN clause, pandas.merge()
allows two DataFrames
to be joined on one or more keys. The function provides a series of parameters (on
, how
, left_on
, right_on
, left_index
, right_index
) allowing you to specify the columns or indexes on which to join.
By default, pandas.merge()
operates as an inner join, which can be changed using the how
parameter. The inner join joins the two DataFrame
by a certain key, and when any of the columns has a NaN
for a key, it's removed.
This is always easiest to understand with examples.
In [3]:
left_frame = pd.DataFrame({'key': range(5),
'left_value': ['a', 'b', 'c', 'd', 'e']})
left_frame
Out[3]:
In [4]:
right_frame = pd.DataFrame({'key': range(2, 7),
'right_value': ['f', 'g', 'h', 'i', 'j']})
right_frame
Out[4]:
In [5]:
pd.merge(left_frame, right_frame, on='key', how='inner')
Out[5]:
We lose values from both frames since certain keys do not match up. Because our key columns have different names, we could have used the left_on
and right_on
parameters to specify which fields to join from each frame.
pd.merge(left_frame, right_frame, left_on='left_key', right_on='right_key')
Alternatively, if our keys were indexes, we could use the left_index
or right_index
parameters, which accept a logical value. You can mix and match columns and indexes like this:
pd.merge(left_frame, right_frame, left_on='key', right_index=True)
In [6]:
pd.merge(left_frame, right_frame, on='key', how='left')
Out[6]:
We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value
is null where keys do not match (NaN
).
In [7]:
pd.merge(left_frame, right_frame, on='key', how='right')
Out[7]:
This time we've kept everything from the right frame with the left_value
being null where the right frame's key did not find a match.
In [41]:
pd.merge(left_frame, right_frame, on='key', how='outer')
Out[41]:
We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are null.
Pandas also provides a way to combine DataFrames
along an axis: pandas.concat()
. It takes a list of Series
or DataFrame
s and returns a Series
or DataFrame
of the concatenated objects. Note that because the function takes list, you can combine many objects at once.
In [42]:
pd.concat([left_frame, right_frame])
Out[42]:
By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be null.
Additionally, objects can be concatentated side-by-side using the function's axis parameter. The axis labeling information in Pandas objects serves many purposes:
In [43]:
pd.concat([left_frame, right_frame], axis=1)
Out[43]:
The groupby()
method lets you perform grouping operations. The example below shows a grouping operation performed with blog
columns entries as keys. It is used to calculate the mean of the entries
for each blog.
In [18]:
blogs
Out[18]:
In [20]:
blogs.groupby('blog').count()
Out[20]:
And now for a more complex exmaple. From our bikes_df
DataFrame
, we want to know if people are more active on weekdays or on weekends. We are going to add a new column with the information about the weekday
, and then sum all the values for each weekday. We do that by grouping by weekday
and aggregating their values by simpe addition. In this case we use np.sum()
, but any other operation, such as np.max()
or np.average()
can be used.
In [45]:
bikes_df.index
Out[45]:
The data frame's index is of type DateTimeIndex
, which allows us to extract information from the standard datetime
Python object.
In [46]:
bikes_df.index.weekday
Out[46]:
In order to create the new column we just need to pick a name and assign a Series
value is it was a dictionary (remember that DataFrame
and Series
are labeled).
In [47]:
bikes_df['weekday'] = bikes_df.index.weekday
bikes_df.head()
Out[47]:
In [48]:
bikes_df.groupby('weekday')
Out[48]:
In [49]:
counts_by_day = bikes_df.groupby('weekday').aggregate(np.sum)
counts_by_day
Out[49]:
And finally a bit of matplotlib + pandas
magic! But don't worry, you'll learn about plotting in the next class.
In [50]:
# display graphs inline
%matplotlib inline
import matplotlib
# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))
counts_by_day.plot()
Out[50]:
This is not, for sure, all we will see about grouping, but it's a good start if you are already understanding what's going on here.
The Art Bank collection is a dataset provided by the data.gc.ca, a key part of Canada's Action Plan on Open Government to enhance transparency and accountability, that contains the art works in the Canada Council Art Bank collection. For 40 years, the Art Bank has been collecting work by the best Canadian artists of our times. The Art Bank is part of the Canada Council for the Arts, Canada's national arts funder. It's licenced under the terms of the Open Government Licence - Canada.
And... it's in XML! XML, stands for eXtensible Markup Language, and if you want to see an example right now, just press Ctrl.+u
and the browser will (likely) show you the HTML under this web page. Because yes, HTML is a subset of XML. So now you have an idea.
XML is something that you should know about already. Very basically, XML defines a hierarchical structure in which concepts are enclosed in tags that might have properties.
<art_list>
<art_work title="Title of the artwork" />
</art_list>
In the last example, art_list
is a tag, and <art_list>...</art_list>
is how you enclose something in a tag. Sometimes tags might not have anything to enclose, as in <art_work />
, so it's self-enclosed. Finally, properties are just like dictionaries.
In Python, the fastest way to get all the concepts we need is usually by finding all the tags of a certain kind. The module lxml.etree
is able to do that for us.
First, let's see some lines of the dataset by using the special IPython command !head
.
In [21]:
!head data/art_listing.xml
From here, it's pretty obvious that what we want the Art_Summary
elements and their properties.
The function etree.parse()
opens a file, parses it into a Python data structure and returns it as an object.
In [22]:
from lxml import etree
arts = etree.parse("data/art_listing.xml")
To find all Art_Summary
elements, we can call the method .findall()
.
In [23]:
summaries = arts.findall("Art_Summary")
summaries[:5]
Out[23]:
Each of those elements in the list behaves like a dictionary.
In [24]:
summaries[0].keys()
Out[24]:
In [25]:
summaries[0].values()
Out[25]:
Now we are ready to build a data frame from the summaries
list.
In [26]:
art_list = []
for summary in summaries:
art_list.append(summary.values())
arts = pd.DataFrame(
art_list,
columns=['artist_name', 'title', 'execution_date', 'category', 'category_french', 'id']
)
Let's remove the category in French.
In [27]:
del arts['category_french']
In [28]:
arts.head()
Out[28]:
Voilà! Ready for the fun. Let's just save it for later.
In [38]:
arts.to_csv("data/arts.csv", index=False)
Activity
Given the `arts` data frame, clean the dates so you only see numbers. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too. [Solution](data/arts1.py)
In [34]:
l = ['1989', '1999']
l[-1]
Out[34]:
In [37]:
"1989, CARCC".split("-")[-1]
Out[37]:
In [40]:
# Clean the dates so you only see numbers.
def clean_years(value):
result = value
chars_to_replace = ["c.", "©", ", CARCC", "no date", "n.d.", " SODRAC", ", CA", " CARCC"]
chars_to_split = ["-", "/"]
if isinstance(result, str): # what isinstance does?
for char in chars_to_split:
result = result.split(char)[-1]
for char in chars_to_replace:
result = result.replace(char, "")
if result == "":
return np.nan
else:
return int(result)
else:
return result
arts['execution_date'] = arts['execution_date'].apply(clean_years)
arts.head()
Out[40]:
In [43]:
# If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
def clean_year_99(value):
if value <= 99:
return 1900 + value
else:
return value
arts["execution_date"] = arts["execution_date"].apply(clean_year_99)
arts[arts["execution_date"] < 100].head()
Out[43]:
In [45]:
# using lambda functions
arts["execution_date"] = arts["execution_date"].apply(lambda value: 1900 + value if value <= 99 else value)