In this class you are expected to learn:
For any of you that have heard about R, a statistics-oriented language, Pandas is the R for Python. It provides easy to use data structures and a ton of useful helper functions for data cleanup and transformations. It's fast because is backed by Numpy arrays, and integrates pretty well with other Python libraries such as scikit-learn for statistical learning, or matplotlib for plotting.
Pandas provides a key data structure: the pandas.DataFrame; which is like a super rich 2D array that addresses three deficiencies of arrays:
numpy.dtype, so can have, in the same row, numbers, strings, and dates.DataFrame are labeled with column names and row indices, what makes easier slicing and indexing.pandas.na or NaN, that stands for "not a number".Data frames are extremely useful for data munging. They provide a large range of operations such as filter, join, and group-by aggregation. Furthermore, pandas.Series is to an 1D array what pandas.DataFrame is to a 2d array.
In [1]:
import numpy as np # we will need it
import pandas as pd
Pandas is able to read data from a bunch of different data formats: CSV, TSV, Excel, HDFS, JSON, etc. The whole list is in the online documentation. In order to read data from a CSV file, we use the read_csv() function that returns a data frame. By default, it assumes that the fields are comma-separated, but it has parameters to tune the behaviour of the importer.
We're going to use some datasets from Julia Evans' Pandas Cookbok. The first one is about cyclist data from Montréal in 2012.
This bikes dataset is a list of how many people were on 7 different bike paths in Montréal, each day.
In [2]:
bikes_df = pd.read_csv('data/bikes.csv') # We add _df 'cause it's a data frame
Hey! We know that error! It's an encoding error probably caused by French accents. Fortunately, read_csv() allows to pass the encoding as an argument.
In [3]:
bikes_df = pd.read_csv('data/bikes.csv', encoding="latin1")
bikes_df
Out[3]:
You'll notice that this is totally broken! Fortunately, read_csv() has a bunch of more options that will let us fix that, though. Here we'll:
;Date columnDate column
In [4]:
bikes_df = pd.read_csv(
'data/bikes.csv',
sep=';',
encoding='latin1',
parse_dates=['Date'],
dayfirst=True,
index_col='Date'
)
bikes_df
Out[4]:
And now let's just remove missing columns
In [5]:
# Get rid of missing columns, don't worry, you'll see more on .dropna() later
bikes_df = bikes_df.dropna(axis=1)
bikes_df.head()
Out[5]:
Way better!
In the last example we added head() to show just the first rows. If you want the last ones, use the tail() function. Both can receive the number of rows to show, but defaults to 5.
Pandas is also able to get data from the Internet without even batting an eye. The excellent book Introduction to Statistical Learning has one of its data sets available to download.
In [6]:
autos_df = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
autos_df.head() # print the first lines
Out[6]:
A DataFrame is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.
Here's an example:
In [7]:
bikes_df['Berri 1'].head()
Out[7]:
For selecting several columns we use the same fancy indexing that we saw for Numpy
In [8]:
fancy_index = ['Berri 1', 'Rachel1']
bikes_df[fancy_index].head()
Out[8]:
But it's very common to just put the fancy index list as the key:
In [9]:
bikes_df[['Berri 1', 'Rachel1']].head()
Out[9]:
For slicing rows the syntax is just as expected: the same that we use for lists.
In [10]:
bikes_df[['Berri 1', 'Rachel1']][:5]
Out[10]:
In [11]:
bikes_df['Berri 1'][-5]
Out[11]:
In a DataFrame every column is a Series, so you can use the index to select specific items from the Series.
In [12]:
berri = bikes_df[['Berri 1']]
berri.head()
Out[12]:
Or you can use boolean indexing for selection.
In [13]:
bikes_df[bikes_df['Berri 1'] < 40]
Out[13]:
That last one might be a little weird, so let's make it more clear: bikes_df['Berri 1'] < 40 returns a Series of True/False values, which we then pass to our DataFrame bikes_df, returning the corresponding True items.
In [14]:
less_than_40 = bikes_df['Berri 1'] < 40
print(less_than_40)
bikes_df[less_than_40]
Out[14]:
Series and DataFrame share the same indexing, slicing and selecting operations. What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.
Let's first create a new Series from a dictionary.
In [15]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities
Out[15]:
In [16]:
'Seattle' in cities
Out[16]:
In [17]:
'San Francisco' in cities
Out[17]:
Finally, it's also possible to get a specific row by using the index value and the ix property.
In [18]:
bikes_df.ix['2012-01-01']
Out[18]:
Mathematical operations can be done using scalars and functions.
In [19]:
# divide city values by 3
cities / 3
Out[19]:
In [20]:
# square city values
np.square(cities)
Out[20]:
But what can we do with that NaN value? The easiest way to get rid of it is by removing it.
In [21]:
cities.dropna()
Out[21]:
Which is equivalent to do:
In [22]:
cities[cities.notnull()]
Out[22]:
We can also fill the null data or even interpolate its value from the rest of the Series.
In [23]:
cities.fillna(0)
Out[23]:
In [24]:
cities.interpolate()
Out[24]:
But null data might come in different ways. The next example illustrates the creation of a DataFrame from a dictionary. There are two blogs with a number of entries per tag. In this case, null data is represented by either the character "?" or the empty string "".
In [25]:
blogs = pd.DataFrame({0: {'blog': 1, 'tag': 'GIS', 'entries': '?'},
1: {'blog': 1, 'tag': 'NLP', 'entries': 1638},
2: {'blog': 1, 'tag': 'SNA', 'entries': 569},
3: {'blog': 1, 'tag': 'DH', 'entries': 115},
4: {'blog': 2, 'tag': 'GIS', 'entries': ''},
5: {'blog': 2, 'tag': 'NLP', 'entries': 1130},
6: {'blog': 2, 'tag': 'SNA', 'entries': 754},
7: {'blog': 2, 'tag': 'DH', 'entries': 555}})
In [26]:
blogs
Out[26]:
In [27]:
blogs = blogs.T # We transpose it to swap rows by columns
blogs
Out[27]:
If we wanted to remove those rows with null values, we can use the na_values parameter when reading the data from a CSV. For DataFrames that we already have, the process is more manual. For now, let's just ignore the warning :-)
In [28]:
blogs[blogs.entries != "?"][blogs.entries != ""]
Out[28]:
In some cases, we might want to keep the rows, but fill them with real null data.
In [29]:
blogs['entries'] = blogs['entries'].replace("?", np.NaN)
blogs['entries'].replace("", np.NaN, inplace=True)
blogs
Out[29]:
Activity
What do you think is the difference between these two lines?
```
blogs['entries'] = blogs['entries'].replace("?", np.NaN)
```
```
blogs['entries'].replace("", np.NaN, inplace=True)
```
Pandas provides nifty methods to understand your data. Let's look at the describe, correlation, covariance, and correlation methods that you can use to quickly make sense of the data.
The describe() method provides quick stats on all suitable columns.
In [30]:
bikes_df.describe()
Out[30]:
In [31]:
blogs.describe()
Out[31]:
In [32]:
autos_df.describe()
Out[32]:
Activity
Other summary statistics include: `count()`, `mean()`, `median()`, `quantile()`, `std()`, `var()`, `min()`, `max()`, etc. Play around a bit to see what they do in the `DataFrames` we already have.
Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.
Like SQL's JOIN clause, pandas.merge() allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, how, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join.
By default, pandas.merge() operates as an inner join, which can be changed using the how parameter. The inner join joins the two DataFrame by a certain key, and when any of the columns has a NaN for a key, it's removed.
This is always easiest to understand with examples.
In [35]:
left_frame = pd.DataFrame({'key': range(5),
'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7),
'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print(right_frame)
In [36]:
pd.merge(left_frame, right_frame, on='key', how='inner')
Out[36]:
We lose values from both frames since certain keys do not match up. Because our key columns have different names, we could have used the left_on and right_on parameters to specify which fields to join from each frame.
pd.merge(left_frame, right_frame, left_on='left_key', right_on='right_key')
Alternatively, if our keys were indexes, we could use the left_index or right_index parameters, which accept a logical value. You can mix and match columns and indexes like this:
pd.merge(left_frame, right_frame, left_on='key', right_index=True)
In [37]:
pd.merge(left_frame, right_frame, on='key', how='left')
Out[37]:
We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value is null where keys do not match (NaN).
In [38]:
pd.merge(left_frame, right_frame, on='key', how='right')
Out[38]:
This time we've kept everything from the right frame with the left_value being null where the right frame's key did not find a match.
In [39]:
pd.merge(left_frame, right_frame, on='key', how='outer')
Out[39]:
We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are null.
Pandas also provides a way to combine DataFrames along an axis: pandas.concat(). It takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.
In [40]:
pd.concat([left_frame, right_frame])
Out[40]:
By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be null.
Additionally, objects can be concatentated side-by-side using the function's axis parameter. The axis labeling information in Pandas objects serves many purposes:
In [41]:
pd.concat([left_frame, right_frame], axis=1)
Out[41]:
The groupby() method lets you perform grouping operations. The example below shows a grouping operation performed with blog columns entries as keys. It is used to calculate the mean of the entries for each blog.
In [42]:
blogs.groupby('blog').mean()
Out[42]:
And now for a more complex exmaple. From our bikes_df DataFrame, we want to know if people are more active on weekdays or on weekends. We are going to add a new column with the information about the weekday, and then sum all the values for each weekday. We do that by grouping by weekday and aggregating their values by simpe addition. In this case we use np.sum(), but any other operation, such as np.max() or np.average() can be used.
In [43]:
bikes_df.index
Out[43]:
The data frame's index is of type DateTimeIndex, which allows us to extract information from the standard datetime Python object.
In [44]:
bikes_df.index.weekday
Out[44]:
In order to create the new column we just need to pick a name and assign a Series value is it was a dictionary (remember that DataFrame and Series are labeled).
In [45]:
bikes_df['weekday'] = bikes_df.index.weekday
bikes_df.head()
Out[45]:
In [46]:
bikes_df.groupby('weekday')
Out[46]:
In [47]:
counts_by_day = bikes_df.groupby('weekday').aggregate(np.sum)
counts_by_day
Out[47]:
And finally a bit of matplotlib + pandas magic! But don't worry, you'll learn about plotting in the next class.
In [48]:
import matplotlib
# display graphs inline
%matplotlib inline
# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))
counts_by_day.plot()
Out[48]:
This is not, for sure, all we will see about grouping, but it's a good start if you are already understanding what's going on here.
The Art Bank collection is a dataset provided by the data.gc.ca, a key part of Canada\u2019s Action Plan on Open Government to enhance transparency and accountability, that contains the art works in the Canada Council Art Bank collection. For 40 years, the Art Bank has been collecting work by the best Canadian artists of our times. The Art Bank is part of the Canada Council for the Arts, Canada's national arts funder. It's licenced under the terms of the Open Government Licence - Canada.
And... it's in XML! XML, stands for eXtensible Markup Language, and if you want to see an example right now, just press Ctrl.+u and the browser will (likely) show you the HTML under this web page. Because yes, HTML is a subset of XML. So now you have an idea.
XML is something that you should know about already. Very basically, XML defines a hierarchical structure in which concepts are enclosed in tags that might have properties.
<art_list>
<art_work title="Title of the artwork" />
</art_list>
In the last example, art_list is a tag, and <art_list>...</art_list> is how you enclose something in a tag. Sometimes tags might not have anything to enclose, as in <art_work />, so it's self-enclosed. Finally, properties are just like dictionaries.
In Python, the fastest way to get all the concepts we need is usually by finding all the tags of a certain kind. The module lxml.etree is able to do that for us.
First, let's see some lines of the dataset by using the special IPython command !head.
In [49]:
!head data/art_listing.xml
From here, it's pretty obvious that what we want the Art_Summary elements and their properties.
The function etree.parse() opens a file, parses it into a Python data structure and returns it as an object.
In [50]:
from lxml import etree
arts = etree.parse("data/art_listing.xml")
To find all Art_Summary elements, we can call the method .findall().
In [51]:
summaries = arts.findall("Art_Summary")
summaries[:5]
Out[51]:
Each of those elements in the list behaves like a dictionary.
In [52]:
summaries[0].keys()
Out[52]:
In [53]:
summaries[0].values()
Out[53]:
Now we are ready to build a data frame from the summaries list.
In [54]:
art_list = []
for summary in summaries:
art_list.append(summary.values())
arts = pd.DataFrame(
art_list,
columns=['artist_name', 'title', 'execution_date', 'category', 'category_french', 'id']
)
Let's remove the category in French.
In [55]:
del arts['category_french']
In [56]:
arts.head()
Out[56]:
Voilà! Ready for the fun.
Activity
Given the `arts` data frame, try to do the next: