[Data, the Humanist's New Best Friend](index.ipynb)
*Class 06*

In this class you are expected to learn:

  1. Pandas
  2. Cleaning data
  3. Summary statistics
  4. Indexing
  5. Merging, joining
  6. Group by
`import pandas as pd`

Pandas

For any of you that have heard about R, a statistics-oriented language, Pandas is the R for Python. It provides easy to use data structures and a ton of useful helper functions for data cleanup and transformations. It's fast because is backed by Numpy arrays, and integrates pretty well with other Python libraries such as scikit-learn for statistical learning, or matplotlib for plotting.

Pandas provides a key data structure: the pandas.DataFrame; which is like a super rich 2D array that addresses three deficiencies of arrays:

  • They hold heterogenous data; each column can have its own numpy.dtype, so can have, in the same row, numbers, strings, and dates.
  • The axes of a DataFrame are labeled with column names and row indices, what makes easier slicing and indexing.
  • They account for missing values, which is not directly supported by arrays, in the form or pandas.na or NaN, that stands for "not a number".

Data frames are extremely useful for data munging. They provide a large range of operations such as filter, join, and group-by aggregation. Furthermore, pandas.Series is to an 1D array what pandas.DataFrame is to a 2d array.


In [1]:
import numpy as np  # we will need it
import pandas as pd

Read data with Pandas

Pandas is able to read data from a bunch of different data formats: CSV, TSV, Excel, HDFS, JSON, etc. The whole list is in the online documentation. In order to read data from a CSV file, we use the read_csv() function that returns a data frame. By default, it assumes that the fields are comma-separated, but it has parameters to tune the behaviour of the importer.

We're going to use some datasets from Julia Evans' Pandas Cookbok. The first one is about cyclist data from Montréal in 2012.

This bikes dataset is a list of how many people were on 7 different bike paths in Montréal, each day.


In [2]:
bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame


---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-e54169c3d1a7> in <module>()
----> 1 bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    418                     infer_datetime_format=infer_datetime_format)
    419 
--> 420         return _read(filepath_or_buffer, kwds)
    421 
    422     parser_f.__name__ = name

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    216 
    217     # Create the parser.
--> 218     parser = TextFileReader(filepath_or_buffer, **kwds)
    219 
    220     if nrows is not None:

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    500             self.options['has_index_names'] = kwds['has_index_names']
    501 
--> 502         self._make_engine(self.engine)
    503 
    504     def _get_options_with_defaults(self, engine):

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    608     def _make_engine(self, engine='c'):
    609         if engine == 'c':
--> 610             self._engine = CParserWrapper(self.f, **self.options)
    611         else:
    612             if engine == 'python':

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
    970         kwds['allow_leading_cols'] = self.index_col is not False
    971 
--> 972         self._reader = _parser.TextReader(src, **kwds)
    973 
    974         # XXX

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/parser.cpython-33m.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4412)()

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/parser.cpython-33m.so in pandas.parser.TextReader._get_header (pandas/parser.c:5924)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 15: invalid continuation byte

Hey! We know that error! It's an encoding error probably caused by French accents. Fortunately, read_csv() allows to pass the encoding as an argument.


In [3]:
bikes_df = pd.read_csv('data/bikes.csv', encoding="latin1")
bikes_df


Out[3]:
Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
0 01/01/2012;35;;0;38;51;26;10;16;
1 02/01/2012;83;;1;68;153;53;6;43;
2 03/01/2012;135;;2;104;248;89;3;58;
3 04/01/2012;144;;1;116;318;111;8;61;
4 05/01/2012;197;;2;124;330;97;13;95;

5 rows × 1 columns

You'll notice that this is totally broken! Fortunately, read_csv() has a bunch of more options that will let us fix that, though. Here we'll:

  • Change the column separator to a ;
  • Parse the dates in the Date column
  • Tell it that our dates have the date first instead of the month first
  • Set the index to be the Date column

In [4]:
bikes_df = pd.read_csv(
    'data/bikes.csv',
    sep=';',
    encoding='latin1',
    parse_dates=['Date'],
    dayfirst=True,
    index_col='Date'
)
bikes_df


Out[4]:
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN
2012-01-04 144 NaN 1 116 318 111 8 61 NaN
2012-01-05 197 NaN 2 124 330 97 13 95 NaN

5 rows × 9 columns

And now let's just remove missing columns


In [5]:
# Get rid of missing columns, don't worry, you'll see more on .dropna() later
bikes_df = bikes_df.dropna(axis=1)
bikes_df.head()


Out[5]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-02 83 1 68 153 53 6 43
2012-01-03 135 2 104 248 89 3 58
2012-01-04 144 1 116 318 111 8 61
2012-01-05 197 2 124 330 97 13 95

5 rows × 7 columns

Way better!

In the last example we added head() to show just the first rows. If you want the last ones, use the tail() function. Both can receive the number of rows to show, but defaults to 5.

Pandas is also able to get data from the Internet without even batting an eye. The excellent book Introduction to Statistical Learning has one of its data sets available to download.


In [6]:
autos_df = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
autos_df.head()  # print the first lines


Out[6]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15 8 350 165 3693 11.5 70 1 buick skylark 320
2 18 8 318 150 3436 11.0 70 1 plymouth satellite
3 16 8 304 150 3433 12.0 70 1 amc rebel sst
4 17 8 302 140 3449 10.5 70 1 ford torino

5 rows × 9 columns

Slicing

A DataFrame is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

Here's an example:


In [7]:
bikes_df['Berri 1'].head()


Out[7]:
Date
2012-01-01     35
2012-01-02     83
2012-01-03    135
2012-01-04    144
2012-01-05    197
Name: Berri 1, dtype: int64

For selecting several columns we use the same fancy indexing that we saw for Numpy


In [8]:
fancy_index = ['Berri 1', 'Rachel1']
bikes_df[fancy_index].head()


Out[8]:
Berri 1 Rachel1
Date
2012-01-01 35 16
2012-01-02 83 43
2012-01-03 135 58
2012-01-04 144 61
2012-01-05 197 95

5 rows × 2 columns

But it's very common to just put the fancy index list as the key:


In [9]:
bikes_df[['Berri 1', 'Rachel1']].head()


Out[9]:
Berri 1 Rachel1
Date
2012-01-01 35 16
2012-01-02 83 43
2012-01-03 135 58
2012-01-04 144 61
2012-01-05 197 95

5 rows × 2 columns

For slicing rows the syntax is just as expected: the same that we use for lists.


In [10]:
bikes_df[['Berri 1', 'Rachel1']][:5]


Out[10]:
Berri 1 Rachel1
Date
2012-01-01 35 16
2012-01-02 83 43
2012-01-03 135 58
2012-01-04 144 61
2012-01-05 197 95

5 rows × 2 columns


In [11]:
bikes_df['Berri 1'][-5]


Out[11]:
2405

Indexing

In a DataFrame every column is a Series, so you can use the index to select specific items from the Series.


In [12]:
berri = bikes_df[['Berri 1']]
berri.head()


Out[12]:
Berri 1
Date
2012-01-01 35
2012-01-02 83
2012-01-03 135
2012-01-04 144
2012-01-05 197

5 rows × 1 columns

Or you can use boolean indexing for selection.


In [13]:
bikes_df[bikes_df['Berri 1'] < 40]


Out[13]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-14 32 0 54 56 19 0 1

2 rows × 7 columns

That last one might be a little weird, so let's make it more clear: bikes_df['Berri 1'] < 40 returns a Series of True/False values, which we then pass to our DataFrame bikes_df, returning the corresponding True items.


In [14]:
less_than_40 = bikes_df['Berri 1'] < 40
print(less_than_40)
bikes_df[less_than_40]


Date
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05    False
2012-01-06    False
2012-01-07    False
2012-01-08    False
2012-01-09    False
2012-01-10    False
2012-01-11    False
2012-01-12    False
2012-01-13    False
2012-01-14     True
2012-01-15    False
...
2012-10-22    False
2012-10-23    False
2012-10-24    False
2012-10-25    False
2012-10-26    False
2012-10-27    False
2012-10-28    False
2012-10-29    False
2012-10-30    False
2012-10-31    False
2012-11-01    False
2012-11-02    False
2012-11-03    False
2012-11-04    False
2012-11-05    False
Name: Berri 1, Length: 310
Out[14]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-14 32 0 54 56 19 0 1

2 rows × 7 columns

Series and DataFrame share the same indexing, slicing and selecting operations. What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

Let's first create a new Series from a dictionary.


In [15]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities


Out[15]:
Austin            450
Boston            NaN
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

In [16]:
'Seattle' in cities


Out[16]:
False

In [17]:
'San Francisco' in cities


Out[17]:
True

Finally, it's also possible to get a specific row by using the index value and the ix property.


In [18]:
bikes_df.ix['2012-01-01']


Out[18]:
Berri 1                  35
Côte-Sainte-Catherine     0
Maisonneuve 1            38
Maisonneuve 2            51
du Parc                  26
Pierre-Dupuy             10
Rachel1                  16
Name: 2012-01-01 00:00:00, dtype: int64

Cleaning

Mathematical operations can be done using scalars and functions.


In [19]:
# divide city values by 3
cities / 3


Out[19]:
Austin           150.000000
Boston                  NaN
Chicago          333.333333
New York         433.333333
Portland         300.000000
San Francisco    366.666667
dtype: float64

In [20]:
# square city values
np.square(cities)


Out[20]:
Austin            202500
Boston               NaN
Chicago          1000000
New York         1690000
Portland          810000
San Francisco    1210000
dtype: float64

But what can we do with that NaN value? The easiest way to get rid of it is by removing it.


In [21]:
cities.dropna()


Out[21]:
Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

Which is equivalent to do:


In [22]:
cities[cities.notnull()]


Out[22]:
Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

We can also fill the null data or even interpolate its value from the rest of the Series.


In [23]:
cities.fillna(0)


Out[23]:
Austin            450
Boston              0
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

In [24]:
cities.interpolate()


Out[24]:
Austin            450
Boston            725
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: int64

But null data might come in different ways. The next example illustrates the creation of a DataFrame from a dictionary. There are two blogs with a number of entries per tag. In this case, null data is represented by either the character "?" or the empty string "".


In [25]:
blogs = pd.DataFrame({0: {'blog': 1, 'tag': 'GIS', 'entries': '?'},
                      1: {'blog': 1, 'tag': 'NLP', 'entries': 1638},
                      2: {'blog': 1, 'tag': 'SNA', 'entries': 569},
                      3: {'blog': 1, 'tag': 'DH', 'entries': 115},
                      4: {'blog': 2, 'tag': 'GIS', 'entries': ''},
                      5: {'blog': 2, 'tag': 'NLP', 'entries': 1130},
                      6: {'blog': 2, 'tag': 'SNA', 'entries': 754},
                      7: {'blog': 2, 'tag': 'DH', 'entries': 555}})

In [26]:
blogs


Out[26]:
0 1 2 3 4 5 6 7
blog 1 1 1 1 2 2 2 2
entries ? 1638 569 115 1130 754 555
tag GIS NLP SNA DH GIS NLP SNA DH

3 rows × 8 columns


In [27]:
blogs = blogs.T  # We transpose it to swap rows by columns
blogs


Out[27]:
blog entries tag
0 1 ? GIS
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
4 2 GIS
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

8 rows × 3 columns

If we wanted to remove those rows with null values, we can use the na_values parameter when reading the data from a CSV. For DataFrames that we already have, the process is more manual. For now, let's just ignore the warning :-)


In [28]:
blogs[blogs.entries != "?"][blogs.entries != ""]


/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/core/frame.py:1686: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[28]:
blog entries tag
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

6 rows × 3 columns

In some cases, we might want to keep the rows, but fill them with real null data.


In [29]:
blogs['entries'] = blogs['entries'].replace("?", np.NaN)
blogs['entries'].replace("", np.NaN, inplace=True)
blogs


Out[29]:
blog entries tag
0 1 NaN GIS
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
4 2 NaN GIS
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

8 rows × 3 columns

Activity

What do you think is the difference between these two lines?
``` blogs['entries'] = blogs['entries'].replace("?", np.NaN) ```
``` blogs['entries'].replace("", np.NaN, inplace=True) ```

Summary statistics

Pandas provides nifty methods to understand your data. Let's look at the describe, correlation, covariance, and correlation methods that you can use to quickly make sense of the data.

The describe() method provides quick stats on all suitable columns.


In [30]:
bikes_df.describe()


Out[30]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
count 310.000000 310.000000 310.000000 310.000000 310.000000 310.000000 310.000000
mean 2985.048387 1233.351613 1983.325806 3510.261290 1862.983871 1054.306452 2873.483871
std 2169.271062 944.643188 1450.715170 2484.959789 1332.543266 1064.029205 2039.315504
min 32.000000 0.000000 33.000000 47.000000 18.000000 0.000000 0.000000
25% 596.000000 243.250000 427.000000 831.000000 474.750000 53.250000 731.000000
50% 3128.000000 1269.000000 2019.500000 3688.500000 1822.500000 704.000000 3223.500000
75% 4973.250000 2003.000000 3168.250000 5731.750000 3069.000000 1818.500000 4717.250000
max 7077.000000 3124.000000 4999.000000 8222.000000 4510.000000 4386.000000 6595.000000

8 rows × 7 columns


In [31]:
blogs.describe()


Out[31]:
entries
count 6.000000
mean 793.500000
std 528.321398
min 115.000000
25% 558.500000
50% 661.500000
75% 1036.000000
max 1638.000000

8 rows × 1 columns


In [32]:
autos_df.describe()


Out[32]:
mpg cylinders displacement weight acceleration year origin
count 397.000000 397.000000 397.000000 397.000000 397.000000 397.000000 397.000000
mean 23.515869 5.458438 193.532746 2970.261965 15.555668 75.994962 1.574307
std 7.825804 1.701577 104.379583 847.904119 2.749995 3.690005 0.802549
min 9.000000 3.000000 68.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.000000 2223.000000 13.800000 73.000000 1.000000
50% 23.000000 4.000000 146.000000 2800.000000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 3609.000000 17.100000 79.000000 2.000000
max 46.600000 8.000000 455.000000 5140.000000 24.800000 82.000000 3.000000

8 rows × 7 columns

Activity

Other summary statistics include: `count()`, `mean()`, `median()`, `quantile()`, `std()`, `var()`, `min()`, `max()`, etc. Play around a bit to see what they do in the `DataFrames` we already have.

Merging and joining

Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.

Like SQL's JOIN clause, pandas.merge() allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, how, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join.

By default, pandas.merge() operates as an inner join, which can be changed using the how parameter. The inner join joins the two DataFrame by a certain key, and when any of the columns has a NaN for a key, it's removed.

This is always easiest to understand with examples.


In [35]:
left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print(right_frame)


   key left_value
0    0          a
1    1          b
2    2          c
3    3          d
4    4          e

[5 rows x 2 columns]
   key right_value
0    2           f
1    3           g
2    4           h
3    5           i
4    6           j

[5 rows x 2 columns]

inner join (default)


In [36]:
pd.merge(left_frame, right_frame, on='key', how='inner')


Out[36]:
key left_value right_value
0 2 c f
1 3 d g
2 4 e h

3 rows × 3 columns

We lose values from both frames since certain keys do not match up. Because our key columns have different names, we could have used the left_on and right_on parameters to specify which fields to join from each frame.

pd.merge(left_frame, right_frame, left_on='left_key', right_on='right_key')

Alternatively, if our keys were indexes, we could use the left_index or right_index parameters, which accept a logical value. You can mix and match columns and indexes like this:

pd.merge(left_frame, right_frame, left_on='key', right_index=True)

left outer join


In [37]:
pd.merge(left_frame, right_frame, on='key', how='left')


Out[37]:
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c f
3 3 d g
4 4 e h

5 rows × 3 columns

We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value is null where keys do not match (NaN).

right outer join


In [38]:
pd.merge(left_frame, right_frame, on='key', how='right')


Out[38]:
key left_value right_value
0 2 c f
1 3 d g
2 4 e h
3 5 NaN i
4 6 NaN j

5 rows × 3 columns

This time we've kept everything from the right frame with the left_value being null where the right frame's key did not find a match.

full outer join


In [39]:
pd.merge(left_frame, right_frame, on='key', how='outer')


Out[39]:
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c f
3 3 d g
4 4 e h
5 5 NaN i
6 6 NaN j

7 rows × 3 columns

We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are null.

concat

Pandas also provides a way to combine DataFrames along an axis: pandas.concat(). It takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.


In [40]:
pd.concat([left_frame, right_frame])


Out[40]:
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c NaN
3 3 d NaN
4 4 e NaN
0 2 NaN f
1 3 NaN g
2 4 NaN h
3 5 NaN i
4 6 NaN j

10 rows × 3 columns

By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be null.

Additionally, objects can be concatentated side-by-side using the function's axis parameter. The axis labeling information in Pandas objects serves many purposes:

  • Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
  • Enables automatic and explicit data alignment.
  • Allows intuitive getting and setting of subsets of the data set.

In [41]:
pd.concat([left_frame, right_frame], axis=1)


Out[41]:
key left_value key right_value
0 0 a 2 f
1 1 b 3 g
2 2 c 4 h
3 3 d 5 i
4 4 e 6 j

5 rows × 4 columns

Group by

The groupby() method lets you perform grouping operations. The example below shows a grouping operation performed with blog columns entries as keys. It is used to calculate the mean of the entries for each blog.


In [42]:
blogs.groupby('blog').mean()


Out[42]:
entries
blog
1 774
2 813

2 rows × 1 columns

And now for a more complex exmaple. From our bikes_df DataFrame, we want to know if people are more active on weekdays or on weekends. We are going to add a new column with the information about the weekday, and then sum all the values for each weekday. We do that by grouping by weekday and aggregating their values by simpe addition. In this case we use np.sum(), but any other operation, such as np.max() or np.average() can be used.


In [43]:
bikes_df.index


Out[43]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01, ..., 2012-11-05]
Length: 310, Freq: None, Timezone: None

The data frame's index is of type DateTimeIndex, which allows us to extract information from the standard datetime Python object.


In [44]:
bikes_df.index.weekday


Out[44]:
array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5,
       6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0], dtype=int32)

In order to create the new column we just need to pick a name and assign a Series value is it was a dictionary (remember that DataFrame and Series are labeled).


In [45]:
bikes_df['weekday'] = bikes_df.index.weekday
bikes_df.head()


Out[45]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 weekday
Date
2012-01-01 35 0 38 51 26 10 16 6
2012-01-02 83 1 68 153 53 6 43 0
2012-01-03 135 2 104 248 89 3 58 1
2012-01-04 144 1 116 318 111 8 61 2
2012-01-05 197 2 124 330 97 13 95 3

5 rows × 8 columns


In [46]:
bikes_df.groupby('weekday')


Out[46]:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f403a9eed90>

In [47]:
counts_by_day = bikes_df.groupby('weekday').aggregate(np.sum)
counts_by_day


Out[47]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
weekday
0 134298 60329 90051 163767 90184 46204 130130
1 135305 58708 92035 165880 91399 35167 120088
2 152972 67344 104891 186061 102103 43263 133088
3 160131 69028 111895 196715 105674 45385 140241
4 141771 56446 98568 172390 89872 42470 137255
5 101578 34018 62067 105060 48289 52861 112828
6 99310 36466 55324 98308 50004 61485 117150

7 rows × 7 columns

And finally a bit of matplotlib + pandas magic! But don't worry, you'll learn about plotting in the next class.


In [48]:
import matplotlib
# display graphs inline
%matplotlib inline 
# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))

counts_by_day.plot()


Out[48]:
<matplotlib.axes.AxesSubplot at 0x7f40389f9dd0>

This is not, for sure, all we will see about grouping, but it's a good start if you are already understanding what's going on here.

Art Bank collection

The Art Bank collection is a dataset provided by the data.gc.ca, a key part of Canada\u2019s Action Plan on Open Government to enhance transparency and accountability, that contains the art works in the Canada Council Art Bank collection. For 40 years, the Art Bank has been collecting work by the best Canadian artists of our times. The Art Bank is part of the Canada Council for the Arts, Canada's national arts funder. It's licenced under the terms of the Open Government Licence - Canada.

And... it's in XML! XML, stands for eXtensible Markup Language, and if you want to see an example right now, just press Ctrl.+u and the browser will (likely) show you the HTML under this web page. Because yes, HTML is a subset of XML. So now you have an idea.

XML is something that you should know about already. Very basically, XML defines a hierarchical structure in which concepts are enclosed in tags that might have properties.

<art_list>
    <art_work title="Title of the artwork" />
</art_list>

In the last example, art_list is a tag, and <art_list>...</art_list> is how you enclose something in a tag. Sometimes tags might not have anything to enclose, as in <art_work />, so it's self-enclosed. Finally, properties are just like dictionaries.

In Python, the fastest way to get all the concepts we need is usually by finding all the tags of a certain kind. The module lxml.etree is able to do that for us.

First, let's see some lines of the dataset by using the special IPython command !head.


In [49]:
!head data/art_listing.xml











From here, it's pretty obvious that what we want the Art_Summary elements and their properties.

The function etree.parse() opens a file, parses it into a Python data structure and returns it as an object.


In [50]:
from lxml import etree
arts = etree.parse("data/art_listing.xml")

To find all Art_Summary elements, we can call the method .findall().


In [51]:
summaries = arts.findall("Art_Summary")
summaries[:5]


Out[51]:
[<Element Art_Summary at 0x7f4038a22200>,
 <Element Art_Summary at 0x7f40386ab3b0>,
 <Element Art_Summary at 0x7f403874f368>,
 <Element Art_Summary at 0x7f403870da70>,
 <Element Art_Summary at 0x7f403a9e9e60>]

Each of those elements in the list behaves like a dictionary.


In [52]:
summaries[0].keys()


Out[52]:
['Artist_Name', 'Title', 'Execution_Date', 'E_Category', 'F_Category', 'ID']

In [53]:
summaries[0].values()


Out[53]:
['Wright, Don',
 'Turf-Arm',
 '1984',
 'Work on paper',
 'Oeuvre sur papier',
 '85/6-0174']

Now we are ready to build a data frame from the summaries list.


In [54]:
art_list = []
for summary in summaries:
    art_list.append(summary.values())
arts = pd.DataFrame(
    art_list,
    columns=['artist_name', 'title', 'execution_date', 'category', 'category_french', 'id']
)

Let's remove the category in French.


In [55]:
del arts['category_french']

In [56]:
arts.head()


Out[56]:
artist_name title execution_date category id
0 Wright, Don Turf-Arm 1984 Work on paper 85/6-0174
1 Hansen, Jim Nfld. album no. 99 ©1977, CARCC Work on paper 79/80-0510
2 Creates, Marlene A Stone Placed in Gathered Water, Newfoundland ©1982, CARCC Photograph 92/3-0221
3 Walker, Peter 1.25.79 ©1979, CARCC Work on paper 79/80-0529
4 Bretzloff, Carol Under Heaven, Wind 1983 Work on paper 13/4-0007

5 rows × 5 columns

Voilà! Ready for the fun.

Activity

Given the `arts` data frame, try to do the next:

  1. Clean the dates so you only see numbers.
  2. Get the average execution year per artist.
  3. Get the average execution year per category.
  4. Get the number of artworks per artist. Which artist is the most prolific?
  5. Get the number of artworks per category. Which category has the highest number?
  6. Get the average length of artworks titles per category and artist.
  7. Get the year with the highest production.
  8. Get the approximate period of production for each artist.
  9. Is there any correlation between the periods of production and the number of artworks per artist?

For the next class