[Data, the Humanist's New Best Friend](index.ipynb)
Class 06

In this class you are expected to learn:

Pandas
Cleaning data
Summary statistics
Indexing
Merging, joining
Group by

`import pandas as pd`

Pandas

For any of you that have heard about R, a statistics-oriented language, Pandas is the R for Python. It provides easy to use data structures and a ton of useful helper functions for data cleanup and transformations. It's fast because is backed by Numpy arrays, and integrates pretty well with other Python libraries such as scikit-learn for statistical learning, or matplotlib for plotting.

Pandas provides a key data structure: the pandas.DataFrame; which is like a super rich 2D array that addresses three deficiencies of arrays:

They hold heterogenous data; each column can have its own numpy.dtype, so can have, in the same row, numbers, strings, and dates.
The axes of a DataFrame are labeled with column names and row indices, what makes easier slicing and indexing.
They account for missing values, which is not directly supported by arrays, in the form or pandas.na or NaN, that stands for "not a number".

Data frames are extremely useful for data munging. They provide a large range of operations such as filter, join, and group-by aggregation. Furthermore, pandas.Series is to an 1D array what pandas.DataFrame is to a 2d array.



In [1]:

    
import numpy as np  # we will need it
import pandas as pd

Read data with Pandas

Pandas is able to read data from a bunch of different data formats: CSV, TSV, Excel, HDFS, JSON, etc. The whole list is in the online documentation. In order to read data from a CSV file, we use the read_csv() function that returns a data frame. By default, it assumes that the fields are comma-separated, but it has parameters to tune the behaviour of the importer.

We're going to use some datasets from Julia Evans' Pandas Cookbok. The first one is about cyclist data from Montréal in 2012.

This bikes dataset is a list of how many people were on 7 different bike paths in Montréal, each day.



In [2]:

    
bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame









    



---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-e54169c3d1a7> in <module>()
----> 1 bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    418                     infer_datetime_format=infer_datetime_format)
    419 
--> 420         return _read(filepath_or_buffer, kwds)
    421 
    422     parser_f.__name__ = name

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    216 
    217     # Create the parser.
--> 218     parser = TextFileReader(filepath_or_buffer, **kwds)
    219 
    220     if nrows is not None:

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    500             self.options['has_index_names'] = kwds['has_index_names']
    501 
--> 502         self._make_engine(self.engine)
    503 
    504     def _get_options_with_defaults(self, engine):

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    608     def _make_engine(self, engine='c'):
    609         if engine == 'c':
--> 610             self._engine = CParserWrapper(self.f, **self.options)
    611         else:
    612             if engine == 'python':

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
    970         kwds['allow_leading_cols'] = self.index_col is not False
    971 
--> 972         self._reader = _parser.TextReader(src, **kwds)
    973 
    974         # XXX

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/parser.cpython-33m.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4412)()

/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/parser.cpython-33m.so in pandas.parser.TextReader._get_header (pandas/parser.c:5924)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 15: invalid continuation byte

Hey! We know that error! It's an encoding error probably caused by French accents. Fortunately, read_csv() allows to pass the encoding as an argument.



In [3]:

    
bikes_df = pd.read_csv('data/bikes.csv', encoding="latin1")
bikes_df









    Out[3]:






  
    
      
      Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
    
  
  
    
      0
          01/01/2012;35;;0;38;51;26;10;16;
    
    
      1
          02/01/2012;83;;1;68;153;53;6;43;
    
    
      2
        03/01/2012;135;;2;104;248;89;3;58;
    
    
      3
       04/01/2012;144;;1;116;318;111;8;61;
    
    
      4
       05/01/2012;197;;2;124;330;97;13;95;
    
  

5 rows × 1 columns

You'll notice that this is totally broken! Fortunately, read_csv() has a bunch of more options that will let us fix that, though. Here we'll:

Change the column separator to a ;
Parse the dates in the Date column
Tell it that our dates have the date first instead of the month first
Set the index to be the Date column



In [4]:

    
bikes_df = pd.read_csv(
    'data/bikes.csv',
    sep=';',
    encoding='latin1',
    parse_dates=['Date'],
    dayfirst=True,
    index_col='Date'
)
bikes_df









    Out[4]:






  
    
      
      Berri 1
      Brébeuf (données non disponibles)
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
      St-Urbain (données non disponibles)
    
    
      Date
      
      
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
        35
      NaN
       0
        38
        51
        26
       10
       16
      NaN
    
    
      2012-01-02
        83
      NaN
       1
        68
       153
        53
        6
       43
      NaN
    
    
      2012-01-03
       135
      NaN
       2
       104
       248
        89
        3
       58
      NaN
    
    
      2012-01-04
       144
      NaN
       1
       116
       318
       111
        8
       61
      NaN
    
    
      2012-01-05
       197
      NaN
       2
       124
       330
        97
       13
       95
      NaN
    
  

5 rows × 9 columns

And now let's just remove missing columns



In [5]:

    
# Get rid of missing columns, don't worry, you'll see more on .dropna() later
bikes_df = bikes_df.dropna(axis=1)
bikes_df.head()









    Out[5]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
        51
        26
       10
       16
    
    
      2012-01-02
        83
       1
        68
       153
        53
        6
       43
    
    
      2012-01-03
       135
       2
       104
       248
        89
        3
       58
    
    
      2012-01-04
       144
       1
       116
       318
       111
        8
       61
    
    
      2012-01-05
       197
       2
       124
       330
        97
       13
       95
    
  

5 rows × 7 columns

Way better!

In the last example we added head() to show just the first rows. If you want the last ones, use the tail() function. Both can receive the number of rows to show, but defaults to 5.

Pandas is also able to get data from the Internet without even batting an eye. The excellent book Introduction to Statistical Learning has one of its data sets available to download.



In [6]:

    
autos_df = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
autos_df.head()  # print the first lines









    Out[6]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      0
       18
       8
       307
       130
       3504
       12.0
       70
       1
       chevrolet chevelle malibu
    
    
      1
       15
       8
       350
       165
       3693
       11.5
       70
       1
               buick skylark 320
    
    
      2
       18
       8
       318
       150
       3436
       11.0
       70
       1
              plymouth satellite
    
    
      3
       16
       8
       304
       150
       3433
       12.0
       70
       1
                   amc rebel sst
    
    
      4
       17
       8
       302
       140
       3449
       10.5
       70
       1
                     ford torino
    
  

5 rows × 9 columns

Slicing

A DataFrame is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

Here's an example:



In [7]:

    
bikes_df['Berri 1'].head()









    Out[7]:





Date
2012-01-01     35
2012-01-02     83
2012-01-03    135
2012-01-04    144
2012-01-05    197
Name: Berri 1, dtype: int64

For selecting several columns we use the same fancy indexing that we saw for Numpy



In [8]:

    
fancy_index = ['Berri 1', 'Rachel1']
bikes_df[fancy_index].head()









    Out[8]:






  
    
      
      Berri 1
      Rachel1
    
    
      Date
      
      
    
  
  
    
      2012-01-01
        35
       16
    
    
      2012-01-02
        83
       43
    
    
      2012-01-03
       135
       58
    
    
      2012-01-04
       144
       61
    
    
      2012-01-05
       197
       95
    
  

5 rows × 2 columns

But it's very common to just put the fancy index list as the key:



In [9]:

    
bikes_df[['Berri 1', 'Rachel1']].head()









    Out[9]:






  
    
      
      Berri 1
      Rachel1
    
    
      Date
      
      
    
  
  
    
      2012-01-01
        35
       16
    
    
      2012-01-02
        83
       43
    
    
      2012-01-03
       135
       58
    
    
      2012-01-04
       144
       61
    
    
      2012-01-05
       197
       95
    
  

5 rows × 2 columns

For slicing rows the syntax is just as expected: the same that we use for lists.



In [10]:

    
bikes_df[['Berri 1', 'Rachel1']][:5]









    Out[10]:






  
    
      
      Berri 1
      Rachel1
    
    
      Date
      
      
    
  
  
    
      2012-01-01
        35
       16
    
    
      2012-01-02
        83
       43
    
    
      2012-01-03
       135
       58
    
    
      2012-01-04
       144
       61
    
    
      2012-01-05
       197
       95
    
  

5 rows × 2 columns



In [11]:

    
bikes_df['Berri 1'][-5]









    Out[11]:





2405

Indexing

In a DataFrame every column is a Series, so you can use the index to select specific items from the Series.



In [12]:

    
berri = bikes_df[['Berri 1']]
berri.head()









    Out[12]:






  
    
      
      Berri 1
    
    
      Date
      
    
  
  
    
      2012-01-01
        35
    
    
      2012-01-02
        83
    
    
      2012-01-03
       135
    
    
      2012-01-04
       144
    
    
      2012-01-05
       197
    
  

5 rows × 1 columns

Or you can use boolean indexing for selection.



In [13]:

    
bikes_df[bikes_df['Berri 1'] < 40]









    Out[13]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
       35
       0
       38
       51
       26
       10
       16
    
    
      2012-01-14
       32
       0
       54
       56
       19
        0
        1
    
  

2 rows × 7 columns

That last one might be a little weird, so let's make it more clear: bikes_df['Berri 1'] < 40 returns a Series of True/False values, which we then pass to our DataFrame bikes_df, returning the corresponding True items.



In [14]:

    
less_than_40 = bikes_df['Berri 1'] < 40
print(less_than_40)
bikes_df[less_than_40]









    



Date
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05    False
2012-01-06    False
2012-01-07    False
2012-01-08    False
2012-01-09    False
2012-01-10    False
2012-01-11    False
2012-01-12    False
2012-01-13    False
2012-01-14     True
2012-01-15    False
...
2012-10-22    False
2012-10-23    False
2012-10-24    False
2012-10-25    False
2012-10-26    False
2012-10-27    False
2012-10-28    False
2012-10-29    False
2012-10-30    False
2012-10-31    False
2012-11-01    False
2012-11-02    False
2012-11-03    False
2012-11-04    False
2012-11-05    False
Name: Berri 1, Length: 310






    Out[14]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
       35
       0
       38
       51
       26
       10
       16
    
    
      2012-01-14
       32
       0
       54
       56
       19
        0
        1
    
  

2 rows × 7 columns

Series and DataFrame share the same indexing, slicing and selecting operations. What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

Let's first create a new Series from a dictionary.



In [15]:

    
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities









    Out[15]:





Austin            450
Boston            NaN
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64



In [16]:

    
'Seattle' in cities









    Out[16]:





False



In [17]:

    
'San Francisco' in cities









    Out[17]:





True

Finally, it's also possible to get a specific row by using the index value and the ix property.



In [18]:

    
bikes_df.ix['2012-01-01']









    Out[18]:





Berri 1                  35
Côte-Sainte-Catherine     0
Maisonneuve 1            38
Maisonneuve 2            51
du Parc                  26
Pierre-Dupuy             10
Rachel1                  16
Name: 2012-01-01 00:00:00, dtype: int64

Cleaning

Mathematical operations can be done using scalars and functions.



In [19]:

    
# divide city values by 3
cities / 3









    Out[19]:





Austin           150.000000
Boston                  NaN
Chicago          333.333333
New York         433.333333
Portland         300.000000
San Francisco    366.666667
dtype: float64



In [20]:

    
# square city values
np.square(cities)









    Out[20]:





Austin            202500
Boston               NaN
Chicago          1000000
New York         1690000
Portland          810000
San Francisco    1210000
dtype: float64

But what can we do with that NaN value? The easiest way to get rid of it is by removing it.



In [21]:

    
cities.dropna()









    Out[21]:





Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

Which is equivalent to do:



In [22]:

    
cities[cities.notnull()]









    Out[22]:





Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

We can also fill the null data or even interpolate its value from the rest of the Series.



In [23]:

    
cities.fillna(0)









    Out[23]:





Austin            450
Boston              0
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64



In [24]:

    
cities.interpolate()









    Out[24]:





Austin            450
Boston            725
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: int64

But null data might come in different ways. The next example illustrates the creation of a DataFrame from a dictionary. There are two blogs with a number of entries per tag. In this case, null data is represented by either the character "?" or the empty string "".



In [25]:

    
blogs = pd.DataFrame({0: {'blog': 1, 'tag': 'GIS', 'entries': '?'},
                      1: {'blog': 1, 'tag': 'NLP', 'entries': 1638},
                      2: {'blog': 1, 'tag': 'SNA', 'entries': 569},
                      3: {'blog': 1, 'tag': 'DH', 'entries': 115},
                      4: {'blog': 2, 'tag': 'GIS', 'entries': ''},
                      5: {'blog': 2, 'tag': 'NLP', 'entries': 1130},
                      6: {'blog': 2, 'tag': 'SNA', 'entries': 754},
                      7: {'blog': 2, 'tag': 'DH', 'entries': 555}})



In [26]:

    
blogs









    Out[26]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
    
  
  
    
      blog
         1
          1
         1
         1
         2
          2
         2
         2
    
    
      entries
         ?
       1638
       569
       115
          
       1130
       754
       555
    
    
      tag
       GIS
        NLP
       SNA
        DH
       GIS
        NLP
       SNA
        DH
    
  

3 rows × 8 columns



In [27]:

    
blogs = blogs.T  # We transpose it to swap rows by columns
blogs









    Out[27]:






  
    
      
      blog
      entries
      tag
    
  
  
    
      0
       1
          ?
       GIS
    
    
      1
       1
       1638
       NLP
    
    
      2
       1
        569
       SNA
    
    
      3
       1
        115
        DH
    
    
      4
       2
           
       GIS
    
    
      5
       2
       1130
       NLP
    
    
      6
       2
        754
       SNA
    
    
      7
       2
        555
        DH
    
  

8 rows × 3 columns

If we wanted to remove those rows with null values, we can use the na_values parameter when reading the data from a CSV. For DataFrames that we already have, the process is more manual. For now, let's just ignore the warning :-)



In [28]:

    
blogs[blogs.entries != "?"][blogs.entries != ""]









    



/home/versae/.venvs/data/lib/python3.3/site-packages/pandas/core/frame.py:1686: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)






    Out[28]:






  
    
      
      blog
      entries
      tag
    
  
  
    
      1
       1
       1638
       NLP
    
    
      2
       1
        569
       SNA
    
    
      3
       1
        115
        DH
    
    
      5
       2
       1130
       NLP
    
    
      6
       2
        754
       SNA
    
    
      7
       2
        555
        DH
    
  

6 rows × 3 columns

In some cases, we might want to keep the rows, but fill them with real null data.



In [29]:

    
blogs['entries'] = blogs['entries'].replace("?", np.NaN)
blogs['entries'].replace("", np.NaN, inplace=True)
blogs









    Out[29]:






  
    
      
      blog
      entries
      tag
    
  
  
    
      0
       1
        NaN
       GIS
    
    
      1
       1
       1638
       NLP
    
    
      2
       1
        569
       SNA
    
    
      3
       1
        115
        DH
    
    
      4
       2
        NaN
       GIS
    
    
      5
       2
       1130
       NLP
    
    
      6
       2
        754
       SNA
    
    
      7
       2
        555
        DH
    
  

8 rows × 3 columns

Activity

What do you think is the difference between these two lines?
``` blogs['entries'] = blogs['entries'].replace("?", np.NaN) ```
``` blogs['entries'].replace("", np.NaN, inplace=True) ```

Summary statistics

Pandas provides nifty methods to understand your data. Let's look at the describe, correlation, covariance, and correlation methods that you can use to quickly make sense of the data.

The describe() method provides quick stats on all suitable columns.



In [30]:

    
bikes_df.describe()









    Out[30]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
  
  
    
      count
        310.000000
        310.000000
        310.000000
        310.000000
        310.000000
        310.000000
        310.000000
    
    
      mean
       2985.048387
       1233.351613
       1983.325806
       3510.261290
       1862.983871
       1054.306452
       2873.483871
    
    
      std
       2169.271062
        944.643188
       1450.715170
       2484.959789
       1332.543266
       1064.029205
       2039.315504
    
    
      min
         32.000000
          0.000000
         33.000000
         47.000000
         18.000000
          0.000000
          0.000000
    
    
      25%
        596.000000
        243.250000
        427.000000
        831.000000
        474.750000
         53.250000
        731.000000
    
    
      50%
       3128.000000
       1269.000000
       2019.500000
       3688.500000
       1822.500000
        704.000000
       3223.500000
    
    
      75%
       4973.250000
       2003.000000
       3168.250000
       5731.750000
       3069.000000
       1818.500000
       4717.250000
    
    
      max
       7077.000000
       3124.000000
       4999.000000
       8222.000000
       4510.000000
       4386.000000
       6595.000000
    
  

8 rows × 7 columns



In [31]:

    
blogs.describe()









    Out[31]:






  
    
      
      entries
    
  
  
    
      count
          6.000000
    
    
      mean
        793.500000
    
    
      std
        528.321398
    
    
      min
        115.000000
    
    
      25%
        558.500000
    
    
      50%
        661.500000
    
    
      75%
       1036.000000
    
    
      max
       1638.000000
    
  

8 rows × 1 columns



In [32]:

    
autos_df.describe()









    Out[32]:






  
    
      
      mpg
      cylinders
      displacement
      weight
      acceleration
      year
      origin
    
  
  
    
      count
       397.000000
       397.000000
       397.000000
        397.000000
       397.000000
       397.000000
       397.000000
    
    
      mean
        23.515869
         5.458438
       193.532746
       2970.261965
        15.555668
        75.994962
         1.574307
    
    
      std
         7.825804
         1.701577
       104.379583
        847.904119
         2.749995
         3.690005
         0.802549
    
    
      min
         9.000000
         3.000000
        68.000000
       1613.000000
         8.000000
        70.000000
         1.000000
    
    
      25%
        17.500000
         4.000000
       104.000000
       2223.000000
        13.800000
        73.000000
         1.000000
    
    
      50%
        23.000000
         4.000000
       146.000000
       2800.000000
        15.500000
        76.000000
         1.000000
    
    
      75%
        29.000000
         8.000000
       262.000000
       3609.000000
        17.100000
        79.000000
         2.000000
    
    
      max
        46.600000
         8.000000
       455.000000
       5140.000000
        24.800000
        82.000000
         3.000000
    
  

8 rows × 7 columns

Activity

Other summary statistics include: `count()`, `mean()`, `median()`, `quantile()`, `std()`, `var()`, `min()`, `max()`, etc. Play around a bit to see what they do in the `DataFrames` we already have.

Merging and joining

Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.

Like SQL's JOIN clause, pandas.merge() allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, how, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join.

By default, pandas.merge() operates as an inner join, which can be changed using the how parameter. The inner join joins the two DataFrame by a certain key, and when any of the columns has a NaN for a key, it's removed.

This is always easiest to understand with examples.



In [35]:

    
left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print(right_frame)









    



   key left_value
0    0          a
1    1          b
2    2          c
3    3          d
4    4          e

[5 rows x 2 columns]
   key right_value
0    2           f
1    3           g
2    4           h
3    5           i
4    6           j

[5 rows x 2 columns]

inner join (default)



In [36]:

    
pd.merge(left_frame, right_frame, on='key', how='inner')









    Out[36]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       2
       c
       f
    
    
      1
       3
       d
       g
    
    
      2
       4
       e
       h
    
  

3 rows × 3 columns

We lose values from both frames since certain keys do not match up. Because our key columns have different names, we could have used the left_on and right_on parameters to specify which fields to join from each frame.

pd.merge(left_frame, right_frame, left_on='left_key', right_on='right_key')

Alternatively, if our keys were indexes, we could use the left_index or right_index parameters, which accept a logical value. You can mix and match columns and indexes like this:

pd.merge(left_frame, right_frame, left_on='key', right_index=True)

left outer join



In [37]:

    
pd.merge(left_frame, right_frame, on='key', how='left')









    Out[37]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       0
       a
       NaN
    
    
      1
       1
       b
       NaN
    
    
      2
       2
       c
         f
    
    
      3
       3
       d
         g
    
    
      4
       4
       e
         h
    
  

5 rows × 3 columns

We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value is null where keys do not match (NaN).

right outer join



In [38]:

    
pd.merge(left_frame, right_frame, on='key', how='right')









    Out[38]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       2
         c
       f
    
    
      1
       3
         d
       g
    
    
      2
       4
         e
       h
    
    
      3
       5
       NaN
       i
    
    
      4
       6
       NaN
       j
    
  

5 rows × 3 columns

This time we've kept everything from the right frame with the left_value being null where the right frame's key did not find a match.

full outer join



In [39]:

    
pd.merge(left_frame, right_frame, on='key', how='outer')









    Out[39]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       0
         a
       NaN
    
    
      1
       1
         b
       NaN
    
    
      2
       2
         c
         f
    
    
      3
       3
         d
         g
    
    
      4
       4
         e
         h
    
    
      5
       5
       NaN
         i
    
    
      6
       6
       NaN
         j
    
  

7 rows × 3 columns

We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are null.

concat

Pandas also provides a way to combine DataFrames along an axis: pandas.concat(). It takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.



In [40]:

    
pd.concat([left_frame, right_frame])









    Out[40]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       0
         a
       NaN
    
    
      1
       1
         b
       NaN
    
    
      2
       2
         c
       NaN
    
    
      3
       3
         d
       NaN
    
    
      4
       4
         e
       NaN
    
    
      0
       2
       NaN
         f
    
    
      1
       3
       NaN
         g
    
    
      2
       4
       NaN
         h
    
    
      3
       5
       NaN
         i
    
    
      4
       6
       NaN
         j
    
  

10 rows × 3 columns

By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be null.

Additionally, objects can be concatentated side-by-side using the function's axis parameter. The axis labeling information in Pandas objects serves many purposes:

Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
Enables automatic and explicit data alignment.
Allows intuitive getting and setting of subsets of the data set.



In [41]:

    
pd.concat([left_frame, right_frame], axis=1)









    Out[41]:






  
    
      
      key
      left_value
      key
      right_value
    
  
  
    
      0
       0
       a
       2
       f
    
    
      1
       1
       b
       3
       g
    
    
      2
       2
       c
       4
       h
    
    
      3
       3
       d
       5
       i
    
    
      4
       4
       e
       6
       j
    
  

5 rows × 4 columns

Group by

The groupby() method lets you perform grouping operations. The example below shows a grouping operation performed with blog columns entries as keys. It is used to calculate the mean of the entries for each blog.



In [42]:

    
blogs.groupby('blog').mean()









    Out[42]:






  
    
      
      entries
    
    
      blog
      
    
  
  
    
      1
       774
    
    
      2
       813
    
  

2 rows × 1 columns

And now for a more complex exmaple. From our bikes_df DataFrame, we want to know if people are more active on weekdays or on weekends. We are going to add a new column with the information about the weekday, and then sum all the values for each weekday. We do that by grouping by weekday and aggregating their values by simpe addition. In this case we use np.sum(), but any other operation, such as np.max() or np.average() can be used.



In [43]:

    
bikes_df.index









    Out[43]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01, ..., 2012-11-05]
Length: 310, Freq: None, Timezone: None

The data frame's index is of type DateTimeIndex, which allows us to extract information from the standard datetime Python object.



In [44]:

    
bikes_df.index.weekday









    Out[44]:





array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5,
       6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0], dtype=int32)

In order to create the new column we just need to pick a name and assign a Series value is it was a dictionary (remember that DataFrame and Series are labeled).



In [45]:

    
bikes_df['weekday'] = bikes_df.index.weekday
bikes_df.head()









    Out[45]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
      weekday
    
    
      Date
      
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
        51
        26
       10
       16
       6
    
    
      2012-01-02
        83
       1
        68
       153
        53
        6
       43
       0
    
    
      2012-01-03
       135
       2
       104
       248
        89
        3
       58
       1
    
    
      2012-01-04
       144
       1
       116
       318
       111
        8
       61
       2
    
    
      2012-01-05
       197
       2
       124
       330
        97
       13
       95
       3
    
  

5 rows × 8 columns



In [46]:

    
bikes_df.groupby('weekday')









    Out[46]:





<pandas.core.groupby.DataFrameGroupBy object at 0x7f403a9eed90>



In [47]:

    
counts_by_day = bikes_df.groupby('weekday').aggregate(np.sum)
counts_by_day









    Out[47]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      weekday
      
      
      
      
      
      
      
    
  
  
    
      0
       134298
       60329
        90051
       163767
        90184
       46204
       130130
    
    
      1
       135305
       58708
        92035
       165880
        91399
       35167
       120088
    
    
      2
       152972
       67344
       104891
       186061
       102103
       43263
       133088
    
    
      3
       160131
       69028
       111895
       196715
       105674
       45385
       140241
    
    
      4
       141771
       56446
        98568
       172390
        89872
       42470
       137255
    
    
      5
       101578
       34018
        62067
       105060
        48289
       52861
       112828
    
    
      6
        99310
       36466
        55324
        98308
        50004
       61485
       117150
    
  

7 rows × 7 columns

And finally a bit of matplotlib + pandas magic! But don't worry, you'll learn about plotting in the next class.



In [48]:

    
import matplotlib
# display graphs inline
%matplotlib inline 
# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))

counts_by_day.plot()









    Out[48]:





<matplotlib.axes.AxesSubplot at 0x7f40389f9dd0>

This is not, for sure, all we will see about grouping, but it's a good start if you are already understanding what's going on here.

Art Bank collection

The Art Bank collection is a dataset provided by the data.gc.ca, a key part of Canada\u2019s Action Plan on Open Government to enhance transparency and accountability, that contains the art works in the Canada Council Art Bank collection. For 40 years, the Art Bank has been collecting work by the best Canadian artists of our times. The Art Bank is part of the Canada Council for the Arts, Canada's national arts funder. It's licenced under the terms of the Open Government Licence - Canada.

And... it's in XML! XML, stands for eXtensible Markup Language, and if you want to see an example right now, just press Ctrl.+u and the browser will (likely) show you the HTML under this web page. Because yes, HTML is a subset of XML. So now you have an idea.

XML is something that you should know about already. Very basically, XML defines a hierarchical structure in which concepts are enclosed in tags that might have properties.

<art_list>
    <art_work title="Title of the artwork" />
</art_list>

In the last example, art_list is a tag, and <art_list>...</art_list> is how you enclose something in a tag. Sometimes tags might not have anything to enclose, as in <art_work />, so it's self-enclosed. Finally, properties are just like dictionaries.

In Python, the fastest way to get all the concepts we need is usually by finding all the tags of a certain kind. The module lxml.etree is able to do that for us.

First, let's see some lines of the dataset by using the special IPython command !head.



In [49]:

    
!head data/art_listing.xml

From here, it's pretty obvious that what we want the Art_Summary elements and their properties.

The function etree.parse() opens a file, parses it into a Python data structure and returns it as an object.



In [50]:

    
from lxml import etree
arts = etree.parse("data/art_listing.xml")

To find all Art_Summary elements, we can call the method .findall().



In [51]:

    
summaries = arts.findall("Art_Summary")
summaries[:5]









    Out[51]:





[<Element Art_Summary at 0x7f4038a22200>,
 <Element Art_Summary at 0x7f40386ab3b0>,
 <Element Art_Summary at 0x7f403874f368>,
 <Element Art_Summary at 0x7f403870da70>,
 <Element Art_Summary at 0x7f403a9e9e60>]

Each of those elements in the list behaves like a dictionary.



In [52]:

    
summaries[0].keys()









    Out[52]:





['Artist_Name', 'Title', 'Execution_Date', 'E_Category', 'F_Category', 'ID']



In [53]:

    
summaries[0].values()









    Out[53]:





['Wright, Don',
 'Turf-Arm',
 '1984',
 'Work on paper',
 'Oeuvre sur papier',
 '85/6-0174']

Now we are ready to build a data frame from the summaries list.



In [54]:

    
art_list = []
for summary in summaries:
    art_list.append(summary.values())
arts = pd.DataFrame(
    art_list,
    columns=['artist_name', 'title', 'execution_date', 'category', 'category_french', 'id']
)

Let's remove the category in French.



In [55]:

    
del arts['category_french']



In [56]:

    
arts.head()









    Out[56]:






  
    
      
      artist_name
      title
      execution_date
      category
      id
    
  
  
    
      0
            Wright, Don
                                             Turf-Arm
               1984
       Work on paper
        85/6-0174
    
    
      1
            Hansen, Jim
                                   Nfld. album no. 99
       ©1977, CARCC
       Work on paper
       79/80-0510
    
    
      2
       Creates, Marlene
       A Stone Placed in Gathered Water, Newfoundland
       ©1982, CARCC
          Photograph
        92/3-0221
    
    
      3
          Walker, Peter
                                              1.25.79
       ©1979, CARCC
       Work on paper
       79/80-0529
    
    
      4
       Bretzloff, Carol
                                   Under Heaven, Wind
               1983
       Work on paper
        13/4-0007
    
  

5 rows × 5 columns

Voilà! Ready for the fun.

Activity

Given the `arts` data frame, try to do the next:

Clean the dates so you only see numbers.
Get the average execution year per artist.
Get the average execution year per category.
Get the number of artworks per artist. Which artist is the most prolific?
Get the number of artworks per category. Which category has the highest number?
Get the average length of artworks titles per category and artist.
Get the year with the highest production.
Get the approximate period of production for each artist.
Is there any correlation between the periods of production and the number of artworks per artist?

For the next class

Read The Split-Apply-Combine Strategy for Data Analysis
Read Section String Manipulation, 205-212, from Python for Data Analysis.
Read Section Pivot Tables and Cross-Tabulation, 275-278, from Python for Data Analysis.
Read Section 1.4. Matplotlib: plotting, from Scipy lecture notes
Read Plotting and Visualization Notebook

	Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
0	01/01/2012;35;;0;38;51;26;10;16;
1	02/01/2012;83;;1;68;153;53;6;43;
2	03/01/2012;135;;2;104;248;89;3;58;
3	04/01/2012;144;;1;116;318;111;8;61;
4	05/01/2012;197;;2;124;330;97;13;95;

	Berri 1	Brébeuf (données non disponibles)	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (données non disponibles)
Date
2012-01-01	35	NaN	0	38	51	26	10	16	NaN
2012-01-02	83	NaN	1	68	153	53	6	43	NaN
2012-01-03	135	NaN	2	104	248	89	3	58	NaN
2012-01-04	144	NaN	1	116	318	111	8	61	NaN
2012-01-05	197	NaN	2	124	330	97	13	95	NaN

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18	8	307	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15	8	350	165	3693	11.5	70	1	buick skylark 320
2	18	8	318	150	3436	11.0	70	1	plymouth satellite
3	16	8	304	150	3433	12.0	70	1	amc rebel sst
4	17	8	302	140	3449	10.5	70	1	ford torino

	0	1	2	3	4	5	6	7
blog	1	1	1	1	2	2	2	2
entries	?	1638	569	115		1130	754	555
tag	GIS	NLP	SNA	DH	GIS	NLP	SNA	DH

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1
count	310.000000	310.000000	310.000000	310.000000	310.000000	310.000000	310.000000
mean	2985.048387	1233.351613	1983.325806	3510.261290	1862.983871	1054.306452	2873.483871
std	2169.271062	944.643188	1450.715170	2484.959789	1332.543266	1064.029205	2039.315504
min	32.000000	0.000000	33.000000	47.000000	18.000000	0.000000	0.000000
25%	596.000000	243.250000	427.000000	831.000000	474.750000	53.250000	731.000000
50%	3128.000000	1269.000000	2019.500000	3688.500000	1822.500000	704.000000	3223.500000
75%	4973.250000	2003.000000	3168.250000	5731.750000	3069.000000	1818.500000	4717.250000
max	7077.000000	3124.000000	4999.000000	8222.000000	4510.000000	4386.000000	6595.000000

	entries
count	6.000000
mean	793.500000
std	528.321398
min	115.000000
25%	558.500000
50%	661.500000
75%	1036.000000
max	1638.000000

	mpg	cylinders	displacement	weight	acceleration	year	origin
count	397.000000	397.000000	397.000000	397.000000	397.000000	397.000000	397.000000
mean	23.515869	5.458438	193.532746	2970.261965	15.555668	75.994962	1.574307
std	7.825804	1.701577	104.379583	847.904119	2.749995	3.690005	0.802549
min	9.000000	3.000000	68.000000	1613.000000	8.000000	70.000000	1.000000
25%	17.500000	4.000000	104.000000	2223.000000	13.800000	73.000000	1.000000
50%	23.000000	4.000000	146.000000	2800.000000	15.500000	76.000000	1.000000
75%	29.000000	8.000000	262.000000	3609.000000	17.100000	79.000000	2.000000
max	46.600000	8.000000	455.000000	5140.000000	24.800000	82.000000	3.000000

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1
weekday
0	134298	60329	90051	163767	90184	46204	130130
1	135305	58708	92035	165880	91399	35167	120088
2	152972	67344	104891	186061	102103	43263	133088
3	160131	69028	111895	196715	105674	45385	140241
4	141771	56446	98568	172390	89872	42470	137255
5	101578	34018	62067	105060	48289	52861	112828
6	99310	36466	55324	98308	50004	61485	117150

	artist_name	title	execution_date	category	id
0	Wright, Don	Turf-Arm	1984	Work on paper	85/6-0174
1	Hansen, Jim	Nfld. album no. 99	©1977, CARCC	Work on paper	79/80-0510
2	Creates, Marlene	A Stone Placed in Gathered Water, Newfoundland	©1982, CARCC	Photograph	92/3-0221
3	Walker, Peter	1.25.79	©1979, CARCC	Work on paper	79/80-0529
4	Bretzloff, Carol	Under Heaven, Wind	1983	Work on paper	13/4-0007

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

[Data, the Humanist's New Best Friend](index.ipynb)*Class 06*

Pandas

Read data with Pandas

Slicing

Indexing

Cleaning

Summary statistics

Merging and joining

inner join (default)

left outer join

right outer join

full outer join

concat

Group by

Art Bank collection

For the next class

[Data, the Humanist's New Best Friend](index.ipynb)
Class 06

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j