[Data, the Humanist's New Best Friend](index.ipynb)
Class 06

In this class you are expected to learn:

Pandas
Cleaning data
Summary statistics
Indexing
Merging, joining

`import pandas as pd`

Pandas

For any of you that have heard about R, a statistics-oriented language, Pandas is the R for Python. It provides easy to use data structures and a ton of useful helper functions for data cleanup and transformations. It's fast because is backed by Numpy arrays, and integrates pretty well with other Python libraries such as scikit-learn for statistical learning, or matplotlib for plotting.

Pandas provides a key data structure: the pandas.DataFrame; which is like a super rich 2D array that addresses three deficiencies of arrays:

They hold heterogenous data; each column can have its own numpy.dtype, so can have, in the same row, numbers, strings, and dates.
The axes of a DataFrame are labeled with column names and row indices, what makes easier slicing and indexing.
They account for missing values, which is not directly supported by arrays, in the form or pandas.na or NaN, that stands for "not a number".

Data frames are extremely useful for data munging. They provide a large range of operations such as filter, join, and group-by aggregation. Furthermore, pandas.Series is to an 1D array what pandas.DataFrame is to a 2d array.



In [2]:

    
import numpy as np  # we will need it
import pandas as pd

Read data with Pandas

Pandas is able to read data from a bunch of different data formats: CSV, TSV, Excel, HDFS, JSON, etc. The whole list is in the online documentation. In order to read data from a CSV file, we use the read_csv() function that returns a data frame. By default, it assumes that the fields are comma-separated, but it has parameters to tune the behaviour of the importer.

We're going to use some datasets from Julia Evans' Pandas Cookbok. The first one is about cyclist data from Montréal in 2012.

This bikes dataset is a list of how many people were on 7 different bike paths in Montréal, each day.



In [2]:

    
bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame









    



---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-e54169c3d1a7> in <module>()
----> 1 bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    463                     skip_blank_lines=skip_blank_lines)
    464 
--> 465         return _read(filepath_or_buffer, kwds)
    466 
    467     parser_f.__name__ = name

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    239 
    240     # Create the parser.
--> 241     parser = TextFileReader(filepath_or_buffer, **kwds)
    242 
    243     if (nrows is not None) and (chunksize is not None):

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    555             self.options['has_index_names'] = kwds['has_index_names']
    556 
--> 557         self._make_engine(self.engine)
    558 
    559     def _get_options_with_defaults(self, engine):

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    692     def _make_engine(self, engine='c'):
    693         if engine == 'c':
--> 694             self._engine = CParserWrapper(self.f, **self.options)
    695         else:
    696             if engine == 'python':

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1059         kwds['allow_leading_cols'] = self.index_col is not False
   1060 
-> 1061         self._reader = _parser.TextReader(src, **kwds)
   1062 
   1063         # XXX

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4710)()

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._get_header (pandas/parser.c:6438)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 15: invalid continuation byte

Hey! We know that error! It's an encoding error probably caused by French accents. Fortunately, read_csv() allows to pass the encoding as an argument.



In [3]:

    
bikes_df = pd.read_csv('data/bikes.csv', encoding="latin1")
bikes_df









    Out[3]:






  
    
      
      Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
    
  
  
    
      0  
                     01/01/2012;35;;0;38;51;26;10;16;
    
    
      1  
                     02/01/2012;83;;1;68;153;53;6;43;
    
    
      2  
                   03/01/2012;135;;2;104;248;89;3;58;
    
    
      3  
                  04/01/2012;144;;1;116;318;111;8;61;
    
    
      4  
                  05/01/2012;197;;2;124;330;97;13;95;
    
    
      5  
                    06/01/2012;146;;0;98;244;86;4;75;
    
    
      6  
                     07/01/2012;98;;2;80;108;53;6;54;
    
    
      7  
                     08/01/2012;95;;1;62;98;64;11;63;
    
    
      8  
                09/01/2012;244;;2;165;432;198;12;173;
    
    
      9  
                10/01/2012;397;;3;238;563;275;18;241;
    
    
      10 
                11/01/2012;273;;0;182;443;258;12;194;
    
    
      11 
                  12/01/2012;157;;1;134;261;137;9;63;
    
    
      12 
                      13/01/2012;75;;0;41;105;64;2;0;
    
    
      13 
                       14/01/2012;32;;0;54;56;19;0;1;
    
    
      14 
                       15/01/2012;54;;0;33;60;18;0;0;
    
    
      15 
                    16/01/2012;168;;2;136;312;137;1;0
    
    
      16 
                      17/01/2012;155;;0;86;256;74;0;0
    
    
      17 
                      18/01/2012;139;;0;66;188;68;3;0
    
    
      18 
                     19/01/2012;191;;1;104;248;79;3;0
    
    
      19 
                      20/01/2012;161;;4;96;217;67;1;1
    
    
      20 
                        21/01/2012;53;;0;47;70;32;1;0
    
    
      21 
                        22/01/2012;71;;0;41;73;35;5;0
    
    
      22 
                     23/01/2012;210;;6;114;357;91;6;0
    
    
      23 
                    24/01/2012;299;;1;189;444;174;4;0
    
    
      24 
                    25/01/2012;334;;1;217;453;180;4;0
    
    
      25 
                    26/01/2012;306;;0;215;495;191;0;1
    
    
      26 
                       27/01/2012;91;;5;79;204;65;0;0
    
    
      27 
                       28/01/2012;80;;1;61;123;33;9;1
    
    
      28 
                       29/01/2012;87;;1;65;132;40;7;0
    
    
      29 
                    30/01/2012;219;;0;146;371;152;2;0
    
    
      ...
      ...
    
    
      280
           07/10/2012;1580;;660;922;1629;860;695;2052
    
    
      281
         08/10/2012;1854;;880;987;1818;1040;1115;2502
    
    
      282
        09/10/2012;4787;;2210;3026;5138;3418;927;4078
    
    
      283
        10/10/2012;3115;;1537;2081;3681;2608;560;2703
    
    
      284
        11/10/2012;3746;;1857;2569;4694;3034;558;3457
    
    
      285
        12/10/2012;3169;;1460;2261;4045;2564;448;3224
    
    
      286
         13/10/2012;1783;;802;1205;2113;1183;681;2309
    
    
      287
               14/10/2012;587;;287;443;852;503;65;952
    
    
      288
        15/10/2012;3292;;1678;2165;4197;2754;560;3183
    
    
      289
        16/10/2012;3739;;1858;2684;4681;2997;554;3593
    
    
      290
        17/10/2012;4098;;1964;2645;4836;3063;728;3834
    
    
      291
       18/10/2012;4671;;2292;3129;5542;3477;1108;4245
    
    
      292
          19/10/2012;1313;;597;885;1668;1209;111;1486
    
    
      293
         20/10/2012;2011;;748;1323;2266;1213;797;2243
    
    
      294
           21/10/2012;1277;;609;869;1777;898;242;1648
    
    
      295
        22/10/2012;3650;;1819;2495;4800;3023;757;3721
    
    
      296
        23/10/2012;4177;;1997;2795;5216;3233;795;3554
    
    
      297
        24/10/2012;3744;;1868;2625;4900;3035;649;3622
    
    
      298
        25/10/2012;3735;;1815;2528;5010;3017;631;3767
    
    
      299
       26/10/2012;4290;;1987;2754;5246;3000;1456;4578
    
    
      300
         27/10/2012;1857;;792;1244;2461;1193;618;2471
    
    
      301
           28/10/2012;1310;;697;910;1776;955;387;1876
    
    
      302
        29/10/2012;2919;;1458;2071;3768;2440;411;2795
    
    
      303
        30/10/2012;2887;;1251;2007;3516;2255;338;2790
    
    
      304
        31/10/2012;2634;;1294;1835;3453;2220;245;2570
    
    
      305
        01/11/2012;2405;;1208;1701;3082;2076;165;2461
    
    
      306
          02/11/2012;1582;;737;1109;2277;1392;97;1888
    
    
      307
            03/11/2012;844;;380;612;1137;713;105;1302
    
    
      308
            04/11/2012;966;;446;710;1277;692;197;1374
    
    
      309
        05/11/2012;2247;;1170;1705;3221;2143;179;2430
    
  

310 rows × 1 columns

You'll notice that this is totally broken! Fortunately, read_csv() has a bunch of more options that will let us fix that, though. Here we'll:

Change the column separator to a ;
Parse the dates in the Date column
Tell it that our dates have the date first instead of the month first
Set the index to be the Date column



In [4]:

    
bikes_df = pd.read_csv(
    'data/bikes.csv',
    sep=';',
    encoding='latin1',
    parse_dates=['Date'],
    dayfirst=True,
    index_col='Date'
)
bikes_df









    Out[4]:






  
    
      
      Berri 1
      Brébeuf (données non disponibles)
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
      St-Urbain (données non disponibles)
    
    
      Date
      
      
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
         35
      NaN
          0
         38
         51
         26
         10
         16
      NaN
    
    
      2012-01-02
         83
      NaN
          1
         68
        153
         53
          6
         43
      NaN
    
    
      2012-01-03
        135
      NaN
          2
        104
        248
         89
          3
         58
      NaN
    
    
      2012-01-04
        144
      NaN
          1
        116
        318
        111
          8
         61
      NaN
    
    
      2012-01-05
        197
      NaN
          2
        124
        330
         97
         13
         95
      NaN
    
    
      2012-01-06
        146
      NaN
          0
         98
        244
         86
          4
         75
      NaN
    
    
      2012-01-07
         98
      NaN
          2
         80
        108
         53
          6
         54
      NaN
    
    
      2012-01-08
         95
      NaN
          1
         62
         98
         64
         11
         63
      NaN
    
    
      2012-01-09
        244
      NaN
          2
        165
        432
        198
         12
        173
      NaN
    
    
      2012-01-10
        397
      NaN
          3
        238
        563
        275
         18
        241
      NaN
    
    
      2012-01-11
        273
      NaN
          0
        182
        443
        258
         12
        194
      NaN
    
    
      2012-01-12
        157
      NaN
          1
        134
        261
        137
          9
         63
      NaN
    
    
      2012-01-13
         75
      NaN
          0
         41
        105
         64
          2
          0
      NaN
    
    
      2012-01-14
         32
      NaN
          0
         54
         56
         19
          0
          1
      NaN
    
    
      2012-01-15
         54
      NaN
          0
         33
         60
         18
          0
          0
      NaN
    
    
      2012-01-16
        168
      NaN
          2
        136
        312
        137
          1
          0
      NaN
    
    
      2012-01-17
        155
      NaN
          0
         86
        256
         74
          0
          0
      NaN
    
    
      2012-01-18
        139
      NaN
          0
         66
        188
         68
          3
          0
      NaN
    
    
      2012-01-19
        191
      NaN
          1
        104
        248
         79
          3
          0
      NaN
    
    
      2012-01-20
        161
      NaN
          4
         96
        217
         67
          1
          1
      NaN
    
    
      2012-01-21
         53
      NaN
          0
         47
         70
         32
          1
          0
      NaN
    
    
      2012-01-22
         71
      NaN
          0
         41
         73
         35
          5
          0
      NaN
    
    
      2012-01-23
        210
      NaN
          6
        114
        357
         91
          6
          0
      NaN
    
    
      2012-01-24
        299
      NaN
          1
        189
        444
        174
          4
          0
      NaN
    
    
      2012-01-25
        334
      NaN
          1
        217
        453
        180
          4
          0
      NaN
    
    
      2012-01-26
        306
      NaN
          0
        215
        495
        191
          0
          1
      NaN
    
    
      2012-01-27
         91
      NaN
          5
         79
        204
         65
          0
          0
      NaN
    
    
      2012-01-28
         80
      NaN
          1
         61
        123
         33
          9
          1
      NaN
    
    
      2012-01-29
         87
      NaN
          1
         65
        132
         40
          7
          0
      NaN
    
    
      2012-01-30
        219
      NaN
          0
        146
        371
        152
          2
          0
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2012-10-07
       1580
      NaN
        660
        922
       1629
        860
        695
       2052
      NaN
    
    
      2012-10-08
       1854
      NaN
        880
        987
       1818
       1040
       1115
       2502
      NaN
    
    
      2012-10-09
       4787
      NaN
       2210
       3026
       5138
       3418
        927
       4078
      NaN
    
    
      2012-10-10
       3115
      NaN
       1537
       2081
       3681
       2608
        560
       2703
      NaN
    
    
      2012-10-11
       3746
      NaN
       1857
       2569
       4694
       3034
        558
       3457
      NaN
    
    
      2012-10-12
       3169
      NaN
       1460
       2261
       4045
       2564
        448
       3224
      NaN
    
    
      2012-10-13
       1783
      NaN
        802
       1205
       2113
       1183
        681
       2309
      NaN
    
    
      2012-10-14
        587
      NaN
        287
        443
        852
        503
         65
        952
      NaN
    
    
      2012-10-15
       3292
      NaN
       1678
       2165
       4197
       2754
        560
       3183
      NaN
    
    
      2012-10-16
       3739
      NaN
       1858
       2684
       4681
       2997
        554
       3593
      NaN
    
    
      2012-10-17
       4098
      NaN
       1964
       2645
       4836
       3063
        728
       3834
      NaN
    
    
      2012-10-18
       4671
      NaN
       2292
       3129
       5542
       3477
       1108
       4245
      NaN
    
    
      2012-10-19
       1313
      NaN
        597
        885
       1668
       1209
        111
       1486
      NaN
    
    
      2012-10-20
       2011
      NaN
        748
       1323
       2266
       1213
        797
       2243
      NaN
    
    
      2012-10-21
       1277
      NaN
        609
        869
       1777
        898
        242
       1648
      NaN
    
    
      2012-10-22
       3650
      NaN
       1819
       2495
       4800
       3023
        757
       3721
      NaN
    
    
      2012-10-23
       4177
      NaN
       1997
       2795
       5216
       3233
        795
       3554
      NaN
    
    
      2012-10-24
       3744
      NaN
       1868
       2625
       4900
       3035
        649
       3622
      NaN
    
    
      2012-10-25
       3735
      NaN
       1815
       2528
       5010
       3017
        631
       3767
      NaN
    
    
      2012-10-26
       4290
      NaN
       1987
       2754
       5246
       3000
       1456
       4578
      NaN
    
    
      2012-10-27
       1857
      NaN
        792
       1244
       2461
       1193
        618
       2471
      NaN
    
    
      2012-10-28
       1310
      NaN
        697
        910
       1776
        955
        387
       1876
      NaN
    
    
      2012-10-29
       2919
      NaN
       1458
       2071
       3768
       2440
        411
       2795
      NaN
    
    
      2012-10-30
       2887
      NaN
       1251
       2007
       3516
       2255
        338
       2790
      NaN
    
    
      2012-10-31
       2634
      NaN
       1294
       1835
       3453
       2220
        245
       2570
      NaN
    
    
      2012-11-01
       2405
      NaN
       1208
       1701
       3082
       2076
        165
       2461
      NaN
    
    
      2012-11-02
       1582
      NaN
        737
       1109
       2277
       1392
         97
       1888
      NaN
    
    
      2012-11-03
        844
      NaN
        380
        612
       1137
        713
        105
       1302
      NaN
    
    
      2012-11-04
        966
      NaN
        446
        710
       1277
        692
        197
       1374
      NaN
    
    
      2012-11-05
       2247
      NaN
       1170
       1705
       3221
       2143
        179
       2430
      NaN
    
  

310 rows × 9 columns

And now let's just remove missing columns



In [5]:

    
# Get rid of missing columns, don't worry, you'll see more on .dropna() later
bikes_df = bikes_df.dropna(axis=1)
bikes_df.head()









    Out[5]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
        51
        26
       10
       16
    
    
      2012-01-02
        83
       1
        68
       153
        53
        6
       43
    
    
      2012-01-03
       135
       2
       104
       248
        89
        3
       58
    
    
      2012-01-04
       144
       1
       116
       318
       111
        8
       61
    
    
      2012-01-05
       197
       2
       124
       330
        97
       13
       95

Way better!

In the last example we added head() to show just the first rows. If you want the last ones, use the tail() function. Both can receive the number of rows to show, but defaults to 5.

Pandas is also able to get data from the Internet without even batting an eye. The excellent book Introduction to Statistical Learning has one of its data sets available to download.



In [6]:

    
autos_df = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
autos_df.head()  # print the first lines









    Out[6]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      0
       18
       8
       307
       130
       3504
       12.0
       70
       1
       chevrolet chevelle malibu
    
    
      1
       15
       8
       350
       165
       3693
       11.5
       70
       1
               buick skylark 320
    
    
      2
       18
       8
       318
       150
       3436
       11.0
       70
       1
              plymouth satellite
    
    
      3
       16
       8
       304
       150
       3433
       12.0
       70
       1
                   amc rebel sst
    
    
      4
       17
       8
       302
       140
       3449
       10.5
       70
       1
                     ford torino

Slicing

A DataFrame is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

Here's an example:



In [7]:

    
bikes_df['Berri 1'].head()









    Out[7]:





Date
2012-01-01     35
2012-01-02     83
2012-01-03    135
2012-01-04    144
2012-01-05    197
Name: Berri 1, dtype: int64

For selecting several columns we use the same fancy indexing that we saw for Numpy



In [8]:

    
fancy_index = ['Berri 1', 'Rachel1']
bikes_df[fancy_index].head()

But it's very common to just put the fancy index list as the key:



In [9]:

    
bikes_df[['Berri 1', 'Rachel1']].head()

For slicing rows the syntax is just as expected: the same that we use for lists.



In [10]:

    
bikes_df[['Berri 1', 'Rachel1']][:5]



In [11]:

    
bikes_df['Berri 1'][-5]









    Out[11]:





2405

Indexing

In a DataFrame every column is a Series, so you can use the index to select specific items from the Series.



In [12]:

    
berri = bikes_df[['Berri 1']]
berri.head()

Or you can use boolean indexing for selection.



In [13]:

    
bikes_df[bikes_df['Berri 1'] < 40]









    Out[13]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
       35
       0
       38
       51
       26
       10
       16
    
    
      2012-01-14
       32
       0
       54
       56
       19
        0
        1

That last one might be a little weird, so let's make it more clear: bikes_df['Berri 1'] < 40 returns a Series of True/False values, which we then pass to our DataFrame bikes_df, returning the corresponding True items.



In [14]:

    
less_than_40 = bikes_df['Berri 1'] < 40
print(less_than_40)
bikes_df[less_than_40]









    



Date
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05    False
2012-01-06    False
2012-01-07    False
2012-01-08    False
2012-01-09    False
2012-01-10    False
2012-01-11    False
2012-01-12    False
2012-01-13    False
2012-01-14     True
2012-01-15    False
...
2012-10-22    False
2012-10-23    False
2012-10-24    False
2012-10-25    False
2012-10-26    False
2012-10-27    False
2012-10-28    False
2012-10-29    False
2012-10-30    False
2012-10-31    False
2012-11-01    False
2012-11-02    False
2012-11-03    False
2012-11-04    False
2012-11-05    False
Name: Berri 1, Length: 310






    Out[14]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
       35
       0
       38
       51
       26
       10
       16
    
    
      2012-01-14
       32
       0
       54
       56
       19
        0
        1

Series and DataFrame share the same indexing, slicing and selecting operations. What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

Let's first create a new Series from a dictionary.



In [15]:

    
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities









    Out[15]:





Austin            450
Boston            NaN
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64



In [16]:

    
'Seattle' in cities









    Out[16]:





False



In [17]:

    
'San Francisco' in cities









    Out[17]:





True

Finally, it's also possible to get a specific row by using the index value and the ix property.



In [18]:

    
bikes_df.ix['2012-01-01']









    Out[18]:





Berri 1                  35
Côte-Sainte-Catherine     0
Maisonneuve 1            38
Maisonneuve 2            51
du Parc                  26
Pierre-Dupuy             10
Rachel1                  16
Name: 2012-01-01 00:00:00, dtype: int64

Cleaning

Mathematical operations can be done using scalars and functions.



In [19]:

    
# divide city values by 3
cities / 3









    Out[19]:





Austin           150.000000
Boston                  NaN
Chicago          333.333333
New York         433.333333
Portland         300.000000
San Francisco    366.666667
dtype: float64



In [20]:

    
# square city values
np.square(cities)









    Out[20]:





Austin            202500
Boston               NaN
Chicago          1000000
New York         1690000
Portland          810000
San Francisco    1210000
dtype: float64

But what can we do with that NaN value? The easiest way to get rid of it is by removing it.



In [21]:

    
cities.dropna()









    Out[21]:





Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

Which is equivalent to do:



In [22]:

    
cities[cities.notnull()]









    Out[22]:





Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

We can also fill the null data or even interpolate its value from the rest of the Series.



In [23]:

    
cities.fillna(0)









    Out[23]:





Austin            450
Boston              0
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64



In [24]:

    
cities.interpolate()









    Out[24]:





Austin            450
Boston            725
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

But null data might come in different ways. The next example illustrates the creation of a DataFrame from a dictionary. There are two blogs with a number of entries per tag. In this case, null data is represented by either the character "?" or the empty string "".



In [8]:

    
blogs = pd.DataFrame({0: {'blog': 1, 'tag': 'GIS', 'entries': '?'},
                      1: {'blog': 1, 'tag': 'NLP', 'entries': 1638},
                      2: {'blog': 1, 'tag': 'SNA', 'entries': 569},
                      3: {'blog': 1, 'tag': 'DH', 'entries': 115},
                      4: {'blog': 2, 'tag': 'GIS', 'entries': ''},
                      5: {'blog': 2, 'tag': 'NLP', 'entries': 1130},
                      6: {'blog': 2, 'tag': 'SNA', 'entries': 754},
                      7: {'blog': 2, 'tag': 'DH', 'entries': 555}})



In [9]:

    
blogs



In [10]:

    
blogs = blogs.T  # We transpose it to swap rows by columns
blogs

If we wanted to remove those rows with null values, we can use the na_values parameter when reading the data from a CSV. For DataFrames that we already have, the process is more manual. For now, let's just ignore the warning :-)



In [11]:

    
blogs[blogs.entries != "?"][blogs.entries != ""]









    



/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/core/frame.py:1808: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)






    Out[11]:






  
    
      
      blog
      entries
      tag
    
  
  
    
      1
       1
       1638
       NLP
    
    
      2
       1
        569
       SNA
    
    
      3
       1
        115
        DH
    
    
      5
       2
       1130
       NLP
    
    
      6
       2
        754
       SNA
    
    
      7
       2
        555
        DH

In some cases, we might want to keep the rows, but fill them with real null data.



In [12]:

    
blogs['entries'] = blogs['entries'].replace("?", np.nan)
blogs['entries'].replace("", np.nan, inplace=True)
blogs

Activity

What do you think is the difference between these two lines?
``` blogs['entries'] = blogs['entries'].replace("?", np.NaN) ```
``` blogs['entries'].replace("?", np.NaN, inplace=True) ```

When the Series is of str type, we can even use almost the whole set of string functions in Python by accessing the attribute .str, and then any other string method we need.



In [13]:

    
blogs.tag.str.lower()









    Out[13]:





0    gis
1    nlp
2    sna
3     dh
4    gis
5    nlp
6    sna
7     dh
Name: tag, dtype: object

But if the cleaning needs a more complicated operation, we can always .apply() a specific function to the whole DataFrame or Series, in order to perform that funcion over the individual values or cells.



In [14]:

    
def clean_func(value):
    if value == "GIS":
        return "Geographical Information Systems"
    elif value == "NLP":
        return "Natural Languaje Processing"
    elif value == "DH":
        return "Digital Humanities"
    elif value == "SNA":
        return "Social Network Analysis"
    else:
        return value
    return value

blogs['tag'].apply(clean_func)









    Out[14]:





0    Geographical Information Systems
1         Natural Languaje Processing
2             Social Network Analysis
3                  Digital Humanities
4    Geographical Information Systems
5         Natural Languaje Processing
6             Social Network Analysis
7                  Digital Humanities
Name: tag, dtype: object



In [15]:

    
blogs

Summary statistics

Pandas provides nifty methods to understand your data. Let's look at the describe, correlation, covariance, and correlation methods that you can use to quickly make sense of the data.

The describe() method provides quick stats on all suitable columns.



In [33]:

    
bikes_df.describe()









    Out[33]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
  
  
    
      count
        310.000000
        310.000000
        310.000000
        310.000000
        310.000000
        310.000000
        310.000000
    
    
      mean
       2985.048387
       1233.351613
       1983.325806
       3510.261290
       1862.983871
       1054.306452
       2873.483871
    
    
      std
       2169.271062
        944.643188
       1450.715170
       2484.959789
       1332.543266
       1064.029205
       2039.315504
    
    
      min
         32.000000
          0.000000
         33.000000
         47.000000
         18.000000
          0.000000
          0.000000
    
    
      25%
        596.000000
        243.250000
        427.000000
        831.000000
        474.750000
         53.250000
        731.000000
    
    
      50%
       3128.000000
       1269.000000
       2019.500000
       3688.500000
       1822.500000
        704.000000
       3223.500000
    
    
      75%
       4973.250000
       2003.000000
       3168.250000
       5731.750000
       3069.000000
       1818.500000
       4717.250000
    
    
      max
       7077.000000
       3124.000000
       4999.000000
       8222.000000
       4510.000000
       4386.000000
       6595.000000



In [34]:

    
blogs.describe()









    Out[34]:






  
    
      
      entries
    
  
  
    
      count
          6.000000
    
    
      mean
        793.500000
    
    
      std
        528.321398
    
    
      min
        115.000000
    
    
      25%
        558.500000
    
    
      50%
        661.500000
    
    
      75%
       1036.000000
    
    
      max
       1638.000000



In [35]:

    
autos_df.describe()









    Out[35]:






  
    
      
      mpg
      cylinders
      displacement
      weight
      acceleration
      year
      origin
    
  
  
    
      count
       397.000000
       397.000000
       397.000000
        397.000000
       397.000000
       397.000000
       397.000000
    
    
      mean
        23.515869
         5.458438
       193.532746
       2970.261965
        15.555668
        75.994962
         1.574307
    
    
      std
         7.825804
         1.701577
       104.379583
        847.904119
         2.749995
         3.690005
         0.802549
    
    
      min
         9.000000
         3.000000
        68.000000
       1613.000000
         8.000000
        70.000000
         1.000000
    
    
      25%
        17.500000
         4.000000
       104.000000
       2223.000000
        13.800000
        73.000000
         1.000000
    
    
      50%
        23.000000
         4.000000
       146.000000
       2800.000000
        15.500000
        76.000000
         1.000000
    
    
      75%
        29.000000
         8.000000
       262.000000
       3609.000000
        17.100000
        79.000000
         2.000000
    
    
      max
        46.600000
         8.000000
       455.000000
       5140.000000
        24.800000
        82.000000
         3.000000

Activity

Other summary statistics include: `count()`, `mean()`, `median()`, `quantile()`, `std()`, `var()`, `min()`, `max()`, etc. Play around a bit to see what they do in the `DataFrames` we already have.

Merging and joining

Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.

Like SQL's JOIN clause, pandas.merge() allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, how, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join.

By default, pandas.merge() operates as an inner join, which can be changed using the how parameter. The inner join joins the two DataFrame by a certain key, and when any of the columns has a NaN for a key, it's removed.

This is always easiest to understand with examples.



In [3]:

    
left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
left_frame



In [4]:

    
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})

right_frame









    Out[4]:






  
    
      
      key
      right_value
    
  
  
    
      0
       2
       f
    
    
      1
       3
       g
    
    
      2
       4
       h
    
    
      3
       5
       i
    
    
      4
       6
       j

inner join (default)



In [5]:

    
pd.merge(left_frame, right_frame, on='key', how='inner')









    Out[5]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       2
       c
       f
    
    
      1
       3
       d
       g
    
    
      2
       4
       e
       h

We lose values from both frames since certain keys do not match up. Because our key columns have different names, we could have used the left_on and right_on parameters to specify which fields to join from each frame.

pd.merge(left_frame, right_frame, left_on='left_key', right_on='right_key')

Alternatively, if our keys were indexes, we could use the left_index or right_index parameters, which accept a logical value. You can mix and match columns and indexes like this:

pd.merge(left_frame, right_frame, left_on='key', right_index=True)

left outer join



In [6]:

    
pd.merge(left_frame, right_frame, on='key', how='left')









    Out[6]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       0
       a
       NaN
    
    
      1
       1
       b
       NaN
    
    
      2
       2
       c
         f
    
    
      3
       3
       d
         g
    
    
      4
       4
       e
         h

We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value is null where keys do not match (NaN).

right outer join



In [7]:

    
pd.merge(left_frame, right_frame, on='key', how='right')









    Out[7]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       2
         c
       f
    
    
      1
       3
         d
       g
    
    
      2
       4
         e
       h
    
    
      3
       5
       NaN
       i
    
    
      4
       6
       NaN
       j

This time we've kept everything from the right frame with the left_value being null where the right frame's key did not find a match.

full outer join



In [41]:

    
pd.merge(left_frame, right_frame, on='key', how='outer')









    Out[41]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       0
         a
       NaN
    
    
      1
       1
         b
       NaN
    
    
      2
       2
         c
         f
    
    
      3
       3
         d
         g
    
    
      4
       4
         e
         h
    
    
      5
       5
       NaN
         i
    
    
      6
       6
       NaN
         j

We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are null.

concat

Pandas also provides a way to combine DataFrames along an axis: pandas.concat(). It takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.



In [42]:

    
pd.concat([left_frame, right_frame])









    Out[42]:






  
    
      
      key
      left_value
      right_value
    
  
  
    
      0
       0
         a
       NaN
    
    
      1
       1
         b
       NaN
    
    
      2
       2
         c
       NaN
    
    
      3
       3
         d
       NaN
    
    
      4
       4
         e
       NaN
    
    
      0
       2
       NaN
         f
    
    
      1
       3
       NaN
         g
    
    
      2
       4
       NaN
         h
    
    
      3
       5
       NaN
         i
    
    
      4
       6
       NaN
         j

By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be null.

Additionally, objects can be concatentated side-by-side using the function's axis parameter. The axis labeling information in Pandas objects serves many purposes:

Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
Enables automatic and explicit data alignment.
Allows intuitive getting and setting of subsets of the data set.



In [43]:

    
pd.concat([left_frame, right_frame], axis=1)









    Out[43]:






  
    
      
      key
      left_value
      key
      right_value
    
  
  
    
      0
       0
       a
       2
       f
    
    
      1
       1
       b
       3
       g
    
    
      2
       2
       c
       4
       h
    
    
      3
       3
       d
       5
       i
    
    
      4
       4
       e
       6
       j

Group by

The groupby() method lets you perform grouping operations. The example below shows a grouping operation performed with blog columns entries as keys. It is used to calculate the mean of the entries for each blog.



In [18]:

    
blogs



In [20]:

    
blogs.groupby('blog').count()

And now for a more complex exmaple. From our bikes_df DataFrame, we want to know if people are more active on weekdays or on weekends. We are going to add a new column with the information about the weekday, and then sum all the values for each weekday. We do that by grouping by weekday and aggregating their values by simpe addition. In this case we use np.sum(), but any other operation, such as np.max() or np.average() can be used.



In [45]:

    
bikes_df.index









    Out[45]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01, ..., 2012-11-05]
Length: 310, Freq: None, Timezone: None

The data frame's index is of type DateTimeIndex, which allows us to extract information from the standard datetime Python object.



In [46]:

    
bikes_df.index.weekday









    Out[46]:





array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5,
       6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0], dtype=int32)

In order to create the new column we just need to pick a name and assign a Series value is it was a dictionary (remember that DataFrame and Series are labeled).



In [47]:

    
bikes_df['weekday'] = bikes_df.index.weekday
bikes_df.head()









    



-c:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[47]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
      weekday
    
    
      Date
      
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
        51
        26
       10
       16
       6
    
    
      2012-01-02
        83
       1
        68
       153
        53
        6
       43
       0
    
    
      2012-01-03
       135
       2
       104
       248
        89
        3
       58
       1
    
    
      2012-01-04
       144
       1
       116
       318
       111
        8
       61
       2
    
    
      2012-01-05
       197
       2
       124
       330
        97
       13
       95
       3



In [48]:

    
bikes_df.groupby('weekday')









    Out[48]:





<pandas.core.groupby.DataFrameGroupBy object at 0x7f2d5303a668>



In [49]:

    
counts_by_day = bikes_df.groupby('weekday').aggregate(np.sum)
counts_by_day









    Out[49]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      Maisonneuve 2
      du Parc
      Pierre-Dupuy
      Rachel1
    
    
      weekday
      
      
      
      
      
      
      
    
  
  
    
      0
       134298
       60329
        90051
       163767
        90184
       46204
       130130
    
    
      1
       135305
       58708
        92035
       165880
        91399
       35167
       120088
    
    
      2
       152972
       67344
       104891
       186061
       102103
       43263
       133088
    
    
      3
       160131
       69028
       111895
       196715
       105674
       45385
       140241
    
    
      4
       141771
       56446
        98568
       172390
        89872
       42470
       137255
    
    
      5
       101578
       34018
        62067
       105060
        48289
       52861
       112828
    
    
      6
        99310
       36466
        55324
        98308
        50004
       61485
       117150

And finally a bit of matplotlib + pandas magic! But don't worry, you'll learn about plotting in the next class.



In [50]:

    
# display graphs inline
%matplotlib inline 
import matplotlib
# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))

counts_by_day.plot()









    Out[50]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2d5169ea90>

This is not, for sure, all we will see about grouping, but it's a good start if you are already understanding what's going on here.

Art Bank collection

The Art Bank collection is a dataset provided by the data.gc.ca, a key part of Canada's Action Plan on Open Government to enhance transparency and accountability, that contains the art works in the Canada Council Art Bank collection. For 40 years, the Art Bank has been collecting work by the best Canadian artists of our times. The Art Bank is part of the Canada Council for the Arts, Canada's national arts funder. It's licenced under the terms of the Open Government Licence - Canada.

And... it's in XML! XML, stands for eXtensible Markup Language, and if you want to see an example right now, just press Ctrl.+u and the browser will (likely) show you the HTML under this web page. Because yes, HTML is a subset of XML. So now you have an idea.

XML is something that you should know about already. Very basically, XML defines a hierarchical structure in which concepts are enclosed in tags that might have properties.

<art_list>
    <art_work title="Title of the artwork" />
</art_list>

In the last example, art_list is a tag, and <art_list>...</art_list> is how you enclose something in a tag. Sometimes tags might not have anything to enclose, as in <art_work />, so it's self-enclosed. Finally, properties are just like dictionaries.

In Python, the fastest way to get all the concepts we need is usually by finding all the tags of a certain kind. The module lxml.etree is able to do that for us.

First, let's see some lines of the dataset by using the special IPython command !head.



In [21]:

    
!head data/art_listing.xml

From here, it's pretty obvious that what we want the Art_Summary elements and their properties.

The function etree.parse() opens a file, parses it into a Python data structure and returns it as an object.



In [22]:

    
from lxml import etree
arts = etree.parse("data/art_listing.xml")

To find all Art_Summary elements, we can call the method .findall().



In [23]:

    
summaries = arts.findall("Art_Summary")
summaries[:5]









    Out[23]:





[<Element Art_Summary at 0x7f85dd0005c8>,
 <Element Art_Summary at 0x7f85dcffbbc8>,
 <Element Art_Summary at 0x7f85dcffbf08>,
 <Element Art_Summary at 0x7f85dcffb148>,
 <Element Art_Summary at 0x7f85dcffbd88>]

Each of those elements in the list behaves like a dictionary.



In [24]:

    
summaries[0].keys()









    Out[24]:





['Artist_Name', 'Title', 'Execution_Date', 'E_Category', 'F_Category', 'ID']



In [25]:

    
summaries[0].values()









    Out[25]:





['Wright, Don',
 'Turf-Arm',
 '1984',
 'Work on paper',
 'Oeuvre sur papier',
 '85/6-0174']

Now we are ready to build a data frame from the summaries list.



In [26]:

    
art_list = []
for summary in summaries:
    art_list.append(summary.values())
arts = pd.DataFrame(
    art_list,
    columns=['artist_name', 'title', 'execution_date', 'category', 'category_french', 'id']
)

Let's remove the category in French.



In [27]:

    
del arts['category_french']



In [28]:

    
arts.head()









    Out[28]:






  
    
      
      artist_name
      title
      execution_date
      category
      id
    
  
  
    
      0
            Wright, Don
                                             Turf-Arm
               1984
       Work on paper
        85/6-0174
    
    
      1
            Hansen, Jim
                                   Nfld. album no. 99
       ©1977, CARCC
       Work on paper
       79/80-0510
    
    
      2
       Creates, Marlene
       A Stone Placed in Gathered Water, Newfoundland
       ©1982, CARCC
          Photograph
        92/3-0221
    
    
      3
          Walker, Peter
                                              1.25.79
       ©1979, CARCC
       Work on paper
       79/80-0529
    
    
      4
       Bretzloff, Carol
                                   Under Heaven, Wind
               1983
       Work on paper
        13/4-0007

Voilà! Ready for the fun. Let's just save it for later.



In [38]:

    
arts.to_csv("data/arts.csv", index=False)

Activity

Given the `arts` data frame, clean the dates so you only see numbers. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too. [Solution](data/arts1.py)



In [34]:

    
l = ['1989', '1999']
l[-1]









    Out[34]:





'1999'



In [37]:

    
"1989, CARCC".split("-")[-1]









    Out[37]:





'1989, CARCC'



In [40]:

    
# Clean the dates so you only see numbers.
def clean_years(value):
    result = value
    chars_to_replace = ["c.", "©", ", CARCC", "no date", "n.d.", " SODRAC", ", CA", " CARCC"]
    chars_to_split = ["-", "/"]
    if isinstance(result, str):  # what isinstance does?
        for char in chars_to_split:
            result = result.split(char)[-1]
        for char in chars_to_replace:
            result = result.replace(char, "")
        if result == "":
            return np.nan
        else:
            return int(result)
    else:
        return result

arts['execution_date'] = arts['execution_date'].apply(clean_years)
arts.head()









    Out[40]:






  
    
      
      artist_name
      title
      execution_date
      category
      id
    
  
  
    
      0
            Wright, Don
                                             Turf-Arm
       1984
       Work on paper
        85/6-0174
    
    
      1
            Hansen, Jim
                                   Nfld. album no. 99
       1977
       Work on paper
       79/80-0510
    
    
      2
       Creates, Marlene
       A Stone Placed in Gathered Water, Newfoundland
       1982
          Photograph
        92/3-0221
    
    
      3
          Walker, Peter
                                              1.25.79
       1979
       Work on paper
       79/80-0529
    
    
      4
       Bretzloff, Carol
                                   Under Heaven, Wind
       1983
       Work on paper
        13/4-0007



In [43]:

    
# If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
def clean_year_99(value):
    if value <= 99:
        return 1900 + value
    else:
        return value 

arts["execution_date"] = arts["execution_date"].apply(clean_year_99)
arts[arts["execution_date"] < 100].head()









    Out[43]:






  
    
      
      artist_name
      title
      execution_date
      category
      id



In [45]:

    
# using lambda functions
arts["execution_date"] = arts["execution_date"].apply(lambda value: 1900 + value if value <= 99 else value)

For the next class

Read The Split-Apply-Combine Strategy for Data Analysis
Read Section String Manipulation, 205-212, from Python for Data Analysis.
Read Section Pivot Tables and Cross-Tabulation, 275-278, from Python for Data Analysis.
Read Section 1.4. Matplotlib: plotting, from Scipy lecture notes
Read Plotting and Visualization Notebook

	Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
0	01/01/2012;35;;0;38;51;26;10;16;
1	02/01/2012;83;;1;68;153;53;6;43;
2	03/01/2012;135;;2;104;248;89;3;58;
3	04/01/2012;144;;1;116;318;111;8;61;
4	05/01/2012;197;;2;124;330;97;13;95;
5	06/01/2012;146;;0;98;244;86;4;75;
6	07/01/2012;98;;2;80;108;53;6;54;
7	08/01/2012;95;;1;62;98;64;11;63;
8	09/01/2012;244;;2;165;432;198;12;173;
9	10/01/2012;397;;3;238;563;275;18;241;
10	11/01/2012;273;;0;182;443;258;12;194;
11	12/01/2012;157;;1;134;261;137;9;63;
12	13/01/2012;75;;0;41;105;64;2;0;
13	14/01/2012;32;;0;54;56;19;0;1;
14	15/01/2012;54;;0;33;60;18;0;0;
15	16/01/2012;168;;2;136;312;137;1;0
16	17/01/2012;155;;0;86;256;74;0;0
17	18/01/2012;139;;0;66;188;68;3;0
18	19/01/2012;191;;1;104;248;79;3;0
19	20/01/2012;161;;4;96;217;67;1;1
20	21/01/2012;53;;0;47;70;32;1;0
21	22/01/2012;71;;0;41;73;35;5;0
22	23/01/2012;210;;6;114;357;91;6;0
23	24/01/2012;299;;1;189;444;174;4;0
24	25/01/2012;334;;1;217;453;180;4;0
25	26/01/2012;306;;0;215;495;191;0;1
26	27/01/2012;91;;5;79;204;65;0;0
27	28/01/2012;80;;1;61;123;33;9;1
28	29/01/2012;87;;1;65;132;40;7;0
29	30/01/2012;219;;0;146;371;152;2;0
...	...
280	07/10/2012;1580;;660;922;1629;860;695;2052
281	08/10/2012;1854;;880;987;1818;1040;1115;2502
282	09/10/2012;4787;;2210;3026;5138;3418;927;4078
283	10/10/2012;3115;;1537;2081;3681;2608;560;2703
284	11/10/2012;3746;;1857;2569;4694;3034;558;3457
285	12/10/2012;3169;;1460;2261;4045;2564;448;3224
286	13/10/2012;1783;;802;1205;2113;1183;681;2309
287	14/10/2012;587;;287;443;852;503;65;952
288	15/10/2012;3292;;1678;2165;4197;2754;560;3183
289	16/10/2012;3739;;1858;2684;4681;2997;554;3593
290	17/10/2012;4098;;1964;2645;4836;3063;728;3834
291	18/10/2012;4671;;2292;3129;5542;3477;1108;4245
292	19/10/2012;1313;;597;885;1668;1209;111;1486
293	20/10/2012;2011;;748;1323;2266;1213;797;2243
294	21/10/2012;1277;;609;869;1777;898;242;1648
295	22/10/2012;3650;;1819;2495;4800;3023;757;3721
296	23/10/2012;4177;;1997;2795;5216;3233;795;3554
297	24/10/2012;3744;;1868;2625;4900;3035;649;3622
298	25/10/2012;3735;;1815;2528;5010;3017;631;3767
299	26/10/2012;4290;;1987;2754;5246;3000;1456;4578
300	27/10/2012;1857;;792;1244;2461;1193;618;2471
301	28/10/2012;1310;;697;910;1776;955;387;1876
302	29/10/2012;2919;;1458;2071;3768;2440;411;2795
303	30/10/2012;2887;;1251;2007;3516;2255;338;2790
304	31/10/2012;2634;;1294;1835;3453;2220;245;2570
305	01/11/2012;2405;;1208;1701;3082;2076;165;2461
306	02/11/2012;1582;;737;1109;2277;1392;97;1888
307	03/11/2012;844;;380;612;1137;713;105;1302
308	04/11/2012;966;;446;710;1277;692;197;1374
309	05/11/2012;2247;;1170;1705;3221;2143;179;2430

	Berri 1	Brébeuf (données non disponibles)	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (données non disponibles)
Date
2012-01-01	35	NaN	0	38	51	26	10	16	NaN
2012-01-02	83	NaN	1	68	153	53	6	43	NaN
2012-01-03	135	NaN	2	104	248	89	3	58	NaN
2012-01-04	144	NaN	1	116	318	111	8	61	NaN
2012-01-05	197	NaN	2	124	330	97	13	95	NaN
2012-01-06	146	NaN	0	98	244	86	4	75	NaN
2012-01-07	98	NaN	2	80	108	53	6	54	NaN
2012-01-08	95	NaN	1	62	98	64	11	63	NaN
2012-01-09	244	NaN	2	165	432	198	12	173	NaN
2012-01-10	397	NaN	3	238	563	275	18	241	NaN
2012-01-11	273	NaN	0	182	443	258	12	194	NaN
2012-01-12	157	NaN	1	134	261	137	9	63	NaN
2012-01-13	75	NaN	0	41	105	64	2	0	NaN
2012-01-14	32	NaN	0	54	56	19	0	1	NaN
2012-01-15	54	NaN	0	33	60	18	0	0	NaN
2012-01-16	168	NaN	2	136	312	137	1	0	NaN
2012-01-17	155	NaN	0	86	256	74	0	0	NaN
2012-01-18	139	NaN	0	66	188	68	3	0	NaN
2012-01-19	191	NaN	1	104	248	79	3	0	NaN
2012-01-20	161	NaN	4	96	217	67	1	1	NaN
2012-01-21	53	NaN	0	47	70	32	1	0	NaN
2012-01-22	71	NaN	0	41	73	35	5	0	NaN
2012-01-23	210	NaN	6	114	357	91	6	0	NaN
2012-01-24	299	NaN	1	189	444	174	4	0	NaN
2012-01-25	334	NaN	1	217	453	180	4	0	NaN
2012-01-26	306	NaN	0	215	495	191	0	1	NaN
2012-01-27	91	NaN	5	79	204	65	0	0	NaN
2012-01-28	80	NaN	1	61	123	33	9	1	NaN
2012-01-29	87	NaN	1	65	132	40	7	0	NaN
2012-01-30	219	NaN	0	146	371	152	2	0	NaN
...	...	...	...	...	...	...	...	...	...
2012-10-07	1580	NaN	660	922	1629	860	695	2052	NaN
2012-10-08	1854	NaN	880	987	1818	1040	1115	2502	NaN
2012-10-09	4787	NaN	2210	3026	5138	3418	927	4078	NaN
2012-10-10	3115	NaN	1537	2081	3681	2608	560	2703	NaN
2012-10-11	3746	NaN	1857	2569	4694	3034	558	3457	NaN
2012-10-12	3169	NaN	1460	2261	4045	2564	448	3224	NaN
2012-10-13	1783	NaN	802	1205	2113	1183	681	2309	NaN
2012-10-14	587	NaN	287	443	852	503	65	952	NaN
2012-10-15	3292	NaN	1678	2165	4197	2754	560	3183	NaN
2012-10-16	3739	NaN	1858	2684	4681	2997	554	3593	NaN
2012-10-17	4098	NaN	1964	2645	4836	3063	728	3834	NaN
2012-10-18	4671	NaN	2292	3129	5542	3477	1108	4245	NaN
2012-10-19	1313	NaN	597	885	1668	1209	111	1486	NaN
2012-10-20	2011	NaN	748	1323	2266	1213	797	2243	NaN
2012-10-21	1277	NaN	609	869	1777	898	242	1648	NaN
2012-10-22	3650	NaN	1819	2495	4800	3023	757	3721	NaN
2012-10-23	4177	NaN	1997	2795	5216	3233	795	3554	NaN
2012-10-24	3744	NaN	1868	2625	4900	3035	649	3622	NaN
2012-10-25	3735	NaN	1815	2528	5010	3017	631	3767	NaN
2012-10-26	4290	NaN	1987	2754	5246	3000	1456	4578	NaN
2012-10-27	1857	NaN	792	1244	2461	1193	618	2471	NaN
2012-10-28	1310	NaN	697	910	1776	955	387	1876	NaN
2012-10-29	2919	NaN	1458	2071	3768	2440	411	2795	NaN
2012-10-30	2887	NaN	1251	2007	3516	2255	338	2790	NaN
2012-10-31	2634	NaN	1294	1835	3453	2220	245	2570	NaN
2012-11-01	2405	NaN	1208	1701	3082	2076	165	2461	NaN
2012-11-02	1582	NaN	737	1109	2277	1392	97	1888	NaN
2012-11-03	844	NaN	380	612	1137	713	105	1302	NaN
2012-11-04	966	NaN	446	710	1277	692	197	1374	NaN
2012-11-05	2247	NaN	1170	1705	3221	2143	179	2430	NaN

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18	8	307	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15	8	350	165	3693	11.5	70	1	buick skylark 320
2	18	8	318	150	3436	11.0	70	1	plymouth satellite
3	16	8	304	150	3433	12.0	70	1	amc rebel sst
4	17	8	302	140	3449	10.5	70	1	ford torino

	0	1	2	3	4	5	6	7
blog	1	1	1	1	2	2	2	2
entries	?	1638	569	115		1130	754	555
tag	GIS	NLP	SNA	DH	GIS	NLP	SNA	DH

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1
count	310.000000	310.000000	310.000000	310.000000	310.000000	310.000000	310.000000
mean	2985.048387	1233.351613	1983.325806	3510.261290	1862.983871	1054.306452	2873.483871
std	2169.271062	944.643188	1450.715170	2484.959789	1332.543266	1064.029205	2039.315504
min	32.000000	0.000000	33.000000	47.000000	18.000000	0.000000	0.000000
25%	596.000000	243.250000	427.000000	831.000000	474.750000	53.250000	731.000000
50%	3128.000000	1269.000000	2019.500000	3688.500000	1822.500000	704.000000	3223.500000
75%	4973.250000	2003.000000	3168.250000	5731.750000	3069.000000	1818.500000	4717.250000
max	7077.000000	3124.000000	4999.000000	8222.000000	4510.000000	4386.000000	6595.000000

	entries
count	6.000000
mean	793.500000
std	528.321398
min	115.000000
25%	558.500000
50%	661.500000
75%	1036.000000
max	1638.000000

	mpg	cylinders	displacement	weight	acceleration	year	origin
count	397.000000	397.000000	397.000000	397.000000	397.000000	397.000000	397.000000
mean	23.515869	5.458438	193.532746	2970.261965	15.555668	75.994962	1.574307
std	7.825804	1.701577	104.379583	847.904119	2.749995	3.690005	0.802549
min	9.000000	3.000000	68.000000	1613.000000	8.000000	70.000000	1.000000
25%	17.500000	4.000000	104.000000	2223.000000	13.800000	73.000000	1.000000
50%	23.000000	4.000000	146.000000	2800.000000	15.500000	76.000000	1.000000
75%	29.000000	8.000000	262.000000	3609.000000	17.100000	79.000000	2.000000
max	46.600000	8.000000	455.000000	5140.000000	24.800000	82.000000	3.000000

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1
weekday
0	134298	60329	90051	163767	90184	46204	130130
1	135305	58708	92035	165880	91399	35167	120088
2	152972	67344	104891	186061	102103	43263	133088
3	160131	69028	111895	196715	105674	45385	140241
4	141771	56446	98568	172390	89872	42470	137255
5	101578	34018	62067	105060	48289	52861	112828
6	99310	36466	55324	98308	50004	61485	117150

	artist_name	title	execution_date	category	id
0	Wright, Don	Turf-Arm	1984	Work on paper	85/6-0174
1	Hansen, Jim	Nfld. album no. 99	©1977, CARCC	Work on paper	79/80-0510
2	Creates, Marlene	A Stone Placed in Gathered Water, Newfoundland	©1982, CARCC	Photograph	92/3-0221
3	Walker, Peter	1.25.79	©1979, CARCC	Work on paper	79/80-0529
4	Bretzloff, Carol	Under Heaven, Wind	1983	Work on paper	13/4-0007

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

[Data, the Humanist's New Best Friend](index.ipynb)*Class 06*

Pandas

Read data with Pandas

Slicing

Indexing

Cleaning

Summary statistics

Merging and joining

inner join (default)

left outer join

right outer join

full outer join

concat

Group by

Art Bank collection

For the next class

[Data, the Humanist's New Best Friend](index.ipynb)
Class 06

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j