[Data, the Humanist's New Best Friend](index.ipynb)
*Class 06*

In this class you are expected to learn:

  1. Pandas
  2. Cleaning data
  3. Summary statistics
  4. Indexing
  5. Merging, joining
`import pandas as pd`

Pandas

For any of you that have heard about R, a statistics-oriented language, Pandas is the R for Python. It provides easy to use data structures and a ton of useful helper functions for data cleanup and transformations. It's fast because is backed by Numpy arrays, and integrates pretty well with other Python libraries such as scikit-learn for statistical learning, or matplotlib for plotting.

Pandas provides a key data structure: the pandas.DataFrame; which is like a super rich 2D array that addresses three deficiencies of arrays:

  • They hold heterogenous data; each column can have its own numpy.dtype, so can have, in the same row, numbers, strings, and dates.
  • The axes of a DataFrame are labeled with column names and row indices, what makes easier slicing and indexing.
  • They account for missing values, which is not directly supported by arrays, in the form or pandas.na or NaN, that stands for "not a number".

Data frames are extremely useful for data munging. They provide a large range of operations such as filter, join, and group-by aggregation. Furthermore, pandas.Series is to an 1D array what pandas.DataFrame is to a 2d array.


In [2]:
import numpy as np  # we will need it
import pandas as pd

Read data with Pandas

Pandas is able to read data from a bunch of different data formats: CSV, TSV, Excel, HDFS, JSON, etc. The whole list is in the online documentation. In order to read data from a CSV file, we use the read_csv() function that returns a data frame. By default, it assumes that the fields are comma-separated, but it has parameters to tune the behaviour of the importer.

We're going to use some datasets from Julia Evans' Pandas Cookbok. The first one is about cyclist data from Montréal in 2012.

This bikes dataset is a list of how many people were on 7 different bike paths in Montréal, each day.


In [2]:
bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame


---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-e54169c3d1a7> in <module>()
----> 1 bikes_df = pd.read_csv('data/bikes.csv')  # We add _df 'cause it's a data frame

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    463                     skip_blank_lines=skip_blank_lines)
    464 
--> 465         return _read(filepath_or_buffer, kwds)
    466 
    467     parser_f.__name__ = name

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    239 
    240     # Create the parser.
--> 241     parser = TextFileReader(filepath_or_buffer, **kwds)
    242 
    243     if (nrows is not None) and (chunksize is not None):

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    555             self.options['has_index_names'] = kwds['has_index_names']
    556 
--> 557         self._make_engine(self.engine)
    558 
    559     def _get_options_with_defaults(self, engine):

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    692     def _make_engine(self, engine='c'):
    693         if engine == 'c':
--> 694             self._engine = CParserWrapper(self.f, **self.options)
    695         else:
    696             if engine == 'python':

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1059         kwds['allow_leading_cols'] = self.index_col is not False
   1060 
-> 1061         self._reader = _parser.TextReader(src, **kwds)
   1062 
   1063         # XXX

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4710)()

/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._get_header (pandas/parser.c:6438)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 15: invalid continuation byte

Hey! We know that error! It's an encoding error probably caused by French accents. Fortunately, read_csv() allows to pass the encoding as an argument.


In [3]:
bikes_df = pd.read_csv('data/bikes.csv', encoding="latin1")
bikes_df


Out[3]:
Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
0 01/01/2012;35;;0;38;51;26;10;16;
1 02/01/2012;83;;1;68;153;53;6;43;
2 03/01/2012;135;;2;104;248;89;3;58;
3 04/01/2012;144;;1;116;318;111;8;61;
4 05/01/2012;197;;2;124;330;97;13;95;
5 06/01/2012;146;;0;98;244;86;4;75;
6 07/01/2012;98;;2;80;108;53;6;54;
7 08/01/2012;95;;1;62;98;64;11;63;
8 09/01/2012;244;;2;165;432;198;12;173;
9 10/01/2012;397;;3;238;563;275;18;241;
10 11/01/2012;273;;0;182;443;258;12;194;
11 12/01/2012;157;;1;134;261;137;9;63;
12 13/01/2012;75;;0;41;105;64;2;0;
13 14/01/2012;32;;0;54;56;19;0;1;
14 15/01/2012;54;;0;33;60;18;0;0;
15 16/01/2012;168;;2;136;312;137;1;0
16 17/01/2012;155;;0;86;256;74;0;0
17 18/01/2012;139;;0;66;188;68;3;0
18 19/01/2012;191;;1;104;248;79;3;0
19 20/01/2012;161;;4;96;217;67;1;1
20 21/01/2012;53;;0;47;70;32;1;0
21 22/01/2012;71;;0;41;73;35;5;0
22 23/01/2012;210;;6;114;357;91;6;0
23 24/01/2012;299;;1;189;444;174;4;0
24 25/01/2012;334;;1;217;453;180;4;0
25 26/01/2012;306;;0;215;495;191;0;1
26 27/01/2012;91;;5;79;204;65;0;0
27 28/01/2012;80;;1;61;123;33;9;1
28 29/01/2012;87;;1;65;132;40;7;0
29 30/01/2012;219;;0;146;371;152;2;0
... ...
280 07/10/2012;1580;;660;922;1629;860;695;2052
281 08/10/2012;1854;;880;987;1818;1040;1115;2502
282 09/10/2012;4787;;2210;3026;5138;3418;927;4078
283 10/10/2012;3115;;1537;2081;3681;2608;560;2703
284 11/10/2012;3746;;1857;2569;4694;3034;558;3457
285 12/10/2012;3169;;1460;2261;4045;2564;448;3224
286 13/10/2012;1783;;802;1205;2113;1183;681;2309
287 14/10/2012;587;;287;443;852;503;65;952
288 15/10/2012;3292;;1678;2165;4197;2754;560;3183
289 16/10/2012;3739;;1858;2684;4681;2997;554;3593
290 17/10/2012;4098;;1964;2645;4836;3063;728;3834
291 18/10/2012;4671;;2292;3129;5542;3477;1108;4245
292 19/10/2012;1313;;597;885;1668;1209;111;1486
293 20/10/2012;2011;;748;1323;2266;1213;797;2243
294 21/10/2012;1277;;609;869;1777;898;242;1648
295 22/10/2012;3650;;1819;2495;4800;3023;757;3721
296 23/10/2012;4177;;1997;2795;5216;3233;795;3554
297 24/10/2012;3744;;1868;2625;4900;3035;649;3622
298 25/10/2012;3735;;1815;2528;5010;3017;631;3767
299 26/10/2012;4290;;1987;2754;5246;3000;1456;4578
300 27/10/2012;1857;;792;1244;2461;1193;618;2471
301 28/10/2012;1310;;697;910;1776;955;387;1876
302 29/10/2012;2919;;1458;2071;3768;2440;411;2795
303 30/10/2012;2887;;1251;2007;3516;2255;338;2790
304 31/10/2012;2634;;1294;1835;3453;2220;245;2570
305 01/11/2012;2405;;1208;1701;3082;2076;165;2461
306 02/11/2012;1582;;737;1109;2277;1392;97;1888
307 03/11/2012;844;;380;612;1137;713;105;1302
308 04/11/2012;966;;446;710;1277;692;197;1374
309 05/11/2012;2247;;1170;1705;3221;2143;179;2430

310 rows × 1 columns

You'll notice that this is totally broken! Fortunately, read_csv() has a bunch of more options that will let us fix that, though. Here we'll:

  • Change the column separator to a ;
  • Parse the dates in the Date column
  • Tell it that our dates have the date first instead of the month first
  • Set the index to be the Date column

In [4]:
bikes_df = pd.read_csv(
    'data/bikes.csv',
    sep=';',
    encoding='latin1',
    parse_dates=['Date'],
    dayfirst=True,
    index_col='Date'
)
bikes_df


Out[4]:
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN
2012-01-04 144 NaN 1 116 318 111 8 61 NaN
2012-01-05 197 NaN 2 124 330 97 13 95 NaN
2012-01-06 146 NaN 0 98 244 86 4 75 NaN
2012-01-07 98 NaN 2 80 108 53 6 54 NaN
2012-01-08 95 NaN 1 62 98 64 11 63 NaN
2012-01-09 244 NaN 2 165 432 198 12 173 NaN
2012-01-10 397 NaN 3 238 563 275 18 241 NaN
2012-01-11 273 NaN 0 182 443 258 12 194 NaN
2012-01-12 157 NaN 1 134 261 137 9 63 NaN
2012-01-13 75 NaN 0 41 105 64 2 0 NaN
2012-01-14 32 NaN 0 54 56 19 0 1 NaN
2012-01-15 54 NaN 0 33 60 18 0 0 NaN
2012-01-16 168 NaN 2 136 312 137 1 0 NaN
2012-01-17 155 NaN 0 86 256 74 0 0 NaN
2012-01-18 139 NaN 0 66 188 68 3 0 NaN
2012-01-19 191 NaN 1 104 248 79 3 0 NaN
2012-01-20 161 NaN 4 96 217 67 1 1 NaN
2012-01-21 53 NaN 0 47 70 32 1 0 NaN
2012-01-22 71 NaN 0 41 73 35 5 0 NaN
2012-01-23 210 NaN 6 114 357 91 6 0 NaN
2012-01-24 299 NaN 1 189 444 174 4 0 NaN
2012-01-25 334 NaN 1 217 453 180 4 0 NaN
2012-01-26 306 NaN 0 215 495 191 0 1 NaN
2012-01-27 91 NaN 5 79 204 65 0 0 NaN
2012-01-28 80 NaN 1 61 123 33 9 1 NaN
2012-01-29 87 NaN 1 65 132 40 7 0 NaN
2012-01-30 219 NaN 0 146 371 152 2 0 NaN
... ... ... ... ... ... ... ... ... ...
2012-10-07 1580 NaN 660 922 1629 860 695 2052 NaN
2012-10-08 1854 NaN 880 987 1818 1040 1115 2502 NaN
2012-10-09 4787 NaN 2210 3026 5138 3418 927 4078 NaN
2012-10-10 3115 NaN 1537 2081 3681 2608 560 2703 NaN
2012-10-11 3746 NaN 1857 2569 4694 3034 558 3457 NaN
2012-10-12 3169 NaN 1460 2261 4045 2564 448 3224 NaN
2012-10-13 1783 NaN 802 1205 2113 1183 681 2309 NaN
2012-10-14 587 NaN 287 443 852 503 65 952 NaN
2012-10-15 3292 NaN 1678 2165 4197 2754 560 3183 NaN
2012-10-16 3739 NaN 1858 2684 4681 2997 554 3593 NaN
2012-10-17 4098 NaN 1964 2645 4836 3063 728 3834 NaN
2012-10-18 4671 NaN 2292 3129 5542 3477 1108 4245 NaN
2012-10-19 1313 NaN 597 885 1668 1209 111 1486 NaN
2012-10-20 2011 NaN 748 1323 2266 1213 797 2243 NaN
2012-10-21 1277 NaN 609 869 1777 898 242 1648 NaN
2012-10-22 3650 NaN 1819 2495 4800 3023 757 3721 NaN
2012-10-23 4177 NaN 1997 2795 5216 3233 795 3554 NaN
2012-10-24 3744 NaN 1868 2625 4900 3035 649 3622 NaN
2012-10-25 3735 NaN 1815 2528 5010 3017 631 3767 NaN
2012-10-26 4290 NaN 1987 2754 5246 3000 1456 4578 NaN
2012-10-27 1857 NaN 792 1244 2461 1193 618 2471 NaN
2012-10-28 1310 NaN 697 910 1776 955 387 1876 NaN
2012-10-29 2919 NaN 1458 2071 3768 2440 411 2795 NaN
2012-10-30 2887 NaN 1251 2007 3516 2255 338 2790 NaN
2012-10-31 2634 NaN 1294 1835 3453 2220 245 2570 NaN
2012-11-01 2405 NaN 1208 1701 3082 2076 165 2461 NaN
2012-11-02 1582 NaN 737 1109 2277 1392 97 1888 NaN
2012-11-03 844 NaN 380 612 1137 713 105 1302 NaN
2012-11-04 966 NaN 446 710 1277 692 197 1374 NaN
2012-11-05 2247 NaN 1170 1705 3221 2143 179 2430 NaN

310 rows × 9 columns

And now let's just remove missing columns


In [5]:
# Get rid of missing columns, don't worry, you'll see more on .dropna() later
bikes_df = bikes_df.dropna(axis=1)
bikes_df.head()


Out[5]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-02 83 1 68 153 53 6 43
2012-01-03 135 2 104 248 89 3 58
2012-01-04 144 1 116 318 111 8 61
2012-01-05 197 2 124 330 97 13 95

Way better!

In the last example we added head() to show just the first rows. If you want the last ones, use the tail() function. Both can receive the number of rows to show, but defaults to 5.

Pandas is also able to get data from the Internet without even batting an eye. The excellent book Introduction to Statistical Learning has one of its data sets available to download.


In [6]:
autos_df = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
autos_df.head()  # print the first lines


Out[6]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15 8 350 165 3693 11.5 70 1 buick skylark 320
2 18 8 318 150 3436 11.0 70 1 plymouth satellite
3 16 8 304 150 3433 12.0 70 1 amc rebel sst
4 17 8 302 140 3449 10.5 70 1 ford torino

Slicing

A DataFrame is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

Here's an example:


In [7]:
bikes_df['Berri 1'].head()


Out[7]:
Date
2012-01-01     35
2012-01-02     83
2012-01-03    135
2012-01-04    144
2012-01-05    197
Name: Berri 1, dtype: int64

For selecting several columns we use the same fancy indexing that we saw for Numpy


In [8]:
fancy_index = ['Berri 1', 'Rachel1']
bikes_df[fancy_index].head()


Out[8]:
Berri 1 Rachel1
Date
2012-01-01 35 16
2012-01-02 83 43
2012-01-03 135 58
2012-01-04 144 61
2012-01-05 197 95

But it's very common to just put the fancy index list as the key:


In [9]:
bikes_df[['Berri 1', 'Rachel1']].head()


Out[9]:
Berri 1 Rachel1
Date
2012-01-01 35 16
2012-01-02 83 43
2012-01-03 135 58
2012-01-04 144 61
2012-01-05 197 95

For slicing rows the syntax is just as expected: the same that we use for lists.


In [10]:
bikes_df[['Berri 1', 'Rachel1']][:5]


Out[10]:
Berri 1 Rachel1
Date
2012-01-01 35 16
2012-01-02 83 43
2012-01-03 135 58
2012-01-04 144 61
2012-01-05 197 95

In [11]:
bikes_df['Berri 1'][-5]


Out[11]:
2405

Indexing

In a DataFrame every column is a Series, so you can use the index to select specific items from the Series.


In [12]:
berri = bikes_df[['Berri 1']]
berri.head()


Out[12]:
Berri 1
Date
2012-01-01 35
2012-01-02 83
2012-01-03 135
2012-01-04 144
2012-01-05 197

Or you can use boolean indexing for selection.


In [13]:
bikes_df[bikes_df['Berri 1'] < 40]


Out[13]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-14 32 0 54 56 19 0 1

That last one might be a little weird, so let's make it more clear: bikes_df['Berri 1'] < 40 returns a Series of True/False values, which we then pass to our DataFrame bikes_df, returning the corresponding True items.


In [14]:
less_than_40 = bikes_df['Berri 1'] < 40
print(less_than_40)
bikes_df[less_than_40]


Date
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05    False
2012-01-06    False
2012-01-07    False
2012-01-08    False
2012-01-09    False
2012-01-10    False
2012-01-11    False
2012-01-12    False
2012-01-13    False
2012-01-14     True
2012-01-15    False
...
2012-10-22    False
2012-10-23    False
2012-10-24    False
2012-10-25    False
2012-10-26    False
2012-10-27    False
2012-10-28    False
2012-10-29    False
2012-10-30    False
2012-10-31    False
2012-11-01    False
2012-11-02    False
2012-11-03    False
2012-11-04    False
2012-11-05    False
Name: Berri 1, Length: 310
Out[14]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-14 32 0 54 56 19 0 1

Series and DataFrame share the same indexing, slicing and selecting operations. What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

Let's first create a new Series from a dictionary.


In [15]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities


Out[15]:
Austin            450
Boston            NaN
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

In [16]:
'Seattle' in cities


Out[16]:
False

In [17]:
'San Francisco' in cities


Out[17]:
True

Finally, it's also possible to get a specific row by using the index value and the ix property.


In [18]:
bikes_df.ix['2012-01-01']


Out[18]:
Berri 1                  35
Côte-Sainte-Catherine     0
Maisonneuve 1            38
Maisonneuve 2            51
du Parc                  26
Pierre-Dupuy             10
Rachel1                  16
Name: 2012-01-01 00:00:00, dtype: int64

Cleaning

Mathematical operations can be done using scalars and functions.


In [19]:
# divide city values by 3
cities / 3


Out[19]:
Austin           150.000000
Boston                  NaN
Chicago          333.333333
New York         433.333333
Portland         300.000000
San Francisco    366.666667
dtype: float64

In [20]:
# square city values
np.square(cities)


Out[20]:
Austin            202500
Boston               NaN
Chicago          1000000
New York         1690000
Portland          810000
San Francisco    1210000
dtype: float64

But what can we do with that NaN value? The easiest way to get rid of it is by removing it.


In [21]:
cities.dropna()


Out[21]:
Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

Which is equivalent to do:


In [22]:
cities[cities.notnull()]


Out[22]:
Austin            450
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

We can also fill the null data or even interpolate its value from the rest of the Series.


In [23]:
cities.fillna(0)


Out[23]:
Austin            450
Boston              0
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

In [24]:
cities.interpolate()


Out[24]:
Austin            450
Boston            725
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

But null data might come in different ways. The next example illustrates the creation of a DataFrame from a dictionary. There are two blogs with a number of entries per tag. In this case, null data is represented by either the character "?" or the empty string "".


In [8]:
blogs = pd.DataFrame({0: {'blog': 1, 'tag': 'GIS', 'entries': '?'},
                      1: {'blog': 1, 'tag': 'NLP', 'entries': 1638},
                      2: {'blog': 1, 'tag': 'SNA', 'entries': 569},
                      3: {'blog': 1, 'tag': 'DH', 'entries': 115},
                      4: {'blog': 2, 'tag': 'GIS', 'entries': ''},
                      5: {'blog': 2, 'tag': 'NLP', 'entries': 1130},
                      6: {'blog': 2, 'tag': 'SNA', 'entries': 754},
                      7: {'blog': 2, 'tag': 'DH', 'entries': 555}})

In [9]:
blogs


Out[9]:
0 1 2 3 4 5 6 7
blog 1 1 1 1 2 2 2 2
entries ? 1638 569 115 1130 754 555
tag GIS NLP SNA DH GIS NLP SNA DH

In [10]:
blogs = blogs.T  # We transpose it to swap rows by columns
blogs


Out[10]:
blog entries tag
0 1 ? GIS
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
4 2 GIS
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

If we wanted to remove those rows with null values, we can use the na_values parameter when reading the data from a CSV. For DataFrames that we already have, the process is more manual. For now, let's just ignore the warning :-)


In [11]:
blogs[blogs.entries != "?"][blogs.entries != ""]


/home/versae/.venvs/dh2304/lib/python3.4/site-packages/pandas/core/frame.py:1808: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[11]:
blog entries tag
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

In some cases, we might want to keep the rows, but fill them with real null data.


In [12]:
blogs['entries'] = blogs['entries'].replace("?", np.nan)
blogs['entries'].replace("", np.nan, inplace=True)
blogs


Out[12]:
blog entries tag
0 1 NaN GIS
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
4 2 NaN GIS
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

Activity

What do you think is the difference between these two lines?
``` blogs['entries'] = blogs['entries'].replace("?", np.NaN) ```
``` blogs['entries'].replace("?", np.NaN, inplace=True) ```

When the Series is of str type, we can even use almost the whole set of string functions in Python by accessing the attribute .str, and then any other string method we need.


In [13]:
blogs.tag.str.lower()


Out[13]:
0    gis
1    nlp
2    sna
3     dh
4    gis
5    nlp
6    sna
7     dh
Name: tag, dtype: object

But if the cleaning needs a more complicated operation, we can always .apply() a specific function to the whole DataFrame or Series, in order to perform that funcion over the individual values or cells.


In [14]:
def clean_func(value):
    if value == "GIS":
        return "Geographical Information Systems"
    elif value == "NLP":
        return "Natural Languaje Processing"
    elif value == "DH":
        return "Digital Humanities"
    elif value == "SNA":
        return "Social Network Analysis"
    else:
        return value
    return value

blogs['tag'].apply(clean_func)


Out[14]:
0    Geographical Information Systems
1         Natural Languaje Processing
2             Social Network Analysis
3                  Digital Humanities
4    Geographical Information Systems
5         Natural Languaje Processing
6             Social Network Analysis
7                  Digital Humanities
Name: tag, dtype: object

In [15]:
blogs


Out[15]:
blog entries tag
0 1 NaN GIS
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
4 2 NaN GIS
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

Summary statistics

Pandas provides nifty methods to understand your data. Let's look at the describe, correlation, covariance, and correlation methods that you can use to quickly make sense of the data.

The describe() method provides quick stats on all suitable columns.


In [33]:
bikes_df.describe()


Out[33]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
count 310.000000 310.000000 310.000000 310.000000 310.000000 310.000000 310.000000
mean 2985.048387 1233.351613 1983.325806 3510.261290 1862.983871 1054.306452 2873.483871
std 2169.271062 944.643188 1450.715170 2484.959789 1332.543266 1064.029205 2039.315504
min 32.000000 0.000000 33.000000 47.000000 18.000000 0.000000 0.000000
25% 596.000000 243.250000 427.000000 831.000000 474.750000 53.250000 731.000000
50% 3128.000000 1269.000000 2019.500000 3688.500000 1822.500000 704.000000 3223.500000
75% 4973.250000 2003.000000 3168.250000 5731.750000 3069.000000 1818.500000 4717.250000
max 7077.000000 3124.000000 4999.000000 8222.000000 4510.000000 4386.000000 6595.000000

In [34]:
blogs.describe()


Out[34]:
entries
count 6.000000
mean 793.500000
std 528.321398
min 115.000000
25% 558.500000
50% 661.500000
75% 1036.000000
max 1638.000000

In [35]:
autos_df.describe()


Out[35]:
mpg cylinders displacement weight acceleration year origin
count 397.000000 397.000000 397.000000 397.000000 397.000000 397.000000 397.000000
mean 23.515869 5.458438 193.532746 2970.261965 15.555668 75.994962 1.574307
std 7.825804 1.701577 104.379583 847.904119 2.749995 3.690005 0.802549
min 9.000000 3.000000 68.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.000000 2223.000000 13.800000 73.000000 1.000000
50% 23.000000 4.000000 146.000000 2800.000000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 3609.000000 17.100000 79.000000 2.000000
max 46.600000 8.000000 455.000000 5140.000000 24.800000 82.000000 3.000000

Activity

Other summary statistics include: `count()`, `mean()`, `median()`, `quantile()`, `std()`, `var()`, `min()`, `max()`, etc. Play around a bit to see what they do in the `DataFrames` we already have.

Merging and joining

Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.

Like SQL's JOIN clause, pandas.merge() allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, how, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join.

By default, pandas.merge() operates as an inner join, which can be changed using the how parameter. The inner join joins the two DataFrame by a certain key, and when any of the columns has a NaN for a key, it's removed.

This is always easiest to understand with examples.


In [3]:
left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
left_frame


Out[3]:
key left_value
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e

In [4]:
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})

right_frame


Out[4]:
key right_value
0 2 f
1 3 g
2 4 h
3 5 i
4 6 j

inner join (default)


In [5]:
pd.merge(left_frame, right_frame, on='key', how='inner')


Out[5]:
key left_value right_value
0 2 c f
1 3 d g
2 4 e h

We lose values from both frames since certain keys do not match up. Because our key columns have different names, we could have used the left_on and right_on parameters to specify which fields to join from each frame.

pd.merge(left_frame, right_frame, left_on='left_key', right_on='right_key')

Alternatively, if our keys were indexes, we could use the left_index or right_index parameters, which accept a logical value. You can mix and match columns and indexes like this:

pd.merge(left_frame, right_frame, left_on='key', right_index=True)

left outer join


In [6]:
pd.merge(left_frame, right_frame, on='key', how='left')


Out[6]:
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c f
3 3 d g
4 4 e h

We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value is null where keys do not match (NaN).

right outer join


In [7]:
pd.merge(left_frame, right_frame, on='key', how='right')


Out[7]:
key left_value right_value
0 2 c f
1 3 d g
2 4 e h
3 5 NaN i
4 6 NaN j

This time we've kept everything from the right frame with the left_value being null where the right frame's key did not find a match.

full outer join


In [41]:
pd.merge(left_frame, right_frame, on='key', how='outer')


Out[41]:
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c f
3 3 d g
4 4 e h
5 5 NaN i
6 6 NaN j

We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are null.

concat

Pandas also provides a way to combine DataFrames along an axis: pandas.concat(). It takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.


In [42]:
pd.concat([left_frame, right_frame])


Out[42]:
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c NaN
3 3 d NaN
4 4 e NaN
0 2 NaN f
1 3 NaN g
2 4 NaN h
3 5 NaN i
4 6 NaN j

By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be null.

Additionally, objects can be concatentated side-by-side using the function's axis parameter. The axis labeling information in Pandas objects serves many purposes:

  • Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
  • Enables automatic and explicit data alignment.
  • Allows intuitive getting and setting of subsets of the data set.

In [43]:
pd.concat([left_frame, right_frame], axis=1)


Out[43]:
key left_value key right_value
0 0 a 2 f
1 1 b 3 g
2 2 c 4 h
3 3 d 5 i
4 4 e 6 j

Group by

The groupby() method lets you perform grouping operations. The example below shows a grouping operation performed with blog columns entries as keys. It is used to calculate the mean of the entries for each blog.


In [18]:
blogs


Out[18]:
blog entries tag
0 1 NaN GIS
1 1 1638 NLP
2 1 569 SNA
3 1 115 DH
4 2 NaN GIS
5 2 1130 NLP
6 2 754 SNA
7 2 555 DH

In [20]:
blogs.groupby('blog').count()


Out[20]:
entries tag
blog
1 3 4
2 3 4

And now for a more complex exmaple. From our bikes_df DataFrame, we want to know if people are more active on weekdays or on weekends. We are going to add a new column with the information about the weekday, and then sum all the values for each weekday. We do that by grouping by weekday and aggregating their values by simpe addition. In this case we use np.sum(), but any other operation, such as np.max() or np.average() can be used.


In [45]:
bikes_df.index


Out[45]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01, ..., 2012-11-05]
Length: 310, Freq: None, Timezone: None

The data frame's index is of type DateTimeIndex, which allows us to extract information from the standard datetime Python object.


In [46]:
bikes_df.index.weekday


Out[46]:
array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5,
       6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0], dtype=int32)

In order to create the new column we just need to pick a name and assign a Series value is it was a dictionary (remember that DataFrame and Series are labeled).


In [47]:
bikes_df['weekday'] = bikes_df.index.weekday
bikes_df.head()


-c:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[47]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 weekday
Date
2012-01-01 35 0 38 51 26 10 16 6
2012-01-02 83 1 68 153 53 6 43 0
2012-01-03 135 2 104 248 89 3 58 1
2012-01-04 144 1 116 318 111 8 61 2
2012-01-05 197 2 124 330 97 13 95 3

In [48]:
bikes_df.groupby('weekday')


Out[48]:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f2d5303a668>

In [49]:
counts_by_day = bikes_df.groupby('weekday').aggregate(np.sum)
counts_by_day


Out[49]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
weekday
0 134298 60329 90051 163767 90184 46204 130130
1 135305 58708 92035 165880 91399 35167 120088
2 152972 67344 104891 186061 102103 43263 133088
3 160131 69028 111895 196715 105674 45385 140241
4 141771 56446 98568 172390 89872 42470 137255
5 101578 34018 62067 105060 48289 52861 112828
6 99310 36466 55324 98308 50004 61485 117150

And finally a bit of matplotlib + pandas magic! But don't worry, you'll learn about plotting in the next class.


In [50]:
# display graphs inline
%matplotlib inline 
import matplotlib
# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))

counts_by_day.plot()


Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2d5169ea90>

This is not, for sure, all we will see about grouping, but it's a good start if you are already understanding what's going on here.

Art Bank collection

The Art Bank collection is a dataset provided by the data.gc.ca, a key part of Canada's Action Plan on Open Government to enhance transparency and accountability, that contains the art works in the Canada Council Art Bank collection. For 40 years, the Art Bank has been collecting work by the best Canadian artists of our times. The Art Bank is part of the Canada Council for the Arts, Canada's national arts funder. It's licenced under the terms of the Open Government Licence - Canada.

And... it's in XML! XML, stands for eXtensible Markup Language, and if you want to see an example right now, just press Ctrl.+u and the browser will (likely) show you the HTML under this web page. Because yes, HTML is a subset of XML. So now you have an idea.

XML is something that you should know about already. Very basically, XML defines a hierarchical structure in which concepts are enclosed in tags that might have properties.

<art_list>
    <art_work title="Title of the artwork" />
</art_list>

In the last example, art_list is a tag, and <art_list>...</art_list> is how you enclose something in a tag. Sometimes tags might not have anything to enclose, as in <art_work />, so it's self-enclosed. Finally, properties are just like dictionaries.

In Python, the fastest way to get all the concepts we need is usually by finding all the tags of a certain kind. The module lxml.etree is able to do that for us.

First, let's see some lines of the dataset by using the special IPython command !head.


In [21]:
!head data/art_listing.xml











From here, it's pretty obvious that what we want the Art_Summary elements and their properties.

The function etree.parse() opens a file, parses it into a Python data structure and returns it as an object.


In [22]:
from lxml import etree
arts = etree.parse("data/art_listing.xml")

To find all Art_Summary elements, we can call the method .findall().


In [23]:
summaries = arts.findall("Art_Summary")
summaries[:5]


Out[23]:
[<Element Art_Summary at 0x7f85dd0005c8>,
 <Element Art_Summary at 0x7f85dcffbbc8>,
 <Element Art_Summary at 0x7f85dcffbf08>,
 <Element Art_Summary at 0x7f85dcffb148>,
 <Element Art_Summary at 0x7f85dcffbd88>]

Each of those elements in the list behaves like a dictionary.


In [24]:
summaries[0].keys()


Out[24]:
['Artist_Name', 'Title', 'Execution_Date', 'E_Category', 'F_Category', 'ID']

In [25]:
summaries[0].values()


Out[25]:
['Wright, Don',
 'Turf-Arm',
 '1984',
 'Work on paper',
 'Oeuvre sur papier',
 '85/6-0174']

Now we are ready to build a data frame from the summaries list.


In [26]:
art_list = []
for summary in summaries:
    art_list.append(summary.values())
arts = pd.DataFrame(
    art_list,
    columns=['artist_name', 'title', 'execution_date', 'category', 'category_french', 'id']
)

Let's remove the category in French.


In [27]:
del arts['category_french']

In [28]:
arts.head()


Out[28]:
artist_name title execution_date category id
0 Wright, Don Turf-Arm 1984 Work on paper 85/6-0174
1 Hansen, Jim Nfld. album no. 99 ©1977, CARCC Work on paper 79/80-0510
2 Creates, Marlene A Stone Placed in Gathered Water, Newfoundland ©1982, CARCC Photograph 92/3-0221
3 Walker, Peter 1.25.79 ©1979, CARCC Work on paper 79/80-0529
4 Bretzloff, Carol Under Heaven, Wind 1983 Work on paper 13/4-0007

Voilà! Ready for the fun. Let's just save it for later.


In [38]:
arts.to_csv("data/arts.csv", index=False)

Activity

Given the `arts` data frame, clean the dates so you only see numbers. If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too. [Solution](data/arts1.py)


In [34]:
l = ['1989', '1999']
l[-1]


Out[34]:
'1999'

In [37]:
"1989, CARCC".split("-")[-1]


Out[37]:
'1989, CARCC'

In [40]:
# Clean the dates so you only see numbers.
def clean_years(value):
    result = value
    chars_to_replace = ["c.", "©", ", CARCC", "no date", "n.d.", " SODRAC", ", CA", " CARCC"]
    chars_to_split = ["-", "/"]
    if isinstance(result, str):  # what isinstance does?
        for char in chars_to_split:
            result = result.split(char)[-1]
        for char in chars_to_replace:
            result = result.replace(char, "")
        if result == "":
            return np.nan
        else:
            return int(result)
    else:
        return result

arts['execution_date'] = arts['execution_date'].apply(clean_years)
arts.head()


Out[40]:
artist_name title execution_date category id
0 Wright, Don Turf-Arm 1984 Work on paper 85/6-0174
1 Hansen, Jim Nfld. album no. 99 1977 Work on paper 79/80-0510
2 Creates, Marlene A Stone Placed in Gathered Water, Newfoundland 1982 Photograph 92/3-0221
3 Walker, Peter 1.25.79 1979 Work on paper 79/80-0529
4 Bretzloff, Carol Under Heaven, Wind 1983 Work on paper 13/4-0007

In [43]:
# If a year is lower than 100, then is referred to 1900. For example, 78 is actually 1978, and that needs to be fixed too.
def clean_year_99(value):
    if value <= 99:
        return 1900 + value
    else:
        return value 

arts["execution_date"] = arts["execution_date"].apply(clean_year_99)
arts[arts["execution_date"] < 100].head()


Out[43]:
artist_name title execution_date category id

In [45]:
# using lambda functions
arts["execution_date"] = arts["execution_date"].apply(lambda value: 1900 + value if value <= 99 else value)

For the next class