Last updated: June 29th 2016

Climate data exploration: a journey through Pandas

Welcome to a demo of Python's data analysis package called Pandas. Our goal is to learn about Data Analysis and transformation using Pandas while exploring datasets used to analyze climate change.

The story

The global goal of this demo is to provide the tools to be able to try and reproduce some of the analysis done in the IPCC global climate reports published in the last decade (see for example https://www.ipcc.ch/pdf/assessment-report/ar5/syr/SYR_AR5_FINAL_full.pdf).

We are first going to load a few public datasets containing information about global temperature, global and local sea level infomation, and global concentration of greenhouse gases like CO2, to see if there are correlations and how the trends are to evolve, assuming no fundamental change in the system. For all these datasets, we will download them, visualize them, clean them, search through them, merge them, resample them, transform them and summarize them.

In the process, we will learn about:

1. Loading data
2. Pandas datastructures
3. Cleaning and formatting data
4. Basic visualization
5. Accessing data
6. Working with dates and times
7. Transforming datasets
8. Statistical analysis
9. Data agregation and summarization
10. Correlations and regressions
11. Predictions from auto regression models

Some initial setup


In [1]:
%matplotlib inline
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_rows", 16)

LARGE_FIGSIZE = (12, 8)

In [2]:
# Change this cell to the demo location on YOUR machine
%cd ~/Projects/pandas_tutorial/climate_timeseries/
%ls


/Users/jrocher/Projects/pandas_tutorial/climate_timeseries
climate_timeseries.ipynb  data/

1. Loading data

To find all reading functions in pandas, ask ipython's tab completion:


In [3]:
#pd.read_<TAB>

In [4]:
pd.read_table?

From a local text file

Let's first load some temperature data which covers all lattitudes. Since read_table is supposed to do the job for a text file, let's just try it:


In [5]:
filename = "data/temperatures/annual.land_ocean.90S.90N.df_1901-2000mean.dat"
full_globe_temp = pd.read_table(filename)
full_globe_temp


Out[5]:
1880 -0.1591
0 1881 -0.0789
1 1882 -0.1313
2 1883 -0.1675
3 1884 -0.2485
4 1885 -0.2042
5 1886 -0.1695
6 1887 -0.2505
7 1888 -0.1605
... ...
122 2003 0.5818
123 2004 0.5416
124 2005 0.6154
125 2006 0.5601
126 2007 0.5472
127 2008 0.4804
128 2009 0.5551
129 2010 -999.0000

130 rows × 1 columns

There is only 1 column! Let's try again stating that values are separated by any number of spaces:


In [6]:
full_globe_temp = pd.read_table(filename, sep="\s+")
full_globe_temp


Out[6]:
1880 -0.1591
0 1881 -0.0789
1 1882 -0.1313
2 1883 -0.1675
3 1884 -0.2485
4 1885 -0.2042
5 1886 -0.1695
6 1887 -0.2505
7 1888 -0.1605
... ... ...
122 2003 0.5818
123 2004 0.5416
124 2005 0.6154
125 2006 0.5601
126 2007 0.5472
127 2008 0.4804
128 2009 0.5551
129 2010 -999.0000

130 rows × 2 columns

There are columns but the column names are 1880 and -0.1591!


In [7]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"])
full_globe_temp


Out[7]:
year mean temp
0 1880 -0.1591
1 1881 -0.0789
2 1882 -0.1313
3 1883 -0.1675
4 1884 -0.2485
5 1885 -0.2042
6 1886 -0.1695
7 1887 -0.2505
... ... ...
123 2003 0.5818
124 2004 0.5416
125 2005 0.6154
126 2006 0.5601
127 2007 0.5472
128 2008 0.4804
129 2009 0.5551
130 2010 -999.0000

131 rows × 2 columns

Since we only have 2 columns, one of which would be nicer to access the data (the year of the record), let's try using the index_col option:


In [8]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"], 
                                index_col=0)
full_globe_temp


Out[8]:
mean temp
year
1880 -0.1591
1881 -0.0789
1882 -0.1313
1883 -0.1675
1884 -0.2485
1885 -0.2042
1886 -0.1695
1887 -0.2505
... ...
2003 0.5818
2004 0.5416
2005 0.6154
2006 0.5601
2007 0.5472
2008 0.4804
2009 0.5551
2010 -999.0000

131 rows × 1 columns

Last step: the index is made of dates. Let's make that explicit:


In [9]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"], 
                                index_col=0, parse_dates=True)
full_globe_temp


Out[9]:
mean temp
year
1880-01-01 00:00:00 -0.1591
1881-01-01 00:00:00 -0.0789
1882-01-01 00:00:00 -0.1313
1883-01-01 00:00:00 -0.1675
1884-01-01 00:00:00 -0.2485
1885-01-01 00:00:00 -0.2042
1886-01-01 00:00:00 -0.1695
1887-01-01 00:00:00 -0.2505
... ...
2003-01-01 0.5818
2004-01-01 0.5416
2005-01-01 0.6154
2006-01-01 0.5601
2007-01-01 0.5472
2008-01-01 0.4804
2009-01-01 0.5551
2010-01-01 -999.0000

131 rows × 1 columns

From a chunked file

Since every dataset can contain mistakes, let's load a different file with temperature data. NASA's GISS dataset is written in chunks: look at it in data/temperatures/GLB.Ts+dSST.txt


In [10]:
giss_temp = pd.read_table("data/temperatures/GLB.Ts+dSST.txt", sep="\s+", skiprows=7,
                          skip_footer=11, engine="python")
giss_temp


Out[10]:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON Year.1
0 1880 -34 -27 -22 -30 -16 -24 -19 -12 -20 -19 -16 -21 -22 *** **** -23 -18 -18 1880
1 1881 -13 -16 -2 -3 -3 -27 -12 -8 -18 -23 -28 -18 -14 -14 -17 -3 -15 -23 1881
2 1882 3 4 -2 -24 -20 -32 -27 -11 -11 -25 -25 -37 -17 -16 -4 -15 -23 -20 1882
3 1883 -38 -38 -12 -20 -20 -8 -3 -13 -19 -19 -28 -21 -20 -21 -38 -18 -8 -22 1883
4 1884 -20 -14 -31 -36 -33 -36 -31 -24 -29 -25 -29 -25 -28 -28 -18 -33 -31 -28 1884
5 1885 -57 -29 -19 -36 -35 -40 -28 -24 -17 -14 -14 0 -26 -28 -37 -30 -31 -15 1885
6 1886 -37 -43 -34 -23 -21 -30 -13 -21 -12 -22 -29 -18 -25 -24 -26 -26 -21 -21 1886
7 1887 -60 -43 -26 -34 -28 -25 -19 -28 -24 -33 -29 -40 -32 -30 -40 -29 -24 -28 1887
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
135 2009 56 48 49 57 59 62 66 61 64 58 72 58 59 59 52 55 63 65 2009
136 2010 66 75 87 82 71 60 56 59 55 65 74 44 66 67 66 80 58 65 2010
137 2011 45 44 57 60 47 54 70 69 52 60 50 48 55 55 45 55 64 54 2011
138 2012 38 43 52 62 71 59 50 56 68 73 69 46 57 57 43 62 55 70 2012
139 2013 62 52 60 48 56 61 53 61 73 61 75 61 60 59 53 55 58 70 2013
140 2014 68 44 71 72 79 62 50 74 81 78 64 74 68 67 58 74 62 74 2014
141 2015 75 80 84 71 **** **** **** **** **** **** **** **** **** *** 76 **** **** **** 2015
142 Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON Year

143 rows × 20 columns

QUIZ: What happens if you remove the skiprows? skipfooter? engine?

EXERCISE: Load some readings of CO2 concentrations in the atmosphere from the data/greenhouse_gaz/co2_mm_global.txt data file.


In [11]:
# Your code here

From a remote text file

So far, we have only loaded temperature datasets. Climate change also affects the sea levels on the globe. Let's load some datasets with the sea levels. The university of colorado posts updated timeseries for mean sea level globably, per hemisphere, or even per ocean, sea, ... Let's download the global one, and the ones for the northern and southern hemisphere.

That will also illustrate that to load text files that are online, there is no more work than replacing the filepath by a URL n read_table:


In [12]:
# Local backup: data/sea_levels/sl_nh.txt
northern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_nh.txt", 
                                   sep="\s+")
northern_sea_level


Out[12]:
year msl_ib(mm) #version_2016_rel3
0 1992.9323 13.717 NaN
1 1992.9595 1.328 NaN
2 1992.9866 -13.375 NaN
3 1993.0138 -24.723 NaN
4 1993.0409 -29.231 NaN
5 1993.0681 -32.538 NaN
6 1993.0952 -35.588 NaN
7 1993.1223 -26.579 NaN
... ... ... ...
853 2016.1440 49.218 NaN
854 2016.1711 44.274 NaN
855 2016.1983 33.961 NaN
856 2016.2254 42.199 NaN
857 2016.2526 45.004 NaN
858 2016.2797 49.809 NaN
859 2016.3068 55.401 NaN
860 2016.3340 50.416 NaN

861 rows × 3 columns


In [13]:
# Local backup: data/sea_levels/sl_sh.txt
southern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_sh.txt", 
                                   sep="\s+")
southern_sea_level


Out[13]:
year msl_ib(mm) #version_2016_rel3
0 1992.9323 2.517 NaN
1 1992.9595 -7.645 NaN
2 1992.9866 -2.267 NaN
3 1993.0138 0.855 NaN
4 1993.0409 -2.020 NaN
5 1993.0681 -0.398 NaN
6 1993.0952 4.102 NaN
7 1993.1223 6.787 NaN
... ... ... ...
853 2016.1440 84.902 NaN
854 2016.1711 83.829 NaN
855 2016.1983 86.020 NaN
856 2016.2254 88.122 NaN
857 2016.2526 88.090 NaN
858 2016.2797 87.806 NaN
859 2016.3068 77.270 NaN
860 2016.3340 70.916 NaN

861 rows × 3 columns


In [14]:
# The 2015 version of the global dataset:
# Local backup: data/sea_levels/sl_ns_global.txt
url = "http://sealevel.colorado.edu/files/2015_rel2/sl_ns_global.txt"
global_sea_level = pd.read_table(url, sep="\s+")
global_sea_level


Out[14]:
year msl_ib_ns(mm) #version_2015_rel2
0 1992.9595 -5.818 NaN
1 1992.9866 -7.525 NaN
2 1993.0138 -9.215 NaN
3 1993.0409 -11.796 NaN
4 1993.0681 -11.291 NaN
5 1993.0952 -9.569 NaN
6 1993.1223 -3.714 NaN
7 1993.1495 -2.471 NaN
... ... ... ...
772 2014.9494 70.723 NaN
773 2014.9766 70.522 NaN
774 2015.0037 66.662 NaN
775 2015.0309 64.804 NaN
776 2015.0580 62.115 NaN
777 2015.0852 68.589 NaN
778 2015.1123 69.745 NaN
779 2015.1395 75.007 NaN

780 rows × 3 columns

There are clearly lots of cleanup to be done on these datasets. See below...

From a local or remote HTML file

To be able to grab more local data about mean sea levels, we can download and extract data about mean sea level stations around the world from the PSMSL (http://www.psmsl.org/). Again to download and parse all tables in a webpage, just give read_html the URL to parse:


In [15]:
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
# Local backup in data/sea_levels/Obtaining Tide Gauge Data.html
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")

In [16]:
# there is 1 table on that page which contains metadata about the stations where 
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stations


Out[16]:
Station Name ID Lat. Lon. GLOSS ID Country Date Coastline Station
0 HELIGMAN 231 60.200 19.300 NaN ALA 01/01/1980 60 251
1 KOBBAKLINTAR 63 60.033 19.883 NaN ALA 01/01/1980 60 261
2 LEMSTROM 84 60.100 20.017 NaN ALA 01/01/1980 60 271
3 FOGLO / DEGERBY 249 60.032 20.385 NaN ALA 06/10/2015 60 281
4 PAGO PAGO 539 -14.280 -170.690 144.0 ASM 08/03/2016 745 1
5 BAHIA ESPERANZA 988 -63.300 -56.917 185.0 ATA 11/05/1999 999 1
6 ARGENTINE ISLANDS 913 -65.246 -64.257 188.0 ATA 21/05/2015 999 3
7 PUERTO SOBERANIA 1603 -62.483 -59.633 189.0 ATA 30/09/2004 999 5
... ... ... ... ... ... ... ... ... ...
1458 DANANG 1475 16.100 108.217 NaN VNM 17/11/2015 605 51
1459 HONNGU 1003 18.800 105.767 NaN VNM 17/11/2015 605 61
1460 HONDAU 841 20.667 106.800 NaN VNM 17/11/2015 605 81
1461 CHARLOTTE AMALIE 1393 18.335 -64.920 NaN VIR 08/03/2016 939 1
1462 LIME TREE BAY, ST CROIX 1447 17.693 -64.753 NaN VIR 08/03/2016 939 11
1463 CHRISTIANSTED HARBOUR 2118 17.750 -64.705 NaN VIR 09/03/2016 939 15
1464 LAMESHUR BAY 2119 18.317 -64.723 NaN VIR 09/03/2016 939 21
1465 ADEN 44 12.788 44.974 3.0 YEM 01/02/2016 485 1

1466 rows × 9 columns

That table can be used to search for a station in a region of the world we choose, extract an ID for it and download the corresponding time series with the URL http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata

2. Pandas DataStructures

Now that we have used read_** functions to load datasets, we need to understand better what kind of objects we got from them to learn to work with them.

DataFrame, the pandas 2D structure


In [17]:
# Type of the object?
type(giss_temp)


Out[17]:
pandas.core.frame.DataFrame

In [18]:
# Internal nature of the object
print(giss_temp.shape)
print(giss_temp.dtypes)


(143, 20)
Year      object
Jan       object
Feb       object
Mar       object
Apr       object
May       object
Jun       object
Jul       object
           ...  
Dec       object
J-D       object
D-N       object
DJF       object
MAM       object
JJA       object
SON       object
Year.1    object
dtype: object

Descriptors for the vertical axis (axis=0)


In [19]:
giss_temp.index


Out[19]:
RangeIndex(start=0, stop=143, step=1)

Descriptors for the horizontal axis (axis=1)


In [20]:
giss_temp.columns


Out[20]:
Index([u'Year', u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', u'Aug',
       u'Sep', u'Oct', u'Nov', u'Dec', u'J-D', u'D-N', u'DJF', u'MAM', u'JJA',
       u'SON', u'Year.1'],
      dtype='object')

A lot of information at once including memory usage:


In [21]:
giss_temp.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 20 columns):
Year      143 non-null object
Jan       143 non-null object
Feb       143 non-null object
Mar       143 non-null object
Apr       143 non-null object
May       143 non-null object
Jun       143 non-null object
Jul       143 non-null object
Aug       143 non-null object
Sep       143 non-null object
Oct       143 non-null object
Nov       143 non-null object
Dec       143 non-null object
J-D       143 non-null object
D-N       143 non-null object
DJF       143 non-null object
MAM       143 non-null object
JJA       143 non-null object
SON       143 non-null object
Year.1    143 non-null object
dtypes: object(20)
memory usage: 22.4+ KB

Series, the pandas 1D structure

A series can be constructed with the pd.Series constructor (passing a list or array of values) or from a DataFrame, by extracting one of its columns.


In [22]:
# Do we already have a series for the full_globe_temp?
type(full_globe_temp)


Out[22]:
pandas.core.frame.DataFrame

In [23]:
# Since there is only one column of values, we can make this a Series without 
# loosing information:
full_globe_temp = full_globe_temp["mean temp"]

Core attributes/information:


In [24]:
print(type(full_globe_temp))
print(full_globe_temp.dtype)
print(full_globe_temp.shape)
print(full_globe_temp.nbytes)


<class 'pandas.core.series.Series'>
float64
(131,)
1048

Probably the most important attribute of a Series or DataFrame is its index since we will use that to, well, index into the structures to access te information:


In [25]:
full_globe_temp.index


Out[25]:
DatetimeIndex(['1880-01-01', '1881-01-01', '1882-01-01', '1883-01-01',
               '1884-01-01', '1885-01-01', '1886-01-01', '1887-01-01',
               '1888-01-01', '1889-01-01',
               ...
               '2001-01-01', '2002-01-01', '2003-01-01', '2004-01-01',
               '2005-01-01', '2006-01-01', '2007-01-01', '2008-01-01',
               '2009-01-01', '2010-01-01'],
              dtype='datetime64[ns]', name=u'year', length=131, freq=None)

NumPy arrays as backend of Pandas

It is always possible to fall back to a good old NumPy array to pass on to scientific libraries that need them: SciPy, scikit-learn, ...


In [26]:
full_globe_temp.values


Out[26]:
array([ -1.59100000e-01,  -7.89000000e-02,  -1.31300000e-01,
        -1.67500000e-01,  -2.48500000e-01,  -2.04200000e-01,
        -1.69500000e-01,  -2.50500000e-01,  -1.60500000e-01,
        -1.10300000e-01,  -3.01400000e-01,  -2.52700000e-01,
        -2.92600000e-01,  -3.27400000e-01,  -2.88000000e-01,
        -2.14600000e-01,  -1.00900000e-01,  -1.29400000e-01,
        -2.51700000e-01,  -1.35600000e-01,  -1.12500000e-01,
        -1.69300000e-01,  -2.25500000e-01,  -3.44200000e-01,
        -3.79900000e-01,  -2.58300000e-01,  -2.06000000e-01,
        -3.72600000e-01,  -3.92200000e-01,  -4.04100000e-01,
        -3.86000000e-01,  -4.14400000e-01,  -3.42200000e-01,
        -3.19600000e-01,  -1.59900000e-01,  -8.10000000e-02,
        -2.80300000e-01,  -3.22700000e-01,  -2.10600000e-01,
        -2.18600000e-01,  -2.02500000e-01,  -1.39000000e-01,
        -2.22200000e-01,  -2.06300000e-01,  -1.84300000e-01,
        -1.26000000e-01,  -1.92000000e-02,  -1.00700000e-01,
        -9.35000000e-02,  -2.20100000e-01,  -2.82000000e-02,
         4.10000000e-03,  -2.85000000e-02,  -1.61600000e-01,
        -2.85000000e-02,  -5.65000000e-02,  -3.18000000e-02,
         7.27000000e-02,   9.55000000e-02,   8.98000000e-02,
         1.23400000e-01,   1.83700000e-01,   1.37700000e-01,
         1.29900000e-01,   2.36200000e-01,   1.16200000e-01,
        -2.02000000e-02,  -4.06000000e-02,  -4.94000000e-02,
        -7.52000000e-02,  -1.63500000e-01,  -2.90000000e-03,
         3.53000000e-02,   1.19000000e-01,  -1.02800000e-01,
        -1.22000000e-01,  -1.83600000e-01,   5.70000000e-02,
         1.05400000e-01,   6.34000000e-02,   1.14000000e-02,
         8.94000000e-02,   1.13600000e-01,   1.33300000e-01,
        -1.26800000e-01,  -6.95000000e-02,  -1.17000000e-02,
         2.20000000e-03,  -1.21000000e-02,   9.11000000e-02,
         4.71000000e-02,  -5.50000000e-02,   3.48000000e-02,
         1.61700000e-01,  -8.62000000e-02,  -1.77000000e-02,
        -1.06200000e-01,   1.42500000e-01,   6.15000000e-02,
         1.53300000e-01,   1.93000000e-01,   2.28400000e-01,
         1.17100000e-01,   2.81700000e-01,   8.75000000e-02,
         6.81000000e-02,   1.60500000e-01,   2.96800000e-01,
         3.00600000e-01,   2.22300000e-01,   3.86100000e-01,
         3.36000000e-01,   2.02300000e-01,   2.30700000e-01,
         2.93400000e-01,   4.07300000e-01,   2.75300000e-01,
         4.78200000e-01,   5.97100000e-01,   4.19900000e-01,
         3.88600000e-01,   5.17300000e-01,   5.74500000e-01,
         5.81800000e-01,   5.41600000e-01,   6.15400000e-01,
         5.60100000e-01,   5.47200000e-01,   4.80400000e-01,
         5.55100000e-01,  -9.99000000e+02])

In [27]:
type(full_globe_temp.values)


Out[27]:
numpy.ndarray

Creating new DataFrames manually

DataFrames can also be created manually, by grouping several Series together. Let's make a new frame from the 3 sea level datasets we downloaded above. They will be displayed along the same index. Wait, does that makes sense to do that?


In [28]:
# Are they aligned?
southern_sea_level.year == northern_sea_level.year


Out[28]:
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
       ... 
853    True
854    True
855    True
856    True
857    True
858    True
859    True
860    True
Name: year, dtype: bool

In [29]:
# So, are they aligned?
np.all(southern_sea_level.year == northern_sea_level.year)


Out[29]:
True

So the northern hemisphere and southern hemisphere datasets are aligned. What about the global one?


In [30]:
len(global_sea_level.year) == len(northern_sea_level.year)


Out[30]:
False

For now, let's just build a DataFrame with the 2 hemisphere datasets then. We will come back to add the global one later...


In [31]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"], 
                               "date": northern_sea_level.year})
mean_sea_level


Out[31]:
date northern_hem southern_hem
0 1992.9323 13.717 2.517
1 1992.9595 1.328 -7.645
2 1992.9866 -13.375 -2.267
3 1993.0138 -24.723 0.855
4 1993.0409 -29.231 -2.020
5 1993.0681 -32.538 -0.398
6 1993.0952 -35.588 4.102
7 1993.1223 -26.579 6.787
... ... ... ...
853 2016.1440 49.218 84.902
854 2016.1711 44.274 83.829
855 2016.1983 33.961 86.020
856 2016.2254 42.199 88.122
857 2016.2526 45.004 88.090
858 2016.2797 49.809 87.806
859 2016.3068 55.401 77.270
860 2016.3340 50.416 70.916

861 rows × 3 columns

Note: there are other ways to create DataFrames manually, for example from a 2D numpy array.

There is still the date in a regular column and a numerical index that is not that meaningful. We can specify the index of a DataFrame at creation. Let's try:


In [32]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"]},
                               index = northern_sea_level.year)
mean_sea_level


Out[32]:
northern_hem southern_hem
year
1992.9323 NaN NaN
1992.9595 NaN NaN
1992.9866 NaN NaN
1993.0138 NaN NaN
1993.0409 NaN NaN
1993.0681 NaN NaN
1993.0952 NaN NaN
1993.1223 NaN NaN
... ... ...
2016.1440 NaN NaN
2016.1711 NaN NaN
2016.1983 NaN NaN
2016.2254 NaN NaN
2016.2526 NaN NaN
2016.2797 NaN NaN
2016.3068 NaN NaN
2016.3340 NaN NaN

861 rows × 2 columns

Now the fact that it is failing show that Pandas does auto-alignment of values: for each value of the index, it searches for a value in each Series that maps the same value. Since these series have a dumb numerical index, no values are found.

Since we know that the order of the values match the index we chose, we can replace the Series by their values only at creation of the DataFrame:


In [33]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"].values, 
                               "southern_hem": southern_sea_level["msl_ib(mm)"].values},
                               index = northern_sea_level.year)
mean_sea_level


Out[33]:
northern_hem southern_hem
year
1992.9323 13.717 2.517
1992.9595 1.328 -7.645
1992.9866 -13.375 -2.267
1993.0138 -24.723 0.855
1993.0409 -29.231 -2.020
1993.0681 -32.538 -0.398
1993.0952 -35.588 4.102
1993.1223 -26.579 6.787
... ... ...
2016.1440 49.218 84.902
2016.1711 44.274 83.829
2016.1983 33.961 86.020
2016.2254 42.199 88.122
2016.2526 45.004 88.090
2016.2797 49.809 87.806
2016.3068 55.401 77.270
2016.3340 50.416 70.916

861 rows × 2 columns

3. Cleaning and formatting data

The datasets that we obtain straight from the reading functions are pretty raw. A lot of pre-processing can be done during data read but we haven't used all the power of the reading functions. Let's learn to do a lot of cleaning and formatting of the data.

The GISS temperature dataset has a lot of issues too: useless numerical index, redundant columns, useless rows, placeholder (****) for missing values, and wrong type for the columns. Let's fix all this:

Renaming columns


In [34]:
# The columns of the local_sea_level_stations aren't clean: they contain spaces and dots.
local_sea_level_stations.columns


Out[34]:
Index([u'Station Name', u'ID', u'Lat.', u'Lon.', u'GLOSS ID  ', u'Country  ',
       u'Date', u'Coastline  ', u'Station  '],
      dtype='object')

In [35]:
# Let's clean them up a bit:
local_sea_level_stations.columns = [name.strip().replace(".", "") 
                                    for name in local_sea_level_stations.columns]
local_sea_level_stations.columns


Out[35]:
Index([u'Station Name', u'ID', u'Lat', u'Lon', u'GLOSS ID', u'Country',
       u'Date', u'Coastline', u'Station'],
      dtype='object')

We can also rename an index by setting its name. For example, the index of the mean_sea_level dataFrame could be called date since it contains more than just the year:


In [36]:
mean_sea_level.index.name = "date"
mean_sea_level


Out[36]:
northern_hem southern_hem
date
1992.9323 13.717 2.517
1992.9595 1.328 -7.645
1992.9866 -13.375 -2.267
1993.0138 -24.723 0.855
1993.0409 -29.231 -2.020
1993.0681 -32.538 -0.398
1993.0952 -35.588 4.102
1993.1223 -26.579 6.787
... ... ...
2016.1440 49.218 84.902
2016.1711 44.274 83.829
2016.1983 33.961 86.020
2016.2254 42.199 88.122
2016.2526 45.004 88.090
2016.2797 49.809 87.806
2016.3068 55.401 77.270
2016.3340 50.416 70.916

861 rows × 2 columns

Setting missing values

In the full globe dataset, -999.00 was used to indicate that there was no value for that year. Let's search for all these values and replace them with the missing value that Pandas understand: np.nan


In [37]:
full_globe_temp == -999.000


Out[37]:
year
1880-01-01 00:00:00    False
1881-01-01 00:00:00    False
1882-01-01 00:00:00    False
1883-01-01 00:00:00    False
1884-01-01 00:00:00    False
1885-01-01 00:00:00    False
1886-01-01 00:00:00    False
1887-01-01 00:00:00    False
                       ...  
2003-01-01             False
2004-01-01             False
2005-01-01             False
2006-01-01             False
2007-01-01             False
2008-01-01             False
2009-01-01             False
2010-01-01              True
Name: mean temp, dtype: bool

In [38]:
full_globe_temp[full_globe_temp == -999.000] = np.nan
full_globe_temp.tail()


Out[38]:
year
2006-01-01    0.5601
2007-01-01    0.5472
2008-01-01    0.4804
2009-01-01    0.5551
2010-01-01       NaN
Name: mean temp, dtype: float64

Choosing what is the index


In [39]:
# We didn't set a column number of the index of giss_temp, we can do that afterwards:
giss_temp = giss_temp.set_index("Year")
giss_temp.head()


Out[39]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON Year.1
Year
1880 -34 -27 -22 -30 -16 -24 -19 -12 -20 -19 -16 -21 -22 *** **** -23 -18 -18 1880
1881 -13 -16 -2 -3 -3 -27 -12 -8 -18 -23 -28 -18 -14 -14 -17 -3 -15 -23 1881
1882 3 4 -2 -24 -20 -32 -27 -11 -11 -25 -25 -37 -17 -16 -4 -15 -23 -20 1882
1883 -38 -38 -12 -20 -20 -8 -3 -13 -19 -19 -28 -21 -20 -21 -38 -18 -8 -22 1883
1884 -20 -14 -31 -36 -33 -36 -31 -24 -29 -25 -29 -25 -28 -28 -18 -33 -31 -28 1884

Dropping rows and columns


In [40]:
# 1 column is redundant with the index: 
giss_temp.columns


Out[40]:
Index([u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', u'Aug', u'Sep',
       u'Oct', u'Nov', u'Dec', u'J-D', u'D-N', u'DJF', u'MAM', u'JJA', u'SON',
       u'Year.1'],
      dtype='object')

In [41]:
# Let's drop it:
giss_temp = giss_temp.drop("Year.1", axis=1)
giss_temp


Out[41]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON
Year
1880 -34 -27 -22 -30 -16 -24 -19 -12 -20 -19 -16 -21 -22 *** **** -23 -18 -18
1881 -13 -16 -2 -3 -3 -27 -12 -8 -18 -23 -28 -18 -14 -14 -17 -3 -15 -23
1882 3 4 -2 -24 -20 -32 -27 -11 -11 -25 -25 -37 -17 -16 -4 -15 -23 -20
1883 -38 -38 -12 -20 -20 -8 -3 -13 -19 -19 -28 -21 -20 -21 -38 -18 -8 -22
1884 -20 -14 -31 -36 -33 -36 -31 -24 -29 -25 -29 -25 -28 -28 -18 -33 -31 -28
1885 -57 -29 -19 -36 -35 -40 -28 -24 -17 -14 -14 0 -26 -28 -37 -30 -31 -15
1886 -37 -43 -34 -23 -21 -30 -13 -21 -12 -22 -29 -18 -25 -24 -26 -26 -21 -21
1887 -60 -43 -26 -34 -28 -25 -19 -28 -24 -33 -29 -40 -32 -30 -40 -29 -24 -28
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2009 56 48 49 57 59 62 66 61 64 58 72 58 59 59 52 55 63 65
2010 66 75 87 82 71 60 56 59 55 65 74 44 66 67 66 80 58 65
2011 45 44 57 60 47 54 70 69 52 60 50 48 55 55 45 55 64 54
2012 38 43 52 62 71 59 50 56 68 73 69 46 57 57 43 62 55 70
2013 62 52 60 48 56 61 53 61 73 61 75 61 60 59 53 55 58 70
2014 68 44 71 72 79 62 50 74 81 78 64 74 68 67 58 74 62 74
2015 75 80 84 71 **** **** **** **** **** **** **** **** **** *** 76 **** **** ****
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON

143 rows × 18 columns


In [42]:
# We can also just select the columns we want to keep:
giss_temp = giss_temp[[u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', 
                       u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']]
giss_temp


Out[42]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
1880 -34 -27 -22 -30 -16 -24 -19 -12 -20 -19 -16 -21
1881 -13 -16 -2 -3 -3 -27 -12 -8 -18 -23 -28 -18
1882 3 4 -2 -24 -20 -32 -27 -11 -11 -25 -25 -37
1883 -38 -38 -12 -20 -20 -8 -3 -13 -19 -19 -28 -21
1884 -20 -14 -31 -36 -33 -36 -31 -24 -29 -25 -29 -25
1885 -57 -29 -19 -36 -35 -40 -28 -24 -17 -14 -14 0
1886 -37 -43 -34 -23 -21 -30 -13 -21 -12 -22 -29 -18
1887 -60 -43 -26 -34 -28 -25 -19 -28 -24 -33 -29 -40
... ... ... ... ... ... ... ... ... ... ... ... ...
2009 56 48 49 57 59 62 66 61 64 58 72 58
2010 66 75 87 82 71 60 56 59 55 65 74 44
2011 45 44 57 60 47 54 70 69 52 60 50 48
2012 38 43 52 62 71 59 50 56 68 73 69 46
2013 62 52 60 48 56 61 53 61 73 61 75 61
2014 68 44 71 72 79 62 50 74 81 78 64 74
2015 75 80 84 71 **** **** **** **** **** **** **** ****
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

143 rows × 12 columns


In [43]:
# Let's remove all these extra column names (Year  Jan ...). They all correspond to the index "Year"
giss_temp = giss_temp.drop("Year")
giss_temp


Out[43]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
1880 -34 -27 -22 -30 -16 -24 -19 -12 -20 -19 -16 -21
1881 -13 -16 -2 -3 -3 -27 -12 -8 -18 -23 -28 -18
1882 3 4 -2 -24 -20 -32 -27 -11 -11 -25 -25 -37
1883 -38 -38 -12 -20 -20 -8 -3 -13 -19 -19 -28 -21
1884 -20 -14 -31 -36 -33 -36 -31 -24 -29 -25 -29 -25
1885 -57 -29 -19 -36 -35 -40 -28 -24 -17 -14 -14 0
1886 -37 -43 -34 -23 -21 -30 -13 -21 -12 -22 -29 -18
1887 -60 -43 -26 -34 -28 -25 -19 -28 -24 -33 -29 -40
... ... ... ... ... ... ... ... ... ... ... ... ...
2008 22 31 70 48 46 43 54 39 59 62 62 51
2009 56 48 49 57 59 62 66 61 64 58 72 58
2010 66 75 87 82 71 60 56 59 55 65 74 44
2011 45 44 57 60 47 54 70 69 52 60 50 48
2012 38 43 52 62 71 59 50 56 68 73 69 46
2013 62 52 60 48 56 61 53 61 73 61 75 61
2014 68 44 71 72 79 62 50 74 81 78 64 74
2015 75 80 84 71 **** **** **** **** **** **** **** ****

136 rows × 12 columns

Let's also set **** to a real missing value (np.nan). We can often do it using a boolean mask, but that may trigger pandas warning. Another way to assign based on a boolean condition is to use the where method:


In [44]:
#giss_temp[giss_temp == "****"] = np.nan
giss_temp = giss_temp.where(giss_temp != "****", np.nan)

In [45]:
giss_temp.tail()


Out[45]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
2011 45 44 57 60 47 54 70 69 52 60 50 48
2012 38 43 52 62 71 59 50 56 68 73 69 46
2013 62 52 60 48 56 61 53 61 73 61 75 61
2014 68 44 71 72 79 62 50 74 81 78 64 74
2015 75 80 84 71 NaN NaN NaN NaN NaN NaN NaN NaN

Adding columns

While building the mean_sea_level dataFrame earlier, we didn't include the values from global_sea_level since the years were not aligned. Adding a column to a dataframe is as easy as adding an entry to a dictionary. So let's try:


In [46]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level


Out[46]:
northern_hem southern_hem mean_global
date
1992.9323 13.717 2.517 NaN
1992.9595 1.328 -7.645 NaN
1992.9866 -13.375 -2.267 NaN
1993.0138 -24.723 0.855 NaN
1993.0409 -29.231 -2.020 NaN
1993.0681 -32.538 -0.398 NaN
1993.0952 -35.588 4.102 NaN
1993.1223 -26.579 6.787 NaN
... ... ... ...
2016.1440 49.218 84.902 NaN
2016.1711 44.274 83.829 NaN
2016.1983 33.961 86.020 NaN
2016.2254 42.199 88.122 NaN
2016.2526 45.004 88.090 NaN
2016.2797 49.809 87.806 NaN
2016.3068 55.401 77.270 NaN
2016.3340 50.416 70.916 NaN

861 rows × 3 columns

The column is full of NaNs again because the auto-alignment feature of Pandas is searching for the index values like 1992.9323 in the index of global_sea_level["msl_ib_ns(mm)"] series and not finding them. Let's set its index to these years so that that auto-alignment can work for us and figure out which values we have and not:


In [47]:
global_sea_level = global_sea_level.set_index("year")
global_sea_level["msl_ib_ns(mm)"]


Out[47]:
year
1992.9595    -5.818
1992.9866    -7.525
1993.0138    -9.215
1993.0409   -11.796
1993.0681   -11.291
1993.0952    -9.569
1993.1223    -3.714
1993.1495    -2.471
              ...  
2014.9494    70.723
2014.9766    70.522
2015.0037    66.662
2015.0309    64.804
2015.0580    62.115
2015.0852    68.589
2015.1123    69.745
2015.1395    75.007
Name: msl_ib_ns(mm), dtype: float64

In [48]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level


Out[48]:
northern_hem southern_hem mean_global
date
1992.9323 13.717 2.517 NaN
1992.9595 1.328 -7.645 -5.818
1992.9866 -13.375 -2.267 -7.525
1993.0138 -24.723 0.855 -9.215
1993.0409 -29.231 -2.020 -11.796
1993.0681 -32.538 -0.398 -11.291
1993.0952 -35.588 4.102 -9.569
1993.1223 -26.579 6.787 -3.714
... ... ... ...
2016.1440 49.218 84.902 NaN
2016.1711 44.274 83.829 NaN
2016.1983 33.961 86.020 NaN
2016.2254 42.199 88.122 NaN
2016.2526 45.004 88.090 NaN
2016.2797 49.809 87.806 NaN
2016.3068 55.401 77.270 NaN
2016.3340 50.416 70.916 NaN

861 rows × 3 columns

EXERCISE: Create a new series containing the average of the 2 hemispheres minus the global value to see if that is close to 0. Work inside the mean_sea_level dataframe first. Then try with the original Series to see what happens with data alignment while doing computations.


In [49]:
# Your code here

Changing dtype of series

Now that the sea levels are looking pretty good, let's got back to the GISS temperature dataset. Because of the labels (strings) found in the middle of the timeseries, every column only assumed to contain strings (didn't convert them to floating point values):


In [50]:
giss_temp.dtypes


Out[50]:
Jan    object
Feb    object
Mar    object
Apr    object
May    object
Jun    object
Jul    object
Aug    object
Sep    object
Oct    object
Nov    object
Dec    object
dtype: object

That can be changed after the fact (and after the cleanup) with the astype method of a Series:


In [51]:
giss_temp["Jan"].astype("float32")


Out[51]:
Year
1880   -34.0
1881   -13.0
1882     3.0
1883   -38.0
1884   -20.0
1885   -57.0
1886   -37.0
1887   -60.0
        ... 
2008    22.0
2009    56.0
2010    66.0
2011    45.0
2012    38.0
2013    62.0
2014    68.0
2015    75.0
Name: Jan, dtype: float32

In [52]:
for col in giss_temp.columns:
    giss_temp.loc[:, col] = giss_temp[col].astype(np.float32)

An index has a dtype just like any Series and that can be changed after the fact too.


In [53]:
giss_temp.index.dtype


Out[53]:
dtype('O')

For now, let's change it to an integer so that values can at least be compared properly. We will learn below to change it to a datetime object.


In [54]:
giss_temp.index = giss_temp.index.astype(np.int32)

Removing missing values

Removing missing values - once they have been converted to np.nan - is very easy. Entries that contain missing values can be removed (dropped), or filled with many strategies.


In [55]:
full_globe_temp


Out[55]:
year
1880-01-01 00:00:00   -0.1591
1881-01-01 00:00:00   -0.0789
1882-01-01 00:00:00   -0.1313
1883-01-01 00:00:00   -0.1675
1884-01-01 00:00:00   -0.2485
1885-01-01 00:00:00   -0.2042
1886-01-01 00:00:00   -0.1695
1887-01-01 00:00:00   -0.2505
                        ...  
2003-01-01             0.5818
2004-01-01             0.5416
2005-01-01             0.6154
2006-01-01             0.5601
2007-01-01             0.5472
2008-01-01             0.4804
2009-01-01             0.5551
2010-01-01                NaN
Name: mean temp, dtype: float64

In [56]:
full_globe_temp.dropna()


Out[56]:
year
1880-01-01 00:00:00   -0.1591
1881-01-01 00:00:00   -0.0789
1882-01-01 00:00:00   -0.1313
1883-01-01 00:00:00   -0.1675
1884-01-01 00:00:00   -0.2485
1885-01-01 00:00:00   -0.2042
1886-01-01 00:00:00   -0.1695
1887-01-01 00:00:00   -0.2505
                        ...  
2002-01-01             0.5745
2003-01-01             0.5818
2004-01-01             0.5416
2005-01-01             0.6154
2006-01-01             0.5601
2007-01-01             0.5472
2008-01-01             0.4804
2009-01-01             0.5551
Name: mean temp, dtype: float64

In [57]:
# This will remove any year that has a missing value. Use how='all' to keep partial years
giss_temp.dropna(how="any").tail()


Out[57]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
2010 66.0 75.0 87.0 82.0 71.0 60.0 56.0 59.0 55.0 65.0 74.0 44.0
2011 45.0 44.0 57.0 60.0 47.0 54.0 70.0 69.0 52.0 60.0 50.0 48.0
2012 38.0 43.0 52.0 62.0 71.0 59.0 50.0 56.0 68.0 73.0 69.0 46.0
2013 62.0 52.0 60.0 48.0 56.0 61.0 53.0 61.0 73.0 61.0 75.0 61.0
2014 68.0 44.0 71.0 72.0 79.0 62.0 50.0 74.0 81.0 78.0 64.0 74.0

In [58]:
giss_temp.fillna(value=0).tail()


Out[58]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
2011 45.0 44.0 57.0 60.0 47.0 54.0 70.0 69.0 52.0 60.0 50.0 48.0
2012 38.0 43.0 52.0 62.0 71.0 59.0 50.0 56.0 68.0 73.0 69.0 46.0
2013 62.0 52.0 60.0 48.0 56.0 61.0 53.0 61.0 73.0 61.0 75.0 61.0
2014 68.0 44.0 71.0 72.0 79.0 62.0 50.0 74.0 81.0 78.0 64.0 74.0
2015 75.0 80.0 84.0 71.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

In [59]:
# This fills them with the previous year. See also temp3.interpolate
giss_temp.fillna(method="ffill").tail()


Out[59]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
2011 45.0 44.0 57.0 60.0 47.0 54.0 70.0 69.0 52.0 60.0 50.0 48.0
2012 38.0 43.0 52.0 62.0 71.0 59.0 50.0 56.0 68.0 73.0 69.0 46.0
2013 62.0 52.0 60.0 48.0 56.0 61.0 53.0 61.0 73.0 61.0 75.0 61.0
2014 68.0 44.0 71.0 72.0 79.0 62.0 50.0 74.0 81.0 78.0 64.0 74.0
2015 75.0 80.0 84.0 71.0 79.0 62.0 50.0 74.0 81.0 78.0 64.0 74.0

Let's also mention the .interpolate method on a Series:


In [60]:
giss_temp.Aug.interpolate().tail()


Out[60]:
Year
2011    69.0
2012    56.0
2013    61.0
2014    74.0
2015    74.0
Name: Aug, dtype: float32

For now, we will leave the missing values in all our datasets, because it wouldn't be meaningful to fill them.

EXERCISE: Go back to the reading functions, and learn more about other options that could have allowed us to fold some of these pre-processing steps into the data loading.

4. Basic visualization

Now they have been formatted, visualizing your datasets is the next logical step and is trivial with Pandas. The first thing to try is to invoke the .plot to generate a basic visualization (uses matplotlib under the covers).

Line plots


In [61]:
full_globe_temp.plot()


Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x10fa8a110>

In [62]:
giss_temp.plot(figsize=LARGE_FIGSIZE)


Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x116f8ed10>

In [63]:
mean_sea_level.plot(subplots=True, figsize=(16, 12));


Showing distributions information


In [64]:
# Distributions of mean sean level globally and per hemisphere?
mean_sea_level.plot(kind="kde", figsize=(12, 8))


Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1180f47d0>

QUIZ: How to list the possible kinds of plots that the plot method can allow?


In [65]:
# Distributions of temperature in each month since 1880
giss_temp.boxplot();


/Users/jrocher/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  from ipykernel import kernelapp as app

Correlations

There are more plot options inside pandas.tools.plotting:


In [66]:
# Is there correlations between the northern and southern sea level timeseries we loaded?
from pandas.tools.plotting import scatter_matrix
scatter_matrix(mean_sea_level, figsize=LARGE_FIGSIZE);


We will confirm the correlations we think we see further down...

EXERCISE: Refer to exercises/aapl_adj_close_plot/aapl_adj_close_plot.ipynb

5. Accessing data

The general philosophy for accessing values inside a Pandas datastructure is that, unlike a numpy array that only allows to index using integers a Series allows to index with the values inside the index. That makes the code more readable.

In a series


In [67]:
full_globe_temp


Out[67]:
year
1880-01-01 00:00:00   -0.1591
1881-01-01 00:00:00   -0.0789
1882-01-01 00:00:00   -0.1313
1883-01-01 00:00:00   -0.1675
1884-01-01 00:00:00   -0.2485
1885-01-01 00:00:00   -0.2042
1886-01-01 00:00:00   -0.1695
1887-01-01 00:00:00   -0.2505
                        ...  
2003-01-01             0.5818
2004-01-01             0.5416
2005-01-01             0.6154
2006-01-01             0.5601
2007-01-01             0.5472
2008-01-01             0.4804
2009-01-01             0.5551
2010-01-01                NaN
Name: mean temp, dtype: float64

In [68]:
# By default [] on a series accesses values using the index, not the location in the series
# print(temp1[0])  # This would to fail!!

In [69]:
# This index is non-trivial though (will talk more about these datetime objects further down):
full_globe_temp.index.dtype


Out[69]:
dtype('<M8[ns]')

In [70]:
first_date = full_globe_temp.index[0]
first_date == pd.Timestamp('1880')


Out[70]:
True

In [71]:
# By default [] on a series accesses values using the index, not the location in the series
print(full_globe_temp[pd.Timestamp('1880')])
# print(temp1[0])  # This would fail!!


-0.1591

In [72]:
# Another more explicit way to do the same thing is to use loc
print(full_globe_temp.loc[pd.Timestamp('1990')])
print(full_globe_temp.iloc[0], full_globe_temp.iloc[-1])


0.3861
(-0.15909999999999999, nan)

In [73]:
# Year of the last record?
full_globe_temp.index[-1]


Out[73]:
Timestamp('2010-01-01 00:00:00')

In [74]:
# New records can be added:
full_globe_temp[pd.Timestamp('2011')] = np.nan

In a dataframe


In [75]:
# In 2D, same idea, though in a DF [] accesses columns (Series)
giss_temp["Jan"]


Out[75]:
Year
1880   -34.0
1881   -13.0
1882     3.0
1883   -38.0
1884   -20.0
1885   -57.0
1886   -37.0
1887   -60.0
        ... 
2008    22.0
2009    56.0
2010    66.0
2011    45.0
2012    38.0
2013    62.0
2014    68.0
2015    75.0
Name: Jan, dtype: float32

In [76]:
# while .loc and .iloc allow to access individual values, slices or masked selections:
print(giss_temp.loc[1979, "Dec"])


43.0

In [77]:
# Slicing can be done with .loc and .iloc
print(giss_temp.loc[1979, "Jan":"Jun"])  # Note that the end point is included unlike NumPy!!!
print(giss_temp.loc[1979, ::2])


Jan    10.0
Feb   -12.0
Mar    14.0
Apr     8.0
May     1.0
Jun     9.0
Name: 1979, dtype: float32
Jan    10.0
Mar    14.0
May     1.0
Jul    -3.0
Sep    20.0
Nov    24.0
Name: 1979, dtype: float32

In [78]:
# Masking can also be used in one or more dimensions. For example, another way to grab every other month for the first year:
mask = [True, False] * 6
print(giss_temp.iloc[0, mask])
print(giss_temp.loc[1880, mask])


Jan   -34.0
Mar   -22.0
May   -16.0
Jul   -19.0
Sep   -20.0
Nov   -16.0
Name: 1880, dtype: float32
Jan   -34.0
Mar   -22.0
May   -16.0
Jul   -19.0
Sep   -20.0
Nov   -16.0
Name: 1880, dtype: float32

In [79]:
# We could also add a new column like a new entry in a dictionary
giss_temp["totals"] = giss_temp.sum(axis=1)
giss_temp


Out[79]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec totals
Year
1880 -34.0 -27.0 -22.0 -30.0 -16.0 -24.0 -19.0 -12.0 -20.0 -19.0 -16.0 -21.0 -260.0
1881 -13.0 -16.0 -2.0 -3.0 -3.0 -27.0 -12.0 -8.0 -18.0 -23.0 -28.0 -18.0 -171.0
1882 3.0 4.0 -2.0 -24.0 -20.0 -32.0 -27.0 -11.0 -11.0 -25.0 -25.0 -37.0 -207.0
1883 -38.0 -38.0 -12.0 -20.0 -20.0 -8.0 -3.0 -13.0 -19.0 -19.0 -28.0 -21.0 -239.0
1884 -20.0 -14.0 -31.0 -36.0 -33.0 -36.0 -31.0 -24.0 -29.0 -25.0 -29.0 -25.0 -333.0
1885 -57.0 -29.0 -19.0 -36.0 -35.0 -40.0 -28.0 -24.0 -17.0 -14.0 -14.0 0.0 -313.0
1886 -37.0 -43.0 -34.0 -23.0 -21.0 -30.0 -13.0 -21.0 -12.0 -22.0 -29.0 -18.0 -303.0
1887 -60.0 -43.0 -26.0 -34.0 -28.0 -25.0 -19.0 -28.0 -24.0 -33.0 -29.0 -40.0 -389.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2008 22.0 31.0 70.0 48.0 46.0 43.0 54.0 39.0 59.0 62.0 62.0 51.0 587.0
2009 56.0 48.0 49.0 57.0 59.0 62.0 66.0 61.0 64.0 58.0 72.0 58.0 710.0
2010 66.0 75.0 87.0 82.0 71.0 60.0 56.0 59.0 55.0 65.0 74.0 44.0 794.0
2011 45.0 44.0 57.0 60.0 47.0 54.0 70.0 69.0 52.0 60.0 50.0 48.0 656.0
2012 38.0 43.0 52.0 62.0 71.0 59.0 50.0 56.0 68.0 73.0 69.0 46.0 687.0
2013 62.0 52.0 60.0 48.0 56.0 61.0 53.0 61.0 73.0 61.0 75.0 61.0 723.0
2014 68.0 44.0 71.0 72.0 79.0 62.0 50.0 74.0 81.0 78.0 64.0 74.0 817.0
2015 75.0 80.0 84.0 71.0 NaN NaN NaN NaN NaN NaN NaN NaN 310.0

136 rows × 13 columns


In [80]:
# Let's remove this new column, we will learn to do this differently
giss_temp = giss_temp.drop("totals", axis=1)

More complex queries rely on the same concepts. For example what are the names, and IDs of the sea level stations in the USA?


In [81]:
local_sea_level_stations.columns


Out[81]:
Index([u'Station Name', u'ID', u'Lat', u'Lon', u'GLOSS ID', u'Country',
       u'Date', u'Coastline', u'Station'],
      dtype='object')

In [82]:
american_stations = local_sea_level_stations["Country"] == "USA"
local_sea_level_stations.loc[american_stations, ["ID", "Station Name"]]


Out[82]:
ID Station Name
1267 1372 FRENCH FRIGATE SHOALS
1268 2192 FRENCH FRIGATE SHOALS B
1269 2129 PORT ALLEN, HANAPEPE BAY, KAUAI ISLAND
1270 756 NAWILIWILI BAY, KAUAI ISLAND
1271 155 HONOLULU
1272 823 MOKUOLOE ISLAND
1273 2273 LANAI ISLAND, KAUMALAPAU
1274 521 KAHULUI HARBOR, MAUI ISLAND
... ... ...
1437 288 SEAVEY ISLAND
1438 2291 FORT POINT, NEWCASTLE ISLAND
1439 183 PORTLAND (MAINE)
1440 1279 ROCKLAND
1441 525 BAR HARBOR, FRENCHMAN BAY, ME
1442 1524 CUTLER II
1443 1081 CUTLER
1444 332 EASTPORT

178 rows × 2 columns

6. Working with dates and times

Let's work some more with full_globe_temp's index since we saw it is special.


In [83]:
# Its dtype is NumPy's new 'datetime64[ns]':
full_globe_temp.index.dtype


Out[83]:
dtype('<M8[ns]')

The advantage of having a real datetime index is that operations can be done efficiently on it. Let's add a flag to signal if the value is before or after the great depression's black Friday:


In [84]:
black_friday = pd.to_datetime('1929-10-29')
full_globe_temp.index > black_friday


Out[84]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True], dtype=bool)

Timestamps or periods?


In [85]:
# Convert its index from timestamp to period: it is more meaningfull since it was measured and averaged over the year...
full_globe_temp.index = full_globe_temp.index.to_period()
full_globe_temp


Out[85]:
year
1880   -0.1591
1881   -0.0789
1882   -0.1313
1883   -0.1675
1884   -0.2485
1885   -0.2042
1886   -0.1695
1887   -0.2505
         ...  
2004    0.5416
2005    0.6154
2006    0.5601
2007    0.5472
2008    0.4804
2009    0.5551
2010       NaN
2011       NaN
Freq: A-DEC, Name: mean temp, dtype: float64

See also to_timestamp to conver back to timestamps and its how method to specify when inside the range to set the timestamp.

Resampling

Another thing that can be done is to resample the series, downsample or upsample. Let's see the series converted to 10 year blocks or upscale to a monthly series:


In [86]:
# Frequencies can be specified as strings: "us", "ms", "S", "T", "H", "D", "B", "W", "M", "A", "3min", "2h20", ...
# More aliases at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
full_globe_temp.resample("M").mean()


Out[86]:
1880-01   -0.1591
1880-02       NaN
1880-03       NaN
1880-04       NaN
1880-05       NaN
1880-06       NaN
1880-07       NaN
1880-08       NaN
            ...  
2011-05       NaN
2011-06       NaN
2011-07       NaN
2011-08       NaN
2011-09       NaN
2011-10       NaN
2011-11       NaN
2011-12       NaN
Freq: M, Name: mean temp, dtype: float64

In [87]:
full_globe_temp.resample("10A").mean()


Out[87]:
year
1880-12-31 00:00:00   -0.15910
1890-12-31 00:00:00   -0.18226
1900-12-31            -0.21054
1910-12-31            -0.31381
1920-12-31            -0.25518
1930-12-31            -0.13395
1940-12-31             0.00786
1950-12-31             0.04548
1960-12-31            -0.00198
1970-12-31             0.02566
1980-12-31             0.04817
1990-12-31             0.21491
2000-12-31             0.36288
2010-12-31             0.55260
2020-12-31                 NaN
Freq: 10A-DEC, Name: mean temp, dtype: float64

Generating DatetimeIndex objects

The index for giss_temp isn't an instance of datetimes so we may want to generate such DatetimeIndex objects. This can be done with date_range and period_range:


In [88]:
# Can specify a start date and a number of values desired. By default it will assume an interval of 1 day:
pd.date_range('1/1/1880', periods=4)


Out[88]:
DatetimeIndex(['1880-01-01', '1880-01-02', '1880-01-03', '1880-01-04'], dtype='datetime64[ns]', freq='D')

In [89]:
# Can also specify a start and a stop date, as well as a frequency
pd.date_range('1/1/1880', '1/1/2016', freq="A")


Out[89]:
DatetimeIndex(['1880-12-31', '1881-12-31', '1882-12-31', '1883-12-31',
               '1884-12-31', '1885-12-31', '1886-12-31', '1887-12-31',
               '1888-12-31', '1889-12-31',
               ...
               '2006-12-31', '2007-12-31', '2008-12-31', '2009-12-31',
               '2010-12-31', '2011-12-31', '2012-12-31', '2013-12-31',
               '2014-12-31', '2015-12-31'],
              dtype='datetime64[ns]', length=136, freq='A-DEC')

Note that "A" by default means the end of the year. Other times in the year can be specified with "AS" (start), "A-JAN" or "A-JUN". Even more options can be imported from pandas.tseries.offsets:


In [90]:
from pandas.tseries.offsets import YearBegin
pd.date_range('1/1/1880', '1/1/2015', freq=YearBegin())


Out[90]:
DatetimeIndex(['1880-01-01', '1881-01-01', '1882-01-01', '1883-01-01',
               '1884-01-01', '1885-01-01', '1886-01-01', '1887-01-01',
               '1888-01-01', '1889-01-01',
               ...
               '2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01',
               '2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01',
               '2014-01-01', '2015-01-01'],
              dtype='datetime64[ns]', length=136, freq='AS-JAN')

Actually we will convert that dataset to a 1D dataset, and build a monthly index, so lets build a monthly period index


In [91]:
giss_temp_index = pd.period_range('1/1/1880', '12/1/2015', freq="M")
giss_temp_index


Out[91]:
PeriodIndex(['1880-01', '1880-02', '1880-03', '1880-04', '1880-05', '1880-06',
             '1880-07', '1880-08', '1880-09', '1880-10',
             ...
             '2015-03', '2015-04', '2015-05', '2015-06', '2015-07', '2015-08',
             '2015-09', '2015-10', '2015-11', '2015-12'],
            dtype='int64', length=1632, freq='M')

7. Transforming datasets: apply, sort, stack/unstack and transpose

Let's look at our local_sea_level_stations dataset some more, to learn more about it and also do some formatting. What is the range of dates and lattitudes we have, the list of countries, the range of variations, ...


In [92]:
# What about the range of dates?
local_sea_level_stations["Date"].min(), local_sea_level_stations["Date"].max(), local_sea_level_stations["Date"].iloc[-1]


Out[92]:
(u'01/01/1980', u'31/07/2014', u'01/02/2016')

In [93]:
local_sea_level_stations.dtypes


Out[93]:
Station Name     object
ID                int64
Lat             float64
Lon             float64
GLOSS ID        float64
Country          object
Date             object
Coastline         int64
Station           int64
dtype: object

Apply: transforming Series

We don't see the range of dates because the dates are of dtype "Object", (usually meaning strings). Let's convert that using apply:


In [94]:
local_sea_level_stations["Date"].apply(pd.to_datetime)


Out[94]:
0      1980-01-01
1      1980-01-01
2      1980-01-01
3      2015-06-10
4      2016-08-03
5      1999-11-05
6      2015-05-21
7      2004-09-30
          ...    
1458   2015-11-17
1459   2015-11-17
1460   2015-11-17
1461   2016-08-03
1462   2016-08-03
1463   2016-09-03
1464   2016-09-03
1465   2016-01-02
Name: Date, dtype: datetime64[ns]

This apply method is very powerful and general. We have used it to do something we could have done with astype, but any custom function can be provided to apply.


In [95]:
local_sea_level_stations["Date"] = local_sea_level_stations["Date"].apply(pd.to_datetime)

# Now we can really compare the dates, and therefore get a real range:
print(local_sea_level_stations["Date"].min(), local_sea_level_stations["Date"].max())


(Timestamp('1980-01-01 00:00:00'), Timestamp('2016-12-05 00:00:00'))

EXERCISE: Use the apply method to search through the stations names for a station in New York. What is the ID of the station?


In [96]:
# Your code here

Now that we know the range of dates, to look at the data, sorting it following the dates is done with sort:


In [97]:
local_sea_level_stations.sort_values(by="Date")


Out[97]:
Station Name ID Lat Lon GLOSS ID Country Date Coastline Station
0 HELIGMAN 231 60.200 19.300 NaN ALA 1980-01-01 60 251
352 KOTKA 164 60.450 26.950 NaN FIN 1980-01-01 60 358
351 SODERSKAR 29 60.117 25.417 NaN FIN 1980-01-01 60 354
347 RUSSARO 28 59.767 22.950 NaN FIN 1980-01-01 60 316
346 JUNGFRUSUND 18 59.950 22.367 NaN FIN 1980-01-01 60 311
345 TVARMINNE 238 59.850 23.250 NaN FIN 1980-01-01 60 306
344 LOHM 233 60.100 21.667 NaN FIN 1980-01-01 60 301
358 ANTIFER 1383 49.650 0.150 NaN FRA 1980-01-01 190 46
... ... ... ... ... ... ... ... ... ...
1085 VIGO II 1898 42.243 -8.726 NaN ESP 2016-12-01 200 42
1082 FERROL 2 2053 43.476 -8.249 NaN ESP 2016-12-01 200 35
1078 GIJON II 1871 43.558 -5.698 NaN ESP 2016-12-01 200 22
1077 SANTANDER III 1807 43.461 -3.791 NaN ESP 2016-12-01 200 13
1074 BILBAO 1806 43.352 -3.045 NaN ESP 2016-12-01 200 6
1106 SAGUNTO 2059 39.634 -0.206 NaN ESP 2016-12-01 220 60
1083 VILLAGARCIA 1897 42.600 -8.767 NaN ESP 2016-12-01 200 36
912 MATI, DAVAO ORIENTAL 2156 6.950 126.217 NaN PHL 2016-12-05 660 116

1466 rows × 9 columns

Since many stations last updated on the same dates, it is logical to want to sort further, for example, by Country at constant date:


In [98]:
local_sea_level_stations.sort_values(by=["Date", "Country"], ascending=False)


Out[98]:
Station Name ID Lat Lon GLOSS ID Country Date Coastline Station
912 MATI, DAVAO ORIENTAL 2156 6.950 126.217 NaN PHL 2016-12-05 660 116
1074 BILBAO 1806 43.352 -3.045 NaN ESP 2016-12-01 200 6
1077 SANTANDER III 1807 43.461 -3.791 NaN ESP 2016-12-01 200 13
1078 GIJON II 1871 43.558 -5.698 NaN ESP 2016-12-01 200 22
1082 FERROL 2 2053 43.476 -8.249 NaN ESP 2016-12-01 200 35
1083 VILLAGARCIA 1897 42.600 -8.767 NaN ESP 2016-12-01 200 36
1085 VIGO II 1898 42.243 -8.726 NaN ESP 2016-12-01 200 42
1088 HUELVA 1883 37.132 -6.834 NaN ESP 2016-12-01 220 5
... ... ... ... ... ... ... ... ... ...
17 PIRAMIDE 867 -42.583 -64.283 NaN ARG 1980-01-01 860 41
18 ROSALES 870 -38.933 -62.067 NaN ARG 1980-01-01 860 61
20 MAR DEL PLATA (PUERTO) 857 -38.033 -57.533 192.0 ARG 1980-01-01 860 91
23 RIO SANTIAGO 897 -34.850 -57.900 NaN ARG 1980-01-01 860 141
26 ISLA MARTIN GARCIA 864 -34.183 -58.250 NaN ARG 1980-01-01 860 171
0 HELIGMAN 231 60.200 19.300 NaN ALA 1980-01-01 60 251
1 KOBBAKLINTAR 63 60.033 19.883 NaN ALA 1980-01-01 60 261
2 LEMSTROM 84 60.100 20.017 NaN ALA 1980-01-01 60 271

1466 rows × 9 columns

Stack and unstack

Let's look at the GISS dataset differently. Instead of seeing the months along the axis 1, and the years along the axis 0, it would could be good to convert these into an outer and an inner axis along only 1 time dimension.

Stacking and unstacking allows to convert a dataframe into a series and vice-versa:


In [99]:
giss_temp.unstack?
unstacked = giss_temp.unstack()
unstacked


Out[99]:
     Year
Jan  1880   -34.0
     1881   -13.0
     1882     3.0
     1883   -38.0
     1884   -20.0
     1885   -57.0
     1886   -37.0
     1887   -60.0
             ... 
Dec  2008    51.0
     2009    58.0
     2010    44.0
     2011    48.0
     2012    46.0
     2013    61.0
     2014    74.0
     2015     NaN
dtype: float32

In [100]:
# Note the nature of the result:
type(unstacked)


Out[100]:
pandas.core.series.Series

The result is grouped in the wrong order since it sorts first the axis that was unstacked. Another transformation that would help us is transposing...


In [101]:
giss_temp.transpose()


Out[101]:
Year 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Jan -34.0 -13.0 3.0 -38.0 -20.0 -57.0 -37.0 -60.0 -43.0 -18.0 ... 52.0 93.0 22.0 56.0 66.0 45.0 38.0 62.0 68.0 75.0
Feb -27.0 -16.0 4.0 -38.0 -14.0 -29.0 -43.0 -43.0 -41.0 16.0 ... 65.0 65.0 31.0 48.0 75.0 44.0 43.0 52.0 44.0 80.0
Mar -22.0 -2.0 -2.0 -12.0 -31.0 -19.0 -34.0 -26.0 -39.0 6.0 ... 59.0 67.0 70.0 49.0 87.0 57.0 52.0 60.0 71.0 84.0
Apr -30.0 -3.0 -24.0 -20.0 -36.0 -36.0 -23.0 -34.0 -25.0 7.0 ... 45.0 71.0 48.0 57.0 82.0 60.0 62.0 48.0 72.0 71.0
May -16.0 -3.0 -20.0 -20.0 -33.0 -35.0 -21.0 -28.0 -26.0 -1.0 ... 43.0 64.0 46.0 59.0 71.0 47.0 71.0 56.0 79.0 NaN
Jun -24.0 -27.0 -32.0 -8.0 -36.0 -40.0 -30.0 -25.0 -20.0 -10.0 ... 60.0 55.0 43.0 62.0 60.0 54.0 59.0 61.0 62.0 NaN
Jul -19.0 -12.0 -27.0 -3.0 -31.0 -28.0 -13.0 -19.0 -11.0 -13.0 ... 50.0 58.0 54.0 66.0 56.0 70.0 50.0 53.0 50.0 NaN
Aug -12.0 -8.0 -11.0 -13.0 -24.0 -24.0 -21.0 -28.0 -13.0 -18.0 ... 66.0 57.0 39.0 61.0 59.0 69.0 56.0 61.0 74.0 NaN
Sep -20.0 -18.0 -11.0 -19.0 -29.0 -17.0 -12.0 -24.0 -10.0 -18.0 ... 59.0 61.0 59.0 64.0 55.0 52.0 68.0 73.0 81.0 NaN
Oct -19.0 -23.0 -25.0 -19.0 -25.0 -14.0 -22.0 -33.0 -3.0 -21.0 ... 66.0 56.0 62.0 58.0 65.0 60.0 73.0 61.0 78.0 NaN
Nov -16.0 -28.0 -25.0 -28.0 -29.0 -14.0 -29.0 -29.0 -1.0 -32.0 ... 69.0 54.0 62.0 72.0 74.0 50.0 69.0 75.0 64.0 NaN
Dec -21.0 -18.0 -37.0 -21.0 -25.0 0.0 -18.0 -40.0 -10.0 -30.0 ... 74.0 46.0 51.0 58.0 44.0 48.0 46.0 61.0 74.0 NaN

12 rows × 136 columns


In [102]:
giss_temp_series = giss_temp.transpose().unstack()
giss_temp_series.name = "Temp anomaly"
giss_temp_series


Out[102]:
Year     
1880  Jan   -34.0
      Feb   -27.0
      Mar   -22.0
      Apr   -30.0
      May   -16.0
      Jun   -24.0
      Jul   -19.0
      Aug   -12.0
             ... 
2015  May     NaN
      Jun     NaN
      Jul     NaN
      Aug     NaN
      Sep     NaN
      Oct     NaN
      Nov     NaN
      Dec     NaN
Name: Temp anomaly, dtype: float32

A side note: Multi-indexes


In [103]:
# Note the nature of the resulting index:
giss_temp_series.index


Out[103]:
MultiIndex(levels=[[1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015], [u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73, 73, 73, 73, 73, 73, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 98, 98, 98, 98, 98, 98, 98, 98, 98, 98, 98, 98, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 106, 106, 106, 106, 106, 106, 106, 106, 106, 106, 106, 106, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 108, 108, 108, 108, 108, 108, 108, 108, 108, 108, 108, 108, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
           names=[u'Year', None])

In [104]:
# It is an index made of 2 columns. Let's fix the fact that one of them doesn't have a name:
giss_temp_series.index = giss_temp_series.index.set_names(["year", "month"])

In [105]:
# We can now access deviations by specifying the year and month:
giss_temp_series[1980, "Jan"]


Out[105]:
24.0

But this new multi-index isn't very good, because is it not viewed as 1 date, just as a tuple of values:


In [106]:
giss_temp_series.plot(figsize=LARGE_FIGSIZE)


Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x114ceae50>

To improve on this, let's reuse an index we generated above with date_range:


In [107]:
giss_temp_series.index = giss_temp_index
giss_temp_series.plot(figsize=LARGE_FIGSIZE)


Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x114bd0dd0>

8. Statistical analysis

Descriptive statistics

Let's go back to the dataframe version of the GISS temperature dataset temporarily to analyze anomalies month per month. Like most functions on a dataframe, stats functions are computed column per column. They also ignore missing values:


In [108]:
monthly_averages = giss_temp.mean()
monthly_averages


Out[108]:
Jan   -1.411765
Feb   -0.161765
Mar    1.294118
Apr   -0.823529
May   -2.051852
Jun   -3.237037
Jul   -0.688889
Aug   -0.622222
Sep    0.755556
Oct    1.333333
Nov    0.148148
Dec   -2.274074
dtype: float32

It is possible to apply stats functions across rows instead of columns using the axis keyword (just like in NumPy).


In [109]:
yearly_averages = giss_temp.mean(axis=1)
yearly_averages


Out[109]:
Year
1880   -21.666666
1881   -14.250000
1882   -17.250000
1883   -19.916666
1884   -27.750000
1885   -26.083334
1886   -25.250000
1887   -32.416668
          ...    
2008    48.916668
2009    59.166668
2010    66.166664
2011    54.666668
2012    57.250000
2013    60.250000
2014    68.083336
2015    77.500000
dtype: float32

describe provides many descriptive stats computed at once:


In [110]:
mean_sea_level.describe()


Out[110]:
northern_hem southern_hem mean_global
count 861.000000 861.000000 780.000000
mean 25.050923 28.422725 29.166036
std 29.292476 24.730756 21.226033
min -38.377000 -28.239000 -11.796000
25% 3.810000 8.946000 11.130500
50% 24.487000 27.836000 31.783000
75% 47.140000 46.066000 45.047500
max 99.108000 90.633000 75.007000

Rolling statistics

Let's remove high frequency signal and extract the trend:


In [111]:
full_globe_temp.plot()
rolled_series = full_globe_temp.rolling(window=10, center=False)
print(rolled_series)
rolled_series.mean().plot(figsize=LARGE_FIGSIZE)


Rolling [window=10,center=False,axis=0]
Out[111]:
<matplotlib.axes._subplots.AxesSubplot at 0x1198e9990>

In [112]:
# To see what all can be done while rolling, 
#pd.rolling_<TAB>

Describing categorical series

Let's look at our local_sea_level_stations dataset some more:


In [113]:
local_sea_level_stations.describe()


Out[113]:
ID Lat Lon GLOSS ID Coastline Station
count 1466.000000 1466.000000 1466.000000 281.000000 1466.000000 1466.000000
mean 1069.452251 28.762511 21.889766 165.327402 532.484993 85.937926
std 655.126610 30.313309 92.947697 95.313053 312.892857 120.120620
min 1.000000 -77.850000 -179.367000 3.000000 10.000000 1.000000
25% 518.250000 13.810500 -64.745500 87.000000 220.000000 12.000000
50% 1021.500000 35.700000 14.448500 154.000000 641.000000 41.000000
75% 1629.750000 49.320500 118.023500 242.000000 822.000000 101.000000
max 2304.000000 82.490000 179.217000 351.000000 999.000000 725.000000

.describe() only displays information about continuous Series. What about categorical ones?


In [114]:
local_sea_level_stations.columns


Out[114]:
Index([u'Station Name', u'ID', u'Lat', u'Lon', u'GLOSS ID', u'Country',
       u'Date', u'Coastline', u'Station'],
      dtype='object')

In [115]:
local_sea_level_stations["Country"]


Out[115]:
0       ALA
1       ALA
2       ALA
3       ALA
4       ASM
5       ATA
6       ATA
7       ATA
       ... 
1458    VNM
1459    VNM
1460    VNM
1461    VIR
1462    VIR
1463    VIR
1464    VIR
1465    YEM
Name: Country, dtype: object

In [116]:
local_sea_level_stations["Country"].describe()


Out[116]:
count     1466
unique     136
top        USA
freq       178
Name: Country, dtype: object

In [117]:
# List of unique values:
local_sea_level_stations["Country"].unique()


Out[117]:
array([u'ALA', u'ASM', u'ATA', u'ARG', u'AUS', u'BHS', u'BHR', u'BGD',
       u'BEL', u'BMU', u'BRA', u'IOT', u'BGR', u'CAN', u'CPV', u'CYM',
       u'CHL', u'CHN', u'CCK', u'COL', u'COG', u'COK', u'CRI', u'CIV',
       u'HRV', u'CUB', u'CYP', u'DNK', u'DOM', u'ECU', u'EGY', u'SLV',
       u'EST', u'FLK', u'FRO', u'FJI', u'FIN', u'FRA', u'GUF', u'PYF',
       u'GEO', u'DEU', u'GHA', u'GIB', u'GRC', u'GRD', u'GLP', u'GUM',
       u'GTM', u'HTI', u'HND', u'HKG', u'ISL', u'IND', u'IDN', u'IRN',
       u'IRL', u'IMN', u'ISR', u'ITA', u'JAM', u'JPN', u'JEY', u'KEN',
       u'KIR', u'PRK', u'KOR', u'LVA', u'LTU', u'MAC', u'MDG', u'MYS',
       u'MDV', u'MLT', u'MHL', u'MTQ', u'MUS', u'MYT', u'MEX', u'FSM',
       u'MCO', u'MNE', u'MAR', u'MOZ', u'MMR', u'NAM', u'NRU', u'NLD',
       u'NCL', u'NZL', u'NGA', u'NFK', u'MNP', u'NOR', u'OMN', u'PAK',
       u'PLW', u'PAN', u'PNG', u'PER', u'PHL', u'POL', u'PRT', u'PRI',
       u'REU', u'ROU', u'RUS', u'SHN', u'WSM', u'SEN', u'SYC', u'SGP',
       u'SVN', u'SLB', u'ZAF', u'ESP', u'SDN', u'SJM', u'SWE', u'TWN',
       u'TZA', u'THA', u'TON', u'TTO', u'TUR', u'TUV', u'UKR', u'GBR',
       u'USA', u'UMI', u'URY', u'VUT', u'VEN', u'VNM', u'VIR', u'YEM'], dtype=object)

In [118]:
local_sea_level_stations["Country"].value_counts()


Out[118]:
USA    178
JPN    139
RUS     86
CAN     82
AUS     81
GBR     68
ESP     61
ITA     44
      ... 
TZA      1
PRK      1
JEY      1
JAM      1
MNP      1
UKR      1
COG      1
VUT      1
Name: Country, dtype: int64

In [119]:
# To save memory, we can convert it to a categorical column:
local_sea_level_stations["Country"] = local_sea_level_stations["Country"].astype("category")

We can also create categorical series from continuous ones with the cut function:


In [120]:
categorized = pd.cut(full_globe_temp, 3, labels=["L", "M", "H"])
categorized


Out[120]:
year
1880      L
1881      L
1882      L
1883      L
1884      L
1885      L
1886      L
1887      L
       ... 
2004      H
2005      H
2006      H
2007      H
2008      H
2009      H
2010    NaN
2011    NaN
Freq: A-DEC, Name: mean temp, dtype: category
Categories (3, object): [L < M < H]

In [121]:
# The advantage is that we can use labels and control the order they should be treated in (L < M < H)
categorized.cat.categories


Out[121]:
Index([u'L', u'M', u'H'], dtype='object')

QUIZ: How much memory did we save? What if it was categorized but with dtype object instead of category?

9. Data Aggregation/summarization

Now that we have a good grasp on our datasets, Let's transform and analyze them some more to prepare them to compare them. The 2 function(alities)s to learn about here are groupby and pivot_table.

GroupBy

Let's explore the sea levels, first splitting into calendar years to compute average sea levels for each year:


In [122]:
mean_sea_level


Out[122]:
northern_hem southern_hem mean_global
date
1992.9323 13.717 2.517 NaN
1992.9595 1.328 -7.645 -5.818
1992.9866 -13.375 -2.267 -7.525
1993.0138 -24.723 0.855 -9.215
1993.0409 -29.231 -2.020 -11.796
1993.0681 -32.538 -0.398 -11.291
1993.0952 -35.588 4.102 -9.569
1993.1223 -26.579 6.787 -3.714
... ... ... ...
2016.1440 49.218 84.902 NaN
2016.1711 44.274 83.829 NaN
2016.1983 33.961 86.020 NaN
2016.2254 42.199 88.122 NaN
2016.2526 45.004 88.090 NaN
2016.2797 49.809 87.806 NaN
2016.3068 55.401 77.270 NaN
2016.3340 50.416 70.916 NaN

861 rows × 3 columns


In [123]:
mean_sea_level = mean_sea_level.reset_index()
mean_sea_level


Out[123]:
date northern_hem southern_hem mean_global
0 1992.9323 13.717 2.517 NaN
1 1992.9595 1.328 -7.645 -5.818
2 1992.9866 -13.375 -2.267 -7.525
3 1993.0138 -24.723 0.855 -9.215
4 1993.0409 -29.231 -2.020 -11.796
5 1993.0681 -32.538 -0.398 -11.291
6 1993.0952 -35.588 4.102 -9.569
7 1993.1223 -26.579 6.787 -3.714
... ... ... ... ...
853 2016.1440 49.218 84.902 NaN
854 2016.1711 44.274 83.829 NaN
855 2016.1983 33.961 86.020 NaN
856 2016.2254 42.199 88.122 NaN
857 2016.2526 45.004 88.090 NaN
858 2016.2797 49.809 87.806 NaN
859 2016.3068 55.401 77.270 NaN
860 2016.3340 50.416 70.916 NaN

861 rows × 4 columns


In [124]:
# Groupby with pandas can be done on a column or by applying a custom function to the index. 
# If we want to group the data by year, we can build a year column into the DF:
mean_sea_level["year"] = mean_sea_level["date"].apply(int)
mean_sea_level


Out[124]:
date northern_hem southern_hem mean_global year
0 1992.9323 13.717 2.517 NaN 1992
1 1992.9595 1.328 -7.645 -5.818 1992
2 1992.9866 -13.375 -2.267 -7.525 1992
3 1993.0138 -24.723 0.855 -9.215 1993
4 1993.0409 -29.231 -2.020 -11.796 1993
5 1993.0681 -32.538 -0.398 -11.291 1993
6 1993.0952 -35.588 4.102 -9.569 1993
7 1993.1223 -26.579 6.787 -3.714 1993
... ... ... ... ... ...
853 2016.1440 49.218 84.902 NaN 2016
854 2016.1711 44.274 83.829 NaN 2016
855 2016.1983 33.961 86.020 NaN 2016
856 2016.2254 42.199 88.122 NaN 2016
857 2016.2526 45.004 88.090 NaN 2016
858 2016.2797 49.809 87.806 NaN 2016
859 2016.3068 55.401 77.270 NaN 2016
860 2016.3340 50.416 70.916 NaN 2016

861 rows × 5 columns


In [125]:
sl_grouped_year = mean_sea_level.groupby("year")

What kind of object did we create?


In [126]:
type(sl_grouped_year)


Out[126]:
pandas.core.groupby.DataFrameGroupBy

What to do with that strange GroupBy object? We can first loop over it to get the labels and the sub-dataframes for each group:


In [127]:
for group_name, subdf in sl_grouped_year:
    print(group_name)
    print(subdf)
    print("")


1992
        date  northern_hem  southern_hem  mean_global  year
0  1992.9323        13.717         2.517          NaN  1992
1  1992.9595         1.328        -7.645       -5.818  1992
2  1992.9866       -13.375        -2.267       -7.525  1992

1993
         date  northern_hem  southern_hem  mean_global  year
3   1993.0138       -24.723         0.855       -9.215  1993
4   1993.0409       -29.231        -2.020      -11.796  1993
5   1993.0681       -32.538        -0.398      -11.291  1993
6   1993.0952       -35.588         4.102       -9.569  1993
7   1993.1223       -26.579         6.787       -3.714  1993
8   1993.1495       -29.622        10.315       -2.471  1993
9   1993.1766       -34.527         5.919       -6.784  1993
10  1993.2038       -34.250         7.193       -5.772  1993
..        ...           ...           ...          ...   ...
32  1993.8010        30.541       -14.024       -1.886  1993
33  1993.8282        18.240       -14.008          NaN  1993
34  1993.8553         8.044        -9.115       -7.186  1993
35  1993.8825        10.529        -7.718       -4.588  1993
36  1993.9096        10.772        -6.169       -2.648  1993
37  1993.9368        10.732        -4.656       -0.795  1993
38  1993.9639         1.704         3.492        1.461  1993
39  1993.9911        -8.579        -0.251       -3.859  1993

[37 rows x 5 columns]

1994
         date  northern_hem  southern_hem  mean_global  year
40  1994.0182       -14.764        -1.078       -5.904  1994
41  1994.0454       -12.663         3.887       -1.261  1994
42  1994.0725       -19.485        13.173        2.404  1994
43  1994.0997       -23.443         8.984       -1.173  1994
44  1994.1268       -31.320         4.399       -6.507  1994
45  1994.1540       -37.411         1.308      -10.638  1994
46  1994.1811       -35.844        13.495       -2.544  1994
47  1994.2083       -20.794        10.220          NaN  1994
..        ...           ...           ...          ...   ...
69  1994.8055        18.954       -10.939       -4.447  1994
70  1994.8327        14.186       -10.363       -5.614  1994
71  1994.8598        19.882         4.403          NaN  1994
72  1994.8870        16.490       -10.154       -3.240  1994
73  1994.9141        -0.195       -12.189      -10.229  1994
74  1994.9413        -0.438        -9.834       -7.929  1994
75  1994.9684        -6.651        -2.703       -5.132  1994
76  1994.9956       -12.734         7.809       -0.212  1994

[37 rows x 5 columns]

1995
          date  northern_hem  southern_hem  mean_global  year
77   1995.0227       -16.869        13.406        2.458  1995
78   1995.0499       -16.734        11.835        2.365  1995
79   1995.0770       -24.215        17.897        3.806  1995
80   1995.1042       -32.395        10.742       -3.379  1995
81   1995.1313       -37.790         4.341       -8.843  1995
82   1995.1585       -30.054        11.276       -1.374  1995
83   1995.1856       -18.882        24.336          NaN  1995
84   1995.2128       -29.813        21.872        5.367  1995
..         ...           ...           ...          ...   ...
104  1995.7557        26.196       -16.096       -4.116  1995
105  1995.7829        26.565       -18.685       -5.580  1995
106  1995.8100        18.206       -11.709          NaN  1995
107  1995.8372        20.199        -7.319       -1.007  1995
108  1995.8643        22.797        -8.014        0.257  1995
109  1995.8915        20.865       -10.144       -0.928  1995
110  1995.9458         3.133       -10.152       -6.178  1995
111  1995.9729         3.723         4.050        3.549  1995

[35 rows x 5 columns]

1996
          date  northern_hem  southern_hem  mean_global  year
112  1996.0000        -8.298         8.021        2.121  1996
113  1996.0272       -15.226         4.369       -1.907  1996
114  1996.0543       -19.351         0.676       -5.008  1996
115  1996.0815       -27.781         1.914       -6.979  1996
116  1996.1086       -32.942         8.965       -4.363  1996
117  1996.1358       -26.880        14.640          NaN  1996
118  1996.1629       -25.254        18.665        5.333  1996
119  1996.1901       -31.981        14.792        0.404  1996
..         ...           ...           ...          ...   ...
141  1996.7873        16.485       -16.277          NaN  1996
142  1996.8145        23.172        -7.566        0.035  1996
143  1996.8416        27.836        -5.890        3.471  1996
144  1996.8688        25.333        -9.022        1.228  1996
145  1996.8959         8.147        -5.264       -2.764  1996
146  1996.9231         6.546        -0.403        0.452  1996
147  1996.9502         3.238         7.443        4.804  1996
148  1996.9774         0.742         5.352        3.602  1996

[37 rows x 5 columns]

1997
          date  northern_hem  southern_hem  mean_global  year
149  1997.0045        -6.439         5.169        1.636  1997
150  1997.0317       -11.603         3.083       -0.803  1997
151  1997.0588       -27.445         5.189       -5.010  1997
152  1997.0860       -24.603        12.049        0.989  1997
153  1997.1131       -32.545         2.174          NaN  1997
154  1997.1403       -22.992        13.187        3.083  1997
155  1997.1674       -21.150         9.064        1.468  1997
156  1997.1946       -28.745         8.237       -1.823  1997
..         ...           ...           ...          ...   ...
178  1997.7918        41.540         4.793       14.829  1997
179  1997.8190        33.572         6.671       12.947  1997
180  1997.8461        23.610        10.016       11.460  1997
181  1997.8733        10.445        13.478        8.891  1997
182  1997.9004        12.106        14.756       11.173  1997
183  1997.9276        11.524        19.435       14.685  1997
184  1997.9547         4.187        16.530       11.068  1997
185  1997.9819        -7.182        21.664       10.593  1997

[37 rows x 5 columns]

1998
          date  northern_hem  southern_hem  mean_global  year
186  1998.0090       -18.303        17.642        4.736  1998
187  1998.0362       -12.483        18.718        8.719  1998
188  1998.0633       -25.377         9.945          NaN  1998
189  1998.0905       -15.267        27.836       14.428  1998
190  1998.1176       -14.355        25.288       13.615  1998
191  1998.1448       -20.614        24.298       10.997  1998
192  1998.1719       -21.860        18.501        7.134  1998
193  1998.1990       -21.656        23.332       10.277  1998
..         ...           ...           ...          ...   ...
215  1998.7963        24.604       -28.239          NaN  1998
216  1998.8235        40.522        -9.477        6.725  1998
217  1998.8506        36.535        -6.496        7.464  1998
218  1998.8777        37.038         1.925       13.388  1998
219  1998.9049        25.888        -1.569        7.522  1998
220  1998.9320        30.483         9.042       16.727  1998
221  1998.9592        12.372         9.913       10.897  1998
222  1998.9863         6.869         4.941        6.738  1998

[37 rows x 5 columns]

1999
          date  northern_hem  southern_hem  mean_global  year
223  1999.0135        -3.830        13.669        8.833  1999
224  1999.0406        -2.387        21.230       14.662  1999
225  1999.0678       -21.709         2.566          NaN  1999
226  1999.0949        -5.048        13.182       10.119  1999
227  1999.1221       -17.530        20.970       10.080  1999
228  1999.1492       -18.436        16.705        7.655  1999
229  1999.1764       -19.660        22.345       10.832  1999
230  1999.2035       -13.171        20.341       12.249  1999
..         ...           ...           ...          ...   ...
252  1999.8008        34.355         2.509       11.388  1999
253  1999.8279        34.834         5.045       13.432  1999
254  1999.8551        42.485         4.787       16.967  1999
255  1999.8822        36.612         5.828       15.993  1999
256  1999.9094        28.903         5.765       13.563  1999
257  1999.9365         7.769         7.551          NaN  1999
258  1999.9637         7.359        17.264       13.728  1999
259  1999.9908         3.449        24.124       17.333  1999

[37 rows x 5 columns]

2000
          date  northern_hem  southern_hem  mean_global  year
260  2000.0180        10.176        18.375       17.533  2000
261  2000.0451         6.061        20.247       17.858  2000
262  2000.0723        -6.104        24.273       16.159  2000
263  2000.0994       -21.263        24.218       10.783  2000
264  2000.1266       -15.703        21.589       11.807  2000
265  2000.1537        -9.047        26.780       17.805  2000
266  2000.1809       -10.735        29.419       18.971  2000
267  2000.2080        -9.286        27.323       18.474  2000
..         ...           ...           ...          ...   ...
289  2000.8053        43.405         9.175       19.382  2000
290  2000.8324        42.480         3.772          NaN  2000
291  2000.8596        43.878        10.708       21.424  2000
292  2000.8867        30.982        11.342       17.268  2000
293  2000.9139        29.436        11.797       17.882  2000
294  2000.9410        15.358        14.999       15.043  2000
295  2000.9682        17.786        15.890       17.602  2000
296  2000.9953        19.537        20.462       22.057  2000

[37 rows x 5 columns]

2001
          date  northern_hem  southern_hem  mean_global  year
297  2001.0225         7.732        19.694       17.812  2001
298  2001.0496        -1.064         9.399          NaN  2001
299  2001.0767       -13.165        26.607       15.095  2001
300  2001.1039        -6.459        27.718       19.007  2001
301  2001.1310        -1.461        25.099       19.868  2001
302  2001.1582        -1.053        31.460       24.128  2001
303  2001.1853        -5.181        34.418       24.451  2001
304  2001.2125       -15.876        30.493       18.088  2001
..         ...           ...           ...          ...   ...
326  2001.8097        51.379         8.698       22.653  2001
327  2001.8369        50.977        14.050       26.002  2001
328  2001.8640        37.447        17.758       23.384  2001
329  2001.8912        25.848        20.314       20.948  2001
330  2001.9183        19.967        22.864       21.027  2001
331  2001.9455        26.199        18.393       21.930  2001
332  2001.9726        17.269        26.257       24.021  2001
333  2001.9998        10.626        30.408       24.842  2001

[37 rows x 5 columns]

2002
          date  northern_hem  southern_hem  mean_global  year
334  2002.0269        -4.812        28.283       18.262  2002
335  2002.0541        -5.577        30.166       19.437  2002
336  2002.0812        -3.769        38.490       26.850  2002
337  2002.1084         2.032        38.583       29.728  2002
338  2002.1355         1.644        36.029       28.163  2002
339  2002.1627        -8.796        38.791       23.377  2002
340  2002.1898       -19.634        35.139       17.886  2002
341  2002.2170       -15.731        35.783       20.958  2002
..         ...           ...           ...          ...   ...
362  2002.7871        58.563        16.982       30.754  2002
363  2002.8142        56.392        18.809       31.080  2002
364  2002.8414        43.217        14.710       23.692  2002
365  2002.8685        36.530        21.482       25.621  2002
366  2002.8957        37.029        33.326       33.671  2002
367  2002.9228        32.933        28.857       30.318  2002
368  2002.9500        30.865        26.055       28.853  2002
369  2002.9771        15.771        31.447       26.970  2002

[36 rows x 5 columns]

2003
          date  northern_hem  southern_hem  mean_global  year
370  2003.0043         5.963        31.473       24.113  2003
371  2003.0314         2.851        31.508       23.798  2003
372  2003.0586         4.805        43.326       32.515  2003
373  2003.0857        11.383        41.987       34.928  2003
374  2003.1129         3.465        45.671       34.526  2003
375  2003.1400        -7.422        34.322       23.710  2003
376  2003.1672        -7.073        37.813       26.165  2003
377  2003.1943        -6.972        41.401       28.678  2003
..         ...           ...           ...          ...   ...
399  2003.7916        56.984         9.884       26.229  2003
400  2003.8187        51.042        15.270       27.206  2003
401  2003.8459        47.689        17.865       27.779  2003
402  2003.8730        50.523        19.480       30.516  2003
403  2003.9002        38.028        23.465          NaN  2003
404  2003.9273        38.324        24.332       30.318  2003
405  2003.9544        28.413        23.719       26.939  2003
406  2003.9816        20.909        24.111       25.179  2003

[37 rows x 5 columns]

2004
          date  northern_hem  southern_hem  mean_global  year
407  2004.0087        15.141        35.164       30.448  2004
408  2004.0359        15.516        42.777       36.011  2004
409  2004.0630         9.291        45.368       35.822  2004
410  2004.0902         0.779        43.484       31.936  2004
411  2004.1173        -8.565        29.311       20.438  2004
412  2004.1445         4.549        38.408       31.004  2004
413  2004.1716         4.706        37.540       31.054  2004
414  2004.1988         8.429        40.072       34.182  2004
..         ...           ...           ...          ...   ...
436  2004.7960        59.767        14.801       30.720  2004
437  2004.8232        49.215        25.103       32.614  2004
438  2004.8503        52.121        27.575       35.760  2004
439  2004.8775        55.696        23.664       35.617  2004
440  2004.9046        39.797        22.644       29.439  2004
441  2004.9318        33.859        26.647       30.321  2004
442  2004.9589        28.070        34.103       33.447  2004
443  2004.9861        30.828        45.432       42.391  2004

[37 rows x 5 columns]

2005
          date  northern_hem  southern_hem  mean_global  year
444  2005.0132        13.266        44.864       36.016  2005
445  2005.0404         8.487        41.739       33.156  2005
446  2005.0675         5.659        42.850       33.256  2005
447  2005.0947         1.446        42.562       32.125  2005
448  2005.1218         3.810        44.824       34.871  2005
449  2005.1490         2.986        50.571       38.288  2005
450  2005.1761        11.661        53.613       43.796  2005
451  2005.2033         5.552        53.487       41.497  2005
..         ...           ...           ...          ...   ...
473  2005.8005        59.212        27.535       38.238  2005
474  2005.8277        60.640        26.807       38.673  2005
475  2005.8548        62.702        27.678       40.573  2005
476  2005.8820        56.170        29.275       39.607  2005
477  2005.9091        42.528        36.391       39.121  2005
478  2005.9363        37.889        36.448       38.252  2005
479  2005.9634        31.711        46.602       42.833  2005
480  2005.9906        28.838        45.449       42.022  2005

[37 rows x 5 columns]

2006
          date  northern_hem  southern_hem  mean_global  year
481  2006.0177        15.208        40.995       34.920  2006
482  2006.0449         9.615        40.183       33.051  2006
483  2006.0720        11.103        42.116       35.496  2006
484  2006.0992        12.648        43.330       37.377  2006
485  2006.1263         1.121        54.380       39.798  2006
486  2006.1534         4.234        50.949       39.271  2006
487  2006.1806         3.625        51.639       39.790  2006
488  2006.2077        -2.376        41.009       31.129  2006
..         ...           ...           ...          ...   ...
509  2006.7779        72.515        25.209       42.603  2006
510  2006.8050        60.166        34.091       42.843  2006
511  2006.8321        53.757        35.536          NaN  2006
512  2006.8864        39.182        40.578       39.739  2006
513  2006.9136        35.415        38.551       37.942  2006
514  2006.9407        38.152        47.284       45.260  2006
515  2006.9679        34.626        42.577       42.091  2006
516  2006.9950        22.585        39.894       36.713  2006

[36 rows x 5 columns]

2007
          date  northern_hem  southern_hem  mean_global  year
517  2007.0222        14.020        39.105       33.873  2007
518  2007.0493         6.764        39.398       31.887  2007
519  2007.0765        12.670        47.309       39.598  2007
520  2007.1036         7.893        50.473       40.276  2007
521  2007.1308         2.065        50.542       38.291  2007
522  2007.1579         1.877        52.641       39.852  2007
523  2007.1851        -2.946        45.414       33.755  2007
524  2007.2122        -4.152        47.918       34.891  2007
..         ...           ...           ...          ...   ...
546  2007.8095        59.060        22.833       36.094  2007
547  2007.8366        51.554        24.945       34.622  2007
548  2007.8638        47.467        36.165       40.161  2007
549  2007.8909        51.270        40.482       45.064  2007
550  2007.9181        43.922        41.457       43.596  2007
551  2007.9452        38.697        40.180       41.739  2007
552  2007.9724        28.495        31.595       33.547  2007
553  2007.9995        21.967        34.136       33.442  2007

[37 rows x 5 columns]

2008
          date  northern_hem  southern_hem  mean_global  year
554  2008.0267        21.058        40.662       37.931  2008
555  2008.0538        19.814        49.348       43.468  2008
556  2008.0810        21.366        46.670       43.161  2008
557  2008.1081         9.819        39.811       35.051  2008
558  2008.1353         3.636        45.203       36.126  2008
559  2008.1624        -1.445        45.221       34.481  2008
560  2008.1896         0.857        52.063       39.613  2008
561  2008.2167         5.715        53.753       42.747  2008
..         ...           ...           ...          ...   ...
582  2008.7868        62.449        26.224       39.684  2008
583  2008.8140        58.994        31.039       41.257  2008
584  2008.8411        57.145        34.752       43.062  2008
585  2008.8683        57.446        38.421       46.017  2008
586  2008.8954        51.724        37.105       43.723  2008
587  2008.9226        47.520        40.971       45.242  2008
588  2008.9497        35.003        40.331       40.793  2008
589  2008.9769        27.860        42.238       40.079  2008

[36 rows x 5 columns]

2009
          date  northern_hem  southern_hem  mean_global  year
590  2009.0040        23.942        45.255       41.273  2009
591  2009.0311        20.181        52.552       44.988  2009
592  2009.0583        19.316        54.624       46.683  2009
593  2009.0854         9.433        48.872       39.889  2009
594  2009.1126         4.119        46.818       36.979  2009
595  2009.1397         3.346        57.280       43.305  2009
596  2009.1669         8.681        59.119       46.849  2009
597  2009.1940        11.278        59.213       48.000  2009
..         ...           ...           ...          ...   ...
619  2009.7913        63.070        38.871       47.685  2009
620  2009.8184        63.075        46.435       52.336  2009
621  2009.8456        62.655        49.685       54.538  2009
622  2009.8727        52.791        45.958       48.978  2009
623  2009.8999        53.320        41.828       47.605  2009
624  2009.9270        46.465        41.179       45.401  2009
625  2009.9542        37.661        49.314       47.678  2009
626  2009.9813        37.620        56.767       53.142  2009

[37 rows x 5 columns]

2010
          date  northern_hem  southern_hem  mean_global  year
627  2010.0085        31.284        53.366       49.536  2010
628  2010.0356        25.411        51.109       46.686  2010
629  2010.0628        12.342        54.702       44.267  2010
630  2010.0899         8.114        50.866       40.904  2010
631  2010.1171        12.005        58.937       47.728  2010
632  2010.1442        19.827        63.416       53.985  2010
633  2010.1714        21.825        64.558       55.642  2010
634  2010.1985        14.246        54.671       46.863  2010
..         ...           ...           ...          ...   ...
656  2010.7958        73.332        35.395       50.165  2010
657  2010.8229        66.836        36.806       48.586  2010
658  2010.8501        58.128        34.763       44.281  2010
659  2010.8772        52.196        33.227       41.655  2010
660  2010.9044        41.361        35.411       39.390  2010
661  2010.9315        41.026        44.128       45.394  2010
662  2010.9587        38.633        45.985       46.572  2010
663  2010.9858        34.141        46.744       46.239  2010

[37 rows x 5 columns]

2011
          date  northern_hem  southern_hem  mean_global  year
664  2011.0130        30.716        42.938       43.566  2011
665  2011.0401        23.483        41.860       40.867  2011
666  2011.0673        13.902        46.066       40.210  2011
667  2011.0944         5.609        58.039       44.603  2011
668  2011.1216        11.143        57.750       47.179  2011
669  2011.1487        12.212        56.681       47.204  2011
670  2011.1759         5.104        51.587       41.495  2011
671  2011.2030        -3.162        52.335       38.762  2011
..         ...           ...           ...          ...   ...
693  2011.8003        72.225        41.945       53.828  2011
694  2011.8274        63.727        35.358       46.806  2011
695  2011.8546        58.443        33.233       43.885  2011
696  2011.8817        49.444        39.057       44.402  2011
697  2011.9089        48.666        52.376       52.858  2011
698  2011.9360        51.497        49.351       53.230  2011
699  2011.9631        45.172        45.614       49.491  2011
700  2011.9903        36.838        48.626       48.917  2011

[37 rows x 5 columns]

2012
          date  northern_hem  southern_hem  mean_global  year
701  2012.0174        34.488        54.170       52.218  2012
702  2012.0446        26.415        59.324       52.830  2012
703  2012.0717        26.632        66.007       57.621  2012
704  2012.0989        19.153        68.557       56.729  2012
705  2012.1260        18.327        66.719       55.610  2012
706  2012.1532        10.008        64.715       51.437  2012
707  2012.1803        11.120        57.088       47.644  2012
708  2012.2075        21.211        58.073       52.460  2012
..         ...           ...           ...          ...   ...
730  2012.8047        71.692        46.084       56.444  2012
731  2012.8319        68.279        49.719       57.525  2012
732  2012.8590        69.288        51.148       59.318  2012
733  2012.8862        68.955        56.288       63.006  2012
734  2012.9133        65.818        58.031       63.669  2012
735  2012.9405        59.783        57.225       61.746  2012
736  2012.9676        53.291        54.447       58.504  2012
737  2012.9948        48.216        61.680       61.747  2012

[37 rows x 5 columns]

2013
          date  northern_hem  southern_hem  mean_global  year
738  2013.0219        41.826        64.677       61.889  2013
739  2013.0491        42.422        66.982       64.321  2013
740  2013.0762        39.225        67.524       63.984  2013
741  2013.1034        31.808        67.212       61.167  2013
742  2013.1305        25.700        65.306       58.161  2013
743  2013.1577        22.666        66.361       57.881  2013
744  2013.1848        22.479        76.215       63.789  2013
745  2013.2120        30.486        74.929       66.444  2013
..         ...           ...           ...          ...   ...
767  2013.8092        78.021        49.384       61.335  2013
768  2013.8364        75.166        51.062       61.503  2013
769  2013.8635        68.294        56.361       62.432  2013
770  2013.8907        65.842        58.415       63.435  2013
771  2013.9178        55.578        58.389       60.160  2013
772  2013.9450        47.291        53.524       54.927  2013
773  2013.9721        45.383        59.871       58.974  2013
774  2013.9993        46.902        63.605       62.804  2013

[37 rows x 5 columns]

2014
          date  northern_hem  southern_hem  mean_global  year
775  2014.0264        40.417        68.653       64.102  2014
776  2014.0536        34.571        67.932       62.125  2014
777  2014.0807        23.058        67.100       57.643  2014
778  2014.1079        21.168        72.272       60.502  2014
779  2014.1350        23.216        74.869       63.246  2014
780  2014.1621        18.486        82.303       66.032  2014
781  2014.1893        21.125        82.769       67.625  2014
782  2014.2164        21.179        74.717       62.877  2014
..         ...           ...           ...          ...   ...
803  2014.7866        83.396        55.520       67.344  2014
804  2014.8137        82.410        58.394       68.780  2014
805  2014.8409        79.392        53.893       65.301  2014
806  2014.8680        70.305        58.685       65.015  2014
807  2014.8951        59.157        55.438       59.381  2014
808  2014.9223        59.013        57.644       61.591  2014
809  2014.9494        63.004        68.561       70.723  2014
810  2014.9766        56.990        70.551       70.522  2014

[36 rows x 5 columns]

2015
          date  northern_hem  southern_hem  mean_global  year
811  2015.0037        49.183        67.742       66.662  2015
812  2015.0309        39.398        69.746       64.804  2015
813  2015.0580        31.524        69.317       62.115  2015
814  2015.0852        33.878        77.533       68.589  2015
815  2015.1123        29.822        81.370       69.745  2015
816  2015.1395        33.825        86.893       75.007  2015
817  2015.1666        34.251        85.865          NaN  2015
818  2015.1938        33.765        75.001          NaN  2015
..         ...           ...           ...          ...   ...
840  2015.7910        96.789        71.575          NaN  2015
841  2015.8182        92.405        74.234          NaN  2015
842  2015.8453        81.278        72.443          NaN  2015
843  2015.8725        79.341        67.228          NaN  2015
844  2015.8996        78.108        74.586          NaN  2015
845  2015.9268        76.599        77.518          NaN  2015
846  2015.9539        69.691        77.618          NaN  2015
847  2015.9811        59.957        76.646          NaN  2015

[37 rows x 5 columns]

2016
          date  northern_hem  southern_hem  mean_global  year
848  2016.0082        46.712        77.585          NaN  2016
849  2016.0354        42.882        82.549          NaN  2016
850  2016.0625        45.051        90.633          NaN  2016
851  2016.0897        43.793        88.760          NaN  2016
852  2016.1168        46.603        90.561          NaN  2016
853  2016.1440        49.218        84.902          NaN  2016
854  2016.1711        44.274        83.829          NaN  2016
855  2016.1983        33.961        86.020          NaN  2016
856  2016.2254        42.199        88.122          NaN  2016
857  2016.2526        45.004        88.090          NaN  2016
858  2016.2797        49.809        87.806          NaN  2016
859  2016.3068        55.401        77.270          NaN  2016
860  2016.3340        50.416        70.916          NaN  2016

We could have done the same with less effort by grouping by the result of a custom function applied to the index. Let's reset the dataframe:


In [128]:
mean_sea_level = mean_sea_level.drop(["year"], axis=1).set_index("date")

So that we can do the groupby on the index:


In [129]:
sl_grouped_year = mean_sea_level.groupby(int)

Something else that can be done with such an object is to look at its groups attribute to see the labels mapped to the rows involved:


In [130]:
sl_grouped_year.groups


Out[130]:
{1992: [1992.9323, 1992.9595, 1992.9866],
 1993: [1993.0138,
  1993.0409,
  1993.0681,
  1993.0952,
  1993.1223,
  1993.1495,
  1993.1766,
  1993.2038,
  1993.2309,
  1993.2581,
  1993.2852,
  1993.3124,
  1993.3395,
  1993.3667,
  1993.3938,
  1993.421,
  1993.4481,
  1993.4753,
  1993.5024,
  1993.5296,
  1993.5567,
  1993.5839,
  1993.611,
  1993.6382,
  1993.6653,
  1993.6925,
  1993.7196,
  1993.7468,
  1993.7739,
  1993.801,
  1993.8282,
  1993.8553,
  1993.8825,
  1993.9096,
  1993.9368,
  1993.9639,
  1993.9911],
 1994: [1994.0182,
  1994.0454,
  1994.0725,
  1994.0997,
  1994.1268,
  1994.154,
  1994.1811,
  1994.2083,
  1994.2354,
  1994.2626,
  1994.2897,
  1994.3169,
  1994.344,
  1994.3712,
  1994.3983,
  1994.4255,
  1994.4526,
  1994.4798,
  1994.5069,
  1994.534,
  1994.5612,
  1994.5883,
  1994.6155,
  1994.6426,
  1994.6698,
  1994.6969,
  1994.7241,
  1994.7512,
  1994.7784,
  1994.8055,
  1994.8327,
  1994.8598,
  1994.887,
  1994.9141,
  1994.9413,
  1994.9684,
  1994.9956],
 1995: [1995.0227,
  1995.0499,
  1995.077,
  1995.1042,
  1995.1313,
  1995.1585,
  1995.1856,
  1995.2128,
  1995.2399,
  1995.267,
  1995.2942,
  1995.3213,
  1995.3485,
  1995.3756,
  1995.4028,
  1995.4299,
  1995.4571,
  1995.4842,
  1995.5114,
  1995.5385,
  1995.5657,
  1995.5928,
  1995.62,
  1995.6471,
  1995.6743,
  1995.7014,
  1995.7286,
  1995.7557,
  1995.7829,
  1995.81,
  1995.8372,
  1995.8643,
  1995.8915,
  1995.9458,
  1995.9729],
 1996: [1996.0,
  1996.0272,
  1996.0543,
  1996.0815,
  1996.1086,
  1996.1358,
  1996.1629,
  1996.1901,
  1996.2172,
  1996.2444,
  1996.2715,
  1996.2987,
  1996.3258,
  1996.353,
  1996.3801,
  1996.4073,
  1996.4344,
  1996.4616,
  1996.4887,
  1996.5159,
  1996.543,
  1996.5702,
  1996.5973,
  1996.6245,
  1996.6516,
  1996.6788,
  1996.7059,
  1996.733,
  1996.7602,
  1996.7873,
  1996.8145,
  1996.8416,
  1996.8688,
  1996.8959,
  1996.9231,
  1996.9502,
  1996.9774],
 1997: [1997.0045,
  1997.0317,
  1997.0588,
  1997.086,
  1997.1131,
  1997.1403,
  1997.1674,
  1997.1946,
  1997.2217,
  1997.2489,
  1997.276,
  1997.3032,
  1997.3303,
  1997.3575,
  1997.3846,
  1997.4118,
  1997.4389,
  1997.466,
  1997.4932,
  1997.5203,
  1997.5475,
  1997.5746,
  1997.6018,
  1997.6289,
  1997.6561,
  1997.6832,
  1997.7104,
  1997.7375,
  1997.7647,
  1997.7918,
  1997.819,
  1997.8461,
  1997.8733,
  1997.9004,
  1997.9276,
  1997.9547,
  1997.9819],
 1998: [1998.009,
  1998.0362,
  1998.0633,
  1998.0905,
  1998.1176,
  1998.1448,
  1998.1719,
  1998.199,
  1998.2262,
  1998.2533,
  1998.2805,
  1998.3076,
  1998.3348,
  1998.3619,
  1998.3891,
  1998.4162,
  1998.4434,
  1998.4705,
  1998.4977,
  1998.5248,
  1998.552,
  1998.5791,
  1998.6063,
  1998.6334,
  1998.6606,
  1998.6877,
  1998.7149,
  1998.742,
  1998.7692,
  1998.7963,
  1998.8235,
  1998.8506,
  1998.8777,
  1998.9049,
  1998.932,
  1998.9592,
  1998.9863],
 1999: [1999.0135,
  1999.0406,
  1999.0678,
  1999.0949,
  1999.1221,
  1999.1492,
  1999.1764,
  1999.2035,
  1999.2307,
  1999.2578,
  1999.285,
  1999.3121,
  1999.3393,
  1999.3664,
  1999.3936,
  1999.4207,
  1999.4479,
  1999.475,
  1999.5022,
  1999.5293,
  1999.5565,
  1999.5836,
  1999.6107,
  1999.6379,
  1999.665,
  1999.6922,
  1999.7193,
  1999.7465,
  1999.7736,
  1999.8008,
  1999.8279,
  1999.8551,
  1999.8822,
  1999.9094,
  1999.9365,
  1999.9637,
  1999.9908],
 2000: [2000.018,
  2000.0451,
  2000.0723,
  2000.0994,
  2000.1266,
  2000.1537,
  2000.1809,
  2000.208,
  2000.2352,
  2000.2623,
  2000.2895,
  2000.3166,
  2000.3437,
  2000.3709,
  2000.398,
  2000.4252,
  2000.4523,
  2000.4795,
  2000.5066,
  2000.5338,
  2000.5609,
  2000.5881,
  2000.6152,
  2000.6424,
  2000.6695,
  2000.6967,
  2000.7238,
  2000.751,
  2000.7781,
  2000.8053,
  2000.8324,
  2000.8596,
  2000.8867,
  2000.9139,
  2000.941,
  2000.9682,
  2000.9953],
 2001: [2001.0225,
  2001.0496,
  2001.0767,
  2001.1039,
  2001.131,
  2001.1582,
  2001.1853,
  2001.2125,
  2001.2396,
  2001.2668,
  2001.2939,
  2001.3211,
  2001.3482,
  2001.3754,
  2001.4025,
  2001.4297,
  2001.4568,
  2001.484,
  2001.5111,
  2001.5383,
  2001.5654,
  2001.5926,
  2001.6197,
  2001.6469,
  2001.674,
  2001.7012,
  2001.7283,
  2001.7555,
  2001.7826,
  2001.8097,
  2001.8369,
  2001.864,
  2001.8912,
  2001.9183,
  2001.9455,
  2001.9726,
  2001.9998],
 2002: [2002.0269,
  2002.0541,
  2002.0812,
  2002.1084,
  2002.1355,
  2002.1627,
  2002.1898,
  2002.217,
  2002.2441,
  2002.2713,
  2002.2984,
  2002.3256,
  2002.3527,
  2002.3799,
  2002.407,
  2002.4342,
  2002.4613,
  2002.4885,
  2002.5156,
  2002.5427,
  2002.5699,
  2002.597,
  2002.6242,
  2002.6513,
  2002.6785,
  2002.7056,
  2002.7328,
  2002.7599,
  2002.7871,
  2002.8142,
  2002.8414,
  2002.8685,
  2002.8957,
  2002.9228,
  2002.95,
  2002.9771],
 2003: [2003.0043,
  2003.0314,
  2003.0586,
  2003.0857,
  2003.1129,
  2003.14,
  2003.1672,
  2003.1943,
  2003.2214,
  2003.2486,
  2003.2757,
  2003.3029,
  2003.33,
  2003.3572,
  2003.3843,
  2003.4115,
  2003.4386,
  2003.4658,
  2003.4929,
  2003.5201,
  2003.5472,
  2003.5744,
  2003.6015,
  2003.6287,
  2003.6558,
  2003.683,
  2003.7101,
  2003.7373,
  2003.7644,
  2003.7916,
  2003.8187,
  2003.8459,
  2003.873,
  2003.9002,
  2003.9273,
  2003.9544,
  2003.9816],
 2004: [2004.0087,
  2004.0359,
  2004.063,
  2004.0902,
  2004.1173,
  2004.1445,
  2004.1716,
  2004.1988,
  2004.2259,
  2004.2531,
  2004.2802,
  2004.3074,
  2004.3345,
  2004.3617,
  2004.3888,
  2004.416,
  2004.4431,
  2004.4703,
  2004.4974,
  2004.5246,
  2004.5517,
  2004.5789,
  2004.606,
  2004.6332,
  2004.6603,
  2004.6874,
  2004.7146,
  2004.7417,
  2004.7689,
  2004.796,
  2004.8232,
  2004.8503,
  2004.8775,
  2004.9046,
  2004.9318,
  2004.9589,
  2004.9861],
 2005: [2005.0132,
  2005.0404,
  2005.0675,
  2005.0947,
  2005.1218,
  2005.149,
  2005.1761,
  2005.2033,
  2005.2304,
  2005.2576,
  2005.2847,
  2005.3119,
  2005.339,
  2005.3662,
  2005.3933,
  2005.4204,
  2005.4476,
  2005.4747,
  2005.5019,
  2005.529,
  2005.5562,
  2005.5833,
  2005.6105,
  2005.6376,
  2005.6648,
  2005.6919,
  2005.7191,
  2005.7462,
  2005.7734,
  2005.8005,
  2005.8277,
  2005.8548,
  2005.882,
  2005.9091,
  2005.9363,
  2005.9634,
  2005.9906],
 2006: [2006.0177,
  2006.0449,
  2006.072,
  2006.0992,
  2006.1263,
  2006.1534,
  2006.1806,
  2006.2077,
  2006.2349,
  2006.262,
  2006.2892,
  2006.3163,
  2006.3435,
  2006.3706,
  2006.3978,
  2006.4249,
  2006.4521,
  2006.4792,
  2006.5064,
  2006.5335,
  2006.5607,
  2006.5878,
  2006.615,
  2006.6421,
  2006.6693,
  2006.6964,
  2006.7236,
  2006.7507,
  2006.7779,
  2006.805,
  2006.8321,
  2006.8864,
  2006.9136,
  2006.9407,
  2006.9679,
  2006.995],
 2007: [2007.0222,
  2007.0493,
  2007.0765,
  2007.1036,
  2007.1308,
  2007.1579,
  2007.1851,
  2007.2122,
  2007.2394,
  2007.2665,
  2007.2937,
  2007.3208,
  2007.348,
  2007.3751,
  2007.4023,
  2007.4294,
  2007.4566,
  2007.4837,
  2007.5109,
  2007.538,
  2007.5651,
  2007.5923,
  2007.6194,
  2007.6466,
  2007.6737,
  2007.7009,
  2007.728,
  2007.7552,
  2007.7823,
  2007.8095,
  2007.8366,
  2007.8638,
  2007.8909,
  2007.9181,
  2007.9452,
  2007.9724,
  2007.9995],
 2008: [2008.0267,
  2008.0538,
  2008.081,
  2008.1081,
  2008.1353,
  2008.1624,
  2008.1896,
  2008.2167,
  2008.2439,
  2008.271,
  2008.2981,
  2008.3253,
  2008.3524,
  2008.3796,
  2008.4067,
  2008.4339,
  2008.461,
  2008.4882,
  2008.5153,
  2008.5425,
  2008.5696,
  2008.5968,
  2008.6239,
  2008.6511,
  2008.6782,
  2008.7054,
  2008.7325,
  2008.7597,
  2008.7868,
  2008.814,
  2008.8411,
  2008.8683,
  2008.8954,
  2008.9226,
  2008.9497,
  2008.9769],
 2009: [2009.004,
  2009.0311,
  2009.0583,
  2009.0854,
  2009.1126,
  2009.1397,
  2009.1669,
  2009.194,
  2009.2212,
  2009.2483,
  2009.2755,
  2009.3026,
  2009.3298,
  2009.3569,
  2009.3841,
  2009.4112,
  2009.4384,
  2009.4655,
  2009.4927,
  2009.5198,
  2009.547,
  2009.5741,
  2009.6013,
  2009.6284,
  2009.6556,
  2009.6827,
  2009.7099,
  2009.737,
  2009.7642,
  2009.7913,
  2009.8184,
  2009.8456,
  2009.8727,
  2009.8999,
  2009.927,
  2009.9542,
  2009.9813],
 2010: [2010.0085,
  2010.0356,
  2010.0628,
  2010.0899,
  2010.1171,
  2010.1442,
  2010.1714,
  2010.1985,
  2010.2257,
  2010.2528,
  2010.28,
  2010.3071,
  2010.3343,
  2010.3614,
  2010.3886,
  2010.4157,
  2010.4429,
  2010.47,
  2010.4971,
  2010.5243,
  2010.5514,
  2010.5786,
  2010.6057,
  2010.6329,
  2010.66,
  2010.6872,
  2010.7143,
  2010.7415,
  2010.7686,
  2010.7958,
  2010.8229,
  2010.8501,
  2010.8772,
  2010.9044,
  2010.9315,
  2010.9587,
  2010.9858],
 2011: [2011.013,
  2011.0401,
  2011.0673,
  2011.0944,
  2011.1216,
  2011.1487,
  2011.1759,
  2011.203,
  2011.2301,
  2011.2573,
  2011.2844,
  2011.3116,
  2011.3387,
  2011.3659,
  2011.393,
  2011.4202,
  2011.4473,
  2011.4745,
  2011.5016,
  2011.5288,
  2011.5559,
  2011.5831,
  2011.6102,
  2011.6374,
  2011.6645,
  2011.6917,
  2011.7188,
  2011.746,
  2011.7731,
  2011.8003,
  2011.8274,
  2011.8546,
  2011.8817,
  2011.9089,
  2011.936,
  2011.9631,
  2011.9903],
 2012: [2012.0174,
  2012.0446,
  2012.0717,
  2012.0989,
  2012.126,
  2012.1532,
  2012.1803,
  2012.2075,
  2012.2346,
  2012.2618,
  2012.2889,
  2012.3161,
  2012.3432,
  2012.3704,
  2012.3975,
  2012.4247,
  2012.4518,
  2012.479,
  2012.5061,
  2012.5333,
  2012.5604,
  2012.5876,
  2012.6147,
  2012.6419,
  2012.669,
  2012.6961,
  2012.7233,
  2012.7504,
  2012.7776,
  2012.8047,
  2012.8319,
  2012.859,
  2012.8862,
  2012.9133,
  2012.9405,
  2012.9676,
  2012.9948],
 2013: [2013.0219,
  2013.0491,
  2013.0762,
  2013.1034,
  2013.1305,
  2013.1577,
  2013.1848,
  2013.212,
  2013.2391,
  2013.2663,
  2013.2934,
  2013.3206,
  2013.3477,
  2013.3749,
  2013.402,
  2013.4291,
  2013.4563,
  2013.4834,
  2013.5106,
  2013.5377,
  2013.5649,
  2013.592,
  2013.6192,
  2013.6463,
  2013.6735,
  2013.7006,
  2013.7278,
  2013.7549,
  2013.7821,
  2013.8092,
  2013.8364,
  2013.8635,
  2013.8907,
  2013.9178,
  2013.945,
  2013.9721,
  2013.9993],
 2014: [2014.0264,
  2014.0536,
  2014.0807,
  2014.1079,
  2014.135,
  2014.1621,
  2014.1893,
  2014.2164,
  2014.2436,
  2014.2707,
  2014.2979,
  2014.325,
  2014.3522,
  2014.3793,
  2014.4065,
  2014.4336,
  2014.4608,
  2014.4879,
  2014.5151,
  2014.5422,
  2014.5694,
  2014.5965,
  2014.6237,
  2014.6508,
  2014.678,
  2014.7051,
  2014.7323,
  2014.7594,
  2014.7866,
  2014.8137,
  2014.8409,
  2014.868,
  2014.8951,
  2014.9223,
  2014.9494,
  2014.9766],
 2015: [2015.0037,
  2015.0309,
  2015.058,
  2015.0852,
  2015.1123,
  2015.1395,
  2015.1666,
  2015.1938,
  2015.2209,
  2015.2481,
  2015.2752,
  2015.3024,
  2015.3295,
  2015.3567,
  2015.3838,
  2015.411,
  2015.4381,
  2015.4653,
  2015.4924,
  2015.5196,
  2015.5467,
  2015.5739,
  2015.601,
  2015.6281,
  2015.6553,
  2015.6824,
  2015.7096,
  2015.7367,
  2015.7639,
  2015.791,
  2015.8182,
  2015.8453,
  2015.8725,
  2015.8996,
  2015.9268,
  2015.9539,
  2015.9811],
 2016: [2016.0082,
  2016.0354,
  2016.0625,
  2016.0897,
  2016.1168,
  2016.144,
  2016.1711,
  2016.1983,
  2016.2254,
  2016.2526,
  2016.2797,
  2016.3068,
  2016.334]}

How to aggregate the results of this grouping depends on what we want to see: do we want to see averaged over the years? That is so common that it has been implemented directly as a method on the GroupBy object.


In [131]:
sl_grouped_year.mean()


Out[131]:
northern_hem southern_hem mean_global
1992 0.556667 -2.465000 -6.671500
1993 -3.787108 -5.485378 -4.181882
1994 -4.631432 -3.523946 -3.668971
1995 -2.709314 -1.625171 -1.755452
1996 -1.184541 -0.204919 0.775912
1997 1.205108 8.180189 7.155545
1998 10.994865 4.420541 10.108182
1999 10.244595 8.327162 12.321727
... ... ... ...
2009 37.268270 43.703973 46.127108
2010 39.910946 43.347676 47.285838
2011 36.664865 43.639108 46.442865
2012 48.506541 53.315595 57.293730
2013 52.047757 54.749270 59.685571
2014 51.709861 60.794194 63.649583
2015 61.329568 71.623676 67.820333
2016 45.794077 84.387923 NaN

25 rows × 3 columns


In [132]:
# We can apply any other reduction function or even a dict of functions using aggregate:
sl_grouped_year.aggregate({"mean_global": np.std})


Out[132]:
mean_global
1992 1.207031
1993 3.411971
1994 3.347750
1995 3.486596
1996 3.138087
1997 5.444541
1998 3.565379
1999 2.960703
... ...
2009 3.776398
2010 4.000347
2011 4.277106
2012 4.249205
2013 3.276180
2014 3.716325
2015 4.448240
2016 NaN

25 rows × 1 columns

Another possibility is to transform each group separately, rather than aggregate. For example, here we group over decades and subtract to each value, the average over that decade:


In [133]:
sl_grouped_decade = mean_sea_level.groupby(lambda x: int(x/10.))
sl_grouped_decade.groups.keys()


Out[133]:
[200, 201, 199]

In [134]:
sl_grouped_decade.transform(lambda subframe: (subframe - subframe.mean()/subframe.std()))


Out[134]:
northern_hem southern_hem mean_global
date
1992.9323 13.652814 2.395273 NaN
1992.9595 1.263814 -7.766727 -6.207707
1992.9866 -13.439186 -2.388727 -7.914707
1993.0138 -24.787186 0.733273 -9.604707
1993.0409 -29.295186 -2.141727 -12.185707
1993.0681 -32.602186 -0.519727 -11.680707
1993.0952 -35.652186 3.980273 -9.958707
1993.1223 -26.643186 6.665273 -4.103707
... ... ... ...
2016.1440 47.147536 81.172989 NaN
2016.1711 42.203536 80.099989 NaN
2016.1983 31.890536 82.290989 NaN
2016.2254 40.128536 84.392989 NaN
2016.2526 42.933536 84.360989 NaN
2016.2797 47.738536 84.076989 NaN
2016.3068 53.330536 73.540989 NaN
2016.3340 48.345536 67.186989 NaN

861 rows × 3 columns

Pivot_table

Pivot table also allows to summarize the information, allowing to convert repeating columns into axes. For example, let's say that we would like to know how many sea level stations are in various european countries. And we would like to group the answers into 2 categories: the stations that have been updated recently (after 2000) and the others.

Let's first extract only entries located (roughly) in Europe.


In [135]:
european_filter = ((local_sea_level_stations["Lat"] > 30) & 
                   (local_sea_level_stations["Lat"] < 70) & 
                   (local_sea_level_stations["Lon"] > -10) & 
                   (local_sea_level_stations["Lon"] < 40) 
                   )

# Let's make a copy to work with a new, clean block of memory 
# (if you are interested, try and remove the copy to see the consequences further down...
european_stations = local_sea_level_stations[european_filter].copy()
european_stations["Country"].unique()


Out[135]:
[ALA, BEL, BGR, HRV, CYP, ..., ESP, SWE, TUR, UKR, GBR]
Length: 37
Categories (37, object): [ALA, BEL, BGR, HRV, ..., SWE, TUR, UKR, GBR]

The columns of our future table should have 2 values, whether the station was updated recently or not. Let's build a column to store that information:


In [136]:
european_stations["Recently updated"] = european_stations["Date"] > pd.to_datetime("2000")

Finally, what value will be displayed inside the table. The values should be extracted from a column, pivot_table allowing an aggregation function to be applied when more than 1 value is found for a given case. Each station should count for 1, and we could aggregate multiple stations by summing these ones:


In [137]:
european_stations["Number of stations"] = np.ones(len(european_stations))

In [138]:
european_stations.sort_values(by="Country")


Out[138]:
Station Name ID Lat Lon GLOSS ID Country Date Coastline Station Recently updated Number of stations
0 HELIGMAN 231 60.200 19.300 NaN ALA 1980-01-01 60 251 False 1.0
1 KOBBAKLINTAR 63 60.033 19.883 NaN ALA 1980-01-01 60 261 False 1.0
2 LEMSTROM 84 60.100 20.017 NaN ALA 1980-01-01 60 271 False 1.0
3 FOGLO / DEGERBY 249 60.032 20.385 NaN ALA 2015-06-10 60 281 True 1.0
116 ZEEBRUGGE 470 51.350 3.200 NaN BEL 2014-01-20 160 11 True 1.0
117 OOSTENDE 413 51.233 2.917 NaN BEL 2016-06-14 160 21 True 1.0
118 NIEUWPOORT 489 51.150 2.733 NaN BEL 2016-05-26 160 31 True 1.0
134 BOURGAS 317 42.483 27.483 NaN BGR 1997-08-27 295 21 False 1.0
... ... ... ... ... ... ... ... ... ... ... ...
1193 ANTALYA II 1681 36.833 30.617 NaN TUR 2013-09-07 310 52 True 1.0
1192 BODRUM II 1680 37.033 27.417 NaN TUR 2013-09-07 310 46 True 1.0
1191 MENTES/IZMIR 1679 38.433 26.717 NaN TUR 2013-09-07 310 42 True 1.0
1190 ERDEK 1598 40.383 27.850 NaN TUR 2013-09-07 310 38 True 1.0
1189 MARMARA EREGLISI 2010 40.967 27.967 NaN TUR 2010-06-28 310 37 True 1.0
1188 AMASRA 1919 41.433 32.233 NaN TUR 2010-06-24 310 26 True 1.0
1187 IGNEDA 1926 41.883 28.017 NaN TUR 2013-09-07 310 18 True 1.0
1198 SEVASTOPOL 42 44.617 33.533 NaN UKR 1998-03-02 298 41 False 1.0

424 rows × 11 columns


In [139]:
station_counts = pd.pivot_table(european_stations, index="Country", columns="Recently updated", 
                                values="Number of stations", aggfunc=np.sum)
# Let's remove from the table the countries for which no station was found:
station_counts.dropna(how="all")


Out[139]:
Recently updated False True
Country
ALA 3.0 1.0
BEL NaN 3.0
BGR 2.0 NaN
CYP 1.0 1.0
DEU NaN 9.0
DNK 1.0 14.0
EGY 1.0 1.0
ESP 7.0 38.0
... ... ...
POL 1.0 6.0
PRT 6.0 2.0
ROU 1.0 NaN
RUS 6.0 6.0
SVN 1.0 1.0
SWE 11.0 26.0
TUR NaN 10.0
UKR 1.0 NaN

37 rows × 2 columns

QUIZ: Why is there still some countries with no entries?

EXERCISE: How many recently updated stations? Not recently updated stations? Which country has the most stations? Which country has the most recently updated stations?


In [140]:
# Your code here

EXERCISE: How would we build the same dataframe with a groupby operation?


In [141]:
# Your code here

EXERCISE: Refer to exercises/pivot_table/pivot_tables.py

10. Correlations and regressions

Correlation coefficients

Both Series and dataframes have a corr method to compute the correlation coefficient between series:


In [142]:
# Let's see what how the various sea levels are correlated with each other:
mean_sea_level["northern_hem"].corr(mean_sea_level["southern_hem"])


Out[142]:
0.347989603115341

In [143]:
# If series are already grouped into a DataFrame, computing all correlation coeff is trivial:
mean_sea_level.corr()


Out[143]:
northern_hem southern_hem mean_global
northern_hem 1.000000 0.347990 0.649231
southern_hem 0.347990 1.000000 0.896886
mean_global 0.649231 0.896886 1.000000

Note: by default, the method used is the Pearson correlation coefficient (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). Other methods are available (kendall, spearman using the method kwarg).


In [144]:
# Visualize the correlation matrix
plt.imshow(mean_sea_level.corr(), interpolation="nearest")


Out[144]:
<matplotlib.image.AxesImage at 0x11a83e3d0>

In [145]:
plt.yticks?

In [146]:
# let's make it a little better to confirm that learning about global sea level cannot be done from just 
# looking at stations in the northern hemisphere:
plt.imshow(mean_sea_level.corr(), interpolation="nearest")
plt.xticks(np.arange(3), mean_sea_level.corr().columns)
plt.yticks(np.arange(3), mean_sea_level.corr().index)
plt.colorbar()


Out[146]:
<matplotlib.colorbar.Colorbar at 0x11aa8fe90>

OLS

The recommeded way to build ordinaty least square regressions is by using statsmodels.


In [147]:
import statsmodels.formula.api as sm

In [148]:
sm_model = sm.ols(formula="mean_global ~ northern_hem + southern_hem", data=mean_sea_level).fit()

In [149]:
sm_model.params


Out[149]:
Intercept       1.468349
northern_hem    0.333625
southern_hem    0.735599
dtype: float64

In [150]:
type(sm_model.params)


Out[150]:
pandas.core.series.Series

In [151]:
sm_model.summary()


Out[151]:
OLS Regression Results
Dep. Variable: mean_global R-squared: 0.988
Model: OLS Adj. R-squared: 0.988
Method: Least Squares F-statistic: 3.188e+04
Date: Sun, 10 Jul 2016 Prob (F-statistic): 0.00
Time: 22:17:18 Log-Likelihood: -1765.7
No. Observations: 780 AIC: 3537.
Df Residuals: 777 BIC: 3551.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.4683 0.138 10.651 0.000 1.198 1.739
northern_hem 0.3336 0.003 108.839 0.000 0.328 0.340
southern_hem 0.7356 0.004 191.198 0.000 0.728 0.743
Omnibus: 86.727 Durbin-Watson: 0.134
Prob(Omnibus): 0.000 Jarque-Bera (JB): 29.349
Skew: 0.206 Prob(JB): 4.24e-07
Kurtosis: 2.144 Cond. No. 75.8

In [152]:
plt.figure(figsize=LARGE_FIGSIZE)
mean_sea_level["mean_global"].plot()
sm_model.fittedvalues.plot(label="OLS prediction")
plt.legend(loc="upper left")


Out[152]:
<matplotlib.legend.Legend at 0x117e76590>

An interlude: data alignment

Converting the floating point date to a timestamp

Now, we would like to look for correlations between our monthly temperatures and the sea levels we have. For this to be possible, some data alignment must be done since the time scales are very different for the 2 datasets.


In [153]:
mean_sea_level["mean_global"].index


Out[153]:
Float64Index([1992.9323, 1992.9595, 1992.9866, 1993.0138, 1993.0409, 1993.0681,
              1993.0952, 1993.1223, 1993.1495, 1993.1766,
              ...
              2016.0897, 2016.1168,  2016.144, 2016.1711, 2016.1983, 2016.2254,
              2016.2526, 2016.2797, 2016.3068,  2016.334],
             dtype='float64', name=u'date', length=861)

In [154]:
giss_temp_series.index


Out[154]:
PeriodIndex(['1880-01', '1880-02', '1880-03', '1880-04', '1880-05', '1880-06',
             '1880-07', '1880-08', '1880-09', '1880-10',
             ...
             '2015-03', '2015-04', '2015-05', '2015-06', '2015-07', '2015-08',
             '2015-09', '2015-10', '2015-11', '2015-12'],
            dtype='int64', length=1632, freq='M')

In [155]:
DAYS_PER_YEAR = {}

In [156]:
import calendar
# Let's first convert the floating point dates in the sea level to timestamps:
def floating_year_to_timestamp(float_date):
    """ Convert a date as a floating point year number to a pandas timestamp object.
    """
    year = int(float_date)
    days_per_year = 366 if calendar.isleap(year) else 365
    remainder = float_date - year
    daynum = 1 + remainder * (days_per_year - 1)
    daynum = int(round(daynum))
    # Convert day number to month and day
    day = daynum
    month = 1
    while month < 13:
        month_days = calendar.monthrange(year, month)[1]
        if day <= month_days:
            return pd.Timestamp(str(year)+"/"+str(month)+"/"+str(day))
        day -= month_days
        month += 1
    raise ValueError('{} does not have {} days'.format(year, daynum))

In [157]:
floating_year_to_timestamp(1996.0), floating_year_to_timestamp(1996.5), floating_year_to_timestamp(1996.9999)


Out[157]:
(Timestamp('1996-01-01 00:00:00'),
 Timestamp('1996-07-02 00:00:00'),
 Timestamp('1996-12-31 00:00:00'))

In [158]:
dt_index = pd.Series(mean_sea_level["mean_global"].index).apply(floating_year_to_timestamp)
dt_index


Out[158]:
0     1992-12-06
1     1992-12-16
2     1992-12-26
3     1993-01-06
4     1993-01-16
5     1993-01-26
6     1993-02-05
7     1993-02-15
         ...    
853   2016-02-23
854   2016-03-03
855   2016-03-13
856   2016-03-23
857   2016-04-02
858   2016-04-12
859   2016-04-22
860   2016-05-02
Name: date, dtype: datetime64[ns]

In [159]:
mean_sea_level = mean_sea_level.reset_index(drop=True)
mean_sea_level.index = dt_index
mean_sea_level


Out[159]:
northern_hem southern_hem mean_global
date
1992-12-06 13.717 2.517 NaN
1992-12-16 1.328 -7.645 -5.818
1992-12-26 -13.375 -2.267 -7.525
1993-01-06 -24.723 0.855 -9.215
1993-01-16 -29.231 -2.020 -11.796
1993-01-26 -32.538 -0.398 -11.291
1993-02-05 -35.588 4.102 -9.569
1993-02-15 -26.579 6.787 -3.714
... ... ... ...
2016-02-23 49.218 84.902 NaN
2016-03-03 44.274 83.829 NaN
2016-03-13 33.961 86.020 NaN
2016-03-23 42.199 88.122 NaN
2016-04-02 45.004 88.090 NaN
2016-04-12 49.809 87.806 NaN
2016-04-22 55.401 77.270 NaN
2016-05-02 50.416 70.916 NaN

861 rows × 3 columns

Now, how to align the 2 series? Is this one sampled regularly so that the month temperatures can be upscaled to that frequency?

Computing the difference between successive values

What is the frequency of that new index?


In [160]:
dt_index.dtype


Out[160]:
dtype('<M8[ns]')

In [161]:
# What is the frequency of the new index? The numpy way to compute differences between all values doesn't work:
dt_index[1:] - dt_index[:-1]


Out[161]:
0        NaT
1     0 days
2     0 days
3     0 days
4     0 days
5     0 days
6     0 days
7     0 days
       ...  
853   0 days
854   0 days
855   0 days
856   0 days
857   0 days
858   0 days
859   0 days
860      NaT
Name: date, dtype: timedelta64[ns]

IMPORTANT Note: The above failure is due to the fact that operations between series automatically align them based on their index.


In [162]:
# There is a method for shifting values up/down the index:
dt_index.shift()


Out[162]:
0            NaT
1     1992-12-06
2     1992-12-16
3     1992-12-26
4     1993-01-06
5     1993-01-16
6     1993-01-26
7     1993-02-05
         ...    
853   2016-02-13
854   2016-02-23
855   2016-03-03
856   2016-03-13
857   2016-03-23
858   2016-04-02
859   2016-04-12
860   2016-04-22
Name: date, dtype: datetime64[ns]

In [163]:
# So the distances can be computed with 
dt_index - dt_index.shift()


Out[163]:
0         NaT
1     10 days
2     10 days
3     11 days
4     10 days
5     10 days
6     10 days
7     10 days
        ...  
853   10 days
854    9 days
855   10 days
856   10 days
857   10 days
858   10 days
859   10 days
860   10 days
Name: date, dtype: timedelta64[ns]

In [164]:
# Not constant reads apparently. Let's downscale the frequency of the sea levels 
# to monthly, like the temperature reads we have:
monthly_mean_sea_level = mean_sea_level.resample("MS").mean().to_period()
monthly_mean_sea_level


Out[164]:
northern_hem southern_hem mean_global
date
1992-12 0.556667 -2.465000 -6.671500
1993-01 -28.830667 -0.521000 -10.767333
1993-02 -30.596333 7.068000 -5.251333
1993-03 -32.431667 6.967000 -5.136333
1993-04 -23.668000 1.549667 -1.118000
1993-05 -21.731333 -2.863000 -6.319667
1993-06 -4.644333 -4.418000 -0.902333
1993-07 0.172667 -12.684667 -4.508000
... ... ... ...
2015-10 96.100667 70.860667 NaN
2015-11 79.575667 71.419000 NaN
2015-12 68.749000 77.260667 NaN
2016-01 44.881667 83.589000 NaN
2016-02 46.538000 88.074333 NaN
2016-03 40.144667 85.990333 NaN
2016-04 50.071333 84.388667 NaN
2016-05 50.416000 70.916000 NaN

282 rows × 3 columns


In [165]:
monthly_mean_sea_level["mean_global"].align(giss_temp_series)


Out[165]:
(1880-01   NaN
 1880-02   NaN
 1880-03   NaN
 1880-04   NaN
 1880-05   NaN
 1880-06   NaN
 1880-07   NaN
 1880-08   NaN
            ..
 2015-10   NaN
 2015-11   NaN
 2015-12   NaN
 2016-01   NaN
 2016-02   NaN
 2016-03   NaN
 2016-04   NaN
 2016-05   NaN
 Freq: M, Name: mean_global, dtype: float64, 1880-01   -34.0
 1880-02   -27.0
 1880-03   -22.0
 1880-04   -30.0
 1880-05   -16.0
 1880-06   -24.0
 1880-07   -19.0
 1880-08   -12.0
            ... 
 2015-10     NaN
 2015-11     NaN
 2015-12     NaN
 2016-01     NaN
 2016-02     NaN
 2016-03     NaN
 2016-04     NaN
 2016-05     NaN
 Freq: M, Name: Temp anomaly, dtype: float32)

In [166]:
giss_temp_series.align?

In [167]:
# Now that the series are using the same type and frequency of indexes, to align them is trivial:
monthly_mean_sea_level["mean_global"].align(giss_temp_series, join='inner')


Out[167]:
(1992-12    -6.671500
 1993-01   -10.767333
 1993-02    -5.251333
 1993-03    -5.136333
 1993-04    -1.118000
 1993-05    -6.319667
 1993-06    -0.902333
 1993-07    -4.508000
              ...    
 2015-05          NaN
 2015-06          NaN
 2015-07          NaN
 2015-08          NaN
 2015-09          NaN
 2015-10          NaN
 2015-11          NaN
 2015-12          NaN
 Freq: M, Name: mean_global, dtype: float64, 1992-12    18.0
 1993-01    34.0
 1993-02    37.0
 1993-03    33.0
 1993-04    22.0
 1993-05    23.0
 1993-06    19.0
 1993-07    24.0
            ... 
 2015-05     NaN
 2015-06     NaN
 2015-07     NaN
 2015-08     NaN
 2015-09     NaN
 2015-10     NaN
 2015-11     NaN
 2015-12     NaN
 Freq: M, Name: Temp anomaly, dtype: float32)

In [168]:
aligned_sl, aligned_temp = monthly_mean_sea_level["mean_global"].align(giss_temp_series, join='inner')
aligned_df = pd.DataFrame({"mean_sea_level": aligned_sl, "mean_global_temp": aligned_temp})

The alignment can even be done on an entire dataframe:


In [169]:
monthly_mean_sea_level.align(giss_temp_series, axis=0, join='inner')


Out[169]:
(         northern_hem  southern_hem  mean_global
 1992-12      0.556667     -2.465000    -6.671500
 1993-01    -28.830667     -0.521000   -10.767333
 1993-02    -30.596333      7.068000    -5.251333
 1993-03    -32.431667      6.967000    -5.136333
 1993-04    -23.668000      1.549667    -1.118000
 1993-05    -21.731333     -2.863000    -6.319667
 1993-06     -4.644333     -4.418000    -0.902333
 1993-07      0.172667    -12.684667    -4.508000
 ...               ...           ...          ...
 2015-05     43.444000     71.664750          NaN
 2015-06     59.716000     71.828667          NaN
 2015-07     72.460667     60.819000          NaN
 2015-08     89.157000     62.902000          NaN
 2015-09     93.290333     61.620333          NaN
 2015-10     96.100667     70.860667          NaN
 2015-11     79.575667     71.419000          NaN
 2015-12     68.749000     77.260667          NaN
 
 [277 rows x 3 columns], 1992-12    18.0
 1993-01    34.0
 1993-02    37.0
 1993-03    33.0
 1993-04    22.0
 1993-05    23.0
 1993-06    19.0
 1993-07    24.0
            ... 
 2015-05     NaN
 2015-06     NaN
 2015-07     NaN
 2015-08     NaN
 2015-09     NaN
 2015-10     NaN
 2015-11     NaN
 2015-12     NaN
 Freq: M, Name: Temp anomaly, dtype: float32)

In [170]:
aligned_sea_levels, aligned_temp = monthly_mean_sea_level.align(giss_temp_series, axis=0, join='inner')
aligned_monthly_data = aligned_sea_levels.copy()
aligned_monthly_data["global_temp"] = aligned_temp
aligned_monthly_data


Out[170]:
northern_hem southern_hem mean_global global_temp
1992-12 0.556667 -2.465000 -6.671500 18.0
1993-01 -28.830667 -0.521000 -10.767333 34.0
1993-02 -30.596333 7.068000 -5.251333 37.0
1993-03 -32.431667 6.967000 -5.136333 33.0
1993-04 -23.668000 1.549667 -1.118000 22.0
1993-05 -21.731333 -2.863000 -6.319667 23.0
1993-06 -4.644333 -4.418000 -0.902333 19.0
1993-07 0.172667 -12.684667 -4.508000 24.0
... ... ... ... ...
2015-05 43.444000 71.664750 NaN NaN
2015-06 59.716000 71.828667 NaN NaN
2015-07 72.460667 60.819000 NaN NaN
2015-08 89.157000 62.902000 NaN NaN
2015-09 93.290333 61.620333 NaN NaN
2015-10 96.100667 70.860667 NaN NaN
2015-11 79.575667 71.419000 NaN NaN
2015-12 68.749000 77.260667 NaN NaN

277 rows × 4 columns

Correlations between sea levels and temperatures


In [171]:
aligned_monthly_data.plot(figsize=LARGE_FIGSIZE)


Out[171]:
<matplotlib.axes._subplots.AxesSubplot at 0x11506fb90>

In [172]:
aligned_monthly_data.corr()


Out[172]:
northern_hem southern_hem mean_global global_temp
northern_hem 1.000000 0.349101 0.658666 0.378089
southern_hem 0.349101 1.000000 0.901072 0.597901
mean_global 0.658666 0.901072 1.000000 0.620717
global_temp 0.378089 0.597901 0.620717 1.000000

In [173]:
model = sm.ols("southern_hem ~ global_temp", data=aligned_monthly_data)
params = model.fit()
params.rsquared


Out[173]:
0.35748527157977106

What if we had done the analysis yearly instead of monthly to remove seasonal variations?


In [174]:
aligned_yearly_data = aligned_monthly_data.resample("A").mean()
aligned_yearly_data.plot()


Out[174]:
<matplotlib.axes._subplots.AxesSubplot at 0x1142f03d0>

In [175]:
aligned_yearly_data.corr()


Out[175]:
northern_hem southern_hem mean_global global_temp
northern_hem 1.000000 0.989688 0.992187 0.821351
southern_hem 0.989688 1.000000 0.994649 0.820091
mean_global 0.992187 0.994649 1.000000 0.832198
global_temp 0.821351 0.820091 0.832198 1.000000

In [176]:
model = sm.ols("southern_hem ~ global_temp", data=aligned_yearly_data).fit()
model.rsquared


Out[176]:
0.6725492407603153

11. Predictions from auto regression models

An auto-regresssive model fits existing data and build a (potentially predictive) model of the data fitted. We use the timeseries analysis (tsa) submodule of statsmodels to make out-of-sample predictions for the upcoming decades:


In [177]:
from statsmodels.tsa.api import AR

In [178]:
import statsmodels as sm
# Let's remove seasonal variations by resampling annually
data = giss_temp_series.resample("A").mean().to_timestamp()
ar_model = AR(data, freq='A')
ar_res = ar_model.fit(maxlag=60, disp=True)

In [179]:
plt.figure(figsize=LARGE_FIGSIZE)
pred = ar_res.predict(start='1950-1-1', end='2070')
data.plot(style='k', label="Historical Data")
pred.plot(style='r', label="Predicted Data")
plt.ylabel("Temperature variation (0.01 degC)")
plt.legend()


/Users/jrocher/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/statsmodels/base/data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series
  return TimeSeries(result, index=self.predict_dates)
Out[179]:
<matplotlib.legend.Legend at 0x11b500a90>

EXERCISE: Make another auto-regression on the sea level of the Atlantic ocean to estimate how much New York is going to flood in the coming century.

  1. You can find the historical sea levels of the Atlantic ocean at http://sealevel.colorado.edu/files/current/sl_Atlantic_Ocean.txt or locally in data/sea_levels/sl_Atlantic_Ocean.txt.
  2. A little more work but more precise: extract the ID of a station in NewYork from the local_sea_level_stations dataset, and use it to download timeseries in NY (URL would be http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata).

In [180]:
# Your code here