Last updated: June 29th 2016

Climate data exploration: a journey through Pandas

Welcome to a demo of Python's data analysis package called Pandas. Our goal is to learn about Data Analysis and transformation using Pandas while exploring datasets used to analyze climate change.

The story

The global goal of this demo is to provide the tools to be able to try and reproduce some of the analysis done in the IPCC global climate reports published in the last decade (see for example https://www.ipcc.ch/pdf/assessment-report/ar5/syr/SYR_AR5_FINAL_full.pdf).

We are first going to load a few public datasets containing information about global temperature, global and local sea level infomation, and global concentration of greenhouse gases like CO2, to see if there are correlations and how the trends are to evolve, assuming no fundamental change in the system. For all these datasets, we will download them, visualize them, clean them, search through them, merge them, resample them, transform them and summarize them.

In the process, we will learn about:

1. Loading data
2. Pandas datastructures
3. Cleaning and formatting data
4. Basic visualization
5. Accessing data
6. Working with dates and times
7. Transforming datasets
8. Statistical analysis
9. Data agregation and summarization
10. Correlations and regressions
11. Predictions from auto regression models

Some initial setup



In [1]:

    
%matplotlib inline
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_rows", 16)

LARGE_FIGSIZE = (12, 8)



In [2]:

    
# Change this cell to the demo location on YOUR machine
%cd ~/Projects/pandas_tutorial/climate_timeseries/
%ls









    



/Users/jrocher/Projects/pandas_tutorial/climate_timeseries
climate_timeseries.ipynb  data/

1. Loading data

More details, see http://pandas.pydata.org/pandas-docs/stable/io.html

To find all reading functions in pandas, ask ipython's tab completion:



In [3]:

    
#pd.read_<TAB>



In [4]:

    
pd.read_table?

From a local text file

Let's first load some temperature data which covers all lattitudes. Since read_table is supposed to do the job for a text file, let's just try it:



In [5]:

    
filename = "data/temperatures/annual.land_ocean.90S.90N.df_1901-2000mean.dat"
full_globe_temp = pd.read_table(filename)
full_globe_temp









    Out[5]:






  
    
      
      1880   -0.1591
    
  
  
    
      0
      1881   -0.0789
    
    
      1
      1882   -0.1313
    
    
      2
      1883   -0.1675
    
    
      3
      1884   -0.2485
    
    
      4
      1885   -0.2042
    
    
      5
      1886   -0.1695
    
    
      6
      1887   -0.2505
    
    
      7
      1888   -0.1605
    
    
      ...
      ...
    
    
      122
      2003    0.5818
    
    
      123
      2004    0.5416
    
    
      124
      2005    0.6154
    
    
      125
      2006    0.5601
    
    
      126
      2007    0.5472
    
    
      127
      2008    0.4804
    
    
      128
      2009    0.5551
    
    
      129
      2010 -999.0000
    
  

130 rows × 1 columns

There is only 1 column! Let's try again stating that values are separated by any number of spaces:



In [6]:

    
full_globe_temp = pd.read_table(filename, sep="\s+")
full_globe_temp









    Out[6]:






  
    
      
      1880
      -0.1591
    
  
  
    
      0
      1881
      -0.0789
    
    
      1
      1882
      -0.1313
    
    
      2
      1883
      -0.1675
    
    
      3
      1884
      -0.2485
    
    
      4
      1885
      -0.2042
    
    
      5
      1886
      -0.1695
    
    
      6
      1887
      -0.2505
    
    
      7
      1888
      -0.1605
    
    
      ...
      ...
      ...
    
    
      122
      2003
      0.5818
    
    
      123
      2004
      0.5416
    
    
      124
      2005
      0.6154
    
    
      125
      2006
      0.5601
    
    
      126
      2007
      0.5472
    
    
      127
      2008
      0.4804
    
    
      128
      2009
      0.5551
    
    
      129
      2010
      -999.0000
    
  

130 rows × 2 columns

There are columns but the column names are 1880 and -0.1591!



In [7]:

    
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"])
full_globe_temp









    Out[7]:






  
    
      
      year
      mean temp
    
  
  
    
      0
      1880
      -0.1591
    
    
      1
      1881
      -0.0789
    
    
      2
      1882
      -0.1313
    
    
      3
      1883
      -0.1675
    
    
      4
      1884
      -0.2485
    
    
      5
      1885
      -0.2042
    
    
      6
      1886
      -0.1695
    
    
      7
      1887
      -0.2505
    
    
      ...
      ...
      ...
    
    
      123
      2003
      0.5818
    
    
      124
      2004
      0.5416
    
    
      125
      2005
      0.6154
    
    
      126
      2006
      0.5601
    
    
      127
      2007
      0.5472
    
    
      128
      2008
      0.4804
    
    
      129
      2009
      0.5551
    
    
      130
      2010
      -999.0000
    
  

131 rows × 2 columns

Since we only have 2 columns, one of which would be nicer to access the data (the year of the record), let's try using the index_col option:



In [8]:

    
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"], 
                                index_col=0)
full_globe_temp









    Out[8]:






  
    
      
      mean temp
    
    
      year
      
    
  
  
    
      1880
      -0.1591
    
    
      1881
      -0.0789
    
    
      1882
      -0.1313
    
    
      1883
      -0.1675
    
    
      1884
      -0.2485
    
    
      1885
      -0.2042
    
    
      1886
      -0.1695
    
    
      1887
      -0.2505
    
    
      ...
      ...
    
    
      2003
      0.5818
    
    
      2004
      0.5416
    
    
      2005
      0.6154
    
    
      2006
      0.5601
    
    
      2007
      0.5472
    
    
      2008
      0.4804
    
    
      2009
      0.5551
    
    
      2010
      -999.0000
    
  

131 rows × 1 columns

Last step: the index is made of dates. Let's make that explicit:



In [9]:

    
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"], 
                                index_col=0, parse_dates=True)
full_globe_temp









    Out[9]:






  
    
      
      mean temp
    
    
      year
      
    
  
  
    
      1880-01-01 00:00:00
      -0.1591
    
    
      1881-01-01 00:00:00
      -0.0789
    
    
      1882-01-01 00:00:00
      -0.1313
    
    
      1883-01-01 00:00:00
      -0.1675
    
    
      1884-01-01 00:00:00
      -0.2485
    
    
      1885-01-01 00:00:00
      -0.2042
    
    
      1886-01-01 00:00:00
      -0.1695
    
    
      1887-01-01 00:00:00
      -0.2505
    
    
      ...
      ...
    
    
      2003-01-01
      0.5818
    
    
      2004-01-01
      0.5416
    
    
      2005-01-01
      0.6154
    
    
      2006-01-01
      0.5601
    
    
      2007-01-01
      0.5472
    
    
      2008-01-01
      0.4804
    
    
      2009-01-01
      0.5551
    
    
      2010-01-01
      -999.0000
    
  

131 rows × 1 columns

From a chunked file

Since every dataset can contain mistakes, let's load a different file with temperature data. NASA's GISS dataset is written in chunks: look at it in data/temperatures/GLB.Ts+dSST.txt



In [10]:

    
giss_temp = pd.read_table("data/temperatures/GLB.Ts+dSST.txt", sep="\s+", skiprows=7,
                          skip_footer=11, engine="python")
giss_temp









    Out[10]:






  
    
      
      Year
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
      J-D
      D-N
      DJF
      MAM
      JJA
      SON
      Year.1
    
  
  
    
      0
      1880
      -34
      -27
      -22
      -30
      -16
      -24
      -19
      -12
      -20
      -19
      -16
      -21
      -22
      ***
      ****
      -23
      -18
      -18
      1880
    
    
      1
      1881
      -13
      -16
      -2
      -3
      -3
      -27
      -12
      -8
      -18
      -23
      -28
      -18
      -14
      -14
      -17
      -3
      -15
      -23
      1881
    
    
      2
      1882
      3
      4
      -2
      -24
      -20
      -32
      -27
      -11
      -11
      -25
      -25
      -37
      -17
      -16
      -4
      -15
      -23
      -20
      1882
    
    
      3
      1883
      -38
      -38
      -12
      -20
      -20
      -8
      -3
      -13
      -19
      -19
      -28
      -21
      -20
      -21
      -38
      -18
      -8
      -22
      1883
    
    
      4
      1884
      -20
      -14
      -31
      -36
      -33
      -36
      -31
      -24
      -29
      -25
      -29
      -25
      -28
      -28
      -18
      -33
      -31
      -28
      1884
    
    
      5
      1885
      -57
      -29
      -19
      -36
      -35
      -40
      -28
      -24
      -17
      -14
      -14
      0
      -26
      -28
      -37
      -30
      -31
      -15
      1885
    
    
      6
      1886
      -37
      -43
      -34
      -23
      -21
      -30
      -13
      -21
      -12
      -22
      -29
      -18
      -25
      -24
      -26
      -26
      -21
      -21
      1886
    
    
      7
      1887
      -60
      -43
      -26
      -34
      -28
      -25
      -19
      -28
      -24
      -33
      -29
      -40
      -32
      -30
      -40
      -29
      -24
      -28
      1887
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      135
      2009
      56
      48
      49
      57
      59
      62
      66
      61
      64
      58
      72
      58
      59
      59
      52
      55
      63
      65
      2009
    
    
      136
      2010
      66
      75
      87
      82
      71
      60
      56
      59
      55
      65
      74
      44
      66
      67
      66
      80
      58
      65
      2010
    
    
      137
      2011
      45
      44
      57
      60
      47
      54
      70
      69
      52
      60
      50
      48
      55
      55
      45
      55
      64
      54
      2011
    
    
      138
      2012
      38
      43
      52
      62
      71
      59
      50
      56
      68
      73
      69
      46
      57
      57
      43
      62
      55
      70
      2012
    
    
      139
      2013
      62
      52
      60
      48
      56
      61
      53
      61
      73
      61
      75
      61
      60
      59
      53
      55
      58
      70
      2013
    
    
      140
      2014
      68
      44
      71
      72
      79
      62
      50
      74
      81
      78
      64
      74
      68
      67
      58
      74
      62
      74
      2014
    
    
      141
      2015
      75
      80
      84
      71
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ***
      76
      ****
      ****
      ****
      2015
    
    
      142
      Year
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
      J-D
      D-N
      DJF
      MAM
      JJA
      SON
      Year
    
  

143 rows × 20 columns

QUIZ: What happens if you remove the skiprows? skipfooter? engine?

EXERCISE: Load some readings of CO2 concentrations in the atmosphere from the data/greenhouse_gaz/co2_mm_global.txt data file.



In [11]:

    
# Your code here

From a remote text file

So far, we have only loaded temperature datasets. Climate change also affects the sea levels on the globe. Let's load some datasets with the sea levels. The university of colorado posts updated timeseries for mean sea level globably, per hemisphere, or even per ocean, sea, ... Let's download the global one, and the ones for the northern and southern hemisphere.

That will also illustrate that to load text files that are online, there is no more work than replacing the filepath by a URL n read_table:



In [12]:

    
# Local backup: data/sea_levels/sl_nh.txt
northern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_nh.txt", 
                                   sep="\s+")
northern_sea_level









    Out[12]:






  
    
      
      year
      msl_ib(mm)
      #version_2016_rel3
    
  
  
    
      0
      1992.9323
      13.717
      NaN
    
    
      1
      1992.9595
      1.328
      NaN
    
    
      2
      1992.9866
      -13.375
      NaN
    
    
      3
      1993.0138
      -24.723
      NaN
    
    
      4
      1993.0409
      -29.231
      NaN
    
    
      5
      1993.0681
      -32.538
      NaN
    
    
      6
      1993.0952
      -35.588
      NaN
    
    
      7
      1993.1223
      -26.579
      NaN
    
    
      ...
      ...
      ...
      ...
    
    
      853
      2016.1440
      49.218
      NaN
    
    
      854
      2016.1711
      44.274
      NaN
    
    
      855
      2016.1983
      33.961
      NaN
    
    
      856
      2016.2254
      42.199
      NaN
    
    
      857
      2016.2526
      45.004
      NaN
    
    
      858
      2016.2797
      49.809
      NaN
    
    
      859
      2016.3068
      55.401
      NaN
    
    
      860
      2016.3340
      50.416
      NaN
    
  

861 rows × 3 columns



In [13]:

    
# Local backup: data/sea_levels/sl_sh.txt
southern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_sh.txt", 
                                   sep="\s+")
southern_sea_level









    Out[13]:






  
    
      
      year
      msl_ib(mm)
      #version_2016_rel3
    
  
  
    
      0
      1992.9323
      2.517
      NaN
    
    
      1
      1992.9595
      -7.645
      NaN
    
    
      2
      1992.9866
      -2.267
      NaN
    
    
      3
      1993.0138
      0.855
      NaN
    
    
      4
      1993.0409
      -2.020
      NaN
    
    
      5
      1993.0681
      -0.398
      NaN
    
    
      6
      1993.0952
      4.102
      NaN
    
    
      7
      1993.1223
      6.787
      NaN
    
    
      ...
      ...
      ...
      ...
    
    
      853
      2016.1440
      84.902
      NaN
    
    
      854
      2016.1711
      83.829
      NaN
    
    
      855
      2016.1983
      86.020
      NaN
    
    
      856
      2016.2254
      88.122
      NaN
    
    
      857
      2016.2526
      88.090
      NaN
    
    
      858
      2016.2797
      87.806
      NaN
    
    
      859
      2016.3068
      77.270
      NaN
    
    
      860
      2016.3340
      70.916
      NaN
    
  

861 rows × 3 columns



In [14]:

    
# The 2015 version of the global dataset:
# Local backup: data/sea_levels/sl_ns_global.txt
url = "http://sealevel.colorado.edu/files/2015_rel2/sl_ns_global.txt"
global_sea_level = pd.read_table(url, sep="\s+")
global_sea_level









    Out[14]:






  
    
      
      year
      msl_ib_ns(mm)
      #version_2015_rel2
    
  
  
    
      0
      1992.9595
      -5.818
      NaN
    
    
      1
      1992.9866
      -7.525
      NaN
    
    
      2
      1993.0138
      -9.215
      NaN
    
    
      3
      1993.0409
      -11.796
      NaN
    
    
      4
      1993.0681
      -11.291
      NaN
    
    
      5
      1993.0952
      -9.569
      NaN
    
    
      6
      1993.1223
      -3.714
      NaN
    
    
      7
      1993.1495
      -2.471
      NaN
    
    
      ...
      ...
      ...
      ...
    
    
      772
      2014.9494
      70.723
      NaN
    
    
      773
      2014.9766
      70.522
      NaN
    
    
      774
      2015.0037
      66.662
      NaN
    
    
      775
      2015.0309
      64.804
      NaN
    
    
      776
      2015.0580
      62.115
      NaN
    
    
      777
      2015.0852
      68.589
      NaN
    
    
      778
      2015.1123
      69.745
      NaN
    
    
      779
      2015.1395
      75.007
      NaN
    
  

780 rows × 3 columns

There are clearly lots of cleanup to be done on these datasets. See below...

From a local or remote HTML file

To be able to grab more local data about mean sea levels, we can download and extract data about mean sea level stations around the world from the PSMSL (http://www.psmsl.org/). Again to download and parse all tables in a webpage, just give read_html the URL to parse:



In [15]:

    
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
# Local backup in data/sea_levels/Obtaining Tide Gauge Data.html
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")



In [16]:

    
# there is 1 table on that page which contains metadata about the stations where 
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stations









    Out[16]:






  
    
      
      Station Name
      ID
      Lat.
      Lon.
      GLOSS ID
      Country
      Date
      Coastline
      Station
    
  
  
    
      0
      HELIGMAN
      231
      60.200
      19.300
      NaN
      ALA
      01/01/1980
      60
      251
    
    
      1
      KOBBAKLINTAR
      63
      60.033
      19.883
      NaN
      ALA
      01/01/1980
      60
      261
    
    
      2
      LEMSTROM
      84
      60.100
      20.017
      NaN
      ALA
      01/01/1980
      60
      271
    
    
      3
      FOGLO / DEGERBY
      249
      60.032
      20.385
      NaN
      ALA
      06/10/2015
      60
      281
    
    
      4
      PAGO PAGO
      539
      -14.280
      -170.690
      144.0
      ASM
      08/03/2016
      745
      1
    
    
      5
      BAHIA ESPERANZA
      988
      -63.300
      -56.917
      185.0
      ATA
      11/05/1999
      999
      1
    
    
      6
      ARGENTINE ISLANDS
      913
      -65.246
      -64.257
      188.0
      ATA
      21/05/2015
      999
      3
    
    
      7
      PUERTO SOBERANIA
      1603
      -62.483
      -59.633
      189.0
      ATA
      30/09/2004
      999
      5
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1458
      DANANG
      1475
      16.100
      108.217
      NaN
      VNM
      17/11/2015
      605
      51
    
    
      1459
      HONNGU
      1003
      18.800
      105.767
      NaN
      VNM
      17/11/2015
      605
      61
    
    
      1460
      HONDAU
      841
      20.667
      106.800
      NaN
      VNM
      17/11/2015
      605
      81
    
    
      1461
      CHARLOTTE AMALIE
      1393
      18.335
      -64.920
      NaN
      VIR
      08/03/2016
      939
      1
    
    
      1462
      LIME TREE BAY, ST CROIX
      1447
      17.693
      -64.753
      NaN
      VIR
      08/03/2016
      939
      11
    
    
      1463
      CHRISTIANSTED HARBOUR
      2118
      17.750
      -64.705
      NaN
      VIR
      09/03/2016
      939
      15
    
    
      1464
      LAMESHUR BAY
      2119
      18.317
      -64.723
      NaN
      VIR
      09/03/2016
      939
      21
    
    
      1465
      ADEN
      44
      12.788
      44.974
      3.0
      YEM
      01/02/2016
      485
      1
    
  

1466 rows × 9 columns

That table can be used to search for a station in a region of the world we choose, extract an ID for it and download the corresponding time series with the URL http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata

2. Pandas DataStructures

For more details, see http://pandas.pydata.org/pandas-docs/stable/dsintro.html

Now that we have used read_** functions to load datasets, we need to understand better what kind of objects we got from them to learn to work with them.

DataFrame, the pandas 2D structure



In [17]:

    
# Type of the object?
type(giss_temp)









    Out[17]:





pandas.core.frame.DataFrame



In [18]:

    
# Internal nature of the object
print(giss_temp.shape)
print(giss_temp.dtypes)









    



(143, 20)
Year      object
Jan       object
Feb       object
Mar       object
Apr       object
May       object
Jun       object
Jul       object
           ...  
Dec       object
J-D       object
D-N       object
DJF       object
MAM       object
JJA       object
SON       object
Year.1    object
dtype: object

Descriptors for the vertical axis (axis=0)



In [19]:

    
giss_temp.index









    Out[19]:





RangeIndex(start=0, stop=143, step=1)

Descriptors for the horizontal axis (axis=1)



In [20]:

    
giss_temp.columns









    Out[20]:





Index([u'Year', u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', u'Aug',
       u'Sep', u'Oct', u'Nov', u'Dec', u'J-D', u'D-N', u'DJF', u'MAM', u'JJA',
       u'SON', u'Year.1'],
      dtype='object')

A lot of information at once including memory usage:



In [21]:

    
giss_temp.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 20 columns):
Year      143 non-null object
Jan       143 non-null object
Feb       143 non-null object
Mar       143 non-null object
Apr       143 non-null object
May       143 non-null object
Jun       143 non-null object
Jul       143 non-null object
Aug       143 non-null object
Sep       143 non-null object
Oct       143 non-null object
Nov       143 non-null object
Dec       143 non-null object
J-D       143 non-null object
D-N       143 non-null object
DJF       143 non-null object
MAM       143 non-null object
JJA       143 non-null object
SON       143 non-null object
Year.1    143 non-null object
dtypes: object(20)
memory usage: 22.4+ KB

Series, the pandas 1D structure

A series can be constructed with the pd.Series constructor (passing a list or array of values) or from a DataFrame, by extracting one of its columns.



In [22]:

    
# Do we already have a series for the full_globe_temp?
type(full_globe_temp)









    Out[22]:





pandas.core.frame.DataFrame



In [23]:

    
# Since there is only one column of values, we can make this a Series without 
# loosing information:
full_globe_temp = full_globe_temp["mean temp"]

Core attributes/information:



In [24]:

    
print(type(full_globe_temp))
print(full_globe_temp.dtype)
print(full_globe_temp.shape)
print(full_globe_temp.nbytes)









    



<class 'pandas.core.series.Series'>
float64
(131,)
1048

Probably the most important attribute of a Series or DataFrame is its index since we will use that to, well, index into the structures to access te information:



In [25]:

    
full_globe_temp.index









    Out[25]:





DatetimeIndex(['1880-01-01', '1881-01-01', '1882-01-01', '1883-01-01',
               '1884-01-01', '1885-01-01', '1886-01-01', '1887-01-01',
               '1888-01-01', '1889-01-01',
               ...
               '2001-01-01', '2002-01-01', '2003-01-01', '2004-01-01',
               '2005-01-01', '2006-01-01', '2007-01-01', '2008-01-01',
               '2009-01-01', '2010-01-01'],
              dtype='datetime64[ns]', name=u'year', length=131, freq=None)

NumPy arrays as backend of Pandas

It is always possible to fall back to a good old NumPy array to pass on to scientific libraries that need them: SciPy, scikit-learn, ...



In [26]:

    
full_globe_temp.values









    Out[26]:





array([ -1.59100000e-01,  -7.89000000e-02,  -1.31300000e-01,
        -1.67500000e-01,  -2.48500000e-01,  -2.04200000e-01,
        -1.69500000e-01,  -2.50500000e-01,  -1.60500000e-01,
        -1.10300000e-01,  -3.01400000e-01,  -2.52700000e-01,
        -2.92600000e-01,  -3.27400000e-01,  -2.88000000e-01,
        -2.14600000e-01,  -1.00900000e-01,  -1.29400000e-01,
        -2.51700000e-01,  -1.35600000e-01,  -1.12500000e-01,
        -1.69300000e-01,  -2.25500000e-01,  -3.44200000e-01,
        -3.79900000e-01,  -2.58300000e-01,  -2.06000000e-01,
        -3.72600000e-01,  -3.92200000e-01,  -4.04100000e-01,
        -3.86000000e-01,  -4.14400000e-01,  -3.42200000e-01,
        -3.19600000e-01,  -1.59900000e-01,  -8.10000000e-02,
        -2.80300000e-01,  -3.22700000e-01,  -2.10600000e-01,
        -2.18600000e-01,  -2.02500000e-01,  -1.39000000e-01,
        -2.22200000e-01,  -2.06300000e-01,  -1.84300000e-01,
        -1.26000000e-01,  -1.92000000e-02,  -1.00700000e-01,
        -9.35000000e-02,  -2.20100000e-01,  -2.82000000e-02,
         4.10000000e-03,  -2.85000000e-02,  -1.61600000e-01,
        -2.85000000e-02,  -5.65000000e-02,  -3.18000000e-02,
         7.27000000e-02,   9.55000000e-02,   8.98000000e-02,
         1.23400000e-01,   1.83700000e-01,   1.37700000e-01,
         1.29900000e-01,   2.36200000e-01,   1.16200000e-01,
        -2.02000000e-02,  -4.06000000e-02,  -4.94000000e-02,
        -7.52000000e-02,  -1.63500000e-01,  -2.90000000e-03,
         3.53000000e-02,   1.19000000e-01,  -1.02800000e-01,
        -1.22000000e-01,  -1.83600000e-01,   5.70000000e-02,
         1.05400000e-01,   6.34000000e-02,   1.14000000e-02,
         8.94000000e-02,   1.13600000e-01,   1.33300000e-01,
        -1.26800000e-01,  -6.95000000e-02,  -1.17000000e-02,
         2.20000000e-03,  -1.21000000e-02,   9.11000000e-02,
         4.71000000e-02,  -5.50000000e-02,   3.48000000e-02,
         1.61700000e-01,  -8.62000000e-02,  -1.77000000e-02,
        -1.06200000e-01,   1.42500000e-01,   6.15000000e-02,
         1.53300000e-01,   1.93000000e-01,   2.28400000e-01,
         1.17100000e-01,   2.81700000e-01,   8.75000000e-02,
         6.81000000e-02,   1.60500000e-01,   2.96800000e-01,
         3.00600000e-01,   2.22300000e-01,   3.86100000e-01,
         3.36000000e-01,   2.02300000e-01,   2.30700000e-01,
         2.93400000e-01,   4.07300000e-01,   2.75300000e-01,
         4.78200000e-01,   5.97100000e-01,   4.19900000e-01,
         3.88600000e-01,   5.17300000e-01,   5.74500000e-01,
         5.81800000e-01,   5.41600000e-01,   6.15400000e-01,
         5.60100000e-01,   5.47200000e-01,   4.80400000e-01,
         5.55100000e-01,  -9.99000000e+02])



In [27]:

    
type(full_globe_temp.values)









    Out[27]:





numpy.ndarray

Creating new DataFrames manually

DataFrames can also be created manually, by grouping several Series together. Let's make a new frame from the 3 sea level datasets we downloaded above. They will be displayed along the same index. Wait, does that makes sense to do that?



In [28]:

    
# Are they aligned?
southern_sea_level.year == northern_sea_level.year









    Out[28]:





0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
       ... 
853    True
854    True
855    True
856    True
857    True
858    True
859    True
860    True
Name: year, dtype: bool



In [29]:

    
# So, are they aligned?
np.all(southern_sea_level.year == northern_sea_level.year)









    Out[29]:





True

So the northern hemisphere and southern hemisphere datasets are aligned. What about the global one?



In [30]:

    
len(global_sea_level.year) == len(northern_sea_level.year)









    Out[30]:





False

For now, let's just build a DataFrame with the 2 hemisphere datasets then. We will come back to add the global one later...



In [31]:

    
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"], 
                               "date": northern_sea_level.year})
mean_sea_level









    Out[31]:






  
    
      
      date
      northern_hem
      southern_hem
    
  
  
    
      0
      1992.9323
      13.717
      2.517
    
    
      1
      1992.9595
      1.328
      -7.645
    
    
      2
      1992.9866
      -13.375
      -2.267
    
    
      3
      1993.0138
      -24.723
      0.855
    
    
      4
      1993.0409
      -29.231
      -2.020
    
    
      5
      1993.0681
      -32.538
      -0.398
    
    
      6
      1993.0952
      -35.588
      4.102
    
    
      7
      1993.1223
      -26.579
      6.787
    
    
      ...
      ...
      ...
      ...
    
    
      853
      2016.1440
      49.218
      84.902
    
    
      854
      2016.1711
      44.274
      83.829
    
    
      855
      2016.1983
      33.961
      86.020
    
    
      856
      2016.2254
      42.199
      88.122
    
    
      857
      2016.2526
      45.004
      88.090
    
    
      858
      2016.2797
      49.809
      87.806
    
    
      859
      2016.3068
      55.401
      77.270
    
    
      860
      2016.3340
      50.416
      70.916
    
  

861 rows × 3 columns

Note: there are other ways to create DataFrames manually, for example from a 2D numpy array.

There is still the date in a regular column and a numerical index that is not that meaningful. We can specify the index of a DataFrame at creation. Let's try:



In [32]:

    
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"]},
                               index = northern_sea_level.year)
mean_sea_level









    Out[32]:






  
    
      
      northern_hem
      southern_hem
    
    
      year
      
      
    
  
  
    
      1992.9323
      NaN
      NaN
    
    
      1992.9595
      NaN
      NaN
    
    
      1992.9866
      NaN
      NaN
    
    
      1993.0138
      NaN
      NaN
    
    
      1993.0409
      NaN
      NaN
    
    
      1993.0681
      NaN
      NaN
    
    
      1993.0952
      NaN
      NaN
    
    
      1993.1223
      NaN
      NaN
    
    
      ...
      ...
      ...
    
    
      2016.1440
      NaN
      NaN
    
    
      2016.1711
      NaN
      NaN
    
    
      2016.1983
      NaN
      NaN
    
    
      2016.2254
      NaN
      NaN
    
    
      2016.2526
      NaN
      NaN
    
    
      2016.2797
      NaN
      NaN
    
    
      2016.3068
      NaN
      NaN
    
    
      2016.3340
      NaN
      NaN
    
  

861 rows × 2 columns

Now the fact that it is failing show that Pandas does auto-alignment of values: for each value of the index, it searches for a value in each Series that maps the same value. Since these series have a dumb numerical index, no values are found.

Since we know that the order of the values match the index we chose, we can replace the Series by their values only at creation of the DataFrame:



In [33]:

    
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"].values, 
                               "southern_hem": southern_sea_level["msl_ib(mm)"].values},
                               index = northern_sea_level.year)
mean_sea_level









    Out[33]:






  
    
      
      northern_hem
      southern_hem
    
    
      year
      
      
    
  
  
    
      1992.9323
      13.717
      2.517
    
    
      1992.9595
      1.328
      -7.645
    
    
      1992.9866
      -13.375
      -2.267
    
    
      1993.0138
      -24.723
      0.855
    
    
      1993.0409
      -29.231
      -2.020
    
    
      1993.0681
      -32.538
      -0.398
    
    
      1993.0952
      -35.588
      4.102
    
    
      1993.1223
      -26.579
      6.787
    
    
      ...
      ...
      ...
    
    
      2016.1440
      49.218
      84.902
    
    
      2016.1711
      44.274
      83.829
    
    
      2016.1983
      33.961
      86.020
    
    
      2016.2254
      42.199
      88.122
    
    
      2016.2526
      45.004
      88.090
    
    
      2016.2797
      49.809
      87.806
    
    
      2016.3068
      55.401
      77.270
    
    
      2016.3340
      50.416
      70.916
    
  

861 rows × 2 columns

3. Cleaning and formatting data

The datasets that we obtain straight from the reading functions are pretty raw. A lot of pre-processing can be done during data read but we haven't used all the power of the reading functions. Let's learn to do a lot of cleaning and formatting of the data.

The GISS temperature dataset has a lot of issues too: useless numerical index, redundant columns, useless rows, placeholder (****) for missing values, and wrong type for the columns. Let's fix all this:

Renaming columns



In [34]:

    
# The columns of the local_sea_level_stations aren't clean: they contain spaces and dots.
local_sea_level_stations.columns









    Out[34]:





Index([u'Station Name', u'ID', u'Lat.', u'Lon.', u'GLOSS ID  ', u'Country  ',
       u'Date', u'Coastline  ', u'Station  '],
      dtype='object')



In [35]:

    
# Let's clean them up a bit:
local_sea_level_stations.columns = [name.strip().replace(".", "") 
                                    for name in local_sea_level_stations.columns]
local_sea_level_stations.columns









    Out[35]:





Index([u'Station Name', u'ID', u'Lat', u'Lon', u'GLOSS ID', u'Country',
       u'Date', u'Coastline', u'Station'],
      dtype='object')

We can also rename an index by setting its name. For example, the index of the mean_sea_level dataFrame could be called date since it contains more than just the year:



In [36]:

    
mean_sea_level.index.name = "date"
mean_sea_level









    Out[36]:






  
    
      
      northern_hem
      southern_hem
    
    
      date
      
      
    
  
  
    
      1992.9323
      13.717
      2.517
    
    
      1992.9595
      1.328
      -7.645
    
    
      1992.9866
      -13.375
      -2.267
    
    
      1993.0138
      -24.723
      0.855
    
    
      1993.0409
      -29.231
      -2.020
    
    
      1993.0681
      -32.538
      -0.398
    
    
      1993.0952
      -35.588
      4.102
    
    
      1993.1223
      -26.579
      6.787
    
    
      ...
      ...
      ...
    
    
      2016.1440
      49.218
      84.902
    
    
      2016.1711
      44.274
      83.829
    
    
      2016.1983
      33.961
      86.020
    
    
      2016.2254
      42.199
      88.122
    
    
      2016.2526
      45.004
      88.090
    
    
      2016.2797
      49.809
      87.806
    
    
      2016.3068
      55.401
      77.270
    
    
      2016.3340
      50.416
      70.916
    
  

861 rows × 2 columns

Setting missing values

In the full globe dataset, -999.00 was used to indicate that there was no value for that year. Let's search for all these values and replace them with the missing value that Pandas understand: np.nan



In [37]:

    
full_globe_temp == -999.000









    Out[37]:





year
1880-01-01 00:00:00    False
1881-01-01 00:00:00    False
1882-01-01 00:00:00    False
1883-01-01 00:00:00    False
1884-01-01 00:00:00    False
1885-01-01 00:00:00    False
1886-01-01 00:00:00    False
1887-01-01 00:00:00    False
                       ...  
2003-01-01             False
2004-01-01             False
2005-01-01             False
2006-01-01             False
2007-01-01             False
2008-01-01             False
2009-01-01             False
2010-01-01              True
Name: mean temp, dtype: bool



In [38]:

    
full_globe_temp[full_globe_temp == -999.000] = np.nan
full_globe_temp.tail()









    Out[38]:





year
2006-01-01    0.5601
2007-01-01    0.5472
2008-01-01    0.4804
2009-01-01    0.5551
2010-01-01       NaN
Name: mean temp, dtype: float64

Choosing what is the index



In [39]:

    
# We didn't set a column number of the index of giss_temp, we can do that afterwards:
giss_temp = giss_temp.set_index("Year")
giss_temp.head()

Dropping rows and columns



In [40]:

    
# 1 column is redundant with the index: 
giss_temp.columns









    Out[40]:





Index([u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', u'Aug', u'Sep',
       u'Oct', u'Nov', u'Dec', u'J-D', u'D-N', u'DJF', u'MAM', u'JJA', u'SON',
       u'Year.1'],
      dtype='object')



In [41]:

    
# Let's drop it:
giss_temp = giss_temp.drop("Year.1", axis=1)
giss_temp









    Out[41]:






  
    
      
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
      J-D
      D-N
      DJF
      MAM
      JJA
      SON
    
    
      Year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1880
      -34
      -27
      -22
      -30
      -16
      -24
      -19
      -12
      -20
      -19
      -16
      -21
      -22
      ***
      ****
      -23
      -18
      -18
    
    
      1881
      -13
      -16
      -2
      -3
      -3
      -27
      -12
      -8
      -18
      -23
      -28
      -18
      -14
      -14
      -17
      -3
      -15
      -23
    
    
      1882
      3
      4
      -2
      -24
      -20
      -32
      -27
      -11
      -11
      -25
      -25
      -37
      -17
      -16
      -4
      -15
      -23
      -20
    
    
      1883
      -38
      -38
      -12
      -20
      -20
      -8
      -3
      -13
      -19
      -19
      -28
      -21
      -20
      -21
      -38
      -18
      -8
      -22
    
    
      1884
      -20
      -14
      -31
      -36
      -33
      -36
      -31
      -24
      -29
      -25
      -29
      -25
      -28
      -28
      -18
      -33
      -31
      -28
    
    
      1885
      -57
      -29
      -19
      -36
      -35
      -40
      -28
      -24
      -17
      -14
      -14
      0
      -26
      -28
      -37
      -30
      -31
      -15
    
    
      1886
      -37
      -43
      -34
      -23
      -21
      -30
      -13
      -21
      -12
      -22
      -29
      -18
      -25
      -24
      -26
      -26
      -21
      -21
    
    
      1887
      -60
      -43
      -26
      -34
      -28
      -25
      -19
      -28
      -24
      -33
      -29
      -40
      -32
      -30
      -40
      -29
      -24
      -28
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2009
      56
      48
      49
      57
      59
      62
      66
      61
      64
      58
      72
      58
      59
      59
      52
      55
      63
      65
    
    
      2010
      66
      75
      87
      82
      71
      60
      56
      59
      55
      65
      74
      44
      66
      67
      66
      80
      58
      65
    
    
      2011
      45
      44
      57
      60
      47
      54
      70
      69
      52
      60
      50
      48
      55
      55
      45
      55
      64
      54
    
    
      2012
      38
      43
      52
      62
      71
      59
      50
      56
      68
      73
      69
      46
      57
      57
      43
      62
      55
      70
    
    
      2013
      62
      52
      60
      48
      56
      61
      53
      61
      73
      61
      75
      61
      60
      59
      53
      55
      58
      70
    
    
      2014
      68
      44
      71
      72
      79
      62
      50
      74
      81
      78
      64
      74
      68
      67
      58
      74
      62
      74
    
    
      2015
      75
      80
      84
      71
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ***
      76
      ****
      ****
      ****
    
    
      Year
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
      J-D
      D-N
      DJF
      MAM
      JJA
      SON
    
  

143 rows × 18 columns



In [42]:

    
# We can also just select the columns we want to keep:
giss_temp = giss_temp[[u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', 
                       u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']]
giss_temp









    Out[42]:






  
    
      
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
    
    
      Year
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1880
      -34
      -27
      -22
      -30
      -16
      -24
      -19
      -12
      -20
      -19
      -16
      -21
    
    
      1881
      -13
      -16
      -2
      -3
      -3
      -27
      -12
      -8
      -18
      -23
      -28
      -18
    
    
      1882
      3
      4
      -2
      -24
      -20
      -32
      -27
      -11
      -11
      -25
      -25
      -37
    
    
      1883
      -38
      -38
      -12
      -20
      -20
      -8
      -3
      -13
      -19
      -19
      -28
      -21
    
    
      1884
      -20
      -14
      -31
      -36
      -33
      -36
      -31
      -24
      -29
      -25
      -29
      -25
    
    
      1885
      -57
      -29
      -19
      -36
      -35
      -40
      -28
      -24
      -17
      -14
      -14
      0
    
    
      1886
      -37
      -43
      -34
      -23
      -21
      -30
      -13
      -21
      -12
      -22
      -29
      -18
    
    
      1887
      -60
      -43
      -26
      -34
      -28
      -25
      -19
      -28
      -24
      -33
      -29
      -40
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2009
      56
      48
      49
      57
      59
      62
      66
      61
      64
      58
      72
      58
    
    
      2010
      66
      75
      87
      82
      71
      60
      56
      59
      55
      65
      74
      44
    
    
      2011
      45
      44
      57
      60
      47
      54
      70
      69
      52
      60
      50
      48
    
    
      2012
      38
      43
      52
      62
      71
      59
      50
      56
      68
      73
      69
      46
    
    
      2013
      62
      52
      60
      48
      56
      61
      53
      61
      73
      61
      75
      61
    
    
      2014
      68
      44
      71
      72
      79
      62
      50
      74
      81
      78
      64
      74
    
    
      2015
      75
      80
      84
      71
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ****
    
    
      Year
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
    
  

143 rows × 12 columns



In [43]:

    
# Let's remove all these extra column names (Year  Jan ...). They all correspond to the index "Year"
giss_temp = giss_temp.drop("Year")
giss_temp









    Out[43]:






  
    
      
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
    
    
      Year
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1880
      -34
      -27
      -22
      -30
      -16
      -24
      -19
      -12
      -20
      -19
      -16
      -21
    
    
      1881
      -13
      -16
      -2
      -3
      -3
      -27
      -12
      -8
      -18
      -23
      -28
      -18
    
    
      1882
      3
      4
      -2
      -24
      -20
      -32
      -27
      -11
      -11
      -25
      -25
      -37
    
    
      1883
      -38
      -38
      -12
      -20
      -20
      -8
      -3
      -13
      -19
      -19
      -28
      -21
    
    
      1884
      -20
      -14
      -31
      -36
      -33
      -36
      -31
      -24
      -29
      -25
      -29
      -25
    
    
      1885
      -57
      -29
      -19
      -36
      -35
      -40
      -28
      -24
      -17
      -14
      -14
      0
    
    
      1886
      -37
      -43
      -34
      -23
      -21
      -30
      -13
      -21
      -12
      -22
      -29
      -18
    
    
      1887
      -60
      -43
      -26
      -34
      -28
      -25
      -19
      -28
      -24
      -33
      -29
      -40
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2008
      22
      31
      70
      48
      46
      43
      54
      39
      59
      62
      62
      51
    
    
      2009
      56
      48
      49
      57
      59
      62
      66
      61
      64
      58
      72
      58
    
    
      2010
      66
      75
      87
      82
      71
      60
      56
      59
      55
      65
      74
      44
    
    
      2011
      45
      44
      57
      60
      47
      54
      70
      69
      52
      60
      50
      48
    
    
      2012
      38
      43
      52
      62
      71
      59
      50
      56
      68
      73
      69
      46
    
    
      2013
      62
      52
      60
      48
      56
      61
      53
      61
      73
      61
      75
      61
    
    
      2014
      68
      44
      71
      72
      79
      62
      50
      74
      81
      78
      64
      74
    
    
      2015
      75
      80
      84
      71
      ****
      ****
      ****
      ****
      ****
      ****
      ****
      ****
    
  

136 rows × 12 columns

Let's also set **** to a real missing value (np.nan). We can often do it using a boolean mask, but that may trigger pandas warning. Another way to assign based on a boolean condition is to use the where method:



In [44]:

    
#giss_temp[giss_temp == "****"] = np.nan
giss_temp = giss_temp.where(giss_temp != "****", np.nan)



In [45]:

    
giss_temp.tail()

Adding columns

While building the mean_sea_level dataFrame earlier, we didn't include the values from global_sea_level since the years were not aligned. Adding a column to a dataframe is as easy as adding an entry to a dictionary. So let's try:



In [46]:

    
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level









    Out[46]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
    
      date
      
      
      
    
  
  
    
      1992.9323
      13.717
      2.517
      NaN
    
    
      1992.9595
      1.328
      -7.645
      NaN
    
    
      1992.9866
      -13.375
      -2.267
      NaN
    
    
      1993.0138
      -24.723
      0.855
      NaN
    
    
      1993.0409
      -29.231
      -2.020
      NaN
    
    
      1993.0681
      -32.538
      -0.398
      NaN
    
    
      1993.0952
      -35.588
      4.102
      NaN
    
    
      1993.1223
      -26.579
      6.787
      NaN
    
    
      ...
      ...
      ...
      ...
    
    
      2016.1440
      49.218
      84.902
      NaN
    
    
      2016.1711
      44.274
      83.829
      NaN
    
    
      2016.1983
      33.961
      86.020
      NaN
    
    
      2016.2254
      42.199
      88.122
      NaN
    
    
      2016.2526
      45.004
      88.090
      NaN
    
    
      2016.2797
      49.809
      87.806
      NaN
    
    
      2016.3068
      55.401
      77.270
      NaN
    
    
      2016.3340
      50.416
      70.916
      NaN
    
  

861 rows × 3 columns

The column is full of NaNs again because the auto-alignment feature of Pandas is searching for the index values like 1992.9323 in the index of global_sea_level["msl_ib_ns(mm)"] series and not finding them. Let's set its index to these years so that that auto-alignment can work for us and figure out which values we have and not:



In [47]:

    
global_sea_level = global_sea_level.set_index("year")
global_sea_level["msl_ib_ns(mm)"]









    Out[47]:





year
1992.9595    -5.818
1992.9866    -7.525
1993.0138    -9.215
1993.0409   -11.796
1993.0681   -11.291
1993.0952    -9.569
1993.1223    -3.714
1993.1495    -2.471
              ...  
2014.9494    70.723
2014.9766    70.522
2015.0037    66.662
2015.0309    64.804
2015.0580    62.115
2015.0852    68.589
2015.1123    69.745
2015.1395    75.007
Name: msl_ib_ns(mm), dtype: float64



In [48]:

    
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level









    Out[48]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
    
      date
      
      
      
    
  
  
    
      1992.9323
      13.717
      2.517
      NaN
    
    
      1992.9595
      1.328
      -7.645
      -5.818
    
    
      1992.9866
      -13.375
      -2.267
      -7.525
    
    
      1993.0138
      -24.723
      0.855
      -9.215
    
    
      1993.0409
      -29.231
      -2.020
      -11.796
    
    
      1993.0681
      -32.538
      -0.398
      -11.291
    
    
      1993.0952
      -35.588
      4.102
      -9.569
    
    
      1993.1223
      -26.579
      6.787
      -3.714
    
    
      ...
      ...
      ...
      ...
    
    
      2016.1440
      49.218
      84.902
      NaN
    
    
      2016.1711
      44.274
      83.829
      NaN
    
    
      2016.1983
      33.961
      86.020
      NaN
    
    
      2016.2254
      42.199
      88.122
      NaN
    
    
      2016.2526
      45.004
      88.090
      NaN
    
    
      2016.2797
      49.809
      87.806
      NaN
    
    
      2016.3068
      55.401
      77.270
      NaN
    
    
      2016.3340
      50.416
      70.916
      NaN
    
  

861 rows × 3 columns

EXERCISE: Create a new series containing the average of the 2 hemispheres minus the global value to see if that is close to 0. Work inside the mean_sea_level dataframe first. Then try with the original Series to see what happens with data alignment while doing computations.



In [49]:

    
# Your code here

Changing dtype of series

Now that the sea levels are looking pretty good, let's got back to the GISS temperature dataset. Because of the labels (strings) found in the middle of the timeseries, every column only assumed to contain strings (didn't convert them to floating point values):



In [50]:

    
giss_temp.dtypes









    Out[50]:





Jan    object
Feb    object
Mar    object
Apr    object
May    object
Jun    object
Jul    object
Aug    object
Sep    object
Oct    object
Nov    object
Dec    object
dtype: object

That can be changed after the fact (and after the cleanup) with the astype method of a Series:



In [51]:

    
giss_temp["Jan"].astype("float32")









    Out[51]:





Year
1880   -34.0
1881   -13.0
1882     3.0
1883   -38.0
1884   -20.0
1885   -57.0
1886   -37.0
1887   -60.0
        ... 
2008    22.0
2009    56.0
2010    66.0
2011    45.0
2012    38.0
2013    62.0
2014    68.0
2015    75.0
Name: Jan, dtype: float32



In [52]:

    
for col in giss_temp.columns:
    giss_temp.loc[:, col] = giss_temp[col].astype(np.float32)

An index has a dtype just like any Series and that can be changed after the fact too.



In [53]:

    
giss_temp.index.dtype









    Out[53]:





dtype('O')

For now, let's change it to an integer so that values can at least be compared properly. We will learn below to change it to a datetime object.



In [54]:

    
giss_temp.index = giss_temp.index.astype(np.int32)

Removing missing values

Removing missing values - once they have been converted to np.nan - is very easy. Entries that contain missing values can be removed (dropped), or filled with many strategies.



In [55]:

    
full_globe_temp









    Out[55]:





year
1880-01-01 00:00:00   -0.1591
1881-01-01 00:00:00   -0.0789
1882-01-01 00:00:00   -0.1313
1883-01-01 00:00:00   -0.1675
1884-01-01 00:00:00   -0.2485
1885-01-01 00:00:00   -0.2042
1886-01-01 00:00:00   -0.1695
1887-01-01 00:00:00   -0.2505
                        ...  
2003-01-01             0.5818
2004-01-01             0.5416
2005-01-01             0.6154
2006-01-01             0.5601
2007-01-01             0.5472
2008-01-01             0.4804
2009-01-01             0.5551
2010-01-01                NaN
Name: mean temp, dtype: float64



In [56]:

    
full_globe_temp.dropna()









    Out[56]:





year
1880-01-01 00:00:00   -0.1591
1881-01-01 00:00:00   -0.0789
1882-01-01 00:00:00   -0.1313
1883-01-01 00:00:00   -0.1675
1884-01-01 00:00:00   -0.2485
1885-01-01 00:00:00   -0.2042
1886-01-01 00:00:00   -0.1695
1887-01-01 00:00:00   -0.2505
                        ...  
2002-01-01             0.5745
2003-01-01             0.5818
2004-01-01             0.5416
2005-01-01             0.6154
2006-01-01             0.5601
2007-01-01             0.5472
2008-01-01             0.4804
2009-01-01             0.5551
Name: mean temp, dtype: float64



In [57]:

    
# This will remove any year that has a missing value. Use how='all' to keep partial years
giss_temp.dropna(how="any").tail()



In [58]:

    
giss_temp.fillna(value=0).tail()



In [59]:

    
# This fills them with the previous year. See also temp3.interpolate
giss_temp.fillna(method="ffill").tail()

Let's also mention the .interpolate method on a Series:



In [60]:

    
giss_temp.Aug.interpolate().tail()









    Out[60]:





Year
2011    69.0
2012    56.0
2013    61.0
2014    74.0
2015    74.0
Name: Aug, dtype: float32

For now, we will leave the missing values in all our datasets, because it wouldn't be meaningful to fill them.

EXERCISE: Go back to the reading functions, and learn more about other options that could have allowed us to fold some of these pre-processing steps into the data loading.

4. Basic visualization

Now they have been formatted, visualizing your datasets is the next logical step and is trivial with Pandas. The first thing to try is to invoke the .plot to generate a basic visualization (uses matplotlib under the covers).

Line plots



In [61]:

    
full_globe_temp.plot()









    Out[61]:





<matplotlib.axes._subplots.AxesSubplot at 0x10fa8a110>



In [62]:

    
giss_temp.plot(figsize=LARGE_FIGSIZE)









    Out[62]:





<matplotlib.axes._subplots.AxesSubplot at 0x116f8ed10>



In [63]:

    
mean_sea_level.plot(subplots=True, figsize=(16, 12));

Showing distributions information



In [64]:

    
# Distributions of mean sean level globally and per hemisphere?
mean_sea_level.plot(kind="kde", figsize=(12, 8))









    Out[64]:





<matplotlib.axes._subplots.AxesSubplot at 0x1180f47d0>

QUIZ: How to list the possible kinds of plots that the plot method can allow?



In [65]:

    
# Distributions of temperature in each month since 1880
giss_temp.boxplot();









    



/Users/jrocher/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  from ipykernel import kernelapp as app

Correlations

There are more plot options inside pandas.tools.plotting:



In [66]:

    
# Is there correlations between the northern and southern sea level timeseries we loaded?
from pandas.tools.plotting import scatter_matrix
scatter_matrix(mean_sea_level, figsize=LARGE_FIGSIZE);

We will confirm the correlations we think we see further down...

EXERCISE: Refer to exercises/aapl_adj_close_plot/aapl_adj_close_plot.ipynb

5. Accessing data

The general philosophy for accessing values inside a Pandas datastructure is that, unlike a numpy array that only allows to index using integers a Series allows to index with the values inside the index. That makes the code more readable.

In a series



In [67]:

    
full_globe_temp









    Out[67]:





year
1880-01-01 00:00:00   -0.1591
1881-01-01 00:00:00   -0.0789
1882-01-01 00:00:00   -0.1313
1883-01-01 00:00:00   -0.1675
1884-01-01 00:00:00   -0.2485
1885-01-01 00:00:00   -0.2042
1886-01-01 00:00:00   -0.1695
1887-01-01 00:00:00   -0.2505
                        ...  
2003-01-01             0.5818
2004-01-01             0.5416
2005-01-01             0.6154
2006-01-01             0.5601
2007-01-01             0.5472
2008-01-01             0.4804
2009-01-01             0.5551
2010-01-01                NaN
Name: mean temp, dtype: float64



In [68]:

    
# By default [] on a series accesses values using the index, not the location in the series
# print(temp1[0])  # This would to fail!!



In [69]:

    
# This index is non-trivial though (will talk more about these datetime objects further down):
full_globe_temp.index.dtype









    Out[69]:





dtype('<M8[ns]')



In [70]:

    
first_date = full_globe_temp.index[0]
first_date == pd.Timestamp('1880')









    Out[70]:





True



In [71]:

    
# By default [] on a series accesses values using the index, not the location in the series
print(full_globe_temp[pd.Timestamp('1880')])
# print(temp1[0])  # This would fail!!



In [72]:

    
# Another more explicit way to do the same thing is to use loc
print(full_globe_temp.loc[pd.Timestamp('1990')])
print(full_globe_temp.iloc[0], full_globe_temp.iloc[-1])









    



0.3861
(-0.15909999999999999, nan)



In [73]:

    
# Year of the last record?
full_globe_temp.index[-1]









    Out[73]:





Timestamp('2010-01-01 00:00:00')



In [74]:

    
# New records can be added:
full_globe_temp[pd.Timestamp('2011')] = np.nan

In a dataframe



In [75]:

    
# In 2D, same idea, though in a DF [] accesses columns (Series)
giss_temp["Jan"]









    Out[75]:





Year
1880   -34.0
1881   -13.0
1882     3.0
1883   -38.0
1884   -20.0
1885   -57.0
1886   -37.0
1887   -60.0
        ... 
2008    22.0
2009    56.0
2010    66.0
2011    45.0
2012    38.0
2013    62.0
2014    68.0
2015    75.0
Name: Jan, dtype: float32



In [76]:

    
# while .loc and .iloc allow to access individual values, slices or masked selections:
print(giss_temp.loc[1979, "Dec"])



In [77]:

    
# Slicing can be done with .loc and .iloc
print(giss_temp.loc[1979, "Jan":"Jun"])  # Note that the end point is included unlike NumPy!!!
print(giss_temp.loc[1979, ::2])









    



Jan    10.0
Feb   -12.0
Mar    14.0
Apr     8.0
May     1.0
Jun     9.0
Name: 1979, dtype: float32
Jan    10.0
Mar    14.0
May     1.0
Jul    -3.0
Sep    20.0
Nov    24.0
Name: 1979, dtype: float32



In [78]:

    
# Masking can also be used in one or more dimensions. For example, another way to grab every other month for the first year:
mask = [True, False] * 6
print(giss_temp.iloc[0, mask])
print(giss_temp.loc[1880, mask])









    



Jan   -34.0
Mar   -22.0
May   -16.0
Jul   -19.0
Sep   -20.0
Nov   -16.0
Name: 1880, dtype: float32
Jan   -34.0
Mar   -22.0
May   -16.0
Jul   -19.0
Sep   -20.0
Nov   -16.0
Name: 1880, dtype: float32



In [79]:

    
# We could also add a new column like a new entry in a dictionary
giss_temp["totals"] = giss_temp.sum(axis=1)
giss_temp









    Out[79]:






  
    
      
      Jan
      Feb
      Mar
      Apr
      May
      Jun
      Jul
      Aug
      Sep
      Oct
      Nov
      Dec
      totals
    
    
      Year
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1880
      -34.0
      -27.0
      -22.0
      -30.0
      -16.0
      -24.0
      -19.0
      -12.0
      -20.0
      -19.0
      -16.0
      -21.0
      -260.0
    
    
      1881
      -13.0
      -16.0
      -2.0
      -3.0
      -3.0
      -27.0
      -12.0
      -8.0
      -18.0
      -23.0
      -28.0
      -18.0
      -171.0
    
    
      1882
      3.0
      4.0
      -2.0
      -24.0
      -20.0
      -32.0
      -27.0
      -11.0
      -11.0
      -25.0
      -25.0
      -37.0
      -207.0
    
    
      1883
      -38.0
      -38.0
      -12.0
      -20.0
      -20.0
      -8.0
      -3.0
      -13.0
      -19.0
      -19.0
      -28.0
      -21.0
      -239.0
    
    
      1884
      -20.0
      -14.0
      -31.0
      -36.0
      -33.0
      -36.0
      -31.0
      -24.0
      -29.0
      -25.0
      -29.0
      -25.0
      -333.0
    
    
      1885
      -57.0
      -29.0
      -19.0
      -36.0
      -35.0
      -40.0
      -28.0
      -24.0
      -17.0
      -14.0
      -14.0
      0.0
      -313.0
    
    
      1886
      -37.0
      -43.0
      -34.0
      -23.0
      -21.0
      -30.0
      -13.0
      -21.0
      -12.0
      -22.0
      -29.0
      -18.0
      -303.0
    
    
      1887
      -60.0
      -43.0
      -26.0
      -34.0
      -28.0
      -25.0
      -19.0
      -28.0
      -24.0
      -33.0
      -29.0
      -40.0
      -389.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2008
      22.0
      31.0
      70.0
      48.0
      46.0
      43.0
      54.0
      39.0
      59.0
      62.0
      62.0
      51.0
      587.0
    
    
      2009
      56.0
      48.0
      49.0
      57.0
      59.0
      62.0
      66.0
      61.0
      64.0
      58.0
      72.0
      58.0
      710.0
    
    
      2010
      66.0
      75.0
      87.0
      82.0
      71.0
      60.0
      56.0
      59.0
      55.0
      65.0
      74.0
      44.0
      794.0
    
    
      2011
      45.0
      44.0
      57.0
      60.0
      47.0
      54.0
      70.0
      69.0
      52.0
      60.0
      50.0
      48.0
      656.0
    
    
      2012
      38.0
      43.0
      52.0
      62.0
      71.0
      59.0
      50.0
      56.0
      68.0
      73.0
      69.0
      46.0
      687.0
    
    
      2013
      62.0
      52.0
      60.0
      48.0
      56.0
      61.0
      53.0
      61.0
      73.0
      61.0
      75.0
      61.0
      723.0
    
    
      2014
      68.0
      44.0
      71.0
      72.0
      79.0
      62.0
      50.0
      74.0
      81.0
      78.0
      64.0
      74.0
      817.0
    
    
      2015
      75.0
      80.0
      84.0
      71.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      310.0
    
  

136 rows × 13 columns



In [80]:

    
# Let's remove this new column, we will learn to do this differently
giss_temp = giss_temp.drop("totals", axis=1)

More complex queries rely on the same concepts. For example what are the names, and IDs of the sea level stations in the USA?



In [81]:

    
local_sea_level_stations.columns









    Out[81]:





Index([u'Station Name', u'ID', u'Lat', u'Lon', u'GLOSS ID', u'Country',
       u'Date', u'Coastline', u'Station'],
      dtype='object')



In [82]:

    
american_stations = local_sea_level_stations["Country"] == "USA"
local_sea_level_stations.loc[american_stations, ["ID", "Station Name"]]









    Out[82]:






  
    
      
      ID
      Station Name
    
  
  
    
      1267
      1372
      FRENCH FRIGATE SHOALS
    
    
      1268
      2192
      FRENCH FRIGATE SHOALS B
    
    
      1269
      2129
      PORT ALLEN, HANAPEPE BAY, KAUAI ISLAND
    
    
      1270
      756
      NAWILIWILI BAY, KAUAI ISLAND
    
    
      1271
      155
      HONOLULU
    
    
      1272
      823
      MOKUOLOE ISLAND
    
    
      1273
      2273
      LANAI ISLAND, KAUMALAPAU
    
    
      1274
      521
      KAHULUI HARBOR, MAUI ISLAND
    
    
      ...
      ...
      ...
    
    
      1437
      288
      SEAVEY ISLAND
    
    
      1438
      2291
      FORT POINT, NEWCASTLE ISLAND
    
    
      1439
      183
      PORTLAND (MAINE)
    
    
      1440
      1279
      ROCKLAND
    
    
      1441
      525
      BAR HARBOR, FRENCHMAN BAY, ME
    
    
      1442
      1524
      CUTLER II
    
    
      1443
      1081
      CUTLER
    
    
      1444
      332
      EASTPORT
    
  

178 rows × 2 columns

6. Working with dates and times

More details at http://pandas.pydata.org/pandas-docs/stable/timeseries.html

Let's work some more with full_globe_temp's index since we saw it is special.



In [83]:

    
# Its dtype is NumPy's new 'datetime64[ns]':
full_globe_temp.index.dtype









    Out[83]:





dtype('<M8[ns]')

The advantage of having a real datetime index is that operations can be done efficiently on it. Let's add a flag to signal if the value is before or after the great depression's black Friday:



In [84]:

    
black_friday = pd.to_datetime('1929-10-29')
full_globe_temp.index > black_friday









    Out[84]:





array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True], dtype=bool)

Timestamps or periods?



In [85]:

    
# Convert its index from timestamp to period: it is more meaningfull since it was measured and averaged over the year...
full_globe_temp.index = full_globe_temp.index.to_period()
full_globe_temp









    Out[85]:





year
1880   -0.1591
1881   -0.0789
1882   -0.1313
1883   -0.1675
1884   -0.2485
1885   -0.2042
1886   -0.1695
1887   -0.2505
         ...  
2004    0.5416
2005    0.6154
2006    0.5601
2007    0.5472
2008    0.4804
2009    0.5551
2010       NaN
2011       NaN
Freq: A-DEC, Name: mean temp, dtype: float64

See also to_timestamp to conver back to timestamps and its how method to specify when inside the range to set the timestamp.

Resampling

Another thing that can be done is to resample the series, downsample or upsample. Let's see the series converted to 10 year blocks or upscale to a monthly series:



In [86]:

    
# Frequencies can be specified as strings: "us", "ms", "S", "T", "H", "D", "B", "W", "M", "A", "3min", "2h20", ...
# More aliases at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
full_globe_temp.resample("M").mean()









    Out[86]:





1880-01   -0.1591
1880-02       NaN
1880-03       NaN
1880-04       NaN
1880-05       NaN
1880-06       NaN
1880-07       NaN
1880-08       NaN
            ...  
2011-05       NaN
2011-06       NaN
2011-07       NaN
2011-08       NaN
2011-09       NaN
2011-10       NaN
2011-11       NaN
2011-12       NaN
Freq: M, Name: mean temp, dtype: float64



In [87]:

    
full_globe_temp.resample("10A").mean()









    Out[87]:





year
1880-12-31 00:00:00   -0.15910
1890-12-31 00:00:00   -0.18226
1900-12-31            -0.21054
1910-12-31            -0.31381
1920-12-31            -0.25518
1930-12-31            -0.13395
1940-12-31             0.00786
1950-12-31             0.04548
1960-12-31            -0.00198
1970-12-31             0.02566
1980-12-31             0.04817
1990-12-31             0.21491
2000-12-31             0.36288
2010-12-31             0.55260
2020-12-31                 NaN
Freq: 10A-DEC, Name: mean temp, dtype: float64

Generating DatetimeIndex objects

The index for giss_temp isn't an instance of datetimes so we may want to generate such DatetimeIndex objects. This can be done with date_range and period_range:



In [88]:

    
# Can specify a start date and a number of values desired. By default it will assume an interval of 1 day:
pd.date_range('1/1/1880', periods=4)









    Out[88]:





DatetimeIndex(['1880-01-01', '1880-01-02', '1880-01-03', '1880-01-04'], dtype='datetime64[ns]', freq='D')



In [89]:

    
# Can also specify a start and a stop date, as well as a frequency
pd.date_range('1/1/1880', '1/1/2016', freq="A")









    Out[89]:





DatetimeIndex(['1880-12-31', '1881-12-31', '1882-12-31', '1883-12-31',
               '1884-12-31', '1885-12-31', '1886-12-31', '1887-12-31',
               '1888-12-31', '1889-12-31',
               ...
               '2006-12-31', '2007-12-31', '2008-12-31', '2009-12-31',
               '2010-12-31', '2011-12-31', '2012-12-31', '2013-12-31',
               '2014-12-31', '2015-12-31'],
              dtype='datetime64[ns]', length=136, freq='A-DEC')

Note that "A" by default means the end of the year. Other times in the year can be specified with "AS" (start), "A-JAN" or "A-JUN". Even more options can be imported from pandas.tseries.offsets:



In [90]:

    
from pandas.tseries.offsets import YearBegin
pd.date_range('1/1/1880', '1/1/2015', freq=YearBegin())









    Out[90]:





DatetimeIndex(['1880-01-01', '1881-01-01', '1882-01-01', '1883-01-01',
               '1884-01-01', '1885-01-01', '1886-01-01', '1887-01-01',
               '1888-01-01', '1889-01-01',
               ...
               '2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01',
               '2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01',
               '2014-01-01', '2015-01-01'],
              dtype='datetime64[ns]', length=136, freq='AS-JAN')

Actually we will convert that dataset to a 1D dataset, and build a monthly index, so lets build a monthly period index



In [91]:

    
giss_temp_index = pd.period_range('1/1/1880', '12/1/2015', freq="M")
giss_temp_index









    Out[91]:





PeriodIndex(['1880-01', '1880-02', '1880-03', '1880-04', '1880-05', '1880-06',
             '1880-07', '1880-08', '1880-09', '1880-10',
             ...
             '2015-03', '2015-04', '2015-05', '2015-06', '2015-07', '2015-08',
             '2015-09', '2015-10', '2015-11', '2015-12'],
            dtype='int64', length=1632, freq='M')

7. Transforming datasets: apply, sort, stack/unstack and transpose

Let's look at our local_sea_level_stations dataset some more, to learn more about it and also do some formatting. What is the range of dates and lattitudes we have, the list of countries, the range of variations, ...



In [92]:

    
# What about the range of dates?
local_sea_level_stations["Date"].min(), local_sea_level_stations["Date"].max(), local_sea_level_stations["Date"].iloc[-1]









    Out[92]:





(u'01/01/1980', u'31/07/2014', u'01/02/2016')



In [93]:

    
local_sea_level_stations.dtypes









    Out[93]:





Station Name     object
ID                int64
Lat             float64
Lon             float64
GLOSS ID        float64
Country          object
Date             object
Coastline         int64
Station           int64
dtype: object

Apply: transforming Series

We don't see the range of dates because the dates are of dtype "Object", (usually meaning strings). Let's convert that using apply:



In [94]:

    
local_sea_level_stations["Date"].apply(pd.to_datetime)









    Out[94]:





0      1980-01-01
1      1980-01-01
2      1980-01-01
3      2015-06-10
4      2016-08-03
5      1999-11-05
6      2015-05-21
7      2004-09-30
          ...    
1458   2015-11-17
1459   2015-11-17
1460   2015-11-17
1461   2016-08-03
1462   2016-08-03
1463   2016-09-03
1464   2016-09-03
1465   2016-01-02
Name: Date, dtype: datetime64[ns]

This apply method is very powerful and general. We have used it to do something we could have done with astype, but any custom function can be provided to apply.



In [95]:

    
local_sea_level_stations["Date"] = local_sea_level_stations["Date"].apply(pd.to_datetime)

# Now we can really compare the dates, and therefore get a real range:
print(local_sea_level_stations["Date"].min(), local_sea_level_stations["Date"].max())









    



(Timestamp('1980-01-01 00:00:00'), Timestamp('2016-12-05 00:00:00'))

EXERCISE: Use the apply method to search through the stations names for a station in New York. What is the ID of the station?



In [96]:

    
# Your code here

Now that we know the range of dates, to look at the data, sorting it following the dates is done with sort:



In [97]:

    
local_sea_level_stations.sort_values(by="Date")









    Out[97]:






  
    
      
      Station Name
      ID
      Lat
      Lon
      GLOSS ID
      Country
      Date
      Coastline
      Station
    
  
  
    
      0
      HELIGMAN
      231
      60.200
      19.300
      NaN
      ALA
      1980-01-01
      60
      251
    
    
      352
      KOTKA
      164
      60.450
      26.950
      NaN
      FIN
      1980-01-01
      60
      358
    
    
      351
      SODERSKAR
      29
      60.117
      25.417
      NaN
      FIN
      1980-01-01
      60
      354
    
    
      347
      RUSSARO
      28
      59.767
      22.950
      NaN
      FIN
      1980-01-01
      60
      316
    
    
      346
      JUNGFRUSUND
      18
      59.950
      22.367
      NaN
      FIN
      1980-01-01
      60
      311
    
    
      345
      TVARMINNE
      238
      59.850
      23.250
      NaN
      FIN
      1980-01-01
      60
      306
    
    
      344
      LOHM
      233
      60.100
      21.667
      NaN
      FIN
      1980-01-01
      60
      301
    
    
      358
      ANTIFER
      1383
      49.650
      0.150
      NaN
      FRA
      1980-01-01
      190
      46
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1085
      VIGO II
      1898
      42.243
      -8.726
      NaN
      ESP
      2016-12-01
      200
      42
    
    
      1082
      FERROL 2
      2053
      43.476
      -8.249
      NaN
      ESP
      2016-12-01
      200
      35
    
    
      1078
      GIJON II
      1871
      43.558
      -5.698
      NaN
      ESP
      2016-12-01
      200
      22
    
    
      1077
      SANTANDER III
      1807
      43.461
      -3.791
      NaN
      ESP
      2016-12-01
      200
      13
    
    
      1074
      BILBAO
      1806
      43.352
      -3.045
      NaN
      ESP
      2016-12-01
      200
      6
    
    
      1106
      SAGUNTO
      2059
      39.634
      -0.206
      NaN
      ESP
      2016-12-01
      220
      60
    
    
      1083
      VILLAGARCIA
      1897
      42.600
      -8.767
      NaN
      ESP
      2016-12-01
      200
      36
    
    
      912
      MATI, DAVAO ORIENTAL
      2156
      6.950
      126.217
      NaN
      PHL
      2016-12-05
      660
      116
    
  

1466 rows × 9 columns

Since many stations last updated on the same dates, it is logical to want to sort further, for example, by Country at constant date:



In [98]:

    
local_sea_level_stations.sort_values(by=["Date", "Country"], ascending=False)









    Out[98]:






  
    
      
      Station Name
      ID
      Lat
      Lon
      GLOSS ID
      Country
      Date
      Coastline
      Station
    
  
  
    
      912
      MATI, DAVAO ORIENTAL
      2156
      6.950
      126.217
      NaN
      PHL
      2016-12-05
      660
      116
    
    
      1074
      BILBAO
      1806
      43.352
      -3.045
      NaN
      ESP
      2016-12-01
      200
      6
    
    
      1077
      SANTANDER III
      1807
      43.461
      -3.791
      NaN
      ESP
      2016-12-01
      200
      13
    
    
      1078
      GIJON II
      1871
      43.558
      -5.698
      NaN
      ESP
      2016-12-01
      200
      22
    
    
      1082
      FERROL 2
      2053
      43.476
      -8.249
      NaN
      ESP
      2016-12-01
      200
      35
    
    
      1083
      VILLAGARCIA
      1897
      42.600
      -8.767
      NaN
      ESP
      2016-12-01
      200
      36
    
    
      1085
      VIGO II
      1898
      42.243
      -8.726
      NaN
      ESP
      2016-12-01
      200
      42
    
    
      1088
      HUELVA
      1883
      37.132
      -6.834
      NaN
      ESP
      2016-12-01
      220
      5
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      17
      PIRAMIDE
      867
      -42.583
      -64.283
      NaN
      ARG
      1980-01-01
      860
      41
    
    
      18
      ROSALES
      870
      -38.933
      -62.067
      NaN
      ARG
      1980-01-01
      860
      61
    
    
      20
      MAR DEL PLATA (PUERTO)
      857
      -38.033
      -57.533
      192.0
      ARG
      1980-01-01
      860
      91
    
    
      23
      RIO SANTIAGO
      897
      -34.850
      -57.900
      NaN
      ARG
      1980-01-01
      860
      141
    
    
      26
      ISLA MARTIN GARCIA
      864
      -34.183
      -58.250
      NaN
      ARG
      1980-01-01
      860
      171
    
    
      0
      HELIGMAN
      231
      60.200
      19.300
      NaN
      ALA
      1980-01-01
      60
      251
    
    
      1
      KOBBAKLINTAR
      63
      60.033
      19.883
      NaN
      ALA
      1980-01-01
      60
      261
    
    
      2
      LEMSTROM
      84
      60.100
      20.017
      NaN
      ALA
      1980-01-01
      60
      271
    
  

1466 rows × 9 columns

Stack and unstack

Let's look at the GISS dataset differently. Instead of seeing the months along the axis 1, and the years along the axis 0, it would could be good to convert these into an outer and an inner axis along only 1 time dimension.

Stacking and unstacking allows to convert a dataframe into a series and vice-versa:



In [99]:

    
giss_temp.unstack?
unstacked = giss_temp.unstack()
unstacked









    Out[99]:





     Year
Jan  1880   -34.0
     1881   -13.0
     1882     3.0
     1883   -38.0
     1884   -20.0
     1885   -57.0
     1886   -37.0
     1887   -60.0
             ... 
Dec  2008    51.0
     2009    58.0
     2010    44.0
     2011    48.0
     2012    46.0
     2013    61.0
     2014    74.0
     2015     NaN
dtype: float32



In [100]:

    
# Note the nature of the result:
type(unstacked)









    Out[100]:





pandas.core.series.Series

The result is grouped in the wrong order since it sorts first the axis that was unstacked. Another transformation that would help us is transposing...



In [101]:

    
giss_temp.transpose()









    Out[101]:






  
    
      Year
      1880
      1881
      1882
      1883
      1884
      1885
      1886
      1887
      1888
      1889
      ...
      2006
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
    
  
  
    
      Jan
      -34.0
      -13.0
      3.0
      -38.0
      -20.0
      -57.0
      -37.0
      -60.0
      -43.0
      -18.0
      ...
      52.0
      93.0
      22.0
      56.0
      66.0
      45.0
      38.0
      62.0
      68.0
      75.0
    
    
      Feb
      -27.0
      -16.0
      4.0
      -38.0
      -14.0
      -29.0
      -43.0
      -43.0
      -41.0
      16.0
      ...
      65.0
      65.0
      31.0
      48.0
      75.0
      44.0
      43.0
      52.0
      44.0
      80.0
    
    
      Mar
      -22.0
      -2.0
      -2.0
      -12.0
      -31.0
      -19.0
      -34.0
      -26.0
      -39.0
      6.0
      ...
      59.0
      67.0
      70.0
      49.0
      87.0
      57.0
      52.0
      60.0
      71.0
      84.0
    
    
      Apr
      -30.0
      -3.0
      -24.0
      -20.0
      -36.0
      -36.0
      -23.0
      -34.0
      -25.0
      7.0
      ...
      45.0
      71.0
      48.0
      57.0
      82.0
      60.0
      62.0
      48.0
      72.0
      71.0
    
    
      May
      -16.0
      -3.0
      -20.0
      -20.0
      -33.0
      -35.0
      -21.0
      -28.0
      -26.0
      -1.0
      ...
      43.0
      64.0
      46.0
      59.0
      71.0
      47.0
      71.0
      56.0
      79.0
      NaN
    
    
      Jun
      -24.0
      -27.0
      -32.0
      -8.0
      -36.0
      -40.0
      -30.0
      -25.0
      -20.0
      -10.0
      ...
      60.0
      55.0
      43.0
      62.0
      60.0
      54.0
      59.0
      61.0
      62.0
      NaN
    
    
      Jul
      -19.0
      -12.0
      -27.0
      -3.0
      -31.0
      -28.0
      -13.0
      -19.0
      -11.0
      -13.0
      ...
      50.0
      58.0
      54.0
      66.0
      56.0
      70.0
      50.0
      53.0
      50.0
      NaN
    
    
      Aug
      -12.0
      -8.0
      -11.0
      -13.0
      -24.0
      -24.0
      -21.0
      -28.0
      -13.0
      -18.0
      ...
      66.0
      57.0
      39.0
      61.0
      59.0
      69.0
      56.0
      61.0
      74.0
      NaN
    
    
      Sep
      -20.0
      -18.0
      -11.0
      -19.0
      -29.0
      -17.0
      -12.0
      -24.0
      -10.0
      -18.0
      ...
      59.0
      61.0
      59.0
      64.0
      55.0
      52.0
      68.0
      73.0
      81.0
      NaN
    
    
      Oct
      -19.0
      -23.0
      -25.0
      -19.0
      -25.0
      -14.0
      -22.0
      -33.0
      -3.0
      -21.0
      ...
      66.0
      56.0
      62.0
      58.0
      65.0
      60.0
      73.0
      61.0
      78.0
      NaN
    
    
      Nov
      -16.0
      -28.0
      -25.0
      -28.0
      -29.0
      -14.0
      -29.0
      -29.0
      -1.0
      -32.0
      ...
      69.0
      54.0
      62.0
      72.0
      74.0
      50.0
      69.0
      75.0
      64.0
      NaN
    
    
      Dec
      -21.0
      -18.0
      -37.0
      -21.0
      -25.0
      0.0
      -18.0
      -40.0
      -10.0
      -30.0
      ...
      74.0
      46.0
      51.0
      58.0
      44.0
      48.0
      46.0
      61.0
      74.0
      NaN
    
  

12 rows × 136 columns



In [102]:

    
giss_temp_series = giss_temp.transpose().unstack()
giss_temp_series.name = "Temp anomaly"
giss_temp_series









    Out[102]:





Year     
1880  Jan   -34.0
      Feb   -27.0
      Mar   -22.0
      Apr   -30.0
      May   -16.0
      Jun   -24.0
      Jul   -19.0
      Aug   -12.0
             ... 
2015  May     NaN
      Jun     NaN
      Jul     NaN
      Aug     NaN
      Sep     NaN
      Oct     NaN
      Nov     NaN
      Dec     NaN
Name: Temp anomaly, dtype: float32

A side note: Multi-indexes



In [103]:

    
# Note the nature of the resulting index:
giss_temp_series.index









    Out[103]:





MultiIndex(levels=[[1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015], [u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73, 73, 73, 73, 73, 73, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 91, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 98, 98, 98, 98, 98, 98, 98, 98, 98, 98, 98, 98, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 106, 106, 106, 106, 106, 106, 106, 106, 106, 106, 106, 106, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 108, 108, 108, 108, 108, 108, 108, 108, 108, 108, 108, 108, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
           names=[u'Year', None])



In [104]:

    
# It is an index made of 2 columns. Let's fix the fact that one of them doesn't have a name:
giss_temp_series.index = giss_temp_series.index.set_names(["year", "month"])



In [105]:

    
# We can now access deviations by specifying the year and month:
giss_temp_series[1980, "Jan"]









    Out[105]:





24.0

But this new multi-index isn't very good, because is it not viewed as 1 date, just as a tuple of values:



In [106]:

    
giss_temp_series.plot(figsize=LARGE_FIGSIZE)









    Out[106]:





<matplotlib.axes._subplots.AxesSubplot at 0x114ceae50>

To improve on this, let's reuse an index we generated above with date_range:



In [107]:

    
giss_temp_series.index = giss_temp_index
giss_temp_series.plot(figsize=LARGE_FIGSIZE)









    Out[107]:





<matplotlib.axes._subplots.AxesSubplot at 0x114bd0dd0>

8. Statistical analysis

Descriptive statistics

Let's go back to the dataframe version of the GISS temperature dataset temporarily to analyze anomalies month per month. Like most functions on a dataframe, stats functions are computed column per column. They also ignore missing values:



In [108]:

    
monthly_averages = giss_temp.mean()
monthly_averages









    Out[108]:





Jan   -1.411765
Feb   -0.161765
Mar    1.294118
Apr   -0.823529
May   -2.051852
Jun   -3.237037
Jul   -0.688889
Aug   -0.622222
Sep    0.755556
Oct    1.333333
Nov    0.148148
Dec   -2.274074
dtype: float32

It is possible to apply stats functions across rows instead of columns using the axis keyword (just like in NumPy).



In [109]:

    
yearly_averages = giss_temp.mean(axis=1)
yearly_averages









    Out[109]:





Year
1880   -21.666666
1881   -14.250000
1882   -17.250000
1883   -19.916666
1884   -27.750000
1885   -26.083334
1886   -25.250000
1887   -32.416668
          ...    
2008    48.916668
2009    59.166668
2010    66.166664
2011    54.666668
2012    57.250000
2013    60.250000
2014    68.083336
2015    77.500000
dtype: float32

describe provides many descriptive stats computed at once:



In [110]:

    
mean_sea_level.describe()









    Out[110]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
  
  
    
      count
      861.000000
      861.000000
      780.000000
    
    
      mean
      25.050923
      28.422725
      29.166036
    
    
      std
      29.292476
      24.730756
      21.226033
    
    
      min
      -38.377000
      -28.239000
      -11.796000
    
    
      25%
      3.810000
      8.946000
      11.130500
    
    
      50%
      24.487000
      27.836000
      31.783000
    
    
      75%
      47.140000
      46.066000
      45.047500
    
    
      max
      99.108000
      90.633000
      75.007000

Rolling statistics

Let's remove high frequency signal and extract the trend:



In [111]:

    
full_globe_temp.plot()
rolled_series = full_globe_temp.rolling(window=10, center=False)
print(rolled_series)
rolled_series.mean().plot(figsize=LARGE_FIGSIZE)









    



Rolling [window=10,center=False,axis=0]






    Out[111]:





<matplotlib.axes._subplots.AxesSubplot at 0x1198e9990>



In [112]:

    
# To see what all can be done while rolling, 
#pd.rolling_<TAB>

Describing categorical series

Let's look at our local_sea_level_stations dataset some more:



In [113]:

    
local_sea_level_stations.describe()









    Out[113]:






  
    
      
      ID
      Lat
      Lon
      GLOSS ID
      Coastline
      Station
    
  
  
    
      count
      1466.000000
      1466.000000
      1466.000000
      281.000000
      1466.000000
      1466.000000
    
    
      mean
      1069.452251
      28.762511
      21.889766
      165.327402
      532.484993
      85.937926
    
    
      std
      655.126610
      30.313309
      92.947697
      95.313053
      312.892857
      120.120620
    
    
      min
      1.000000
      -77.850000
      -179.367000
      3.000000
      10.000000
      1.000000
    
    
      25%
      518.250000
      13.810500
      -64.745500
      87.000000
      220.000000
      12.000000
    
    
      50%
      1021.500000
      35.700000
      14.448500
      154.000000
      641.000000
      41.000000
    
    
      75%
      1629.750000
      49.320500
      118.023500
      242.000000
      822.000000
      101.000000
    
    
      max
      2304.000000
      82.490000
      179.217000
      351.000000
      999.000000
      725.000000

.describe() only displays information about continuous Series. What about categorical ones?



In [114]:

    
local_sea_level_stations.columns









    Out[114]:





Index([u'Station Name', u'ID', u'Lat', u'Lon', u'GLOSS ID', u'Country',
       u'Date', u'Coastline', u'Station'],
      dtype='object')



In [115]:

    
local_sea_level_stations["Country"]









    Out[115]:





0       ALA
1       ALA
2       ALA
3       ALA
4       ASM
5       ATA
6       ATA
7       ATA
       ... 
1458    VNM
1459    VNM
1460    VNM
1461    VIR
1462    VIR
1463    VIR
1464    VIR
1465    YEM
Name: Country, dtype: object



In [116]:

    
local_sea_level_stations["Country"].describe()









    Out[116]:





count     1466
unique     136
top        USA
freq       178
Name: Country, dtype: object



In [117]:

    
# List of unique values:
local_sea_level_stations["Country"].unique()









    Out[117]:





array([u'ALA', u'ASM', u'ATA', u'ARG', u'AUS', u'BHS', u'BHR', u'BGD',
       u'BEL', u'BMU', u'BRA', u'IOT', u'BGR', u'CAN', u'CPV', u'CYM',
       u'CHL', u'CHN', u'CCK', u'COL', u'COG', u'COK', u'CRI', u'CIV',
       u'HRV', u'CUB', u'CYP', u'DNK', u'DOM', u'ECU', u'EGY', u'SLV',
       u'EST', u'FLK', u'FRO', u'FJI', u'FIN', u'FRA', u'GUF', u'PYF',
       u'GEO', u'DEU', u'GHA', u'GIB', u'GRC', u'GRD', u'GLP', u'GUM',
       u'GTM', u'HTI', u'HND', u'HKG', u'ISL', u'IND', u'IDN', u'IRN',
       u'IRL', u'IMN', u'ISR', u'ITA', u'JAM', u'JPN', u'JEY', u'KEN',
       u'KIR', u'PRK', u'KOR', u'LVA', u'LTU', u'MAC', u'MDG', u'MYS',
       u'MDV', u'MLT', u'MHL', u'MTQ', u'MUS', u'MYT', u'MEX', u'FSM',
       u'MCO', u'MNE', u'MAR', u'MOZ', u'MMR', u'NAM', u'NRU', u'NLD',
       u'NCL', u'NZL', u'NGA', u'NFK', u'MNP', u'NOR', u'OMN', u'PAK',
       u'PLW', u'PAN', u'PNG', u'PER', u'PHL', u'POL', u'PRT', u'PRI',
       u'REU', u'ROU', u'RUS', u'SHN', u'WSM', u'SEN', u'SYC', u'SGP',
       u'SVN', u'SLB', u'ZAF', u'ESP', u'SDN', u'SJM', u'SWE', u'TWN',
       u'TZA', u'THA', u'TON', u'TTO', u'TUR', u'TUV', u'UKR', u'GBR',
       u'USA', u'UMI', u'URY', u'VUT', u'VEN', u'VNM', u'VIR', u'YEM'], dtype=object)



In [118]:

    
local_sea_level_stations["Country"].value_counts()









    Out[118]:





USA    178
JPN    139
RUS     86
CAN     82
AUS     81
GBR     68
ESP     61
ITA     44
      ... 
TZA      1
PRK      1
JEY      1
JAM      1
MNP      1
UKR      1
COG      1
VUT      1
Name: Country, dtype: int64



In [119]:

    
# To save memory, we can convert it to a categorical column:
local_sea_level_stations["Country"] = local_sea_level_stations["Country"].astype("category")

We can also create categorical series from continuous ones with the cut function:



In [120]:

    
categorized = pd.cut(full_globe_temp, 3, labels=["L", "M", "H"])
categorized









    Out[120]:





year
1880      L
1881      L
1882      L
1883      L
1884      L
1885      L
1886      L
1887      L
       ... 
2004      H
2005      H
2006      H
2007      H
2008      H
2009      H
2010    NaN
2011    NaN
Freq: A-DEC, Name: mean temp, dtype: category
Categories (3, object): [L < M < H]



In [121]:

    
# The advantage is that we can use labels and control the order they should be treated in (L < M < H)
categorized.cat.categories









    Out[121]:





Index([u'L', u'M', u'H'], dtype='object')

QUIZ: How much memory did we save? What if it was categorized but with dtype object instead of category?

9. Data Aggregation/summarization

Now that we have a good grasp on our datasets, Let's transform and analyze them some more to prepare them to compare them. The 2 function(alities)s to learn about here are groupby and pivot_table.

GroupBy

Let's explore the sea levels, first splitting into calendar years to compute average sea levels for each year:



In [122]:

    
mean_sea_level









    Out[122]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
    
      date
      
      
      
    
  
  
    
      1992.9323
      13.717
      2.517
      NaN
    
    
      1992.9595
      1.328
      -7.645
      -5.818
    
    
      1992.9866
      -13.375
      -2.267
      -7.525
    
    
      1993.0138
      -24.723
      0.855
      -9.215
    
    
      1993.0409
      -29.231
      -2.020
      -11.796
    
    
      1993.0681
      -32.538
      -0.398
      -11.291
    
    
      1993.0952
      -35.588
      4.102
      -9.569
    
    
      1993.1223
      -26.579
      6.787
      -3.714
    
    
      ...
      ...
      ...
      ...
    
    
      2016.1440
      49.218
      84.902
      NaN
    
    
      2016.1711
      44.274
      83.829
      NaN
    
    
      2016.1983
      33.961
      86.020
      NaN
    
    
      2016.2254
      42.199
      88.122
      NaN
    
    
      2016.2526
      45.004
      88.090
      NaN
    
    
      2016.2797
      49.809
      87.806
      NaN
    
    
      2016.3068
      55.401
      77.270
      NaN
    
    
      2016.3340
      50.416
      70.916
      NaN
    
  

861 rows × 3 columns



In [123]:

    
mean_sea_level = mean_sea_level.reset_index()
mean_sea_level









    Out[123]:






  
    
      
      date
      northern_hem
      southern_hem
      mean_global
    
  
  
    
      0
      1992.9323
      13.717
      2.517
      NaN
    
    
      1
      1992.9595
      1.328
      -7.645
      -5.818
    
    
      2
      1992.9866
      -13.375
      -2.267
      -7.525
    
    
      3
      1993.0138
      -24.723
      0.855
      -9.215
    
    
      4
      1993.0409
      -29.231
      -2.020
      -11.796
    
    
      5
      1993.0681
      -32.538
      -0.398
      -11.291
    
    
      6
      1993.0952
      -35.588
      4.102
      -9.569
    
    
      7
      1993.1223
      -26.579
      6.787
      -3.714
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      853
      2016.1440
      49.218
      84.902
      NaN
    
    
      854
      2016.1711
      44.274
      83.829
      NaN
    
    
      855
      2016.1983
      33.961
      86.020
      NaN
    
    
      856
      2016.2254
      42.199
      88.122
      NaN
    
    
      857
      2016.2526
      45.004
      88.090
      NaN
    
    
      858
      2016.2797
      49.809
      87.806
      NaN
    
    
      859
      2016.3068
      55.401
      77.270
      NaN
    
    
      860
      2016.3340
      50.416
      70.916
      NaN
    
  

861 rows × 4 columns



In [124]:

    
# Groupby with pandas can be done on a column or by applying a custom function to the index. 
# If we want to group the data by year, we can build a year column into the DF:
mean_sea_level["year"] = mean_sea_level["date"].apply(int)
mean_sea_level









    Out[124]:






  
    
      
      date
      northern_hem
      southern_hem
      mean_global
      year
    
  
  
    
      0
      1992.9323
      13.717
      2.517
      NaN
      1992
    
    
      1
      1992.9595
      1.328
      -7.645
      -5.818
      1992
    
    
      2
      1992.9866
      -13.375
      -2.267
      -7.525
      1992
    
    
      3
      1993.0138
      -24.723
      0.855
      -9.215
      1993
    
    
      4
      1993.0409
      -29.231
      -2.020
      -11.796
      1993
    
    
      5
      1993.0681
      -32.538
      -0.398
      -11.291
      1993
    
    
      6
      1993.0952
      -35.588
      4.102
      -9.569
      1993
    
    
      7
      1993.1223
      -26.579
      6.787
      -3.714
      1993
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      853
      2016.1440
      49.218
      84.902
      NaN
      2016
    
    
      854
      2016.1711
      44.274
      83.829
      NaN
      2016
    
    
      855
      2016.1983
      33.961
      86.020
      NaN
      2016
    
    
      856
      2016.2254
      42.199
      88.122
      NaN
      2016
    
    
      857
      2016.2526
      45.004
      88.090
      NaN
      2016
    
    
      858
      2016.2797
      49.809
      87.806
      NaN
      2016
    
    
      859
      2016.3068
      55.401
      77.270
      NaN
      2016
    
    
      860
      2016.3340
      50.416
      70.916
      NaN
      2016
    
  

861 rows × 5 columns



In [125]:

    
sl_grouped_year = mean_sea_level.groupby("year")

What kind of object did we create?



In [126]:

    
type(sl_grouped_year)









    Out[126]:





pandas.core.groupby.DataFrameGroupBy

What to do with that strange GroupBy object? We can first loop over it to get the labels and the sub-dataframes for each group:



In [127]:

    
for group_name, subdf in sl_grouped_year:
    print(group_name)
    print(subdf)
    print("")









    



1992
        date  northern_hem  southern_hem  mean_global  year
0  1992.9323        13.717         2.517          NaN  1992
1  1992.9595         1.328        -7.645       -5.818  1992
2  1992.9866       -13.375        -2.267       -7.525  1992

1993
         date  northern_hem  southern_hem  mean_global  year
3   1993.0138       -24.723         0.855       -9.215  1993
4   1993.0409       -29.231        -2.020      -11.796  1993
5   1993.0681       -32.538        -0.398      -11.291  1993
6   1993.0952       -35.588         4.102       -9.569  1993
7   1993.1223       -26.579         6.787       -3.714  1993
8   1993.1495       -29.622        10.315       -2.471  1993
9   1993.1766       -34.527         5.919       -6.784  1993
10  1993.2038       -34.250         7.193       -5.772  1993
..        ...           ...           ...          ...   ...
32  1993.8010        30.541       -14.024       -1.886  1993
33  1993.8282        18.240       -14.008          NaN  1993
34  1993.8553         8.044        -9.115       -7.186  1993
35  1993.8825        10.529        -7.718       -4.588  1993
36  1993.9096        10.772        -6.169       -2.648  1993
37  1993.9368        10.732        -4.656       -0.795  1993
38  1993.9639         1.704         3.492        1.461  1993
39  1993.9911        -8.579        -0.251       -3.859  1993

[37 rows x 5 columns]

1994
         date  northern_hem  southern_hem  mean_global  year
40  1994.0182       -14.764        -1.078       -5.904  1994
41  1994.0454       -12.663         3.887       -1.261  1994
42  1994.0725       -19.485        13.173        2.404  1994
43  1994.0997       -23.443         8.984       -1.173  1994
44  1994.1268       -31.320         4.399       -6.507  1994
45  1994.1540       -37.411         1.308      -10.638  1994
46  1994.1811       -35.844        13.495       -2.544  1994
47  1994.2083       -20.794        10.220          NaN  1994
..        ...           ...           ...          ...   ...
69  1994.8055        18.954       -10.939       -4.447  1994
70  1994.8327        14.186       -10.363       -5.614  1994
71  1994.8598        19.882         4.403          NaN  1994
72  1994.8870        16.490       -10.154       -3.240  1994
73  1994.9141        -0.195       -12.189      -10.229  1994
74  1994.9413        -0.438        -9.834       -7.929  1994
75  1994.9684        -6.651        -2.703       -5.132  1994
76  1994.9956       -12.734         7.809       -0.212  1994

[37 rows x 5 columns]

1995
          date  northern_hem  southern_hem  mean_global  year
77   1995.0227       -16.869        13.406        2.458  1995
78   1995.0499       -16.734        11.835        2.365  1995
79   1995.0770       -24.215        17.897        3.806  1995
80   1995.1042       -32.395        10.742       -3.379  1995
81   1995.1313       -37.790         4.341       -8.843  1995
82   1995.1585       -30.054        11.276       -1.374  1995
83   1995.1856       -18.882        24.336          NaN  1995
84   1995.2128       -29.813        21.872        5.367  1995
..         ...           ...           ...          ...   ...
104  1995.7557        26.196       -16.096       -4.116  1995
105  1995.7829        26.565       -18.685       -5.580  1995
106  1995.8100        18.206       -11.709          NaN  1995
107  1995.8372        20.199        -7.319       -1.007  1995
108  1995.8643        22.797        -8.014        0.257  1995
109  1995.8915        20.865       -10.144       -0.928  1995
110  1995.9458         3.133       -10.152       -6.178  1995
111  1995.9729         3.723         4.050        3.549  1995

[35 rows x 5 columns]

1996
          date  northern_hem  southern_hem  mean_global  year
112  1996.0000        -8.298         8.021        2.121  1996
113  1996.0272       -15.226         4.369       -1.907  1996
114  1996.0543       -19.351         0.676       -5.008  1996
115  1996.0815       -27.781         1.914       -6.979  1996
116  1996.1086       -32.942         8.965       -4.363  1996
117  1996.1358       -26.880        14.640          NaN  1996
118  1996.1629       -25.254        18.665        5.333  1996
119  1996.1901       -31.981        14.792        0.404  1996
..         ...           ...           ...          ...   ...
141  1996.7873        16.485       -16.277          NaN  1996
142  1996.8145        23.172        -7.566        0.035  1996
143  1996.8416        27.836        -5.890        3.471  1996
144  1996.8688        25.333        -9.022        1.228  1996
145  1996.8959         8.147        -5.264       -2.764  1996
146  1996.9231         6.546        -0.403        0.452  1996
147  1996.9502         3.238         7.443        4.804  1996
148  1996.9774         0.742         5.352        3.602  1996

[37 rows x 5 columns]

1997
          date  northern_hem  southern_hem  mean_global  year
149  1997.0045        -6.439         5.169        1.636  1997
150  1997.0317       -11.603         3.083       -0.803  1997
151  1997.0588       -27.445         5.189       -5.010  1997
152  1997.0860       -24.603        12.049        0.989  1997
153  1997.1131       -32.545         2.174          NaN  1997
154  1997.1403       -22.992        13.187        3.083  1997
155  1997.1674       -21.150         9.064        1.468  1997
156  1997.1946       -28.745         8.237       -1.823  1997
..         ...           ...           ...          ...   ...
178  1997.7918        41.540         4.793       14.829  1997
179  1997.8190        33.572         6.671       12.947  1997
180  1997.8461        23.610        10.016       11.460  1997
181  1997.8733        10.445        13.478        8.891  1997
182  1997.9004        12.106        14.756       11.173  1997
183  1997.9276        11.524        19.435       14.685  1997
184  1997.9547         4.187        16.530       11.068  1997
185  1997.9819        -7.182        21.664       10.593  1997

[37 rows x 5 columns]

1998
          date  northern_hem  southern_hem  mean_global  year
186  1998.0090       -18.303        17.642        4.736  1998
187  1998.0362       -12.483        18.718        8.719  1998
188  1998.0633       -25.377         9.945          NaN  1998
189  1998.0905       -15.267        27.836       14.428  1998
190  1998.1176       -14.355        25.288       13.615  1998
191  1998.1448       -20.614        24.298       10.997  1998
192  1998.1719       -21.860        18.501        7.134  1998
193  1998.1990       -21.656        23.332       10.277  1998
..         ...           ...           ...          ...   ...
215  1998.7963        24.604       -28.239          NaN  1998
216  1998.8235        40.522        -9.477        6.725  1998
217  1998.8506        36.535        -6.496        7.464  1998
218  1998.8777        37.038         1.925       13.388  1998
219  1998.9049        25.888        -1.569        7.522  1998
220  1998.9320        30.483         9.042       16.727  1998
221  1998.9592        12.372         9.913       10.897  1998
222  1998.9863         6.869         4.941        6.738  1998

[37 rows x 5 columns]

1999
          date  northern_hem  southern_hem  mean_global  year
223  1999.0135        -3.830        13.669        8.833  1999
224  1999.0406        -2.387        21.230       14.662  1999
225  1999.0678       -21.709         2.566          NaN  1999
226  1999.0949        -5.048        13.182       10.119  1999
227  1999.1221       -17.530        20.970       10.080  1999
228  1999.1492       -18.436        16.705        7.655  1999
229  1999.1764       -19.660        22.345       10.832  1999
230  1999.2035       -13.171        20.341       12.249  1999
..         ...           ...           ...          ...   ...
252  1999.8008        34.355         2.509       11.388  1999
253  1999.8279        34.834         5.045       13.432  1999
254  1999.8551        42.485         4.787       16.967  1999
255  1999.8822        36.612         5.828       15.993  1999
256  1999.9094        28.903         5.765       13.563  1999
257  1999.9365         7.769         7.551          NaN  1999
258  1999.9637         7.359        17.264       13.728  1999
259  1999.9908         3.449        24.124       17.333  1999

[37 rows x 5 columns]

2000
          date  northern_hem  southern_hem  mean_global  year
260  2000.0180        10.176        18.375       17.533  2000
261  2000.0451         6.061        20.247       17.858  2000
262  2000.0723        -6.104        24.273       16.159  2000
263  2000.0994       -21.263        24.218       10.783  2000
264  2000.1266       -15.703        21.589       11.807  2000
265  2000.1537        -9.047        26.780       17.805  2000
266  2000.1809       -10.735        29.419       18.971  2000
267  2000.2080        -9.286        27.323       18.474  2000
..         ...           ...           ...          ...   ...
289  2000.8053        43.405         9.175       19.382  2000
290  2000.8324        42.480         3.772          NaN  2000
291  2000.8596        43.878        10.708       21.424  2000
292  2000.8867        30.982        11.342       17.268  2000
293  2000.9139        29.436        11.797       17.882  2000
294  2000.9410        15.358        14.999       15.043  2000
295  2000.9682        17.786        15.890       17.602  2000
296  2000.9953        19.537        20.462       22.057  2000

[37 rows x 5 columns]

2001
          date  northern_hem  southern_hem  mean_global  year
297  2001.0225         7.732        19.694       17.812  2001
298  2001.0496        -1.064         9.399          NaN  2001
299  2001.0767       -13.165        26.607       15.095  2001
300  2001.1039        -6.459        27.718       19.007  2001
301  2001.1310        -1.461        25.099       19.868  2001
302  2001.1582        -1.053        31.460       24.128  2001
303  2001.1853        -5.181        34.418       24.451  2001
304  2001.2125       -15.876        30.493       18.088  2001
..         ...           ...           ...          ...   ...
326  2001.8097        51.379         8.698       22.653  2001
327  2001.8369        50.977        14.050       26.002  2001
328  2001.8640        37.447        17.758       23.384  2001
329  2001.8912        25.848        20.314       20.948  2001
330  2001.9183        19.967        22.864       21.027  2001
331  2001.9455        26.199        18.393       21.930  2001
332  2001.9726        17.269        26.257       24.021  2001
333  2001.9998        10.626        30.408       24.842  2001

[37 rows x 5 columns]

2002
          date  northern_hem  southern_hem  mean_global  year
334  2002.0269        -4.812        28.283       18.262  2002
335  2002.0541        -5.577        30.166       19.437  2002
336  2002.0812        -3.769        38.490       26.850  2002
337  2002.1084         2.032        38.583       29.728  2002
338  2002.1355         1.644        36.029       28.163  2002
339  2002.1627        -8.796        38.791       23.377  2002
340  2002.1898       -19.634        35.139       17.886  2002
341  2002.2170       -15.731        35.783       20.958  2002
..         ...           ...           ...          ...   ...
362  2002.7871        58.563        16.982       30.754  2002
363  2002.8142        56.392        18.809       31.080  2002
364  2002.8414        43.217        14.710       23.692  2002
365  2002.8685        36.530        21.482       25.621  2002
366  2002.8957        37.029        33.326       33.671  2002
367  2002.9228        32.933        28.857       30.318  2002
368  2002.9500        30.865        26.055       28.853  2002
369  2002.9771        15.771        31.447       26.970  2002

[36 rows x 5 columns]

2003
          date  northern_hem  southern_hem  mean_global  year
370  2003.0043         5.963        31.473       24.113  2003
371  2003.0314         2.851        31.508       23.798  2003
372  2003.0586         4.805        43.326       32.515  2003
373  2003.0857        11.383        41.987       34.928  2003
374  2003.1129         3.465        45.671       34.526  2003
375  2003.1400        -7.422        34.322       23.710  2003
376  2003.1672        -7.073        37.813       26.165  2003
377  2003.1943        -6.972        41.401       28.678  2003
..         ...           ...           ...          ...   ...
399  2003.7916        56.984         9.884       26.229  2003
400  2003.8187        51.042        15.270       27.206  2003
401  2003.8459        47.689        17.865       27.779  2003
402  2003.8730        50.523        19.480       30.516  2003
403  2003.9002        38.028        23.465          NaN  2003
404  2003.9273        38.324        24.332       30.318  2003
405  2003.9544        28.413        23.719       26.939  2003
406  2003.9816        20.909        24.111       25.179  2003

[37 rows x 5 columns]

2004
          date  northern_hem  southern_hem  mean_global  year
407  2004.0087        15.141        35.164       30.448  2004
408  2004.0359        15.516        42.777       36.011  2004
409  2004.0630         9.291        45.368       35.822  2004
410  2004.0902         0.779        43.484       31.936  2004
411  2004.1173        -8.565        29.311       20.438  2004
412  2004.1445         4.549        38.408       31.004  2004
413  2004.1716         4.706        37.540       31.054  2004
414  2004.1988         8.429        40.072       34.182  2004
..         ...           ...           ...          ...   ...
436  2004.7960        59.767        14.801       30.720  2004
437  2004.8232        49.215        25.103       32.614  2004
438  2004.8503        52.121        27.575       35.760  2004
439  2004.8775        55.696        23.664       35.617  2004
440  2004.9046        39.797        22.644       29.439  2004
441  2004.9318        33.859        26.647       30.321  2004
442  2004.9589        28.070        34.103       33.447  2004
443  2004.9861        30.828        45.432       42.391  2004

[37 rows x 5 columns]

2005
          date  northern_hem  southern_hem  mean_global  year
444  2005.0132        13.266        44.864       36.016  2005
445  2005.0404         8.487        41.739       33.156  2005
446  2005.0675         5.659        42.850       33.256  2005
447  2005.0947         1.446        42.562       32.125  2005
448  2005.1218         3.810        44.824       34.871  2005
449  2005.1490         2.986        50.571       38.288  2005
450  2005.1761        11.661        53.613       43.796  2005
451  2005.2033         5.552        53.487       41.497  2005
..         ...           ...           ...          ...   ...
473  2005.8005        59.212        27.535       38.238  2005
474  2005.8277        60.640        26.807       38.673  2005
475  2005.8548        62.702        27.678       40.573  2005
476  2005.8820        56.170        29.275       39.607  2005
477  2005.9091        42.528        36.391       39.121  2005
478  2005.9363        37.889        36.448       38.252  2005
479  2005.9634        31.711        46.602       42.833  2005
480  2005.9906        28.838        45.449       42.022  2005

[37 rows x 5 columns]

2006
          date  northern_hem  southern_hem  mean_global  year
481  2006.0177        15.208        40.995       34.920  2006
482  2006.0449         9.615        40.183       33.051  2006
483  2006.0720        11.103        42.116       35.496  2006
484  2006.0992        12.648        43.330       37.377  2006
485  2006.1263         1.121        54.380       39.798  2006
486  2006.1534         4.234        50.949       39.271  2006
487  2006.1806         3.625        51.639       39.790  2006
488  2006.2077        -2.376        41.009       31.129  2006
..         ...           ...           ...          ...   ...
509  2006.7779        72.515        25.209       42.603  2006
510  2006.8050        60.166        34.091       42.843  2006
511  2006.8321        53.757        35.536          NaN  2006
512  2006.8864        39.182        40.578       39.739  2006
513  2006.9136        35.415        38.551       37.942  2006
514  2006.9407        38.152        47.284       45.260  2006
515  2006.9679        34.626        42.577       42.091  2006
516  2006.9950        22.585        39.894       36.713  2006

[36 rows x 5 columns]

2007
          date  northern_hem  southern_hem  mean_global  year
517  2007.0222        14.020        39.105       33.873  2007
518  2007.0493         6.764        39.398       31.887  2007
519  2007.0765        12.670        47.309       39.598  2007
520  2007.1036         7.893        50.473       40.276  2007
521  2007.1308         2.065        50.542       38.291  2007
522  2007.1579         1.877        52.641       39.852  2007
523  2007.1851        -2.946        45.414       33.755  2007
524  2007.2122        -4.152        47.918       34.891  2007
..         ...           ...           ...          ...   ...
546  2007.8095        59.060        22.833       36.094  2007
547  2007.8366        51.554        24.945       34.622  2007
548  2007.8638        47.467        36.165       40.161  2007
549  2007.8909        51.270        40.482       45.064  2007
550  2007.9181        43.922        41.457       43.596  2007
551  2007.9452        38.697        40.180       41.739  2007
552  2007.9724        28.495        31.595       33.547  2007
553  2007.9995        21.967        34.136       33.442  2007

[37 rows x 5 columns]

2008
          date  northern_hem  southern_hem  mean_global  year
554  2008.0267        21.058        40.662       37.931  2008
555  2008.0538        19.814        49.348       43.468  2008
556  2008.0810        21.366        46.670       43.161  2008
557  2008.1081         9.819        39.811       35.051  2008
558  2008.1353         3.636        45.203       36.126  2008
559  2008.1624        -1.445        45.221       34.481  2008
560  2008.1896         0.857        52.063       39.613  2008
561  2008.2167         5.715        53.753       42.747  2008
..         ...           ...           ...          ...   ...
582  2008.7868        62.449        26.224       39.684  2008
583  2008.8140        58.994        31.039       41.257  2008
584  2008.8411        57.145        34.752       43.062  2008
585  2008.8683        57.446        38.421       46.017  2008
586  2008.8954        51.724        37.105       43.723  2008
587  2008.9226        47.520        40.971       45.242  2008
588  2008.9497        35.003        40.331       40.793  2008
589  2008.9769        27.860        42.238       40.079  2008

[36 rows x 5 columns]

2009
          date  northern_hem  southern_hem  mean_global  year
590  2009.0040        23.942        45.255       41.273  2009
591  2009.0311        20.181        52.552       44.988  2009
592  2009.0583        19.316        54.624       46.683  2009
593  2009.0854         9.433        48.872       39.889  2009
594  2009.1126         4.119        46.818       36.979  2009
595  2009.1397         3.346        57.280       43.305  2009
596  2009.1669         8.681        59.119       46.849  2009
597  2009.1940        11.278        59.213       48.000  2009
..         ...           ...           ...          ...   ...
619  2009.7913        63.070        38.871       47.685  2009
620  2009.8184        63.075        46.435       52.336  2009
621  2009.8456        62.655        49.685       54.538  2009
622  2009.8727        52.791        45.958       48.978  2009
623  2009.8999        53.320        41.828       47.605  2009
624  2009.9270        46.465        41.179       45.401  2009
625  2009.9542        37.661        49.314       47.678  2009
626  2009.9813        37.620        56.767       53.142  2009

[37 rows x 5 columns]

2010
          date  northern_hem  southern_hem  mean_global  year
627  2010.0085        31.284        53.366       49.536  2010
628  2010.0356        25.411        51.109       46.686  2010
629  2010.0628        12.342        54.702       44.267  2010
630  2010.0899         8.114        50.866       40.904  2010
631  2010.1171        12.005        58.937       47.728  2010
632  2010.1442        19.827        63.416       53.985  2010
633  2010.1714        21.825        64.558       55.642  2010
634  2010.1985        14.246        54.671       46.863  2010
..         ...           ...           ...          ...   ...
656  2010.7958        73.332        35.395       50.165  2010
657  2010.8229        66.836        36.806       48.586  2010
658  2010.8501        58.128        34.763       44.281  2010
659  2010.8772        52.196        33.227       41.655  2010
660  2010.9044        41.361        35.411       39.390  2010
661  2010.9315        41.026        44.128       45.394  2010
662  2010.9587        38.633        45.985       46.572  2010
663  2010.9858        34.141        46.744       46.239  2010

[37 rows x 5 columns]

2011
          date  northern_hem  southern_hem  mean_global  year
664  2011.0130        30.716        42.938       43.566  2011
665  2011.0401        23.483        41.860       40.867  2011
666  2011.0673        13.902        46.066       40.210  2011
667  2011.0944         5.609        58.039       44.603  2011
668  2011.1216        11.143        57.750       47.179  2011
669  2011.1487        12.212        56.681       47.204  2011
670  2011.1759         5.104        51.587       41.495  2011
671  2011.2030        -3.162        52.335       38.762  2011
..         ...           ...           ...          ...   ...
693  2011.8003        72.225        41.945       53.828  2011
694  2011.8274        63.727        35.358       46.806  2011
695  2011.8546        58.443        33.233       43.885  2011
696  2011.8817        49.444        39.057       44.402  2011
697  2011.9089        48.666        52.376       52.858  2011
698  2011.9360        51.497        49.351       53.230  2011
699  2011.9631        45.172        45.614       49.491  2011
700  2011.9903        36.838        48.626       48.917  2011

[37 rows x 5 columns]

2012
          date  northern_hem  southern_hem  mean_global  year
701  2012.0174        34.488        54.170       52.218  2012
702  2012.0446        26.415        59.324       52.830  2012
703  2012.0717        26.632        66.007       57.621  2012
704  2012.0989        19.153        68.557       56.729  2012
705  2012.1260        18.327        66.719       55.610  2012
706  2012.1532        10.008        64.715       51.437  2012
707  2012.1803        11.120        57.088       47.644  2012
708  2012.2075        21.211        58.073       52.460  2012
..         ...           ...           ...          ...   ...
730  2012.8047        71.692        46.084       56.444  2012
731  2012.8319        68.279        49.719       57.525  2012
732  2012.8590        69.288        51.148       59.318  2012
733  2012.8862        68.955        56.288       63.006  2012
734  2012.9133        65.818        58.031       63.669  2012
735  2012.9405        59.783        57.225       61.746  2012
736  2012.9676        53.291        54.447       58.504  2012
737  2012.9948        48.216        61.680       61.747  2012

[37 rows x 5 columns]

2013
          date  northern_hem  southern_hem  mean_global  year
738  2013.0219        41.826        64.677       61.889  2013
739  2013.0491        42.422        66.982       64.321  2013
740  2013.0762        39.225        67.524       63.984  2013
741  2013.1034        31.808        67.212       61.167  2013
742  2013.1305        25.700        65.306       58.161  2013
743  2013.1577        22.666        66.361       57.881  2013
744  2013.1848        22.479        76.215       63.789  2013
745  2013.2120        30.486        74.929       66.444  2013
..         ...           ...           ...          ...   ...
767  2013.8092        78.021        49.384       61.335  2013
768  2013.8364        75.166        51.062       61.503  2013
769  2013.8635        68.294        56.361       62.432  2013
770  2013.8907        65.842        58.415       63.435  2013
771  2013.9178        55.578        58.389       60.160  2013
772  2013.9450        47.291        53.524       54.927  2013
773  2013.9721        45.383        59.871       58.974  2013
774  2013.9993        46.902        63.605       62.804  2013

[37 rows x 5 columns]

2014
          date  northern_hem  southern_hem  mean_global  year
775  2014.0264        40.417        68.653       64.102  2014
776  2014.0536        34.571        67.932       62.125  2014
777  2014.0807        23.058        67.100       57.643  2014
778  2014.1079        21.168        72.272       60.502  2014
779  2014.1350        23.216        74.869       63.246  2014
780  2014.1621        18.486        82.303       66.032  2014
781  2014.1893        21.125        82.769       67.625  2014
782  2014.2164        21.179        74.717       62.877  2014
..         ...           ...           ...          ...   ...
803  2014.7866        83.396        55.520       67.344  2014
804  2014.8137        82.410        58.394       68.780  2014
805  2014.8409        79.392        53.893       65.301  2014
806  2014.8680        70.305        58.685       65.015  2014
807  2014.8951        59.157        55.438       59.381  2014
808  2014.9223        59.013        57.644       61.591  2014
809  2014.9494        63.004        68.561       70.723  2014
810  2014.9766        56.990        70.551       70.522  2014

[36 rows x 5 columns]

2015
          date  northern_hem  southern_hem  mean_global  year
811  2015.0037        49.183        67.742       66.662  2015
812  2015.0309        39.398        69.746       64.804  2015
813  2015.0580        31.524        69.317       62.115  2015
814  2015.0852        33.878        77.533       68.589  2015
815  2015.1123        29.822        81.370       69.745  2015
816  2015.1395        33.825        86.893       75.007  2015
817  2015.1666        34.251        85.865          NaN  2015
818  2015.1938        33.765        75.001          NaN  2015
..         ...           ...           ...          ...   ...
840  2015.7910        96.789        71.575          NaN  2015
841  2015.8182        92.405        74.234          NaN  2015
842  2015.8453        81.278        72.443          NaN  2015
843  2015.8725        79.341        67.228          NaN  2015
844  2015.8996        78.108        74.586          NaN  2015
845  2015.9268        76.599        77.518          NaN  2015
846  2015.9539        69.691        77.618          NaN  2015
847  2015.9811        59.957        76.646          NaN  2015

[37 rows x 5 columns]

2016
          date  northern_hem  southern_hem  mean_global  year
848  2016.0082        46.712        77.585          NaN  2016
849  2016.0354        42.882        82.549          NaN  2016
850  2016.0625        45.051        90.633          NaN  2016
851  2016.0897        43.793        88.760          NaN  2016
852  2016.1168        46.603        90.561          NaN  2016
853  2016.1440        49.218        84.902          NaN  2016
854  2016.1711        44.274        83.829          NaN  2016
855  2016.1983        33.961        86.020          NaN  2016
856  2016.2254        42.199        88.122          NaN  2016
857  2016.2526        45.004        88.090          NaN  2016
858  2016.2797        49.809        87.806          NaN  2016
859  2016.3068        55.401        77.270          NaN  2016
860  2016.3340        50.416        70.916          NaN  2016

We could have done the same with less effort by grouping by the result of a custom function applied to the index. Let's reset the dataframe:



In [128]:

    
mean_sea_level = mean_sea_level.drop(["year"], axis=1).set_index("date")

So that we can do the groupby on the index:



In [129]:

    
sl_grouped_year = mean_sea_level.groupby(int)

Something else that can be done with such an object is to look at its groups attribute to see the labels mapped to the rows involved:



In [130]:

    
sl_grouped_year.groups









    Out[130]:





{1992: [1992.9323, 1992.9595, 1992.9866],
 1993: [1993.0138,
  1993.0409,
  1993.0681,
  1993.0952,
  1993.1223,
  1993.1495,
  1993.1766,
  1993.2038,
  1993.2309,
  1993.2581,
  1993.2852,
  1993.3124,
  1993.3395,
  1993.3667,
  1993.3938,
  1993.421,
  1993.4481,
  1993.4753,
  1993.5024,
  1993.5296,
  1993.5567,
  1993.5839,
  1993.611,
  1993.6382,
  1993.6653,
  1993.6925,
  1993.7196,
  1993.7468,
  1993.7739,
  1993.801,
  1993.8282,
  1993.8553,
  1993.8825,
  1993.9096,
  1993.9368,
  1993.9639,
  1993.9911],
 1994: [1994.0182,
  1994.0454,
  1994.0725,
  1994.0997,
  1994.1268,
  1994.154,
  1994.1811,
  1994.2083,
  1994.2354,
  1994.2626,
  1994.2897,
  1994.3169,
  1994.344,
  1994.3712,
  1994.3983,
  1994.4255,
  1994.4526,
  1994.4798,
  1994.5069,
  1994.534,
  1994.5612,
  1994.5883,
  1994.6155,
  1994.6426,
  1994.6698,
  1994.6969,
  1994.7241,
  1994.7512,
  1994.7784,
  1994.8055,
  1994.8327,
  1994.8598,
  1994.887,
  1994.9141,
  1994.9413,
  1994.9684,
  1994.9956],
 1995: [1995.0227,
  1995.0499,
  1995.077,
  1995.1042,
  1995.1313,
  1995.1585,
  1995.1856,
  1995.2128,
  1995.2399,
  1995.267,
  1995.2942,
  1995.3213,
  1995.3485,
  1995.3756,
  1995.4028,
  1995.4299,
  1995.4571,
  1995.4842,
  1995.5114,
  1995.5385,
  1995.5657,
  1995.5928,
  1995.62,
  1995.6471,
  1995.6743,
  1995.7014,
  1995.7286,
  1995.7557,
  1995.7829,
  1995.81,
  1995.8372,
  1995.8643,
  1995.8915,
  1995.9458,
  1995.9729],
 1996: [1996.0,
  1996.0272,
  1996.0543,
  1996.0815,
  1996.1086,
  1996.1358,
  1996.1629,
  1996.1901,
  1996.2172,
  1996.2444,
  1996.2715,
  1996.2987,
  1996.3258,
  1996.353,
  1996.3801,
  1996.4073,
  1996.4344,
  1996.4616,
  1996.4887,
  1996.5159,
  1996.543,
  1996.5702,
  1996.5973,
  1996.6245,
  1996.6516,
  1996.6788,
  1996.7059,
  1996.733,
  1996.7602,
  1996.7873,
  1996.8145,
  1996.8416,
  1996.8688,
  1996.8959,
  1996.9231,
  1996.9502,
  1996.9774],
 1997: [1997.0045,
  1997.0317,
  1997.0588,
  1997.086,
  1997.1131,
  1997.1403,
  1997.1674,
  1997.1946,
  1997.2217,
  1997.2489,
  1997.276,
  1997.3032,
  1997.3303,
  1997.3575,
  1997.3846,
  1997.4118,
  1997.4389,
  1997.466,
  1997.4932,
  1997.5203,
  1997.5475,
  1997.5746,
  1997.6018,
  1997.6289,
  1997.6561,
  1997.6832,
  1997.7104,
  1997.7375,
  1997.7647,
  1997.7918,
  1997.819,
  1997.8461,
  1997.8733,
  1997.9004,
  1997.9276,
  1997.9547,
  1997.9819],
 1998: [1998.009,
  1998.0362,
  1998.0633,
  1998.0905,
  1998.1176,
  1998.1448,
  1998.1719,
  1998.199,
  1998.2262,
  1998.2533,
  1998.2805,
  1998.3076,
  1998.3348,
  1998.3619,
  1998.3891,
  1998.4162,
  1998.4434,
  1998.4705,
  1998.4977,
  1998.5248,
  1998.552,
  1998.5791,
  1998.6063,
  1998.6334,
  1998.6606,
  1998.6877,
  1998.7149,
  1998.742,
  1998.7692,
  1998.7963,
  1998.8235,
  1998.8506,
  1998.8777,
  1998.9049,
  1998.932,
  1998.9592,
  1998.9863],
 1999: [1999.0135,
  1999.0406,
  1999.0678,
  1999.0949,
  1999.1221,
  1999.1492,
  1999.1764,
  1999.2035,
  1999.2307,
  1999.2578,
  1999.285,
  1999.3121,
  1999.3393,
  1999.3664,
  1999.3936,
  1999.4207,
  1999.4479,
  1999.475,
  1999.5022,
  1999.5293,
  1999.5565,
  1999.5836,
  1999.6107,
  1999.6379,
  1999.665,
  1999.6922,
  1999.7193,
  1999.7465,
  1999.7736,
  1999.8008,
  1999.8279,
  1999.8551,
  1999.8822,
  1999.9094,
  1999.9365,
  1999.9637,
  1999.9908],
 2000: [2000.018,
  2000.0451,
  2000.0723,
  2000.0994,
  2000.1266,
  2000.1537,
  2000.1809,
  2000.208,
  2000.2352,
  2000.2623,
  2000.2895,
  2000.3166,
  2000.3437,
  2000.3709,
  2000.398,
  2000.4252,
  2000.4523,
  2000.4795,
  2000.5066,
  2000.5338,
  2000.5609,
  2000.5881,
  2000.6152,
  2000.6424,
  2000.6695,
  2000.6967,
  2000.7238,
  2000.751,
  2000.7781,
  2000.8053,
  2000.8324,
  2000.8596,
  2000.8867,
  2000.9139,
  2000.941,
  2000.9682,
  2000.9953],
 2001: [2001.0225,
  2001.0496,
  2001.0767,
  2001.1039,
  2001.131,
  2001.1582,
  2001.1853,
  2001.2125,
  2001.2396,
  2001.2668,
  2001.2939,
  2001.3211,
  2001.3482,
  2001.3754,
  2001.4025,
  2001.4297,
  2001.4568,
  2001.484,
  2001.5111,
  2001.5383,
  2001.5654,
  2001.5926,
  2001.6197,
  2001.6469,
  2001.674,
  2001.7012,
  2001.7283,
  2001.7555,
  2001.7826,
  2001.8097,
  2001.8369,
  2001.864,
  2001.8912,
  2001.9183,
  2001.9455,
  2001.9726,
  2001.9998],
 2002: [2002.0269,
  2002.0541,
  2002.0812,
  2002.1084,
  2002.1355,
  2002.1627,
  2002.1898,
  2002.217,
  2002.2441,
  2002.2713,
  2002.2984,
  2002.3256,
  2002.3527,
  2002.3799,
  2002.407,
  2002.4342,
  2002.4613,
  2002.4885,
  2002.5156,
  2002.5427,
  2002.5699,
  2002.597,
  2002.6242,
  2002.6513,
  2002.6785,
  2002.7056,
  2002.7328,
  2002.7599,
  2002.7871,
  2002.8142,
  2002.8414,
  2002.8685,
  2002.8957,
  2002.9228,
  2002.95,
  2002.9771],
 2003: [2003.0043,
  2003.0314,
  2003.0586,
  2003.0857,
  2003.1129,
  2003.14,
  2003.1672,
  2003.1943,
  2003.2214,
  2003.2486,
  2003.2757,
  2003.3029,
  2003.33,
  2003.3572,
  2003.3843,
  2003.4115,
  2003.4386,
  2003.4658,
  2003.4929,
  2003.5201,
  2003.5472,
  2003.5744,
  2003.6015,
  2003.6287,
  2003.6558,
  2003.683,
  2003.7101,
  2003.7373,
  2003.7644,
  2003.7916,
  2003.8187,
  2003.8459,
  2003.873,
  2003.9002,
  2003.9273,
  2003.9544,
  2003.9816],
 2004: [2004.0087,
  2004.0359,
  2004.063,
  2004.0902,
  2004.1173,
  2004.1445,
  2004.1716,
  2004.1988,
  2004.2259,
  2004.2531,
  2004.2802,
  2004.3074,
  2004.3345,
  2004.3617,
  2004.3888,
  2004.416,
  2004.4431,
  2004.4703,
  2004.4974,
  2004.5246,
  2004.5517,
  2004.5789,
  2004.606,
  2004.6332,
  2004.6603,
  2004.6874,
  2004.7146,
  2004.7417,
  2004.7689,
  2004.796,
  2004.8232,
  2004.8503,
  2004.8775,
  2004.9046,
  2004.9318,
  2004.9589,
  2004.9861],
 2005: [2005.0132,
  2005.0404,
  2005.0675,
  2005.0947,
  2005.1218,
  2005.149,
  2005.1761,
  2005.2033,
  2005.2304,
  2005.2576,
  2005.2847,
  2005.3119,
  2005.339,
  2005.3662,
  2005.3933,
  2005.4204,
  2005.4476,
  2005.4747,
  2005.5019,
  2005.529,
  2005.5562,
  2005.5833,
  2005.6105,
  2005.6376,
  2005.6648,
  2005.6919,
  2005.7191,
  2005.7462,
  2005.7734,
  2005.8005,
  2005.8277,
  2005.8548,
  2005.882,
  2005.9091,
  2005.9363,
  2005.9634,
  2005.9906],
 2006: [2006.0177,
  2006.0449,
  2006.072,
  2006.0992,
  2006.1263,
  2006.1534,
  2006.1806,
  2006.2077,
  2006.2349,
  2006.262,
  2006.2892,
  2006.3163,
  2006.3435,
  2006.3706,
  2006.3978,
  2006.4249,
  2006.4521,
  2006.4792,
  2006.5064,
  2006.5335,
  2006.5607,
  2006.5878,
  2006.615,
  2006.6421,
  2006.6693,
  2006.6964,
  2006.7236,
  2006.7507,
  2006.7779,
  2006.805,
  2006.8321,
  2006.8864,
  2006.9136,
  2006.9407,
  2006.9679,
  2006.995],
 2007: [2007.0222,
  2007.0493,
  2007.0765,
  2007.1036,
  2007.1308,
  2007.1579,
  2007.1851,
  2007.2122,
  2007.2394,
  2007.2665,
  2007.2937,
  2007.3208,
  2007.348,
  2007.3751,
  2007.4023,
  2007.4294,
  2007.4566,
  2007.4837,
  2007.5109,
  2007.538,
  2007.5651,
  2007.5923,
  2007.6194,
  2007.6466,
  2007.6737,
  2007.7009,
  2007.728,
  2007.7552,
  2007.7823,
  2007.8095,
  2007.8366,
  2007.8638,
  2007.8909,
  2007.9181,
  2007.9452,
  2007.9724,
  2007.9995],
 2008: [2008.0267,
  2008.0538,
  2008.081,
  2008.1081,
  2008.1353,
  2008.1624,
  2008.1896,
  2008.2167,
  2008.2439,
  2008.271,
  2008.2981,
  2008.3253,
  2008.3524,
  2008.3796,
  2008.4067,
  2008.4339,
  2008.461,
  2008.4882,
  2008.5153,
  2008.5425,
  2008.5696,
  2008.5968,
  2008.6239,
  2008.6511,
  2008.6782,
  2008.7054,
  2008.7325,
  2008.7597,
  2008.7868,
  2008.814,
  2008.8411,
  2008.8683,
  2008.8954,
  2008.9226,
  2008.9497,
  2008.9769],
 2009: [2009.004,
  2009.0311,
  2009.0583,
  2009.0854,
  2009.1126,
  2009.1397,
  2009.1669,
  2009.194,
  2009.2212,
  2009.2483,
  2009.2755,
  2009.3026,
  2009.3298,
  2009.3569,
  2009.3841,
  2009.4112,
  2009.4384,
  2009.4655,
  2009.4927,
  2009.5198,
  2009.547,
  2009.5741,
  2009.6013,
  2009.6284,
  2009.6556,
  2009.6827,
  2009.7099,
  2009.737,
  2009.7642,
  2009.7913,
  2009.8184,
  2009.8456,
  2009.8727,
  2009.8999,
  2009.927,
  2009.9542,
  2009.9813],
 2010: [2010.0085,
  2010.0356,
  2010.0628,
  2010.0899,
  2010.1171,
  2010.1442,
  2010.1714,
  2010.1985,
  2010.2257,
  2010.2528,
  2010.28,
  2010.3071,
  2010.3343,
  2010.3614,
  2010.3886,
  2010.4157,
  2010.4429,
  2010.47,
  2010.4971,
  2010.5243,
  2010.5514,
  2010.5786,
  2010.6057,
  2010.6329,
  2010.66,
  2010.6872,
  2010.7143,
  2010.7415,
  2010.7686,
  2010.7958,
  2010.8229,
  2010.8501,
  2010.8772,
  2010.9044,
  2010.9315,
  2010.9587,
  2010.9858],
 2011: [2011.013,
  2011.0401,
  2011.0673,
  2011.0944,
  2011.1216,
  2011.1487,
  2011.1759,
  2011.203,
  2011.2301,
  2011.2573,
  2011.2844,
  2011.3116,
  2011.3387,
  2011.3659,
  2011.393,
  2011.4202,
  2011.4473,
  2011.4745,
  2011.5016,
  2011.5288,
  2011.5559,
  2011.5831,
  2011.6102,
  2011.6374,
  2011.6645,
  2011.6917,
  2011.7188,
  2011.746,
  2011.7731,
  2011.8003,
  2011.8274,
  2011.8546,
  2011.8817,
  2011.9089,
  2011.936,
  2011.9631,
  2011.9903],
 2012: [2012.0174,
  2012.0446,
  2012.0717,
  2012.0989,
  2012.126,
  2012.1532,
  2012.1803,
  2012.2075,
  2012.2346,
  2012.2618,
  2012.2889,
  2012.3161,
  2012.3432,
  2012.3704,
  2012.3975,
  2012.4247,
  2012.4518,
  2012.479,
  2012.5061,
  2012.5333,
  2012.5604,
  2012.5876,
  2012.6147,
  2012.6419,
  2012.669,
  2012.6961,
  2012.7233,
  2012.7504,
  2012.7776,
  2012.8047,
  2012.8319,
  2012.859,
  2012.8862,
  2012.9133,
  2012.9405,
  2012.9676,
  2012.9948],
 2013: [2013.0219,
  2013.0491,
  2013.0762,
  2013.1034,
  2013.1305,
  2013.1577,
  2013.1848,
  2013.212,
  2013.2391,
  2013.2663,
  2013.2934,
  2013.3206,
  2013.3477,
  2013.3749,
  2013.402,
  2013.4291,
  2013.4563,
  2013.4834,
  2013.5106,
  2013.5377,
  2013.5649,
  2013.592,
  2013.6192,
  2013.6463,
  2013.6735,
  2013.7006,
  2013.7278,
  2013.7549,
  2013.7821,
  2013.8092,
  2013.8364,
  2013.8635,
  2013.8907,
  2013.9178,
  2013.945,
  2013.9721,
  2013.9993],
 2014: [2014.0264,
  2014.0536,
  2014.0807,
  2014.1079,
  2014.135,
  2014.1621,
  2014.1893,
  2014.2164,
  2014.2436,
  2014.2707,
  2014.2979,
  2014.325,
  2014.3522,
  2014.3793,
  2014.4065,
  2014.4336,
  2014.4608,
  2014.4879,
  2014.5151,
  2014.5422,
  2014.5694,
  2014.5965,
  2014.6237,
  2014.6508,
  2014.678,
  2014.7051,
  2014.7323,
  2014.7594,
  2014.7866,
  2014.8137,
  2014.8409,
  2014.868,
  2014.8951,
  2014.9223,
  2014.9494,
  2014.9766],
 2015: [2015.0037,
  2015.0309,
  2015.058,
  2015.0852,
  2015.1123,
  2015.1395,
  2015.1666,
  2015.1938,
  2015.2209,
  2015.2481,
  2015.2752,
  2015.3024,
  2015.3295,
  2015.3567,
  2015.3838,
  2015.411,
  2015.4381,
  2015.4653,
  2015.4924,
  2015.5196,
  2015.5467,
  2015.5739,
  2015.601,
  2015.6281,
  2015.6553,
  2015.6824,
  2015.7096,
  2015.7367,
  2015.7639,
  2015.791,
  2015.8182,
  2015.8453,
  2015.8725,
  2015.8996,
  2015.9268,
  2015.9539,
  2015.9811],
 2016: [2016.0082,
  2016.0354,
  2016.0625,
  2016.0897,
  2016.1168,
  2016.144,
  2016.1711,
  2016.1983,
  2016.2254,
  2016.2526,
  2016.2797,
  2016.3068,
  2016.334]}

How to aggregate the results of this grouping depends on what we want to see: do we want to see averaged over the years? That is so common that it has been implemented directly as a method on the GroupBy object.



In [131]:

    
sl_grouped_year.mean()









    Out[131]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
  
  
    
      1992
      0.556667
      -2.465000
      -6.671500
    
    
      1993
      -3.787108
      -5.485378
      -4.181882
    
    
      1994
      -4.631432
      -3.523946
      -3.668971
    
    
      1995
      -2.709314
      -1.625171
      -1.755452
    
    
      1996
      -1.184541
      -0.204919
      0.775912
    
    
      1997
      1.205108
      8.180189
      7.155545
    
    
      1998
      10.994865
      4.420541
      10.108182
    
    
      1999
      10.244595
      8.327162
      12.321727
    
    
      ...
      ...
      ...
      ...
    
    
      2009
      37.268270
      43.703973
      46.127108
    
    
      2010
      39.910946
      43.347676
      47.285838
    
    
      2011
      36.664865
      43.639108
      46.442865
    
    
      2012
      48.506541
      53.315595
      57.293730
    
    
      2013
      52.047757
      54.749270
      59.685571
    
    
      2014
      51.709861
      60.794194
      63.649583
    
    
      2015
      61.329568
      71.623676
      67.820333
    
    
      2016
      45.794077
      84.387923
      NaN
    
  

25 rows × 3 columns



In [132]:

    
# We can apply any other reduction function or even a dict of functions using aggregate:
sl_grouped_year.aggregate({"mean_global": np.std})









    Out[132]:






  
    
      
      mean_global
    
  
  
    
      1992
      1.207031
    
    
      1993
      3.411971
    
    
      1994
      3.347750
    
    
      1995
      3.486596
    
    
      1996
      3.138087
    
    
      1997
      5.444541
    
    
      1998
      3.565379
    
    
      1999
      2.960703
    
    
      ...
      ...
    
    
      2009
      3.776398
    
    
      2010
      4.000347
    
    
      2011
      4.277106
    
    
      2012
      4.249205
    
    
      2013
      3.276180
    
    
      2014
      3.716325
    
    
      2015
      4.448240
    
    
      2016
      NaN
    
  

25 rows × 1 columns

Another possibility is to transform each group separately, rather than aggregate. For example, here we group over decades and subtract to each value, the average over that decade:



In [133]:

    
sl_grouped_decade = mean_sea_level.groupby(lambda x: int(x/10.))
sl_grouped_decade.groups.keys()









    Out[133]:





[200, 201, 199]



In [134]:

    
sl_grouped_decade.transform(lambda subframe: (subframe - subframe.mean()/subframe.std()))









    Out[134]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
    
      date
      
      
      
    
  
  
    
      1992.9323
      13.652814
      2.395273
      NaN
    
    
      1992.9595
      1.263814
      -7.766727
      -6.207707
    
    
      1992.9866
      -13.439186
      -2.388727
      -7.914707
    
    
      1993.0138
      -24.787186
      0.733273
      -9.604707
    
    
      1993.0409
      -29.295186
      -2.141727
      -12.185707
    
    
      1993.0681
      -32.602186
      -0.519727
      -11.680707
    
    
      1993.0952
      -35.652186
      3.980273
      -9.958707
    
    
      1993.1223
      -26.643186
      6.665273
      -4.103707
    
    
      ...
      ...
      ...
      ...
    
    
      2016.1440
      47.147536
      81.172989
      NaN
    
    
      2016.1711
      42.203536
      80.099989
      NaN
    
    
      2016.1983
      31.890536
      82.290989
      NaN
    
    
      2016.2254
      40.128536
      84.392989
      NaN
    
    
      2016.2526
      42.933536
      84.360989
      NaN
    
    
      2016.2797
      47.738536
      84.076989
      NaN
    
    
      2016.3068
      53.330536
      73.540989
      NaN
    
    
      2016.3340
      48.345536
      67.186989
      NaN
    
  

861 rows × 3 columns

Pivot_table

Pivot table also allows to summarize the information, allowing to convert repeating columns into axes. For example, let's say that we would like to know how many sea level stations are in various european countries. And we would like to group the answers into 2 categories: the stations that have been updated recently (after 2000) and the others.

Let's first extract only entries located (roughly) in Europe.



In [135]:

    
european_filter = ((local_sea_level_stations["Lat"] > 30) & 
                   (local_sea_level_stations["Lat"] < 70) & 
                   (local_sea_level_stations["Lon"] > -10) & 
                   (local_sea_level_stations["Lon"] < 40) 
                   )

# Let's make a copy to work with a new, clean block of memory 
# (if you are interested, try and remove the copy to see the consequences further down...
european_stations = local_sea_level_stations[european_filter].copy()
european_stations["Country"].unique()









    Out[135]:





[ALA, BEL, BGR, HRV, CYP, ..., ESP, SWE, TUR, UKR, GBR]
Length: 37
Categories (37, object): [ALA, BEL, BGR, HRV, ..., SWE, TUR, UKR, GBR]

The columns of our future table should have 2 values, whether the station was updated recently or not. Let's build a column to store that information:



In [136]:

    
european_stations["Recently updated"] = european_stations["Date"] > pd.to_datetime("2000")

Finally, what value will be displayed inside the table. The values should be extracted from a column, pivot_table allowing an aggregation function to be applied when more than 1 value is found for a given case. Each station should count for 1, and we could aggregate multiple stations by summing these ones:



In [137]:

    
european_stations["Number of stations"] = np.ones(len(european_stations))



In [138]:

    
european_stations.sort_values(by="Country")









    Out[138]:






  
    
      
      Station Name
      ID
      Lat
      Lon
      GLOSS ID
      Country
      Date
      Coastline
      Station
      Recently updated
      Number of stations
    
  
  
    
      0
      HELIGMAN
      231
      60.200
      19.300
      NaN
      ALA
      1980-01-01
      60
      251
      False
      1.0
    
    
      1
      KOBBAKLINTAR
      63
      60.033
      19.883
      NaN
      ALA
      1980-01-01
      60
      261
      False
      1.0
    
    
      2
      LEMSTROM
      84
      60.100
      20.017
      NaN
      ALA
      1980-01-01
      60
      271
      False
      1.0
    
    
      3
      FOGLO / DEGERBY
      249
      60.032
      20.385
      NaN
      ALA
      2015-06-10
      60
      281
      True
      1.0
    
    
      116
      ZEEBRUGGE
      470
      51.350
      3.200
      NaN
      BEL
      2014-01-20
      160
      11
      True
      1.0
    
    
      117
      OOSTENDE
      413
      51.233
      2.917
      NaN
      BEL
      2016-06-14
      160
      21
      True
      1.0
    
    
      118
      NIEUWPOORT
      489
      51.150
      2.733
      NaN
      BEL
      2016-05-26
      160
      31
      True
      1.0
    
    
      134
      BOURGAS
      317
      42.483
      27.483
      NaN
      BGR
      1997-08-27
      295
      21
      False
      1.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1193
      ANTALYA II
      1681
      36.833
      30.617
      NaN
      TUR
      2013-09-07
      310
      52
      True
      1.0
    
    
      1192
      BODRUM II
      1680
      37.033
      27.417
      NaN
      TUR
      2013-09-07
      310
      46
      True
      1.0
    
    
      1191
      MENTES/IZMIR
      1679
      38.433
      26.717
      NaN
      TUR
      2013-09-07
      310
      42
      True
      1.0
    
    
      1190
      ERDEK
      1598
      40.383
      27.850
      NaN
      TUR
      2013-09-07
      310
      38
      True
      1.0
    
    
      1189
      MARMARA EREGLISI
      2010
      40.967
      27.967
      NaN
      TUR
      2010-06-28
      310
      37
      True
      1.0
    
    
      1188
      AMASRA
      1919
      41.433
      32.233
      NaN
      TUR
      2010-06-24
      310
      26
      True
      1.0
    
    
      1187
      IGNEDA
      1926
      41.883
      28.017
      NaN
      TUR
      2013-09-07
      310
      18
      True
      1.0
    
    
      1198
      SEVASTOPOL
      42
      44.617
      33.533
      NaN
      UKR
      1998-03-02
      298
      41
      False
      1.0
    
  

424 rows × 11 columns



In [139]:

    
station_counts = pd.pivot_table(european_stations, index="Country", columns="Recently updated", 
                                values="Number of stations", aggfunc=np.sum)
# Let's remove from the table the countries for which no station was found:
station_counts.dropna(how="all")









    Out[139]:






  
    
      Recently updated
      False
      True
    
    
      Country
      
      
    
  
  
    
      ALA
      3.0
      1.0
    
    
      BEL
      NaN
      3.0
    
    
      BGR
      2.0
      NaN
    
    
      CYP
      1.0
      1.0
    
    
      DEU
      NaN
      9.0
    
    
      DNK
      1.0
      14.0
    
    
      EGY
      1.0
      1.0
    
    
      ESP
      7.0
      38.0
    
    
      ...
      ...
      ...
    
    
      POL
      1.0
      6.0
    
    
      PRT
      6.0
      2.0
    
    
      ROU
      1.0
      NaN
    
    
      RUS
      6.0
      6.0
    
    
      SVN
      1.0
      1.0
    
    
      SWE
      11.0
      26.0
    
    
      TUR
      NaN
      10.0
    
    
      UKR
      1.0
      NaN
    
  

37 rows × 2 columns

QUIZ: Why is there still some countries with no entries?

EXERCISE: How many recently updated stations? Not recently updated stations? Which country has the most stations? Which country has the most recently updated stations?



In [140]:

    
# Your code here

EXERCISE: How would we build the same dataframe with a groupby operation?



In [141]:

    
# Your code here

EXERCISE: Refer to exercises/pivot_table/pivot_tables.py

10. Correlations and regressions

Correlation coefficients

Both Series and dataframes have a corr method to compute the correlation coefficient between series:



In [142]:

    
# Let's see what how the various sea levels are correlated with each other:
mean_sea_level["northern_hem"].corr(mean_sea_level["southern_hem"])









    Out[142]:





0.347989603115341



In [143]:

    
# If series are already grouped into a DataFrame, computing all correlation coeff is trivial:
mean_sea_level.corr()









    Out[143]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
  
  
    
      northern_hem
      1.000000
      0.347990
      0.649231
    
    
      southern_hem
      0.347990
      1.000000
      0.896886
    
    
      mean_global
      0.649231
      0.896886
      1.000000

Note: by default, the method used is the Pearson correlation coefficient (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). Other methods are available (kendall, spearman using the method kwarg).



In [144]:

    
# Visualize the correlation matrix
plt.imshow(mean_sea_level.corr(), interpolation="nearest")









    Out[144]:





<matplotlib.image.AxesImage at 0x11a83e3d0>



In [145]:

    
plt.yticks?



In [146]:

    
# let's make it a little better to confirm that learning about global sea level cannot be done from just 
# looking at stations in the northern hemisphere:
plt.imshow(mean_sea_level.corr(), interpolation="nearest")
plt.xticks(np.arange(3), mean_sea_level.corr().columns)
plt.yticks(np.arange(3), mean_sea_level.corr().index)
plt.colorbar()









    Out[146]:





<matplotlib.colorbar.Colorbar at 0x11aa8fe90>

OLS

The recommeded way to build ordinaty least square regressions is by using statsmodels.



In [147]:

    
import statsmodels.formula.api as sm



In [148]:

    
sm_model = sm.ols(formula="mean_global ~ northern_hem + southern_hem", data=mean_sea_level).fit()



In [149]:

    
sm_model.params









    Out[149]:





Intercept       1.468349
northern_hem    0.333625
southern_hem    0.735599
dtype: float64



In [150]:

    
type(sm_model.params)









    Out[150]:





pandas.core.series.Series



In [151]:

    
sm_model.summary()









    Out[151]:





OLS Regression Results

  Dep. Variable:        mean_global      R-squared:             0.988 


  Model:                    OLS          Adj. R-squared:        0.988 


  Method:              Least Squares     F-statistic:        3.188e+04


  Date:              Sun, 10 Jul 2016    Prob (F-statistic):     0.00  


  Time:                  22:17:18        Log-Likelihood:      -1765.7 


  No. Observations:          780         AIC:                   3537. 


  Df Residuals:              777         BIC:                   3551. 


  Df Model:                    2                                      


  Covariance Type:       nonrobust                                    




                  coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept         1.4683      0.138     10.651   0.000      1.198     1.739


  northern_hem      0.3336      0.003    108.839   0.000      0.328     0.340


  southern_hem      0.7356      0.004    191.198   0.000      0.728     0.743




  Omnibus:        86.727    Durbin-Watson:         0.134


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     29.349


  Skew:            0.206    Prob(JB):           4.24e-07


  Kurtosis:        2.144    Cond. No.               75.8



In [152]:

    
plt.figure(figsize=LARGE_FIGSIZE)
mean_sea_level["mean_global"].plot()
sm_model.fittedvalues.plot(label="OLS prediction")
plt.legend(loc="upper left")









    Out[152]:





<matplotlib.legend.Legend at 0x117e76590>

An interlude: data alignment

Converting the floating point date to a timestamp

Now, we would like to look for correlations between our monthly temperatures and the sea levels we have. For this to be possible, some data alignment must be done since the time scales are very different for the 2 datasets.



In [153]:

    
mean_sea_level["mean_global"].index









    Out[153]:





Float64Index([1992.9323, 1992.9595, 1992.9866, 1993.0138, 1993.0409, 1993.0681,
              1993.0952, 1993.1223, 1993.1495, 1993.1766,
              ...
              2016.0897, 2016.1168,  2016.144, 2016.1711, 2016.1983, 2016.2254,
              2016.2526, 2016.2797, 2016.3068,  2016.334],
             dtype='float64', name=u'date', length=861)



In [154]:

    
giss_temp_series.index









    Out[154]:





PeriodIndex(['1880-01', '1880-02', '1880-03', '1880-04', '1880-05', '1880-06',
             '1880-07', '1880-08', '1880-09', '1880-10',
             ...
             '2015-03', '2015-04', '2015-05', '2015-06', '2015-07', '2015-08',
             '2015-09', '2015-10', '2015-11', '2015-12'],
            dtype='int64', length=1632, freq='M')



In [155]:

    
DAYS_PER_YEAR = {}



In [156]:

    
import calendar
# Let's first convert the floating point dates in the sea level to timestamps:
def floating_year_to_timestamp(float_date):
    """ Convert a date as a floating point year number to a pandas timestamp object.
    """
    year = int(float_date)
    days_per_year = 366 if calendar.isleap(year) else 365
    remainder = float_date - year
    daynum = 1 + remainder * (days_per_year - 1)
    daynum = int(round(daynum))
    # Convert day number to month and day
    day = daynum
    month = 1
    while month < 13:
        month_days = calendar.monthrange(year, month)[1]
        if day <= month_days:
            return pd.Timestamp(str(year)+"/"+str(month)+"/"+str(day))
        day -= month_days
        month += 1
    raise ValueError('{} does not have {} days'.format(year, daynum))



In [157]:

    
floating_year_to_timestamp(1996.0), floating_year_to_timestamp(1996.5), floating_year_to_timestamp(1996.9999)









    Out[157]:





(Timestamp('1996-01-01 00:00:00'),
 Timestamp('1996-07-02 00:00:00'),
 Timestamp('1996-12-31 00:00:00'))



In [158]:

    
dt_index = pd.Series(mean_sea_level["mean_global"].index).apply(floating_year_to_timestamp)
dt_index









    Out[158]:





0     1992-12-06
1     1992-12-16
2     1992-12-26
3     1993-01-06
4     1993-01-16
5     1993-01-26
6     1993-02-05
7     1993-02-15
         ...    
853   2016-02-23
854   2016-03-03
855   2016-03-13
856   2016-03-23
857   2016-04-02
858   2016-04-12
859   2016-04-22
860   2016-05-02
Name: date, dtype: datetime64[ns]



In [159]:

    
mean_sea_level = mean_sea_level.reset_index(drop=True)
mean_sea_level.index = dt_index
mean_sea_level









    Out[159]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
    
      date
      
      
      
    
  
  
    
      1992-12-06
      13.717
      2.517
      NaN
    
    
      1992-12-16
      1.328
      -7.645
      -5.818
    
    
      1992-12-26
      -13.375
      -2.267
      -7.525
    
    
      1993-01-06
      -24.723
      0.855
      -9.215
    
    
      1993-01-16
      -29.231
      -2.020
      -11.796
    
    
      1993-01-26
      -32.538
      -0.398
      -11.291
    
    
      1993-02-05
      -35.588
      4.102
      -9.569
    
    
      1993-02-15
      -26.579
      6.787
      -3.714
    
    
      ...
      ...
      ...
      ...
    
    
      2016-02-23
      49.218
      84.902
      NaN
    
    
      2016-03-03
      44.274
      83.829
      NaN
    
    
      2016-03-13
      33.961
      86.020
      NaN
    
    
      2016-03-23
      42.199
      88.122
      NaN
    
    
      2016-04-02
      45.004
      88.090
      NaN
    
    
      2016-04-12
      49.809
      87.806
      NaN
    
    
      2016-04-22
      55.401
      77.270
      NaN
    
    
      2016-05-02
      50.416
      70.916
      NaN
    
  

861 rows × 3 columns

Now, how to align the 2 series? Is this one sampled regularly so that the month temperatures can be upscaled to that frequency?

Computing the difference between successive values

What is the frequency of that new index?



In [160]:

    
dt_index.dtype









    Out[160]:





dtype('<M8[ns]')



In [161]:

    
# What is the frequency of the new index? The numpy way to compute differences between all values doesn't work:
dt_index[1:] - dt_index[:-1]









    Out[161]:





0        NaT
1     0 days
2     0 days
3     0 days
4     0 days
5     0 days
6     0 days
7     0 days
       ...  
853   0 days
854   0 days
855   0 days
856   0 days
857   0 days
858   0 days
859   0 days
860      NaT
Name: date, dtype: timedelta64[ns]

IMPORTANT Note: The above failure is due to the fact that operations between series automatically align them based on their index.



In [162]:

    
# There is a method for shifting values up/down the index:
dt_index.shift()









    Out[162]:





0            NaT
1     1992-12-06
2     1992-12-16
3     1992-12-26
4     1993-01-06
5     1993-01-16
6     1993-01-26
7     1993-02-05
         ...    
853   2016-02-13
854   2016-02-23
855   2016-03-03
856   2016-03-13
857   2016-03-23
858   2016-04-02
859   2016-04-12
860   2016-04-22
Name: date, dtype: datetime64[ns]



In [163]:

    
# So the distances can be computed with 
dt_index - dt_index.shift()









    Out[163]:





0         NaT
1     10 days
2     10 days
3     11 days
4     10 days
5     10 days
6     10 days
7     10 days
        ...  
853   10 days
854    9 days
855   10 days
856   10 days
857   10 days
858   10 days
859   10 days
860   10 days
Name: date, dtype: timedelta64[ns]



In [164]:

    
# Not constant reads apparently. Let's downscale the frequency of the sea levels 
# to monthly, like the temperature reads we have:
monthly_mean_sea_level = mean_sea_level.resample("MS").mean().to_period()
monthly_mean_sea_level









    Out[164]:






  
    
      
      northern_hem
      southern_hem
      mean_global
    
    
      date
      
      
      
    
  
  
    
      1992-12
      0.556667
      -2.465000
      -6.671500
    
    
      1993-01
      -28.830667
      -0.521000
      -10.767333
    
    
      1993-02
      -30.596333
      7.068000
      -5.251333
    
    
      1993-03
      -32.431667
      6.967000
      -5.136333
    
    
      1993-04
      -23.668000
      1.549667
      -1.118000
    
    
      1993-05
      -21.731333
      -2.863000
      -6.319667
    
    
      1993-06
      -4.644333
      -4.418000
      -0.902333
    
    
      1993-07
      0.172667
      -12.684667
      -4.508000
    
    
      ...
      ...
      ...
      ...
    
    
      2015-10
      96.100667
      70.860667
      NaN
    
    
      2015-11
      79.575667
      71.419000
      NaN
    
    
      2015-12
      68.749000
      77.260667
      NaN
    
    
      2016-01
      44.881667
      83.589000
      NaN
    
    
      2016-02
      46.538000
      88.074333
      NaN
    
    
      2016-03
      40.144667
      85.990333
      NaN
    
    
      2016-04
      50.071333
      84.388667
      NaN
    
    
      2016-05
      50.416000
      70.916000
      NaN
    
  

282 rows × 3 columns



In [165]:

    
monthly_mean_sea_level["mean_global"].align(giss_temp_series)









    Out[165]:





(1880-01   NaN
 1880-02   NaN
 1880-03   NaN
 1880-04   NaN
 1880-05   NaN
 1880-06   NaN
 1880-07   NaN
 1880-08   NaN
            ..
 2015-10   NaN
 2015-11   NaN
 2015-12   NaN
 2016-01   NaN
 2016-02   NaN
 2016-03   NaN
 2016-04   NaN
 2016-05   NaN
 Freq: M, Name: mean_global, dtype: float64, 1880-01   -34.0
 1880-02   -27.0
 1880-03   -22.0
 1880-04   -30.0
 1880-05   -16.0
 1880-06   -24.0
 1880-07   -19.0
 1880-08   -12.0
            ... 
 2015-10     NaN
 2015-11     NaN
 2015-12     NaN
 2016-01     NaN
 2016-02     NaN
 2016-03     NaN
 2016-04     NaN
 2016-05     NaN
 Freq: M, Name: Temp anomaly, dtype: float32)



In [166]:

    
giss_temp_series.align?



In [167]:

    
# Now that the series are using the same type and frequency of indexes, to align them is trivial:
monthly_mean_sea_level["mean_global"].align(giss_temp_series, join='inner')









    Out[167]:





(1992-12    -6.671500
 1993-01   -10.767333
 1993-02    -5.251333
 1993-03    -5.136333
 1993-04    -1.118000
 1993-05    -6.319667
 1993-06    -0.902333
 1993-07    -4.508000
              ...    
 2015-05          NaN
 2015-06          NaN
 2015-07          NaN
 2015-08          NaN
 2015-09          NaN
 2015-10          NaN
 2015-11          NaN
 2015-12          NaN
 Freq: M, Name: mean_global, dtype: float64, 1992-12    18.0
 1993-01    34.0
 1993-02    37.0
 1993-03    33.0
 1993-04    22.0
 1993-05    23.0
 1993-06    19.0
 1993-07    24.0
            ... 
 2015-05     NaN
 2015-06     NaN
 2015-07     NaN
 2015-08     NaN
 2015-09     NaN
 2015-10     NaN
 2015-11     NaN
 2015-12     NaN
 Freq: M, Name: Temp anomaly, dtype: float32)



In [168]:

    
aligned_sl, aligned_temp = monthly_mean_sea_level["mean_global"].align(giss_temp_series, join='inner')
aligned_df = pd.DataFrame({"mean_sea_level": aligned_sl, "mean_global_temp": aligned_temp})

The alignment can even be done on an entire dataframe:



In [169]:

    
monthly_mean_sea_level.align(giss_temp_series, axis=0, join='inner')









    Out[169]:





(         northern_hem  southern_hem  mean_global
 1992-12      0.556667     -2.465000    -6.671500
 1993-01    -28.830667     -0.521000   -10.767333
 1993-02    -30.596333      7.068000    -5.251333
 1993-03    -32.431667      6.967000    -5.136333
 1993-04    -23.668000      1.549667    -1.118000
 1993-05    -21.731333     -2.863000    -6.319667
 1993-06     -4.644333     -4.418000    -0.902333
 1993-07      0.172667    -12.684667    -4.508000
 ...               ...           ...          ...
 2015-05     43.444000     71.664750          NaN
 2015-06     59.716000     71.828667          NaN
 2015-07     72.460667     60.819000          NaN
 2015-08     89.157000     62.902000          NaN
 2015-09     93.290333     61.620333          NaN
 2015-10     96.100667     70.860667          NaN
 2015-11     79.575667     71.419000          NaN
 2015-12     68.749000     77.260667          NaN
 
 [277 rows x 3 columns], 1992-12    18.0
 1993-01    34.0
 1993-02    37.0
 1993-03    33.0
 1993-04    22.0
 1993-05    23.0
 1993-06    19.0
 1993-07    24.0
            ... 
 2015-05     NaN
 2015-06     NaN
 2015-07     NaN
 2015-08     NaN
 2015-09     NaN
 2015-10     NaN
 2015-11     NaN
 2015-12     NaN
 Freq: M, Name: Temp anomaly, dtype: float32)



In [170]:

    
aligned_sea_levels, aligned_temp = monthly_mean_sea_level.align(giss_temp_series, axis=0, join='inner')
aligned_monthly_data = aligned_sea_levels.copy()
aligned_monthly_data["global_temp"] = aligned_temp
aligned_monthly_data









    Out[170]:






  
    
      
      northern_hem
      southern_hem
      mean_global
      global_temp
    
  
  
    
      1992-12
      0.556667
      -2.465000
      -6.671500
      18.0
    
    
      1993-01
      -28.830667
      -0.521000
      -10.767333
      34.0
    
    
      1993-02
      -30.596333
      7.068000
      -5.251333
      37.0
    
    
      1993-03
      -32.431667
      6.967000
      -5.136333
      33.0
    
    
      1993-04
      -23.668000
      1.549667
      -1.118000
      22.0
    
    
      1993-05
      -21.731333
      -2.863000
      -6.319667
      23.0
    
    
      1993-06
      -4.644333
      -4.418000
      -0.902333
      19.0
    
    
      1993-07
      0.172667
      -12.684667
      -4.508000
      24.0
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2015-05
      43.444000
      71.664750
      NaN
      NaN
    
    
      2015-06
      59.716000
      71.828667
      NaN
      NaN
    
    
      2015-07
      72.460667
      60.819000
      NaN
      NaN
    
    
      2015-08
      89.157000
      62.902000
      NaN
      NaN
    
    
      2015-09
      93.290333
      61.620333
      NaN
      NaN
    
    
      2015-10
      96.100667
      70.860667
      NaN
      NaN
    
    
      2015-11
      79.575667
      71.419000
      NaN
      NaN
    
    
      2015-12
      68.749000
      77.260667
      NaN
      NaN
    
  

277 rows × 4 columns

Correlations between sea levels and temperatures



In [171]:

    
aligned_monthly_data.plot(figsize=LARGE_FIGSIZE)









    Out[171]:





<matplotlib.axes._subplots.AxesSubplot at 0x11506fb90>



In [172]:

    
aligned_monthly_data.corr()









    Out[172]:






  
    
      
      northern_hem
      southern_hem
      mean_global
      global_temp
    
  
  
    
      northern_hem
      1.000000
      0.349101
      0.658666
      0.378089
    
    
      southern_hem
      0.349101
      1.000000
      0.901072
      0.597901
    
    
      mean_global
      0.658666
      0.901072
      1.000000
      0.620717
    
    
      global_temp
      0.378089
      0.597901
      0.620717
      1.000000



In [173]:

    
model = sm.ols("southern_hem ~ global_temp", data=aligned_monthly_data)
params = model.fit()
params.rsquared









    Out[173]:





0.35748527157977106

What if we had done the analysis yearly instead of monthly to remove seasonal variations?



In [174]:

    
aligned_yearly_data = aligned_monthly_data.resample("A").mean()
aligned_yearly_data.plot()









    Out[174]:





<matplotlib.axes._subplots.AxesSubplot at 0x1142f03d0>



In [175]:

    
aligned_yearly_data.corr()









    Out[175]:






  
    
      
      northern_hem
      southern_hem
      mean_global
      global_temp
    
  
  
    
      northern_hem
      1.000000
      0.989688
      0.992187
      0.821351
    
    
      southern_hem
      0.989688
      1.000000
      0.994649
      0.820091
    
    
      mean_global
      0.992187
      0.994649
      1.000000
      0.832198
    
    
      global_temp
      0.821351
      0.820091
      0.832198
      1.000000



In [176]:

    
model = sm.ols("southern_hem ~ global_temp", data=aligned_yearly_data).fit()
model.rsquared









    Out[176]:





0.6725492407603153

11. Predictions from auto regression models

An auto-regresssive model fits existing data and build a (potentially predictive) model of the data fitted. We use the timeseries analysis (tsa) submodule of statsmodels to make out-of-sample predictions for the upcoming decades:



In [177]:

    
from statsmodels.tsa.api import AR



In [178]:

    
import statsmodels as sm
# Let's remove seasonal variations by resampling annually
data = giss_temp_series.resample("A").mean().to_timestamp()
ar_model = AR(data, freq='A')
ar_res = ar_model.fit(maxlag=60, disp=True)



In [179]:

    
plt.figure(figsize=LARGE_FIGSIZE)
pred = ar_res.predict(start='1950-1-1', end='2070')
data.plot(style='k', label="Historical Data")
pred.plot(style='r', label="Predicted Data")
plt.ylabel("Temperature variation (0.01 degC)")
plt.legend()









    



/Users/jrocher/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/statsmodels/base/data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series
  return TimeSeries(result, index=self.predict_dates)






    Out[179]:





<matplotlib.legend.Legend at 0x11b500a90>

EXERCISE: Make another auto-regression on the sea level of the Atlantic ocean to estimate how much New York is going to flood in the coming century.

You can find the historical sea levels of the Atlantic ocean at http://sealevel.colorado.edu/files/current/sl_Atlantic_Ocean.txt or locally in data/sea_levels/sl_Atlantic_Ocean.txt.
A little more work but more precise: extract the ID of a station in NewYork from the local_sea_level_stations dataset, and use it to download timeseries in NY (URL would be http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata).



In [180]:

    
# Your code here

	1880 -0.1591
0	1881 -0.0789
1	1882 -0.1313
2	1883 -0.1675
3	1884 -0.2485
4	1885 -0.2042
5	1886 -0.1695
6	1887 -0.2505
7	1888 -0.1605
...	...
122	2003 0.5818
123	2004 0.5416
124	2005 0.6154
125	2006 0.5601
126	2007 0.5472
127	2008 0.4804
128	2009 0.5551
129	2010 -999.0000

	1880	-0.1591
0	1881	-0.0789
1	1882	-0.1313
2	1883	-0.1675
3	1884	-0.2485
4	1885	-0.2042
5	1886	-0.1695
6	1887	-0.2505
7	1888	-0.1605
...	...	...
122	2003	0.5818
123	2004	0.5416
124	2005	0.6154
125	2006	0.5601
126	2007	0.5472
127	2008	0.4804
128	2009	0.5551
129	2010	-999.0000

	year	mean temp
0	1880	-0.1591
1	1881	-0.0789
2	1882	-0.1313
3	1883	-0.1675
4	1884	-0.2485
5	1885	-0.2042
6	1886	-0.1695
7	1887	-0.2505
...	...	...
123	2003	0.5818
124	2004	0.5416
125	2005	0.6154
126	2006	0.5601
127	2007	0.5472
128	2008	0.4804
129	2009	0.5551
130	2010	-999.0000

	mean temp
year
1880	-0.1591
1881	-0.0789
1882	-0.1313
1883	-0.1675
1884	-0.2485
1885	-0.2042
1886	-0.1695
1887	-0.2505
...	...
2003	0.5818
2004	0.5416
2005	0.6154
2006	0.5601
2007	0.5472
2008	0.4804
2009	0.5551
2010	-999.0000

	mean temp
year
1880-01-01 00:00:00	-0.1591
1881-01-01 00:00:00	-0.0789
1882-01-01 00:00:00	-0.1313
1883-01-01 00:00:00	-0.1675
1884-01-01 00:00:00	-0.2485
1885-01-01 00:00:00	-0.2042
1886-01-01 00:00:00	-0.1695
1887-01-01 00:00:00	-0.2505
...	...
2003-01-01	0.5818
2004-01-01	0.5416
2005-01-01	0.6154
2006-01-01	0.5601
2007-01-01	0.5472
2008-01-01	0.4804
2009-01-01	0.5551
2010-01-01	-999.0000

	Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	J-D	D-N	DJF	MAM	JJA	SON	Year.1
0	1880	-34	-27	-22	-30	-16	-24	-19	-12	-20	-19	-16	-21	-22	***	****	-23	-18	-18	1880
1	1881	-13	-16	-2	-3	-3	-27	-12	-8	-18	-23	-28	-18	-14	-14	-17	-3	-15	-23	1881
2	1882	3	4	-2	-24	-20	-32	-27	-11	-11	-25	-25	-37	-17	-16	-4	-15	-23	-20	1882
3	1883	-38	-38	-12	-20	-20	-8	-3	-13	-19	-19	-28	-21	-20	-21	-38	-18	-8	-22	1883
4	1884	-20	-14	-31	-36	-33	-36	-31	-24	-29	-25	-29	-25	-28	-28	-18	-33	-31	-28	1884
5	1885	-57	-29	-19	-36	-35	-40	-28	-24	-17	-14	-14	0	-26	-28	-37	-30	-31	-15	1885
6	1886	-37	-43	-34	-23	-21	-30	-13	-21	-12	-22	-29	-18	-25	-24	-26	-26	-21	-21	1886
7	1887	-60	-43	-26	-34	-28	-25	-19	-28	-24	-33	-29	-40	-32	-30	-40	-29	-24	-28	1887
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
135	2009	56	48	49	57	59	62	66	61	64	58	72	58	59	59	52	55	63	65	2009
136	2010	66	75	87	82	71	60	56	59	55	65	74	44	66	67	66	80	58	65	2010
137	2011	45	44	57	60	47	54	70	69	52	60	50	48	55	55	45	55	64	54	2011
138	2012	38	43	52	62	71	59	50	56	68	73	69	46	57	57	43	62	55	70	2012
139	2013	62	52	60	48	56	61	53	61	73	61	75	61	60	59	53	55	58	70	2013
140	2014	68	44	71	72	79	62	50	74	81	78	64	74	68	67	58	74	62	74	2014
141	2015	75	80	84	71	****	****	****	****	****	****	****	****	****	***	76	****	****	****	2015
142	Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	J-D	D-N	DJF	MAM	JJA	SON	Year

	year	msl_ib(mm)	#version_2016_rel3
0	1992.9323	13.717	NaN
1	1992.9595	1.328	NaN
2	1992.9866	-13.375	NaN
3	1993.0138	-24.723	NaN
4	1993.0409	-29.231	NaN
5	1993.0681	-32.538	NaN
6	1993.0952	-35.588	NaN
7	1993.1223	-26.579	NaN
...	...	...	...
853	2016.1440	49.218	NaN
854	2016.1711	44.274	NaN
855	2016.1983	33.961	NaN
856	2016.2254	42.199	NaN
857	2016.2526	45.004	NaN
858	2016.2797	49.809	NaN
859	2016.3068	55.401	NaN
860	2016.3340	50.416	NaN

	year	msl_ib(mm)	#version_2016_rel3
0	1992.9323	2.517	NaN
1	1992.9595	-7.645	NaN
2	1992.9866	-2.267	NaN
3	1993.0138	0.855	NaN
4	1993.0409	-2.020	NaN
5	1993.0681	-0.398	NaN
6	1993.0952	4.102	NaN
7	1993.1223	6.787	NaN
...	...	...	...
853	2016.1440	84.902	NaN
854	2016.1711	83.829	NaN
855	2016.1983	86.020	NaN
856	2016.2254	88.122	NaN
857	2016.2526	88.090	NaN
858	2016.2797	87.806	NaN
859	2016.3068	77.270	NaN
860	2016.3340	70.916	NaN

	year	msl_ib_ns(mm)	#version_2015_rel2
0	1992.9595	-5.818	NaN
1	1992.9866	-7.525	NaN
2	1993.0138	-9.215	NaN
3	1993.0409	-11.796	NaN
4	1993.0681	-11.291	NaN
5	1993.0952	-9.569	NaN
6	1993.1223	-3.714	NaN
7	1993.1495	-2.471	NaN
...	...	...	...
772	2014.9494	70.723	NaN
773	2014.9766	70.522	NaN
774	2015.0037	66.662	NaN
775	2015.0309	64.804	NaN
776	2015.0580	62.115	NaN
777	2015.0852	68.589	NaN
778	2015.1123	69.745	NaN
779	2015.1395	75.007	NaN

	Station Name	ID	Lat.	Lon.	GLOSS ID	Country	Date	Coastline	Station
0	HELIGMAN	231	60.200	19.300	NaN	ALA	01/01/1980	60	251
1	KOBBAKLINTAR	63	60.033	19.883	NaN	ALA	01/01/1980	60	261
2	LEMSTROM	84	60.100	20.017	NaN	ALA	01/01/1980	60	271
3	FOGLO / DEGERBY	249	60.032	20.385	NaN	ALA	06/10/2015	60	281
4	PAGO PAGO	539	-14.280	-170.690	144.0	ASM	08/03/2016	745	1
5	BAHIA ESPERANZA	988	-63.300	-56.917	185.0	ATA	11/05/1999	999	1
6	ARGENTINE ISLANDS	913	-65.246	-64.257	188.0	ATA	21/05/2015	999	3
7	PUERTO SOBERANIA	1603	-62.483	-59.633	189.0	ATA	30/09/2004	999	5
...	...	...	...	...	...	...	...	...	...
1458	DANANG	1475	16.100	108.217	NaN	VNM	17/11/2015	605	51
1459	HONNGU	1003	18.800	105.767	NaN	VNM	17/11/2015	605	61
1460	HONDAU	841	20.667	106.800	NaN	VNM	17/11/2015	605	81
1461	CHARLOTTE AMALIE	1393	18.335	-64.920	NaN	VIR	08/03/2016	939	1
1462	LIME TREE BAY, ST CROIX	1447	17.693	-64.753	NaN	VIR	08/03/2016	939	11
1463	CHRISTIANSTED HARBOUR	2118	17.750	-64.705	NaN	VIR	09/03/2016	939	15
1464	LAMESHUR BAY	2119	18.317	-64.723	NaN	VIR	09/03/2016	939	21
1465	ADEN	44	12.788	44.974	3.0	YEM	01/02/2016	485	1

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
Year
2010	66.0	75.0	87.0	82.0	71.0	60.0	56.0	59.0	55.0	65.0	74.0	44.0
2011	45.0	44.0	57.0	60.0	47.0	54.0	70.0	69.0	52.0	60.0	50.0	48.0
2012	38.0	43.0	52.0	62.0	71.0	59.0	50.0	56.0	68.0	73.0	69.0	46.0
2013	62.0	52.0	60.0	48.0	56.0	61.0	53.0	61.0	73.0	61.0	75.0	61.0
2014	68.0	44.0	71.0	72.0	79.0	62.0	50.0	74.0	81.0	78.0	64.0	74.0

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	totals
Year
1880	-34.0	-27.0	-22.0	-30.0	-16.0	-24.0	-19.0	-12.0	-20.0	-19.0	-16.0	-21.0	-260.0
1881	-13.0	-16.0	-2.0	-3.0	-3.0	-27.0	-12.0	-8.0	-18.0	-23.0	-28.0	-18.0	-171.0
1882	3.0	4.0	-2.0	-24.0	-20.0	-32.0	-27.0	-11.0	-11.0	-25.0	-25.0	-37.0	-207.0
1883	-38.0	-38.0	-12.0	-20.0	-20.0	-8.0	-3.0	-13.0	-19.0	-19.0	-28.0	-21.0	-239.0
1884	-20.0	-14.0	-31.0	-36.0	-33.0	-36.0	-31.0	-24.0	-29.0	-25.0	-29.0	-25.0	-333.0
1885	-57.0	-29.0	-19.0	-36.0	-35.0	-40.0	-28.0	-24.0	-17.0	-14.0	-14.0	0.0	-313.0
1886	-37.0	-43.0	-34.0	-23.0	-21.0	-30.0	-13.0	-21.0	-12.0	-22.0	-29.0	-18.0	-303.0
1887	-60.0	-43.0	-26.0	-34.0	-28.0	-25.0	-19.0	-28.0	-24.0	-33.0	-29.0	-40.0	-389.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2008	22.0	31.0	70.0	48.0	46.0	43.0	54.0	39.0	59.0	62.0	62.0	51.0	587.0
2009	56.0	48.0	49.0	57.0	59.0	62.0	66.0	61.0	64.0	58.0	72.0	58.0	710.0
2010	66.0	75.0	87.0	82.0	71.0	60.0	56.0	59.0	55.0	65.0	74.0	44.0	794.0
2011	45.0	44.0	57.0	60.0	47.0	54.0	70.0	69.0	52.0	60.0	50.0	48.0	656.0
2012	38.0	43.0	52.0	62.0	71.0	59.0	50.0	56.0	68.0	73.0	69.0	46.0	687.0
2013	62.0	52.0	60.0	48.0	56.0	61.0	53.0	61.0	73.0	61.0	75.0	61.0	723.0
2014	68.0	44.0	71.0	72.0	79.0	62.0	50.0	74.0	81.0	78.0	64.0	74.0	817.0
2015	75.0	80.0	84.0	71.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	310.0

	ID	Station Name
1267	1372	FRENCH FRIGATE SHOALS
1268	2192	FRENCH FRIGATE SHOALS B
1269	2129	PORT ALLEN, HANAPEPE BAY, KAUAI ISLAND
1270	756	NAWILIWILI BAY, KAUAI ISLAND
1271	155	HONOLULU
1272	823	MOKUOLOE ISLAND
1273	2273	LANAI ISLAND, KAUMALAPAU
1274	521	KAHULUI HARBOR, MAUI ISLAND
...	...	...
1437	288	SEAVEY ISLAND
1438	2291	FORT POINT, NEWCASTLE ISLAND
1439	183	PORTLAND (MAINE)
1440	1279	ROCKLAND
1441	525	BAR HARBOR, FRENCHMAN BAY, ME
1442	1524	CUTLER II
1443	1081	CUTLER
1444	332	EASTPORT

	Station Name	ID	Lat	Lon	GLOSS ID	Country	Date	Coastline	Station
912	MATI, DAVAO ORIENTAL	2156	6.950	126.217	NaN	PHL	2016-12-05	660	116
1074	BILBAO	1806	43.352	-3.045	NaN	ESP	2016-12-01	200	6
1077	SANTANDER III	1807	43.461	-3.791	NaN	ESP	2016-12-01	200	13
1078	GIJON II	1871	43.558	-5.698	NaN	ESP	2016-12-01	200	22
1082	FERROL 2	2053	43.476	-8.249	NaN	ESP	2016-12-01	200	35
1083	VILLAGARCIA	1897	42.600	-8.767	NaN	ESP	2016-12-01	200	36
1085	VIGO II	1898	42.243	-8.726	NaN	ESP	2016-12-01	200	42
1088	HUELVA	1883	37.132	-6.834	NaN	ESP	2016-12-01	220	5
...	...	...	...	...	...	...	...	...	...
17	PIRAMIDE	867	-42.583	-64.283	NaN	ARG	1980-01-01	860	41
18	ROSALES	870	-38.933	-62.067	NaN	ARG	1980-01-01	860	61
20	MAR DEL PLATA (PUERTO)	857	-38.033	-57.533	192.0	ARG	1980-01-01	860	91
23	RIO SANTIAGO	897	-34.850	-57.900	NaN	ARG	1980-01-01	860	141
26	ISLA MARTIN GARCIA	864	-34.183	-58.250	NaN	ARG	1980-01-01	860	171
0	HELIGMAN	231	60.200	19.300	NaN	ALA	1980-01-01	60	251
1	KOBBAKLINTAR	63	60.033	19.883	NaN	ALA	1980-01-01	60	261
2	LEMSTROM	84	60.100	20.017	NaN	ALA	1980-01-01	60	271

	northern_hem	southern_hem	mean_global
count	861.000000	861.000000	780.000000
mean	25.050923	28.422725	29.166036
std	29.292476	24.730756	21.226033
min	-38.377000	-28.239000	-11.796000
25%	3.810000	8.946000	11.130500
50%	24.487000	27.836000	31.783000
75%	47.140000	46.066000	45.047500
max	99.108000	90.633000	75.007000

	ID	Lat	Lon	GLOSS ID	Coastline	Station
count	1466.000000	1466.000000	1466.000000	281.000000	1466.000000	1466.000000
mean	1069.452251	28.762511	21.889766	165.327402	532.484993	85.937926
std	655.126610	30.313309	92.947697	95.313053	312.892857	120.120620
min	1.000000	-77.850000	-179.367000	3.000000	10.000000	1.000000
25%	518.250000	13.810500	-64.745500	87.000000	220.000000	12.000000
50%	1021.500000	35.700000	14.448500	154.000000	641.000000	41.000000
75%	1629.750000	49.320500	118.023500	242.000000	822.000000	101.000000
max	2304.000000	82.490000	179.217000	351.000000	999.000000	725.000000

	northern_hem	southern_hem	mean_global
1992	0.556667	-2.465000	-6.671500
1993	-3.787108	-5.485378	-4.181882
1994	-4.631432	-3.523946	-3.668971
1995	-2.709314	-1.625171	-1.755452
1996	-1.184541	-0.204919	0.775912
1997	1.205108	8.180189	7.155545
1998	10.994865	4.420541	10.108182
1999	10.244595	8.327162	12.321727
...	...	...	...
2009	37.268270	43.703973	46.127108
2010	39.910946	43.347676	47.285838
2011	36.664865	43.639108	46.442865
2012	48.506541	53.315595	57.293730
2013	52.047757	54.749270	59.685571
2014	51.709861	60.794194	63.649583
2015	61.329568	71.623676	67.820333
2016	45.794077	84.387923	NaN

	mean_global
1992	1.207031
1993	3.411971
1994	3.347750
1995	3.486596
1996	3.138087
1997	5.444541
1998	3.565379
1999	2.960703
...	...
2009	3.776398
2010	4.000347
2011	4.277106
2012	4.249205
2013	3.276180
2014	3.716325
2015	4.448240
2016	NaN

	northern_hem	southern_hem	mean_global
date
1992.9323	13.652814	2.395273	NaN
1992.9595	1.263814	-7.766727	-6.207707
1992.9866	-13.439186	-2.388727	-7.914707
1993.0138	-24.787186	0.733273	-9.604707
1993.0409	-29.295186	-2.141727	-12.185707
1993.0681	-32.602186	-0.519727	-11.680707
1993.0952	-35.652186	3.980273	-9.958707
1993.1223	-26.643186	6.665273	-4.103707
...	...	...	...
2016.1440	47.147536	81.172989	NaN
2016.1711	42.203536	80.099989	NaN
2016.1983	31.890536	82.290989	NaN
2016.2254	40.128536	84.392989	NaN
2016.2526	42.933536	84.360989	NaN
2016.2797	47.738536	84.076989	NaN
2016.3068	53.330536	73.540989	NaN
2016.3340	48.345536	67.186989	NaN

Recently updated	False	True
Country
ALA	3.0	1.0
BEL	NaN	3.0
BGR	2.0	NaN
CYP	1.0	1.0
DEU	NaN	9.0
DNK	1.0	14.0
EGY	1.0	1.0
ESP	7.0	38.0
...	...	...
POL	1.0	6.0
PRT	6.0	2.0
ROU	1.0	NaN
RUS	6.0	6.0
SVN	1.0	1.0
SWE	11.0	26.0
TUR	NaN	10.0
UKR	1.0	NaN

	northern_hem	southern_hem	mean_global
northern_hem	1.000000	0.347990	0.649231
southern_hem	0.347990	1.000000	0.896886
mean_global	0.649231	0.896886	1.000000

Dep. Variable:	mean_global	R-squared:	0.988
Model:	OLS	Adj. R-squared:	0.988
Method:	Least Squares	F-statistic:	3.188e+04
Date:	Sun, 10 Jul 2016	Prob (F-statistic):	0.00
Time:	22:17:18	Log-Likelihood:	-1765.7
No. Observations:	780	AIC:	3537.
Df Residuals:	777	BIC:	3551.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1.4683	0.138	10.651	0.000	1.198 1.739
northern_hem	0.3336	0.003	108.839	0.000	0.328 0.340
southern_hem	0.7356	0.004	191.198	0.000	0.728 0.743

Omnibus:	86.727	Durbin-Watson:	0.134
Prob(Omnibus):	0.000	Jarque-Bera (JB):	29.349
Skew:	0.206	Prob(JB):	4.24e-07
Kurtosis:	2.144	Cond. No.	75.8

	northern_hem	southern_hem	mean_global
date
1992-12-06	13.717	2.517	NaN
1992-12-16	1.328	-7.645	-5.818
1992-12-26	-13.375	-2.267	-7.525
1993-01-06	-24.723	0.855	-9.215
1993-01-16	-29.231	-2.020	-11.796
1993-01-26	-32.538	-0.398	-11.291
1993-02-05	-35.588	4.102	-9.569
1993-02-15	-26.579	6.787	-3.714
...	...	...	...
2016-02-23	49.218	84.902	NaN
2016-03-03	44.274	83.829	NaN
2016-03-13	33.961	86.020	NaN
2016-03-23	42.199	88.122	NaN
2016-04-02	45.004	88.090	NaN
2016-04-12	49.809	87.806	NaN
2016-04-22	55.401	77.270	NaN
2016-05-02	50.416	70.916	NaN

	northern_hem	southern_hem	mean_global
date
1992-12	0.556667	-2.465000	-6.671500
1993-01	-28.830667	-0.521000	-10.767333
1993-02	-30.596333	7.068000	-5.251333
1993-03	-32.431667	6.967000	-5.136333
1993-04	-23.668000	1.549667	-1.118000
1993-05	-21.731333	-2.863000	-6.319667
1993-06	-4.644333	-4.418000	-0.902333
1993-07	0.172667	-12.684667	-4.508000
...	...	...	...
2015-10	96.100667	70.860667	NaN
2015-11	79.575667	71.419000	NaN
2015-12	68.749000	77.260667	NaN
2016-01	44.881667	83.589000	NaN
2016-02	46.538000	88.074333	NaN
2016-03	40.144667	85.990333	NaN
2016-04	50.071333	84.388667	NaN
2016-05	50.416000	70.916000	NaN

	northern_hem	southern_hem	mean_global	global_temp
northern_hem	1.000000	0.349101	0.658666	0.378089
southern_hem	0.349101	1.000000	0.901072	0.597901
mean_global	0.658666	0.901072	1.000000	0.620717
global_temp	0.378089	0.597901	0.620717	1.000000

	northern_hem	southern_hem	mean_global	global_temp
northern_hem	1.000000	0.989688	0.992187	0.821351
southern_hem	0.989688	1.000000	0.994649	0.820091
mean_global	0.992187	0.994649	1.000000	0.832198
global_temp	0.821351	0.820091	0.832198	1.000000