Last updated: Jul 06 2015

Climate data exploration: a journey through Pandas

Welcome to a demo of Python's data analysis package called Pandas. Our goal is to learn about Data Analysis and transformation using Pandas while exploring datasets used to analyze climate change.

The story

The global goal of this demo is to provide the tools to be able to try and reproduce some of the analysis done in the IPCC global climate reports published in the last decade (see for example https://www.ipcc.ch/pdf/assessment-report/ar5/syr/SYR_AR5_FINAL_full.pdf).

We are first going to load a few public datasets containing information about global temperature, global and local sea level infomation, and global concentration of greenhouse gases like CO2, to see if there are correlations and how the trends are to evolve, assuming no fundamental change in the system. For all these datasets, we will download them, visualize them, clean them, search through them, merge them, resample them, transform them and summarize them.

In the process, we will learn about:

1. Loading data
2. Pandas datastructures
3. Cleaning and formatting data
4. Basic visualization
5. Accessing data
6. Working with dates and times
7. Transforming datasets
8. Statistical analysis
9. Data agregation and summarization
10. Correlations and regressions
11. Predictions from auto regression models

Some initial setup


In [ ]:
%matplotlib inline
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

from pandas import set_option
set_option("display.max_rows", 16)

LARGE_FIGSIZE = (12, 8)

In [ ]:
# Change this cell to the demo location on YOUR machine
%cd ~/Projects/SciPy2015_pandas_tutorial/demos/climate_timeseries/
%ls

1. Loading data

To find all reading functions in pandas, ask ipython's tab completion:


In [ ]:
#pd.read_<TAB>

In [ ]:
pd.read_table?

From a local text file

Let's first load some temperature data which covers all lattitudes. Since read_table is supposed to do the job for a text file, let's just try it:


In [ ]:
filename = "data/temperatures/annual.land_ocean.90S.90N.df_1901-2000mean.dat"
full_globe_temp = pd.read_table(filename)
full_globe_temp

There is only 1 column! Let's try again stating that values are separated by any number of spaces:


In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+")
full_globe_temp

There are columns but the column names are 1880 and -0.1591!


In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"])
full_globe_temp

Since we only have 2 columns, one of which would be nicer to access the data (the year of the record), let's try using the index_col option:


In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"], 
                                index_col=0)
full_globe_temp

Last step: the index is made of dates. Let's make that explicit:


In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"], 
                                index_col=0, parse_dates=True)
full_globe_temp

From a chunked file

Since every dataset can contain mistakes, let's load a different file with temperature data. NASA's GISS dataset is written in chunks: look at it in data/temperatures/GLB.Ts+dSST.txt


In [ ]:
giss_temp = pd.read_table("data/temperatures/GLB.Ts+dSST.txt", sep="\s+", skiprows=7,
                          skip_footer=11, engine="python")
giss_temp

QUIZ: What happens if you remove the skiprows? skipfooter? engine?

EXERCISE: Load some readings of CO2 concentrations in the atmosphere from the data/greenhouse_gaz/co2_mm_global.txt data file.


In [ ]:
# Your code here

From a remote text file

So far, we have only loaded temperature datasets. Climate change also affects the sea levels on the globe. Let's load some datasets with the sea levels. The university of colorado posts updated timeseries for mean sea level globably, per hemisphere, or even per ocean, sea, ... Let's download the global one, and the ones for the northern and southern hemisphere.

That will also illustrate that to load text files that are online, there is no more work than replacing the filepath by a URL n read_table:


In [ ]:
# Local backup: data/sea_levels/sl_nh.txt
northern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_nh.txt", 
                                   sep="\s+")
northern_sea_level

In [ ]:
# Local backup: data/sea_levels/sl_sh.txt
southern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_sh.txt", 
                                   sep="\s+")
southern_sea_level

In [ ]:
# The 2015 version of the global dataset:
# Local backup: data/sea_levels/sl_ns_global.txt
url = "http://sealevel.colorado.edu/files/2015_rel2/sl_ns_global.txt"
global_sea_level = pd.read_table(url, sep="\s+")
global_sea_level

There are clearly lots of cleanup to be done on these datasets. See below...

From a local or remote HTML file

To be able to grab more local data about mean sea levels, we can download and extract data about mean sea level stations around the world from the PSMSL (http://www.psmsl.org/). Again to download and parse all tables in a webpage, just give read_html the URL to parse:


In [ ]:
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
# Local backup in data/sea_levels/Obtaining Tide Gauge Data.html
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")

In [ ]:
# there is 1 table on that page which contains metadata about the stations where 
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stations

That table can be used to search for a station in a region of the world we choose, extract an ID for it and download the corresponding time series with the URL http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata

2. Pandas DataStructures

Now that we have used read_** functions to load datasets, we need to understand better what kind of objects we got from them to learn to work with them.

DataFrame, the pandas 2D structure


In [ ]:
# Type of the object?
type(giss_temp)

In [ ]:
# Internal nature of the object
print(giss_temp.shape)
print(giss_temp.dtypes)

Descriptors for the vertical axis (axis=0)


In [ ]:
giss_temp.index

Descriptors for the horizontal axis (axis=1)


In [ ]:
giss_temp.columns

A lot of information at once including memory usage:


In [ ]:
giss_temp.info()

Series, the pandas 1D structure

A series can be constructed with the pd.Series constructor (passing a list or array of values) or from a DataFrame, by extracting one of its columns.


In [ ]:
# Do we already have a series for the full_globe_temp?
type(full_globe_temp)

In [ ]:
# Since there is only one column of values, we can make this a Series without 
# loosing information:
full_globe_temp = full_globe_temp["mean temp"]

Core attributes/information:


In [ ]:
print(type(full_globe_temp))
print(full_globe_temp.dtype)
print(full_globe_temp.shape)
print(full_globe_temp.nbytes)

Probably the most important attribute of a Series or DataFrame is its index since we will use that to, well, index into the structures to access te information:


In [ ]:
full_globe_temp.index

NumPy arrays as backend of Pandas

It is always possible to fall back to a good old NumPy array to pass on to scientific libraries that need them: SciPy, scikit-learn, ...


In [ ]:
full_globe_temp.values

In [ ]:
type(full_globe_temp.values)

Creating new DataFrames manually

DataFrames can also be created manually, by grouping several Series together. Let's make a new frame from the 3 sea level datasets we downloaded above. They will be displayed along the same index. Wait, does that makes sense to do that?


In [ ]:
# Are they aligned?
southern_sea_level.year == northern_sea_level.year

In [ ]:
# So, are they aligned?
np.all(southern_sea_level.year == northern_sea_level.year)

So the northern hemisphere and southern hemisphere datasets are aligned. What about the global one?


In [ ]:
len(global_sea_level.year) == len(northern_sea_level.year)

For now, let's just build a DataFrame with the 2 hemisphere datasets then. We will come back to add the global one later...


In [ ]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"], 
                               "date": northern_sea_level.year})
mean_sea_level

Note: there are other ways to create DataFrames manually, for example from a 2D numpy array.

There is still the date in a regular column and a numerical index that is not that meaningful. We can specify the index of a DataFrame at creation. Let's try:


In [ ]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"]},
                               index = northern_sea_level.year)
mean_sea_level

Now the fact that it is failing show that Pandas does auto-alignment of values: for each value of the index, it searches for a value in each Series that maps the same value. Since these series have a dumb numerical index, no values are found.

Since we know that the order of the values match the index we chose, we can replace the Series by their values only at creation of the DataFrame:


In [ ]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"].values, 
                               "southern_hem": southern_sea_level["msl_ib(mm)"].values},
                               index = northern_sea_level.year)
mean_sea_level

3. Cleaning and formatting data

The datasets that we obtain straight from the reading functions are pretty raw. A lot of pre-processing can be done during data read but we haven't used all the power of the reading functions. Let's learn to do a lot of cleaning and formatting of the data.

The GISS temperature dataset has a lot of issues too: useless numerical index, redundant columns, useless rows, placeholder (****) for missing values, and wrong type for the columns. Let's fix all this:

Renaming columns


In [ ]:
# The columns of the local_sea_level_stations aren't clean: they contain spaces and dots.
local_sea_level_stations.columns

In [ ]:
# Let's clean them up a bit:
local_sea_level_stations.columns = [name.strip().replace(".", "") 
                                    for name in local_sea_level_stations.columns]
local_sea_level_stations.columns

We can also rename an index by setting its name. For example, the index of the mean_sea_level dataFrame could be called date since it contains more than just the year:


In [ ]:
mean_sea_level.index.name = "date"
mean_sea_level

Setting missing values

In the full globe dataset, -999.00 was used to indicate that there was no value for that year. Let's search for all these values and replace them with the missing value that Pandas understand: np.nan


In [ ]:
full_globe_temp == -999.000

In [ ]:
full_globe_temp[full_globe_temp == -999.000] = np.nan
full_globe_temp.tail()

Choosing what is the index


In [ ]:
# We didn't set a column number of the index of giss_temp, we can do that afterwards:
giss_temp = giss_temp.set_index("Year")
giss_temp.head()

Dropping rows and columns


In [ ]:
# 1 column is redundant with the index: 
giss_temp.columns

In [ ]:
# Let's drop it:
giss_temp = giss_temp.drop("Year.1", axis=1)
giss_temp

In [ ]:
# We can also just select the columns we want to keep:
giss_temp = giss_temp[[u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', 
                       u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']]
giss_temp

In [ ]:
# Let's remove all these extra column names (Year  Jan ...). They all correspond to the index "Year"
giss_temp = giss_temp.drop("Year")
giss_temp

Let's also set **** to a real missing value (np.nan). We can often do it using a boolean mask, but that may trigger pandas warning. Another way to assign based on a boolean condition is to use the where method:


In [ ]:
#giss_temp[giss_temp == "****"] = np.nan
giss_temp = giss_temp.where(giss_temp != "****", np.nan)

In [ ]:
giss_temp.tail()

Adding columns

While building the mean_sea_level dataFrame earlier, we didn't include the values from global_sea_level since the years were not aligned. Adding a column to a dataframe is as easy as adding an entry to a dictionary. So let's try:


In [ ]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level

The column is full of NaNs again because the auto-alignment feature of Pandas is searching for the index values like 1992.9323 in the index of global_sea_level["msl_ib_ns(mm)"] series and not finding them. Let's set its index to these years so that that auto-alignment can work for us and figure out which values we have and not:


In [ ]:
global_sea_level = global_sea_level.set_index("year")
global_sea_level["msl_ib_ns(mm)"]

In [ ]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level

EXERCISE: Create a new series containing the average of the 2 hemispheres minus the global value to see if that is close to 0. Work inside the mean_sea_level dataframe first. Then try with the original Series to see what happens with data alignment while doing computations.


In [ ]:
# Your code here

Changing dtype of series

Now that the sea levels are looking pretty good, let's got back to the GISS temperature dataset. Because of the labels (strings) found in the middle of the timeseries, every column only assumed to contain strings (didn't convert them to floating point values):


In [ ]:
giss_temp.dtypes

That can be changed after the fact (and after the cleanup) with the astype method of a Series:


In [ ]:
giss_temp["Jan"].astype("float32")

In [ ]:
for col in giss_temp.columns:
    giss_temp.loc[:, col] = giss_temp[col].astype(np.float32)

An index has a dtype just like any Series and that can be changed after the fact too.


In [ ]:
giss_temp.index.dtype

For now, let's change it to an integer so that values can at least be compared properly. We will learn below to change it to a datetime object.


In [ ]:
giss_temp.index = giss_temp.index.astype(np.int32)

Removing missing values

Removing missing values - once they have been converted to np.nan - is very easy. Entries that contain missing values can be removed (dropped), or filled with many strategies.


In [ ]:
full_globe_temp

In [ ]:
full_globe_temp.dropna()

In [ ]:
# This will remove any year that has a missing value. Use how='all' to keep partial years
giss_temp.dropna(how="any").tail()

In [ ]:
giss_temp.fillna(value=0).tail()

In [ ]:
# This fills them with the previous year. See also temp3.interpolate
giss_temp.fillna(method="ffill").tail()

Let's also mention the .interpolate method on a Series:


In [ ]:
giss_temp.Aug.interpolate().tail()

For now, we will leave the missing values in all our datasets, because it wouldn't be meaningful to fill them.

EXERCISE: Go back to the reading functions, and learn more about other options that could have allowed us to fold some of these pre-processing steps into the data loading.

4. Basic visualization

Now they have been formatted, visualizing your datasets is the next logical step and is trivial with Pandas. The first thing to try is to invoke the .plot to generate a basic visualization (uses matplotlib under the covers).

Line plots


In [ ]:
full_globe_temp.plot()

In [ ]:
giss_temp.plot(figsize=LARGE_FIGSIZE)

In [ ]:
mean_sea_level.plot(subplots=True, figsize=(16, 12));

Showing distributions information


In [ ]:
# Distributions of mean sean level globally and per hemisphere?
mean_sea_level.plot(kind="kde", figsize=(12, 8))

QUIZ: How to list the possible kinds of plots that the plot method can allow?


In [ ]:
# Distributions of temperature in each month since 1880
giss_temp.boxplot();

Correlations

There are more plot options inside pandas.tools.plotting:


In [ ]:
# Is there correlations between the northern and southern sea level timeseries we loaded?
from pandas.tools.plotting import scatter_matrix
scatter_matrix(mean_sea_level, figsize=LARGE_FIGSIZE);

We will confirm the correlations we think we see further down...

EXERCISE: Refer to exercises/aapl_adj_close_plot/aapl_adj_close_plot.ipynb

5. Accessing data

The general philosophy for accessing values inside a Pandas datastructure is that, unlike a numpy array that only allows to index using integers a Series allows to index with the values inside the index. That makes the code more readable.

In a series


In [ ]:
full_globe_temp

In [ ]:
# By default [] on a series accesses values using the index, not the location in the series
# print(temp1[0])  # This would to fail!!

In [ ]:
# This index is non-trivial though (will talk more about these datetime objects further down):
full_globe_temp.index.dtype

In [ ]:
first_date = full_globe_temp.index[0]
first_date == pd.Timestamp('1880')

In [ ]:
# By default [] on a series accesses values using the index, not the location in the series
print(full_globe_temp[pd.Timestamp('1880')])
# print(temp1[0])  # This would fail!!

In [ ]:
# Another more explicit way to do the same thing is to use loc
print(full_globe_temp.loc[pd.Timestamp('1990')])
print(full_globe_temp.iloc[0], full_globe_temp.iloc[-1])

In [ ]:
# Year of the last record?
full_globe_temp.index[-1]

In [ ]:
# New records can be added:
full_globe_temp[pd.Timestamp('2011')] = np.nan

In a dataframe


In [ ]:
# In 2D, same idea, though in a DF [] accesses columns (Series)
giss_temp["Jan"]

In [ ]:
# while .loc and .iloc allow to access individual values, slices or masked selections:
print(giss_temp.loc[1979, "Dec"])

In [ ]:
# Slicing can be done with .loc and .iloc
print(giss_temp.loc[1979, "Jan":"Jun"])  # Note that the end point is included unlike NumPy!!!
print(giss_temp.loc[1979, ::2])

In [ ]:
# Masking can also be used in one or more dimensions. For example, another way to grab every other month for the first year:
mask = [True, False] * 6
print(giss_temp.iloc[0, mask])
print(giss_temp.loc[1880, mask])

In [ ]:
# We could also add a new column like a new entry in a dictionary
giss_temp["totals"] = giss_temp.sum(axis=1)
giss_temp

In [ ]:
# Let's remove this new column, we will learn to do this differently
giss_temp = giss_temp.drop("totals", axis=1)

More complex queries rely on the same concepts. For example what are the names, and IDs of the sea level stations in the USA?


In [ ]:
local_sea_level_stations.columns

In [ ]:
american_stations = local_sea_level_stations["Country"] == "USA"
local_sea_level_stations.loc[american_stations, ["ID", "Station Name"]]

6. Working with dates and times

Let's work some more with full_globe_temp's index since we saw it is special.


In [ ]:
# Its dtype is NumPy's new 'datetime64[ns]':
full_globe_temp.index.dtype

The advantage of having a real datetime index is that operations can be done efficiently on it. Let's add a flag to signal if the value is before or after the great depression's black Friday:


In [ ]:
black_friday = pd.to_datetime('1929-10-29')
full_globe_temp.index > black_friday

Timestamps or periods?


In [ ]:
# Convert its index from timestamp to period: it is more meaningfull since it was measured and averaged over the year...
full_globe_temp.index = full_globe_temp.index.to_period()
full_globe_temp

See also to_timestamp to conver back to timestamps and its how method to specify when inside the range to set the timestamp.

Resampling

Another thing that can be done is to resample the series, downsample or upsample. Let's see the series converted to 10 year blocks or upscale to a monthly series:


In [ ]:
# Frequencies can be specified as strings: "us", "ms", "S", "T", "H", "D", "B", "W", "M", "A", "3min", "2h20", ...
# More aliases at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
full_globe_temp.resample("M")

In [ ]:
full_globe_temp.resample("10A", how="mean")

Generating DatetimeIndex objects

The index for giss_temp isn't an instance of datetimes so we may want to generate such DatetimeIndex objects. This can be done with date_range and period_range:


In [ ]:
# Can specify a start date and a number of values desired. By default it will assume an interval of 1 day:
pd.date_range('1/1/1880', periods=4)

In [ ]:
# Can also specify a start and a stop date, as well as a frequency
pd.date_range('1/1/1880', '1/1/2016', freq="A")

Note that "A" by default means the end of the year. Other times in the year can be specified with "AS" (start), "A-JAN" or "A-JUN". Even more options can be imported from pandas.tseries.offsets:


In [ ]:
from pandas.tseries.offsets import YearBegin
pd.date_range('1/1/1880', '1/1/2015', freq=YearBegin())

Actually we will convert that dataset to a 1D dataset, and build a monthly index, so lets build a monthly period index


In [ ]:
giss_temp_index = pd.period_range('1/1/1880', '12/1/2015', freq="M")
giss_temp_index

7. Transforming datasets: apply, sort, stack/unstack and transpose

Let's look at our local_sea_level_stations dataset some more, to learn more about it and also do some formatting. What is the range of dates and lattitudes we have, the list of countries, the range of variations, ...


In [ ]:
# What about the range of dates?
local_sea_level_stations["Date"].min(), local_sea_level_stations["Date"].max(), local_sea_level_stations["Date"].iloc[-1]

In [ ]:
local_sea_level_stations.dtypes

Apply: transforming Series

We don't see the range of dates because the dates are of dtype "Object", (usually meaning strings). Let's convert that using apply:


In [ ]:
local_sea_level_stations["Date"].apply(pd.to_datetime)

This apply method is very powerful and general. We have used it to do something we could have done with astype, but any custom function can be provided to apply.


In [ ]:
local_sea_level_stations["Date"] = local_sea_level_stations["Date"].apply(pd.to_datetime)

# Now we can really compare the dates, and therefore get a real range:
print(local_sea_level_stations["Date"].min(), local_sea_level_stations["Date"].max())

EXERCISE: Use the apply method to search through the stations names for a station in New York. What is the ID of the station?


In [ ]:
# Your code here

Now that we know the range of dates, to look at the data, sorting it following the dates is done with sort:


In [ ]:
local_sea_level_stations.sort("Date")

Since many stations last updated on the same dates, it is logical to want to sort further, for example, by Country at constant date:


In [ ]:
local_sea_level_stations.sort(["Date", "Country"], ascending=False)

Stack and unstack

Let's look at the GISS dataset differently. Instead of seeing the months along the axis 1, and the years along the axis 0, it would could be good to convert these into an outer and an inner axis along only 1 time dimension.

Stacking and unstacking allows to convert a dataframe into a series and vice-versa:


In [ ]:
giss_temp.unstack?
unstacked = giss_temp.unstack()
unstacked

In [ ]:
# Note the nature of the result:
type(unstacked)

The result is grouped in the wrong order since it sorts first the axis that was unstacked. Another transformation that would help us is transposing...


In [ ]:
giss_temp.transpose()

In [ ]:
giss_temp_series = giss_temp.transpose().unstack()
giss_temp_series.name = "Temp anomaly"
giss_temp_series

A side note: Multi-indexes


In [ ]:
# Note the nature of the resulting index:
giss_temp_series.index

In [ ]:
# It is an index made of 2 columns. Let's fix the fact that one of them doesn't have a name:
giss_temp_series.index = giss_temp_series.index.set_names(["year", "month"])

In [ ]:
# We can now access deviations by specifying the year and month:
giss_temp_series[1980, "Jan"]

But this new multi-index isn't very good, because is it not viewed as 1 date, just as a tuple of values:


In [ ]:
giss_temp_series.plot(figsize=LARGE_FIGSIZE)

To improve on this, let's reuse an index we generated above with date_range:


In [ ]:
giss_temp_series.index = giss_temp_index
giss_temp_series.plot(figsize=LARGE_FIGSIZE)

8. Statistical analysis

Descriptive statistics

Let's go back to the dataframe version of the GISS temperature dataset temporarily to analyze anomalies month per month. Like most functions on a dataframe, stats functions are computed column per column. They also ignore missing values:


In [ ]:
monthly_averages = giss_temp.mean()
monthly_averages

It is possible to apply stats functions across rows instead of columns using the axis keyword (just like in NumPy).


In [ ]:
yearly_averages = giss_temp.mean(axis=1)
yearly_averages

describe provides many descriptive stats computed at once:


In [ ]:
mean_sea_level.describe()

Rolling statistics

Let's remove high frequency signal and extract the trend:


In [ ]:
full_globe_temp.plot()
pd.rolling_mean(full_globe_temp, 10).plot(figsize=LARGE_FIGSIZE)

In [ ]:
# To see what all can be done while rolling, 
#pd.rolling_<TAB>

Describing categorical series

Let's look at our local_sea_level_stations dataset some more:


In [ ]:
local_sea_level_stations.describe()

.describe() only displays information about continuous Series. What about categorical ones?


In [ ]:
local_sea_level_stations.columns

In [ ]:
local_sea_level_stations["Country"]

In [ ]:
local_sea_level_stations["Country"].describe()

In [ ]:
# List of unique values:
local_sea_level_stations["Country"].unique()

In [ ]:
local_sea_level_stations["Country"].value_counts()

In [ ]:
# To save memory, we can convert it to a categorical column:
local_sea_level_stations["Country"] = local_sea_level_stations["Country"].astype("category")

We can also create categorical series from continuous ones with the cut function:


In [ ]:
categorized = pd.cut(full_globe_temp, 3, labels=["L", "M", "H"])
categorized

In [ ]:
# The advantage is that we can use labels and control the order they should be treated in (L < M < H)
categorized.cat.categories

QUIZ: How much memory did we save? What if it was categorized but with dtype object instead of category?

9. Data Aggregation/summarization

Now that we have a good grasp on our datasets, Let's transform and analyze them some more to prepare them to compare them. The 2 function(alities)s to learn about here are groupby and pivot_table.

GroupBy

Let's explore the sea levels, first splitting into calendar years to compute average sea levels for each year:


In [ ]:
mean_sea_level

In [ ]:
mean_sea_level = mean_sea_level.reset_index()
mean_sea_level

In [ ]:
# Groupby with pandas can be done on a column or by applying a custom function to the index. 
# If we want to group the data by year, we can build a year column into the DF:
mean_sea_level["year"] = mean_sea_level["date"].apply(int)
mean_sea_level

In [ ]:
sl_grouped_year = mean_sea_level.groupby("year")

What kind of object did we create?


In [ ]:
type(sl_grouped_year)

What to do with that strange GroupBy object? We can first loop over it to get the labels and the sub-dataframes for each group:


In [ ]:
for group_name, subdf in sl_grouped_year:
    print(group_name)
    print(subdf)
    print("")

We could have done the same with less effort by grouping by the result of a custom function applied to the index. Let's reset the dataframe:


In [ ]:
mean_sea_level = mean_sea_level.drop(["year"], axis=1).set_index("date")

So that we can do the groupby on the index:


In [ ]:
sl_grouped_year = mean_sea_level.groupby(int)

Something else that can be done with such an object is to look at its groups attribute to see the labels mapped to the rows involved:


In [ ]:
sl_grouped_year.groups

How to aggregate the results of this grouping depends on what we want to see: do we want to see averaged over the years? That is so common that it has been implemented directly as a method on the GroupBy object.


In [ ]:
sl_grouped_year.mean()

In [ ]:
# We can apply any other reduction function or even a dict of functions using aggregate:
sl_grouped_year.aggregate({"mean_global": np.std})

Another possibility is to transform each group separately, rather than aggregate. For example, here we group over decades and subtract to each value, the average over that decade:


In [ ]:
sl_grouped_decade = mean_sea_level.groupby(lambda x: int(x/10.))
sl_grouped_decade.groups.keys()

In [ ]:
sl_grouped_decade.transform(lambda subframe: (subframe - subframe.mean()/subframe.std()))

Pivot_table

Pivot table also allows to summarize the information, allowing to convert repeating columns into axes. For example, let's say that we would like to know how many sea level stations are in various european countries. And we would like to group the answers into 2 categories: the stations that have been updated recently (after 2000) and the others.

Let's first extract only entries located (roughly) in Europe.


In [ ]:
european_filter = ((local_sea_level_stations["Lat"] > 30) & 
                   (local_sea_level_stations["Lat"] < 70) & 
                   (local_sea_level_stations["Lon"] > -10) & 
                   (local_sea_level_stations["Lon"] < 40) 
                   )

# Let's make a copy to work with a new, clean block of memory 
# (if you are interested, try and remove the copy to see the consequences further down...
european_stations = local_sea_level_stations[european_filter].copy()
european_stations["Country"].unique()

The columns of our future table should have 2 values, whether the station was updated recently or not. Let's build a column to store that information:


In [ ]:
european_stations["Recently updated"] = european_stations["Date"] > pd.to_datetime("2000")

Finally, what value will be displayed inside the table. The values should be extracted from a column, pivot_table allowing an aggregation function to be applied when more than 1 value is found for a given case. Each station should count for 1, and we could aggregate multiple stations by summing these ones:


In [ ]:
european_stations["Number of stations"] = np.ones(len(european_stations))

In [ ]:
european_stations.sort("Country")

In [ ]:
station_counts = pd.pivot_table(european_stations, index="Country", columns="Recently updated", 
                                values="Number of stations", aggfunc=np.sum)
# Let's remove from the table the countries for which no station was found:
station_counts.dropna(how="all")

QUIZ: Why is there still some countries with no entries?

EXERCISE: How many recently updated stations? Not recently updated stations? Which country has the most stations? Which country has the most recently updated stations?


In [ ]:
# Your code here

EXERCISE: How would we build the same dataframe with a groupby operation?


In [ ]:
# Your code here

EXERCISE: Refer to exercises/pivot_table/pivot_tables.py

10. Correlations and regressions

Correlation coefficients

Both Series and dataframes have a corr method to compute the correlation coefficient between series:


In [ ]:
# Let's see what how the various sea levels are correlated with each other:
mean_sea_level["northern_hem"].corr(mean_sea_level["southern_hem"])

In [ ]:
# If series are already grouped into a DataFrame, computing all correlation coeff is trivial:
mean_sea_level.corr()

Note: by default, the method used is the Pearson correlation coefficient (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). Other methods are available (kendall, spearman using the method kwarg).


In [ ]:
# Visualize the correlation matrix
plt.imshow(mean_sea_level.corr(), interpolation="nearest")

In [ ]:
plt.yticks?

In [ ]:
# let's make it a little better to confirm that learning about global sea level cannot be done from just 
# looking at stations in the northern hemisphere:
plt.imshow(mean_sea_level.corr(), interpolation="nearest")
plt.xticks(np.arange(3), mean_sea_level.corr().columns)
plt.yticks(np.arange(3), mean_sea_level.corr().index)
plt.colorbar()

OLS

There are 2 objects constructors inside Pandas and inside statsmodels. There has been talks about merging the 2 into SM, but that hasn't happened yet. OLS in statsmodels allows more complex formulas:


In [ ]:
import statsmodels.formula.api as sm

In [ ]:
sm_model = sm.ols(formula="mean_global ~ northern_hem + southern_hem", data=mean_sea_level).fit()

In [ ]:
sm_model.params

In [ ]:
type(sm_model.params)

In [ ]:
sm_model.summary()

In [ ]:
mean_sea_level["mean_global"].plot()
sm_model.fittedvalues.plot(label="OLS prediction")
plt.legend(loc="upper left")

OLS in pandas requires to pass a y series and an x series to do a fit of the form y ~ x. But the formula can be more complex by providing a DataFrame for x and reproduce a formula of the form y ~ x1 + x2.

Also, OLS in pandas allows to do rolling and expanding OLS:


In [ ]:
from pandas.stats.api import ols as pdols

In [ ]:
# Same fit as above:
pd_model = pdols(y=mean_sea_level["mean_global"], x=mean_sea_level[["northern_hem", "southern_hem"]])
pd_model

In [ ]:
plt.figure(figsize=LARGE_FIGSIZE)
mean_sea_level["mean_global"].plot()
pd_model.predict().plot(label="OLS prediction")
plt.legend(loc="upper left")

An interlude: data alignment

Converting the floating point date to a timestamp

Now, we would like to look for correlations between our monthly temperatures and the sea levels we have. For this to be possible, some data alignment must be done since the time scales are very different for the 2 datasets.


In [ ]:
mean_sea_level["mean_global"].index

In [ ]:
giss_temp_series.index

In [ ]:
DAYS_PER_YEAR = {}

In [ ]:
import calendar
# Let's first convert the floating point dates in the sea level to timestamps:
def floating_year_to_timestamp(float_date):
    """ Convert a date as a floating point year number to a pandas timestamp object.
    """
    year = int(float_date)
    days_per_year = 366 if calendar.isleap(year) else 365
    remainder = float_date - year
    daynum = 1 + remainder * (days_per_year - 1)
    daynum = int(round(daynum))
    # Convert day number to month and day
    day = daynum
    month = 1
    while month < 13:
        month_days = calendar.monthrange(year, month)[1]
        if day <= month_days:
            return pd.Timestamp(str(year)+"/"+str(month)+"/"+str(day))
        day -= month_days
        month += 1
    raise ValueError('{} does not have {} days'.format(year, daynum))

In [ ]:
floating_year_to_timestamp(1996.0), floating_year_to_timestamp(1996.5), floating_year_to_timestamp(1996.9999)

In [ ]:
dt_index = pd.Series(mean_sea_level["mean_global"].index).apply(floating_year_to_timestamp)
dt_index

In [ ]:
mean_sea_level = mean_sea_level.reset_index(drop=True)
mean_sea_level.index = dt_index
mean_sea_level

Now, how to align the 2 series? Is this one sampled regularly so that the month temperatures can be upscaled to that frequency?

Computing the difference between successive values

What is the frequency of that new index?


In [ ]:
dt_index.dtype

In [ ]:
# What is the frequency of the new index? The numpy way to compute differences between all values doesn't work:
dt_index[1:] - dt_index[:-1]

IMPORTANT Note: The above failure is due to the fact that operations between series automatically align them based on their index.


In [ ]:
# There is a method for shifting values up/down the index:
dt_index.shift()

In [ ]:
# So the distances can be computed with 
dt_index - dt_index.shift()

In [ ]:
# Not constant reads apparently. Let's downscale the frequency of the sea levels 
# to monthly, like the temperature reads we have:
monthly_mean_sea_level = mean_sea_level.resample("MS").to_period()
monthly_mean_sea_level

In [ ]:
monthly_mean_sea_level["mean_global"].align(giss_temp_series)

In [ ]:
giss_temp_series.align?

In [ ]:
# Now that the series are using the same type and frequency of indexes, to align them is trivial:
monthly_mean_sea_level["mean_global"].align(giss_temp_series, join='inner')

In [ ]:
aligned_sl, aligned_temp = monthly_mean_sea_level["mean_global"].align(giss_temp_series, join='inner')
aligned_df = pd.DataFrame({"mean_sea_level": aligned_sl, "mean_global_temp": aligned_temp})

The alignment can even be done on an entire dataframe:


In [ ]:
monthly_mean_sea_level.align(giss_temp_series, axis=0, join='inner')

In [ ]:
aligned_sea_levels, aligned_temp = monthly_mean_sea_level.align(giss_temp_series, axis=0, join='inner')
aligned_monthly_data = aligned_sea_levels.copy()
aligned_monthly_data["global_temp"] = aligned_temp
aligned_monthly_data

Correlations between sea levels and temperatures


In [ ]:
aligned_monthly_data.plot(figsize=LARGE_FIGSIZE)

In [ ]:
aligned_monthly_data.corr()

In [ ]:
model = sm.ols("southern_hem ~ global_temp", data=aligned_monthly_data).fit()
model.rsquared

What if we had done the analysis yearly instead of monthly to remove seasonal variations?


In [ ]:
aligned_yearly_data = aligned_monthly_data.resample("A")
aligned_yearly_data.plot()

In [ ]:
aligned_yearly_data.corr()

In [ ]:
model = sm.ols("southern_hem ~ global_temp", data=aligned_yearly_data).fit()
model.rsquared

11. Predictions from auto regression models

An auto-regresssive model fits existing data and build a (potentially predictive) model of the data fitted. We use the timeseries analysis (tsa) submodule of statsmodels to make out-of-sample predictions for the upcoming decades:


In [ ]:
import statsmodels as sm
# Let's remove seasonal variations by resampling annually
data = giss_temp_series.resample("A").to_timestamp()
ar_model = sm.tsa.ar_model.AR(data, freq='A')
ar_res = ar_model.fit(maxlag=60, disp=True)

In [ ]:
plt.figure(figsize=LARGE_FIGSIZE)
pred = ar_res.predict(start='1950-1-1', end='2070')
data.plot(style='k', label="Historical Data")
pred.plot(style='r', label="Predicted Data")
plt.ylabel("Temperature variation (0.01 degC)")
plt.legend()

EXERCISE: Make another auto-regression on the sea level of the Atlantic ocean to estimate how much New York is going to flood in the coming century.

  1. You can find the historical sea levels of the Atlantic ocean at http://sealevel.colorado.edu/files/current/sl_Atlantic_Ocean.txt or locally in data/sea_levels/sl_Atlantic_Ocean.txt.
  2. A little more work but more precise: extract the ID of a station in NewYork from the local_sea_level_stations dataset, and use it to download timeseries in NY (URL would be http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata).

In [ ]:
# Your code here

Want to practice more?

EXERCISE (computations): Refer to exercises/stock_returns/stock_returns.py

EXERCISE (stats, groupby, timeseries): Refer to exercises/pandas_wind_statistics/pandas_wind_statistics.py


In [ ]: