Last updated: June 29th 2016
Welcome to a demo of Python's data analysis package called Pandas
. Our goal is to learn about Data Analysis and transformation using Pandas while exploring datasets used to analyze climate change.
The global goal of this demo is to provide the tools to be able to try and reproduce some of the analysis done in the IPCC global climate reports published in the last decade (see for example https://www.ipcc.ch/pdf/assessment-report/ar5/syr/SYR_AR5_FINAL_full.pdf).
We are first going to load a few public datasets containing information about global temperature, global and local sea level infomation, and global concentration of greenhouse gases like CO2, to see if there are correlations and how the trends are to evolve, assuming no fundamental change in the system. For all these datasets, we will download them, visualize them, clean them, search through them, merge them, resample them, transform them and summarize them.
In the process, we will learn about:
Part 1:
1. Loading data
2. Pandas datastructures
3. Cleaning and formatting data
4. Basic visualization
Part 2:
5. Accessing data
6. Working with dates and times
7. Transforming datasets
8. Statistical analysis
9. Data agregation and summarization
10. Correlations and regressions
11. Predictions from auto regression models
In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option("display.max_rows", 16)
LARGE_FIGSIZE = (12, 8)
In [ ]:
# Change this cell to the demo location on YOUR machine
%cd ~/Projects/pandas_tutorial/climate_timeseries/
%ls
More details, see http://pandas.pydata.org/pandas-docs/stable/io.html
To find all reading functions in pandas, ask ipython's tab completion:
In [ ]:
#pd.read_<TAB>
In [ ]:
pd.read_table?
Let's first load some temperature data which covers all lattitudes. Since read_table is supposed to do the job for a text file, let's just try it:
In [ ]:
filename = "data/temperatures/annual.land_ocean.90S.90N.df_1901-2000mean.dat"
full_globe_temp = pd.read_table(filename)
full_globe_temp
There is only 1 column! Let's try again stating that values are separated by any number of spaces:
In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+")
full_globe_temp
There are columns but the column names are 1880 and -0.1591!
In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"])
full_globe_temp
Since we only have 2 columns, one of which would be nicer to access the data (the year of the record), let's try using the index_col
option:
In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"],
index_col=0)
full_globe_temp
Last step: the index is made of dates. Let's make that explicit:
In [ ]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"],
index_col=0, parse_dates=True)
full_globe_temp
Since every dataset can contain mistakes, let's load a different file with temperature data. NASA's GISS dataset is written in chunks: look at it in data/temperatures/GLB.Ts+dSST.txt
In [ ]:
giss_temp = pd.read_table("data/temperatures/GLB.Ts+dSST.txt", sep="\s+", skiprows=7,
skip_footer=11, engine="python")
giss_temp
QUIZ: What happens if you remove the skiprows
? skipfooter
? engine
?
EXERCISE: Load some readings of CO2 concentrations in the atmosphere from the data/greenhouse_gaz/co2_mm_global.txt
data file.
In [ ]:
# Your code here
So far, we have only loaded temperature datasets. Climate change also affects the sea levels on the globe. Let's load some datasets with the sea levels. The university of colorado posts updated timeseries for mean sea level globably, per hemisphere, or even per ocean, sea, ... Let's download the global one, and the ones for the northern and southern hemisphere.
That will also illustrate that to load text files that are online, there is no more work than replacing the filepath by a URL n read_table
:
In [ ]:
# Local backup: data/sea_levels/sl_nh.txt
northern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_nh.txt",
sep="\s+")
northern_sea_level
In [ ]:
# Local backup: data/sea_levels/sl_sh.txt
southern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_sh.txt",
sep="\s+")
southern_sea_level
In [ ]:
# The 2015 version of the global dataset:
# Local backup: data/sea_levels/sl_ns_global.txt
url = "http://sealevel.colorado.edu/files/2015_rel2/sl_ns_global.txt"
global_sea_level = pd.read_table(url, sep="\s+")
global_sea_level
There are clearly lots of cleanup to be done on these datasets. See below...
To be able to grab more local data about mean sea levels, we can download and extract data about mean sea level stations around the world from the PSMSL (http://www.psmsl.org/). Again to download and parse all tables in a webpage, just give read_html
the URL to parse:
In [ ]:
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
# Local backup in data/sea_levels/Obtaining Tide Gauge Data.html
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")
In [ ]:
# there is 1 table on that page which contains metadata about the stations where
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stations
That table can be used to search for a station in a region of the world we choose, extract an ID for it and download the corresponding time series with the URL http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata
For more details, see http://pandas.pydata.org/pandas-docs/stable/dsintro.html
Now that we have used read_**
functions to load datasets, we need to understand better what kind of objects we got from them to learn to work with them.
In [ ]:
# Type of the object?
type(giss_temp)
In [ ]:
# Internal nature of the object
print(giss_temp.shape)
print(giss_temp.dtypes)
Descriptors for the vertical axis (axis=0)
In [ ]:
giss_temp.index
Descriptors for the horizontal axis (axis=1)
In [ ]:
giss_temp.columns
A lot of information at once including memory usage:
In [ ]:
giss_temp.info()
A series can be constructed with the pd.Series
constructor (passing a list or array of values) or from a DataFrame
, by extracting one of its columns.
In [ ]:
# Do we already have a series for the full_globe_temp?
type(full_globe_temp)
In [ ]:
# Since there is only one column of values, we can make this a Series without
# loosing information:
full_globe_temp = full_globe_temp["mean temp"]
Core attributes/information:
In [ ]:
print(type(full_globe_temp))
print(full_globe_temp.dtype)
print(full_globe_temp.shape)
print(full_globe_temp.nbytes)
Probably the most important attribute of a Series
or DataFrame
is its index
since we will use that to, well, index into the structures to access te information:
In [ ]:
full_globe_temp.index
It is always possible to fall back to a good old NumPy array to pass on to scientific libraries that need them: SciPy, scikit-learn, ...
In [ ]:
full_globe_temp.values
In [ ]:
type(full_globe_temp.values)
DataFrame
s can also be created manually, by grouping several Series together. Let's make a new frame from the 3 sea level datasets we downloaded above. They will be displayed along the same index. Wait, does that makes sense to do that?
In [ ]:
# Are they aligned?
southern_sea_level.year == northern_sea_level.year
In [ ]:
# So, are they aligned?
np.all(southern_sea_level.year == northern_sea_level.year)
So the northern hemisphere and southern hemisphere datasets are aligned. What about the global one?
In [ ]:
len(global_sea_level.year) == len(northern_sea_level.year)
For now, let's just build a DataFrame with the 2 hemisphere datasets then. We will come back to add the global one later...
In [ ]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"],
"southern_hem": southern_sea_level["msl_ib(mm)"],
"date": northern_sea_level.year})
mean_sea_level
Note: there are other ways to create DataFrames manually, for example from a 2D numpy array.
There is still the date in a regular column and a numerical index that is not that meaningful. We can specify the index
of a DataFrame
at creation. Let's try:
In [ ]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"],
"southern_hem": southern_sea_level["msl_ib(mm)"]},
index = northern_sea_level.year)
mean_sea_level
Now the fact that it is failing show that Pandas does auto-alignment of values: for each value of the index, it searches for a value in each Series that maps the same value. Since these series have a dumb numerical index, no values are found.
Since we know that the order of the values match the index we chose, we can replace the Series by their values only at creation of the DataFrame:
In [ ]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"].values,
"southern_hem": southern_sea_level["msl_ib(mm)"].values},
index = northern_sea_level.year)
mean_sea_level
The datasets that we obtain straight from the reading functions are pretty raw. A lot of pre-processing can be done during data read but we haven't used all the power of the reading functions. Let's learn to do a lot of cleaning and formatting of the data.
The GISS temperature dataset has a lot of issues too: useless numerical index, redundant columns, useless rows, placeholder (****
) for missing values, and wrong type for the columns. Let's fix all this:
In [ ]:
# The columns of the local_sea_level_stations aren't clean: they contain spaces and dots.
local_sea_level_stations.columns
In [ ]:
# Let's clean them up a bit:
local_sea_level_stations.columns = [name.strip().replace(".", "")
for name in local_sea_level_stations.columns]
local_sea_level_stations.columns
We can also rename an index by setting its name. For example, the index of the mean_sea_level
dataFrame could be called date
since it contains more than just the year:
In [ ]:
mean_sea_level.index.name = "date"
mean_sea_level
In the full globe dataset, -999.00 was used to indicate that there was no value for that year. Let's search for all these values and replace them with the missing value that Pandas understand: np.nan
In [ ]:
full_globe_temp == -999.000
In [ ]:
full_globe_temp[full_globe_temp == -999.000] = np.nan
full_globe_temp.tail()
In [ ]:
# We didn't set a column number of the index of giss_temp, we can do that afterwards:
giss_temp = giss_temp.set_index("Year")
giss_temp.head()
In [ ]:
# 1 column is redundant with the index:
giss_temp.columns
In [ ]:
# Let's drop it:
giss_temp = giss_temp.drop("Year.1", axis=1)
giss_temp
In [ ]:
# We can also just select the columns we want to keep:
giss_temp = giss_temp[[u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul',
u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']]
giss_temp
In [ ]:
# Let's remove all these extra column names (Year Jan ...). They all correspond to the index "Year"
giss_temp = giss_temp.drop("Year")
giss_temp
Let's also set ****
to a real missing value (np.nan
). We can often do it using a boolean mask, but that may trigger pandas warning. Another way to assign based on a boolean condition is to use the where
method:
In [ ]:
#giss_temp[giss_temp == "****"] = np.nan
giss_temp = giss_temp.where(giss_temp != "****", np.nan)
In [ ]:
giss_temp.tail()
While building the mean_sea_level
dataFrame earlier, we didn't include the values from global_sea_level
since the years were not aligned. Adding a column to a dataframe is as easy as adding an entry to a dictionary. So let's try:
In [ ]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level
The column is full of NaNs again because the auto-alignment feature of Pandas is searching for the index values like 1992.9323
in the index of global_sea_level["msl_ib_ns(mm)"]
series and not finding them. Let's set its index to these years so that that auto-alignment can work for us and figure out which values we have and not:
In [ ]:
global_sea_level = global_sea_level.set_index("year")
global_sea_level["msl_ib_ns(mm)"]
In [ ]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level
EXERCISE: Create a new series containing the average of the 2 hemispheres minus the global value to see if that is close to 0. Work inside the mean_sea_level dataframe first. Then try with the original Series to see what happens with data alignment while doing computations.
In [ ]:
# Your code here
Now that the sea levels are looking pretty good, let's got back to the GISS temperature dataset. Because of the labels (strings) found in the middle of the timeseries, every column only assumed to contain strings (didn't convert them to floating point values):
In [ ]:
giss_temp.dtypes
That can be changed after the fact (and after the cleanup) with the astype
method of a Series
:
In [ ]:
giss_temp["Jan"].astype("float32")
In [ ]:
for col in giss_temp.columns:
giss_temp.loc[:, col] = giss_temp[col].astype(np.float32)
An index has a dtype
just like any Series and that can be changed after the fact too.
In [ ]:
giss_temp.index.dtype
For now, let's change it to an integer so that values can at least be compared properly. We will learn below to change it to a datetime object.
In [ ]:
giss_temp.index = giss_temp.index.astype(np.int32)
Removing missing values - once they have been converted to np.nan
- is very easy. Entries that contain missing values can be removed (dropped), or filled with many strategies.
In [ ]:
full_globe_temp
In [ ]:
full_globe_temp.dropna()
In [ ]:
# This will remove any year that has a missing value. Use how='all' to keep partial years
giss_temp.dropna(how="any").tail()
In [ ]:
giss_temp.fillna(value=0).tail()
In [ ]:
# This fills them with the previous year. See also temp3.interpolate
giss_temp.fillna(method="ffill").tail()
Let's also mention the .interpolate
method on a Series
:
In [ ]:
giss_temp.Aug.interpolate().tail()
For now, we will leave the missing values in all our datasets, because it wouldn't be meaningful to fill them.
EXERCISE: Go back to the reading functions, and learn more about other options that could have allowed us to fold some of these pre-processing steps into the data loading.
Now they have been formatted, visualizing your datasets is the next logical step and is trivial with Pandas. The first thing to try is to invoke the .plot
to generate a basic visualization (uses matplotlib under the covers).
In [ ]:
full_globe_temp.plot()
In [ ]:
giss_temp.plot(figsize=LARGE_FIGSIZE)
In [ ]:
mean_sea_level.plot(subplots=True, figsize=(16, 12));
In [ ]:
# Distributions of mean sean level globally and per hemisphere?
mean_sea_level.plot(kind="kde", figsize=(12, 8))
QUIZ: How to list the possible kinds of plots that the plot method can allow?
In [ ]:
# Distributions of temperature in each month since 1880
giss_temp.boxplot();
There are more plot options inside pandas.tools.plotting
:
In [ ]:
# Is there correlations between the northern and southern sea level timeseries we loaded?
from pandas.tools.plotting import scatter_matrix
scatter_matrix(mean_sea_level, figsize=LARGE_FIGSIZE);
We will confirm the correlations we think we see further down...
For each read_**
function to load data, there is a to_**
method attached to Series and DataFrames.
EXERCISE: explore how the to_csv method work using ipython
's ? and store the giss_temp
dataframe. Do the same to store the full_globe_temp series to another file.
Another file format that is commonly used is Excel, and there multiple datasets can be stored in 1 file.
In [ ]:
writer = pd.ExcelWriter("test.xls")
In [ ]:
giss_temp.to_excel(writer, sheet_name="GISS temp data")
full_globe_temp.to_excel(writer, sheet_name="NASA temp data")
In [ ]:
writer.close()
In [ ]:
with pd.ExcelWriter("test.xls") as writer:
giss_temp.to_excel(writer, sheet_name="GISS temp data")
pd.DataFrame({"Full Globe Temp": full_globe_temp}).to_excel(writer, sheet_name="FullGlobe temp data")
In [ ]:
%ls
Another, more powerful file format to store binary data, which allows us to store both Series
and DataFrame
s without having to cast anybody is HDF5.
In [ ]:
with pd.HDFStore("all_data.h5") as writer:
giss_temp.to_hdf(writer, "/temperatures/giss")
full_globe_temp.to_hdf(writer, "/temperatures/full_globe")
mean_sea_level.to_hdf(writer, "/sea_level/mean_sea_level")
local_sea_level_stations.to_hdf(writer, "/sea_level/stations")
EXERCISE: Add the greenhouse gas dataset in this data store. Store it in a separate folder.