This post serves the dual purpose of being a follow-on tutorial on using pandas to look at weather data, and also as an exploratory post on how appropriate the homogenized data might be for some of my projects.
The background is that I've been working on a dynamic, django based website, that would allows users to look at plots of the near real time-time weather being recorded by Environment Canada (EC) stations, and to compare it to historical norms and extremes. In a previous post Weather with Pandas, I discussed some techniques for accessing this EC weather station data directly from the API on the EC-weather site. There is a major problem, it turns out. The weather available on the EC site is basically raw station data. Every few years, old stations are replaced with new ones and the type of equipment used to make the measurements might change, the site's position might change by a few metres or more, and the site might dissappear entirely. Whenever such a change is made to a site, EC gives it a new station code and even sometimes a new name. Thus for any one location there may be tens of "stations", which all cover diffrent periods, with slightly different measurement characteristics. All these changes mean that it is basically impossible to use the "raw" data available through the EC-weather API to calculate long term averages, extremes, or trends. Basically you cannot use it for climate analysis.
All is not lost though. Scientists in the EC Climate Research Division have produced "homogenized" data which does not have any of these problems. It's not available for every variable, but there is surface temperature. To produce these homogenized surface temperatures for any one location, let's say Shawnigan Lake, all the stations are joined into one record, which is carefully checked for consistency, and "homogenized" so that there are no jumps due to changes in equipment or location etc. There are lots of challenges in producing this homogenized data, that are fully described in the documentation and associated publications, and I won't consider any of those details. Here, I want to explore this homogenized surface temperature data just to get a feel for it. The data is available here:
http://www.ec.gc.ca/dccha-ahccd/default.asp?lang=en&n=1EEECD01-1
I downloaded the documentation, station list, and "Monthly mean of daily mean temperature". The last of these is a zip archive that you need to extract. So what I have is:
Temperature_Stations.xls
Homog_monthly_mean_temp/
Within Homog_monthly_mean_temp/ there are about 340 txt files, starting with "mm" for "monthly mean", then a number indicating the station id. Those files contain the station id, name and the data, but not much else info. on the station. More detail on the stations (location etc) is provided in Temperature_Stations.xls. We'll use pandas to examine the timeseries as before, and we'll use basemap to map the station locations. The objectives here are to:
Before we start, if you want to download this notebook you can find it on my github.
In [1]:
# Start off importing some basic python stuff that we will need.
import os
import pandas as pd
import numpy as np
from mpl_toolkits.basemap import Basemap
# plotting stuff
%pylab inline
# set the plot fontsize
font = {'size' : 14}
matplotlib.rc('font', **font)
# Set default figure size for this notebook
pylab.rcParams['figure.figsize'] = (8.0, 6.4)
We can load in the station information using the pandas read_excel function. There file has English and French headers. I just want to keep the English ones, so I'll specify the header column accordingly as 2, and to skip row 3 (French).
In [2]:
stns = pd.read_excel('Temperature_Stations.xls', header=2, skiprows=[3])
Lets look at the first ten rows to get a feeling for the type of information, and get the shape so that we know how many stations that there are.
In [3]:
stns.head(10)
Out[3]:
In [4]:
stns.shape
Out[4]:
Okay so we have 338 stations across Canada, and information on the station name, id, province, date-span and 3D position. Lets find info on Vancouver stations so that we can explore those further. We can use the str method and contains to find the station by name.
In [5]:
stns[ stns["station's name"].str.contains('VANCOUVER') ]
Out[5]:
So now that we know the stnid for Vancouver, we can load in the corresponding .txt file and have a look. I'm going to skip rows 0,1 and 3, since those contain the French header and other unwanted stuff.
In [6]:
s = pd.read_csv('Homog_monthly_mean_temp/mm1108447.txt',skiprows=[0,1,3]
, index_col=False, skipinitialspace=True, na_values=-9999.9)
In [7]:
s.head(5)
Out[7]:
There are columns for each month, as well as the seasonal and annual means. Above we saw that there are columns ("Unnamed") with letters ("a", "M" etc) following each column of data. Those letters are flags for the quality of the data, and are described in the documentation. For this exercise lets just assume that all the data is of good quality, and get rid of the flag columns. One way to filter a dataframe is by using regular expressions. For example, if we just wanted to get the columns for months starting with a "M", we could do this:
In [8]:
s.filter(regex='^M').head(2)
Out[8]:
So above we used the filter method, and used a regular expression to find all columns starting with "M". The carrot (^) means "starting with". Regular expressions are a powerful tool used for pattern matching in all types of programming. What we want to do here is just find the columns of s that contain letters of the months (and y for year). We'll use a regex to do a match of any letters. To indicate the range we use square brackets, like this []. We'll also use the plus sign to indicate match "one or more" letters. Here goes:
In [9]:
s = s.filter(regex='[JFMASONDYW]+')
s.head(2)
Out[9]:
So we have been successfull in keeping only the columns we want (getting rid of the flags). The data is clean enough that we can start to look at it. Let's start with a basic description so we can see the min, max and average of the Annual mean temperature in Vancouver.
In [10]:
s.Annual.describe()
Out[10]:
Okay, so lets find out which is the warmest year(s) on record...and the coldest year...
In [11]:
s.Year[ s.Annual == s.Annual.max() ]
Out[11]:
In [12]:
s.Year[s.Annual == s.Annual.min() ]
Out[12]:
Now let's go and look for which was the coldest winter and warmest summer
In [13]:
s.Year[s.Winter == s.Winter.min()]
Out[13]:
In [14]:
s.Year[s.Summer == s.Summer.max()]
Out[14]:
Ok, so those we some interesting numbers, but now let's produce some visual information on the seasonal cycle. To see what the average temperature change between seasons looks like, we can average the monthly data over several years. A good amount of time to get a reliable average is 30 years, and this is called a climatology. So lets compute the climatology of monthly temperature for the years 1981 to 2010. We won't need the seasonal or annual data for this, so I'll make a new dataframe called s_months
with just the first 14 columns.
In [15]:
s_months = s[ s.columns[0:13]]
s_months.head(2)
Out[15]:
Now what we have just the monthly data, lets select out the years between 1981 and 2010 to compute our climatology. We'll compute the mean over those 30 years and plot it up.
In [16]:
s_months_1981_2010 = s_months[(s_months.Year >= 1981) & (s_months.Year<=2010)]
clim_1981_2010 = s_months_1981_2010.mean(axis=0)
clim_1981_2010.drop('Year').plot(color='k', linewidth=3)
plt.ylabel('Monthly mean temp. ($^{\circ}$C)', fontsize=14)
plt.title('Climatology for Vancouver 1981-2010', fontsize=16)
Out[16]:
So, we have a monthly mean temperature of about 18 degrees C in the summer and about 4 degrees C in the winter. The range is about 14 degrees, which is much smaller than some other Canadian cities. The climatology gives us a good idea of how the temperatures change over the year, on average, but of course, any one year can look quite different from the climatology. At this point I think that it is worth it to pause for a second to think about what we are acutally plotting. We loaded in the monthly mean of daily mean temperature, and then we averaged that over 30 years to produce the climatology. So, the above plot tells us that the daily average temperature is about 18 degrees C in summer. But, on any given summer day the temperature will vary quite a lot, normally being warmest in the afternoon and coldest in the early morning. The plot tells us nothing about that. To get an idea of the daily range in temperature, we would need to load in the monthly mean of daily minimum and monthly mean of daily maximum data and plot the climatologies of those along with the daily average.
Next lets look at how temperature has changed from year to year in Vancouver. We saw above, using s.Annual.describe()
, that the record length (count) is 116 points, which in this case is equal to years. We'll make a plot of the annual mean, summer and winter mean temperatures over the full record.
In [17]:
plt.plot(s.Year, s.Annual, 'k', linewidth=3, label='Annual')
plt.plot(s.Year, s.Winter, 'b', label='Winter')
plt.plot(s.Year, s.Summer, 'r', label='Summer')
plt.ylabel('Monthly mean temp ($^{\circ}$C)')
plt.xlabel('Year')
plt.title(stns[stns.stnid==1108447]) # Add the info from the excel sheet as the title
plt.legend(bbox_to_anchor=(1.35,1)); # put a legend outside the plot using bbox_to_anchor
Out[17]:
We can see a lot of interesting information here. Again, summer is obviously warming than winter, with the annual mean inbetween at around 10C. For me what is interesting is that the inter-annual variability of the winter temperatures seems much higher than that of the summer or annual means. We can check this formally below. One reason that this might be is because of increased storminess in the winter, which can vary alot from year to year.
In [18]:
print s.Annual.std(), s.Summer.std(), s.Winter.std()
Okay, now that we have the hang of things, let's set ourselves up to do some more in depth analysis. One thing we might be interested to look at is whether we can see a global warming signal in the homogenized data. Looking for a signal at any one station will be hard, because of year-to-year variability. Looking at the Vancouver data alone it's a little hard to see a very clear warming signal, but thats partly because of the large y-axis range. But, if we average over many stations the signal should be clearer. To help us let's set up some functions that we can use to automatically process the data from each station. Then we can go in and process the data from every station in Canada quickly and efficiently. Before we start loading all the data though, lets use the excel spreadsheet to get the lat/lons of every available station and plot them on a map to see what type of coverage that we have.
Lets map all the the stations. We'll use the basemap toolkit to provide a nice map. basemap does have the ability to outline countries, but you get all or nothing. To get just the outline of Canada, download the GIS shapefile from:
http://biogeo.ucdavis.edu/data/gadm2/shp/CAN_adm.zip
and unzip it. Then we can use the shapefile to plot just the outline of Canada for our map.
In [19]:
plt.figure(figsize=(10,10))
# setup Lambert Equal area
m = Basemap(width=5500000,height=5000000,projection='laea',
resolution=None,lat_0=62,lon_0=-90.);
# plot the outline of Canada by reading from the shapefile.
m.readshapefile('CAN_adm/CAN_adm0','Canada',drawbounds=True, linewidth=3);
# if the above did not work, you could instead just use the built-in data:
#m.fillcontinents(color='coral',lake_color='aqua')
#m.drawcountries(linewidth=0.25)
# Get the lons and lats off all EC stations
lons = stns['long (deg)']
lats = stns['lat (deg)']
x, y = m(lons[1::].values, lats[1::].values)
m.plot(x, y, 'ro', markersize=5);
# put on some line of lon and lat
parallels = np.arange(0.,81,10.)
m.drawparallels(parallels,labels=[False,True,True,False]);
meridians = np.arange(10.,351.,20.)
m.drawmeridians(meridians,labels=[True,False,False,True]);
Ok, so we have a broad coverage, but the stations are far more concentrated in the south, and quite sparse in the north. We'll have to be aware of that. Now let's define those functions to crunch all the data. Instead of functions I could just load the data, but setting up functions breaks the task into manageable steps, and also produces some useful tools that I might be able to reuse later. From here on we'll just focus on the annual mean data.
In [20]:
def datetime_from_year(years):
""" get dates from years. Assign years to June 15 for convenience."""
if type(years) == str:
years = int(years)
dts = pd.datetime(years, 6, 15)
return dts
In [21]:
def df_from_txt(infile):
""" Produce a pandas dataframe from an EC homogenized .txt file.
Only the Annual data is returned. The returned dataframe has a datetime index,
properly assigned missing values, and a stripped column header.
Parameters:
-----------
infile : str
The name of file to load, can be a full path.
Returns:
df : pandas dataframe
"""
df = pd.read_csv(infile,skiprows=[0,1,3], index_col=False
, skipinitialspace=True, na_values=-9999.9 ) # read into a dataframe
df.index = df.Year.apply( datetime_from_year ) # make a datetime index from Year column
df = df.Annual # only keep the annual data
return df
In [22]:
def group_stns(dfs):
"""Concatenate a list of stations, dfs, into a
single dataframe along axis 1 (i.e. columnwise)"""
dfo = pd.concat(dfs, axis=1)
return dfo
In [23]:
def files_from_ids(ids, prefix=''):
""" Given an list of stnids, return the corresponding
list of filenames. If prefix is provided prepend the
prefix to filename (useful for specifying a path)
"""
files = []
for id in ids:
files.append( prefix + 'mm' + str(id) + '.txt')
return files
In [24]:
def load_stns(files):
"""Given a list of files, load them into a single pandas dataframe
and return it. The column header will be the stnid.
"""
dfo = df_from_txt(files[0]) # Get the first file (0) into a dataframe
cols = [] # list of column names = stnid
head, tail = os.path.split(files[0]) # use os.path to split any prepending path off the filename
cols.append(tail.replace('mm','').replace('.txt','')) # get the stn id and append to col list.
for f in files[1::]:
dfi = df_from_txt(f) # files 1...n
dfo = group_stns([dfo, dfi]) # append into the df
head, tail = os.path.split(f)
cols.append(tail.replace('mm','').replace('.txt',''))
dfo.columns = cols # assign the columns names to be stnid
return dfo
Now we have all the functions that we need to process all the Canadian data. If you are unsure how they work you can try them out one-by-one to figure out what they do. First we'll get a list of all the station id's from the excel file. Next from those id's we'll create a corresponding list of filenames, and load them in using our load_stns functions.
In [25]:
stn_ids = stns['stnid'].values[1::] # Get the stnids
cfiles = files_from_ids(stn_ids, 'Homog_monthly_mean_temp/') # get a list of filenames, with the path prepended
dfc = load_stns(cfiles) # load the data
dfc.shape
Out[25]:
Some stations go all the way back to almost 1845. However, only a few. Good coverage only starts after about 1900, so we'll just limit ourselves to the period after 1900.
In [26]:
dfc = dfc[dfc.index.year>=1900]
All the stations have quite different average temperatures. Think Inuvik vs Vancouver. If we want to compare changes in time, it is best to do so using anomalies relative to a given base period. Since we used 1981-2010 for the climatology, we'll use that again here for our base period. For every station, we'll calculate the mean temperature over 1981 to 2010, and then we'll compute the temperature anomaly relative to that base period in order to assess things have changed in time.
In [27]:
dfc_clim = dfc[ (dfc.index.year>=1980) & (dfc.index.year<=2010) ].mean(axis=0)
dfc_anom = dfc - dfc_clim
In [28]:
dfc_anom.mean(axis=1).plot(label='Annual')
dfc_anom.mean(axis=1).resample('5A').plot(color='r', linewidth=3, label='5yr mean')
plt.ylabel('Annual mean temp. anomaly ($^{\circ}$C)')
plt.title('Simple mean of all Canadian Stations')
plt.legend(loc='best')
Out[28]:
The red line is a 5-year running mean. The main pattern that sticks out to me is the warming which really takes off after about 1970. A strong warming post-1970 is exactly what we would expect from the global warming signal. This is very typical of what is seen in global temperature records, and climate model simulations. In fact, we'll compare our stations to a global mean temperature record just below.
The blue, unsmoothed, line shows us that temperatures vary a lot from year to year. So even though things are getting warmer on average, any one year might not be that warm. In the same way, while it is getting hotter on average across the globe, one region may experience no warming or even cooling for a year or even a decade. This is called interannual variability. It's not well appreciated in the general public, who sometimes expect that global warming means every year will be warmer than the last.
One important thing to note is that it is not correct to say that what I have calculated is the "average Canadian" temperature anomaly. What I have done here is to take a simple mean across all the Canadian stations. However, as we have already seen, the stations are not evenly distributed, but are far more concentrated in the south. Thus, this mean is biased towards southern Canada. It's thought that warming is occuring faster in the Arctic than in the south, due to a phenomenon called "polar amplification". Basically the Arctic is warming faster than the rest of the planet because of sea-ice feedbacks. If we want to calculate a proper Canadian average, we would have to make sure we used spatially-even coverage, by either binning the data into boxes, or by interpolating the data onto a standard grid. Thats beyond the scope here, but maybe I'll look into it in the future.
For now, as a simple check, we'll just look at a plot of all the individual stations to see if we can still see the warming pattern without doing the averaging.
In [29]:
ax = dfc_anom.resample('5A').mean(axis=1).plot(color='k', alpha=0.25)
dfc.shape
for col in dfc_anom.columns:
dfc_anom[col].resample('5A').plot(ax=ax, color='k', alpha=0.25)
dfc_anom.mean(axis=1).resample('5A').plot(color='r', linewidth=3, label='Simple mean')
plt.ylim([-5, 3])
plt.ylabel('Annual mean temp ($^{\circ}$C)')
plt.title('Individual Stations vs simple mean of all Stations')
#plt.legend(loc='best')
Out[29]:
I think that is seems pretty clear that the post-1970s warming has occurred at most individual stations, not just in the mean. As another sanity check, lets compare our simple average of the Canadian stations to the global mean surface temperature anomaly computed for one of the "official global temperature series". We'll use the record from NASA GISS. The GISS data file is not very standard and it takes quite a few options to parse it properly. You can just look at the raw file here:
http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts.txt
I'm just going to be pulling out column 0 (year) and 13 (J-D), which is the January to December Global Mean Surface Temperature Anomaly (GMST). As it says in the the file header, the anomaly is relative to the 1951 to 1980 base period, so that is what I'll use for the Canadian station too.
In [30]:
giss = pd.read_csv('http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts.txt', skiprows=7, skipfooter=12
,skipinitialspace=True, sep=' ', usecols=[0, 13], na_values=['****']
).dropna()
giss[ giss['Year'] =='Year' ] = np.nan # Get rid of rows with repeated column headers
giss = giss.dropna()
giss.index = giss.Year.apply( datetime_from_year ) # Get a datetime index
giss = giss['J-D'].astype(float)/100 # convert to floats (from str) and from 100ths of a degree to degrees.
In [31]:
ax = giss.plot(color='k', linewidth=3, label='NASA GISS GMST')
dfc_clim2 = dfc[(dfc.index.year>=1951) & (dfc.index.year<=1980)].mean(axis=0)
dfc_anom2 = dfc - dfc_clim2
dfc_anom2.mean(axis=1).plot(ax=ax, color='r', linewidth=1, label='EC stn simple mean')
dfc_anom2.mean(axis=1).resample('5A').plot(ax=ax, color='r', linewidth=3, label='5-yr running mean')
plt.title('EC Cdn stn mean vs GISS global mean')
plt.ylabel('Annual mean temp. anomaly ($^{\circ}$C)')
plt.legend(loc='best')
Out[31]:
The year to year variability is higher for the Canadian stations (thin red line), which is expected since we are using a smaller avaeraging area, which has more weather "noise", which the GISS global average smooths out. That is the reason I overlaid a 5 year running mean, which ends up looking quite similar to the GISS annual mean. In both records we can see some warming from the early 1900s until about 1945, slight cooling from 1945 to around 1970 and then a fairly rapid warming. The rate of warming in the Canadian stations is about the same or maybe slightly higher than in the global mean. This might be expected due to polar amplification.
What might be interesting is use our Canadian station data to compare warming rates in the north to those in the south, to see if the so called "polar amplification" of warming is evident. We'll isolate all stations north of 60N, average them, and compare it to the average of all stations south of 50N.
In [32]:
# Get a list of stn id's for all stations N of 60
n_stn_ids = stns[stns['lat (deg)'] > 60]['stnid'].values
n_stn_ids = [str(id) for id in n_stn_ids] # convert to str
dfn_anom = dfc_anom2[n_stn_ids].copy()
# Get a list of stn id's for all stations S of 50
s_stn_ids = stns[stns['lat (deg)'] < 50]['stnid'].values
s_stn_ids = [str(id) for id in s_stn_ids]
# make sure all the id's actually exist
for id in s_stn_ids:
if id in dfc_anom.columns.values:
pass
else:
s_stn_ids.remove(id)
dfs_anom = dfc_anom2[s_stn_ids]
dfs_anom.mean(axis=1).resample('5A').plot(color='g', linewidth=3, label='South-simple-mean')
dfn_anom.mean(axis=1).resample('5A').plot(color='b', linewidth=3, label='North-simple-mean')
plt.ylabel('Annual mean temp ($^{\circ}$C)')
plt.title('Canada warming South of 50$^{\circ}$N vs North of 60$^{\circ}$N')
plt.legend(loc='best')
Out[32]:
Okay, so it appears that the north might be warming faster than the south. That is even more reason to be careful about calculating simple means, without proper area weighting. Speaking of the north, lets go and find out which is the coldest station in Canada! To do this we'll again compute a climatology over 1981 to 2010.
In [33]:
dfc_clim = dfc[(dfc.index.year>=1981) & (dfc.index.year<=2010)].mean(axis=0)
In [34]:
print dfc_clim.min()
coldest_id = dfc_clim[dfc_clim == dfc_clim.min() ].index
#print coldest_id[0]
stns[stns.stnid==int(coldest_id[0])]
Out[34]:
In [35]:
print dfc_clim.max()
warmest_id = dfc_clim[dfc_clim == dfc_clim.max() ].index
#print coldest_id[0]
stns[stns.stnid==int(warmest_id[0])]
Out[35]:
According to the homogenized temperatures, Vancouver was the warmest Canadian city over 1981 to 2010 on average. Maybe it explains a little about the house prices.
I've now got a good feeling about the homogenized data, since it is quite easy to use and the CRD scientists have already done the hard work of doing the homogenizing. For my django website I'll probably start off using this data, which at least ensures reliable statistics. Its unfortunate that it will be hard to compare this with the unhomogenized near real time data. Perhaps at some point I'll think about applying some type of calibration to the near-real time data so that it can be included, but that will likely be tricky.
In [35]:
In [35]:
In [35]: