Since the 2016 Presidential Elections in the USA, the interest of people with regards to climate change and the correct environmental policy has reached an all-time high. In this project the aim is to use a dataset with temperature data from 1750 to 2015 [1] and check whether global warming is a fact or a speculation. The dataset is nicely packaged and allows for slicing into interesting subsets (by country, by city, global temperatures e.t.c.). It was put together by Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory.
In [39]:
import numpy as np
# Show matplotlib graphs inside the notebook.
%matplotlib inline
import os.path
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from sklearn import linear_model
from statsmodels.tsa.arima_model import ARIMA
from myutils import makeTimeSeries
from myutils import differenciate
from myutils import test_stationarity
In [40]:
import warnings
warnings.filterwarnings("ignore")
In [41]:
folder = os.path.join('data', 'temperatures','GlobalLandTemperatures')
filename_ByCity = os.path.join(folder, 'GlobalLandTemperaturesByCity.csv')
filename_ByCountry = os.path.join(folder, 'GlobalLandTemperaturesByCountry.csv')
filename_ByMajorCity = os.path.join(folder, 'GlobalLandTemperaturesByMajorCity.csv')
filename_ByState = os.path.join(folder, 'GlobalLandTemperaturesByState.csv')
filename_Global = os.path.join(folder, 'GlobalTemperatures.csv')
In [42]:
ByCity=pd.read_csv(filename_ByCity)
ByCountry=pd.read_csv(filename_ByCountry)
ByMajorCity=pd.read_csv(filename_ByMajorCity)
ByState=pd.read_csv(filename_ByState)
Global=pd.read_csv(filename_Global)
In [43]:
ByCity[:10000].to_html('ByCity.html')
ByCountry[:10000].to_html('ByCountry.html')
ByMajorCity[:10000].to_html('ByMajorCity.html')
ByState[:10000].to_html('ByState.html')
Global.to_html('Global.html')
Export part of the dataset as HTML files for inspection ByCity, ByCountry, ByMajorCity, ByState, Global .
As we can see by following the links above, there is a need to clean our data:
In [44]:
#Removing duplicates from ByCountry
ByCountry_clear = ByCountry[~ByCountry['Country'].isin(
['Denmark', 'France', 'Europe', 'Netherlands',
'United Kingdom'])]
#ByCountry_clear.loc[ByCountry_clear['Country'] == 'Denmark (Europe)']
ByCountry_clear = ByCountry_clear.replace(
['Denmark (Europe)', 'France (Europe)', 'Netherlands (Europe)', 'United Kingdom (Europe)'],
['Denmark', 'France', 'Netherlands', 'United Kingdom'])
#countries = np.unique(ByCountry_clear['Country'])
#np.set_printoptions(threshold=np.inf)
#print(countries)
In [45]:
#Removing duplicates from ByCity
ByCity_clear = ByCity[~ByCity['City'].isin(
['Guatemala'])]
ByCity_clear = ByCity_clear.replace(['Guatemala City'],['Guatemala'])
#cities = np.unique(ByCity_clear['City'])
#print(cities)
As far as the missing data is concerned we can chose to either:
For example, if we chose to ignore the global temperature measurement for a month where the value for LandAverageTemperature is missing we can do it as follows:
In [46]:
Global.dropna(subset=['LandAverageTemperature']).head()
Out[46]:
Or, if we chose to ignore the global temperature measurements for which we don't have all of the 8 fields:
In [47]:
Global.dropna(axis=0).head()
Out[47]:
If we chose to fill in the missing values with the values of the previous corresponding measurement:
In [48]:
Global.fillna(method='pad').head()
Out[48]:
The method we will use will depend on the problem we will try to solve with our data.
In [49]:
mean_Global= []
mean_Global_uncertainty = []
years = np.unique(Global['dt'].apply(lambda x: x[:4]))
for year in years:
mean_Global.append(Global[Global['dt'].apply(
lambda x: x[:4]) == year]['LandAverageTemperature'].mean())
mean_Global_uncertainty.append(Global[Global['dt'].apply(
lambda x: x[:4]) == year]['LandAverageTemperatureUncertainty'].mean())
#print(years.dtype)
x=years.astype(int)
minimum=np.array(mean_Global) + np.array(mean_Global_uncertainty)
y=np.array(mean_Global)
maximum=np.array(mean_Global) - np.array(mean_Global_uncertainty)
plt.figure(figsize=(16,8))
plt.plot(x,minimum,'b')
plt.hold
plt.plot(x,y,'r')
plt.hold
plt.plot(x,maximum,'b')
plt.hold
plt.fill_between(x,y1=minimum,y2=maximum)
plt.xlabel('years',fontsize=16)
plt.xlim(1748,2017)
plt.ylabel('Temperature, °C',fontsize=16)
plt.title('Yearly Global Temperature',fontsize=24)
Out[49]:
As it can be observed the uncertainty of the measurements in the 18th and 19th century was very high. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias. One can chose to ignore or give smaller weights to older, less reliable measurements. For the data exploitation part we will consider data from 1900 onward.
We now draw a map with the average temperature of each country over all years. This serves as a quick way to check that our data make sense. We can see that the warmest countries are the ones along the Equator and that the coldest countries are Greenland, Canada and Russia. Countries for which the data was missing are depicted as white. One can hover above counties to see their name and average temperatures.
In [50]:
countries = np.unique(ByCountry_clear['Country'])
mean_temp = []
for country in countries:
mean_temp.append(ByCountry_clear[ByCountry_clear['Country'] == country]['AverageTemperature'].mean())
#when taking the mean the missing data are automatically ignored=>see data cleaning section
#use choropleth map provided by pyplot
data = [ dict(
type = 'choropleth',
locations = countries,
z = mean_temp,
locationmode = 'country names',
text = countries,
colorbar = dict(autotick = True, tickprefix = '',
title = '\n °C')
)
]
layout = dict(
title = 'Average Temperature in Countries',
geo = dict(
showframe = False,
showocean = True,
oceancolor = 'rgb(0,255,255)',
),
)
fig = dict(data=data, layout=layout)
py.iplot(fig,validate=False)
We now look at the change of temperature in the major cities over the last 50 years. We subtract the oldest temperature $T_{old }$ from the most recent temperature $T_{new}$. Therefore if $dT=T_{new}-T_{old }>0 \rightarrow$ the temperature has increased. It can be observed that for almost all (95%) of the major cities there has been an increase in the temperature in the last 50 years.
One can zoom into the map and see the name and the coordinates of the cities.
In [51]:
years_in_MajorCities=np.unique(ByMajorCity['dt'].apply(lambda x: x[:4]))
cities = np.unique(ByMajorCity['City'])
dt=[years_in_MajorCities[-51],years_in_MajorCities[-1]]
T1=[]
T2=[]
lon=[]
lat=[]
for city in cities:
T1.append(ByMajorCity[(ByMajorCity['City'] == city) & (ByMajorCity['dt'].apply(lambda x: x[:4]) == dt[0])]['AverageTemperature'].mean())
T2.append(ByMajorCity[(ByMajorCity['City'] == city) & (ByMajorCity['dt'].apply(lambda x: x[:4]) == dt[1])]['AverageTemperature'].mean())
lon.append(ByMajorCity[ByMajorCity['City'] == city]['Longitude'].iloc[1])
lat.append(ByMajorCity[ByMajorCity['City'] == city]['Latitude'].iloc[1])
In [52]:
lon=np.array(lon)
lat=np.array(lat)
In [53]:
for i in range(0,lon.size):
if lon[i].endswith('W'):
west=lon[i]
west=float(west[:-1])
east=str(360-west)
lon[i]=east+'E'
In [54]:
for i in range(0,lat.size):
if lat[i].endswith('S'):
south=lat[i]
south=float(south[:-1])
north=str(1-south)
lat[i]=north+'N'
In [55]:
lon=pd.DataFrame(lon)
lat=pd.DataFrame(lat)
In [56]:
long=lon[0].apply(lambda x: x[:-1])
lati=lat[0].apply(lambda x: x[:-1])
dT=np.array(T2)-np.array(T1)
In [57]:
data = [ dict(
type = 'scattergeo',
lon = long,
lat = lati,
text=cities,
mode = 'markers',
marker = dict(
size = 8,
opacity = 0.8,
reversescale = True,
autocolorscale = False,
symbol = 'square',
line = dict(
width=1,
color='rgba(102, 102, 102)'
),
color = dT,
colorbar=dict(
title="\n °C"
)
))]
layout = dict(
title = 'Change in the temperature the last 50 years',
colorbar = True,
geo = dict(
showland = True,
landcolor = "rgb(250, 250, 250)",
subunitcolor = "rgb(217, 217, 217)",
countrycolor = "rgb(217, 217, 217)",
showocean = True,
oceancolor = 'rgb(0,255,255)',
),
)
fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False)
We now want to build a model that predicts the global temperature based on the temperatures of the previous years. It can be obsereved from the figure below that the mean of the global temperature data has a positive trend. Therefore the yearly global temperature is a non-stationary process (the joint probability distribution changes when shifted in time). In order to produce reliable results and good prediction the process must be converted to a stationary process.
In [58]:
mean_Global=pd.DataFrame(mean_Global)
mean_Global['dt']=years
ts=makeTimeSeries(mean_Global)
#print(ts)
In [59]:
plt.figure(figsize=(16,8))
plt.plot(ts)
plt.xlabel('time',fontsize=16)
plt.ylabel('Temperature, °C',fontsize=16)
plt.title('Yearly Global Temperature',fontsize=24)
Out[59]:
An easy way to detrend a time series is by differencing. For a non-stationary time series $X$, its corresponding time series after differencing $X_{diff}$ can be calculated as:
$$
X_{diff}(i)=X(i)-X(i-1)
$$
$X_{diff}$ will obviously have one sample less than $X$.
In [60]:
X = ts[0]['1900':'2000'] #training set, temporal split
#print(X)
X_diff=differenciate(X)
#print(X_diff)
plt.figure(figsize=(16,8))
plt.plot(X_diff)
plt.xlabel('years',fontsize=16)
plt.ylabel('Temperature, °C',fontsize=16)
plt.title('Yearly Global Temperature (after differencing)',fontsize=24)
Out[60]:
Now we can check if in fact the process after differencing is stationary with the Dickey-Fuller Test. Here the null hypothesis is that the time series is non-stationary. The test results comprise of a Test Statistic and some Critical Values for different confidence levels. If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.
In [61]:
test_stationarity(X_diff)
Another way is to try to model the trend and then subtract it from the data.
In [62]:
regresor = linear_model.LinearRegression()
y=np.array(X.dropna())
t=np.arange(y.size)
y=y.reshape(-1,1)
t=t.reshape(-1,1)
regresor.fit(t,y)
trend=regresor.predict(t)
# detrend
detrended = [y[i]-trend[i] for i in range(0, y.size)]
In [63]:
y=pd.DataFrame(y)
y.index=X.index
trend=pd.DataFrame(trend)
trend.index=X.index
detrended=pd.DataFrame(detrended)
detrended.index=X.index
In [64]:
print('Coefficients: \n', regresor.coef_)
print("Mean of error: %.2f" % np.mean((trend - y) ** 2))
In [65]:
# plot trend
plt.figure(figsize=(16,8))
plt.plot(y,color='blue',label='time series')
plt.plot(trend,color='green',label='trend')
plt.xlabel('years',fontsize=16)
plt.ylabel('Temperature, °C',fontsize=16)
plt.title('Trend of Yearly Global Temperature',fontsize=24)
plt.legend()
plt.show()
# plot detrended
plt.figure(figsize=(16,8))
plt.plot(detrended)
plt.xlabel('years',fontsize=16)
plt.ylabel('Temperature, °C',fontsize=16)
plt.title('Detrended Yearly Global Temperature',fontsize=24)
plt.show()
In [66]:
test_stationarity(detrended[0])
Looking at the results we get from the Dickey-Fuller Test for the two methods of making the time series stationary, we can see that we got better results here through the method of differencing. Therefore, in what follows we will use $X_{diff}$. We could have gotten better results for the method based on modeling the trend if we had allowed a more complex model than the linear one.
For the modeling we will use the Auto-Regressive Integrated Moving Averages (ARIMA) model. For our training set we will use the global temperatures from 1900 to 2000.
The ARIMA provided by statsmodels differenciates the time series. Therefore, we give first in the figure below the results for the differenciated time series.
In [67]:
model = ARIMA(ts[0]['1900':'2000'], order=(1, 1, 2))
results_ARIMA = model.fit(disp=-1)
plt.figure(figsize=(16,8))
plt.plot(X_diff,color='blue',label='original')
plt.plot(results_ARIMA.fittedvalues, color='red',label='predicted')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-X_diff)**2),fontsize=20)
plt.legend(loc='best')
plt.xlabel('years',fontsize=16)
plt.ylabel('Temperature, °C',fontsize=16)
Out[67]:
In [68]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
#print(predictions_ARIMA_diff.head())
We now take it back to the original scale (no differencing).
In [69]:
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
#print (predictions_ARIMA_diff_cumsum.head())
In [70]:
predictions_ARIMA = pd.Series(X.ix[0], index=X.index)
predictions_ARIMA = predictions_ARIMA.add(predictions_ARIMA_diff_cumsum,fill_value=0)
#predictions_ARIMA.head()
We now plot the actual and the predicted time series by our model for our training set. It is not a perfect prediction, but the root mean square error is relatively small.
In [71]:
plt.figure(figsize=(16,8))
plt.plot(X,color='blue',label='original')
plt.plot(predictions_ARIMA,color='green',label='predicted')
plt.title('RMSE= %.4f'% np.sqrt(sum((predictions_ARIMA-X)**2)/len(X)),fontsize=24)
plt.legend(loc='best')
plt.xlabel('years',fontsize=16)
plt.ylabel('Temperature, °C',fontsize=16)
Out[71]:
We now look at the accuracy of our model in predicting the future. We test on the temperatures from 2001 to 2015. Again, the model is not perfectly accurate, but the root mean square error is relatively small.
In [72]:
X_test = ts[0]['2001':] #test set, temporal split
#print(X_test)
In [73]:
preds=results_ARIMA.predict('2001-01-01','2015-01-01')
In [74]:
#preds.head
In [75]:
preds_cumsum = preds.cumsum()
preds=preds_cumsum+X[-1]
#print (preds)
In [76]:
#print(X_test)
plt.figure(figsize=(16,8))
plt.plot(X_test,color='blue',label='original')
plt.plot(preds, color='red',label='predicted')
plt.title('RMSE= %.4f'% np.sqrt(sum((preds-X_test)**2)/len(X_test)),fontsize=24)
plt.legend(loc='best')
plt.xlabel('years',fontsize=16)
plt.ylabel('Temperature, °C',fontsize=16)
Out[76]:
In conclusion, as we have seen, global warming is a fact. It is therefore important that nations will work towards a systematic and organized fight against climate change.
[1] Climate Change: Earth Surface Temperature Data. Exploring global temperatures since 1750. Available from World Wide Web: (https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).