Data Analysis on Climate Timeseries Data-set

This is a set of climate time series in which different climate variables (i.e. temperature , windspeed, rainfall, etc) are summarized on a yearly basis. The purpose of this note book is NOT to be a teaching prototype but to crunch the numbers, derive useful relationships and visualizations in order to asses the potential in this data for the dds-notebook project.

$ cd /somewhre



In [1]:

    
import numpy as np
import scipy as s
import math
from bokeh.browserlib import view
from bokeh.document import Document
from bokeh.embed import file_html
from bokeh.models.glyphs import Circle, Text
from bokeh.models import (
    BasicTicker, ColumnDataSource, Grid, GridPlot, LinearAxis,
    DataRange1d, PanTool, Plot, WheelZoomTool,HoverTool
)
from bokeh.charts import TimeSeries
from bokeh.resources import INLINE
from bokeh.sampledata.iris import flowers
from bokeh.plotting import *
from bokeh.io import gridplot, output_file, show, vplot
import re
from os import listdir
from os.path import isfile, join
from itertools import product
from collections import OrderedDict
from dds_lab import climdat

Correlation Plots

The following code both cleans up the data and readys it for correlation plots. Some of the code can do with refactoring but there are some relatively not neat routines required thus they can be incorporated as libraries in the scripts package. Most climate variables exihibit no obvious relationship. In the example here only 3 a plotted at a time sinc there is really not enough space to plot all 64 combinations (this includes tranpose of plots).[EDIT: Everything is currently being plotted

Parsing and pairing up climate variables

The following snippet defines the pair_data function which both parses the timeseries of two variables at a time and pairs them up for a scatter plot of one vs the other. Some abstractions are urgently required in this section.



In [2]:

    
# Read in html tags string (the data was extracted from an interactive SEPA graph in a rather hacky manner)
# Thus the output consists of a clause of inner html tags which is actually convinient to examplify operations
# on strings within python.
# DDS-Noteboook API development (debbugging both data sets and APIs on a notebook)


def read_tags(path):
    """
    This function takes in a unix path of type String and
    returns a list of html tags each of type String
    """
    txts = [ f for f in listdir(path) if(isfile(join(path,f)) and 'txt' in f) ]
    html_tags = {}
    for x in txts:
        html_tags[x] = list(set(open(path + x, 'r').read().split(">,<")))
        # ^ tags contain duplicates due to the process
        # they were retrieved in thus we use set to remove them
    return html_tags


# Regex for values + filtering over the data set
# Year, value parsing functions to parse an individual tag
# i.e.:
# <span style="max-width: 962px; max-height: 992px;">Year: 1961<br>Value: 1011.88<br>Series: City of Edinburgh<br></span>
year = lambda x:(int(re.search(r'(?<=Year:\s)(\d+)', x).group(0))
                     if re.search(r'(?<=Year:\s)(\d+)', x) else None)

val = lambda x: (float(re.search(r'(?<=Value:\s)\d+\.?\d*', x).group(0))
                 if re.search(r'(?<=Value:\s)\d+\.?\d*', x) else None)

# Function that combines the previous two and applys them to a list in order to yield (year, value)
# pairs.
extract = lambda y: dict(zip(filter(lambda x: x is not None,
                          map(year, y)),
                         (filter(lambda x: x is not None,
                          map(val, y)))))

def parse_time_series(html_tags_files,txt):
    """
    This function parses individual timeseries data in to
    a dictionary which contains the year as keys
    """
    prod_dict = {}
    t_dict = OrderedDict()
    for p in txt:
        prod_dict[str(p)] = extract(html_tags_files[str(p)])
    return prod_dict
        

def pair_data(html_tags1, html_tags2, order, split_by=2010):

    # obtain years and values for every tag in both data sets and place them in
    # dictionary data structure since this is probably one ofthe most efficient
    # ways to pair them on later on
    first_pair = extract(html_tags1)
    second_pair = extract(html_tags2)
    
    first_y = []
    second_y = []
    
    order = list(order)
    # Determine which pair is larger and pair up relative to
    # that pair in order to not miss tags
    if(len(second_pair) < len(first_pair)):
        tmp = dict(second_pair)
        tmp_name = order[0]
        second_pair = dict(first_pair)
        first_pair = dict(tmp)
        order[0] = order[1]
        order[1] = tmp_name
        del tmp
       
    # Pairing up process taking advantage of the dictionary data structure
    for k, v in second_pair.items():
        if k in first_pair:
            first_y.append((k, first_pair[k]))
            second_y.append((k,v))
    
    # Make sure to sort by one axis
    final = list(zip(first_y, second_y))
    final.sort(key=lambda x: x[0][1])
    
    # Split set by  3 year ranges (interesting thought)
    if split_by:
        first_third = []
        second_third = []
        third_third = []
        for x in final:
            # Playing with splits             
            if(x[0][0] >= split_by):
                first_third.append(x)
            # TODO: Make it a param and not a hardcoded val.
            elif(x[0][0] >= 1980):
                second_third.append(x)
            else:
                third_third.append(x)
                
        # Maybe change return to dict format to increase readability.
        return (np.array(first_third)[:,0],
                np.array(first_third)[:,1],
                np.array(second_third)[:,0],
                np.array(second_third)[:,1],
                np.array(third_third)[:,0],
                np.array(third_third)[:,1],
                first_pair,
                second_pair,
                order)
            
    # Raw return (no ranges)
    return (np.array(final)[:,0],
            np.array(final)[:,1],
            first_pair,
            second_pair,
            order)

Grid Plots

This snippet generates grid plots for a range of climate variables from l, j. Not figure out how to use widgets to display files of choice. Several transformations on the data kinda make it look better but there is nothing extremely strong. Some relatively notable relationships are :

Has a small bug when I wrote this (regex + instead of *). After fixing the bug I can spot some faint and interesting relationships such as snow_cover and temperature; min/max/avg temperature all demonstrate closeness which points out for good results and air pressure and rainy days.

Note: make sure to implement funtional transformations via widgets and choice of climate variables to display if possible. Other things to look at are outliers in the time series and check in the news if it was a particularly snowy year or an extremely hot year in edinburgh, which helps in making assesments in how valid/useful the data is.

GOOD IDEA:

Color scatter plots red and blue based on a split point (i.e. being before 2000 and after) this may give a visual indicator of global warming. The same idea can be used later on in the individual time series when obtaining summary statistics. Average temperature is deffinitely higher on the current years nonethless one can read this of time series.

Heat Map

Replace time series with heatmaps in order to aid an easier mechanism in terms of reading high values vs year.



In [5]:

    
html_tags = read_tags("../../data/SPRI/climate_timeseries/")
prod_dict = OrderedDict()
#print(txts)
#print(len(txts))
# txts = txts[l:j]
# COMMENT THIS LINE TO SEE ALL 64 plots
txts = ['edinburgh_snow_cover.txt','edinburgh_tmp_min.txt','edinburgh_tmp.txt']
#print(txts)
square_flag = False
for p in product(txts, repeat=2):
    prod_dict[str(p)] = pair_data(html_tags[p[0]],
                                  html_tags[p[1]],
                                  p,
                                  1995)

n = len(prod_dict)
k2 = len(txts)

fig_dict = OrderedDict()
cds_dict = OrderedDict()
for k, v in prod_dict.items():
    print( k)
    v[-1][0] = v[-1][0].replace(".txt","").replace("_", " ")
    v[-1][1] = v[-1][1].replace(".txt","").replace("_", " ") 
    fig_dict[k] = figure(width=300, plot_height=300,title=str(v[-1][0]) +" vs "+str(v[-1][1]),
                        title_text_font_size='8pt',
                        tools="reset,hover",
                        x_axis_label=v[-1][0],
                        y_axis_label=v[-1][1])
    fig_dict[k].xaxis.axis_label_text_font_size = '8pt'
    fig_dict[k].yaxis.axis_label_text_font_size = '8pt'
    # xdata = list(np.array(v[0])[:,1]) + list(np.array(v[2])[:,1]) + list(np.array(v[4])[:,1])
    # ydata = list(np.array(v[1])[:,1]) + list(np.array(v[3])[:,1]) + list(np.array(v[5])[:,1])
    # sim = list(np.array(v[1])[:,0]) + list(np.array(v[3])[:,0]) + list(np.array(v[5])[:,0])
    cds_dict[k+'1'] = ColumnDataSource(
    data=dict(
        x=np.array(v[0])[:,1],
        y=np.array(v[1])[:,1],
        desc=np.array(v[1])[:,0]
         )
    )
    cds_dict[k+'2'] = ColumnDataSource(
    data=dict(
        x=np.array(v[2])[:,1],
        y=np.array(v[3])[:,1],
        desc2=np.array(v[3])[:,0]
         )
    )
    cds_dict[k+'3'] = ColumnDataSource(
    data=dict(
        x=np.array(v[4])[:,1],
        y=np.array(v[5])[:,1],
        desc=np.array(v[5])[:,0]
         )
    )
    hover = HoverTool()
    s1=fig_dict[k].scatter(np.array(v[0])[:,1], np.array(v[1])[:,1],
                           fill_color='red',size=13,source=cds_dict[k +'1'])
    s1.select(dict(type=HoverTool)).tooltips = {"x":"$x", "y":"$y", "year": "@desc"}
    s2=fig_dict[k].scatter(np.array(v[2])[:,1], np.array(v[3])[:,1],fill_color='green',size=10,source=cds_dict[k +'2'])
    s2.select(dict(type=HoverTool)).tooltips = {"x":"$x", "y":"$y", "year": "@desc"}
    s3=fig_dict[k].scatter(np.array(v[4])[:,1], np.array(v[5])[:,1],fill_color='blue',size=7,source=cds_dict[k+'3'])
    s1.select(dict(type=HoverTool)).tooltips = {"x":"$x", "y":"$y", "year": "@desc"}
    # fig_dict[k].line(np.array(v[0])[:,1], np.array(v[1])[:,1])
    

f_vals = list(fig_dict.values())

pl = [[None]*(k2 -1 - round((i + 1) / k2))+f_vals[i: i +1 + round((i + 1) / k2)][::-1] for i in range(0, n, k2)]
print(pl)
g = gridplot(pl[::-1])
output_notebook()
show(g)









    



('edinburgh_snow_cover.txt', 'edinburgh_snow_cover.txt')
('edinburgh_snow_cover.txt', 'edinburgh_tmp_min.txt')
('edinburgh_snow_cover.txt', 'edinburgh_tmp.txt')
('edinburgh_tmp_min.txt', 'edinburgh_snow_cover.txt')
('edinburgh_tmp_min.txt', 'edinburgh_tmp_min.txt')
('edinburgh_tmp_min.txt', 'edinburgh_tmp.txt')
('edinburgh_tmp.txt', 'edinburgh_snow_cover.txt')
('edinburgh_tmp.txt', 'edinburgh_tmp_min.txt')
('edinburgh_tmp.txt', 'edinburgh_tmp.txt')
[[None, None, <bokeh.plotting.Figure object at 0x7f935765b9b0>], [None, <bokeh.plotting.Figure object at 0x7f935765c550>, <bokeh.plotting.Figure object at 0x7f935765ca58>], [<bokeh.plotting.Figure object at 0x7f93573587b8>, <bokeh.plotting.Figure object at 0x7f9357366e48>, <bokeh.plotting.Figure object at 0x7f9357366f98>]]






    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded

Vstack Plot of all Timeseries

The following plot vertically overlays time series vertically.



In [4]:

    
t_fig_dict = {}
n_dict = parse_time_series(html_tags, txts)
print(n_dict.keys())
series = []
for k in list(sorted(n_dict.keys())):
    v = n_dict[k]
    t_fig_dict[k] = figure(title=k.split(",")[-1][0:-1],
                           tools=[HoverTool()])
    # print(v)
    # print(k)
    tmp = list(v.items())
    tmp.sort()
    # print(tmp)
    np.array(tmp)
    # print(tmp)
    date = np.array(tmp)[:,0]
    vals = np.array(tmp)[:,1]
    t_fig_dict[k].line(np.array(tmp)[:,0], np.array(tmp)[:,1])
    t_fig_dict[k].scatter(np.array(tmp)[:,0], np.array(tmp)[:,1])
    series.append(TimeSeries({'dates': date, 'vals': vals},
                             index='dates'))

#v=vplot(*list(t_fig_dict.values()))
v = vplot(*list(series))
#series
show(v)









    



dict_keys(['edinburgh_tmp_min.txt', 'edinburgh_snow_cover.txt', 'edinburgh_tmp.txt'])



In [7]:

    
c = climdat.ClimPlots(txts, path="../../data/SPRI/climate_timeseries/")



In [8]:

    
c.plot_pairs()









    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded






    














    




    
        
        
        
    
        
        BokehJS successfully loaded.
    
    Warning: BokehJS previously loaded



In [ ]: