Severe Weather Forecasting with Python and Data Science Tools: Interactive Demo

David John Gagne, University of Oklahoma and NCAR

Introduction

Severe weather forecasting has entered an age of unprecedented access to large model and observational datasets with even greater hordes of data in the pipeline. With multiple ensembles of convection-allowing models available and an increasing variety of observations derived from radar, satellite, surface, upper air, and crowd-sourcing, forecasters can easily be overwhelmed with guidance. Without ways to organize, synthesize, and visualize the data in a useful manner for forecasters, the pile of new models and observations will languish unused and will not fulfill their full potential. An even worse outcome would be to take the human forecasters completely out of the loop and trust the models, which is a way fraught with peril. Data science tools offer ways to synthesize essential information from many disparate data sources while also quantifying uncertainty. When forecasters use the tools properly, they can identify potential hazards and the associated spatial and time uncertainties more quickly by using the output of the tools to help target their domain knowledge.

This module demonstrates how data science tools from the image processing and machine learning families can be used to create a forecast of severe hail. It aims to teach the advantages, challenges, and limitations of these tools through hands-on interaction.


In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from mpl_toolkits.basemap import Basemap
from IPython.display import display, Image
from ipywidgets import widgets, interact
from scipy.ndimage import gaussian_filter, find_objects
from copy import deepcopy
from glob import glob
import matplotlib.patches as patches

Part 1: Storm Track Identification

We will be using the hagelslag library to perform object-based data processing of convection-allowing model output.


In [2]:
from hagelslag.processing.EnhancedWatershedSegmenter import EnhancedWatershed
from hagelslag.data.ModelOutput import ModelOutput
from hagelslag.processing.ObjectMatcher import ObjectMatcher, shifted_centroid_distance, centroid_distance
from hagelslag.processing.STObject import STObject
from hagelslag.processing.tracker import label_storm_objects, extract_storm_objects

We will be using model output from the control run of the Center for Analysis and Prediction of Storms 2015 Storm-Scale Ensemble Forecast system. The model output for this exercise is included with the hagelslag package, but additional variables can be downloaded from the Unidata RAMADDA server or from my personal page. Please untar the data in a local directory and modify the model_path variable below to point to the spring2015_unidata directory.


In [3]:
model_path = "../testdata/spring2015_unidata/"
ensemble_name = "SSEF"
member ="wrf-s3cn_arw"
run_date = datetime(2015, 6, 4)
# We will be using the uh_max (hourly max 2-5 km Updraft Helicity) variable for this exercise
# cmpref (simulated composite radar reflectivity) is also available.
variable = "uh_max"
start_date = run_date + timedelta(hours=12)
end_date = run_date + timedelta(hours=29)
model_grid = ModelOutput(ensemble_name, 
                         member, 
                         run_date, 
                         variable, 
                         start_date, 
                         end_date,
                         model_path,
                         map_file="../mapfiles/ssef2015.map",
                         single_step=False)
model_grid.load_data()
model_grid.load_map_info("../mapfiles/ssef2015.map")


/Users/dgagne/miniconda3/envs/ml/lib/python3.6/site-packages/hagelslag-0.2-py3.6.egg/hagelslag/data/ModelGrid.py:39: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated.  Use `pandas.date_range` instead.
  freq=self.frequency)

The max updraft helicity map over the full period shows mutiple long and intense tracks in the central Plains.


In [4]:
lon_range = (-105, -91)
lat_range = (35, 44)
basemap = Basemap(projection="cyl", 
                  resolution="l",
                  llcrnrlon=lon_range[0], 
                  urcrnrlon=lon_range[1],
                  llcrnrlat=lat_range[0],
                  urcrnrlat=lat_range[1])
plt.figure(figsize=(10,8))
basemap.drawstates()
plt.contourf(model_grid.lon, 
             model_grid.lat, 
             model_grid.data.max(axis=0), 
             np.arange(25,225,25), 
             extend="max",
             cmap="YlOrRd")
#plt.colorbar(shrink=0.6, fraction=0.05, pad=0.02 )
title_info = plt.title("Max Updraft Helicity {0}-{1}".format(start_date.strftime("%d %B %y %H:%M"), 
                                                             end_date.strftime("%d %B %y %H:%M")),
                      fontweight="bold", fontsize=14)
plt.savefig("uh_swaths.png", dpi=200, bbox_inches="tight")


To investigate the timing of the tracks, we can use this interactive widget to explore the tracks through time and to zoom in on areas of interest.


In [8]:
zoomable_bmap = Basemap(projection="cyl", 
                        resolution="l", 
                        llcrnrlon=model_grid.lon.min(),
                        llcrnrlat=model_grid.lat.min(),
                        urcrnrlon=model_grid.lon.max(),
                        urcrnrlat=model_grid.lat.max(),
                        fix_aspect=False)
def model_time_viewer(lon_range, lat_range, hour):
    #lon_range = (-108, -90)
    #lat_range = (35, 45)
    #basemap = Basemap(projection="cyl", 
    #                  resolution="l",
    #                  llcrnrlon=lon_range[0], 
    #                  urcrnrlon=lon_range[1],
    #                  llcrnrlat=lat_range[0],
    #                  urcrnrlat=lat_range[1])
    plt.figure(figsize=(12,8))
    zoomable_bmap.drawstates()
    zoomable_bmap.drawcoastlines()
    zoomable_bmap.drawcountries()
    plt.contourf(model_grid.lon, 
                 model_grid.lat, 
                 model_grid.data[hour - model_grid.start_hour], 
                 #np.arange(3, 80, 3),
                 np.arange(25,225,25), 
                 extend="max",
                 cmap="YlOrRd")
    plt.colorbar(shrink=0.6, fraction=0.05, pad=0.02)
    title_info = plt.title("Max Updraft Helicity Valid {0}".format((run_date + timedelta(hours=hour)).strftime(
                "%d %B %Y %H UTC")))
    plt.xlim(*lon_range)
    plt.ylim(*lat_range)
    plt.show()
lon_slider = widgets.IntRangeSlider(min=int(model_grid.lon.min()), 
                                    max=int(model_grid.lon.max()), 
                                    step=1, value=(-108, -90))
lat_slider = widgets.IntRangeSlider(min=int(model_grid.lat.min()), 
                                    max=int(model_grid.lat.max()), 
                                    value=(35,45),
                                    step=1)
hour_slider = widgets.IntSlider(min=model_grid.start_hour, max=model_grid.end_hour, step=1, value=0)
w = widgets.interactive(model_time_viewer, lon_range=lon_slider, lat_range=lat_slider, hour=hour_slider)
display(w)


Storm Track Identification with the Enhanced Watershed

Our first data science tool is the enhanced watershed (Lakshmanan 2009), which is used for identifying features in gridded data. The original watershed transform identifies regions from an image or grid by finding local maxima and then growing objects from those maxima in discrete steps by looking for adjacent pixels with at least a certain intensity in an iterative fashion. The traditional watershed uses an intensity threshold as the stopping criterion for growth, which can produce unrealistic looking objects. The enhanced watershed first discretizes the data and then uses size and relative intensity thresholds to identify discrete objects. Buffer regions are also created around each object.

The enhanced watershed has the following tunable parameters:

  • min, step, max: parameters to quantize the grid into a discrete number of levels
  • size: growth of an object is stopped after it reaches the specified number of grid points in area
  • delta: the maximum range of values contained within an object
Exercise: Manual Tuning

Pick a model time step and tune the enhanced watershed parameters until the objects look reasonable. Note how changing parameter values affects the shape of the objects. See how your chosen set of parameters handles other time steps. Finally, see what parameter settings produce particularly poor objects. If you find either a particularly good representation or a hilariously bad one, right-click the image, save it, and email the image to me at djgagne@ou.edu.


In [9]:
plt.figure(figsize=(10 ,5))
model_grid.data.shape
label_grid = label_storm_objects(model_grid.data[11], "hyst", 10, 50, min_area=2, max_area=100)
storm_objs = extract_storm_objects(label_grid, model_grid.data[11], model_grid.x, model_grid.y, np.array([0]))
print(storm_objs)
plt.subplot(1, 2, 1)
plt.title("Original UH Track")
plt.contourf(model_grid.x / 1000, model_grid.y / 1000, model_grid.data[11], np.arange(25, 225, 25))
plt.xlim(-600, -500)
plt.ylim(-200, -150)
plt.subplot(1, 2, 2)
for storm_obj in storm_objs[0]:
    print(storm_obj.timesteps[0].max())
    plt.contourf(storm_obj.x[0] / 1000, storm_obj.y[0] / 1000, storm_obj.timesteps[0], np.arange(25, 225, 25))
plt.xlim(-600, -500)
plt.ylim(-200, -150)
plt.colorbar()
plt.title("Extracted UH Track")
plt.savefig("extracted_uh_comp.png", dpi=200, bbox_inches="tight")


[[<hagelslag.processing.STObject.STObject object at 0x13a1214e0>, <hagelslag.processing.STObject.STObject object at 0x13a1219e8>, <hagelslag.processing.STObject.STObject object at 0x13a121198>, <hagelslag.processing.STObject.STObject object at 0x13a121a20>, <hagelslag.processing.STObject.STObject object at 0x13a121a58>, <hagelslag.processing.STObject.STObject object at 0x13a121a90>, <hagelslag.processing.STObject.STObject object at 0x13a121ac8>, <hagelslag.processing.STObject.STObject object at 0x13a121b00>, <hagelslag.processing.STObject.STObject object at 0x13a121b38>, <hagelslag.processing.STObject.STObject object at 0x13a121b70>, <hagelslag.processing.STObject.STObject object at 0x13a121ba8>, <hagelslag.processing.STObject.STObject object at 0x13a121be0>, <hagelslag.processing.STObject.STObject object at 0x13a121c18>, <hagelslag.processing.STObject.STObject object at 0x13a121c50>, <hagelslag.processing.STObject.STObject object at 0x13a121c88>, <hagelslag.processing.STObject.STObject object at 0x13a121cc0>, <hagelslag.processing.STObject.STObject object at 0x13a121cf8>, <hagelslag.processing.STObject.STObject object at 0x13a121d30>, <hagelslag.processing.STObject.STObject object at 0x13a121d68>, <hagelslag.processing.STObject.STObject object at 0x13a121da0>, <hagelslag.processing.STObject.STObject object at 0x13a121dd8>]]
150.84277
77.21777
58.085938
62.760742
67.8916
99.11719
115.78711
53.15918
180.05078
51.18164
68.5791
67.41406
391.03613
104.73828
157.88086
73.10156
87.92578
53.72754
52.783203
76.77051
81.336914

In [10]:
def ew_demo(min_max, step_val, size_val=50, delta_val=5, time=12):
    ew = EnhancedWatershed(min_max[0],step_val,min_max[1],size_val,delta_val)
    fig, ax = plt.subplots(figsize=(12,8))
    basemap.drawstates()
    labels = ew.label(gaussian_filter(model_grid.data[time - model_grid.start_hour], 1))
    objs = find_objects(labels)
    plt.contourf(model_grid.lon, 
                 model_grid.lat, 
                 labels, 
                 np.arange(1,labels.max()), 
                 extend="max",
                 cmap="Set1")
    for obj in objs:
        sy = model_grid.lat[obj[0].start-1, obj[1].start-1]
        sx = model_grid.lon[obj[0].start-1, obj[1].start-1]
        wx = model_grid.lon[obj[0].stop + 1, obj[1].stop + 1] - sx
        wy = model_grid.lat[obj[0].stop + 1, obj[1].stop + 1] - sy 
        ax.add_patch(patches.Rectangle((sx, sy), wx, wy, fill=False, color="red"))
    plt.xlim(*lon_range)
    plt.ylim(*lat_range)
    plt.grid()
    plt.title("Enhanced Watershed Objects Min: {0:d} Step: {1:d} Max: {2:d} Size: {3:d} Delta: {4:d}) Time: {5:d}".format(min_max[0], 
                                                                                      step_val, 
                                                                                      min_max[1], 
                                                                                      size_val, 
                                                                                      delta_val, 
                                                                                      time),
             fontsize=14, fontweight="bold")
    plt.show()
    #plt.savefig("ew_objs.pdf", bbox_inches="tight")
minmax_slider = widgets.IntRangeSlider(min=25, max=225, step=25, value=(25,225))
step_slider = widgets.IntSlider(min=1, max=10, step=1, value=1)
size_slider = widgets.IntSlider(min=5, max=300, step=5, value=50)
delta_slider = widgets.IntSlider(min=10, max=100, step=10, value=20)
time_slider = widgets.IntSlider(min=model_grid.start_hour, max=model_grid.end_hour, step=1, value=24)
w = widgets.interactive(ew_demo, 
                        min_max=minmax_slider,          
                        step_val=step_slider, 
                        size_val=size_slider, 
                        delta_val=delta_slider,
                        time=time_slider)
display(w)


Once you find a desirable set of enhanced watershed parameters, input them below and generate storm objects for all time steps.


In [14]:
def get_forecast_objects(model_grid, ew_params, min_size, gaussian_window):
    ew = EnhancedWatershed(*ew_params)
    model_objects = []
    for h, hour in enumerate(np.arange(model_grid.start_hour, model_grid.end_hour + 1)):
        print("{0:02d}".format(hour))
        hour_labels = ew.size_filter(ew.label(gaussian_filter(model_grid.data[h], gaussian_window)), min_size)
        obj_slices = find_objects(hour_labels)
        num_slices = len(obj_slices)
        model_objects.append([])
        if num_slices > 0:
            for sl in obj_slices:   
                model_objects[-1].append(STObject(model_grid.data[h][sl],
                                                  np.where(hour_labels[sl] > 0, 1, 0),
                                                  model_grid.x[sl], 
                                                  model_grid.y[sl], 
                                                  model_grid.i[sl], 
                                                  model_grid.j[sl],
                                                  hour,
                                                  hour,
                                                  dx=3000))
                if h > 0:
                    dims = model_objects[-1][-1].timesteps[0].shape
                    model_objects[-1][-1].estimate_motion(hour, model_grid.data[h-1], dims[1], dims[0])
    return model_objects

min_thresh = 25
max_thresh = 225
step = 1
max_size = 100
min_size = 12
delta = 100
gaussian_filter_size = 1

model_objects = get_forecast_objects(model_grid, (min_thresh, step, max_thresh, max_size, delta), 
                                     min_size, gaussian_filter_size)


12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Object Tracking

Tracking storms over time provides useful information about their evolution and threat potential. However, storm tracking is a challenging problem due to storms forming, splitting, merging, and dying. Basic object-based storm tracking methods compare the locations of storms at one time step with the locations at the previous time steps and then find an optimal way to match storms from the two sets appropriately.


In [17]:
def track_forecast_objects(input_model_objects, model_grid, object_matcher):
    model_objects = deepcopy(input_model_objects)
    hours = np.arange(int(model_grid.start_hour), int(model_grid.end_hour) + 1)
    tracked_model_objects = []
    for h, hour in enumerate(hours):
        past_time_objs = []
        for obj in tracked_model_objects:
            # Potential trackable objects are identified
            if obj.end_time == hour - 1:
                past_time_objs.append(obj)
        # If no objects existed in the last time step, then consider objects in current time step all new
        if len(past_time_objs) == 0:
            tracked_model_objects.extend(deepcopy(model_objects[h]))
        # Match from previous time step with current time step
        elif len(past_time_objs) > 0 and len(model_objects[h]) > 0:
            assignments = object_matcher.match_objects(past_time_objs, model_objects[h], hour - 1, hour)
            unpaired = list(range(len(model_objects[h])))
            for pair in assignments:
                past_time_objs[pair[0]].extend(model_objects[h][pair[1]])
                unpaired.remove(pair[1])
            if len(unpaired) > 0:
                for up in unpaired:
                    tracked_model_objects.append(model_objects[h][up])
        #print("Tracked Model Objects: {0:03d} Hour: {1:02d}".format(len(tracked_model_objects), hour))
    return tracked_model_objects

def make_tracks(dist_weight, max_distance):
    global tracked_model_objects
    object_matcher = ObjectMatcher([shifted_centroid_distance, centroid_distance], 
                                   np.array([dist_weight, 1-dist_weight]), np.array([max_distance * 1000] * 2))
    tracked_model_objects = track_forecast_objects(model_objects, model_grid, object_matcher)
    color_list = ["violet", "cyan", "blue", "green", "purple", "darkgreen", "teal", "royalblue"]
    color_arr = np.tile(color_list, len(tracked_model_objects) // len(color_list) + 1)
    plt.figure(figsize=(10, 8))
    basemap.drawstates()
    plt.contourf(model_grid.lon, 
             model_grid.lat, 
             model_grid.data.max(axis=0), 
             np.arange(25,225,25), 
             extend="max",
             cmap="YlOrRd")
    plt.colorbar(shrink=0.6, fraction=0.05, pad=0.02 )
    for t, tracked_model_object in enumerate(tracked_model_objects):
        traj = tracked_model_object.trajectory()
        t_lon, t_lat = model_grid.proj(traj[0], traj[1], inverse=True)
        plt.plot(t_lon, t_lat, marker='o', markersize=4, color=color_arr[t], lw=2)
        #plt.barbs(t_lon, t_lat, tracked_model_object.u /3000, 
        #          tracked_model_object.v / 3000.0, length=6,
        #          barbcolor=color_arr[t])
    plt.title("Forecast Tracks Shifted Centroid: {0:0.1f}, Centroid: {1:0.1f}, Max Distance: {2:3d} km".format(
        dist_weight, 1-dist_weight, max_distance), fontweight="bold", fontsize=14)
    plt.show()
    #plt.savefig("storm_tracks.png", dpi=200, bbox_inches="tight")
tracked_model_objects = None
weight_slider = widgets.FloatSlider(min=0, max=1, step=1, value=1)
dist_slider = widgets.IntSlider(min=10, max=1000, step=10, value=50)
track_w = widgets.interactive(make_tracks, dist_weight=weight_slider, max_distance=dist_slider)
display(track_w)


Part 2: Hail Size Prediction with Machine Learning

Once storm tracks have been identified, data can be extracted from within each step of the storm track from the other model fields. The forecast tracks are also associated with the observed tracks using a process similar to storm tracking. Storm track data and associated hail sizes have been extracted for a set of model runs from May 20 through 3 June 2015. We will use this data to find a relationship between the model output and hail size and try to account for the uncertainty in that relationship.

First we will import some statistical and machine learning models from the scikit-learn package. The library supports a wide variety of machine learning models, which are described in great detail in the official documentation.


In [3]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KernelDensity
from scipy.stats import norm

First, we will upload the data using the pandas library. The data are stored in DataFrames, a 2-dimensional data structure in which each row is a record, and each column contains data of the same type. DataFrames allow arbitrary indexing of the rows and columns. They are based on the R data frame. The individual steps of each track are stored in track_step_data, and information about the full tracks are stored in track_total_data. The dataset contains 117 columns of data.


In [4]:
train_data_dir = "../testdata/track_data_csv_unidata_train/"
forecast_data_dir = "../testdata/track_data_csv_unidata_forecast/"
train_step_files = sorted(glob(train_data_dir + "track_step_SSEF*.csv"))
train_total_files = sorted(glob(train_data_dir + "track_total_SSEF*.csv"))
track_step_data = pd.concat(map(pd.read_csv, train_step_files), ignore_index=True)
track_total_data = pd.concat(map(pd.read_csv, train_total_files), ignore_index=True)
track_forecast_data = pd.read_csv(forecast_data_dir + "track_step_SSEF_wrf-s3cn_arw_20150604.csv")
pd.set_option('display.max_columns', track_step_data.shape[1])
print(track_step_data.shape)
track_step_data.describe()


(834, 117)
Out[4]:
Forecast_Hour Valid_Hour_UTC Duration Centroid_Lon Centroid_Lat uh_max_mean uh_max_max uh_max_min uh_max_std uh_max_mean_dt uh_max_max_dt r10cmx_mean r10cmx_max r10cmx_min r10cmx_std r10cmx_mean_dt r10cmx_max_dt wupmax_mean wupmax_max wupmax_min wupmax_std wupmax_mean_dt wupmax_max_dt cqgmax_mean cqgmax_max cqgmax_min cqgmax_std cqgmax_mean_dt cqgmax_max_dt wdnmax_mean wdnmax_max wdnmax_min wdnmax_std wdnmax_mean_dt wdnmax_max_dt mlcape_mean mlcape_max mlcape_min mlcape_std mlcape_mean_dt mlcape_max_dt mlcins_mean mlcins_max mlcins_min mlcins_std mlcins_mean_dt mlcins_max_dt sblcl_mean sblcl_max sblcl_min sblcl_std sblcl_mean_dt sblcl_max_dt srh03_mean srh03_max srh03_min srh03_std srh03_mean_dt srh03_max_dt shr06_mean shr06_max shr06_min shr06_std shr06_mean_dt shr06_max_dt tmp500_mean tmp500_max tmp500_min tmp500_std tmp500_mean_dt tmp500_max_dt tmp700_mean tmp700_max tmp700_min tmp700_std tmp700_mean_dt tmp700_max_dt dewp2m_mean dewp2m_max dewp2m_min dewp2m_std dewp2m_mean_dt dewp2m_max_dt temp2m_mean temp2m_max temp2m_min temp2m_std temp2m_mean_dt temp2m_max_dt sph850_mean sph850_max sph850_min sph850_std sph850_mean_dt sph850_max_dt sph500_mean sph500_max sph500_min sph500_std sph500_mean_dt sph500_max_dt lllr_mean lllr_max lllr_min lllr_std lllr_mean_dt lllr_max_dt lr75_mean lr75_max lr75_min lr75_std lr75_mean_dt lr75_max_dt Hail_Size
count 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000 834.000000
mean 24.257794 10.962830 1.800959 -98.310265 36.693985 44.993792 102.242935 7.191966 23.182265 0.528673 0.390372 51.970555 63.454816 32.608761 7.307039 0.379349 0.123978 14.180176 24.400828 4.635971 4.948709 -0.144992 -0.260749 10.762241 25.874820 1.752028 5.961419 0.111238 -0.059765 -4.252473 -1.725163 -7.871387 1.362692 -0.042233 -0.056786 1083.812120 1688.130731 329.977872 348.501964 -97.009960 -82.415329 -90.535152 -31.516311 -239.112740 46.567225 -10.569926 -3.067848 1451.027743 1704.411884 1190.877802 130.235566 -61.058565 -54.039706 164.417739 347.528673 14.589966 77.499511 3.153002 18.626337 0.003555 0.005068 0.002041 0.000658 0.000071 0.000195 -10.765749 -8.039304 -12.169749 0.863641 0.062929 0.274485 -7.477601 8.463614 -30.504185 11.863096 -3.420703 0.049298 61.936877 64.538995 58.783514 1.420768 0.019041 0.082347 71.217282 74.931302 66.285561 2.229271 -0.667972 -0.613421 -1501.933320 -971.667354 -2102.513463 395.129628 119.163729 204.025319 2.464933 3.914384 1.621251 0.560168 0.067877 0.115171 -6.488742 -5.323290 -7.284475 0.468366 0.153626 0.212485 -21.002337 -5.451006 -43.355704 11.614131 -3.300843 0.143913 32.860360
std 4.813393 8.806977 1.281955 8.655670 6.623929 17.245240 46.578200 8.955418 11.904038 11.464844 26.809818 9.244027 8.428149 12.253723 2.813482 3.221756 3.110947 3.844418 6.827545 2.362997 2.050583 1.662281 2.548202 6.518202 15.622866 1.639769 3.777185 2.885489 6.958366 1.483162 0.836498 2.977773 0.544647 0.535214 0.517031 936.038411 1201.947319 607.606805 305.395892 348.660076 409.106530 130.193878 81.162699 228.460112 46.481865 51.623257 35.914470 1062.091992 1130.687590 1002.854428 97.418797 179.454287 230.954553 141.385652 231.386694 137.936942 54.038646 72.810536 146.677338 0.000959 0.001339 0.000962 0.000303 0.000435 0.000751 2.512618 3.263141 2.460358 0.505875 0.498420 1.294300 327.872763 2.630047 599.313556 190.932795 75.875911 0.592245 9.060799 8.797079 9.599784 1.055371 1.579743 1.997119 6.948456 7.180576 7.430010 1.623729 2.506260 2.661478 3367.674652 2982.758581 4086.371572 1173.645513 1011.471402 1497.752001 0.789130 1.141543 0.869020 0.343595 0.337789 0.535649 1.502401 1.582284 1.504295 0.297294 0.439843 0.687204 327.468521 0.951253 598.536393 190.604497 75.705411 0.539490 29.029908
min 12.000000 0.000000 1.000000 -122.815840 23.452470 23.739120 31.979990 0.000000 3.401450 -94.141840 -145.914460 7.190610 14.722900 0.000000 2.044060 -20.003580 -16.299090 6.245250 7.960990 0.581930 0.753030 -8.463810 -18.016670 0.167680 0.524320 0.005110 0.102180 -15.800960 -43.083290 -9.433880 -5.120350 -17.618300 0.265830 -3.032430 -2.462710 0.000000 0.000000 0.000000 0.000000 -2494.781980 -3533.340580 -778.333740 -567.831790 -1143.123170 0.053920 -504.832430 -371.109010 75.185550 144.478520 24.224730 10.440430 -1373.590940 -2381.415040 -792.525510 -397.952850 -962.754270 3.419200 -576.481450 -728.755550 0.000310 0.000690 0.000020 0.000040 -0.001810 -0.003480 -21.834980 -20.980670 -22.514000 0.072970 -2.078950 -7.658030 -9208.947270 -3.025160 -9999.000000 0.082660 -2065.789310 -2.039680 35.357680 37.556060 29.797080 0.091090 -9.898130 -10.713540 44.077430 49.007500 35.620320 0.133170 -16.384060 -14.853790 -9999.000000 -9999.000000 -9999.000000 0.000000 -7807.298830 -10013.196290 0.559060 0.757210 0.098570 0.012000 -1.892850 -2.982650 -9.872060 -9.662160 -10.360950 0.046810 -3.426850 -2.846910 -9210.148440 -8.911560 -9999.000000 0.033470 -2061.562500 -2.382380 0.000000
25% 21.000000 2.000000 1.000000 -102.858598 31.902718 35.686227 68.509527 0.180188 14.637245 0.000000 0.000000 47.058757 58.925980 25.905442 5.497332 0.000000 0.000000 11.095997 19.111058 2.844310 3.447750 0.000000 0.000000 5.584560 13.042507 0.498488 2.865743 0.000000 0.000000 -5.403673 -2.279662 -9.930945 0.960410 0.000000 0.000000 342.641550 692.438568 0.000000 130.366543 -58.490077 -41.472563 -93.906898 -15.391593 -354.459712 14.160488 -3.230590 0.000000 577.047123 748.635982 397.042788 58.108102 -55.537292 -22.006982 76.070538 187.750940 -30.171118 38.244647 0.000000 0.000000 0.002860 0.004090 0.001355 0.000432 0.000000 0.000000 -12.306815 -10.376947 -13.631720 0.458535 0.000000 0.000000 5.571867 7.068713 4.122810 0.436255 0.000000 0.000000 56.119008 58.707340 52.601232 0.682445 0.000000 0.000000 66.151910 70.036523 61.850968 0.937277 -0.567260 -0.279002 8.102510 10.194152 6.150345 0.409970 0.000000 0.000000 1.880162 3.065473 0.917978 0.283423 0.000000 0.000000 -7.647380 -6.539162 -8.496295 0.253352 0.000000 0.000000 -7.183775 -5.962003 -8.083155 0.252270 0.000000 0.000000 23.000247
50% 24.000000 8.500000 1.000000 -99.771285 35.816325 40.729720 92.240585 4.929345 20.736250 0.000000 0.000000 54.149545 66.260115 34.163760 6.765020 0.000000 0.000000 13.609835 23.975250 4.299195 4.766780 0.000000 0.000000 10.120220 24.362300 1.300910 5.420800 0.000000 0.000000 -4.104985 -1.609395 -7.605855 1.304395 0.000000 0.000000 821.679260 1440.406010 33.849690 244.551095 0.000000 0.000000 -42.411655 -1.212090 -150.593155 31.011130 0.000000 0.000000 1124.763615 1393.044370 797.747345 104.575085 0.000000 0.000000 145.270515 295.553815 24.055105 64.234515 0.000000 0.000000 0.003500 0.005000 0.001990 0.000630 0.000000 0.000000 -10.647500 -7.722840 -12.271015 0.747080 0.000000 0.000000 6.795800 8.652565 5.387790 0.612295 0.000000 0.000000 63.448005 66.196920 60.617350 1.150170 0.000000 0.000000 71.875825 75.958140 66.525470 1.789740 0.000000 0.000000 11.243440 12.931770 9.065780 0.700215 0.000000 0.000000 2.443525 3.937990 1.570625 0.527975 0.000000 0.000000 -6.392730 -4.994240 -7.246780 0.397520 0.000000 0.000000 -6.554175 -5.211290 -7.479935 0.389970 0.000000 0.000000 31.180175
75% 27.000000 21.000000 2.000000 -95.201880 42.470635 47.910823 122.287335 11.213240 29.138850 0.000000 0.000000 58.801732 69.492897 41.907720 8.642410 0.000000 0.000000 16.865838 29.816008 6.004293 6.118550 0.000000 0.000000 14.985393 36.981442 2.419122 8.536345 0.000000 0.000000 -2.991477 -1.064185 -5.525362 1.690783 0.000000 0.000000 1679.387022 2660.138610 360.068110 488.462913 0.000000 0.000000 -17.171813 0.000000 -67.845665 63.145452 0.000000 0.000000 2211.307618 2575.143130 1834.954407 169.807298 0.000000 0.000000 227.228717 455.257505 79.635693 104.032988 0.000000 11.427333 0.004230 0.005998 0.002597 0.000830 0.000040 0.000298 -8.992838 -5.531820 -10.406688 1.183250 0.000000 0.213455 8.706610 10.278168 7.086315 0.846477 0.000000 0.000000 69.315640 71.607378 66.251278 1.845420 0.000000 0.000000 75.974717 79.838823 71.317162 3.093673 0.000000 0.000000 12.586325 14.312610 10.536125 1.220570 0.000000 0.000000 3.019892 4.794722 2.176867 0.769103 0.040813 0.065822 -5.349642 -4.067668 -6.173275 0.601522 0.180100 0.140813 -5.965058 -4.784417 -6.781847 0.537370 0.003973 0.123540 42.307127
max 36.000000 23.000000 10.000000 -68.734530 53.505040 212.093350 339.965850 84.424500 80.299910 78.702060 170.782710 66.718670 75.289260 56.474310 20.054140 17.536560 23.156290 28.194430 40.876530 13.264230 11.778180 7.169180 11.752870 35.089310 73.119840 11.403840 18.456760 14.196650 38.973260 -1.387230 0.000000 -2.037610 3.483090 2.813140 2.915580 4571.430180 5955.933590 3693.597410 1708.067750 1583.243900 2005.860840 -0.020580 0.000000 -0.209370 342.680880 271.828310 285.329220 4458.810550 4719.322270 4361.061040 675.081970 519.942140 889.533690 876.858950 1342.765010 691.947020 323.321410 337.119050 851.422120 0.006560 0.009610 0.005090 0.001670 0.003790 0.002970 -5.346370 -0.790990 -6.212970 3.247470 2.944260 9.723960 14.664110 15.805370 12.709850 4051.288090 1.275510 3.685370 78.275540 82.790340 77.114700 6.928270 9.985030 15.547630 89.433800 96.895580 86.479880 9.051710 19.096690 19.824390 16.508990 18.414910 15.426960 4991.487790 10007.885740 10013.385740 4.879830 6.740430 4.460800 1.772140 2.104510 3.690950 -2.528800 -0.050480 -3.797750 1.800520 2.743830 4.193910 -4.517960 -3.146250 -4.831280 4042.887940 1.719880 5.051350 680.540040

The extracted hail sizes show a skewed distribution with a long tail. Storms with a hail size of 0 were not matched with an observed track and should be excluded when building a regression model.


In [5]:
track_step_data["Hail_Size"].hist(bins=np.arange(0,105,5))
plt.xlabel("Hail Size (mm)")
plt.ylabel("Frequency")


Out[5]:
Text(0, 0.5, 'Frequency')

In [38]:
plt.figure(figsize=(8, 4))
from sklearn.preprocessing import PowerTransformer
plt.subplot(1, 2, 1)
idxs = (track_step_data["Hail_Size"] > 0) & (track_step_data["Hail_Size"] < 125)
plt.hist(track_step_data["Hail_Size"][idxs], bins=np.arange(0, 100, 5))
pt = PowerTransformer(method="box-cox")
bc_hail = pt.fit_transform(track_step_data.loc[idxs, ["Hail_Size"]])
plt.title("Raw Hail Sizes")
plt.ylabel("Frequency", fontsize=12)
plt.xlabel("Hail Size (mm)", fontsize=12)
plt.subplot(1, 2, 2)
log_hail= np.log(track_step_data["Hail_Size"][idxs])
plt.hist((log_hail - log_hail.mean()) / log_hail.std(), bins=np.arange(-3, 3, 0.2))
plt.title("Log-Transformed and Standardized")
plt.xlabel("log(Hail Size) Anomaly", fontsize=12)
plt.savefig("hail_size_scaled.png", dpi=200, bbox_inches="tight")
#plt.subplot(1, 3, 3)
#plt.hist(bc_hail, bins=np.arange(-3, 3, 0.2))



In [22]:



Out[22]:
(array([  2.,  20.,  50., 105., 180., 151., 122.,  74.,  16.,   6.]),
 array([-3.08383771, -2.47801104, -1.87218437, -1.2663577 , -0.66053104,
        -0.05470437,  0.5511223 ,  1.15694897,  1.76277564,  2.36860231,
         2.97442898]),
 <a list of 10 Patch objects>)

The simplest and generally first choice for a statistical model is an ordinary linear regression. The model minimizes the mean squared error over the full training set. Since we used updraft helicity to identify the storm tracks, we start with building two linear models using the maximum updraft helicity and observed hail size. We first fit a linear model of the form $y=ax+b$. Then we fit a power-law model (think Z-R relationship) of the form $\ln(y)=a\ln(x) + b$. The training data and the two linear models are plotted below.


In [24]:
# We are filtering the unmatched storm tracks and storm tracks matched with unrealistically high values
filter_idx = (track_step_data['Hail_Size'] > 0) & (track_step_data['Hail_Size'] < 100)
x_var = "uh_max_max"
print("Standard deviation", track_step_data["Hail_Size"][filter_idx].std())
print("Correlation coefficient", np.corrcoef(track_step_data[x_var][filter_idx], 
                                             track_step_data['Hail_Size'][filter_idx])[0,1])
lr = LinearRegression()
log_lr = LinearRegression()
log_lr.fit(np.log(track_step_data.loc[filter_idx,[x_var]]), np.log(track_step_data['Hail_Size'][filter_idx]))
lr.fit(track_step_data.loc[filter_idx,[x_var]], track_step_data['Hail_Size'][filter_idx])
print("Linear model:", "a", lr.coef_[0], "b",lr.intercept_)
print("Power law model:","a",log_lr.coef_[0], "b",log_lr.intercept_)
plt.scatter(track_step_data.loc[filter_idx, x_var], 
            track_step_data.loc[filter_idx, 'Hail_Size'], 10, 'r')
uh_test_vals = np.arange(1 , track_step_data.loc[filter_idx, x_var].max())
power_hail_vals = np.exp(log_lr.intercept_) * uh_test_vals ** log_lr.coef_[0]
hail_vals = lr.intercept_ + lr.coef_[0] * uh_test_vals
plt.plot(uh_test_vals, power_hail_vals)
plt.plot(uh_test_vals, hail_vals)
plt.xlabel(x_var)
plt.ylabel("Hail Size (mm)")


Standard deviation 13.972728715232114
Correlation coefficient 0.24703314933322518
Linear model: a 0.07377552474447038 b 28.734268757919274
Power law model: a 0.23007580575113645 b 2.478056598878192
Out[24]:
Text(0, 0.5, 'Hail Size (mm)')

While the linear regression fit does show a slight positive relationship with hail size, it also shows a large amount of variance. We could try to train a multiple linear regression from a subset of all the variables, but finding that subset is time consuming, and the resulting model still relies on the assumption of constant variance and normality.

Alternatively, we could use a decision tree, which is a popular model from the machine learning community that performs variable selection, is robust against outliers and missing data, and does not rely on any parametric data model assumptions. While individual decision trees do not provide very high accuracy, randomized ensembles of decision trees are consistently among the best performing models in many applications. We will experiment with the Random Forest, a popular decision tree ensemble due to its ease of use and generally high accuracy.


In [27]:
rf = RandomForestRegressor(n_estimators=500, min_samples_split=20, max_features="sqrt")
rf.fit(track_step_data.loc[filter_idx, track_step_data.columns[3:-1]], track_step_data['Hail_Size'][filter_idx])


Out[27]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=20,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Random forests provide a measure of how much each input variable affects the performance of the model called variable importance. It is a normalized measure of the decrease in error produced by each variable.


In [28]:
def plot_importances(num_features):
    feature_names = np.array(["{0} ({1:d})".format(f, x) for x, f in enumerate(track_step_data.columns[3:-1].values)])
    feature_ranks = np.argsort(rf.feature_importances_)
    plt.figure(figsize=(5,8))
    plt.barh(np.arange(feature_ranks.size)[-num_features:], 
             rf.feature_importances_[feature_ranks][-num_features:], height=1)
    plt.yticks(np.arange(feature_ranks.size)[-num_features:] + 0.5, feature_names[feature_ranks][-num_features:])
    plt.ylim(feature_names.size-num_features, feature_names.size)
    plt.xlabel("Normalized Importance")
    plt.title("Random Forest Variable Importance")
feature_slider = widgets.IntSlider(min=1, max=100, value=10)
feature_w = widgets.interactive(plot_importances, num_features=feature_slider)
display(feature_w)


We can validate the accuracy of the two models by comparing the hail size predictions for 4 June 2015 from each model. The root mean squared errors from each model are similar, but the random forest appears to be better at spreading the predictions over a larger range of hail sizes.


In [30]:
ver_idx = (track_forecast_data["Hail_Size"].values > 0) & (track_forecast_data["Hail_Size"].values < 100)
rf_preds = rf.predict(track_forecast_data.loc[ver_idx, track_forecast_data.columns[3:-1]])
lr_preds = lr.predict(track_forecast_data.loc[ver_idx,["uh_max_max"]])
rf_rmse = np.sqrt(np.mean(np.power(rf_preds - track_forecast_data.loc[ver_idx, "Hail_Size"], 2)))
lr_rmse = np.sqrt(np.mean(np.power(lr_preds - track_forecast_data.loc[ver_idx, "Hail_Size"], 2)))
plt.figure(figsize=(12, 6))
plt.subplot(1,2,1)
plt.scatter(rf_preds, track_forecast_data.loc[ver_idx, "Hail_Size"])
plt.plot(np.arange(0, 65, 5), np.arange(0, 65, 5), "k--")
plt.xlabel("Random Forest Hail Size (mm)")
plt.ylabel("Observed Hail Size (mm)")
plt.title("Random Forest Predictions RMSE: {0:0.3f}".format(rf_rmse))
plt.xlim(20,60)
plt.ylim(20,60)
plt.subplot(1,2,2)
plt.scatter(lr_preds, track_forecast_data.loc[ver_idx, "Hail_Size"])
plt.plot(np.arange(0, 65, 5), np.arange(0, 65, 5), "k--")
plt.xlabel("Linear Regression Hail Size (mm)")
plt.ylabel("Observed Hail Size (mm)")
plt.title("Linear Regression Predictions RMSE: {0:0.3f}".format(lr_rmse))
plt.xlim(20,60)
plt.ylim(20,60)


Out[30]:
(20, 60)

Since each tree in the random forest produces an independent output, a probability density function can be generated from them for each prediction. To translate the predictions into probabilities, two methods can be used. A kernel density estimate (KDE) uses a moving window to determine probability based on the concentration of events at particular values. The alternative approach is to assume a parametric distribution, such as a Gaussian, and fit the distribution parameters to your predictions. As the KDE is non-parametric, it is much better at identifying the longer tails and secondary peaks that may hint at the chance for extreme hail. The example below shows how the KDE and Gaussian distributions compare for all of the June 4 predictions.


In [31]:
kde = KernelDensity(bandwidth=4)
bins = np.arange(0, 100)
bins = bins.reshape((bins.size, 1))
rf_tree_preds = np.array([t.predict(track_forecast_data.loc[ver_idx, track_forecast_data.columns[3:-1]]) for t in rf.estimators_])
mean_preds = rf_tree_preds.mean(axis=0)
sd_preds = rf_tree_preds.std(axis=0)
rf_pdfs = []
for r in range(rf_tree_preds.shape[1]):
    kde.fit(rf_tree_preds[:, r:r+1])
    rf_pdfs.append(np.exp(kde.score_samples(bins)))
rf_pdfs = np.array(rf_pdfs)
rf_cdfs = rf_pdfs.cumsum(axis=1)
pred_sorted = np.argsort(mean_preds)
def plot_pdfs(min_max):
    plt.figure(figsize=(15,5))
    plt.subplot(1,2,1)
    plt.title("Random Forest KDE Prediction PDFs")
    for r in rf_pdfs[pred_sorted][min_max[0]:min_max[1]+1]:
        plt.plot(bins, r)
    plt.ylim(0, 0.1)
    plt.xlabel("Forecast Hail Size (mm)")
    plt.ylabel("Probability Density")
    plt.xticks(np.arange(0, 105, 5))
    plt.grid()
    plt.subplot(1,2,2)
    plt.title("Random Forest Gaussian Prediction PDFs")
    for r2, mean_pred in enumerate(mean_preds[pred_sorted][min_max[0]:min_max[1]+1]):
        plt.plot(bins, norm.pdf(bins, loc=mean_pred, scale=sd_preds[pred_sorted][r2]))
    plt.ylim(0, 0.1)
    plt.xlabel("Forecast Hail Size (mm)")
    plt.ylabel("Probability Density")
    plt.xticks(np.arange(0, 105, 5))
    plt.grid()
    plt.show()
mm_slider = widgets.IntRangeSlider(min=0, max=rf_pdfs.shape[0])
display(widgets.interactive(plot_pdfs, min_max=mm_slider))


Closing

This notebook demonstrated some of the power found in data science tools to turn raw data into valuable forecast information. Please continue to experiment with the parameters in this notebook and even try some of the other models found in scikit-learn and elsewhere. If you have any questions or comments, please email me at djgagne@ou.edu or find me on Twitter (Username: @DJGagneDos).

Additional Resources


In [ ]: