Severe weather forecasting has entered an age of unprecedented access to large model and observational datasets with even greater hordes of data in the pipeline. With multiple ensembles of convection-allowing models available and an increasing variety of observations derived from radar, satellite, surface, upper air, and crowd-sourcing, forecasters can easily be overwhelmed with guidance. Without ways to organize, synthesize, and visualize the data in a useful manner for forecasters, the pile of new models and observations will languish unused and will not fulfill their full potential. An even worse outcome would be to take the human forecasters completely out of the loop and trust the models, which is a way fraught with peril. Data science tools offer ways to synthesize essential information from many disparate data sources while also quantifying uncertainty. When forecasters use the tools properly, they can identify potential hazards and the associated spatial and time uncertainties more quickly by using the output of the tools to help target their domain knowledge.
This module demonstrates how data science tools from the image processing and machine learning families can be used to create a forecast of severe hail. It aims to teach the advantages, challenges, and limitations of these tools through hands-on interaction.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from mpl_toolkits.basemap import Basemap
from IPython.display import display, Image
from ipywidgets import widgets, interact
from scipy.ndimage import gaussian_filter, find_objects
from copy import deepcopy
from glob import glob
import matplotlib.patches as patches
In [2]:
from hagelslag.processing.EnhancedWatershedSegmenter import EnhancedWatershed
from hagelslag.data.ModelOutput import ModelOutput
from hagelslag.processing.ObjectMatcher import ObjectMatcher, shifted_centroid_distance, centroid_distance
from hagelslag.processing.STObject import STObject
from hagelslag.processing.tracker import label_storm_objects, extract_storm_objects
We will be using model output from the control run of the Center for Analysis and Prediction of Storms 2015 Storm-Scale Ensemble Forecast system. The model output for this exercise is included with the hagelslag package, but additional variables can be downloaded from the Unidata RAMADDA server or from my personal page. Please untar the data in a local directory and modify the model_path
variable below to point to the spring2015_unidata directory.
In [3]:
model_path = "../testdata/spring2015_unidata/"
ensemble_name = "SSEF"
member ="wrf-s3cn_arw"
run_date = datetime(2015, 6, 4)
# We will be using the uh_max (hourly max 2-5 km Updraft Helicity) variable for this exercise
# cmpref (simulated composite radar reflectivity) is also available.
variable = "uh_max"
start_date = run_date + timedelta(hours=12)
end_date = run_date + timedelta(hours=29)
model_grid = ModelOutput(ensemble_name,
member,
run_date,
variable,
start_date,
end_date,
model_path,
map_file="../mapfiles/ssef2015.map",
single_step=False)
model_grid.load_data()
model_grid.load_map_info("../mapfiles/ssef2015.map")
The max updraft helicity map over the full period shows mutiple long and intense tracks in the central Plains.
In [4]:
lon_range = (-105, -91)
lat_range = (35, 44)
basemap = Basemap(projection="cyl",
resolution="l",
llcrnrlon=lon_range[0],
urcrnrlon=lon_range[1],
llcrnrlat=lat_range[0],
urcrnrlat=lat_range[1])
plt.figure(figsize=(10,8))
basemap.drawstates()
plt.contourf(model_grid.lon,
model_grid.lat,
model_grid.data.max(axis=0),
np.arange(25,225,25),
extend="max",
cmap="YlOrRd")
#plt.colorbar(shrink=0.6, fraction=0.05, pad=0.02 )
title_info = plt.title("Max Updraft Helicity {0}-{1}".format(start_date.strftime("%d %B %y %H:%M"),
end_date.strftime("%d %B %y %H:%M")),
fontweight="bold", fontsize=14)
plt.savefig("uh_swaths.png", dpi=200, bbox_inches="tight")
To investigate the timing of the tracks, we can use this interactive widget to explore the tracks through time and to zoom in on areas of interest.
In [8]:
zoomable_bmap = Basemap(projection="cyl",
resolution="l",
llcrnrlon=model_grid.lon.min(),
llcrnrlat=model_grid.lat.min(),
urcrnrlon=model_grid.lon.max(),
urcrnrlat=model_grid.lat.max(),
fix_aspect=False)
def model_time_viewer(lon_range, lat_range, hour):
#lon_range = (-108, -90)
#lat_range = (35, 45)
#basemap = Basemap(projection="cyl",
# resolution="l",
# llcrnrlon=lon_range[0],
# urcrnrlon=lon_range[1],
# llcrnrlat=lat_range[0],
# urcrnrlat=lat_range[1])
plt.figure(figsize=(12,8))
zoomable_bmap.drawstates()
zoomable_bmap.drawcoastlines()
zoomable_bmap.drawcountries()
plt.contourf(model_grid.lon,
model_grid.lat,
model_grid.data[hour - model_grid.start_hour],
#np.arange(3, 80, 3),
np.arange(25,225,25),
extend="max",
cmap="YlOrRd")
plt.colorbar(shrink=0.6, fraction=0.05, pad=0.02)
title_info = plt.title("Max Updraft Helicity Valid {0}".format((run_date + timedelta(hours=hour)).strftime(
"%d %B %Y %H UTC")))
plt.xlim(*lon_range)
plt.ylim(*lat_range)
plt.show()
lon_slider = widgets.IntRangeSlider(min=int(model_grid.lon.min()),
max=int(model_grid.lon.max()),
step=1, value=(-108, -90))
lat_slider = widgets.IntRangeSlider(min=int(model_grid.lat.min()),
max=int(model_grid.lat.max()),
value=(35,45),
step=1)
hour_slider = widgets.IntSlider(min=model_grid.start_hour, max=model_grid.end_hour, step=1, value=0)
w = widgets.interactive(model_time_viewer, lon_range=lon_slider, lat_range=lat_slider, hour=hour_slider)
display(w)
Our first data science tool is the enhanced watershed (Lakshmanan 2009), which is used for identifying features in gridded data. The original watershed transform identifies regions from an image or grid by finding local maxima and then growing objects from those maxima in discrete steps by looking for adjacent pixels with at least a certain intensity in an iterative fashion. The traditional watershed uses an intensity threshold as the stopping criterion for growth, which can produce unrealistic looking objects. The enhanced watershed first discretizes the data and then uses size and relative intensity thresholds to identify discrete objects. Buffer regions are also created around each object.
The enhanced watershed has the following tunable parameters:
Pick a model time step and tune the enhanced watershed parameters until the objects look reasonable. Note how changing parameter values affects the shape of the objects. See how your chosen set of parameters handles other time steps. Finally, see what parameter settings produce particularly poor objects. If you find either a particularly good representation or a hilariously bad one, right-click the image, save it, and email the image to me at djgagne@ou.edu.
In [9]:
plt.figure(figsize=(10 ,5))
model_grid.data.shape
label_grid = label_storm_objects(model_grid.data[11], "hyst", 10, 50, min_area=2, max_area=100)
storm_objs = extract_storm_objects(label_grid, model_grid.data[11], model_grid.x, model_grid.y, np.array([0]))
print(storm_objs)
plt.subplot(1, 2, 1)
plt.title("Original UH Track")
plt.contourf(model_grid.x / 1000, model_grid.y / 1000, model_grid.data[11], np.arange(25, 225, 25))
plt.xlim(-600, -500)
plt.ylim(-200, -150)
plt.subplot(1, 2, 2)
for storm_obj in storm_objs[0]:
print(storm_obj.timesteps[0].max())
plt.contourf(storm_obj.x[0] / 1000, storm_obj.y[0] / 1000, storm_obj.timesteps[0], np.arange(25, 225, 25))
plt.xlim(-600, -500)
plt.ylim(-200, -150)
plt.colorbar()
plt.title("Extracted UH Track")
plt.savefig("extracted_uh_comp.png", dpi=200, bbox_inches="tight")
In [10]:
def ew_demo(min_max, step_val, size_val=50, delta_val=5, time=12):
ew = EnhancedWatershed(min_max[0],step_val,min_max[1],size_val,delta_val)
fig, ax = plt.subplots(figsize=(12,8))
basemap.drawstates()
labels = ew.label(gaussian_filter(model_grid.data[time - model_grid.start_hour], 1))
objs = find_objects(labels)
plt.contourf(model_grid.lon,
model_grid.lat,
labels,
np.arange(1,labels.max()),
extend="max",
cmap="Set1")
for obj in objs:
sy = model_grid.lat[obj[0].start-1, obj[1].start-1]
sx = model_grid.lon[obj[0].start-1, obj[1].start-1]
wx = model_grid.lon[obj[0].stop + 1, obj[1].stop + 1] - sx
wy = model_grid.lat[obj[0].stop + 1, obj[1].stop + 1] - sy
ax.add_patch(patches.Rectangle((sx, sy), wx, wy, fill=False, color="red"))
plt.xlim(*lon_range)
plt.ylim(*lat_range)
plt.grid()
plt.title("Enhanced Watershed Objects Min: {0:d} Step: {1:d} Max: {2:d} Size: {3:d} Delta: {4:d}) Time: {5:d}".format(min_max[0],
step_val,
min_max[1],
size_val,
delta_val,
time),
fontsize=14, fontweight="bold")
plt.show()
#plt.savefig("ew_objs.pdf", bbox_inches="tight")
minmax_slider = widgets.IntRangeSlider(min=25, max=225, step=25, value=(25,225))
step_slider = widgets.IntSlider(min=1, max=10, step=1, value=1)
size_slider = widgets.IntSlider(min=5, max=300, step=5, value=50)
delta_slider = widgets.IntSlider(min=10, max=100, step=10, value=20)
time_slider = widgets.IntSlider(min=model_grid.start_hour, max=model_grid.end_hour, step=1, value=24)
w = widgets.interactive(ew_demo,
min_max=minmax_slider,
step_val=step_slider,
size_val=size_slider,
delta_val=delta_slider,
time=time_slider)
display(w)
Once you find a desirable set of enhanced watershed parameters, input them below and generate storm objects for all time steps.
In [14]:
def get_forecast_objects(model_grid, ew_params, min_size, gaussian_window):
ew = EnhancedWatershed(*ew_params)
model_objects = []
for h, hour in enumerate(np.arange(model_grid.start_hour, model_grid.end_hour + 1)):
print("{0:02d}".format(hour))
hour_labels = ew.size_filter(ew.label(gaussian_filter(model_grid.data[h], gaussian_window)), min_size)
obj_slices = find_objects(hour_labels)
num_slices = len(obj_slices)
model_objects.append([])
if num_slices > 0:
for sl in obj_slices:
model_objects[-1].append(STObject(model_grid.data[h][sl],
np.where(hour_labels[sl] > 0, 1, 0),
model_grid.x[sl],
model_grid.y[sl],
model_grid.i[sl],
model_grid.j[sl],
hour,
hour,
dx=3000))
if h > 0:
dims = model_objects[-1][-1].timesteps[0].shape
model_objects[-1][-1].estimate_motion(hour, model_grid.data[h-1], dims[1], dims[0])
return model_objects
min_thresh = 25
max_thresh = 225
step = 1
max_size = 100
min_size = 12
delta = 100
gaussian_filter_size = 1
model_objects = get_forecast_objects(model_grid, (min_thresh, step, max_thresh, max_size, delta),
min_size, gaussian_filter_size)
Tracking storms over time provides useful information about their evolution and threat potential. However, storm tracking is a challenging problem due to storms forming, splitting, merging, and dying. Basic object-based storm tracking methods compare the locations of storms at one time step with the locations at the previous time steps and then find an optimal way to match storms from the two sets appropriately.
In [17]:
def track_forecast_objects(input_model_objects, model_grid, object_matcher):
model_objects = deepcopy(input_model_objects)
hours = np.arange(int(model_grid.start_hour), int(model_grid.end_hour) + 1)
tracked_model_objects = []
for h, hour in enumerate(hours):
past_time_objs = []
for obj in tracked_model_objects:
# Potential trackable objects are identified
if obj.end_time == hour - 1:
past_time_objs.append(obj)
# If no objects existed in the last time step, then consider objects in current time step all new
if len(past_time_objs) == 0:
tracked_model_objects.extend(deepcopy(model_objects[h]))
# Match from previous time step with current time step
elif len(past_time_objs) > 0 and len(model_objects[h]) > 0:
assignments = object_matcher.match_objects(past_time_objs, model_objects[h], hour - 1, hour)
unpaired = list(range(len(model_objects[h])))
for pair in assignments:
past_time_objs[pair[0]].extend(model_objects[h][pair[1]])
unpaired.remove(pair[1])
if len(unpaired) > 0:
for up in unpaired:
tracked_model_objects.append(model_objects[h][up])
#print("Tracked Model Objects: {0:03d} Hour: {1:02d}".format(len(tracked_model_objects), hour))
return tracked_model_objects
def make_tracks(dist_weight, max_distance):
global tracked_model_objects
object_matcher = ObjectMatcher([shifted_centroid_distance, centroid_distance],
np.array([dist_weight, 1-dist_weight]), np.array([max_distance * 1000] * 2))
tracked_model_objects = track_forecast_objects(model_objects, model_grid, object_matcher)
color_list = ["violet", "cyan", "blue", "green", "purple", "darkgreen", "teal", "royalblue"]
color_arr = np.tile(color_list, len(tracked_model_objects) // len(color_list) + 1)
plt.figure(figsize=(10, 8))
basemap.drawstates()
plt.contourf(model_grid.lon,
model_grid.lat,
model_grid.data.max(axis=0),
np.arange(25,225,25),
extend="max",
cmap="YlOrRd")
plt.colorbar(shrink=0.6, fraction=0.05, pad=0.02 )
for t, tracked_model_object in enumerate(tracked_model_objects):
traj = tracked_model_object.trajectory()
t_lon, t_lat = model_grid.proj(traj[0], traj[1], inverse=True)
plt.plot(t_lon, t_lat, marker='o', markersize=4, color=color_arr[t], lw=2)
#plt.barbs(t_lon, t_lat, tracked_model_object.u /3000,
# tracked_model_object.v / 3000.0, length=6,
# barbcolor=color_arr[t])
plt.title("Forecast Tracks Shifted Centroid: {0:0.1f}, Centroid: {1:0.1f}, Max Distance: {2:3d} km".format(
dist_weight, 1-dist_weight, max_distance), fontweight="bold", fontsize=14)
plt.show()
#plt.savefig("storm_tracks.png", dpi=200, bbox_inches="tight")
tracked_model_objects = None
weight_slider = widgets.FloatSlider(min=0, max=1, step=1, value=1)
dist_slider = widgets.IntSlider(min=10, max=1000, step=10, value=50)
track_w = widgets.interactive(make_tracks, dist_weight=weight_slider, max_distance=dist_slider)
display(track_w)
Once storm tracks have been identified, data can be extracted from within each step of the storm track from the other model fields. The forecast tracks are also associated with the observed tracks using a process similar to storm tracking. Storm track data and associated hail sizes have been extracted for a set of model runs from May 20 through 3 June 2015. We will use this data to find a relationship between the model output and hail size and try to account for the uncertainty in that relationship.
First we will import some statistical and machine learning models from the scikit-learn package. The library supports a wide variety of machine learning models, which are described in great detail in the official documentation.
In [3]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KernelDensity
from scipy.stats import norm
First, we will upload the data using the pandas library. The data are stored in DataFrames, a 2-dimensional data structure in which each row is a record, and each column contains data of the same type. DataFrames allow arbitrary indexing of the rows and columns. They are based on the R data frame. The individual steps of each track are stored in track_step_data
, and information about the full tracks are stored in track_total_data
. The dataset contains 117 columns of data.
In [4]:
train_data_dir = "../testdata/track_data_csv_unidata_train/"
forecast_data_dir = "../testdata/track_data_csv_unidata_forecast/"
train_step_files = sorted(glob(train_data_dir + "track_step_SSEF*.csv"))
train_total_files = sorted(glob(train_data_dir + "track_total_SSEF*.csv"))
track_step_data = pd.concat(map(pd.read_csv, train_step_files), ignore_index=True)
track_total_data = pd.concat(map(pd.read_csv, train_total_files), ignore_index=True)
track_forecast_data = pd.read_csv(forecast_data_dir + "track_step_SSEF_wrf-s3cn_arw_20150604.csv")
pd.set_option('display.max_columns', track_step_data.shape[1])
print(track_step_data.shape)
track_step_data.describe()
Out[4]:
The extracted hail sizes show a skewed distribution with a long tail. Storms with a hail size of 0 were not matched with an observed track and should be excluded when building a regression model.
In [5]:
track_step_data["Hail_Size"].hist(bins=np.arange(0,105,5))
plt.xlabel("Hail Size (mm)")
plt.ylabel("Frequency")
Out[5]:
In [38]:
plt.figure(figsize=(8, 4))
from sklearn.preprocessing import PowerTransformer
plt.subplot(1, 2, 1)
idxs = (track_step_data["Hail_Size"] > 0) & (track_step_data["Hail_Size"] < 125)
plt.hist(track_step_data["Hail_Size"][idxs], bins=np.arange(0, 100, 5))
pt = PowerTransformer(method="box-cox")
bc_hail = pt.fit_transform(track_step_data.loc[idxs, ["Hail_Size"]])
plt.title("Raw Hail Sizes")
plt.ylabel("Frequency", fontsize=12)
plt.xlabel("Hail Size (mm)", fontsize=12)
plt.subplot(1, 2, 2)
log_hail= np.log(track_step_data["Hail_Size"][idxs])
plt.hist((log_hail - log_hail.mean()) / log_hail.std(), bins=np.arange(-3, 3, 0.2))
plt.title("Log-Transformed and Standardized")
plt.xlabel("log(Hail Size) Anomaly", fontsize=12)
plt.savefig("hail_size_scaled.png", dpi=200, bbox_inches="tight")
#plt.subplot(1, 3, 3)
#plt.hist(bc_hail, bins=np.arange(-3, 3, 0.2))
In [22]:
Out[22]:
The simplest and generally first choice for a statistical model is an ordinary linear regression. The model minimizes the mean squared error over the full training set. Since we used updraft helicity to identify the storm tracks, we start with building two linear models using the maximum updraft helicity and observed hail size. We first fit a linear model of the form $y=ax+b$. Then we fit a power-law model (think Z-R relationship) of the form $\ln(y)=a\ln(x) + b$. The training data and the two linear models are plotted below.
In [24]:
# We are filtering the unmatched storm tracks and storm tracks matched with unrealistically high values
filter_idx = (track_step_data['Hail_Size'] > 0) & (track_step_data['Hail_Size'] < 100)
x_var = "uh_max_max"
print("Standard deviation", track_step_data["Hail_Size"][filter_idx].std())
print("Correlation coefficient", np.corrcoef(track_step_data[x_var][filter_idx],
track_step_data['Hail_Size'][filter_idx])[0,1])
lr = LinearRegression()
log_lr = LinearRegression()
log_lr.fit(np.log(track_step_data.loc[filter_idx,[x_var]]), np.log(track_step_data['Hail_Size'][filter_idx]))
lr.fit(track_step_data.loc[filter_idx,[x_var]], track_step_data['Hail_Size'][filter_idx])
print("Linear model:", "a", lr.coef_[0], "b",lr.intercept_)
print("Power law model:","a",log_lr.coef_[0], "b",log_lr.intercept_)
plt.scatter(track_step_data.loc[filter_idx, x_var],
track_step_data.loc[filter_idx, 'Hail_Size'], 10, 'r')
uh_test_vals = np.arange(1 , track_step_data.loc[filter_idx, x_var].max())
power_hail_vals = np.exp(log_lr.intercept_) * uh_test_vals ** log_lr.coef_[0]
hail_vals = lr.intercept_ + lr.coef_[0] * uh_test_vals
plt.plot(uh_test_vals, power_hail_vals)
plt.plot(uh_test_vals, hail_vals)
plt.xlabel(x_var)
plt.ylabel("Hail Size (mm)")
Out[24]:
While the linear regression fit does show a slight positive relationship with hail size, it also shows a large amount of variance. We could try to train a multiple linear regression from a subset of all the variables, but finding that subset is time consuming, and the resulting model still relies on the assumption of constant variance and normality.
Alternatively, we could use a decision tree, which is a popular model from the machine learning community that performs variable selection, is robust against outliers and missing data, and does not rely on any parametric data model assumptions. While individual decision trees do not provide very high accuracy, randomized ensembles of decision trees are consistently among the best performing models in many applications. We will experiment with the Random Forest, a popular decision tree ensemble due to its ease of use and generally high accuracy.
In [27]:
rf = RandomForestRegressor(n_estimators=500, min_samples_split=20, max_features="sqrt")
rf.fit(track_step_data.loc[filter_idx, track_step_data.columns[3:-1]], track_step_data['Hail_Size'][filter_idx])
Out[27]:
Random forests provide a measure of how much each input variable affects the performance of the model called variable importance. It is a normalized measure of the decrease in error produced by each variable.
In [28]:
def plot_importances(num_features):
feature_names = np.array(["{0} ({1:d})".format(f, x) for x, f in enumerate(track_step_data.columns[3:-1].values)])
feature_ranks = np.argsort(rf.feature_importances_)
plt.figure(figsize=(5,8))
plt.barh(np.arange(feature_ranks.size)[-num_features:],
rf.feature_importances_[feature_ranks][-num_features:], height=1)
plt.yticks(np.arange(feature_ranks.size)[-num_features:] + 0.5, feature_names[feature_ranks][-num_features:])
plt.ylim(feature_names.size-num_features, feature_names.size)
plt.xlabel("Normalized Importance")
plt.title("Random Forest Variable Importance")
feature_slider = widgets.IntSlider(min=1, max=100, value=10)
feature_w = widgets.interactive(plot_importances, num_features=feature_slider)
display(feature_w)
We can validate the accuracy of the two models by comparing the hail size predictions for 4 June 2015 from each model. The root mean squared errors from each model are similar, but the random forest appears to be better at spreading the predictions over a larger range of hail sizes.
In [30]:
ver_idx = (track_forecast_data["Hail_Size"].values > 0) & (track_forecast_data["Hail_Size"].values < 100)
rf_preds = rf.predict(track_forecast_data.loc[ver_idx, track_forecast_data.columns[3:-1]])
lr_preds = lr.predict(track_forecast_data.loc[ver_idx,["uh_max_max"]])
rf_rmse = np.sqrt(np.mean(np.power(rf_preds - track_forecast_data.loc[ver_idx, "Hail_Size"], 2)))
lr_rmse = np.sqrt(np.mean(np.power(lr_preds - track_forecast_data.loc[ver_idx, "Hail_Size"], 2)))
plt.figure(figsize=(12, 6))
plt.subplot(1,2,1)
plt.scatter(rf_preds, track_forecast_data.loc[ver_idx, "Hail_Size"])
plt.plot(np.arange(0, 65, 5), np.arange(0, 65, 5), "k--")
plt.xlabel("Random Forest Hail Size (mm)")
plt.ylabel("Observed Hail Size (mm)")
plt.title("Random Forest Predictions RMSE: {0:0.3f}".format(rf_rmse))
plt.xlim(20,60)
plt.ylim(20,60)
plt.subplot(1,2,2)
plt.scatter(lr_preds, track_forecast_data.loc[ver_idx, "Hail_Size"])
plt.plot(np.arange(0, 65, 5), np.arange(0, 65, 5), "k--")
plt.xlabel("Linear Regression Hail Size (mm)")
plt.ylabel("Observed Hail Size (mm)")
plt.title("Linear Regression Predictions RMSE: {0:0.3f}".format(lr_rmse))
plt.xlim(20,60)
plt.ylim(20,60)
Out[30]:
Since each tree in the random forest produces an independent output, a probability density function can be generated from them for each prediction. To translate the predictions into probabilities, two methods can be used. A kernel density estimate (KDE) uses a moving window to determine probability based on the concentration of events at particular values. The alternative approach is to assume a parametric distribution, such as a Gaussian, and fit the distribution parameters to your predictions. As the KDE is non-parametric, it is much better at identifying the longer tails and secondary peaks that may hint at the chance for extreme hail. The example below shows how the KDE and Gaussian distributions compare for all of the June 4 predictions.
In [31]:
kde = KernelDensity(bandwidth=4)
bins = np.arange(0, 100)
bins = bins.reshape((bins.size, 1))
rf_tree_preds = np.array([t.predict(track_forecast_data.loc[ver_idx, track_forecast_data.columns[3:-1]]) for t in rf.estimators_])
mean_preds = rf_tree_preds.mean(axis=0)
sd_preds = rf_tree_preds.std(axis=0)
rf_pdfs = []
for r in range(rf_tree_preds.shape[1]):
kde.fit(rf_tree_preds[:, r:r+1])
rf_pdfs.append(np.exp(kde.score_samples(bins)))
rf_pdfs = np.array(rf_pdfs)
rf_cdfs = rf_pdfs.cumsum(axis=1)
pred_sorted = np.argsort(mean_preds)
def plot_pdfs(min_max):
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.title("Random Forest KDE Prediction PDFs")
for r in rf_pdfs[pred_sorted][min_max[0]:min_max[1]+1]:
plt.plot(bins, r)
plt.ylim(0, 0.1)
plt.xlabel("Forecast Hail Size (mm)")
plt.ylabel("Probability Density")
plt.xticks(np.arange(0, 105, 5))
plt.grid()
plt.subplot(1,2,2)
plt.title("Random Forest Gaussian Prediction PDFs")
for r2, mean_pred in enumerate(mean_preds[pred_sorted][min_max[0]:min_max[1]+1]):
plt.plot(bins, norm.pdf(bins, loc=mean_pred, scale=sd_preds[pred_sorted][r2]))
plt.ylim(0, 0.1)
plt.xlabel("Forecast Hail Size (mm)")
plt.ylabel("Probability Density")
plt.xticks(np.arange(0, 105, 5))
plt.grid()
plt.show()
mm_slider = widgets.IntRangeSlider(min=0, max=rf_pdfs.shape[0])
display(widgets.interactive(plot_pdfs, min_max=mm_slider))
This notebook demonstrated some of the power found in data science tools to turn raw data into valuable forecast information. Please continue to experiment with the parameters in this notebook and even try some of the other models found in scikit-learn and elsewhere. If you have any questions or comments, please email me at djgagne@ou.edu or find me on Twitter (Username: @DJGagneDos).
In [ ]: