Note: This notebook uses Python 3.
In this notebook we will explore the interpolated dataset and test some basic python functions.
Contents
Let's start by opening the file and checking out what data it contains.
In [1]:
from netCDF4 import Dataset
# Define directory where interpolated files are stored.
DATA_DIR = '/project/meteo/w2w/C7/ppnn_data/' # At LMU
# DATA_DIR = '/Users/stephanrasp/repositories/ppnn/data/' # Mac
# Define file name
fn = 'data_interpolated.nc'
In [2]:
# Open NetCDF rootgroup
rg = Dataset(DATA_DIR + 'data_interpolated.nc')
In [3]:
# What does the NetCDF file contain?
rg
Out[3]:
Here is what's in the file:
So how much training data do we have?
In [4]:
# Total amount of data
rg.dimensions['station'].size * rg.dimensions['time'].size / 2
Out[4]:
In [5]:
# Rough data amount per month
rg.dimensions['station'].size * rg.dimensions['time'].size / 2 / 12.
Out[5]:
In [6]:
# Per station per month
rg.dimensions['time'].size / 2 / 12.
Out[6]:
In [7]:
time = rg.variables['time']
time
Out[7]:
In [8]:
time[:5]
Out[8]:
In fact, the time is given in seconds rather than hours.
In [9]:
# convert back to dates (http://unidata.github.io/netcdf4-python/#section7)
from netCDF4 import num2date
dates = num2date(time[:],units='seconds since 1970-01-01 00:00 UTC')
dates[:5]
Out[9]:
So dates are in 12 hour intervals. Which means that since we downloaded 36/48h forecasts: the 12UTC dates correspond to the 36 hour fcs and the following 00UTC dates correspond to the same forecast at 48 hour lead time.
Station and station ID are in fact the same and simply contain a number, which does not start at one and is not continuous.
In [10]:
import numpy as np
# Check whether the two variables are equal
np.array_equal(rg.variables['station'][:], rg.variables['station_id'][:])
Out[10]:
In [11]:
# Then just print the first 5
rg.variables['station'][:5]
Out[11]:
station_alt contains the station altitude in meters.
In [12]:
rg.variables['station_alt'][:5]
Out[12]:
In [13]:
rg.variables['station_loc'][0].data
Out[13]:
Ahhhh, Aachen :D So this leads me to believe that the station numbering is done by name.
In [14]:
station_lon = rg.variables['station_lon']
station_lat = rg.variables['station_lat']
In [15]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
In [16]:
plt.scatter(station_lon[:], station_lat[:])
Out[16]:
Wohooo, Germany!
In [17]:
# Let's extract the actual data from the NetCDF array
# Then we can manipulate it later.
tfc = rg.variables['t2m_fc'][:]
tobs = rg.variables['t2m_obs'][:]
In [18]:
tobs[:5, :5].data
Out[18]:
So there are actually missing data in the observations. We will need to think about how to deal with those.
Sebastian mentioned that in the current version there is some Celcius/Kelvin inconsistencies.
In [19]:
plt.plot(np.mean(tfc, axis=(1, 2)))
Out[19]:
In [20]:
# Since this will be fixed soon, let's just create a little ad hoc fix
idx = np.where(np.mean(tfc, axis=(1, 2)) > 100)[0][0]
tfc[idx:] = tfc[idx:] - 273.15
In [21]:
# Let's create a little function to visualize the ensemble forecast
# and the corresponding observation
def plot_fc_obs_hist(t, s):
plt.hist(tfc[t, :, s])
plt.axvline(tobs[t, s], c='r')
plt.title(num2date(time[t], units='seconds since 1970-01-01 00:00 UTC'))
plt.show()
In [22]:
# Now let's plot some forecast for random time steps and stations
plot_fc_obs_hist(0, 0)
plot_fc_obs_hist(100, 100)
plot_fc_obs_hist(1000, 200)
In [23]:
# Let's write a handy function which returns the required data
# from the NetCDF object
def get_data_slice(rg, month, utc=0):
# Get array of datetime objects
dates = num2date(rg.variables['time'][:],
units='seconds since 1970-01-01 00:00 UTC')
# Extract months and hours
months = np.array([d.month for d in list(dates)])
hours = np.array([d.hour for d in list(dates)])
# for now I need to include the Kelvin fix
tfc = rg.variables['t2m_fc'][:]
idx = np.where(np.mean(tfc, axis=(1, 2)) > 100)[0][0]
tfc[idx:] = tfc[idx:] - 273.15
# Extract the requested data
tobs = rg.variables['t2m_obs'][(months == 1) & (hours == 0)]
tfc = tfc[(months == 1) & (hours == 0)]
return tobs, tfc
In [24]:
tobs_jan_00, tfc_jan_00 = get_data_slice(rg, 1, 0)
In [25]:
tfc_jan_00.shape
Out[25]:
In [26]:
from scipy.stats import norm
In [27]:
def crps_normal(mu, sigma, y):
loc = (y - mu) / sigma
crps = sigma * (loc * (2 * norm.cdf(loc) - 1) +
2 * norm.pdf(loc) - 1. / np.sqrt(np.pi))
return crps
In [28]:
# Get ensmean and ensstd
tfc_jan_00_mean = np.mean(tfc_jan_00, axis=1)
tfc_jan_00_std = np.std(tfc_jan_00, axis=1, ddof=1)
In [29]:
# Compute CRPS using the ensemble mean and variance
crps_jan_00 = crps_normal(tfc_jan_00_mean, tfc_jan_00_std, tobs_jan_00)
In [30]:
# Wanrings probably doe to missing values
crps_jan_00.mean()
Out[30]:
Nice, this corresponds well to the value sebastian got for the raw ensemble in January.
In [9]:
import sys
sys.path.append('/Users/stephanrasp/repositories/enstools')
from enstools.scores.ScoringRules2Py.scoringtools import
In [32]:
??enstools.scores.crps_sample
In [33]:
tfc_jan_00.shape
Out[33]:
In [36]:
tobs_jan_00.shape
Out[36]:
In [52]:
tfc_jan_00_flat = np.rollaxis(tfc_jan_00, 1, 0)
tfc_jan_00_flat.shape
Out[52]:
In [53]:
tfc_jan_00_flat = tfc_jan_00_flat.reshape(tfc_jan_00_flat.shape[0], -1)
tfc_jan_00_flat.shape
Out[53]:
In [54]:
tobs_jan_00_flat = tobs_jan_00.ravel()
In [63]:
mask = tobs_jan_00_flat.mask
tobs_jan_00_flat_true = np.array(tobs_jan_00_flat)[~mask]
tfc_jan_00_flat_true = np.array(tfc_jan_00_flat)[:, ~mask]
In [64]:
np.isfinite(tobs_jan_00_flat_true)
Out[64]:
In [65]:
tfc_jan_00_flat_true.shape
Out[65]:
In [66]:
enstools.scores.crps_sample(tobs_jan_00_flat_true, tfc_jan_00_flat_true, mean=True)
Out[66]:
In [ ]: