If you routinely analyze data to make better decisions (or if you build machine learning models to provide such guidance automatically), weather should be one of your inputs. Having GHCN data in BigQuery just made that a whole lot easier.
Install basemap (to draw maps), restart the kernel, then start with some Python imports
In [1]:
%bash
echo 'Y' | apt-get install python-mpltoolkits.basemap
In [2]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import datalab.bigquery as bq
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import shutil
Let’s take a simple example. Let’s say you are a pizza chain based in Chicago and want to grab some weather variables that might affect demand for pizza and pizza delivery times. The first thing to do is to find the GHCN station closest to you. Go to Google Maps and find your latitude and longitude (say this turns out to be 42 degrees latitude and -87.9 degrees longitude)
In [3]:
%sql --module find_stations_close_to
SELECT
name, id,
state,
latitude,
longitude,
DEGREES(ACOS(SIN(RADIANS(latitude)) * SIN(RADIANS($lat)) + COS(RADIANS(latitude)) * COS(RADIANS($lat)) * COS(RADIANS(longitude - $lon)))) * 60 * 1.515 * 1.609344 AS dist_kms
FROM
[bigquery-public-data:ghcn_d.ghcnd_stations]
ORDER BY
dist_kms ASC
LIMIT
1000
In [4]:
stations = bq.Query(find_stations_close_to, lat=42, lon=-87.9).to_dataframe()
stations[:5]
Out[4]:
In [5]:
clat = 42
clon = -87.9
m = Basemap(resolution='h',projection='merc', llcrnrlat=clat-.5,urcrnrlat=clat+.5,llcrnrlon=clon-.5,urcrnrlon=clon+.5,lat_ts=51.0)
m.drawcountries(linewidth=0.5)
m.drawcoastlines(linewidth=0.5)
ax = plt.subplot(111)
x,y = m(stations['longitude'].values,stations['latitude'].values)
m.scatter(x,y,c=stations['dist_kms'].values)
ax.plot([x[0]], [y[0]], 'or')
Out[5]:
Next, we need to pull data from this station (USW00094846) on the dates of interest. Here, I will query the table of 2015 data and pull all the days from that table. To get the rainfall amount (“precipitation” or PRCP) in millimeters, you’d write:
In [6]:
%sql
SELECT
STRING(wx.date) AS date,
wx.value/10.0 AS prcp
FROM
[bigquery-public-data:ghcn_d.ghcnd_2015] AS wx
WHERE
id = 'USW00094846'
AND qflag IS NULL
AND element = 'PRCP'
ORDER BY wx.date
Out[6]:
In [7]:
%sql --module wxquery
SELECT
STRING(wx.date) AS date,
MAX(prcp) AS prcp,
MAX(tmin) AS tmin,
MAX(tmax) AS tmax,
IF(MAX(haswx) = 'True', 'True', 'False') AS haswx
FROM (
SELECT
wx.date,
IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp,
IF (wx.element = 'TMIN', wx.value/10, NULL) AS tmin,
IF (wx.element = 'TMAX', wx.value/10, NULL) AS tmax,
IF (SUBSTR(wx.element, 0, 2) = 'WT', 'True', NULL) AS haswx
FROM
[bigquery-public-data:ghcn_d.ghcnd_$year] AS wx
WHERE
id = 'USW00094846'
AND qflag IS NULL )
GROUP BY
date
ORDER BY
date
In [8]:
wx = bq.Query(wxquery, year=2015).to_dataframe()
wx[:5]
Out[8]:
In [9]:
# Plot the average value
ax = (0.5*(wx['tmin']+wx['tmax'])).plot()
x = np.arange(len(wx.date))
palette = sns.color_palette()
ax.fill_between(x, wx['tmin'].values, wx['tmax'].values, alpha=.2, color=palette.pop(0))
ax.set_ylabel("Temperature (Celsius)")
import matplotlib.ticker as plticker
junk = plt.xticks(x[1::50], wx['date'].values[1::50], rotation='vertical')
In [10]:
%sql --module airwxquery
SELECT
wx.date,
wx.prcp,
f.departure_delay,
f.arrival_airport
FROM (
SELECT
STRING(date) AS date,
value/10 AS prcp
FROM
[bigquery-public-data:ghcn_d.ghcnd_2005]
WHERE
id = 'USW00094846'
AND qflag IS NULL
AND element = 'PRCP') AS wx
JOIN
[bigquery-samples:airline_ontime_data.flights] AS f
ON
f.date = wx.date
WHERE
f.departure_airport = 'ORD'
In [11]:
airwx = bq.Query(airwxquery).to_dataframe()
airwx[:5]
Out[11]:
In [12]:
rainyday = airwx.wx_prcp > 25.4 # 1 inch of rain
ax = sns.violinplot(x=rainyday, y=airwx.f_departure_delay)
ax.set_ylim(0, 200)
ax.set_xlabel('Heavy rain?')
ax.set_ylabel('Flight delays (minutes)')
Out[12]:
In [13]:
limited = airwx[airwx.f_departure_delay > 0]
limited = limited[limited.f_departure_delay < 200]
heavy = limited.wx_prcp > 25.4 # 1 inch of rain
rainy = limited.wx_prcp > 0
ax = sns.kdeplot(limited.f_departure_delay[heavy], shade=True, label='heavy rain')
ax = sns.kdeplot(limited.f_departure_delay[rainy], shade=True, label='rainy')
ax = sns.kdeplot(limited.f_departure_delay[np.logical_not(rainy)], shade=False, label='dry')
ax.set_xlim(0,200)
ax.set_xlabel('Flight delay (minutes)')
Out[13]:
In [14]:
%sql --module all_stations
SELECT
name,
value/10 AS min_temperature,
latitude,
longitude
FROM
[bigquery-public-data:ghcn_d.ghcnd_stations] AS stn
JOIN
[bigquery-public-data:ghcn_d.ghcnd_2016] AS wx
ON
wx.id = stn.id
WHERE
wx.element = 'TMIN'
AND wx.qflag IS NULL
AND STRING(wx.date) = '2016-08-15'
In [15]:
stations = bq.Query(all_stations).to_dataframe()
stations[:5]
Out[15]:
In [16]:
mpl.rcParams['figure.figsize'] = '16, 12'
m = Basemap(projection='kav7', lon_0=-90, resolution = 'l', area_thresh = 1000.0)
m.shadedrelief(scale=0.2)
m.drawcoastlines()
m.drawcountries()
m.drawparallels(np.arange(-90.,99.,30.))
junk = m.drawmeridians(np.arange(-180.,180.,60.))
x,y = m(stations['longitude'].values,stations['latitude'].values)
m.scatter(x,y)
Out[16]:
In [ ]: