ML Challenges

This notebook includes various code snippets mentioned in the first chapter of our Machine Learning Design Patterns book.



In [0]:

    
import pandas as pd
import tensorflow as tf

from sklearn.utils import shuffle
from google.cloud import bigquery

Repeatability

Because of the inherent randomness in ML, there are additional measures required to ensure repeatability and reproducability between training and evaluation runs.



In [0]:

    
# Setting a random seed in TensorFlow
# Do this before you run training to ensure reproducible evaluation metrics
# You can use whatever value you'd like for the seed
tf.random.set_seed(2)

You also need to consider randomness when preparing your training, test, and validation datasets. To ensure consistency, prepare a shuffled dataset before training by setting a random seed value.

First, let's look at an example without shuffling. We'll grab some data from the NOAA storms public dataset in BigQuery. You'll need a Google Cloud account to run the cells that use this dataset.



In [0]:

    
from google.colab import auth
auth.authenticate_user()

Replace your-cloud-project below with the name of your Google Cloud project.



In [0]:

    
%%bigquery storms_df --project your-cloud-project
SELECT
  *
FROM
  `bigquery-public-data.noaa_historic_severe_storms.storms_*`
LIMIT 1000

Run the cell below multiple times, and notice that the order of the data changes each time.



In [9]:

    
storms_df = shuffle(storms_df)
storms_df.head()









    Out[9]:







  
    
      
      episode_id
      event_id
      state
      state_fips_code
      event_type
      cz_type
      cz_fips_code
      cz_name
      wfo
      event_begin_time
      event_timezone
      event_end_time
      injuries_direct
      injuries_indirect
      deaths_direct
      deaths_indirect
      damage_property
      damage_crops
      source
      magnitude
      magnitude_type
      flood_cause
      tor_f_scale
      tor_length
      tor_width
      tor_other_wfo
      location_index
      event_range
      event_azimuth
      reference_location
      event_latitude
      event_longitude
      event_point
    
  
  
    
      875
      None
      10075537
      New york
      36
      thunderstorm wind
      C
      13
      CHAUTAUQUA
      BUF
      1990-08-27 23:40:00
      CST
      1990-08-27 23:40:00
      0
      0
      0
      0
      0
      0
      None
      0.00
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      449
      None
      9998455
      Florida
      12
      thunderstorm wind
      C
      5
      BAY
      PNS
      1990-08-18 17:00:00
      CST
      1990-08-18 17:00:00
      0
      0
      0
      0
      0
      0
      None
      0.00
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      464
      None
      10076219
      New mexico
      35
      hail
      C
      5
      CHAVES
      ROW
      1990-04-21 22:00:00
      CST
      1990-04-21 22:00:00
      0
      0
      0
      0
      0
      0
      None
      1.75
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      698
      None
      10138226
      Texas
      48
      thunderstorm wind
      C
      9
      ARCHER
      SPS
      1990-05-29 19:27:00
      CST
      1990-05-29 19:27:00
      0
      0
      0
      0
      0
      0
      None
      55.00
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      840
      None
      10125887
      South dakota
      46
      tornado
      C
      13
      BROWN
      None
      1990-06-01 20:45:00
      CST
      1990-06-01 20:45:00
      0
      0
      0
      0
      0
      0
      None
      0.00
      None
      None
      F0
      3
      30
      None
      None
      None
      None
      None
      None
      None
      None

Next, repeat the above but set a random seed. Note that the data order stays the same even when run multiple times.



In [16]:

    
shuffled_df = shuffle(storms_df, random_state=2)
shuffled_df.head()









    Out[16]:







  
    
      
      episode_id
      event_id
      state
      state_fips_code
      event_type
      cz_type
      cz_fips_code
      cz_name
      wfo
      event_begin_time
      event_timezone
      event_end_time
      injuries_direct
      injuries_indirect
      deaths_direct
      deaths_indirect
      damage_property
      damage_crops
      source
      magnitude
      magnitude_type
      flood_cause
      tor_f_scale
      tor_length
      tor_width
      tor_other_wfo
      location_index
      event_range
      event_azimuth
      reference_location
      event_latitude
      event_longitude
      event_point
    
  
  
    
      888
      None
      10082557
      New hampshire
      33
      thunderstorm wind
      C
      13
      MERRIMACK
      CON
      1990-10-18 21:00:00
      CST
      1990-10-18 21:00:00
      0
      0
      0
      0
      0
      0
      None
      0.0
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      185
      None
      10111136
      South carolina
      45
      thunderstorm wind
      C
      3
      AIKEN
      AGS
      1990-05-28 10:30:00
      CST
      1990-05-28 10:30:00
      0
      0
      0
      0
      0
      0
      None
      0.0
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      975
      None
      10049089
      Minnesota
      27
      thunderstorm wind
      C
      13
      BLUE EARTH
      RST
      1990-06-02 11:40:00
      CST
      1990-06-02 11:40:00
      1
      0
      0
      0
      0
      0
      None
      0.0
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      684
      None
      10147552
      West virginia
      54
      thunderstorm wind
      C
      9
      BROOKE
      PIT
      1990-09-06 15:30:00
      CST
      1990-09-06 15:30:00
      0
      0
      0
      0
      0
      0
      None
      0.0
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      45
      None
      10054154
      Michigan
      26
      thunderstorm wind
      C
      1
      ALCONA
      APN
      1990-10-04 13:00:00
      CST
      1990-10-04 13:00:00
      0
      0
      0
      0
      0
      0
      None
      0.0
      None
      None
      None
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None

Data drift

It's important to analyze how data is changing over time to ensure your ML models are trained on accurate data. To demonstrate this, we'll use the same NOAA storms dataset as above with a slightly different query.

Let's look at how the number of reported storms has increased over time.



In [0]:

    
%%bigquery storm_trends --project your-cloud-project
SELECT
  SUBSTR(CAST(event_begin_time AS string), 1, 4) AS year,
  COUNT(*) AS num_storms
FROM
  `bigquery-public-data.noaa_historic_severe_storms.storms_*`
GROUP BY
  year
ORDER BY
  year ASC



In [18]:

    
storm_trends.head()

As seen below, training a model on data before 2000 to predict storms now would result in incorrect predictions.



In [22]:

    
storm_trends.plot(title='Storm trends over time', x='year', y='num_storms')









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fd2dac5b400>

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

	year	num_storms
0	1950	223
1	1951	269
2	1952	272
3	1953	492
4	1954	609

	episode_id	event_id	state	state_fips_code	event_type	cz_type	cz_fips_code	cz_name	wfo	event_begin_time	event_timezone	event_end_time	source	magnitude	magnitude_type	flood_cause	tor_f_scale	tor_length	tor_width	tor_other_wfo	location_index	event_range	event_azimuth	reference_location	event_latitude	event_longitude	event_point
875	None	10075537	New york	36	thunderstorm wind	C	13	CHAUTAUQUA	BUF	1990-08-27 23:40:00	CST	1990-08-27 23:40:00	None	0.00	None	None	None	0	0	None	None	None	None	None	None	None	None
449	None	9998455	Florida	12	thunderstorm wind	C	5	BAY	PNS	1990-08-18 17:00:00	CST	1990-08-18 17:00:00	None	0.00	None	None	None	0	0	None	None	None	None	None	None	None	None
464	None	10076219	New mexico	35	hail	C	5	CHAVES	ROW	1990-04-21 22:00:00	CST	1990-04-21 22:00:00	None	1.75	None	None	None	0	0	None	None	None	None	None	None	None	None
698	None	10138226	Texas	48	thunderstorm wind	C	9	ARCHER	SPS	1990-05-29 19:27:00	CST	1990-05-29 19:27:00	None	55.00	None	None	None	0	0	None	None	None	None	None	None	None	None
840	None	10125887	South dakota	46	tornado	C	13	BROWN	None	1990-06-01 20:45:00	CST	1990-06-01 20:45:00	None	0.00	None	None	F0	3	30	None	None	None	None	None	None	None	None

	episode_id	event_id	state	state_fips_code	event_type	cz_type	cz_fips_code	cz_name	wfo	event_begin_time	event_timezone	event_end_time	injuries_direct	source	magnitude_type	flood_cause	tor_f_scale	tor_other_wfo	location_index	event_range	event_azimuth	reference_location	event_latitude	event_longitude	event_point
888	None	10082557	New hampshire	33	thunderstorm wind	C	13	MERRIMACK	CON	1990-10-18 21:00:00	CST	1990-10-18 21:00:00	0	None	None	None	None	None	None	None	None	None	None	None	None
185	None	10111136	South carolina	45	thunderstorm wind	C	3	AIKEN	AGS	1990-05-28 10:30:00	CST	1990-05-28 10:30:00	0	None	None	None	None	None	None	None	None	None	None	None	None
975	None	10049089	Minnesota	27	thunderstorm wind	C	13	BLUE EARTH	RST	1990-06-02 11:40:00	CST	1990-06-02 11:40:00	1	None	None	None	None	None	None	None	None	None	None	None	None
684	None	10147552	West virginia	54	thunderstorm wind	C	9	BROOKE	PIT	1990-09-06 15:30:00	CST	1990-09-06 15:30:00	0	None	None	None	None	None	None	None	None	None	None	None	None
45	None	10054154	Michigan	26	thunderstorm wind	C	1	ALCONA	APN	1990-10-04 13:00:00	CST	1990-10-04 13:00:00	0	None	None	None	None	None	None	None	None	None	None	None	None