Learning Objectives
In this notebook, we will use AutoML Tables to train a model to predict the weight of a baby before it is born. We will use the AutoML Tables UI to create a training dataset from BigQuery and will then train, evaluate, and predict with a Auto ML Tables model.
In this lab, we will setup AutoML Tables, create and import an AutoML Tables dataset from BigQuery, analyze AutoML Tables dataset, train an AutoML Tables model, check evaluation metrics of trained model, deploy trained model, and then finally make both batch and online predictions using the trained model.
Each learning objective will correspond to a series of steps to complete in this student lab notebook.
Run the following cells to verify that we previously created the dataset and data tables. If not, go back to lab 1b_prepare_data_babyweight to create them.
In [ ]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_augmented_data
LIMIT 0
Now that we've created a dataset, let's import our data so that AutoML Tables can use it for training. Our data is currently already in BigQuery, so we will select the radio button Import data from BigQuery
. This will give us some text boxes to fill in with our data's BigQuery Project ID
, BigQuery Dataset ID
, and BigQuery Table or View ID
. Once you are done entering those in, click the IMPORT
button.
Awesome! Our dataset has been successfully imported! You can now look at the dataset's schema which will show for each column the column name, the data type, and its nullability. Out of these columns we need to select which column is we want to be our target or label column. Clikc the drop down for Target column
and choose weight_pounds
.
When you successfully choose your target column you will see a green checkmark and the tag target added to the column row on the right. It will also disable its nullability since machine learning doesn't do too well with null labels. Once you've verified everything is correct with your target column and the schema, then click the CONTINUE
button.
The next tab we're brought to is ANALYZE
. This is where some basic statistics are shown. We can see that we have 6 features, 4 of which are numeric and 2 of which are categorical. We can also see that there are 0% missing and 0 invalid values across all of our columns, which is great! We can also see the number of distinct values which we can compare with our expectations. Additionally, the linear correlation with the target column, weight_pounds
in this instance, is shown as well as the mean and standard deviation for each column. Once you are satisfied with the analysis, then click the TRAIN
tab.
We are almost ready to train our model. It took a lot of steps to get here but those were mainly to import the data and make sure the data is alright. As we all know, data is extremely important for ML and if it is not what we expect then our model will also not perfom as we expect. Garbage in, garbage out. We need to set the Training budget
which is the maxmimum number of node hours to spend training our model. Thankfully, if improvement stops before that, then the training will stop and you'll only be charged for the actual node hours you used. For this dataset, I got decent results with a budget of just 1 to 3 node hours. We also need to select which features we want to use in our model out of the superset of features by selecting the Input feature selection
dropdown where we will see details in the next step below. Once all of that is set the click the TRAIN MODEL
button.
We imported six columns, one of which, weight_pounds
, we have set aside to be our target or label column. This leaves five columns leftover. Clicking the Input feature selection
dropdown provides you with a list of all of the remaining columns. We want is_male
, mother_age
, plurality
, and gestation_weeks
as our four features. hashmonth
is leftover from when we did our repeatable splitting in the 2_prepare_babyweight lab. Whatever is selected will be trained with, so please click the checkbox to de-select it.
Woohoo! Our model is training! We are going to have an awesome model when it finishes! And now we wait. Depending on the size of your dataset, your training budget, and other factors, this could take a while, anywhere from a couple hours to over a day, so this step is about just waiting and being patient. A good thing to do while you are waiting is to keep going through the next labs in this series and then come back to this once lab training completes.
Yay! Our model is done training! Now we can check the EVALUATE
tab and see how well we did. It reminds you what the target was, weight_pounds
, what the training was optimized for, RMSE
, and then many evaluation metrics like MAE, RMSE, etc. My training run did great with an RMSE of 1.030 after only an hour of training! It really shows you the amazing power of AutoML! Below you can see a feature importance bar chart. gestation_weeks
is by far the most important which makes sense because usually the longer someone has been pregnant, the longer the baby has had time to grow, and therefore the heavier the baby weighs.
So if you are satisified with how well our brand new AutoML Tables model trained and evaluated, then you'll probably want to do next what ML is all about; making great predictions! To do that, we'll have to deploy our trained model. If you go to the main Models
page for AutoML Tables you'll see your trained model listed. It gives the model name, the dataset used, the problem type, the time of creation, the model size, and whether the model is deployed or not. Since we just finished training our model, Deployed
should say No
. Click the three vertical dots to the right and then click Deploy model
.
Great! Once it is done deploying, Deployed
should say Yes
and you can now click your model name and then the PREDICT
tab. You'll start out with batch prediction. To make these easy, we can for now just predict on the BigQuery table that we used to train and evaluate on. To do that, select the radio button Data from BigQuery
and then enter your BigQuery Project Id
, BigQuery Dataset Id
, and BigQuery Table or View Id
. We could have also used CSVs from Google Cloud Storage. Then we need to select where we want to put our Result
. Let's select the radio button BigQuery project
and then enter our BigQuery Project Id
. We also could have written the results to Google Cloud Storage. Once all of that is set, please click SEND BATCH PREDICTION
which will submit a batch prediction job using our trained AutoML Tables model and the data at the location we chose above.
After just a little bit of waiting, your batch predictions should be done. For me with my dataset it took just over 15 minutes. At the bottom of the BATCH PREDICTION
page you should see a section labeled Recent Predictions
. It shows the data input, where the results are stored, when it was created, and how long it took to process. Let's now move to the BigQuery Console UI to have a look.
On your list of projects on the far left, you will see the project you have been working in. Click the arrow to expand the dropdown list of all of the BigQuery datasets within the project. You'll see a new dataset there which is the same as what was shown for the Results directory
from the last step. Expanding that dataset dropdown list you will see two BigQuery tables that have been created: predictions
and errors
. Let's first look at the predictions
table.
The predictions
BigQuery table has essentially taken your input data to the batch prediction job and appended three new columns to it. Notice even columns that you did not use as features in your model are still here such as hashmonth
. You should see the two prediction_inteval
columns for start
and end
. The last column is the prediction value
which for us is our predicted weight_pounds
that was calculated by our trained AutoML Tables model uses the corresponding features in the row.
We can also look at the errors
table for any possible errors. When I ran my batch prediction job, thankfully I didn't have any errors, but this is definitely the place to check in case you did. Since my errors
table was empty, below you'll see the schema. Once again it has essentially taken your input data to the batch prediction job and appended three new columns to it. There is a record stored as well as an error code
and error
message. These could be helpful in debugging any unwanted behavior.
We can also perform online prediction with our trained AutoML Tables model. To do that, in the PREDICT
tab, click ONLINE PREDICTION
. You'll see on your screen something similar to below with a table our model's features. Each feature has the column name, the column ID, the data type, the status (whether it is required or not), and a prepopulated value. You can leave those values as is or enter values. For Categorical
features, make sure to use valid values or else they will just end up in the OOV (out of vocabulary) spill-over and not take full advantage of the training. When you're done setting your values, click the PREDICT
button.
In this lab, we setup AutoML Tables, created and imported an AutoML Tables dataset from BigQuery, analyzed AutoML Tables dataset, trained an AutoML Tables model, checked evaluation metrics of trained model, deployed trained model, and then finally made both batch and online predictions using the trained model.
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License
In [ ]: