Learning Objectives
In this notebook, we will use AutoML for Text Classification to train a text model to recognize the source of article titles: New York Times, TechCrunch or GitHub.
In a first step, we will query a public dataset on BigQuery taken from hacker news ( it is an aggregator that displays tech related headlines from various sources) to create our training set.
In a second step, use the AutoML UI to upload our dataset, train a text model on it, and evaluate the model we have just trained.
Each learning objective will correspond to a #TODO in this student lab notebook -- try to complete this notebook first and then review the solution notebook.
In [ ]:
import os
from google.cloud import bigquery
import pandas as pd
In [ ]:
%load_ext google.cloud.bigquery
Replace the variable values in the cell below:
In [ ]:
PROJECT = "cloud-training-demos" # Replace with your PROJECT
BUCKET = PROJECT # defaults to PROJECT
REGION = "us-central1" # Replace with your REGION
SEED = 0
In [ ]:
%%bash
gsutil mb gs://$BUCKET
Hacker news headlines are available as a BigQuery public dataset. The dataset contains all headlines from the sites inception in October 2006 until October 2015.
Complete the query below to create a sample dataset containing the url
, title
, and score
of articles from the public dataset bigquery-public-data.hacker_news.stories
. Use a WHERE clause to restrict to only those articles with
In [ ]:
In [ ]:
%%bigquery --project $PROJECT
SELECT
# TODO: Your code goes here.
FROM
# TODO: Your code goes here.
WHERE
# TODO: Your code goes here.
# TODO: Your code goes here.
# TODO: Your code goes here.
LIMIT 10
Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with nytimes
Complete the query below to count the number of titles within each 'source' category. Note that to grab the 'source' of the article we use the a regex command on the url
of the article. To count the number of articles you'll use a GROUP BY
in sql, and we'll also restrict our attention to only those articles whose title has greater than 10 characters.
In [ ]:
%%bigquery --project $PROJECT
SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
# TODO: Your code goes here.
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
# TODO: Your code goes here.
GROUP BY
# TODO: Your code goes here.
ORDER BY num_articles DESC
LIMIT 100
Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.
In [ ]:
regex = '.*://(.[^/]+)/'
sub_query = """
SELECT
title,
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
AND LENGTH(title) > 10
""".format(regex)
query = """
SELECT
LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
source
FROM
({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(sub_query=sub_query)
print(query)
For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here.
In [ ]:
bq = bigquery.Client(project=PROJECT)
title_dataset = bq.query(query).to_dataframe()
title_dataset.head()
AutoML for text classification requires that
The dataset we pulled from BiqQuery satisfies these requirements.
In [ ]:
print("The full dataset contains {n} titles".format(n=len(title_dataset)))
Let's make sure we have roughly the same number of labels for each of our three labels:
In [ ]:
title_dataset.source.value_counts()
Finally we will save our data, which is currently in-memory, to disk.
We will create a csv file containing the full dataset and another containing only 1000 articles for development.
Note: It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool.
In [ ]:
DATADIR = './data/'
if not os.path.exists(DATADIR):
os.makedirs(DATADIR)
In [ ]:
FULL_DATASET_NAME = 'titles_full.csv'
FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)
# Let's shuffle the data before writing it to disk.
title_dataset = title_dataset.sample(n=len(title_dataset))
title_dataset.to_csv(
FULL_DATASET_PATH, header=False, index=False, encoding='utf-8')
Now let's sample 1000 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see here for further details on how to prepare data for AutoML).
In [ ]:
sample_title_dataset = # TODO: Your code goes here.
# TODO: Your code goes here.
Let's write the sample datatset to disk.
In [ ]:
SAMPLE_DATASET_NAME = 'titles_sample.csv'
SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)
sample_title_dataset.to_csv(
SAMPLE_DATASET_PATH, header=False, index=False, encoding='utf-8')
In [ ]:
sample_title_dataset.head()
In [ ]:
%%bash
gsutil cp data/titles_sample.csv gs://$BUCKET
Go the GCP console, and click on the Natural Language service in the console menu. Click on 'ENABLE API' if the API is not enable.
Then click on "Get started" in the "AutoML Text & Documentation Classification" tile.
Select "New Dataset"
Then
Then
titles_sample.csv
we created above.This step may take a while. You should see the following screen:
When the dataset has been imported, you can inspect it in the AutoML UI, and get statistics about the label distribution. If you are happy with what you see, proceed to train a text model from this dataset:
Then
The training step may last a few hours, while AutoML is searching for the best model to crush this dataset.
You should see the following screen:
Once the model is trained, click on "Evaluate" to undertand how the model performed. You'll be able to see the averall presicion and recall, as well as drill down to preformances at the individual label level.
AutoML UI will also show you examples where the model made a mistake for each of the labels.
Now you can test your model directly by entering new text in the UI and having AutoML predicts the source of your snippet:
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License