AutoML for Text Classification

Learning Objectives

  1. Learn how to create a text classification dataset for AutoML using BigQuery
  2. Learn how to train AutoML to build a text classification model
  3. Learn how to evaluate a model trained with AutoML
  4. Learn how to predict on new test data with AutoML

Introduction

In this notebook, we will use AutoML for Text Classification to train a text model to recognize the source of article titles: New York Times, TechCrunch or GitHub.

In a first step, we will query a public dataset on BigQuery taken from hacker news ( it is an aggregator that displays tech related headlines from various sources) to create our training set.

In a second step, use the AutoML UI to upload our dataset, train a text model on it, and evaluate the model we have just trained.

Each learning objective will correspond to a #TODO in this student lab notebook -- try to complete this notebook first and then review the solution notebook.


In [ ]:
import os

from google.cloud import bigquery
import pandas as pd

In [ ]:
%load_ext google.cloud.bigquery

Replace the variable values in the cell below:


In [ ]:
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION
SEED = 0

In [ ]:
%%bash
gsutil mb gs://$BUCKET

Create a Dataset from BigQuery

Hacker news headlines are available as a BigQuery public dataset. The dataset contains all headlines from the sites inception in October 2006 until October 2015.

Lab Task 1a:

Complete the query below to create a sample dataset containing the url, title, and score of articles from the public dataset bigquery-public-data.hacker_news.stories. Use a WHERE clause to restrict to only those articles with

  • title length greater than 10 characters
  • score greater than 10
  • url length greater than 0 characters

In [ ]:


In [ ]:
%%bigquery --project $PROJECT

SELECT
    # TODO: Your code goes here.
FROM
    # TODO: Your code goes here.
WHERE
    # TODO: Your code goes here.
    # TODO: Your code goes here.
    # TODO: Your code goes here.
LIMIT 10

Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with nytimes

Lab task 1b:

Complete the query below to count the number of titles within each 'source' category. Note that to grab the 'source' of the article we use the a regex command on the url of the article. To count the number of articles you'll use a GROUP BY in sql, and we'll also restrict our attention to only those articles whose title has greater than 10 characters.


In [ ]:
%%bigquery --project $PROJECT

SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    # TODO: Your code goes here.
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    # TODO: Your code goes here.
GROUP BY
    # TODO: Your code goes here.
ORDER BY num_articles DESC
  LIMIT 100

Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.


In [ ]:
regex = '.*://(.[^/]+)/'


sub_query = """
SELECT
    title,
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source
    
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
    AND LENGTH(title) > 10
""".format(regex)


query = """
SELECT 
    LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
    source
FROM
  ({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(sub_query=sub_query)

print(query)

For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here.


In [ ]:
bq = bigquery.Client(project=PROJECT)
title_dataset = bq.query(query).to_dataframe()
title_dataset.head()

AutoML for text classification requires that

  • the dataset be in csv form with
  • the first column being the texts to classify or a GCS path to the text
  • the last colum to be the text labels

The dataset we pulled from BiqQuery satisfies these requirements.


In [ ]:
print("The full dataset contains {n} titles".format(n=len(title_dataset)))

Let's make sure we have roughly the same number of labels for each of our three labels:


In [ ]:
title_dataset.source.value_counts()

Finally we will save our data, which is currently in-memory, to disk.

We will create a csv file containing the full dataset and another containing only 1000 articles for development.

Note: It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool.


In [ ]:
DATADIR = './data/'

if not os.path.exists(DATADIR):
    os.makedirs(DATADIR)

In [ ]:
FULL_DATASET_NAME = 'titles_full.csv'
FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)

# Let's shuffle the data before writing it to disk.
title_dataset = title_dataset.sample(n=len(title_dataset))

title_dataset.to_csv(
    FULL_DATASET_PATH, header=False, index=False, encoding='utf-8')

Now let's sample 1000 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see here for further details on how to prepare data for AutoML).

Lab Task 1c:

Use .sample to create a sample dataset of 1,000 articles from the full dataset. Use .value_counts to see how many articles are contained in each of the three source categories?


In [ ]:
sample_title_dataset = # TODO: Your code goes here.
# TODO: Your code goes here.

Let's write the sample datatset to disk.


In [ ]:
SAMPLE_DATASET_NAME = 'titles_sample.csv'
SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)

sample_title_dataset.to_csv(
    SAMPLE_DATASET_PATH, header=False, index=False, encoding='utf-8')

In [ ]:
sample_title_dataset.head()

In [ ]:
%%bash
gsutil cp data/titles_sample.csv gs://$BUCKET

Train a Model with AutoML for Text Classification

Lab Task 2:

Complete Steps 1-3 below to train a text classification model using AutoML.

Step 1: Launch AutoML

Go the GCP console, and click on the Natural Language service in the console menu. Click on 'ENABLE API' if the API is not enable.

Then click on "Get started" in the "AutoML Text & Documentation Classification" tile.

Step 2: Create a Dataset

Select "New Dataset"

Then

  1. Give the new dataset a name
  2. Choose "Multi-label classification"
  3. Hit "Create Dataset"

Then

  1. In 'Select files to import' section, choose 'Select a CSV file on Cloud Storage'.
  2. Click on 'Browse' and upload titles_sample.csv we created above.
  3. Hit "Import"

This step may take a while. You should see the following screen:

Step 3: Train a AutoML text model

When the dataset has been imported, you can inspect it in the AutoML UI, and get statistics about the label distribution. If you are happy with what you see, proceed to train a text model from this dataset:

Then

  1. Switch to 'Train' tab.
  2. Click on 'START TRAINING' and confirm again by clicking on 'START TRAINING'.

The training step may last a few hours, while AutoML is searching for the best model to crush this dataset.

You should see the following screen:

Lab Task 3:

Complete Step 4 below to evaluate the AutoML model.

Step 4: Evaluate the model

Once the model is trained, click on "Evaluate" to undertand how the model performed. You'll be able to see the averall presicion and recall, as well as drill down to preformances at the individual label level.

AutoML UI will also show you examples where the model made a mistake for each of the labels.

Lab Task 4:

Complete Step 5 below to call prediction on your AutoML text classification model.

Step 5: Predict with the trained AutoML model

Now you can test your model directly by entering new text in the UI and having AutoML predicts the source of your snippet:

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License