AutoML for Text Classification

Learning Objectives

  1. Learn how to create a text classification dataset for AutoML using BigQuery
  2. Learn how to train AutoML to build a text classification model
  3. Learn how to evaluate a model trained with AutoML
  4. Learn how to predict on new test data with AutoML

Introduction

In this notebook, we will use AutoML for Text Classification to train a text model to recognize the source of article titles: New York Times, TechCrunch or GitHub.

In a first step, we will query a public dataset on BigQuery taken from hacker news ( it is an aggregator that displays tech related headlines from various sources) to create our training set.

In a second step, use the AutoML UI to upload our dataset, train a text model on it, and evaluate the model we have just trained.


In [ ]:
import os

from google.cloud import bigquery
import pandas as pd

In [ ]:
%load_ext google.cloud.bigquery

Replace the variable values in the cell below:


In [ ]:
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION
SEED = 0

In [ ]:
%%bash
gsutil mb gs://$BUCKET

Create a Dataset from BigQuery

Hacker news headlines are available as a BigQuery public dataset. The dataset contains all headlines from the sites inception in October 2006 until October 2015.

Here is a sample of the dataset:


In [ ]:
%%bigquery --project $PROJECT

SELECT
    url, title, score
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    LENGTH(title) > 10
    AND score > 10
    AND LENGTH(url) > 0
LIMIT 10

Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with nytimes


In [ ]:
%%bigquery --project $PROJECT

SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    COUNT(title) AS num_articles
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
GROUP BY
    source
ORDER BY num_articles DESC
  LIMIT 100

Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.


In [ ]:
regex = '.*://(.[^/]+)/'


sub_query = """
SELECT
    title,
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source
    
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
    AND LENGTH(title) > 10
""".format(regex)


query = """
SELECT 
    LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
    source
FROM
  ({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(sub_query=sub_query)

print(query)

For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here.


In [ ]:
bq = bigquery.Client(project=PROJECT)
title_dataset = bq.query(query).to_dataframe()
title_dataset.head()

AutoML for text classification requires that

  • the dataset be in csv form with
  • the first column being the texts to classify or a GCS path to the text
  • the last colum to be the text labels

The dataset we pulled from BiqQuery satisfies these requirements.


In [ ]:
print("The full dataset contains {n} titles".format(n=len(title_dataset)))

Let's make sure we have roughly the same number of labels for each of our three labels:


In [ ]:
title_dataset.source.value_counts()

Finally we will save our data, which is currently in-memory, to disk.

We will create a csv file containing the full dataset and another containing only 1000 articles for development.

Note: It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool.


In [ ]:
DATADIR = './data/'

if not os.path.exists(DATADIR):
    os.makedirs(DATADIR)

In [ ]:
FULL_DATASET_NAME = 'titles_full.csv'
FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)

# Let's shuffle the data before writing it to disk.
title_dataset = title_dataset.sample(n=len(title_dataset))

title_dataset.to_csv(
    FULL_DATASET_PATH, header=False, index=False, encoding='utf-8')

Now let's sample 1000 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see here for further details on how to prepare data for AutoML).


In [ ]:
sample_title_dataset = title_dataset.sample(n=1000)
sample_title_dataset.source.value_counts()

Let's write the sample datatset to disk.


In [ ]:
SAMPLE_DATASET_NAME = 'titles_sample.csv'
SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)

sample_title_dataset.to_csv(
    SAMPLE_DATASET_PATH, header=False, index=False, encoding='utf-8')

In [ ]:
sample_title_dataset.head()

In [ ]:
%%bash
gsutil cp data/titles_sample.csv gs://$BUCKET

Train a Model with AutoML for Text Classification

Step 1: Launch AutoML

Go the GCP console, and click on the Natural Language service in the console menu. Click on 'ENABLE API' if the API is not enable.

Then click on "Get started" in the "AutoML Text & Documentation Classification" tile.

Step 2: Create a Dataset

Select "New Dataset"

Then

  1. Give the new dataset a name
  2. Choose "Multi-label classification"
  3. Hit "Create Dataset"

Then

  1. In 'Select files to import' section, choose 'Select a CSV file on Cloud Storage'.
  2. Click on 'Browse' and upload titles_sample.csv we created above.
  3. Hit "Import"

This step may take a while. You should see the following screen:

Step 3: Train a AutoML text model

When the dataset has been imported, you can inspect it in the AutoML UI, and get statistics about the label distribution. If you are happy with what you see, proceed to train a text model from this dataset:

Then

  1. Switch to 'Train' tab.
  2. Click on 'START TRAINING' and confirm again by clicking on 'START TRAINING'.

The training step may last a few hours, while AutoML is searching for the best model to crush this dataset.

You should see the following screen:

Step 4: Evaluate the model

Once the model is trained, click on "Evaluate" to undertand how the model performed. You'll be able to see the averall presicion and recall, as well as drill down to preformances at the individual label level.

AutoML UI will also show you examples where the model made a mistake for each of the labels.

Step 5: Predict with the trained AutoML model

Now you can test your model directly by entering new text in the UI and having AutoML predicts the source of your snippet:

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License