Learning Objectives
In this notebook, we will use AutoML for Text Classification to train a text model to recognize the source of article titles: New York Times, TechCrunch or GitHub.
In a first step, we will query a public dataset on BigQuery taken from hacker news ( it is an aggregator that displays tech related headlines from various sources) to create our training set.
In a second step, use the AutoML UI to upload our dataset, train a text model on it, and evaluate the model we have just trained.
In [ ]:
import os
from google.cloud import bigquery
import pandas as pd
In [ ]:
%load_ext google.cloud.bigquery
Replace the variable values in the cell below:
In [ ]:
PROJECT = "cloud-training-demos" # Replace with your PROJECT
BUCKET = PROJECT # defaults to PROJECT
REGION = "us-central1" # Replace with your REGION
SEED = 0
In [ ]:
%%bash
gsutil mb gs://$BUCKET
Hacker news headlines are available as a BigQuery public dataset. The dataset contains all headlines from the sites inception in October 2006 until October 2015.
Here is a sample of the dataset:
In [ ]:
%%bigquery --project $PROJECT
SELECT
url, title, score
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
LENGTH(title) > 10
AND score > 10
AND LENGTH(url) > 0
LIMIT 10
Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with nytimes
In [ ]:
%%bigquery --project $PROJECT
SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
COUNT(title) AS num_articles
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
GROUP BY
source
ORDER BY num_articles DESC
LIMIT 100
Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.
In [ ]:
regex = '.*://(.[^/]+)/'
sub_query = """
SELECT
title,
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
AND LENGTH(title) > 10
""".format(regex)
query = """
SELECT
LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
source
FROM
({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(sub_query=sub_query)
print(query)
For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here.
In [ ]:
bq = bigquery.Client(project=PROJECT)
title_dataset = bq.query(query).to_dataframe()
title_dataset.head()
AutoML for text classification requires that
The dataset we pulled from BiqQuery satisfies these requirements.
In [ ]:
print("The full dataset contains {n} titles".format(n=len(title_dataset)))
Let's make sure we have roughly the same number of labels for each of our three labels:
In [ ]:
title_dataset.source.value_counts()
Finally we will save our data, which is currently in-memory, to disk.
We will create a csv file containing the full dataset and another containing only 1000 articles for development.
Note: It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool.
In [ ]:
DATADIR = './data/'
if not os.path.exists(DATADIR):
os.makedirs(DATADIR)
In [ ]:
FULL_DATASET_NAME = 'titles_full.csv'
FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)
# Let's shuffle the data before writing it to disk.
title_dataset = title_dataset.sample(n=len(title_dataset))
title_dataset.to_csv(
FULL_DATASET_PATH, header=False, index=False, encoding='utf-8')
Now let's sample 1000 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see here for further details on how to prepare data for AutoML).
In [ ]:
sample_title_dataset = title_dataset.sample(n=1000)
sample_title_dataset.source.value_counts()
Let's write the sample datatset to disk.
In [ ]:
SAMPLE_DATASET_NAME = 'titles_sample.csv'
SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)
sample_title_dataset.to_csv(
SAMPLE_DATASET_PATH, header=False, index=False, encoding='utf-8')
In [ ]:
sample_title_dataset.head()
In [ ]:
%%bash
gsutil cp data/titles_sample.csv gs://$BUCKET
Go the GCP console, and click on the Natural Language service in the console menu. Click on 'ENABLE API' if the API is not enable.
Then click on "Get started" in the "AutoML Text & Documentation Classification" tile.
Select "New Dataset"
Then
Then
titles_sample.csv
we created above.This step may take a while. You should see the following screen:
When the dataset has been imported, you can inspect it in the AutoML UI, and get statistics about the label distribution. If you are happy with what you see, proceed to train a text model from this dataset:
Then
The training step may last a few hours, while AutoML is searching for the best model to crush this dataset.
You should see the following screen:
Once the model is trained, click on "Evaluate" to undertand how the model performed. You'll be able to see the averall presicion and recall, as well as drill down to preformances at the individual label level.
AutoML UI will also show you examples where the model made a mistake for each of the labels.
Now you can test your model directly by entering new text in the UI and having AutoML predicts the source of your snippet:
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License