In [0]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Prepare data for Google Cloud AutoML Natural Language from scikit-learn

Overview

This notebook demonstrates how to prepare text data available in scikit-learn (or other libraries), so that it can be used in Google Cloud AutoML Natural Language.

The script reads the data into a pandas dataframe, and then makes some minor transformations to ensure that it is compatible with the AutoML Natural Language input specification. Finally, the CSV is saved into a CSV file, which can be downloaded from the notebook server.

Dataset

This notebook downloads the 20 newsgroups dataset using scikit-learn. This dataset contains about 18000 posts from 20 newsgroups, and is useful for text classification. More details on the dataset can be found here.

Objectives

There are 3 goals for this notebook:

  1. Introduce scikit-learn datasets
  2. Explore pandas dataframe text manipulation
  3. Import data into AutoML Natural language for text classification

What's next?

After downloading the CSV at the end of this notebook, import the data into Google Cloud AutoML Natural Language to explore classifying text.

Imports


In [0]:
import numpy as np
import pandas as pd
import csv

from sklearn.datasets import fetch_20newsgroups

Fetch data


In [0]:
newsgroups = fetch_20newsgroups(subset='all')

df = pd.DataFrame(newsgroups.data, columns=['text'])
df['categories'] = [newsgroups.target_names[index] for index in newsgroups.target]
df.head()

Clean data


In [0]:
# Convert multiple whitespace characters into a space
df['text'] = df['text'].str.replace('\s+',' ')

# Change newsgroup titles to use underscores rather than periods
df['categories'] = df['categories'].str.replace('.','_')

# Trim leading and tailing whitespace
df['text'] = df['text'].str.strip()

# Truncate all fields to the maximum field length of 128kB
df['text'] = df['text'].str.slice(0,131072)

# Remove any rows with empty fields
df = df.replace('', np.NaN).dropna()

# Drop duplicates
df = df.drop_duplicates(subset='text')

# Limit rows to maximum of 100,000
df = df.sample(min(100000, len(df)))

df.head()

Export to CSV


In [0]:
csv_str = df.to_csv(index=False, header=False)

with open("20-newsgroups-dataset.csv", "w") as text_file:
    print(csv_str, file=text_file)

You're all set! Download 20-newsgroups-dataset.csv and import it into Google Cloud AutoML Natural Language.

If you are using , you will find the file in the left navbar:

  • From the menu, select View > Table of Contents
  • Navigate to the Files tab
  • Select .. and find the file in /content directory
  • Download the CSV with the context menu