In [0]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This notebook demonstrates how to prepare text data available in scikit-learn (or other libraries), so that it can be used in Google Cloud AutoML Natural Language.
The script reads the data into a pandas dataframe, and then makes some minor transformations to ensure that it is compatible with the AutoML Natural Language input specification. Finally, the CSV is saved into a CSV file, which can be downloaded from the notebook server.
This notebook downloads the 20 newsgroups dataset using scikit-learn. This dataset contains about 18000 posts from 20 newsgroups, and is useful for text classification. More details on the dataset can be found here.
There are 3 goals for this notebook:
After downloading the CSV at the end of this notebook, import the data into Google Cloud AutoML Natural Language to explore classifying text.
In [0]:
import numpy as np
import pandas as pd
import csv
from sklearn.datasets import fetch_20newsgroups
In [0]:
newsgroups = fetch_20newsgroups(subset='all')
df = pd.DataFrame(newsgroups.data, columns=['text'])
df['categories'] = [newsgroups.target_names[index] for index in newsgroups.target]
df.head()
In [0]:
# Convert multiple whitespace characters into a space
df['text'] = df['text'].str.replace('\s+',' ')
# Change newsgroup titles to use underscores rather than periods
df['categories'] = df['categories'].str.replace('.','_')
# Trim leading and tailing whitespace
df['text'] = df['text'].str.strip()
# Truncate all fields to the maximum field length of 128kB
df['text'] = df['text'].str.slice(0,131072)
# Remove any rows with empty fields
df = df.replace('', np.NaN).dropna()
# Drop duplicates
df = df.drop_duplicates(subset='text')
# Limit rows to maximum of 100,000
df = df.sample(min(100000, len(df)))
df.head()
In [0]:
csv_str = df.to_csv(index=False, header=False)
with open("20-newsgroups-dataset.csv", "w") as text_file:
print(csv_str, file=text_file)
You're all set! Download 20-newsgroups-dataset.csv
and import it into Google Cloud AutoML Natural Language.
If you are using , you will find the file in the left navbar:
..
and find the file in /content
directory