An autotagger model matches unstructured text queries to a reference set of strings, a.k.a tags, which are known beforehand. It is similar to the task of fuzzy matching. But unlike fuzzy matching, autotagging is typically done with a fixed set of tags, and it treats the unstructured documents as the queries. Also in contrast to fuzzy matching or search, we support efficient batch tagging of documents.
The primary use case is the following: given a reference set of labels and free-text queries, assign zero or more labels to each query. Each assignment is given a similarity score, and if there are no labels with non-zero scores for a particular record, then none are assigned. The initial implementation uses a nearest neighbor model.
Autotagging is also closely related to multiclass document classification, where training data is used to construct a model that predicts the best tag for each document query. In many cases, however, training data is not available, or it's not appropriate to always assign the single top tag (or top-k tags) to a document.
One common approach to this tagging problem is to build a topic model on some training set of documents, and to use that model to make predictions about topic associations on new data. Typically, unless you're using labeled topics, you would then assign the highest scoring terms from the relevant topics as tags. One drawback of standard topic modeling is that you often end up with noisy terms in your topics. Showing garbage terms to a user -- in the case of end-user applications -- is unhelpful and makes your application look bad.
In this notebook, we will show you how to collect a high-quality set of entities from the web to use as a tag set, how to train an autotagger model with that reference data, and then how to use that model to tag news articles.
The GraphLab autotagger is part of our new data matching toolkit. This toolkit is currently in beta, and feedback is welcome. Send comments to support@turi.com.
Let's consider one real-life example. It is very common for news aggregators -- such as Hacker News, Prismatic, and Right Relevance, to name a few -- to provide high-level categories with the stories that they show their users. These are useful for two obvious reasons. Firstly, they give us a snapshot of what the article is about, and secondly, they provide an additional mechanism for filtering or navigating the stream of news. For this notebook, we'll use Hacker News as our data set.
The raw source data used for this analysis can be found at the Internet Archive. There are two tables: a stories table and a comments table, both of which went through some initial pre-processing to facilitate ingestion into an SFrame. For the purposes of this analysis, we'll only concern ourselves with the stories SFrame, which you can download with the code below.
In [8]:
import graphlab as gl # this import will be assumed from now on
stories_sf = gl.load_sframe("https://static.turi.com/datasets/hacker_news/stories_with_text.sframe")
stories_sf
Out[8]:
The original data underwent a handful of transformations before we could start tagging. For starters, the source article content was missing for the majority of the records in the stories table. That content is crucial for doing any tagging. Using Distributed, it's pretty straightforward to launch parallel jobs in EC2 to crawl the source for each story in our data set.
Once a large enough subset of the source articles were downloaded, we had to extract the main content from the HTML. For this, we used Boilerpipe, a fast Java library for extracting primary textual content out of HTML pages. Making calls to boilerpipe from Python is made possible by the python-boilerpipe module. You can find a how-to here for this preprocessing step.
The end result of the pre-processing pipeline is the stories SFrame downloaded in the previous section, which contains story metadata, as well as clean content for each Hacker News story in the original dataset (but only those for which content was available).
Using the clean content we've crawled and extracted, we're just about ready to do some tagging. All we need is some tags.
When it comes to deciding on a tag set, there are two alternative approaches to consider:
1. using an algorithmically generated set of tags
2. using a curated set of tags
Where do we find the right set of tags? In some cases, you may already have the set of tags you want to use. For example, many users have known and relatively fixed product catalogs, which act as the tags. If you don't have an existing set of tags, maybe someone else does. Freebase is a well-known example of a crowd-sourced database of well-known entities. Being crowd-sourced, it's prone to noise. Even more problematic is that it's much more exhaustive than is useful for us. Determining the right level of granularity for a tag set is a difficult task and definitely application-specific.
It's a good idea to do some experimentation with different tag sets on your own, but for the purposes of this notebook, we'll use a curated set of tags rather than trying to generate one algorithmically to reduce the opportunity for error. In particular, let's use one that we know is of high quality: the set of technology concepts that Google News uses. This data is not publically available in its entirety, but it's easy enough to get the currently trending entities from Google News with a little Python code, using the BeautifulSoup module and the Python requests module.
In [9]:
from bs4 import BeautifulSoup
import requests
url = "http://news.google.com/news/section?pz=1&cf=all&topic=tc"
soup = BeautifulSoup(requests.get("http://news.google.com/news/section?pz=1&cf=all&topic=tc").text,"html.parser")
entities = [t.get_text() for t in soup.select('.topic')]
entities
Out[9]:
Run this snippet as a cron job twice a day, and after a couple of months, you'll have a useful collection of high-quality, trending, technology entities. You could continually append them to an SFrame in S3, or simply save them to your favorite database -- take your pick. Let's assume that we've been saving them to S3, and now try loading them back into an SFrame, and visualizing them using GraphLab Canvas.
In [10]:
google_concepts_sf = gl.load_sframe("https://static.turi.com/datasets/hacker_news/google_news_tech_concepts.sframe")
gl.canvas.set_target("ipynb")
google_concepts_sf.show()
Finally, let's create an autotagger model.
In [11]:
m = gl.autotagger.create(google_concepts_sf, verbose=False, tag_name="Tag")
Before tagging, we create a new column by combining the text and title columns into a single text column in the sample SFrame.
In [12]:
stories_sf["text_title"] = stories_sf.apply(lambda row: row["title"] + " " + row["text"])
example_row = stories_sf[stories_sf["objectID"] == "2019157"]
m.tag(example_row, query_name="text_title").print_rows(max_column_width=100)
In this particular post, the author is looking for advice on designing a web application. Two of the assigned tags ("Web search engines" and "The Internet") are both reasonably relevant. On the other hand, "Lumia Series" and "Take-Two Interactive" are pretty specific and not very relevant. (This suggests that our tag set could potentially use some additional curation to be better suited for our application.)
It's illuminating to explore the data this way -- sample some posts randomly, and look at the tags assigned by our model. Perhaps we modify our tag set, retrain, re-tag and then compare the results. There are bound to be some bad results, but the more care used when creating the reference set of tags, the better the results will be.
In this initial implementation, there are just three knobs we can tweak. The first, as mentioned above, is the reference set
itself. The next parameter you'll want to be aware of is the similarity_threshold
parameter to the tag method, which allows us to filter out results that aren't within some distance from the query. This won't help us replace bad tags with good ones, but it will prevent us from showing bad tags in the first place. (Precision is
much more important than recall for this application.) The final parameter that we can adjust is the query_name
parameter, which determines the column of textual data to use as the basis of tagging. Let's try tagging again,
this time using only the title of the Hacker News post.
In [13]:
m.tag(example_row, query_name="title").print_rows(max_column_width=100)
The model doesn't have enough informative text to work with when tagging solely based on the title.
Let's try tagging again, this time with a larger sample of our data set, indicating our bias for
high precision by setting the similarity_threshold
parameter to .5, which amounts to applying tags only where the score is .5 or greater.
In [14]:
sample = stories_sf.sample(.1)
m.tag(sample, query_name="text_title", k=2, similarity_threshold=.5)
Out[14]:
One of the crucial points when creating autotagger models is that the tags assigned by the model are only ever as good as the reference set. You can always increase the similarity threshold at tagging time to increase precision, but if there are garbage tags in your reference set at model creation time, then the same garbage tags will show up at tagging time. So invest as much time as possible up front to ensure that you minimize the noise in your reference data. And if you have resources to collect training data, you'll see a huge improvement in the accuracy of your tagger.
We hope to improve our data matching offerings in the coming months. Here are some of the improvements on the horizon:
Email us support@turi.com and let us know if you have a data matching problem that we can help you solve.