Autotagging Hacker News Posts

What is autotagging?

An autotagger model matches unstructured text queries to a reference set of strings, a.k.a tags, which are known beforehand. It is similar to the task of fuzzy matching. But unlike fuzzy matching, autotagging is typically done with a fixed set of tags, and it treats the unstructured documents as the queries. Also in contrast to fuzzy matching or search, we support efficient batch tagging of documents.

The primary use case is the following: given a reference set of labels and free-text queries, assign zero or more labels to each query. Each assignment is given a similarity score, and if there are no labels with non-zero scores for a particular record, then none are assigned. The initial implementation uses a nearest neighbor model.

Autotagging is also closely related to multiclass document classification, where training data is used to construct a model that predicts the best tag for each document query. In many cases, however, training data is not available, or it's not appropriate to always assign the single top tag (or top-k tags) to a document.

One common approach to this tagging problem is to build a topic model on some training set of documents, and to use that model to make predictions about topic associations on new data. Typically, unless you're using labeled topics, you would then assign the highest scoring terms from the relevant topics as tags. One drawback of standard topic modeling is that you often end up with noisy terms in your topics. Showing garbage terms to a user -- in the case of end-user applications -- is unhelpful and makes your application look bad.

In this notebook, we will show you how to collect a high-quality set of entities from the web to use as a tag set, how to train an autotagger model with that reference data, and then how to use that model to tag news articles.

The GraphLab autotagger is part of our new data matching toolkit. This toolkit is currently in beta, and feedback is welcome. Send comments to support@turi.com.

A motivating example

Let's consider one real-life example. It is very common for news aggregators -- such as Hacker News, Prismatic, and Right Relevance, to name a few -- to provide high-level categories with the stories that they show their users. These are useful for two obvious reasons. Firstly, they give us a snapshot of what the article is about, and secondly, they provide an additional mechanism for filtering or navigating the stream of news. For this notebook, we'll use Hacker News as our data set.

The raw source data used for this analysis can be found at the Internet Archive. There are two tables: a stories table and a comments table, both of which went through some initial pre-processing to facilitate ingestion into an SFrame. For the purposes of this analysis, we'll only concern ourselves with the stories SFrame, which you can download with the code below.


In [8]:
import graphlab as gl # this import will be assumed from now on
stories_sf = gl.load_sframe("https://static.turi.com/datasets/hacker_news/stories_with_text.sframe")
stories_sf


Out[8]:
author created_at num_comments objectID points text
Leynos 2014-05-29 08:23:46+00:00 0 7815285 1 May 28, 2014 10:55
pm\nMay 28, 2014 10:55 ...
darrhiggs 2014-05-29 08:21:56+00:00 0 7815279 2 Uber app taxi row
referred to London's ...
a159482a 2014-05-29 08:18:42+00:00 1 7815274 1 Young Global Leaders
Nomination Form\nThe ...
dveeden2 2014-05-29 08:13:23+00:00 0 7815255 1 News, announcements,
release info, and ...
JensRantil 2014-05-29 08:11:53+00:00 0 7815249 2 Aug 12,
2014\nREADME\nUsage: fm ...
connochristou 2014-05-29 08:10:51+00:00 0 7815247 1 A Blog About Monetizing
Mobile Apps\nAvocarrot ...
stremovsky 2014-05-29 08:09:46+00:00 0 7815245 1 Windows Mobile
Test\nHello\nWe are ...
pakostina 2014-05-29 08:08:31+00:00 0 7815241 1 Mobile\nWhy Mobile
Developers Should Care ...
dragongraphics 2014-05-29 08:07:50+00:00 0 7815238 2 About\nAre we getting too
Sassy?\nI love Sass. I ...
morphics 2014-05-29 08:07:22+00:00 0 7815237 2 Warning: You are entering
the XSS game ...
title url
Making Twitter Easier to
Use ...
http://bits.blogs.nytimes
.com/2014/05/28/making- ...
London refers Uber app
row to High Court ...
http://www.bbc.co.uk/news
/technology-27617079 ...
Young Global Leaders,
who should be nominated? ...
http://nomination.younggl
oballeaders.org/nomin ...
Unicode Security Data:
Beta Review ...
http://unicode-
inc.blogspot.nl/2014/05 ...
FileMap: MapReduce on the
CLI ...
https://github.com/mfisk/
filemap ...
Hybrid App Monetization
Example with Mobile Ads ...
http://www.avocarrot.com/
blog/hybrid-app- ...
We need oppinion from
Android Developers ...
https://droidresearch.wuf
oo.com/forms/windows- ...
\t Why Mobile Developers
Should Care About Deep ...
http://developer.telerik.
com/products/why- ...
Are we getting too Sassy?
Weighing up micro- ...
http://ashleynolan.co.uk/
blog/are-we-getting-too- ...
Google's XSS game https://xss-
game.appspot.com/ ...
[253973 rows x 8 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Data pre-processing

The original data underwent a handful of transformations before we could start tagging. For starters, the source article content was missing for the majority of the records in the stories table. That content is crucial for doing any tagging. Using Distributed, it's pretty straightforward to launch parallel jobs in EC2 to crawl the source for each story in our data set.

Once a large enough subset of the source articles were downloaded, we had to extract the main content from the HTML. For this, we used Boilerpipe, a fast Java library for extracting primary textual content out of HTML pages. Making calls to boilerpipe from Python is made possible by the python-boilerpipe module. You can find a how-to here for this preprocessing step.

The end result of the pre-processing pipeline is the stories SFrame downloaded in the previous section, which contains story metadata, as well as clean content for each Hacker News story in the original dataset (but only those for which content was available).

Ready, set, tag!

Using the clean content we've crawled and extracted, we're just about ready to do some tagging. All we need is some tags.

When it comes to deciding on a tag set, there are two alternative approaches to consider:

1. using an algorithmically generated set of tags
2. using a curated set of tags

Where do we find the right set of tags? In some cases, you may already have the set of tags you want to use. For example, many users have known and relatively fixed product catalogs, which act as the tags. If you don't have an existing set of tags, maybe someone else does. Freebase is a well-known example of a crowd-sourced database of well-known entities. Being crowd-sourced, it's prone to noise. Even more problematic is that it's much more exhaustive than is useful for us. Determining the right level of granularity for a tag set is a difficult task and definitely application-specific.

It's a good idea to do some experimentation with different tag sets on your own, but for the purposes of this notebook, we'll use a curated set of tags rather than trying to generate one algorithmically to reduce the opportunity for error. In particular, let's use one that we know is of high quality: the set of technology concepts that Google News uses. This data is not publically available in its entirety, but it's easy enough to get the currently trending entities from Google News with a little Python code, using the BeautifulSoup module and the Python requests module.


In [9]:
from bs4 import BeautifulSoup
import requests

url = "http://news.google.com/news/section?pz=1&cf=all&topic=tc"
soup = BeautifulSoup(requests.get("http://news.google.com/news/section?pz=1&cf=all&topic=tc").text,"html.parser")
entities = [t.get_text() for t in soup.select('.topic')]
entities


Out[9]:
[u'Facebook',
 u'BlackBerry',
 u'YouTube',
 u'Dropbox',
 u'Comcast',
 u'Google Inbox',
 u'Lytro',
 u'Apple Inc.',
 u'Bethesda Softworks',
 u'Huawei']

Run this snippet as a cron job twice a day, and after a couple of months, you'll have a useful collection of high-quality, trending, technology entities. You could continually append them to an SFrame in S3, or simply save them to your favorite database -- take your pick. Let's assume that we've been saving them to S3, and now try loading them back into an SFrame, and visualizing them using GraphLab Canvas.


In [10]:
google_concepts_sf = gl.load_sframe("https://static.turi.com/datasets/hacker_news/google_news_tech_concepts.sframe")
gl.canvas.set_target("ipynb")
google_concepts_sf.show()


Finally, let's create an autotagger model.


In [11]:
m = gl.autotagger.create(google_concepts_sf, verbose=False, tag_name="Tag")

Before tagging, we create a new column by combining the text and title columns into a single text column in the sample SFrame.


In [12]:
stories_sf["text_title"] = stories_sf.apply(lambda row: row["title"] + " " + row["text"])
example_row = stories_sf[stories_sf["objectID"] == "2019157"]

m.tag(example_row, query_name="text_title").print_rows(max_column_width=100)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.250627    | 717us        |
PROGRESS: | Done         |         | 100         | 2.162ms      |
PROGRESS: +--------------+---------+-------------+--------------+
+---------------+
| text_title_id |
+---------------+
|       0       |
|       0       |
|       0       |
|       0       |
|       0       |
+---------------+
+-----------------------------------------------------------------------------------------------------+
|                                              text_title                                             |
+-----------------------------------------------------------------------------------------------------+
| Ask HN: How do you begin to build your new project? I'm embarking on my first "decent-size" proj... |
| Ask HN: How do you begin to build your new project? I'm embarking on my first "decent-size" proj... |
| Ask HN: How do you begin to build your new project? I'm embarking on my first "decent-size" proj... |
| Ask HN: How do you begin to build your new project? I'm embarking on my first "decent-size" proj... |
| Ask HN: How do you begin to build your new project? I'm embarking on my first "decent-size" proj... |
+-----------------------------------------------------------------------------------------------------+
+----------------------+-----------------+
|         Tag          |      score      |
+----------------------+-----------------+
| Take-Two Interactive |      0.0625     |
|     The Internet     | 0.0377358490566 |
|     Lumia series     | 0.0373831775701 |
|  Internet of Things  | 0.0353982300885 |
|  Web search engines  | 0.0350877192982 |
+----------------------+-----------------+
[5 rows x 4 columns]

In this particular post, the author is looking for advice on designing a web application. Two of the assigned tags ("Web search engines" and "The Internet") are both reasonably relevant. On the other hand, "Lumia Series" and "Take-Two Interactive" are pretty specific and not very relevant. (This suggests that our tag set could potentially use some additional curation to be better suited for our application.)

It's illuminating to explore the data this way -- sample some posts randomly, and look at the tags assigned by our model. Perhaps we modify our tag set, retrain, re-tag and then compare the results. There are bound to be some bad results, but the more care used when creating the reference set of tags, the better the results will be.

In this initial implementation, there are just three knobs we can tweak. The first, as mentioned above, is the reference set itself. The next parameter you'll want to be aware of is the similarity_threshold parameter to the tag method, which allows us to filter out results that aren't within some distance from the query. This won't help us replace bad tags with good ones, but it will prevent us from showing bad tags in the first place. (Precision is much more important than recall for this application.) The final parameter that we can adjust is the query_name parameter, which determines the column of textual data to use as the basis of tagging. Let's try tagging again, this time using only the title of the Hacker News post.


In [13]:
m.tag(example_row, query_name="title").print_rows(max_column_width=100)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.250627    | 739us        |
PROGRESS: | Done         |         | 100         | 2.173ms      |
PROGRESS: +--------------+---------+-------------+--------------+
+----------+-----------------------------------------------------+-----------+----------------+
| title_id |                        title                        |    Tag    |     score      |
+----------+-----------------------------------------------------+-----------+----------------+
|    0     | Ask HN: How do you begin to build your new project? | Macintosh | 0.142857142857 |
+----------+-----------------------------------------------------+-----------+----------------+
[1 rows x 4 columns]

The model doesn't have enough informative text to work with when tagging solely based on the title. Let's try tagging again, this time with a larger sample of our data set, indicating our bias for high precision by setting the similarity_threshold parameter to .5, which amounts to applying tags only where the score is .5 or greater.


In [14]:
sample = stories_sf.sample(.1)
m.tag(sample, query_name="text_title", k=2, similarity_threshold=.5)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 99      | 0.000976698 | 151.508ms    |
PROGRESS: | 3572         | 1425327 | 14.0618     | 1.17s        |
PROGRESS: | 6202         | 2474943 | 24.4169     | 2.15s        |
PROGRESS: | 9528         | 3802062 | 37.5098     | 3.15s        |
PROGRESS: | 10915        | 4355185 | 42.9667     | 4.15s        |
PROGRESS: | 13826        | 5516871 | 54.4274     | 5.15s        |
PROGRESS: | 17762        | 7087138 | 69.9191     | 6.16s        |
PROGRESS: | 19294        | 7698375 | 75.9494     | 7.15s        |
PROGRESS: | 23121        | 9225378 | 91.0142     | 8.16s        |
PROGRESS: | Done         |         | 100         | 8.42s        |
PROGRESS: +--------------+---------+-------------+--------------+
Out[14]:
text_title_id text_title Tag score
652 Gigafactory http:/&#
x2F;www.teslamotors.c ...
Tesla Motors 0.666666666667
675 Whatsapp down Seems
whatsapp is down now ...
WhatsApp 0.5
714 Thoughts on Silicon
Valley What are your ...
Silicon Valley 0.928571428571
752 How many computer
programmers are in the ...
Computer programs 0.590909090909
983 Ask HN: This is what is
happening in Silicon ...
Silicon Valley 0.590909090909
1335 Facebook down The webpage
is down! ...
Facebook 1.0
1417 Santa Claus VS Justin
Bieber - The santa is ...
Justin Bieber 0.666666666667
1533 Beauty channel Just made
my first youtube video! ...
YouTube 0.555555555556
1558 E canl tv izle ecanlitv -
turkey e canl tv izleme ...
Websites 0.625
2169 General Motors Hiring
Developers We are loo ...
General Motors 0.722222222222
[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

Key takeaways

One of the crucial points when creating autotagger models is that the tags assigned by the model are only ever as good as the reference set. You can always increase the similarity threshold at tagging time to increase precision, but if there are garbage tags in your reference set at model creation time, then the same garbage tags will show up at tagging time. So invest as much time as possible up front to ensure that you minimize the noise in your reference data. And if you have resources to collect training data, you'll see a huge improvement in the accuracy of your tagger.

We hope to improve our data matching offerings in the coming months. Here are some of the improvements on the horizon:

  1. expose the underlying features and distance function
  2. allow multi-field records in the reference set
  3. provide supervised models with learned distance functions

Email us support@turi.com and let us know if you have a data matching problem that we can help you solve.