Popular Hacker News stories analyzed with AlchemyAPI

Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator.

Hacker News is community based. Articles are posted and upvoted by users. The most popular articles make it to the top of the site. The algorithm that computes the top ranked articles is largely impacted by time, giving new articles a significant boost. After all, it's a news site.

Content that can be submitted to Hacker News is defined as "anything that gratifies one's intellectual curiosity". If you haven't seen the site yet, check it out. You can find more details about the Hacker News algorithm in this blog post.

If you're already an avid reader of Hacker News you might have wondered "Is there a topic commonality between the popular Hacker News stories? or put differently if you're an author - "What topic do I have to write about to make it to the top of Hacker News?

This notebook is trying to answer these questions using the Hacker News API and the AlchemyAPI to get ranked concepts for textual data. Here are the steps we'll take:

  1. Query the most popular Hacker News stories
  2. Save the current 500 most popular Hacker News stories
  3. Tag concepts of Hacker News stories using AlchemyAPI

To run this notebook, register for a free AlchemyAPI account. After you receive your API key, paste your API key in the cell below:


In [1]:
api_key = 'PASTE_ALCHEMY_API_KEY_HERE'

Import required Python libraries:


In [2]:
import requests
import os
import pandas as pd
from datetime import datetime

The Hacker News API provides an end point topstories that returns the 500 most highly rated stories at this time.

Define Hacker News API request url strings


In [3]:
hacker_news_api_base_url = 'https://hacker-news.firebaseio.com/v0/'
hacker_news_feature_url_item = 'item/'
hacker_news_feature_url_topstories = 'topstories'
hacker_news_api_parameters = '.json?print=pretty'

Define Hacker News API helper functions


In [4]:
def get_story_for_id(story_id):
    ''' Queries the Hacker News API for story information about for the given story_id. '''
    story_request_url = hacker_news_api_base_url + hacker_news_feature_url_item + unicode(story_id) + hacker_news_api_parameters
    story = requests.get(story_request_url).json()
    return story

In [5]:
def get_story_details(story):
    ''' Filter relevant story information from the given Hacker News API story object. '''
    # remove descendants from story (e.g. comments), because we don't use them
    if 'kids' in story: del story['kids']
    # encode text field content as ascii (work around IPython defect https://github.com/ipython/ipython/issues/6799)
    if 'title' in story: story['title'] = story['title'].encode('ascii', 'ignore')
    if 'text' in story: story['text'] = story['text'].encode('ascii', 'ignore')
    if 'url' in story: story['url'] = story['url'].encode('ascii', 'ignore')
    return story

In [6]:
def get_all_story_details(story_ids):
    ''' Queries Hacker News API for relevant story information for given list of story_ids. '''
    all_story_details = []
    for story_id in story_ids:
        all_story_details.append(get_story_details(get_story_for_id(story_id)))
    return all_story_details

Query stories

With these helper functions in place, let's query the Hacker News API for the current top 500 stories:


In [7]:
current_top_500_stories_url = hacker_news_api_base_url + hacker_news_feature_url_topstories + hacker_news_api_parameters
current_top_500_stories = requests.get(current_top_500_stories_url).json()

Take a look at what we got to make sure we have a list of story ids:


In [ ]:
current_top_500_stories

The top 500 stories provided by the Hacker News API are a snapshot that reflect the currently most popular stories. To enable an analysis of the most popular stories over time, it is helpful to work with a larger corpus of stories.

Define story persistence helper function

Let's save the story ids on disk for future usage with the Python pickle library provided by Pandas convenience methods read_pickle and to_pickle:


In [9]:
story_ids_file_name = 'hacker_news_story_ids.pickle'

def update_saved_story_ids(story_ids, story_ids_file_name):
    ''' Read story ids from disk, merge with given story_ids, and save back to disk. '''
    file_story_ids = []
    try:
        file_story_ids = pd.read_pickle(story_ids_file_name)
    except IOError as err:
        # file for story ids does not yet exist, move on
        pass
    merged_story_ids = set(file_story_ids).union(set(story_ids))
    pd.Series(list(merged_story_ids)).to_pickle(story_ids_file_name)
    return merged_story_ids

In [10]:
story_ids_up_until_today = update_saved_story_ids(current_top_500_stories, story_ids_file_name)

Query story details

Query details about all stories (e.g. title, url of linked article, publish time, score, number of descendenants). Let's see for how many stories we're going query details:


In [11]:
len(story_ids_up_until_today)


Out[11]:
1053

Now, query the details and show a sample of the first five stories (If you happen to hit JSON errors, try running the cell again, as these seem to happen intermittently):


In [14]:
all_story_details = get_all_story_details(list(story_ids_up_until_today))
# optionally, comment the first line and uncomment the two lines below to use a subset of stories to work with to reduce subsequent API requests against AlchemyAPI
# top_10_stories = list(story_ids_up_until_today)[0:10]
# all_story_details = get_all_story_details(top_10_stories)
stories_df = pd.DataFrame.from_dict(all_story_details)
stories_df.head(5)


Out[14]:
by dead deleted descendants id parts score text time title type url
0 mparr4 NaN NaN 0 9367553 NaN 5 1428933057 Negotiating HTTP/2: ALPN and the TLS Handshake story http://matthewparrilla.com/post/negotiation-ht...
1 ivank NaN NaN 29 9363458 NaN 102 1428849266 The Essence of Peopling story http://www.ribbonfarm.com/2015/04/08/the-essen...
2 tokai NaN NaN 0 9367556 NaN 10 1428933126 The Word-Space Model (2006) [pdf] story https://www.sics.se/~mange/TheWordSpaceModel.pdf
3 spikels NaN NaN 6 9377110 NaN 39 1429043471 SpaceX Rocket lands on droneship, but too hard... story https://mobile.twitter.com/elonmusk/status/588...
4 gits1225 NaN NaN 6 9367564 NaN 19 1428933203 Programming languages shape the way their user... story http://www.technologyreview.com/review/536356/...

Tag concepts of Hacker News stories using AlchemyAPI

One of the features provided by AlchemyAPI is Concept Tagging. It allows extracting concepts from web-based content available at a given URL. We're going to apply concept tagging to the URLs from the Hacker News stories.

Define AlchemyAPI request URL strings


In [15]:
alchemy_api_base_url = 'http://access.alchemyapi.com/calls/url/'
alchemy_api_parameters = '?apikey=' + api_key + '&outputMode=json&url='
alchemy_feature_url_concepts = "URLGetRankedConcepts"

Define AlchemyAPI helper functions

Implement a function to query the AlchemyAPI for concepts for a given url:


In [16]:
def get_concepts_for_url(story_url, story_urls_and_concepts):
    ''' Query AlchemyAPI concept tagging for given url and add result to given story_urls_and_concepts dictionary. '''
    if story_url in story_urls_and_concepts:
        # attempt to get concepts for story url from disk 
        concepts = story_urls_and_concepts.get(story_url)
    else:
        # no concepts available on disk for story url, query AlchemyAPI for concepts and add save for future use
        request_url = alchemy_api_base_url + alchemy_feature_url_concepts + alchemy_api_parameters + story_url
        concepts = requests.get(request_url).json()
        story_urls_and_concepts[story_url] = concepts
    return concepts

Let's test the function by running it against a test_url pointing to an article from cnn.com:


In [17]:
story_urls_and_concepts = {}
test_url = 'http://www.cnn.com/2009/CRIME/01/13/missing.pilot/index.html'
get_concepts_for_url(test_url, story_urls_and_concepts)


Out[17]:
{u'concepts': [{u'dbpedia': u'http://dbpedia.org/resource/Marshal',
   u'freebase': u'http://rdf.freebase.com/ns/m.01mz37',
   u'relevance': u'0.979535',
   u'text': u'Marshal'},
  {u'dbpedia': u'http://dbpedia.org/resource/United_States_Marshals_Service',
   u'freebase': u'http://rdf.freebase.com/ns/m.0p6f_',
   u'relevance': u'0.904977',
   u'text': u'United States Marshals Service',
   u'website': u'http://www.usdoj.gov/marshals',
   u'yago': u'http://yago-knowledge.org/resource/United_States_Marshals_Service'},
  {u'dbpedia': u'http://dbpedia.org/resource/Suicide',
   u'freebase': u'http://rdf.freebase.com/ns/m.06z5s',
   u'opencyc': u'http://sw.opencyc.org/concept/Mx4rwQrsYZwpEbGdrcN5Y29ycA',
   u'relevance': u'0.902347',
   u'text': u'Suicide'},
  {u'dbpedia': u'http://dbpedia.org/resource/Sheriff',
   u'freebase': u'http://rdf.freebase.com/ns/m.0mb31',
   u'opencyc': u'http://sw.opencyc.org/concept/Mx4rRWTpvi9AQdicUZy0P3lKVQ',
   u'relevance': u'0.784423',
   u'text': u'Sheriff'},
  {u'dbpedia': u'http://dbpedia.org/resource/Constable',
   u'freebase': u'http://rdf.freebase.com/ns/m.01434f',
   u'relevance': u'0.76973',
   u'text': u'Constable'},
  {u'ciaFactbook': u'http://www4.wiwiss.fu-berlin.de/factbook/resource/United_States',
   u'dbpedia': u'http://dbpedia.org/resource/United_States',
   u'freebase': u'http://rdf.freebase.com/ns/m.09c7w0',
   u'opencyc': u'http://sw.opencyc.org/concept/Mx4rvVikKpwpEbGdrcN5Y29ycA',
   u'relevance': u'0.748903',
   u'text': u'United States',
   u'website': u'http://www.usa.gov/',
   u'yago': u'http://yago-knowledge.org/resource/United_States'},
  {u'dbpedia': u'http://dbpedia.org/resource/Federal_Bureau_of_Investigation',
   u'freebase': u'http://rdf.freebase.com/ns/m.02_1m',
   u'geo': u'38.894465 -77.024503',
   u'opencyc': u'http://sw.opencyc.org/concept/Mx4rvWJE0JwpEbGdrcN5Y29ycA',
   u'relevance': u'0.589856',
   u'text': u'Federal Bureau of Investigation',
   u'website': u'http://www.fbi.gov',
   u'yago': u'http://yago-knowledge.org/resource/Federal_Bureau_of_Investigation'},
  {u'dbpedia': u'http://dbpedia.org/resource/The_Fugitive_(1993_film)',
   u'freebase': u'http://rdf.freebase.com/ns/m.05znxx',
   u'relevance': u'0.586845',
   u'text': u'The Fugitive',
   u'yago': u'http://yago-knowledge.org/resource/The_Fugitive_(1993_film)'}],
 u'language': u'english',
 u'status': u'OK',
 u'url': u'http://www.cnn.com/2009/CRIME/01/13/missing.pilot/index.html',
 u'usage': u'By accessing AlchemyAPI or using information generated by AlchemyAPI, you are agreeing to be bound by the AlchemyAPI Terms of Use: http://www.alchemyapi.com/company/terms.html'}

You should see a JSON document containing list of concepts extracted from the website at the given url. Each concept is identified by the text and is assigned a relevance which measures how confident AlchemyAPI is that the website is talking about this concept. Based on the identified concept, the JSON also contains links to publicly available knowledge bases DBPedia and Yago. Feel free to test AlchemyAPI concept tagging for articles that you're interested in by replacing the test_url.

The free price tier for AlchemyAPI allows 1000 queries per day. To reduce the number of AlchemyAPI query requests, we create a dictionary story_urls_and_concepts to store the story url with the list of detected concepts:


In [26]:
story_urls_and_concepts_file_name = 'story_urls_and_concepts.pickle'

try:
    story_urls_and_concepts = pd.read_pickle(story_urls_and_concepts_file_name)
except IOError as err:
    # file for story urls and concepts does not yet exist, move on
    story_urls_and_concepts = {}
    pass

Now that we can extract concepts for articles at a given url we need to extract the concepts for a Hacker News story_id. For each identified concept we need to keep track how often and from what story it was extracted. We need to aggregate popularity measures like score and number of descendants from the stories.

The resulting data structure is a dictionary of dictionaries containing the following information:

{
'Programming language': {
  concept_dict {
    'occurs' : 11    # there are 10 occurrences of the concept 'Programming language' in our stories
    'score'  : 543   # aggregated score of all stories containing concept 'Progamming language' is 543
    'ids'    : [123,456] # story ids of all stories containing concept 'Programming language'
    'descendants' 94  # aggregated number of all descendants of all stories containing 'Progamming language'
    'links' : ['www.cnn.com/programming_language', ... ]  # links to all stories about 'Progamming language'
  }
}
}

The following function aggregates all this information about all concepts extracted from all stories:


In [27]:
def get_concepts_for_id(story_id, all_concepts_dicts, story_urls_and_concepts):
    ''' Extracts concepts for given story_id and aggregates story popularity information. '''
    print "Querying concepts for story " + unicode(story_id) + "..."
    request_url = hacker_news_api_base_url + hacker_news_feature_url_item + unicode(story_id) + hacker_news_api_parameters
    print(request_url)
    story = requests.get(request_url).json()
    # ignore "Ask HN" and job posts, only consider actual stories
    if story.get('type') == 'story':
        # make sure story has url that links to article
        if story.get('url') is not None:
            # extract concepts using AlchemyAPI
            concept_result = get_concepts_for_url(story.get('url'), story_urls_and_concepts)
            if concept_result['status'] == 'OK':
                concepts = concept_result.get('concepts')
                for concept in concepts:
                    # check, if we previously encountered the concept in another article
                    concept_dict = {}
                    concept_text = concept.get('text')
                    # ignore concepts with low score
                    if (float(concept.get('relevance')) > 0.6):
                        concept_dict['occurs'] = 1
                        concept_dict['relevance'] = concept.get('relevance')
                        concept_dict['ids'] = [story_id]
                        concept_dict['score'] = story.get('score')
                        concept_dict['descendants'] = story.get('descendants')
                        concept_dict['links'] = [story.get('url')]
                        if concept_text in all_concepts_dicts:
                            # merge additional concept info with already existing concept info
                            # add up the scores and number of descendants by concept
                            already_existing_concept = all_concepts_dicts.get(concept_text)
                            already_existing_concept['occurs'] = already_existing_concept['occurs'] + 1
                            already_existing_concept['score'] = already_existing_concept['score'] + story.get('score')
                            already_existing_concept['descendants'] = already_existing_concept['descendants'] + story.get('descendants')
                            already_existing_concept['links'] = already_existing_concept['links'] + concept_dict['links']
                            already_existing_concept['ids'] = already_existing_concept['ids'] + concept_dict['ids']
                        else:
                            all_concepts_dicts[concept_text] = concept_dict
    return all_concepts_dicts

Let's test the get_concepts_for_id helper function by providing it a valid Hacker News story id:


In [28]:
all_concepts_dicts = {}
test_story_id = 9226497
all_concepts_dicts = get_concepts_for_id(test_story_id, all_concepts_dicts, story_urls_and_concepts)
print all_concepts_dicts


Querying concepts for story 9226497...
https://hacker-news.firebaseio.com/v0/item/9226497.json?print=pretty
{u'Split Airport': {'links': [u'http://lit.vulf.de/spotify-so-little/'], 'descendants': 295, 'ids': [9226497], 'score': 613, 'relevance': u'0.896593', 'occurs': 1}, u'Pool': {'links': [u'http://lit.vulf.de/spotify-so-little/'], 'descendants': 295, 'ids': [9226497], 'score': 613, 'relevance': u'0.941916', 'occurs': 1}}

You should see a dictionary of concepts with links to stories, score and descendant information. Feel free to enter different test_story_ids.

Now that we know our get_concepts_for_id function works, let's query and aggregate the concepts for all Hacker News stories:


In [29]:
len(story_urls_and_concepts)


Out[29]:
1708

In [30]:
len(stories_df)


Out[30]:
1053

In [ ]:
story_counter = 1
for story_id in story_ids_up_until_today:
# optionally, comment the line above and uncomment the line below to limit requests to 10 stories
# for story_id in top_10_stories:
    all_concepts_dicts = get_concepts_for_id(story_id, all_concepts_dicts, story_urls_and_concepts)
    print 'Done. ' + unicode(story_counter) + ' stories queried.'
    story_counter = story_counter + 1

Save the story_urls_and_concepts dictionary to disk for future use. The dictionary got created and updated while iterating through all story_ids and is valueable at this point, because it contains the concepts from querying the AlchemyAPI, which supports a limited number of requests per day. Without saving the story_urls_and_concepts to disk, we would hit that 1000 requests per day limit after just a few days of collecting story ids.


In [33]:
import pickle

with open(story_urls_and_concepts_file_name, 'wb') as story_urls_and_concepts_file:
    pickle.dump(story_urls_and_concepts, story_urls_and_concepts_file)

Evaluate the result

Create a DataFrame for the extracted concepts for tabular presentation and sort the concepts by score, showing the most popular topics at the top.


In [34]:
all_concepts_df = pd.DataFrame.from_dict(all_concepts_dicts, orient='index')
all_concepts_sorted_by_score_df = all_concepts_df.sort(columns='score', ascending=False)
all_concepts_sorted_by_score_df


Out[34]:
links descendants ids score relevance occurs
Apple Inc. [http://roadlesstravelled.me/2015/04/06/why-st... 1186 [9342994, 9367868, 9360732, 9361163, 9378434, ... 2551 0.93003 16
Operating system [https://www.kickstarter.com/projects/13814379... 808 [9375914, 9359722, 9363857, 9347669, 9380621, ... 2333 0.987744 25
Computer program [https://atmospherejs.com/chipcastledotcom/jsp... 580 [9367859, 9367913, 9367618, 9380338, 9380339, ... 2159 0.780098 22
Java [https://github.com/jackm321/RustNN, https://a... 588 [9277680, 9367859, 9380190, 9368075, 9368137, ... 2097 0.819112 18
Internet [http://www.chinafile.com/conversation/new-chi... 807 [9367601, 9363565, 9359799, 9372431, 9360421, ... 1944 0.94787 32
Programming language [http://www.technologyreview.com/review/536356... 585 [9367564, 9261073, 9363496, 9380089, 9380165, ... 1902 0.987098 49
Steve Jobs [http://roadlesstravelled.me/2015/04/06/why-st... 833 [9342994, 9363714, 9360732, 9366382, 9362328, ... 1871 0.758757 7
Google [http://www.rawstory.com/rs/2015/04/5-worst-th... 928 [9363714, 9196008, 9380421, 9368418, 6312903, ... 1618 0.882041 26
E-mail [http://matthewparrilla.com/post/negotiation-h... 402 [9367553, 9381017, 9364783, 9374335, 9375525, ... 1583 0.628929 10
Threads [http://blog.rust-lang.org/2015/04/10/Fearless... 466 [9355382, 9271246] 1573 0.774183 2
JavaScript [https://github.com/mozumder/HTML6, https://gi... 516 [9368014, 9380339, 9365193, 9373618, 9271246, ... 1469 0.969272 11
Open source [http://www.openwall.com/lists/oss-security/20... 335 [9375884, 9363857, 9351769, 9380621, 9376896, ... 1458 0.954089 21
C [http://probablyfine.co.uk/2015/04/11/announci... 533 [9375767, 9361781, 9277680, 9380089, 9367859, ... 1428 0.66933 31
English-language films [http://arxiv.org/abs/1504.02179, http://sethg... 583 [9367580, 9372086, 9368426, 9364426, 9372918, ... 1398 0.742263 17
Mac OS X [http://sourceforge.net/projects/elementaryos/... 481 [9359722, 9363857, 9347669, 9360803, 9365563, ... 1293 0.673395 10
Google Chrome [https://github.com/mozumder/HTML6, http://blo... 332 [9368014, 9196008, 9367051, 9360588, 9377681, ... 1288 0.862783 14
IP address [http://morris.guru/huthos-the-totally-100-leg... 544 [9363565, 9375886, 9372918, 9374028, 9353785] 1282 0.973105 5
IPhone [http://roadlesstravelled.me/2015/04/06/why-st... 564 [9342994, 9376513, 9368859, 9361163, 9325611, ... 1270 0.610224 6
Abuse [http://roadlesstravelled.me/2015/04/06/why-st... 543 [9342994] 1212 0.786606 1
Cupertino, California [http://roadlesstravelled.me/2015/04/06/why-st... 543 [9342994] 1212 0.61414 1
Computer [https://www.kickstarter.com/projects/13814379... 501 [9375914, 9372066, 9368255, 9368294, 9368443, ... 1190 0.726815 20
File system [http://getgrav.org/, http://jjacky.com/anopa/... 439 [8175797, 9359598, 9351542, 9368353, 9368475, ... 1181 0.928131 12
Free software [https://github.com/blog/1986-announcing-git-l... 345 [9343021, 9380338, 9380621, 9376896, 9373671, ... 1161 0.606101 13
Linux [http://strong-pm.io/, http://www.phoronix.com... 245 [9380338, 9360321, 9372845, 9377066, 9360803, ... 1153 0.62069 21
Unix [http://www.phoronix.com/scan.php?page=news_it... 300 [9360321, 9364658, 9365317, 9354246, 9350206, ... 1055 0.61581 7
Computing platform [https://github.com/facebook/react-native] 287 [9271246] 1039 0.60849 1
Cryptography [http://www.washingtonpost.com/world/national%... 597 [9360553, 9377080, 9365026, 9365054, 9365302, ... 1036 0.733362 13
World Wide Web [https://www.dareboost.com, https://medium.com... 509 [9375815, 9376360, 9380468, 9360421, 9381305, ... 1022 0.9044 16
Python [http://it-ebooks.info/, http://squirrel-lang.... 403 [9377853, 9368075, 9372303, 9368475, 9360421, ... 1008 0.648592 17
2004 albums [http://codepen.io/jakealbaugh/full/PwLXXP/] 77 [9317159] 962 0.834497 1
... ... ... ... ... ... ...
Economic democracy [https://www.jacobinmag.com/2015/04/uber-explo... 0 [9362077] 3 0.781777 1
H. G. Wells [http://blog.macsales.com/29795-owc-tear-down-... 1 [9377621] 3 0.714366 1
Personal health record [http://blogs.wsj.com/venturecapital/2015/04/1... 0 [9368859] 3 0.911095 1
United States presidential election, 2008 [http://www.msn.com/en-us/news/politics/obama-... 2 [9371314] 3 0.740318 1
Dustin Moskovitz [https://www.facebook.com/zuck/posts/101020281... 2 [9377842] 3 0.824491 1
Business terms [http://uk.businessinsider.com/spotify-raising... 1 [9366515] 3 0.649363 1
Laborer [https://www.jacobinmag.com/2015/04/uber-explo... 0 [9362077] 3 0.855691 1
Circuit [http://gocircuit.github.io/circuit/] 0 [9268796] 3 0.624 1
Depression [http://www.feld.com/archives/2015/04/bringing... 0 [9378446] 3 0.736492 1
Solar cell [http://www.bloomberg.com/news/articles/2015-0... 0 [9381209] 3 0.886276 1
Time travel [http://blog.macsales.com/29795-owc-tear-down-... 1 [9377621] 3 0.751752 1
Product management [https://medium.com/galleys/the-minimum-viable... 1 [9367414] 3 0.710645 1
Mutualism [https://www.jacobinmag.com/2015/04/uber-explo... 0 [9362077] 3 0.636801 1
Television [https://medium.com/@ev/sometimes-things-stay-... 0 [9370533] 3 0.72558 1
Grammatical person [https://motivatedgrammar.wordpress.com/2009/0... 0 [9372405] 3 0.824975 1
JQuery [http://noeticforce.com/best-Javascript-framew... 0 [9362861] 3 0.632081 1
Health informatics [http://blogs.wsj.com/venturecapital/2015/04/1... 0 [9368859] 3 0.80838 1
Ignaz Pleyel [https://www.kickstarter.com/projects/opengold... 0 [9378635] 3 0.621621 1
Direct memory access [https://www.parallella.org/2015/04/05/paralle... 0 [9378023] 3 0.989697 1
Special Activities Division [http://www.nytimes.com/2015/04/14/world/middl... 0 [9370503] 3 0.679562 1
Technological singularity [http://conversableeconomist.blogspot.com/2015... 1 [9374784] 3 0.642579 1
Present [https://medium.com/@ev/sometimes-things-stay-... 0 [9370533] 3 0.692987 1
Coaxial cable [https://medium.com/@ev/sometimes-things-stay-... 0 [9370533] 3 0.663549 1
Libertarian socialism [https://www.jacobinmag.com/2015/04/uber-explo... 0 [9362077] 3 0.644318 1
Closed-circuit television [http://www.theverge.com/2015/4/7/8355123/flir... 0 [9362296] 3 0.86142 1
Serfdom [https://www.jacobinmag.com/2015/04/uber-explo... 0 [9362077] 3 0.860459 1
Bo Diddley [http://www.theverge.com/2015/4/7/8355123/flir... 0 [9362296] 3 0.727634 1
Visual effects [http://motherboard.vice.com/read/hollywoods-p... 0 [9374114] 3 0.781674 1
Bo Jackson [http://www.theverge.com/2015/4/7/8355123/flir... 0 [9362296] 3 0.676713 1
Instability [https://medium.com/@alan.imgur/startup-growth... 0 [9374265] 3 0.709443 1

2344 rows × 6 columns

Most discussed stories by number of comments

Let's see which topics are most discussed and result in the highest number of comments (story descendants):


In [35]:
all_concepts_sorted_by_descendants_df = all_concepts_df.sort(columns='descendants', ascending=False)
all_concepts_sorted_by_descendants_df


Out[35]:
links descendants ids score relevance occurs
Apple Inc. [http://roadlesstravelled.me/2015/04/06/why-st... 1186 [9342994, 9367868, 9360732, 9361163, 9378434, ... 2551 0.93003 16
Google [http://www.rawstory.com/rs/2015/04/5-worst-th... 928 [9363714, 9196008, 9380421, 9368418, 6312903, ... 1618 0.882041 26
Steve Jobs [http://roadlesstravelled.me/2015/04/06/why-st... 833 [9342994, 9363714, 9360732, 9366382, 9362328, ... 1871 0.758757 7
Operating system [https://www.kickstarter.com/projects/13814379... 808 [9375914, 9359722, 9363857, 9347669, 9380621, ... 2333 0.987744 25
Internet [http://www.chinafile.com/conversation/new-chi... 807 [9367601, 9363565, 9359799, 9372431, 9360421, ... 1944 0.94787 32
Cryptography [http://www.washingtonpost.com/world/national%... 597 [9360553, 9377080, 9365026, 9365054, 9365302, ... 1036 0.733362 13
Java [https://github.com/jackm321/RustNN, https://a... 588 [9277680, 9367859, 9380190, 9368075, 9368137, ... 2097 0.819112 18
Programming language [http://www.technologyreview.com/review/536356... 585 [9367564, 9261073, 9363496, 9380089, 9380165, ... 1902 0.987098 49
English-language films [http://arxiv.org/abs/1504.02179, http://sethg... 583 [9367580, 9372086, 9368426, 9364426, 9372918, ... 1398 0.742263 17
Computer program [https://atmospherejs.com/chipcastledotcom/jsp... 580 [9367859, 9367913, 9367618, 9380338, 9380339, ... 2159 0.780098 22
IPhone [http://roadlesstravelled.me/2015/04/06/why-st... 564 [9342994, 9376513, 9368859, 9361163, 9325611, ... 1270 0.610224 6
IP address [http://morris.guru/huthos-the-totally-100-leg... 544 [9363565, 9375886, 9372918, 9374028, 9353785] 1282 0.973105 5
Cupertino, California [http://roadlesstravelled.me/2015/04/06/why-st... 543 [9342994] 1212 0.61414 1
Abuse [http://roadlesstravelled.me/2015/04/06/why-st... 543 [9342994] 1212 0.786606 1
C [http://probablyfine.co.uk/2015/04/11/announci... 533 [9375767, 9361781, 9277680, 9380089, 9367859, ... 1428 0.66933 31
JavaScript [https://github.com/mozumder/HTML6, https://gi... 516 [9368014, 9380339, 9365193, 9373618, 9271246, ... 1469 0.969272 11
World Wide Web [https://www.dareboost.com, https://medium.com... 509 [9375815, 9376360, 9380468, 9360421, 9381305, ... 1022 0.9044 16
Computer [https://www.kickstarter.com/projects/13814379... 501 [9375914, 9372066, 9368255, 9368294, 9368443, ... 1190 0.726815 20
Mac OS X [http://sourceforge.net/projects/elementaryos/... 481 [9359722, 9363857, 9347669, 9360803, 9365563, ... 1293 0.673395 10
Threads [http://blog.rust-lang.org/2015/04/10/Fearless... 466 [9355382, 9271246] 1573 0.774183 2
Want [http://blog.rokkincat.com/three-types-of-harm... 461 [9375810, 9375842, 9367687, 9364389, 9369038, ... 873 0.746904 20
File system [http://getgrav.org/, http://jjacky.com/anopa/... 439 [8175797, 9359598, 9351542, 9368353, 9368475, ... 1181 0.928131 12
App Store [https://blog.branch.io/in-deeplinking-context... 420 [9359791, 9380558, 9372708, 9376895, 9368859, ... 846 0.743945 10
Computer programming [https://www.youtube.com/watch?v=3CwJ0MH-4MA, ... 417 [9380089, 9315503, 9369547, 9361580, 9365980, ... 620 0.630686 6
Telecommuting [https://hbr.org/2014/01/to-raise-productivity... 411 [9367123, 9346726] 844 0.982637 2
Amazon.com [http://citizapp.com, http://www.cgpgrey.com/b... 404 [9360562, 9365198, 9365710, 9349501] 802 0.727798 4
Python [http://it-ebooks.info/, http://squirrel-lang.... 403 [9377853, 9368075, 9372303, 9368475, 9360421, ... 1008 0.648592 17
E-mail [http://matthewparrilla.com/post/negotiation-h... 402 [9367553, 9381017, 9364783, 9374335, 9375525, ... 1583 0.628929 10
Time zone [http://infiniteundo.com/post/25326999628/fals... 401 [4128208, 9346726] 670 0.64268 2
Time [https://mobile.twitter.com/elonmusk/status/58... 392 [9377110, 9372092, 9360732, 9361495, 9366122, ... 785 0.601115 14
... ... ... ... ... ... ...
Inline expansion [http://www.ocamlpro.com/blog/2015/04/13/ocp-m... 0 [9367913] 3 0.690379 1
Quantum mechanics [http://ham.so/2015/04/14/simulating-a-spring-... 0 [9378504] 8 0.943504 1
Decision tree [http://brandonharris.io/kaggle-bike-sharing/] 0 [9363678] 4 0.907974 1
Decision tree learning [http://brandonharris.io/kaggle-bike-sharing/] 0 [9363678] 4 0.924328 1
Deep linking [https://blog.branch.io/in-deeplinking-context... 0 [9359791] 3 0.689424 1
Default [https://scotch.io/tutorials/a-visual-guide-to... 0 [9375696] 5 0.87576 1
Pima County, Arizona [http://www.theguardian.com/us-news/2015/apr/1... 0 [9380030] 6 0.963105 1
Pinaceae [http://thewalrus.ca/wood-is-the-new-steel/] 0 [9371750] 21 0.761346 1
Pinophyta [http://thewalrus.ca/wood-is-the-new-steel/] 0 [9371750] 21 0.968128 1
Direct memory access [https://www.parallella.org/2015/04/05/paralle... 0 [9378023] 3 0.989697 1
Placebo [http://www.bidmc.org/News/In-Research/2015/Ap... 0 [9374527] 8 0.666572 1
Plan 9 from Bell Labs [http://spinroot.com/pico/pjw.html] 0 [9372050] 7 0.764789 1
Dimension [https://cs.stanford.edu/people/karpathy/tsnejs/] 0 [9368255] 10 0.892484 1
Playing card [https://www.quantamagazine.org/20150414-for-p... 0 [9375779] 6 0.981832 1
Differential geometry [https://www.quantamagazine.org/20150408-a-gra... 0 [9367819] 7 0.89041 1
Difference [https://cs.stanford.edu/people/karpathy/tsnejs/] 0 [9368255] 10 0.828773 1
Developing country [https://www.kickstarter.com/projects/13814379... 0 [9375914] 15 0.718201 1
Design pattern [http://codeconnect.io/] 0 [8628334] 4 0.677821 1
Depression [http://www.feld.com/archives/2015/04/bringing... 0 [9378446] 3 0.736492 1
Depend [http://blog.guillermowinkler.com/blog/2015/04... 0 [9365640] 6 0.72435 1
Density [http://datagenetics.com/blog/april32015/index... 0 [9380492] 6 0.707671 1
Pregnancy [http://www.theatlantic.com/health/archive/201... 0 [9367312] 10 0.952931 1
Pregnancy test [http://www.theatlantic.com/health/archive/201... 0 [9367312] 10 0.831481 1
Present [https://medium.com/@ev/sometimes-things-stay-... 0 [9370533] 3 0.692987 1
President of the European Commission [http://www.cnbc.com/id/102573773] 0 [9377030] 9 0.775637 1
Democratic Republic of the Congo [http://spectrum.ieee.org/automaton/robotics/h... 0 [9369752] 8 0.79672 1
Default judgment [https://scotch.io/tutorials/a-visual-guide-to... 0 [9375696] 5 0.8364 1
Lightning [http://www.blitzortung.org/Webpages/index.php... 0 [9370346] 6 0.848056 1
Chatty Cathy [http://www.hopesandfears.com/hopes/now/experi... NaN [9380554] 24 0.833185 1
Feel Good Inc. [http://www.hopesandfears.com/hopes/now/experi... NaN [9380554] 24 0.716204 1

2344 rows × 6 columns

You can be the judge whether these are the topics you would have expected to be on top. As you run this notebook over time, more stories will be available. You can also run this Hacker News and AlchemyAPI.ipynb notebook recurringly every day by running the notebook Hacker News Runner.ipynb. This will aggregate data over time and allow for more detailed analysis.

In this notebook we showed the usage of the Hacker News API and the AlchemyAPI. We used the AlchemyAPI concept tagging feature to extract topics from the Hacker News stories. Finally, we sorted aggregated popularity information about the stories for each concept and showed it in a tabular form sorted by different popularity measures.

The invocation of the AlchemyAPI was rather simple. A lot of the code to write intermediate result to disk is to work around the 1000 requests per day limitation.

Concept detection is only one feature of the AlchemyAPI. Check out more features in the API documentation.

This notebook was created using IBM Knowledge Anyhow Workbench. To learn more, visit us at https://knowledgeanyhow.org.