Visualizing StackOverflow tags

StackOverflow is a question-and-answer website for programmers. Each question can have up to 5 descriptive tags for people to be able to find related tags. In this demo, we will look at 50,000 questions with more than 3 tags to try to visualize what StackOverflow is all about. The data came from http://data.stackexchange.com/stackoverflow/query/new with a query like this:

select tags from posts where PostTypeId=1 and tags like '%><%><%'

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.


In [1]:
from pymldb import Connection
mldb = Connection()

The sequence of procedures below is based on the one explained in the Mapping Reddit demo notebook. In pseudo-code:

dataset so_raw = import("http://public.mldb.ai/stackoverflow.csv.gz")
dataset so_tags = tokenize(so_raw)
dataset so_counts = count(transpose(so_tags))
dataset so_svd_embedding = svd_embed(so_tags)
dataset so_kmeans_clusters = kmeans_cluster(so_svd_embedding)
dataset so_tsne_embedding = tsne_embed(so_svd_embedding)

In [2]:
# dataset so_raw = import("http://public.mldb.ai/stackoverflow.csv.gz")
print mldb.put('/v1/procedures/import_so_raw', {
    "type": "import.text",
    "params": {
        "dataFileUrl": "http://public.mldb.ai/stackoverflow.csv.gz",
        'delimiter':'', 
        'quoteChar':'',
        'outputDataset': 'so_raw',
        'runOnCreation': True
    }
})

# dataset so_tags = tokenize(so_raw)
print mldb.post('/v1/procedures', {
    "id": "so_import",
    "type": "transform",
    "params": {
        "inputData": "select tokenize(lineText, {splitChars: '\"\r<>', quoteChar: ''}) as * from so_raw",
        "outputDataset": "so_tags",
        "runOnCreation": True
    }
})

# dataset so_counts = count(transpose(so_tags))
print mldb.post('/v1/procedures', {
    "id": "so_counts",
    "type": "transform",
    "params": {
        "inputData": "select columnCount() AS numQuestions from transpose(so_tags)",
        "outputDataset": "so_counts",
        "runOnCreation": True
    }
})

# dataset so_svd_embedding = svd_embed(so_tags)
print mldb.post('/v1/procedures', {
    "id": "so_svd",
    "type" : "svd.train",
    "params" : {
        "trainingData" : """
            select 
                COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 6000) 
            from so_tags
        """,
        "columnOutputDataset" : "so_svd_embedding",
        "numSingularValues": 100,
        "runOnCreation": True
    }
})

# dataset so_kmeans_clusters = kmeans_cluster(so_svd_embedding)
print mldb.post('/v1/procedures', {
    "id" : "so_kmeans",
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from so_svd_embedding",
        "outputDataset" : "so_kmeans_clusters",
        "numClusters" : 20,
        "runOnCreation": True
    }
})

# dataset so_tsne_embedding = tsne_embed(so_svd_embedding)
print mldb.post('/v1/procedures', {
    "id": "so_tsne",
    "type" : "tsne.train",
    "params" : {
        "trainingData" : "select * from so_svd_embedding",
        "rowOutputDataset" : "so_tsne_embedding",
        "runOnCreation": True
    }
})


<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>

Visualization with Bokeh

In the program above we built three datasets:

  • so_tsne_embedding: contains x and y coordinates per tag
  • so_kmeans_clusters: contains a cluster per tag
  • so_counts: contains numQuestions per tag

Now we query out a pandas dataframe with the merge() of these three datasets, along with a 'grid' coordinate for each tag which we will use to automatically position labels in the visualization below.


In [4]:
df = mldb.query("""
    select c.* as *, m.* as *, quantize(m.x, 7) as grid_x, quantize(m.y, 7) as grid_y 
    named c.rowName()   
    from merge(so_tsne_embedding, so_kmeans_clusters) as m
        join so_counts as c on (c.rowName() = m.rowPathElement(0))
    where m.cluster is not null
    order by c.numQuestions desc
    """)
df.head()


Out[4]:
cluster grid_x grid_y numQuestions x y
_rowName
c# 8 -21 -14 5929 -18.530771 -13.916371
java 8 -56 0 5449 -54.343040 2.447962
javascript 8 21 -42 4250 23.090406 -41.900581
c++ 8 -21 -14 3231 -22.746912 -13.542056
android 2 -70 -28 3013 -69.587387 -29.107584

In [5]:
import numpy as np
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

import bokeh.plotting as bp
from bokeh.models import HoverTool

In [6]:
#this line must be in its own cell 
bp.output_notebook()


Loading BokehJS ...

In [7]:
x = bp.figure(plot_width=900, plot_height=700, title="Map of StackOverflow Tags",
       tools=[HoverTool( tooltips=[ ("tag", "@tags") ] )], toolbar_location=None,
       x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
    x = df.x.values, 
    y = df.y.values, 
    color=colormap[df.cluster.astype(int).values],
    alpha=0.6,
    radius=(df.numQuestions.values ** .35)/5,
    source=bp.ColumnDataSource({"tags": df.index.values})
)

labels = df.reset_index().groupby(['grid_x', 'grid_y'], as_index=False).first()
labels = labels[labels["numQuestions"] > 200]
x.text(
    x = labels.x.values, 
    y = labels.y.values,
    text = labels._rowName.values,
    text_align="center", text_baseline="middle",
    text_font_size="8pt", text_font_style="bold",
    text_color="#333333"
)
bp.show(x)


Out[7]:

<Bokeh Notebook handle for In[16]>

Where to next?

Check out the other Tutorials and Demos.


In [ ]: