Visualizing StackOverflow tags

StackOverflow is a question-and-answer website for programmers. Each question can have up to 5 descriptive tags for people to be able to find related tags. In this demo, we will look at 50,000 questions with more than 3 tags to try to visualize what StackOverflow is all about. The data came from http://data.stackexchange.com/stackoverflow/query/new with a query like this:

select tags from posts where PostTypeId=1 and tags like '%><%><%'

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.



In [1]:

    
from pymldb import Connection
mldb = Connection()

The sequence of procedures below is based on the one explained in the Mapping Reddit demo notebook. In pseudo-code:

dataset so_raw = import("http://public.mldb.ai/stackoverflow.csv.gz")
dataset so_tags = tokenize(so_raw)
dataset so_counts = count(transpose(so_tags))
dataset so_svd_embedding = svd_embed(so_tags)
dataset so_kmeans_clusters = kmeans_cluster(so_svd_embedding)
dataset so_tsne_embedding = tsne_embed(so_svd_embedding)



In [2]:

    
# dataset so_raw = import("http://public.mldb.ai/stackoverflow.csv.gz")
print mldb.put('/v1/procedures/import_so_raw', {
    "type": "import.text",
    "params": {
        "dataFileUrl": "http://public.mldb.ai/stackoverflow.csv.gz",
        'delimiter':'', 
        'quoteChar':'',
        'outputDataset': 'so_raw',
        'runOnCreation': True
    }
})

# dataset so_tags = tokenize(so_raw)
print mldb.post('/v1/procedures', {
    "id": "so_import",
    "type": "transform",
    "params": {
        "inputData": "select tokenize(lineText, {splitChars: '\"\r<>', quoteChar: ''}) as * from so_raw",
        "outputDataset": "so_tags",
        "runOnCreation": True
    }
})

# dataset so_counts = count(transpose(so_tags))
print mldb.post('/v1/procedures', {
    "id": "so_counts",
    "type": "transform",
    "params": {
        "inputData": "select columnCount() AS numQuestions from transpose(so_tags)",
        "outputDataset": "so_counts",
        "runOnCreation": True
    }
})

# dataset so_svd_embedding = svd_embed(so_tags)
print mldb.post('/v1/procedures', {
    "id": "so_svd",
    "type" : "svd.train",
    "params" : {
        "trainingData" : """
            select 
                COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 6000) 
            from so_tags
        """,
        "columnOutputDataset" : "so_svd_embedding",
        "numSingularValues": 100,
        "runOnCreation": True
    }
})

# dataset so_kmeans_clusters = kmeans_cluster(so_svd_embedding)
print mldb.post('/v1/procedures', {
    "id" : "so_kmeans",
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from so_svd_embedding",
        "outputDataset" : "so_kmeans_clusters",
        "numClusters" : 20,
        "runOnCreation": True
    }
})

# dataset so_tsne_embedding = tsne_embed(so_svd_embedding)
print mldb.post('/v1/procedures', {
    "id": "so_tsne",
    "type" : "tsne.train",
    "params" : {
        "trainingData" : "select * from so_svd_embedding",
        "rowOutputDataset" : "so_tsne_embedding",
        "runOnCreation": True
    }
})









    



<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>

Visualization with Bokeh

In the program above we built three datasets:

so_tsne_embedding: contains x and y coordinates per tag
so_kmeans_clusters: contains a cluster per tag
so_counts: contains numQuestions per tag

Now we query out a pandas dataframe with the merge() of these three datasets, along with a 'grid' coordinate for each tag which we will use to automatically position labels in the visualization below.



In [4]:

    
df = mldb.query("""
    select c.* as *, m.* as *, quantize(m.x, 7) as grid_x, quantize(m.y, 7) as grid_y 
    named c.rowName()   
    from merge(so_tsne_embedding, so_kmeans_clusters) as m
        join so_counts as c on (c.rowName() = m.rowPathElement(0))
    where m.cluster is not null
    order by c.numQuestions desc
    """)
df.head()









    Out[4]:






  
    
      
      cluster
      grid_x
      grid_y
      numQuestions
      x
      y
    
    
      _rowName
      
      
      
      
      
      
    
  
  
    
      c#
      8
      -21
      -14
      5929
      -18.530771
      -13.916371
    
    
      java
      8
      -56
      0
      5449
      -54.343040
      2.447962
    
    
      javascript
      8
      21
      -42
      4250
      23.090406
      -41.900581
    
    
      c++
      8
      -21
      -14
      3231
      -22.746912
      -13.542056
    
    
      android
      2
      -70
      -28
      3013
      -69.587387
      -29.107584



In [5]:

    
import numpy as np
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

import bokeh.plotting as bp
from bokeh.models import HoverTool



In [6]:

    
#this line must be in its own cell 
bp.output_notebook()









    





    
        
        Loading BokehJS ...



In [7]:

    
x = bp.figure(plot_width=900, plot_height=700, title="Map of StackOverflow Tags",
       tools=[HoverTool( tooltips=[ ("tag", "@tags") ] )], toolbar_location=None,
       x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
    x = df.x.values, 
    y = df.y.values, 
    color=colormap[df.cluster.astype(int).values],
    alpha=0.6,
    radius=(df.numQuestions.values ** .35)/5,
    source=bp.ColumnDataSource({"tags": df.index.values})
)

labels = df.reset_index().groupby(['grid_x', 'grid_y'], as_index=False).first()
labels = labels[labels["numQuestions"] > 200]
x.text(
    x = labels.x.values, 
    y = labels.y.values,
    text = labels._rowName.values,
    text_align="center", text_baseline="middle",
    text_font_size="8pt", text_font_style="bold",
    text_color="#333333"
)
bp.show(x)









    






    







    Out[7]:




<Bokeh Notebook handle for In[16]>

Where to next?

Check out the other Tutorials and Demos.



In [ ]:

	cluster	grid_x	grid_y	numQuestions	x	y
_rowName
c#	8	-21	-14	5929	-18.530771	-13.916371
java	8	-56	0	5449	-54.343040	2.447962
javascript	8	21	-42	4250	23.090406	-41.900581
c++	8	-21	-14	3231	-22.746912	-13.542056
android	2	-70	-28	3013	-69.587387	-29.107584