Mapping Reddit

Reddit is a discussion board that bills itself as the "Front Page of the Internet". It is divided into a large number of topic-specific "subreddits". In this demo, we'll take some data about which subreddits various active Reddit users post to a lot to make a visual map of subreddits. The data comes from the paper Navigating the massive world of reddit.

Initializing `pymldb`

In this demo, we will use pymldb to interact with the REST API: see the Using pymldb Tutorial for more details.



In [1]:

    
from pymldb import Connection
mldb = Connection("http://localhost")

Loading up the raw data



In [2]:

    
mldb.put('/v1/procedures/import_reddit', { 
    "type": "import.text",  
    "params": { 
        "dataFileUrl": "http://public.mldb.ai/reddit.csv.gz",
        'delimiter':'', 
        'quoteChar':'',
        'outputDataset': 'reddit_raw',
        'runOnCreation': True
    } 
})









    Out[2]:




PUT http://localhost/v1/procedures/import_reddit
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:43:43.374825Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-05-28T03:43:45.0417228Z", 
      "id": "2016-05-28T03:43:43.374753Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "quoteChar": "", 
      "delimiter": "", 
      "outputDataset": "reddit_raw", 
      "runOnCreation": true, 
      "dataFileUrl": "http://public.mldb.ai/reddit.csv.gz"
    }, 
    "type": "import.text", 
    "id": "import_reddit"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_reddit"
}

And here is what our raw dataset looks like. The lineText column will need to be parsed: it's comma-delimited, with the first token being a user ID and the remaining tokens being the set of subreddits that user contributed to.



In [3]:

    
mldb.query("select * from reddit_raw limit 5")









    Out[3]:






  
    
      
      lineText
    
    
      _rowName
      
    
  
  
    
      1
      603,politics,trees,pics
    
    
      2
      604,Metal,AskReddit,tattoos,redditguild,WTF,co...
    
    
      3
      605,politics,IAmA,AdviceAnimals,movies,smallbu...
    
    
      4
      606,CrohnsDisease,birthcontrol,IAmA,AdviceAnim...
    
    
      5
      607,space,Fitment,cars,Economics,Libertarian,2...

Transforming the raw data into a sparse matrix

We will create and run a Procedure of type transform. The tokenize function will project out the subreddit names into columns.



In [4]:

    
mldb.put('/v1/procedures/reddit_import', {
    "type": "transform",
    "params": {
        "inputData": "select tokenize(lineText, {offset: 1, value: 1}) as * from reddit_raw",
        "outputDataset": "reddit_dataset",
        "runOnCreation": True
    }
})









    Out[4]:




PUT http://localhost/v1/procedures/reddit_import
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:43:45.3741944Z", 
      "status": {
        "columnCount": 15122, 
        "rowCount": 876961, 
        "valueCount": 892067
      }, 
      "runFinished": "2016-05-28T03:43:51.7200754Z", 
      "id": "2016-05-28T03:43:45.374114Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "reddit_dataset", 
      "inputData": "select tokenize(lineText, {offset: 1, value: 1}) as * from reddit_raw", 
      "runOnCreation": true
    }, 
    "type": "transform", 
    "id": "reddit_import"
  }, 
  "state": "ok", 
  "type": "transform", 
  "id": "reddit_import"
}

Here is the resulting dataset: it's a sparse matrix with a row per user and a column per subreddit, where the cells are 1 if the row's user was a contributor to the column's subreddit, and null otherwise.



In [5]:

    
mldb.query("select * from reddit_dataset limit 5")









    Out[5]:






  
    
      
      AdviceAnimals
      AskReddit
      LucidDreaming
      Mustang
      WTF
      cars
      ems
      funny
      guns
      motorcycles
      ...
      fffffffuuuuuuuuuuuu
      gaming
      movies
      "reddit.com"
      trees
      videos
      breakingbad
      cringe
      darksouls
      thatHappened
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      785489
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      754094
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      785459
      1
      1
      NaN
      NaN
      1
      1
      NaN
      1
      NaN
      NaN
      ...
      1
      1
      1
      1
      1
      1
      NaN
      NaN
      NaN
      NaN
    
    
      785298
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      785140
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1
      1
      1
      1
    
  

5 rows × 32 columns

Dimensionality Reduction with Singular Value Decomposition (SVD)

We will create and run a Procedure of type svd.train.



In [6]:

    
mldb.put('/v1/procedures/reddit_svd', {
    "type" : "svd.train",
    "params" : {
        "trainingData" : """
            SELECT 
                COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 4000) 
            FROM reddit_dataset
        """,
        "columnOutputDataset" : "reddit_svd_embedding",
        "runOnCreation": True
    }
})









    Out[6]:




PUT http://localhost/v1/procedures/reddit_svd
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:43:54.4468389Z", 
      "runFinished": "2016-05-28T03:44:44.2194996Z", 
      "id": "2016-05-28T03:43:54.446762Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "columnOutputDataset": "reddit_svd_embedding", 
      "trainingData": "\n            SELECT \n                COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 4000) \n            FROM reddit_dataset\n        ", 
      "runOnCreation": true
    }, 
    "type": "svd.train", 
    "id": "reddit_svd"
  }, 
  "state": "ok", 
  "type": "svd.train", 
  "id": "reddit_svd"
}

The result of this operation is a new dataset with a row per subreddit for the 4000 most-active subreddits and columns representing coordinates for that subreddit in a 100-dimensional space.

Note: the row names are the subreddit names followed by ".numberEquals.1" because the SVD training procedure interpreted the input matrix as categorical rather than numerical.



In [8]:

    
mldb.query("select * from reddit_svd_embedding limit 5")









    Out[8]:






  
    
      
      embedding.0
      embedding.10
      embedding.11
      embedding.12
      embedding.13
      embedding.14
      embedding.15
      embedding.16
      embedding.17
      embedding.18
      ...
      embedding.91
      embedding.92
      embedding.93
      embedding.94
      embedding.95
      embedding.96
      embedding.97
      embedding.98
      embedding.99
      embedding.9
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      AskReddit.numberEquals.1
      -0.448854
      0.002584
      0.071048
      0.004983
      -0.004218
      -0.000756
      -0.014381
      0.039270
      0.011690
      0.005458
      ...
      -0.001824
      -0.001005
      0.002710
      0.001701
      0.002573
      -0.002802
      -0.004249
      0.006107
      -0.000439
      0.066779
    
    
      funny.numberEquals.1
      -0.400773
      -0.096315
      -0.002975
      0.027832
      0.006565
      -0.009726
      0.020593
      0.019796
      -0.006885
      -0.020714
      ...
      0.000278
      0.000375
      0.001001
      -0.000875
      0.000552
      -0.001406
      0.001127
      -0.002564
      -0.002319
      0.311920
    
    
      pics.numberEquals.1
      -0.380045
      0.041943
      -0.038168
      -0.073641
      0.054020
      0.070904
      0.024925
      -0.028070
      0.015659
      0.013717
      ...
      0.004381
      -0.003049
      -0.001057
      -0.003187
      -0.000768
      -0.001563
      0.000897
      -0.001195
      -0.002240
      -0.468059
    
    
      WTF.numberEquals.1
      -0.294057
      0.131549
      -0.170047
      -0.081603
      0.048413
      -0.017429
      -0.013951
      -0.024105
      -0.052710
      0.028045
      ...
      0.000223
      0.003586
      0.003579
      -0.000195
      0.004218
      -0.001208
      0.001949
      0.003942
      0.003031
      0.005529
    
    
      gaming.numberEquals.1
      -0.263956
      -0.123462
      -0.109681
      0.100280
      0.050675
      -0.071520
      -0.129815
      0.166366
      0.046277
      -0.155068
      ...
      0.005391
      -0.005024
      0.011013
      -0.003873
      0.002043
      0.009336
      -0.001159
      0.001016
      0.002622
      -0.062798
    
  

5 rows × 100 columns

Clustering with K-Means

We will create and run a Procedure of type kmeans.train.



In [9]:

    
mldb.put('/v1/procedures/reddit_kmeans', {
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from reddit_svd_embedding",
        "outputDataset" : "reddit_kmeans_clusters",
        "numClusters" : 20,
        "runOnCreation": True
    }
})









    Out[9]:




PUT http://localhost/v1/procedures/reddit_kmeans
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:44:58.9105241Z", 
      "runFinished": "2016-05-28T03:44:59.093301Z", 
      "id": "2016-05-28T03:44:58.910438Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "numClusters": 20, 
      "outputDataset": "reddit_kmeans_clusters", 
      "trainingData": "select * from reddit_svd_embedding", 
      "runOnCreation": true
    }, 
    "type": "kmeans.train", 
    "id": "reddit_kmeans"
  }, 
  "state": "ok", 
  "type": "kmeans.train", 
  "id": "reddit_kmeans"
}

The result of this operation is a simple dataset which associates each row in the input (i.e. each subreddit) to one of 20 clusters.



In [10]:

    
mldb.query("select * from reddit_kmeans_clusters limit 5")









    Out[10]:






  
    
      
      cluster
    
    
      _rowName
      
    
  
  
    
      usyd.numberEquals.1
      5
    
    
      Mommit.numberEquals.1
      16
    
    
      minnesotatwins.numberEquals.1
      3
    
    
      HongKong.numberEquals.1
      2
    
    
      niggers.numberEquals.1
      18

2-d Dimensionality Reduction with t-SNE

We will create and run a Procedure of type tsne.train.



In [11]:

    
mldb.put('/v1/procedures/reddit_tsne', {
    "type" : "tsne.train",
    "params" : {
        "trainingData" : "select * from reddit_svd_embedding",
        "rowOutputDataset" : "reddit_tsne_embedding",
        "runOnCreation": True
    }
})









    Out[11]:




PUT http://localhost/v1/procedures/reddit_tsne
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:45:01.2912471Z", 
      "runFinished": "2016-05-28T03:45:12.5964549Z", 
      "id": "2016-05-28T03:45:01.291181Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "trainingData": "select * from reddit_svd_embedding", 
      "rowOutputDataset": "reddit_tsne_embedding", 
      "runOnCreation": true
    }, 
    "type": "tsne.train", 
    "id": "reddit_tsne"
  }, 
  "state": "ok", 
  "type": "tsne.train", 
  "id": "reddit_tsne"
}

The result is similar to the SVD step above: we get a row per subreddit and the columns are coordinates, but this time in a 2-dimensional space appropriate for visualization.



In [12]:

    
mldb.query("select * from reddit_tsne_embedding limit 5")









    Out[12]:






  
    
      
      x
      y
    
    
      _rowName
      
      
    
  
  
    
      usyd.numberEquals.1
      -30.012499
      4.485752
    
    
      Mommit.numberEquals.1
      8.447520
      -47.081745
    
    
      minnesotatwins.numberEquals.1
      -50.094040
      -18.705278
    
    
      HongKong.numberEquals.1
      -34.990097
      1.323213
    
    
      niggers.numberEquals.1
      26.637928
      -0.875668

Counting the number of users per subreddit

We will create and run a Procedure of type transform on the transpose of the original input dataset.



In [13]:

    
mldb.put('/v1/procedures/reddit_count_users', {
    "type": "transform",
    "params": {
        "inputData": "select columnCount() as numUsers from transpose(reddit_dataset)",
        "outputDataset": "reddit_user_counts",
        "runOnCreation": True
    }
})









    Out[13]:




PUT http://localhost/v1/procedures/reddit_count_users
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:45:20.0345097Z", 
      "status": {
        "columnCount": 1, 
        "rowCount": 15122, 
        "valueCount": 15123
      }, 
      "runFinished": "2016-05-28T03:45:21.165891Z", 
      "id": "2016-05-28T03:45:20.034423Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "reddit_user_counts", 
      "inputData": "select columnCount() as numUsers from transpose(reddit_dataset)", 
      "runOnCreation": true
    }, 
    "type": "transform", 
    "id": "reddit_count_users"
  }, 
  "state": "ok", 
  "type": "transform", 
  "id": "reddit_count_users"
}

We appended "|1" to the row names in this dataset to allow the merge operation below to work well.



In [14]:

    
mldb.query("select * from reddit_user_counts limit 5")









    Out[14]:






  
    
      
      numUsers
    
    
      _rowName
      
    
  
  
    
      oracle
      27
    
    
      notcollectingstamps
      1
    
    
      notarealsub
      1
    
    
      norge
      446
    
    
      police
      89

Querying and Visualizating the output

We'll use the Query API to get the data into a Pandas DataFrame and then use Bokeh to visualize it.

In the query below we renamed the rows to get rid of the "|1" which the SVD appended to each subreddit name and we filter out rows where cluster is null because we only clustered the 4000 most-active subreddits.



In [17]:

    
df = mldb.query("""
    select c.* as *, m.* as *, quantize(m.x, 7) as grid_x, quantize(m.y, 7) as grid_y 
    named c.rowName() 
    from merge(reddit_tsne_embedding, reddit_kmeans_clusters) as m
        join reddit_user_counts as c on c.rowName() = m.rowPathElement(0)
    where m.cluster is not null 
    order by c.numUsers desc
""")
df.head()



In [18]:

    
import numpy as np
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

import bokeh.plotting as bp
from bokeh.models import HoverTool



In [19]:

    
#this line must be in its own cell 
bp.output_notebook()









    





    
        
        Loading BokehJS ...



In [20]:

    
x = bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by t-SNE",
       tools=[HoverTool( tooltips=[ ("/r/", "@subreddit") ] )], toolbar_location=None,
       x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
    x = df.x.values, 
    y=df.y.values, 
    color=colormap[df.cluster.astype(int).values],
    alpha=0.6,
    radius=(df.numUsers.values ** .3)/15,
    source=bp.ColumnDataSource({"subreddit": df.index.values})
)

labels = df.reset_index().groupby(['grid_x', 'grid_y'], as_index=False).first()
labels = labels[labels["numUsers"] > 10000]
x.text(
    x = labels.x.values, 
    y = labels.y.values,
    text = labels._rowName.values,
    text_align="center", text_baseline="middle",
    text_font_size="8pt", text_font_style="bold",
    text_color="#333333"
)

bp.show(x)









    






    







    Out[20]:




<Bokeh Notebook handle for In[20]>

Where to next?

Check out the other Tutorials and Demos.



In [ ]:

	cluster	grid_x	grid_y	numUsers	x	y
_rowName
AskReddit	2	0	-42	523005	-1.118738	-43.699772
funny	3	-49	0	396478	-45.672779	-1.286311
pics	0	-35	-42	362588	-33.671326	-43.146740
WTF	16	21	-49	262293	21.992300	-49.596294
gaming	13	-14	49	255763	-12.789426	48.143551

	lineText
_rowName
1	603,politics,trees,pics
2	604,Metal,AskReddit,tattoos,redditguild,WTF,co...
3	605,politics,IAmA,AdviceAnimals,movies,smallbu...
4	606,CrohnsDisease,birthcontrol,IAmA,AdviceAnim...
5	607,space,Fitment,cars,Economics,Libertarian,2...

	AdviceAnimals	AskReddit	LucidDreaming	Mustang	WTF	cars	ems	funny	guns	motorcycles	...	fffffffuuuuuuuuuuuu	gaming	movies	"reddit.com"	trees	videos	breakingbad	cringe	darksouls	thatHappened
_rowName
785489	1	1	1	1	1	1	1	1	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
754094	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
785459	1	1	NaN	NaN	1	1	NaN	1	NaN	NaN	...	1	1	1	1	1	1	NaN	NaN	NaN	NaN
785298	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
785140	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1	1	1	1

	embedding.0	embedding.10	embedding.11	embedding.12	embedding.13	embedding.14	embedding.15	embedding.16	embedding.17	embedding.18	...	embedding.91	embedding.92	embedding.93	embedding.94	embedding.95	embedding.96	embedding.97	embedding.98	embedding.99	embedding.9
_rowName
AskReddit.numberEquals.1	-0.448854	0.002584	0.071048	0.004983	-0.004218	-0.000756	-0.014381	0.039270	0.011690	0.005458	...	-0.001824	-0.001005	0.002710	0.001701	0.002573	-0.002802	-0.004249	0.006107	-0.000439	0.066779
funny.numberEquals.1	-0.400773	-0.096315	-0.002975	0.027832	0.006565	-0.009726	0.020593	0.019796	-0.006885	-0.020714	...	0.000278	0.000375	0.001001	-0.000875	0.000552	-0.001406	0.001127	-0.002564	-0.002319	0.311920
pics.numberEquals.1	-0.380045	0.041943	-0.038168	-0.073641	0.054020	0.070904	0.024925	-0.028070	0.015659	0.013717	...	0.004381	-0.003049	-0.001057	-0.003187	-0.000768	-0.001563	0.000897	-0.001195	-0.002240	-0.468059
WTF.numberEquals.1	-0.294057	0.131549	-0.170047	-0.081603	0.048413	-0.017429	-0.013951	-0.024105	-0.052710	0.028045	...	0.000223	0.003586	0.003579	-0.000195	0.004218	-0.001208	0.001949	0.003942	0.003031	0.005529
gaming.numberEquals.1	-0.263956	-0.123462	-0.109681	0.100280	0.050675	-0.071520	-0.129815	0.166366	0.046277	-0.155068	...	0.005391	-0.005024	0.011013	-0.003873	0.002043	0.009336	-0.001159	0.001016	0.002622	-0.062798

	cluster
_rowName
usyd.numberEquals.1	5
Mommit.numberEquals.1	16
minnesotatwins.numberEquals.1	3
HongKong.numberEquals.1	2
niggers.numberEquals.1	18

	x	y
_rowName
usyd.numberEquals.1	-30.012499	4.485752
Mommit.numberEquals.1	8.447520	-47.081745
minnesotatwins.numberEquals.1	-50.094040	-18.705278
HongKong.numberEquals.1	-34.990097	1.323213
niggers.numberEquals.1	26.637928	-0.875668

	numUsers
_rowName
oracle	27
notcollectingstamps	1
notarealsub	1
norge	446
police	89