Reddit is a discussion board that bills itself as the "Front Page of the Internet". It is divided into a large number of topic-specific "subreddits". In this demo, we'll take some data about which subreddits various active Reddit users post to a lot to make a visual map of subreddits. The data comes from the paper Navigating the massive world of reddit.
pymldb
In this demo, we will use pymldb
to interact with the REST API: see the Using pymldb
Tutorial for more details.
In [1]:
from pymldb import Connection
mldb = Connection("http://localhost")
In [2]:
mldb.put('/v1/procedures/import_reddit', {
"type": "import.text",
"params": {
"dataFileUrl": "http://public.mldb.ai/reddit.csv.gz",
'delimiter':'',
'quoteChar':'',
'outputDataset': 'reddit_raw',
'runOnCreation': True
}
})
Out[2]:
And here is what our raw dataset looks like. The lineText
column will need to be parsed: it's comma-delimited, with the first token being a user ID and the remaining tokens being the set of subreddits that user contributed to.
In [3]:
mldb.query("select * from reddit_raw limit 5")
Out[3]:
In [4]:
mldb.put('/v1/procedures/reddit_import', {
"type": "transform",
"params": {
"inputData": "select tokenize(lineText, {offset: 1, value: 1}) as * from reddit_raw",
"outputDataset": "reddit_dataset",
"runOnCreation": True
}
})
Out[4]:
Here is the resulting dataset: it's a sparse matrix with a row per user and a column per subreddit, where the cells are 1
if the row's user was a contributor to the column's subreddit, and null
otherwise.
In [5]:
mldb.query("select * from reddit_dataset limit 5")
Out[5]:
In [6]:
mldb.put('/v1/procedures/reddit_svd', {
"type" : "svd.train",
"params" : {
"trainingData" : """
SELECT
COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 4000)
FROM reddit_dataset
""",
"columnOutputDataset" : "reddit_svd_embedding",
"runOnCreation": True
}
})
Out[6]:
The result of this operation is a new dataset with a row per subreddit for the 4000 most-active subreddits and columns representing coordinates for that subreddit in a 100-dimensional space.
Note: the row names are the subreddit names followed by ".numberEquals.1" because the SVD training procedure interpreted the input matrix as categorical rather than numerical.
In [8]:
mldb.query("select * from reddit_svd_embedding limit 5")
Out[8]:
We will create and run a Procedure of type kmeans.train
.
In [9]:
mldb.put('/v1/procedures/reddit_kmeans', {
"type" : "kmeans.train",
"params" : {
"trainingData" : "select * from reddit_svd_embedding",
"outputDataset" : "reddit_kmeans_clusters",
"numClusters" : 20,
"runOnCreation": True
}
})
Out[9]:
The result of this operation is a simple dataset which associates each row in the input (i.e. each subreddit) to one of 20 clusters.
In [10]:
mldb.query("select * from reddit_kmeans_clusters limit 5")
Out[10]:
We will create and run a Procedure of type tsne.train
.
In [11]:
mldb.put('/v1/procedures/reddit_tsne', {
"type" : "tsne.train",
"params" : {
"trainingData" : "select * from reddit_svd_embedding",
"rowOutputDataset" : "reddit_tsne_embedding",
"runOnCreation": True
}
})
Out[11]:
The result is similar to the SVD step above: we get a row per subreddit and the columns are coordinates, but this time in a 2-dimensional space appropriate for visualization.
In [12]:
mldb.query("select * from reddit_tsne_embedding limit 5")
Out[12]:
In [13]:
mldb.put('/v1/procedures/reddit_count_users', {
"type": "transform",
"params": {
"inputData": "select columnCount() as numUsers from transpose(reddit_dataset)",
"outputDataset": "reddit_user_counts",
"runOnCreation": True
}
})
Out[13]:
We appended "|1" to the row names in this dataset to allow the merge
operation below to work well.
In [14]:
mldb.query("select * from reddit_user_counts limit 5")
Out[14]:
We'll use the Query API to get the data into a Pandas DataFrame and then use Bokeh to visualize it.
In the query below we renamed the rows to get rid of the "|1" which the SVD appended to each subreddit name and we filter out rows where cluster
is null
because we only clustered the 4000 most-active subreddits.
In [17]:
df = mldb.query("""
select c.* as *, m.* as *, quantize(m.x, 7) as grid_x, quantize(m.y, 7) as grid_y
named c.rowName()
from merge(reddit_tsne_embedding, reddit_kmeans_clusters) as m
join reddit_user_counts as c on c.rowName() = m.rowPathElement(0)
where m.cluster is not null
order by c.numUsers desc
""")
df.head()
Out[17]:
In [18]:
import numpy as np
colormap = np.array([
"#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
"#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
"#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
"#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])
import bokeh.plotting as bp
from bokeh.models import HoverTool
In [19]:
#this line must be in its own cell
bp.output_notebook()
In [20]:
x = bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by t-SNE",
tools=[HoverTool( tooltips=[ ("/r/", "@subreddit") ] )], toolbar_location=None,
x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
x = df.x.values,
y=df.y.values,
color=colormap[df.cluster.astype(int).values],
alpha=0.6,
radius=(df.numUsers.values ** .3)/15,
source=bp.ColumnDataSource({"subreddit": df.index.values})
)
labels = df.reset_index().groupby(['grid_x', 'grid_y'], as_index=False).first()
labels = labels[labels["numUsers"] > 10000]
x.text(
x = labels.x.values,
y = labels.y.values,
text = labels._rowName.values,
text_align="center", text_baseline="middle",
text_font_size="8pt", text_font_style="bold",
text_color="#333333"
)
bp.show(x)
Out[20]:
Check out the other Tutorials and Demos.
In [ ]: