ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.
On this small article I reproduce the results (not visualization this time) from the reddit user /u/fhoffa, he posted a nice word cloud visualization of the most common words on some famous subreddits.
The reddit post can be found in the data is beautiful subreddit: Reddit most common words for /r/politics, /r/movies, /r/trees, /r/science
and the original word cloud was this:
For some context he mentioned that he used Google BigQuery and Tableau. The data used was the most recent month available (May 2015) in a recent reddit dump that user /u/Stuck_In_the_Matrix made available in a nice torrent.
To reproduce the results I am using dask which is a nice new project from Continuum Analytics which got a lot of attention in the most recent SciPy. A little disclaimer here: I currently work for Continuum but this post is not sponsored in any way.
Dask enables parallel computing through task scheduling and blocked algorithms. This allows developers to write complex parallel algorithms and execute them in parallel either on a modern multi-core machine or on a distributed cluster.
So basically dask is similar to Spark in the way of applying transformation to data but it also works amazingly well in a single multi-core machine and reuses battle tested libraries like numpy and pandas.
The data I am using is the same data as in the original post wich is 33.46 GB of a JSON new line delimited, each line having one comment with some metadata. All this code ran a my MacBook Pro (Retina, Mid 2014) - 2.8 GHz Intel core i7 - 16 GB 1600 MHz DDR3.
Lets jump to the code, as usual some libraries and their versions for reproducibility:
In [1]:
import re
import json
import time
import nltk
import dask
import dask.bag as db
import nltk
from nltk.corpus import stopwords
In [2]:
dask.__version__
Out[2]:
Note: At the moment I am writing this you need the master version of dask since it includes a recent feature Matt Rocklin recently did but it should be available in the next version of dask.
Update: Dask 0.6.1 has been realeased so you no longer need to use master just conda install dask
.
In [4]:
nltk.__version__
Out[4]:
Create a data
variable that points to the big 33 GB file and set the chunkbytes
to 100 MB, map each item to json.loads
since I know that each row is a JSON doc.
In [5]:
data = db.from_filenames("RC_2015-05", chunkbytes=100000).map(json.loads)
Similar to spark you can take
one or more items to see items of the dataset.
In [6]:
data.take(1)
Out[6]:
Now some simple helper functions that we are going to use as filters.
In [7]:
no_stopwords = lambda x: x not in stopwords.words('english')
In [8]:
is_word = lambda x: re.search("^[0-9a-zA-Z]+$", x) is not None
Now it comes the dask
part which it will look really similar to anyone who has used Spark.
In [9]:
subreddit = data.filter(lambda x: x['subreddit'] == 'movies')
bodies = subreddit.pluck('body')
words = bodies.map(nltk.word_tokenize).concat()
words2 = words.map(lambda x: x.lower())
words3 = words2.filter(no_stopwords)
words4 = words3.filter(is_word)
counts = words4.frequencies()
Call compute
and let dask work. At this moment if you take a look at your process manager you should a lot of Python processes.
In [10]:
start_time = time.time()
values = counts.compute()
elapsed_time = time.time() - start_time
We can see how long the computation took, in this case: around 23 minutes which is about 1.4 GB per minute.
In [11]:
elapsed_time # seconds
Out[11]:
With the word count done we can just sort the python list and see the most common words.
In [12]:
len(values)
Out[12]:
In [13]:
sort = sorted(values, key=lambda x: x[1], reverse=True)
In [14]:
sort[:100]
Out[14]:
Save a file with the results
In [17]:
with open('r_movies.txt', 'w') as f:
for item in sort:
f.write("%s %i\n" % item)
This was a simple post making some computation on 33 GB of data in a single node using around 20 lines of python. The main objective was to reproduce some of the results other people have made with bigger tools like Google Big Query with simpler but also powerful tools in a single node. Also I used dask for the first time and I was kinda impressed by it.
If you have some medium size data a big workstation and dont want to mess up with clusters you should checkout dask. You can also use dask in a cluster but lets save that for a diffent post.