In [1]:
from IPython.display import Image
Image(url='http://python.org/images/python-logo.gif')
Out[1]:
In [2]:
Image(url='http://ipython.org/_static/IPy_header.png')
Out[2]:
In [3]:
Image(url='http://jupyter.org/images/jupyter-sq-text.svg', width=300, height=300)
Out[3]:
IPython Notebook is a web-based interactive computational environment for creating IPython notebooks. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media.
matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code, with familiar MATLAB APIs.
plt.barh(y_pos, performance, xerr=error, align='center', alpha=0.4)
plt.yticks(y_pos, people)
plt.xlabel('Performance')
plt.title('How fast do you want to go today?')
plt.show()
Spark on Python, this serves as the Kernel, integrating with IPython
# Markdown code block
if not full:
print 'eat more!'
In [4]:
import matplotlib
matplotlib.__version__
Out[4]:
In [5]:
print sys.version
print sc.version
In [6]:
lines = sc.parallelize(['Its fun to have fun,','but you have to know how.'])
wordcounts = lines.map( lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').lower()) \
.flatMap(lambda x: x.split()) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda x,y:x+y) \
.map(lambda x:(x[1],x[0])) \
.sortByKey(False)
wordcounts.take(10)
Out[6]:
In [7]:
pagecounts = sc.textFile('/user/fcheung/pagecounts') # HDFS
pagecounts.take(10)
Out[7]:
In [8]:
enPages = pagecounts.filter(lambda x: x.split(" ")[1] == "en")
enPages.map(lambda x: x.split(" ")).map(lambda x: (x[2], int(x[3]))).reduceByKey(lambda x, y: x + y, 40).filter(lambda x: x[1] > 200000).map(lambda x: (x[1], x[0])).collect()
# This runs in the cluster
Out[8]:
In [9]:
words = sc.textFile('/user/fcheung/hamlet.txt')
words.take(5)
Out[9]:
In [10]:
import re
hamlet = words.flatMap(lambda line: re.split('\W+', line.lower().strip()))
hamlet.take(5)
Out[10]:
In [11]:
tmp = hamlet.filter(lambda x: len(x) > 2 )
print tmp.take(5)
In [12]:
tmp = tmp.map(lambda word: (word, 1))
tmp.take(5)
Out[12]:
In [13]:
tmp = tmp.reduceByKey(lambda a, b: a + b)
tmp.take(5)
Out[13]:
In [14]:
tmp = tmp.map(lambda x: (x[1], x[0])).sortByKey(False)
tmp.take(20)
Out[14]:
In [15]:
tmp = tmp.map(lambda x: (x[1], x[0]))
tmp.take(20)
Out[15]:
In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
def plot(words):
values = map(lambda x: x[1], words)
labels = map(lambda x: x[0], words)
plt.barh(range(len(values)), values, color='grey')
plt.yticks(range(len(values)), labels)
plt.show()
In [17]:
plot(tmp.take(15))
Word2Vec computes distributed vector representation of words. Distributed vector representation is showed to be useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation. https://code.google.com/p/word2vec/
Spark implements the Skip-gram approach. With Skip-gram we want to predict a window of words given a single word.
It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1].
Wikipedia dump http://mattmahoney.net/dc/textdata
grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
then randomly sampled to ~200k lines
In [22]:
from pyspark.mllib.feature import Word2Vec
textpath = '/user/fcheung/text8_linessmall'
inp = sc.textFile(textpath).map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
# This takes a while....
In [24]:
synonyms = model.findSynonyms('car', 40)
for word, cosine_distance in synonyms:
print "{}: {}".format(word, cosine_distance)
In [25]:
values = map(lambda x: x[1], synonyms)
labels = map(lambda x: x[0], synonyms)
plt.barh(range(len(values)), values, color='blue')
plt.yticks(range(len(values)), labels)
plt.show()
In [26]:
from wordcloud import WordCloud, STOPWORDS
words = " ".join([x[0] for x in synonyms for times in range(0, int(x[1]*10))])
wordcloud = WordCloud(font_path='/home/fcheung/CabinSketch-Bold.ttf',
stopwords=STOPWORDS,
background_color='white',
width=1800,
height=1400
).generate(words)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
In [21]: