In [1]:
import nltk
In [2]:
abstract = """
It's morning, you settle in, check your dashboards and it looks like there is an increase of load coming through on some of your web server logs. What happened? You're about to deploy code that will hopefully fix some issues; how will you know that things worked well? The design team is thinking about changing some of the site icons; do your users like seeing big icons or small icons on your site? These are all scenarios that are all too common and the one thing that helps you answer these is your data.
Pushing data is typically easy. If you're tracking tracking events on a website, you'll probably want to know a lot about click tracking, URL referrals, and user sessions. If you're curious about the number of downloads your users go through per day, you'll probably have some data that you can aggregate a sum. Your data can be small or large or anything in between, but making it available is the most important piece that you'll need to have.
Pulling data can be a bit more complex. Do you have a small amount of data that you're just pulling from a relational database? Or are you processing data through Hadoop or Spark? Data is what you want; how you pull it is dependent on your architecture needs.
Presenting data is a simple task, but are you presenting the correct story? Whether you are presenting your web traffic or your user behavior data, you'll want to present your data that tells the story you want to tell in the best way.
Push data, pull data, present data; these are your main tasks in your typical cycle of product development and analysis. We built out a fairly quick data pipeline using Airflow, a workflow framework made by Airbnb. We push a lot of data so we can make good data-driven business decisions. Pulling data and presenting them have gone hand-in-hand for us. We have utilized Google's BigQuery in order for us to have a fast, columnar data store in order for us to build out dashboards to visualize our data. This will shed light into what a typical push-pull-present cycle looks like and will be exemplified with real-world examples."""
In [3]:
tokens = nltk.word_tokenize(abstract)
In [4]:
tokens[:10]
Out[4]:
In [5]:
tagged = nltk.pos_tag(tokens)
In [6]:
tagged[:10]
Out[6]:
In [7]:
tag_fd = nltk.FreqDist(tag for (word, tag) in tagged)
tag_fd.most_common()
Out[7]:
In [8]:
nltk.help.upenn_tagset()
In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
In [10]:
most_common_pos = tag_fd.most_common()
plt.figure(figsize=(15, 10))
plt.bar([x for x in range(len(most_common_pos))],
[count for (pos, count) in most_common_pos],
tick_label=['' for (pos, count) in most_common_pos]
)
plt.show()
In [11]:
most_common_pos = tag_fd.most_common()
plt.figure(figsize=(15, 10))
plt.bar([x for x in range(len(most_common_pos))],
[count for (pos, count) in most_common_pos],
tick_label=[pos for (pos, count) in most_common_pos]
)
plt.xticks(rotation=45)
plt.show()
In [12]:
most_common_pos = tag_fd.most_common()
plt.figure(figsize=(10, 10))
plt.pie([count for (pos, count) in most_common_pos], shadow=True)
plt.show()
In [13]:
most_common_pos = tag_fd.most_common()
plt.figure(figsize=(10, 10))
plt.pie([count for (pos, count) in most_common_pos],
labels=[pos for (pos, count) in most_common_pos],
autopct='%.2f%%',
shadow=True
)
plt.show()
In [14]:
from collections import Counter
NN_tags = Counter([word.lower() for (word, pos) in tagged if pos=='NN'])
NN_tags.most_common()
Out[14]:
In [15]:
IN_tags = Counter([word.lower() for (word, pos) in tagged if pos=='IN'])
IN_tags.most_common()
Out[15]:
In [16]:
NNS_tags = Counter([word.lower() for (word, pos) in tagged if pos=='NNS'])
NNS_tags.most_common()
Out[16]:
In [17]:
verb_tags = Counter([word.lower() for (word, pos) in tagged if pos in {'VB', 'VBG',}])
verb_tags.most_common()
Out[17]:
In [21]:
from itertools import product
['{} {}'.format(verb[0], noun[0]) for (verb, noun) in product(verb_tags.most_common()[:5], NNS_tags.most_common()[:5])]
Out[21]: