Explore the Data

Load the data


In [1]:
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
%matplotlib inline

In [3]:
df = pd.read_csv('data_tau_days.csv')

In [4]:
df.head()


Out[4]:
title date days
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss 1
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment 1
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss 1
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss 1
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss 1

Creating a Wordcloud

Now let us visually see which are the words that are most prominent. This requires us to find all the words in the title and create a frequency count for how many times the word occurs.


In [5]:
import nltk
from wordcloud import WordCloud

Tokenization

Tokenization segments a document into its atomic elements. In this case, we are interested in tokenizing to words. First we need to break the sentence in to words. This process is called tokenization


In [6]:
sentence = df["title"][0]
sentence


Out[6]:
'An Exploration of R, Yelp, and the Search for Good Indian Food'

In [7]:
tokens = nltk.wordpunct_tokenize(sentence)
tokens


Out[7]:
['An',
 'Exploration',
 'of',
 'R',
 ',',
 'Yelp',
 ',',
 'and',
 'the',
 'Search',
 'for',
 'Good',
 'Indian',
 'Food']

Let us take all the sentence in the dataframe and tokenize to find the words, and get a frequency of count of each words


In [8]:
frequency_words = {}

In [9]:
for data in df['title']:
    tokens = nltk.wordpunct_tokenize(data)
    for token in tokens:
        if token in frequency_words:
            count = frequency_words[token]
            count = count + 1
            frequency_words[token] = count
        else:
            frequency_words[token] = 1

In [10]:
# Let us see the frequency_words for each word occuring
frequency_words


Out[10]:
{'!': 2,
 '"': 2,
 '#': 1,
 '&': 1,
 "'": 6,
 '(': 8,
 ')': 8,
 '***': 1,
 '+': 1,
 ',': 23,
 '-': 39,
 '.': 14,
 '.*:': 1,
 '/': 8,
 '0': 4,
 '1': 5,
 '10': 1,
 '101': 1,
 '11': 1,
 '16': 1,
 '2': 2,
 '2016': 1,
 '3': 3,
 '4': 1,
 '50': 3,
 '7': 2,
 '8M': 1,
 ':': 34,
 '?': 16,
 '???': 1,
 '@': 1,
 'A': 13,
 'API': 3,
 'AWS': 1,
 'AYLIEN': 1,
 'About': 1,
 'Advances': 1,
 'Agree': 1,
 'Algorithms': 2,
 'All': 1,
 'Amazon': 2,
 'An': 10,
 'Analysis': 6,
 'Analytics': 1,
 'Analyze': 1,
 'Analyzing': 2,
 'Announcing': 2,
 'Answers': 1,
 'Apache': 5,
 'Apple': 1,
 'Appliance': 1,
 'Arrow': 1,
 'Art': 1,
 'Artists': 1,
 'Ask': 2,
 'Asked': 1,
 'Austin': 1,
 'Authoring': 1,
 'Auto': 1,
 'Automate': 1,
 'Automated': 1,
 'B': 2,
 'BallR': 1,
 'Bay': 1,
 'Bayesian': 7,
 'Be': 1,
 'Become': 1,
 'Beginners': 1,
 'Behavior': 1,
 'Better': 1,
 'Between': 1,
 'Big': 2,
 'Biggest': 1,
 'Billion': 3,
 'Boosting': 2,
 'Bootstrap': 1,
 'Bowl': 1,
 'Building': 1,
 'CMU': 1,
 'Caffe': 1,
 'Campaigns': 1,
 'Can': 1,
 'Careers': 1,
 'Cartoon': 1,
 'Chains': 1,
 'Characters': 1,
 'Charts': 1,
 'Cheat': 1,
 'Classification': 2,
 'Click': 1,
 'Cloud': 1,
 'Code': 2,
 'Cohort': 1,
 'Coll': 1,
 'Collectd': 1,
 'Competition': 1,
 'Component': 2,
 'Computer': 1,
 'Computing': 2,
 'Convolutional': 1,
 'Cooker': 1,
 'Corrosivity': 1,
 'Counts': 1,
 'Course': 2,
 'Coursera': 1,
 'Creating': 1,
 'Cricket': 1,
 'Crisis': 1,
 'Current': 1,
 'Cycle': 1,
 'D3': 2,
 'DT': 3,
 'Daily': 1,
 'Data': 31,
 'DataIsBeautiful': 1,
 'DataRadar': 1,
 'Databases': 1,
 'Dataflow': 1,
 'Datasets': 1,
 'Datathon': 1,
 'Datumbox': 1,
 'Day': 1,
 'DeZyre': 1,
 'Decision': 1,
 'Deep': 5,
 'DeepMind': 1,
 'Density': 1,
 'Departments': 1,
 'Depth': 3,
 'Deriving': 1,
 'Descriptive': 1,
 'Details': 1,
 'Detect': 1,
 'Dimensions': 1,
 'Dirichlet': 1,
 'Distributed': 3,
 'Do': 3,
 'Don': 1,
 'Doodles': 1,
 'Dplyr': 1,
 'Dplython': 1,
 'DrivenData': 1,
 'Dummy': 1,
 'EA': 1,
 'ELG': 1,
 'EMR': 2,
 'ETL': 1,
 'Elasticsearch': 1,
 'Elegant': 1,
 'Engineering': 1,
 'Engineers': 2,
 'Ensemble': 1,
 'Environment': 1,
 'Eric': 1,
 'Estimation': 2,
 'Ethical': 1,
 'Evaluation': 1,
 'Even': 1,
 'Ever': 1,
 'Examples': 1,
 'Expert': 1,
 'Explained': 1,
 'Exploration': 1,
 'Exploring': 1,
 'Extracting': 1,
 'FIFA': 1,
 'Facebook': 1,
 'Fast': 1,
 'Fatigue': 1,
 'Feed': 1,
 'File': 1,
 'Find': 1,
 'Finding': 1,
 'First': 1,
 'Flink': 2,
 'Flint': 1,
 'Flow': 1,
 'FlyElephant': 1,
 'Fog': 1,
 'Follower': 1,
 'Food': 1,
 'Forests': 1,
 'Four': 1,
 'Framework': 1,
 'France': 1,
 'Free': 2,
 'Frequently': 1,
 'From': 1,
 'Fun': 2,
 'Functioning': 1,
 'G': 1,
 'GW150914': 1,
 'Generate': 1,
 'Generation': 1,
 'Generative': 1,
 'Genius': 1,
 'Genomic': 3,
 'Gensim': 1,
 'Geographic': 1,
 'Getting': 1,
 'GloVe': 1,
 'Globe': 1,
 'Golden': 1,
 'Good': 1,
 'Grafana': 1,
 'Graph': 1,
 'GraphFrames': 2,
 'Graphical': 1,
 'Growth': 1,
 'Guide': 3,
 'Hacked': 1,
 'Hacking': 1,
 'Hadoop': 1,
 'Harvard': 1,
 'Has': 1,
 'High': 1,
 'Highly': 1,
 'Hiring': 1,
 'Histogram': 1,
 'Hong': 1,
 'How': 11,
 'I': 3,
 'IBM': 1,
 'IDF': 1,
 'IO': 1,
 'Improved': 1,
 'In': 5,
 'Inception': 1,
 'Indian': 1,
 'Inferring': 1,
 'Initial': 1,
 'Insights': 1,
 'Instacart': 1,
 'Instagram': 1,
 'Intellexer': 1,
 'Interactive': 2,
 'Interests': 1,
 'Interface': 1,
 'International': 1,
 'Interviews': 1,
 'Intro': 2,
 'Introducing': 1,
 'Introduction': 4,
 'Invaders': 1,
 'IoT': 1,
 'Is': 2,
 'It': 2,
 'JSON': 1,
 'Javascript': 1,
 'Julia': 2,
 'Jupyter': 1,
 'K': 2,
 'Kafka': 1,
 'Kitchen': 1,
 'Know': 1,
 'Kong': 1,
 'LIGO': 1,
 'Language': 4,
 'Large': 1,
 'Latency': 1,
 'LeCun': 1,
 'Learn': 1,
 'Learning': 16,
 'Lectures': 1,
 'Lens': 1,
 'Lense': 1,
 'Less': 1,
 'Level': 1,
 'Lift': 1,
 'Limits': 1,
 'Live': 1,
 'Logic': 1,
 'Logstash': 1,
 'Losers': 1,
 'Love': 1,
 'ML': 1,
 'Machine': 12,
 'MachineJS': 1,
 'Made': 1,
 'Mail': 1,
 'Making': 1,
 'Manifold': 1,
 'Map': 1,
 'March': 1,
 'Markov': 1,
 'Math': 1,
 'Means': 1,
 'Meetup': 2,
 'Megaman': 1,
 'Metaprogramming': 1,
 'Methods': 2,
 'Metrics': 1,
 'Michigan': 1,
 'Millions': 1,
 'Minecraft': 1,
 'Mining': 2,
 'Minute': 1,
 'Mistakes': 1,
 'Misusing': 1,
 'Mixtures': 1,
 'Model': 1,
 'Modeling': 2,
 'Models': 2,
 'Moneyball': 1,
 'Monsanto': 1,
 'Months': 1,
 'More': 1,
 'Morocco': 1,
 'My': 2,
 'NBA': 1,
 'NSA': 1,
 'NYC': 2,
 'Natural': 1,
 'Nets': 1,
 'Network': 1,
 'Networks': 2,
 'Neural': 5,
 'Newly': 1,
 'Next': 1,
 'Nine': 1,
 'No': 1,
 'Non': 4,
 'Not': 1,
 'Notification': 1,
 'Numerical': 1,
 'OC': 1,
 'Ode': 1,
 'OkCupid': 1,
 'One': 3,
 'Online': 1,
 'Open': 1,
 'Optimization': 1,
 'Optimizing': 3,
 'Oscars': 1,
 'Outliers': 1,
 'Overoptimizing': 1,
 'Overview': 1,
 'Owned': 1,
 'P': 2,
 'Pandas': 2,
 'Parallel': 1,
 'Parametric': 3,
 'Park': 1,
 'Part': 4,
 'Patterns': 1,
 'Peak': 1,
 'Personality': 1,
 'Pipelines': 1,
 'Platform': 1,
 'Player': 1,
 'Playing': 1,
 'PledgeForParity': 1,
 'Plots': 1,
 'Poets': 1,
 'Pool': 1,
 'Pop': 1,
 'Portable': 1,
 'Powerful': 1,
 'Prescriptive': 1,
 'Presidential': 1,
 'Presto': 1,
 'Primary': 1,
 'Principal': 2,
 'Probabilistic': 1,
 'Process': 1,
 'Processing': 2,
 'Producer': 1,
 'Profit': 1,
 'Project': 1,
 'Projection': 1,
 'Pseudo': 1,
 'PyLearn2': 1,
 'PyMC3': 1,
 'Python': 10,
 'Q': 1,
 'Question': 1,
 'Questions': 1,
 'R': 10,
 'REST': 1,
 'RSS': 1,
 'Ranges': 1,
 'Reason': 1,
 'Reasoning': 1,
 'Redshift': 1,
 'Regression': 1,
 'Released': 1,
 'Republican': 1,
 'Reshaping': 1,
 'Results': 1,
 'Rice': 1,
 'Rides': 3,
 'Rodeo': 1,
 'Role': 1,
 'Roots': 1,
 'Running': 1,
 'SF': 1,
 'SKYNET': 1,
 'SQL': 3,
 'Say': 1,
 'Scala': 2,
 'Scalable': 1,
 'Scammers': 1,
 'Scared': 1,
 'Science': 17,
 'Scientist': 1,
 'Scikit': 1,
 'Screencasts': 1,
 'Search': 2,
 'Sense2vec': 1,
 'Series': 1,
 'Sheets': 1,
 'Shiny': 2,
 'Shit': 1,
 'Shot': 1,
 'Should': 1,
 'Shouldn': 1,
 'Show': 1,
 'Side': 1,
 'Signal': 2,
 'Significance': 1,
 'Simple': 2,
 'Simplified': 1,
 'Skizze': 1,
 'Slack': 2,
 'Smartest': 1,
 'Sneak': 1,
 'Some': 1,
 'Source': 1,
 'South': 1,
 'Space': 2,
 'Spark': 10,
 'Stack': 1,
 'Started': 1,
 'State': 3,
 'Statebins': 1,
 'Statistical': 1,
 'Statisticians': 1,
 'Statistics': 2,
 'Step': 1,
 'Stochastic': 1,
 'Stole': 1,
 'Stop': 1,
 'Stream': 1,
 'Streaming': 1,
 'Structures': 1,
 'Studio': 1,
 'Summarizing': 1,
 'Super': 1,
 'Supervision': 1,
 'Survival': 1,
 'System': 1,
 'TF': 1,
 'TX': 1,
 'Tab': 1,
 'Tau': 1,
 'Taxi': 3,
 'Teaching': 1,
 'Technical': 3,
 'Technologies': 1,
 'Telemetry': 1,
 'TensorFlow': 4,
 'Tensorflow': 1,
 'Testing': 2,
 'Text': 2,
 'The': 13,
 'Theano': 1,
 'Them': 1,
 'Three': 1,
 'Through': 1,
 'Time': 2,
 'Times': 1,
 'Timing': 1,
 'Tiny': 1,
 'To': 5,
 'Tools': 2,
 'Top': 1,
 'Topic': 2,
 'Train': 2,
 'Training': 1,
 'Tree': 1,
 'True': 1,
 'Trump': 1,
 'Tutorial': 2,
 'Tweets': 1,
 'Twelve': 1,
 'Twice': 1,
 'Twilight': 1,
 'Twitter': 5,
 'Twython': 1,
 'US': 1,
 'Uber': 1,
 'Undergrad': 1,
 'Understand': 1,
 'Unsupervised': 2,
 'Up': 1,
 'Upcoming': 1,
 'Us': 1,
 'Use': 2,
 'User': 1,
 'Using': 2,
 'VCs': 1,
 'Value': 1,
 'Values': 1,
 'Vector': 1,
 'Vectorization': 1,
 'Viewing': 1,
 'Vision': 1,
 'Visual': 2,
 'Visualization': 2,
 'Visualize': 1,
 'Visualizing': 1,
 'Visually': 1,
 'Wait': 1,
 'Warriors': 1,
 'Wars': 1,
 'Watch': 2,
 'Water': 1,
 'Web': 1,
 'Webhose': 1,
 'Webinar': 1,
 'What': 6,
 'When': 1,
 'Where': 1,
 'Who': 1,
 'Why': 1,
 'Win': 1,
 'Winners': 1,
 'With': 2,
 'Without': 1,
 'Women': 1,
 'Work': 1,
 'Workflows': 1,
 'Working': 1,
 'Write': 1,
 'XGBoost': 1,
 'XGBoost4J': 1,
 'XGboost': 1,
 'XML': 1,
 'Xing': 1,
 'YARN': 1,
 'Yan': 1,
 'Years': 1,
 'Yelp': 1,
 'Yet': 1,
 'You': 1,
 'Your': 1,
 'Zone': 1,
 '[': 1,
 ']': 1,
 'a': 10,
 'about': 5,
 'affect': 1,
 'age': 1,
 'aka': 1,
 'almost': 1,
 'an': 1,
 'analogies': 1,
 'analysis': 2,
 'analytical': 1,
 'and': 36,
 'animated': 1,
 'anywhere': 1,
 'app': 1,
 'archive': 1,
 'are': 2,
 'article': 1,
 'artificial': 1,
 'at': 5,
 'background': 1,
 'be': 2,
 'become': 1,
 'better': 1,
 'black': 1,
 'blending': 1,
 'box': 1,
 'breweries': 1,
 'by': 2,
 'can': 2,
 'categorical': 1,
 'causal': 1,
 'causality': 1,
 'certified': 1,
 'challenge': 1,
 'change': 1,
 'changed': 1,
 'changes': 1,
 'channel': 1,
 'charts': 1,
 'choice': 1,
 'choices': 1,
 'classifier': 1,
 'classifiers': 1,
 'climbing': 1,
 'clusters': 1,
 'co': 1,
 'code': 2,
 'compatible': 1,
 'completion': 1,
 'complex': 1,
 'condition': 1,
 'conversion': 1,
 'course': 2,
 'courses': 1,
 'creators': 1,
 'd3': 1,
 'data': 10,
 'datasets': 2,
 'de': 1,
 'decision': 1,
 'deep': 1,
 'deepen': 1,
 'demo': 1,
 'demystified': 1,
 'details': 1,
 'detecting': 1,
 'detection': 1,
 'developers': 1,
 'discover': 1,
 'do': 3,
 'docstrings': 1,
 'easy': 2,
 'eight': 1,
 'encoders': 1,
 'enough': 1,
 'ensemble': 1,
 'estimate': 1,
 'estimation': 1,
 'events': 1,
 'excited': 1,
 'experience': 1,
 'experiments': 2,
 'explaining': 1,
 'f': 1,
 'families': 1,
 'file': 1,
 'for': 23,
 'free': 1,
 'from': 2,
 'ge': 1,
 'git': 2,
 'gitnoc': 1,
 'give': 1,
 'greedy': 1,
 'hands': 1,
 'have': 1,
 'heart': 1,
 'hierarchical': 1,
 'high': 1,
 'hill': 1,
 'historical': 1,
 'how': 1,
 'image': 3,
 'impact': 1,
 'import': 2,
 'in': 28,
 'innocent': 1,
 'instead': 1,
 'intelligence': 1,
 'interactive': 1,
 'internships': 1,
 'interpretable': 1,
 'intersection': 1,
 'into': 1,
 'intro': 1,
 'invite': 1,
 'io': 3,
 'it': 1,
 'jobs': 1,
 'js': 2,
 'just': 2,
 'kaggle': 1,
 'killing': 1,
 'large': 1,
 'leaders': 1,
 'learn': 5,
 'learning': 5,
 'lectures': 1,
 'library': 1,
 'lines': 1,
 'links': 1,
 'look': 1,
 'machine': 3,
 'machines': 1,
 'make': 2,
 'mapping': 1,
 'matching': 1,
 'math': 1,
 'may': 1,
 'means': 1,
 'merge': 1,
 'messaging': 1,
 'metadata': 1,
 'mistakes': 1,
 'ml': 1,
 'modelling': 1,
 'models': 1,
 'network': 1,
 'neural': 1,
 'now': 1,
 'of': 27,
 'offers': 1,
 'on': 11,
 'online': 2,
 'open': 2,
 'optional': 1,
 'other': 1,
 'own': 1,
 'owners': 1,
 'pandas': 3,
 'park': 1,
 'passing': 1,
 'people': 1,
 'petersburg': 1,
 'phys': 1,
 'pitfalls': 1,
 'platform': 1,
 'points': 1,
 'polished': 1,
 'predictions': 1,
 'presentations': 1,
 'price': 1,
 'private': 1,
 'probabilistic': 1,
 'processing': 1,
 'program': 1,
 'quality': 1,
 'r': 1,
 'rate': 1,
 'released': 1,
 'repositories': 1,
 'results': 1,
 'revisited': 1,
 'rise': 1,
 'robots': 1,
 'rookie': 1,
 'rules': 1,
 'run': 2,
 'running': 2,
 's': 5,
 'say': 1,
 'scale': 1,
 'scaling': 1,
 'science': 1,
 'scientist': 1,
 'scikit': 3,
 'scraping': 1,
 'secret': 1,
 'security': 1,
 'series': 1,
 'service': 1,
 'shape': 1,
 'share': 1,
 'should': 1,
 'simpler': 1,
 'sklearn': 1,
 'slides': 1,
 'socket': 1,
 'some': 1,
 'sourced': 2,
 'spaCy': 1,
 'statistical': 1,
 'status': 1,
 'steps': 1,
 'storage': 1,
 'story': 1,
 'streams': 1,
 'structural': 1,
 'structure': 1,
 'survival': 1,
 'systems': 1,
 't': 2,
 'talk': 1,
 'than': 1,
 'that': 1,
 'the': 15,
 'thought': 1,
 'thousands': 1,
 'through': 1,
 'throughput': 1,
 'time': 1,
 'timeseries': 1,
 'to': 24,
 'tools': 1,
 'training': 1,
 'transparent': 1,
 'tweets': 1,
 'understanding': 1,
 'undiagnosed': 1,
 'unsupervised': 1,
 'unusual': 1,
 'updates': 1,
 'use': 1,
 'users': 1,
 'using': 7,
 'variations': 1,
 've': 2,
 'vectorization': 1,
 'video': 2,
 'visibility': 1,
 'visualization': 1,
 'vs': 2,
 'want': 1,
 'way': 2,
 'we': 1,
 'weapon': 1,
 'with': 27,
 'word2vec': 1,
 'work': 2,
 'working': 1,
 'worry': 1,
 'you': 2,
 'your': 4}

In [11]:
# Creating a Wordcloud
wordcloud = WordCloud()

In [12]:
wordcloud.generate_from_frequencies(frequency_words.items())


Out[12]:
<wordcloud.wordcloud.WordCloud at 0x1184795c0>

In [13]:
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


Question - What are the two issue with this wordcloud?


In [14]:
# Convert the dict to a dataframe
freq = pd.DataFrame.from_dict(frequency_words, orient = 'index')

In [15]:
# Let us sort them in descinding order
freq.sort_values(by = 0, ascending=False).head(10)


Out[15]:
0
- 39
and 36
: 34
Data 31
in 28
of 27
with 27
to 24
for 23
, 23

Stopword Removal

Stop words are words which are filtered out before or after processing of natural language data. Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.


In [19]:
from nltk.corpus import stopwords
# nltk.download()

In [20]:
stop = stopwords.words('english')

In [21]:
stop[0:10]


Out[21]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

We will recreate the frequency words with two additional steps

  • Remove all the stop words in our count
  • Make every word lower case

In [23]:
frequency_words_wo_stop = {}
for data in df['title']:
    tokens = nltk.wordpunct_tokenize(data)
    for token in tokens:
        if token.lower() not in stop:
            if token in frequency_words_wo_stop:
                count = frequency_words_wo_stop[token]
                count = count + 1
                frequency_words_wo_stop[token] = count
            else:
                frequency_words_wo_stop[token] = 1

In [24]:
wordcloud.generate_from_frequencies(frequency_words_wo_stop.items())


Out[24]:
<wordcloud.wordcloud.WordCloud at 0x1184795c0>

In [25]:
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


We can also extend the stopword list with common punctuations to even reomove those from the list


In [28]:
stop.extend(('.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','/','-'))

In [29]:
frequency_words_wo_stop = {}

In [30]:
def generate_word_frequency(row):
    data = row['title']
    tokens = nltk.wordpunct_tokenize(data)
    token_list = []
    for token in tokens:
        if token.lower() not in stop:
            token_list.append(token.lower())
            if token.lower() in frequency_words_wo_stop:
                count = frequency_words_wo_stop[token.lower()]
                count = count + 1
                frequency_words_wo_stop[token.lower()] = count
            else:
                frequency_words_wo_stop[token.lower()] = 1
    
    return ','.join(token_list)

The apply function takes a function as its input and applies that across all the rows or columns


In [31]:
df['tokens'] = df.apply(generate_word_frequency,axis=1)

In [32]:
df.head()


Out[32]:
title date days tokens
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss 1 exploration,r,yelp,search,good,indian,food
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment 1 deep,advances,generative,modeling
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss 1 spark,pipelines,elegant,yet,powerful
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss 1 shit,vcs,say
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss 1 python,machine,learning,language,wars

In [33]:
wordcloud.generate_from_frequencies(frequency_words_wo_stop.items())


Out[33]:
<wordcloud.wordcloud.WordCloud at 0x1184795c0>

In [34]:
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


Exercise: Find the frequency count for each word without stopword


In [ ]:

Stemming

An linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Stemming words is another common NLP technique to reduce topically similar words to their root. For example, “stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces those terms to “stem.” This is important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model. Stemming programs are commonly referred to as stemming algorithms or stemmers.

Like stopping, stemming is flexible and some methods are more aggressive. The Porter stemming algorithm is the most widely used method. To implement a Porter stemming algorithm, import the Porter Stemmer module from NLTK:


In [35]:
from nltk.stem.porter import PorterStemmer

In [36]:
porter_stemmer = PorterStemmer()

In [37]:
porter_stemmer.stem('dividing')


Out[37]:
'divid'

Lemmatization

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.

In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word.

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

We will use a corpus to do the Lemmatization. Let us download the wordnet corpora using nltk.download()


In [38]:
from nltk.stem import WordNetLemmatizer

In [39]:
wordnet_lemmatizer = WordNetLemmatizer()

In [40]:
wordnet_lemmatizer.lemmatize('are')


Out[40]:
'are'

In [41]:
wordnet_lemmatizer.lemmatize('is')


Out[41]:
'is'

But we know that the root of are and is , is be. The reason why we see are and is as is , is because we have to define them as verbs


In [45]:
wordnet_lemmatizer.lemmatize('dividing', pos = "v")


Out[45]:
'divide'

In [42]:
wordnet_lemmatizer.lemmatize('is',pos='v')


Out[42]:
'be'

In [46]:
def stem_title(data):
    return porter_stemmer.stem(data['title'])

In [47]:
def lemmatize_title(data):
    return wordnet_lemmatizer.lemmatize(data['title'])

In [48]:
df['stem'] = df.apply(stem_title,axis=1)

In [49]:
df.head()


Out[49]:
title date days tokens stem
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss 1 exploration,r,yelp,search,good,indian,food An Exploration of R, Yelp, and the Search for ...
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment 1 deep,advances,generative,modeling Deep Advances in Generative Model
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss 1 spark,pipelines,elegant,yet,powerful Spark Pipelines: Elegant Yet Pow
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss 1 shit,vcs,say Shit VCs Say
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss 1 python,machine,learning,language,wars Python, Machine Learning, and Language War

In [50]:
df['lemma'] = df.apply(lemmatize_title,axis=1)

In [51]:
df.head()


Out[51]:
title date days tokens stem lemma
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss 1 exploration,r,yelp,search,good,indian,food An Exploration of R, Yelp, and the Search for ... An Exploration of R, Yelp, and the Search for ...
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment 1 deep,advances,generative,modeling Deep Advances in Generative Model Deep Advances in Generative Modeling
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss 1 spark,pipelines,elegant,yet,powerful Spark Pipelines: Elegant Yet Pow Spark Pipelines: Elegant Yet Powerful
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss 1 shit,vcs,say Shit VCs Say Shit VCs Say
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss 1 python,machine,learning,language,wars Python, Machine Learning, and Language War Python, Machine Learning, and Language Wars

In [52]:
df.tail()


Out[52]:
title date days tokens stem lemma
175 Getting Started with Statistics for Data Science 3 points by nickhould 35 days ago | discuss 35 getting,started,statistics,data,science Getting Started with Statistics for Data Sci Getting Started with Statistics for Data Science
176 Rodeo 1.3 - Tab-completion for docstrings 3 points by glamp 35 days ago | discuss 35 rodeo,1,3,tab,completion,docstrings Rodeo 1.3 - Tab-completion for docstr Rodeo 1.3 - Tab-completion for docstrings
177 Teaching D3.js - links 3 points by pmigdal 35 days ago | discuss 35 teaching,d3,js,links Teaching D3.js - link Teaching D3.js - links
178 Parallel scikit-learn on YARN 5 points by stijntonk 39 days ago | discuss 39 parallel,scikit,learn,yarn Parallel scikit-learn on YARN Parallel scikit-learn on YARN
179 Meetup: Free Live Webinar on Prescriptive Anal... 2 points by ann928 32 days ago | discuss 32 meetup,free,live,webinar,prescriptive,analytic... Meetup: Free Live Webinar on Prescriptive Anal... Meetup: Free Live Webinar on Prescriptive Anal...

Note: Stemming and Lemma in the context of Recall

Part of Speech (POS) tagging

https://displacy.spacy.io/displacy/index.html?full=Click+the+button+to+see+this+sentence+in+displaCy.

Let us go back to school. Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection.

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories. Here is the definition from wikipedia:

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.


In [53]:
text = 'Calvin harris is a great musician'

In [54]:
text_tokens = nltk.wordpunct_tokenize(text)

In [55]:
text_tokens


Out[55]:
['Calvin', 'harris', 'is', 'a', 'great', 'musician']

We will download from nltk download averaged perceptron tagger to do POS tagging


In [56]:
nltk.pos_tag(text_tokens)


Out[56]:
[('Calvin', 'NNP'),
 ('harris', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('musician', 'NN')]

Tag | Meaning | English Examples

Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
0 punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy

Let us generate POS tags for each title


In [57]:
def get_pos_tags(data):
    return nltk.pos_tag(nltk.wordpunct_tokenize(data['title']))

In [58]:
df['pos_tags'] = df.apply(get_pos_tags,axis=1)

In [59]:
df.head()


Out[59]:
title date days tokens stem lemma pos_tags
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss 1 exploration,r,yelp,search,good,indian,food An Exploration of R, Yelp, and the Search for ... An Exploration of R, Yelp, and the Search for ... [(An, DT), (Exploration, NN), (of, IN), (R, NN...
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment 1 deep,advances,generative,modeling Deep Advances in Generative Model Deep Advances in Generative Modeling [(Deep, JJ), (Advances, NNS), (in, IN), (Gener...
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss 1 spark,pipelines,elegant,yet,powerful Spark Pipelines: Elegant Yet Pow Spark Pipelines: Elegant Yet Powerful [(Spark, NNP), (Pipelines, NNS), (:, :), (Eleg...
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss 1 shit,vcs,say Shit VCs Say Shit VCs Say [(Shit, NNP), (VCs, NNP), (Say, NNP)]
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss 1 python,machine,learning,language,wars Python, Machine Learning, and Language War Python, Machine Learning, and Language Wars [(Python, NNP), (,, ,), (Machine, NNP), (Learn...

Entity Extraction

Now using pos tags we can extract entities i.e find the primary focus of the sentence

We already have POS tags - now all we need is chunking

The basic technique we will use for entity detection is chunking, which segments and labels multi-token sequences as illustrated in 2.1. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.

Named Entity-Type Examples
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2008-06-29
TIME two fifty a m, 1:30 p.m.
MONEY 175 million Canadian Dollars, GBP 10.40
PERCENT twenty pct, 18.75 %
FACILITY Washington Monument, Stonehenge
GPE South East Asia, Midlothian

To do the entity identification, we will download the maxent chunker and words corpora


In [60]:
df.pos_tags[0]


Out[60]:
[('An', 'DT'),
 ('Exploration', 'NN'),
 ('of', 'IN'),
 ('R', 'NNP'),
 (',', ','),
 ('Yelp', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('Search', 'NNP'),
 ('for', 'IN'),
 ('Good', 'NNP'),
 ('Indian', 'NNP'),
 ('Food', 'NNP')]

In [61]:
ne_tree = nltk.ne_chunk(df.pos_tags[0],binary=True)
# ne_tree

In [ ]:
# for x in ne_tree:
#    print(x)

In [ ]:
# we want only the NE ones and when print the type we see that it is a tree
# so we need to iterate over the tree and get the NE

In [62]:
for x in ne_tree:
    print(type(x),x)
    if type(x) == nltk.tree.Tree:
        if(x.label()) == 'NE':
            print(x)


<class 'tuple'> ('An', 'DT')
<class 'tuple'> ('Exploration', 'NN')
<class 'tuple'> ('of', 'IN')
<class 'tuple'> ('R', 'NNP')
<class 'tuple'> (',', ',')
<class 'nltk.tree.Tree'> (NE Yelp/NNP)
(NE Yelp/NNP)
<class 'tuple'> (',', ',')
<class 'tuple'> ('and', 'CC')
<class 'tuple'> ('the', 'DT')
<class 'tuple'> ('Search', 'NNP')
<class 'tuple'> ('for', 'IN')
<class 'nltk.tree.Tree'> (NE Good/NNP Indian/NNP Food/NNP)
(NE Good/NNP Indian/NNP Food/NNP)

In [63]:
def get_entities(row):
    entities=[]
    chunked_tree = nltk.ne_chunk(row.pos_tags,binary=True)
    for nodes in chunked_tree:
        if type(nodes) == nltk.tree.Tree:
            if(nodes.label()) == 'NE':
                print("Before zip",nodes.leaves())
                zipped_list = list(zip(*nodes.leaves()))
                print("After zip",zipped_list)
                entities.append(' '.join(zipped_list[0]))
    return entities

In [64]:
df['named_entities'] = df.apply(get_entities,axis=1)


Before zip [('Yelp', 'NNP')]
After zip [('Yelp',), ('NNP',)]
Before zip [('Good', 'NNP'), ('Indian', 'NNP'), ('Food', 'NNP')]
After zip [('Good', 'Indian', 'Food'), ('NNP', 'NNP', 'NNP')]
Before zip [('Generative', 'NNP'), ('Modeling', 'NNP')]
After zip [('Generative', 'Modeling'), ('NNP', 'NNP')]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Shit', 'NNP')]
After zip [('Shit',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NNP', 'NNP')]
Before zip [('Language', 'NNP'), ('Wars', 'NNP')]
After zip [('Language', 'Wars'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Markov', 'NNP'), ('Chains', 'NNP')]
After zip [('Markov', 'Chains'), ('NNP', 'NNP')]
Before zip [('Visually', 'NNP')]
After zip [('Visually',), ('NNP',)]
Before zip [('Dplython', 'NN')]
After zip [('Dplython',), ('NN',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Bayesian', 'JJ')]
After zip [('Bayesian',), ('JJ',)]
Before zip [('Amazon', 'NNP')]
After zip [('Amazon',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Structures', 'NNP')]
After zip [('Data', 'Structures'), ('NNP', 'NNP')]
Before zip [('Algorithms', 'NNP')]
After zip [('Algorithms',), ('NNP',)]
Before zip [('Lift', 'NNP')]
After zip [('Lift',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Side', 'NNP')]
After zip [('Data', 'Science', 'Side'), ('NNP', 'NNP', 'NNP')]
Before zip [('Write', 'NNP')]
After zip [('Write',), ('NNP',)]
Before zip [('Simple', 'JJ')]
After zip [('Simple',), ('JJ',)]
Before zip [('Computer', 'NNP'), ('Vision', 'NNP')]
After zip [('Computer', 'Vision'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Engineering', 'NNP')]
After zip [('Data', 'Engineering'), ('NNP', 'NNP')]
Before zip [('Slack', 'NNP')]
After zip [('Slack',), ('NNP',)]
Before zip [('Pandas', 'NNP')]
After zip [('Pandas',), ('NNP',)]
Before zip [('Datumbox', 'NNP'), ('Machine', 'NNP')]
After zip [('Datumbox', 'Machine'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP')]
After zip [('Data',), ('NNP',)]
Before zip [('Neural', 'JJ'), ('Networks', 'NNP')]
After zip [('Neural', 'Networks'), ('JJ', 'NNP')]
Before zip [('Apple', 'NNP'), ('Watch', 'NNP')]
After zip [('Apple', 'Watch'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Tools', 'NNP')]
After zip [('Data', 'Science', 'Tools'), ('NNP', 'NNP', 'NNP')]
Before zip [('Biggest', 'NNP'), ('Winners', 'NNPS')]
After zip [('Biggest', 'Winners'), ('NNP', 'NNPS')]
Before zip [('Open', 'NNP'), ('Source', 'NNP'), ('Machine', 'NNP')]
After zip [('Open', 'Source', 'Machine'), ('NNP', 'NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Scikit', 'NNP'), ('Flow', 'NNP')]
After zip [('Scikit', 'Flow'), ('NNP', 'NNP')]
Before zip [('XGBoost4J', 'NN')]
After zip [('XGBoost4J',), ('NN',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Flink', 'NNP')]
After zip [('Flink',), ('NNP',)]
Before zip [('Dataflow', 'NNP')]
After zip [('Dataflow',), ('NNP',)]
Before zip [('Deep', 'NNP'), ('Roots', 'NNP')]
After zip [('Deep', 'Roots'), ('NNP', 'NNP')]
Before zip [('Javascript', 'NNP'), ('Fatigue', 'NNP')]
After zip [('Javascript', 'Fatigue'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Tau', 'NNP')]
After zip [('Data', 'Tau'), ('NNP', 'NNP')]
Before zip [('Machine', 'NN'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NN', 'NNP')]
Before zip [('Non', 'NNP')]
After zip [('Non',), ('NNP',)]
Before zip [('Technical', 'NNP'), ('Guide', 'NNP')]
After zip [('Technical', 'Guide'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Slack', 'NNP')]
After zip [('Data', 'Science', 'Slack'), ('NNP', 'NNP', 'NNP')]
Before zip [('Click', 'NN')]
After zip [('Click',), ('NN',)]
Before zip [('Intellexer', 'NNP')]
After zip [('Intellexer',), ('NNP',)]
Before zip [('Natural', 'JJ'), ('Language', 'NNP')]
After zip [('Natural', 'Language'), ('JJ', 'NNP')]
Before zip [('Text', 'NNP'), ('Mining', 'NNP')]
After zip [('Text', 'Mining'), ('NNP', 'NNP')]
Before zip [('SQL', 'NNP')]
After zip [('SQL',), ('NNP',)]
Before zip [('Genomic', 'NNP'), ('Data', 'NNP')]
After zip [('Genomic', 'Data'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Cohort', 'NNP'), ('Data', 'NNP')]
After zip [('Cohort', 'Data'), ('NNP', 'NNP')]
Before zip [('Analyze', 'NNP'), ('User', 'NNP'), ('Behavior', 'NNP')]
After zip [('Analyze', 'User', 'Behavior'), ('NNP', 'NNP', 'NNP')]
Before zip [('Show', 'NNP')]
After zip [('Show',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Apache', 'NNP'), ('Spark', 'NNP')]
After zip [('Apache', 'Spark'), ('NNP', 'NNP')]
Before zip [('Numerical', 'NNP'), ('Computing', 'NNP')]
After zip [('Numerical', 'Computing'), ('NNP', 'NNP')]
Before zip [('Rice', 'NNP'), ('Cooker', 'NNP')]
After zip [('Rice', 'Cooker'), ('NNP', 'NNP')]
Before zip [('Smartest', 'NNP'), ('Kitchen', 'NNP')]
After zip [('Smartest', 'Kitchen'), ('NNP', 'NNP')]
Before zip [('Golden', 'NNP'), ('State', 'NNP'), ('Warriors', 'NNP')]
After zip [('Golden', 'State', 'Warriors'), ('NNP', 'NNP', 'NNP')]
Before zip [('GraphFrames', 'NNP')]
After zip [('GraphFrames',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Megaman', 'NN')]
After zip [('Megaman',), ('NN',)]
Before zip [('Millions', 'NNP')]
After zip [('Millions',), ('NNP',)]
Before zip [('Outliers', 'NNP')]
After zip [('Outliers',), ('NNP',)]
Before zip [('Parametric', 'NNP')]
After zip [('Parametric',), ('NNP',)]
Before zip [('Non', 'NNP'), ('Parametric', 'NNP'), ('Methods', 'NNP')]
After zip [('Non', 'Parametric', 'Methods'), ('NNP', 'NNP', 'NNP')]
Before zip [('BallR', 'NN')]
After zip [('BallR',), ('NN',)]
Before zip [('NBA', 'NNP'), ('Shot', 'NNP')]
After zip [('NBA', 'Shot'), ('NNP', 'NNP')]
Before zip [('Shiny', 'NNP')]
After zip [('Shiny',), ('NNP',)]
Before zip [('Amazon', 'NNP')]
After zip [('Amazon',), ('NNP',)]
Before zip [('Minecraft', 'NN')]
After zip [('Minecraft',), ('NN',)]
Before zip [('Deep', 'NNP')]
After zip [('Deep',), ('NNP',)]
Before zip [('Space', 'NNP'), ('Invaders', 'NNP')]
After zip [('Space', 'Invaders'), ('NNP', 'NNP')]
Before zip [('Theano', 'NNP'), ('Tutorial', 'NNP')]
After zip [('Theano', 'Tutorial'), ('NNP', 'NNP')]
Before zip [('Personality', 'NNP'), ('Space', 'NNP')]
After zip [('Personality', 'Space'), ('NNP', 'NNP')]
Before zip [('Cartoon', 'NNP')]
After zip [('Cartoon',), ('NNP',)]
Before zip [('Apache', 'NNP')]
After zip [('Apache',), ('NNP',)]
Before zip [('Telemetry', 'NN')]
After zip [('Telemetry',), ('NN',)]
Before zip [('Collectd', 'NNP')]
After zip [('Collectd',), ('NNP',)]
Before zip [('Logstash', 'NNP')]
After zip [('Logstash',), ('NNP',)]
Before zip [('Elasticsearch', 'NNP')]
After zip [('Elasticsearch',), ('NNP',)]
Before zip [('Grafana', 'NNP')]
After zip [('Grafana',), ('NNP',)]
Before zip [('ELG', 'NNP')]
After zip [('ELG',), ('NNP',)]
Before zip [('Bayesian', 'JJ'), ('Reasoning', 'NNP')]
After zip [('Bayesian', 'Reasoning'), ('JJ', 'NNP')]
Before zip [('Twilight', 'NNP'), ('Zone', 'NN')]
After zip [('Twilight', 'Zone'), ('NNP', 'NN')]
Before zip [('Bayesian', 'JJ'), ('Estimation', 'NNP')]
After zip [('Bayesian', 'Estimation'), ('JJ', 'NNP')]
Before zip [('XGBoost', 'NN')]
After zip [('XGBoost',), ('NN',)]
Before zip [('Hadoop', 'NNP'), ('Pseudo', 'NNP')]
After zip [('Hadoop', 'Pseudo'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Pop', 'NNP')]
After zip [('Data', 'Science', 'Pop'), ('NNP', 'NNP', 'NNP')]
Before zip [('Austin', 'NNP')]
After zip [('Austin',), ('NNP',)]
Before zip [('TensorFlow', 'NNP')]
After zip [('TensorFlow',), ('NNP',)]
Before zip [('Shiny', 'NNP')]
After zip [('Shiny',), ('NNP',)]
Before zip [('Tensorflow', 'NNP')]
After zip [('Tensorflow',), ('NNP',)]
Before zip [('File', 'NN')]
After zip [('File',), ('NN',)]
Before zip [('All', 'NNP'), ('Data', 'NNP'), ('Engineers', 'NNP'), ('Should', 'NNP'), ('Know', 'NNP')]
After zip [('All', 'Data', 'Engineers', 'Should', 'Know'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Topic', 'NN')]
After zip [('Topic',), ('NN',)]
Before zip [('TF', 'NNP')]
After zip [('TF',), ('NNP',)]
Before zip [('IDF', 'NNP')]
After zip [('IDF',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Scala', 'NNP')]
After zip [('Scala',), ('NNP',)]
Before zip [('Next', 'JJ'), ('Generation', 'NNP')]
After zip [('Next', 'Generation'), ('JJ', 'NNP')]
Before zip [('Graph', 'NNP')]
After zip [('Graph',), ('NNP',)]
Before zip [('DataRadar', 'NNP')]
After zip [('DataRadar',), ('NNP',)]
Before zip [('IO', 'NNP')]
After zip [('IO',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('International', 'NNP'), ('Women', 'NNP')]
After zip [('International', 'Women'), ('NNP', 'NNP')]
Before zip [('PledgeForParity', 'NN'), ('Means', 'NNPS')]
After zip [('PledgeForParity', 'Means'), ('NN', 'NNPS')]
Before zip [('Ask', 'NNP')]
After zip [('Ask',), ('NNP',)]
Before zip [('Better', 'NNP'), ('Insights', 'NNPS'), ('From', 'NNP'), ('Time', 'NNP'), ('Series', 'NNP'), ('Data', 'NNP')]
After zip [('Better', 'Insights', 'From', 'Time', 'Series', 'Data'), ('NNP', 'NNPS', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Cycle', 'NNP'), ('Plots', 'NNP')]
After zip [('Cycle', 'Plots'), ('NNP', 'NNP')]
Before zip [('SQL', 'NNP')]
After zip [('SQL',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Analysis', 'NNP')]
After zip [('Data', 'Analysis'), ('NNP', 'NNP')]
Before zip [('Stream', 'NN')]
After zip [('Stream',), ('NN',)]
Before zip [('IoT', 'NNP')]
After zip [('IoT',), ('NNP',)]
Before zip [('Visual', 'NNP'), ('Studio', 'NNP')]
After zip [('Visual', 'Studio'), ('NNP', 'NNP')]
Before zip [('Skizze', 'NNP')]
After zip [('Skizze',), ('NNP',)]
Before zip [('Genomic', 'NNP'), ('Ranges', 'NNP')]
After zip [('Genomic', 'Ranges'), ('NNP', 'NNP')]
Before zip [('Genomic', 'NNP'), ('Data', 'NNP')]
After zip [('Genomic', 'Data'), ('NNP', 'NNP')]
Before zip [('TensorFlow', 'NNP')]
After zip [('TensorFlow',), ('NNP',)]
Before zip [('Even', 'NNP'), ('Less', 'NNP'), ('Supervision', 'NNP'), ('Using', 'NNP'), ('Bayesian', 'NNP')]
After zip [('Even', 'Less', 'Supervision', 'Using', 'Bayesian'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('JSON', 'NNP')]
After zip [('JSON',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Pandas', 'NNP')]
After zip [('Pandas',), ('NNP',)]
Before zip [('DrivenData', 'NNP')]
After zip [('DrivenData',), ('NNP',)]
Before zip [('Morocco', 'NNP')]
After zip [('Morocco',), ('NNP',)]
Before zip [('Deep', 'JJ')]
After zip [('Deep',), ('JJ',)]
Before zip [('Coll', 'NNP')]
After zip [('Coll',), ('NNP',)]
Before zip [('France', 'NNP')]
After zip [('France',), ('NNP',)]
Before zip [('Yan', 'NNP')]
After zip [('Yan',), ('NNP',)]
Before zip [('Facebook', 'NNP'), ('Campaigns', 'NNP')]
After zip [('Facebook', 'Campaigns'), ('NNP', 'NNP')]
Before zip [('Trump', 'NNP'), ('Tweets', 'NNP')]
After zip [('Trump', 'Tweets'), ('NNP', 'NNP')]
Before zip [('Twitter', 'NNP')]
After zip [('Twitter',), ('NNP',)]
Before zip [('Apache', 'NNP'), ('Arrow', 'NNP')]
After zip [('Apache', 'Arrow'), ('NNP', 'NNP')]
Before zip [('Histogram', 'NNP')]
After zip [('Histogram',), ('NNP',)]
Before zip [('TensorFlow', 'NNP')]
After zip [('TensorFlow',), ('NNP',)]
Before zip [('Regression', 'NN')]
After zip [('Regression',), ('NN',)]
Before zip [('Examples', 'NNP')]
After zip [('Examples',), ('NNP',)]
Before zip [('Free', 'JJ')]
After zip [('Free',), ('JJ',)]
Before zip [('Don', 'NNP')]
After zip [('Don',), ('NNP',)]
Before zip [('Work', 'NN')]
After zip [('Work',), ('NN',)]
Before zip [('FlyElephant', 'NNP')]
After zip [('FlyElephant',), ('NNP',)]
Before zip [('XML', 'NN')]
After zip [('XML',), ('NN',)]
Before zip [('Cricket', 'NNP'), ('Player', 'NNP'), ('Careers', 'NNP')]
After zip [('Cricket', 'Player', 'Careers'), ('NNP', 'NNP', 'NNP')]
Before zip [('Generate', 'NNP')]
After zip [('Generate',), ('NNP',)]
Before zip [('Super', 'NNP'), ('Bowl', 'NNP')]
After zip [('Super', 'Bowl'), ('NNP', 'NNP')]
Before zip [('Twython', 'NNP')]
After zip [('Twython',), ('NNP',)]
Before zip [('Twitter', 'NNP'), ('API', 'NNP')]
After zip [('Twitter', 'API'), ('NNP', 'NNP')]
Before zip [('AYLIEN', 'NNP')]
After zip [('AYLIEN',), ('NNP',)]
Before zip [('Watch', 'NNP'), ('Tiny', 'NNP'), ('Neural', 'NNP'), ('Nets', 'NNP'), ('Learn', 'NNP')]
After zip [('Watch', 'Tiny', 'Neural', 'Nets', 'Learn'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Convolutional', 'NNP'), ('Networks', 'NNP')]
After zip [('Convolutional', 'Networks'), ('NNP', 'NNP')]
Before zip [('Models', 'NNP')]
After zip [('Models',), ('NNP',)]
Before zip [('True', 'JJ'), ('Love', 'NNP')]
After zip [('True', 'Love'), ('JJ', 'NNP')]
Before zip [('PyLearn2', 'NN')]
After zip [('PyLearn2',), ('NN',)]
Before zip [('Density', 'NNP'), ('Estimation', 'NNP')]
After zip [('Density', 'Estimation'), ('NNP', 'NNP')]
Before zip [('Dirichlet', 'NNP'), ('Process', 'NNP'), ('Mixtures', 'NNP')]
After zip [('Dirichlet', 'Process', 'Mixtures'), ('NNP', 'NNP', 'NNP')]
Before zip [('PyMC3', 'NNP')]
After zip [('PyMC3',), ('NNP',)]
Before zip [('Flint', 'NNP'), ('Michigan', 'NNP'), ('Water', 'NNP')]
After zip [('Flint', 'Michigan', 'Water'), ('NNP', 'NNP', 'NNP')]
Before zip [('Republican', 'JJ'), ('Twitter', 'NNP'), ('Follower', 'NNP')]
After zip [('Republican', 'Twitter', 'Follower'), ('JJ', 'NNP', 'NNP')]
Before zip [('GloVe', 'NNP')]
After zip [('GloVe',), ('NNP',)]
Before zip [('Undergrad', 'NNP'), ('Data', 'NNP'), ('Analysis', 'NNP')]
After zip [('Undergrad', 'Data', 'Analysis'), ('NNP', 'NNP', 'NNP')]
Before zip [('Statistical', 'NNP'), ('Significance', 'NNP')]
After zip [('Statistical', 'Significance'), ('NNP', 'NNP')]
Before zip [('Growth', 'NNP'), ('Hacking', 'NNP')]
After zip [('Growth', 'Hacking'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Course', 'NNP')]
After zip [('Data', 'Science', 'Course'), ('NNP', 'NNP', 'NNP')]
Before zip [('Principal', 'NNP')]
After zip [('Principal',), ('NNP',)]
Before zip [('Machine', 'NN'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NN', 'NNP')]
Before zip [('Non', 'NNP')]
After zip [('Non',), ('NNP',)]
Before zip [('Technical', 'NNP'), ('Guide', 'NNP')]
After zip [('Technical', 'Guide'), ('NNP', 'NNP')]
Before zip [('Stochastic', 'JJ'), ('Dummy', 'NNP')]
After zip [('Stochastic', 'Dummy'), ('JJ', 'NNP')]
Before zip [('Hong', 'NNP')]
After zip [('Hong',), ('NNP',)]
Before zip [('Kong', 'NNP')]
After zip [('Kong',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Monsanto', 'NNP')]
After zip [('Monsanto',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Instacart', 'NNP')]
After zip [('Instacart',), ('NNP',)]
Before zip [('Kafka', 'NNP'), ('Producer', 'NNP'), ('Latency', 'NNP')]
After zip [('Kafka', 'Producer', 'Latency'), ('NNP', 'NNP', 'NNP')]
Before zip [('Large', 'NNP'), ('Topic', 'NNP'), ('Counts', 'NNP')]
After zip [('Large', 'Topic', 'Counts'), ('NNP', 'NNP', 'NNP')]
Before zip [('Sneak', 'NNP'), ('Peak', 'NNP')]
After zip [('Sneak', 'Peak'), ('NNP', 'NNP')]
Before zip [('Win', 'NNP')]
After zip [('Win',), ('NNP',)]
Before zip [('Vector', 'NNP')]
After zip [('Vector',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP'), ('Cheat', 'NNP')]
After zip [('Machine', 'Learning', 'Cheat'), ('NNP', 'NNP', 'NNP')]
Before zip [('Reason', 'NNP')]
After zip [('Reason',), ('NNP',)]
Before zip [('Deep', 'NNP'), ('Learning', 'NNP')]
After zip [('Deep', 'Learning'), ('NNP', 'NNP')]
Before zip [('Visual', 'JJ'), ('Logic', 'NNP')]
After zip [('Visual', 'Logic'), ('JJ', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('US', 'NNP')]
After zip [('US',), ('NNP',)]
Before zip [('Caffe', 'NNP')]
After zip [('Caffe',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Ethical', 'NNP'), ('Data', 'NNP')]
After zip [('Ethical', 'Data'), ('NNP', 'NNP')]
Before zip [('Frequently', 'NNP'), ('Asked', 'NNP')]
After zip [('Frequently', 'Asked'), ('NNP', 'NNP')]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NNP', 'NNP')]
Before zip [('Intro', 'NNP')]
After zip [('Intro',), ('NNP',)]
Before zip [('Eric', 'NNP'), ('Xing', 'NNP')]
After zip [('Eric', 'Xing'), ('NNP', 'NNP')]
Before zip [('CMU', 'NNP')]
After zip [('CMU',), ('NNP',)]
Before zip [('Sense2vec', 'NN')]
After zip [('Sense2vec',), ('NN',)]
Before zip [('spaCy', 'NN')]
After zip [('spaCy',), ('NN',)]
Before zip [('Gensim', 'NNP')]
After zip [('Gensim',), ('NNP',)]
Before zip [('AWS', 'NNP'), ('Redshift', 'NNP')]
After zip [('AWS', 'Redshift'), ('NNP', 'NNP')]
Before zip [('Code', 'NNP')]
After zip [('Code',), ('NNP',)]
Before zip [('Understand', 'NNP'), ('DeepMind', 'NNP')]
After zip [('Understand', 'DeepMind'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Bayesian', 'JJ')]
After zip [('Bayesian',), ('JJ',)]
Before zip [('Julia', 'NNP')]
After zip [('Julia',), ('NNP',)]
Before zip [('IBM', 'NNP')]
After zip [('IBM',), ('NNP',)]
Before zip [('Apache', 'NNP'), ('Spark', 'NNP'), ('Online', 'NNP')]
After zip [('Apache', 'Spark', 'Online'), ('NNP', 'NNP', 'NNP')]
Before zip [('Geographic', 'NNP'), ('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Geographic', 'Data', 'Science'), ('NNP', 'NNP', 'NNP')]
Before zip [('Daily', 'NNP'), ('Mail', 'NNP'), ('Stole', 'NNP'), ('My', 'NNP'), ('Visualization', 'NNP')]
After zip [('Daily', 'Mail', 'Stole', 'My', 'Visualization'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP'), ('Results', 'NNP')]
After zip [('Machine', 'Learning', 'Results'), ('NNP', 'NNP', 'NNP')]
Before zip [('Apache', 'NNP'), ('Spark', 'NNP')]
After zip [('Apache', 'Spark'), ('NNP', 'NNP')]
Before zip [('MachineJS', 'NN')]
After zip [('MachineJS',), ('NN',)]
Before zip [('NSA', 'NNP')]
After zip [('NSA',), ('NNP',)]
Before zip [('Oscars', 'NNP'), ('Pool', 'NNP')]
After zip [('Oscars', 'Pool'), ('NNP', 'NNP')]
Before zip [('LIGO', 'NNP')]
After zip [('LIGO',), ('NNP',)]
Before zip [('Overview', 'NN')]
After zip [('Overview',), ('NN',)]
Before zip [('DeZyre', 'NNP')]
After zip [('DeZyre',), ('NNP',)]
Before zip [('Coursera', 'NNP'), ('Data', 'NNP'), ('Science', 'NNP'), ('Course', 'NNP')]
After zip [('Coursera', 'Data', 'Science', 'Course'), ('NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Datathon', 'NNP')]
After zip [('Datathon',), ('NNP',)]
Before zip [('NYC', 'NNP')]
After zip [('NYC',), ('NNP',)]
Before zip [('SQL', 'NNP')]
After zip [('SQL',), ('NNP',)]
Before zip [('Highly', 'NNP')]
After zip [('Highly',), ('NNP',)]
Before zip [('Bayesian', 'JJ')]
After zip [('Bayesian',), ('JJ',)]
Before zip [('Auto', 'NNP')]
After zip [('Auto',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Machine', 'NN'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NN', 'NNP')]
Before zip [('Non', 'NNP')]
After zip [('Non',), ('NNP',)]
Before zip [('Technical', 'NNP'), ('Guide', 'NNP')]
After zip [('Technical', 'Guide'), ('NNP', 'NNP')]
Before zip [('Webhose', 'NNP')]
After zip [('Webhose',), ('NNP',)]
Before zip [('Meetup', 'NN')]
After zip [('Meetup',), ('NN',)]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP'), ('Algorithms', 'NNP')]
After zip [('Machine', 'Learning', 'Algorithms'), ('NNP', 'NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Language', 'NNP'), ('Modeling', 'NNP')]
After zip [('Language', 'Modeling'), ('NNP', 'NNP')]
Before zip [('Text', 'NNP'), ('Mining', 'NNP'), ('South', 'NNP'), ('Park', 'NNP')]
After zip [('Text', 'Mining', 'South', 'Park'), ('NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Parametric', 'NNP'), ('Bootstrap', 'NNP')]
After zip [('Parametric', 'Bootstrap'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Tab', 'NNP')]
After zip [('Tab',), ('NNP',)]
Before zip [('Parallel', 'NNP')]
After zip [('Parallel',), ('NNP',)]
Before zip [('YARN', 'NN')]
After zip [('YARN',), ('NN',)]
Before zip [('Meetup', 'NN')]
After zip [('Meetup',), ('NN',)]
Before zip [('Fun', 'NNP')]
After zip [('Fun',), ('NNP',)]

In [65]:
df.head()


Out[65]:
title date days tokens stem lemma pos_tags named_entities
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss 1 exploration,r,yelp,search,good,indian,food An Exploration of R, Yelp, and the Search for ... An Exploration of R, Yelp, and the Search for ... [(An, DT), (Exploration, NN), (of, IN), (R, NN... [Yelp, Good Indian Food]
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment 1 deep,advances,generative,modeling Deep Advances in Generative Model Deep Advances in Generative Modeling [(Deep, JJ), (Advances, NNS), (in, IN), (Gener... [Generative Modeling]
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss 1 spark,pipelines,elegant,yet,powerful Spark Pipelines: Elegant Yet Pow Spark Pipelines: Elegant Yet Powerful [(Spark, NNP), (Pipelines, NNS), (:, :), (Eleg... [Spark]
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss 1 shit,vcs,say Shit VCs Say Shit VCs Say [(Shit, NNP), (VCs, NNP), (Say, NNP)] [Shit]
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss 1 python,machine,learning,language,wars Python, Machine Learning, and Language War Python, Machine Learning, and Language Wars [(Python, NNP), (,, ,), (Machine, NNP), (Learn... [Python, Machine Learning, Language Wars]

Now that we have entities, we can understand the statements better


In [ ]:
df.tail()

In [ ]:
df.to_csv('data_tau_ta.csv',index=False)

In [ ]:


In [ ]: