Explore the Data

Load the data



In [1]:

    
import matplotlib.pyplot as plt
import pandas as pd



In [2]:

    
%matplotlib inline



In [3]:

    
df = pd.read_csv('data_tau_days.csv')



In [4]:

    
df.head()









    Out[4]:






  
    
      
      title
      date
      days
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
      1
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
      1
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
      1
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
      1
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss
      1

Creating a Wordcloud

Now let us visually see which are the words that are most prominent. This requires us to find all the words in the title and create a frequency count for how many times the word occurs.



In [5]:

    
import nltk
from wordcloud import WordCloud

Tokenization

Tokenization segments a document into its atomic elements. In this case, we are interested in tokenizing to words. First we need to break the sentence in to words. This process is called tokenization



In [6]:

    
sentence = df["title"][0]
sentence









    Out[6]:





'An Exploration of R, Yelp, and the Search for Good Indian Food'



In [7]:

    
tokens = nltk.wordpunct_tokenize(sentence)
tokens









    Out[7]:





['An',
 'Exploration',
 'of',
 'R',
 ',',
 'Yelp',
 ',',
 'and',
 'the',
 'Search',
 'for',
 'Good',
 'Indian',
 'Food']

Let us take all the sentence in the dataframe and tokenize to find the words, and get a frequency of count of each words



In [8]:

    
frequency_words = {}



In [9]:

    
for data in df['title']:
    tokens = nltk.wordpunct_tokenize(data)
    for token in tokens:
        if token in frequency_words:
            count = frequency_words[token]
            count = count + 1
            frequency_words[token] = count
        else:
            frequency_words[token] = 1



In [10]:

    
# Let us see the frequency_words for each word occuring
frequency_words









    Out[10]:





{'!': 2,
 '"': 2,
 '#': 1,
 '&': 1,
 "'": 6,
 '(': 8,
 ')': 8,
 '***': 1,
 '+': 1,
 ',': 23,
 '-': 39,
 '.': 14,
 '.*:': 1,
 '/': 8,
 '0': 4,
 '1': 5,
 '10': 1,
 '101': 1,
 '11': 1,
 '16': 1,
 '2': 2,
 '2016': 1,
 '3': 3,
 '4': 1,
 '50': 3,
 '7': 2,
 '8M': 1,
 ':': 34,
 '?': 16,
 '???': 1,
 '@': 1,
 'A': 13,
 'API': 3,
 'AWS': 1,
 'AYLIEN': 1,
 'About': 1,
 'Advances': 1,
 'Agree': 1,
 'Algorithms': 2,
 'All': 1,
 'Amazon': 2,
 'An': 10,
 'Analysis': 6,
 'Analytics': 1,
 'Analyze': 1,
 'Analyzing': 2,
 'Announcing': 2,
 'Answers': 1,
 'Apache': 5,
 'Apple': 1,
 'Appliance': 1,
 'Arrow': 1,
 'Art': 1,
 'Artists': 1,
 'Ask': 2,
 'Asked': 1,
 'Austin': 1,
 'Authoring': 1,
 'Auto': 1,
 'Automate': 1,
 'Automated': 1,
 'B': 2,
 'BallR': 1,
 'Bay': 1,
 'Bayesian': 7,
 'Be': 1,
 'Become': 1,
 'Beginners': 1,
 'Behavior': 1,
 'Better': 1,
 'Between': 1,
 'Big': 2,
 'Biggest': 1,
 'Billion': 3,
 'Boosting': 2,
 'Bootstrap': 1,
 'Bowl': 1,
 'Building': 1,
 'CMU': 1,
 'Caffe': 1,
 'Campaigns': 1,
 'Can': 1,
 'Careers': 1,
 'Cartoon': 1,
 'Chains': 1,
 'Characters': 1,
 'Charts': 1,
 'Cheat': 1,
 'Classification': 2,
 'Click': 1,
 'Cloud': 1,
 'Code': 2,
 'Cohort': 1,
 'Coll': 1,
 'Collectd': 1,
 'Competition': 1,
 'Component': 2,
 'Computer': 1,
 'Computing': 2,
 'Convolutional': 1,
 'Cooker': 1,
 'Corrosivity': 1,
 'Counts': 1,
 'Course': 2,
 'Coursera': 1,
 'Creating': 1,
 'Cricket': 1,
 'Crisis': 1,
 'Current': 1,
 'Cycle': 1,
 'D3': 2,
 'DT': 3,
 'Daily': 1,
 'Data': 31,
 'DataIsBeautiful': 1,
 'DataRadar': 1,
 'Databases': 1,
 'Dataflow': 1,
 'Datasets': 1,
 'Datathon': 1,
 'Datumbox': 1,
 'Day': 1,
 'DeZyre': 1,
 'Decision': 1,
 'Deep': 5,
 'DeepMind': 1,
 'Density': 1,
 'Departments': 1,
 'Depth': 3,
 'Deriving': 1,
 'Descriptive': 1,
 'Details': 1,
 'Detect': 1,
 'Dimensions': 1,
 'Dirichlet': 1,
 'Distributed': 3,
 'Do': 3,
 'Don': 1,
 'Doodles': 1,
 'Dplyr': 1,
 'Dplython': 1,
 'DrivenData': 1,
 'Dummy': 1,
 'EA': 1,
 'ELG': 1,
 'EMR': 2,
 'ETL': 1,
 'Elasticsearch': 1,
 'Elegant': 1,
 'Engineering': 1,
 'Engineers': 2,
 'Ensemble': 1,
 'Environment': 1,
 'Eric': 1,
 'Estimation': 2,
 'Ethical': 1,
 'Evaluation': 1,
 'Even': 1,
 'Ever': 1,
 'Examples': 1,
 'Expert': 1,
 'Explained': 1,
 'Exploration': 1,
 'Exploring': 1,
 'Extracting': 1,
 'FIFA': 1,
 'Facebook': 1,
 'Fast': 1,
 'Fatigue': 1,
 'Feed': 1,
 'File': 1,
 'Find': 1,
 'Finding': 1,
 'First': 1,
 'Flink': 2,
 'Flint': 1,
 'Flow': 1,
 'FlyElephant': 1,
 'Fog': 1,
 'Follower': 1,
 'Food': 1,
 'Forests': 1,
 'Four': 1,
 'Framework': 1,
 'France': 1,
 'Free': 2,
 'Frequently': 1,
 'From': 1,
 'Fun': 2,
 'Functioning': 1,
 'G': 1,
 'GW150914': 1,
 'Generate': 1,
 'Generation': 1,
 'Generative': 1,
 'Genius': 1,
 'Genomic': 3,
 'Gensim': 1,
 'Geographic': 1,
 'Getting': 1,
 'GloVe': 1,
 'Globe': 1,
 'Golden': 1,
 'Good': 1,
 'Grafana': 1,
 'Graph': 1,
 'GraphFrames': 2,
 'Graphical': 1,
 'Growth': 1,
 'Guide': 3,
 'Hacked': 1,
 'Hacking': 1,
 'Hadoop': 1,
 'Harvard': 1,
 'Has': 1,
 'High': 1,
 'Highly': 1,
 'Hiring': 1,
 'Histogram': 1,
 'Hong': 1,
 'How': 11,
 'I': 3,
 'IBM': 1,
 'IDF': 1,
 'IO': 1,
 'Improved': 1,
 'In': 5,
 'Inception': 1,
 'Indian': 1,
 'Inferring': 1,
 'Initial': 1,
 'Insights': 1,
 'Instacart': 1,
 'Instagram': 1,
 'Intellexer': 1,
 'Interactive': 2,
 'Interests': 1,
 'Interface': 1,
 'International': 1,
 'Interviews': 1,
 'Intro': 2,
 'Introducing': 1,
 'Introduction': 4,
 'Invaders': 1,
 'IoT': 1,
 'Is': 2,
 'It': 2,
 'JSON': 1,
 'Javascript': 1,
 'Julia': 2,
 'Jupyter': 1,
 'K': 2,
 'Kafka': 1,
 'Kitchen': 1,
 'Know': 1,
 'Kong': 1,
 'LIGO': 1,
 'Language': 4,
 'Large': 1,
 'Latency': 1,
 'LeCun': 1,
 'Learn': 1,
 'Learning': 16,
 'Lectures': 1,
 'Lens': 1,
 'Lense': 1,
 'Less': 1,
 'Level': 1,
 'Lift': 1,
 'Limits': 1,
 'Live': 1,
 'Logic': 1,
 'Logstash': 1,
 'Losers': 1,
 'Love': 1,
 'ML': 1,
 'Machine': 12,
 'MachineJS': 1,
 'Made': 1,
 'Mail': 1,
 'Making': 1,
 'Manifold': 1,
 'Map': 1,
 'March': 1,
 'Markov': 1,
 'Math': 1,
 'Means': 1,
 'Meetup': 2,
 'Megaman': 1,
 'Metaprogramming': 1,
 'Methods': 2,
 'Metrics': 1,
 'Michigan': 1,
 'Millions': 1,
 'Minecraft': 1,
 'Mining': 2,
 'Minute': 1,
 'Mistakes': 1,
 'Misusing': 1,
 'Mixtures': 1,
 'Model': 1,
 'Modeling': 2,
 'Models': 2,
 'Moneyball': 1,
 'Monsanto': 1,
 'Months': 1,
 'More': 1,
 'Morocco': 1,
 'My': 2,
 'NBA': 1,
 'NSA': 1,
 'NYC': 2,
 'Natural': 1,
 'Nets': 1,
 'Network': 1,
 'Networks': 2,
 'Neural': 5,
 'Newly': 1,
 'Next': 1,
 'Nine': 1,
 'No': 1,
 'Non': 4,
 'Not': 1,
 'Notification': 1,
 'Numerical': 1,
 'OC': 1,
 'Ode': 1,
 'OkCupid': 1,
 'One': 3,
 'Online': 1,
 'Open': 1,
 'Optimization': 1,
 'Optimizing': 3,
 'Oscars': 1,
 'Outliers': 1,
 'Overoptimizing': 1,
 'Overview': 1,
 'Owned': 1,
 'P': 2,
 'Pandas': 2,
 'Parallel': 1,
 'Parametric': 3,
 'Park': 1,
 'Part': 4,
 'Patterns': 1,
 'Peak': 1,
 'Personality': 1,
 'Pipelines': 1,
 'Platform': 1,
 'Player': 1,
 'Playing': 1,
 'PledgeForParity': 1,
 'Plots': 1,
 'Poets': 1,
 'Pool': 1,
 'Pop': 1,
 'Portable': 1,
 'Powerful': 1,
 'Prescriptive': 1,
 'Presidential': 1,
 'Presto': 1,
 'Primary': 1,
 'Principal': 2,
 'Probabilistic': 1,
 'Process': 1,
 'Processing': 2,
 'Producer': 1,
 'Profit': 1,
 'Project': 1,
 'Projection': 1,
 'Pseudo': 1,
 'PyLearn2': 1,
 'PyMC3': 1,
 'Python': 10,
 'Q': 1,
 'Question': 1,
 'Questions': 1,
 'R': 10,
 'REST': 1,
 'RSS': 1,
 'Ranges': 1,
 'Reason': 1,
 'Reasoning': 1,
 'Redshift': 1,
 'Regression': 1,
 'Released': 1,
 'Republican': 1,
 'Reshaping': 1,
 'Results': 1,
 'Rice': 1,
 'Rides': 3,
 'Rodeo': 1,
 'Role': 1,
 'Roots': 1,
 'Running': 1,
 'SF': 1,
 'SKYNET': 1,
 'SQL': 3,
 'Say': 1,
 'Scala': 2,
 'Scalable': 1,
 'Scammers': 1,
 'Scared': 1,
 'Science': 17,
 'Scientist': 1,
 'Scikit': 1,
 'Screencasts': 1,
 'Search': 2,
 'Sense2vec': 1,
 'Series': 1,
 'Sheets': 1,
 'Shiny': 2,
 'Shit': 1,
 'Shot': 1,
 'Should': 1,
 'Shouldn': 1,
 'Show': 1,
 'Side': 1,
 'Signal': 2,
 'Significance': 1,
 'Simple': 2,
 'Simplified': 1,
 'Skizze': 1,
 'Slack': 2,
 'Smartest': 1,
 'Sneak': 1,
 'Some': 1,
 'Source': 1,
 'South': 1,
 'Space': 2,
 'Spark': 10,
 'Stack': 1,
 'Started': 1,
 'State': 3,
 'Statebins': 1,
 'Statistical': 1,
 'Statisticians': 1,
 'Statistics': 2,
 'Step': 1,
 'Stochastic': 1,
 'Stole': 1,
 'Stop': 1,
 'Stream': 1,
 'Streaming': 1,
 'Structures': 1,
 'Studio': 1,
 'Summarizing': 1,
 'Super': 1,
 'Supervision': 1,
 'Survival': 1,
 'System': 1,
 'TF': 1,
 'TX': 1,
 'Tab': 1,
 'Tau': 1,
 'Taxi': 3,
 'Teaching': 1,
 'Technical': 3,
 'Technologies': 1,
 'Telemetry': 1,
 'TensorFlow': 4,
 'Tensorflow': 1,
 'Testing': 2,
 'Text': 2,
 'The': 13,
 'Theano': 1,
 'Them': 1,
 'Three': 1,
 'Through': 1,
 'Time': 2,
 'Times': 1,
 'Timing': 1,
 'Tiny': 1,
 'To': 5,
 'Tools': 2,
 'Top': 1,
 'Topic': 2,
 'Train': 2,
 'Training': 1,
 'Tree': 1,
 'True': 1,
 'Trump': 1,
 'Tutorial': 2,
 'Tweets': 1,
 'Twelve': 1,
 'Twice': 1,
 'Twilight': 1,
 'Twitter': 5,
 'Twython': 1,
 'US': 1,
 'Uber': 1,
 'Undergrad': 1,
 'Understand': 1,
 'Unsupervised': 2,
 'Up': 1,
 'Upcoming': 1,
 'Us': 1,
 'Use': 2,
 'User': 1,
 'Using': 2,
 'VCs': 1,
 'Value': 1,
 'Values': 1,
 'Vector': 1,
 'Vectorization': 1,
 'Viewing': 1,
 'Vision': 1,
 'Visual': 2,
 'Visualization': 2,
 'Visualize': 1,
 'Visualizing': 1,
 'Visually': 1,
 'Wait': 1,
 'Warriors': 1,
 'Wars': 1,
 'Watch': 2,
 'Water': 1,
 'Web': 1,
 'Webhose': 1,
 'Webinar': 1,
 'What': 6,
 'When': 1,
 'Where': 1,
 'Who': 1,
 'Why': 1,
 'Win': 1,
 'Winners': 1,
 'With': 2,
 'Without': 1,
 'Women': 1,
 'Work': 1,
 'Workflows': 1,
 'Working': 1,
 'Write': 1,
 'XGBoost': 1,
 'XGBoost4J': 1,
 'XGboost': 1,
 'XML': 1,
 'Xing': 1,
 'YARN': 1,
 'Yan': 1,
 'Years': 1,
 'Yelp': 1,
 'Yet': 1,
 'You': 1,
 'Your': 1,
 'Zone': 1,
 '[': 1,
 ']': 1,
 'a': 10,
 'about': 5,
 'affect': 1,
 'age': 1,
 'aka': 1,
 'almost': 1,
 'an': 1,
 'analogies': 1,
 'analysis': 2,
 'analytical': 1,
 'and': 36,
 'animated': 1,
 'anywhere': 1,
 'app': 1,
 'archive': 1,
 'are': 2,
 'article': 1,
 'artificial': 1,
 'at': 5,
 'background': 1,
 'be': 2,
 'become': 1,
 'better': 1,
 'black': 1,
 'blending': 1,
 'box': 1,
 'breweries': 1,
 'by': 2,
 'can': 2,
 'categorical': 1,
 'causal': 1,
 'causality': 1,
 'certified': 1,
 'challenge': 1,
 'change': 1,
 'changed': 1,
 'changes': 1,
 'channel': 1,
 'charts': 1,
 'choice': 1,
 'choices': 1,
 'classifier': 1,
 'classifiers': 1,
 'climbing': 1,
 'clusters': 1,
 'co': 1,
 'code': 2,
 'compatible': 1,
 'completion': 1,
 'complex': 1,
 'condition': 1,
 'conversion': 1,
 'course': 2,
 'courses': 1,
 'creators': 1,
 'd3': 1,
 'data': 10,
 'datasets': 2,
 'de': 1,
 'decision': 1,
 'deep': 1,
 'deepen': 1,
 'demo': 1,
 'demystified': 1,
 'details': 1,
 'detecting': 1,
 'detection': 1,
 'developers': 1,
 'discover': 1,
 'do': 3,
 'docstrings': 1,
 'easy': 2,
 'eight': 1,
 'encoders': 1,
 'enough': 1,
 'ensemble': 1,
 'estimate': 1,
 'estimation': 1,
 'events': 1,
 'excited': 1,
 'experience': 1,
 'experiments': 2,
 'explaining': 1,
 'f': 1,
 'families': 1,
 'file': 1,
 'for': 23,
 'free': 1,
 'from': 2,
 'ge': 1,
 'git': 2,
 'gitnoc': 1,
 'give': 1,
 'greedy': 1,
 'hands': 1,
 'have': 1,
 'heart': 1,
 'hierarchical': 1,
 'high': 1,
 'hill': 1,
 'historical': 1,
 'how': 1,
 'image': 3,
 'impact': 1,
 'import': 2,
 'in': 28,
 'innocent': 1,
 'instead': 1,
 'intelligence': 1,
 'interactive': 1,
 'internships': 1,
 'interpretable': 1,
 'intersection': 1,
 'into': 1,
 'intro': 1,
 'invite': 1,
 'io': 3,
 'it': 1,
 'jobs': 1,
 'js': 2,
 'just': 2,
 'kaggle': 1,
 'killing': 1,
 'large': 1,
 'leaders': 1,
 'learn': 5,
 'learning': 5,
 'lectures': 1,
 'library': 1,
 'lines': 1,
 'links': 1,
 'look': 1,
 'machine': 3,
 'machines': 1,
 'make': 2,
 'mapping': 1,
 'matching': 1,
 'math': 1,
 'may': 1,
 'means': 1,
 'merge': 1,
 'messaging': 1,
 'metadata': 1,
 'mistakes': 1,
 'ml': 1,
 'modelling': 1,
 'models': 1,
 'network': 1,
 'neural': 1,
 'now': 1,
 'of': 27,
 'offers': 1,
 'on': 11,
 'online': 2,
 'open': 2,
 'optional': 1,
 'other': 1,
 'own': 1,
 'owners': 1,
 'pandas': 3,
 'park': 1,
 'passing': 1,
 'people': 1,
 'petersburg': 1,
 'phys': 1,
 'pitfalls': 1,
 'platform': 1,
 'points': 1,
 'polished': 1,
 'predictions': 1,
 'presentations': 1,
 'price': 1,
 'private': 1,
 'probabilistic': 1,
 'processing': 1,
 'program': 1,
 'quality': 1,
 'r': 1,
 'rate': 1,
 'released': 1,
 'repositories': 1,
 'results': 1,
 'revisited': 1,
 'rise': 1,
 'robots': 1,
 'rookie': 1,
 'rules': 1,
 'run': 2,
 'running': 2,
 's': 5,
 'say': 1,
 'scale': 1,
 'scaling': 1,
 'science': 1,
 'scientist': 1,
 'scikit': 3,
 'scraping': 1,
 'secret': 1,
 'security': 1,
 'series': 1,
 'service': 1,
 'shape': 1,
 'share': 1,
 'should': 1,
 'simpler': 1,
 'sklearn': 1,
 'slides': 1,
 'socket': 1,
 'some': 1,
 'sourced': 2,
 'spaCy': 1,
 'statistical': 1,
 'status': 1,
 'steps': 1,
 'storage': 1,
 'story': 1,
 'streams': 1,
 'structural': 1,
 'structure': 1,
 'survival': 1,
 'systems': 1,
 't': 2,
 'talk': 1,
 'than': 1,
 'that': 1,
 'the': 15,
 'thought': 1,
 'thousands': 1,
 'through': 1,
 'throughput': 1,
 'time': 1,
 'timeseries': 1,
 'to': 24,
 'tools': 1,
 'training': 1,
 'transparent': 1,
 'tweets': 1,
 'understanding': 1,
 'undiagnosed': 1,
 'unsupervised': 1,
 'unusual': 1,
 'updates': 1,
 'use': 1,
 'users': 1,
 'using': 7,
 'variations': 1,
 've': 2,
 'vectorization': 1,
 'video': 2,
 'visibility': 1,
 'visualization': 1,
 'vs': 2,
 'want': 1,
 'way': 2,
 'we': 1,
 'weapon': 1,
 'with': 27,
 'word2vec': 1,
 'work': 2,
 'working': 1,
 'worry': 1,
 'you': 2,
 'your': 4}



In [11]:

    
# Creating a Wordcloud
wordcloud = WordCloud()



In [12]:

    
wordcloud.generate_from_frequencies(frequency_words.items())









    Out[12]:





<wordcloud.wordcloud.WordCloud at 0x1184795c0>



In [13]:

    
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Question - What are the two issue with this wordcloud?



In [14]:

    
# Convert the dict to a dataframe
freq = pd.DataFrame.from_dict(frequency_words, orient = 'index')



In [15]:

    
# Let us sort them in descinding order
freq.sort_values(by = 0, ascending=False).head(10)

Stopword Removal

Stop words are words which are filtered out before or after processing of natural language data. Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.



In [19]:

    
from nltk.corpus import stopwords
# nltk.download()



In [20]:

    
stop = stopwords.words('english')



In [21]:

    
stop[0:10]









    Out[21]:





['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

We will recreate the frequency words with two additional steps

Remove all the stop words in our count
Make every word lower case



In [23]:

    
frequency_words_wo_stop = {}
for data in df['title']:
    tokens = nltk.wordpunct_tokenize(data)
    for token in tokens:
        if token.lower() not in stop:
            if token in frequency_words_wo_stop:
                count = frequency_words_wo_stop[token]
                count = count + 1
                frequency_words_wo_stop[token] = count
            else:
                frequency_words_wo_stop[token] = 1



In [24]:

    
wordcloud.generate_from_frequencies(frequency_words_wo_stop.items())









    Out[24]:





<wordcloud.wordcloud.WordCloud at 0x1184795c0>



In [25]:

    
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

We can also extend the stopword list with common punctuations to even reomove those from the list



In [28]:

    
stop.extend(('.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','/','-'))



In [29]:

    
frequency_words_wo_stop = {}



In [30]:

    
def generate_word_frequency(row):
    data = row['title']
    tokens = nltk.wordpunct_tokenize(data)
    token_list = []
    for token in tokens:
        if token.lower() not in stop:
            token_list.append(token.lower())
            if token.lower() in frequency_words_wo_stop:
                count = frequency_words_wo_stop[token.lower()]
                count = count + 1
                frequency_words_wo_stop[token.lower()] = count
            else:
                frequency_words_wo_stop[token.lower()] = 1
    
    return ','.join(token_list)

The apply function takes a function as its input and applies that across all the rows or columns



In [31]:

    
df['tokens'] = df.apply(generate_word_frequency,axis=1)



In [32]:

    
df.head()









    Out[32]:






  
    
      
      title
      date
      days
      tokens
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
      1
      exploration,r,yelp,search,good,indian,food
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
      1
      deep,advances,generative,modeling
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
      1
      spark,pipelines,elegant,yet,powerful
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
      1
      shit,vcs,say
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss
      1
      python,machine,learning,language,wars



In [33]:

    
wordcloud.generate_from_frequencies(frequency_words_wo_stop.items())









    Out[33]:





<wordcloud.wordcloud.WordCloud at 0x1184795c0>



In [34]:

    
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Exercise: Find the frequency count for each word without stopword



In [ ]:

Stemming

An linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Stemming words is another common NLP technique to reduce topically similar words to their root. For example, “stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces those terms to “stem.” This is important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model. Stemming programs are commonly referred to as stemming algorithms or stemmers.

Like stopping, stemming is flexible and some methods are more aggressive. The Porter stemming algorithm is the most widely used method. To implement a Porter stemming algorithm, import the Porter Stemmer module from NLTK:



In [35]:

    
from nltk.stem.porter import PorterStemmer



In [36]:

    
porter_stemmer = PorterStemmer()



In [37]:

    
porter_stemmer.stem('dividing')









    Out[37]:





'divid'

Lemmatization

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.

In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word.

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

We will use a corpus to do the Lemmatization. Let us download the wordnet corpora using nltk.download()



In [38]:

    
from nltk.stem import WordNetLemmatizer



In [39]:

    
wordnet_lemmatizer = WordNetLemmatizer()



In [40]:

    
wordnet_lemmatizer.lemmatize('are')









    Out[40]:





'are'



In [41]:

    
wordnet_lemmatizer.lemmatize('is')









    Out[41]:





'is'

But we know that the root of are and is , is be. The reason why we see are and is as is , is because we have to define them as verbs



In [45]:

    
wordnet_lemmatizer.lemmatize('dividing', pos = "v")









    Out[45]:





'divide'



In [42]:

    
wordnet_lemmatizer.lemmatize('is',pos='v')









    Out[42]:





'be'



In [46]:

    
def stem_title(data):
    return porter_stemmer.stem(data['title'])



In [47]:

    
def lemmatize_title(data):
    return wordnet_lemmatizer.lemmatize(data['title'])



In [48]:

    
df['stem'] = df.apply(stem_title,axis=1)



In [49]:

    
df.head()









    Out[49]:






  
    
      
      title
      date
      days
      tokens
      stem
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
      1
      exploration,r,yelp,search,good,indian,food
      An Exploration of R, Yelp, and the Search for ...
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
      1
      deep,advances,generative,modeling
      Deep Advances in Generative Model
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
      1
      spark,pipelines,elegant,yet,powerful
      Spark Pipelines: Elegant Yet Pow
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
      1
      shit,vcs,say
      Shit VCs Say
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss
      1
      python,machine,learning,language,wars
      Python, Machine Learning, and Language War



In [50]:

    
df['lemma'] = df.apply(lemmatize_title,axis=1)



In [51]:

    
df.head()









    Out[51]:






  
    
      
      title
      date
      days
      tokens
      stem
      lemma
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
      1
      exploration,r,yelp,search,good,indian,food
      An Exploration of R, Yelp, and the Search for ...
      An Exploration of R, Yelp, and the Search for ...
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
      1
      deep,advances,generative,modeling
      Deep Advances in Generative Model
      Deep Advances in Generative Modeling
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
      1
      spark,pipelines,elegant,yet,powerful
      Spark Pipelines: Elegant Yet Pow
      Spark Pipelines: Elegant Yet Powerful
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
      1
      shit,vcs,say
      Shit VCs Say
      Shit VCs Say
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss
      1
      python,machine,learning,language,wars
      Python, Machine Learning, and Language War
      Python, Machine Learning, and Language Wars



In [52]:

    
df.tail()









    Out[52]:






  
    
      
      title
      date
      days
      tokens
      stem
      lemma
    
  
  
    
      175
      Getting Started with Statistics for Data Science
      3 points by nickhould 35 days ago  | discuss
      35
      getting,started,statistics,data,science
      Getting Started with Statistics for Data Sci
      Getting Started with Statistics for Data Science
    
    
      176
      Rodeo 1.3 - Tab-completion for docstrings
      3 points by glamp 35 days ago  | discuss
      35
      rodeo,1,3,tab,completion,docstrings
      Rodeo 1.3 - Tab-completion for docstr
      Rodeo 1.3 - Tab-completion for docstrings
    
    
      177
      Teaching D3.js - links
      3 points by pmigdal 35 days ago  | discuss
      35
      teaching,d3,js,links
      Teaching D3.js - link
      Teaching D3.js - links
    
    
      178
      Parallel scikit-learn on YARN
      5 points by stijntonk 39 days ago  | discuss
      39
      parallel,scikit,learn,yarn
      Parallel scikit-learn on YARN
      Parallel scikit-learn on YARN
    
    
      179
      Meetup: Free Live Webinar on Prescriptive Anal...
      2 points by ann928 32 days ago  | discuss
      32
      meetup,free,live,webinar,prescriptive,analytic...
      Meetup: Free Live Webinar on Prescriptive Anal...
      Meetup: Free Live Webinar on Prescriptive Anal...

Note: Stemming and Lemma in the context of Recall

Part of Speech (POS) tagging

https://displacy.spacy.io/displacy/index.html?full=Click+the+button+to+see+this+sentence+in+displaCy.

Let us go back to school. Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection.

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories. Here is the definition from wikipedia:

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.



In [53]:

    
text = 'Calvin harris is a great musician'



In [54]:

    
text_tokens = nltk.wordpunct_tokenize(text)



In [55]:

    
text_tokens









    Out[55]:





['Calvin', 'harris', 'is', 'a', 'great', 'musician']

We will download from nltk download averaged perceptron tagger to do POS tagging



In [56]:

    
nltk.pos_tag(text_tokens)









    Out[56]:





[('Calvin', 'NNP'),
 ('harris', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('musician', 'NN')]

Tag | Meaning | English Examples

Tag	Meaning	Examples
ADJ	adjective	new, good, high, special, big, local
ADP	adposition	on, of, at, with, by, into, under
ADV	adverb	really, already, still, early, now
CONJ	conjunction	and, or, but, if, while, although
DET	determiner	the, a, some, most, every, no, which
NOUN	noun	year, home, costs, time, Africa
NUM	numeral	twenty-four, fourth, 1991, 14:24
PRT	particle	at, on, out, over per, that, up, with
PRON	pronoun	he, their, her, its, my, I, us
VERB	verb	is, say, told, given, playing, would
0	punctuation marks	. , ; !
X	other	ersatz, esprit, dunno, gr8, univeristy

Let us generate POS tags for each title



In [57]:

    
def get_pos_tags(data):
    return nltk.pos_tag(nltk.wordpunct_tokenize(data['title']))



In [58]:

    
df['pos_tags'] = df.apply(get_pos_tags,axis=1)



In [59]:

    
df.head()









    Out[59]:






  
    
      
      title
      date
      days
      tokens
      stem
      lemma
      pos_tags
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
      1
      exploration,r,yelp,search,good,indian,food
      An Exploration of R, Yelp, and the Search for ...
      An Exploration of R, Yelp, and the Search for ...
      [(An, DT), (Exploration, NN), (of, IN), (R, NN...
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
      1
      deep,advances,generative,modeling
      Deep Advances in Generative Model
      Deep Advances in Generative Modeling
      [(Deep, JJ), (Advances, NNS), (in, IN), (Gener...
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
      1
      spark,pipelines,elegant,yet,powerful
      Spark Pipelines: Elegant Yet Pow
      Spark Pipelines: Elegant Yet Powerful
      [(Spark, NNP), (Pipelines, NNS), (:, :), (Eleg...
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
      1
      shit,vcs,say
      Shit VCs Say
      Shit VCs Say
      [(Shit, NNP), (VCs, NNP), (Say, NNP)]
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss
      1
      python,machine,learning,language,wars
      Python, Machine Learning, and Language War
      Python, Machine Learning, and Language Wars
      [(Python, NNP), (,, ,), (Machine, NNP), (Learn...

Entity Extraction

Now using pos tags we can extract entities i.e find the primary focus of the sentence

We already have POS tags - now all we need is chunking

The basic technique we will use for entity detection is chunking, which segments and labels multi-token sequences as illustrated in 2.1. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.

Named Entity-Type	Examples
ORGANIZATION	Georgia-Pacific Corp., WHO
PERSON	Eddy Bonte, President Obama
LOCATION	Murray River, Mount Everest
DATE	June, 2008-06-29
TIME	two fifty a m, 1:30 p.m.
MONEY	175 million Canadian Dollars, GBP 10.40
PERCENT	twenty pct, 18.75 %
FACILITY	Washington Monument, Stonehenge
GPE	South East Asia, Midlothian

To do the entity identification, we will download the maxent chunker and words corpora



In [60]:

    
df.pos_tags[0]









    Out[60]:





[('An', 'DT'),
 ('Exploration', 'NN'),
 ('of', 'IN'),
 ('R', 'NNP'),
 (',', ','),
 ('Yelp', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('Search', 'NNP'),
 ('for', 'IN'),
 ('Good', 'NNP'),
 ('Indian', 'NNP'),
 ('Food', 'NNP')]



In [61]:

    
ne_tree = nltk.ne_chunk(df.pos_tags[0],binary=True)
# ne_tree



In [ ]:

    
# for x in ne_tree:
#    print(x)



In [ ]:

    
# we want only the NE ones and when print the type we see that it is a tree
# so we need to iterate over the tree and get the NE



In [62]:

    
for x in ne_tree:
    print(type(x),x)
    if type(x) == nltk.tree.Tree:
        if(x.label()) == 'NE':
            print(x)









    



<class 'tuple'> ('An', 'DT')
<class 'tuple'> ('Exploration', 'NN')
<class 'tuple'> ('of', 'IN')
<class 'tuple'> ('R', 'NNP')
<class 'tuple'> (',', ',')
<class 'nltk.tree.Tree'> (NE Yelp/NNP)
(NE Yelp/NNP)
<class 'tuple'> (',', ',')
<class 'tuple'> ('and', 'CC')
<class 'tuple'> ('the', 'DT')
<class 'tuple'> ('Search', 'NNP')
<class 'tuple'> ('for', 'IN')
<class 'nltk.tree.Tree'> (NE Good/NNP Indian/NNP Food/NNP)
(NE Good/NNP Indian/NNP Food/NNP)



In [63]:

    
def get_entities(row):
    entities=[]
    chunked_tree = nltk.ne_chunk(row.pos_tags,binary=True)
    for nodes in chunked_tree:
        if type(nodes) == nltk.tree.Tree:
            if(nodes.label()) == 'NE':
                print("Before zip",nodes.leaves())
                zipped_list = list(zip(*nodes.leaves()))
                print("After zip",zipped_list)
                entities.append(' '.join(zipped_list[0]))
    return entities



In [64]:

    
df['named_entities'] = df.apply(get_entities,axis=1)









    



Before zip [('Yelp', 'NNP')]
After zip [('Yelp',), ('NNP',)]
Before zip [('Good', 'NNP'), ('Indian', 'NNP'), ('Food', 'NNP')]
After zip [('Good', 'Indian', 'Food'), ('NNP', 'NNP', 'NNP')]
Before zip [('Generative', 'NNP'), ('Modeling', 'NNP')]
After zip [('Generative', 'Modeling'), ('NNP', 'NNP')]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Shit', 'NNP')]
After zip [('Shit',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NNP', 'NNP')]
Before zip [('Language', 'NNP'), ('Wars', 'NNP')]
After zip [('Language', 'Wars'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Markov', 'NNP'), ('Chains', 'NNP')]
After zip [('Markov', 'Chains'), ('NNP', 'NNP')]
Before zip [('Visually', 'NNP')]
After zip [('Visually',), ('NNP',)]
Before zip [('Dplython', 'NN')]
After zip [('Dplython',), ('NN',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Bayesian', 'JJ')]
After zip [('Bayesian',), ('JJ',)]
Before zip [('Amazon', 'NNP')]
After zip [('Amazon',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Structures', 'NNP')]
After zip [('Data', 'Structures'), ('NNP', 'NNP')]
Before zip [('Algorithms', 'NNP')]
After zip [('Algorithms',), ('NNP',)]
Before zip [('Lift', 'NNP')]
After zip [('Lift',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Side', 'NNP')]
After zip [('Data', 'Science', 'Side'), ('NNP', 'NNP', 'NNP')]
Before zip [('Write', 'NNP')]
After zip [('Write',), ('NNP',)]
Before zip [('Simple', 'JJ')]
After zip [('Simple',), ('JJ',)]
Before zip [('Computer', 'NNP'), ('Vision', 'NNP')]
After zip [('Computer', 'Vision'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Engineering', 'NNP')]
After zip [('Data', 'Engineering'), ('NNP', 'NNP')]
Before zip [('Slack', 'NNP')]
After zip [('Slack',), ('NNP',)]
Before zip [('Pandas', 'NNP')]
After zip [('Pandas',), ('NNP',)]
Before zip [('Datumbox', 'NNP'), ('Machine', 'NNP')]
After zip [('Datumbox', 'Machine'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP')]
After zip [('Data',), ('NNP',)]
Before zip [('Neural', 'JJ'), ('Networks', 'NNP')]
After zip [('Neural', 'Networks'), ('JJ', 'NNP')]
Before zip [('Apple', 'NNP'), ('Watch', 'NNP')]
After zip [('Apple', 'Watch'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Tools', 'NNP')]
After zip [('Data', 'Science', 'Tools'), ('NNP', 'NNP', 'NNP')]
Before zip [('Biggest', 'NNP'), ('Winners', 'NNPS')]
After zip [('Biggest', 'Winners'), ('NNP', 'NNPS')]
Before zip [('Open', 'NNP'), ('Source', 'NNP'), ('Machine', 'NNP')]
After zip [('Open', 'Source', 'Machine'), ('NNP', 'NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Scikit', 'NNP'), ('Flow', 'NNP')]
After zip [('Scikit', 'Flow'), ('NNP', 'NNP')]
Before zip [('XGBoost4J', 'NN')]
After zip [('XGBoost4J',), ('NN',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Flink', 'NNP')]
After zip [('Flink',), ('NNP',)]
Before zip [('Dataflow', 'NNP')]
After zip [('Dataflow',), ('NNP',)]
Before zip [('Deep', 'NNP'), ('Roots', 'NNP')]
After zip [('Deep', 'Roots'), ('NNP', 'NNP')]
Before zip [('Javascript', 'NNP'), ('Fatigue', 'NNP')]
After zip [('Javascript', 'Fatigue'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Tau', 'NNP')]
After zip [('Data', 'Tau'), ('NNP', 'NNP')]
Before zip [('Machine', 'NN'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NN', 'NNP')]
Before zip [('Non', 'NNP')]
After zip [('Non',), ('NNP',)]
Before zip [('Technical', 'NNP'), ('Guide', 'NNP')]
After zip [('Technical', 'Guide'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Slack', 'NNP')]
After zip [('Data', 'Science', 'Slack'), ('NNP', 'NNP', 'NNP')]
Before zip [('Click', 'NN')]
After zip [('Click',), ('NN',)]
Before zip [('Intellexer', 'NNP')]
After zip [('Intellexer',), ('NNP',)]
Before zip [('Natural', 'JJ'), ('Language', 'NNP')]
After zip [('Natural', 'Language'), ('JJ', 'NNP')]
Before zip [('Text', 'NNP'), ('Mining', 'NNP')]
After zip [('Text', 'Mining'), ('NNP', 'NNP')]
Before zip [('SQL', 'NNP')]
After zip [('SQL',), ('NNP',)]
Before zip [('Genomic', 'NNP'), ('Data', 'NNP')]
After zip [('Genomic', 'Data'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Cohort', 'NNP'), ('Data', 'NNP')]
After zip [('Cohort', 'Data'), ('NNP', 'NNP')]
Before zip [('Analyze', 'NNP'), ('User', 'NNP'), ('Behavior', 'NNP')]
After zip [('Analyze', 'User', 'Behavior'), ('NNP', 'NNP', 'NNP')]
Before zip [('Show', 'NNP')]
After zip [('Show',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Apache', 'NNP'), ('Spark', 'NNP')]
After zip [('Apache', 'Spark'), ('NNP', 'NNP')]
Before zip [('Numerical', 'NNP'), ('Computing', 'NNP')]
After zip [('Numerical', 'Computing'), ('NNP', 'NNP')]
Before zip [('Rice', 'NNP'), ('Cooker', 'NNP')]
After zip [('Rice', 'Cooker'), ('NNP', 'NNP')]
Before zip [('Smartest', 'NNP'), ('Kitchen', 'NNP')]
After zip [('Smartest', 'Kitchen'), ('NNP', 'NNP')]
Before zip [('Golden', 'NNP'), ('State', 'NNP'), ('Warriors', 'NNP')]
After zip [('Golden', 'State', 'Warriors'), ('NNP', 'NNP', 'NNP')]
Before zip [('GraphFrames', 'NNP')]
After zip [('GraphFrames',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Megaman', 'NN')]
After zip [('Megaman',), ('NN',)]
Before zip [('Millions', 'NNP')]
After zip [('Millions',), ('NNP',)]
Before zip [('Outliers', 'NNP')]
After zip [('Outliers',), ('NNP',)]
Before zip [('Parametric', 'NNP')]
After zip [('Parametric',), ('NNP',)]
Before zip [('Non', 'NNP'), ('Parametric', 'NNP'), ('Methods', 'NNP')]
After zip [('Non', 'Parametric', 'Methods'), ('NNP', 'NNP', 'NNP')]
Before zip [('BallR', 'NN')]
After zip [('BallR',), ('NN',)]
Before zip [('NBA', 'NNP'), ('Shot', 'NNP')]
After zip [('NBA', 'Shot'), ('NNP', 'NNP')]
Before zip [('Shiny', 'NNP')]
After zip [('Shiny',), ('NNP',)]
Before zip [('Amazon', 'NNP')]
After zip [('Amazon',), ('NNP',)]
Before zip [('Minecraft', 'NN')]
After zip [('Minecraft',), ('NN',)]
Before zip [('Deep', 'NNP')]
After zip [('Deep',), ('NNP',)]
Before zip [('Space', 'NNP'), ('Invaders', 'NNP')]
After zip [('Space', 'Invaders'), ('NNP', 'NNP')]
Before zip [('Theano', 'NNP'), ('Tutorial', 'NNP')]
After zip [('Theano', 'Tutorial'), ('NNP', 'NNP')]
Before zip [('Personality', 'NNP'), ('Space', 'NNP')]
After zip [('Personality', 'Space'), ('NNP', 'NNP')]
Before zip [('Cartoon', 'NNP')]
After zip [('Cartoon',), ('NNP',)]
Before zip [('Apache', 'NNP')]
After zip [('Apache',), ('NNP',)]
Before zip [('Telemetry', 'NN')]
After zip [('Telemetry',), ('NN',)]
Before zip [('Collectd', 'NNP')]
After zip [('Collectd',), ('NNP',)]
Before zip [('Logstash', 'NNP')]
After zip [('Logstash',), ('NNP',)]
Before zip [('Elasticsearch', 'NNP')]
After zip [('Elasticsearch',), ('NNP',)]
Before zip [('Grafana', 'NNP')]
After zip [('Grafana',), ('NNP',)]
Before zip [('ELG', 'NNP')]
After zip [('ELG',), ('NNP',)]
Before zip [('Bayesian', 'JJ'), ('Reasoning', 'NNP')]
After zip [('Bayesian', 'Reasoning'), ('JJ', 'NNP')]
Before zip [('Twilight', 'NNP'), ('Zone', 'NN')]
After zip [('Twilight', 'Zone'), ('NNP', 'NN')]
Before zip [('Bayesian', 'JJ'), ('Estimation', 'NNP')]
After zip [('Bayesian', 'Estimation'), ('JJ', 'NNP')]
Before zip [('XGBoost', 'NN')]
After zip [('XGBoost',), ('NN',)]
Before zip [('Hadoop', 'NNP'), ('Pseudo', 'NNP')]
After zip [('Hadoop', 'Pseudo'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Pop', 'NNP')]
After zip [('Data', 'Science', 'Pop'), ('NNP', 'NNP', 'NNP')]
Before zip [('Austin', 'NNP')]
After zip [('Austin',), ('NNP',)]
Before zip [('TensorFlow', 'NNP')]
After zip [('TensorFlow',), ('NNP',)]
Before zip [('Shiny', 'NNP')]
After zip [('Shiny',), ('NNP',)]
Before zip [('Tensorflow', 'NNP')]
After zip [('Tensorflow',), ('NNP',)]
Before zip [('File', 'NN')]
After zip [('File',), ('NN',)]
Before zip [('All', 'NNP'), ('Data', 'NNP'), ('Engineers', 'NNP'), ('Should', 'NNP'), ('Know', 'NNP')]
After zip [('All', 'Data', 'Engineers', 'Should', 'Know'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Topic', 'NN')]
After zip [('Topic',), ('NN',)]
Before zip [('TF', 'NNP')]
After zip [('TF',), ('NNP',)]
Before zip [('IDF', 'NNP')]
After zip [('IDF',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Scala', 'NNP')]
After zip [('Scala',), ('NNP',)]
Before zip [('Next', 'JJ'), ('Generation', 'NNP')]
After zip [('Next', 'Generation'), ('JJ', 'NNP')]
Before zip [('Graph', 'NNP')]
After zip [('Graph',), ('NNP',)]
Before zip [('DataRadar', 'NNP')]
After zip [('DataRadar',), ('NNP',)]
Before zip [('IO', 'NNP')]
After zip [('IO',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('International', 'NNP'), ('Women', 'NNP')]
After zip [('International', 'Women'), ('NNP', 'NNP')]
Before zip [('PledgeForParity', 'NN'), ('Means', 'NNPS')]
After zip [('PledgeForParity', 'Means'), ('NN', 'NNPS')]
Before zip [('Ask', 'NNP')]
After zip [('Ask',), ('NNP',)]
Before zip [('Better', 'NNP'), ('Insights', 'NNPS'), ('From', 'NNP'), ('Time', 'NNP'), ('Series', 'NNP'), ('Data', 'NNP')]
After zip [('Better', 'Insights', 'From', 'Time', 'Series', 'Data'), ('NNP', 'NNPS', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Cycle', 'NNP'), ('Plots', 'NNP')]
After zip [('Cycle', 'Plots'), ('NNP', 'NNP')]
Before zip [('SQL', 'NNP')]
After zip [('SQL',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Analysis', 'NNP')]
After zip [('Data', 'Analysis'), ('NNP', 'NNP')]
Before zip [('Stream', 'NN')]
After zip [('Stream',), ('NN',)]
Before zip [('IoT', 'NNP')]
After zip [('IoT',), ('NNP',)]
Before zip [('Visual', 'NNP'), ('Studio', 'NNP')]
After zip [('Visual', 'Studio'), ('NNP', 'NNP')]
Before zip [('Skizze', 'NNP')]
After zip [('Skizze',), ('NNP',)]
Before zip [('Genomic', 'NNP'), ('Ranges', 'NNP')]
After zip [('Genomic', 'Ranges'), ('NNP', 'NNP')]
Before zip [('Genomic', 'NNP'), ('Data', 'NNP')]
After zip [('Genomic', 'Data'), ('NNP', 'NNP')]
Before zip [('TensorFlow', 'NNP')]
After zip [('TensorFlow',), ('NNP',)]
Before zip [('Even', 'NNP'), ('Less', 'NNP'), ('Supervision', 'NNP'), ('Using', 'NNP'), ('Bayesian', 'NNP')]
After zip [('Even', 'Less', 'Supervision', 'Using', 'Bayesian'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('JSON', 'NNP')]
After zip [('JSON',), ('NNP',)]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Pandas', 'NNP')]
After zip [('Pandas',), ('NNP',)]
Before zip [('DrivenData', 'NNP')]
After zip [('DrivenData',), ('NNP',)]
Before zip [('Morocco', 'NNP')]
After zip [('Morocco',), ('NNP',)]
Before zip [('Deep', 'JJ')]
After zip [('Deep',), ('JJ',)]
Before zip [('Coll', 'NNP')]
After zip [('Coll',), ('NNP',)]
Before zip [('France', 'NNP')]
After zip [('France',), ('NNP',)]
Before zip [('Yan', 'NNP')]
After zip [('Yan',), ('NNP',)]
Before zip [('Facebook', 'NNP'), ('Campaigns', 'NNP')]
After zip [('Facebook', 'Campaigns'), ('NNP', 'NNP')]
Before zip [('Trump', 'NNP'), ('Tweets', 'NNP')]
After zip [('Trump', 'Tweets'), ('NNP', 'NNP')]
Before zip [('Twitter', 'NNP')]
After zip [('Twitter',), ('NNP',)]
Before zip [('Apache', 'NNP'), ('Arrow', 'NNP')]
After zip [('Apache', 'Arrow'), ('NNP', 'NNP')]
Before zip [('Histogram', 'NNP')]
After zip [('Histogram',), ('NNP',)]
Before zip [('TensorFlow', 'NNP')]
After zip [('TensorFlow',), ('NNP',)]
Before zip [('Regression', 'NN')]
After zip [('Regression',), ('NN',)]
Before zip [('Examples', 'NNP')]
After zip [('Examples',), ('NNP',)]
Before zip [('Free', 'JJ')]
After zip [('Free',), ('JJ',)]
Before zip [('Don', 'NNP')]
After zip [('Don',), ('NNP',)]
Before zip [('Work', 'NN')]
After zip [('Work',), ('NN',)]
Before zip [('FlyElephant', 'NNP')]
After zip [('FlyElephant',), ('NNP',)]
Before zip [('XML', 'NN')]
After zip [('XML',), ('NN',)]
Before zip [('Cricket', 'NNP'), ('Player', 'NNP'), ('Careers', 'NNP')]
After zip [('Cricket', 'Player', 'Careers'), ('NNP', 'NNP', 'NNP')]
Before zip [('Generate', 'NNP')]
After zip [('Generate',), ('NNP',)]
Before zip [('Super', 'NNP'), ('Bowl', 'NNP')]
After zip [('Super', 'Bowl'), ('NNP', 'NNP')]
Before zip [('Twython', 'NNP')]
After zip [('Twython',), ('NNP',)]
Before zip [('Twitter', 'NNP'), ('API', 'NNP')]
After zip [('Twitter', 'API'), ('NNP', 'NNP')]
Before zip [('AYLIEN', 'NNP')]
After zip [('AYLIEN',), ('NNP',)]
Before zip [('Watch', 'NNP'), ('Tiny', 'NNP'), ('Neural', 'NNP'), ('Nets', 'NNP'), ('Learn', 'NNP')]
After zip [('Watch', 'Tiny', 'Neural', 'Nets', 'Learn'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Convolutional', 'NNP'), ('Networks', 'NNP')]
After zip [('Convolutional', 'Networks'), ('NNP', 'NNP')]
Before zip [('Models', 'NNP')]
After zip [('Models',), ('NNP',)]
Before zip [('True', 'JJ'), ('Love', 'NNP')]
After zip [('True', 'Love'), ('JJ', 'NNP')]
Before zip [('PyLearn2', 'NN')]
After zip [('PyLearn2',), ('NN',)]
Before zip [('Density', 'NNP'), ('Estimation', 'NNP')]
After zip [('Density', 'Estimation'), ('NNP', 'NNP')]
Before zip [('Dirichlet', 'NNP'), ('Process', 'NNP'), ('Mixtures', 'NNP')]
After zip [('Dirichlet', 'Process', 'Mixtures'), ('NNP', 'NNP', 'NNP')]
Before zip [('PyMC3', 'NNP')]
After zip [('PyMC3',), ('NNP',)]
Before zip [('Flint', 'NNP'), ('Michigan', 'NNP'), ('Water', 'NNP')]
After zip [('Flint', 'Michigan', 'Water'), ('NNP', 'NNP', 'NNP')]
Before zip [('Republican', 'JJ'), ('Twitter', 'NNP'), ('Follower', 'NNP')]
After zip [('Republican', 'Twitter', 'Follower'), ('JJ', 'NNP', 'NNP')]
Before zip [('GloVe', 'NNP')]
After zip [('GloVe',), ('NNP',)]
Before zip [('Undergrad', 'NNP'), ('Data', 'NNP'), ('Analysis', 'NNP')]
After zip [('Undergrad', 'Data', 'Analysis'), ('NNP', 'NNP', 'NNP')]
Before zip [('Statistical', 'NNP'), ('Significance', 'NNP')]
After zip [('Statistical', 'Significance'), ('NNP', 'NNP')]
Before zip [('Growth', 'NNP'), ('Hacking', 'NNP')]
After zip [('Growth', 'Hacking'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP'), ('Course', 'NNP')]
After zip [('Data', 'Science', 'Course'), ('NNP', 'NNP', 'NNP')]
Before zip [('Principal', 'NNP')]
After zip [('Principal',), ('NNP',)]
Before zip [('Machine', 'NN'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NN', 'NNP')]
Before zip [('Non', 'NNP')]
After zip [('Non',), ('NNP',)]
Before zip [('Technical', 'NNP'), ('Guide', 'NNP')]
After zip [('Technical', 'Guide'), ('NNP', 'NNP')]
Before zip [('Stochastic', 'JJ'), ('Dummy', 'NNP')]
After zip [('Stochastic', 'Dummy'), ('JJ', 'NNP')]
Before zip [('Hong', 'NNP')]
After zip [('Hong',), ('NNP',)]
Before zip [('Kong', 'NNP')]
After zip [('Kong',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Monsanto', 'NNP')]
After zip [('Monsanto',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Instacart', 'NNP')]
After zip [('Instacart',), ('NNP',)]
Before zip [('Kafka', 'NNP'), ('Producer', 'NNP'), ('Latency', 'NNP')]
After zip [('Kafka', 'Producer', 'Latency'), ('NNP', 'NNP', 'NNP')]
Before zip [('Large', 'NNP'), ('Topic', 'NNP'), ('Counts', 'NNP')]
After zip [('Large', 'Topic', 'Counts'), ('NNP', 'NNP', 'NNP')]
Before zip [('Sneak', 'NNP'), ('Peak', 'NNP')]
After zip [('Sneak', 'Peak'), ('NNP', 'NNP')]
Before zip [('Win', 'NNP')]
After zip [('Win',), ('NNP',)]
Before zip [('Vector', 'NNP')]
After zip [('Vector',), ('NNP',)]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP'), ('Cheat', 'NNP')]
After zip [('Machine', 'Learning', 'Cheat'), ('NNP', 'NNP', 'NNP')]
Before zip [('Reason', 'NNP')]
After zip [('Reason',), ('NNP',)]
Before zip [('Deep', 'NNP'), ('Learning', 'NNP')]
After zip [('Deep', 'Learning'), ('NNP', 'NNP')]
Before zip [('Visual', 'JJ'), ('Logic', 'NNP')]
After zip [('Visual', 'Logic'), ('JJ', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('US', 'NNP')]
After zip [('US',), ('NNP',)]
Before zip [('Caffe', 'NNP')]
After zip [('Caffe',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Ethical', 'NNP'), ('Data', 'NNP')]
After zip [('Ethical', 'Data'), ('NNP', 'NNP')]
Before zip [('Frequently', 'NNP'), ('Asked', 'NNP')]
After zip [('Frequently', 'Asked'), ('NNP', 'NNP')]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NNP', 'NNP')]
Before zip [('Intro', 'NNP')]
After zip [('Intro',), ('NNP',)]
Before zip [('Eric', 'NNP'), ('Xing', 'NNP')]
After zip [('Eric', 'Xing'), ('NNP', 'NNP')]
Before zip [('CMU', 'NNP')]
After zip [('CMU',), ('NNP',)]
Before zip [('Sense2vec', 'NN')]
After zip [('Sense2vec',), ('NN',)]
Before zip [('spaCy', 'NN')]
After zip [('spaCy',), ('NN',)]
Before zip [('Gensim', 'NNP')]
After zip [('Gensim',), ('NNP',)]
Before zip [('AWS', 'NNP'), ('Redshift', 'NNP')]
After zip [('AWS', 'Redshift'), ('NNP', 'NNP')]
Before zip [('Code', 'NNP')]
After zip [('Code',), ('NNP',)]
Before zip [('Understand', 'NNP'), ('DeepMind', 'NNP')]
After zip [('Understand', 'DeepMind'), ('NNP', 'NNP')]
Before zip [('Python', 'NNP')]
After zip [('Python',), ('NNP',)]
Before zip [('Bayesian', 'JJ')]
After zip [('Bayesian',), ('JJ',)]
Before zip [('Julia', 'NNP')]
After zip [('Julia',), ('NNP',)]
Before zip [('IBM', 'NNP')]
After zip [('IBM',), ('NNP',)]
Before zip [('Apache', 'NNP'), ('Spark', 'NNP'), ('Online', 'NNP')]
After zip [('Apache', 'Spark', 'Online'), ('NNP', 'NNP', 'NNP')]
Before zip [('Geographic', 'NNP'), ('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Geographic', 'Data', 'Science'), ('NNP', 'NNP', 'NNP')]
Before zip [('Daily', 'NNP'), ('Mail', 'NNP'), ('Stole', 'NNP'), ('My', 'NNP'), ('Visualization', 'NNP')]
After zip [('Daily', 'Mail', 'Stole', 'My', 'Visualization'), ('NNP', 'NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP'), ('Results', 'NNP')]
After zip [('Machine', 'Learning', 'Results'), ('NNP', 'NNP', 'NNP')]
Before zip [('Apache', 'NNP'), ('Spark', 'NNP')]
After zip [('Apache', 'Spark'), ('NNP', 'NNP')]
Before zip [('MachineJS', 'NN')]
After zip [('MachineJS',), ('NN',)]
Before zip [('NSA', 'NNP')]
After zip [('NSA',), ('NNP',)]
Before zip [('Oscars', 'NNP'), ('Pool', 'NNP')]
After zip [('Oscars', 'Pool'), ('NNP', 'NNP')]
Before zip [('LIGO', 'NNP')]
After zip [('LIGO',), ('NNP',)]
Before zip [('Overview', 'NN')]
After zip [('Overview',), ('NN',)]
Before zip [('DeZyre', 'NNP')]
After zip [('DeZyre',), ('NNP',)]
Before zip [('Coursera', 'NNP'), ('Data', 'NNP'), ('Science', 'NNP'), ('Course', 'NNP')]
After zip [('Coursera', 'Data', 'Science', 'Course'), ('NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Datathon', 'NNP')]
After zip [('Datathon',), ('NNP',)]
Before zip [('NYC', 'NNP')]
After zip [('NYC',), ('NNP',)]
Before zip [('SQL', 'NNP')]
After zip [('SQL',), ('NNP',)]
Before zip [('Highly', 'NNP')]
After zip [('Highly',), ('NNP',)]
Before zip [('Bayesian', 'JJ')]
After zip [('Bayesian',), ('JJ',)]
Before zip [('Auto', 'NNP')]
After zip [('Auto',), ('NNP',)]
Before zip [('Spark', 'NNP')]
After zip [('Spark',), ('NNP',)]
Before zip [('Machine', 'NN'), ('Learning', 'NNP')]
After zip [('Machine', 'Learning'), ('NN', 'NNP')]
Before zip [('Non', 'NNP')]
After zip [('Non',), ('NNP',)]
Before zip [('Technical', 'NNP'), ('Guide', 'NNP')]
After zip [('Technical', 'Guide'), ('NNP', 'NNP')]
Before zip [('Webhose', 'NNP')]
After zip [('Webhose',), ('NNP',)]
Before zip [('Meetup', 'NN')]
After zip [('Meetup',), ('NN',)]
Before zip [('Machine', 'NNP'), ('Learning', 'NNP'), ('Algorithms', 'NNP')]
After zip [('Machine', 'Learning', 'Algorithms'), ('NNP', 'NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Language', 'NNP'), ('Modeling', 'NNP')]
After zip [('Language', 'Modeling'), ('NNP', 'NNP')]
Before zip [('Text', 'NNP'), ('Mining', 'NNP'), ('South', 'NNP'), ('Park', 'NNP')]
After zip [('Text', 'Mining', 'South', 'Park'), ('NNP', 'NNP', 'NNP', 'NNP')]
Before zip [('Parametric', 'NNP'), ('Bootstrap', 'NNP')]
After zip [('Parametric', 'Bootstrap'), ('NNP', 'NNP')]
Before zip [('Data', 'NNP'), ('Science', 'NNP')]
After zip [('Data', 'Science'), ('NNP', 'NNP')]
Before zip [('Tab', 'NNP')]
After zip [('Tab',), ('NNP',)]
Before zip [('Parallel', 'NNP')]
After zip [('Parallel',), ('NNP',)]
Before zip [('YARN', 'NN')]
After zip [('YARN',), ('NN',)]
Before zip [('Meetup', 'NN')]
After zip [('Meetup',), ('NN',)]
Before zip [('Fun', 'NNP')]
After zip [('Fun',), ('NNP',)]



In [65]:

    
df.head()









    Out[65]:






  
    
      
      title
      date
      days
      tokens
      stem
      lemma
      pos_tags
      named_entities
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
      1
      exploration,r,yelp,search,good,indian,food
      An Exploration of R, Yelp, and the Search for ...
      An Exploration of R, Yelp, and the Search for ...
      [(An, DT), (Exploration, NN), (of, IN), (R, NN...
      [Yelp, Good Indian Food]
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
      1
      deep,advances,generative,modeling
      Deep Advances in Generative Model
      Deep Advances in Generative Modeling
      [(Deep, JJ), (Advances, NNS), (in, IN), (Gener...
      [Generative Modeling]
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
      1
      spark,pipelines,elegant,yet,powerful
      Spark Pipelines: Elegant Yet Pow
      Spark Pipelines: Elegant Yet Powerful
      [(Spark, NNP), (Pipelines, NNS), (:, :), (Eleg...
      [Spark]
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
      1
      shit,vcs,say
      Shit VCs Say
      Shit VCs Say
      [(Shit, NNP), (VCs, NNP), (Say, NNP)]
      [Shit]
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss
      1
      python,machine,learning,language,wars
      Python, Machine Learning, and Language War
      Python, Machine Learning, and Language Wars
      [(Python, NNP), (,, ,), (Machine, NNP), (Learn...
      [Python, Machine Learning, Language Wars]

Now that we have entities, we can understand the statements better



In [ ]:

    
df.tail()



In [ ]:

    
df.to_csv('data_tau_ta.csv',index=False)



In [ ]:



In [ ]:

	title	date	days
0	An Exploration of R, Yelp, and the Search for ...	5 points by Rogerh91 6 hours ago \| discuss	1
1	Deep Advances in Generative Modeling	7 points by gwulfs 15 hours ago \| 1 comment	1
2	Spark Pipelines: Elegant Yet Powerful	3 points by aouyang1 9 hours ago \| discuss	1
3	Shit VCs Say	3 points by Argentum01 10 hours ago \| discuss	1
4	Python, Machine Learning, and Language Wars	4 points by pmigdal 17 hours ago \| discuss	1

	title	date	days	tokens
0	An Exploration of R, Yelp, and the Search for ...	5 points by Rogerh91 6 hours ago \| discuss	1	exploration,r,yelp,search,good,indian,food
1	Deep Advances in Generative Modeling	7 points by gwulfs 15 hours ago \| 1 comment	1	deep,advances,generative,modeling
2	Spark Pipelines: Elegant Yet Powerful	3 points by aouyang1 9 hours ago \| discuss	1	spark,pipelines,elegant,yet,powerful
3	Shit VCs Say	3 points by Argentum01 10 hours ago \| discuss	1	shit,vcs,say
4	Python, Machine Learning, and Language Wars	4 points by pmigdal 17 hours ago \| discuss	1	python,machine,learning,language,wars

	title	date	days	tokens	stem	lemma
175	Getting Started with Statistics for Data Science	3 points by nickhould 35 days ago \| discuss	35	getting,started,statistics,data,science	Getting Started with Statistics for Data Sci	Getting Started with Statistics for Data Science
176	Rodeo 1.3 - Tab-completion for docstrings	3 points by glamp 35 days ago \| discuss	35	rodeo,1,3,tab,completion,docstrings	Rodeo 1.3 - Tab-completion for docstr	Rodeo 1.3 - Tab-completion for docstrings
177	Teaching D3.js - links	3 points by pmigdal 35 days ago \| discuss	35	teaching,d3,js,links	Teaching D3.js - link	Teaching D3.js - links
178	Parallel scikit-learn on YARN	5 points by stijntonk 39 days ago \| discuss	39	parallel,scikit,learn,yarn	Parallel scikit-learn on YARN	Parallel scikit-learn on YARN
179	Meetup: Free Live Webinar on Prescriptive Anal...	2 points by ann928 32 days ago \| discuss	32	meetup,free,live,webinar,prescriptive,analytic...	Meetup: Free Live Webinar on Prescriptive Anal...	Meetup: Free Live Webinar on Prescriptive Anal...