Tutorial Outline

Introduction
Preprerequisites
How does it work?
Authentication
- Authentication keys
MongoDB Collection
Starting a Stream
- Stream Listener
- Connect to a streaming API
Data Access and Analysis
- Load results to a DataFrame
Visualization

Introduction

Twitter provides two types of API to access their data:

RESTful API: Used to get data about existing data objects like statuses "tweets", user, ... etc
Streaming API: Used to get live statuses "tweets" as they are sent

The reason why you would like to use streaming API:

Capture large amount of data because RESTful API has limited access to older data
Real-time analysis like monitoring social discussion about a live event
In house archive like archiving social discussion about your brand(s)
AI response system for a twitter account like automated reply and filing questions or providing answers

Preprerequisites

Python 2 or 3
Jupyter /w IPyWidgets
Pandas
Numpy
Matplotlib
MogoDB Installtion
Pymongo
Scikit-learn
Tweepy
Twitter account

How does it work?

Twitter streaming API can provide data through a streaming HTTP response. This is very similar to downloading a file where you read a number of bytes and store it to disk and repeat until the end of file. The only difference is this stream is endless. The only things that could stop this stream are:

If you closed your connection to the streaming response
If your connection speed is not capable of receiving data and the servers buffer is filling up

This means that this process will be using the thread that it was launched from until it is stopped. In production, you should always start this in a different thread or process to make sure your software doesn't freeze until you stop the stream.

Authentication

You will need four numbers from twitter development to start using streaming API. First, let's import some important libraries for dealing with twitter API, data analysis, data storage ... etc



In [1]:

    
import numpy as np
import pandas as pd
import tweepy
import matplotlib.pyplot as plt
import pymongo
import ipywidgets as wgt
from IPython.display import display
from sklearn.feature_extraction.text import CountVectorizer
import re
from datetime import datetime

%matplotlib inline

Authentication keys

Go to https://apps.twitter.com/
Create an App (if you don't have one yet)
Grant read-only access to your account
Copy the four keys and paste them here:



In [2]:

    
api_key = "yP0yoCitoUNgD63ebMerGyJaE" # <---- Add your API Key
api_secret = "kLO5YUtlth3cd4lOHLy8nlLHW5npVQgUfO4FhsyCn6wCMIz5E6" # <---- Add your API Secret
access_token = "259862037-iMXNjfL8JBApm4LVcdfwc3FcMm7Xta4TKg5cd44K" # <---- Add your access token
access_token_secret = "UIgh08dtmavzlvlWWukIXwN5HDIQD0wNwyn5sPzhrynBf" # <---- Add your access token secret

auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

MongoDB Collection

Connect to MongoDB and create/get a collection.



In [3]:

    
col = pymongo.MongoClient()["tweets"]["StreamingTutorial"]
col.count()









    Out[3]:





2251

Starting a Stream

We need a listener which should extend tweepy.StreamListener class. There is a number of methods that you can extend to instruct the listener class to perform functionality. Some of the important methods are:

on_status(self, status): This will pass a status "tweet" object when a tweet is received
on_data(self, raw_data): Called when any any data is received and the raw data will be passed
on_error(self, status_code): Called when you get a response with code other than 200 (ok)

Stream Listener



In [4]:

    
class MyStreamListener(tweepy.StreamListener):
    
    counter = 0
    
    def __init__(self, max_tweets=1000, *args, **kwargs):
        self.max_tweets = max_tweets
        self.counter = 0
        super().__init__(*args, **kwargs)
    
    def on_connect(self):
        self.counter = 0
        self.start_time = datetime.now()
    
    def on_status(self, status):
        # Increment counter
        self.counter += 1
        
        # Store tweet to MongoDB
        col.insert_one(status._json)
        
        
        if self.counter % 1 == 0:
            value = int(100.00 * self.counter / self.max_tweets)
            mining_time = datetime.now() - self.start_time
            progress_bar.value = value
            html_value = """<span class="label label-primary">Tweets/Sec: %.1f</span>""" % (self.counter / max([1,mining_time.seconds]))
            html_value += """ <span class="label label-success">Progress: %.1f%%</span>""" % (self.counter / self.max_tweets * 100.0)
            html_value += """ <span class="label label-info">ETA: %.1f Sec</span>""" % ((self.max_tweets - self.counter) / (self.counter / max([1,mining_time.seconds])))
            wgt_status.value = html_value
            #print("%s/%s" % (self.counter, self.max_tweets))
            if self.counter >= self.max_tweets:
                myStream.disconnect()
                print("Finished")
                print("Total Mining Time: %s" % (mining_time))
                print("Tweets/Sec: %.1f" % (self.max_tweets / mining_time.seconds))
                progress_bar.value = 0
                
    
myStreamListener = MyStreamListener(max_tweets=100)
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

Connect to a streaming API

There are two methods to connect to a stream:

filter(follow=None, track=None, async=False, locations=None, stall_warnings=False, languages=None, encoding='utf8', filter_level=None)
firehose(count=None, async=False)

Firehose captures everything. You should make sure that you have connection speed that can handle the stream and you have the storage capacity that can store these tweets at the same rate. We cannot really use firehose for this tutorial but we'll be using filter.

You have to specify one of two things to filter:

follow: A list of user ID to follow. This will stream all their tweets, retweets, and others retweeting their tweets. This doesn't include mentions and manual retweets where the user doesn't press the retweet button.
track: A string or list of string to be used for filtering. If you use multiple words separated by spaces, this will be used for AND operator. If you use multiple words in a string separated by commas or pass a list of words this will be treated as OR operator.

Note: track is case insensitive.

What to track?

I want to collect all tweets that contains any of these words:

Jupyter
Python
Data Mining
Machine Learning
Data Science
Big Data
IoT
#R

This could be done with a string or a list. It is easier to to it with a list to make your code clear to read.



In [5]:

    
keywords = ["Jupyter",
            "Python",
            "Data Mining",
            "Machine Learning",
            "Data Science",
            "Big Data",
            "DataMining",
            "MachineLearning",
            "DataScience",
            "BigData",
            "IoT",
            "#R",
           ]

# Visualize a progress bar to track progress
progress_bar = wgt.IntProgress(value=0)
display(progress_bar)
wgt_status = wgt.HTML(value="""<span class="label label-primary">Tweets/Sec: 0.0</span>""")
display(wgt_status)

# Start a filter with an error counter of 20
for error_counter in range(20):
    try:
        myStream.filter(track=keywords)
        print("Tweets collected: %s" % myStream.listener.counter)
        print("Total tweets in collection: %s" % col.count())
        break
    except:
        print("ERROR# %s" % (error_counter + 1))









    



Finished
Total Mining Time: 0:01:21.477351
Tweets/Sec: 1.2
Tweets collected: 100
Total tweets in collection: 2351

Data Access and Analysis

Now that we have stored all these tweets in a MongoDB collection, let's take a look at one of these tweets



In [6]:

    
col.find_one()









    Out[6]:





{'_id': ObjectId('56937d2e105f1970314720e2'),
 'contributors': None,
 'coordinates': None,
 'created_at': 'Mon Jan 11 10:00:14 +0000 2016',
 'entities': {'hashtags': [{'indices': [22, 27], 'text': 'Rã®æ³•å‰‡'}],
  'symbols': [],
  'urls': [],
  'user_mentions': []},
 'favorite_count': 0,
 'favorited': False,
 'filter_level': 'low',
 'geo': None,
 'id': 686487772970942466,
 'id_str': '686487772970942466',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'ja',
 'place': None,
 'retweet_count': 0,
 'retweeted': False,
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'text': 'ä½“åŠ›è½ã¡ã¦ãã¦ãŠã°ã•ã‚“ã¿ãŸã„ã«ãªã£ã¦ããŸã€‚\n#Rã®æ³•å‰‡',
 'timestamp_ms': '1452506414059',
 'truncated': False,
 'user': {'contributors_enabled': False,
  'created_at': 'Tue Aug 18 16:19:16 +0000 2015',
  'default_profile': True,
  'default_profile_image': False,
  'description': 'â˜® é–¢ã‚¸ãƒ£ãƒ‹âˆž ï¼† å±±ç”°æ¶¼ä»‹ ï¼† Justin Bieber ï¼† Benjamin Lasnier ï¼† Selena Gomez â˜®',
  'favourites_count': 1121,
  'follow_request_sent': None,
  'followers_count': 121,
  'following': None,
  'friends_count': 92,
  'geo_enabled': True,
  'id': 3318871652,
  'id_str': '3318871652',
  'is_translator': False,
  'lang': 'en',
  'listed_count': 0,
  'location': 'The land of dreams',
  'name': 'rena',
  'notifications': None,
  'profile_background_color': 'C0DEED',
  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_tile': False,
  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/3318871652/1452436374',
  'profile_image_url': 'http://pbs.twimg.com/profile_images/683964013558931456/Q1rx1s5b_normal.jpg',
  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/683964013558931456/Q1rx1s5b_normal.jpg',
  'profile_link_color': '0084B4',
  'profile_sidebar_border_color': 'C0DEED',
  'profile_sidebar_fill_color': 'DDEEF6',
  'profile_text_color': '333333',
  'profile_use_background_image': True,
  'protected': False,
  'screen_name': 'Q2HpiJwCX1huBwf',
  'statuses_count': 497,
  'time_zone': None,
  'url': None,
  'utc_offset': None,
  'verified': False}}

Load results to a DataFrame



In [29]:

    
dataset = [{"created_at": item["created_at"],
            "text": item["text"],
            "user": "@%s" % item["user"]["screen_name"],
            "source": item["source"],
           } for item in col.find()]

dataset = pd.DataFrame(dataset)
dataset









    Out[29]:






  
    
      
      created_at
      source
      text
      user
    
  
  
    
      0
      Mon Jan 11 10:00:14 +0000 2016
      <a href="http://twitter.com/download/iphone" r...
      ä½“åŠ›è½ã¡ã¦ãã¦ãŠã°ã•ã‚“ã¿ãŸã„ã«ãªã£ã¦ããŸã€‚\n#Rã®æ³•å‰‡
      @Q2HpiJwCX1huBwf
    
    
      1
      Mon Jan 11 10:09:26 +0000 2016
      <a href="http://twitter.com/download/android" ...
      çš†ã«ãŠã°ã•ã‚“ã¨è¨€ã‚ã‚Œã¦ã†ã‚Œã—ãŒã£ã¦ã‚‹ #Rã®æ³•å‰‡
      @Tamutamu1017
    
    
      2
      Mon Jan 11 10:00:10 +0000 2016
      <a href="http://trendkeyword.blog.jp/" rel="no...
      ã€R.I.Pã€‘æ€¥ä¸Šæ˜‡ãƒ¯ãƒ¼ãƒ‰ã€ŒR.I.Pã€ã®ã¾ã¨ã‚é€Ÿå ± https://t.co/yi1yfC...
      @pickword_matome
    
    
      3
      Mon Jan 11 10:00:10 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      #Rã®æ³•å‰‡ \nã©ã‚Œã‚‚ãŠã°ã•ã‚“è‡ã„ã‘ã‚Œã©ã‚„ã£ã±ã‚Šé»„è‰²ãŒä¸€ç•ªã ãªã
      @kakinotise
    
    
      4
      Mon Jan 11 10:00:10 +0000 2016
      <a href="http://bufferapp.com" rel="nofollow">...
      The New Best Thing HP ATP - Vertica Big Data S...
      @DataCentreNews1
    
    
      5
      Mon Jan 11 10:00:11 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      IoT Now: lâ€™Internet of Things Ã¨ qui, ora https...
      @datamanager_it
    
    
      6
      Mon Jan 11 10:00:11 +0000 2016
      <a href="http://trendkeyword.doorblog.jp/" rel...
      ä»Šè©±é¡Œã®ã€ŒR.I.Pã€ã¾ã¨ã‚ https://t.co/VOc5cwK5hg #R.I.P ...
      @buzz_wadai
    
    
      7
      Mon Jan 11 10:00:11 +0000 2016
      <a href="http://twitterfeed.com" rel="nofollow...
      #oldham #stockport VIDEO: Snake thief hides py...
      @Labour_is_PIE
    
    
      8
      Mon Jan 11 10:00:11 +0000 2016
      <a href="https://about.twitter.com/products/tw...
      Las #startup pioneras de #machinelearning ofre...
      @techreview_es
    
    
      9
      Mon Jan 11 10:00:12 +0000 2016
      <a href="http://www.linkedin.com/" rel="nofoll...
      Lets talk about how to harness the power of ma...
      @jansmit1
    
    
      10
      Mon Jan 11 10:00:13 +0000 2016
      <a href="http://catalystfive.com" rel="nofollo...
      Business Intelligence and Big Data Consulting ...
      @Catalyst5Jobs
    
    
      11
      Mon Jan 11 10:00:13 +0000 2016
      <a href="http://twitter.com/NewsICT" rel="nofo...
      [æƒ…å ±é€šä¿¡]2016å¹´å°åŒ—å›½éš›ã‚³ãƒ³ãƒ”ãƒ¥ãƒ¼ã‚¿ãƒ¼è¦‹æœ¬å¸‚ãŒæ–°ã—ã„ä½ç½®ã¥ã‘ã¨æ–°ã—ã„å±•ç¤ºã§è£…ã„æ–°ãŸã«ï¼...
      @NewsICT
    
    
      12
      Mon Jan 11 10:02:10 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      #bonplan Parties de Laser Quest entre amis Ã  2...
      @Bons_Plans_
    
    
      13
      Mon Jan 11 10:02:10 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      Parties de Laser Quest entre amis Ã  22.00â‚¬ au ...
      @keepmymindfree
    
    
      14
      Mon Jan 11 10:02:10 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @jose_garde: Why a Simple Data Analytics St...
      @martingeldish
    
    
      15
      Mon Jan 11 10:02:11 +0000 2016
      <a href="http://twitter.com/download/iphone" r...
      èŠ¸èƒ½äººã®äººãŸãã•ã‚“æ‰‹ãŸãŸãç¬‘ã„ã—ã¦ã‚‹ã‹ã‚‰ãŠã°ã•ã‚“ãŸãã•ã‚“ã«ãªã£ã¡ã‚ƒã†ã‚ˆwww\n\n#Rã®æ³•å‰‡
      @YK__0704
    
    
      16
      Mon Jan 11 10:02:12 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      DÃ©couvrez le jeu Pure Mission entre amis Ã  22....
      @keepmymindfree
    
    
      17
      Mon Jan 11 10:02:12 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      #bonplan Parties de bowling pour 4 Ã  #POINCY :...
      @Bons_Plans_
    
    
      18
      Mon Jan 11 10:02:12 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      20 min de vol dÃ©couverte ULM pour 1 ou 2 Ã  79....
      @CrationSiteWeb
    
    
      19
      Mon Jan 11 10:02:12 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @hortonworks: Paris is the city of love but...
      @bigdataparis
    
    
      20
      Mon Jan 11 10:02:12 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @hynek: So #emacs / @spacemacs nerds: is th...
      @fdiesch
    
    
      21
      Mon Jan 11 10:02:12 +0000 2016
      <a href="http://www.hootsuite.com" rel="nofoll...
      .@QonexCyber founder member of @IoT_SF is orga...
      @QonexCyber
    
    
      22
      Mon Jan 11 10:02:13 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      Parties de bowling pour 4 Ã  #POINCY : 35.00â‚¬ a...
      @keepmymindfree
    
    
      23
      Mon Jan 11 10:02:13 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      #bonplan 30 sÃ©ances de Squash Ã  #LISSES : 39.9...
      @Bons_Plans_
    
    
      24
      Mon Jan 11 10:02:13 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      App: ExZeus 2 â€“ free to play   https://t.co/ZT...
      @UniversalConsol
    
    
      25
      Mon Jan 11 10:02:13 +0000 2016
      <a href="http://dlvr.it" rel="nofollow">dlvr.i...
      #discount Parties de bowling pour 4 Ã  #POINCY ...
      @PromosPromos
    
    
      26
      Mon Jan 11 10:02:13 +0000 2016
      <a href="http://www.google.com/" rel="nofollow...
      spiegel.de :  Tier macht Sachen: Python beiÃŸt ...
      @arminfischer_de
    
    
      27
      Mon Jan 11 10:09:02 +0000 2016
      <a href="http://twitter.com/download/android" ...
      è‹¥ã„ã£ã¦ã„ã„ããˆâ€¦ã£ã¦ã‚ˆãè¨€ã†w #Rã®æ³•å‰‡
      @naco75x
    
    
      28
      Mon Jan 11 10:09:17 +0000 2016
      <a href="http://twitter.com/download/iphone" r...
      æœ€è¿‘ã®è‹¥ã„åã¯æœ€è¿‘ä½¿ã£ãŸ\n#Rã®æ³•å‰‡
      @K1224West
    
    
      29
      Mon Jan 11 10:09:19 +0000 2016
      <a href="http://twitter.com/download/android" ...
      #Rã®æ³•å‰‡\nè‡ªåˆ†ã‚‚è‹¥è€…ãªã®ã«ç¬‘
      @V6ZRRT7Q22BZ1cF
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2221
      Mon Jan 11 10:29:40 +0000 2016
      <a href="http://201512291327-7430af.bitnamiapp...
      https://t.co/BTAAq6HuuJ - pcgamer - #machinele...
      @vinceyue
    
    
      2222
      Mon Jan 11 10:29:40 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @jose_garde: The Big Data Analytics Softwar...
      @LJ_Blanchard
    
    
      2223
      Mon Jan 11 10:29:41 +0000 2016
      <a href="http://www.linkedin.com/" rel="nofoll...
      What is data mining? Do you have to be a mathe...
      @ednuwan
    
    
      2224
      Mon Jan 11 10:29:41 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @AgroKnow: A #BigData platform for the futu...
      @albertspijkers
    
    
      2225
      Mon Jan 11 10:29:42 +0000 2016
      <a href="http://www.linkedin.com/" rel="nofoll...
      Big Data: Is It A Tsunami, The New Oil, Or Sim...
      @Summerlovegrove
    
    
      2226
      Mon Jan 11 10:29:43 +0000 2016
      <a href="https://www.jobfindly.com/php-jobs.ht...
      Sr Software Engineer C Php Python Linux Jobs i...
      @jobfindlyphpdev
    
    
      2227
      Mon Jan 11 10:29:43 +0000 2016
      <a href="http://twitter.com/download/iphone" r...
      RT @bigdataparis: #Bigdata bang : un marchÃ© en...
      @LifeIsWeb
    
    
      2228
      Mon Jan 11 10:29:43 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      Learn from the best professors in India .. cou...
      @ashwaniapex
    
    
      2229
      Mon Jan 11 10:29:45 +0000 2016
      <a href="http://getsmoup.com" rel="nofollow">S...
      RT @rebrandtoday: #startup or #rebrand -Buy Cr...
      @SmartData_Fr
    
    
      2230
      Mon Jan 11 10:29:45 +0000 2016
      <a href="http://www.ajaymatharu.com/" rel="nof...
      Â¿CÃ³mo serÃ¡ el futuro del Big Data? https://t.c...
      @eduardogarsanch
    
    
      2231
      Mon Jan 11 10:29:46 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @jose_garde: 3 Ways to Transform Your Compa...
      @LJ_Blanchard
    
    
      2232
      Mon Jan 11 10:29:46 +0000 2016
      <a href="http://publicize.wp.com/" rel="nofoll...
      Woman Tries To Kiss Python, Gets Bitten In The...
      @NAIJA_VIBEZ
    
    
      2233
      Mon Jan 11 10:29:46 +0000 2016
      <a href="http://www.itknowingness.com" rel="no...
      RT @jose_garde: 3 Ways to Transform Your Compa...
      @itknowingness
    
    
      2234
      Mon Jan 11 10:29:48 +0000 2016
      <a href="http://twitter.com/download/iphone" r...
      RT @ErikaPauwels: Building a #BigData platform...
      @impulsater
    
    
      2235
      Mon Jan 11 10:29:49 +0000 2016
      <a href="http://twitter.com/download/android" ...
      En 2016 j'aimerais moins rÃ¢ler. #rÃ©solution. S...
      @ce1ce2makarenko
    
    
      2236
      Mon Jan 11 10:29:51 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @jose_garde: How Marketing Can Be Better Au...
      @LJ_Blanchard
    
    
      2237
      Mon Jan 11 10:29:51 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      Hey @Pontifex accueille ces rÃ©fugiÃ©es dans ta ...
      @Atmosfive
    
    
      2238
      Mon Jan 11 10:29:51 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @Ubixr: .@GroupeLaPoste choisit le toulousa...
      @The_Nextwork
    
    
      2239
      Mon Jan 11 10:29:56 +0000 2016
      <a href="http://twitterfeed.com" rel="nofollow...
      Thanks @hackplayers Blade: un webshell en Pyth...
      @Navarmedia
    
    
      2240
      Mon Jan 11 10:29:56 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @jose_garde: Which big data personality are...
      @LJ_Blanchard
    
    
      2241
      Mon Jan 11 10:29:56 +0000 2016
      <a href="http://ifttt.com" rel="nofollow">IFTT...
      Cybersecurity Forum tackles challenges with th...
      @wulfsec
    
    
      2242
      Mon Jan 11 10:29:56 +0000 2016
      <a href="http://publicize.wp.com/" rel="nofoll...
      Woman Tries To Kiss Python, Gets Bitten In The...
      @Lola2Records
    
    
      2243
      Mon Jan 11 10:29:56 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @TeamAnodot: Join David Drai, CEO of Anodot...
      @iottechexpo
    
    
      2244
      Mon Jan 11 10:29:57 +0000 2016
      <a href="http://getsmoup.com" rel="nofollow">A...
      RT @rebrandtoday: #startup or #rebrand -Buy Cr...
      @AI__news
    
    
      2245
      Mon Jan 11 10:29:57 +0000 2016
      <a href="http://www.twitter.com" rel="nofollow...
      RT @Matthis__VERNON: "@Fred_Poquet Sans #confi...
      @sibueta
    
    
      2246
      Mon Jan 11 10:29:57 +0000 2016
      <a href="https://social.zoho.com" rel="nofollo...
      The right place for #BigData is #Cloud #Storag...
      @TyroneSystems
    
    
      2247
      Mon Jan 11 10:29:59 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @PEBlanrue: Il a toujours le mot pour rire,...
      @lesroisduring
    
    
      2248
      Mon Jan 11 10:29:58 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @DigitalAgendaEU: â‚¬15 million for a #IoT so...
      @ImproveNPA
    
    
      2249
      Mon Jan 11 10:29:59 +0000 2016
      <a href="http://twitter.com" rel="nofollow">Tw...
      Neat IoT innovation https://t.co/atARX0m5Bj
      @sherwinnovator
    
    
      2250
      Mon Jan 11 10:30:00 +0000 2016
      <a href="http://www.hubspot.com/" rel="nofollo...
      Check out our #Mobile App Predicitions for 201...
      @B60uk
    
  

2251 rows Ã— 4 columns

Checking the highest used words



In [30]:

    
cv = CountVectorizer()
count_matrix = cv.fit_transform(dataset.text)

word_count = pd.DataFrame(cv.get_feature_names(), columns=["word"])
word_count["count"] = count_matrix.sum(axis=0).tolist()[0]
word_count = word_count.sort_values("count", ascending=False).reset_index(drop=True)
word_count[:50]









    Out[30]:






  
    
      
      word
      count
    
  
  
    
      0
      https
      1986
    
    
      1
      co
      1907
    
    
      2
      rt
      804
    
    
      3
      de
      550
    
    
      4
      rã®æ³•å‰‡
      408
    
    
      5
      iot
      374
    
    
      6
      the
      358
    
    
      7
      bigdata
      293
    
    
      8
      00
      275
    
    
      9
      data
      250
    
    
      10
      in
      234
    
    
      11
      python
      219
    
    
      12
      to
      212
    
    
      13
      au
      199
    
    
      14
      of
      188
    
    
      15
      lieu
      168
    
    
      16
      rÃ©duction
      166
    
    
      17
      big
      157
    
    
      18
      on
      143
    
    
      19
      is
      142
    
    
      20
      and
      140
    
    
      21
      for
      136
    
    
      22
      analytics
      107
    
    
      23
      le
      89
    
    
      24
      via
      86
    
    
      25
      you
      86
    
    
      26
      thingsexpo
      86
    
    
      27
      by
      85
    
    
      28
      2016
      84
    
    
      29
      snake
      80
    
    
      30
      en
      76
    
    
      31
      bowie
      75
    
    
      32
      la
      74
    
    
      33
      thief
      74
    
    
      34
      video
      73
    
    
      35
      m2m
      70
    
    
      36
      jose_garde
      68
    
    
      37
      19
      67
    
    
      38
      david
      66
    
    
      39
      with
      63
    
    
      40
      how
      61
    
    
      41
      it
      60
    
    
      42
      will
      55
    
    
      43
      un
      54
    
    
      44
      amp
      53
    
    
      45
      des
      53
    
    
      46
      rÃ©paration
      53
    
    
      47
      new
      52
    
    
      48
      39
      52
    
    
      49
      at
      51

Visualization



In [37]:

    
def get_source_name(x):
    value = re.findall(pattern="<[^>]+>([^<]+)</a>", string=x)
    if len(value) > 0:
        return value[0]
    else:
        return ""



In [38]:

    
dataset.source_name = dataset.source.apply(get_source_name)

source_counts = dataset.source_name.value_counts().sort_values()[-10:]

bottom = [index for index, item in enumerate(source_counts.index)]
plt.barh(bottom, width=source_counts, color="orange", linewidth=0)

y_labels = ["%s %.1f%%" % (item, 100.0*source_counts[item]/len(dataset)) for index,item in enumerate(source_counts.index)]
plt.yticks(np.array(bottom)+0.4, y_labels)

source_counts









    Out[38]:





Facebook                25
TweetDeck               37
RoundTeam               41
Hootsuite               46
twitterfeed             81
IFTTT                  134
dlvr.it                200
Twitter Web Client     388
Twitter for Android    392
Twitter for iPhone     515
Name: source, dtype: int64

	created_at	source	text	user
0	Mon Jan 11 10:00:14 +0000 2016	<a href="http://twitter.com/download/iphone" r...	ä½“åŠ›è½ã¡ã¦ãã¦ãŠã°ã•ã‚“ã¿ãŸã„ã«ãªã£ã¦ããŸã€‚\n#Rã®æ³•å‰‡	@Q2HpiJwCX1huBwf
1	Mon Jan 11 10:09:26 +0000 2016	<a href="http://twitter.com/download/android" ...	çš†ã«ãŠã°ã•ã‚“ã¨è¨€ã‚ã‚Œã¦ã†ã‚Œã—ãŒã£ã¦ã‚‹ #Rã®æ³•å‰‡	@Tamutamu1017
2	Mon Jan 11 10:00:10 +0000 2016	<a href="http://trendkeyword.blog.jp/" rel="no...	ã€R.I.Pã€‘æ€¥ä¸Šæ˜‡ãƒ¯ãƒ¼ãƒ‰ã€ŒR.I.Pã€ã®ã¾ã¨ã‚é€Ÿå ± https://t.co/yi1yfC...	@pickword_matome
3	Mon Jan 11 10:00:10 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	#Rã®æ³•å‰‡ \nã©ã‚Œã‚‚ãŠã°ã•ã‚“è‡ã„ã‘ã‚Œã©ã‚„ã£ã±ã‚Šé»„è‰²ãŒä¸€ç•ªã ãªã	@kakinotise
4	Mon Jan 11 10:00:10 +0000 2016	<a href="http://bufferapp.com" rel="nofollow">...	The New Best Thing HP ATP - Vertica Big Data S...	@DataCentreNews1
5	Mon Jan 11 10:00:11 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	IoT Now: lâ€™Internet of Things Ã¨ qui, ora https...	@datamanager_it
6	Mon Jan 11 10:00:11 +0000 2016	<a href="http://trendkeyword.doorblog.jp/" rel...	ä»Šè©±é¡Œã®ã€ŒR.I.Pã€ã¾ã¨ã‚ https://t.co/VOc5cwK5hg #R.I.P ...	@buzz_wadai
7	Mon Jan 11 10:00:11 +0000 2016	<a href="http://twitterfeed.com" rel="nofollow...	#oldham #stockport VIDEO: Snake thief hides py...	@Labour_is_PIE
8	Mon Jan 11 10:00:11 +0000 2016	<a href="https://about.twitter.com/products/tw...	Las #startup pioneras de #machinelearning ofre...	@techreview_es
9	Mon Jan 11 10:00:12 +0000 2016	<a href="http://www.linkedin.com/" rel="nofoll...	Lets talk about how to harness the power of ma...	@jansmit1
10	Mon Jan 11 10:00:13 +0000 2016	<a href="http://catalystfive.com" rel="nofollo...	Business Intelligence and Big Data Consulting ...	@Catalyst5Jobs
11	Mon Jan 11 10:00:13 +0000 2016	<a href="http://twitter.com/NewsICT" rel="nofo...	[æƒ…å ±é€šä¿¡]2016å¹´å°åŒ—å›½éš›ã‚³ãƒ³ãƒ”ãƒ¥ãƒ¼ã‚¿ãƒ¼è¦‹æœ¬å¸‚ãŒæ–°ã—ã„ä½ç½®ã¥ã‘ã¨æ–°ã—ã„å±•ç¤ºã§è£…ã„æ–°ãŸã«ï¼...	@NewsICT
12	Mon Jan 11 10:02:10 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	#bonplan Parties de Laser Quest entre amis Ã 2...	@Bons_Plans_
13	Mon Jan 11 10:02:10 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	Parties de Laser Quest entre amis Ã 22.00â‚¬ au ...	@keepmymindfree
14	Mon Jan 11 10:02:10 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @jose_garde: Why a Simple Data Analytics St...	@martingeldish
15	Mon Jan 11 10:02:11 +0000 2016	<a href="http://twitter.com/download/iphone" r...	èŠ¸èƒ½äººã®äººãŸãã•ã‚“æ‰‹ãŸãŸãç¬‘ã„ã—ã¦ã‚‹ã‹ã‚‰ãŠã°ã•ã‚“ãŸãã•ã‚“ã«ãªã£ã¡ã‚ƒã†ã‚ˆwww\n\n#Rã®æ³•å‰‡	@YK__0704
16	Mon Jan 11 10:02:12 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	DÃ©couvrez le jeu Pure Mission entre amis Ã 22....	@keepmymindfree
17	Mon Jan 11 10:02:12 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	#bonplan Parties de bowling pour 4 Ã #POINCY :...	@Bons_Plans_
18	Mon Jan 11 10:02:12 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	20 min de vol dÃ©couverte ULM pour 1 ou 2 Ã 79....	@CrationSiteWeb
19	Mon Jan 11 10:02:12 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @hortonworks: Paris is the city of love but...	@bigdataparis
20	Mon Jan 11 10:02:12 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @hynek: So #emacs / @spacemacs nerds: is th...	@fdiesch
21	Mon Jan 11 10:02:12 +0000 2016	<a href="http://www.hootsuite.com" rel="nofoll...	.@QonexCyber founder member of @IoT_SF is orga...	@QonexCyber
22	Mon Jan 11 10:02:13 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	Parties de bowling pour 4 Ã #POINCY : 35.00â‚¬ a...	@keepmymindfree
23	Mon Jan 11 10:02:13 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	#bonplan 30 sÃ©ances de Squash Ã #LISSES : 39.9...	@Bons_Plans_
24	Mon Jan 11 10:02:13 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	App: ExZeus 2 â€“ free to play https://t.co/ZT...	@UniversalConsol
25	Mon Jan 11 10:02:13 +0000 2016	<a href="http://dlvr.it" rel="nofollow">dlvr.i...	#discount Parties de bowling pour 4 Ã #POINCY ...	@PromosPromos
26	Mon Jan 11 10:02:13 +0000 2016	<a href="http://www.google.com/" rel="nofollow...	spiegel.de : Tier macht Sachen: Python beiÃŸt ...	@arminfischer_de
27	Mon Jan 11 10:09:02 +0000 2016	<a href="http://twitter.com/download/android" ...	è‹¥ã„ã£ã¦ã„ã„ããˆâ€¦ã£ã¦ã‚ˆãè¨€ã†w #Rã®æ³•å‰‡	@naco75x
28	Mon Jan 11 10:09:17 +0000 2016	<a href="http://twitter.com/download/iphone" r...	æœ€è¿‘ã®è‹¥ã„åã¯æœ€è¿‘ä½¿ã£ãŸ\n#Rã®æ³•å‰‡	@K1224West
29	Mon Jan 11 10:09:19 +0000 2016	<a href="http://twitter.com/download/android" ...	#Rã®æ³•å‰‡\nè‡ªåˆ†ã‚‚è‹¥è€…ãªã®ã«ç¬‘	@V6ZRRT7Q22BZ1cF
...	...	...	...	...
2221	Mon Jan 11 10:29:40 +0000 2016	<a href="http://201512291327-7430af.bitnamiapp...	https://t.co/BTAAq6HuuJ - pcgamer - #machinele...	@vinceyue
2222	Mon Jan 11 10:29:40 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @jose_garde: The Big Data Analytics Softwar...	@LJ_Blanchard
2223	Mon Jan 11 10:29:41 +0000 2016	<a href="http://www.linkedin.com/" rel="nofoll...	What is data mining? Do you have to be a mathe...	@ednuwan
2224	Mon Jan 11 10:29:41 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @AgroKnow: A #BigData platform for the futu...	@albertspijkers
2225	Mon Jan 11 10:29:42 +0000 2016	<a href="http://www.linkedin.com/" rel="nofoll...	Big Data: Is It A Tsunami, The New Oil, Or Sim...	@Summerlovegrove
2226	Mon Jan 11 10:29:43 +0000 2016	<a href="https://www.jobfindly.com/php-jobs.ht...	Sr Software Engineer C Php Python Linux Jobs i...	@jobfindlyphpdev
2227	Mon Jan 11 10:29:43 +0000 2016	<a href="http://twitter.com/download/iphone" r...	RT @bigdataparis: #Bigdata bang : un marchÃ© en...	@LifeIsWeb
2228	Mon Jan 11 10:29:43 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	Learn from the best professors in India .. cou...	@ashwaniapex
2229	Mon Jan 11 10:29:45 +0000 2016	<a href="http://getsmoup.com" rel="nofollow">S...	RT @rebrandtoday: #startup or #rebrand -Buy Cr...	@SmartData_Fr
2230	Mon Jan 11 10:29:45 +0000 2016	<a href="http://www.ajaymatharu.com/" rel="nof...	Â¿CÃ³mo serÃ¡ el futuro del Big Data? https://t.c...	@eduardogarsanch
2231	Mon Jan 11 10:29:46 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @jose_garde: 3 Ways to Transform Your Compa...	@LJ_Blanchard
2232	Mon Jan 11 10:29:46 +0000 2016	<a href="http://publicize.wp.com/" rel="nofoll...	Woman Tries To Kiss Python, Gets Bitten In The...	@NAIJA_VIBEZ
2233	Mon Jan 11 10:29:46 +0000 2016	<a href="http://www.itknowingness.com" rel="no...	RT @jose_garde: 3 Ways to Transform Your Compa...	@itknowingness
2234	Mon Jan 11 10:29:48 +0000 2016	<a href="http://twitter.com/download/iphone" r...	RT @ErikaPauwels: Building a #BigData platform...	@impulsater
2235	Mon Jan 11 10:29:49 +0000 2016	<a href="http://twitter.com/download/android" ...	En 2016 j'aimerais moins rÃ¢ler. #rÃ©solution. S...	@ce1ce2makarenko
2236	Mon Jan 11 10:29:51 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @jose_garde: How Marketing Can Be Better Au...	@LJ_Blanchard
2237	Mon Jan 11 10:29:51 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	Hey @Pontifex accueille ces rÃ©fugiÃ©es dans ta ...	@Atmosfive
2238	Mon Jan 11 10:29:51 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @Ubixr: .@GroupeLaPoste choisit le toulousa...	@The_Nextwork
2239	Mon Jan 11 10:29:56 +0000 2016	<a href="http://twitterfeed.com" rel="nofollow...	Thanks @hackplayers Blade: un webshell en Pyth...	@Navarmedia
2240	Mon Jan 11 10:29:56 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @jose_garde: Which big data personality are...	@LJ_Blanchard
2241	Mon Jan 11 10:29:56 +0000 2016	<a href="http://ifttt.com" rel="nofollow">IFTT...	Cybersecurity Forum tackles challenges with th...	@wulfsec
2242	Mon Jan 11 10:29:56 +0000 2016	<a href="http://publicize.wp.com/" rel="nofoll...	Woman Tries To Kiss Python, Gets Bitten In The...	@Lola2Records
2243	Mon Jan 11 10:29:56 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @TeamAnodot: Join David Drai, CEO of Anodot...	@iottechexpo
2244	Mon Jan 11 10:29:57 +0000 2016	<a href="http://getsmoup.com" rel="nofollow">A...	RT @rebrandtoday: #startup or #rebrand -Buy Cr...	@AI__news
2245	Mon Jan 11 10:29:57 +0000 2016	<a href="http://www.twitter.com" rel="nofollow...	RT @Matthis__VERNON: "@Fred_Poquet Sans #confi...	@sibueta
2246	Mon Jan 11 10:29:57 +0000 2016	<a href="https://social.zoho.com" rel="nofollo...	The right place for #BigData is #Cloud #Storag...	@TyroneSystems
2247	Mon Jan 11 10:29:59 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @PEBlanrue: Il a toujours le mot pour rire,...	@lesroisduring
2248	Mon Jan 11 10:29:58 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	RT @DigitalAgendaEU: â‚¬15 million for a #IoT so...	@ImproveNPA
2249	Mon Jan 11 10:29:59 +0000 2016	<a href="http://twitter.com" rel="nofollow">Tw...	Neat IoT innovation https://t.co/atARX0m5Bj	@sherwinnovator
2250	Mon Jan 11 10:30:00 +0000 2016	<a href="http://www.hubspot.com/" rel="nofollo...	Check out our #Mobile App Predicitions for 201...	@B60uk

	word	count
0	https	1986
1	co	1907
2	rt	804
3	de	550
4	rã®æ³•å‰‡	408
5	iot	374
6	the	358
7	bigdata	293
8	00	275
9	data	250
10	in	234
11	python	219
12	to	212
13	au	199
14	of	188
15	lieu	168
16	rÃ©duction	166
17	big	157
18	on	143
19	is	142
20	and	140
21	for	136
22	analytics	107
23	le	89
24	via	86
25	you	86
26	thingsexpo	86
27	by	85
28	2016	84
29	snake	80
30	en	76
31	bowie	75
32	la	74
33	thief	74
34	video	73
35	m2m	70
36	jose_garde	68
37	19	67
38	david	66
39	with	63
40	how	61
41	it	60
42	will	55
43	un	54
44	amp	53
45	des	53
46	rÃ©paration	53
47	new	52
48	39	52
49	at	51