1. Introduction

We are interested in observing the discussions on Twitter to identify vulnerabilites and exposures. We plan to focus on collecting tweets consisting of particular words of interest. The data collected can then be cleaned and used for analysis on similar lines as the full disclosure mailing list.

The notebook is inspired by the work in the article Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits where the idea is to use twitter analytics for early detecting exploits. There have been instances shared in the presentation, where vulnerabilities have been mentioned and discussed on twitter before the vulnerability is disclosed and this is the motivation for using twitter as a part of the research.

The researchers in the article started collecting tweets based on a list of 50 words. The list of words have not been mentioned in the article, hence we start our analysis by collecting tweets for keywords identified manually by doing our research on discussions related to vulnerabilities on twitter.

The following task can be achieved in two ways:

  1. Search for historical tweets with specific words of interest using the search API
  2. Monitor the feed on twitter for specific words of interest using the streaming API

In [11]:
myvars = {}
with open("Twitter_keys.txt") as myfile:
    for line in myfile:
        name, var = line.partition("=")[::2]
        myvars[name.strip()] = var

In [12]:
APP_KEY = myvars["APP_KEY"].rstrip()
APP_SECRET = myvars["APP_SECRET"].rstrip()
OAUTH_TOKEN = myvars["OAUTH_TOKEN"].rstrip()
OAUTH_TOKEN_SECRET = myvars["OAUTH_TOKEN_SECRET"].rstrip()

A variable save_path is created which contains the path to the folder where the tweet files in json format will be stored. The folder name is saved in the format "tweet_MM_DD_YYYY".


In [13]:
import os
now=datetime.datetime.now()
day=int(now.day)
month=now.month
year=int(now.year)
current_dir=os.getcwd()
save_path=os.path.join(current_dir, r'tweet_%i_%i_%i' % (now.month, now.day, now.year))
if not os.path.exists(save_path):
    os.makedirs(save)

2. Tweet Extraction

We use the twython library for the tweet extraction. Twython is an actively maintained, pure Python wrapper for the Twitter API. It supports both normal and streaming Twitter APIs.

The primary task is to obtain the keys and tokens required to access the API and then access the functions in the wrappers.

The scripts below have two configurable parameters:

  1. The query_word variable needs to be initialized with the keywords we are looking for on the twitter feed
  2. The max_tweets variable is the number of tweets we plan to extract for the keywords mentioned above

If query_word is initialized to multiple words, the code will retrieve set of tweets that consists of all the words.

2.1. Streaming API

The streaming API is used to access all the current tweets. It returns approximately 1% of the tweets i.e. 60 tweets per second assuming a maximum of 6000 users tweet every second. There is no rate limit to the Streaming API.

The hyperlink has details related to the Twitter Streaming API limit:

  1. URL 1
  2. URL 2

In [14]:
#import libraries
from twython import TwythonStreamer
from twython import Twython, TwythonError
import time
import sys
import os


#Configurable parameters
query_word="access-role"
max_tweets=5


searched_tweets_strm=[]

class MyStreamer(TwythonStreamer):
    def on_success(self, data):
        if len(searched_tweets_strm)<max_tweets:
            if 'text' in data:
                #print data['text'].encode('utf-8')
                searched_tweets_strm.append(data['text'].encode('utf-8'))
            else:
                print ("No tweets found")
                self.disconnect()
        else: 
            print ("Max tweets extracted")
            sys.exit()

    def on_error(self, status_code, data):
        print (status_code, data)
        print ("Exception raised, waiting 15 minutes")
        time.sleep(15*60)

# Requires Authentication as of Twitter API v1.1
stream = MyStreamer(APP_KEY, APP_SECRET,
                    OAUTH_TOKEN, OAUTH_TOKEN_SECRET)


stream.statuses.filter(track=query_word)


---------------------------------------------------------------------------
WantReadError                             Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in recv_into(self, *args, **kwargs)
    279         try:
--> 280             return self.connection.recv_into(*args, **kwargs)
    281         except OpenSSL.SSL.SysCallError as e:

~/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py in recv_into(self, buffer, nbytes, flags)
   1714             result = _lib.SSL_read(self._ssl, buf, nbytes)
-> 1715         self._raise_ssl_error(self._ssl, result)
   1716 

~/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py in _raise_ssl_error(self, ssl, result)
   1520         if error == _lib.SSL_ERROR_WANT_READ:
-> 1521             raise WantReadError()
   1522         elif error == _lib.SSL_ERROR_WANT_WRITE:

WantReadError: 

During handling of the above exception, another exception occurred:

timeout                                   Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/urllib3/response.py in _error_catcher(self)
    301             try:
--> 302                 yield
    303 

~/anaconda3/lib/python3.6/site-packages/urllib3/response.py in read_chunked(self, amt, decode_content)
    597             while True:
--> 598                 self._update_chunk_length()
    599                 if self.chunk_left == 0:

~/anaconda3/lib/python3.6/site-packages/urllib3/response.py in _update_chunk_length(self)
    539             return
--> 540         line = self._fp.fp.readline()
    541         line = line.split(b';', 1)[0]

~/anaconda3/lib/python3.6/socket.py in readinto(self, b)
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:

~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in recv_into(self, *args, **kwargs)
    293             if not rd:
--> 294                 raise timeout('The read operation timed out')
    295             else:

timeout: The read operation timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError                          Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/requests/models.py in generate()
    744                 try:
--> 745                     for chunk in self.raw.stream(chunk_size, decode_content=True):
    746                         yield chunk

~/anaconda3/lib/python3.6/site-packages/urllib3/response.py in stream(self, amt, decode_content)
    431         if self.chunked and self.supports_chunked_reads():
--> 432             for line in self.read_chunked(amt, decode_content=decode_content):
    433                 yield line

~/anaconda3/lib/python3.6/site-packages/urllib3/response.py in read_chunked(self, amt, decode_content)
    625             if self._original_response:
--> 626                 self._original_response.close()

~/anaconda3/lib/python3.6/contextlib.py in __exit__(self, type, value, traceback)
     98             try:
---> 99                 self.gen.throw(type, value, traceback)
    100             except StopIteration as exc:

~/anaconda3/lib/python3.6/site-packages/urllib3/response.py in _error_catcher(self)
    306                 # there is yet no clean way to get at it from this context.
--> 307                 raise ReadTimeoutError(self._pool, None, 'Read timed out.')
    308 

ReadTimeoutError: HTTPSConnectionPool(host='stream.twitter.com', port=443): Read timed out.

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
<ipython-input-14-6179d2d42235> in <module>()
     37 
     38 
---> 39 stream.statuses.filter(track=query_word)

~/anaconda3/lib/python3.6/site-packages/twython/streaming/types.py in filter(self, **params)
     64         url = 'https://stream.twitter.com/%s/statuses/filter.json' \
     65               % self.streamer.api_version
---> 66         self.streamer._request(url, 'POST', params=params)
     67 
     68     def sample(self, **params):

~/anaconda3/lib/python3.6/site-packages/twython/streaming/api.py in _request(self, url, method, params)
    139             response = _send(retry_counter)
    140 
--> 141             for line in response.iter_lines(self.chunk_size):
    142                 if not self.connected:
    143                     break

~/anaconda3/lib/python3.6/site-packages/requests/models.py in iter_lines(self, chunk_size, decode_unicode, delimiter)
    787         pending = None
    788 
--> 789         for chunk in self.iter_content(chunk_size=chunk_size, decode_unicode=decode_unicode):
    790 
    791             if pending is not None:

~/anaconda3/lib/python3.6/site-packages/requests/models.py in generate()
    750                     raise ContentDecodingError(e)
    751                 except ReadTimeoutError as e:
--> 752                     raise ConnectionError(e)
    753             else:
    754                 # Standard file-like object.

ConnectionError: HTTPSConnectionPool(host='stream.twitter.com', port=443): Read timed out.

In [ ]:
print (searched_tweets_strm)

2.2. Search API

The streaming API is used to collect tweets from the current feed and not the historical data. For historical data, the search API can be used to access tweets that are up to three weeks old. Once these historical tweets are collected, the rest of the tweets can be accessed by running the streaming API continuously.

Limitations: To collect tweets for two different words, the Twitter API needs to be queried twice and there is no functionality of collecting it in one function call at the same time.

Example: If we need to collect tweets with the word "data" in it and we also need to collect tweets with the word with the word "flaw" in it. The search API needs to be called twice, once for the word "data" and then for the word "flaw".

  1. twitter.search(q="data")
  2. twitter.search(q="flaw")

Note: The search API limit is 180 requests every 15 minutes and hence the code will sleep for 15 minutes every time the API limit is reached.


In [ ]:
#import libraries
from twython import Twython, TwythonError
import time
from time import gmtime, strftime
import json
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import datetime
import os

#Configurable parameters
query_word="CVE"
max_tweets=800

tweet_cnt=0
# Requires Authentication as of Twitter API v1.1
twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

searched_tweets_srch = []
while len(searched_tweets_srch) < max_tweets:
    remaining_tweets = max_tweets - len(searched_tweets_srch)
    try:
        search_results = twitter.search(q=query_word, count=100)
        
        if not search_results:
            print('no tweets found')
            break
            
        tweet_cnt=tweet_cnt+len(search_results["statuses"])
        searched_tweets_srch.extend(search_results["statuses"])
        
    except TwythonError as e:
        print (e)
        print ("exception raised, waiting 16 minutes")
        print (strftime("%H:%M:%S"+ gmtime()))
        time.sleep(16*60)

print ("Total tweets extracted for "+query_word+": "+str(tweet_cnt))

The hyperlink has details related to the structure of a tweet:

  1. URL 1
  2. URL 2

We will define a function save_tweet_json which takes the save_path as an argument and stores each tweet as a .json file. As such, for example, if you download 200 tweets, it accordingly have 200 files in the folder. Each file is saved in the form _ to desambiguate them, and avoid storing duplicate tweets.


In [ ]:
def save_tweets_json(save_path):
    
    data=pd.DataFrame(data=[tweets['id'] for tweets in searched_tweets_srch],columns=['user_id'])
    i=0

    for tweets in searched_tweets_srch:
        if(tweets['id']==data.iloc[i]['user_id']):
            file=open(save_path+"/"+tweets['user']['screen_name']+'_'+ str(data.iloc[i]['user_id'])+".json", "w")
            file.write(json.dumps(tweets,indent=4))
            file.write("\n")
            file.close()
        i=i+1    
save_tweets_json(save_path)

In [10]:
print (searched_tweets_srch)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

References:

  1. Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits, https://www.umiacs.umd.edu/~tdumitra/papers/USENIX-SECURITY-2015.pdf

  2. http://www.umiacs.umd.edu/~tdumitra/blog/2015/08/02/predicting-vulnerability-exploits/