Notebook 2

Our current dataset suffers from duplicate tweet's from bots, hacked accounts etc.

As such, this notebook will show you how to deal with these duplicates in a manner that does not take $O(N^{2})$.

Traditionally, the way to search for duplicates would be to wrte some code with nested for loops that passess over the list twice, comparing one entry to every other entry in the list. While this approach would work - it is incredibly slow for a list with a large number of elements. In our case, our list is just over 1 million elements long.

Furthermore, this approach only really works when you are looking for EXACT duplicates. This will not do for our case as bots will re-tweet the original tweet with either a different URL link or slightly different formatting. Thus, while a human would be able to tell that the tweet is practically identical, a computer would not.

One possible solution to this to use a form of fuzzy matching, that is, if two strings are similar enough to pass a threshold of, say, 0.9 (levenstein distance), then we can assume they are duplicate tweets. While this is a fantastic approach, it is still $O(N^{2})$.

In this notebook, I present a different, more nuanced approach!



In [1]:

    
import pandas as pd
import arrow # way better than datetime
import numpy as np
import random
import re
%run helper_functions.py
import string



In [2]:

    
new_df = unpickle_object("new_df.pkl") # this loads up the dataframe from our previous notebook



In [3]:

    
new_df.head() #sorted first on date and then time!









    Out[3]:







  
    
      
      date
      handle
      time
      tweet
    
  
  
    
      0
      2017-02-22
      FoabMoab
      19:35:43
      RT @bitcoinagile: .1 #bitcoin BTC Straight To ...
    
    
      1
      2017-02-22
      Bitcoin_Revo
      19:35:39
      #bitcoin “Is Like Positive Bacteria”: Russian ...
    
    
      2
      2017-02-22
      alt_bit_coins
      19:35:31
      Volatile Bitcoin Nears Its All-Time High - Wal...
    
    
      3
      2017-02-22
      Rhino3nity
      19:35:26
      Missed out on #bitcoin? This #cryptocurrency c...
    
    
      4
      2017-02-22
      Siimple_inc
      19:35:18
      RT @blockchainhelpr: 5 Weak Points Of #Blockch...



In [4]:

    
new_df.iloc[0, 3]









    Out[4]:





'RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer https://t.co/2sslXiNHz1 https…'



In [5]:

    
#we need to remove all links in a tweet!
regex = r"http\S+"
subset = ""



In [6]:

    
removed_links = list(map(lambda x: re.sub(regex, subset, x), list(new_df['tweet'])))
removed_links = list(map(str.strip, removed_links))
new_df['tweet'] = removed_links



In [7]:

    
new_df.iloc[0, 3] # we can see here that the link has been removed!









    Out[7]:





'RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer'



In [8]:

    
new_df.iloc[1047748, [1, 3]] #example of duplicate enttry - different handles, same tweets









    Out[8]:





handle                                FaucetGaming
tweet     Multiplayer #Poker launching tomorrow on
Name: 1047748, dtype: object



In [9]:

    
new_df.iloc[1047749, [1, 3]]









    Out[9]:





handle                                        CryptoPromote
tweet     RT @FaucetGaming: Multiplayer #Poker launching...
Name: 1047749, dtype: object



In [10]:

    
#this illustrates only one example of duplicates in the data!
duplicate_indicies = []
for index, value in enumerate(new_df.index):
    if "Multiplayer #Poker" in new_df.iloc[value, 3]:
        duplicate_indicies.append(index)
        
new_df.iloc[duplicate_indicies, [1,3]]









    Out[10]:







  
    
      
      handle
      tweet
    
  
  
    
      1047748
      FaucetGaming
      Multiplayer #Poker launching tomorrow on
    
    
      1047749
      CryptoPromote
      RT @FaucetGaming: Multiplayer #Poker launching...
    
    
      1047750
      CryptoPromote
      Multiplayer #Poker launching tomorrow on
    
    
      1047751
      EmmaBitcoin
      Multiplayer #Poker launching tomorrow on
    
    
      1047752
      ClaraBitcoin
      Multiplayer #Poker launching tomorrow on
    
    
      1047753
      NickolayV7
      RT @ClaraBitcoin: Multiplayer #Poker launching...
    
    
      1047754
      ehsminer
      RT @CryptoPromote: Multiplayer #Poker launchin...
    
    
      1047756
      CryptoMegn
      Multiplayer #Poker launching tomorrow on
    
    
      1047757
      LenaBitcoin
      Multiplayer #Poker launching tomorrow on
    
    
      1047758
      StartUpRealTime
      RT @LenaBitcoin: Multiplayer #Poker launching ...
    
    
      1047759
      CoinCreation
      Multiplayer #Poker launching tomorrow on
    
    
      1047760
      MariaBitcoin
      Multiplayer #Poker launching tomorrow on
    
    
      1047761
      CryptoWePromote
      Multiplayer #Poker launching tomorrow on
    
    
      1047763
      NickolayV10
      RT @CryptoWePromote: Multiplayer #Poker launch...



In [11]:

    
tweet_list = list(new_df['tweet']) #lets first make a list of the tweets we need to remove duplicates from



In [12]:

    
string.punctuation









    Out[12]:





'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'



In [13]:

    
remove_punctuaton = '!"$%&\'()*+,-./:;<=>?@[\\]“”^_`{|}~' # same as string.punctuation, but without # - I want hashtags!



In [14]:

    
set_list = []
clean_tweet_list = []
translator = str.maketrans('', '', remove_punctuaton) #very fast punctuation remover!
for word in tweet_list:
    list_form = word.split() #turns the word into a list
    
    to_process = [x for x in list_form if not x.startswith("@")] #removes handles
    
    to_process_2 = [x for x in to_process if not x.startswith("RT")] #removed retweet indicator
    
    string_form = " ".join(to_process_2) #back into a string
    
    set_form = set(string_form.translate(translator).strip().lower().split()) #this is the magic!
    
    clean_tweet_list.append(string_form.translate(translator).strip().lower())
    
    set_list.append(tuple(set_form)) #need to make it a tuple so it's hashable!

new_df['tuple_version_tweet'] = set_list
new_df['clean_tweet_V1'] = clean_tweet_list



In [15]:

    
new_df.head()









    Out[15]:







  
    
      
      date
      handle
      time
      tweet
      tuple_version_tweet
      clean_tweet_V1
    
  
  
    
      0
      2017-02-22
      FoabMoab
      19:35:43
      RT @bitcoinagile: .1 #bitcoin BTC Straight To ...
      (immediate, 1, to, #bitcoin, out, straight, us...
      1 #bitcoin btc straight to wallet usa bitcoin ...
    
    
      1
      2017-02-22
      Bitcoin_Revo
      19:35:39
      #bitcoin “Is Like Positive Bacteria”: Russian ...
      (like, bank, #bitcoin, vice, state, russian, p...
      #bitcoin is like positive bacteria russian sta...
    
    
      2
      2017-02-22
      alt_bit_coins
      19:35:31
      Volatile Bitcoin Nears Its All-Time High - Wal...
      (volatile, wall, street, subscription, blog, i...
      volatile bitcoin nears its alltime high  wall ...
    
    
      3
      2017-02-22
      Rhino3nity
      19:35:26
      Missed out on #bitcoin? This #cryptocurrency c...
      (this, millionaires, over, worldwide, created,...
      missed out on #bitcoin this #cryptocurrency cr...
    
    
      4
      2017-02-22
      Siimple_inc
      19:35:18
      RT @blockchainhelpr: 5 Weak Points Of #Blockch...
      (points, of, #bitcoin, technology, #tech, 5, w...
      5 weak points of #blockchain technology #tech ...



In [16]:

    
new_df.iloc[1047748, 4] # we have extracted the core text from the tweets! YAY!









    Out[16]:





('multiplayer', 'on', 'launching', 'tomorrow', '#poker')



In [17]:

    
new_df.iloc[1047749, 4]









    Out[17]:





('multiplayer', 'on', 'launching', 'tomorrow', '#poker')



In [18]:

    
new_df.iloc[1047748, 4] == new_df.iloc[1047748, 4] #this is perfect!









    Out[18]:





True



In [19]:

    
new_df.shape #dimensions before duplicate removal!









    Out[19]:





(1049878, 6)



In [20]:

    
test_df = new_df.drop_duplicates(subset='tuple_version_tweet', keep="first") #keep the first occurence
#otherwise drop rows that have matching tuples!

We see from the above code, that I have removed duplicates by creating a tuple set of the words that are in the tweet after having removed the URL's, punctuation etc.

This way we get to utilise the power of pandas to drop rows that contain duplicate tweets.

It is important to note that this is NOT a fool proof way to drop ALL duplciates. However, I am confident that I will drop the majority!



In [21]:

    
#lets use the example from before! - it only occurs once now!
for index, value in enumerate(test_df.iloc[:, 3]):
    if "Multiplayer #Poker" in value:
        print(test_df.iloc[index, [1,3]])









    



handle                                FaucetGaming
tweet     Multiplayer #Poker launching tomorrow on
Name: 1047748, dtype: object



In [22]:

    
new_df.shape









    Out[22]:





(1049878, 6)



In [23]:

    
test_df.shape









    Out[23]:





(611980, 6)



In [24]:

    
((612644-1049878)/1049878)*100 #41% reduction!









    Out[24]:





-41.64617222191531

Note: I added a column called clean_tweet_V1. These are the tweets stripped of punctuation. These will be very useful for our NLP process later on when it comes to lemmatization.



In [31]:

    
pickle_object(test_df, "no_duplicates_df")



In [ ]:

	date	handle	time	tweet
0	2017-02-22	FoabMoab	19:35:43	RT @bitcoinagile: .1 #bitcoin BTC Straight To ...
1	2017-02-22	Bitcoin_Revo	19:35:39	#bitcoin “Is Like Positive Bacteria”: Russian ...
2	2017-02-22	alt_bit_coins	19:35:31	Volatile Bitcoin Nears Its All-Time High - Wal...
3	2017-02-22	Rhino3nity	19:35:26	Missed out on #bitcoin? This #cryptocurrency c...
4	2017-02-22	Siimple_inc	19:35:18	RT @blockchainhelpr: 5 Weak Points Of #Blockch...

	handle	tweet
1047748	FaucetGaming	Multiplayer #Poker launching tomorrow on
1047749	CryptoPromote	RT @FaucetGaming: Multiplayer #Poker launching...
1047750	CryptoPromote	Multiplayer #Poker launching tomorrow on
1047751	EmmaBitcoin	Multiplayer #Poker launching tomorrow on
1047752	ClaraBitcoin	Multiplayer #Poker launching tomorrow on
1047753	NickolayV7	RT @ClaraBitcoin: Multiplayer #Poker launching...
1047754	ehsminer	RT @CryptoPromote: Multiplayer #Poker launchin...
1047756	CryptoMegn	Multiplayer #Poker launching tomorrow on
1047757	LenaBitcoin	Multiplayer #Poker launching tomorrow on
1047758	StartUpRealTime	RT @LenaBitcoin: Multiplayer #Poker launching ...
1047759	CoinCreation	Multiplayer #Poker launching tomorrow on
1047760	MariaBitcoin	Multiplayer #Poker launching tomorrow on
1047761	CryptoWePromote	Multiplayer #Poker launching tomorrow on
1047763	NickolayV10	RT @CryptoWePromote: Multiplayer #Poker launch...