Notebook 2

Our current dataset suffers from duplicate tweet's from bots, hacked accounts etc.

As such, this notebook will show you how to deal with these duplicates in a manner that does not take $O(N^{2})$.

Traditionally, the way to search for duplicates would be to wrte some code with nested for loops that passess over the list twice, comparing one entry to every other entry in the list. While this approach would work - it is incredibly slow for a list with a large number of elements. In our case, our list is just over 1 million elements long.

Furthermore, this approach only really works when you are looking for EXACT duplicates. This will not do for our case as bots will re-tweet the original tweet with either a different URL link or slightly different formatting. Thus, while a human would be able to tell that the tweet is practically identical, a computer would not.

One possible solution to this to use a form of fuzzy matching, that is, if two strings are similar enough to pass a threshold of, say, 0.9 (levenstein distance), then we can assume they are duplicate tweets. While this is a fantastic approach, it is still $O(N^{2})$.

In this notebook, I present a different, more nuanced approach!


In [1]:
import pandas as pd
import arrow # way better than datetime
import numpy as np
import random
import re
%run helper_functions.py
import string

In [2]:
new_df = unpickle_object("new_df.pkl") # this loads up the dataframe from our previous notebook

In [3]:
new_df.head() #sorted first on date and then time!


Out[3]:
date handle time tweet
0 2017-02-22 FoabMoab 19:35:43 RT @bitcoinagile: .1 #bitcoin BTC Straight To ...
1 2017-02-22 Bitcoin_Revo 19:35:39 #bitcoin “Is Like Positive Bacteria”: Russian ...
2 2017-02-22 alt_bit_coins 19:35:31 Volatile Bitcoin Nears Its All-Time High - Wal...
3 2017-02-22 Rhino3nity 19:35:26 Missed out on #bitcoin? This #cryptocurrency c...
4 2017-02-22 Siimple_inc 19:35:18 RT @blockchainhelpr: 5 Weak Points Of #Blockch...

In [4]:
new_df.iloc[0, 3]


Out[4]:
'RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer https://t.co/2sslXiNHz1 https…'

In [5]:
#we need to remove all links in a tweet!
regex = r"http\S+"
subset = ""

In [6]:
removed_links = list(map(lambda x: re.sub(regex, subset, x), list(new_df['tweet'])))
removed_links = list(map(str.strip, removed_links))
new_df['tweet'] = removed_links

In [7]:
new_df.iloc[0, 3] # we can see here that the link has been removed!


Out[7]:
'RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer'

In [8]:
new_df.iloc[1047748, [1, 3]] #example of duplicate enttry - different handles, same tweets


Out[8]:
handle                                FaucetGaming
tweet     Multiplayer #Poker launching tomorrow on
Name: 1047748, dtype: object

In [9]:
new_df.iloc[1047749, [1, 3]]


Out[9]:
handle                                        CryptoPromote
tweet     RT @FaucetGaming: Multiplayer #Poker launching...
Name: 1047749, dtype: object

In [10]:
#this illustrates only one example of duplicates in the data!
duplicate_indicies = []
for index, value in enumerate(new_df.index):
    if "Multiplayer #Poker" in new_df.iloc[value, 3]:
        duplicate_indicies.append(index)
        
new_df.iloc[duplicate_indicies, [1,3]]


Out[10]:
handle tweet
1047748 FaucetGaming Multiplayer #Poker launching tomorrow on
1047749 CryptoPromote RT @FaucetGaming: Multiplayer #Poker launching...
1047750 CryptoPromote Multiplayer #Poker launching tomorrow on
1047751 EmmaBitcoin Multiplayer #Poker launching tomorrow on
1047752 ClaraBitcoin Multiplayer #Poker launching tomorrow on
1047753 NickolayV7 RT @ClaraBitcoin: Multiplayer #Poker launching...
1047754 ehsminer RT @CryptoPromote: Multiplayer #Poker launchin...
1047756 CryptoMegn Multiplayer #Poker launching tomorrow on
1047757 LenaBitcoin Multiplayer #Poker launching tomorrow on
1047758 StartUpRealTime RT @LenaBitcoin: Multiplayer #Poker launching ...
1047759 CoinCreation Multiplayer #Poker launching tomorrow on
1047760 MariaBitcoin Multiplayer #Poker launching tomorrow on
1047761 CryptoWePromote Multiplayer #Poker launching tomorrow on
1047763 NickolayV10 RT @CryptoWePromote: Multiplayer #Poker launch...

In [11]:
tweet_list = list(new_df['tweet']) #lets first make a list of the tweets we need to remove duplicates from

In [12]:
string.punctuation


Out[12]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
remove_punctuaton = '!"$%&\'()*+,-./:;<=>?@[\\]“”^_`{|}~' # same as string.punctuation, but without # - I want hashtags!

In [14]:
set_list = []
clean_tweet_list = []
translator = str.maketrans('', '', remove_punctuaton) #very fast punctuation remover!
for word in tweet_list:
    list_form = word.split() #turns the word into a list
    
    to_process = [x for x in list_form if not x.startswith("@")] #removes handles
    
    to_process_2 = [x for x in to_process if not x.startswith("RT")] #removed retweet indicator
    
    string_form = " ".join(to_process_2) #back into a string
    
    set_form = set(string_form.translate(translator).strip().lower().split()) #this is the magic!
    
    clean_tweet_list.append(string_form.translate(translator).strip().lower())
    
    set_list.append(tuple(set_form)) #need to make it a tuple so it's hashable!

new_df['tuple_version_tweet'] = set_list
new_df['clean_tweet_V1'] = clean_tweet_list

In [15]:
new_df.head()


Out[15]:
date handle time tweet tuple_version_tweet clean_tweet_V1
0 2017-02-22 FoabMoab 19:35:43 RT @bitcoinagile: .1 #bitcoin BTC Straight To ... (immediate, 1, to, #bitcoin, out, straight, us... 1 #bitcoin btc straight to wallet usa bitcoin ...
1 2017-02-22 Bitcoin_Revo 19:35:39 #bitcoin “Is Like Positive Bacteria”: Russian ... (like, bank, #bitcoin, vice, state, russian, p... #bitcoin is like positive bacteria russian sta...
2 2017-02-22 alt_bit_coins 19:35:31 Volatile Bitcoin Nears Its All-Time High - Wal... (volatile, wall, street, subscription, blog, i... volatile bitcoin nears its alltime high wall ...
3 2017-02-22 Rhino3nity 19:35:26 Missed out on #bitcoin? This #cryptocurrency c... (this, millionaires, over, worldwide, created,... missed out on #bitcoin this #cryptocurrency cr...
4 2017-02-22 Siimple_inc 19:35:18 RT @blockchainhelpr: 5 Weak Points Of #Blockch... (points, of, #bitcoin, technology, #tech, 5, w... 5 weak points of #blockchain technology #tech ...

In [16]:
new_df.iloc[1047748, 4] # we have extracted the core text from the tweets! YAY!


Out[16]:
('multiplayer', 'on', 'launching', 'tomorrow', '#poker')

In [17]:
new_df.iloc[1047749, 4]


Out[17]:
('multiplayer', 'on', 'launching', 'tomorrow', '#poker')

In [18]:
new_df.iloc[1047748, 4] == new_df.iloc[1047748, 4] #this is perfect!


Out[18]:
True

In [19]:
new_df.shape #dimensions before duplicate removal!


Out[19]:
(1049878, 6)

In [20]:
test_df = new_df.drop_duplicates(subset='tuple_version_tweet', keep="first") #keep the first occurence
#otherwise drop rows that have matching tuples!

We see from the above code, that I have removed duplicates by creating a tuple set of the words that are in the tweet after having removed the URL's, punctuation etc.

This way we get to utilise the power of pandas to drop rows that contain duplicate tweets.

It is important to note that this is NOT a fool proof way to drop ALL duplciates. However, I am confident that I will drop the majority!


In [21]:
#lets use the example from before! - it only occurs once now!
for index, value in enumerate(test_df.iloc[:, 3]):
    if "Multiplayer #Poker" in value:
        print(test_df.iloc[index, [1,3]])


handle                                FaucetGaming
tweet     Multiplayer #Poker launching tomorrow on
Name: 1047748, dtype: object

In [22]:
new_df.shape


Out[22]:
(1049878, 6)

In [23]:
test_df.shape


Out[23]:
(611980, 6)

In [24]:
((612644-1049878)/1049878)*100 #41% reduction!


Out[24]:
-41.64617222191531

Note: I added a column called clean_tweet_V1. These are the tweets stripped of punctuation. These will be very useful for our NLP process later on when it comes to lemmatization.


In [31]:
pickle_object(test_df, "no_duplicates_df")

In [ ]: