Our current dataset suffers from duplicate tweet's from bots, hacked accounts etc.
As such, this notebook will show you how to deal with these duplicates in a manner that does not take $O(N^{2})$.
Traditionally, the way to search for duplicates would be to wrte some code with nested for loops that passess over the list twice, comparing one entry to every other entry in the list. While this approach would work - it is incredibly slow for a list with a large number of elements. In our case, our list is just over 1 million elements long.
Furthermore, this approach only really works when you are looking for EXACT duplicates. This will not do for our case as bots will re-tweet the original tweet with either a different URL link or slightly different formatting. Thus, while a human would be able to tell that the tweet is practically identical, a computer would not.
One possible solution to this to use a form of fuzzy matching, that is, if two strings are similar enough to pass a threshold of, say, 0.9 (levenstein distance), then we can assume they are duplicate tweets. While this is a fantastic approach, it is still $O(N^{2})$.
In this notebook, I present a different, more nuanced approach!
In [1]:
import pandas as pd
import arrow # way better than datetime
import numpy as np
import random
import re
%run helper_functions.py
import string
In [2]:
new_df = unpickle_object("new_df.pkl") # this loads up the dataframe from our previous notebook
In [3]:
new_df.head() #sorted first on date and then time!
Out[3]:
In [4]:
new_df.iloc[0, 3]
Out[4]:
In [5]:
#we need to remove all links in a tweet!
regex = r"http\S+"
subset = ""
In [6]:
removed_links = list(map(lambda x: re.sub(regex, subset, x), list(new_df['tweet'])))
removed_links = list(map(str.strip, removed_links))
new_df['tweet'] = removed_links
In [7]:
new_df.iloc[0, 3] # we can see here that the link has been removed!
Out[7]:
In [8]:
new_df.iloc[1047748, [1, 3]] #example of duplicate enttry - different handles, same tweets
Out[8]:
In [9]:
new_df.iloc[1047749, [1, 3]]
Out[9]:
In [10]:
#this illustrates only one example of duplicates in the data!
duplicate_indicies = []
for index, value in enumerate(new_df.index):
if "Multiplayer #Poker" in new_df.iloc[value, 3]:
duplicate_indicies.append(index)
new_df.iloc[duplicate_indicies, [1,3]]
Out[10]:
In [11]:
tweet_list = list(new_df['tweet']) #lets first make a list of the tweets we need to remove duplicates from
In [12]:
string.punctuation
Out[12]:
In [13]:
remove_punctuaton = '!"$%&\'()*+,-./:;<=>?@[\\]“”^_`{|}~' # same as string.punctuation, but without # - I want hashtags!
In [14]:
set_list = []
clean_tweet_list = []
translator = str.maketrans('', '', remove_punctuaton) #very fast punctuation remover!
for word in tweet_list:
list_form = word.split() #turns the word into a list
to_process = [x for x in list_form if not x.startswith("@")] #removes handles
to_process_2 = [x for x in to_process if not x.startswith("RT")] #removed retweet indicator
string_form = " ".join(to_process_2) #back into a string
set_form = set(string_form.translate(translator).strip().lower().split()) #this is the magic!
clean_tweet_list.append(string_form.translate(translator).strip().lower())
set_list.append(tuple(set_form)) #need to make it a tuple so it's hashable!
new_df['tuple_version_tweet'] = set_list
new_df['clean_tweet_V1'] = clean_tweet_list
In [15]:
new_df.head()
Out[15]:
In [16]:
new_df.iloc[1047748, 4] # we have extracted the core text from the tweets! YAY!
Out[16]:
In [17]:
new_df.iloc[1047749, 4]
Out[17]:
In [18]:
new_df.iloc[1047748, 4] == new_df.iloc[1047748, 4] #this is perfect!
Out[18]:
In [19]:
new_df.shape #dimensions before duplicate removal!
Out[19]:
In [20]:
test_df = new_df.drop_duplicates(subset='tuple_version_tweet', keep="first") #keep the first occurence
#otherwise drop rows that have matching tuples!
We see from the above code, that I have removed duplicates by creating a tuple set of the words that are in the tweet after having removed the URL's, punctuation etc.
This way we get to utilise the power of pandas to drop rows that contain duplicate tweets.
It is important to note that this is NOT a fool proof way to drop ALL duplciates. However, I am confident that I will drop the majority!
In [21]:
#lets use the example from before! - it only occurs once now!
for index, value in enumerate(test_df.iloc[:, 3]):
if "Multiplayer #Poker" in value:
print(test_df.iloc[index, [1,3]])
In [22]:
new_df.shape
Out[22]:
In [23]:
test_df.shape
Out[23]:
In [24]:
((612644-1049878)/1049878)*100 #41% reduction!
Out[24]:
Note: I added a column called clean_tweet_V1. These are the tweets stripped of punctuation. These will be very useful for our NLP process later on when it comes to lemmatization.
In [31]:
pickle_object(test_df, "no_duplicates_df")
In [ ]: