Notebook 1

This notebook contains code used to construct the dataframe that contains our raw data.


In [1]:
import pandas as pd
import arrow # way better than datetime
import numpy as np
import random
import re
%run helper_functions.py

Above, I used the arrow library instead of datetime. In my opinion, Arrow overcomes a lot of the shortfalls and syntactic complexity of the datetime library!

Here is the documentation: https://arrow.readthedocs.io/en/latest/


In [2]:
df = pd.read_csv("tweets_formatted.txt", sep="| |", header=None)

In [3]:
df.shape


Out[3]:
(1049878, 1)

In [4]:
list_of_dicts = []
for i in range(df.shape[0]):
    temp_dict = {}
    temp_lst = df.iloc[i,0].split("||")
    
    temp_dict['handle'] = temp_lst[0]
    temp_dict['tweet'] = temp_lst[1]
    
    try: #sometimes the date/time is missing - we will have to infer
        temp_dict['date'] = arrow.get(temp_lst[2]).date()
    except:
        temp_dict['date'] = np.nan
    try:  
        temp_dict['time'] = arrow.get(temp_lst[2]).time()
    except:
        temp_dict['time'] = np.nan
    
    list_of_dicts.append(temp_dict)

In [5]:
list_of_dicts[0].keys()


Out[5]:
dict_keys(['handle', 'tweet', 'date', 'time'])

In [6]:
new_df = pd.DataFrame(list_of_dicts) #magic!

In [7]:
new_df.head() #unsorted!


Out[7]:
date handle time tweet
0 2016-10-26 Bitcoin_City 17:52:26 #btc Hands-on with Microsoft’s Surface Studio:...
1 2016-10-26 Bitcoin_City 17:52:26 #btc Everything announced at Microsoft’s Windo...
2 2016-10-26 aesanto 17:52:26 RT @CloudExpo: Encore Presentation of #Blockch...
3 2016-10-26 Bitcoin_City 17:52:25 #btc Darkstore opens on-demand delivery fulfil...
4 2016-10-26 Bitcoin_City 17:52:25 #btc Microsoft shows off a new $99 input metho...

In [8]:
new_df.sort_values(by=['date', 'time'], ascending=False, inplace=True)
new_df.reset_index(inplace=True)
del new_df['index']
pickle_object(new_df, "new_df")

In [9]:
new_df.head() #sorted first on date and then on time


Out[9]:
date handle time tweet
0 2017-02-22 FoabMoab 19:35:43 RT @bitcoinagile: .1 #bitcoin BTC Straight To ...
1 2017-02-22 Bitcoin_Revo 19:35:39 #bitcoin “Is Like Positive Bacteria”: Russian ...
2 2017-02-22 alt_bit_coins 19:35:31 Volatile Bitcoin Nears Its All-Time High - Wal...
3 2017-02-22 Rhino3nity 19:35:26 Missed out on #bitcoin? This #cryptocurrency c...
4 2017-02-22 Siimple_inc 19:35:18 RT @blockchainhelpr: 5 Weak Points Of #Blockch...

Evidence of Duplicates

It is clear that we have some duplicates. Let's first clean out the URL's.


In [15]:
sample_duplicate_indicies = []
for i in new_df.index:
    if "Multiplayer #Poker" in new_df.iloc[i, 3]:
        sample_duplicate_indicies.append(i)

In [25]:
new_df.iloc[sample_duplicate_indicies, :]


Out[25]:
date handle time tweet
1047748 NaN FaucetGaming NaN Multiplayer #Poker launching tomorrow on https...
1047749 NaN CryptoPromote NaN RT @FaucetGaming: Multiplayer #Poker launching...
1047750 NaN CryptoPromote NaN Multiplayer #Poker launching tomorrow on https...
1047751 NaN EmmaBitcoin NaN Multiplayer #Poker launching tomorrow on https...
1047752 NaN ClaraBitcoin NaN Multiplayer #Poker launching tomorrow on https...
1047753 NaN NickolayV7 NaN RT @ClaraBitcoin: Multiplayer #Poker launching...
1047754 NaN ehsminer NaN RT @CryptoPromote: Multiplayer #Poker launchin...
1047756 NaN CryptoMegn NaN Multiplayer #Poker launching tomorrow on https...
1047757 NaN LenaBitcoin NaN Multiplayer #Poker launching tomorrow on https...
1047758 NaN StartUpRealTime NaN RT @LenaBitcoin: Multiplayer #Poker launching ...
1047759 NaN CoinCreation NaN Multiplayer #Poker launching tomorrow on https...
1047760 NaN MariaBitcoin NaN Multiplayer #Poker launching tomorrow on https...
1047761 NaN CryptoWePromote NaN Multiplayer #Poker launching tomorrow on https...
1047763 NaN NickolayV10 NaN RT @CryptoWePromote: Multiplayer #Poker launch...

Let's remove these duplicates in a seperate notebook!


In [ ]: