Notebook 1

This notebook contains code used to construct the dataframe that contains our raw data.



In [1]:

    
import pandas as pd
import arrow # way better than datetime
import numpy as np
import random
import re
%run helper_functions.py

Above, I used the arrow library instead of datetime. In my opinion, Arrow overcomes a lot of the shortfalls and syntactic complexity of the datetime library!

Here is the documentation: https://arrow.readthedocs.io/en/latest/



In [2]:

    
df = pd.read_csv("tweets_formatted.txt", sep="| |", header=None)



In [3]:

    
df.shape









    Out[3]:





(1049878, 1)



In [4]:

    
list_of_dicts = []
for i in range(df.shape[0]):
    temp_dict = {}
    temp_lst = df.iloc[i,0].split("||")
    
    temp_dict['handle'] = temp_lst[0]
    temp_dict['tweet'] = temp_lst[1]
    
    try: #sometimes the date/time is missing - we will have to infer
        temp_dict['date'] = arrow.get(temp_lst[2]).date()
    except:
        temp_dict['date'] = np.nan
    try:  
        temp_dict['time'] = arrow.get(temp_lst[2]).time()
    except:
        temp_dict['time'] = np.nan
    
    list_of_dicts.append(temp_dict)



In [5]:

    
list_of_dicts[0].keys()









    Out[5]:





dict_keys(['handle', 'tweet', 'date', 'time'])



In [6]:

    
new_df = pd.DataFrame(list_of_dicts) #magic!



In [7]:

    
new_df.head() #unsorted!









    Out[7]:







  
    
      
      date
      handle
      time
      tweet
    
  
  
    
      0
      2016-10-26
      Bitcoin_City
      17:52:26
      #btc Hands-on with Microsoft’s Surface Studio:...
    
    
      1
      2016-10-26
      Bitcoin_City
      17:52:26
      #btc Everything announced at Microsoft’s Windo...
    
    
      2
      2016-10-26
      aesanto
      17:52:26
      RT @CloudExpo: Encore Presentation of #Blockch...
    
    
      3
      2016-10-26
      Bitcoin_City
      17:52:25
      #btc Darkstore opens on-demand delivery fulfil...
    
    
      4
      2016-10-26
      Bitcoin_City
      17:52:25
      #btc Microsoft shows off a new $99 input metho...



In [8]:

    
new_df.sort_values(by=['date', 'time'], ascending=False, inplace=True)
new_df.reset_index(inplace=True)
del new_df['index']
pickle_object(new_df, "new_df")



In [9]:

    
new_df.head() #sorted first on date and then on time









    Out[9]:







  
    
      
      date
      handle
      time
      tweet
    
  
  
    
      0
      2017-02-22
      FoabMoab
      19:35:43
      RT @bitcoinagile: .1 #bitcoin BTC Straight To ...
    
    
      1
      2017-02-22
      Bitcoin_Revo
      19:35:39
      #bitcoin “Is Like Positive Bacteria”: Russian ...
    
    
      2
      2017-02-22
      alt_bit_coins
      19:35:31
      Volatile Bitcoin Nears Its All-Time High - Wal...
    
    
      3
      2017-02-22
      Rhino3nity
      19:35:26
      Missed out on #bitcoin? This #cryptocurrency c...
    
    
      4
      2017-02-22
      Siimple_inc
      19:35:18
      RT @blockchainhelpr: 5 Weak Points Of #Blockch...

Evidence of Duplicates

It is clear that we have some duplicates. Let's first clean out the URL's.



In [15]:

    
sample_duplicate_indicies = []
for i in new_df.index:
    if "Multiplayer #Poker" in new_df.iloc[i, 3]:
        sample_duplicate_indicies.append(i)



In [25]:

    
new_df.iloc[sample_duplicate_indicies, :]









    Out[25]:







  
    
      
      date
      handle
      time
      tweet
    
  
  
    
      1047748
      NaN
      FaucetGaming
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047749
      NaN
      CryptoPromote
      NaN
      RT @FaucetGaming: Multiplayer #Poker launching...
    
    
      1047750
      NaN
      CryptoPromote
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047751
      NaN
      EmmaBitcoin
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047752
      NaN
      ClaraBitcoin
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047753
      NaN
      NickolayV7
      NaN
      RT @ClaraBitcoin: Multiplayer #Poker launching...
    
    
      1047754
      NaN
      ehsminer
      NaN
      RT @CryptoPromote: Multiplayer #Poker launchin...
    
    
      1047756
      NaN
      CryptoMegn
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047757
      NaN
      LenaBitcoin
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047758
      NaN
      StartUpRealTime
      NaN
      RT @LenaBitcoin: Multiplayer #Poker launching ...
    
    
      1047759
      NaN
      CoinCreation
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047760
      NaN
      MariaBitcoin
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047761
      NaN
      CryptoWePromote
      NaN
      Multiplayer #Poker launching tomorrow on https...
    
    
      1047763
      NaN
      NickolayV10
      NaN
      RT @CryptoWePromote: Multiplayer #Poker launch...

Let's remove these duplicates in a seperate notebook!



In [ ]:

	date	handle	time	tweet
0	2016-10-26	Bitcoin_City	17:52:26	#btc Hands-on with Microsoft’s Surface Studio:...
1	2016-10-26	Bitcoin_City	17:52:26	#btc Everything announced at Microsoft’s Windo...
2	2016-10-26	aesanto	17:52:26	RT @CloudExpo: Encore Presentation of #Blockch...
3	2016-10-26	Bitcoin_City	17:52:25	#btc Darkstore opens on-demand delivery fulfil...
4	2016-10-26	Bitcoin_City	17:52:25	#btc Microsoft shows off a new $99 input metho...

	date	handle	time	tweet
0	2017-02-22	FoabMoab	19:35:43	RT @bitcoinagile: .1 #bitcoin BTC Straight To ...
1	2017-02-22	Bitcoin_Revo	19:35:39	#bitcoin “Is Like Positive Bacteria”: Russian ...
2	2017-02-22	alt_bit_coins	19:35:31	Volatile Bitcoin Nears Its All-Time High - Wal...
3	2017-02-22	Rhino3nity	19:35:26	Missed out on #bitcoin? This #cryptocurrency c...
4	2017-02-22	Siimple_inc	19:35:18	RT @blockchainhelpr: 5 Weak Points Of #Blockch...

	date	handle	time	tweet
1047748	NaN	FaucetGaming	NaN	Multiplayer #Poker launching tomorrow on https...
1047749	NaN	CryptoPromote	NaN	RT @FaucetGaming: Multiplayer #Poker launching...
1047750	NaN	CryptoPromote	NaN	Multiplayer #Poker launching tomorrow on https...
1047751	NaN	EmmaBitcoin	NaN	Multiplayer #Poker launching tomorrow on https...
1047752	NaN	ClaraBitcoin	NaN	Multiplayer #Poker launching tomorrow on https...
1047753	NaN	NickolayV7	NaN	RT @ClaraBitcoin: Multiplayer #Poker launching...
1047754	NaN	ehsminer	NaN	RT @CryptoPromote: Multiplayer #Poker launchin...
1047756	NaN	CryptoMegn	NaN	Multiplayer #Poker launching tomorrow on https...
1047757	NaN	LenaBitcoin	NaN	Multiplayer #Poker launching tomorrow on https...
1047758	NaN	StartUpRealTime	NaN	RT @LenaBitcoin: Multiplayer #Poker launching ...
1047759	NaN	CoinCreation	NaN	Multiplayer #Poker launching tomorrow on https...
1047760	NaN	MariaBitcoin	NaN	Multiplayer #Poker launching tomorrow on https...
1047761	NaN	CryptoWePromote	NaN	Multiplayer #Poker launching tomorrow on https...
1047763	NaN	NickolayV10	NaN	RT @CryptoWePromote: Multiplayer #Poker launch...