Notebook 3.

This notebook will be concerned with EDA of our data.

Furthermore, this notebook will remove further duplicates that survived our first filtration technique in Notebook 2.


In [1]:
%run helper_functions.py
%run filters.py #contain filtration code used in this notebook.
%run plotly_functions.py #wrapper around plot.ly
from datetime import date
from tabulate import tabulate
from collections import Counter
from IPython.display import Image
import math
import string
%matplotlib inline
plt.rcParams["figure.figsize"] = (15,20)
plt.rcParams["xtick.labelsize"] = 16
plt.rcParams["ytick.labelsize"] = 16
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['legend.fontsize'] = 20
plt.style.use('fivethirtyeight')
pd.set_option('display.max_colwidth', -1)
import plotly.plotly as py
import plotly.graph_objs as go

In this notebook, we will try to handle missing data!


In [2]:
df = unpickle_object("no_duplicates_df.pkl")

In [3]:
df.shape


Out[3]:
(611980, 6)

In [4]:
df.head()


Out[4]:
date handle time tweet tuple_version_tweet clean_tweet_V1
0 2017-02-22 FoabMoab 19:35:43 RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer (immediate, 1, to, #bitcoin, out, straight, usa, seller, wallet, gatherer, pay, btc, bitcoin) 1 #bitcoin btc straight to wallet usa bitcoin seller immediate pay out bitcoin gatherer
1 2017-02-22 Bitcoin_Revo 19:35:39 #bitcoin “Is Like Positive Bacteria”: Russian State Bank Vice President (like, bank, #bitcoin, vice, state, russian, president, positive, bacteria, is) #bitcoin is like positive bacteria russian state bank vice president
2 2017-02-22 alt_bit_coins 19:35:31 Volatile Bitcoin Nears Its All-Time High - Wall Street Journal (subscription) (blog) (volatile, wall, street, subscription, blog, its, nears, alltime, journal, high, bitcoin) volatile bitcoin nears its alltime high wall street journal subscription blog
3 2017-02-22 Rhino3nity 19:35:26 Missed out on #bitcoin? This #cryptocurrency created OVER 300 MILLIONAIRES WORLDWIDE within 12... by #davidf1982 (this, millionaires, over, worldwide, created, within, on, #bitcoin, out, #davidf1982, 300, 12, by, #cryptocurrency, missed) missed out on #bitcoin this #cryptocurrency created over 300 millionaires worldwide within 12 by #davidf1982
4 2017-02-22 Siimple_inc 19:35:18 RT @blockchainhelpr: 5 Weak Points Of #Blockchain Technology #tech #bitcoin (points, of, #bitcoin, technology, #tech, 5, weak, #blockchain) 5 weak points of #blockchain technology #tech #bitcoin

In [5]:
percentage_missing(df) # seems that only dates and times are missing for out data! And only 0.2%!


0.2% missing from: Column date
0.2% missing from: Column time

In [6]:
df[df["date"].isnull()].shape


Out[6]:
(1286, 6)

In [7]:
df[df["time"].isnull()].shape


Out[7]:
(1286, 6)

In [8]:
all(df[df["time"].isnull()].index == df[df["date"].isnull()].index) #perfect match for indicies!


Out[8]:
True

In [9]:
df[df["date"].isnull()].iloc[-1, :]


Out[9]:
date                   NaN                                       
handle                 SemperArgentum                            
time                   NaN                                       
tweet                  Spooked by spike in cyber extortion,      
tuple_version_tweet    (in, spike, extortion, spooked, cyber, by)
clean_tweet_V1         spooked by spike in cyber extortion       
Name: 1049876, dtype: object

Our last entry in the dataframe has an index of 1049876


In [10]:
df[df["date"].isnull()].head()


Out[10]:
date handle time tweet tuple_version_tweet clean_tweet_V1
1047747 NaN kointrend NaN RT @FractalNode: Great News #Dash, #Bitcoin & #Steem blockchains in the Smartcoin Wallet (in, news, blockchains, #bitcoin, #dash, amp, the, smartcoin, wallet, #steem, great) great news #dash #bitcoin amp #steem blockchains in the smartcoin wallet
1047748 NaN FaucetGaming NaN Multiplayer #Poker launching tomorrow on (multiplayer, on, launching, tomorrow, #poker) multiplayer #poker launching tomorrow on
1047755 NaN FollowOrFadeMe NaN RT @FollowOrFadeMe: Join me at: (me, at, join) join me at
1047765 NaN bitcoin_experts NaN luke-jr: > So you propose a soft fork to 500 kB limit? (500, propose, soft, fork, kb, limit, so, to, gt, lukejr, a, you) lukejr gt so you propose a soft fork to 500 kb limit
1047766 NaN mikeytea123 NaN Get 2 ETH back with my referral! #KIBO less than 2 weeks left (my, weeks, get, referral, eth, #kibo, 2, less, back, than, with, left) get 2 eth back with my referral #kibo less than 2 weeks left

Our first entry in the dataframe has an index of 1047747


In [11]:
1049876 - 1047747 # this is a range of 2129. Which is larger than the total dimensions missing.

#so we know that the missing values are not consecutive!


Out[11]:
2129

In [12]:
#these are the handles that have missing dates/times
list(set(df[df["date"].isnull()]['handle']))


Out[12]:
['Very spectacular info about Bitcoin',
 'RacataVajasanoh',
 'bitcoinsulting',
 'AfrikanNinja',
 'MakeMoney4rhome',
 'Adeniyiseyi15',
 'HiImYash',
 'India_Bitcoin',
 'BryceWeiner',
 'umukoroemmanu14',
 'JimMarous',
 'CryptoTraders',
 'cuttwood00716',
 'OsazeeKelvinH',
 'sthenc',
 'Freedom_Fever75',
 'kingtin1978',
 'Adela_Kuban',
 'jaytrades16',
 'BitcoinSpeaks',
 'ehpeacekeeper',
 'What are you waiting for?',
 'dmitriybtc',
 'Khaled_SD95',
 'nimblues',
 'mkunaini',
 'WhatIsWAVES',
 'LondonTech_',
 'satoshigallery',
 'andyapdirect',
 'Radi_speculant',
 'widytra',
 '.05 bitcoin out of pocket',
 'JL02048628',
 '1971231KIYA',
 'iGent_BigData',
 '#bitcoin',
 'PesamobCom',
 'AUCharles1',
 'ask4waqar',
 '-> USD',
 'igotfame05',
 'BoardDavid',
 'PendoSystemsInc',
 'ooobo1',
 'Olatorich',
 'kingbash247',
 '*Payoneer',
 'amassalot',
 'zikis2323',
 'Helena_Wish',
 'DigitalGokhan',
 'Dennahz',
 'BuzzLotto',
 'BTCParlay',
 'BitWhizzUK',
 'FairlayDev',
 'ItitzexCrypto',
 'FollowOrFadeMe',
 'europecoinEUORG',
 'CryptoPsych',
 'ArkEcosystem',
 'PascalCoin',
 'kristovatlas',
 'CarpeNoctom',
 'iAm_elkay',
 'dplsurve',
 'kelvinlyttle',
 '100% organic',
 'ReddBazaar',
 'dun3buggi3',
 'freespinsgratis',
 'DonateBitcoin',
 'btcmrkts',
 '1. online drug dealers',
 'darkone1',
 'bit_host',
 'workwithlee',
 'grbroker',
 'davin5515',
 'FaucetGaming',
 'NickolayV19',
 '• 19th century—Railways',
 'FNovelsO',
 'PauloBtc',
 'webhat',
 'int_reserve',
 'EKrediet',
 'miningfield',
 'DanielPox',
 'DavidKa3002',
 'MistahGeorge',
 'CivicCentreMan',
 'redpillmen',
 '1. cash it out',
 'anthony9843',
 '86Someone86',
 'CoinCards',
 'jagira',
 'jacker_27',
 'jhaand',
 'MarleyBrooklyn1',
 'SQLTrooper',
 'maxjcm',
 'decodedlegal',
 'DigitalMintIO',
 'theolynn',
 'CaroleBakerTV',
 'stasuzu',
 'audio_pervert',
 '_portaldoaxe_',
 'Journal_Seminar',
 'lenteracoin',
 'valiktrade',
 'BTCoinSwap',
 'TraceMayer',
 'BlackMambaBitco',
 'Cryppel',
 'dariodipardo',
 '#BITCOIN',
 'ProjectInvesd',
 'nigelmarkdias',
 'vrtxlabs',
 'bognetakau',
 'glued2thescreen',
 'yriysh',
 'OssbergerMarkus',
 'a4ndr3wzone',
 'noshortre',
 'jacaf01jacaf',
 'n_growth',
 'MilanPrucha',
 'LuckyGi20401711',
 '5ub_zer0',
 'bigbluecup721',
 'bitcoin_ch',
 'PoyetBenjamin',
 '_BitcoinNews_',
 'triumphalArthur',
 'Zealot99',
 'kokowtp',
 'dbblackdiamond',
 'marcoporretto',
 '\\(•_•)',
 'FifalpayPro',
 'sweefm',
 'Realisation',
 'redmail99',
 'Godzealous',
 'Rembrandt_972',
 'DigitalNote_XDN',
 'eZwallStTrades',
 'DaRrYl2o0o',
 'BiostarUSA',
 'Multi-Sig Wallet',
 'roadcrypto',
 'see_u_REKT',
 'merlyne93',
 'NWFleissig',
 'armiols',
 'i_instances',
 'gustin_john',
 'MABitcoin',
 'clever_together',
 'tajitj1',
 'stosty',
 'tiupaeng',
 'yassiner2',
 'globalrashid007',
 'manipuflation',
 'FinTechProffitt',
 'DanPejeroni',
 'bitcoin_miner_',
 'cinnamonshelper',
 'DevopsInfo1',
 'makemoney_boi',
 'wasmandj',
 'Teeatrends',
 'LeanLione',
 'bitcoin_kid16',
 'MatrixDirectory',
 'BestPtcPaying',
 'uscointech',
 'Dollar is crashing',
 'BlockChannel',
 'PotomacWill',
 'jennyblacky08',
 'wbe1947',
 'pierre_rochard',
 'DogeofBlessing',
 'Maniac_Bitcoin',
 'equibit',
 'enjoyurdays',
 'D_Zimnoch',
 'Eng_HishamAdel',
 'ymubarak79',
 'Cik_Su',
 'Bitcoin_StarBoy',
 'ZET_mining',
 'jspontes',
 'DoctorSekzi',
 'iam_hidris',
 'BeFrOGnet',
 'ryro_peterson',
 'GPL_Life',
 'globaliqx',
 'mskellytolar',
 'mz_antii',
 'bitstein',
 'Bitcoin_2018',
 'cltucci',
 'AderojuAdelere',
 'mccinnovlab',
 'ana3ig',
 'G3NTbl4ck',
 'onlygodlyson',
 'goldblockscoin',
 'postkneejerk',
 'LaptopPrice24',
 'zebpay',
 'Is to follow',
 'KirkDBorne',
 'vladbitchat',
 'JoshCrumb',
 'MinersBitcoin',
 'sandyme07',
 'CRYPTOBANGer',
 'digiimaxx',
 'Logic_Joker',
 'alison82182',
 'sanctumware',
 'kvandenbrande',
 'quotedian',
 'Swiss Goodness',
 'Betternotrest',
 'randybrito',
 'thatssixzs',
 'musicjem666',
 'plutonash06',
 'Coindata_IO',
 'Dashpay',
 'angelinasarahh',
 'GEOPS22',
 'tachyeonz',
 'BlueST955',
 'M_RappaCEO',
 'BitMedia1',
 'anonmoxygirl',
 '_BCostello',
 'No Sign Up',
 'TechNowOrNever',
 'THE MOST INNOVATIVE AND LUCRATIVE WAY TO EARN BITCOIN',
 'Shieldzzzy84',
 'statspotting',
 'MultiMediaDave',
 'YOC_Crypto',
 'gamersclub0',
 'BlockChainBiz',
 'MGSoundproofing',
 'AltcoinTraders',
 'crypto_lawyers',
 'kingofpayments',
 'ryanstr',
 'smarterway',
 'WundaTraida',
 'BitcoinWanderer',
 '40x market cap growth(2016)',
 'MythicalLucy',
 'DaleGo800',
 'alt_bit_coins',
 'MasterChangz',
 'ottikas2',
 'Crypto_Lord',
 'winnersID',
 'itknowingness',
 'BSAUNIVERSITY',
 'crypto_gangsta',
 'Newusernametime',
 'Altantis12',
 'BITCOIN_Forex',
 'igorklajo',
 'mygulfcoin',
 'SuccessWithLou',
 'OKIE_Actual',
 'anaamh12',
 'sophiaahmed',
 'julia_zekorn',
 'Alexandiver',
 'GG33___',
 'jouslaw',
 'Krippled_Hick',
 'hiroshimaz',
 'GppeVassallo',
 'TheShahzada',
 'aawitan',
 'rboro28',
 'BryanMicon',
 'healthfitnes121',
 'eViLPlunfy',
 'charlitojr',
 'HuttonIF',
 'jm111t',
 'angela_walch',
 'yogesh93112',
 'Target #Bitcoin price if $COIN',
 'cazushi0321',
 'santilestary11',
 'assunitel',
 'ronniegolpark',
 'HQFalwasser',
 'Flo020890',
 'onlinepuspadas',
 'honeyanghel_05',
 'Roh_Sakra',
 'Mansa_Godson',
 'bitcointreff',
 'HilariousTrends',
 'MyFriendonCam',
 '#CreditCard',
 'Uber $68 billion',
 '1Sandella',
 'rusmana62',
 'romerome94',
 'rblk247',
 'btsfav',
 'op_return_ack',
 'badass_network',
 'RashadAlwahaj',
 'get_help_W',
 'Lucky_Bitcoin',
 'finly_india',
 'TheSiliconWave',
 '@Fox_Theatre',
 'gonzoucab',
 'sportmeister',
 'LinosVersion',
 'RicardodeHaan85',
 'btcquantum',
 'tokyobitcoiner',
 'thickopedia',
 'bitcoin1000',
 'satoshibitcoins',
 'otubovictor',
 'success5353',
 'julia_vaingurt',
 'weickmann',
 '_anthony_seger_',
 'd_seaman',
 'deskoffuatbalta',
 'StartUpRealTime',
 '8chpol',
 'karlmoneyinpjs',
 'Investments_bit',
 'lucasmzanella',
 '🔘   Grab-n-Go',
 'thelordoftruth',
 'Coincanna',
 'LONG!!!',
 'web24services',
 'MaximillionRuns',
 '2be_here',
 'TreesCalifornia',
 'freetree64',
 'Much Buuuuhhhhh.',
 'ildike49',
 'gubatron',
 'SeamusConwell',
 'AutoviewExt',
 'PMPSLLC',
 'Fodaroalerts2',
 'Follow_Bitcoin',
 'bitcoin_experts',
 'jaylordcuevas',
 'yugosobolev',
 'FionaAneltaut_',
 'viajeroloco13',
 'crypto_princess',
 'JonasSunshine',
 'eResearch_UCL',
 'xfinstocks2016',
 'dannylmasters',
 'ReviewsAcross',
 'realsheepwolf',
 'distresser',
 'mohitr1810',
 'Ehsanbtc',
 'dontakemycrack',
 'fanfare100',
 '#BitcoinLoan #BitcoinLoans',
 'D3pak',
 'DISRUPT_BANKS',
 'coin_joker',
 'killer_reaper23',
 'Holayincah',
 'buyucoin',
 'SemperArgentum',
 'SteveBTC',
 'Nairaongold',
 'EffizientLink',
 'SantaCracker',
 'Bitcoin $1,054.62',
 'Mutantoide',
 'mayurpa03598706',
 'elivue',
 'iivanov482',
 'coolindark',
 'iDeyPay_web',
 'Sellbitbuy',
 'BrandonKellyND',
 'OnTheBlockChain',
 'knowbtc7',
 'dffss8',
 'devnullius',
 'CloudExpo',
 '➊ Deepavali 2016',
 'OWASPDublin',
 'CNBCi',
 'starteasyway',
 'PayPal',
 'fay_yeen',
 'blockspotio',
 'JulianAgunabor',
 'josetron',
 'CryptoCryptics',
 '_3ch0_',
 '_vjy',
 'CryptoPH',
 'Crypto_God',
 'markmedia',
 'MirakhorHassan',
 'dilaraaltundag',
 'Bitbisiness',
 'AnnguOrngu',
 'Afrotechtv',
 'At_Atiya',
 'RNR_0',
 'CryptoGermanBro',
 'ricardo88rick',
 'Techemist',
 'neoinvest282',
 '10% cashback',
 '201210060',
 'Click2Amit',
 'SpirosMargaris',
 'financialMAVN',
 'MattCRZ22',
 '_bpriy',
 'mikeytea123',
 'TheUltraAliens',
 'JohnRiversX4',
 'mr__austin',
 'rosafutur',
 'AI',
 'canboybs',
 'NelsonMRosario',
 'itraor',
 'Treocian',
 'jawadms',
 'tchinyamakobvu',
 'Omero1972',
 'Big_Bitcoin',
 'Hermetec',
 'erickquiap',
 'VinnieVagabond',
 'Cryptogeek_BTC',
 'HYIP_Dr',
 'MsCharm143',
 'JuanCMinero',
 'jozeran',
 '999BTC',
 'Docbuh',
 'EdmundCMoy',
 'DarioTepoTec',
 '84Bitcoins',
 'AchwaqKhalid',
 'BetChain',
 'CryptoGuides',
 'michaelalex11',
 'landofcrypto',
 'SlonBTM',
 'bhumharit',
 'SylviaJanotta',
 'ArsAuxilia',
 'paditetteh',
 'AorakiPro',
 'Ravi_Bansal5',
 'CD28UK',
 'coolstuffcheap',
 '- sexy spread',
 '2iqresearch',
 'TheRockTrading',
 'RhinoAK',
 'TopBonPlanDuNet',
 '1cyrilblanc1',
 'DLCMyPocketFull',
 'AmarunGSR',
 'AtlanticCent',
 'vio1965',
 'techguard1',
 'HenriMorpatiza',
 '012Bit',
 'James_Laminger',
 'bit_coin_invest',
 'SwCPoker',
 'JRell274',
 'Sire_Charsty',
 'Bitfinex: $1006.6',
 'kyletorpey',
 'basonggogok',
 '🤑💬 #CrytoCurrency rules!',
 'bitcoingeek000',
 'Shopawl_com',
 'Rational_Urge',
 'news_btc',
 'HodgeShopArt',
 'HealthyFood87',
 'iAmMojari',
 'epakravan',
 'aaron_koenig',
 'mBTCPizpie',
 'CashQueen2013',
 'AltexSoft',
 'AdekanyeOpe',
 'programmableweb',
 '@realDonaldTrump',
 'iPraveenKumar_',
 'Peso: -13%',
 'olajide_tv',
 'theonevortex',
 'coollikeme',
 'KPesaBit',
 'vince_weyn',
 'MoonboundLTC',
 'skope1',
 'xavvypls',
 'vidal007',
 'Coinkite',
 'crazytipster_',
 'InstituteCrypto',
 'a) get a job working for it',
 'Bitcoin_Rush',
 'Viremia',
 'GhanmiChokri',
 'Wally_WST',
 '400',
 'DTongSports',
 'bitnewslatest',
 'vakeraj',
 'SmalValera',
 'BareMetalGame',
 '•• bitcoin',
 'localizecrypto',
 'cryptopump_pimp',
 'MadBitcoins',
 'Marijuana_Coin',
 'CollinCrypto',
 'RoniFinTech',
 'koqoo',
 'ThePriceOfBTC',
 'Mrajesin',
 'BitcoinGamble3',
 'ModernApostate',
 'Creator',
 'intowin2017',
 'MindCrypto',
 'Xentagz',
 'JoseGuequi',
 'Bitcoin',
 'bcb_atm',
 'tjpezlo',
 'bitcoinrainnet',
 'hikaru2525ajt',
 'rogersm',
 'hackfest_ca',
 'Ad3lin312',
 'iYuta_channel',
 'Gainbitcoin07_B',
 'LawnmowerIO',
 'impraise',
 'yayoinakano1',
 'UseBtCoin',
 'BitcoinBotTrade',
 'insidestat',
 'cryptohorse',
 'haibitcoincom',
 'kamleshk28',
 'Everybody need wallet .',
 'Metaballo',
 'Free 0.10 #Bitcoin Giveaway Below!',
 'MeerWindEnergy',
 'BitcoinTeo',
 'bdncsgo',
 'mi071630',
 'Rh_Tomal',
 'Stephen1424',
 '1 Signup',
 'csovil',
 'hrobeers',
 'CymKal',
 'vmalevris',
 'TuurDemeester',
 'bitfortip',
 '54pto',
 'LibertyBlitz',
 'vijayrathoda9',
 '🔛Bitcoin',
 'ntppf_',
 'darren006',
 'kotwica666',
 'TheForexIcon',
 'padbor',
 'bitsquare_',
 'ursualboaie',
 '_20BiZ',
 'BTC_Planet',
 'friedrichdundee',
 'hq83bnn9',
 '- 60% of all gun crime',
 'Josh_hoy20',
 'Florentino_Schm',
 'Bitcoin_Genius',
 'agnelmoses',
 'CryptoMoar',
 '_jonasschnelli_',
 'mekysmooth',
 'Amolain',
 'CommoditiesAcad',
 'don_teague',
 'Selim_Kavcik',
 'btc_dance',
 'GaltsGirl',
 'moneytrigz',
 'kwill_tuka',
 'misbahurbd',
 'btcvideospro',
 'yD2pzehRGC7coyZ',
 'guzzler_3',
 'EINS78',
 'zatocent',
 'AudunGulbrands1',
 'AKenyanVoice',
 'Rock_bitcoin',
 'Alhilal_sky',
 'stefioan',
 'PocketMinerAndr',
 'micah_gideon',
 'imNdash',
 'onitaizi3',
 'IngridMKollmann',
 'raphaelshalaby',
 'BitcoinsUP',
 'ThunkinOutloud',
 'BeerdHead',
 'niftygoblin',
 'AdminHOST',
 'Bitcoin_Fellow',
 'el33th4xor',
 'iremsokullu',
 'Zero_Cult',
 'A3coin',
 '100kaprocentovm',
 'maqbool_sm',
 'Itsjoeco',
 'massatofo',
 'BoulevardLP',
 'bishopmarn',
 'sovietskull',
 'freecryptomoney',
 'iamdwinn_',
 'keshab_baruah',
 'vernonstallins',
 'zupdawg_btc',
 'leojavi2',
 'nonameagent001',
 'QcanApp',
 'EarningsR',
 'sulisituaku',
 'RegTech_FinTech',
 'MismeMC',
 'swaggzconcept',
 'TwinklePortal',
 'MariusIluca',
 'kointrend',
 'BitForCoin',
 'DigitalAyrshire',
 'cryptopicasso',
 'neylorZ',
 'AegisFS',
 'Vaultoro',
 '@makadem',
 'ezyremit_IN',
 'mynameisajudin',
 ':)',
 'Themmytaryoh',
 'ICONICTOKEN',
 'Hicks_Associate',
 'laasch',
 'PayPal, Credit/Debit, Bitcoin',
 'pesobit',
 'stacystock',
 'davidogdenseo',
 'smile7031_i',
 'makemoremoney1_',
 'BitcoinOM',
 'cryptoSqueeze',
 'LYMSAMOMOBA',
 'Rasheed_Moore',
 'RlCKYJONES',
 'PrivateBrief',
 'macksonwakama',
 'grandsfx',
 'Win UNLIMITED #Satoshi ♡',
 'valdeano',
 'naveed7naik',
 'leklenche',
 'Item of clothing',
 'CryptoCompare',
 'Vindyne8',
 'boy_bitcoin',
 'timpastoor',
 'czarofrebels',
 'IarikVeliciko',
 'iGent_crypto',
 'willyskye',
 'LECAUSTRALIA',
 'Invest about N2000 (0.002btc) & earn16btc',
 'syedimran01',
 'prince_nrb',
 'My ID  ... DA84306',
 'Chris_Haefner',
 'moolanghat',
 'milos_zikic',
 'Bitcoin-unlimited',
 'Piotr14Tra',
 'DigitalProcess',
 'UmayangaLK',
 'bitcoin4u2',
 'grantrobertb',
 'EMC2Whale',
 'frederikro',
 'aalbertjac',
 'bit_bitcoin',
 'jonest_l',
 'menelaou',
 'JosefValkoun',
 'scriptingfuture',
 'CloudRank',
 'silveretblack',
 'myprtship',
 'BardirDwarf',
 'Gana_Bitcoins',
 'coinsz23',
 'howestreet',
 'maurilioberetta',
 'Matbythesea',
 'AlexSurrett',
 'sparky1213mc',
 'ezCoinAccess',
 'dbb2015',
 '_GetFollower_',
 'Altcoindreams',
 'dash_vibes',
 'pterion2910',
 'ehsminer',
 'ThingsExpo',
 'censorednewsnow',
 'technoapt',
 'DavidVicary',
 'Var_retweets',
 'alexsunnarborg',
 'IljaRadomir',
 'irvpaswan',
 'bitrevolver_net',
 'Bitcoin_kr',
 '* Store data efficiently',
 'Lingo101',
 'amystocks21334',
 'CryptoOz',
 'OrangeCaramelLZ',
 'Diseases desperate grown',
 'Bakom_skynket',
 'CoexistCoin',
 'OttawaStare',
 'hariyono_id',
 'JerryMooney',
 'SkipTerrio',
 'BigBrainArmy',
 'Gary_Price',
 '1. d/l a wallet app',
 'constancewillli',
 'desantis',
 'CrowdfundB0t',
 'Kuwaddo',
 'Blame China',
 'hiroshi_shimo',
 'arthur952010',
 'cryptowanderer',
 'ahier',
 'praisefmnevis',
 'bobbyclee',
 'ajicute1976',
 'OttLegalRebels',
 'Melt_Dem',
 'goldkey0070',
 'wendellbulbaai',
 'DonTravlos',
 'GeekOnTheLoose',
 'don_ejeh',
 'ducatirocks0369',
 'ArtByteMe',
 'jyotimish501',
 '_digitalfutures',
 'OpenECommunity',
 'promotebitcoin']

Having sampled a large amount of the handles above, I found that most accounts were primarily bots or the accounts were suspended, meaning that they were formerly bots. As such, I will just drop all the rows that have missing data for date and time.


In [13]:
df.dropna(inplace=True)

In [14]:
df.shape


Out[14]:
(610694, 6)

In [15]:
1286 + 610694 # looks like we dropped the correct rows!


Out[15]:
611980

We now have rows that have a date of 1970! They most have nonsensical tweets! We will drop these too!


In [16]:
df[df['date'] == date(1970,1,1)] #clearly bad rows of data!


Out[16]:
date handle time tweet tuple_version_tweet clean_tweet_V1
1047740 1970-01-01 BitcoinPlusOrg 00:00:00.136185 Network hashrate. (hash/s) (hashrate, network, hashs) network hashrate hashs
1047741 1970-01-01 DOGE/RUB 00:00:00.013534 0.013619 (0013619,) 0013619
1047742 1970-01-01 DOGE/RUB 00:00:00.013331 0.0134 (00134,) 00134
1047743 1970-01-01 DOGE/RUB 00:00:00.012855 0.01299899 (001299899,) 001299899
1047744 1970-01-01 LTC/BTC 00:00:00.004666 0.00473865 (000473865,) 000473865
1047745 1970-01-01 LTC/BTC 00:00:00.004638 0.00474998 (000474998,) 000474998
1047746 1970-01-01 LTC/BTC 00:00:00.004254 0.00435327 (000435327,) 000435327

In [17]:
to_drop = df[df['date'] == date(1970,1,1)].index

In [18]:
to_drop


Out[18]:
Int64Index([1047740, 1047741, 1047742, 1047743, 1047744, 1047745, 1047746], dtype='int64')

In [19]:
df.loc[to_drop, :]


Out[19]:
date handle time tweet tuple_version_tweet clean_tweet_V1
1047740 1970-01-01 BitcoinPlusOrg 00:00:00.136185 Network hashrate. (hash/s) (hashrate, network, hashs) network hashrate hashs
1047741 1970-01-01 DOGE/RUB 00:00:00.013534 0.013619 (0013619,) 0013619
1047742 1970-01-01 DOGE/RUB 00:00:00.013331 0.0134 (00134,) 00134
1047743 1970-01-01 DOGE/RUB 00:00:00.012855 0.01299899 (001299899,) 001299899
1047744 1970-01-01 LTC/BTC 00:00:00.004666 0.00473865 (000473865,) 000473865
1047745 1970-01-01 LTC/BTC 00:00:00.004638 0.00474998 (000474998,) 000474998
1047746 1970-01-01 LTC/BTC 00:00:00.004254 0.00435327 (000435327,) 000435327

In [20]:
df.drop(to_drop, inplace=True)

In [21]:
df.shape


Out[21]:
(610687, 6)

In [22]:
610694 - 7 # we dropped the right amount!


Out[22]:
610687

In [23]:
df.head()


Out[23]:
date handle time tweet tuple_version_tweet clean_tweet_V1
0 2017-02-22 FoabMoab 19:35:43 RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer (immediate, 1, to, #bitcoin, out, straight, usa, seller, wallet, gatherer, pay, btc, bitcoin) 1 #bitcoin btc straight to wallet usa bitcoin seller immediate pay out bitcoin gatherer
1 2017-02-22 Bitcoin_Revo 19:35:39 #bitcoin “Is Like Positive Bacteria”: Russian State Bank Vice President (like, bank, #bitcoin, vice, state, russian, president, positive, bacteria, is) #bitcoin is like positive bacteria russian state bank vice president
2 2017-02-22 alt_bit_coins 19:35:31 Volatile Bitcoin Nears Its All-Time High - Wall Street Journal (subscription) (blog) (volatile, wall, street, subscription, blog, its, nears, alltime, journal, high, bitcoin) volatile bitcoin nears its alltime high wall street journal subscription blog
3 2017-02-22 Rhino3nity 19:35:26 Missed out on #bitcoin? This #cryptocurrency created OVER 300 MILLIONAIRES WORLDWIDE within 12... by #davidf1982 (this, millionaires, over, worldwide, created, within, on, #bitcoin, out, #davidf1982, 300, 12, by, #cryptocurrency, missed) missed out on #bitcoin this #cryptocurrency created over 300 millionaires worldwide within 12 by #davidf1982
4 2017-02-22 Siimple_inc 19:35:18 RT @blockchainhelpr: 5 Weak Points Of #Blockchain Technology #tech #bitcoin (points, of, #bitcoin, technology, #tech, 5, weak, #blockchain) 5 weak points of #blockchain technology #tech #bitcoin

Let's now clean up our tweets further! This will ensure we dont have garbage hashtags or nonsensical words. This will be important for the lemmatization process later on!

In order to ensure that we have fully removed, duplciates, I will again drop duplicates based on clean_tweet_V2 column.


In [24]:
clean_df = filtration_1(df,"clean_tweet_V1", "clean_tweet_V2")

In [25]:
clean_df.head()


Out[25]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2
0 2017-02-22 FoabMoab 19:35:43 RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer (immediate, 1, to, #bitcoin, out, straight, usa, seller, wallet, gatherer, pay, btc, bitcoin) 1 #bitcoin btc straight to wallet usa bitcoin seller immediate pay out bitcoin gatherer #bitcoin btc straight wallet usa bitcoin seller immediate pay bitcoin gatherer
1 2017-02-22 Bitcoin_Revo 19:35:39 #bitcoin “Is Like Positive Bacteria”: Russian State Bank Vice President (like, bank, #bitcoin, vice, state, russian, president, positive, bacteria, is) #bitcoin is like positive bacteria russian state bank vice president #bitcoin like positive bacteria russian state bank vice president
2 2017-02-22 alt_bit_coins 19:35:31 Volatile Bitcoin Nears Its All-Time High - Wall Street Journal (subscription) (blog) (volatile, wall, street, subscription, blog, its, nears, alltime, journal, high, bitcoin) volatile bitcoin nears its alltime high wall street journal subscription blog volatile bitcoin nears alltime high wall street journal subscription blog
3 2017-02-22 Rhino3nity 19:35:26 Missed out on #bitcoin? This #cryptocurrency created OVER 300 MILLIONAIRES WORLDWIDE within 12... by #davidf1982 (this, millionaires, over, worldwide, created, within, on, #bitcoin, out, #davidf1982, 300, 12, by, #cryptocurrency, missed) missed out on #bitcoin this #cryptocurrency created over 300 millionaires worldwide within 12 by #davidf1982 missed #bitcoin #cryptocurrency created millionaires worldwide within #davidf1982
4 2017-02-22 Siimple_inc 19:35:18 RT @blockchainhelpr: 5 Weak Points Of #Blockchain Technology #tech #bitcoin (points, of, #bitcoin, technology, #tech, 5, weak, #blockchain) 5 weak points of #blockchain technology #tech #bitcoin weak points #blockchain technology #tech #bitcoin

In [26]:
clean_df = filtration_2(clean_df, "clean_tweet_V2")

In [27]:
clean_df.head()


Out[27]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2
0 2017-02-22 FoabMoab 19:35:43 RT @bitcoinagile: .1 #bitcoin BTC Straight To Wallet USA Bitcoin Seller, Immediate Pay Out - BitCoin Gatherer (immediate, 1, to, #bitcoin, out, straight, usa, seller, wallet, gatherer, pay, btc, bitcoin) 1 #bitcoin btc straight to wallet usa bitcoin seller immediate pay out bitcoin gatherer #bitcoin btc straight wallet usa bitcoin seller immediate pay bitcoin gatherer
1 2017-02-22 Bitcoin_Revo 19:35:39 #bitcoin “Is Like Positive Bacteria”: Russian State Bank Vice President (like, bank, #bitcoin, vice, state, russian, president, positive, bacteria, is) #bitcoin is like positive bacteria russian state bank vice president #bitcoin like positive bacteria russian state bank vice president
2 2017-02-22 alt_bit_coins 19:35:31 Volatile Bitcoin Nears Its All-Time High - Wall Street Journal (subscription) (blog) (volatile, wall, street, subscription, blog, its, nears, alltime, journal, high, bitcoin) volatile bitcoin nears its alltime high wall street journal subscription blog volatile bitcoin nears all time high wall street journal subscription blog
3 2017-02-22 Rhino3nity 19:35:26 Missed out on #bitcoin? This #cryptocurrency created OVER 300 MILLIONAIRES WORLDWIDE within 12... by #davidf1982 (this, millionaires, over, worldwide, created, within, on, #bitcoin, out, #davidf1982, 300, 12, by, #cryptocurrency, missed) missed out on #bitcoin this #cryptocurrency created over 300 millionaires worldwide within 12 by #davidf1982 missed #bitcoin #cryptocurrency created millionaires worldwide withinavidf1982
4 2017-02-22 Siimple_inc 19:35:18 RT @blockchainhelpr: 5 Weak Points Of #Blockchain Technology #tech #bitcoin (points, of, #bitcoin, technology, #tech, 5, weak, #blockchain) 5 weak points of #blockchain technology #tech #bitcoin weak points #blockchain technology #tech #bitcoin

In [28]:
clean_df.shape


Out[28]:
(610687, 7)

In [29]:
clean_df.drop_duplicates(subset="clean_tweet_V2", inplace=True)

In [30]:
clean_df.shape #lost around 80K rows! so many duplicates!


Out[30]:
(531257, 7)

It seems that our handles column does not strictly contain the user names of twitter handles. Rather, they contain tweets! it is likely that this is the result of bots. As such, we will remove these entries from our dataset!

Also, if the handle contains the word bot, we will also remove it!


In [31]:
clean_df.sort_values(by="handle").head(50) #as we can see from the sample, its all nonsense tweets


Out[31]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2
500708 2016-12-28 "BTC qualifies as money, a federal judge ruled" 22:22:05 h… (h…,) h… h
491015 2016-12-30 "Destinia decides to operate exclusively in Bitcoins in Venezuela" 00:27:16 v… (v…,) v… v
566239 2016-12-21 "I like big blocks & I cannot lie" 07:05:57 previewing at a #blockchain conference near you (#blockchain, near, conference, at, previewing, a, you) previewing at a #blockchain conference near you previewing #blockchain conference near
575012 2016-12-20 "I want my house to be seen from space!": FESTIVE! 07:39:12 (ego)… (ego…,) ego… ego
1006691 2016-10-31 #ARK $ARK #altcoin #Bitcoin #fintech #crowdfund #ICO 06:30:22 #block… (#block…,) #block… #block
572863 2016-12-20 #ATM #Kiosks #Debitcard #Digital$$$ #Bitcoin ready 13:37:54 #Unbanked #payroll #ID Cards… (#unbanked, #id, cards…, #payroll) #unbanked #payroll #id cards… #unbanked #payroll #id cards
650214 2016-12-11 #ATM-#Debit-#Digital-#Kiosks 18:57:39 #BSA✅ #BIN-#Bank✅… (#bin#bank✅…, #bsa✅) #bsa✅ #bin#bank✅… #bsa✅ #bin#bank✅
714070 2016-12-04 #BTC #bitcoin #crypto #bitcoins #marijuana #cannabis 01:40:46 #ganja #weed #hem… (#hem…, #ganja, #weed) #ganja #weed #hem… #ganja #weed #hem
716578 2016-12-03 #BTC #bitcoin #crypto #bitcoins #marijuana #cannabis 18:10:33 #ganja #weed #hemp… (#ganja, #hemp…, #weed) #ganja #weed #hemp… #ganja #weed #hemp
900752 2016-11-12 #BTC #cannabis #bitcoin #crypto #bitcoins #marijuana 15:50:06 #ganja #weed #hemp #domain #domains (#domains, #weed, #domain, #ganja, #hemp) #ganja #weed #hemp #domain #domains #ganja #weed #hempomainomains
877212 2016-11-15 #BULLISH 08:16:12 #UNDERVALUED (#undervalued,) #undervalued #undervalued
352528 2017-01-14 #BitCoin 15:11:17 #TheBillionCoin… (#thebillioncoin…,) #thebillioncoin… #thebillioncoin
327975 2017-01-17 #BitCoin (feat. #drZIUBER) - EP by #drHAJS 11:30:59 →… (→…,) →…
745519 2016-11-30 #BitID & maybe #YoursLogin? @ryanxcharles @YoursNetwork 12:05:22 (demo) (demo,) demo demo
16791 2017-02-21 #Bitcoin 00:03:59 #Fint… (#fint…,) #fint… #fint
372341 2017-01-12 #Bitcoin 06:38:46 #Crypto (#crypto,) #crypto #crypto
248146 2017-01-26 #Bitcoin 03:03:14 #ライトライズ… (#ライトライズ…,) #ライトライズ… #ライトライズ
298184 2017-01-20 #Bitcoin 17:15:51 $BTC $XBT (xbt, btc) btc xbt btc xbt
536744 2016-12-24 #Bitcoin 16:29:20 #Gold (#gold,) #gold #gold
300858 2017-01-20 #Bitcoin 10:44:24 #Golem (#golem,) #golem #golem
623240 2016-12-14 #Bitcoin 21:55:00 #Privacy (#privacy,) #privacy #privacy
648886 2016-12-11 #Bitcoin 22:48:33 #coins (#coins,) #coins #coins
531577 2016-12-25 #Bitcoin #BTC 07:59:01 #Make_More_Bitcoins (#makemorebitcoins,) #makemorebitcoins #makemorebitcoins
111424 2017-02-09 #Bitcoin #Blockchain #FinTech 19:37:31 $BTCCNY $BTC… (btccny, btc…) btccny btc… btccny btc
6205 2017-02-22 #Bitcoin #Blockchain #FinTech 04:10:25 $BTC $BTCCNY @YourBTCC (btccny, btc) btc btccny btc btccny
52254 2017-02-16 #Bitcoin #Blockchain #FinTech. 16:52:55 *#BTC $BTCCNY @YourBTCC (btccny, #btc) #btc btccny #btc btccny
718337 2016-12-03 #Bitcoin #Hacked #Collapse 13:28:02 $BTC $XBTE… see more (see, xbte…, btc, more) btc xbte… see more btc xbte see
914540 2016-11-10 #Bitcoin #Hacked #Ponzi(!) 23:19:37 #Collapse #99DEDseriouz. (#collapse, #99dedseriouz) #collapse #99dedseriouz #collapse #99dedseriouz
353147 2017-01-14 #Bitcoin #Ponzi #Collapse 13:14:29 $BTCCNY @YourBTCC(!) (btccny,) btccny btccny
198698 2017-01-31 #Bitcoin #Ponzi #Collapse(!) 13:57:29 $BTC $XBTEUR… (xbteur…, btc) btc xbteur… btc xbteur
503289 2016-12-28 #Bitcoin $BTC $1350 15:55:24 #Monero $XMR $100 (#monero, 100, xmr) #monero xmr 100 #monero xmr
206719 2017-01-30 #Bitcoin > CALL 17:13:28 #SuicideHotline: 666-R-E-K-T (#suicidehotline, 666rekt) #suicidehotline 666rekt #suicidehotline
701212 2016-12-05 #Bitcoin Distributed Ledger: 16:06:27 • Public #Blockchain = Internet… (#blockchain, internet…, •, public) • public #blockchain internet… public #blockchain internet
900295 2016-11-12 #Bitcoin here 17:00:03 Bestellung per Mail & PN / Page. #CCO #PeggyHetscher #Bitcoin #Geil... (pn, bestellung, #geil, #cco, #bitcoin, amp, per, mail, #peggyhetscher, page) bestellung per mail amp pn page #cco #peggyhetscher #bitcoin #geil bestellung per mail and pn page #cco #peggyhetscher #bitcoin #geil
457522 2017-01-02 #Bitcoin hits $1000 21:57:29 And it's just the beginning (just, beginning, its, the, and) and its just the beginning beginning
437670 2017-01-05 #Bitcoin is at $0.018 Tn at Year 8. 01:39:33 2 more yrs? (2, yrs, more) 2 more yrs yrs
310823 2017-01-19 #Bitcoin is for Everyone! 09:20:29 Tr… (tr…,) tr… tr
915383 2016-11-10 #Bitcoin is rising 20:45:58 Gold is rising (gold, rising, is) gold is rising gold rising
890090 2016-11-13 #Bitcoin payroll 22:23:04 #Equity… (#equity…,) #equity… #equity
890095 2016-11-13 #Bitcoin payroll 22:20:35 #Equity in project (in, project, #equity) #equity in project #equity project
967390 2016-11-04 #Bitcoin sent 24h 16:46:48 6,562,716 BTC (41.14% cap… (4114, cap…, 6562716, btc) 6562716 btc 4114 cap… btc cap
73870 2017-02-14 #BitcoinGenerator #Bitcoinhack 04:57:44 #Freebitcoin (#freebitcoin,) #freebitcoin #freebitcoin
48 2017-02-22 #BitcoinKE 19:30:06 #ReTwita (#retwita,) #retwita twita
700977 2016-12-05 #Bitcoin’s #blockchain #technology 16:42:38 puts #tech before the financial system. (financial, puts, system, #tech, the, before) puts #tech before the financial system puts #tech financial system
57875 2017-02-16 #Blockchain #FinTech 01:43:31 #Bitcoin $BTC $BTCCN… (btccn…, #bitcoin, btc) #bitcoin btc btccn… #bitcoin btc btccn
109444 2017-02-10 #Blockchain #Fintech😟 00:48:29 #Bitcoin $BTC $BTCCNY @YourBTCC (btccny, #bitcoin, btc) #bitcoin btc btccny #bitcoin btc btccny
355273 2017-01-14 #CRBIT is awakening NOW 07:04:44 CRBIT/BTC Last: 0.000128 BTC 24High: 0.00013 BTC... (crbitbtc, 24high, 000013, last, btc, 0000128) crbitbtc last 0000128 btc 24high 000013 btc crbitbtc last btc btc
234499 2017-01-27 #ChurchOfBitcoin 24-25th February @5thBaseGallery 14:16:23 #bitcoin #mass #art (#art, #mass, #bitcoin) #bitcoin #mass #art #bitcoin #mass #art
919182 2016-11-10 #Cryptocurrency #Bittrex #Blockchain now is the 11:32:05 time for a #digital currency #CAPRICOIN IS The Perfect Coin !!!! (perfect, currency, #capricoin, time, #digital, the, coin, for, is, a) time for a #digital currency #capricoin is the perfect coin timeigital currency #capricoin perfect coin
17880 2017-02-20 #Cryptomania = #Tulipmania! 21:05:02 #Blockchain #Fin… (#blockchain, #fin…) #blockchain #fin… #blockchain #fin

In [32]:
clean_df.sort_values(by='handle', inplace=True) #lets prep our dataframe for the cleaning process
clean_df.reset_index(inplace=True)
del clean_df['index']
clean_df.head()


Out[32]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2
0 2016-12-28 "BTC qualifies as money, a federal judge ruled" 22:22:05 h… (h…,) h… h
1 2016-12-30 "Destinia decides to operate exclusively in Bitcoins in Venezuela" 00:27:16 v… (v…,) v… v
2 2016-12-21 "I like big blocks & I cannot lie" 07:05:57 previewing at a #blockchain conference near you (#blockchain, near, conference, at, previewing, a, you) previewing at a #blockchain conference near you previewing #blockchain conference near
3 2016-12-20 "I want my house to be seen from space!": FESTIVE! 07:39:12 (ego)… (ego…,) ego… ego
4 2016-10-31 #ARK $ARK #altcoin #Bitcoin #fintech #crowdfund #ICO 06:30:22 #block… (#block…,) #block… #block

In [33]:
clean_df.shape


Out[33]:
(531257, 7)

In [34]:
to_drop = []
for index in clean_df.index:
    if clean_df.iloc[index, 1][0] == "_":#some users have _ at the start of their name
        continue
    if not clean_df.iloc[index, 1][0].isalnum():
        to_drop.append(index)

In [35]:
len(to_drop) #we need to drop 212 records


Out[35]:
212

In [36]:
for handle in clean_df.iloc[to_drop, :]['handle']:
    print(handle)
    print()


"BTC qualifies as money, a federal judge ruled"

"Destinia decides to operate exclusively in Bitcoins in Venezuela"

"I like big blocks & I cannot lie"

"I want my house to be seen from space!": FESTIVE!

#ARK $ARK #altcoin #Bitcoin #fintech #crowdfund #ICO

#ATM #Kiosks #Debitcard #Digital$$$ #Bitcoin ready

#ATM-#Debit-#Digital-#Kiosks

#BTC #bitcoin #crypto #bitcoins #marijuana #cannabis

#BTC #bitcoin #crypto #bitcoins #marijuana #cannabis

#BTC #cannabis #bitcoin #crypto #bitcoins #marijuana

#BULLISH

#BitCoin

#BitCoin (feat. #drZIUBER) - EP by #drHAJS

#BitID & maybe #YoursLogin? @ryanxcharles @YoursNetwork

#Bitcoin

#Bitcoin

#Bitcoin

#Bitcoin

#Bitcoin

#Bitcoin

#Bitcoin

#Bitcoin

#Bitcoin #BTC

#Bitcoin #Blockchain #FinTech

#Bitcoin #Blockchain #FinTech

#Bitcoin #Blockchain #FinTech.

#Bitcoin #Hacked #Collapse

#Bitcoin #Hacked #Ponzi(!)

#Bitcoin #Ponzi #Collapse

#Bitcoin #Ponzi #Collapse(!)

#Bitcoin $BTC $1350

#Bitcoin > CALL

#Bitcoin Distributed Ledger:

#Bitcoin here

#Bitcoin hits $1000

#Bitcoin is at $0.018 Tn at Year 8.

#Bitcoin is for Everyone!

#Bitcoin is rising

#Bitcoin payroll

#Bitcoin payroll

#Bitcoin sent 24h

#BitcoinGenerator #Bitcoinhack

#BitcoinKE

#Bitcoin’s #blockchain #technology

#Blockchain #FinTech

#Blockchain #Fintech😟

#CRBIT is awakening NOW

#ChurchOfBitcoin 24-25th February @5thBaseGallery

#Cryptocurrency #Bittrex #Blockchain now is the

#Cryptomania = #Tulipmania!

#Digits mined from a digital world are worth $787.84 each

#Digits mined from a digital world are worth $787.84 each

#EH

#ForLimitedTime

#FreeBitcoin Every 5 Minutes!

#FreeBitcoin Every 5 Minutes!

#Hero

#HiddenInflation

#IndigenousSovereignty

#IndigenousSovereignty

#JulianIsDead #proofoflife #bitcoin #ethereum #Linux #MIT #netsec #infosec

#MakeMoreMoney101 😈💯💥

#MovieTVTechGeeks

#NUMEROLOGY

#RT let's see who's interested.

#RT let's see who's interested.

#SBXofficial #SBXfintech

#STEEMIT

#SatoshiNakamoto #disruptivetech

#SatoshiNakamoto #disruptivetech

#SaveEdmund #FindSweetLiberty #CollapseTrx

#SaveEdmund #WEF #Davos2017 #Davos #QE2

#SaveEdmund #WEF #Davos2017 #Davos #QE2

#SaveEdmund @USMC #Davos2017 #Davos #WEF

#SpreadTheWord

#TCC

#TRUMPCOIN

#Thanks

#Trading

#VR #bitcoin #trade #business #money #btc  #bitcoins

#WTC7

#WTC7

#Whales

#WinBitcoin

#WomanCrushWednesday

#ZCLASSIC

#adspace #marketing #promote

#bit #bitcoinbillionaire #bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin

#bitcoin #Cryptocurrency #Crypto

#bitcoin #Cryptocurrency #Crypto

#bitcoin #bitclubnetwork #bitcoinrevolution

#bitcoin #bitcoins #insurance #CyberSecurity #domains

#bitcoin #bitcoins #insurance #CyberSecurity #domains

#bitcoin #bitcoins #insurance #CyberSecurity #domains

#bitcoin #cryptocurrency

#bitcoin #hyip #cash #money #investment #business

#bitcoin #hyip #mlm #M2M #Russia #Vietnam #Pakistan #india #philippines

#bitcoin #investment #inversión #investimento

#bitcoincasino #bitcoin #btc #bitcoins #casino #casinos

#bitcoincasino #bitcoin #btc #bitcoins #casino #casinos

#bitcoins

#blockchain #bitcoin #fintech

#blockchain #bitcoin #fintech #regtech

#blockchain Experts: Mine ZCash But Be Cautious

#blockchain without the #bitcoin

#bloktex

#btc

#btc #bitcoin #insurance #insure #coin #money

#btcc

#coolchickwisdom

#cryptocurrency

#earn 0,4 u #btc by #visiting #websites for a #minute

#earthsfinal321

#fintech

#fintech #AI #bitcoin #IoT

#fintech #bitcoin #blockchain #unbanked

#fintech #bitcoin #blockchain #unbanked

#fintech #bitcoin #payments #Insurtech

#fintech #bitcoin #payments #IoT

#fintech #blockchain

#fintech #blockchain

#fintech #blockchain #payments

#fintech #payments #Russia

#fintech #payments #blockchain

#fintech #payments #cryptocurrency #bitcoin #unbanked

#iOS, now. 😊

#iOS, now. 😊

#litecoin

#litecoin

#mindful #balance #Awareness

#myrant

#newsflash Kids!

#oldschool vs. #newparadigm

#restaurant

#silveretblack

#techfin #blockchain #fintech

#techfin #blockchain #fintech

#techfin #blockchain #fintech

#thefuture

#virtualreality #bitcoin #btc #bitcoins #VRgames

#virtualreality #bitcoin #btc #bitcoins #VRgames

#virtualreality #bitcoin #btc #bitcoins #VRgames

$1,999.99

$5Million/bitcoin: Q: infinitely divisible? Past 8 decimal places?

$ARK #altcoin #Bitcoin #fintech #crowdfund #ICO

$ARK #altcoin #Bitcoin #fintech #crowdfund #ICO

$ARK #altcoin #Bitcoin #fintech #crowdfund #ICO

$ARK-TEC

$ARK-TEC

$BTC #Blockchain #FinTech

$BTC #bitcoin

$BTC $XBTEUR @krakenfx #Bitcoin

$BTC $XBTEUR krakenfx #Bitcoin

$BTC +9.8%

$FacebookSDKs-iOS-4.18.0

> Paypal (Friends and Family)

> https://t.co/7Slymx5lis

> https://t.co/7Slymx5lis via @vidme

> https://t.co/7Slymx5lis via @vidme

> https://t.co/P4cqCnIkCT

(@ezscott48)

(Transmitting payments)

*govt. pushes for digital transactions*

*govt. pushes for digital transactions*

- BitKings educate about Bitcoins

- Bitcoin Network ($1.34B #VC)

- For sale -

- Scheduled Pay

- Win immensely at QueenSpin casino!

- buy starbucks gift card w/ cash

-After $500 - $600USD

-Globally #Accepted. -Lowest #Fees. -#Unbanked👄

-this site is alot of fun, one of the better sites

-this site is alot of fun, one of the better sites

.....

...if not Bitcoin, then what?

.05 bitcoin one time out of pocket ($50.00)

@BitCongress @DemocracyEarth @FEC

@BitCongress @DemocracyEarth @FEC

@BitcoinFullnode @bitcointopia @NODEfather

@BitcoinKinetics @BitSwitchIO @BitCongress @inc_bitcoin

@CryptoHustle @ChrisDunnTV

@KimDotcom

@OKCoin

@VentureScanner

@eBitcoinics.com

@el33th4xor

@realDonaldTrump

| #fintech #money

• 25/25 Love Match Deposit bonus.

• 25/25 Love Match Deposit bonus.

⏩Min deposit $10

⏫⏫⏫⏫⏫⏫⏫⏫⏫⏫ ⏫⏫⏫⏫

▼▼▼

⚫ Your hardware is Already Running :

✔✔ Bitstarz Bonus Packages.

�BECOME A MILLIONAIRE THROUGH BITCOIN

🏆Top Overview and Trends

👉🏿


In [37]:
clean_df.drop(to_drop, inplace=True)

In [38]:
clean_df.head()


Out[38]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2
190 2016-12-01 0.1 #bitcoin will be shared among everyone who'll be present in the chat. 20:31:41 11PM, UK time.… (uk, time…, 11pm) 11pm uk time… uk time
191 2016-12-29 00000000R0B 06:33:11 Building a crypto coin: some FAQs #bitcoin #money (#money, #bitcoin, faqs, building, crypto, coin, some, a) building a crypto coin some faqs #bitcoin #money building crypto coin faqs #bitcoin #money
192 2017-02-20 0000nikolay3 23:33:50 Accept Bitcoin and other crypto in your online store with CoinPayments #bitcoin #altcoins #CoinPayments (in, accept, online, #bitcoin, bitcoin, store, crypto, coinpayments, #coinpayments, and, your, with, #altcoins, other) accept bitcoin and other crypto in your online store with coinpayments #bitcoin #altcoins #coinpayments accept bitcoin crypto online store coinpayments #bitcoin #acoins #coinpayments
193 2017-02-09 000_EVAN_000 15:28:43 RT @Cointelegraph: WireX Buy #Bitcoin - a Step Closer to the World’s First Bankless Personal #Banking @wirexapp htt… (step, personal, #banking, to, closer, #bitcoin, world’s, bankless, buy, first, the, htt…, wirex, a) wirex buy #bitcoin a step closer to the world’s first bankless personal #banking htt… wirex buy #bitcoin step closer worlds first bankless personal #banking htt
194 2016-11-10 000gun 07:41:28 RT @SecNewsBot: Hacker News - Bitcoin’s Civil War Is Ending Soon (news, bitcoin’s, civil, ending, war, hacker, is, soon) hacker news bitcoin’s civil war is ending soon hacker news bitcoins civil war ending soon

In [39]:
clean_df.drop([190], inplace=True)

In [40]:
clean_df.shape


Out[40]:
(531044, 7)

In [41]:
clean_df.sort_values(by=['date', 'time'], inplace=True)
clean_df.reset_index(inplace=True)
del clean_df['index']
clean_df.head()


Out[41]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2
0 2016-10-26 ChiefApricot 17:51:46 Excellent primer on bitcoin mining (mining, primer, excellent, on, bitcoin) excellent primer on bitcoin mining excellent primer bitcoin mining
1 2016-10-26 bitcoin_experts 17:52:23 theymos: Do you know of a good tutorial for that? (tutorial, of, for, know, theymos, that, do, good, a, you) theymos do you know of a good tutorial for that theymos know good tutorial
2 2016-10-26 Bitcoin_City 17:52:25 #btc Darkstore opens on-demand delivery fulfillment centers in New York and Phoenix:  Darkst... #bitcoin #crypto (in, york, #btc, #bitcoin, fulfillment, centers, darkst, delivery, new, phoenix, and, #crypto, ondemand, darkstore, opens) #btc darkstore opens ondemand delivery fulfillment centers in new york and phoenix darkst #bitcoin #crypto #btc darkstore opens ondemand delivery fulfillment centers new york phoenix darkst #bitcoin #crypto
3 2016-10-26 Bitcoin_City 17:52:25 #btc Javelin Venture Partners just closed its fourth fund with $125 million:  Javelin Ventur... #bitcoin #crypto (closed, just, 125, #btc, million, fourth, #bitcoin, partners, its, fund, venture, #crypto, with, ventur, javelin) #btc javelin venture partners just closed its fourth fund with 125 million javelin ventur #bitcoin #crypto #btc javelin venture partners closed fourth fund million javelin ventur #bitcoin #crypto
4 2016-10-26 Bitcoin_City 17:52:25 #btc Microsoft shows off a new $99 input method with the Surface Dial:  Okay, this is cool. ... #bitcoin #crypto (this, is, shows, microsoft, cool, #btc, method, #bitcoin, #crypto, dial, off, with, the, okay, input, 99, new, a, surface) #btc microsoft shows off a new 99 input method with the surface dial okay this is cool #bitcoin #crypto #btc microsoft shows new input method surface dial okay cool #bitcoin #crypto

In [42]:
clean_df_tweet_by_day_plot = bar_graph(clean_df.groupby("date").count()['tweet'].index, clean_df.groupby("date").count()['tweet'], "tweets by day", "Date", "No. Tweets", "tweets-by-day-V2")
clean_df_tweet_by_day_plot #plot looks pretty uniform!


Out[42]:

In [43]:
Image("tweets_by_day.png")


Out[43]:

In [44]:
diff_df = pd.DataFrame(clean_df.groupby("date").count()['tweet'].diff())
diff_df_plot = line_graph(list(diff_df.index)[2:], list(diff_df['tweet'])[2:],'Difference in tweets by day', 'Date', 'Difference', 'difference-bar-plot')
diff_df_plot # a couple of spikes!


Out[44]:

In [45]:
Image("difference_by_day.png")


Out[45]:

In [46]:
#Let's find how many unique hashtags we have over a 4 month period
set_of_all_hashtags = set()
counts = {}
for i in range(clean_df.shape[0]):
    temp_lst = clean_df.iloc[i, 6].split()
    hashtags = [x for x in temp_lst if x.startswith("#")]
    for tag in hashtags:
        if tag not in counts:
            counts[tag] = 1
        else:
            counts[tag] += 1
        set_of_all_hashtags.add(tag)
print("There are {} unique hashtags over the course of 4 months".format(len(set_of_all_hashtags)))


There are 33110 unique hashtags over the course of 4 months

In [47]:
# let's visualize the top 50 hashtags
counts_lst = []
for k, v in counts.items():
    counts_lst.append((k, v))

top_50_hashtags = pd.DataFrame(sorted(counts_lst, key=lambda x: x[1], reverse=True)[:50])
top_50_hashtags.set_index(0, inplace=True)
top_50_hashtags.sort_values(by=1, inplace=True)
top_50_hashtags.index.names = ['hashtag']
top_50_hashtags = top_50_hashtags.rename(columns = {1:'count'})

top_50_hashtags_plot = horizontal_bar_graph(top_50_hashtags['count'], top_50_hashtags.index, 'Top 50 hashtags by count', 'Count', 'Hashtag','top-50-hashtags-barh')
top_50_hashtags_plot


Out[47]:

In [48]:
Image("top_50_hashtags.png")


Out[48]:

Lets look at some stats at the day level:


In [49]:
stats = pd.DataFrame(clean_df.groupby("date")['tweet'].size().describe())
stats.drop(["count"], inplace=True)
stats = stats.rename(columns = {"tweet":'tweets_per_day'})
print(tabulate(stats, headers='keys', tablefmt='fancy_grid'))


╒══════╤══════════════════╕
│      │   tweets_per_day │
╞══════╪══════════════════╡
│ mean │         4425.37  │
├──────┼──────────────────┤
│ std  │          547.018 │
├──────┼──────────────────┤
│ min  │         1453     │
├──────┼──────────────────┤
│ 25%  │         4184.5   │
├──────┼──────────────────┤
│ 50%  │         4468.5   │
├──────┼──────────────────┤
│ 75%  │         4737.5   │
├──────┼──────────────────┤
│ max  │         5593     │
╘══════╧══════════════════╛

Now that we have plotted the hashtags. There is no need to keep the hashtags in the corpus of a particular tweet. In fact, keeping the hashtag would serve to only confuse our sentiment calculations.


In [50]:
clean_df['clean_tweet_V2'] = clean_df['clean_tweet_V2'].apply(lambda x: x.replace("#",""))

In [51]:
clean_df.head()


Out[51]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2
0 2016-10-26 ChiefApricot 17:51:46 Excellent primer on bitcoin mining (mining, primer, excellent, on, bitcoin) excellent primer on bitcoin mining excellent primer bitcoin mining
1 2016-10-26 bitcoin_experts 17:52:23 theymos: Do you know of a good tutorial for that? (tutorial, of, for, know, theymos, that, do, good, a, you) theymos do you know of a good tutorial for that theymos know good tutorial
2 2016-10-26 Bitcoin_City 17:52:25 #btc Darkstore opens on-demand delivery fulfillment centers in New York and Phoenix:  Darkst... #bitcoin #crypto (in, york, #btc, #bitcoin, fulfillment, centers, darkst, delivery, new, phoenix, and, #crypto, ondemand, darkstore, opens) #btc darkstore opens ondemand delivery fulfillment centers in new york and phoenix darkst #bitcoin #crypto btc darkstore opens ondemand delivery fulfillment centers new york phoenix darkst bitcoin crypto
3 2016-10-26 Bitcoin_City 17:52:25 #btc Javelin Venture Partners just closed its fourth fund with $125 million:  Javelin Ventur... #bitcoin #crypto (closed, just, 125, #btc, million, fourth, #bitcoin, partners, its, fund, venture, #crypto, with, ventur, javelin) #btc javelin venture partners just closed its fourth fund with 125 million javelin ventur #bitcoin #crypto btc javelin venture partners closed fourth fund million javelin ventur bitcoin crypto
4 2016-10-26 Bitcoin_City 17:52:25 #btc Microsoft shows off a new $99 input method with the Surface Dial:  Okay, this is cool. ... #bitcoin #crypto (this, is, shows, microsoft, cool, #btc, method, #bitcoin, #crypto, dial, off, with, the, okay, input, 99, new, a, surface) #btc microsoft shows off a new 99 input method with the surface dial okay this is cool #bitcoin #crypto btc microsoft shows new input method surface dial okay cool bitcoin crypto

While it is excellent that we have such a high level of granularity for our time column, it is not needed for our analysis. Rather, it would be excellent if we could place tweets into "hourly" buckets. This way, we can have analysis for both the day level and the hour level!


In [52]:
hours = []

for index in clean_df.index:
    hours.append(clean_df.iloc[index, 2].hour)

In [53]:
clean_df.shape[0] == len(hours) #perfect


Out[53]:
True

In [54]:
clean_df['hour_of_day'] = hours
clean_df = clean_df.set_value(clean_df[clean_df['hour_of_day'] == 0].index, "hour_of_day", 24)

In [55]:
clean_df.head()


Out[55]:
date handle time tweet tuple_version_tweet clean_tweet_V1 clean_tweet_V2 hour_of_day
0 2016-10-26 ChiefApricot 17:51:46 Excellent primer on bitcoin mining (mining, primer, excellent, on, bitcoin) excellent primer on bitcoin mining excellent primer bitcoin mining 17
1 2016-10-26 bitcoin_experts 17:52:23 theymos: Do you know of a good tutorial for that? (tutorial, of, for, know, theymos, that, do, good, a, you) theymos do you know of a good tutorial for that theymos know good tutorial 17
2 2016-10-26 Bitcoin_City 17:52:25 #btc Darkstore opens on-demand delivery fulfillment centers in New York and Phoenix:  Darkst... #bitcoin #crypto (in, york, #btc, #bitcoin, fulfillment, centers, darkst, delivery, new, phoenix, and, #crypto, ondemand, darkstore, opens) #btc darkstore opens ondemand delivery fulfillment centers in new york and phoenix darkst #bitcoin #crypto btc darkstore opens ondemand delivery fulfillment centers new york phoenix darkst bitcoin crypto 17
3 2016-10-26 Bitcoin_City 17:52:25 #btc Javelin Venture Partners just closed its fourth fund with $125 million:  Javelin Ventur... #bitcoin #crypto (closed, just, 125, #btc, million, fourth, #bitcoin, partners, its, fund, venture, #crypto, with, ventur, javelin) #btc javelin venture partners just closed its fourth fund with 125 million javelin ventur #bitcoin #crypto btc javelin venture partners closed fourth fund million javelin ventur bitcoin crypto 17
4 2016-10-26 Bitcoin_City 17:52:25 #btc Microsoft shows off a new $99 input method with the Surface Dial:  Okay, this is cool. ... #bitcoin #crypto (this, is, shows, microsoft, cool, #btc, method, #bitcoin, #crypto, dial, off, with, the, okay, input, 99, new, a, surface) #btc microsoft shows off a new 99 input method with the surface dial okay this is cool #bitcoin #crypto btc microsoft shows new input method surface dial okay cool bitcoin crypto 17

In [56]:
clean_df_tweet_by_hour_plot = line_graph(clean_df.groupby("hour_of_day").count()['tweet'].index, clean_df.groupby("hour_of_day").count()['tweet'], "tweets by hour", "Hour", "No. Tweets", "tweets-by-hour")
clean_df_tweet_by_hour_plot #plot looks pretty uniform!


Out[56]:

In [57]:
Image("tweets_by_hour.png")


Out[57]:

In [58]:
number_of_tweets = []
number_of_users = []

for k, v in Counter(clean_df.groupby("handle").count()['tweet']).items():
    number_of_tweets.append(k)
    number_of_users.append(v)
scaled_num_users = list(map(lambda x: x/10e3, number_of_users))

tweets_per_user_plot = bar_graph(number_of_tweets, scaled_num_users, "Tweets per user", "Number of tweets per user", "Number of Users (10e3)", "tweet-per-user")
tweets_per_user_plot


Out[58]:

In [59]:
Image("tweets_by_user.png") #the first bar has a value of 5.0471 x 10^3


Out[59]:

This concludes the exploration notebook! In the next notebook, we will gather some additional data and prepare our data for the modelling process!

As far as model building is concerned, we only need the dates, hours and clean_tweet_V2. Everything else is irrelevant. Let's go ahead and make these changes!


In [85]:
clean_df.drop(["handle", "time", "tweet", "tuple_version_tweet", "clean_tweet_V1"], axis=1, inplace=True)

In [86]:
clean_df.head()


Out[86]:
date clean_tweet_V2 hour_of_day
0 2016-10-26 excellent primer bitcoin mining 17
1 2016-10-26 theymos know good tutorial 17
2 2016-10-26 btc darkstore opens ondemand delivery fulfillment centers new york phoenix darkst bitcoin crypto 17
3 2016-10-26 btc javelin venture partners closed fourth fund million javelin ventur bitcoin crypto 17
4 2016-10-26 btc microsoft shows new input method surface dial okay cool bitcoin crypto 17

In [87]:
pickle_object(clean_df, "clean_df_NB3_Complete")

Conclusion

In this notebook, we have attempted to control for bots as well as dupliate tweets. I am sure that there are tweets that slopped throughmy filtrations and cleaning code. However, this should be very small and not have a major impact on my results. At least, that is the hope!


In [ ]: