In [5]:

    
import pandas as pd
import datetime
import numpy as np
import scipy as sp
import os
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
font = {'size'   : 18}
matplotlib.rc('font', **font)
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
#os.chdir("/root/Envs/btc-analysis/btc-price-analysis")
time_format = "%Y-%m-%dT%H:%M:%SZ"

Weekly sentiment score analysis



In [8]:

    
score_data = pd.read_csv("../data/nyt_bitcoin_with_score.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
score_data.head()









    Out[8]:






  
    
      
      headline
      sentiment
      sentimentValue
    
    
      time
      
      
      
    
  
  
    
      2012-01-16 01:01:41
      'Good Wife' Watch: Jason Biggs, Jim Cramer and...
      Negative
      1
    
    
      2012-01-16 01:01:41
      'Good Wife' Watch: Jason Biggs, Jim Cramer and...
      Negative
      1
    
    
      2012-04-12 14:30:13
      Canada Seeks to Turn Coins Into Digital Currency
      Neutral
      2
    
    
      2013-03-12 20:28:27
      Today's Scuttlebot: Bitcoin Problem and Tracki...
      Negative
      1
    
    
      2013-04-08 00:00:00
      Bubble or No, This Virtual Currency Is a Lot o...
      Negative
      1

Ratio of "negative", "neutral", "positive"



In [9]:

    
score_data.sentiment.unique()









    Out[9]:





array(['Negative', 'Neutral', 'Positive'], dtype=object)



In [10]:

    
score_data.groupby("sentiment").sentiment.count().plot(kind='bar',rot=0)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fe2a98a0810>

Massively negative ratings!!!! Is this special to bitcoin news? To double check, run the same analysis on news with headline including "internet".

Alternate news analysis (digress)



In [10]:

    
internet_news = pd.read_csv("../data/nyt_internet_with_score.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
internet_news.head()









    Out[10]:






  
    
      
      headline
      sentiment
      sentimentValue
    
    
      time
      
      
      
    
  
  
    
      2012-01-05 00:00:00
      Internet Access Is Not a Human Right
      Positive
      3
    
    
      2012-01-05 10:34:13
      Internet Access Is Not a Human Right
      Positive
      3
    
    
      2012-01-05 10:34:13
      Internet Access Is Not a Human Right
      Positive
      3
    
    
      2012-01-06 00:00:00
      Students of Online Schools Are Lagging
      Negative
      1
    
    
      2012-01-09 15:50:30
      Be Nice on the Internet Week
      Positive
      3



In [12]:

    
internet_news.groupby("sentiment").sentiment.count()









    Out[12]:





sentiment
Negative    527
Neutral     337
Positive    135
Name: sentiment, dtype: int64

So it seems most of news would be classified as negative by the Stanford classifier. How about other classifier?

Indico.io sentiment score

Here we analyze the score generated by Indico.io API on the same dataset. The score is between 0 and 1, and scores >0.5 are considered as positive.



In [8]:

    
indico_news = pd.read_csv("../data/indico_nyt_bitcoin.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
indico_news.head()









    Out[8]:






  
    
      
      headline
      indico_score
    
    
      time
      
      
    
  
  
    
      2012-01-16 01:01:41
      'Good Wife' Watch: Jason Biggs, Jim Cramer and...
      0.599536
    
    
      2012-01-16 01:01:41
      'Good Wife' Watch: Jason Biggs, Jim Cramer and...
      0.599536
    
    
      2012-04-12 14:30:13
      Canada Seeks to Turn Coins Into Digital Currency
      0.429367
    
    
      2013-03-12 20:28:27
      Today's Scuttlebot: Bitcoin Problem and Tracki...
      0.486258
    
    
      2013-04-08 00:00:00
      Bubble or No, This Virtual Currency Is a Lot o...
      0.469938



In [9]:

    
indico_news.indico_score.describe()









    Out[9]:





count    354.000000
mean       0.508686
std        0.111847
min        0.186255
25%        0.439536
50%        0.509819
75%        0.569890
max        0.890683
Name: indico_score, dtype: float64

Distribution



In [6]:

    
indico_news.indico_score.plot(kind='hist')









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5c13673e50>

The distribution of indico score looks quite like a normal distribution, which is better than the Stanford one of course. So maybe we should try using indico score?



In [11]:

    
indico_news.resample('w', how='mean').plot()









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f81e8f3d290>

Let's try again with news about "internet".



In [12]:

    
indico_news = pd.read_csv("../data/indico_nyt_internet.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
indico_news.head()









    Out[12]:






  
    
      
      headline
      indico_score
    
    
      time
      
      
    
  
  
    
      2012-01-05 00:00:00
      Internet Access Is Not a Human Right
      0.347786
    
    
      2012-01-05 10:34:13
      Internet Access Is Not a Human Right
      0.347786
    
    
      2012-01-05 10:34:13
      Internet Access Is Not a Human Right
      0.347786
    
    
      2012-01-06 00:00:00
      Students of Online Schools Are Lagging
      0.596841
    
    
      2012-01-09 15:50:30
      Be Nice on the Internet Week
      0.420838



In [5]:

    
indico_news.indico_score.plot(kind='hist', bins=20)









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f81e94397d0>

Again, it is a normal distribution. I am not sure whether the reasonable distribution of sentiment about a thing should be like this? Because this is not a very neutral thing, and we should probably expect the distribution be positively skewed.

This needs to be further studied for validity.



In [13]:

    
indico_news.indico_score.resample('w', how='mean').plot()









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f81e8d6e5d0>



In [6]:

    
indico_news.indico_score.describe()









    Out[6]:





count    999.000000
mean       0.512424
std        0.123417
min       -0.132640
25%        0.437790
50%        0.514323
75%        0.590976
max        0.954630
Name: indico_score, dtype: float64

Weekly distribution



In [4]:

    
weekly_news_count = score_data.resample('w', how='count').fillna(0)
weekly_news_count.sentiment.describe()









    Out[4]:





count    168.000000
mean       2.107143
std        4.115162
min        0.000000
25%        0.000000
50%        0.000000
75%        3.000000
max       36.000000
Name: sentiment, dtype: float64

News distribution by week



In [5]:

    
weekly_news_count.sentiment.plot()









    Out[5]:





<matplotlib.axes.AxesSubplot at 0x107435b10>

News distribution



In [6]:

    
weekly_news_count.sentiment.plot(kind='hist')









    Out[6]:





<matplotlib.axes.AxesSubplot at 0x107637a50>

Average weekly sentiment score



In [18]:

    
weekly_score = score_data.resample('d', how='mean').fillna(0)
weekly_score.head()









    Out[18]:






  
    
      
      sentimentValue
    
    
      time
      
    
  
  
    
      2012-01-16
       1
    
    
      2012-01-17
       0
    
    
      2012-01-18
       0
    
    
      2012-01-19
       0
    
    
      2012-01-20
       0

Score Distribution



In [19]:

    
weekly_score.sentimentValue.plot(kind='hist')









    Out[19]:





<matplotlib.axes.AxesSubplot at 0x108e739d0>

Score distribution by week



In [20]:

    
weekly_score.plot()









    Out[20]:





<matplotlib.axes.AxesSubplot at 0x108f6f850>

We miss news about bitcoin for about half of the all time. Therefore we try keyword "internet".



In [21]:

    
missing_news = 100*weekly_score[weekly_score.sentimentValue==0].count()/float(weekly_score.count())
print "Percentage of weeks without news:  %f%%" % missing_news









    



Percentage of weeks without news:  82.735043%

	headline	sentiment	sentimentValue
time
2012-01-16 01:01:41	'Good Wife' Watch: Jason Biggs, Jim Cramer and...	Negative	1
2012-01-16 01:01:41	'Good Wife' Watch: Jason Biggs, Jim Cramer and...	Negative	1
2012-04-12 14:30:13	Canada Seeks to Turn Coins Into Digital Currency	Neutral	2
2013-03-12 20:28:27	Today's Scuttlebot: Bitcoin Problem and Tracki...	Negative	1
2013-04-08 00:00:00	Bubble or No, This Virtual Currency Is a Lot o...	Negative	1

	headline	sentiment	sentimentValue
time
2012-01-05 00:00:00	Internet Access Is Not a Human Right	Positive	3
2012-01-05 10:34:13	Internet Access Is Not a Human Right	Positive	3
2012-01-05 10:34:13	Internet Access Is Not a Human Right	Positive	3
2012-01-06 00:00:00	Students of Online Schools Are Lagging	Negative	1
2012-01-09 15:50:30	Be Nice on the Internet Week	Positive	3