In [5]:
import pandas as pd
import datetime
import numpy as np
import scipy as sp
import os
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
font = {'size'   : 18}
matplotlib.rc('font', **font)
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
#os.chdir("/root/Envs/btc-analysis/btc-price-analysis")
time_format = "%Y-%m-%dT%H:%M:%SZ"

Weekly sentiment score analysis


In [8]:
score_data = pd.read_csv("../data/nyt_bitcoin_with_score.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
score_data.head()


Out[8]:
headline sentiment sentimentValue
time
2012-01-16 01:01:41 'Good Wife' Watch: Jason Biggs, Jim Cramer and... Negative 1
2012-01-16 01:01:41 'Good Wife' Watch: Jason Biggs, Jim Cramer and... Negative 1
2012-04-12 14:30:13 Canada Seeks to Turn Coins Into Digital Currency Neutral 2
2013-03-12 20:28:27 Today's Scuttlebot: Bitcoin Problem and Tracki... Negative 1
2013-04-08 00:00:00 Bubble or No, This Virtual Currency Is a Lot o... Negative 1

Ratio of "negative", "neutral", "positive"


In [9]:
score_data.sentiment.unique()


Out[9]:
array(['Negative', 'Neutral', 'Positive'], dtype=object)

In [10]:
score_data.groupby("sentiment").sentiment.count().plot(kind='bar',rot=0)


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe2a98a0810>

Massively negative ratings!!!! Is this special to bitcoin news? To double check, run the same analysis on news with headline including "internet".

Alternate news analysis (digress)


In [10]:
internet_news = pd.read_csv("../data/nyt_internet_with_score.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
internet_news.head()


Out[10]:
headline sentiment sentimentValue
time
2012-01-05 00:00:00 Internet Access Is Not a Human Right Positive 3
2012-01-05 10:34:13 Internet Access Is Not a Human Right Positive 3
2012-01-05 10:34:13 Internet Access Is Not a Human Right Positive 3
2012-01-06 00:00:00 Students of Online Schools Are Lagging Negative 1
2012-01-09 15:50:30 Be Nice on the Internet Week Positive 3

In [12]:
internet_news.groupby("sentiment").sentiment.count()


Out[12]:
sentiment
Negative    527
Neutral     337
Positive    135
Name: sentiment, dtype: int64

So it seems most of news would be classified as negative by the Stanford classifier. How about other classifier?

Indico.io sentiment score

Here we analyze the score generated by Indico.io API on the same dataset. The score is between 0 and 1, and scores >0.5 are considered as positive.


In [8]:
indico_news = pd.read_csv("../data/indico_nyt_bitcoin.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
indico_news.head()


Out[8]:
headline indico_score
time
2012-01-16 01:01:41 'Good Wife' Watch: Jason Biggs, Jim Cramer and... 0.599536
2012-01-16 01:01:41 'Good Wife' Watch: Jason Biggs, Jim Cramer and... 0.599536
2012-04-12 14:30:13 Canada Seeks to Turn Coins Into Digital Currency 0.429367
2013-03-12 20:28:27 Today's Scuttlebot: Bitcoin Problem and Tracki... 0.486258
2013-04-08 00:00:00 Bubble or No, This Virtual Currency Is a Lot o... 0.469938

In [9]:
indico_news.indico_score.describe()


Out[9]:
count    354.000000
mean       0.508686
std        0.111847
min        0.186255
25%        0.439536
50%        0.509819
75%        0.569890
max        0.890683
Name: indico_score, dtype: float64

Distribution


In [6]:
indico_news.indico_score.plot(kind='hist')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c13673e50>

The distribution of indico score looks quite like a normal distribution, which is better than the Stanford one of course. So maybe we should try using indico score?


In [11]:
indico_news.resample('w', how='mean').plot()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81e8f3d290>

Let's try again with news about "internet".


In [12]:
indico_news = pd.read_csv("../data/indico_nyt_internet.csv", index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
indico_news.head()


Out[12]:
headline indico_score
time
2012-01-05 00:00:00 Internet Access Is Not a Human Right 0.347786
2012-01-05 10:34:13 Internet Access Is Not a Human Right 0.347786
2012-01-05 10:34:13 Internet Access Is Not a Human Right 0.347786
2012-01-06 00:00:00 Students of Online Schools Are Lagging 0.596841
2012-01-09 15:50:30 Be Nice on the Internet Week 0.420838

In [5]:
indico_news.indico_score.plot(kind='hist', bins=20)


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81e94397d0>

Again, it is a normal distribution. I am not sure whether the reasonable distribution of sentiment about a thing should be like this? Because this is not a very neutral thing, and we should probably expect the distribution be positively skewed.

This needs to be further studied for validity.


In [13]:
indico_news.indico_score.resample('w', how='mean').plot()


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81e8d6e5d0>

In [6]:
indico_news.indico_score.describe()


Out[6]:
count    999.000000
mean       0.512424
std        0.123417
min       -0.132640
25%        0.437790
50%        0.514323
75%        0.590976
max        0.954630
Name: indico_score, dtype: float64

Weekly distribution


In [4]:
weekly_news_count = score_data.resample('w', how='count').fillna(0)
weekly_news_count.sentiment.describe()


Out[4]:
count    168.000000
mean       2.107143
std        4.115162
min        0.000000
25%        0.000000
50%        0.000000
75%        3.000000
max       36.000000
Name: sentiment, dtype: float64

News distribution by week


In [5]:
weekly_news_count.sentiment.plot()


Out[5]:
<matplotlib.axes.AxesSubplot at 0x107435b10>

News distribution


In [6]:
weekly_news_count.sentiment.plot(kind='hist')


Out[6]:
<matplotlib.axes.AxesSubplot at 0x107637a50>

Average weekly sentiment score


In [18]:
weekly_score = score_data.resample('d', how='mean').fillna(0)
weekly_score.head()


Out[18]:
sentimentValue
time
2012-01-16 1
2012-01-17 0
2012-01-18 0
2012-01-19 0
2012-01-20 0

Score Distribution


In [19]:
weekly_score.sentimentValue.plot(kind='hist')


Out[19]:
<matplotlib.axes.AxesSubplot at 0x108e739d0>

Score distribution by week


In [20]:
weekly_score.plot()


Out[20]:
<matplotlib.axes.AxesSubplot at 0x108f6f850>

We miss news about bitcoin for about half of the all time. Therefore we try keyword "internet".


In [21]:
missing_news = 100*weekly_score[weekly_score.sentimentValue==0].count()/float(weekly_score.count())
print "Percentage of weeks without news:  %f%%" % missing_news


Percentage of weeks without news:  82.735043%