In [5]:
import pandas as pd
import datetime
import numpy as np
import scipy as sp
import os
import matplotlib.pyplot as plt
import matplotlib
# from ggplot import geom_point
%matplotlib inline
# font = {'size'   : 18}
# matplotlib.rc('font', **font)
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
os.chdir("/root/Envs/btc-project/btc-price-analysis")

reading google trend data

Google trend data seems to be updated since the orignal google trend paper. Currently, we are unable to get the absolute search volume data. Instead, for each week, we get a normalized search interest index. This index seems to be ranging from 0 to 100. The index is normalized against the search region. Therefore, it shows more of a search "density", instead of the pure count of searches.

This update might render the replicate not possible.


In [6]:
# parse data from google trend
def parseWeek(w):
    return w.split(" - ")[1]

In this replicate, we focus on one keyword, 'Bitcoin'.


In [7]:
trend = pd.read_csv("./data/trend.csv", converters={0:parseWeek})
trend['Week'] = pd.to_datetime(trend['Week'])
trend.set_index(['Week'], inplace=True)
trend.columns = ['search']
trend.head()


Out[7]:
search
Week
2012-01-07 2
2012-01-14 2
2012-01-21 4
2012-01-28 2
2012-02-04 2

We use the Bitcoin price index data from coinbase. Resample it per week starting at every Saturday to match google trend data. That means every Friday will be our action day, to either buy or sell bitcoin.


In [8]:
time_format = "%Y-%m-%dT%H:%M:%S"
data = pd.read_csv("./data/price.csv", names=['time', 'price'], index_col='time',
                   parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
bpi = data.resample('w-sat', how='ohlc')
bpi.index.name = 'Week'
bpi = pd.DataFrame(bpi['price']['close'])
bpi.head()


Out[8]:
close
Week
2011-11-05 2.98
2011-11-12 3.01
2011-11-19 2.19
2011-11-26 2.47
2011-12-03 2.78

In [9]:
trend_bpi = pd.merge(trend, bpi, how='right', left_index=True, right_index=True)
trend_bpi.columns = ['search', 'close_price']
trend_bpi = trend_bpi['2012':]
trend_bpi.head()


Out[9]:
search close_price
Week
2012-01-07 2 6.73
2012-01-14 2 6.73
2012-01-21 4 6.16
2012-01-28 2 5.68
2012-02-04 2 5.92

BPI and search interest plot


In [10]:
plt.figure()
ax = trend_bpi.plot(secondary_y=['close_price'])


<matplotlib.figure.Figure at 0x7fc4e578d590>

Correlation given by Pearson's coefficient


In [11]:
trend_bpi.corr()


Out[11]:
search close_price
search 1.000000 0.736974
close_price 0.736974 1.000000

BPI v. search interest (relative change)

Similar to return index, here we calculate the change of both variables (BPI and search interest) each week compared with last week. We only show 2014 to 2015 as previous variance is too big.


In [12]:
trend_bpi.pct_change().plot()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4e5510450>

Again, correlation.


In [13]:
trend_bpi.pct_change().corr()


Out[13]:
search close_price
search 1.000000 0.179871
close_price 0.179871 1.000000

Pearson correlation coefficient has been decreased once we start examining the return index instead of the actual value of two variables. Does this matter?

Replicating google trend paper

We first take the moving average of the search interest index (SII).


In [14]:
delta_t = 3

trend_bpi['rolling_SII'] = pd.rolling_mean(trend_bpi.search, delta_t)
trend_bpi.head()


Out[14]:
search close_price rolling_SII
Week
2012-01-07 2 6.73 NaN
2012-01-14 2 6.73 NaN
2012-01-21 4 6.16 2.666667
2012-01-28 2 5.68 2.666667
2012-02-04 2 5.92 2.666667

We shift the moving average one week ahead, as we trying to predict BPI based on previous search interest.


In [15]:
trend_bpi['rolling_SII_shifted'] = trend_bpi.rolling_SII.shift(1)
trend_bpi.head()


Out[15]:
search close_price rolling_SII rolling_SII_shifted
Week
2012-01-07 2 6.73 NaN NaN
2012-01-14 2 6.73 NaN NaN
2012-01-21 4 6.16 2.666667 NaN
2012-01-28 2 5.68 2.666667 2.666667
2012-02-04 2 5.92 2.666667 2.666667

Generate order signal

If the search interest of this week is less than the moving average of interest of the past three week (delta_t), people search less about bitcoint, therefore it is likely this is the time to buy in; otherwise, if people start searching more about bitcoin this week, it is the time to sell.

We generate order data. If it is a 1, it means at that particular week we buy BTC, and sell it next week. If it is -1, we sell it this week and buy it back next week.

We assign the order signal based on the comparison of this week's search interest and the rolling mean of previous three weeks' search interest.


In [16]:
trend_bpi['order']=0
trend_bpi['SII_diff'] = trend_bpi.search - trend_bpi.rolling_SII_shifted
## SII_diff >= diff => search interest rises this week => price rises next week
trend_bpi.loc[trend_bpi.SII_diff >= 0,'order'] = -1
## SII_diff < diff => search interest falls this week => price falls next week
trend_bpi.loc[trend_bpi.SII_diff < 0,'order'] = 1
trend_bpi.head()


Out[16]:
search close_price rolling_SII rolling_SII_shifted order SII_diff
Week
2012-01-07 2 6.73 NaN NaN 0 NaN
2012-01-14 2 6.73 NaN NaN 0 NaN
2012-01-21 4 6.16 2.666667 NaN 0 NaN
2012-01-28 2 5.68 2.666667 2.666667 1 -0.666667
2012-02-04 2 5.92 2.666667 2.666667 1 -0.666667

Evaluation as returns

Compute log returns as proposed in the paper: the difference of prices of two consecutive weeks times the order signal.


In [17]:
trend_bpi['log_returns'] = trend_bpi.order * np.log(trend_bpi.close_price.shift(-1)) - \
                            trend_bpi.order * np.log(trend_bpi.close_price)

In [18]:
trend_bpi.log_returns.head()


Out[18]:
Week
2012-01-07    0.000000
2012-01-14    0.000000
2012-01-21    0.000000
2012-01-28    0.041385
2012-02-04   -0.050227
Freq: W-SAT, Name: log_returns, dtype: float64

Positive returns indicate earning on that week.


In [19]:
trend_bpi[trend_bpi.log_returns>0].close_price.count()


Out[19]:
65

Negative returns indicate losing on that week.


In [20]:
trend_bpi[trend_bpi.log_returns<0].close_price.count()


Out[20]:
101

Plot cumulative returns over time.


In [21]:
(np.exp(trend_bpi.log_returns.cumsum()) - 1).plot()


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4e539c350>

Plot cumulative return only for 2015


In [22]:
(np.exp(trend_bpi['2015'].log_returns.cumsum()) - 1).plot()


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4e5514890>

Plot only 2014


In [23]:
(np.exp(trend_bpi['2014'].log_returns.cumsum()) - 1).plot()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4e4f885d0>

Plot only 2013


In [24]:
(np.exp(trend_bpi['2013'].log_returns.cumsum()) - 1).plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4e4dcbd50>

It seems this strategy relies on heavily the performance of earlies times (e.g. 2013), than later times. If earlier times work well, the performance overall is really good. However, cutting off earlier times the performance is much worse.

Evaluation as prediction of trend

Label each week as up (1) and down (-1) comparing its price and that of previous week.


In [25]:
def trend_label(cur,prev):
    if cur == prev:
        return 0
    elif cur > prev:
        return 1
    else:
        return -1
trend_bpi['truth'] = np.vectorize(trend_label)(trend_bpi.close_price, trend_bpi.close_price.shift(1))
trend_bpi.head()


Out[25]:
search close_price rolling_SII rolling_SII_shifted order SII_diff log_returns truth
Week
2012-01-07 2 6.73 NaN NaN 0 NaN 0.000000 -1
2012-01-14 2 6.73 NaN NaN 0 NaN 0.000000 0
2012-01-21 4 6.16 2.666667 NaN 0 NaN 0.000000 -1
2012-01-28 2 5.68 2.666667 2.666667 1 -0.666667 0.041385 -1
2012-02-04 2 5.92 2.666667 2.666667 1 -0.666667 -0.050227 1

In [26]:
trend_bpi.groupby('truth').truth.count()


Out[26]:
truth
-1     72
 0      1
 1    100
Name: truth, dtype: int64

therefore in our case "-1" is the positive case and "1" negative case.


In [27]:
trend_bpi.groupby('order').truth.count()


Out[27]:
order
-1    92
 0     7
 1    74
Name: truth, dtype: int64

In [28]:
trend_bpi_exclude_init = trend_bpi[3:]
true_prediction = trend_bpi_exclude_init.truth==trend_bpi_exclude_init.order
correct_ratio = trend_bpi_exclude_init[true_prediction].close_price.count()/float(trend_bpi_exclude_init.close_price.count())
print "Correctly predicting trend: %f" % correct_ratio


Correctly predicting trend: 0.423529

In [29]:
true_positive = trend_bpi[(trend_bpi.truth==1)&(trend_bpi.order==1)].order.count()
false_negative = trend_bpi[(trend_bpi.truth==1)&(trend_bpi.order==-1)].order.count()
false_positive = trend_bpi[(trend_bpi.truth==-1)&(trend_bpi.order==1)].order.count()
true_negative = trend_bpi[(trend_bpi.truth==-1)&(trend_bpi.order==-1)].order.count()
print "TP: %d, FN: %d, FP: %d, TN: %d" % (true_positive, false_negative, false_positive, true_negative)


TP: 39, FN: 59, FP: 35, TN: 33

In [30]:
tp_rate = float(true_positive) /(true_positive+false_negative)
fp_rate = float(false_positive) /(true_negative+false_positive)
print "TPR: %f, FPR: %f" % (tp_rate, fp_rate)


TPR: 0.397959, FPR: 0.514706