In [5]:
import pandas as pd
import datetime
import numpy as np
import scipy as sp
import os
import matplotlib.pyplot as plt
import matplotlib
# from ggplot import geom_point
%matplotlib inline
# font = {'size' : 18}
# matplotlib.rc('font', **font)
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
os.chdir("/root/Envs/btc-project/btc-price-analysis")
Google trend data seems to be updated since the orignal google trend paper. Currently, we are unable to get the absolute search volume data. Instead, for each week, we get a normalized search interest index. This index seems to be ranging from 0 to 100. The index is normalized against the search region. Therefore, it shows more of a search "density", instead of the pure count of searches.
This update might render the replicate not possible.
In [6]:
# parse data from google trend
def parseWeek(w):
return w.split(" - ")[1]
In this replicate, we focus on one keyword, 'Bitcoin'.
In [7]:
trend = pd.read_csv("./data/trend.csv", converters={0:parseWeek})
trend['Week'] = pd.to_datetime(trend['Week'])
trend.set_index(['Week'], inplace=True)
trend.columns = ['search']
trend.head()
Out[7]:
We use the Bitcoin price index data from coinbase. Resample it per week starting at every Saturday to match google trend data. That means every Friday will be our action day, to either buy or sell bitcoin.
In [8]:
time_format = "%Y-%m-%dT%H:%M:%S"
data = pd.read_csv("./data/price.csv", names=['time', 'price'], index_col='time',
parse_dates=[0], date_parser=lambda x: datetime.datetime.strptime(x, time_format))
bpi = data.resample('w-sat', how='ohlc')
bpi.index.name = 'Week'
bpi = pd.DataFrame(bpi['price']['close'])
bpi.head()
Out[8]:
In [9]:
trend_bpi = pd.merge(trend, bpi, how='right', left_index=True, right_index=True)
trend_bpi.columns = ['search', 'close_price']
trend_bpi = trend_bpi['2012':]
trend_bpi.head()
Out[9]:
In [10]:
plt.figure()
ax = trend_bpi.plot(secondary_y=['close_price'])
Correlation given by Pearson's coefficient
In [11]:
trend_bpi.corr()
Out[11]:
In [12]:
trend_bpi.pct_change().plot()
Out[12]:
Again, correlation.
In [13]:
trend_bpi.pct_change().corr()
Out[13]:
Pearson correlation coefficient has been decreased once we start examining the return index instead of the actual value of two variables. Does this matter?
We first take the moving average of the search interest index (SII).
In [14]:
delta_t = 3
trend_bpi['rolling_SII'] = pd.rolling_mean(trend_bpi.search, delta_t)
trend_bpi.head()
Out[14]:
We shift the moving average one week ahead, as we trying to predict BPI based on previous search interest.
In [15]:
trend_bpi['rolling_SII_shifted'] = trend_bpi.rolling_SII.shift(1)
trend_bpi.head()
Out[15]:
If the search interest of this week is less than the moving average of interest of the past three week (delta_t), people search less about bitcoint, therefore it is likely this is the time to buy in; otherwise, if people start searching more about bitcoin this week, it is the time to sell.
We generate order
data. If it is a 1
, it means at that particular week we buy BTC, and sell it next week. If it is -1
, we sell it this week and buy it back next week.
We assign the order signal based on the comparison of this week's search interest and the rolling mean of previous three weeks' search interest.
In [16]:
trend_bpi['order']=0
trend_bpi['SII_diff'] = trend_bpi.search - trend_bpi.rolling_SII_shifted
## SII_diff >= diff => search interest rises this week => price rises next week
trend_bpi.loc[trend_bpi.SII_diff >= 0,'order'] = -1
## SII_diff < diff => search interest falls this week => price falls next week
trend_bpi.loc[trend_bpi.SII_diff < 0,'order'] = 1
trend_bpi.head()
Out[16]:
In [17]:
trend_bpi['log_returns'] = trend_bpi.order * np.log(trend_bpi.close_price.shift(-1)) - \
trend_bpi.order * np.log(trend_bpi.close_price)
In [18]:
trend_bpi.log_returns.head()
Out[18]:
Positive returns indicate earning on that week.
In [19]:
trend_bpi[trend_bpi.log_returns>0].close_price.count()
Out[19]:
Negative returns indicate losing on that week.
In [20]:
trend_bpi[trend_bpi.log_returns<0].close_price.count()
Out[20]:
Plot cumulative returns over time.
In [21]:
(np.exp(trend_bpi.log_returns.cumsum()) - 1).plot()
Out[21]:
Plot cumulative return only for 2015
In [22]:
(np.exp(trend_bpi['2015'].log_returns.cumsum()) - 1).plot()
Out[22]:
Plot only 2014
In [23]:
(np.exp(trend_bpi['2014'].log_returns.cumsum()) - 1).plot()
Out[23]:
Plot only 2013
In [24]:
(np.exp(trend_bpi['2013'].log_returns.cumsum()) - 1).plot()
Out[24]:
It seems this strategy relies on heavily the performance of earlies times (e.g. 2013), than later times. If earlier times work well, the performance overall is really good. However, cutting off earlier times the performance is much worse.
Label each week as up (1) and down (-1) comparing its price and that of previous week.
In [25]:
def trend_label(cur,prev):
if cur == prev:
return 0
elif cur > prev:
return 1
else:
return -1
trend_bpi['truth'] = np.vectorize(trend_label)(trend_bpi.close_price, trend_bpi.close_price.shift(1))
trend_bpi.head()
Out[25]:
In [26]:
trend_bpi.groupby('truth').truth.count()
Out[26]:
therefore in our case "-1" is the positive case and "1" negative case.
In [27]:
trend_bpi.groupby('order').truth.count()
Out[27]:
In [28]:
trend_bpi_exclude_init = trend_bpi[3:]
true_prediction = trend_bpi_exclude_init.truth==trend_bpi_exclude_init.order
correct_ratio = trend_bpi_exclude_init[true_prediction].close_price.count()/float(trend_bpi_exclude_init.close_price.count())
print "Correctly predicting trend: %f" % correct_ratio
In [29]:
true_positive = trend_bpi[(trend_bpi.truth==1)&(trend_bpi.order==1)].order.count()
false_negative = trend_bpi[(trend_bpi.truth==1)&(trend_bpi.order==-1)].order.count()
false_positive = trend_bpi[(trend_bpi.truth==-1)&(trend_bpi.order==1)].order.count()
true_negative = trend_bpi[(trend_bpi.truth==-1)&(trend_bpi.order==-1)].order.count()
print "TP: %d, FN: %d, FP: %d, TN: %d" % (true_positive, false_negative, false_positive, true_negative)
In [30]:
tp_rate = float(true_positive) /(true_positive+false_negative)
fp_rate = float(false_positive) /(true_negative+false_positive)
print "TPR: %f, FPR: %f" % (tp_rate, fp_rate)