Loading and Cleaning the Data

Turn on inline matplotlib plotting and import plotting dependencies.


In [ ]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

Import analytic depedencies. Doc code for spark-timeseries and source code for tsanalysis.


In [ ]:
import numpy as np
import pandas as pd
import tsanalysis.loaddata as ld
import tsanalysis.tsutil as tsutil
import sparkts.timeseriesrdd as tsrdd
import sparkts.datetimeindex as dtindex
from sklearn import linear_model

Load wiki page view and stock price data into Spark DataFrames.

wiki_obs is a Spark dataframe of (timestamp, page, views) of types (Timestamp, String, Double). ticker_obs is a Spark dataframe of (timestamp, symbol, price) of types (Timestamp, String, Double).


In [ ]:
wiki_obs = ld.load_wiki_df(sqlCtx, '/user/srowen/wiki.tsv')
ticker_obs = ld.load_ticker_df(sqlCtx, '/user/srowen/ticker.tsv')

Display the first 5 elements of the wiki_obs RDD.

wiki_obs contains Row objects with the fields (timestamp, page, views).


In [ ]:

Display the first 5 elements of the tickers_obs RDD.

ticker_obs contains Row objects with the fields (timestamp, symbol, views).


In [ ]:

Create datetime index.

Create time series RDD from observations and index. Remove time instants with NaNs.

Cache the tsrdd.

Examine the first element in the RDD.

Time series have values and a datetime index. We can create a tsrdd for hourly stock prices from an index and a Spark DataFrame of observations. ticker_tsrdd is an RDD of tuples where each tuple has the form (ticker symbol, stock prices) where ticker symbol is a string and stock prices is a 1D np.ndarray. We create a nicely formatted string representation of this pair in print_ticker_info(). Notice how we access the two elements of the tuple.


In [ ]:
def print_ticker_info(ticker):
    print ('The first ticker symbol is: {} \nThe first 20 elements of the associated ' +\
    'series are:\n {}').format(ticker[0], ticker[1][:20])

In [ ]:

Create a wiki page view tsrdd and set the index to match the index of ticker_tsrdd.

Linearly interpolate to impute missing values.

wiki_tsrdd is an RDD of tuples where each tuple has the form (page title, wiki views) where page title is a string and wiki views is a 1D np.ndarray. We have cached both RDDs because we will be doing many subsequent operations on them.


In [ ]:

Filter out symbols with more than the minimum number of NaNs.

Then filter out instants with NaNs.


In [ ]:
def count_nans(vec):
    return np.count_nonzero(np.isnan(vec))

In [ ]:

Linking symbols and pages

We need to join together the wiki page and ticker data, but the time series RDDs are not directly joinable on their keys. To overcome this, we have create a dict from wikipage title to stock ticker symbol.

Create a dict from ticker symbols to page names.

Create another from page names to ticker symbols.


In [ ]:
# a dict from wiki page name to ticker symbol
page_symbols = {}
for line in open('../symbolnames.tsv').readlines():
    tokens = line[:-1].split('\t')
    page_symbols[tokens[1]] = tokens[0]

def get_page_symbol(page_series):
    if page_series[0] in page_symbols:
        return [(page_symbols[page_series[0]], page_series[1])]
    else:
        return []
# reverse keys and values. a dict from ticker symbol to wiki page name.
symbol_pages = dict(zip(page_symbols.values(), page_symbols.keys()))
print page_symbols.items()[0]
print symbol_pages.items()[0]

Join together wiki_tsrdd and ticker_tsrdd

First, we use this dict to look up the corresponding stock ticker symbol and rekey the wiki page view time series. We then join the data sets together. The result is an RDD of tuples where each element is of the form (ticker_symbol, (wiki_series, ticker_series)). We count the number of elements in the resulting rdd to see how many matches we have.


In [ ]:

Correlation and Relationships

Define a function for computing the pearson r correlation of the stock price and wiki page traffic associated with a company.

Here we look up a specific stock and corrsponding wiki page, and provide an example of computing the pearson correlation locally. We use scipy.stats.stats.pearsonr to compute the pearson correlation and corresponding two sided p value. wiki_vol_corr and corr_with_offset both return this as a tuple of (corr, p_value).


In [ ]:
from scipy.stats.stats import pearsonr

def wiki_vol_corr(page_key):
    # lookup individual time series by key.
    ticker = ticker_tsrdd.find_series(page_symbols[page_key]) # numpy array
    wiki = wiki_tsrdd.find_series(page_key) # numpy array
    return pearsonr(ticker, wiki)

def corr_with_offset(page_key, offset):
    """offset is an integer that describes how many time intervals we have slid
    the wiki series ahead of the ticker series."""
    ticker = ticker_tsrdd.find_series(page_symbols[page_key]) # numpy array
    wiki = wiki_tsrdd.find_series(page_key) # numpy array
    return pearsonr(ticker[offset:], wiki[:-offset])

In [ ]:

Create a plot of the joint distribution of wiki trafiic and stock prices for a specific company using seaborn's jointplot function.


In [ ]:
def joint_plot(page_key, ticker, wiki, offset=0):
    with sns.axes_style("white"):
        sns.jointplot(x=ticker, y=wiki, kind="kde", color="b");
        plt.xlabel('Stock Price')
        plt.ylabel('Wikipedia Page Views')
        plt.title('Joint distribution of {} stock price\n and Wikipedia page views.'\
            +'\nWith a {} day offset'.format(page_key, offset), y=1.20)

In [ ]:

Find the companies with the highest correlation between stock prices time series and wikipedia page traffic.

Note that comparing a tuple means you compare the composite object lexicographically.


In [ ]:

Add in filtering out less than useful correlation results.

There are a lot of invalid correlations that get computed, so lets filter those out.


In [ ]:

Find the top 10 correlations as defined by the ordering on tuples.


In [ ]:

Create a joint plot of some of the stronger relationships.


In [ ]:

Volatility

Compute per-day volatility for each symbol.


In [ ]:

Make sure we don't have any NaNs.


In [ ]:

Visualize volatility

Plot daily volatility in stocks over time.


In [ ]:

What does the distribution of volatility for the whole market look like? Add volatility for individual stocks in a datetime bin.


In [ ]:


In [ ]:

Find stocks with the highest average daily volatility.


In [ ]:

Plot stocks with the highest average daily volatility over time.


In [ ]:

We first map over ticker_daily_vol to find the index of the value with the highest volatility. We then relate that back to the index set on the RDD to find the corresponding datetime.


In [ ]:

A large number of stock symbols had their most volatile days on August 24th and August 25th of this year.

Regress volatility against page views

Resample the wiki page view data set so we have total pageviews by day.

Cache the wiki page view RDD.

Resample the wiki page view data set so we have total pageviews by day. This means reindexing the time series and aggregating data together with daily buckets. We use np.nansum to add up numbers while treating nans like zero.


In [ ]:

Validate data by checking for nans.


In [ ]:


In [ ]:

Fit a linear regression model to every pair in the joined wiki-ticker RDD and extract R^2 scores.


In [ ]:
def regress(X, y):
    model = linear_model.LinearRegression()
    model.fit(X, y)
    score = model.score(X, y)
    return (score, model)

lag = 2
lead = 2

joined = regressions = wiki_daily_views.flatMap(get_page_symbol) \
    .join(ticker_daily_vol)
    
models = joined.mapValues(lambda x: regress(tsutil.lead_and_lag(lead, lag, x[0]), x[1][lag:-lead]))
models.cache()
models.count()

Print out the symbols with the highest R^2 scores


In [ ]:

Plot the results of a linear model.

Plotting a linear model always helps me understand it better. Again, seaborn is super useful with smart defaults built in.


In [ ]:

Box plot / Tukey outlier identification

Tukey originally proposed a method for identifying outliers in bow and whisker plots. Eseentially, we find the cut off value for the 75th percentile $P_{75} = percentile(sample, 0.75)$, and add a reasonable buffer (expressed in terms of the interquartile range) $1.5*IQR = 1.5*(P_{75}-P_{25})$ above that cutoff.

Write a function that returns the high value cutoff for Tukey's boxplot outlier criterion.


In [ ]:

Filter out any values below Tukey's boxplot criterion for outliers.


In [ ]:


In [ ]:

Black Monday

Select the date range comprising Black Monday 2015.


In [ ]:

Which stocks saw the worst return for that day?


In [ ]:

Plot wiki page views for one of those stocks


In [ ]: