Sentiment Data vs. Stock Return

Written by Dongchen Zou, Jasmine (Xiaosu) Wei

Introduction

Many studies have documented long-term historical phenomena in securities markets that contradict the Efficient Market Hypothesis. The EMH states it is impossible to "beat the market"because stock market efficiency causes existing share prices to always incorporate and reflect all relevant information.

Behavioral finance attempts to fill the void by proposing psychology-based theories to explain market anomalies. Recently, there have been several papers focusing on what is called "investor sentiment" -- the propensity of individuals to trade on "noise" and emotions rather than facts. Sentiment causes investors to have beliefs about future cash flows and investment risks that aren't justified.

Warren Buffett once said that as an investor it is wise to be “Fearful when others are greedy and greedy when others are fearful.” This statement is somewhat of a contrarian view on stock markets and relates directly to the price of an asset: when others are greedy, prices typically spike, and one should be cautious so that they do not overpay for an asset. When others are fearful however, it may present a good buying opportunity at an undervalued cost. This is an intriguing yet intellectually invigorating topic we are set to explore in our project.

Abstract

In this project, we ask ourselves one simple question: is the philosophy that aggregate retail investor sentiment is a contrary indicator of future stock market returns correct? To investigate, we explore the possible relationship between investor sentiments and actual stock return. Our project use easily accesible public data to examine whether a negative correlation exists between the two variables. In turn, we are going to validate Buffett's famous investment philosophy by comparing it to our actual results.


In [1]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import sys                      # system module, used to get Python version 
import os                       # operating system tools (check files)
import datetime as dt           # date tools, used to note current date 
import seaborn as sns

# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)

            

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print("Today's date:", dt.date.today())

%matplotlib inline
plotly.offline.init_notebook_mode()


Python version:  3.5.1 |Anaconda 2.5.0 (x86_64)| (default, Dec  7 2015, 11:24:55) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas version:  0.17.1
Today's date: 2016-05-11

Package Imported

We use pandas, a Python package that allows for fast data manipulation and analysis. In pandas, a dataframe allows for storing related columns of data. We use matplotlib to generate a variety of figures and graphics. We also used system module to get Python version. Lastly, we used os, a operating system tool to check files.

Creating the Data Set

Using the sentiment data from American Associatioin of Individual Investors and weekly stock return data from the Farma French Data Factors Weekly, we created a list of dataframes. Each sentiment dataframe contains Date, % Bullish, % Neutral and % Bearish. Each stock return dataframe contains Date, Excess Return,SMB,HML,RF and Stock Index. The problem with the sentiment data is that we notice some noises in the dataset. Some of the data are not related to the project, such as annual sentimanet data. Thus, we clean the data by slicing out the portion we would like to examine. We also notice that the dates are not exactly paired for many cells. We perform fuzzy pair and pair stock return data with the dates of the sentiment data. We then concatenate the two datasets, which leaves us with a clean side-by-side comparison.


In [2]:
# sentiment data
sentiment = pd.read_excel("http://www.aaii.com/files/surveys/sentiment.xls",skiprows=3);
stm = sentiment[["Date","Bullish","Neutral","Bearish"]]

In [3]:
stm.head()


Out[3]:
Date Bullish Neutral Bearish
0 NaN NaN NaN NaN
1 1987-06-26 00:00:00 NaN NaN NaN
2 1987-07-17 00:00:00 NaN NaN NaN
3 1987-07-24 00:00:00 0.36 0.50 0.14
4 1987-07-31 00:00:00 0.26 0.48 0.26

In [4]:
# We have some noises in the dataset, for example:
stm.tail()


Out[4]:
Date Bullish Neutral Bearish
1668 Count '10 52 52 52
1669 Count '11 52 52 52
1670 Count '12 52 52 52
1671 Count '13 52 52 52
1672 Count '14 123 123 123

In [5]:
# clean sentiment data
k=[];

for i in range(len(stm.index)):
    if (type(stm["Date"][i])==type(stm["Date"][3])):
        k.append(i)

stm2 = stm.loc[k].reset_index(drop="Index")
stm2["Date"] = pd.to_datetime(stm2["Date"])

In [6]:
stm2.head()


Out[6]:
Date Bullish Neutral Bearish
0 1987-06-26 NaN NaN NaN
1 1987-07-17 NaN NaN NaN
2 1987-07-24 0.36 0.50 0.14
3 1987-07-31 0.26 0.48 0.26
4 1987-08-07 0.56 0.15 0.29

In [7]:
# weekly stock return data
from pandas_datareader.famafrench import get_available_datasets
import pandas_datareader.data as web

get_available_datasets();
r = web.DataReader('F-F_Research_Data_factors_weekly', 'famafrench')[0];
names = ["Excess_Return","SMB","HML","RF"];
r.columns = names

# Slice Stock return data starting from sentiment data's beginning date
start = r.index.searchsorted(stm2["Date"][0]);
rd = r.ix[start:]
r2 = rd.reset_index();

Here, we create a replicative version of stock index which starts from 1 and multiplies weekly returns from the beginning date of the dataset. Why are we doing this? This is because we may have some very interesting graphs to show and we need an index to track the stock price movement and calculate the returns:


In [8]:
# create an index that starts from 1 and weekly return
w=[]
kk = 1
for i in range(len(r2)):
    kk = kk*(1+r2["Excess_Return"][i]/100+r2["RF"][i]/100);
    w.append(kk)

r2["Stock_Index"] = pd.Series(w);

In [9]:
iplot_mpl(r2.set_index("Date")["Stock_Index"].plot(figsize=(12,6)).get_figure())
r2.head()


Out[9]:
Date Excess_Return SMB HML RF Stock_Index
0 1987-06-26 -0.04 -0.28 0.10 0.120 1.000800
1 1987-07-02 -0.64 0.66 0.87 0.114 0.995536
2 1987-07-10 0.68 0.11 1.29 0.114 1.003440
3 1987-07-17 1.69 -0.01 0.26 0.114 1.021542
4 1987-07-24 -1.65 0.47 -0.23 0.114 1.005852

The sample data below shows that the dates are not paired exactly, and thus fuzzy pairing, i.e., matching the dates with approximation is necessary here:


In [10]:
#The dates are not exactly paired for many cells. For example:
print(r2["Date"].loc[1496:1499])
print(stm2["Date"].loc[1493:1496])


1496   2016-02-26
1497   2016-03-04
1498   2016-03-11
1499   2016-03-18
Name: Date, dtype: datetime64[ns]
1493   2016-02-25
1494   2016-03-03
1495   2016-03-10
1496   2016-03-17
Name: Date, dtype: datetime64[ns]

In [11]:
# fuzzy pair the dates in two datasets (Caution: It may run for a while. Please be patient!)
dates = stm2["Date"];
dates2 = r2["Date"];

ii=[];
jj=[];

for i in range(len(dates)):
    for j in range(i,len(dates2)):

        transformed_date1 = dates[i];
        transformed_date2 = dates2[j];
        
        timediff = transformed_date2 - transformed_date1;
        
        if abs(timediff.days) < 3:
            ii.append(i);
            jj.append(j);
            
print(len(ii)==len(jj))     #check if they are paired


True

In [12]:
# Concatenate two datasets

stm3 = stm2.ix[ii].reset_index(drop="index");
stm3 = stm3.rename(columns={"Date":"Report_Date"});
r3 = r2.ix[jj].reset_index(drop="index");

result = pd.concat([stm3, r3], axis=1);

In [13]:
result.tail()


Out[13]:
Report_Date Bullish Neutral Bearish Date Excess_Return SMB HML RF Stock_Index
1492 2016-02-25 0.311947 0.373894 0.314159 2016-02-26 1.84 1.30 -0.23 0.005 12.162560
1493 2016-03-03 0.320158 0.387352 0.292490 2016-03-04 3.08 1.34 2.44 0.005 12.537775
1494 2016-03-10 0.373576 0.382688 0.243736 2016-03-11 1.00 -0.47 1.33 0.005 12.663779
1495 2016-03-17 0.299587 0.431818 0.268595 2016-03-18 1.42 -0.34 0.12 0.005 12.844238
1496 2016-03-24 0.337808 0.425056 0.237136 2016-03-24 -0.80 -0.95 -0.79 0.005 12.742127

In [14]:
# Select Columns we would like to examine
examine = result[["Date","Bullish","Bearish","Excess_Return","Stock_Index"]].set_index("Date")
examine.head()


Out[14]:
Bullish Bearish Excess_Return Stock_Index
Date
1987-06-26 NaN NaN -0.04 1.000800
1987-07-17 NaN NaN 1.69 1.021542
1987-07-24 0.36 0.14 -1.65 1.005852
1987-07-31 0.26 0.26 2.68 1.033955
1987-08-07 0.56 0.29 1.50 1.050684

Boxplot

In the box plot we create for Sentiment Data, we can see that the mean of investors feeling Bullish is generally higher than those who feel bearish.


In [15]:
#Boxplot of Sentiment Data
long = pd.melt(examine, value_vars=['Bullish','Bearish'], var_name='Sentiment', value_name='Ratio')
plt.figure(figsize=(6,8))
sns.boxplot(data=long, x="Sentiment", y="Ratio",palette=["g","r"])
plt.xlabel("Sentiment",fontsize=18)
plt.show()


Kernel Density

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. We can see from our plot below that Bearish sentiment is more rihgt tailed, while Bullish sentiment is more similar to the normal distribution.


In [16]:
#KDE of Sentiment Data
fig, ax1 = plt.subplots(figsize=(10,6))

sns.kdeplot(examine["Bullish"], ax=ax1, color="g")
sns.kdeplot(examine["Bearish"],ax=ax1, color="r")
ax1.legend()

fig.suptitle("Kernel Density",fontsize=18)
plt.show()


Scatterplot

We performed scatterplot analysis to examine the correlation between Bullish and Bearish Sentiment. We can see that there is a negative correlation between Bullish and Brearish sentiment. From the scatterplot of Stock Return vs. Sentiment Data, we notice that there is a nonconstant variance between Bullish and Bearish sentiment. However, in both cases, we see no correlation between Sentiment Data and Stock Return.


In [17]:
#Scatterplot and Correlation of Bullish versus Bearish Data
plt.figure(figsize=(6,6))
k = sns.regplot(x="Bullish", y="Bearish", data=examine)
note = "Correlation is " + str( round(examine["Bullish"].corr(examine["Bearish"]),3))
k.figure.text(0.4, 0.8, note, fontsize=18, weight="bold")


Out[17]:
<matplotlib.text.Text at 0x117b8d208>

In [18]:
sns.jointplot(x="Bullish", y="Excess_Return", data=examine,color="g")
sns.jointplot(x="Bearish", y="Excess_Return", data=examine,color="r")
plt.show()


Regression

The regression result shows that coefficients of Bullish and bearish sentiment data is not significantly away from zero and the model barely explains the variations of weekly stock returns, with R-squared of 0.006. This is not surprising because otherwise some intelligent players would have earned tons of money by tracking the investor sentiment.


In [19]:
# Regression of Excess Return on Bullish and Bearish Sentiment Data
import statsmodels.formula.api as sm
reg = sm.ols(formula="Excess_Return ~ Bullish + Bearish", data=examine).fit()
reg.summary()


Out[19]:
OLS Regression Results
Dep. Variable: Excess_Return R-squared: 0.006
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 4.427
Date: Wed, 11 May 2016 Prob (F-statistic): 0.0121
Time: 18:12:53 Log-Likelihood: -3389.0
No. Observations: 1494 AIC: 6784.
Df Residuals: 1491 BIC: 6800.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.2776 0.485 0.572 0.567 -0.674 1.229
Bullish 0.6577 0.756 0.870 0.384 -0.825 2.140
Bearish -1.3237 0.795 -1.665 0.096 -2.883 0.236
Omnibus: 264.436 Durbin-Watson: 2.122
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2224.995
Skew: -0.570 Prob(JB): 0.00
Kurtosis: 8.869 Cond. No. 20.3

Outliers

Although regression result doesn't tell us anything useful about the relationship between investor sentiment and stock returns, we may still be interested to see whether an extremely bullish or bearish datapoint tells us something about the market timing. In other words, is an extremely bullish point implies that the market is about to reach the peak, or is an extreme bearish point suggests that the market is at the bottom? To do this, we first source the very bullish and bearish datapoints that are over 3.5 standard deviations above the mean. Then we use the replicative stock index that we generated before and see what happened to the stock market one year after a super bullish or bearish point is detected.


In [23]:
#Define outliers as beyond +3.5 to -3.5 standard deviation. Subject to change.
toobull = examine[(examine.Bullish - examine.Bullish.mean())>=(3.5*examine.Bullish.std())].reset_index();
toobull


Out[23]:
Date Bullish Bearish Excess_Return Stock_Index
0 2000-01-07 0.75 0.1333 -2.49 6.298392

In [24]:
#Define outliers as beyond +3 to -3 standard deviation. Subject to change.
toobear = examine[(examine.Bearish - examine.Bearish.mean())>=(3.5*examine.Bearish.std())].reset_index();
toobear


Out[24]:
Date Bullish Bearish Excess_Return Stock_Index
0 1990-10-19 0.1300 0.6700 3.44 1.070872
1 2009-03-06 0.1892 0.7027 -7.03 3.694305

In [22]:
for i in range(len(toobull)):
    num = examine.index.get_loc(toobull["Date"][i]);
    num2 = num - 52;
    num3 = num + 53;
    
    toobull_index_before = examine["Stock_Index"][num2:num+1]      #Start from the previous 52nd week to this week
    toobull_index = examine["Stock_Index"][num:num3]               #Start from this week to the 52nd week
    Annual_Return = str(round((toobull_index[len(toobull_index)-1] -
                               toobull_index[0]) *100 / toobull_index[0],2)) + "%";
    
    plt.figure(figsize=(12,6))
    plt.plot(toobull_index_before,color="b",alpha=0.2)
    plt.plot(toobull_index,color="r")
    plt.ylabel("Stock Index")
    plt.title("Annual Return: "+Annual_Return, fontsize=20, loc="left", weight="bold")
    plt.annotate("Sell",
                 xy=(toobull["Date"][i], toobull["Stock_Index"][i]+0.04),
                 xytext=(toobull["Date"][i], toobull["Stock_Index"][i]+0.2),
                 fontsize=12, 
                 weight="bold", 
                 arrowprops=dict(facecolor='red', shrink=0.05))

    
for i in range(len(toobear)):
    num = examine.index.get_loc(toobear["Date"][i]);
    num2 = num - 52;
    num3 = num + 53;
    
    toobear_index_before = examine["Stock_Index"][num2:num+1]      #Start from the previous 52nd week to this week
    toobear_index = examine["Stock_Index"][num:num3]               #Start from this week to the next 52nd week
    Annual_Return = str(round((toobear_index[len(toobear_index)-1] -
                               toobear_index[0]) *100 / toobear_index[0],2)) + "%";
    
    plt.figure(figsize=(12,6))
    plt.plot(toobear_index_before,color="b",alpha=0.2)
    plt.plot(toobear_index,color="g")
    plt.ylabel("Stock Index")
    plt.title("Annual Return: "+Annual_Return, fontsize=20, loc="left", weight="bold")
    plt.annotate("Buy",
                 xy=(toobear["Date"][i], toobear["Stock_Index"][i]-0.02),
                 xytext=(toobear["Date"][i], toobear["Stock_Index"][i]-0.06),
                 fontsize=12, 
                 weight="bold", 
                 arrowprops=dict(facecolor='green', shrink=0.05))


The result shows a very interesting market-timing skill. We can see that all three datapoints look like turning points in the stock index movement. For example, when highly bullish atmosphere was discovered in January 2000, the dot-com bubble was about to collapse. And when everyone was extremely pessimistic about the financial market, a 5-year bull market was coming. Indeed, this experimental design is not perfect. First, there are only extreme observations found in the dataset, meaning that there is no sufficient evidence to justify these market-timing patterns. Second, the experiment is retrospective rather than prospective, where we used current standard deviations of predictors and back-forecasted the results. Nevertheless, these outliers could be still indicative in evaluating the market inflection points.

Conclusion

We believe that Buffet's saying is somehow informative to investors. "Be fearful" when the market feels extremely bullish and "be greedy" when the market is way too pessimistic. Our findings from historical data suggest that when a positive outlier from bullish and bearish sentiment data was discovered, the date of such outlier may be the date for market to switch sign. We may further the study by connecting the sentiment data with macroeconomic factors to have better predictive power in marking down the turning points of the stock market.

Data Sources

"AAII | AAII Investor Sentiment Data." AAII | AAII Investor Sentiment Data. Quandl, n.d. Web. 4 May 2016.

"Efficient Market Hypothesis (EMH) Definition | Investopedia." Investopedia. N.p., 18 Nov. 2003. Web. 2 May 2016.

"How Investor Sentiment Affects Returns." CBSNews. CBS Interactive, n.d. Web. 2 May 2016.

"KFRENCH | Fama/French Factors (Weekly)." KFRENCH | Fama/French Factors (Weekly). Quandl, n.d. Web. 3 May 2016.