Sentiment Data vs. Stock Return

Written by Dongchen Zou, Jasmine (Xiaosu) Wei

Introduction

Many studies have documented long-term historical phenomena in securities markets that contradict the Efficient Market Hypothesis. The EMH states it is impossible to "beat the market"because stock market efficiency causes existing share prices to always incorporate and reflect all relevant information.

Behavioral finance attempts to fill the void by proposing psychology-based theories to explain market anomalies. Recently, there have been several papers focusing on what is called "investor sentiment" -- the propensity of individuals to trade on "noise" and emotions rather than facts. Sentiment causes investors to have beliefs about future cash flows and investment risks that aren't justified.

Warren Buffett once said that as an investor it is wise to be “Fearful when others are greedy and greedy when others are fearful.” This statement is somewhat of a contrarian view on stock markets and relates directly to the price of an asset: when others are greedy, prices typically spike, and one should be cautious so that they do not overpay for an asset. When others are fearful however, it may present a good buying opportunity at an undervalued cost. This is an intriguing yet intellectually invigorating topic we are set to explore in our project.

Abstract

In this project, we ask ourselves one simple question: is the philosophy that aggregate retail investor sentiment is a contrary indicator of future stock market returns correct? To investigate, we explore the possible relationship between investor sentiments and actual stock return. Our project use easily accesible public data to examine whether a negative correlation exists between the two variables. In turn, we are going to validate Buffett's famous investment philosophy by comparing it to our actual results.



In [1]:

    
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import sys                      # system module, used to get Python version 
import os                       # operating system tools (check files)
import datetime as dt           # date tools, used to note current date 
import seaborn as sns

# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)

            

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print("Today's date:", dt.date.today())

%matplotlib inline
plotly.offline.init_notebook_mode()









    











    



Python version:  3.5.1 |Anaconda 2.5.0 (x86_64)| (default, Dec  7 2015, 11:24:55) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas version:  0.17.1
Today's date: 2016-05-11

Package Imported

We use pandas, a Python package that allows for fast data manipulation and analysis. In pandas, a dataframe allows for storing related columns of data. We use matplotlib to generate a variety of figures and graphics. We also used system module to get Python version. Lastly, we used os, a operating system tool to check files.

Creating the Data Set

Using the sentiment data from American Associatioin of Individual Investors and weekly stock return data from the Farma French Data Factors Weekly, we created a list of dataframes. Each sentiment dataframe contains Date, % Bullish, % Neutral and % Bearish. Each stock return dataframe contains Date, Excess Return,SMB,HML,RF and Stock Index. The problem with the sentiment data is that we notice some noises in the dataset. Some of the data are not related to the project, such as annual sentimanet data. Thus, we clean the data by slicing out the portion we would like to examine. We also notice that the dates are not exactly paired for many cells. We perform fuzzy pair and pair stock return data with the dates of the sentiment data. We then concatenate the two datasets, which leaves us with a clean side-by-side comparison.



In [2]:

    
# sentiment data
sentiment = pd.read_excel("http://www.aaii.com/files/surveys/sentiment.xls",skiprows=3);
stm = sentiment[["Date","Bullish","Neutral","Bearish"]]



In [3]:

    
stm.head()









    Out[3]:






  
    
      
      Date
      Bullish
      Neutral
      Bearish
    
  
  
    
      0
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1987-06-26 00:00:00
      NaN
      NaN
      NaN
    
    
      2
      1987-07-17 00:00:00
      NaN
      NaN
      NaN
    
    
      3
      1987-07-24 00:00:00
      0.36
      0.50
      0.14
    
    
      4
      1987-07-31 00:00:00
      0.26
      0.48
      0.26



In [4]:

    
# We have some noises in the dataset, for example:
stm.tail()



In [5]:

    
# clean sentiment data
k=[];

for i in range(len(stm.index)):
    if (type(stm["Date"][i])==type(stm["Date"][3])):
        k.append(i)

stm2 = stm.loc[k].reset_index(drop="Index")
stm2["Date"] = pd.to_datetime(stm2["Date"])



In [6]:

    
stm2.head()



In [7]:

    
# weekly stock return data
from pandas_datareader.famafrench import get_available_datasets
import pandas_datareader.data as web

get_available_datasets();
r = web.DataReader('F-F_Research_Data_factors_weekly', 'famafrench')[0];
names = ["Excess_Return","SMB","HML","RF"];
r.columns = names

# Slice Stock return data starting from sentiment data's beginning date
start = r.index.searchsorted(stm2["Date"][0]);
rd = r.ix[start:]
r2 = rd.reset_index();

Here, we create a replicative version of stock index which starts from 1 and multiplies weekly returns from the beginning date of the dataset. Why are we doing this? This is because we may have some very interesting graphs to show and we need an index to track the stock price movement and calculate the returns:



In [8]:

    
# create an index that starts from 1 and weekly return
w=[]
kk = 1
for i in range(len(r2)):
    kk = kk*(1+r2["Excess_Return"][i]/100+r2["RF"][i]/100);
    w.append(kk)

r2["Stock_Index"] = pd.Series(w);



In [9]:

    
iplot_mpl(r2.set_index("Date")["Stock_Index"].plot(figsize=(12,6)).get_figure())
r2.head()









    











    Out[9]:






  
    
      
      Date
      Excess_Return
      SMB
      HML
      RF
      Stock_Index
    
  
  
    
      0
      1987-06-26
      -0.04
      -0.28
      0.10
      0.120
      1.000800
    
    
      1
      1987-07-02
      -0.64
      0.66
      0.87
      0.114
      0.995536
    
    
      2
      1987-07-10
      0.68
      0.11
      1.29
      0.114
      1.003440
    
    
      3
      1987-07-17
      1.69
      -0.01
      0.26
      0.114
      1.021542
    
    
      4
      1987-07-24
      -1.65
      0.47
      -0.23
      0.114
      1.005852

The sample data below shows that the dates are not paired exactly, and thus fuzzy pairing, i.e., matching the dates with approximation is necessary here:



In [10]:

    
#The dates are not exactly paired for many cells. For example:
print(r2["Date"].loc[1496:1499])
print(stm2["Date"].loc[1493:1496])









    



1496   2016-02-26
1497   2016-03-04
1498   2016-03-11
1499   2016-03-18
Name: Date, dtype: datetime64[ns]
1493   2016-02-25
1494   2016-03-03
1495   2016-03-10
1496   2016-03-17
Name: Date, dtype: datetime64[ns]



In [11]:

    
# fuzzy pair the dates in two datasets (Caution: It may run for a while. Please be patient!)
dates = stm2["Date"];
dates2 = r2["Date"];

ii=[];
jj=[];

for i in range(len(dates)):
    for j in range(i,len(dates2)):

        transformed_date1 = dates[i];
        transformed_date2 = dates2[j];
        
        timediff = transformed_date2 - transformed_date1;
        
        if abs(timediff.days) < 3:
            ii.append(i);
            jj.append(j);
            
print(len(ii)==len(jj))     #check if they are paired









    



True



In [12]:

    
# Concatenate two datasets

stm3 = stm2.ix[ii].reset_index(drop="index");
stm3 = stm3.rename(columns={"Date":"Report_Date"});
r3 = r2.ix[jj].reset_index(drop="index");

result = pd.concat([stm3, r3], axis=1);



In [13]:

    
result.tail()









    Out[13]:






  
    
      
      Report_Date
      Bullish
      Neutral
      Bearish
      Date
      Excess_Return
      SMB
      HML
      RF
      Stock_Index
    
  
  
    
      1492
      2016-02-25
      0.311947
      0.373894
      0.314159
      2016-02-26
      1.84
      1.30
      -0.23
      0.005
      12.162560
    
    
      1493
      2016-03-03
      0.320158
      0.387352
      0.292490
      2016-03-04
      3.08
      1.34
      2.44
      0.005
      12.537775
    
    
      1494
      2016-03-10
      0.373576
      0.382688
      0.243736
      2016-03-11
      1.00
      -0.47
      1.33
      0.005
      12.663779
    
    
      1495
      2016-03-17
      0.299587
      0.431818
      0.268595
      2016-03-18
      1.42
      -0.34
      0.12
      0.005
      12.844238
    
    
      1496
      2016-03-24
      0.337808
      0.425056
      0.237136
      2016-03-24
      -0.80
      -0.95
      -0.79
      0.005
      12.742127



In [14]:

    
# Select Columns we would like to examine
examine = result[["Date","Bullish","Bearish","Excess_Return","Stock_Index"]].set_index("Date")
examine.head()









    Out[14]:






  
    
      
      Bullish
      Bearish
      Excess_Return
      Stock_Index
    
    
      Date
      
      
      
      
    
  
  
    
      1987-06-26
      NaN
      NaN
      -0.04
      1.000800
    
    
      1987-07-17
      NaN
      NaN
      1.69
      1.021542
    
    
      1987-07-24
      0.36
      0.14
      -1.65
      1.005852
    
    
      1987-07-31
      0.26
      0.26
      2.68
      1.033955
    
    
      1987-08-07
      0.56
      0.29
      1.50
      1.050684

Boxplot

In the box plot we create for Sentiment Data, we can see that the mean of investors feeling Bullish is generally higher than those who feel bearish.



In [15]:

    
#Boxplot of Sentiment Data
long = pd.melt(examine, value_vars=['Bullish','Bearish'], var_name='Sentiment', value_name='Ratio')
plt.figure(figsize=(6,8))
sns.boxplot(data=long, x="Sentiment", y="Ratio",palette=["g","r"])
plt.xlabel("Sentiment",fontsize=18)
plt.show()

Kernel Density

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. We can see from our plot below that Bearish sentiment is more rihgt tailed, while Bullish sentiment is more similar to the normal distribution.



In [16]:

    
#KDE of Sentiment Data
fig, ax1 = plt.subplots(figsize=(10,6))

sns.kdeplot(examine["Bullish"], ax=ax1, color="g")
sns.kdeplot(examine["Bearish"],ax=ax1, color="r")
ax1.legend()

fig.suptitle("Kernel Density",fontsize=18)
plt.show()

Scatterplot

We performed scatterplot analysis to examine the correlation between Bullish and Bearish Sentiment. We can see that there is a negative correlation between Bullish and Brearish sentiment. From the scatterplot of Stock Return vs. Sentiment Data, we notice that there is a nonconstant variance between Bullish and Bearish sentiment. However, in both cases, we see no correlation between Sentiment Data and Stock Return.



In [17]:

    
#Scatterplot and Correlation of Bullish versus Bearish Data
plt.figure(figsize=(6,6))
k = sns.regplot(x="Bullish", y="Bearish", data=examine)
note = "Correlation is " + str( round(examine["Bullish"].corr(examine["Bearish"]),3))
k.figure.text(0.4, 0.8, note, fontsize=18, weight="bold")









    Out[17]:





<matplotlib.text.Text at 0x117b8d208>



In [18]:

    
sns.jointplot(x="Bullish", y="Excess_Return", data=examine,color="g")
sns.jointplot(x="Bearish", y="Excess_Return", data=examine,color="r")
plt.show()

Regression

The regression result shows that coefficients of Bullish and bearish sentiment data is not significantly away from zero and the model barely explains the variations of weekly stock returns, with R-squared of 0.006. This is not surprising because otherwise some intelligent players would have earned tons of money by tracking the investor sentiment.



In [19]:

    
# Regression of Excess Return on Bullish and Bearish Sentiment Data
import statsmodels.formula.api as sm
reg = sm.ols(formula="Excess_Return ~ Bullish + Bearish", data=examine).fit()
reg.summary()









    Out[19]:





OLS Regression Results

  Dep. Variable:       Excess_Return     R-squared:             0.006


  Model:                    OLS          Adj. R-squared:        0.005


  Method:              Least Squares     F-statistic:           4.427


  Date:              Wed, 11 May 2016    Prob (F-statistic):   0.0121 


  Time:                  18:12:53        Log-Likelihood:      -3389.0


  No. Observations:         1494         AIC:                   6784.


  Df Residuals:             1491         BIC:                   6800.


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      0.2776      0.485      0.572   0.567     -0.674     1.229


  Bullish        0.6577      0.756      0.870   0.384     -0.825     2.140


  Bearish       -1.3237      0.795     -1.665   0.096     -2.883     0.236




  Omnibus:        264.436    Durbin-Watson:         2.122


  Prob(Omnibus):   0.000     Jarque-Bera (JB):   2224.995


  Skew:           -0.570     Prob(JB):               0.00


  Kurtosis:        8.869     Cond. No.               20.3

Outliers

Although regression result doesn't tell us anything useful about the relationship between investor sentiment and stock returns, we may still be interested to see whether an extremely bullish or bearish datapoint tells us something about the market timing. In other words, is an extremely bullish point implies that the market is about to reach the peak, or is an extreme bearish point suggests that the market is at the bottom? To do this, we first source the very bullish and bearish datapoints that are over 3.5 standard deviations above the mean. Then we use the replicative stock index that we generated before and see what happened to the stock market one year after a super bullish or bearish point is detected.



In [23]:

    
#Define outliers as beyond +3.5 to -3.5 standard deviation. Subject to change.
toobull = examine[(examine.Bullish - examine.Bullish.mean())>=(3.5*examine.Bullish.std())].reset_index();
toobull









    Out[23]:






  
    
      
      Date
      Bullish
      Bearish
      Excess_Return
      Stock_Index
    
  
  
    
      0
      2000-01-07
      0.75
      0.1333
      -2.49
      6.298392



In [24]:

    
#Define outliers as beyond +3 to -3 standard deviation. Subject to change.
toobear = examine[(examine.Bearish - examine.Bearish.mean())>=(3.5*examine.Bearish.std())].reset_index();
toobear









    Out[24]:






  
    
      
      Date
      Bullish
      Bearish
      Excess_Return
      Stock_Index
    
  
  
    
      0
      1990-10-19
      0.1300
      0.6700
      3.44
      1.070872
    
    
      1
      2009-03-06
      0.1892
      0.7027
      -7.03
      3.694305



In [22]:

    
for i in range(len(toobull)):
    num = examine.index.get_loc(toobull["Date"][i]);
    num2 = num - 52;
    num3 = num + 53;
    
    toobull_index_before = examine["Stock_Index"][num2:num+1]      #Start from the previous 52nd week to this week
    toobull_index = examine["Stock_Index"][num:num3]               #Start from this week to the 52nd week
    Annual_Return = str(round((toobull_index[len(toobull_index)-1] -
                               toobull_index[0]) *100 / toobull_index[0],2)) + "%";
    
    plt.figure(figsize=(12,6))
    plt.plot(toobull_index_before,color="b",alpha=0.2)
    plt.plot(toobull_index,color="r")
    plt.ylabel("Stock Index")
    plt.title("Annual Return: "+Annual_Return, fontsize=20, loc="left", weight="bold")
    plt.annotate("Sell",
                 xy=(toobull["Date"][i], toobull["Stock_Index"][i]+0.04),
                 xytext=(toobull["Date"][i], toobull["Stock_Index"][i]+0.2),
                 fontsize=12, 
                 weight="bold", 
                 arrowprops=dict(facecolor='red', shrink=0.05))

    
for i in range(len(toobear)):
    num = examine.index.get_loc(toobear["Date"][i]);
    num2 = num - 52;
    num3 = num + 53;
    
    toobear_index_before = examine["Stock_Index"][num2:num+1]      #Start from the previous 52nd week to this week
    toobear_index = examine["Stock_Index"][num:num3]               #Start from this week to the next 52nd week
    Annual_Return = str(round((toobear_index[len(toobear_index)-1] -
                               toobear_index[0]) *100 / toobear_index[0],2)) + "%";
    
    plt.figure(figsize=(12,6))
    plt.plot(toobear_index_before,color="b",alpha=0.2)
    plt.plot(toobear_index,color="g")
    plt.ylabel("Stock Index")
    plt.title("Annual Return: "+Annual_Return, fontsize=20, loc="left", weight="bold")
    plt.annotate("Buy",
                 xy=(toobear["Date"][i], toobear["Stock_Index"][i]-0.02),
                 xytext=(toobear["Date"][i], toobear["Stock_Index"][i]-0.06),
                 fontsize=12, 
                 weight="bold", 
                 arrowprops=dict(facecolor='green', shrink=0.05))

The result shows a very interesting market-timing skill. We can see that all three datapoints look like turning points in the stock index movement. For example, when highly bullish atmosphere was discovered in January 2000, the dot-com bubble was about to collapse. And when everyone was extremely pessimistic about the financial market, a 5-year bull market was coming. Indeed, this experimental design is not perfect. First, there are only extreme observations found in the dataset, meaning that there is no sufficient evidence to justify these market-timing patterns. Second, the experiment is retrospective rather than prospective, where we used current standard deviations of predictors and back-forecasted the results. Nevertheless, these outliers could be still indicative in evaluating the market inflection points.

Conclusion

We believe that Buffet's saying is somehow informative to investors. "Be fearful" when the market feels extremely bullish and "be greedy" when the market is way too pessimistic. Our findings from historical data suggest that when a positive outlier from bullish and bearish sentiment data was discovered, the date of such outlier may be the date for market to switch sign. We may further the study by connecting the sentiment data with macroeconomic factors to have better predictive power in marking down the turning points of the stock market.

Data Sources

"AAII | AAII Investor Sentiment Data." AAII | AAII Investor Sentiment Data. Quandl, n.d. Web. 4 May 2016.

"Efficient Market Hypothesis (EMH) Definition | Investopedia." Investopedia. N.p., 18 Nov. 2003. Web. 2 May 2016.

"How Investor Sentiment Affects Returns." CBSNews. CBS Interactive, n.d. Web. 2 May 2016.

"KFRENCH | Fama/French Factors (Weekly)." KFRENCH | Fama/French Factors (Weekly). Quandl, n.d. Web. 3 May 2016.

	Date	Bullish	Neutral	Bearish
0	NaN	NaN	NaN	NaN
1	1987-06-26 00:00:00	NaN	NaN	NaN
2	1987-07-17 00:00:00	NaN	NaN	NaN
3	1987-07-24 00:00:00	0.36	0.50	0.14
4	1987-07-31 00:00:00	0.26	0.48	0.26

	Date	Bullish	Neutral	Bearish
1668	Count '10	52	52	52
1669	Count '11	52	52	52
1670	Count '12	52	52	52
1671	Count '13	52	52	52
1672	Count '14	123	123	123

	Date	Bullish	Neutral	Bearish
0	1987-06-26	NaN	NaN	NaN
1	1987-07-17	NaN	NaN	NaN
2	1987-07-24	0.36	0.50	0.14
3	1987-07-31	0.26	0.48	0.26
4	1987-08-07	0.56	0.15	0.29

	Date	Excess_Return	SMB	HML	RF	Stock_Index
0	1987-06-26	-0.04	-0.28	0.10	0.120	1.000800
1	1987-07-02	-0.64	0.66	0.87	0.114	0.995536
2	1987-07-10	0.68	0.11	1.29	0.114	1.003440
3	1987-07-17	1.69	-0.01	0.26	0.114	1.021542
4	1987-07-24	-1.65	0.47	-0.23	0.114	1.005852

	Report_Date	Bullish	Neutral	Bearish	Date	Excess_Return	SMB	HML	RF	Stock_Index
1492	2016-02-25	0.311947	0.373894	0.314159	2016-02-26	1.84	1.30	-0.23	0.005	12.162560
1493	2016-03-03	0.320158	0.387352	0.292490	2016-03-04	3.08	1.34	2.44	0.005	12.537775
1494	2016-03-10	0.373576	0.382688	0.243736	2016-03-11	1.00	-0.47	1.33	0.005	12.663779
1495	2016-03-17	0.299587	0.431818	0.268595	2016-03-18	1.42	-0.34	0.12	0.005	12.844238
1496	2016-03-24	0.337808	0.425056	0.237136	2016-03-24	-0.80	-0.95	-0.79	0.005	12.742127

	Bullish	Bearish	Excess_Return	Stock_Index
Date
1987-06-26	NaN	NaN	-0.04	1.000800
1987-07-17	NaN	NaN	1.69	1.021542
1987-07-24	0.36	0.14	-1.65	1.005852
1987-07-31	0.26	0.26	2.68	1.033955
1987-08-07	0.56	0.29	1.50	1.050684

Dep. Variable:	Excess_Return	R-squared:	0.006
Model:	OLS	Adj. R-squared:	0.005
Method:	Least Squares	F-statistic:	4.427
Date:	Wed, 11 May 2016	Prob (F-statistic):	0.0121
Time:	18:12:53	Log-Likelihood:	-3389.0
No. Observations:	1494	AIC:	6784.
Df Residuals:	1491	BIC:	6800.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.2776	0.485	0.572	0.567	-0.674 1.229
Bullish	0.6577	0.756	0.870	0.384	-0.825 2.140
Bearish	-1.3237	0.795	-1.665	0.096	-2.883 0.236

Omnibus:	264.436	Durbin-Watson:	2.122
Prob(Omnibus):	0.000	Jarque-Bera (JB):	2224.995
Skew:	-0.570	Prob(JB):	0.00
Kurtosis:	8.869	Cond. No.	20.3

	Date	Bullish	Bearish	Excess_Return	Stock_Index
0	1990-10-19	0.1300	0.6700	3.44	1.070872
1	2009-03-06	0.1892	0.7027	-7.03	3.694305