Data Mining StockTwits

Introduction

Predicting stock market performance is big business. However, most models are incomplete as they have been unable to take social influence into account. Mining social media for insight into the sentiment of stock traders and investors has the potential to lead to a more-complete understanding of stock performance. In this project, we conduct several experiments using StockTwits data to see what kind of relationships and metrics can be found that might be useful for stock traders and investors.

StockTwits is a social media platform similar to Twitter aimed at stock traders and investors. Instead of hashtags, StockTwits has stock ticker tags like \$AMD or \$AAPL that users post in their tweets. In addition, tweets can be tagged with 'bullish' or 'bearish' to indicate if the poster thinks the mentioned stocks will move upwards or downwards in price.

We conducted several analyses using this data:

  • Visualize metrics from both StockTwits and the Stock Market to gain intuition on their relationship.
  • Use bullish/bearish tagged tweets to train a classifier to gauge sentiment based on tweet text.
  • Compare predicted sentiment on specific companies to the actual sentiment to measure the performance of the automatic sentiment classifier.

Ultimately, we decided to focus on sentiment analysis and various visualizations. Though there is much excitement in predicting stock performance from tweets, we were unable to find strong correlations between our StockTwits data and actual stock prices, so we focused on other analyses. Accurate sentiment analysis of tweets by itself is a useful metric that traders and investors could be interested in.

Prior Work

StockTwits provides several whitepapers on previous experiments conducted using their service. In particular, On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume was a useful template and source of motivation for this project.

  • Note: If you don't have pandas_datareader installed, uncomment the line below to install it

In [2]:
#!pip install pandas_datareader

In [2]:
import io, json, requests, time, os, os.path, math, urllib
from sys import stdout
from collections import Counter
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn import linear_model
from pandas_datareader.data import get_data_yahoo
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

StockTwits Data Collection

First we will write a function to query the StockTwits API to get up to 30 tweets at a time for a given ticker symbol. The API allows getting only tweets older than some tweet ID, which we need for repeatedly querying the server to get many recent tweets.


In [4]:
# returns python object representation of JSON in response
def get_response(symbol, older_than, retries=5):
    url = 'https://api.stocktwits.com/api/2/streams/symbol/%s.json?max=%d' % (symbol, older_than-1)
    for _ in range(retries):
        response = requests.get(url)
        if response.status_code == 200:
            return json.loads(response.content)
        elif response.status_code == 429:
            print response.content
            return None
        time.sleep(1.0)
    # couldn't get response
    return None

Now we can write a function to build or extend a dataset of tweets for a given symbol. This works by remembering the oldest ID of tweets we have gotten so far, and using that as an option in the API query to get older tweets. By doing this we can iteratively build up a list of recent tweets for a given symbol ordered from most recent to least. The data is stored in JSON form, which is the same format the API returns to us.


In [5]:
# extends the current dataset for a given symbol with more tweets
def get_older_tweets(symbol, num_queries):    
    path = './data/%s.json' % symbol
    if os.path.exists(path):
        # extending an existing json file
        with open(path, 'r') as f:
            data = json.load(f)
            if len(data) > 0:
                older_than = data[-1]['id']
            else:
                older_than = 1000000000000
    else:
        # creating a new json file
        data = []
        older_than = 1000000000000 # any huge number
    
    for i in range(num_queries):
        content = get_response(symbol, older_than)
        if content == None:
            print 'Error, an API query timed out'
            break
        data.extend(content['messages'])
        older_than = data[-1]['id']
        stdout.write('\rSuccessfully made query %d' % (i+1))
        stdout.flush()
        # sleep to make sure we don't get throttled
        time.sleep(0.5)
        
    # write the new data to the JSON file
    with open(path, 'w') as f:
        json.dump(data, f)
    print
    print 'Done'

Now we fetch data for several ticker symbols. Note that to get all the data, you will have to rerun this cell once an hour multiple times because of API rate limiting. The JSON files will be distributed with this notebook so this cell is only here to show how we originally got the data.


In [6]:
# get some data
# apparently a client can only make 200 requests an hour, so we can't get all the data at once

# make data directory if needed
if not os.path.exists('./data'):
    os.mkdir('./data')
    
symbols = symbols = ['AAPL', 'NVDA', 'TSLA', 'AMD', 'JNUG', 'JDST', 'LABU', 'QCOM', 'INTC', 'DGAZ']
tweets_per_symbol = 3000
for symbol in symbols:
    path = './data/%s.json' % symbol
    if os.path.exists(path):
        with open(path, 'r') as f:
            num_tweets = len(json.load(f))
    else:
        num_tweets = 0
    num_queries = (tweets_per_symbol - num_tweets - 1)/30 + 1
    if num_queries > 0:
        print 'Getting tweets for symbol %s' % symbol
        get_older_tweets(symbol, num_queries)


Getting tweets for symbol INTC
Successfully made query 100
Done
Getting tweets for symbol DGAZ
Successfully made query 100
Done

The next cell is mainly just for debugging purposes. There is no need to run it.


In [7]:
# check that we're doing the querying and appending correctly without getting duplicates
# and that message IDs are in descending order
symbol = 'NVDA'
with open('./data/%s.json' % symbol, 'r') as f:
    data = json.load(f)
S = set()
old_id = 1000000000000
for message in data:
    message_id = message['id']
    assert message_id not in S
    assert message_id < old_id
    old_id = message_id
    S.add(message_id)
print 'Passed'


Passed

Stock Market Data Comparison

Next, we'll extract stock market data for the symbols we're interested in. For the purpose of our experiment, we'll use Yahoo Finance's daily stock data. The API takes in a start date, end date, and stock symbol.


In [35]:
enddate=datetime.now()
startdate=datetime(2015,1,1)
stock_data = get_data_yahoo('AAPL',startdate,enddate)
stock_data['Volume'].plot(legend=True,figsize=(10,4));
stock_data.head()


Out[35]:
Open High Low Close Volume Adj Close
Date
2015-01-02 111.389999 111.440002 107.349998 109.330002 53204600 105.158716
2015-01-05 108.290001 108.650002 105.410004 106.250000 64285500 102.196226
2015-01-06 106.540001 107.430000 104.629997 106.260002 65797100 102.205846
2015-01-07 107.199997 108.199997 106.699997 107.750000 40105900 103.638996
2015-01-08 109.230003 112.150002 108.699997 111.889999 59364500 107.621041

In [36]:
stock_data['Adj Close'].plot(legend=True,figsize=(10,4));


As you can see, we can quickly and easily pull both volume and closing prices for the dates of interest. This data was useful in exploring the possibility of predicting market performance.

Data Visualization & Exploration

Next, we parsed the JSON data we've collected into a Pandas DataFrame to more easily work with it.


In [10]:
# Function takes in a JSON and returns a Pandas DataFrame for easier operation. 
def stocktwits_json_to_df(data, verbose=False):
    #data = json.loads(results)
    columns = ['id','created_at','username','name','user_id','body','basic_sentiment','reshare_count']
    db = pd.DataFrame(index=range(len(data)),columns=columns)
    for i, message in enumerate(data):
        db.loc[i,'id'] = message['id']
        db.loc[i,'created_at'] = message['created_at']
        db.loc[i,'username'] = message['user']['username']
        db.loc[i,'name'] = message['user']['name']
        db.loc[i,'user_id'] = message['user']['id']
        db.loc[i,'body'] = message['body']
        #We'll classify bullish as +1 and bearish as -1 to make it ready for classification training
        try:
            if (message['entities']['sentiment']['basic'] == 'Bullish'):
                db.loc[i,'basic_sentiment'] = 1
            elif (message['entities']['sentiment']['basic'] == 'Bearish'):
                db.loc[i,'basic_sentiment'] = -1
            else:
                db.loc[i,'basic_sentiment'] = 0
        except:
                db.loc[i,'basic_sentiment'] = 0
        db.loc[i,'reshare_count'] = message['reshares']['reshared_count']
        for j, symbol in enumerate(message['symbols']):
                db.loc[i,'symbol'+str(j)] = symbol['symbol']
        if verbose:
            #print message
            print db.loc[i,:]
    db['created_at'] = pd.to_datetime(db['created_at'])
    return db

We're going to use \$TSLA to visualize data since we have data going back the furthest. We'll now combine these two data sources, so we can generate useful metrics for understanding how StockTwits relates to the stock market over time.


In [11]:
# Load tweets for visualizing data
filename = 'TSLA.json'
path = './tsla_data/%s' % filename
with open(path, 'r') as f:
    data = json.load(f)
db = stocktwits_json_to_df(data)
print '%d examples extracted ' % db.shape[0]


14998 examples extracted 

In [37]:
enddate = db['created_at'].max()
startdate = db['created_at'].min()
print startdate, enddate
stock_data = get_data_yahoo('TSLA', startdate, enddate)


2016-10-17 17:25:28 2016-11-19 22:43:16

We now will combine our datasets. In the process, we also generate statistics related to the total number of bullish/bearish tweets. This is accomplished by grouping tweets by day. We pay attention to the totals and their ratios to each other.

  • Mentions: Total number of mentions with our without bullish/bearish labels
  • Total Bullish/Bearish: Number of tweets with the bullish/bearish labels on the given date.
  • Total Predictions: The sum of bullish and bearish tweets on the given date.
  • Bull Ratio: The ratio of Total Bullish Tweets to Total Predictions.
  • Bear Ratio: The ratio of Total Bearish Tweets to Total Predictions.

In [38]:
#Counts mentions and bullish/bearish ratio of stock tweets collected
def tweet_metrics(stock_data, stock_tweets):
    stock_data['mentions'] = np.zeros(stock_data.shape[0])
    stock_data['total_bullish'] = np.zeros(stock_data.shape[0])
    stock_data['total_bearish'] = np.zeros(stock_data.shape[0])
    stock_data['total_predictions'] = np.zeros(stock_data.shape[0])
    stock_data['bull_ratio'] = np.zeros(stock_data.shape[0])
    stock_data['bear_ratio'] = np.zeros(stock_data.shape[0])
    for i, d in enumerate(stock_data.index):
        tweets_on_d = stock_tweets[stock_tweets['created_at'].dt.date==d.date()]
        stock_data.loc[d,'mentions'] = tweets_on_d.shape[0]
        stock_data.loc[d,'total_bullish'] = tweets_on_d[tweets_on_d['basic_sentiment']==1].shape[0]
        stock_data.loc[d,'total_bearish'] = tweets_on_d[tweets_on_d['basic_sentiment']==-1].shape[0]
        stock_data.loc[d,'total_predictions'] =  stock_data.loc[d,'total_bearish'] +  stock_data.loc[d,'total_bullish']
        stock_data.loc[d,'bull_ratio'] = stock_data.loc[d,'total_bullish']/float(stock_data.loc[d,'total_predictions'])
        stock_data.loc[d,'bear_ratio'] = stock_data.loc[d,'total_bearish']/float(stock_data.loc[d,'total_predictions'])
    return stock_data

Now we can now visualize the results of our analysis.


In [14]:
stock_metrics = tweet_metrics(stock_data, db)
print stock_metrics[['mentions','total_bullish','total_bearish','bull_ratio']]


            mentions  total_bullish  total_bearish  bull_ratio
Date                                                          
2016-10-17     113.0           16.0            9.0    0.640000
2016-10-18     564.0          102.0           61.0    0.625767
2016-10-19    1308.0          353.0           65.0    0.844498
2016-10-20    1417.0          293.0          139.0    0.678241
2016-10-21     236.0           49.0           26.0    0.653333
2016-10-24     220.0           56.0           14.0    0.800000
2016-10-25     195.0           38.0            8.0    0.826087
2016-10-26    1973.0          411.0          135.0    0.752747
2016-10-27    1427.0          228.0          166.0    0.578680
2016-10-28     295.0           37.0           29.0    0.560606
2016-10-31     313.0           52.0           31.0    0.626506
2016-11-01     905.0          100.0          110.0    0.476190
2016-11-02     433.0           58.0           81.0    0.417266
2016-11-03     298.0           31.0           46.0    0.402597
2016-11-04     627.0           92.0          103.0    0.471795
2016-11-07     227.0           48.0           23.0    0.676056
2016-11-08     200.0           52.0           13.0    0.800000
2016-11-09     529.0          108.0           91.0    0.542714
2016-11-10     541.0          105.0           54.0    0.660377
2016-11-11     339.0           68.0           32.0    0.680000
2016-11-14     494.0           96.0           68.0    0.585366
2016-11-15     218.0           45.0           22.0    0.671642
2016-11-16     277.0           67.0           35.0    0.656863
2016-11-17     713.0          195.0           44.0    0.815900
2016-11-18     431.0           80.0           55.0    0.592593

Note that Yahoo's Finance data is "delayed" (i.e. It won't show the current day unless the market has closed).

Next, we'll compare our metrics to gain an understand of StockTwits's correlation to the stock market. In our first comparison, we see a clear correlation between the number of mentions and the trading volume.

Our first comparison is between the total number of mentions and the trading volume. In the two graphs below you will see a clear correlation between the number of mentions and the trading volume.

However, what we don't see is any predictive trend in the data.


In [40]:
stock_metrics[['mentions']].plot(legend=True,figsize=(10,4));
stock_metrics[['Volume']].plot(legend=True,figsize=(10,4));


Finally, we'll compare the total closing price to the bullish/bearish predictions of Stock Twits. Here, we see the strong correlation between market and StockTwits begin to breakdown.

There seems to be an abundance of optimism: The majority of labelled tweets are "bullish". Additionally, not all peaks and valleys appear to be forcasted by the market. At this time, we have an insignificant sample size to say with any certainty.


In [41]:
stock_metrics[['total_bullish','total_bearish','total_predictions']].plot(legend=True,figsize=(10,4));
stock_metrics[['bull_ratio','bear_ratio']].plot(legend=True,figsize=(10,4));
stock_metrics[['Adj Close']].plot(legend=True,figsize=(10,4));


We will next explore the connections between symbols mentioned in the tweets. The function below counts the co-occurrences of symbols mentioned in StockTwits' tweets.


In [42]:
def countcomentions(df):
    
    def getsymbolset(df):
        symbols = []
        for i, row in df.iterrows():
            for symbol in row:
                if (pd.notnull(symbol)):
                    symbols.append(symbol)
        return set(symbols)
   
    def getallsymbols(df):
        columns = df.columns
        symbolcolumns = []
        for col in columns:
            if col.startswith('symbol'):
                symbolcolumns.append(col)
        return df[symbolcolumns]
        
    def count(df, stock_symbol):
        cnt = Counter()
        for i, row in df.iterrows():
            for sym in row:
                if (sym!=stock_symbol) & pd.notnull(sym):
                    cnt[sym] += 1
        return cnt
    
    df = getallsymbols(df)
    symbolset = getsymbolset(df)
    print len(symbolset), "total symbols found."
    co = np.zeros((len(symbolset), len(symbolset)))
    co = pd.DataFrame(co, index=symbolset, columns=symbolset)
    for i, row in df.iterrows():
        for stock_symbol in row:
            for sym in row:
                if (sym!=stock_symbol) & pd.notnull(stock_symbol) & pd.notnull(sym):
                    co.loc[stock_symbol,sym]+=1
    return co
    #return pd.DataFrame(co)
    
co = countcomentions(db)


788 total symbols found.

We see a clear power-law distribution when viewing the histogram of mentions related to Tesla Motors. The vast majority of tweets are mentioned only a few times.


In [52]:
plt.figure(figsize=(10,4))
The sns.distplot(co.loc['TSLA',:], kde=False)


Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd3d5157d0>

The histogram above tells us that the some stocks will have a disproportionate relationship to TSLA. The twenty most commonly co-mentioned tweets are given below.

Unsurprisingly, SolarCity Corporation (SCTY) was listed most commonly. SolarCity was recently in the news due to its decision to merge with Tesla Motors following a long, and well publicized lead up. Investors and stock traders clearly believed the merger was important to Tesla's stock price, as \$SCTY has by far the greatest number of co-mentions with \$TSLA. Additionally, other top co-mentions include large tech companies, as we as Ford and General Motors.


In [44]:
co.loc['TSLA',co.loc['TSLA',:]>0].sort_values(ascending=False)[:20]


Out[44]:
SCTY     1005.0
AAPL      480.0
AMZN      366.0
NFLX      244.0
SPY       242.0
FB        238.0
GOOGL     173.0
F         157.0
GM        147.0
BABA      146.0
TWTR      146.0
NVDA      120.0
GOOG      111.0
QQQ        90.0
CMG        86.0
TWLO       57.0
GPRO       54.0
BMY        53.0
UAA        52.0
ACIA       45.0
Name: TSLA, dtype: float64

Very few others even come close to the density of TSLA and SCTY. Those that are mentioned are often in the same segment as Tesla. These are big-name tech stocks: Apple, Amazon, Netflix, Facebook, Google, and Alibaba.

Below is a heatmap of the co-mentions matrix. It is 788 x 788 and focused between 0 & 5. Note that the vast majority of stocks are only mentioned a few times. We see a very clear axes representing TSLA. A slighly more fuzzy intersection below and to the right represents SCTY. You can guess the rest from the list above.


In [47]:
plt.figure(figsize=(45,10))
sns.heatmap(co, xticklabels=False, vmin=0, vmax=5, yticklabels=False, square=True);


Having successful datamined StockTwits, we explored some of the relationships found in the data. However, we did not find sufficient evidence that stock performance could be predicted by tweets. For these reasons, we shift our focus.

StockTwits provides a remarkably large set of labeled data for training. We explored sentiment prediction using 'Bullish' and 'Bearish.' More on this is covered in the section below.

Sentiment Prediction

Some of the tweets have bullish or bearish labels, indicating if the poster thinks the mentioned stock will go up or down in price respectively. We will extract only those tweets which have such sentiment labels, and convert the labels into either 0 for bearish or 1 for bullish.


In [3]:
def get_tweets_and_labels(data):
    # filter out messages without a bullish/bearish tag
    data = filter(lambda m: m['entities']['sentiment'] != None, data)
    # get tweets
    tweets = map(lambda m: m['body'], data)
    # get labels
    def create_label(message):
        sentiment = message['entities']['sentiment']['basic']
        if sentiment == 'Bearish':
            return 0
        elif sentiment == 'Bullish':
            return 1
        else:
            raise Exception('Got unexpected sentiment')
    labels = map(create_label, data)
    return tweets, labels

In [4]:
# get all tweets and labels available
tweets = []
labels = []
all_tweets = []
for filename in os.listdir('./data'):
    path = './data/%s' % filename
    with open(path, 'r') as f:
        data = json.load(f)
    all_tweets.extend(map(lambda m: m['body'], data))
    t, l = get_tweets_and_labels(data)
    tweets.extend(t)
    labels.extend(l)
assert len(tweets) == len(labels)
print '%d labeled examples extracted ' % len(tweets)


12231 labeled examples extracted 

The next two cells make functions to create a TF-IDF vectorizer for the tweets and to train a linear SVM classifier to predict bearish or bullish sentiment.


In [5]:
def tfidf_vectorizer(tweets, all_tweets=None):
    vectorizer = TfidfVectorizer()
    if all_tweets != None:
        # use all tweets, including unlabeled, to learn vocab and tfidf weights
        vectorizer.fit(all_tweets)
    else:
        vectorizer.fit(tweets)
    return vectorizer

In [6]:
def train_svm(X, y):
    model = svm.LinearSVC(penalty='l2', loss='hinge', C=1.0)
    #model = svm.SVC(C=1.0, kernel='rbf')
    model.fit(X, y)
    return model

We first create the TF-IDF feature matrix for all of our labeled data. Then we randomly permute it and split 10% off into a held out test set. We also print out the percentage of labeled tweets that are bullish, because the 2 classes are likely not balanced. We want to know how well a classifier that only predicts the most common class would do.


In [9]:
vectorizer = tfidf_vectorizer(tweets, all_tweets)
X = vectorizer.transform(tweets)
words = vectorizer.get_feature_names()
y = np.array(labels)
print X.shape
print y.shape

N = X.shape[0]
num_train = int(math.floor(N*0.9))
P = np.random.permutation(N)
X_tr = X[P[:num_train]]
y_tr = y[P[:num_train]]
X_te = X[P[num_train:]]
y_te = y[P[num_train:]]
print 'Training set size is %d' % num_train
print 'Percent bullish = %f%%' % (100*y.mean())


(12231, 27389)
(12231,)
Training set size is 11007
Percent bullish = 76.698553%

Now it is simple to train the SVM and print our the accuracy for both the training and testing data.


In [10]:
model = train_svm(X_tr, y_tr)
print 'Training set accuracy = %f' % model.score(X_tr, y_tr)
print 'Test set accuracy = %f' % model.score(X_te, y_te)


Training set accuracy = 0.894794
Test set accuracy = 0.839869

We can see that the classifier does several percent better than just guessing the most common class. Now that we have a trained SVM, we can use the weights to print out words most indicative of bearish or bullish sentiment. This is because we used a linear SVM, so each weight coefficient corresponds to a column in the TF-IDF matrix, which itself corresponds to a word. We get the indices of the weight coefficients with the highest and lowest values, and use them to print the most bullish and bearish words.


In [11]:
weights = np.squeeze(model.coef_)
sorted_weight_indices = np.argsort(weights)
num_words = 30
bearish_indices = sorted_weight_indices[:num_words]
bullish_indices = sorted_weight_indices[-num_words:][::-1]
words = np.array(words)
print 'Bearish words:'
for w in words[bearish_indices]:
    print w
print
print 'Bullish words:'
for w in words[bullish_indices]:
    print w


Bearish words:
puts
short
bearish
below
flush
down
fake
terrible
tank
drop
lower
biib
denial
116
sucked
accounting
180
fade
breakdown
hype
pig
ugly
740
doomed
abandon
downside
scam
160
dump
garbage

Bullish words:
bears
calls
amd
squeeze
buy
long
baba
ng
thiel
want
iphone
higher
relax
apple
111
up
shorts
breakout
bullish
strong
why
million
great
pls
ugaz
fairly
fly
average
love
msft

The results are actually pretty interesting. I'll give a bit of explanation for some of the terms for people who are not familiar with the stock market. If you expect the price of a stock to fall, you can try to make money off it by shorting it or buying a type of option called puts. If the price is falling, you could say it is tanking, crashing, or on a downtrend. The classifier picked up on all of these terms, and many more, as correctly indicating bearish sentiment.

Likewise, if you expect the price to rise, you can make money by buying the stock and going 'long' on it, or by buying a type of option known as calls. And obviously terms like 'higher' or 'buy' clearly indicated bullishness. The classifier picked up on all these terms and others that are similar as correctly being indicative of bullish sentiment.

Finally, let's use a trained classifier to predict sentiment on a held-out dataset of Tesla Motors tweets, and plot it along with the actual sentiment. We use logistic regression instead of SVM for this experiment because we find it gives better results.


In [43]:
model = linear_model.LogisticRegression(penalty='l2', C=10.0, class_weight='balanced')
#model = svm.LinearSVC(penalty='l2', loss='hinge', C=1.0, class_weight='balanced')
model.fit(X, y)

with open('./tsla_data/TSLA.json', 'r') as f:
    data = json.load(f)[::-1]

def extract_body(m):
    return m['body']

def extract_date(m):
    return m['created_at']

def extract_sentiment(m):
    if m['entities']['sentiment'] != None:
        sentiment = m['entities']['sentiment']['basic']
        if sentiment == 'Bearish':
            return 0
        else:
            return 1
    else:
        return np.nan
    
d = {'body': map(extract_body, data),
     'date': pd.to_datetime(map(extract_date, data)),
     'sentiment': map(extract_sentiment, data)}
df = pd.DataFrame(data=d)

# use classifier to predict sentiment for unlabeled examples
features = vectorizer.transform(df['body'])
predictions = model.predict(features)
predicted_sentiment = []
for i, sentiment in enumerate(df['sentiment']):
    if np.isnan(sentiment):
        predicted_sentiment.append(predictions[i])
    else:
        predicted_sentiment.append(sentiment)
df['predicted_sentiment'] = pd.Series(predictions)

print df.dtypes
print df.head()

grouped_df = df.groupby(pd.Grouper(key='date', freq='1D')).aggregate(np.mean)

print grouped_df.head()

plt.plot(grouped_df['sentiment'], label='Actual sentiment')
plt.plot(grouped_df['predicted_sentiment'], label='Predicted sentiment')
plt.legend(loc='lower right')

coef = np.corrcoef(grouped_df['sentiment'], grouped_df['predicted_sentiment'])[0,1]
print
print 'Correlation coefficient = %f' % coef


body                           object
date                   datetime64[ns]
sentiment                     float64
predicted_sentiment             int64
dtype: object
                                                body                date  \
0  $TSLA let&#39;s hope the new product contribut... 2016-10-17 17:25:28   
1  $TSLA \nElon&#39;s new product just leaked: it... 2016-10-17 17:27:25   
2  $TSLA how low can this go with an announcement... 2016-10-17 17:28:01   
3  $TSLA every spike gets stuffed . Lows on deck ... 2016-10-17 17:28:02   
4  $TSLA - lets see if LOD holds. Doubtful, but w... 2016-10-17 17:28:04   

   sentiment  predicted_sentiment  
0        NaN                    1  
1        NaN                    1  
2        NaN                    1  
3        NaN                    0  
4        0.0                    1  
            sentiment  predicted_sentiment
date                                      
2016-10-17   0.640000             0.592920
2016-10-18   0.625767             0.640071
2016-10-19   0.844498             0.648318
2016-10-20   0.678241             0.648553
2016-10-21   0.653333             0.682203

Correlation coefficient = 0.343614

To make the plot above we've assigned 0 to bearish and 1 to bullish, and averaged the results for each day. The actual sentiment calculation ignores any unlabeled tweets. The predicted sentiment treats all tweets as unlabeled and uses the classifier to predict 0 or 1. It is clear that while the predicted sentiment does not always exactly match the actual, the trend generally holds. That is, movements in the predicted sentiment often match movements in the actual sentiment, which is reflected in the positive correlation coefficient. This means that even in the absence of explicitly labeled sentiment data, it is possible to use machine learning techniques to get an estimate of investor sentiment based purely on the textual content of tweets.

Conclusion

StockTwits is potentially a very valuable source of social media data related to the stock market. In this project, we showed how StockTwits posting volume correlates strongly with stock trading volume, how company co-mentions in tweets can identify important events and related companies, and how classifiers can be trained to automatically gauge investor sentiment from raw text. These kinds of techniques and analyses may prove to be valuable for future stock traders in an increasingly data-driven world.