In this project, we use two sets of data to draw insights on how media sentiment can be an indicator for the financial sector. For the financial data, we plan to use daily return of the market index (^GSPC), which is a good indicator for market fluctuation; for media sentiment, we use summarized information of news pieces from top 10 most popular press because of their stronger influence in shaping people's perception of events that are happening in the world.
Both sets of data are real-time, which means the source files are of the moment and need to be loaded each time analysis is performed. The sentiment analysis library returns a polarity score (-1.0 to 1.0) and a polarity score (0.0 to 1.0) on the news stories. Using quantified sentiment analysis, we juxtapose the two time series of data and observe if they present any correlation and search for potential causality. For example, we may test the hypothesis that when polarity among the daily news posts is higher (a.k.a., positive), the financial market that same day is more likely to rise. The rest of the notebook is a step-by-step instruction.
In [69]:
%matplotlib inline
# import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data
from datetime import datetime
import numpy as np
from textblob import TextBlob
import csv
from wordcloud import WordCloud,ImageColorGenerator
#from scipy.misc import imread
import string
We use pd.read_json() to import real-time news information (top 10 posts from each publisher). These news items are stored separately as dataframes and combined into one collective dataframe. (News API powered by NewsAPI.org)**
The news press consists of
In [70]:
cnn = pd.read_json('https://newsapi.org/v1/articles?source=cnn&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518')
nyt= pd.read_json('https://newsapi.org/v1/articles?source=the-new-york-times&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518')
wsp=pd.read_json('https://newsapi.org/v1/articles?source=the-washington-post&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518')
bbc=pd.read_json("https://newsapi.org/v1/articles?source=bbc-news&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518")
abc=pd.read_json("https://newsapi.org/v1/articles?source=abc-news-au&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518")
#google = pd.read_json(" https://newsapi.org/v1/articles?source=google-news&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518")
ft = pd.read_json("https://newsapi.org/v1/articles?source=financial-times&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518")
bloomberg = pd.read_json("https://newsapi.org/v1/articles?source=bloomberg&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518")
economist = pd.read_json("https://newsapi.org/v1/articles?source=the-economist&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518")
wsj = pd.read_json("https://newsapi.org/v1/articles?source=the-wall-street-journal&sortBy=top&apiKey=bdc0623102e94a7586137f02a51e0518")
In [71]:
total = [wsj, cnn, nyt, wsp, bbc, abc, ft, bloomberg, economist]
total1 = pd.concat(total, ignore_index=True)
total1
Out[71]:
Some values may be missing in the article column. For example, if there is no imformation of the key author of news pieces from BBC, it will indicates None where the author information should have been. Therefore, we need to convert Nonetype entries to string type, because the .append() method for a list cannot pass values of Nonetype. We will use .append() method later for displaying sentiment analysis results.
In [72]:
k = 0
while k < len(total1):
if total1['articles'][k]['description'] is None:
total1['articles'][k]['description'] = 'None'
k += 1
j = 0
while j < len(total1):
print(type(total1['articles'][j]['description']))
j += 1
# now all entries are of type string, regardless whether there is real contents.
In [73]:
l = 0
while l < len(total1):
if total1['articles'][l]['title'] is None:
total1['articles'][l]['title'] = 'None'
l += 1
p = 0
while p < len(total1):
print(type(total1['articles'][p]['title']))
p += 1
# now all entries are of type string, regardless whether there is real contents.
Contents of the column named articles are of dict type; each row contains information including author, title, description, url, urlToImage and publishedAt, among which title is selected for main analysis.
In [74]:
# write the news posts into a new .csv file
n_rows = len(total1.index)
articles = total1['articles']
result = csv.writer(open('result.csv','a'))
result.writerow(['PublishedAt','Title','description'])
for i in range(0,n_rows):
line = [articles[i]['publishedAt'],articles[i]['title'],articles[i]['description']]
result.writerow(line)
# print the first item in the 'articles' series as an example.
articles[0]
Out[74]:
In [75]:
# type of each entry in the 'articles' column is 'dict'
type(articles[0])
Out[75]:
In [76]:
# keys of the 'dict' variables are 'author', 'publishedAt', 'urlToImage', 'description', 'title', 'url'
articles[0].keys()
Out[76]:
The tags method performs part-of-speech tagging (for example, NNP stands for a singular proper noun).
In [77]:
blob = TextBlob(str(articles[0]['title']))
blob.tags
Out[77]:
A loop prints all the news titles, which are later used for sentiment analysis.
In [78]:
i = 0
while i < n_rows:
blob = TextBlob(articles[i]['title'])
print(1 + i, ". ", blob, sep = "")
i += 1
All descriptions for the 100 news posts are printed in the same way as above; their presence is useful for adding accuracy for our sentiment analysis by providing more words on the same topic as the titles.
In [79]:
j = 0
while j < n_rows:
blob1 = TextBlob(str(articles[j]['description']))
print(1 + j, ". ", blob1, sep = "")
j += 1
A word cloud of news tiltles can provide us with a direct and vivid impression of the most frequently discussed topics in today's news reports. Topic/person/event that prevails among the top news pieces appears in the largest font, occupies the center space and displays the most salient colors.
In a visually pleasant way, a word cloud gives us a hint for the news sentiment of the day.
In [80]:
#write the csv file into a txt file called entire_text.txt
contents = csv.reader(open('result.csv','r'))
texts = open('entire_text.txt','w')
list_of_text = []
for row in contents:
line = row[2].encode('utf-8')
line = str(line.decode())
list_of_text.append(line)
texts.writelines(list_of_text)
In [81]:
text=open("entire_text.txt",'r')
text=text.read()
wordcloud = WordCloud().generate(text)
In [82]:
#display the generated image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
Out[82]:
In [83]:
# increase max_font_size and change backgroud color to white
wordcloud = WordCloud(max_font_size=40).generate(text)
wordcloud = WordCloud(max_words=200,background_color='white',max_font_size=100).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
We use .sentiment method from TextBlob to calculate polatiry and subjectivity of each title.
The sentiment property returns an output in the form of namedtuple (Sentiment(polarity, subjectivity)
). The polarity score is a float within the range [-1.0, 1.0]
. The subjectivity is a float within the range [0.0, 1.0]
where 0.0
is very objective and 1.0
is very subjective.
In [84]:
# a loop to show sentiment analysis results of the 100 titles
n = 0
while n < n_rows:
print(TextBlob(articles[n]['title']).sentiment)
n += 1
From the TextBlob module, the .sentiment method returns results in the form of namedtuples. Elements in namedtuples can only be printed after being appended into the form of a list. Therefore, we use a list named tests_title to store all the results from our sentiment tests on the news titles.
In [85]:
N = 0
tests_title = []
while N < n_rows:
tests_title.append(TextBlob(articles[N]['title']).sentiment)
N += 1
We create a list named list_polarity_title to store polarity scores for news titles.
In [86]:
list_polarity_title = [] # this list contains all titles polarity scores.
for test in tests_title:
list_polarity_title.append(test.polarity)
Similarly, we create a list of subjectivity scores for news titles.
In [87]:
list_subjectivity_title = [] # this list contains all titles subjectivity scores.
for test in tests_title:
list_subjectivity_title.append(test.subjectivity)
We use .sentiment method again to calculate polarity and subjectivity of each description. As mentioned above, analysis on descritions make the final results more versatile and hopefully more accurate.
In [88]:
m = 0
while m < n_rows:
print(TextBlob(articles[m]['description']).sentiment)
m += 1
In [89]:
M = 0
tests_description = []
while M < n_rows:
tests_description.append(TextBlob(articles[M]['description']).sentiment)
M += 1
We create a list of polarity scores for news descriptions by appending each polarity score to the list named list_polarity_description.
In [90]:
list_polarity_description = [] # this list contains all descriptions' polarity scores.
for test in tests_description:
list_polarity_description.append(test.polarity)
Same as above, we create a list of subjectivity for news descriptions.
In [91]:
list_subjectivity_description = [] # this list contains all descriptions' subjectivity scores.
for test in tests_description:
list_subjectivity_description.append(test.subjectivity)
Now we have four lists of data:
We convert the four lists of data into one dataframe for drawing plots.
In [92]:
total_score = [list_polarity_title, list_subjectivity_title, list_polarity_description, list_subjectivity_description]
labels = ['T_polarity', 'T_subjectivity', 'D_polarity', 'D_subjectivity']
df = pd.DataFrame.from_records(total_score, index = labels)
df
Out[92]:
We transpose the dataframe to make it compatible with the .plot() method.
In [93]:
df = df.transpose()
df
Out[93]:
In [94]:
# this plot shows scores for all 100 news posts.
df.plot()
Out[94]:
Apparently, the 100 news posts standing alone aren't of much information. For a better perspective, we need to group scores by the press they belong to, under the assumption that posts from the same press are much more likely to embed a uniform tone. We create a list names new_T_polarity to store the sum of polarity scores of news titles for each press. The we do the same operation on subjectivity scores.
In [95]:
c_T_polarity = df['T_polarity']
new_T_polarity = []
B = 0
C = 0
while B < n_rows:
sum = 0
while C < B + 10:
sum += c_T_polarity[C]
C += 1
new_T_polarity.append(sum)
B += 10
new_T_polarity
# The press are in the order as: wsj, cnn, nyt, wsp, bbc, abc, google, ft, bloomberg and economist.
Out[95]:
In [96]:
c_T_subjectivity = df['T_subjectivity']
new_T_subjectivity = []
D = 0
E = 0
while D < n_rows:
sum = 0
while E < D + 10:
sum += c_T_subjectivity[E]
E += 1
new_T_subjectivity.append(sum)
D += 10
new_T_subjectivity
Out[96]:
In [97]:
c_D_polarity = df['D_polarity']
new_D_polarity = []
F = 0
G = 0
while F < n_rows:
sum = 0
while G < F + 10:
sum += c_D_polarity[G]
G += 1
new_D_polarity.append(sum)
F += 10
new_D_polarity
Out[97]:
In [98]:
c_D_subjectivity = df['D_subjectivity']
new_D_subjectivity = []
H = 0
I = 0
while H < n_rows:
sum = 0
while I < H + 10:
sum += c_D_subjectivity[I]
I += 1
new_D_subjectivity.append(sum)
H += 10
new_D_subjectivity
Out[98]:
In [99]:
total_score_bypublishhouse = [new_T_polarity, new_T_subjectivity, new_D_polarity, new_D_subjectivity]
df1 = pd.DataFrame.from_records(total_score_bypublishhouse, index = labels)
df1
Out[99]:
In [100]:
# change the column labels to press house.
new_columns = ['wsj', 'cnn', 'nyt', 'wsp', 'guardian', 'abc', 'ft', 'bloomberg', 'economist']
df1.columns = new_columns
df1
Out[100]:
In [101]:
#colors = [(x/10.0, x/20.0, 0.75) for x in range(n_rows)]
df1.plot(kind = 'bar', legend = True, figsize = (15, 2), colormap='Paired', grid = True)
# place the legend above the subplot and use all the expended width.
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
ncol=10, mode="expand", borderaxespad=0.)
Out[101]:
In [102]:
bar_color = 'orange'
row = df1.iloc[0]
row.plot(kind = 'bar', title = "Polarity for news titles by news press", color = bar_color, grid = True)
Out[102]:
In [103]:
contents = csv.reader(open('all_news.csv','r', encoding = "ISO-8859-1"))
result = csv.writer(open('entire_result.csv','w'))
In [104]:
result.writerow(['Date','polarity'])
for row in contents:
comment = row[2]
blob = TextBlob(comment)
polarity = blob.sentiment.polarity
line = [row[0],polarity]
result.writerow(line)
In [105]:
data = pd.read_csv('entire_result.csv')
data
Out[105]:
In [106]:
#group the data by date
data=data.groupby('Date', as_index=False)['polarity'].mean()
#convert column "Date" to a date data type
data['Date'] = pd.to_datetime(data['Date'])
#sort the data by date ascending
data=data.sort_values(by="Date", axis=0, ascending=True, inplace=False, kind='quicksort')
data
Out[106]:
In [107]:
data.plot(x=data["Date"],kind = 'bar',title='Polarity for news titles by date',grid = True, color = 'orange')
Out[107]:
In [108]:
from yahoo_finance import Share
# '^GSPC' is the market symble for S&P 500 Index
yahoo = Share('^GSPC')
print(yahoo.get_open())
In [109]:
print(yahoo.get_price())
In [110]:
print(yahoo.get_trade_datetime())
In [111]:
from pprint import pprint
pprint(yahoo.get_historical('2017-04-09', '2017-05-09'))
We create a .csv file called yahoo.csv to store the financial data upon each import.
In [119]:
from yahoo_finance import Share
yahoo = Share('^GSPC')
dataset = yahoo.get_historical('2017-04-27','2017-05-09')
result = csv.writer(open('yahoo.csv','w'))
result.writerow(['Date','Low','High'])
for i in range(0,len(dataset)):
line = [dataset[i]['Date'],dataset[i]['Low'],dataset[i]['High']]
result.writerow(line)
In [120]:
yahoo = pd.read_csv('yahoo.csv')
yahoo
Out[120]:
In [121]:
#convert column "Date" to a date data type
yahoo['Date'] = pd.to_datetime(yahoo['Date'])
#sort the data by date ascending
yahoo=yahoo.sort_values(by="Date", axis=0, ascending=True, inplace=False, kind='quicksort')
yahoo
Out[121]:
In [122]:
type(data['Date'])
type(yahoo['Date'])
Out[122]:
In [123]:
#join yahoo and data together on "Date"
result = pd.merge(data, yahoo,on='Date')
result
Out[123]:
In [124]:
result_len = len(result)
In [125]:
yahoo.plot(x="Date",figsize=(6, 2),title='Yahoo Finance')
data.plot(x='Date',figsize=(6, 2),title='News Title Polarity')
Out[125]:
In [126]:
import numpy
low=result['Low']
high=result['High']
polarity=result['polarity']
numpy.corrcoef(low, polarity)
#from the data we have, we can conclude that news polarity and S&P500 index are positively correlated
Out[126]:
In [127]:
numpy.corrcoef(high, polarity)
Out[127]:
In [128]:
numpy.corrcoef(high, low)
Out[128]:
In [129]:
#a scatterplot for news polarity and Yahoo daily return of the market index
result.plot.scatter(x="polarity", y="Low")
Out[129]:
In [130]:
#a parametic estimation for Yahoo daily return by news polarity
import seaborn as sns
#lmplot plots the data with the regression coefficient through it.
sns.lmplot(x="polarity", y="Low", data=result, ci=0.95) #ci stands for confidence interval
Out[130]:
In [131]:
import pyqt_fit.nonparam_regression as smooth
from pyqt_fit import npr_methods
In [132]:
k0 = smooth.NonParamRegression(polarity, low, method=npr_methods.SpatialAverage())
k0.fit()
grid = np.r_[-0.05:0.05:0.01]
plt.plot(grid, k0(grid), label="Spatial Averaging", linewidth=2)
plt.legend(loc='best')
Out[132]: