URL for accessing our scraped data: slantparse.com/data/slantparse_data.zip
Provide an overview of the project goals and the motivation for it. Consider that this will be read by people who did not see your project proposal.
Our original goal for this project was to train a classifier such that we were able to determine the unstated political leanings of satirical news organizations. This idea came up in casual conversation between the authors, and it was (and is) something we very much want to investigate. In drafting our project proposal and in meeting with our TF Olivia, it became apparant that for such a project to be relevant, we would need to provide a metric for determining the success of our classification. Given that the types of satirical news organizations we wanted to target certainly do not have a stated leaning, and nor have any or many lists of political leanings been created for these publications, such a metric couldn't be readily verified. For this reason, we decided to refine our project idea to determine the political leanings of news organizations in general.
Both of the authors of this project are very interested in text analysis and natural language processing; this interest was part of our motivation for studying classification of political leanings in texts. Additionally, we wanted to explore scraping and cleaning data from the web. We had a lot of fun with that aspect of this project, and we both felt extremely proud of how neatly our article-retrieved text came out of our scraper (post computational clean up, naturally).
The principal result of this work is a classifier which is trained on more than 600 articles scraped from five Liberal sources and five Conservative sources. This classifier vectorizes the words, considers the word frequencies, removes stop words (words like "the" or "again"), and successfully classifies the test data with approximately 75% accuracy. We look at the precision, the recall, and the F1-score for our classifier. We then provide some visualizations for interpreting our data and our feature evaluation.
Anything that inspired you, such as a paper, a web site, or something we discussed in class.
We, the authors of this project, were inspired to work with news data due to particular current events that were unfolding at the time of this project. These current events include ruling in the Ferguson case as well as the case of Eric Gardner, among many other interesting, complex, and deep news stories. We set out to learn about the reprentation of politics in natural language. Further, we are both very interested in natural language processing in general and saw this as an opportunity to explore that interest, as well as an opportunity to learn how to scrape and clean text data from the web.
The main reason the authors embarked on this endeavor was to see how politics is evident in writing. Is it restricted to phrases like "Obamacare" versus the "Affordable Care Act"? Or is it more subtle, affecting adjective choices and emotion words? Because we wanted to see how politics manifested itself in writing, we didn't want to restrict our classification scheme to a single subject (like Obamacare or Ferguson) -- we really wanted to see what features of politics were expressed across topics and across publications.
In short, as news junkies we wanted to work with news data, and as students we wanted to explore natural language processing and classification further.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Initially, we wanted to learn about the political slants of satirical news organizations. Over the course of this project, particularly after discussing the project with our TF, we realized that coming up with a valid metric for sites like 'The Onion', which has no official political leaning, was an unattainable goal. We wouldn't be able classify satirical news organizations with conviction, as we would have no proof of validity. For this reason, we refined our question to concern only those publications with a known political slant. Over the course of this project, this question became data-focused: we wanted to know if our classifier was successful, and if it was, what features caused it to be successful. We answer these questions in our visualizations, and we also channeled these questions into a feature of our website. On our website, we create a web tool which allows a guest to enter an article text, and we classify that article in real time.
In [1]:
GET_NEW_ARTICLES = False #changing this to true will add new content to the dataset.
import urllib
import csv
from bs4 import BeautifulSoup
from sets import Set
from sgmllib import SGMLParser
import pickle
from textstat.textstat import textstat as ts
from sklearn import svm
from sklearn import neighbors
from sklearn import tree
from sklearn import ensemble
from sklearn.feature_extraction import text
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
import sklearn.linear_model
from os import path
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
import csv, enchant, string, cPickle, re, time
from collections import Counter
import stop_words
%matplotlib inline
In [2]:
list_of_political_magazines = ['http://en.wikipedia.org/wiki/List_of_political_magazines', 'http://en.wikipedia.org/wiki/Category:Modern_liberal_American_magazines',
'http://en.wikipedia.org/wiki/Category:Conservative_American_magazines', 'http://usconservatives.about.com/od/gettinginvolved/tp/TopConservativeMagazines.htm',
'http://www.allyoucanread.com/top-10-political-magazines/', 'http://www.conservapedia.com/Conservative_media',
'http://www.dailykos.com/story/2009/04/05/716698/-The-Compleat-Revised-Guide-to-Liberal-and-Progressive-News-and-Politics-on-the-Web',
'http://www.washingtonpost.com/blogs/the-fix/wp/2014/10/21/lets-rank-the-media-from-liberal-to-conservative-based-on-their-audiences/']
Source, scraping method, cleanup, etc.
We use the metric defined above (cross-referencing lists of political leanings) to determine 5 liberal sources and 5 conservative sources.
The 5 liberal sources:
- Mother Jones
- The Nation
- Slate
- The New Yorker
- The Washington Post
The 5 conservative sources:
- The Christian Science Monitor
- The Weekly Standard
- TownHall
- The American Conservative
- The American Spectator
We first visit the homepage of each of these publications. From the home page, we scrape all links and store these in a set. We then visit each such link, check if it is an article from the given publication, and then scrape that data into a csv called articles.csv. The format of articles.csv is: publication title, political leaning (C or L), article title, article text.
In [3]:
# cite: https://github.com/chungy/diveintopython/blob/master/py/urllister.py
class URLLister(SGMLParser):
"""
Arguments: (implicit) SGMLParser
Extracts all 'href's from an html document. We store these links in a set,
and process them individually to retrieve content or additional links.
"""
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
In [4]:
def process_url(url, url_mod, string_start, find_string, start_article_delim, end_article_delim, name, pol):
"""
Arguments: string url, string url_mod, string find_string, string start_title_delim,
string end_title_delim, string start_article_delim, string end_article_delim, string name, string pol
For each url passed to process_url, open the CSV file articles.csv which contains all data scraped.
Process the article by retrieving the title, and by using BeautifulSoup to retrieve the article text.
Store each article in the CSV as the publication name, the publication's political leaning (either 'L' or 'C'), article title, and article content.
"""
#with open('articles_scraped.pickle', 'a+') as handle:
# pickle.dump(set(), handle)
# check url format
if len(url) > 0 and ((str(url)[0] == '/' and string_start == "argument-not-used") or str(url).find(string_start) == 0) and str(url).find(find_string) != -1:
with open('articles.csv', 'a+') as csvfile:
article_writer = csv.writer(csvfile, delimiter=',')
tmp_url = url_mod + str(url)
usock = urllib.urlopen(tmp_url)
html = usock.read()
# use Beautiful Soup
soup_unprocessed = BeautifulSoup(html)
# use try/except to catch cases in which articles do not conform to general standards
# e.g. when an article does not have a title
try:
title = soup_unprocessed.title.string
title = title.encode("utf-8")
# If article is from a specific provider, modify it according to the following rules:
# e.g. in the case of Townhall, we only want html which has the full article rather than a paged version
if name == "TownHall" and html.find("View Full Article") != -1:
process_url(url + '/page/full', url_mod, string_start, find_string, start_article_delim, end_article_delim, name, pol)
usock.close()
return
elif name == "Slate" and html.find('<div class="single-page">') != -1:
process_url(url[0:url.find('.html')] + '.single.html', url_mod, string_start, find_string, start_article_delim, end_article_delim, name, pol)
usock.close()
return
elif name == "Mother Jones" and (title.find('Issue') != -1 or title.find('map') != -1):
print 'Content not text, or article not relevant.\n'
usock.close()
return
elif name == "The American Conservative" and html.find('Author Archives') != -1 or title.find('Web Headlines') != -1 or title.find('Articles') != -1:
print 'Content not text, or article not relevant.\n'
usock.close()
return
elif name == "The Christian Science Monitor" and (title.find('+video') != -1 or url.find('The-Culture') != -1 or title.find('Photos of the day') != -1 or title.find('How much do you know about') != -1):
print 'Content not text, or article not relevant.\n'
usock.close()
return
elif name == "The Blaze" and title.find('Video') != -1:
print 'Content not text, or article not relevant.\n'
usock.close()
return
elif name == "Slate" and (url.find('video') != -1 or url.find('podcast') != -1):
print 'Content not text, or article not relevant.\n'
usock.close()
return
elif name == "The Washington Post" and (url.find('live') != -1 or html.find('posttv-video-template') != -1):
print 'Content not text, or article not relevant.\n'
usock.close()
return
else:
print title + '\n'
content = str(html)[str(html).find(start_article_delim) + len (start_article_delim):str(html).find(end_article_delim)]
soup = BeautifulSoup(content)
# remove JS from html
for script in soup(["script", "style"]):
script.extract()
mod_content = soup.get_text()
# check if article has already been processed;
with open('articles_scraped.pickle', 'rb') as handle:
article_list = pickle.load(handle)
if title in article_list:
print '\n ALREADY IN SET \n'
return
else:
with open('articles_scraped.pickle', 'wb') as handle:
article_list.append(title)
pickle.dump(article_list, handle)
mod_content = mod_content.encode("utf-8")
ascii_title = unicode(title, 'ascii', 'ignore')
ascii_content = unicode(mod_content, 'ascii', 'ignore')
article_writer.writerow([name, pol, ascii_title, ascii_content])
except AttributeError:
print 'Attribute Error'
usock.close()
In [5]:
def scrape(start_url, iterator, url_list, url_mod):
"""
Arguments: string start_url, int iterator, set url_list, string url_mod
Returns: updated set url_list
For each start_url passed, this function opens that url and retrieves all hrefs from that page.
If iterator is smaller than the (constant) number of ITER_PAGES, scrape is called recursively
on the set of urls.
"""
# limit the number of articles retrieved by limiting page depth
if iterator > ITER_PAGES:
return set()
# call URLLister to retrieve all hrefs in the html read from the url
usock = urllib.urlopen(start_url)
parser = URLLister()
html = usock.read()
parser.feed(html)
usock.close()
parser.close()
url_list = url_list.union(parser.urls)
# scrape each correctly formatted url
for url in url_list:
if len(url) > 0 and str(url)[0] == '/':
tmp_url = url_mod + str(url)
try:
tmp_set = scrape(tmp_url, iterator + 1, url_list, url_mod)
url_list = url_list.union(tmp_set)
except IOError:
print "Couldn't Open"
return url_list
In [6]:
ITER_PAGES = 1
# pickle setup
articles_to_pickle = []
try:
with open('articles_scraped.pickle', 'rb') as handle:
articles_to_pickle = pickle.load(handle)
except IOError:
with open('articles_scraped.pickle', 'wb') as handle:
pickle.dump(articles_to_pickle, handle)
# call the function
url_list = set()
american_conservative_list = scrape('http://www.theamericanconservative.com/', 1, url_list, '')
townhall_list = scrape('http://www.townhall.com/', 1, url_list, 'http://www.townhall.com')
csmonitor_list = scrape('http://www.csmonitor.com/', 1, url_list, 'http://www.csmonitor.com/')
weekly_std_list = scrape('http://www.weeklystandard.com/', 1, url_list, 'http://www.weeklystandard.com/')
spectator_list = scrape('http://spectator.org/', 1, url_list, 'http://spectator.org/')
mother_jones_list = scrape('http://www.motherjones.com/', 1, url_list, 'http://www.motherjones.com')
nation_list = scrape('http://www.thenation.com/', 1, url_list, 'http://www.thenation.com/')
slate_list = scrape('http://www.slate.com/', 1, url_list, '')
newyorker_list = scrape('http://www.newyorker.com/', 1, url_list, '')
wp_list = scrape('http://www.washingtonpost.com/', 1, url_list, '')
# Set this flag to "true" above if you want to add to the content!
if GET_NEW_ARTICLES:
# conservative 1
for url in american_conservative_list:
process_url(url, '', 'http://www.theamericanconservative.com/', 'theamericanconservative', '<div class="post-content">', '<footer id="articlefooter">', 'The American Conservative', 'C')
# conservative 2
for url in townhall_list:
process_url(url, 'http://www.townhall.com', 'argument-not-used', '2014' , '<ul class="breadcrumb">', '<hr class="article-divider"', 'TownHall', 'C')
# conservative 3
for url in csmonitor_list:
process_url(url, 'http://www.csmonitor.com/', 'argument-not-used', '2014/', '<div id="story-body"', '<span id="end-of-story"', 'The Christian Science Monitor', 'C')
# conservative 4
for url in weekly_std_list:
process_url(url, 'http://www.weeklystandard.com/', 'argument-not-used', '/articles/', ' <div class="all_article">', '<div class="article-footer">', 'The Weekly Standard', 'C')
# conservative 5
for url in spectator_list:
process_url(url, 'http://spectator.org/', 'argument-not-used', '/articles/', 'target="_blank" rel="nofollow">', '</iframe></div><div class="field-item even"><p class="label">', 'The American Spectator', 'C')
# liberal 1
for url in mother_jones_list:
process_url(url, 'http://www.motherjones.com', 'argument-not-used', '2014' , '<div id="node-header" class="clear-block">', '<div id="node-footer" class="clear-block">', 'Mother Jones', 'L')
# liberal 2
for url in nation_list:
process_url(url, 'http://www.thenation.com/', 'argument-not-used', '/article/', '<div class="field field-type-text field-field-image-caption">', '</p></div><div class="views-field-value byline">', 'The Nation', 'L')
# liberal 3
for url in slate_list:
process_url(url, '', 'http://www.slate.com/', '/2014/', '<div class="text text-1 parbase section">', '<section class="about-the-author', 'Slate', 'L')
# liberal 4 - slightly more complex
for url in newyorker_list:
if url.find('//') == 0 or len(url) < 45:
print 'URL badly formatted'
# do nothing for this url
else:
process_url(url, '', 'http://www.newyorker.com/', 'news', '<div itemprop="articleBody" class="articleBody" id="articleBody">', '<span class="dingbat">', 'The New Yorker', 'L')
process_url(url, '', 'http://www.newyorker.com/', 'magazine/2014', '<div itemprop="articleBody" class="articleBody" id="articleBody">', '<span class="dingbat">', 'The New Yorker', 'L')
# liberal 5
for url in wp_list:
process_url(url, '', 'http://www.washingtonpost.com/', '/2014/', '<div id="article-body" class="article-body">', '</article>', 'The Washington Post', 'L')
In [7]:
CHAR_LEN_BOUND = 1500
fd = open('articles.csv', 'rb')
raw_rows = []
reader = csv.reader(fd, delimiter=',')
for reading in reader:
raw_rows.append(reading)
def basic_strip(article):
"""
Arguments: string article
Returns: string
When passed article text, this function looks for words which are on its 'exclude' list; these
words include those which ask users to share articles on social media. All words are converted to lower case.
Only the sentences which do not include words from the 'exclude' list are returned in the cleaned string.
"""
exclude = ["facebook", "digg", "tagged", "recommended", "stumbleupon", "share", "blogs", "user agreement", "subscription", "login", "twitter", "topics", "excel", "accessed", "check out", "tweet", "|", "see also", "e-mail", "strongbox",
"ad choices", "photograph", "about us", "faq", "careers", "view all", "app", "sign in", "contact us", "comment", "follow", "@", "http", "posted", "update", "staff writer", "editor", "advertisement", "clearfix", "eza"]
article = re.sub('[ \t\n]+' , ' ', article)
sentences = article.split('.')
new_sentences = []
for sen in sentences:
clean = True
for word in exclude:
if word in sen.lower():
clean = False
if clean:
new_sentences.append(sen)
outstr = ""
for sen in new_sentences:
if len(sen) > 5:
outstr += sen.strip() + '. '
return outstr
rows = []
for i in range(0, len(raw_rows)):
article_text = basic_strip(raw_rows[i][3])
if len(article_text) > CHAR_LEN_BOUND:
rows.append([raw_rows[i][0], raw_rows[i][1], raw_rows[i][2], article_text])
print 'Finished applying the basic_strip function to text.'
For this part of the project, we analyzed several different types of classifiers. We considered using random forest classifiers, a Bayesian multinomial classifier, a traditional svm. We finally decided to use a gradient boosting classifier. We did so for several reasons; first and foremost, this type of classifier is robust to overfitting, which we control by selecting a random sample of articles as the training set, leaving the remaining articles we scraped as test data.
In [8]:
Xtrain, Ytrain = [],[]
Xtest, Ytest = [],[]
# function for converting format of CountVectorizer output to be used for classification
def csr_2_list(csr):
ints = [csr[0,i] for i in range(0, csr.shape[1]) ]
return ints
num_libarticles, num_consarticles = 0, 0
for row in rows:
if row[1] == 'C':
num_consarticles += 1
else:
num_libarticles += 1
print "***VECTORIZING DOCUMENTS***"
test_rows = rows[::2]
train_rows = rows[1::2]
stop_words = text.ENGLISH_STOP_WORDS.union(['said'])
# construct a CountVectorizer and give it training data
tk_train = text.CountVectorizer(max_features=2400, stop_words=stop_words)
text_doc_matrix_train = tk_train.fit_transform([row[3] for row in train_rows])
# construct another CountVectorizer with vocabulary based on training set's vocab
tk_test = text.CountVectorizer(max_features=2400, stop_words=stop_words, vocabulary = tk_train.vocabulary_)
text_doc_matrix_test = tk_test.fit_transform([row[3] for row in test_rows])
for i in range(0, text_doc_matrix_train.shape[0]):
Xtrain.append(csr_2_list(text_doc_matrix_train[i]))
Ytrain.append(train_rows[i][1])
for i in range(0, text_doc_matrix_test.shape[0]):
Xtest.append(csr_2_list(text_doc_matrix_test[i]))
Ytest.append(test_rows[i][1])
print ">>>DONE VECTORIZING DOCUMENTS<<<\n"
print "***TRAINING CLASSIFIER***"
# define the classifier
clf = GradientBoostingClassifier(n_estimators = 800, max_depth = 5)
clf.fit(Xtrain, Ytrain)
print ">>>DONE TRAINING CLASSIFIER<<<\n"
print "***DUMPING CLASSIFIER***"
fd = open('./web/classifier', 'wb')
cPickle.dump(clf, fd, cPickle.HIGHEST_PROTOCOL)
fd.close()
fd = open('./web/vocab', 'wb')
cPickle.dump(tk_train.vocabulary_, fd, cPickle.HIGHEST_PROTOCOL)
fd.close()
print ">>>DONE DUMPING CLASSIFIER<<<\n"
What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?
We first look at all of our test data and see how the classifier ranks the article's political slant relative to its publisher's political slant.
After looking at the classification directly, we then look at the precision, recall, and F1-score for our classifier. All of these values tend to be around 80%.
Having looked at the precision, recall, and F1-scores for our classifier, we then consider the data we scraped from the ten liberal and conservative sources. We visualize this data by creating a word cloud of all of the words (not including stop words) from all of the articles. We then create another word cloud, this time looking at the most important features to the classifier.
In [9]:
successes,trials = 0,0
predicted = clf.predict(Xtest)
for i in range(0, len(Xtest)):
print "FOR: " + test_rows[i][2] + "\n CLF SAID: " + clf.predict(Xtest[i])[0] + " ACTUALLY: " + Ytest[i]
if Ytest[i] == clf.predict(Xtest[i])[0]:
successes += 1
trials+=1
print "\n The classifier was %.2f%% accurate." % (float(successes)/trials*100)
print "\n %d liberal articles, %d conservative articles." % (num_libarticles, num_consarticles)
print "\n"
In [10]:
print (metrics.classification_report(Ytest, predicted))
In [11]:
from wordcloud import STOPWORDS
# cite: https://github.com/amueller/word_cloud
mywcstr = ""
for row in test_rows:
mywcstr += row[3] + ' '
for row in train_rows:
mywcstr += row[3] + ' '
our_stopword_list = ('one', 'said', 'year', 'will', 'take', 'new', 'say')
our_stopwords = STOPWORDS.union(our_stopword_list)
wordcloud = WordCloud(background_color="white", max_words=100, stopwords=our_stopwords).generate(mywcstr)
plt.figure(figsize = (14,16))
plt.title("What's in the news? \n Words by Frequency", fontsize = 20)
plt.imshow(wordcloud, )
plt.axis("off")
plt.show()
In [12]:
# word cloud of SVM features
features = clf.feature_importances_
vocab = tk_train.vocabulary_
vocabInvDict = {v: k for k, v in vocab.items()}
outstr = ""
for i in range(0,len(features)):
# lookup feature i in vocabulary
for _ in range(0,int(4000*features[i])):
outstr += ' ' + vocabInvDict[i] + ' '
# multiply value of feature i * word, add to out string
wordcloud = WordCloud(background_color="black", max_words=100, stopwords=['said','advertisement', 'photo']).generate(outstr)
plt.figure(figsize = (14,16))
plt.title("Most Useful Features For Classification of News Articles", fontsize = 20)
plt.imshow(wordcloud, )
plt.axis("off")
plt.show()
print "In the word cloud above, the larger the word, the more important the feature."
In [13]:
freqs, feats = [],[]
for i in range(0,len(features)):
# lookup feature i in vocabulary
freqs.append(features[i])
feats.append(str(vocabInvDict[i]))
pairs = zip(feats, freqs)
pairs.sort(key = lambda elem : elem[1], reverse = True)
top_10_features = pairs[:10]
_ = plt.figure(figsize=(10,10))
graph = plt.bar(range(0,len(top_10_features)), [elt[1] for elt in top_10_features], color = "green")
_ = plt.title('Feature versus Level of Importance', fontsize=15)
_ = plt.ylabel('Feature Level of Importance')
_ = plt.xlabel('Feature')
_ = plt.xticks(np.arange(0.5, 10.5, 1), [elt[0] for elt in top_10_features], rotation=30)
In [14]:
import random
# cite: PS5 distribution code
def plot_decision_surface(clf, X_train, Y_train, title, y_l, x_l, os1, os2):
plot_step=0.1
if X_train.shape[1] != 2:
raise ValueError("X_train should have exactly 2 columnns!")
x_min, x_max = X_train[:, 0].min() - plot_step, X_train[:, 0].max() + plot_step
y_min, y_max = X_train[:, 1].min() - plot_step, X_train[:, 1].max() + plot_step
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
clf.fit(X_train,Y_train)
if hasattr(clf, 'predict_proba'):
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]
else:
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Reds)
def offset(f):
return (random.random() * f - f/2)
plt.scatter([x + offset(os1) for x in X_train[:,0]],[x + offset(os2) for x in X_train[:,1]],c=Y_train,cmap=plt.cm.Paired)
#labelling functionality
_ = plt.title(title, fontsize = 20)
_ = plt.xlabel(x_l, fontsize = 16)
_ = plt.ylabel(y_l, fontsize = 16)
plt.show()
In [19]:
best_col = [row[vocab[top_10_features[0][0]]] for row in Xtrain]
second_best_col = [row[vocab[top_10_features[1][0]]] for row in Xtrain]
third_best_col = [row[vocab[top_10_features[2][0]]] for row in Xtrain]
fourth_best_col = [row[vocab[top_10_features[3][0]]] for row in Xtrain]
clf_new = GradientBoostingClassifier(n_estimators = 800, max_depth = 5)
clf_new_ = GradientBoostingClassifier(n_estimators = 800, max_depth = 5)
best_feats_Xtrain = np.matrix([best_col, second_best_col]).transpose()
good_feats_Xtrain = np.matrix([third_best_col, fourth_best_col]).transpose()
Ytrain2 = []
for elt in Ytrain:
if elt == 'L':
Ytrain2.append("Blue")
elif elt == 'C':
Ytrain2.append("Red")
_ = plt.figure(figsize = (10,8))
def ft(s):
return "Occurrences of feature " + "'" + s + "'"
plot_decision_surface(clf_new, best_feats_Xtrain, Ytrain2, "Decision Surface For Top 2 Features",
ft(top_10_features[0][0]), ft(top_10_features[1][0]), .1, 0)
_ = plt.figure(figsize = (10,8))
plot_decision_surface(clf_new, good_feats_Xtrain, Ytrain2, "Decision Surface For 2nd Best Feature Pair",
ft(top_10_features[2][0]), ft(top_10_features[3][0]), .33, 0)
Here we've looked at the decision surfaces that result from plotting the probabilistic results generated from the first and second best pairs of features from our feature vectors, as determined by the feature_importances attribute of the Gradient Boosting Classifier we used to separate liberal from conservative articles. Red areas in the above graphs are conservative prediction regions, while lighter colored regions in the above graphs are liberal. We see some interesting triends here, ranging from the unsurprising (repeated usage of the word "foreign" is correlated with conservatism) to the more subtle, with "time" appearing more frequently in conservative articles for example. Since our vectors have integer components, a small random horizontal offset was added to point values in both graphs. These allow the viewer to interpret the quantity of articles at a particular point, as well as to get an idea of the distribution of liberal articles vs conservative articles that appear at that point.
On the website which corresponds to this project, we will have a textbox whereby a user can enter article text and we will classify it for them on the spot. The code to do this on-demand classification is included here. The code is commented out as this script cannot be called from within the IP[y] Notebook.
In [16]:
_ = """
from sklearn.feature_extraction import text
import sys, pickle, cPickle
article_to_classify = ""
if __name__ == "__main__":
for line in sys.stdin:
article_to_classify += line
try:
fd = open('classifier', 'rb')
except:
print "classifier failed to load"
sys.exit()
clf = cPickle.load(fd)
fd.close()
try:
vfd = open('vocab', 'rb')
except:
print "vocab failed to load"
sys.exit()
vocab = cPickle.load(vfd)
vfd.close()
tk = text.CountVectorizer(max_features=2400, stop_words='english', vocabulary = vocab)
my_row = tk.fit_transform([article_to_classify])
def csr_2_list(csr):
ints = [csr[0,i] for i in range(0, csr.shape[1]) ]
return ints
print clf.predict(csr_2_list(my_row))
"""
What did you learn about the data? How did you answer the questions? How can you justify your answers?
In analyzing our data, we saw a lot of really interesting results with regard to the type of language used by Conservative or Liberal publications.
In order to learn about our data, we first took a look at the most frequently used words across all publications, Conservative or Liberal. We ignored stop-words, or words that otherwise weren't meaningful to the reader. We saw words like "people", "time", "police", "government", "American", "black". These results are expected, as they define the current movement of protests in response to grand jury decisions in the recent Ferguson and Eric Gardner cases.
Having looked at the most frequently used words across all news, we created a classifier for determining the political slant of news publications, and then we analyzed the text data that classifier used as features. We created another word cloud of these features, where the features of highest importance are largest in the word cloud. In this cloud, we see words like "photos", "stocks", "time", "guilty", "voters", "foreign", "control", "obama", "income", "taxes", and "god". These are, as expected, charged and emotional in most cases, but only in a few of them (e.g. "obama") would it be immediately clear that a trend should be expected.
After creating these two word clouds, we plotted the relative frequency of the ten most important features. This plot is represented as a bar chart. The most notable feature represented on that chart is the word "Mr.". It's an oddly simple word, and yet it seems as if it's used in an effort to be respectful which is characteristic of a political leaning.
Finally, we looked at some decision surfaces for our classifier. These are explained in detail below the graphs, but the fundamental idea is that each red dot represents a Conservative article in our training set, and each blue dot a Liberal article. These are then scattered atop a color contour map which is based on the decision the classifier would make for the given feature specifications. The redder the map, the more strictly conservative the intersection of those two plotted features.