In this project we're analyzing news headlines written by two journalists – a finance reporter from the Business Insider, and a celebrity reporter from the Huffington post – to find similarities and differences between the ways that these authors write headlines for their news articles and blog posts. Our selected reporters are:
We're initially going to collect and parse news headlines from each of the authors in order to obtain a parse tree. Then we're going to extract certain information from these parse trees that are indicative of the overall structure of the headline.
Next, we will define a simple sequence similarity metric to compare any pair of headlines quantitatively, and we will apply the same method to all of the headlines we've gathered for each author, to find out how similar each pair of headlines is.
Finally, we're going to use K-Means and tSNE to produce a visual map of all the headlines, where we can see the similarities and the differences between the two authors more clearly.
For this project we've gathered 700 headlines for each author using the AYLIEN News API which we're going to analyze using Python. You can obtain the Pickled data files directly from the GitHub repository, or by using the data collection notebook that we've prepared for this project.
We're going to use the Pattern library for Python to parse the headlines and create parse trees for them:
In [1]:
from pattern.en import parsetree
Let's see an example:
In [2]:
s = parsetree('The cat sat on the mat.')
for sentence in s:
for chunk in sentence.chunks:
print chunk.type, [(w.string, w.type) for w in chunk.words]
In [3]:
import cPickle as pickle
author1 = pickle.load( open( "author1.p", "rb" ) )
author1[0]
Out[3]:
In [4]:
for story in author1:
story["title_length"] = len(story["title"])
story["title_chunks"] = [chunk.type for chunk in parsetree(story["title"])[0].chunks]
story["title_chunks_length"] = len(story["title_chunks"])
In [5]:
author1[0]
Out[5]:
Let's see what the numeric attributes for headlines written by this author look like. We're going to use Pandas for this.
In [6]:
import pandas as pd
df1 = pd.DataFrame.from_dict(author1)
In [7]:
df1.describe()
Out[7]:
From this information, we're going to extract the chunk type sequence of each headline (i.e. the first level of the parse tree) and use it as an indicator of the overall structure of the headline. So in the above example, we would extract and use the following sequence of chunk types in our analysis:
['NP', 'PP', 'NP', 'VP']
We have loaded all the headlines written by the first author, and created and stored their parse trees. Next, we need to find a similarity metric that given two chunk type sequences, tells us how similar these two headlines are, from a structural perspective.
For that we're going to use the SequenceMatcher class of difflib, which produces a similarity score between 0 and 1 for any two sequences (Python lists):
In [8]:
import difflib
print "Similarity scores for...\n"
print "Two identical sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["A","B","C"]).ratio()
print "Two similar sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["A","B","D"]).ratio()
print "Two completely different sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["X","Y","Z"]).ratio()
Now let's see how that works with our chunk type sequences, for two randomly selected headlines from the first author:
In [9]:
v1 = author1[3]["title_chunks"]
v2 = author1[1]["title_chunks"]
print v1, v2, difflib.SequenceMatcher(None,v1,v2).ratio()
In [10]:
import numpy as np
chunks = [author["title_chunks"] for author in author1]
m = np.zeros((700,700))
for i, chunkx in enumerate(chunks):
for j, chunky in enumerate(chunks):
m[i][j] = difflib.SequenceMatcher(None,chunkx,chunky).ratio()
To make things clearer and more understandable, let's try and put all the headlines written by the first author on a 2d scatter plot, where similarly structured headlines are close together.
For that we're going to first use tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2:
In [11]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
In [12]:
tsne = tsne_model.fit_transform(m)
And to add a bit of color to our visualization, let's use K-Means to identify 5 clusters of similar headlines, which we will use in our visualization:
In [13]:
from sklearn.cluster import MiniBatchKMeans
kmeans_model = MiniBatchKMeans(n_clusters=5, init='k-means++', n_init=1,
init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(m)
kmeans_clusters = kmeans.predict(m)
kmeans_distances = kmeans.transform(m)
Finally let's plot the actual chart using Bokeh:
In [14]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook
colormap = np.array([
"#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
"#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
"#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
"#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])
output_notebook()
plot_author1 = bp.figure(plot_width=900, plot_height=700, title="Author1",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
plot_author1.scatter(x=tsne[:,0], y=tsne[:,1],
color=colormap[kmeans_clusters],
source=bp.ColumnDataSource({
"chunks": [x["title_chunks"] for x in author1],
"title": [x["title"] for x in author1],
"cluster": kmeans_clusters
}))
hover = plot_author1.select(dict(type=HoverTool))
hover.tooltips={"chunks": "@chunks (title: \"@title\")", "cluster": "@cluster"}
show(plot_author1)
Out[14]:
The above interactive chart shows a number of dense groups of headlines, as well as some sparse ones. Some of the dense groups that stand out more are:
If you look closely you will find other interesting groups, as well as their similarities/disimilarities when compared to their neighbors.
In [15]:
author2 = pickle.load( open( "author2.p", "rb" ) )
for story in author2:
story["title_length"] = len(story["title"])
story["title_chunks"] = [chunk.type for chunk in parsetree(story["title"])[0].chunks]
story["title_chunks_length"] = len(story["title_chunks"])
In [16]:
pd.DataFrame.from_dict(author2).describe()
Out[16]:
The basic stats don't show a significant difference between the headlines written by the two authors.
In [17]:
chunks_joint = [author["title_chunks"] for author in (author1+author2)]
m_joint = np.zeros((1400,1400))
for i, chunkx in enumerate(chunks_joint):
for j, chunky in enumerate(chunks_joint):
sm=difflib.SequenceMatcher(None,chunkx,chunky)
m_joint[i][j] = sm.ratio()
Now that we have analyzed the headlines for the second author, let's see how many common patterns exist between the two authors:
In [18]:
set1= [author["title_chunks"] for author in author1]
set2= [author["title_chunks"] for author in author2]
list_new = [itm for itm in set1 if itm in set2]
len(list_new)
Out[18]:
We observe that about 50% (347/700) of the headlines have a similar structure.
In [19]:
tsne_joint = tsne_model.fit_transform(m_joint)
In [20]:
plot_joint = bp.figure(plot_width=900, plot_height=700, title="Author1 vs. Author2",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
plot_joint.scatter(x=tsne_joint[:,0], y=tsne_joint[:,1],
color=colormap[([0] * 700 + [3] * 700)],
source=bp.ColumnDataSource({
"chunks": [x["title_chunks"] for x in author1] + [x["title_chunks"] for x in author2],
"title": [x["title"] for x in author1] + [x["title"] for x in author2]
}))
hover = plot_joint.select(dict(type=HoverTool))
hover.tooltips={"chunks": "@chunks (title: \"@title\")"}
show(plot_joint)
Out[20]:
Here we observe the same dense and sparse patterns, as well as groups of points that are somewhat unique to each author, or shared by both authors.
We're sure you can find more interesting observations by looking closely at the above chart.
In this project we've shown how one can retrieve and analyze news headlines, evaluate their structure and similarity, and build an interactive map to visualize them clearly.
Some of the weaknesses of our approach, and ways to improve upon them are:
In a future post, we're going to study the correlations between various headline structures and some external metrics such as the number of Shares and Likes on Social Media platforms, and see if we can uncover any interesting patterns.