In [1]:
import pandas as pd
from datetime import datetime
import dateutil
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML
import re
from urllib.parse import urlparse
import json
In [2]:
data = pd.read_csv('../data/in/native_ad_data.csv')
In [3]:
data.head()
Out[3]:
As a side note, the headlines from zergnet all have some newlines we need to get rid of and they appear to have concatenated the headline with the provider. So let's clean those up.
In [4]:
data['headline'] = data['headline'].apply(lambda x: re.sub('(?<=[a-z])\.?([A-Z](.*))' , '', x.strip()))
data.head()
Out[4]:
OK, that's better.
The img_file
column values also have ./imgs/ appended to the front of each file name. Let's get rid of those:
In [5]:
data['img_file'] = data['img_file'].apply(lambda x: re.sub('\.\/imgs\/' , '', str(x).strip()))
Now, let's check, do we have any null values?
In [6]:
for col in data.columns:
print((col, sum(data[col].isnull())))
For now only the orig_article
column has nulls, as we had not collected those consistently
In [7]:
data.describe()
Out[7]:
Already we can see some interesting trends here. Out of 129399 unique records, only 18022 of the headlines are unique, but 43315 of the links are unique and 23866 of the image files are unique (assuming for sure that there were issues with downloading images).
So it seems already that there are content links which might reuse the same headline, or image for different destination articles.
Also, because we want to inspect the hosts from which the articles and images are coming from, let's parse those out in the data.
In [8]:
data['img_host'] = data['img'].apply(lambda x: urlparse(x).netloc)
In [9]:
data['link_host'] = data['final_link'].apply(lambda x: urlparse(x).netloc)
Next, let's classify each site by a very relaxed set of tags based on perceived political bias. I might be a little off on some, I referenced https://www.allsides.com/ where possible, but that was not entirely helpful in all cases. Otherwise, I just went with my own idea of where I felt a site fell on the political spectrum (e.g., left, right, or center). There is also a tag for tabloids, or primarily sites that probably don't really have an editorial perspective so much as a desire to publish whatever gets the most traffic.
In [10]:
left = ['http://www.politico.com/magazine/', 'https://www.washingtonpost.com/', 'http://www.huffingtonpost.com/', 'http://gothamist.com/news', 'http://www.metro.us/news', 'http://www.politico.com/politics', 'http://www.nydailynews.com/news', 'http://www.thedailybeast.com/']
right = ['http://www.breitbart.com', 'http://www.rt.com', 'https://nypost.com/news/', 'http://www.infowars.com/', 'https://www.therebel.media/news', 'http://observer.com/latest/']
center = ['http://www.ibtimes.com/', 'http://www.businessinsider.com/', 'http://thehill.com']
tabloid = ['http://tmz.com', 'http://www.dailymail.co.uk/', 'https://downtrend.com/', 'http://reductress.com/', 'http://preventionpulse.com/', 'http://elitedaily.com/', 'http://worldstarhiphop.com/videos/']
In [11]:
def get_classification(source):
if source in left:
return 'left'
if source in right:
return 'right'
if source in center:
return 'center'
if source in tabloid:
return 'tabloid'
In [12]:
data['source_class'] = data['source'].apply(lambda x: get_classification(x))
In [13]:
data.head()
Out[13]:
Now let's remove duplicates based on a subset of the columns using pandas' drop_duplicates
for DataFrames
In [14]:
deduped = data.drop_duplicates(subset=['headline', 'link', 'img', 'provider', 'source', 'img_file', 'final_link'], keep=False)
In [15]:
deduped.describe()
Out[15]:
And let's just check on those null values again...
In [16]:
for col in deduped.columns:
print((col, sum(deduped[col].isnull())))
Out of curiousity, as we're only left with 43630 records after deduping, let's take a look at the rate of success for our record collection.
In [17]:
(43630/129399)*100
Out[17]:
Crud, doing a harvest yields results where only 33% of our sample is worth examining further.
Let's get the top 10 headlines grouped by img
In [18]:
deduped['headline'].groupby(deduped['img']).value_counts().nlargest(10)
Out[18]:
But hang on. let's just see what the top headlines are. There's certainly overlap, but it's not a one to one relationship between headlines and their images (or at least maybe it's the same image, but coming from a different URL).
In [19]:
deduped['headline'].value_counts().nlargest(10)
Out[19]:
Note: perhaps something we will want to look into is how many different headline, image permutations there are. I am particularly interested in the reuse of images across different headlines.
And how are our sources distributed?
In [20]:
deduped['source'].value_counts().nlargest(25)
Out[20]:
TMZ is a bit over-represented here
And what about by classification
In [21]:
deduped['source_class'].value_counts()
Out[21]:
Looks like the over-representation of TMZ is pushing on Tabloids a bit. Not terribly even between left, right, and center, either.
Let's take a look at the sources again as broken down by bother provider and our classification.
In [22]:
deduped.groupby(['provider', 'source_class'])['source'].value_counts()
Out[22]:
OK so what are the most frequent and least images per classification?
In [23]:
IMG_MAX=5
In [24]:
topimgs_center = deduped['img'][deduped['source_class'].isin(['center'])].value_counts().nlargest(IMG_MAX).index.tolist()
In [25]:
bottomimgs_center = deduped['img'][deduped['source_class'].isin(['center'])].value_counts().nsmallest(IMG_MAX).index.tolist()
In [26]:
topimgs_left = deduped['img'][deduped['source_class'].isin(['left'])].value_counts().nlargest(IMG_MAX).index.tolist()
In [27]:
bottomimgs_left = deduped['img'][deduped['source_class'].isin(['left'])].value_counts().nsmallest(IMG_MAX).index.tolist()
In [28]:
topimgs_right = deduped['img'][deduped['source_class'].isin(['right'])].value_counts().nlargest(IMG_MAX).index.tolist()
In [29]:
bottomimgs_right = deduped['img'][deduped['source_class'].isin(['right'])].value_counts().nsmallest(IMG_MAX).index.tolist()
In [30]:
topimgs_tabloid = deduped['img'][deduped['source_class'].isin(['tabloid'])].value_counts().nlargest(IMG_MAX).index.tolist()
In [31]:
bottomimgs_tabloid = deduped['img'][deduped['source_class'].isin(['tabloid'])].value_counts().nsmallest(IMG_MAX).index.tolist()
In [32]:
for i in topimgs_center:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [33]:
for i in bottomimgs_center:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [34]:
for i in topimgs_left:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [35]:
for i in bottomimgs_left:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [36]:
for i in topimgs_right:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [37]:
for i in bottomimgs_right:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [38]:
for i in topimgs_tabloid:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [39]:
for i in bottomimgs_tabloid:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
Yawn! I have to admit this isnt's as interesting as I thought it might be.
Next perhaps let's explore trends over time. First we'll want to make a version of the Data Frame that is indexed by date
In [40]:
deduped_date_idx = deduped.copy(deep=False)
In [41]:
deduped_date_idx['date'] = pd.to_datetime(deduped_date_idx.date)
In [42]:
deduped_date_idx.set_index('date',inplace=True)
See what dates we're working with
In [43]:
"Start: {} - End: {}".format(deduped_date_idx.index.min(), deduped_date_idx.index.max())
Out[43]:
Let's examine the distribution of the classifications over time
In [44]:
deduped_date_idx['2017-03-01':'2017-07-07'].groupby('source_class').resample('M').size().plot(kind='bar')
Out[44]:
In [45]:
plt.show()
I think what we're mostly seeing here is that our scraper was most active during the month of June.
Let's see the same distribution for provider.
In [46]:
deduped_date_idx['2017-03-01':'2017-07-07'].groupby(['provider']).resample('M').size().plot(kind='bar')
Out[46]:
In [47]:
plt.show()
Same, we're seeing that our results are biased towards June.
What about if we check all results mentioning certain people
In [48]:
(deduped_date_idx[deduped_date_idx['headline'].str.contains('Trump')]['2017-03-01':'2017-07-07']).groupby('source_class').resample('M').size().plot(title="Headlines Containing 'Trump' By Month and Classification", kind='bar', color="pink")
Out[48]:
In [49]:
plt.show()
In [50]:
(deduped_date_idx[deduped_date_idx['headline'].str.contains('Clinton')]['2017-03-01':'2017-07-07']).groupby('source_class').resample('M').size().plot(title="Headlines Containing 'Clinton' By Month and Classification", kind='bar', color="gray")
Out[50]:
In [51]:
plt.show()
In [52]:
(deduped_date_idx[deduped_date_idx['headline'].str.contains('Hillary')]['2017-03-01':'2017-07-07']).groupby('source_class').resample('M').size().plot(title="Headlines Containing 'Hillary' By Month and Classification" ,kind='bar', color="gray")
Out[52]:
In [53]:
plt.show()
In [54]:
(deduped_date_idx[deduped_date_idx['headline'].str.contains('Obama')]['2017-03-01':'2017-07-07']).groupby('source_class').resample('M').size().plot(title="Headlines Containing 'Obama' By Month and Classification", kind='bar')
Out[54]:
In [55]:
plt.show()
Again, seeing more of a trend around our data collection. There is an interesting trend that Trump articles are appearing on way more Tabloid articles than we might expect. Obama is appearing a lot on Right classified site articles, but again this is for June, so might just be an artifact of increased data collection. Finally, we see way more results for "Hillary" than we do "Clinton", and most of those are on Tabloid sites in April.
And let's check out some bucketed headline trends, both largest and smallest overall and for the various classifications.
In [56]:
(deduped_date_idx['2017-03-27':'2017-07-07'])['headline'].value_counts().nlargest(15)
Out[56]:
In [57]:
(deduped_date_idx['2017-03-27':'2017-07-07'])['headline'].value_counts().nsmallest(15)
Out[57]:
In [58]:
deduped['headline'][deduped['source_class'].isin(['center'])].value_counts().nlargest(25)
Out[58]:
In [59]:
deduped['headline'][deduped['source_class'].isin(['center'])].value_counts().nsmallest(25)
Out[59]:
In [60]:
deduped['headline'][deduped['source_class'].isin(['left'])].value_counts().nlargest(25)
Out[60]:
In [61]:
deduped['headline'][deduped['source_class'].isin(['left'])].value_counts().nsmallest(25)
Out[61]:
In [62]:
deduped['headline'][deduped['source_class'].isin(['right'])].value_counts().nlargest(25)
Out[62]:
In [63]:
deduped['headline'][deduped['source_class'].isin(['right'])].value_counts().nsmallest(25)
Out[63]:
In [64]:
deduped['headline'][deduped['source_class'].isin(['tabloid'])].value_counts().nlargest(25)
Out[64]:
In [65]:
deduped['headline'][deduped['source_class'].isin(['tabloid'])].value_counts().nsmallest(25)
Out[65]:
Finally, we wanted to see if any headlines had more than one image. Let's check a few.
In [66]:
def imgs_from_headlines(headline):
"""
A function to spit out all the different images used for a headline, assuming there's no more than 50/headline
"""
all_images = deduped['img'][deduped['headline'].isin([headline])].value_counts().nlargest(50).index.tolist()
for i in all_images:
displaystring = '<img src={} width="200"/>'.format(i)
display(HTML(displaystring))
In [67]:
imgs_from_headlines("Trump Voters Shocked After Watching This Leaked Video")
In [68]:
imgs_from_headlines("What Tiger Woods' Ex-Wife Looks Like Now Left Us With No Words")
In [69]:
imgs_from_headlines("Nicole Kidman's Yacht Is Far From You'd Expect")
In [70]:
imgs_from_headlines("He Never Mentions His Son, Here's Why")
In [71]:
imgs_from_headlines("Do This Tonight to Make Fungus Disappear by Morning (Try Today)")
Well, that was edifying.
In [72]:
timestamp = datetime.now().strftime('%Y-%m-%d-%H_%M')
In [73]:
datefile = '../data/out/{}_native_ad_data_deduped.csv'.format(timestamp)
In [74]:
deduped.to_csv(datefile, index=False)
Finally, let's generate a json file where each item is an individual image, and for each image we are listing out all the original sources, dates, headlines, classifications, and final locations for it.
In [75]:
img_json_data = {}
for index, row in deduped.iterrows():
img_json_data[row['img_file']] = {'url':row['img'],
'dates':[],
'sources':[],
'providers':[],
'classifications':[],
'headlines':[],
'locations':[],
}
In [76]:
print(len(img_json_data.keys()))
In [77]:
for index, row in deduped.iterrows():
record = img_json_data[row['img_file']]
if row['date'] not in record['dates']:
record['dates'].append(row['date'])
if row['headline'] not in record['headlines']:
record['headlines'].append(row['headline'])
if row['provider'] not in record['providers']:
record['providers'].append(row['provider'])
if row['source_class'] not in record['classifications']:
record['classifications'].append(row['source_class'])
if row['source'] not in record['sources']:
record['sources'].append(row['source'])
if row['final_link'] not in record['locations']:
record['locations'].append(row['final_link'])
In [78]:
for i in list(img_json_data.keys())[0:5]:
print(img_json_data[i])
In [79]:
hl_json_data = {}
for index, row in deduped.iterrows():
hl_json_data[row['headline']] = {'img_urls':[],
'dates':[],
'sources':[],
'providers':[],
'classifications':[],
'imgs':[],
'locations':[],
}
In [80]:
print(len(hl_json_data.keys()))
In [81]:
for index, row in deduped.iterrows():
record = hl_json_data[row['headline']]
if row['img'] not in record['img_urls']:
record['img_urls'].append(row['img'])
if row['date'] not in record['dates']:
record['dates'].append(row['date'])
if row['img_file'] not in record['imgs']:
record['imgs'].append(row['img_file'])
if row['provider'] not in record['providers']:
record['providers'].append(row['provider'])
if row['source_class'] not in record['classifications']:
record['classifications'].append(row['source_class'])
if row['source'] not in record['sources']:
record['sources'].append(row['source'])
if row['final_link'] not in record['locations']:
record['locations'].append(row['final_link'])
In [82]:
for i in list(hl_json_data.keys())[0:5]:
print(i, " = " ,hl_json_data[i])
In [83]:
def to_json_file(json_data, prefix):
filename = "../data/out/{}_grouped_data.json".format(prefix)
with open(filename, 'w') as outfile:
json.dump(json_data, outfile, indent=4)
In [84]:
to_json_file(img_json_data, "images")
In [85]:
to_json_file(hl_json_data, "headlines")