NTDS Project: How Do Fake-News Go Viral ?

[Or why Bernie Sanders could replace President Trump with little-known loophole](http://www.huffingtonpost.com/entry/bernie-sanders-could-replace-president-trump-with-little_us_5829f25fe4b02b1f5257a6b7)

Authors: Victor Kristof and William Trouleau


The click-bait headline of the title, published on November 14, 2016 by the Huffington Post, is clearly fake (and on purpose). It cleverly shades light on an issue that became mainstream with the recent US election: fake-news propaganda on social networks. Indeed, an analysis of user engagement performed by BuzzFeed concluded that fake-news stories raised more engagement on Facebook than the top election story from 19 major news outlets combined. The idea of this project is to analyze posts from famous Facebook pages publishing articles related to the US elections and to measure the impact of misinformation on the elections.

In particular, we follow the analysis of BuzzFeed on nine famous Facebook pages actively publishing articles related to the US elections:

BuzzFeed manually anotated all articles posted by these nine pages during one week in September 2016, and rated them according to the accuracy of the information published. More information on these ratings is developed in Section 1. However, their analysis was limited to the exploration of engagement metrics like the number of reactions, comments or shares generated by fake-news articles.

In this project, we extract additional features of the posts like the text message, the type of content, and so on. We then apply Machine Learning techniques to automatically detect fake-news ratings from unlabelled posts.


  • In Section 1. (Acquisition), we first detail the scraping process to acquire the Facebook posts used throughout the project. We then clean the data and save it to a csv file for future use.

  • In Section 2. (Exploration), we then explore the scraped data. We first investigate the volume of data that was scraped for each page and for each political orientation. We then dig deeper and investigate the text messages that will later be used for exploitation. We explore how these messages relate to the truthfulness of the articles. Since we also extracted other kind of features, we investigate them in the same way.

  • In Section 3. (Exploitation/Evaluation), we finally use the scraped data to build a detection algorithm to predict wether or not a given Facebook post is propaganda. We tackle this problem with three different approaches.

    • The first two approaches aim at detecting the truthfulness of articles based only on the text messages of the post. We take inspiration from Sentiment Analysis modeling, but here we try to predict truthfulness rather than sentiment. In particular, we first build a simple Naive Bayes classifier, but its limitations are quickly reached. We then implement a more involved model based on Convolutional Neural Networks that is the state-of-the-art for Sentiment Analysis.
    • Finally, we also build a last model based on other non-textual features that were found to be discriminative in the exporation step.

In [1]:
import numpy as np
import scipy

import matplotlib as mpl
mpl.use('svg')
%matplotlib inline
from matplotlib import pyplot as plt

from cycler import cycler
palette = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
mpl.rcParams.update({
    'axes.prop_cycle': cycler(color=palette),
    'font.sans-serif': 'Helvetica',
    'font.style': 'normal',
    'font.size': 10,
})

import warnings
warnings.filterwarnings("ignore")

import pandas as pd

# Load internal libraries
%load_ext autoreload
%autoreload 2
import lib
from lib.acquisition import Feature, FacebookScraper
from lib import exploration, exploration_helper
from lib import ntds_utils
from lib import utils
from lib import config

1. Data Acquisition

NOTE: In order to run the data acquisition process, one needs to obtain a token to connect to the Facebook Graph API. This token needs to be saved in a credentials file (by default credentials.ini in the current directory).

We collect data from nine famous Facebook pages actively publishing articles related to the US elections:

Each collected post includes the name of the page in addition to:

  • the post id post_id,
  • the URL of the post post_url,
  • the time created_time when the post was created ,
  • the type of post type (i.e. status, link, video, photo),
  • the text message of the post,
  • the number of times the post was shared share_count,
  • the number of times the post received a reaction reaction_count,
  • the number of times the post was commented comment_count.

1.1 Facebook Web Scraper

To have better flexibility on the scraping process, we collect data using the low-level approach with the requests library. The code of the scraper FacebookScraper, as well as all the code corresponding to the data collection part, is contained in the acquisition module.

The FacebookScraper takes an iterable of Features which represents the fields we want to query on the Facebook Graph API and provide a convenient way to query, clean and save the raw data into a well-formatted pandas.DataFrame.

Parameters

Define the parameters for the web scraper:

  • Credential file where your token is located
  • List of the pages to scrape
  • Date range over which we want to collect data
  • Fields to query

In [2]:
# Credential file containing the Facebook authentication token
credentials_file = 'credentials.ini'

# List of pages to scrape
page_list = ['ABCNewsPolitics', 'cnnpolitics', 'politico', 
             'AddictingInfoOrg', 'OccupyDemocrats', 'TheOther98', 
             'FreedomDailyNews', 'OfficialRightWingNews', 'theEagleisRising']

# Define the date range to scrape
start_date = '2016/01/01'
end_date = '2016/12/05'

# Define the fields to query
field_list = [
    Feature('id', name='post_id'),
    Feature('permalink_url', name='post_url'),
    Feature('link'),
    Feature('created_time'),
    Feature('type'),
    Feature('message'),
    Feature('description'),
    Feature('picture'),
    Feature('status_type'),
    Feature('caption'),
    Feature('reactions.limit(0).summary(1)', 
            formatter=lambda data: data.get('reactions', {'summary':{'total_count':0}})['summary']['total_count'], 
            name='reaction_count'),
    Feature('comments.limit(0).summary(1)', 
            formatter=lambda data: data.get('comments', {'summary':{'total_count':0}})['summary']['total_count'], 
            name='comment_count'),
    Feature('shares', 
            formatter=lambda data: data.get('shares', {'count': 0})['count'],
            name='share_count')
]

print('We collect data corresponding to fields:', ', '.join([f.fbquery for f in field_list]))
print('from the Facebook pages: ', ', '.join(page_list))


We collect data corresponding to fields: id, permalink_url, link, created_time, type, message, description, picture, status_type, caption, reactions.limit(0).summary(1), comments.limit(0).summary(1), shares
from the Facebook pages:  ABCNewsPolitics, cnnpolitics, politico, AddictingInfoOrg, OccupyDemocrats, TheOther98, FreedomDailyNews, OfficialRightWingNews, theEagleisRising

Run the web scraper.

WARNING: By running this cell, you will crawl ~60Mb of data from Facebook.


In [ ]:
# Initialize the Facebook web scraper
scraper = FacebookScraper(field_list)
scraper.extract_token(credentials_file)

# Iterate over the Facebook pages
for page in page_list:
    try:
        print('Scrape data from the Facebook page: {}...'.format(page))
        # Collect data in the requested date range
        scraper.run(page=page, since=start_date, until=end_date)
        print('{} articles collected'.format(len(scraper.data)))
    except Exception as e:
        print(e)
        break

1.2 Data Cleaning


In [ ]:
# Split the `account_id` and the `post_id` from the field `post_id`
scraper.data['account_id'] = scraper.data['post_id'].apply(lambda s: s.split('_')[0])
scraper.data['post_id'] = scraper.data['post_id'].apply(lambda s: s.split('_')[1])

# Convert string times to datetime objects
scraper.data['created_time'] = pd.to_datetime(scraper.data['created_time'])

# Clean the name of the pages
page_name_map = {
    'politico': 'Politico', 
    'cnnpolitics': 'CNN Politics', 
    'ABCNewsPolitics': 'ABC News Politics', 
    'TheOther98': 'The Other 98%', 
    'AddictingInfoOrg': 'Addicting Info',
    'OccupyDemocrats': 'Occupy Democrats', 
    'theEagleisRising': 'The Eagle is Rising', 
    'OfficialRightWingNews': 'Right Wing News', 
    'FreedomDailyNews': 'Freedom Daily'
}

scraper.data['page_name'] = scraper.data.apply(lambda row: page_name_map[row['page']], axis=1)

1.3 Data Augmentation

We now augment the dataset with the political orientation of each page as well as the manually annotated rating of fact checking performed by BuzzFeed in this article.

Add the political orientation


In [ ]:
# Page overall political orientation category: mainstrean, left, right
categories = {
    'politico': 'mainstream', 
    'cnnpolitics': 'mainstream', 
    'ABCNewsPolitics': 'mainstream', 
    'TheOther98': 'left', 
    'AddictingInfoOrg': 'left',
    'OccupyDemocrats': 'left', 
    'theEagleisRising': 'right', 
    'OfficialRightWingNews': 'right', 
    'FreedomDailyNews': 'right'
}

scraper.data['category'] = scraper.data.apply(lambda row: categories[row['page']], axis=1)

Add the fact checking annotation

To write their article, BuzzFeed already collected this data for one week in September 2016. Then they manually annotated it to add a rating on the content truthfulness. According to their article, their methodology was the following:

Posts could be rated “mostly true,” “mixture of true and false,” or “mostly false.” If we encountered a post that was satirical or opinion-driven, or that otherwise lacked a factual claim, we rated it “no factual content.” (We chose to rate things as “mostly” true or false in order to allow for smaller errors or accurate facts within otherwise true or false claims or stories.)


In [ ]:
# Load the BuzzFeed dataset
buzzfeed_df = pd.read_csv('data/buzzfeed-fact-check.csv')

# Join both on the `post_id` field
scraper.data = pd.merge(scraper.data, buzzfeed_df[['post_id', 'rating']], on='post_id', how='left')

# Fill the null values with str 'NaN'
scraper.data['rating'] = scraper.data['rating'].fillna('UNKNOWN')

In [ ]:
# Save the dataset in a csv file
scraper.data.to_csv('data/dataset.csv', index=False)

In [ ]:
# Free the memory for the next steps
del scraper

2. Data Exploration

Let us now explore the collected data.


In [2]:
# Load the dataset
df = utils.load_dataset('data/dataset.csv')

2.1 Data Volume Analysis

Number of articles published per page


In [3]:
exploration.barplot_count_posts_per_page(df)


The two most active pages are the two mainstream pages Politico (with ~19k articles posted) and CNN (with ~15k articles posted). In contrast, the two least active pages are the pro-democrat pages The Other 98% (with ~4k articles posted) and Occupy Democrats (with ~6k articles posted).

Boxplots of number of comments/reactions/shares for each page


In [4]:
exploration.barplot_engagement(df, agg_func='median')


It is interesting to note that the two pages that posted the least are the ones who generated, in average, the most user engagement (in terms of number of reactions, comments and shares).


2.2 Time Series Visualization

Let us now investigate how the articles and their reactions unfold over time. To better visualize the data, we aggregate articles by week, using the week number in year 2016.


In [5]:
# Extract the week number when the article was created
df['created_week'] = df['created_time'].apply(lambda row: row.isocalendar()[1])

Number of Articles Published each Week for each Page


In [6]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()


Loading BokehJS ...

In [7]:
p = figure(title='Number of articles published each week for each page', plot_width=950, plot_height=400)
for i, page_name in enumerate(df.page_name.unique()):
    s = df[df.page_name==page_name]['created_week'].value_counts().sort_index()
    p.line(x=s.index, y=np.array(s), color=palette[i], legend=page_name, line_width=2)
p.xaxis.axis_label = 'Week of the year 2016'
p.yaxis.axis_label = 'Number of articles published'
p.legend.location = 'top_left'
show(p)


The three mainstream pages (i.e. Politico, CNN and ABCNewsPolitics) exhibit the same peak around weeks 29-30. This time period corresponds to the primary election of the Democratic National Convention. In addition, the peak at week 45 corresponds to articles released on the week of the election which was held on November 8. We can clearly see a big drop for all pages 2 weeks after the election was held.

Number of Comments Published each Week for each Page


In [8]:
p = figure(title='Number of comments published each week for each page', plot_width=950, plot_height=400)
for i, page_name in enumerate(df.page_name.unique()):
    s = df[df.page_name==page_name].groupby('created_week').sum()['comment_count']
    p.line(x=s.index, y=np.array(s), color=palette[i], legend=page_name, line_width=2)
p.xaxis.axis_label = 'Week of the year 2016'
p.yaxis.axis_label = 'Number of comments'
p.legend.location = 'top_left'
show(p)



2.3 Analysis of the Text Messages

Analysis of Message Length

Let us first investage the length of the messages under different perspectives.


In [9]:
# Compute the length of each text message
df['Message length'] = df['message'].fillna('').apply(lambda s: len(s))

How does the message length differ between political orientation categories?


In [10]:
exploration.aggregate_message_length(df, by='category', agg_funcs=['median']).T


Out[10]:
Category left mainstream right
Message length median 55 134 64

How does the message length differ between each page?


In [11]:
exploration.aggregate_message_length(df, by='page_name', agg_funcs=['median']).T


Out[11]:
Page name ABC News Politics Addicting Info CNN Politics Freedom Daily Occupy Democrats Politico Right Wing News The Eagle is Rising The Other 98%
Message length median 172 64 168 166 61 104 36 71 18

On the first table above, we see that the mainstream pages have longer messages than the others. If we look closer to the median length of messages per page, we see that The Other 98% (democrat oriented) and Right Wing News (republican oriented) have much shorter text messages than any other pages. Let us look at a few examples of such short messages for the page The Other 98%:


In [12]:
df['message'][(df.page=='TheOther98')&(df['Message length'] < 10)&(df['Message length'] > 0)][:15]


Out[12]:
87136      Better.
87137         Yep.
87138         Amen
87141      MmmHmm.
87143          Ha!
87153         WOW!
87157    BREAKING:
87161       Neato!
87166    BREAKING:
87174        Word.
87185       Ditto.
87186         YES!
87188          Yep
87191         Amen
87196        Word.
Name: message, dtype: object

We can see that these short messages are undeniably subjective and biased. In contrast, we can see below that long messages seem to be more objective. This observation might be interpreted as short click-bait messages versus long story-telling articles.


In [13]:
pd.options.display.max_colwidth = 100
print(df['message'][(df.page=='cnnpolitics')&(df['Message length'] > 100)][:5])


15280    Whether it's the messenger or the message, Democrats agree something needs to change following t...
15281    In a statement, Carrier says, “the incentives offered by the state were an important considerati...
15282    The destiny of US diplomacy is coming down to a process of elimination between Republican giants...
15283    President-elect Donald J. Trump has been known to favor Kentucky Fried Chicken, McDonald's and t...
15284    Michael Banerian, the youth vice-chairman of the Michigan Republican Party, says he has been thr...
Name: message, dtype: object

Text Features Analysis

Let us now dive deeper into the text messages and explore the relevant terms that we will use as text features in the data exploitation part. We start by vectorizing the raw text messages into a matrix counting the occurences of each word in each post.


In [14]:
X, vectorizer = exploration_helper.vectorize_messages(df.message.fillna(''))
print("There are {} messages with {} bag-of-word features.".format(*X.shape))


There are 90934 messages with 8143 bag-of-word features.

In [15]:
counts = exploration.count_words(X, vectorizer)
counts[:20].to_frame('Count').T


Out[15]:
trump donald hillary clinton http obama president says just like ws abcn cnn campaign new said video sanders republican bernie
Count 22089 14601 10419 10035 8636 7209 6642 6189 5050 4900 4790 4786 4407 4190 4043 3775 3497 3369 3368 3275

 Rating vs. Political Orientation Category

Only the labelled messages will actually be useful to us to train the models in the exploitation section. Do the words used in posts rated as mostly true differ from the ones rated as mostly false?

On the tables below, we can see that the top words of both classes share many common words like trump, domald, obama, clinton, hillary, ... However, it is interesting to note that the 10-th most used word of the mostly false articles is muslim and that the 17-th one is lie.


In [16]:
# Define the masks for the true and false articles.
mask_true = np.array(df.rating == 'mostly true')
mask_false = np.array(df.rating == 'mostly false')

Top vocabulary words of mostly true articles


In [17]:
# Count the word frequency of posts rated as `mostly true`
counts = exploration.count_words(X[mask_true,:], vectorizer)
# Print the top 50 words
pd.set_option("display.max_columns", 50)
counts[:50].to_frame('Count').T


Out[17]:
trump donald clinton hillary debate president says obama http presidential cnn said new just campaign like police ws abcn people debates democrats did white going election state black think republican video america say voters know don american stop country national night politico york man nominee occupy make candidate news americans
Count 560 449 324 282 158 143 135 111 108 108 98 88 81 77 74 71 66 65 65 59 57 52 47 47 46 45 44 43 43 42 42 41 40 39 38 37 36 36 35 35 34 33 33 33 32 32 31 31 31 31

 Top vocabulary words of mostly false articles


In [18]:
# Count the word frequency of posts rated as `mostly false`
counts = exploration.count_words(X[mask_false,:], vectorizer)
# Print the top 50 words
pd.set_option("display.max_columns", 50)
counts[:50].to_frame('Count').T


Out[18]:
hillary clinton obama just trump media america liberal let muslim new 000 shares going share went lie police said time foundation com come video illegal world charity run disgusting like sued general refugees hell news live caught eaglerising need believe white www completely obvious way think clintons make happening attacked
Count 19 12 11 11 10 8 7 7 6 6 5 5 5 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2

2.4 Exploration of Other Non-Text Discriminative Features in Fake News Articles

Let us now explore other non-textual features to analyze if they clearly separate fake-news from legitimate ones.


In [19]:
# Mask all articles that have not been manually annotated
mask_labelled = df.rating != 'UNKNOWN'

Rating vs. Political Orientation Category

Let us first investigate the proportions of genuine and fake-news articles based on the political orientation of the pages. The Figure below clearly shows that the mainstream pages contain much less fake content than politically oriented pages.


In [20]:
exploration.count_group(df[mask_labelled], by=['rating', 'category'])


Rating vs. Page

Let us now analyze this behavior per page in more depth, with the number of mostly false news for each page. The no factual content rating from the mainstream media correspond in 100% of the cases to news advertising for their own content. The ones for the right and left media are all jokes and subjective content. The Figure below clearly shows that the pro-left Occupy Democrats and the pro-right The Eagle is Rising are to two most active distributors of fake-news.


In [47]:
mask_false = (df.rating == 'mostly false')
exploration.count_group(df[mask_false], by=['rating', 'page_name'], figsize=(12,3), rot=10)


Rating vs. Type of content

Now that we know which are the most likely pages to publish fake-news, it is natural to wonder whether the type of content published (i.e. status, link, photo, video) distinguishes the truthfulness of the articles.

On the table below, we can see that most articles are links. A manual inspection actually shows that they are links to full articles hosted on the websites of the respective pages, with a short summary text message. There is no clear difference between mostly true and mostly false articles. However, we clearly see that articles with no factual content (i.e. articles that are satirical, opinion-driven, or that otherwise lacked a factual claim) are mostly photos and videos, whereas both types are mostly absent of all other rating classes.


In [22]:
exploration.count_group(df[mask_labelled], by=['rating', 'type'], figsize=(8,8))


Rating vs. Number of shares/comments/reactions

Let us finally explore which articles raised the most engagement and what is their rating. On the plots below, we show the number of reactions, comments and shares with respect to the article rating for each political orientation category. The 9 plots are organized as follows:

  • The first line corresponds to pro-democrat pages,
  • the second line corresponds to mainstream pages,
  • and the last line corresponds to pro-republican pages.

We see that the posts published by pro-democrat pages that raised the most engagement are the ones with no factual content (i.e. posts that are satirical, opinion-driven, or that otherwise lacked a factual claim). In contrast, pro-republican posts rated as mostly false (i.e. fake-news posts) are the ones that raised the most engagement from their readers. Finally, we also see that the posts published by mainstream pages that raised the most engagement are rated as mixture of true and false.

It is interesting to note that, for each category, results agree for all three engagement metrics (i.e. number of reactions, comments and shares).


In [23]:
RATINGS =  ['no factual content', 'mostly false', 'mixture of true and false', 'mostly true']
CATEGORIES = ['left', 'mainstream', 'right']
ENGAGEMENT_COLUMNS = ['reaction_count','comment_count','share_count']

fig, axs = plt.subplots(len(CATEGORIES), len(ENGAGEMENT_COLUMNS), figsize=(12,12))
for i, cat in enumerate(CATEGORIES):
    for j, col in enumerate(ENGAGEMENT_COLUMNS):
        data = df[df.category==cat][['rating',col]].groupby('rating').agg('median').reindex(RATINGS) 
        data.plot.bar(ax=axs[i][j], rot=20, legend=False)
        axs[i][j].set_title('{} - {}'.format(cat,col))
fig.subplots_adjust(hspace=0.85)



3. Data Exploitation

Now that we have a pretty good idea of the data we are dealing with, we can implement a model to detect fake-news articles for the various features we discussed earlier. To tackle this task, we take inspiration from sentiment analysis models whose goal is to detect happiness and sadness in text data.


In [24]:
# (Re)load the dataset
df = utils.load_dataset('data/dataset.csv')

3.1 Fake-News Classification using Naive Bayes

A widly used model for sentiment classification is the Naive Bayes model. The assumption behind this model is that all features (i.e. words in the vocabulary) are conditionally independent given the class. This assumption allows a nice and simple mathematical derivation of the learning algorithm.


In [25]:
from lib.exploitation import NaiveBayesClassifier

Preprocessing

Format the dataframe and split the dataset into a training and a held out testing set.

Using the bag-of-words assumption, the messages are tokenized into words and we use the widely used "tf-idf" normalization of features, whose formula is as follows:

$$\text{tf-idf}(w,m) = \frac{\text{Number of times word } w \text{ appears in message } m}{\text{Number of messages in which word } w \text{ occurs}}.$$

This measure emphasizes words that occur a lot in a single message but are rare in the whole corpus.


In [26]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [27]:
# Filter the data to consider only posts labeled as 'mostly true' and 'mostly false'
mask_true_false = (df.rating == 'mostly true') | (df.rating == 'mostly false')
X, vectorizer = exploration_helper.vectorize_messages(df[mask_true_false].message.fillna(''), vectorizerObj=TfidfVectorizer)
y = np.array(df[mask_true_false].rating == 'mostly false', dtype='bool')
print("There are {} labeled messages with {} bag-of-word features.".format(*X.shape))
print("There are {} legitimate messages and {} fake ones.".format((y==0).sum(), (y==1).sum()))


There are 1766 labeled messages with 311 bag-of-word features.
There are 1662 legitimate messages and 104 fake ones.

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("There are {} legitimate messages and {} fake ones in the training set.".format((y_train==0).sum(), (y_train==1).sum()))
print("There are {} legitimate messages and {} fake ones in the testing set.".format((y_test==0).sum(), (y_test==1).sum()))


There are 1148 legitimate messages and 88 fake ones in the training set.
There are 514 legitimate messages and 16 fake ones in the testing set.

Hyperparameter tuning

The Naive Bayes model has two hyperparameters that need to be tuned.

  • The first one corresponds to the smoothing parameter alpha used to avoid overfitting due to features with very small probabilities.
  • The second one corresponds to the class_prior parameter which aims at reducing the bias due to unbalanced data.

The first plot (on the left) below shows that small alpha (.e.g 1.0) should be used. A higher value seems to smooth the features too much, thus reducing their discriminative power. The other plot (on the right) shows that a prior between 0.3 and 0.4 should be used for the fake-news class.


In [29]:
# Tune the `alpha` hyperparameter
nb_classifier = NaiveBayesClassifier(class_prior=[0.7, 0.3])
alpha_search_space = np.linspace(0.1, 50.0, 50.0)
alpha_search_perf = nb_classifier.tune_alpha(X_train, y_train, alpha_search_space)

# Tune the `class_prior` hyperparameter
nb_classifier = NaiveBayesClassifier(alpha=1.0)
prior_search_space = np.linspace(0.0, 1.0, 50.0)
prior_search_perf = nb_classifier.tune_prior(X_train, y_train, prior_search_space)

fig, axs = plt.subplots(1, 2, figsize=(12,4))
axs[0].plot(alpha_search_space, alpha_search_perf)
axs[0].set_ylabel("F1-Score"); axs[0].set_xlabel("Smoothing parameter `alpha`");
axs[0].set_title("`alpha` hyperparameter tuning on validation set")
axs[1].plot(prior_search_space, prior_search_perf)
axs[1].set_ylabel("F1-Score"); axs[1].set_xlabel("Prior probability of class `mostly false`");
axs[1].set_title("`class_prior` hyperparameter tuning on validation set");


Evaluation

Once the hyperparameters are tuned, we can evaluate the performances of the Naive Bayes model on the held out testing dataset. We can see below that the the model achieves a good accuracy. However, since our dataset is highly unbalanced, accuracy actually hides the fact that the model is only performing well at classifying the mostly true articles. Looking at the second line of the confusion matrix shows that, among the $19$ fake-news articles of the testing set, the model is actually wrongly classifying $11$ as legitimate.

Due to the class imbalance, it is thus much more relevant to evaluate the model using the $F_1$-score. Indeed, this metric defined as:

$$F_1 = 2 \cdot \frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}},$$

where $$\text{precision} = \frac{\text{True positive}}{\text{True positive}+\text{False positive}}, ~\text{and}~ \text{recall} = \frac{\text{True positive}}{\text{True positive}+\text{False negative}}.$$

Taking into account both the precision and recall instead of the accuracy avoids high scores due to imbalanced datasets.


In [30]:
nb_classifier = NaiveBayesClassifier(alpha=1.0, class_prior=[0.6, 0.4])
# Train the model on the trainin set
nb_classifier.fit(X_train, y_train)
# Evaluate the model on the heldout testing set
_, nb_performance_metrics = nb_classifier.predict(X_test, y_test)
print('Accuracy:', nb_performance_metrics['accuracy'])
print('F1 score:', nb_performance_metrics['f1_score'])
print('Confusion Matrix:\n', nb_performance_metrics['confusion_matrix'])


Accuracy: 0.87358490566
F1 score: 0.12987012987
Confusion Matrix:
 [[458  56]
 [ 11   5]]

3.2 Fake-News Classification using Convolutional Neural Network

We implement here a method to predict the truthfulness of a post using a Convolutional Neural Network. We vectorize the sentences by mapping each word to an index in the vocabulary of the whole corpus. The extracted vocabulary has size 6490. Word embeddings are then learned directly by the network. The architecture is built such that filters of different sizes (2, 3, 4 and 5 in our case) consecutively process the embedded words, sliding over different numbers of words at the same time (i.e. convolution).

We use a dropout probability $p=0.5$, as well as an $L_2$-regularizer with $\lambda = 0.1$ in order to countereffect overfitting.

Reference: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

 Preprocess the data

We preprocess the texts by padding them such that they all have the same length.


In [31]:
from lib.exploitation_helper import preprocessing

X, y, vocabulary = preprocessing(df)

print("There are {} labeled messages whose maximum length is {}.".format(*X.shape))
print("There are {} true messages and {} false ones.".format(y[:, 0].sum(), y[:, 1].sum()))


There are 1766 labeled messages whose maximum length is 89.
There are 1662 true messages and 104 false ones.

Training and Evaluation

Train the CNN on the data. Batches of size 64 samples are randomly generated and the whole training set is processed for 10 epochs.

The evaluation is done simultaneously by testing the model on a heldout testing dataset.

Note: the nan values displayed for the F1 score for a couple of iterations is due to the highly unbalanced data.


In [32]:
from lib.exploitation import CNNClassifier

model = CNNClassifier(X, 
                      y, 
                      vocabulary, 
                      test_sample_percentage=0.3,
                      filter_sizes=[2, 3, 4, 5],
                      dropout_keep_prob=0.5,
                      l2_reg_lambda=0.1,
                      batch_size=64, 
                      num_epochs=10, 
                      evaluate_every=5)

loss, accuracy, f1_score = model.fit()


Vocabulary size: 6490
Train/test split: 1237/529`>
Writing to /Users/kristof/GitHub/ntds_2016/prj-fake-news/runs/1484506616

2017-01-15T19:57:00.912270: step 1, loss 1.2138, acc 0.625, f1 0.0769231
2017-01-15T19:57:01.281131: step 2, loss 0.890614, acc 0.78125, f1 0.125
2017-01-15T19:57:01.629120: step 3, loss 1.1517, acc 0.8125, f1 nan
2017-01-15T19:57:01.965870: step 4, loss 0.793367, acc 0.9375, f1 nan
2017-01-15T19:57:02.301126: step 5, loss 0.51815, acc 0.953125, f1 nan

Evaluation:
2017-01-15T19:57:03.280443: step 5, loss 0.736578, acc 0.943289, f1 nan

2017-01-15T19:57:03.643570: step 6, loss 1.08518, acc 0.921875, f1 nan
2017-01-15T19:57:03.990908: step 7, loss 0.649553, acc 0.921875, f1 nan
2017-01-15T19:57:04.334797: step 8, loss 0.895872, acc 0.921875, f1 nan
2017-01-15T19:57:04.723266: step 9, loss 0.559493, acc 0.953125, f1 nan
2017-01-15T19:57:05.068046: step 10, loss 0.808965, acc 0.921875, f1 nan

Evaluation:
2017-01-15T19:57:05.924707: step 10, loss 0.692098, acc 0.943289, f1 nan

2017-01-15T19:57:06.282873: step 11, loss 0.590494, acc 0.96875, f1 0.5
2017-01-15T19:57:06.642686: step 12, loss 0.863735, acc 0.9375, f1 nan
2017-01-15T19:57:06.983077: step 13, loss 0.309892, acc 0.90625, f1 nan
2017-01-15T19:57:07.315234: step 14, loss 0.378857, acc 0.890625, f1 nan
2017-01-15T19:57:07.647243: step 15, loss 0.473085, acc 0.9375, f1 nan

Evaluation:
2017-01-15T19:57:08.406627: step 15, loss 0.641603, acc 0.941399, f1 nan

2017-01-15T19:57:08.811104: step 16, loss 0.811614, acc 0.890625, f1 nan
2017-01-15T19:57:09.157902: step 17, loss 0.658881, acc 0.9375, f1 nan
2017-01-15T19:57:09.499546: step 18, loss 0.509049, acc 0.953125, f1 nan
2017-01-15T19:57:09.838211: step 19, loss 1.10663, acc 0.859375, f1 nan
2017-01-15T19:57:10.035616: step 20, loss 1.19875, acc 0.904762, f1 nan

Evaluation:
2017-01-15T19:57:10.906334: step 20, loss 0.566659, acc 0.941399, f1 nan

2017-01-15T19:57:11.243165: step 21, loss 0.410493, acc 0.90625, f1 0.25
2017-01-15T19:57:11.616063: step 22, loss 0.593436, acc 0.921875, f1 0.285714
2017-01-15T19:57:12.188742: step 23, loss 0.453851, acc 0.90625, f1 0.4
2017-01-15T19:57:12.599674: step 24, loss 0.56331, acc 0.890625, f1 0.461538
2017-01-15T19:57:12.984176: step 25, loss 0.446496, acc 0.953125, f1 nan

Evaluation:
2017-01-15T19:57:13.851361: step 25, loss 0.491606, acc 0.941399, f1 nan

2017-01-15T19:57:14.240662: step 26, loss 0.375342, acc 0.9375, f1 nan
2017-01-15T19:57:14.602736: step 27, loss 0.700206, acc 0.875, f1 0.2
2017-01-15T19:57:14.967426: step 28, loss 0.754919, acc 0.875, f1 nan
2017-01-15T19:57:15.345178: step 29, loss 0.602314, acc 0.859375, f1 0.181818
2017-01-15T19:57:15.714995: step 30, loss 0.503278, acc 0.890625, f1 0.222222

Evaluation:
2017-01-15T19:57:16.574848: step 30, loss 0.480684, acc 0.943289, f1 nan

2017-01-15T19:57:16.960098: step 31, loss 0.455465, acc 0.875, f1 nan
2017-01-15T19:57:17.336040: step 32, loss 0.212647, acc 1, f1 1
2017-01-15T19:57:17.714570: step 33, loss 0.386217, acc 0.9375, f1 nan
2017-01-15T19:57:18.332873: step 34, loss 0.91705, acc 0.890625, f1 nan
2017-01-15T19:57:18.702380: step 35, loss 0.632137, acc 0.859375, f1 nan

Evaluation:
2017-01-15T19:57:19.476472: step 35, loss 0.496589, acc 0.943289, f1 nan

2017-01-15T19:57:19.841174: step 36, loss 0.429221, acc 0.890625, f1 nan
2017-01-15T19:57:20.208625: step 37, loss 0.373169, acc 0.90625, f1 nan
2017-01-15T19:57:20.574048: step 38, loss 0.578765, acc 0.9375, f1 nan
2017-01-15T19:57:20.961282: step 39, loss 0.255044, acc 0.984375, f1 nan
2017-01-15T19:57:21.164522: step 40, loss 0.677893, acc 0.857143, f1 nan

Evaluation:
2017-01-15T19:57:21.963475: step 40, loss 0.519541, acc 0.943289, f1 nan

2017-01-15T19:57:22.505882: step 41, loss 0.301774, acc 0.953125, f1 nan
2017-01-15T19:57:22.866442: step 42, loss 0.353338, acc 0.953125, f1 0.571429
2017-01-15T19:57:23.346447: step 43, loss 0.301723, acc 0.953125, f1 0.4
2017-01-15T19:57:23.710918: step 44, loss 0.447541, acc 0.921875, f1 0.285714
2017-01-15T19:57:24.069176: step 45, loss 0.314902, acc 0.953125, f1 0.4

Evaluation:
2017-01-15T19:57:24.976237: step 45, loss 0.541266, acc 0.943289, f1 nan

2017-01-15T19:57:25.347411: step 46, loss 0.519022, acc 0.9375, f1 nan
2017-01-15T19:57:25.709231: step 47, loss 0.528196, acc 0.921875, f1 0.444444
2017-01-15T19:57:26.080128: step 48, loss 0.336875, acc 0.953125, f1 0.4
2017-01-15T19:57:26.480975: step 49, loss 0.194913, acc 0.984375, f1 0.8
2017-01-15T19:57:26.873369: step 50, loss 0.391851, acc 0.9375, f1 0.5

Evaluation:
2017-01-15T19:57:27.668845: step 50, loss 0.508302, acc 0.943289, f1 nan

2017-01-15T19:57:28.058526: step 51, loss 0.250643, acc 0.953125, f1 0.4
2017-01-15T19:57:28.417011: step 52, loss 0.44705, acc 0.90625, f1 0.5
2017-01-15T19:57:28.764869: step 53, loss 0.523612, acc 0.84375, f1 nan
2017-01-15T19:57:29.108906: step 54, loss 0.499771, acc 0.890625, f1 0.363636
2017-01-15T19:57:29.471088: step 55, loss 0.317959, acc 0.96875, f1 nan

Evaluation:
2017-01-15T19:57:30.244913: step 55, loss 0.505817, acc 0.94518, f1 0.121212

2017-01-15T19:57:30.646779: step 56, loss 0.395384, acc 0.921875, f1 0.285714
2017-01-15T19:57:31.031380: step 57, loss 0.584402, acc 0.875, f1 nan
2017-01-15T19:57:31.389525: step 58, loss 0.415126, acc 0.921875, f1 nan
2017-01-15T19:57:31.753141: step 59, loss 0.521352, acc 0.9375, f1 nan
2017-01-15T19:57:31.936897: step 60, loss 0.190062, acc 1, f1 nan

Evaluation:
2017-01-15T19:57:32.881427: step 60, loss 0.488624, acc 0.94707, f1 0.176471

2017-01-15T19:57:33.273521: step 61, loss 0.248393, acc 0.96875, f1 0.75
2017-01-15T19:57:33.658661: step 62, loss 0.446822, acc 0.9375, f1 0.5
2017-01-15T19:57:34.058358: step 63, loss 0.340754, acc 0.921875, f1 nan
2017-01-15T19:57:34.446368: step 64, loss 0.237077, acc 0.96875, f1 0.5
2017-01-15T19:57:34.810260: step 65, loss 0.311844, acc 0.953125, f1 0.4

Evaluation:
2017-01-15T19:57:35.667771: step 65, loss 0.536604, acc 0.943289, f1 nan

2017-01-15T19:57:36.056455: step 66, loss 0.404244, acc 0.90625, f1 nan
2017-01-15T19:57:36.436272: step 67, loss 0.196946, acc 0.984375, f1 0.666667
2017-01-15T19:57:36.824470: step 68, loss 0.432412, acc 0.9375, f1 nan
2017-01-15T19:57:37.167421: step 69, loss 0.328707, acc 0.9375, f1 0.333333
2017-01-15T19:57:37.505608: step 70, loss 0.218168, acc 0.96875, f1 0.5

Evaluation:
2017-01-15T19:57:38.320828: step 70, loss 0.538155, acc 0.94707, f1 0.125

2017-01-15T19:57:38.685393: step 71, loss 0.285732, acc 0.9375, f1 0.6
2017-01-15T19:57:39.033934: step 72, loss 0.374805, acc 0.9375, f1 nan
2017-01-15T19:57:39.391680: step 73, loss 0.442771, acc 0.890625, f1 0.363636
2017-01-15T19:57:39.731983: step 74, loss 0.236138, acc 0.953125, f1 0.4
2017-01-15T19:57:40.063136: step 75, loss 0.377715, acc 0.953125, f1 0.666667

Evaluation:
2017-01-15T19:57:40.828126: step 75, loss 0.489469, acc 0.94896, f1 0.228571

2017-01-15T19:57:41.188375: step 76, loss 0.363734, acc 0.9375, f1 0.5
2017-01-15T19:57:41.545748: step 77, loss 0.270332, acc 0.96875, f1 0.75
2017-01-15T19:57:41.879443: step 78, loss 0.362656, acc 0.9375, f1 0.6
2017-01-15T19:57:42.206447: step 79, loss 0.320228, acc 0.96875, f1 0.666667
2017-01-15T19:57:42.364759: step 80, loss 0.278259, acc 0.904762, f1 0.5

Evaluation:
2017-01-15T19:57:43.109236: step 80, loss 0.45314, acc 0.950851, f1 0.235294

2017-01-15T19:57:43.442860: step 81, loss 0.265772, acc 0.9375, f1 0.666667
2017-01-15T19:57:43.796833: step 82, loss 0.343141, acc 0.921875, f1 0.285714
2017-01-15T19:57:44.131113: step 83, loss 0.295319, acc 0.921875, f1 0.285714
2017-01-15T19:57:44.492615: step 84, loss 0.195562, acc 0.984375, f1 nan
2017-01-15T19:57:44.823219: step 85, loss 0.32325, acc 0.921875, f1 0.444444

Evaluation:
2017-01-15T19:57:45.583977: step 85, loss 0.496439, acc 0.950851, f1 0.235294

2017-01-15T19:57:45.933255: step 86, loss 0.265223, acc 0.96875, f1 0.5
2017-01-15T19:57:46.285153: step 87, loss 0.295113, acc 0.96875, f1 0.5
2017-01-15T19:57:46.613729: step 88, loss 0.247205, acc 0.984375, f1 0.8
2017-01-15T19:57:46.953105: step 89, loss 0.228499, acc 0.984375, f1 0.666667
2017-01-15T19:57:47.290859: step 90, loss 0.411187, acc 0.9375, f1 0.333333

Evaluation:
2017-01-15T19:57:48.040853: step 90, loss 0.528993, acc 0.94896, f1 0.181818

2017-01-15T19:57:48.396467: step 91, loss 0.169302, acc 1, f1 nan
2017-01-15T19:57:48.730295: step 92, loss 0.190799, acc 0.96875, f1 0.5
2017-01-15T19:57:49.056634: step 93, loss 0.320444, acc 0.921875, f1 0.545455
2017-01-15T19:57:49.384154: step 94, loss 0.260747, acc 0.953125, f1 0.769231
2017-01-15T19:57:49.710397: step 95, loss 0.328978, acc 0.9375, f1 0.666667

Evaluation:
2017-01-15T19:57:50.449595: step 95, loss 0.496342, acc 0.950851, f1 0.235294

2017-01-15T19:57:50.797519: step 96, loss 0.200466, acc 0.96875, f1 0.5
2017-01-15T19:57:51.131595: step 97, loss 0.297932, acc 0.953125, f1 0.571429
2017-01-15T19:57:51.469078: step 98, loss 0.208026, acc 0.984375, f1 0.909091
2017-01-15T19:57:51.821050: step 99, loss 0.279272, acc 0.9375, f1 0.5
2017-01-15T19:57:51.991929: step 100, loss 0.282956, acc 0.952381, f1 0.8

Evaluation:
2017-01-15T19:57:52.763781: step 100, loss 0.465704, acc 0.94896, f1 0.228571

Saved model checkpoint to /Users/kristof/GitHub/ntds_2016/prj-fake-news/runs/1484506616/checkpoints/model-100

2017-01-15T19:57:53.843807: step 101, loss 0.189515, acc 0.984375, f1 nan
2017-01-15T19:57:54.190101: step 102, loss 0.246904, acc 0.953125, f1 0.666667
2017-01-15T19:57:54.531445: step 103, loss 0.198424, acc 0.984375, f1 0.909091
2017-01-15T19:57:54.873641: step 104, loss 0.229382, acc 0.953125, f1 0.4
2017-01-15T19:57:55.236914: step 105, loss 0.353972, acc 0.953125, f1 0.571429

Evaluation:
2017-01-15T19:57:55.993338: step 105, loss 0.491336, acc 0.94707, f1 0.222222

2017-01-15T19:57:56.359117: step 106, loss 0.265302, acc 0.9375, f1 0.666667
2017-01-15T19:57:56.699087: step 107, loss 0.204732, acc 0.984375, f1 0.666667
2017-01-15T19:57:57.028370: step 108, loss 0.263835, acc 0.96875, f1 0.5
2017-01-15T19:57:57.360866: step 109, loss 0.368517, acc 0.921875, f1 0.285714
2017-01-15T19:57:57.702884: step 110, loss 0.232516, acc 0.96875, f1 0.666667

Evaluation:
2017-01-15T19:57:58.486288: step 110, loss 0.500793, acc 0.950851, f1 0.235294

2017-01-15T19:57:58.837775: step 111, loss 0.201347, acc 0.984375, f1 0.666667
2017-01-15T19:57:59.172357: step 112, loss 0.235387, acc 0.953125, f1 0.666667
2017-01-15T19:57:59.523070: step 113, loss 0.296669, acc 0.953125, f1 0.4
2017-01-15T19:57:59.873638: step 114, loss 0.251831, acc 0.984375, f1 0.888889
2017-01-15T19:58:00.228374: step 115, loss 0.19376, acc 0.96875, f1 0.5

Evaluation:
2017-01-15T19:58:01.258753: step 115, loss 0.503284, acc 0.950851, f1 0.235294

2017-01-15T19:58:01.787250: step 116, loss 0.338115, acc 0.9375, f1 0.5
2017-01-15T19:58:02.137969: step 117, loss 0.228713, acc 0.953125, f1 0.666667
2017-01-15T19:58:02.495506: step 118, loss 0.195734, acc 0.96875, f1 0.666667
2017-01-15T19:58:03.003743: step 119, loss 0.26119, acc 0.96875, f1 0.5
2017-01-15T19:58:03.205841: step 120, loss 0.172746, acc 1, f1 1

Evaluation:
2017-01-15T19:58:04.535752: step 120, loss 0.448449, acc 0.94896, f1 0.228571

2017-01-15T19:58:05.001240: step 121, loss 0.215572, acc 0.984375, f1 0.909091
2017-01-15T19:58:05.434980: step 122, loss 0.168813, acc 1, f1 1
2017-01-15T19:58:05.795706: step 123, loss 0.294405, acc 0.953125, f1 0.769231
2017-01-15T19:58:06.209694: step 124, loss 0.243993, acc 0.96875, f1 0.666667
2017-01-15T19:58:06.635938: step 125, loss 0.217016, acc 0.984375, f1 0.8

Evaluation:
2017-01-15T19:58:07.574386: step 125, loss 0.473331, acc 0.950851, f1 0.235294

2017-01-15T19:58:07.914601: step 126, loss 0.156682, acc 1, f1 1
2017-01-15T19:58:08.265190: step 127, loss 0.219759, acc 0.984375, f1 0.857143
2017-01-15T19:58:08.608869: step 128, loss 0.290975, acc 0.953125, f1 0.666667
2017-01-15T19:58:08.943990: step 129, loss 0.224669, acc 0.953125, f1 0.571429
2017-01-15T19:58:09.273731: step 130, loss 0.219486, acc 0.953125, f1 0.571429

Evaluation:
2017-01-15T19:58:10.010079: step 130, loss 0.48101, acc 0.950851, f1 0.235294

2017-01-15T19:58:10.353136: step 131, loss 0.15999, acc 1, f1 1
2017-01-15T19:58:10.692707: step 132, loss 0.249199, acc 0.953125, f1 0.769231
2017-01-15T19:58:11.018353: step 133, loss 0.196284, acc 0.984375, f1 0.666667
2017-01-15T19:58:11.347572: step 134, loss 0.22174, acc 0.953125, f1 0.571429
2017-01-15T19:58:11.675723: step 135, loss 0.226194, acc 0.96875, f1 0.5

Evaluation:
2017-01-15T19:58:12.462679: step 135, loss 0.470761, acc 0.950851, f1 0.235294

2017-01-15T19:58:12.804489: step 136, loss 0.285658, acc 0.9375, f1 0.333333
2017-01-15T19:58:13.140807: step 137, loss 0.265396, acc 0.953125, f1 0.727273
2017-01-15T19:58:13.466084: step 138, loss 0.203425, acc 0.984375, f1 0.888889
2017-01-15T19:58:13.797895: step 139, loss 0.176494, acc 1, f1 1
2017-01-15T19:58:13.955783: step 140, loss 0.151342, acc 1, f1 nan

Evaluation:
2017-01-15T19:58:14.706876: step 140, loss 0.484924, acc 0.950851, f1 0.235294

2017-01-15T19:58:15.046949: step 141, loss 0.243233, acc 0.96875, f1 0.8
2017-01-15T19:58:15.398323: step 142, loss 0.188186, acc 0.984375, f1 0.888889
2017-01-15T19:58:15.729683: step 143, loss 0.181651, acc 0.984375, f1 0.909091
2017-01-15T19:58:16.072989: step 144, loss 0.165745, acc 1, f1 1
2017-01-15T19:58:16.430011: step 145, loss 0.154883, acc 1, f1 1

Evaluation:
2017-01-15T19:58:17.244613: step 145, loss 0.478244, acc 0.950851, f1 0.235294

2017-01-15T19:58:17.763989: step 146, loss 0.171493, acc 0.984375, f1 0.666667
2017-01-15T19:58:18.149863: step 147, loss 0.15315, acc 1, f1 1
2017-01-15T19:58:18.470952: step 148, loss 0.233886, acc 0.9375, f1 0.5
2017-01-15T19:58:18.795054: step 149, loss 0.182659, acc 0.984375, f1 0.888889
2017-01-15T19:58:19.160935: step 150, loss 0.192327, acc 0.96875, f1 0.5

Evaluation:
2017-01-15T19:58:20.019899: step 150, loss 0.470659, acc 0.950851, f1 0.235294

2017-01-15T19:58:20.373162: step 151, loss 0.163573, acc 0.984375, f1 0.666667
2017-01-15T19:58:20.724709: step 152, loss 0.234789, acc 0.96875, f1 0.666667
2017-01-15T19:58:21.179806: step 153, loss 0.204161, acc 0.953125, f1 0.727273
2017-01-15T19:58:21.597070: step 154, loss 0.177659, acc 0.984375, f1 0.857143
2017-01-15T19:58:21.978372: step 155, loss 0.201177, acc 0.96875, f1 0.666667

Evaluation:
2017-01-15T19:58:22.723713: step 155, loss 0.477085, acc 0.950851, f1 0.235294

2017-01-15T19:58:23.052125: step 156, loss 0.257474, acc 0.9375, f1 0.666667
2017-01-15T19:58:23.386350: step 157, loss 0.195367, acc 0.96875, f1 0.5
2017-01-15T19:58:23.711322: step 158, loss 0.156767, acc 1, f1 1
2017-01-15T19:58:24.030182: step 159, loss 0.150544, acc 1, f1 1
2017-01-15T19:58:24.183707: step 160, loss 0.254638, acc 0.952381, f1 nan

Evaluation:
2017-01-15T19:58:24.953037: step 160, loss 0.462473, acc 0.94707, f1 0.222222

2017-01-15T19:58:25.310453: step 161, loss 0.157726, acc 1, f1 1
2017-01-15T19:58:25.652295: step 162, loss 0.145025, acc 1, f1 1
2017-01-15T19:58:26.116129: step 163, loss 0.165706, acc 0.984375, f1 0.666667
2017-01-15T19:58:26.452889: step 164, loss 0.143143, acc 1, f1 1
2017-01-15T19:58:26.783880: step 165, loss 0.191821, acc 0.984375, f1 0.909091

Evaluation:
2017-01-15T19:58:27.518515: step 165, loss 0.494155, acc 0.950851, f1 0.235294

2017-01-15T19:58:27.861946: step 166, loss 0.168067, acc 0.984375, f1 0.909091
2017-01-15T19:58:28.197675: step 167, loss 0.18347, acc 0.984375, f1 0.888889
2017-01-15T19:58:28.524492: step 168, loss 0.17834, acc 0.96875, f1 0.666667
2017-01-15T19:58:28.850879: step 169, loss 0.198878, acc 0.96875, f1 0.875
2017-01-15T19:58:29.173876: step 170, loss 0.16461, acc 0.984375, f1 0.888889

Evaluation:
2017-01-15T19:58:30.042838: step 170, loss 0.480547, acc 0.950851, f1 0.235294

2017-01-15T19:58:30.405389: step 171, loss 0.14073, acc 1, f1 1
2017-01-15T19:58:30.979094: step 172, loss 0.174225, acc 0.984375, f1 0.666667
2017-01-15T19:58:31.574420: step 173, loss 0.183012, acc 0.984375, f1 nan
2017-01-15T19:58:32.065952: step 174, loss 0.166413, acc 0.984375, f1 0.909091
2017-01-15T19:58:32.461964: step 175, loss 0.166081, acc 0.984375, f1 0.857143

Evaluation:
2017-01-15T19:58:33.285542: step 175, loss 0.468504, acc 0.950851, f1 0.235294

2017-01-15T19:58:33.722252: step 176, loss 0.152627, acc 1, f1 1
2017-01-15T19:58:34.154552: step 177, loss 0.142144, acc 1, f1 1
2017-01-15T19:58:34.678616: step 178, loss 0.193421, acc 0.96875, f1 0.875
2017-01-15T19:58:35.064563: step 179, loss 0.253689, acc 0.96875, f1 0.666667
2017-01-15T19:58:35.301129: step 180, loss 0.144671, acc 1, f1 nan

Evaluation:
2017-01-15T19:58:36.348812: step 180, loss 0.464608, acc 0.950851, f1 0.235294

2017-01-15T19:58:36.707042: step 181, loss 0.221752, acc 0.96875, f1 0.8
2017-01-15T19:58:37.064935: step 182, loss 0.132402, acc 1, f1 1
2017-01-15T19:58:37.396235: step 183, loss 0.131999, acc 1, f1 1
2017-01-15T19:58:37.906337: step 184, loss 0.157284, acc 0.984375, f1 0.666667
2017-01-15T19:58:38.319012: step 185, loss 0.158061, acc 0.984375, f1 0.857143

Evaluation:
2017-01-15T19:58:39.318906: step 185, loss 0.458267, acc 0.950851, f1 0.235294

2017-01-15T19:58:39.666073: step 186, loss 0.168663, acc 1, f1 1
2017-01-15T19:58:40.029526: step 187, loss 0.143127, acc 1, f1 1
2017-01-15T19:58:40.352188: step 188, loss 0.234858, acc 0.953125, f1 0.571429
2017-01-15T19:58:40.686645: step 189, loss 0.137505, acc 1, f1 1
2017-01-15T19:58:41.010647: step 190, loss 0.163097, acc 0.984375, f1 0.8

Evaluation:
2017-01-15T19:58:41.752168: step 190, loss 0.469551, acc 0.950851, f1 0.235294

2017-01-15T19:58:42.099269: step 191, loss 0.134454, acc 1, f1 1
2017-01-15T19:58:42.435295: step 192, loss 0.126812, acc 1, f1 1
2017-01-15T19:58:42.759274: step 193, loss 0.162672, acc 0.984375, f1 0.909091
2017-01-15T19:58:43.085062: step 194, loss 0.181842, acc 0.984375, f1 0.666667
2017-01-15T19:58:43.409546: step 195, loss 0.142935, acc 1, f1 1

Evaluation:
2017-01-15T19:58:44.164206: step 195, loss 0.465725, acc 0.950851, f1 0.235294

2017-01-15T19:58:44.505361: step 196, loss 0.190513, acc 0.96875, f1 0.8
2017-01-15T19:58:44.837611: step 197, loss 0.172384, acc 0.984375, f1 0.888889
2017-01-15T19:58:45.187341: step 198, loss 0.142326, acc 1, f1 1
2017-01-15T19:58:45.538330: step 199, loss 0.128841, acc 1, f1 1
2017-01-15T19:58:45.701334: step 200, loss 0.123986, acc 1, f1 1

Evaluation:
2017-01-15T19:58:46.816622: step 200, loss 0.452735, acc 0.950851, f1 0.235294

Saved model checkpoint to /Users/kristof/GitHub/ntds_2016/prj-fake-news/runs/1484506616/checkpoints/model-200

2017-01-15T19:58:48.974729: step 200, loss 0.452735, acc 0.950851, f1 0.235294

In [33]:
print("Test accuracy: %.4f" % accuracy)
print("Test F1-Score: %.4f" % f1_score)


Test accuracy: 0.9509
Test F1-Score: 0.2353

Inspect Results

Open TensorBoard to inspect the results of the training.

Change MODEL_ID (or the 10-digit number in the path) by the value returned by the following cell:

INFO: works only in Chrome.


In [34]:
print(model.model_id)


1484506616

In [ ]:
!tensorboard --logdir runs/1484506616/summaries/

Predict

Predict the truthfulness of new messages.


In [40]:
x_raw = ["Hillary just admitted Trump should be elected president.", 
         "Trump to resign from the race to the White House.",
         "ALERT: Police Pull Over Islamic Refugee, Horrified To See What Was In The Car!!! The media has swept this under the rug, but it is a severe national security threat.",
         "If Trump becomes President, this will not end well.",]

predictions = model.predict(x_raw)

for pred, message in zip(predictions, x_raw):
    if pred == 0:
        print("TRUE: ", message)
    else:
        print("FAKE: ", message)


TRUE:  Hillary just admitted Trump should be elected president.
TRUE:  Trump to resign from the race to the White House.
FAKE:  ALERT: Police Pull Over Islamic Refugee, Horrified To See What Was In The Car!!! The media has swept this under the rug, but it is a severe national security threat.
FAKE:  If Trump becomes President, this will not end well.

3.3 Fake-News Detection with Non-Textual Features: K-Nearest Neighbors Classifier

The two previous models were only based on the text messages written with the Facebook post. However, as seen in the exploration section, other non-textual features like the type of content published (i.e. video, photo, link or status) or the number of comments, shares and reactions are alors discriminative of the truthfulness of the information.

Therefore, we take these features, map the categorical ones to dummy variables (a.k.a. one hot vectors) and try detect the fake-news articles based on a K-nearest neighbors model. This model is appealing since it does not make any linearity assumption on the classification boundaries.


In [8]:
from lib.exploitation import KNNClassifier

Preprocessing

Format the dataframe and split the dataset into a training and a heldout testing set.


In [42]:
# Filter the data to consider only posts labeled as 'mostly true' and 'mostly false'
mask_true_false = (df.rating == 'mostly true') | (df.rating == 'mostly false')
filtered_df = df[mask_true_false]

# Choose the features
knn_data = filtered_df[['comment_count', 'share_count', 'reaction_count']]
knn_data = knn_data.join(pd.get_dummies(filtered_df[['type', 'page']]))
# Rescale the features
for col in ['comment_count', 'share_count', 'reaction_count']:
    knn_data[col] = knn_data[col] / knn_data[col].max()

X = knn_data.values
y = np.array(filtered_df['rating'] == 'mostly false', dtype='bool')
print("There are {} labeled messages with {} features.".format(*X.shape))
print("There are {} legitimate messages and {} fake ones.".format((y==0).sum(), (y==1).sum()))


There are 1766 labeled messages with 16 features.
There are 1662 legitimate messages and 104 fake ones.

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("There are {} legitimate messages and {} fake ones in the training set.".format((y_train==0).sum(), (y_train==1).sum()))
print("There are {} legitimate messages and {} fake ones in the testing set.".format((y_test==0).sum(), (y_test==1).sum()))


There are 1148 legitimate messages and 88 fake ones in the training set.
There are 514 legitimate messages and 16 fake ones in the testing set.

 Hyperparameter tuning

The only hyperparameter to tune for the $K$-nearest neigbhors model is the number of neighbors $K$ to use in the majority vote computation.

On the plot below, we see that using only $K=1$ neighbor is the value leading to the best F1-score on the validation set using cross-validation. This behavior may be explained by the imbalance of the dataset and the lack of fake-news samples in the training set.


In [44]:
# Tune the hyperparameter `n_neighbors` (the number of nearest neighbors)
nb_classifier = KNNClassifier()
n_search_space = np.arange(5)*2 + 1
n_search_perf = nb_classifier.tune_n_neighbors(X_train, y_train, n_search_space)

fig, ax = plt.subplots(1, 1, figsize=(6,4))
ax.plot(n_search_space, n_search_perf)
ax.set_ylabel("F1-Score"); ax.set_xlabel("Number of nearest neighbors");
ax.set_title("Number of neighbors hyperparameter tuning on validation set");


Evaluation

Using the non-textual on this algorithm leads to roughly the same results as the other models based on the text messages. As stated in the previous sections, this result may certainly come from the high imbalance of the data set.


In [45]:
knn_classifier = KNNClassifier(n_neighbors=1)
# Train the model on the training set
knn_classifier.fit(X_train, y_train)
# Evaluate the model on the heldout testing set
_, nb_performance_metrics = knn_classifier.predict(X_test, y_test)
print('Accuracy:', nb_performance_metrics['accuracy'])
print('F1 score:', nb_performance_metrics['f1_score'])
print('Confusion Matrix:\n', nb_performance_metrics['confusion_matrix'])


Accuracy: 0.924528301887
F1 score: 0.2
Confusion Matrix:
 [[485  29]
 [ 11   5]]

Conclusion

In this project, inspired by the work from BuzzFeed, we scraped data from nine Facebook pages posting actively about the U.S. presidential elections to analyze the truthfulness of the information published. We saw that pages publishing mostly fake news typically use less words in their messages, suggesting that they are trying to craft click-bait posts. We also noticed that right-winged pages are more prone to publish mostly fake messages and mainstream pages usually publish mostly true ones.

We then tried two text-based classifiers, namely Naive Bayes and Convolutional Neural Network, in order to fit a dataset of labelled messages. Due to a small amount of data, as well as very unbalanced classes, we could not obtain convincing results. We moreover tried to add domain-specific features to improve our predictions. For the same reasons as before, the results were not promising.

To improve upon our work, we should tackle the unbalance issue. Some techniques exist, such as subsampling or cost-sensitive classification, and should be considered. The labelled dataset covers a week of Facebook activity and we collected data over a year. This fresh data should be labelled in order to increase the size of our train set.