NTDS Project: How Do Fake-News Go Viral ?

[Or why Bernie Sanders could replace President Trump with little-known loophole](http://www.huffingtonpost.com/entry/bernie-sanders-could-replace-president-trump-with-little_us_5829f25fe4b02b1f5257a6b7)

Authors: Victor Kristof and William Trouleau

The click-bait headline of the title, published on November 14, 2016 by the Huffington Post, is clearly fake (and on purpose). It cleverly shades light on an issue that became mainstream with the recent US election: fake-news propaganda on social networks. Indeed, an analysis of user engagement performed by BuzzFeed concluded that fake-news stories raised more engagement on Facebook than the top election story from 19 major news outlets combined. The idea of this project is to analyze posts from famous Facebook pages publishing articles related to the US elections and to measure the impact of misinformation on the elections.

In particular, we follow the analysis of BuzzFeed on nine famous Facebook pages actively publishing articles related to the US elections:

BuzzFeed manually anotated all articles posted by these nine pages during one week in September 2016, and rated them according to the accuracy of the information published. More information on these ratings is developed in Section 1. However, their analysis was limited to the exploration of engagement metrics like the number of reactions, comments or shares generated by fake-news articles.

In this project, we extract additional features of the posts like the text message, the type of content, and so on. We then apply Machine Learning techniques to automatically detect fake-news ratings from unlabelled posts.

  • In Section 1. (Acquisition), we first detail the scraping process to acquire the Facebook posts used throughout the project. We then clean the data and save it to a csv file for future use.

  • In Section 2. (Exploration), we then explore the scraped data. We first investigate the volume of data that was scraped for each page and for each political orientation. We then dig deeper and investigate the text messages that will later be used for exploitation. We explore how these messages relate to the truthfulness of the articles. Since we also extracted other kind of features, we investigate them in the same way.

  • In Section 3. (Exploitation/Evaluation), we finally use the scraped data to build a detection algorithm to predict wether or not a given Facebook post is propaganda. We tackle this problem with three different approaches.

    • The first two approaches aim at detecting the truthfulness of articles based only on the text messages of the post. We take inspiration from Sentiment Analysis modeling, but here we try to predict truthfulness rather than sentiment. In particular, we first build a simple Naive Bayes classifier, but its limitations are quickly reached. We then implement a more involved model based on Convolutional Neural Networks that is the state-of-the-art for Sentiment Analysis.
    • Finally, we also build a last model based on other non-textual features that were found to be discriminative in the exporation step.

1. Data Acquisition

NOTE: In order to run the data acquisition process, one needs to obtain a token to connect to the Facebook Graph API.

We collect data from nine famous Facebook pages actively publishing articles related to the US elections:

Each collected post includes the name of the page in addition to:

  • the post id post_id,
  • the URL of the post post_url,
  • the time created_time when the post was created ,
  • the type of post type (i.e. status, link, video, photo),
  • the text message of the post,
  • the number of times the post was shared share_count,
  • the number of times the post received a reaction reaction_count,
  • the number of times the post was commented comment_count.

1.1 Facebook Web Scraper

To have better flexibility on the scraping process, we collect data using the low-level approach with the requests library. The code of the scraper FacebookScraper, as well as all the code corresponding to the data collection part, is contained in the acquisition module.

The FacebookScraper takes an iterable of Features which represents the fields we want to query on the Facebook Graph API and provide a convenient way to query, clean and save the raw data into a well-formatted pandas.DataFrame.


Define the parameters for the web scraper:

  • Credential file where your token is located
  • List of the pages to scrape
  • Date range over which we want to collect data
  • Fields to query

Run the web scraper.

WARNING: By running this cell, you will crawl ~60Mb of data from Facebook.

1.2 Data Cleaning

1.3 Data Augmentation

We now augment the dataset with the political orientation of each page as well as the manually annotated rating of fact checking performed by BuzzFeed in this article.

Add the political orientation

Add the fact checking annotation

To write their article, BuzzFeed already collected this data for one week in September 2016. Then they manually annotated it to add a rating on the content truthfulness. According to their article, their methodology was the following:

Posts could be rated “mostly true,” “mixture of true and false,” or “mostly false.” If we encountered a post that was satirical or opinion-driven, or that otherwise lacked a factual claim, we rated it “no factual content.” (We chose to rate things as “mostly” true or false in order to allow for smaller errors or accurate facts within otherwise true or false claims or stories.)

2. Data Exploration

Let us now explore the collected data.

In [7]:
p = figure(title='Number of articles published each week for each page', plot_width=950, plot_height=400)
for i, page_name in enumerate(df.page_name.unique()):
    s = df[df.page_name==page_name]['created_week'].value_counts().sort_index()
    p.line(x=s.index, y=np.array(s), color=palette[i], legend=page_name, line_width=2)
p.xaxis.axis_label = 'Week of the year 2016'
p.yaxis.axis_label = 'Number of articles published'
p.legend.location = 'top_left'

The three mainstream pages (i.e. Politico, CNN and ABCNewsPolitics) exhibit the same peak around weeks 29-30. This time period corresponds to the primary election of the Democratic National Convention. In addition, the peak at week 45 corresponds to articles released on the week of the election which was held on November 8. We can clearly see a big drop for all pages 2 weeks after the election was held.

Number of Comments Published each Week for each Page

In [8]:
p = figure(title='Number of comments published each week for each page', plot_width=950, plot_height=400)
for i, page_name in enumerate(df.page_name.unique()):
    s = df[df.page_name==page_name].groupby('created_week').sum()['comment_count']
    p.line(x=s.index, y=np.array(s), color=palette[i], legend=page_name, line_width=2)
p.xaxis.axis_label = 'Week of the year 2016'
p.yaxis.axis_label = 'Number of comments'
p.legend.location = 'top_left'

2.3 Analysis of the Text Messages

Analysis of Message Length

Let us first investage the length of the messages under different perspectives.

Text Features Analysis

Let us now dive deeper into the text messages and explore the relevant terms that we will use as text features in the data exploitation part. We start by vectorizing the raw text messages into a matrix counting the occurences of each word in each post.

 Rating vs. Political Orientation Category

Only the labelled messages will actually be useful to us to train the models in the exploitation section. Do the words used in posts rated as mostly true differ from the ones rated as mostly false?

On the tables below, we can see that the top words of both classes share many common words like trump, domald, obama, clinton, hillary, ... However, it is interesting to note that the 10-th most used word of the mostly false articles is muslim and that the 17-th one is lie.

2.4 Exploration of Other Non-Text Discriminative Features in Fake News Articles

Let us now explore other non-textual features to analyze if they clearly separate fake-news from legitimate ones.

Rating vs. Political Orientation Category

Let us first investigate the proportions of genuine and fake-news articles based on the political orientation of the pages. The Figure below clearly shows that the mainstream pages contain much less fake content than politically oriented pages.

In [20]:
exploration.count_group(df[mask_labelled], by=['rating', 'category'])

Rating vs. Page

Let us now analyze this behavior per page in more depth, with the number of mostly false news for each page. The no factual content rating from the mainstream media correspond in 100% of the cases to news advertising for their own content. The ones for the right and left media are all jokes and subjective content. The Figure below clearly shows that the pro-left Occupy Democrats and the pro-right The Eagle is Rising are to two most active distributors of fake-news.

In [47]:
Rating vs. Type of content

Now that we know which are the most likely pages to publish fake-news, it is natural to wonder whether the type of content published (i.e. status, link, photo, video) distinguishes the truthfulness of the articles.

On the table below, we can see that most articles are links. A manual inspection actually shows that they are links to full articles hosted on the websites of the respective pages, with a short summary text message. There is no clear difference between mostly true and mostly false articles. However, we clearly see that articles with no factual content (i.e. articles that are satirical, opinion-driven, or that otherwise lacked a factual claim) are mostly photos and videos, whereas both types are mostly absent of all other rating classes.

3. Data Exploitation

Now that we have a pretty good idea of the data we are dealing with, we can implement a model to detect fake-news articles for the various features we discussed earlier. To tackle this task, we take inspiration from sentiment analysis models whose goal is to detect happiness and sadness in text data.

In [24]:
3.1 Fake-News Classification using Naive Bayes

A widly used model for sentiment classification is the Naive Bayes model. The assumption behind this model is that all features (i.e. words in the vocabulary) are conditionally independent given the class. This assumption allows a nice and simple mathematical derivation of the learning algorithm.

Format the dataframe and split the dataset into a training and a held out testing set.

Using the bag-of-words assumption, the messages are tokenized into words and we use the widely used "tf-idf" normalization of features, whose formula is as follows:

$$\text{tf-idf}(w,m) = \frac{\text{Number of times word } w \text{ appears in message } m}{\text{Number of messages in which word } w \text{ occurs}}.$$

This measure emphasizes words that occur a lot in a single message but are rare in the whole corpus.

The first plot (on the left) below shows that small alpha (.e.g 1.0) should be used. A higher value seems to smooth the features too much, thus reducing their discriminative power. The other plot (on the right) shows that a prior between 0.3 and 0.4 should be used for the fake-news class.

Once the hyperparameters are tuned, we can evaluate the performances of the Naive Bayes model on the held out testing dataset. We can see below that the the model achieves a good accuracy. However, since our dataset is highly unbalanced, accuracy actually hides the fact that the model is only performing well at classifying the mostly true articles. Looking at the second line of the confusion matrix shows that, among the $19$ fake-news articles of the testing set, the model is actually wrongly classifying $11$ as legitimate.

Due to the class imbalance, it is thus much more relevant to evaluate the model using the $F_1$-score. Indeed, this metric defined as:

$$F_1 = 2 \cdot \frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}},$$

where $$\text{precision} = \frac{\text{True positive}}{\text{True positive}+\text{False positive}}, ~\text{and}~ \text{recall} = \frac{\text{True positive}}{\text{True positive}+\text{False negative}}.$$

Taking into account both the precision and recall instead of the accuracy avoids high scores due to imbalanced datasets.

3.2 Fake-News Classification using Convolutional Neural Network

We implement here a method to predict the truthfulness of a post using a Convolutional Neural Network. We vectorize the sentences by mapping each word to an index in the vocabulary of the whole corpus. The extracted vocabulary has size 6490. Word embeddings are then learned directly by the network. The architecture is built such that filters of different sizes (2, 3, 4 and 5 in our case) consecutively process the embedded words, sliding over different numbers of words at the same time (i.e. convolution).

We use a dropout probability $p=0.5$, as well as an $L_2$-regularizer with $\lambda = 0.1$ in order to countereffect overfitting.

Reference: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

 Preprocess the data

We preprocess the texts by padding them such that they all have the same length.

Training and Evaluation

Train the CNN on the data. Batches of size 64 samples are randomly generated and the whole training set is processed for 10 epochs.

The evaluation is done simultaneously by testing the model on a heldout testing dataset.

Note: the nan values displayed for the F1 score for a couple of iterations is due to the highly unbalanced data.

Inspect Results

Open TensorBoard to inspect the results of the training.

INFO: works only in Chrome.

Predict the truthfulness of new messages.

3.3 Fake-News Detection with Non-Textual Features: K-Nearest Neighbors Classifier

The two previous models were only based on the text messages written with the Facebook post. However, as seen in the exploration section, other non-textual features like the type of content published (i.e. video, photo, link or status) or the number of comments, shares and reactions are alors discriminative of the truthfulness of the information.

Therefore, we take these features, map the categorical ones to dummy variables (a.k.a. one hot vectors) and try detect the fake-news articles based on a K-nearest neighbors model. This model is appealing since it does not make any linearity assumption on the classification boundaries.

Format the dataframe and split the dataset into a training and a heldout testing set.

 Hyperparameter tuning

The only hyperparameter to tune for the $K$-nearest neigbhors model is the number of neighbors $K$ to use in the majority vote computation.

On the plot below, we see that using only $K=1$ neighbor is the value leading to the best F1-score on the validation set using cross-validation. This behavior may be explained by the imbalance of the dataset and the lack of fake-news samples in the training set.

Using the non-textual on this algorithm leads to roughly the same results as the other models based on the text messages. As stated in the previous sections, this result may certainly come from the high imbalance of the data set.

In this project, inspired by the work from BuzzFeed, we scraped data from nine Facebook pages posting actively about the U.S. presidential elections to analyze the truthfulness of the information published. We saw that pages publishing mostly fake news typically use less words in their messages, suggesting that they are trying to craft click-bait posts. We also noticed that right-winged pages are more prone to publish mostly fake messages and mainstream pages usually publish mostly true ones.

We then tried two text-based classifiers, namely Naive Bayes and Convolutional Neural Network, in order to fit a dataset of labelled messages. Due to a small amount of data, as well as very unbalanced classes, we could not obtain convincing results. We moreover tried to add domain-specific features to improve our predictions. For the same reasons as before, the results were not promising.

To improve upon our work, we should tackle the unbalance issue. Some techniques exist, such as subsampling or cost-sensitive classification, and should be considered. The labelled dataset covers a week of Facebook activity and we collected data over a year. This fresh data should be labelled in order to increase the size of our train set.