Looking for Sneaky Clickbait

The aim of this experiment is to evaluate the clickbait detector model and find out what kind of clickbait does it fail to detect.


In [10]:
from keras.models import load_model
from keras.preprocessing import sequence
import sys
import string 
import re


UNK = "<UNK>"
PAD = "<PAD>"
MATCH_MULTIPLE_SPACES = re.compile("\ {2,}")
SEQUENCE_LENGTH = 20

Load the model and vocabulary


In [11]:
model = load_model("../models/detector.h5")


vocabulary = open("../data/vocabulary.txt").read().split("\n")
inverse_vocabulary = dict((word, i) for i, word in enumerate(vocabulary))

Load validation data


In [12]:
clickbait = open("../data/clickbait.valid.txt").read().split("\n")
genuine = open("../data/genuine.valid.txt").read().split("\n")

print "Clickbait: "
for each in clickbait[:5]:
    print each
print "-" * 50

print "Genuine: "
for each in genuine[:5]:
    print each


Clickbait: 
All The Looks At The People's Choice Awards
Does Kylie Jenner Know How To Wear Coats? A Very Serious Investigation
This Is What US Protests Looked Like In The '60s
24 GIFs That Show How Corinne Is The Greatest "Bachelor" Villian Yet
Nene Leakes And Kandi Burruss Finally "See Each Other" In A Good Way
--------------------------------------------------
Genuine: 
Mayawatis risky calculus
L&T Q3 net up 39% at Rs 972 cr, co says note ban a disruptor
Australian Open women's final: Serena beats sister Venus Williams to win 23rd Grand Slam
It's Federer vs Nadal in Australian Open finals
Medical board fails to make any conclusion in report on Sunandas death

In [13]:
def words_to_indices(words):
    return [inverse_vocabulary.get(word, inverse_vocabulary[UNK]) for word in words]


def clean(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, " " + punctuation + " ")
    for i in range(10):
        text = text.replace(str(i), " " + str(i) + " ")
    text = MATCH_MULTIPLE_SPACES.sub(" ", text)
    return text

Genuine news marked as clickbait


In [14]:
wrong_genuine_count = 0
for each in genuine:
    cleaned = clean(each.encode("ascii", "ignore").lower()).split()
    indices = words_to_indices(cleaned)
    indices = sequence.pad_sequences([indices], maxlen=SEQUENCE_LENGTH)
    prediction = model.predict(indices)[0, 0]
    if prediction > .5:
        print prediction, each
        wrong_genuine_count += 1

print "-" * 50
print "{0} out of {1} wrong.".format(wrong_genuine_count, len(genuine))


0.865757 A look at Trumps executive order on refugees, immigration
0.90955 Malala heartbroken over Trumps ban on most defenceless refugees
0.541996 The White House hints that tax reform could pay for the border wall
0.786731 Understanding the spike in Chinas birth rate
0.685651 President Trumps infrastructure plans probably involve more tolls
0.61867 Digital immortality for the Holocausts last survivors
0.504147 Printed human body parts could soon be available for transplant
0.979102 Germanys Social Democrats pick Martin Schulz as leader
0.843225 Netgears Orbi might be the best Wi-Fi router Ive everused
0.934488 Twitter releases national securityletters
0.736603 Zuckerberg defends immigrants threatened byTrump
0.807549 Doug shows you how to get rid of Amazon Freshtotes
0.925443 Rogue National Park Service Twitter account says its no longer run by government employeesbut maybe it neverwas
0.839666 Watch a Massive Fire Tornado Sweep the Outback
0.991364 3 Things You Need to Know About the Science Rebellion Against Trump
0.729308 These Are the Defiant "Water Protectors" of Standing Rock
0.956611 Watch Alien Worlds Whirl Around a Distant Star
0.563429 Why a Volcano Has Erupted Almost Every Hour for 94 Years
0.93334 4 Key Impacts of the Keystone XL and Dakota Access Pipelines
0.61299 Human-Pig Hybrid Created in the LabHere Are the Facts
--------------------------------------------------
20 out of 74 wrong.

Clickbait not detected


In [15]:
wrong_clickbait_count = 0
for each in clickbait:
    cleaned = clean(each.encode("ascii", "ignore").lower()).split()
    indices = words_to_indices(cleaned)
    indices = sequence.pad_sequences([indices], maxlen=SEQUENCE_LENGTH)
    prediction = model.predict(indices)[0, 0]
    if prediction < .5:
        print prediction, each
        wrong_clickbait_count += 1

print "-" * 50
print "{0} out of {1} wrong.".format(wrong_clickbait_count, len(clickbait))


0.347193 Nene Leakes And Kandi Burruss Finally "See Each Other" In A Good Way
0.446493 Channing Tatum Is Currently Teaching Himself How To Play Piano
0.244268 Trump signs executive order to 'keep radical Islamic terrorists out' of U.S.
0.486077 A look at Neil Gorsuch, a possible Trump Supreme Court nominee
0.101281 Haley to U.N. allies: back us or we'll take names
0.409009 Do Donald Trump's criticisms of NATO have merit? | Opinion
0.139018 Mexico foreign minister says paying for Trump's border wall "totally unacceptable"
0.0512496 China is stepping up as Donald Trump withdraws from the world stage | Opinion
0.135089 Buffett, Gates express optimism for U.S. in Trump era
0.154335 Can Congos footballers help ease political tensions?
0.113401 Aruba; Five Star Island Goes Green
0.0229274 Michael Wolff: Why the media keeps losing to Donald Trump
0.31196 Vijay Mallya: I begged for help, not loans - Times of India
0.117259 Union Budget 2017: What manufacturing sector expect from Arun Jaitley-  The Times of India
0.183548 Read The Full Text Of Donald Trump's Executive Order Limiting Muslim Entry To The U.S.
0.0772077 Weekend Roundup: A New 'Nationalist International' Challenges The Old Globalization
0.0710645 Anne Frank Was A Refugee, Too
0.437475 Can You Guess Which Drugs Kill The Most People In The UK?
0.389437 GQ Gives Donald Trump An Arguably Much-Needed Makeover
0.193617 Trump To Publish Weekly List Of Crimes Committed By Undocumented Immigrants In Sanctuary Cities
0.446493 Channing Tatum Is Currently Teaching Himself How To Play Piano
0.347193 Nene Leakes And Kandi Burruss Finally "See Each Other" In A Good Way
--------------------------------------------------
22 out of 76 wrong.