This notebook contains the cleaning and exploration of the chan_example csv which is hosted on the far-right s3 bucket. It contains cleaning out the html links from the text of the messages with beautiful soup, grouping the messages into their threads, and an exploratory sentiment analysis.
Further work could be to get the topic modelling for messages working and perhaps look at sentiment regarding different topics.
In [15]:
import boto3
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
In [16]:
session = boto3.Session(profile_name='default')
s3 = session.resource('s3')
bucket = s3.Bucket("far-right")
session.available_profiles
Out[16]:
In [ ]:
# print all objects in bucket
for obj in bucket.objects.all():
if "chan" in obj.key:
#print(obj.key)
pass
In [18]:
bucket.download_file('fourchan/chan_example.csv', 'chan_example.csv')
In [19]:
chan = pd.read_csv("chan_example.csv")
# remove the newline tags. They're not useful for our analysis and just clutter the text.
chan.com = chan.com.astype(str).apply(lambda x: x.replace("<br>", " "))
In [20]:
bucket.download_file('info-source/daily/20170228/fourchan/fourchan_1204.json', '2017-02-28-1204.json')
chan2 = pd.read_json("2017-02-28-1204.json")
In [21]:
soup = BeautifulSoup(chan.com[19], "lxml")
quotes = soup.find("span")
for quote in quotes.contents:
print(quote.replace(">>", ""))
parent = soup.find("a")
print(parent.contents[0].replace(">>", ""))
In [22]:
print(chan.com[19])
In [23]:
# If there's a quote and then the text, this would work.
print(chan.com[19].split("</span>")[-1])
In [24]:
def split_comment(comment):
"""Splits up a comment into parent, quotes, and text"""
# I used lxml to
soup = BeautifulSoup(comment, "lxml")
quotes, quotelink, text = None, None, None
try:
quotes = soup.find("span")
quotes = [quote.replace(">>", "") for quote in quotes.contents]
except:
pass
try:
quotelink = soup.find("a").contents[0].replace(">>", "")
except:
pass
# no quote or parent
if quotes is None and quotelink is None:
text = comment
# Parent but no quote
if quotelink is not None and quotes is None:
text = comment.split("a>")[-1]
# There is a quote
if quotes is not None:
text = comment.split("</span>")[-1]
return {'quotes':quotes, 'quotelink': quotelink, 'text': text}
In [25]:
df = pd.DataFrame({'quotes':[], 'quotelink':[], 'text':[]})
for comment in chan['com']:
df = df.append(split_comment(comment), ignore_index = True)
full = pd.merge(chan, df, left_index = True, right_index = True)
In [26]:
quotes = pd.Series()
quotelinks = pd.Series()
texts = pd.Series()
for comment in chan['com']:
parse = split_comment(comment)
quotes.append(pd.Series(parse['quotes']))
quotelinks.append(pd.Series(parse['quotelink']))
texts.append(pd.Series(parse['text']))
chan['quotes'] = quotes
chan['quotelinks'] = quotelinks
chan['text'] = texts
Forchan messages are all part of a message thread, which can be reassembled by following the parents for each post and chaining them back together. This code creates a thread ID and maps that thread ID to the corresponding messages.
I don't know currently whether or not messages are linear, or if they can be a tree structure. This section of code simply tries to find which messages belong to which threads
Here i'll group the threads into a paragraph like structure and store it in a dictionary with the key being the parent chan_id.
In [27]:
threads = full['parent'].unique()
full_text = {}
for thread in threads:
full_text[int(thread)] = ". ".join(full[full['parent'] == thread]['text'])
Now we can do some topic modeling on the different threads
Following along with the topic modelling tweet exploration, we're going to tokenize our messages and then build a corpus from it. We'll then use the gensim library to run our topic model over the tokenized messages
In [38]:
import gensim
import pyLDAvis.gensim as gensimvis
import pyLDAvis
In [ ]:
tokenized_messages = []
for msg in nlp.pipe(full['text'], n_threads = 100, batch_size = 100):
ents = msg.ents
msg = [token.lemma_ for token in msg if token.is_alpha and not token.is_stop]
tokenized_messages.append(msg)
# Build the corpus using gensim
dictionary = gensim.corpora.Dictionary(tokenized_messages)
msg_corpus = [dictionary.doc2bow(x) for x in tokenized_messages]
msg_dictionary = gensim.corpora.Dictionary([])
# gensim.corpora.MmCorpus.serialize(tweets_corpus_filepath, tweets_corpus)
Labeled dataset provided by @crowdflower hosted on data.world. Dataset contains 40,000 tweets which are labeled as one of 13 emotions. Here I looked at the top 5 emotions, since the bottom few had very tweets by comparison, so it would be hard to get a properly split dataset on for train/testing. Probably the one i'd want to include that wasn't included yet is anger, but neutral, worry, happinness, sadness, love are pretty good starting point for emotion classification regarding news tweets. https://data.world/crowdflower/sentiment-analysis-in-text
In [31]:
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.classify import accuracy
from nltk import WordNetLemmatizer
lemma = nltk.WordNetLemmatizer()
df = pd.read_csv('https://query.data.world/s/8c7bwy8c55zx1t0c4yyrnjyax')
In [49]:
emotions = list(df.groupby("sentiment").agg("count").sort_values(by = "content", ascending = False).head(6).index)
print(emotions)
emotion_subset = df[df['sentiment'].isin(emotions)]
def format_sentence(sent):
ex = [i.lower() for i in sent.split()]
lemmas = [lemma.lemmatize(i) for i in ex]
return {word: True for word in nltk.word_tokenize(" ".join(lemmas))}
def create_train_vector(row):
"""
Formats a row when used in df.apply to create a train vector to be used by a
Naive Bayes Classifier from the nltk library.
"""
sentiment = row[1]
text = row[3]
return [format_sentence(text), sentiment]
train = emotion_subset.apply(create_train_vector, axis = 1)
# Split off 10% of our train vector to be for test.
test = train[:int(0.1*len(train))]
train = train[int(0.9)*len(train):]
emotion_classifier = NaiveBayesClassifier.train(train)
print(accuracy(emotion_classifier, test))
64% test accuracy on the test is nothing to phone home about. It's also likely to be a lot less accurate on our data from the 4chan messages, since those will be using much different language than the messages in our training set.
In [55]:
emotion_classifier.show_most_informative_features()
In [43]:
for comment in full['text'].head(10):
print(emotion_classifier.classify(format_sentence(comment)), ": ", comment)
Looking at this sample of 10 posts, I'm not convinced in the accuracy of this classifier on the far-right data, but out of curiosity, what did it classifer the
In [51]:
full['emotion'] = full['text'].apply(lambda x: emotion_classifier.classify(format_sentence(x)))
In [52]:
grouped_emotion_messages = full.groupby('emotion').count()[[2]]
grouped_emotion_messages.columns = ["count"]
grouped_emotion_messages
Out[52]:
In [53]:
grouped_emotion_messages.plot.bar()
Out[53]:
These results do seem semi logical though, based on some knowledge of the group. Online trolls are well known for their anger and rudeness, which could seemingly be classified as surprise and worry on a more standard data set.