Hate speech refers to statements made specifically to attack or delegitimize particular groups of people based on a demographic category—race, gender, religion, sexual orientation, and so on. Below is a tutorial showcasing a few methods one can use to classify hate speech.
These methods are by no means comperehensive; the lines between hate speech and other offensive language can be blurry and contextual,and existing methods do not always distill this context well. When detecting hate speech, it is important to recognize how audience, author, and context affect the intent of text and its status as hate speech. Failing to do so may result in the censorship of targeted groups as they reclaim the language used to disparage them.
**Content warning**: the code snippet below includes slurs used as part of a lexicon to detect these terms in the document. We elected to partially censor these terms in the ACM XRDS column for which this notebook was created; here, we use a similar censoring scheme to replace vowels. While we acknowledge this does not make the words unrecognizable and still may upset those targeted by them, we hope this can mitigate their use in this document.
In [1]:
# Built-in Python libraries
import csv
import pickle
import re
import string
import sys
# Python libraries that may need to be installed. To install any of these
# with a standard Python installation, you can run
# pip install <package>
# or if you are using Anaconda to manage your Python installation,
# conda install <package>
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import numpy as np
import pandas
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.naive_bayes import MultinomialNB
We import tweets labeled as hate speech, offensive, or neither from a CSV. We extract the class labels and the raw text of the tweet.
You can find this data on GitHub here. If you use this data, please cite:
Davidson, Thomas, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. “Automated Hate Speech Detection and the Problem of Offensive Language.” Proceedings of the Eleventh AAAI International Conference on Web and Social Media (ICWSM): 512-515.
In [2]:
data_url = 'https://raw.githubusercontent.com/t-davidson/hate-speech-and-offensive-language/master/data/labeled_data.csv'
data = pandas.read_csv(data_url)
tweets = data['tweet']
y = data['class']
In [3]:
def substitute_char_list(str_list, list_from, list_to):
for cf, ct in zip(list_from, list_to):
str_list = [sl.replace(cf, ct) for sl in str_list]
return str_list
In [4]:
list_orig = 'aeiouy'
list_censored = '#$%&*!'
hate_lexicon = [
'b%tch',
'b%tch$s',
'wh&r$',
'wh&r$s',
'n%gg$r',
'n%gg$rs',
'f#g',
'f#gs',
'f#gg&t',
'f#gg&ts']
hate_lexicon = substitute_char_list(hate_lexicon, list_censored, list_orig)
In [5]:
tokens = [re.split("[^a-zA-Z]*", tweet.lower()) for tweet in tweets]
X_lexicon = np.zeros((len(tweets), len(hate_lexicon) + 1))
for i, tweet in enumerate(tweets):
for j, term in enumerate(hate_lexicon):
X_lexicon[i,j] = tweet.count(term)
X_lexicon[:,-1] = X_lexicon.sum(axis=1)
In [6]:
def show_classifier_results(y_actual, y_pred):
# Obtain the confusion matrix to describe the types of classifier errors
conf_mat = confusion_matrix(y_actual, y_pred)
# Some constants for computation later
n_classes = len(conf_mat) # should be 3
n_total = conf_mat.sum()
n_by_actual_class = conf_mat.sum(axis=1)
print('Accuracy score: {:.2f}%\n'.format(100 * conf_mat.trace() / n_total))
print('Recall by class:')
for cls in range(n_classes):
print(' {} - {:.2f}%'.format(cls, 100 * conf_mat[cls, cls] / n_by_actual_class[cls]))
We test two versions of this classifier. In the first, we weight all examples equally, which results in the classifier naively classifying almost all examples as class 1 due to the proportions of the corpus. In the second, we force the classifier to balance the importance of precision and recall on the three classes, reducing the accuracy but improving the recall on the other two classes. In both, we use an L2 penalty (subtracting out the sum of the squares of the weights), which encouraging the combined weight of the features to be low. It's also possible to use an L1 penalty (subtracting out the sum of the absolute values of the weights), which encourages the classifier to concentrate weight in relatively few features A low value of C corresponds to more importance placed on the penalty.
In [7]:
model = LogisticRegression(penalty="l2", C=0.01)
y_pred = cross_val_predict(model, X_lexicon, y, cv=10)
show_classifier_results(y, y_pred)
In [8]:
model = LogisticRegression(class_weight="balanced", penalty="l2", C=0.01)
y_pred = cross_val_predict(model, X_lexicon, y, cv=10)
show_classifier_results(y, y_pred)
In order to improve our classifier's accuracy at distinguishing hate speech from other offensive speech, we expand out to a much larger set of features: counts of each word's frequency in the vocabulary of the whole corpus.
To train a text classifier, one must tokenize the text, or split it into individual words or substrings in order to provide units for the classifier to process. With a relatively small supply of social media data, it is unlikely that many words will show up often enough to produce useful signals. To handle this, we do some pre-processing to ensure that the forms of words are more standardized.
This process uses NLTK to perform stopword removal and stemming, which may require the download of specific stopword and data through the NLTK download utility. This should be installable in the Corpora menu under the name stopwords.
In [9]:
# If either of these are not correctly constructed, you will
# need to download the stopword and stemmer files using the NLTK
# download utility and then rerun this cell.
try:
stoplist = stopwords.words('english')
stemmer = PorterStemmer()
except Exception as e:
print(str(e))
import nltk
nltk.download()
In [10]:
# Adapted from code used in https://github.com/t-davidson/hate-speech-and-offensive-language, written by Tom Davidson
def preprocess(text_string):
"""
Accepts a text string and replaces:
1) urls with URLHERE
2) lots of whitespace with one instance
3) mentions with MENTIONHERE
This allows us to get standardized counts of urls and mentions
without caring about specific people mentioned
"""
space_pattern = '\s+'
giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
'[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
mention_regex = '@[\w\-]+'
parsed_text = re.sub(space_pattern, ' ', text_string)
parsed_text = re.sub(giant_url_regex, '', parsed_text)
parsed_text = re.sub(mention_regex, '', parsed_text)
return parsed_text.lower()
def tokenize(tweet):
"""Removes punctuation & excess whitespace, sets to lowercase,
and stems tweets. Returns a list of stemmed tokens."""
tweet = " ".join(re.split("[^a-zA-Z]*", tweet.lower())).strip()
#tokens = re.split("[^a-zA-Z]*", tweet.lower())
tokens = [stemmer.stem(t) for t in tweet.split() if t not in stoplist]
return tokens
def basic_tokenize(tweet):
"""Same as tokenize but without the stemming"""
tweet = " ".join(re.split("[^a-zA-Z.,!?]*", tweet.lower())).strip()
return tweet.split()
We use a TfidfVectorizer to load our data into a scipy sparse matrix representation, where every row corresponds to a document and every column corresponds to a word in the vocabulary. Here, the arguments mean
tokenizer=tokenize: we tokenize using the tokenize function to split words,preprocessor=preprocess: we preprocess the text before tokenization using preprocess,use_idf=False: we are not normalizing by the inverse document frequency (IDF) and are instead just using term frequency (TF) for each entry,decode_error='replace': we are replacing characters we can't convert into Unicode with a special "replace" character in Unicode,min_df=5: we only keep words showing up in at least 5 tweets, andmax_df=0.5: we only keep words showing up in less than half the tweets.
In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
tokenizer=tokenize,
preprocessor=preprocess,
use_idf=False,
decode_error='replace',
min_df=5,
max_df=0.5
)
X_count = vectorizer.fit_transform(tweets)
vocab = vectorizer.vocabulary_
idx_to_vocab = {idx: wd for (wd, idx) in vocab.items()}
We use a new kind of classifier better suited to frequency information: a Multinomial Naive Bayes classifier. We use fit_prior=False to prevent the classifier from just learning the prior over the probability distribution, that class 1 is much more probable than the other two classes.
We can use the weights this classifier learns to find out which specific features are most indicative of each class. We do this finding the words corresponding to the highest coefficients in the model (in the coef_ attribute of the model) for each class.
In [12]:
model = MultinomialNB(fit_prior=False)
y_pred = cross_val_predict(model, X_count, y, cv=10)
show_classifier_results(y, y_pred)
In [13]:
feature_model = MultinomialNB(fit_prior=False)
feature_model.fit(X_count, y)
n_top_entries = 15
class_labels = ['hate', 'offensive', 'other']
print('Top features for each class:')
for i, class_label in enumerate(class_labels):
top_by_coeff = np.argsort(feature_model.coef_[i])[-n_top_entries:]
print(" {} - {}".format(
class_label,
" ".join(substitute_char_list([idx_to_vocab[j] for j in top_by_coeff], list_orig, list_censored))))
To avoid the problems of sparsity from raw term frequencies, we use LSA, or latent semantic analysis, to give us shorter vector representations of each document. These effectively summarize the information in a term frequency matrix.
To more effectively train these models, we slightly modify our TfidfVectorizer to enable the use_idf feature. This multiplies all term frequency entries for a feature by the log of the inverse of the proportion of documents that feature appeared in. This means terms that are specific to a few documents will have their weight increased, while terms appearing across all documents will be effectively reduced to 0. The way LSA summarizes the vectors is affected by the magnitude of each of the weights in the matrix, so downweighting features we don't think are informative will make this work better.
In [14]:
tfvectorizer = TfidfVectorizer(
tokenizer=tokenize,
preprocessor=preprocess,
use_idf=True,
smooth_idf=True,
decode_error='replace',
min_df=5,
max_df=0.5
)
X_tfidf = tfvectorizer.fit_transform(tweets)
tsvd = TruncatedSVD(n_components=20)
X_lsa = tsvd.fit_transform(X_tfidf)
We use LSA along with our feature counts to help "smooth out" some of the information about how words are related that the sparse word counts might not include. In the first model, we do this with all possible count features; in the second, we only use words that were in the top 20 most indicative features for one of the three classes.
In [15]:
model = LogisticRegression(class_weight="balanced", penalty="l2", C=0.01)
y_pred_all = cross_val_predict(model, np.hstack((X_lsa, X_count.toarray())), y, cv=10)
show_classifier_results(y, y_pred_all)
In [16]:
top_feature_idxs = set()
for i, class_label in enumerate(class_labels):
top_feature_idxs.update(np.argsort(feature_model.coef_[i])[-20:])
X_top_count = X_count[:,list(top_feature_idxs)].toarray()
In [17]:
y_pred_top = cross_val_predict(model, np.hstack((X_lsa, X_top_count)), y, cv=10)
show_classifier_results(y, y_pred_top)