Abydos can be helpful in general machine learning tasks like text classification. The following notebook demonstrates how Abydos's phonetic algoriths, string fingerprint functions, and q-grams can squeeze a little extra accuracy out of a text classification task.
The text classification task below uses customer review text to predict the condition for which the drug in question was prescribed. No other data (the drug name, for example) is used in this task.
Unfortunately, this notebook crashes near the end when run on mybinder.org. But it runs fine on Google Colab, though you'll need to add a cell at the beginning to call !pip install abydos.
This is a toy problem. I have taken a dataset that was already divided into training & test sets and used the test set for validation, not as a genuine test set. On the other hand, I haven't done much hyperparameter tuning. Indeed, all of the classifiers used below have identical parameters: LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337).
However, Abydos was used in a winning submission to a Kaggle (InClass) competition in UC Berkeley's 2015 Applied NLP course. The same notebook (but with its Pseudo-SSK classifier disabled due to memory requirements) was applied to the following year's competition, after the competition deadline, and beat that year's leader (0.89535 to 0.89369) without any tuning. So... Abydos can be useful in generalizing text classification tasks.
We start by importing from standard libraries, Pandas, Abydos, scikit-learn (for the ML algorithms/pipeline), NLTK (for a tokenizer & stopword corpus).
In [1]:
import html
import os
from string import punctuation
import pandas as pd
from abydos.phonetic import DoubleMetaphone, Soundex
from abydos.fingerprint import SkeletonKey, OmissionKey
from abydos.tokenizer import QGrams
from sklearn.base import TransformerMixin
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.ensemble import VotingClassifier
from nltk import wordpunct_tokenize
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english')) | set(punctuation)
In [2]:
# Useful Transformer from
# http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
# This pulls a single column from a supplied pandas dataframe for classification.
class ColumnExtractor(TransformerMixin):
def __init__(self, columns=[]):
self.columns = columns
def fit_transform(self, X, y=None, **fit_params):
self.fit(X, y, **fit_params)
return self.transform(X)
def transform(self, X, **transform_params):
return X[self.columns]
def fit(self, X, y=None, **fit_params):
return self
Below, if the dataset isn't already present, we download it to the working directory.
In [3]:
if not os.path.isfile('drugsComTrain_raw.tsv'):
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
resp = urlopen("https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip")
zipfile = ZipFile(BytesIO(resp.read()))
zipfile.extract('drugsComTrain_raw.tsv')
zipfile.extract('drugsComTest_raw.tsv')
Here, a pair of cleanup functions are defined for the files above.
From the DataFrame's review field, we remove surrounding quotes, unescape HTML escapes, lowercase, strip stopwords, and apply NLTK's word_punct() tokenizer.
From its condition field, we combine a number of conditions into supercategories. (NB: No offense is intended if I've miscategorized any of these diagnoses or if their conflation is inappropriate. And in some cases, these conditions were conflated because they employ the same drugs.) All other conditions are tagged as '' for later removal of these records.
In [4]:
def clean_review(review):
review = review.strip('"')
review = html.unescape(review)
review = review.lower()
review = ' '.join([_ for _ in wordpunct_tokenize(review) if _ not in stopwords])
return review
def clean_condition(condition):
if not isinstance(condition, str):
return ''
if 'Pain' in condition:
condition = 'Pain'
elif condition in {'Insomnia', 'Narcolepsy'} or 'Sleep' in condition:
condition = 'Sleep'
elif condition in {'Weight Loss', 'Obesity'}:
condition = 'Weight'
elif condition in {'Depression', 'Anxiety', 'Bipolar Disorde', 'Anxiety and Stress'
'Panic Disorde', 'Generalized Anxiety Disorde', 'Schizophrenia',
'Major Depressive Disorde',}:
condition = 'Mental Health'
elif condition in {'Birth Control', 'Emergency Contraception', 'Menstrual Disorders'}:
condition = 'Contraception'
elif 'Headache' in condition or 'Migraine' in condition:
condition = 'Headache'
elif condition in {'Acne', 'Rosacea', 'Eczema'}:
condition = 'Dermatalogical'
else:
condition = ''
return condition
Below, the training & test sets are read into a DataFrame and pre-processed as described above.
In [5]:
# Read the TSVs into a DataFrame
drug_train = pd.read_csv('drugsComTrain_raw.tsv', sep='\t', index_col=0, usecols=[0,1,2,3])
drug_test = pd.read_csv('drugsComTest_raw.tsv', sep='\t', index_col=0, usecols=[0,1,2,3])
# Clean the review field
drug_train.review = drug_train.review.apply(clean_review)
drug_test.review = drug_test.review.apply(clean_review)
# Clean the condition field (condense some classes)
drug_train.condition = drug_train.condition.apply(clean_condition)
drug_test.condition = drug_test.condition.apply(clean_condition)
# Drop records that aren't among the 7 condition classes we will consider
drug_train = drug_train[drug_train.condition != '']
drug_test = drug_test[drug_test.condition != '']
In [6]:
drug_train
Out[6]:
Next, we define a dictionary to hold accuracy data and define a function to train a model on the training set, test it on the test set, and store & report the resulting accuracy.
In [7]:
accuracies = {}
def test_pipeline(pipeline, name):
model = pipeline.fit(drug_train, drug_train.condition)
drug_test['prediction'] = model.predict(drug_test)
acc = sum(drug_test['prediction']==
drug_test['condition'])/len(drug_test)
accuracies[name] = acc
print('Accuracy: {:0.3f}%'.format(100*acc))
As a baseline, here is a straightfoward classifier on unigrams from the review text. It achieves an already-enviable 95.139% accuracy.
In [8]:
pipeline_plain = Pipeline([
('extract', ColumnExtractor('review')),
('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_plain, 'plain')
Next, we run each word of the review text through Soundex and through Double Metaphone (separately, not in series) and store the results in new DataFrame columns. And then we run the results through their own pipelines and get somewhat-disappointing results of 92.588% and 93.947% accuracies, respectively.
In [9]:
sdx = Soundex()
drug_train['soundex'] = drug_train.review.apply(lambda review:
' '.join(sdx.encode(word) for
word in review.split()))
dm = DoubleMetaphone()
drug_train['dmetaphone'] = drug_train.review.apply(lambda review:
' '.join(dm.encode(word)[0] for
word in review.split()))
In [10]:
drug_test['soundex'] = drug_test.review.apply(lambda review:
' '.join(sdx.encode(word) for
word in review.split()))
drug_test['dmetaphone'] = drug_test.review.apply(lambda review:
' '.join(dm.encode(word)[0] for
word in review.split()))
In [11]:
pipeline_soundex = Pipeline([
('extract', ColumnExtractor('soundex')),
('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_soundex, 'soundex')
In [12]:
pipeline_dmetaphone = Pipeline([
('extract', ColumnExtractor('dmetaphone')),
('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_dmetaphone, 'double metaphone')
Next, we get the Skeleton and Omission keys of each word of the review text and store the results in new DataFrame columns. And then we run the results through their own pipelines and get more encouraging results of 95.075% and 95.075% accuracy, respectively.
In [13]:
sk = SkeletonKey()
drug_train['skeleton'] = drug_train.review.apply(lambda review:
' '.join(sk.fingerprint(word) for word in
review.split()))
ok = OmissionKey()
drug_train['omission'] = drug_train.review.apply(lambda review:
' '.join(ok.fingerprint(word) for word in
review.split()))
In [14]:
drug_test['skeleton'] = drug_test.review.apply(lambda review:
' '.join(sk.fingerprint(word) for word in
review.split()))
drug_test['omission'] = drug_test.review.apply(lambda review:
' '.join(ok.fingerprint(word) for word in
review.split()))
In [15]:
pipeline_skeleton = Pipeline([
('extract', ColumnExtractor('skeleton')),
('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_skeleton, 'skeleton key')
In [16]:
pipeline_omission = Pipeline([
('extract', ColumnExtractor('omission')),
('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_omission, 'omission key')
As another option, we retrieve character-wise 7-grams, storing them as dictionaries within their own column. This gives a nice improvement over the baseline, with 96.536% accuracy.
In [17]:
tokenizer = QGrams(qval=7, start_stop='')
drug_train['qgrams'] = drug_train.review.apply(lambda review:
dict(tokenizer.tokenize(review).get_counter()))
In [18]:
drug_test['qgrams'] = drug_test.review.apply(lambda review:
dict(tokenizer.tokenize(review).get_counter()))
In [19]:
pipeline_qgrams = Pipeline([
('extract', ColumnExtractor('qgrams')),
('vectorize', DictVectorizer()),
('tfidf', TfidfTransformer(sublinear_tf=True, norm='l2')),
('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_qgrams, 'q-grams')
And finally, we throw all the pipelines together with a voting classifier, minus the worst performing (Soundex) pipeline. We also add weights to bias strongly towards the best performing (q-grams) pipeline. And the resulting pipeline should both generalize well and beat someone using the plain pipeline.
Naturally, there is much room for improvement!
In [20]:
pipeline_voting = VotingClassifier([
('plain', pipeline_plain),
# ('soundex', pipeline_soundex),
('dmetaphone', pipeline_dmetaphone),
('skeleton', pipeline_skeleton),
('omission', pipeline_omission),
('qgrams', pipeline_qgrams),
], weights=[1, 1, 1, 1, 2.5])
test_pipeline(pipeline_voting, 'voting')