Text Classification of Drug Reviews

Abydos can be helpful in general machine learning tasks like text classification. The following notebook demonstrates how Abydos's phonetic algoriths, string fingerprint functions, and q-grams can squeeze a little extra accuracy out of a text classification task.

The text classification task below uses customer review text to predict the condition for which the drug in question was prescribed. No other data (the drug name, for example) is used in this task.

Caveats

Unfortunately, this notebook crashes near the end when run on mybinder.org. But it runs fine on Google Colab, though you'll need to add a cell at the beginning to call !pip install abydos.

This is a toy problem. I have taken a dataset that was already divided into training & test sets and used the test set for validation, not as a genuine test set. On the other hand, I haven't done much hyperparameter tuning. Indeed, all of the classifiers used below have identical parameters: LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337).

However, Abydos was used in a winning submission to a Kaggle (InClass) competition in UC Berkeley's 2015 Applied NLP course. The same notebook (but with its Pseudo-SSK classifier disabled due to memory requirements) was applied to the following year's competition, after the competition deadline, and beat that year's leader (0.89535 to 0.89369) without any tuning. So... Abydos can be useful in generalizing text classification tasks.

Imports

We start by importing from standard libraries, Pandas, Abydos, scikit-learn (for the ML algorithms/pipeline), NLTK (for a tokenizer & stopword corpus).


In [1]:
import html
import os

from string import punctuation

import pandas as pd

from abydos.phonetic import DoubleMetaphone, Soundex
from abydos.fingerprint import SkeletonKey, OmissionKey
from abydos.tokenizer import QGrams

from sklearn.base import TransformerMixin
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.ensemble import VotingClassifier

from nltk import wordpunct_tokenize

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english')) | set(punctuation)


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/chrislit/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In [2]:
# Useful Transformer from
# http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
# This pulls a single column from a supplied pandas dataframe for classification.
class ColumnExtractor(TransformerMixin):
    def __init__(self, columns=[]):
        self.columns = columns

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def transform(self, X, **transform_params):
        return X[self.columns]

    def fit(self, X, y=None, **fit_params):
        return self

Below, if the dataset isn't already present, we download it to the working directory.


In [3]:
if not os.path.isfile('drugsComTrain_raw.tsv'):
    from io import BytesIO
    from zipfile import ZipFile
    from urllib.request import urlopen

    resp = urlopen("https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip")
    zipfile = ZipFile(BytesIO(resp.read()))
    zipfile.extract('drugsComTrain_raw.tsv')
    zipfile.extract('drugsComTest_raw.tsv')

Here, a pair of cleanup functions are defined for the files above.

From the DataFrame's review field, we remove surrounding quotes, unescape HTML escapes, lowercase, strip stopwords, and apply NLTK's word_punct() tokenizer.

From its condition field, we combine a number of conditions into supercategories. (NB: No offense is intended if I've miscategorized any of these diagnoses or if their conflation is inappropriate. And in some cases, these conditions were conflated because they employ the same drugs.) All other conditions are tagged as '' for later removal of these records.


In [4]:
def clean_review(review):
    review = review.strip('"')
    review = html.unescape(review)

    review = review.lower()
    review = ' '.join([_ for _ in wordpunct_tokenize(review) if _ not in stopwords])
    
    return review

def clean_condition(condition):
    if not isinstance(condition, str):
        return ''

    if 'Pain' in condition:
        condition = 'Pain'
    elif condition in {'Insomnia', 'Narcolepsy'} or 'Sleep' in condition:
        condition = 'Sleep'
    elif condition in {'Weight Loss', 'Obesity'}:
        condition = 'Weight'
    elif condition in {'Depression', 'Anxiety', 'Bipolar Disorde', 'Anxiety and Stress'
                       'Panic Disorde', 'Generalized Anxiety Disorde', 'Schizophrenia',
                       'Major Depressive Disorde',}:
        condition = 'Mental Health'
    elif condition in {'Birth Control', 'Emergency Contraception', 'Menstrual Disorders'}:
        condition = 'Contraception'
    elif 'Headache' in condition or 'Migraine' in condition:
        condition = 'Headache'
    elif condition in {'Acne', 'Rosacea', 'Eczema'}:
        condition = 'Dermatalogical'
    else:
        condition = ''
    return condition

Below, the training & test sets are read into a DataFrame and pre-processed as described above.


In [5]:
# Read the TSVs into a DataFrame
drug_train = pd.read_csv('drugsComTrain_raw.tsv', sep='\t', index_col=0, usecols=[0,1,2,3])
drug_test = pd.read_csv('drugsComTest_raw.tsv', sep='\t', index_col=0, usecols=[0,1,2,3])

# Clean the review field
drug_train.review = drug_train.review.apply(clean_review)
drug_test.review = drug_test.review.apply(clean_review)

# Clean the condition field (condense some classes)
drug_train.condition = drug_train.condition.apply(clean_condition)
drug_test.condition = drug_test.condition.apply(clean_condition)

# Drop records that aren't among the 7 condition classes we will consider
drug_train = drug_train[drug_train.condition != '']
drug_test = drug_test[drug_test.condition != '']

In [6]:
drug_train


Out[6]:
drugName condition review
92703 Lybrel Contraception used take another oral contraceptive 21 pill c...
138000 Ortho Evra Contraception first time using form birth control glad went ...
165907 Levonorgestrel Contraception pulled cummed bit took plan b 26 hours later t...
102654 Aripiprazole Mental Health abilify changed life hope zoloft clonidine fir...
48928 Ethinyl estradiol / levonorgestrel Contraception pill many years doctor changed rx chateal effe...
29607 Topiramate Headache medication almost two weeks started 25mg worki...
75612 L-methylfolate Mental Health taken anti depressants years improvement mostl...
98494 Nexplanon Contraception started nexplanon 2 months ago minimal amount ...
81890 Liraglutide Weight taking saxenda since july 2016 severe nausea m...
212077 Lamotrigine Mental Health every medicine sun seems manage hypomania mani...
231466 Trazodone Sleep insomnia horrible story begins pcp prescribing...
227020 Etonogestrel Contraception nexplanon job worry free sex thing periods som...
27339 Imitrex Headache first suffered included splitting head pain na...
96233 Sertraline Mental Health 1 week zoloft anxiety mood swings take 50mg mo...
204999 Toradol Pain 30 years old multiple composite spinal injurie...
93678 Morphine Pain morphine least 7 years .. medicine seems manag...
39795 Contrave Weight finishing second week taking contrave lost 10 ...
121333 Venlafaxine Mental Health gp started venlafaxine yesterday help depressi...
188061 Symbyax Mental Health helps sadness strongly counters moderate urges...
69629 Buprenorphine Pain pain management doctor put butrans patches 6 w...
102449 Aripiprazole Mental Health abilify 20 mg patient diagnosed disorganized s...
87285 Latuda Mental Health great experience far latuda started taking 40 ...
106703 Implanon Contraception never depo suppose b ideal candidate first 6 m...
131704 Effexor XR Mental Health med 5 years worked fine great stopped panic at...
192806 Drospirenone / ethinyl estradiol Contraception put yasmin 6 months regulate cycle reduce acne...
69488 Buprenorphine Pain love butrans patch !!! relieved half pain know...
107449 Implanon Contraception 8 months sad say caused nothing self esteem be...
60156 NuvaRing Contraception birth control considering getting pregnant use...
24139 Tretinoin Dermatalogical hit three month point tretinoin 05 happy reall...
131909 Effexor XR Mental Health medicine saved life wits end anti depressants ...
... ... ... ...
104148 Ethinyl estradiol / levonorgestrel Contraception birth control best got heavy periods period li...
66631 Seroquel Mental Health begin seroquel personally think great medicati...
145230 Etonogestrel Contraception highly reccomend implant anyone got mine inser...
77215 Lorcaserin Weight started taking medication yesterday craving su...
96128 Sertraline Mental Health taking wellbutrin depression stopped working d...
205544 Pristiq Mental Health hot flashes blisters heart palpitations bruising
62789 Cafergot Headache diagnosed cluster headaches late 20 prescribed...
160363 Buspirone Mental Health good experience anxiety depression usually man...
132661 Doxepin Sleep read great comments talked doctor tried medici...
128820 Phentermine Weight started adipex 2 weeks ago lost 20 lbs far eve...
131713 Effexor XR Mental Health started take med one week ago gad im feeling b...
215786 Copper Contraception covered family pact california insertion painf...
53802 Zipsor Pain knees arthroscopic last year half ended reinju...
142183 Levonorgestrel Contraception bad reviews kyleena wanted share experience iu...
116889 Lamictal Mental Health medication nearly 10 years generally helpful t...
144132 Etonogestrel Contraception got nexplanon day baby feb 9 2016 gained 24 po...
143487 Etonogestrel Contraception honestly worst birth control ever taken even h...
8477 Zolpidem Sleep zolpidem work fast however right arm goes slee...
76151 Portia Contraception switched portia 12 days ago started spotting a...
73058 Ethinyl estradiol / norethindrone Contraception first starting taking lo loestrin fe first bir...
183202 Cymbalta Mental Health taking cymbalta 15 months first 30mg six month...
148859 Mirena Contraception experience painful insertion expected since ne...
109111 Nexplanon Contraception nexplanon since dec 27 2016 got first period e...
176146 Lorazepam Mental Health 4 years ago started early morning awakening in...
18421 Zolpidem Sleep started taking medication 10 years ago doctor ...
56907 Roxicodone Intensol Pain used throat cancer helped numb throat able eat...
228492 Geodon Mental Health bad place time started taking doctor wanted we...
93069 Vortioxetine Mental Health third med tried anxiety mild depression week h...
132177 Ativan Mental Health super taking medication started dealing anxiet...
164345 Junel 1.5 / 30 Contraception would second month junel birth control 10 year...

84513 rows × 3 columns

Next, we define a dictionary to hold accuracy data and define a function to train a model on the training set, test it on the test set, and store & report the resulting accuracy.


In [7]:
accuracies = {}
def test_pipeline(pipeline, name):
    model = pipeline.fit(drug_train, drug_train.condition)
    drug_test['prediction'] = model.predict(drug_test)
    acc = sum(drug_test['prediction']==
              drug_test['condition'])/len(drug_test)
    accuracies[name] = acc
    print('Accuracy: {:0.3f}%'.format(100*acc))

As a baseline, here is a straightfoward classifier on unigrams from the review text. It achieves an already-enviable 95.139% accuracy.


In [8]:
pipeline_plain = Pipeline([
    ('extract', ColumnExtractor('review')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_plain, 'plain')


Accuracy: 95.139%

Next, we run each word of the review text through Soundex and through Double Metaphone (separately, not in series) and store the results in new DataFrame columns. And then we run the results through their own pipelines and get somewhat-disappointing results of 92.588% and 93.947% accuracies, respectively.


In [9]:
sdx = Soundex()
drug_train['soundex'] = drug_train.review.apply(lambda review:
                                                ' '.join(sdx.encode(word) for
                                                         word in review.split()))

dm = DoubleMetaphone()
drug_train['dmetaphone'] = drug_train.review.apply(lambda review:
                                                   ' '.join(dm.encode(word)[0] for
                                                            word in review.split()))

In [10]:
drug_test['soundex'] = drug_test.review.apply(lambda review:
                                              ' '.join(sdx.encode(word) for
                                                       word in review.split()))

drug_test['dmetaphone'] = drug_test.review.apply(lambda review:
                                                 ' '.join(dm.encode(word)[0] for
                                                          word in review.split()))

In [11]:
pipeline_soundex = Pipeline([
    ('extract', ColumnExtractor('soundex')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_soundex, 'soundex')


Accuracy: 92.588%

In [12]:
pipeline_dmetaphone = Pipeline([
    ('extract', ColumnExtractor('dmetaphone')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_dmetaphone, 'double metaphone')


Accuracy: 93.947%

Next, we get the Skeleton and Omission keys of each word of the review text and store the results in new DataFrame columns. And then we run the results through their own pipelines and get more encouraging results of 95.075% and 95.075% accuracy, respectively.


In [13]:
sk = SkeletonKey()
drug_train['skeleton'] = drug_train.review.apply(lambda review:
                                                 ' '.join(sk.fingerprint(word) for word in
                                                          review.split()))
ok = OmissionKey()
drug_train['omission'] = drug_train.review.apply(lambda review:
                                                 ' '.join(ok.fingerprint(word) for word in
                                                          review.split()))

In [14]:
drug_test['skeleton'] = drug_test.review.apply(lambda review:
                                               ' '.join(sk.fingerprint(word) for word in
                                                        review.split()))
drug_test['omission'] = drug_test.review.apply(lambda review:
                                               ' '.join(ok.fingerprint(word) for word in
                                                        review.split()))

In [15]:
pipeline_skeleton = Pipeline([
    ('extract', ColumnExtractor('skeleton')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_skeleton, 'skeleton key')


Accuracy: 95.075%

In [16]:
pipeline_omission = Pipeline([
    ('extract', ColumnExtractor('omission')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_omission, 'omission key')


Accuracy: 94.856%

As another option, we retrieve character-wise 7-grams, storing them as dictionaries within their own column. This gives a nice improvement over the baseline, with 96.536% accuracy.


In [17]:
tokenizer = QGrams(qval=7, start_stop='')
drug_train['qgrams'] = drug_train.review.apply(lambda review:
                                               dict(tokenizer.tokenize(review).get_counter()))

In [18]:
drug_test['qgrams'] = drug_test.review.apply(lambda review:
                                             dict(tokenizer.tokenize(review).get_counter()))

In [19]:
pipeline_qgrams = Pipeline([
    ('extract', ColumnExtractor('qgrams')),
    ('vectorize', DictVectorizer()),
    ('tfidf', TfidfTransformer(sublinear_tf=True, norm='l2')),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_qgrams, 'q-grams')


Accuracy: 96.536%

And finally, we throw all the pipelines together with a voting classifier, minus the worst performing (Soundex) pipeline. We also add weights to bias strongly towards the best performing (q-grams) pipeline. And the resulting pipeline should both generalize well and beat someone using the plain pipeline.

Naturally, there is much room for improvement!


In [20]:
pipeline_voting = VotingClassifier([
    ('plain', pipeline_plain),
    # ('soundex', pipeline_soundex),
    ('dmetaphone', pipeline_dmetaphone),
    ('skeleton', pipeline_skeleton),
    ('omission', pipeline_omission),
    ('qgrams', pipeline_qgrams),
], weights=[1, 1, 1, 1, 2.5])
test_pipeline(pipeline_voting, 'voting')


Accuracy: 95.560%