Text Classification of Drug Reviews

Abydos can be helpful in general machine learning tasks like text classification. The following notebook demonstrates how Abydos's phonetic algoriths, string fingerprint functions, and q-grams can squeeze a little extra accuracy out of a text classification task.

The text classification task below uses customer review text to predict the condition for which the drug in question was prescribed. No other data (the drug name, for example) is used in this task.

Caveats

Unfortunately, this notebook crashes near the end when run on mybinder.org. But it runs fine on Google Colab, though you'll need to add a cell at the beginning to call !pip install abydos.

This is a toy problem. I have taken a dataset that was already divided into training & test sets and used the test set for validation, not as a genuine test set. On the other hand, I haven't done much hyperparameter tuning. Indeed, all of the classifiers used below have identical parameters: LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337).

However, Abydos was used in a winning submission to a Kaggle (InClass) competition in UC Berkeley's 2015 Applied NLP course. The same notebook (but with its Pseudo-SSK classifier disabled due to memory requirements) was applied to the following year's competition, after the competition deadline, and beat that year's leader (0.89535 to 0.89369) without any tuning. So... Abydos can be useful in generalizing text classification tasks.

Imports

We start by importing from standard libraries, Pandas, Abydos, scikit-learn (for the ML algorithms/pipeline), NLTK (for a tokenizer & stopword corpus).



In [1]:

    
import html
import os

from string import punctuation

import pandas as pd

from abydos.phonetic import DoubleMetaphone, Soundex
from abydos.fingerprint import SkeletonKey, OmissionKey
from abydos.tokenizer import QGrams

from sklearn.base import TransformerMixin
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.ensemble import VotingClassifier

from nltk import wordpunct_tokenize

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english')) | set(punctuation)









    



[nltk_data] Downloading package stopwords to
[nltk_data]     /home/chrislit/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



In [2]:

    
# Useful Transformer from
# http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
# This pulls a single column from a supplied pandas dataframe for classification.
class ColumnExtractor(TransformerMixin):
    def __init__(self, columns=[]):
        self.columns = columns

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def transform(self, X, **transform_params):
        return X[self.columns]

    def fit(self, X, y=None, **fit_params):
        return self

Below, if the dataset isn't already present, we download it to the working directory.



In [3]:

    
if not os.path.isfile('drugsComTrain_raw.tsv'):
    from io import BytesIO
    from zipfile import ZipFile
    from urllib.request import urlopen

    resp = urlopen("https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip")
    zipfile = ZipFile(BytesIO(resp.read()))
    zipfile.extract('drugsComTrain_raw.tsv')
    zipfile.extract('drugsComTest_raw.tsv')

Here, a pair of cleanup functions are defined for the files above.

From the DataFrame's review field, we remove surrounding quotes, unescape HTML escapes, lowercase, strip stopwords, and apply NLTK's word_punct() tokenizer.

From its condition field, we combine a number of conditions into supercategories. (NB: No offense is intended if I've miscategorized any of these diagnoses or if their conflation is inappropriate. And in some cases, these conditions were conflated because they employ the same drugs.) All other conditions are tagged as '' for later removal of these records.



In [4]:

    
def clean_review(review):
    review = review.strip('"')
    review = html.unescape(review)

    review = review.lower()
    review = ' '.join([_ for _ in wordpunct_tokenize(review) if _ not in stopwords])
    
    return review

def clean_condition(condition):
    if not isinstance(condition, str):
        return ''

    if 'Pain' in condition:
        condition = 'Pain'
    elif condition in {'Insomnia', 'Narcolepsy'} or 'Sleep' in condition:
        condition = 'Sleep'
    elif condition in {'Weight Loss', 'Obesity'}:
        condition = 'Weight'
    elif condition in {'Depression', 'Anxiety', 'Bipolar Disorde', 'Anxiety and Stress'
                       'Panic Disorde', 'Generalized Anxiety Disorde', 'Schizophrenia',
                       'Major Depressive Disorde',}:
        condition = 'Mental Health'
    elif condition in {'Birth Control', 'Emergency Contraception', 'Menstrual Disorders'}:
        condition = 'Contraception'
    elif 'Headache' in condition or 'Migraine' in condition:
        condition = 'Headache'
    elif condition in {'Acne', 'Rosacea', 'Eczema'}:
        condition = 'Dermatalogical'
    else:
        condition = ''
    return condition

Below, the training & test sets are read into a DataFrame and pre-processed as described above.



In [5]:

    
# Read the TSVs into a DataFrame
drug_train = pd.read_csv('drugsComTrain_raw.tsv', sep='\t', index_col=0, usecols=[0,1,2,3])
drug_test = pd.read_csv('drugsComTest_raw.tsv', sep='\t', index_col=0, usecols=[0,1,2,3])

# Clean the review field
drug_train.review = drug_train.review.apply(clean_review)
drug_test.review = drug_test.review.apply(clean_review)

# Clean the condition field (condense some classes)
drug_train.condition = drug_train.condition.apply(clean_condition)
drug_test.condition = drug_test.condition.apply(clean_condition)

# Drop records that aren't among the 7 condition classes we will consider
drug_train = drug_train[drug_train.condition != '']
drug_test = drug_test[drug_test.condition != '']



In [6]:

    
drug_train









    Out[6]:







  
    
      
      drugName
      condition
      review
    
  
  
    
      92703
      Lybrel
      Contraception
      used take another oral contraceptive 21 pill c...
    
    
      138000
      Ortho Evra
      Contraception
      first time using form birth control glad went ...
    
    
      165907
      Levonorgestrel
      Contraception
      pulled cummed bit took plan b 26 hours later t...
    
    
      102654
      Aripiprazole
      Mental Health
      abilify changed life hope zoloft clonidine fir...
    
    
      48928
      Ethinyl estradiol / levonorgestrel
      Contraception
      pill many years doctor changed rx chateal effe...
    
    
      29607
      Topiramate
      Headache
      medication almost two weeks started 25mg worki...
    
    
      75612
      L-methylfolate
      Mental Health
      taken anti depressants years improvement mostl...
    
    
      98494
      Nexplanon
      Contraception
      started nexplanon 2 months ago minimal amount ...
    
    
      81890
      Liraglutide
      Weight
      taking saxenda since july 2016 severe nausea m...
    
    
      212077
      Lamotrigine
      Mental Health
      every medicine sun seems manage hypomania mani...
    
    
      231466
      Trazodone
      Sleep
      insomnia horrible story begins pcp prescribing...
    
    
      227020
      Etonogestrel
      Contraception
      nexplanon job worry free sex thing periods som...
    
    
      27339
      Imitrex
      Headache
      first suffered included splitting head pain na...
    
    
      96233
      Sertraline
      Mental Health
      1 week zoloft anxiety mood swings take 50mg mo...
    
    
      204999
      Toradol
      Pain
      30 years old multiple composite spinal injurie...
    
    
      93678
      Morphine
      Pain
      morphine least 7 years .. medicine seems manag...
    
    
      39795
      Contrave
      Weight
      finishing second week taking contrave lost 10 ...
    
    
      121333
      Venlafaxine
      Mental Health
      gp started venlafaxine yesterday help depressi...
    
    
      188061
      Symbyax
      Mental Health
      helps sadness strongly counters moderate urges...
    
    
      69629
      Buprenorphine
      Pain
      pain management doctor put butrans patches 6 w...
    
    
      102449
      Aripiprazole
      Mental Health
      abilify 20 mg patient diagnosed disorganized s...
    
    
      87285
      Latuda
      Mental Health
      great experience far latuda started taking 40 ...
    
    
      106703
      Implanon
      Contraception
      never depo suppose b ideal candidate first 6 m...
    
    
      131704
      Effexor XR
      Mental Health
      med 5 years worked fine great stopped panic at...
    
    
      192806
      Drospirenone / ethinyl estradiol
      Contraception
      put yasmin 6 months regulate cycle reduce acne...
    
    
      69488
      Buprenorphine
      Pain
      love butrans patch !!! relieved half pain know...
    
    
      107449
      Implanon
      Contraception
      8 months sad say caused nothing self esteem be...
    
    
      60156
      NuvaRing
      Contraception
      birth control considering getting pregnant use...
    
    
      24139
      Tretinoin
      Dermatalogical
      hit three month point tretinoin 05 happy reall...
    
    
      131909
      Effexor XR
      Mental Health
      medicine saved life wits end anti depressants ...
    
    
      ...
      ...
      ...
      ...
    
    
      104148
      Ethinyl estradiol / levonorgestrel
      Contraception
      birth control best got heavy periods period li...
    
    
      66631
      Seroquel
      Mental Health
      begin seroquel personally think great medicati...
    
    
      145230
      Etonogestrel
      Contraception
      highly reccomend implant anyone got mine inser...
    
    
      77215
      Lorcaserin
      Weight
      started taking medication yesterday craving su...
    
    
      96128
      Sertraline
      Mental Health
      taking wellbutrin depression stopped working d...
    
    
      205544
      Pristiq
      Mental Health
      hot flashes blisters heart palpitations bruising
    
    
      62789
      Cafergot
      Headache
      diagnosed cluster headaches late 20 prescribed...
    
    
      160363
      Buspirone
      Mental Health
      good experience anxiety depression usually man...
    
    
      132661
      Doxepin
      Sleep
      read great comments talked doctor tried medici...
    
    
      128820
      Phentermine
      Weight
      started adipex 2 weeks ago lost 20 lbs far eve...
    
    
      131713
      Effexor XR
      Mental Health
      started take med one week ago gad im feeling b...
    
    
      215786
      Copper
      Contraception
      covered family pact california insertion painf...
    
    
      53802
      Zipsor
      Pain
      knees arthroscopic last year half ended reinju...
    
    
      142183
      Levonorgestrel
      Contraception
      bad reviews kyleena wanted share experience iu...
    
    
      116889
      Lamictal
      Mental Health
      medication nearly 10 years generally helpful t...
    
    
      144132
      Etonogestrel
      Contraception
      got nexplanon day baby feb 9 2016 gained 24 po...
    
    
      143487
      Etonogestrel
      Contraception
      honestly worst birth control ever taken even h...
    
    
      8477
      Zolpidem
      Sleep
      zolpidem work fast however right arm goes slee...
    
    
      76151
      Portia
      Contraception
      switched portia 12 days ago started spotting a...
    
    
      73058
      Ethinyl estradiol / norethindrone
      Contraception
      first starting taking lo loestrin fe first bir...
    
    
      183202
      Cymbalta
      Mental Health
      taking cymbalta 15 months first 30mg six month...
    
    
      148859
      Mirena
      Contraception
      experience painful insertion expected since ne...
    
    
      109111
      Nexplanon
      Contraception
      nexplanon since dec 27 2016 got first period e...
    
    
      176146
      Lorazepam
      Mental Health
      4 years ago started early morning awakening in...
    
    
      18421
      Zolpidem
      Sleep
      started taking medication 10 years ago doctor ...
    
    
      56907
      Roxicodone Intensol
      Pain
      used throat cancer helped numb throat able eat...
    
    
      228492
      Geodon
      Mental Health
      bad place time started taking doctor wanted we...
    
    
      93069
      Vortioxetine
      Mental Health
      third med tried anxiety mild depression week h...
    
    
      132177
      Ativan
      Mental Health
      super taking medication started dealing anxiet...
    
    
      164345
      Junel 1.5 / 30
      Contraception
      would second month junel birth control 10 year...
    
  

84513 rows × 3 columns

Next, we define a dictionary to hold accuracy data and define a function to train a model on the training set, test it on the test set, and store & report the resulting accuracy.



In [7]:

    
accuracies = {}
def test_pipeline(pipeline, name):
    model = pipeline.fit(drug_train, drug_train.condition)
    drug_test['prediction'] = model.predict(drug_test)
    acc = sum(drug_test['prediction']==
              drug_test['condition'])/len(drug_test)
    accuracies[name] = acc
    print('Accuracy: {:0.3f}%'.format(100*acc))

As a baseline, here is a straightfoward classifier on unigrams from the review text. It achieves an already-enviable 95.139% accuracy.



In [8]:

    
pipeline_plain = Pipeline([
    ('extract', ColumnExtractor('review')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_plain, 'plain')









    



Accuracy: 95.139%

Next, we run each word of the review text through Soundex and through Double Metaphone (separately, not in series) and store the results in new DataFrame columns. And then we run the results through their own pipelines and get somewhat-disappointing results of 92.588% and 93.947% accuracies, respectively.



In [9]:

    
sdx = Soundex()
drug_train['soundex'] = drug_train.review.apply(lambda review:
                                                ' '.join(sdx.encode(word) for
                                                         word in review.split()))

dm = DoubleMetaphone()
drug_train['dmetaphone'] = drug_train.review.apply(lambda review:
                                                   ' '.join(dm.encode(word)[0] for
                                                            word in review.split()))



In [10]:

    
drug_test['soundex'] = drug_test.review.apply(lambda review:
                                              ' '.join(sdx.encode(word) for
                                                       word in review.split()))

drug_test['dmetaphone'] = drug_test.review.apply(lambda review:
                                                 ' '.join(dm.encode(word)[0] for
                                                          word in review.split()))



In [11]:

    
pipeline_soundex = Pipeline([
    ('extract', ColumnExtractor('soundex')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_soundex, 'soundex')









    



Accuracy: 92.588%



In [12]:

    
pipeline_dmetaphone = Pipeline([
    ('extract', ColumnExtractor('dmetaphone')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_dmetaphone, 'double metaphone')









    



Accuracy: 93.947%

Next, we get the Skeleton and Omission keys of each word of the review text and store the results in new DataFrame columns. And then we run the results through their own pipelines and get more encouraging results of 95.075% and 95.075% accuracy, respectively.



In [13]:

    
sk = SkeletonKey()
drug_train['skeleton'] = drug_train.review.apply(lambda review:
                                                 ' '.join(sk.fingerprint(word) for word in
                                                          review.split()))
ok = OmissionKey()
drug_train['omission'] = drug_train.review.apply(lambda review:
                                                 ' '.join(ok.fingerprint(word) for word in
                                                          review.split()))



In [14]:

    
drug_test['skeleton'] = drug_test.review.apply(lambda review:
                                               ' '.join(sk.fingerprint(word) for word in
                                                        review.split()))
drug_test['omission'] = drug_test.review.apply(lambda review:
                                               ' '.join(ok.fingerprint(word) for word in
                                                        review.split()))



In [15]:

    
pipeline_skeleton = Pipeline([
    ('extract', ColumnExtractor('skeleton')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_skeleton, 'skeleton key')









    



Accuracy: 95.075%



In [16]:

    
pipeline_omission = Pipeline([
    ('extract', ColumnExtractor('omission')),
    ('vectorize', TfidfVectorizer(sublinear_tf=True, norm='l2', lowercase=False)),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_omission, 'omission key')









    



Accuracy: 94.856%

As another option, we retrieve character-wise 7-grams, storing them as dictionaries within their own column. This gives a nice improvement over the baseline, with 96.536% accuracy.



In [17]:

    
tokenizer = QGrams(qval=7, start_stop='')
drug_train['qgrams'] = drug_train.review.apply(lambda review:
                                               dict(tokenizer.tokenize(review).get_counter()))



In [18]:

    
drug_test['qgrams'] = drug_test.review.apply(lambda review:
                                             dict(tokenizer.tokenize(review).get_counter()))



In [19]:

    
pipeline_qgrams = Pipeline([
    ('extract', ColumnExtractor('qgrams')),
    ('vectorize', DictVectorizer()),
    ('tfidf', TfidfTransformer(sublinear_tf=True, norm='l2')),
    ('classifier', LinearSVC(loss='hinge', C=1, max_iter=2000, random_state=1337))
])
test_pipeline(pipeline_qgrams, 'q-grams')









    



Accuracy: 96.536%

And finally, we throw all the pipelines together with a voting classifier, minus the worst performing (Soundex) pipeline. We also add weights to bias strongly towards the best performing (q-grams) pipeline. And the resulting pipeline should both generalize well and beat someone using the plain pipeline.

Naturally, there is much room for improvement!



In [20]:

    
pipeline_voting = VotingClassifier([
    ('plain', pipeline_plain),
    # ('soundex', pipeline_soundex),
    ('dmetaphone', pipeline_dmetaphone),
    ('skeleton', pipeline_skeleton),
    ('omission', pipeline_omission),
    ('qgrams', pipeline_qgrams),
], weights=[1, 1, 1, 1, 2.5])
test_pipeline(pipeline_voting, 'voting')









    



Accuracy: 95.560%

	drugName	condition	review
92703	Lybrel	Contraception	used take another oral contraceptive 21 pill c...
138000	Ortho Evra	Contraception	first time using form birth control glad went ...
165907	Levonorgestrel	Contraception	pulled cummed bit took plan b 26 hours later t...
102654	Aripiprazole	Mental Health	abilify changed life hope zoloft clonidine fir...
48928	Ethinyl estradiol / levonorgestrel	Contraception	pill many years doctor changed rx chateal effe...
29607	Topiramate	Headache	medication almost two weeks started 25mg worki...
75612	L-methylfolate	Mental Health	taken anti depressants years improvement mostl...
98494	Nexplanon	Contraception	started nexplanon 2 months ago minimal amount ...
81890	Liraglutide	Weight	taking saxenda since july 2016 severe nausea m...
212077	Lamotrigine	Mental Health	every medicine sun seems manage hypomania mani...
231466	Trazodone	Sleep	insomnia horrible story begins pcp prescribing...
227020	Etonogestrel	Contraception	nexplanon job worry free sex thing periods som...
27339	Imitrex	Headache	first suffered included splitting head pain na...
96233	Sertraline	Mental Health	1 week zoloft anxiety mood swings take 50mg mo...
204999	Toradol	Pain	30 years old multiple composite spinal injurie...
93678	Morphine	Pain	morphine least 7 years .. medicine seems manag...
39795	Contrave	Weight	finishing second week taking contrave lost 10 ...
121333	Venlafaxine	Mental Health	gp started venlafaxine yesterday help depressi...
188061	Symbyax	Mental Health	helps sadness strongly counters moderate urges...
69629	Buprenorphine	Pain	pain management doctor put butrans patches 6 w...
102449	Aripiprazole	Mental Health	abilify 20 mg patient diagnosed disorganized s...
87285	Latuda	Mental Health	great experience far latuda started taking 40 ...
106703	Implanon	Contraception	never depo suppose b ideal candidate first 6 m...
131704	Effexor XR	Mental Health	med 5 years worked fine great stopped panic at...
192806	Drospirenone / ethinyl estradiol	Contraception	put yasmin 6 months regulate cycle reduce acne...
69488	Buprenorphine	Pain	love butrans patch !!! relieved half pain know...
107449	Implanon	Contraception	8 months sad say caused nothing self esteem be...
60156	NuvaRing	Contraception	birth control considering getting pregnant use...
24139	Tretinoin	Dermatalogical	hit three month point tretinoin 05 happy reall...
131909	Effexor XR	Mental Health	medicine saved life wits end anti depressants ...
...	...	...	...
104148	Ethinyl estradiol / levonorgestrel	Contraception	birth control best got heavy periods period li...
66631	Seroquel	Mental Health	begin seroquel personally think great medicati...
145230	Etonogestrel	Contraception	highly reccomend implant anyone got mine inser...
77215	Lorcaserin	Weight	started taking medication yesterday craving su...
96128	Sertraline	Mental Health	taking wellbutrin depression stopped working d...
205544	Pristiq	Mental Health	hot flashes blisters heart palpitations bruising
62789	Cafergot	Headache	diagnosed cluster headaches late 20 prescribed...
160363	Buspirone	Mental Health	good experience anxiety depression usually man...
132661	Doxepin	Sleep	read great comments talked doctor tried medici...
128820	Phentermine	Weight	started adipex 2 weeks ago lost 20 lbs far eve...
131713	Effexor XR	Mental Health	started take med one week ago gad im feeling b...
215786	Copper	Contraception	covered family pact california insertion painf...
53802	Zipsor	Pain	knees arthroscopic last year half ended reinju...
142183	Levonorgestrel	Contraception	bad reviews kyleena wanted share experience iu...
116889	Lamictal	Mental Health	medication nearly 10 years generally helpful t...
144132	Etonogestrel	Contraception	got nexplanon day baby feb 9 2016 gained 24 po...
143487	Etonogestrel	Contraception	honestly worst birth control ever taken even h...
8477	Zolpidem	Sleep	zolpidem work fast however right arm goes slee...
76151	Portia	Contraception	switched portia 12 days ago started spotting a...
73058	Ethinyl estradiol / norethindrone	Contraception	first starting taking lo loestrin fe first bir...
183202	Cymbalta	Mental Health	taking cymbalta 15 months first 30mg six month...
148859	Mirena	Contraception	experience painful insertion expected since ne...
109111	Nexplanon	Contraception	nexplanon since dec 27 2016 got first period e...
176146	Lorazepam	Mental Health	4 years ago started early morning awakening in...
18421	Zolpidem	Sleep	started taking medication 10 years ago doctor ...
56907	Roxicodone Intensol	Pain	used throat cancer helped numb throat able eat...
228492	Geodon	Mental Health	bad place time started taking doctor wanted we...
93069	Vortioxetine	Mental Health	third med tried anxiety mild depression week h...
132177	Ativan	Mental Health	super taking medication started dealing anxiet...
164345	Junel 1.5 / 30	Contraception	would second month junel birth control 10 year...