This code comes straight from a Times project that helps us standardize campaign finance data to enable new types of analyses. Specifically, it tries to categorize a free-form occupation/employer string into a discrete job category (for example, the strings "LAWYER" and "ATTORNEY" would both be categorized under "LAW").
We use this to create one of a large number of features that inform the larger predictive model we use for standardization. But it also shows the power of simple classification in action.
In :import csv, re, string import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import Pipeline
In :# Some basic setup for data-cleaning purposes PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation)) VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']
In :# Open the training data and clean it up a bit data =  with open('data/category-training.csv', 'r') as f: inputreader = csv.reader(f, delimiter=',', quotechar='"') for r in inputreader: # Concatenate the occupation and employer strings together and remove # punctuation. Both occupation and employer will be used in prediction. text = PUNCTUATION.sub('', ' '.join(r[0:2])) if len(r) > 1 and r in VALID_CLASSES: # We're only attempting to classify the first character of the # industry prefix ("A", "B", etc.) -- not the whole thing. That's # what the r piece is about. data.append([text, r])
In :# Separate the text of the occupation/employer strings from the correct classification texts = np.array([el for el in data]) classes = np.array([el for el in data])
In :print texts
['Owner First Priority Title Llc' 'SENIOR PARTNER ARES MANAGEMENT' 'CEO HB AGENCY' ..., 'INVESTMENT EXECUTIVE FEF MANAGEMENT LLC' 'Owner Fair Funeral Home' 'ST MARTIN LIRERRE LAW FIRM ']
In :print classes
['F' 'Z' 'Z' ..., 'F' 'G' 'K']
In :# Build a simple machine learning pipeline to turn the above arrays into something scikit-learn understands pipeline = Pipeline([ ('vectorizer', CountVectorizer( ngram_range=(1,2), stop_words='english', min_df=2, max_df=len(texts))), ('classifier', LogisticRegression()) ])
In :# Fit the model pipeline.fit(np.asarray(texts), np.asarray(classes))
Out:Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=66923, max_features=None, min_df=2, ngram_range=(1, 2), preprocessor=None, stop_words='english...', penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0))])
In :# Now, run some predictions. "K" means "LAW" in this case. print pipeline.predict(['LAWYER'])
In :# It also recognizes law firms! print pipeline.predict(['SKADDEN ARPS'])
In :# The "F" category represents business and finance. print pipeline.predict(['CEO'])