This code comes straight from a Times project that helps us standardize campaign finance data to enable new types of analyses. Specifically, it tries to categorize a free-form occupation/employer string into a discrete job category (for example, the strings "LAWYER" and "ATTORNEY" would both be categorized under "LAW").
We use this to create one of a large number of features that inform the larger predictive model we use for standardization. But it also shows the power of simple classification in action.
In [2]:
import csv, re, string
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
In [30]:
# Some basic setup for data-cleaning purposes
PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))
VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']
In [16]:
# Open the training data and clean it up a bit
data = []
with open('data/category-training.csv', 'r') as f:
inputreader = csv.reader(f, delimiter=',', quotechar='"')
for r in inputreader:
# Concatenate the occupation and employer strings together and remove
# punctuation. Both occupation and employer will be used in prediction.
text = PUNCTUATION.sub('', ' '.join(r[0:2]))
if len(r[2]) > 1 and r[2][0] in VALID_CLASSES:
# We're only attempting to classify the first character of the
# industry prefix ("A", "B", etc.) -- not the whole thing. That's
# what the r[2][0] piece is about.
data.append([text, r[2][0]])
In [18]:
# Separate the text of the occupation/employer strings from the correct classification
texts = np.array([el[0] for el in data])
classes = np.array([el[1] for el in data])
In [19]:
print texts
In [20]:
print classes
In [31]:
# Build a simple machine learning pipeline to turn the above arrays into something scikit-learn understands
pipeline = Pipeline([
('vectorizer', CountVectorizer(
ngram_range=(1,2),
stop_words='english',
min_df=2,
max_df=len(texts))),
('classifier', LogisticRegression())
])
In [32]:
# Fit the model
pipeline.fit(np.asarray(texts), np.asarray(classes))
Out[32]:
In [27]:
# Now, run some predictions. "K" means "LAW" in this case.
print pipeline.predict(['LAWYER'])
In [28]:
# It also recognizes law firms!
print pipeline.predict(['SKADDEN ARPS'])
In [34]:
# The "F" category represents business and finance.
print pipeline.predict(['CEO'])