In [ ]:
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer

In [ ]:
########## STEP 1: DATA IMPORT AND PREPROCESSING ##########

# Here we're taking in the training data and splitting it into two lists: One with the text of
# each bill title, and the second with each bill title's corresponding category. Order is important.
# The first bill in list 1 should also be the first category in list 2.
training = [line.strip().split('|') for line in open('../data/bills_training.txt', 'r').readlines()]
text = [t[0] for t in training if len(t) > 1]
labels = [t[1] for t in training if len(t) > 1]

# A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
# be numbers, not strings. The LabelEncoder performs this transformation.
encoder = preprocessing.LabelEncoder()
correct_labels = encoder.fit_transform(labels)

In [ ]:
########## STEP 2: FEATURE EXTRACTION ##########
vectorizer = CountVectorizer(stop_words='english')
data = vectorizer.fit_transform(text)

In [ ]:
########## STEP 3: MODEL BUILDING ##########
model = DecisionTreeClassifier()
fit_model = model.fit(data, correct_labels)

In [ ]:
# ########## STEP 4: EVALUATION ##########
# Evaluate our model with 10-fold cross-validation
scores = cross_validation.cross_val_score(model, data, correct_labels, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [ ]:
# ########## STEP 5: APPLYING THE MODEL ##########
docs_new = ["Public postsecondary education: executive officer compensation.",
            "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
            "Political Reform Act of 1974: campaign disclosures.",
            "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
        ]

test_data = vectorizer.transform(docs_new)

for i in range(len(docs_new)):
    print('%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])]))

Overall, this model make a good amount of sense to me. The concept of keeping the information in order while separating it by title and corresponding category is something that I think could be useful in a variety of ways.

The code seems pretty straight forward, however, the last step gets a bit confusing. I understand everything prior to the for loop and then the for loop I understand to an extent. If we could maybe go over the last step in more detail I think that would be helpful.


In [ ]: