Unigram_feature: This feature was used to check for the occurrence of certain unigrams just as in John's SciKit Learn Notebook. We used it initially to check for the most frequent words of each category of differing amounts. Using 500 most frequent words of each category performed the best. However this performance was outstipped by a simple tfidf and in combination only lowered the score.
numeric_feature: The goal of this feature was to check if a certain question used numbers. The idea lay in that certain categories such as math would use numbers more than categories such as entertainment. In practice it did not work out that well.
Similarity_Feature: Here we use wordnet's similarity to see how similar the words in the question are to the question's category. This performed quite poorly mostly I believe due to the fact that the similarity function is not that accurate.
Pos_feature: We added a feature to count the number of a particular part of speech. We tested it with nouns, verbs, and adjectives. Interestingly the verbs performed the best. However in combination with the other features we chose it seemed to hurt the performance
Median_Length Feature: Without tfidf including the length of the median word of a question greatly increased the categorization accuracy. However after using tfidf as a feature the median length only detracted from the score and since tfidf performed better we did not include it in the features.
Names_Feature: This feature checked if a particular question contained a name. This worked better than counting the number of names. This is likely due to a lack of data. Overall the number of questions with names in the training set is small so you can get better classification by only making the feature return
In [ ]:
def unigram_feature(x, unigrams):
word_list = x.lower().split(" ")
count = 0
for unigram in unigrams:
count += word_list.count(unigram)
return count
def numeric_feature(x):
count = 0
for c in x:
if x.isnumeric():
count += 1
return count
def similarity_feature(x, word):
word_list = x.lower().split(" ")
similarity = 0
for w in word_list:
for s in wn.synsets(w, pos=wn.NOUN):
similarity = max(similarity, word.wup_similarity(s))
return similarity
def pos_feature(x, pos):
word_list = x.lower().split(" ")
t = nltk.pos_tag(word_list)
count = 0
for w in t:
if w[1] == pos:
count+=1
return count
def median_length_feature(x):
word_list = x.lower().split(" ")
word_lengths = [len(w) for w in word_list]
word_lengths.sort()
return word_lengths[len(word_lengths)//2]
allNames = [name.lower() for name in names.words()]
def names_feature(x):
word_list = x.lower().split(" ")
count = 0
for word in word_list:
if word in allNames:
count = 1
return count