Анализ тональности отзывов

Сначала возьмем выборку отзывов на фильмы из NLTK:


In [2]:
from nltk.corpus import movie_reviews
 
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print negids[:5]


[u'neg/cv000_29416.txt', u'neg/cv001_19502.txt', u'neg/cv002_17424.txt', u'neg/cv003_12683.txt', u'neg/cv004_12641.txt']

Приготовим список текстов и классов как обучающую выборку:


In [3]:
negfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in negids]
posfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in posids]

texts = negfeats + posfeats
labels = [0] * len(negfeats) + [1] * len(posfeats)

In [4]:
print texts[0]


plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what ' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn ' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it ' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea what ' s going on . there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . now i personally don ' t mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film ' s biggest problem . it ' s obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . and do they make things entertaining , thrilling or even engaging , in the meantime ? not really . the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half - way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn ' t the make the film all that more entertaining . i guess the bottom line with movies like this is that you should always make sure that the audience is " into it " even before they are given the secret password to enter your world of understanding . i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! okay , we get it . . . there are people chasing her and we don ' t know who they are . do we really need to see it over and over again ? how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? apparently , the studio took this film away from its director and chopped it up themselves , and it shows . there might ' ve been a pretty decent teen mind - fuck movie in here somewhere , but i guess " the suits " decided that turning it into a music video with little edge , would make more sense . the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character ' s unraveling . overall , the film doesn ' t stick because it doesn ' t entertain , it ' s confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . oh , and by the way , this is not a horror or teen slasher flick . . . it ' s just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . it also wrapped production two years ago and has been sitting on the shelves ever since . whatever . . . skip it ! where ' s joblo coming from ? a nightmare of elm street 3 ( 7 / 10 ) - blair witch 2 ( 7 / 10 ) - the crow ( 9 / 10 ) - the crow : salvation ( 4 / 10 ) - lost highway ( 10 / 10 ) - memento ( 10 / 10 ) - the others ( 9 / 10 ) - stir of echoes ( 8 / 10 )

Импортируем нужные нам модули


In [5]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline

Оценка качества работы разных классификаторов


In [6]:
def text_classifier(vectorizer, transformer, classifier):
    return Pipeline(
            [("vectorizer", vectorizer),
            ("transformer", transformer),
            ("classifier", classifier)]
        )

In [7]:
for clf in [LogisticRegression, LinearSVC, SGDClassifier]:
    print clf
    print cross_val_score(text_classifier(CountVectorizer(), TfidfTransformer(), clf()), texts, labels).mean()
    print "\n"


<class 'sklearn.linear_model.logistic.LogisticRegression'>
0.813511115906


<class 'sklearn.svm.classes.LinearSVC'>
0.845507183831


<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>
0.840010669352


Подготовка классификатора, обученного на всех данных


In [8]:
clf_pipeline = Pipeline(
            [("vectorizer", TfidfVectorizer()),
            ("classifier", LinearSVC())]
        )


clf_pipeline.fit(texts, labels)

print clf_pipeline


Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_id...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [9]:
print clf_pipeline.predict(["Amazing film! I will advice it to all my friends. Genious",
                           "Awful film! The man who advised me to watch it is really crazy idiot."])


[1 0]

Понижение размерности и ансамбли деревьев


In [23]:
%%time
from sklearn.decomposition import NMF, TruncatedSVD

v = CountVectorizer()
mx = v.fit_transform(texts)
mf = TruncatedSVD(10)
u = mf.fit_transform(mx)


CPU times: user 1.77 s, sys: 0 ns, total: 1.77 s
Wall time: 1.77 s

In [22]:
for transform in [TruncatedSVD, NMF]:
    print transform
    print cross_val_score(text_classifier(CountVectorizer(), transform(n_components=10), LinearSVC()), texts, labels).mean()
    print "\n"


<class 'sklearn.decomposition.truncated_svd.TruncatedSVD'>
0.569963676251


<class 'sklearn.decomposition.nmf.NMF'>
0.650514286742



In [ ]:

Если задать n_components=1000:


In [12]:
%%time
print cross_val_score(text_classifier(TfidfVectorizer(), TruncatedSVD(n_components=1000), LinearSVC()),
                      texts, 
                      labels
                     ).mean()


0.842014169858
CPU times: user 3min 11s, sys: 13.5 s, total: 3min 25s
Wall time: 2min 24s

Ансамбли деревьев на преобразованных признаках


In [14]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [15]:
%%time
print cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TruncatedSVD(100)),
            ("classifier", RandomForestClassifier(100))
        ]),
    texts,
    labels
    )


[ 0.70209581  0.71621622  0.69069069]
CPU times: user 13.9 s, sys: 600 ms, total: 14.5 s
Wall time: 14.3 s

Больше компонент и больше деревьев:


In [19]:
%%time
print cross_val_score(text_classifier(CountVectorizer(), TruncatedSVD(n_components=1000), RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean()


0.561485137832
CPU times: user 4min 43s, sys: 13.6 s, total: 4min 57s
Wall time: 3min 56s

Tf*Idf вместо частот слов:


In [18]:
%%time
print cross_val_score(text_classifier(TfidfVectorizer(), TruncatedSVD(n_components=1000), RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean()


0.590001678325
CPU times: user 4min 39s, sys: 14.3 s, total: 4min 53s
Wall time: 3min 52s

Совмещаем Tf*Idf и SVD


In [16]:
from sklearn.pipeline import FeatureUnion

estimators = [('tfidf', TfidfTransformer()), ('svd', TruncatedSVD(1))]
combined = FeatureUnion(estimators)

In [17]:
%%time
print cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", combined),
            ("classifier", LinearSVC())
        ]),
    texts,
    labels
    )


[ 0.74251497  0.78978979  0.62912913]
CPU times: user 10.7 s, sys: 24 ms, total: 10.7 s
Wall time: 10.7 s

In [ ]: