第8章 機械学習の適用1 - 感情分析

8.1 IMDbの映画レビューデータセットの取得

$ curl -O http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
$ tar xfz aclImdb_v1.tar.gz
  • 進行状況を見るために pyprind をインストール
$ pip install pyprind

In [18]:
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

In [9]:
# データを読み込む
import pyprind
import pandas as pd
import os
pbar = pyprind.ProgBar(50000)
labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()
for set in ('test', 'train'):
    for label in ('pos', 'neg'):
        path = os.path.join('.', 'aclImdb', set, label)
        for file in os.listdir(path):
            with open(os.path.join(path, file), encoding='utf-8') as f:
                txt = f.read()
                df = df.append([[txt, labels[label]]], ignore_index=True)
                pbar.update()
df.columns = ['review', 'sentiment']


0%                          100%
[##############################] | ETA: 00:01:55 | ETA: 00:01:44 | ETA: 00:01:38 | ETA: 00:01:36 | ETA: 00:01:33 | ETA: 00:01:30 | ETA: 00:01:28 | ETA: 00:01:25 | ETA: 00:01:24 | ETA: 00:01:25 | ETA: 00:01:23 | ETA: 00:01:19 | ETA: 00:01:15 | ETA: 00:01:11 | ETA: 00:01:07 | ETA: 00:01:03 | ETA: 00:00:59 | ETA: 00:00:56 | ETA: 00:00:52 | ETA: 00:00:48 | ETA: 00:00:44 | ETA: 00:00:40 | ETA: 00:00:36 | ETA: 00:00:31 | ETA: 00:00:26 | ETA: 00:00:21 | ETA: 00:00:16 | ETA: 00:00:11 | ETA: 00:00:05 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:02:48

In [10]:
import numpy as np
np.random.seed(0)
# 行の順番をシャッフル
df = df.reindex(np.random.permutation(df.index))
# CSV に保存
df.to_csv('./movie_data.csv', index=False)

In [2]:
import pandas as pd
df = pd.read_csv('./movie_data.csv')
df.head(3)


Out[2]:
review sentiment
0 In 1974, the teenager Martha Moxley (Maggie Gr... 1
1 OK... so... I really like Kris Kristofferson a... 0
2 ***SPOILER*** Do not read this, if you think a... 0

8.2 BoWモデルの紹介

  • BoW(Bag-of-Words)
  1. ドキュメントの集合全体から、たとえば単語という一意なトークン(token)からなる語彙(vocabulary)を作成する
  2. 各ドキュメントでの各単語の出現回数を含んだ特徴ベクトルを構築する
  • 疎ベクトル(sparse vector)

8.2.1 単語を特徴ベクトルに変換する

  • 生の出現頻度(raw term frequencies): tf(t, d)

In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [4]:
# 語彙の中身を出力
print(count.vocabulary_)
# 特徴ベクトルを出力
print(bag.toarray())


{'is': 1, 'the': 6, 'sun': 4, 'two': 7, 'one': 2, 'weather': 8, 'and': 0, 'sweet': 5, 'shining': 3}
[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]

8.2.2 TF-IDFを使って単語の関連性を評価する

  • TF-IDF(Term Frequency-Inverse Document Frequency)
  • TF: 単語の出現頻度
  • IDF: 逆文書頻度

In [5]:
np.set_printoptions(precision=2)

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())


[[ 0.    0.43  0.    0.56  0.56  0.    0.43  0.    0.  ]
 [ 0.    0.43  0.    0.    0.    0.56  0.43  0.    0.56]
 [ 0.5   0.45  0.5   0.19  0.19  0.19  0.3   0.25  0.19]]

8.2.3 テキストデータのクレンジング


In [6]:
df.loc[0, 'review'][-50:]


Out[6]:
'is seven.<br /><br />Title (Brazil): Not Available'

In [7]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

In [8]:
preprocessor(df.loc[0, 'review'][-50:])


Out[8]:
'is seven title brazil not available'

In [10]:
preprocessor("</a>This :) is :( a test :-)!")


Out[10]:
'this is a test :) :( :)'

In [11]:
df['review'] = df['review'].apply(preprocessor)

8.2.4 ドキュメントをトークン化する

  • トークン化(tokenize)
  • Porterステミング(Porter stemming)アルゴリズム: 単語を原形に変換する
  • ストップワードの除去(stop-word removal)

In [12]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')


Out[12]:
['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [13]:
# Porterステミングを行う
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')


Out[13]:
['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [14]:
# ストップワードをダウンロードする
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/takanori/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[14]:
True

In [15]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]


Out[15]:
['runner', 'like', 'run', 'run', 'lot']

8.2.5 ドキュメントを分類するロジスティック回帰モデルのトレーニング


In [16]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV
    
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             ]
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)


Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 41.8min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 206.6min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 263.7min finished
Out[23]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'clf__C': [1.0, 10.0, 100.0], 'vect__tokenizer': [<function tokenizer at 0x10883d7b8>, <function tokenizer_porter at 0x10659e598>], 'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'your...n', 'wouldn'], None], 'clf__penalty': ['l1', 'l2'], 'vect__norm': [None], 'vect__use_idf': [False]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=1)

In [24]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)


Best parameter set: {'clf__C': 10.0, 'vect__tokenizer': <function tokenizer at 0x10883d7b8>, 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'clf__penalty': 'l2'} 
CV Accuracy: 0.897

In [25]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))


Test Accuracy: 0.899

8.3 さらに大規模なデータの処理: オンラインアルゴリズムとアウトオブコア学習

  • アウトオブコア学習(out-of-core learning)

In [36]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

In [37]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [38]:
next(stream_docs(path='./movie_data.csv'))


Out[38]:
('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.<br /><br />Title (Brazil): Not Available"',
 1)

In [39]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [40]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')

In [41]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()


0%                          100%
[##############################] | ETA: 00:00:34 | ETA: 00:00:32 | ETA: 00:00:33 | ETA: 00:00:35 | ETA: 00:00:32 | ETA: 00:00:31 | ETA: 00:00:28 | ETA: 00:00:27 | ETA: 00:00:25 | ETA: 00:00:23 | ETA: 00:00:22 | ETA: 00:00:21 | ETA: 00:00:19 | ETA: 00:00:18 | ETA: 00:00:17 | ETA: 00:00:15 | ETA: 00:00:14 | ETA: 00:00:13 | ETA: 00:00:12 | ETA: 00:00:11 | ETA: 00:00:10 | ETA: 00:00:08 | ETA: 00:00:07 | ETA: 00:00:06 | ETA: 00:00:05 | ETA: 00:00:04 | ETA: 00:00:03 | ETA: 00:00:01 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:00:34

In [42]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))


Accuracy: 0.867

In [43]:
clf = clf.partial_fit(X_test, y_test)

In [ ]: