70. データの入手・整形

文に関する極性分析の正解データを用い，以下の要領で正解データ（sentiment.txt）を作成せよ．

rt-polarity.posの各行の先頭に"+1 "という文字列を追加する（極性ラベル"+1"とスペースに続けて肯定的な文の内容が続く）
rt-polarity.negの各行の先頭に"-1 "という文字列を追加する（極性ラベル"-1"とスペースに続けて否定的な文の内容が続く）
上述1と2の内容を結合（concatenate）し，行をランダムに並び替える

sentiment.txtを作成したら，正例（肯定的な文）の数と負例（否定的な文）の数を確認せよ．



In [ ]:

    
import random

with open('rt-polarity.neg.utf8', 'r') as f:
    negative_list = ['-1 '+i for i in  f]
with open("rt-polarity.pos.utf8", "r") as f:
    positive_list = ["+1"+i for i in f]
#for sentence in temp:
#    positive_list.append('+1 '+"".join([i.encode('replace') for i in sentence]))
concatenate = positive_list + negative_list
random.shuffle(concatenate)
with open('sentiment.txt', 'w') as f:
    f.write("".join(concatenate))

71. ストップワード

英語のストップワードのリスト（ストップリスト）を適当に作成せよ．さらに，引数に与えられた単語（文字列）がストップリストに含まれている場合は真，それ以外は偽を返す関数を実装せよ．さらに，その関数に対するテストを記述せよ．



In [ ]:

    
from nltk.corpus import stopwords
stopwords_list = [s for s in stopwords.words('english')]
print(stopwords_list)

72. 素性抽出

極性分析に有用そうな素性を各自で設計し，学習データから素性を抽出せよ．素性としては，レビューからストップワードを除去し，各単語をステミング処理したものが最低限のベースラインとなるであろう．



In [ ]:

    
from nltk.stem.porter import PorterStemmer
def feature(sentence): 
    porter = PorterStemmer()
    result = []
    label = sentence[0:2]
    for s in sentence[3:].split(' '): 
        try:
            result.append(porter.stem(s))
        except KeyError:
            pass
            
    return (label + " " + " ".join(result))



In [ ]:

    
feature("+1 intensely romantic , thought-provoking and even an engaging mystery . ")

No.73

72で抽出した素性を用いて，ロジスティック回帰モデルを学習せよ．



In [68]:

    
# import passages to construct logistic regression and learn the model.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS



In [69]:

    
tfv = TfidfVectorizer(encoding='utf-8', lowercase=True,
                      stop_words=ENGLISH_STOP_WORDS,
                      #token_pattern='(?u)\b\w\w+\b',
                      ngram_range=(1, 2))



In [72]:

    
with open('sentiment.txt') as f:
    features = [(s[:2], s[3:]) for s in f]

# make label list
label = [i[0] for i in features]

# make sentence list that is removed English Stop Words
sentence = []
for i in features:
    temp = i[1].split(' ')
    temp2 = [i+' ' for i in temp]
    s = "".join(temp2)
    sentence.append(s)



In [94]:

    
tfv_vector = tfv.fit_transform("".join(sentence).split(' '))

TfidfVectorizer.fit()の引数は単語"リスト"



In [109]:









    Out[109]:





107312



In [89]:

    
tfv_vector.









    Out[89]:





<bound method spmatrix.getH of <234271x20031 sparse matrix of type '<class 'numpy.float64'>'
	with 107312 stored elements in Compressed Sparse Row format>>