第4章:形態素解析

夏目漱石の小説「吾輩は猫である」の文章(nelo.txt)をMeCabを使って形態素解析し, その結果をneko.txt.mecabというファイルに保存せよ.
このファイルを用いて, 以下の問に対応するプログラムを実装せよ.
なお, 問題37, 38, 39はmatplotlibもしくはGnuplotを用いると良い.


In [ ]:
import MeCab
mecab = MeCab.Tagger("")
mecab.parse('')
with open('neko2.txt.mecab', 'w') as neko_mecab:
    neko = "".join([i for i in open('neko.txt', 'r')])
    morpheme = mecab.parse(neko)
    neko_mecab.write("".join([i for i in morpheme]))

30. 形態素解析結果の読み込み

形態素解析結果(neko.txt.mecab)を読み込むプログラムを実装せよ.
ただし各形態素は表層系(surface), 基本形(base), 品詞(pos), 品詞細分類1(pos1)をキーとするマッピング型に格納し,
1文を形態素(マッピング型)のリストとして表現せよ.
第4章の残りの問題では, ここで作ったプログラムを活用せよ.


In [ ]:
import pickle
def analyze_morph(filename):
    morph = []
    sentence = []
    for s in open(filename, 'r'):
        if not s == 'EOS\n':   
            surface, result = s.split('\t')
            result = result.split(',')
            
            if surface == '。':
                morph.append({'surface':surface, 'base':result[6], 'pos':result[0], 'pos1':result[1]})
                sentence.append(morph)
                morph = []
            else:
                morph.append({'surface':surface, 'base':result[6], 'pos':result[0], 'pos1':result[1]})
    return sentence

neko_sample = analyze_morph('neko.txt.mecab')
with open('morph_neko.pickle', 'wb') as f:
    pickle.dump(neko_sample, f, protocol=pickle.HIGHEST_PROTOCOL)

31. 動詞

動詞の表層形をすべて抽出せよ.


In [ ]:
import pickle
def extract_verb_surface(morph):
    verb_list = []
    for i in morph:
        for j in i:
            if j['pos']=='動詞' :
                verb_list.append(j['surface']) 
    return verb_list
with open('morph_neko.pickle', 'rb') as f:
    neko_sample = pickle.load(f, encoding='utf-8', fix_imports=False)
verbs = extract_verb(neko_sample)

32. 動詞の原形

動詞の原形をすべて抽出せよ


In [ ]:
import pickle
def extract_verb_base(morph):
    base_form_lists = [j['base']  for i in morph for j in i if j['pos'] == '動詞']
    return base_form_lists
with open('morph_neko.pickle', 'rb') as f:
    neko_sample = pickle.load(f, encoding='utf-8', fix_imports=False)
base_forms = extract_verb_base(neko_sample)

33. サ変名詞

サ変接続の名詞をすべて抽出せよ


In [ ]:
with open('morph_neko.pickle', 'rb') as f:
    neko_sample = pickle.load(f, encoding='utf-8', fix_imports=False)
noun_shen = [j['base']  for i in neko_sample for j in i if j['pos1'] in ('サ変接続') and j['base'] not in ['*\n']]

34. 「AのB」

2つの名詞が「の」で連結されている名詞句を抽出せよ.


In [ ]:
import pickle
with open('morph_neko.pickle', 'rb') as f:
    neko_samples = pickle.load(f)
    
noun_phrase = [i[j-1]['surface'] + i[j]['base'] + i[j+1]['surface'] \
               for i in neko_samples for j in range(len(i)) \
               if i[j]['base'] in ('の') and i[j+1]['pos'] in ('名詞') and i[j-1]['pos'] in ('名詞')]
noun_phrase

35. 名詞の連接

名詞の連接(連続して出現する名詞)を最長一致で抽出せよ.


In [ ]:
import pickle
with open('morph_neko.pickle', 'rb') as f:
    neko_samples = pickle.load(f, encoding='utf-8', fix_imports=False)
len_dict = {}
for sentence_list in neko_samples:
    noun_list = []
    for dict_morpho in sentence_list:
        if dict_morpho["pos"] in ("名詞"):
            noun_list.append(dict_morpho["surface"])   
        else:
            len_dict[len(noun_list)] = noun_list
            noun_list = []
print(len_dict)
print("".join(len_dict[max(len_dict.keys())]))

36. 単語の出現頻度

文章中に出現する単語とその出現頻度を求め,出現頻度の高い順に並べよ.


In [54]:
import pickle
from operator import itemgetter
from collections import defaultdict
with open('morph_neko.pickle', 'rb') as f:
    neko_samples = pickle.load(f, encoding='utf-8', fix_imports=False)

words = [word['base'] for sentence in neko_samples for word in sentence ]
word_count = defaultdict(int)
for i in words:
    if i not in (['。', '、']) :
        word_count[i] += 1
word_count = sorted(word_count.items(), key=itemgetter(1), reverse=True)

with open('count_list.pickle', 'wb') as f:
    pickle.dump(word_count, f)

37. 頻度上位10語

出現頻度が高い10語とその出現頻度をグラフ(例えば棒グラフなど)で表示せよ.


In [45]:
%matplotlib inline
import matplotlib.pyplot as plt
import prettyplotlib as ppl
import pandas as pd
word_list = [i[0] for i in word_count][0:10]
count_list = [i[1] for i in word_count][0:10]

plt.xlim(-0.5, 10)
ppl.bar(range(10), count_list, align='center', alpha=0.8)
plt.xticks(range(10), word_list)
plt.show()


38. ヒストグラム

単語の出現頻度のヒストグラム(横軸に出現頻度,縦軸に出現頻度をとる単語の種類数を棒グラフで表したもの)を描け.


In [42]:
%matplotlib inline
import prettyplotlib as ppl
import matplotlib.pyplot as plt
import pickle
from collections import defaultdict, Counter
with open('count_list.pickle', 'rb') as f:
    count_list = pickle.load(f)

counts = [i[1] for i in count_list]
counts = (Counter(counts))
kind = [i[0]for i in counts.most_common()]
frequency = [i[1] for i in counts.most_common()]
plt.figure(figsize=(12, 9))
plt.hist(frequency, color='c', bins=239)
print()




In [59]:
%matplotlib inline
import prettyplotlib as ppl
import matplotlib.pyplot as plt
import pickle
from collections import defaultdict, Counter

count_list = [i[1] for i in word_count]
ranking = range(1, len(count_list)+1)
plt.xscale('log')
plt.yscale('log')
ppl.plot(ranking, count_list)


Out[59]:
[<matplotlib.lines.Line2D at 0x136b14f60>]

In [ ]: