第6章: 英語テキストの処理

英語テキストのnlp.txtに対して, 以下の処理を実行せよ


In [25]:
import pickle

50. 文区切り

(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし,入力された文書を1行1文の形式で出力せよ.


In [49]:
def separate_sentences(filename):
    checks = ['. ', '; ', '? ', '! ']
    for sentences in open(filename, 'r'):
        sentences = sentences.strip()
        sep_flag = False
        sep_index = 0
        check_sign = ''
        for i in checks:
            if i in sentences:
                check_sign = i
                sep_index = sentences.find(i)
                sep_flag = sentences[sep_index+2].isupper()
                continue
                
        if sep_flag:
            splited_sentences = sentences.split(check_sign)
            for s in splited_sentences:
                if s[-1] not in (check_sign[0]):
                    print(s+check_sign[0]+'\n')
                else:
                    print(s)
        else:
            print(sentences)
            
separate_sentences('nlp.txt')


Natural language processing
From Wikipedia, the free encyclopedia

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.

As such, NLP is related to the area of humani-computer interaction.

Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.

History

The history of NLP generally starts in the 1950s, although work can be found from earlier periods.

In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.

The authors claimed that within three or five years, machine translation would be a solved problem.

However, real progress was much slower, and after the ALPAC report in 1966, which found that ten year long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.

Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.

Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966.

Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction.

When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".

During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data.

Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981).

During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.

Up to the 1980s, most NLP systems were based on complex sets of hand-written rules.

Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing.

This was due to both the steady increase in computational power resulting from Moore's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g.

transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.

Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.

However, Part of speech tagging introduced the use of Hidden Markov Models to NLP, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.

The cache language models upon which many speech recognition systems now rely are examples of such statistical models.

Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.

Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed.

These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government.

However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems.

As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms.

Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data.

Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.

However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.

NLP using machine learning

Modern NLP algorithms are based on machine learning, especially statistical machine learning.

The paradigm of machine learning is different from that of most prior attempts at language processing.

Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules.

The machine-learning paradigm calls instead for using general learning algorithms - often, although not always, grounded in statistical inference - to automatically learn such rules through the analysis of large corpora of typical real-world examples.

A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.

Many different classes of machine learning algorithms have been applied to NLP tasks.

These algorithms take as input a large set of "features" that are generated from the input data.

Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common.

Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature.

Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.

Systems based on machine-learning algorithms have many advantages over hand-produced rules:

The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not obvious at all where the effort should be directed.
Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted). Generally, handling such input gracefully with hand-written rules -- or more generally, creating systems of hand-written rules that make soft decisions -- extremely difficult, error-prone and time-consuming.
Systems based on automatically learning the rules can be made more accurate simply by supplying more input data.

However, systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules, which is a much more difficult task.

In particular, there is a limit to the complexity of systems based on hand-crafted rules, beyond which the systems become more and more unmanageable.

However, creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.
The subfield of NLP devoted to learning approaches is known as Natural Language Learning (NLL) and its conference CoNLL and peak body SIGNLL are sponsored by ACL, recognizing also their links with Computational Linguistics and Language Acquisition.

When the aims of computational language learning research is to understand more about human language acquisition, or psycholinguistics, NLL overlaps into the related field of Computational Psycholinguistics.


In [74]:
l_sentences = []
def separate_sentences(filename):
    checks = ['. ', '; ', '? ', '! ']
    for sentences in open(filename, 'r'):
        sentences = sentences.strip()
        sep_flag = False
        sep_index = 0
        check_sign = ''
        for i in checks:
            if i in sentences:
                check_sign = i
                sep_index = sentences.find(i)
                sep_flag = sentences[sep_index+2].isupper()
                continue
                
        if sep_flag:
            splited_sentences = sentences.split(check_sign)
            for s in splited_sentences:
                if s[-1] not in (check_sign[0]):
                    l_sentences.append(s+check_sign[0]+'\n')
                else:
                    l_sentences.append(s)
        else:
            l_sentences.append(sentences)
            
separate_sentences('nlp.txt')
with open('nlp.pickle', 'wb') as f:
    pickle.dump("".join(l_sentences), f)

51. 単語の切り出し

空白を単語の区切りとみなし,50の出力を入力として受け取り,1行1単語の形式で出力せよ.ただし,文の終端では空行を出力せよ.


In [47]:
with open('nlp.pickle', 'rb') as f:
    sentences = pickle.load(f)
words = sentences.split(' ')

with open('nlp_words.pickle', 'wb') as f:
    pickle.dump(words, f)

with open('nlp_words.txt', 'w') as f:
    for word in words:
        f.write(word.rstrip()+'\n')

52. ステミング

51の出力を入力として受け取り,Porterのステミングアルゴリズムを適用し,
単語と語幹をタブ区切り形式で出力せよ.
Pythonでは,Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい.

  • ### ステミング(stemming)
    同じような話題を指している語を一つの素性としてみなしたいとき、
    派生語なども含めて同一の素性とみなす作業のこと。
    例. ”run”, “runs”, “ran” さらに”runner”など。

  • ### ポーターのステマー(Porter’s stemmer) 英語のステミングの手法。多くの規則がある。
    規則の例

    • 語尾のedを除去する(walked→walk)
    • 語尾のateを除去する(passionate→passion)
    • 語尾のationalを除去する(operational→oper)
  • “operational”と ”operate”は同じ ”oper” として扱えるが、
    (hundred→hundr)となってしまったり、
    (international→intern)と、意味が変わってしまうこともある。


In [40]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
with open('nlp_words.pickle', 'rb') as f:
    words = pickle.load(f)
    
stem_words = []
for word in words:
    stem = porter.stem(word)
    stemmed = word[len(stem):]
    stem_words.append(stem+'\t'+stemmed+'\n')

with open('nlp_stems.pickle', 'wb') as f:
    pickle.dump(stem_words, f)

with open('nlp_stems.txt', 'w') as f:
    for stem in stem_words:
        f.write("".join(stem))

53. Tokenization

Stanford Core NLPを用い,入力テキストの解析結果をXML形式で得よ.
また,このXMLファイルを読み込み,入力テキストを1行1単語の形式で出力せよ.


In [41]:
%%bash
cd stanford-corenlp-full-2015-04-20/
sh corenlp.sh -file ../nlp.txt
cp nlp.txt.xml ../


java -mx3g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file ../nlp.txt
Searching for resource: StanfordCoreNLP.properties
Searching for resource: edu/stanford/nlp/pipeline/StanfordCoreNLP.properties
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.9 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [9.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [5.1 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [6.8 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.9 sec].
Adding annotator dcoref

Ready to process: 1 files, skipped 0, total 1
Processing file /Volumes/share/研究/修士/2016/urushi/notebook/knocks/06/stanford-corenlp-full-2015-04-20/../nlp.txt ... writing to /Volumes/share/研究/修士/2016/urushi/notebook/knocks/06/stanford-corenlp-full-2015-04-20/nlp.txt.xml {
  Annotating file /Volumes/share/研究/修士/2016/urushi/notebook/knocks/06/stanford-corenlp-full-2015-04-20/../nlp.txt [17.832 seconds]
} [18.676 seconds]
Processed 1 documents
Skipped 0 documents, error annotating 0 documents
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.2 sec.
MorphaAnnotator: 0.1 sec.
NERCombinerAnnotator: 4.3 sec.
ParserAnnotator: 10.3 sec.
DeterministicCorefAnnotator: 2.8 sec.
TOTAL: 17.8 sec. for 1452 tokens at 81.4 tokens/sec.
Pipeline setup: 0.1 sec.
Total time for StanfordCoreNLP pipeline: 18.8 sec.

54. 品詞タグ付け

Stanford Core NLPの解析結果XMLを読み込み,単語,レンマ,品詞をタブ区切り形式で出力せよ.


In [29]:
import xml.etree.ElementTree as etree
tree = etree.parse('nlp.txt.xml')
root = tree.getroot()
temp = []
for tokens in root.iter('token'):
    temp.append(tokens.find('word').text+'\t'+\
                 tokens.find('lemma').text+'\t'+\
                 tokens.find('POS').text+'\n')
with open('nlp_tag.pickle', 'wb') as f:
    pickle.dump("".join(temp), f)
with open('nlp_tag.txt', 'w') as f:
    f.write("".join(temp))

55. 固有表現抽出

入力文中の人名をすべて抜き出せ.


In [38]:
with open('nlp_tag.pickle', 'rb') as f:
    tokens = pickle.load(f)
    
for token in tokens.split('\n'):
    splited_token = token.split('\t')
    if splited_token[-1] == 'NNP':
        print(splited_token[0])


Wikipedia
NLP
Alan
Turing
Computing
Machinery
Intelligence
Georgetown
English
ALPAC
SHRDLU
ELIZA
Joseph
Weizenbaum
ELIZA
MARGIE
Schank
SAM
Cullingford
Wilensky
TaleSpin
Meehan
Lehnert
Carbonell
Lehnert
PARRY
Racter
Jabberwacky
Moore
Chomskyan
Hidden
Markov
NLP
IBM
Research
Parliament
Canada
European
Union
World
Modern
NLP
Automatic
NLP
Learning
NLL
SIGNLL
Language
Acquisition
Computational

56. 共参照解析

Stanford Core NLPの共参照解析の結果に基づき,
文中の参照表現(mention)を代表参照表現(representative mention)に置換せよ.
ただし,置換するときは,「代表参照表現(参照表現)」のように,元の参照表現が分かるように配慮せよ.


In [80]:
import xml.etree.ElementTree as etree
tree = etree.parse('nlp.txt.xml')
root = tree.getroot()
temp = []

for mentions in root.iter('mention'):
    if mentions.attrib != {}:
        representative = mentions.find('text').text
    else:
        temp.append(mentions.find('text').text+'->'+representative)

57. 係り受け解析

Stanford Core NLPの係り受け解析の結果(collapsed-dependencies)を有向グラフとして可視化せよ.
可視化には,係り受け木をDOT言語に変換し,Graphvizを用いるとよい.
また,Pythonから有向グラフを直接的に可視化するには,pydotを使うとよい.


In [1]:
import xml.etree.ElementTree as etree
import pygraphviz as pgv
tree = etree.parse('nlp.txt.xml')
root = tree.getroot()
collapse = []
collapse_list = []
for collapses in root.iter('dependencies'):
    if collapses.attrib['type'] in ('collapsed-dependencies'):
        for dep in collapses:
            collapse.append((dep.find('governor').text, dep.find('dependent').text))
        collapse_list.append(collapse)
        collapse = []

graph_list = []
for i, sentence in enumerate(collapse_list):
    g = pgv.AGraph(overlap='false')
    for node in sentence:
        if (node[0] in ('ROOT')):
            g.add_node(node[1])
        else:
            g.add_edge(node, spline='true') 
    g.layout()
    g.draw('./Untitled Folder/'+(str(i+1))+'.png')
    del(g)

58. タプルの抽出

Stanford Core NLPの係り受け解析の結果(collapsed-dependencies)に基づき,
「主語 述語 目的語」の組をタブ区切り形式で出力せよ.
ただし,主語,述語,目的語の定義は以下を参考にせよ.

  • 主語: 述語からnsubj関係にある子(dependent)
  • 述語: nsubj関係とdobj関係の子(dependant)を持つ単語
  • 目的語: 述語からdobj関係にある子(dependent)

In [4]:
import pygraphviz as pgv
from collections import defaultdict

def choice_svo(xml_name):
    tag_collapsed = "<dependencies type=\"collapsed-dependencies\">"
    tag_collapsed_end = "</dependencies>"
    collapsed_flag = False;

    tag_governor = "</governor>"
    tag_dependent = "</dependent>"

    tag_dep = "<dep type=\""
    dep_type_flag = False 
    dep_type = ''
    dep_dict = defaultdict(lambda:  defaultdict(list) )

    sentence_id = 0; sentence_num = ''

    for line in open(xml_name):
        if tag_collapsed in line:
            sentence_id += 1
            sentence_num = str(sentence_id) + "\t"
            collapsed_flag = True
            dep_dict = defaultdict(lambda:  defaultdict(list) )
        if collapsed_flag:
            if tag_collapsed_end in line:
                make_svo(dep_dict)
                collapsed_flag = False
                
            if tag_dep in line:
                dep_type = line.replace(tag_dep, '').strip().split("\"")[0]
                if dep_type == "dobj" or dep_type == "nsubj":
                    dep_type_flag = True
                else:
                    dep_type_flag = False
            if dep_type_flag:
                if tag_governor in line:
                    governor = line.replace(tag_governor, '').strip().split(">")[1]
                elif tag_dependent in line:
                    dependent = line.replace(tag_dependent, '').strip().split(">")[1]
                    dep_dict[dep_type][governor].append(dependent)


def make_svo(dep_dict):
    predicate_list = list()
    for governor_word in dep_dict["nsubj"]:
        if governor_word in dep_dict["dobj"]:
            predicate_list.append(governor_word)
    subject_list = list()
    object_list = list()
    for predicate in predicate_list:
        print("\t".join((",".join(dep_dict["nsubj"][predicate]), \
                   predicate, ",".join(dep_dict["dobj"][predicate]) )))
        


if __name__ == "__main__":
    xml_name = "nlp.txt.xml"
    choice_svo(xml_name)


challenges,others	involve	generation
understanding	enabling	computers
Turing	published	article
experiment	involved	translation
ELIZA	provided	interaction
patient	exceeded	base
ELIZA	provide	response
which	structured	information
underpinnings	discouraged	sort
that	underlies	approach
Some	produced	systems
which	make	decisions
systems	rely	which
that	contains	errors
implementations	involved	coding
algorithms	take	set
Some	produced	systems
which	make	decisions
they	express	certainty
models	have	advantage
Systems	have	advantages
Automatic	make	use
that	make	decisions

59. S式の解析

Stanford Core NLPの句構造解析の結果(S式)を読み込み,
文中のすべての名詞句(NP)を表示せよ.
入れ子になっている名詞句もすべて表示すること.


In [ ]: