Ch3 Processing Raw Text

本章要回答幾個問題

  1. 如何從本地端的檔案或網路上取得文字
  2. 如何將文件拆成單字和符號
  3. 如何產生格式化的輸出並儲存到檔案中

注意: 從這章開始,都會假設程式已經導入下列library:


In [3]:
from __future__ import division
import nltk, re, pprint

Accessing Text from the Web and from Disk

Electronic Books

Gutenberg上每一本電子書都有編號,只要知道編號就可以下載電子書,例如"Crime and Punishment"是第2554號,可以用下面的方法取得。


In [1]:
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
type(raw)


Out[1]:
str

註:如果要使用proxy下載,可以用

proxy = {'http': 'http://www.someproxy.com:3128'}
raw = urlopen(url, proxies=proxy).read()

剛下載的raw是字串形態,包含許多不必要的字,所以需要tokenization,使單字分離。


In [4]:
tokens = nltk.word_tokenize(raw)
tokens[:10]


Out[4]:
['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

產生的token可以進一步轉換成nltk text形態,以便進一步處理。


In [5]:
text = nltk.Text(tokens)
text


Out[5]:
<Text: The Project Gutenberg EBook of Crime and Punishment...>

In [6]:
text.collocations()


Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market

Dealing with HTML


In [7]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
html[:60]


Out[7]:
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [15]:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, "lxml")

In [21]:
tokens = nltk.word_tokenize(bs.get_text())
tokens[:5]


Out[21]:
[u'BBC', u'NEWS', u'|', u'Health', u'|']

concordance列出單字出現的每個地方


In [22]:
text = nltk.Text(tokens)
text.concordance('gene')


Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 

Strings: Lowest Level Text

字串是一種immutable的資料形態,設定值之後就不能修改。因此所有字串函式都會產生複製後的字串,而不會修改原始的字串。

用三個引號"""來定義字串,可以將換行一起放進去。


In [31]:
s = """hello, my
friend"""
print s


hello, my
friend

str支援+*兩個operator。+代表連接兩個字串,*代表重複一個字串。


In [32]:
'must' + 'maybe' + 'hello'


Out[32]:
'mustmaybehello'

In [34]:
'bug-' * 5


Out[34]:
'bug-bug-bug-bug-bug-'

[i]可以存取第i+1個字元,如果字串長度是n,則i的範圍為0到n-1。
[-i]可以存取後面數過來第i個字元,如果字串長度是n,則i的範圍為-1到-n。


In [36]:
s = 'hello, world'
s[0], s[1], s[:5], s[-5:]


Out[36]:
('h', 'e', 'hello', 'world')

其他常見的字串函式:

  • s.find(t): 找到字串s內第一個字串t的位置,介於0~n-1間,找不到傳回-1
  • s.rfind(t): 從右邊找字串s內第一個字串t的位置,介於0~n-1間,找不到傳回-1
  • s.index(t): 功能如s.find(t),但找不到時會產生ValueError
  • s.rindex(t): 功能如s.rfind(t),但找不到時會產生ValueError
  • s.join([a,b,c]): 將字串a,b,c結合,中間插入s,最後結果為asbsc
  • s.split(t): 用字串t為分隔,將s拆成多個字串,預設t為空白字元
  • s.splitlines(): 將字串s根據換行拆成多個字串
  • s.lower(): 將所有字母換成小寫
  • s.upper(): 將所有字母換成大寫
  • s.titlecase(): 每個單字第一個字母換成大寫
  • s.strip(): 刪除前後的空白字元
  • s.replace(t, u): 將s中的字串t替換成字串u

Text Processing with Unicode


In [44]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [60]:
import codecs
f = codecs.open(path, encoding='latin2')
for line in f:
    print line, line.encode('utf-8'), line.encode('unicode_escape'), '\n'


Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
Pruska Biblioteka Pa\u0144stwowa. Jej dawne zbiory znane pod nazw\u0105\n 

"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez\n 

Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n 

odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
odnalezione po 1945 r. na terytorium Polski. Trafi\u0142y do Biblioteki\n 

Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
Jagiello\u0144skiej w Krakowie, obejmuj\u0105 ponad 500 tys. zabytkowych\n 

archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.
archiwali\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.\n 


In [65]:
# 傳回unicode的編號
print ord(u'許'), ord(u'洪'), ord(u'蓋')


35377 27946 33995

In [66]:
print repr(u'許洪蓋')


u'\u8a31\u6d2a\u84cb'

unicodedata可以印出unicode中對字元的描述


In [67]:
import unicodedata
lines = codecs.open(path, encoding='latin2').readlines()
line = lines[2]
print line.encode('unicode_escape')


Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n

In [69]:
for c in line:
    if ord(c) > 127:
        print '%s U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c))


ó U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE
ą U+0105 LATIN SMALL LETTER A WITH OGONEK
ł U+0142 LATIN SMALL LETTER L WITH STROKE

Regular Expressions

  • .: 對應任意一個字元
  • ^abc: 字串開頭為abc
  • abc$: 字串結尾為abc
  • [abc]: 對應[]內出現的字元,例如a或b或c
  • [A-Z0-9]: 對應範圍內的字元,例如A到Z以及0到9
  • ed|ing|s: 對應數種指定的字串
  • *: 前一個符號重複0次以上
  • +: 前一個符號重複1次以上
  • ?: 前一個符號重複0次或1次
  • {n}: 前一個符號重複恰好n次
  • {n,}: 前一個符號重複至少n次
  • {,n}: 前一個符號重複最多n次
  • {m,n}: 前一個符號重複最少m次、最多n次
  • a(b|c)+: 括號表示|運算的範圍

In [70]:
import re

In [77]:
# 一般re.search傳回的是一個 _sre.SRE_Match 物件
# 將物件轉換成 bool 值,如果 True 代表有找到符合的字串
bool(re.search('ed$', 'played')), bool(re.search('ed$', 'happy'))


Out[77]:
(True, False)

In [98]:
# re.findall 可以取出部分的字串,要取出的部分由()的範圍決定
re.findall('\[\[(.+?)[\]|\|]+', 'My Name is [[Bany]] Hung, this is my [[Dog|Animal]]')


Out[98]:
['Bany', 'Dog']

In [97]:
# re.sub 會取代字串
re.sub('\[\[.+?[\]|\|]+', '###', 'My Name is [[Bany]] Hung, this is my [[Dog|Animal]]')


Out[97]:
'My Name is ### Hung, this is my ###Animal]]'

Normalize Text

正規化是指將文字變成統一的格式,例如轉成小寫、取字根。


In [105]:
porter = nltk.PorterStemmer()
lanc = nltk.LancasterStemmer()
w = ['I','was','playing','television','in','the','painted','garden']
[(a, porter.stem(a), lanc.stem(a)) for a in w]


Out[105]:
[('I', u'I', 'i'),
 ('was', u'wa', 'was'),
 ('playing', u'play', 'play'),
 ('television', u'televis', u'televid'),
 ('in', u'in', 'in'),
 ('the', u'the', 'the'),
 ('painted', u'paint', 'paint'),
 ('garden', u'garden', 'gard')]

作stemming之前,要將原始資料存在dict中,以方便使用。


In [120]:
class IndexedText(object):
    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i) for (i, word) in enumerate(text))
        
    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '%*s'  % (width, lcontext[-width:])
            rdisplay = '%-*s' % (width, rcontext[:width])
            print ldisplay, rdisplay
            
    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [121]:
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')


r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t

Lemmatization


In [132]:
s = ['women', 'are', 'living']
wnl = nltk.WordNetLemmatizer()
zip(s, [wnl.lemmatize(t, 'v') for t in s], [wnl.lemmatize(t, 'n') for t in s])


Out[132]:
[('women', 'women', u'woman'),
 ('are', u'be', 'are'),
 ('living', u'live', 'living')]

Formatting: From Lists to Strings


In [135]:
s = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
' '.join(s)


Out[135]:
'We called him Tortoise because he taught us .'

In [137]:
';'.join(s)


Out[137]:
'We;called;him;Tortoise;because;he;taught;us;.'