The process of cleaning data for analysis often requires working with text, for example, to correct typos, convert to standard nomenclature and resolve ambiguous labels. In some statistical fields that deal with (say) processing electronic medical records, information science or recommendations based on user feedback, text must be processed before analysis - for example, by converting to a bag of words.
We will use a whimsical example to illustrate Python tools for munging text data using string methods and regular expressions. Finally, we will see how to format text data for reporting.
In [1]:
import requests
try:
with open('looking_glass.txt') as f:
text = f.read()
except IOError:
url = 'http://www.gutenberg.org/cache/epub/12/pg12.txt'
res = requests.get(url)
text = res.text
with open('looking_glass.txt', 'w') as f:
f.write(str(text))
In [2]:
start = text.find('JABBERWOCKY')
In [3]:
text[start:start+2000]
Out[3]:
In [4]:
end = text.find('It seems very pretty', start)
In [5]:
poem = text[start:end]
poem
Out[5]:
In [6]:
print(poem)
In [7]:
print(poem.title())
In [8]:
poem.count('the')
Out[8]:
In [9]:
print(poem.replace('the', 'XXX'))
In [10]:
poem = poem.lower()
In [11]:
import string
string.punctuation
Out[11]:
In [12]:
poem = poem.translate(dict.fromkeys(map(ord, string.punctuation)))
poem
Out[12]:
In [13]:
words = poem.split()
words[:10]
Out[13]:
In [14]:
def is_palindrome(word):
return word == word[::-1]
In [15]:
{word for word in words if is_palindrome(word)}
Out[15]:
In [16]:
import collections
In [17]:
poem_counter = collections.Counter(words)
In [18]:
poem_counter.most_common(10)
Out[18]:
In [19]:
[(k, v) for (k, v) in poem_counter.items() if v==2]
Out[19]:
In [73]:
list(zip(words[:], words[1:], words[2:]))[:10]
Out[73]:
In [21]:
import itertools as it
In [22]:
def window(x, n):
"""Sliding widnow of size n from iterable x."""
s = (it.islice(x, i, None) for i in range(n))
return zip(*s)
In [23]:
list(window(words, 3))[:10]
Out[23]:
In [24]:
book = text
In [25]:
book = book.lower().translate(dict.fromkeys(map(ord, string.punctuation)))
In [26]:
book_counter = collections.Counter(book.split())
In [27]:
n = sum(book_counter.values())
book_freqs = {k: v/n for k, v in book_counter.items()}
In [28]:
n = sum(poem_counter.values())
stats = [(k, v, book_freqs.get(k,0)*n) for k, v in poem_counter.items()]
In [29]:
from pandas import DataFrame
In [30]:
df = DataFrame(stats, columns = ['word', 'observed', 'expected'])
In [31]:
df['score'] = (df.observed-df.expected)**2/df.expected
In [32]:
df = df.sort_values(['score'], ascending=False)
df.head(n=10)
Out[32]:
In [33]:
print(poem)
In [34]:
def encode(text, k):
table = dict(zip(map(ord, string.ascii_lowercase),
string.ascii_lowercase[k:] + string.ascii_lowercase[:k]))
return text.translate(table)
In [35]:
cipher = encode(poem, 2)
print(cipher)
In [36]:
recovered = encode(cipher, -2)
print(recovered)
In [37]:
import re
In [38]:
regex = re.compile(r'(\w*(\w)\2+\w*)', re.IGNORECASE | re.MULTILINE)
In [39]:
for match in regex.finditer(poem):
print(match.group(2), match.group(1))
In [40]:
def f(match):
word, letter = match.groups()
return word.replace(letter, letter.upper())
print(regex.sub(f, poem))
If you intend to perform statistical analysis on natural language, you should probably use NLTK to pre-process the text instead of using string methods and regular expressions. For example, a simple challenge is to first parse the paragraph below into sentences, then parse each sentence into words.
Paragraph from random Pubmed abstract.
In [41]:
para = """When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables. For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity. Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups. aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG. Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted."""
In [42]:
sep = re.compile(r'[\?\!\.]')
In [43]:
ss = sep.split(para)
In [44]:
for i, s in enumerate(ss, 1):
print(i, ':', s, end='\n\n')
In [45]:
import nltk
In [46]:
ss_nltk = nltk.sent_tokenize(para)
In [47]:
for i, s in enumerate(ss_nltk, 1):
print(i, ':', s, end='\n\n')
In [48]:
s = ss_nltk[1]
s
Out[48]:
In [49]:
# remove punctuation and split on whit space
table = dict.fromkeys(map(ord, string.punctuation))
s.translate(table).split()
Out[49]:
In [50]:
text = nltk.word_tokenize(s)
text
Out[50]:
See http://www.nltk.org for details.
In [51]:
tagged_text = nltk.pos_tag(text)
tagged_text
Out[51]:
In [52]:
s
Out[52]:
In [53]:
[w for w, t in tagged_text if t.startswith('N')]
Out[53]:
In [54]:
import math
In [55]:
stuff = ('bun', 'shoe', ['bee', 'door'], 2, math.pi, 0.05)
In [56]:
'One: {}, Two {}'.format(*stuff)
Out[56]:
In [57]:
'One: {0}, Two {1}'.format(*stuff)
Out[57]:
In [58]:
'One: {1}, Two {1}'.format(*stuff)
Out[58]:
In [59]:
'One: {0}, Two {2[1]}'.format(*stuff)
Out[59]:
In [60]:
'One: {0:^10s}, Two {1:_>15s}'.format(*stuff)
Out[60]:
In [61]:
'One: {3}, Two {4}'.format(*stuff)
Out[61]:
In [62]:
'One: {3:+10d}, Two {4:.4f}'.format(*stuff)
Out[62]:
In [63]:
'One: {3:04d}, Two {4:.4g}'.format(*stuff)
Out[63]:
In [64]:
'One: {3:.4e}, Two {4:.4e}'.format(*stuff)
Out[64]:
In [65]:
'One: {5:.2%}, Two {5:f}'.format(*stuff)
Out[65]:
In [66]:
'%s, %s, %a, %d, %.4f, %.2f' % stuff
Out[66]:
In [67]:
import numpy as np
In [68]:
x = np.arange(1, 13).reshape(3,4)
x
Out[68]:
In [69]:
np.set_printoptions(formatter={'int': lambda x: '%8.2f' % x})
In [70]:
x
Out[70]:
In [71]:
np.set_printoptions()
In [72]:
x
Out[72]: