Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.
In [2]:
s = 'colorless'
s[:4]+'u'+s[4:]
Out[2]:
We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat.
In [9]:
print 'dishes'[:-2]
print 'running'[:-4]
print 'nationality'[:6]
print 'undo'[:2]
print 'preheat'[:3]
We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?
We can specify a "step" size for the slice. The following returns every second character within the slice: monty[6:11:2]. It also works in the reverse direction: monty[10:5:-2] Try these for yourself, then experiment with different step values.
In [21]:
print 'ThisisTest'[::1]
print 'ThisisTest'[::2]
print 'ThisisTest'[3::1]
print 'ThisisTest'[:4:1]
print 'ThisisTest'[::-1]
What happens if you ask the interpreter to evaluate monty[::-1]? Explain why this is a reasonable result.
Describe the class of strings matched by the following regular expressions.
- [a-zA-Z]+
- [A-Z][a-z]*
- p[aeiou]{,2}t
- \d+(.\d+)?
- ([^aeiou][aeiou][^aeiou])*
- \w+|[^\w\s]+
- Test your answers using nltk.re_show().
In [1]:
import nltk, re, pprint
In [4]:
nltk.re_show('[a-zA-Z]+','This is test 123')
In [7]:
nltk.re_show('[A-Z][a-z]*','This IS test 123')
In [11]:
nltk.re_show('p[aeiou]{,2}t', 'paat pit pooot 123 abc')
In [14]:
nltk.re_show('\d+(.\d+)?','1t23 1 abc')
In [15]:
nltk.re_show('([^aeiou][aeiou][^aeiou])*','hat cat aei')
In [16]:
nltk.re_show('\w+|[^\w\s]+','1ab @@@')
Write regular expressions to match the following classes of strings:
In [83]:
import re
re.findall(r'an?|the','this is an a the check')
Out[83]:
In [92]:
re.findall(r'\d+[\+\-\/\*]\d+[\+\-\/\*]\d+','Do this 2+93*78 and 5*7+8')
Out[92]:
Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use urllib.urlopen to access the contents of the URL, e.g. raw_contents = urllib.urlopen('http://www.nltk.org/').read().
In [30]:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
def get_content(url):
soup = BeautifulSoup(urlopen(url).read())
return soup.getText()
content = get_content('http://www.nltk.org/')[1000:]
with open('corpus.txt','w') as f:
f.write(content[:843])
Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.
In [35]:
def loadFile(fileName):
with open(fileName,'r') as f:
contents = f.read()
return contents
text = loadFile('corpus.txt')
In [37]:
pattern = r'''(?x) #set to verbose
[^\w\s]|_ #no alpha-numeric no space
'''
nltk.regexp_tokenize(text,pattern)[:3]
Out[37]:
In [52]:
pattern = r'''(?x) #set to verbose
[A-Z][a-z]+ #organization Name
| [A-Z][a-z]+\s[A-Z][a-z]+ #People name
'''
nltk.regexp_tokenize(text,pattern)[:5]
Out[52]:
In [ ]:
pattern = r'''(?x) #set to verbose
^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$ #date regex
\$?\d+(\.\d+) #currency regex
'''
Rewrite the following loop as a list comprehension: sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = [] for word in sent: word_len = (word, len(word)) result.append(word_len) result [('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]
In [54]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
print [(word, len(word)) for word in sent]