Exercise from http://www.nltk.org/book_1ed/ch03.html

Author : Nirmal kumar Ravi

Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.


In [2]:
s = 'colorless'
s[:4]+'u'+s[4:]


Out[2]:
'colourless'

We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat.


In [9]:
print 'dishes'[:-2]
print 'running'[:-4]
print 'nationality'[:6]
print 'undo'[:2]
print 'preheat'[:3]


dish
run
nation
un
pre

We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

  • Yes, Its possiblie with python. We can say something like 'Pyhton'[-7]

We can specify a "step" size for the slice. The following returns every second character within the slice: monty[6:11:2]. It also works in the reverse direction: monty[10:5:-2] Try these for yourself, then experiment with different step values.


In [21]:
print 'ThisisTest'[::1]
print 'ThisisTest'[::2]
print 'ThisisTest'[3::1]
print 'ThisisTest'[:4:1]
print 'ThisisTest'[::-1]


ThisisTest
TiiTs
sisTest
This
tseTsisihT

What happens if you ask the interpreter to evaluate monty[::-1]? Explain why this is a reasonable result.

  • It prints string in reverse order. '-1' says start from last char and It does not have begin and length paramter so it prints everything

Describe the class of strings matched by the following regular expressions.

  • [a-zA-Z]+
  • [A-Z][a-z]*
  • p[aeiou]{,2}t
  • \d+(.\d+)?
  • ([^aeiou][aeiou][^aeiou])*
  • \w+|[^\w\s]+
  • Test your answers using nltk.re_show().

In [1]:
import nltk, re, pprint
  • [a-zA-Z]+ Contains one or more alphabets.Matches both capital and small letters

In [4]:
nltk.re_show('[a-zA-Z]+','This is test 123')


{This} {is} {test} 123
  • [A-Z][a-z]* Start with Capital letter alphabet followed by zero or more small letters

In [7]:
nltk.re_show('[A-Z][a-z]*','This IS test 123')


{This} {I}{S} test 123
  • p[aeiou]{,2}t Word starts with p ends with t with letters 'aeiou' in-between with 0-2 occurences.

In [11]:
nltk.re_show('p[aeiou]{,2}t', 'paat pit pooot 123 abc')


{paat} {pit} pooot 123 abc
  • \d+(.\d+)? starts with digit followed by one or more digits followed by zero or more pattern matching any character followed by one or more digts

In [14]:
nltk.re_show('\d+(.\d+)?','1t23 1 abc')


{1t23} {1} abc
  • ([^aeiou][aeiou][^aeiou])* starts with letter other than 'aeiou' followed by 'aeiou' and a letter not in 'aeiou' with zero or more condition

In [15]:
nltk.re_show('([^aeiou][aeiou][^aeiou])*','hat cat aei')


{hat} {cat} {}a{}e{}i{}
  • \w+|[^\w\s]+ Either one or more alpha numeric or not of alphanumeric and space

In [16]:
nltk.re_show('\w+|[^\w\s]+','1ab @@@')


{1ab} {@@@}

Write regular expressions to match the following classes of strings:

  • A single determiner (assume that a, an, and the are the only determiners).
  • An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.

In [83]:
import re

re.findall(r'an?|the','this is an a the check')


Out[83]:
['an', 'a', 'the']

In [92]:
re.findall(r'\d+[\+\-\/\*]\d+[\+\-\/\*]\d+','Do this 2+93*78 and 5*7+8')


Out[92]:
['2+93*78', '5*7+8']

Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use urllib.urlopen to access the contents of the URL, e.g. raw_contents = urllib.urlopen('http://www.nltk.org/').read().


In [30]:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

def get_content(url):
    soup = BeautifulSoup(urlopen(url).read())
    return soup.getText()

content =  get_content('http://www.nltk.org/')[1000:]
with open('corpus.txt','w') as f:
    f.write(content[:843])

Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

  • Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).
  • Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.

In [35]:
def loadFile(fileName):
    with open(fileName,'r') as f:
        contents = f.read()
        return contents
text =  loadFile('corpus.txt')

In [37]:
pattern = r'''(?x)  #set to verbose
  [^\w\s]|_  #no alpha-numeric no space
'''
nltk.regexp_tokenize(text,pattern)[:3]


Out[37]:
[',', ',', ',']

In [52]:
pattern = r'''(?x)  #set to verbose
  [A-Z][a-z]+ #organization Name
  | [A-Z][a-z]+\s[A-Z][a-z]+  #People name
'''
nltk.regexp_tokenize(text,pattern)[:5]


Out[52]:
['Windows', 'Mac', 'Linux', 'Best', 'Python']

In [ ]:
pattern = r'''(?x)  #set to verbose
    ^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$  #date regex
    \$?\d+(\.\d+) #currency regex
'''

Rewrite the following loop as a list comprehension: sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] result = [] for word in sent: word_len = (word, len(word)) result.append(word_len) result [('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]


In [54]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
print [(word, len(word)) for word in sent]


[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]