Exercise from http://www.nltk.org/book_1ed/ch03.html

Author : Nirmal kumar Ravi

Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.



In [2]:

    
s = 'colorless'
s[:4]+'u'+s[4:]









    Out[2]:





'colourless'

We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat.



In [9]:

    
print 'dishes'[:-2]
print 'running'[:-4]
print 'nationality'[:6]
print 'undo'[:2]
print 'preheat'[:3]









    



dish
run
nation
un
pre

We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

Yes, Its possiblie with python. We can say something like 'Pyhton'[-7]

We can specify a "step" size for the slice. The following returns every second character within the slice: monty[6:11:2]. It also works in the reverse direction: monty[10:5:-2] Try these for yourself, then experiment with different step values.



In [21]:

    
print 'ThisisTest'[::1]
print 'ThisisTest'[::2]
print 'ThisisTest'[3::1]
print 'ThisisTest'[:4:1]
print 'ThisisTest'[::-1]









    



ThisisTest
TiiTs
sisTest
This
tseTsisihT

What happens if you ask the interpreter to evaluate monty[::-1]? Explain why this is a reasonable result.

It prints string in reverse order. '-1' says start from last char and It does not have begin and length paramter so it prints everything

Describe the class of strings matched by the following regular expressions.

[a-zA-Z]+

[A-Z][a-z]*

p[aeiou]{,2}t

\d+(.\d+)?

([^aeiou][aeiou][^aeiou])*

\w+|[^\w\s]+

Test your answers using nltk.re_show().



In [1]:

    
import nltk, re, pprint

[a-zA-Z]+ Contains one or more alphabets.Matches both capital and small letters



In [4]:

    
nltk.re_show('[a-zA-Z]+','This is test 123')









    



{This} {is} {test} 123

[A-Z][a-z]* Start with Capital letter alphabet followed by zero or more small letters



In [7]:

    
nltk.re_show('[A-Z][a-z]*','This IS test 123')









    



{This} {I}{S} test 123

p[aeiou]{,2}t Word starts with p ends with t with letters 'aeiou' in-between with 0-2 occurences.



In [11]:

    
nltk.re_show('p[aeiou]{,2}t', 'paat pit pooot 123 abc')









    



{paat} {pit} pooot 123 abc

\d+(.\d+)? starts with digit followed by one or more digits followed by zero or more pattern matching any character followed by one or more digts



In [14]:

    
nltk.re_show('\d+(.\d+)?','1t23 1 abc')









    



{1t23} {1} abc

([^aeiou][aeiou][^aeiou])* starts with letter other than 'aeiou' followed by 'aeiou' and a letter not in 'aeiou' with zero or more condition



In [15]:

    
nltk.re_show('([^aeiou][aeiou][^aeiou])*','hat cat aei')









    



{hat} {cat} {}a{}e{}i{}

\w+|[^\w\s]+ Either one or more alpha numeric or not of alphanumeric and space



In [16]:

    
nltk.re_show('\w+|[^\w\s]+','1ab @@@')









    



{1ab} {@@@}

Write regular expressions to match the following classes of strings:

A single determiner (assume that a, an, and the are the only determiners).
An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.



In [83]:

    
import re

re.findall(r'an?|the','this is an a the check')









    Out[83]:





['an', 'a', 'the']



In [92]:

    
re.findall(r'\d+[\+\-\/\*]\d+[\+\-\/\*]\d+','Do this 2+93*78 and 5*7+8')









    Out[92]:





['2+93*78', '5*7+8']

Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use urllib.urlopen to access the contents of the URL, e.g. raw_contents = urllib.urlopen('http://www.nltk.org/').read().



In [30]:

    
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

def get_content(url):
    soup = BeautifulSoup(urlopen(url).read())
    return soup.getText()

content =  get_content('http://www.nltk.org/')[1000:]
with open('corpus.txt','w') as f:
    f.write(content[:843])

Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).
Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.



In [35]:

    
def loadFile(fileName):
    with open(fileName,'r') as f:
        contents = f.read()
        return contents
text =  loadFile('corpus.txt')



In [37]:

    
pattern = r'''(?x)  #set to verbose
  [^\w\s]|_  #no alpha-numeric no space
'''
nltk.regexp_tokenize(text,pattern)[:3]









    Out[37]:





[',', ',', ',']



In [52]:

    
pattern = r'''(?x)  #set to verbose
  [A-Z][a-z]+ #organization Name
  | [A-Z][a-z]+\s[A-Z][a-z]+  #People name
'''
nltk.regexp_tokenize(text,pattern)[:5]









    Out[52]:





['Windows', 'Mac', 'Linux', 'Best', 'Python']



In [ ]:

    
pattern = r'''(?x)  #set to verbose
    ^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$  #date regex
    \$?\d+(\.\d+) #currency regex
'''

Rewrite the following loop as a list comprehension: sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] result = [] for word in sent: word_len = (word, len(word)) result.append(word_len) result [('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]



In [54]:

    
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
print [(word, len(word)) for word in sent]









    



[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]