### Author : Nirmal kumar Ravi

Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.

``````

In :

s = 'colorless'
s[:4]+'u'+s[4:]

``````
``````

Out:

'colourless'

``````

We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat.

``````

In :

print 'dishes'[:-2]
print 'running'[:-4]
print 'nationality'[:6]
print 'undo'[:2]
print 'preheat'[:3]

``````
``````

dish
run
nation
un
pre

``````

We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

• Yes, Its possiblie with python. We can say something like 'Pyhton'[-7]

We can specify a "step" size for the slice. The following returns every second character within the slice: monty[6:11:2]. It also works in the reverse direction: monty[10:5:-2] Try these for yourself, then experiment with different step values.

``````

In :

print 'ThisisTest'[::1]
print 'ThisisTest'[::2]
print 'ThisisTest'[3::1]
print 'ThisisTest'[:4:1]
print 'ThisisTest'[::-1]

``````
``````

ThisisTest
TiiTs
sisTest
This
tseTsisihT

``````

What happens if you ask the interpreter to evaluate monty[::-1]? Explain why this is a reasonable result.

• It prints string in reverse order. '-1' says start from last char and It does not have begin and length paramter so it prints everything

Describe the class of strings matched by the following regular expressions.

• [a-zA-Z]+
• [A-Z][a-z]*
• p[aeiou]{,2}t
• \d+(.\d+)?
• ([^aeiou][aeiou][^aeiou])*
• \w+|[^\w\s]+
``````

In :

import nltk, re, pprint

``````
• [a-zA-Z]+ Contains one or more alphabets.Matches both capital and small letters
``````

In :

nltk.re_show('[a-zA-Z]+','This is test 123')

``````
``````

{This} {is} {test} 123

``````
• [A-Z][a-z]* Start with Capital letter alphabet followed by zero or more small letters
``````

In :

nltk.re_show('[A-Z][a-z]*','This IS test 123')

``````
``````

{This} {I}{S} test 123

``````
• p[aeiou]{,2}t Word starts with p ends with t with letters 'aeiou' in-between with 0-2 occurences.
``````

In :

nltk.re_show('p[aeiou]{,2}t', 'paat pit pooot 123 abc')

``````
``````

{paat} {pit} pooot 123 abc

``````
• \d+(.\d+)? starts with digit followed by one or more digits followed by zero or more pattern matching any character followed by one or more digts
``````

In :

nltk.re_show('\d+(.\d+)?','1t23 1 abc')

``````
``````

{1t23} {1} abc

``````
• ([^aeiou][aeiou][^aeiou])* starts with letter other than 'aeiou' followed by 'aeiou' and a letter not in 'aeiou' with zero or more condition
``````

In :

nltk.re_show('([^aeiou][aeiou][^aeiou])*','hat cat aei')

``````
``````

{hat} {cat} {}a{}e{}i{}

``````
• \w+|[^\w\s]+ Either one or more alpha numeric or not of alphanumeric and space
``````

In :

nltk.re_show('\w+|[^\w\s]+','1ab @@@')

``````
``````

{1ab} {@@@}

``````

Write regular expressions to match the following classes of strings:

• A single determiner (assume that a, an, and the are the only determiners).
• An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.
``````

In :

import re

re.findall(r'an?|the','this is an a the check')

``````
``````

Out:

['an', 'a', 'the']

``````
``````

In :

re.findall(r'\d+[\+\-\/\*]\d+[\+\-\/\*]\d+','Do this 2+93*78 and 5*7+8')

``````
``````

Out:

['2+93*78', '5*7+8']

``````

Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use urllib.urlopen to access the contents of the URL, e.g. raw_contents = urllib.urlopen('http://www.nltk.org/').read().

``````

In :

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

def get_content(url):
return soup.getText()

content =  get_content('http://www.nltk.org/')[1000:]
with open('corpus.txt','w') as f:
f.write(content[:843])

``````

Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

• Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).
• Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.
``````

In :

with open(fileName,'r') as f:
return contents

``````
``````

In :

pattern = r'''(?x)  #set to verbose
[^\w\s]|_  #no alpha-numeric no space
'''
nltk.regexp_tokenize(text,pattern)[:3]

``````
``````

Out:

[',', ',', ',']

``````
``````

In :

pattern = r'''(?x)  #set to verbose
[A-Z][a-z]+ #organization Name
| [A-Z][a-z]+\s[A-Z][a-z]+  #People name
'''
nltk.regexp_tokenize(text,pattern)[:5]

``````
``````

Out:

['Windows', 'Mac', 'Linux', 'Best', 'Python']

``````
``````

In [ ]:

pattern = r'''(?x)  #set to verbose
^(19|20)\d\d[- /.](0[1-9]|1)[- /.](0[1-9]|[0-9]|3)\$  #date regex
\\$?\d+(\.\d+) #currency regex
'''

``````

Rewrite the following loop as a list comprehension: sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] ``` result = [] for word in sent: word_len = (word, len(word)) result.append(word_len) result [('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)] ```

``````

In :

sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
print [(word, len(word)) for word in sent]

``````
``````

[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]

``````