Laboratory 03

1. Write regular expressions that match any number of digits greedily. Do not match anything else.



In [1]:

    
import re

patt = re.compile(r'^\d+$')
    
assert(patt.match("123"))
assert(patt.match("123ab") is None)

2. Match Hungarian phone numbers. The numbers do not include spaces or hyphens.



In [2]:

    
patt = re.compile(r'^\+36\d{9}$')

assert(patt.match("+36301234567"))
assert(patt.match("+37000000000") is None)
assert(patt.match("+363012345678") is None)

3. Match Hungarian phone numbers that may be grouped in the following ways.



In [3]:

    
patt = re.compile(r'^\+36[ -]?\d{2}[ -]?\d{3}[ -]?\d{4}$')

assert(patt.match("+36301234567"))
assert(patt.match("+37000000000") is None)
assert(patt.match("+363012345678") is None)
assert(patt.match("+36 30 123 4566"))
assert(patt.match("+36-30-123-4566"))
assert(patt.match("+36-30-123-45667") is None)

4. Match any floating point numbers. Do not match invalid numbers such as 0.34.1



In [4]:

    
patt = re.compile(r'^\d+(\.\d+)?$')

assert(patt.match("1.9"))
assert(patt.match("1.9.2") is None)

5. Match email addresses.



In [5]:

    
patt = re.compile(r'^[a-zA-Z._][a-zA-Z0-9._]*@[a-zA-Z0-9._]+\.[a-z]{2,3}$')

assert(patt.match("abc@example.com"))
assert(patt.match("abc") is None)
assert(patt.match("abc@example@example.com") is None)

6. Match any number in hexadecimal format. Hexadecimal numbers are prefixed with 0x.



In [6]:

    
patt = re.compile(r'^0x[a-fA-F0-9]+$')

assert(patt.match("0xa"))
assert(patt.match("0x16FA"))
assert(patt.match("16FA") is None)
assert(patt.match("0x16FG") is None)

7. Remove everything between parentheses. Be careful about the next pair of parantheses.



In [7]:

    
patt = re.compile(r'\([^)]*\)')

assert(patt.sub("",  "(a)bc") == "bc")
assert(patt.sub("",  "abc") == "abc")
assert(patt.sub("",  "a (bc) de (12)") == "a  de ")

8. Create a simple word tokenizer using regular expressions. What patterns should you split on? Add more tests.



In [8]:

    
patt = re.compile(r'[\s.]+')

assert(patt.split("simple sentence.") == ["simple", "sentence", ""])
assert(patt.split("multiple \t whitespaces") == ["multiple", "whitespaces"])