wk4.1

A word on virtual environments

conda create -n virtualenv_name anaconda # Can use any other package besides anaconda for instance python=2

source activate virtualenv_name

source deactivate

Regular expressions


In [ ]:
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line) :
        print(line)

In [ ]:
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^F..m:', line) :
        print(line)

In [ ]:
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^R.+: <.+@.+>', line) : # This is greedy!
        print(line)

Extracting information

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
          for <source@collab.sakaiproject.org>;
Received: (from apache@localhost)
Author: stephen.marquard@uct.ac.za

In [ ]:
import re
s = 'Hello from csev@umich.edu to cwen@iupui.edu about the meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)

The regular expression would match twice (csev@umich.edu and cwen@iupui.edu) but it would not match the string "@2PM" because there are no non-blank characters before the at-sign. We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an e-mail address as follows:


In [ ]:
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('\S+@\S+', line) # some emails contain gross < characters!
    if len(x) > 0 :
        print(x)

Some of our E-mail addresses have incorrect characters like "<" or ";" at the beginning or end. Let's declare that we are only interested in the portion of the string that starts and ends with a letter or a number.

To do this, we use another feature of regular expressions. Square brackets are used to indicate a set of multiple acceptable characters we are willing to consider matching. In a sense, the "\S" is asking to match the set of "non-whitespace characters". Now we will be a little more explicit in terms of the characters we will match.

Here is our new regular expression:

[a-zA-Z0-9]\S*@\S*[a-zA-Z]

In [ ]:
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)
    if len(x) > 0 :
        print(x)

Combining search and extraction

If we want to find numbers on lines that start with the string "X-" such as:

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

We don't just want any floating point numbers from any lines. We only to extract numbers from lines that have the above syntax.

To match we write

^X-.*: [0-9.]+

In [ ]:
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^X\S*: [0-9.]+', line) :
        print(line)

But now we have to solve the problem of extracting the numbers using split. While it would be simple enough to use split, we can use another feature of regular expressions to both search and parse the line at the same time.

Parentheses are another special character in regular expressions. When you add parentheses to a regular expression they are ignored when matching the string, but when you are using findall(), parentheses indicate that while you want the whole expression to match, you only are interested in extracting a portion of the substring that matches the regular expression.

So we make the following change to our program:


In [ ]:
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^X\S*: ([0-9]\.[0-9]+)', line)
    if len(x) > 0 :
        print(x)

Escape characters

What if we want to match $, ^, * etc?

Use escape character (forward slash)


In [ ]:
import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)
print(y)

In [ ]:
import re
x = 'We just received $10.00 for cookies.'
y = re.findall('c\S+$',x)
print(y)

Summary

^ Matches the beginning of the line.

$ Matches the end of the line.

. Matches any character (a wildcard).

\s Matches a whitespace character.

\S Matches a non-whitespace character (opposite of \s).

* Applies to the immediately preceding character and indicates to match zero or more of the preceding character.

*? Applies to the immediately preceding character and indicates to match zero or more of the preceding character in "non-greedy mode".

+ Applies to the immediately preceding character and indicates to match zero or more of the preceding character.

+? Applies to the immediately preceding character and indicates to match zero or more of the preceding character in "non-greedy mode".

[aeiou] Matches a single character as long as that character is in the specified set. In this example, it would match "a", "e", "i", "o" or "u" but no other characters.

[a-z0-9] You can specify ranges of characters using the minus sign. This example is a single character that must be a lower case letter or a digit.

[^A-Za-z] When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an upper or lower case character.

( ) When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().

\b Matches the empty string, but only at the start or end of a word.

\B Matches the empty string, but not at the start or end of a word.

\d Matches any decimal digit; equivalent to the set [0-9].

\D Matches any non-digit character; equivalent to the set [^0-9].

Bonus section for Unix users

Support for searching files using regular expressions was built into the Unix operating system since the 1960's and it is available in nearly all programming languages in one form or another.

As a matter of fact, there is a command-line program built into Unix called grep (Generalized Regular Expression Parser) that does pretty much the same as the search() examples in this chapter. So if you have a Macintosh or Linux system, you can try the following commands in your command line window.

$ grep '^From:' mbox-short.txt
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu

Exercises

  • Exercise 1 Write a simple program to simulate the operation of the the grep command on Unix. Ask the user to enter a regular expression and count the number of lines that matched the regular expression:
$ python grep.py
Enter a regular expression: ^Author
mbox.txt had 1798 lines that matched ^Author

$ python grep.py
Enter a regular expression: ^X-
mbox.txt had 14368 lines that matched ^X-

$ python grep.py
Enter a regular expression: java$
mbox.txt had 4218 lines that matched java$
  • Exercise 2 Write a program to look for lines of the form
    New Revision: 39772
    And extract the number from each of the lines using a regular expression and the findall() method. Compute the average of the numbers and print out the average. ``` Enter file:mbox.txt 38549.7949721

Enter file:mbox-short.txt 39756.9259259 ```

World's simplest browser


In [ ]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if ( len(data) < 1 ) :
        break
    print(data)

mysock.close()

In [ ]:
import socket
import time

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')


count = 0
picture = "";
while True:
    data = mysock.recv(5120)
    if ( len(data) < 1 ) : break
    time.sleep(0.25)
    count = count + len(data)
    print len(data),count
    #print data
    picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)
pos = picture.find("\r\n\r\n");
print 'Header length',pos
print picture[:pos]

# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg","w")
fhand.write(picture);
fhand.close()

In [ ]:
import urllib
import re

url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
links = re.findall('href="(http://.*?)"', html)
for link in links:
    print link

In [ ]:
import urllib
from bs4 import BeautifulSoup

url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print tag.get('href', None)


In [ ]: