Since most students in this class use Windows 7, I will use Windows 7 for illustration of the setup. Setting up the environmnet in Mac OS and Linux should be similar. Please note that the code should produce the same results whichever operating system (even on your smart phone) you are using because Python is platform independent.
Download the Python 3.5 version of Anaconda that matches your operating system from this link. You can accept the default options during installation. To see if your Windows is 32 bit or 64 bit, check here
You can save and run this document using the Jupyter notebook (previously known as IPython notebook). Another tool that I recommend would be PyCharm, which has a free community edition.
This is a tutorial based on the official Python Tutorial for Python 3.5.1. If you need a little more motivation to learn this programming language, consider reading this article.
In [ ]:
width = 20
height = 5*9
width * height
In [ ]:
tax = 8.25 / 100
price = 100.50
price * tax
In [ ]:
price + _
In [ ]:
round(_, 2)
In [ ]:
print('spam email')
In [ ]:
# This would cause error
print('doesn't')
In [ ]:
# One way of doing it correctly
print('doesn\'t')
In [ ]:
# Another way of doing it correctly
print("doesn't")
In [ ]:
print('''
Usage: thingy [OPTIONS]
-h Display this usage message
-H hostname Hostname to connect to
''')
In [ ]:
print('''Cheng highly recommends Python programming language''')
In [ ]:
word = 'HELP' + 'A'
word
Index in the Python way
In [ ]:
word[0]
In [ ]:
word[4]
In [ ]:
# endding index not included
word[0:2]
In [ ]:
word[2:4]
In [ ]:
# length of a string
len(word)
In [ ]:
a = ['spam', 'eggs', 100, 1234]
a
In [ ]:
a[0]
In [ ]:
a[3]
In [ ]:
a[2:4]
In [ ]:
sum(a[2:4])
Built-in functions like sum
and len
are explained in the document too. Here is a link to it.
In [ ]:
a
In [ ]:
a[2] = a[2] + 23
a
In [ ]:
q = [2, 3]
p = [1, q, 4]
p
In [ ]:
len(p)
In [ ]:
p[1]
In [ ]:
p[1][0]
In [ ]:
x=(1,2,3,4)
x[0]
In [ ]:
x[0]=7 # it will raise error since tuple is immutable
In [ ]:
tel = {'jack': 4098, 'sam': 4139}
tel['dan'] = 4127
tel
In [ ]:
tel['jack']
In [ ]:
del tel['sam']
tel
In [ ]:
tel['mike'] = 4127
tel
In [ ]:
# Is dan in the dict?
'dan' in tel
In [ ]:
for key in tel:
print('key:', key, '; value:', tel[key])
Quiz: how to print the tel dict sorted by the key?
Ask a user to input a number, if it's negative, x=0, else if it's 1
In [ ]:
x = int(input("Please enter an integer for x: "))
if x < 0:
x = 0
print('Negative; changed to zero')
elif x == 0:
print('Zero')
elif x == 1:
print('Single')
else:
print('More')
In [ ]:
a, b = 0, 1 # multiple assignment
while a < 10:
print(a)
a, b = b, a+b
In [ ]:
# Measure some strings:
words = ['cat', 'window', 'defenestrate']
for i in words:
print(i, len(i))
In [ ]:
def fib(n): # write Fibonacci series up to n
"""Print a Fibonacci series up to n."""
a, b = 0, 1
while a < n:
print(a)
a, b = b, a+b
In [ ]:
fib(200)
In [ ]:
fib(2000000000000000) # do not need to worry about the type of a,b
Create some data in Python and populate the database with the created data. We want to create a table with 3 columns: id, name, and age to store information about 50 kids in a day care.
The various modules that extend the basic Python funtions are indexed here.
In [ ]:
# output for viewing first
import string
import random
# fix the pseudo-random sequences for easy replication
# It will generate the same random sequences
# of nubmers/letters with the same seed.
random.seed(123)
for i in range(50):
# Data values separated by comma(csv file)
print(i+1,random.choice(string.ascii_uppercase),
random.choice(range(6)), sep=',')
In [ ]:
# write the data to a file
random.seed(123)
out_file=open('data.csv','w')
columns=['id','name','age']
out_file.write(','.join(columns)+'\n')
for i in range(50):
row=[str(i+1),random.choice(string.ascii_uppercase),
str(random.choice(range(6)))]
out_file.write(','.join(row)+'\n')
else:
out_file.close()
In [ ]:
# read data into Python
for line in open('data.csv', 'r'):
print(line)
Install MySQL 5.7 Workbench first following this link. You might also need to install the prerequisites listed here before you can install the Workbench. The Workbench is an interface to interact with MySQL database. The actual MySQL database server requires a second step: run the MySQL Installer, then add and intall the MySQL servers using the Installer. You can accept the default options during installation. Later, you will connect to MySQL using the password you set during the installation and configuration. I set the password to be pythonClass
.
The documentation for MySQL is here.
To get comfortable with it, you might find this tutorial of Structured Query Language(SQL) to be helpful.
The University of Texas at Dallas is reviewed on Yelp.com. It shows on this page that it attracted 38 reviews so far from various reviewers. You learn from the webpage that Yelp displays at most 20 recommended reviews per page and we need to go to page 2 to see the review 21 to review 38. You notice that the URL in the address box of your browser changed when you click on the Next page. Previouly, on page 1, the URL is:
http://www.yelp.com/biz/university-of-texas-at-dallas-richardson
On page 2, the URL is:
http://www.yelp.com/biz/university-of-texas-at-dallas-richardson?start=20
You learn that probably Yelp use this ?start=20
to skip(or offset
in MySQL language) the first 20 records to show you the next 18 reviews. You can use this pattern of going to the next page to enumerate all pages of a business in Yelp.com.
In this exmaple, we are going get the rating (number of stars) and the date for each of these 38 reviews.
The general procedure to crawl any web page is the following:
for
or while
loop) all the pages.For this example, I did a screenshot with my annotation to illustrate the critical patterns in the Yelp page for UTD reviews.
review_start_pattern
is a variable to stroe the string of '<div class="review-wrapper">'
to locate the beginning of an individual review.rating_pattern
is a variable to stroe the string of '<i class="star-img stars_'
to locate the rating.date_pattern
is a variable to stroe the string of '"datePublished" content="'
to locate date of the rating.It takes some trails and errors to figure out what are good string patterns to use to locate the information you need in an html. For example, I found that '<div class="review-wrapper">'
appeared exactly 20 times in the webpage, which is a good indication that it corresponds to the 20 individual reviews on the page (the review-wrapper
tag seems to imply that too).
In [ ]:
# crawl_UTD_reviews
# Author: Cheng Nie
# Email: me@chengnie.com
# Date: Feb 8, 2016
# Updated: Feb 12, 2016
from urllib.request import urlopen
num_pages = 2
reviews_per_page = 20
# the file we will save the rating and date
out_file = open('UTD_reviews.csv', 'w')
# the url that we need to locate the page for UTD reviews
url = 'http://www.yelp.com/biz/university-of-texas-at-dallas-\
richardson?start={start_number}'
# the three string patterns we just explained
review_start_pattern = '<div class="review-wrapper">'
rating_pattern = '<i class="star-img stars_'
date_pattern = '"datePublished" content="'
reviews_count = 0
for page in range(num_pages):
print('processing page', page)
# open the url and save the source code string to page_content
html = urlopen(url.format(start_number = page * reviews_per_page))
page_content = html.read().decode('utf-8')
# locate the beginning of an individual review
review_start = page_content.find(review_start_pattern)
while review_start != -1:
# it means there at least one more review to be crawled
reviews_count += 1
# get the rating
cut_front = page_content.find(rating_pattern, review_start) \
+ len(rating_pattern)
cut_end = page_content.find('" title="', cut_front)
rating = page_content[cut_front:cut_end]
# get the date
cut_front = page_content.find(date_pattern, cut_end) \
+ len(date_pattern)
cut_end = page_content.find('">', cut_front)
date = page_content[cut_front:cut_end]
# save the data into out_file
out_file.write(','.join([rating, date]) + '\n')
review_start = page_content.find(review_start_pattern, cut_end)
print('crawled', reviews_count, 'reviews so far')
out_file.close()
Quiz: import the crawled file into a table in your database.
In [ ]:
word
In [ ]:
# first index default to 0 and second index default to the size
word[:2]
In [ ]:
# It's equivalent to
word[0:2]
In [ ]:
# Everything except the first two characters
word[2:]
In [ ]:
# It's equivalent to
word[2:len(word)]
In [ ]:
# start: end: step
word[0::2]
In [ ]:
word[0:len(word):2]
In [ ]:
word[-1] # The last character
In [ ]:
word[-2] # The last-but-one character
In [ ]:
word[-2:] # The last two characters
In [ ]:
word[:-2] # Everything except the last two characters
In [ ]:
a
In [ ]:
a[-2]
In [ ]:
a[1:-1]
In [ ]:
a[:2] + ['bacon', 2*2]
In [ ]:
3*a[:3] + ['Boo!']
In [ ]:
# Replace some items:
a[0:2] = [1, 12]
a
In [ ]:
# Remove some:
a[0:2] = [] # or del a[0:2]
a
In [ ]:
# Insert some:
a[1:1] = ['insert', 'some']
a
In [ ]:
# inserting at one position is not the same as changing one element
# a=[1, 12, 100, 1234]
a = [123, 1234]
sum(a)
a[1] = ['insert', 'some']
a
In [ ]:
# loop way
cubes = []
for x in range(11):
cubes.append(x**3)
cubes
In [ ]:
# map way
def cube(x):
return x*x*x
list(map(cube, range(11)))
In [ ]:
# list comprehension way
[x**3 for x in range(11)]
In [ ]:
result = []
for i in range(11):
if i%2 == 0:
result.append(i)
else:
print(result)
In [ ]:
[i for i in range(11) if i%2==0]
In [ ]:
l=[1,3,5,6,8,10]
[i for i in l if i%2==0]
Since the official MySQL 5.7.11 provides support for Python upto Version 3.4, we need to install a package to provide to support the Python 3.5. Execute the following line in Windows command line to install it.
In [ ]:
#
# ----------------------- In Python ------------------
# access table from Python
# connect to MySQL in Python
import mysql.connector
cnx = mysql.connector.connect(user='root',
password='pythonClass',
database='test')
# All DDL (Data Definition Language) statements are
# executed using a handle structure known as a cursor
cursor = cnx.cursor()
#cursor.execute("")
# write the same data to the example table
query0 = '''insert into example (id, name, age) \
values ({id_num},"{c_name}",{c_age});'''
random.seed(123)
for i in range(50):
query1 = query0.format(id_num = i+1,
c_name = random.choice(string.ascii_uppercase),
c_age = random.choice(range(6)))
print(query1)
cursor.execute(query1)
cnx.commit()
To get better understanding of the table we just created. We will use MySQL command line again.
In [ ]:
#
# ----------------------- In Python ------------------
#
cursor.execute('select * from e_copy;')
for i in cursor:
print(i)
In [ ]:
#
# ----------------------- In Python ------------------
#
# # example for adding new info for existing record
# cursor.execute('alter table e_copy add mother_name varchar(1) default null')
query='update e_copy set mother_name="{m_name}" where id={id_num};'
# random.seed(333)
for i in range(50):
query1=query.format(m_name = random.choice(string.ascii_uppercase),id_num = i+1)
print(query1)
cursor.execute(query1)
cnx.commit()
In [ ]:
#
# ----------------------- In Python ------------------
#
# example for insert new records
query2='insert into e_copy (id, name,age,mother_name) \
values ({id_num},"{c_name}",{c_age},"{m_name}")'
for i in range(10):
query3=query2.format(id_num = i+60,
c_name = random.choice(string.ascii_uppercase),
c_age = random.randint(0,6),
m_name = random.choice(string.ascii_uppercase))
print(query3)
cursor.execute(query3)
cnx.commit()
# check if you've updated the data successfully in MySQL
In [ ]:
import re
# digits
# find all the numbers
infile=open('digits.txt','r')
content=infile.read()
print(content)
In [ ]:
# Find all the numbers in the file
numbers=re.findall(r'\d+',content)
for n in numbers:
print(n)
In [ ]:
# find equations
equations=re.findall(r'(\d+)=\d+',content)
for e in equations:
print(e)
In [ ]:
# subsitute equations to correct them
print(re.sub(r'(\d+)=\d+',r'\1=\1',content))
In [ ]:
# Save to file
print(re.sub(r'(\d+)=\d+',r'\1=\1',content), file = open('digits_corrected.txt', 'w'))
All three books are available online legally through the UTD library. The Amazon links are given because the reviews might help you decide wheter or not to read them.