Since most students in this class use Windows 7, I will use Windows 7 for illustration of the setup. Setting up the environmnet in Mac OS and Linux should be similar. Please note that the code should produce the same results whichever operating system (even on your smart phone) you are using because Python is platform independent.
Download the Python 3.5 version of Anaconda that matches your operating system from this link. You can accept the default options during installation. To see if your Windows is 32 bit or 64 bit, check here
You can save and run this document using the Jupyter notebook (previously known as IPython notebook). Another tool that I recommend would be PyCharm, which has a free community edition.
This is a tutorial based on the official Python Tutorial for Python 3.5.1. If you need a little more motivation to learn this programming language, consider reading this article.
In [ ]:
width = 20
height = 5*9
width * height
In [ ]:
tax = 8.25 / 100
price = 100.50
price * tax
In [ ]:
price + _
In [ ]:
round(_, 2)
In [ ]:
print('spam email')
In [ ]:
# This would cause error
print('doesn't')
In [ ]:
# One way of doing it correctly
print('doesn\'t')
In [ ]:
# Another way of doing it correctly
print("doesn't")
In [ ]:
print('''
Usage: thingy [OPTIONS]
-h Display this usage message
-H hostname Hostname to connect to
''')
In [ ]:
print('''Cheng highly recommends Python programming language''')
In [ ]:
word = 'HELP' + 'A'
word
Index in the Python way
In [ ]:
word[0]
In [ ]:
word[4]
In [ ]:
# endding index not included
word[0:2]
In [ ]:
word[2:4]
In [ ]:
# length of a string
len(word)
In [ ]:
a = ['spam', 'eggs', 100, 1234]
a
In [ ]:
a[0]
In [ ]:
a[3]
In [ ]:
a[2:4]
In [ ]:
sum(a[2:4])
Built-in functions like sum
and len
are explained in the document too. Here is a link to it.
In [ ]:
a
In [ ]:
a[2] = a[2] + 23
a
In [ ]:
q = [2, 3]
p = [1, q, 4]
p
In [ ]:
len(p)
In [ ]:
p[1]
In [ ]:
p[1][0]
In [ ]:
x=(1,2,3,4)
x[0]
In [ ]:
x[0]=7 # it will raise error since tuple is immutable
In [ ]:
tel = {'jack': 4098, 'sam': 4139}
tel['dan'] = 4127
tel
In [ ]:
tel['jack']
In [ ]:
del tel['sam']
tel
In [ ]:
tel['mike'] = 4127
tel
In [ ]:
# Is dan in the dict?
'dan' in tel
In [ ]:
for key in tel:
print('key:', key, '; value:', tel[key])
In [ ]:
import collections
od = collections.OrderedDict(sorted(tel.items()))
In [ ]:
od
Quiz: how to print the tel dict sorted by the key?
Ask a user to input a number, if it's negative, x=0, else if it's 1
In [ ]:
x = int(input("Please enter an integer for x: "))
if x < 0:
x = 0
print('Negative; changed to zero')
elif x == 0:
print('Zero')
elif x == 1:
print('Single')
else:
print('More')
In [ ]:
# multiple assignment to assign two variables at the same time
a, b = 0, 1
while a < 10:
print(a)
a, b = b, a+b
In [ ]:
# Measure some strings:
words = ['cat', 'window', 'defenestrate']
for i in words:
print(i, len(i))
The University of Texas at Dallas is reviewed on Yelp.com. It shows on this page that it attracted 38 reviews so far from various reviewers. You learn from the webpage that Yelp displays at most 20 recommended reviews per page and we need to go to page 2 to see the review 21 to review 38. You notice that the URL in the address box of your browser changed when you click on the Next page. Previouly, on page 1, the URL is:
http://www.yelp.com/biz/university-of-texas-at-dallas-richardson
On page 2, the URL is:
http://www.yelp.com/biz/university-of-texas-at-dallas-richardson?start=20
You learn that probably Yelp use this ?start=20
to skip(or offset
in MySQL language) the first 20 records to show you the next 18 reviews. You can use this pattern of going to the next page to enumerate all pages of a business in Yelp.com.
In this exmaple, we are going get the rating (number of stars) and the date for each of these 38 reviews.
The general procedure to crawl any web page is the following:
for
or while
loop) all the pages.For this example, I did a screenshot with my annotation to illustrate the critical patterns in the Yelp page for UTD reviews.
review_start_pattern
is a variable to stroe the string of '<div class="review-wrapper">'
to locate the beginning of an individual review.rating_pattern
is a variable to stroe the string of '<i class="star-img stars_'
to locate the rating.date_pattern
is a variable to stroe the string of '"datePublished" content="'
to locate date of the rating.It takes some trails and errors to figure out what are good string patterns to use to locate the information you need in an html. For example, I found that '<div class="review-wrapper">'
appeared exactly 20 times in the webpage, which is a good indication that it corresponds to the 20 individual reviews on the page (the review-wrapper
tag seems to imply that too).
In [ ]:
# crawl_UTD_reviews
# Author: Cheng Nie
# Email: me@chengnie.com
# Date: Feb 8, 2016
# Updated: Feb 12, 2016
from urllib.request import urlopen
num_pages = 2
reviews_per_page = 20
# the file we will save the rating and date
out_file = open('UTD_reviews.csv', 'w')
# the url that we need to locate the page for UTD reviews
url = 'http://www.yelp.com/biz/university-of-texas-at-dallas-\
richardson?start={start_number}'
# the three string patterns we just explained
review_start_pattern = '<div class="review-wrapper">'
rating_pattern = '<i class="star-img stars_'
date_pattern = '"datePublished" content="'
reviews_count = 0
for page in range(num_pages):
print('processing page', page)
# open the url and save the source code string to page_content
html = urlopen(url.format(start_number = page * reviews_per_page))
page_content = html.read().decode('utf-8')
# locate the beginning of an individual review
review_start = page_content.find(review_start_pattern)
while review_start != -1:
# it means there at least one more review to be crawled
reviews_count += 1
# get the rating
cut_front = page_content.find(rating_pattern, review_start) \
+ len(rating_pattern)
cut_end = page_content.find('" title="', cut_front)
rating = page_content[cut_front:cut_end]
# get the date
cut_front = page_content.find(date_pattern, cut_end) \
+ len(date_pattern)
cut_end = page_content.find('">', cut_front)
date = page_content[cut_front:cut_end]
# save the data into out_file
out_file.write(','.join([rating, date]) + '\n')
review_start = page_content.find(review_start_pattern, cut_end)
print('crawled', reviews_count, 'reviews so far')
out_file.close()
In [ ]:
def fib(n): # write Fibonacci series up to n
"""Print a Fibonacci series up to n."""
a, b = 0, 1
while a < n:
print(a)
a, b = b, a+b
In [ ]:
fib(200)
In [ ]:
fib(2000000000000000) # do not need to worry about the type of a,b
Create some data in Python and populate the database with the created data. We want to create a table with 3 columns: id, name, and age to store information about 50 kids in a day care.
The various modules that extend the basic Python funtions are indexed here.
In [ ]:
# output for eyeballing the data
import string
import random
# fix the pseudo-random sequences for easy replication
# It will generate the same random sequences
# of nubmers/letters with the same seed.
random.seed(123)
for i in range(50):
# Data values separated by comma(csv file)
print(i+1,random.choice(string.ascii_uppercase),
random.choice(range(6)), sep=',')
In [ ]:
# write the data to a file called data.csv
random.seed(123)
out_file=open('data.csv','w')
columns=['id','name','age']
out_file.write(','.join(columns)+'\n')
for i in range(50):
row=[str(i+1),random.choice(string.ascii_uppercase),
str(random.choice(range(6)))]
out_file.write(','.join(row)+'\n')
else:
out_file.close()
In [ ]:
# load data back into Python
for line in open('data.csv', 'r'):
print(line)
In [ ]:
# To disable to the new line added for each print
# use the end parameter in print function
for line in open('data.csv', 'r'):
print(line, end = '')
Install MySQL 5.7 Workbench first following this link. You might also need to install the prerequisites listed here before you can install the Workbench. The Workbench is an interface to interact with MySQL database. The actual MySQL database server requires a second step: run the MySQL Installer, then add and intall the MySQL servers using the Installer. You can accept the default options during installation. Later, you will connect to MySQL using the password you set during the installation and configuration. I set the password to be pythonClass
.
The documentation for MySQL is here.
To get comfortable with it, you might find this tutorial of Structured Query Language(SQL) to be helpful.
In [ ]:
These commands are executed in MySQL query tab, not in Python.
In mysql, you need to end all commands with ;
#
# ----------------------- In MySQL ------------------
# display the database
show databases;
# create a database named test
create database test;
# choose a database for future commands
use test;
# display the tables in test database
show tables;
# create a new table named example
create table example(
id int not null,
name varchar(30),
age tinyint,
primary key(id));
# now we should have the example table
show tables;
# how was the table example defined again?
desc example;
# is there anything in the example table?
select * from example;
# import csv file into MySQL database
load data local infile "C:\\Users\\cxn123430\\Downloads\\data.csv" into table test.example FIELDS TERMINATED BY ',' lines terminated by '\r\n' ignore 1 lines;
# is there anything now?
select * from example;
# drop the table
drop table example;
# does the example table still exist?
show tables;
Quiz: import the crawled Yelp review file UTD_reviews.csv
into a table in your database.
Since the official MySQL 5.7 provides support for Python upto Version 3.4 as of writing this tutorial, we need to install a package named mysql-connector-python
to provide support for the cutting-edge Python 3.5. Execute the following line in Windows command line to install it.
This is relatively easy since you have the Anancoda installed. We can use the conda
command to intall that package in the Windows command line.
In [ ]:
#
# ----------------------- In Windows command line(cmd) ------------------
conda install mysql-connector-python
Remember that we use Python to save 50 kids' infomation into a csv file named data.csv
first and then use the load
command in MySQL to import the data? We don't actually need to save the data.csv
file to hard disk. And we can "load" the same data into database without leaving Python.
In [ ]:
#
# ----------------------- In Python ------------------
# access table from Python
# connect to MySQL in Python
import mysql.connector
cnx = mysql.connector.connect(user='root',
password='pythonClass',
database='test')
# All DDL (Data Definition Language) statements are
# executed using a handle structure known as a cursor
cursor = cnx.cursor()
# create a table named example
cursor.execute('''create table example(
id int not null,
name varchar(30),
age tinyint,
primary key(id));''')
cnx.commit()
# write the same data to the example table without saving a csv file
query0_template = '''insert into example (id, name, age) \
values ({id_num},"{c_name}",{c_age});'''
random.seed(123)
for i in range(50):
query0 = query0_template.format(id_num = i+1,
c_name = random.choice(string.ascii_uppercase),
c_age = random.choice(range(6)))
print(query0)
cursor.execute(query0)
cnx.commit()
To get better understanding of the table we just created. We will use MySQL command line again.
In [ ]:
#
# ----------------------- In MySQL ------------------
# To get the totoal number of records
select count(*) from example;
# To get age histgram
select distinct age, count(*) from example group by age;
# create a copy of the example table for modifying.
create table e_copy select * from example;
select * from e_copy;
# note that the primary key is not copied to the e_copy table
desc e_copy;
# add the primary key to e_copy table using the alter command
alter table e_copy add primary key(id);
# is it done correctly?
desc e_copy;
# does MySQL take the primary key seriously?
insert into e_copy (id, name, age) values (null,'P',6);
insert into e_copy (id, name, age) values (3,'P',6);
# alright, let's insert something else
insert into e_copy (id, name, age) values (51,'P',6);
insert into e_copy (id, name, age) values (52,'Q',null);
insert into e_copy (id, name, age) values (54,'S',null),(55,'T',null);
insert into e_copy (id, name) values (53,'R');
# who is the child with id of 53?
select * from e_copy where id = 53;
# update the age for this child.
update e_copy set age=3 where id=53;
select * from e_copy where id = 53;
# what's inside the table now?
select * from e_copy;
Again, you can actually do everything in Python without going to the MySQL workbench.
In [ ]:
#
# ----------------------- In Python ------------------
# query all the content in the e_copy table
cursor.execute('select * from e_copy;')
for i in cursor:
print(i)
Now we want to add one new column of mother_name
to record the mother's name for each child in the child care.
In [ ]:
#
# ----------------------- In Python ------------------
# # example for adding new info for existing record
cursor.execute('alter table e_copy add mother_name varchar(1) default null')
cnx.commit()
query1_template='update e_copy set mother_name="{m_name}" where id={id_num};'
random.seed(333)
for i in range(55):
query1=query1_template.format(m_name = random.choice(string.ascii_uppercase),id_num = i+1)
print(query1)
cursor.execute(query1)
cnx.commit()
In [ ]:
#
# ----------------------- In Python ------------------
# example for insert new records
query2_template='insert into e_copy (id, name,age,mother_name) \
values ({id_num},"{c_name}",{c_age},"{m_name}")'
for i in range(10):
query2=query2_template.format(id_num = i+60,
c_name = random.choice(string.ascii_uppercase),
c_age = random.randint(0,6),
m_name = random.choice(string.ascii_uppercase))
print(query2)
cursor.execute(query2)
cnx.commit()
Check if you've updated the data successfully in MySQL database from Python
In [ ]:
#
# ----------------------- In Python ------------------
# query all the content in the e_copy table
cursor.execute('select * from e_copy;')
for i in cursor:
print(i)
In [ ]:
#
# ----------------------- In MySQL ------------------
Use the GUI to export the database into a self-contained file (the extension name would be sql)
Before you run this part, you need to download the digits.txt and spaces.txt files to the same folder as this notebook
What's in the digits.txt
file?
In [ ]:
import re
infile=open('digits.txt','r')
content=infile.read()
print(content)
How can I find all the numbers in a file like digits.txt
?
In [ ]:
# Find all the numbers in the file
numbers=re.findall('\d+',content)
for n in numbers:
print(n)
How can I find all the equations?
In [ ]:
# find equations
equations=re.findall('(\d+)=\d+',content)
for e in equations:
print(e)
The equations seem to be incorrect, how can I correct them without affecting other text information?
In [ ]:
# subsitute equations to correct them
# use the left hand side number
print(re.sub('(\d+)=\d+','\1=\1',content))
In [ ]:
# another way to subsitute equations to correct them
# use the right hand side number
print(re.sub('\d+=(\d+)','\\1=\\1',content))
In [ ]:
# Save to file
print(re.sub('(\d+)=\d+','\\1=\\1',content), file = open('digits_corrected.txt', 'w'))
Preprocessing a text file with various types of spaces.
In [ ]:
infile=open('spaces.txt','r')
content=infile.read()
print(content)
In [ ]:
print(re.sub('[\t ]+','\t',content))
In [ ]:
print(re.sub('[\t ]+','\t',content), file = open('spaces_corrected.txt', 'w'))
In [ ]:
word = 'HELP' + 'A'
word
In [ ]:
# first index default to 0 and second index default to the size
word[:2]
In [ ]:
# It's equivalent to
word[0:2]
In [ ]:
# Everything except the first two characters
word[2:]
In [ ]:
# It's equivalent to
word[2:len(word)]
How about selecting every other character?
In [ ]:
# start: end: step
word[0::2]
In [ ]:
# It's equivalent to
word[0:len(word):2]
In [ ]:
word[-1] # The last character
In [ ]:
word[-2] # The last-but-one character
In [ ]:
word[-2:] # The last two characters
In [ ]:
word[:-2] # Everything except the last two characters
In [ ]:
a = ['spam', 'eggs', 100, 1234]
a
In [ ]:
a[-2]
In [ ]:
a[1:-1]
In [ ]:
a[:2] + ['bacon', 2*2]
In [ ]:
3*a[:3] + ['Boo!']
In [ ]:
# Replace some items:
a[0:2] = [1, 12]
a
In [ ]:
# Remove some:
del a[0:2] # or a[0:2] = []
a
In [ ]:
# create some copies for change
b = a.copy()
c = a.copy()
In [ ]:
# Insert some:
b[1:1] = ['insert', 'some']
b
In [ ]:
# inserting at one position is not the same as changing one element
c[1] = ['insert', 'some']
c
In [ ]:
# loop way
cubes = []
for x in range(11):
cubes.append(x**3)
cubes
In [ ]:
# map way
def cube(x):
return x*x*x
list(map(cube, range(11)))
In [ ]:
# list comprehension way
[x**3 for x in range(11)]
In [ ]:
result = []
for i in range(11):
if i%2 == 0:
result.append(i)
else:
print(result)
In [ ]:
# Use if in list comprehension
[i for i in range(11) if i%2==0]
In [ ]:
l=[1,3,5,6,8,10]
[i for i in l if i%2==0]
All three books are available online legally through the UTD library. The Amazon links are given because the reviews might help you decide wheter or not to read them.