This tutorial was originally drawn from Scipy Lecture Notes by this list of contributors. I've continued to modify it as I use it.
This work is CC-BY. Author: Aaron L. Brenner
Python is a programming language, as are C, Fortran, BASIC, PHP, etc. Some specific features of Python are as follows:
See https://www.python.org/about/ for more information about distinguishing features of Python.
If you are interested in moving forward with learning Python, it is worth your time to get acquainted with all of these resources. The tutorial will step you through more Python, and you should be familiar with the basics of the Python language and its standard library.
Python documentation home https://docs.python.org/3/
Tutorial https://docs.python.org/3/tutorial/index.html
Python Language Reference https://docs.python.org/3/reference/index.html#reference-index
The Python Standard Library https://docs.python.org/3/library/index.html#library-index
The Python Cookbook, 3rd Edition - This is one of many various 'cookbooks'. These can be very useful not only for seeing solutions to common problems, but also as a way to read brief examples of ideomatic code. Reading code snippets in this way can be a great compliment to language reference documentation and traditional tutorials. http://chimera.labs.oreilly.com/books/1230000000393/
Also, don't be embarrased to Google your questions! Try some variation of python [thing] example
* hat tip to Mark Pilgrim
This is a script that inspects a CSV data file and reports on some summary characteristics. Take a minute to read over the code before running it. Don't worry if you don't understand all of what's happening. We'll step through some of this code in more detail as we learn the basics of python. For now, just try to get a feel for what a complete script looks like. After you've read it over, go ahead and execute it.
In [ ]:
# open the source CSV file
csv = open("cars.csv")
# create a list with the column names. we assume the first row contiains them.
# we strip the carriage return (if there is one) from the line, then split values on the commas.
# Note: this uses a nifty python feature called 'list comprehension' to do it in one line
column_names = [i for i in csv.readline().strip().split(',')]
# read the rest of the file into a matrix (a list of lists). Use the same strip and split methods.
data = [line.strip().split(',') for line in csv.readlines()]
# now, try to infer the data types of each column from the values in the first row.
# the testing here shows some string methods, like isspace(), isalpha(), isdigit().
# we'll save these data type assumptions because we'll use them later in a report.
column_datatypes = []
for value in data[0]:
if len(value) < 1 or value.isspace():
column_datatypes.append('string')
elif value.isalpha():
column_datatypes.append('string')
elif '.' in value or value.isdigit():
column_datatypes.append('numeric')
else:
column_datatypes.append('string')
# now let's do some basic reporting on the csv
# overall stats of the file:
print("this csv file has " + str(len(column_names)) + " columns and " + str(len(data)) + " rows.")
# loop over each column name, do some different things depending on whether we've inferred
# it contains string or numeric values. we declare certain variables with 'False' so even if
# we can't fill them we can test them without an error.
for i, value in enumerate(column_names):
average_value = False
highest_value = False
lowest_value = False
# if it's a numeric column, we'll get all the values for this column out of our data matrix,
# convert them to float (remember they are all strings by default), and then get the average,
# high, and low values. If there's an error doing this, just get the values as strings
if column_datatypes[i] == 'numeric':
try:
column_values = [float(data[j][i]) for j in range(len(data))]
average_value = sum(column_values)/len(column_values)
highest_value = sorted(column_values)[-1]
lowest_value = sorted(column_values)[0]
except ValueError:
column_values = [data[j][i] for j in range(len(data))]
else:
column_values = [data[j][i] for j in range(len(data))]
# the set function removes duplicates from a list, so taking its length is equivilent
# to the number of unique values
unique_value_count = len(set(column_values))
# now we start printing. First just the field name. The simple way of formatting a string
# is with the + operator. Note: we add one to the index because we don't want our list
# to start with zero.
print(str(i+1) + ". \"" + value + "\"")
# next the type we think it is, and the number of unique values
# Note: using the + style of string formatting all non-string values have to be cast to strings
print("\t{0} ({1} of {2} unique)".format(column_datatypes[i], unique_value_count, len(data)))
# now different details if it's numeric and successfully converted to float, if it's
# numeric, and didnt', and otherwise we assume it's a string.
# Note: also showing a different, more powerful string formatting method here
if column_datatypes[i] == 'numeric':
if average_value:
print("\taverage value: {0:g}".format(average_value))
print("\tlowest value: {0:g}".format(lowest_value))
print("\thighest value: {0:g}".format(highest_value))
else:
print("\tNOTE: problems converting values to float!")
else:
print("\tfirst value: {0:s}".format(column_values[0]))
print("\tlast value: {0:s}".format(column_values[-1]))
After you've run this as-is, change the name of the CSV file from cars.csv to cities.csv. Run it again.
Follow along with the instructor in typing instructions:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Two variables a and b have been defined above. Note that one does not declare the type of an variable before assigning its value.
In addition, the type of a variable may change, in the sense that at one point in time it can be equal to a value of a certain type, and a second point in time, it can be equal to a value of a different type. b was first equal to an integer, but it became equal to a string when it was assigned the value ’hello’. But you can see that type often matters, as when we try to print an integer in the midst of a string.
In [ ]:
1 + 1
Remember how we saw integers in our CSV script example? The number of columns and number of rows are integers
In [ ]:
a = len(column_names)
type(a)
Note: most decimal fractions cannot be represented exactly as binary fractions, and certain operations using floats may lead to surprising results. For more details, start here.
In [ ]:
c = 2.1
type(c)
In [ ]:
type(average_value)
In [ ]:
3 > 4
In [ ]:
test = (3 > 4)
test
In [ ]:
type(test)
A Python shell can therefore replace your pocket calculator, with the basic arithmetic operations +, -, *, /, % (modulo) natively implemented.
Try some things here or follow along with the instructor's examples:
In [ ]:
In [ ]:
In [ ]:
Type conversion (casting):
In [ ]:
float(1)
Commenting code is good practice and is extremely helpful to help others understand your code. And, often, to help you understand code that you've written earlier.
In python, everything following the hash/pound sign # is a comment. Comments can either be their own line(s), or in-line. Use in-line comments sparingly.
In [ ]:
# this is a comment. We might say, for example, that we're setting the value of Pi:
pi = 3.14
pie = 'pumpkin' # and this is an in-line comment. Setting the value of pie.
In [ ]:
In [ ]:
What happened with those floats? How could we avoid this? There is an explanation and some suggestions in this python documentaion on floating points.
In [ ]:
l = ['red', 'blue', 'green', 'black', 'white']
type(l)
And remember in our CSV script example, we used lots of lists! There was a list to store the column names, which we assumed were in the first row of data:
In [ ]:
column_names
And then each row of the CSV was itself a list, and all the rows were another list. So we used a list of lists, or a matrix.
In [ ]:
data
Indexing: accessing individual objects contained in the list:
In [ ]:
column_names[0]
In [ ]:
column_names[-1]
Indexing starts at 0, not 1!
Slicing: obtaining sublists of regularly-spaced elements:
In [ ]:
column_names[1:3]
Note that l[start:stop] contains the elements with indices i such as start<= i < stop (i ranging from start to stop-1). Therefore, l[start:stop] has (stop - start) elements.
Slicing syntax: l[start:stop:stride]
Lists are mutable objects and can be modified:
In [ ]:
column_names[0] = 'LOOK AT ME!'
column_names
The elements of a list may have different types:
In [ ]:
l = [3, -200, 'hello']
l
Python offers a large panel of functions to modify lists, or query them. Here are a few examples; for more details, see https://docs.python.org/tutorial/datastructures.html#more-on-lists
Add and remove elements:
In [ ]:
L = ['red', 'blue', 'green', 'black', 'white']
L.append('pink')
L
In [ ]:
L.pop() # removes and returns the last item
In [ ]:
L
Add a list to the end of a list with extend()
In [ ]:
L.extend(['pink', 'purple']) # extend L, in-place
In [ ]:
L
In [ ]:
L = L[:-2]
L
Two ways to reverse a list:
In [ ]:
r = L[::-1]
r
In [ ]:
r.reverse() # in-place
r
Concatenate lists:
In [ ]:
r + L
Sort:
In [ ]:
sorted(r) # new object
In [ ]:
r
In [ ]:
r.sort() #in-place
r
We used sorted() in our CSV example a few times. That, along with list indexes, helped us get the lowest and highest value. Here's what it looked like:
highest_value = sorted(column_values)[-1]
lowest_value = sorted(column_values)[0]
Try creating your own list of unsorted items. Can you replicate the highest and lowest value expressions above? What happens if your list is not made up of numbers?
In [ ]:
Methods
The notation r.method() (e.g. r.append(3) and L.pop()) is our first example of object-oriented programming (OOP). Being a list, the object r has a method function that is called using the notation .methodname(). We will talk about functions later in this tutorial.
When you're using jupyter, to see all the different methods available to a variable, type a period after the variable name and hit the tab key.
In [ ]:
In [ ]:
s = 'Hello, how are you?'
s = "Hi, what's up"
s = '''Hello, # tripling the quotes allows the
how are you''' # the string to span more than one line
s = """Hi,
what's up?"""
Double quotes are crucial when you have a quote in the string:
In [ ]:
s = 'Hi, what's up?'
The newline character is \n, and the tab character is \t.
Strings are collections like lists. Hence they can be indexed and sliced, using the same syntax and rules.
Indexing strings:
In [ ]:
a = "hello"
a[0]
In [ ]:
a[1]
In [ ]:
a[-1]
Accents and special characters can also be handled in strings because since Python 3, the string type handles unicode (UTF-8) by default.(For a lot more on Unicode, character encoding, and how it relates to python, see https://docs.python.org/3/howto/unicode.html).
A string is an immutable object and it is not possible to modify its contents. If you want to modify a string, you'll create a new string from the original one (or use a method that returns a new string).
In [ ]:
a = "hello, world!"
a[2] = 'z'
In [ ]:
a.replace('l', 'z', 1)
In [ ]:
a.replace('l', 'z')
Strings have many useful methods, such as a.replace as seen above. Remember the a. object-oriented notation and use tab completion or help(str) to search for new methods.
We used a few string methods in the CSV example at the start. See them in this chunk of code?
if len(value) < 1 or value.isspace():
column_datatypes.append('string')
elif value.isalpha():
column_datatypes.append('string')
elif '.' in value or value.isdigit():
Now you try. Create a new variable and assign it with a string. Then, try a few of python's string methods to see how you can return different versions of your string, or test whether it has certain characteristics.
In [ ]:
String formatting:
We also saw string formatting in the CSV example, when we printed some of the reporting:
print("\t{0} ({1} of {2} unique)".format(column_datatypes[i], unique_value_count, len(data)))
In [ ]:
'An integer: {0} ; a float: {1} ; another string: {2} '.format(1, 0.1, 'string')
In [ ]:
i = 102
filename = 'processing_of_dataset_{0}.txt'.format(i)
filename
In [ ]:
tel = {'emmanuelle': 5752, 'sebastian': 5578}
tel['francis'] = 5915
tel
In [ ]:
tel['sebastian']
In [ ]:
tel.keys()
In [ ]:
tel.values()
In [ ]:
'francis' in tel
It can be used to conveniently store and retrieve values associated with a name (a string for a date, a name, etc.). See https://docs.python.org/tutorial/datastructures.html#dictionaries for more information.
A dictionary can have keys (resp. values) with different types:
In [ ]:
d = {'a':1, 'b':2, 3:'hello'}
d
Python library reference says:
Assignment statements are used to (re)bind names to values and to modify attributes or items of mutable objects.
In short, it works as follows (simple assignment):
Things to note:
In [ ]:
a = [1, 2, 3]
b = a
a
In [ ]:
b
In [ ]:
a is b
In [ ]:
b[1] = "hi!"
a
In [ ]:
if 2**2 == 4:
print('Obviously')
a in b For any collection b check to see if b contains a:
In [ ]:
b = [1,2,3]
2 in b
In [ ]:
5 in b
Blocks are delimited by indentation
Type the following lines in your Python interpreter, and be careful to respect the indentation depth. The Jupyter Notebook automatically increases the indentation depth after a coln : sign; to decrease the indentation depth, go four spaces to the left with the Backspace key. Press the Enter key twice to leave the logical block.
In [ ]:
a = 10
if a == 1:
print(1)
elif a == 2:
print(2)
else:
print("A lot")
In [ ]:
for i in range(4):
print(i)
But most often, it is more readable to iterate over values:
In [ ]:
for word in ('cool', 'powerful', 'readable'):
print('Python is %s ' % word)
In [ ]:
vowels = 'aeiouy'
for i in 'powerful':
if i in vowels:
print(i)
In [ ]:
message = "Hello how are you?"
message.split() # returns a list
In [ ]:
for word in message.split():
print(word)
Few languages (in particular, languages for scientific computing) allow to loop over anything but integers/indices.
With Python it is possible to loop exactly over the objects of interest without bothering with indices you often don’t care about. This feature can often be used to make code more readable.
It is not safe to modify the sequence you are iterating over.
Keeping track of enumeration number
Common task is to iterate over a sequence while keeping track of the item number.
In [ ]:
words = ['cool', 'powerful', 'readable']
for i in range(0, len(words)):
print(i, words[i])
In [ ]:
for index, item in enumerate(words):
print(index, item)
When looping over a dictionary use .items():
In [ ]:
d = {'a': 1, 'b':1.2, 'c':"hi"}
for key, val in sorted(d.items()):
print('Key: %s has value: %s ' % (key, val))
The ordering of a dictionary in random, thus we use sorted() which will sort on the keys.
In [ ]:
In [ ]:
[i**2 for i in range(4)]
Same as:
In [ ]:
l = []
for i in range(4):
l.append(i)
l
Now that you've done the countdown exercise above, consider how you could have used a list comprehension in a solution:
In [ ]:
[10 - i for i in range(10)]
In [ ]:
def test():
print('in test function')
test()
In [ ]:
def disk_area(radius):
return 3.14 * radius * radius
disk_area(1.5)
By default, functions return None.
Note the syntax to define a function:
object for optionally returning values.Mandatory parameters (positional arguments):
In [ ]:
def double_it(x):
return x * 2
double_it(3)
In [ ]:
double_it()
Optional parameters (keyword or named arguments)
In [ ]:
def double_it(x=2):
return x * 2
double_it()
In [ ]:
double_it(3)
Keyword arguments allow you to specify default values.
Default values are evaluated when the function is defined, not when it is called. This can be problematic when using mutable types (e.g. dictionary or list) and modifying them in the function body, since the modifications will be persistent across invocations of the function.
In [ ]:
# We're defining a global variable for pi, and it's actually a special kind of global
# because we intend it to be constant (i.e. it's value doesn't change). There's a convention
# of using uppercase in naming constants. See https://www.python.org/dev/peps/pep-0008/#constants
PI = 3.14159
def disk_area(radius):
return PI * radius * radius
disk_area(1.5)
In [ ]:
def funcname(params):
"""Concise one-line sentence describing the function.
Extended summary which can contain multiple paragraphs.
"""
# function body
pass
There's a great help feature build into Jupyter: type a question mark after any object or function to get quick access to its docstring. Try it:
In [ ]:
funcname?
Docstring guidelines For the sake of standardization, the Docstring Conventions webpage documents the semantics and conventions associated with Python docstrings.
Also, the Numpy and Scipy modules have defined a precise standard for documenting scientific functions, that you may want to follow for your own functions, with a Parameters section, an Examples section, etc. See http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines#docstring-standard and
http://projects.scipy.org/numpy/browser/trunk/doc/example.py#L37
Use any features of Python that we've already worked on, or play with something new from the documentation. Write a function, and include a docstring explaining the function's purpose. Test your function by executing it. If your function uses parameters, try calling the function a few times with different parameters.
Define the function here:
In [ ]:
Call the function here:
In [ ]:
In [ ]:
import os
os
In [ ]:
os.listdir('.')
And also:
In [ ]:
from os import listdir
listdir('.')
Using alias:
In [ ]:
import pandas as pd
Modules are thus a good way to organize code in a hierarchical way. Actually, all the data science tools we are going to use are modules:
In [ ]:
import pandas as pd
pd.Series([0,1,2,3,4,5,6,7,8,9])
Indenting is compulsory in Python! Every command block following a colon bears an additional indentation level with respect to the previous line with a colon. One must therefore indent after def f(): or while:.
At the end of such logical blocks, one decreases the indentation depth (and re-increases it if a new block is entered, etc.)
Strict respect of indentation is the price to pay for getting rid of { or ; characters that delineate logical blocks in other languages. Improper indentation leads to errors such as:
------------------------------------------------------------
IndentationError: unexpected indent (test.py, line 2)
All this indentation business can be a bit confusing in the beginning. However, with the clear indentation, and in the absence of extra characters, the resulting code is very nice to read compared to other languages.
Tab key to a 4-space indentation. In Python(x,y), the editor is already configured this way.
In [ ]:
long_line = "Here is a very very long line \
... that we break in two parts."
Spaces Write well-spaced code: put whitespaces after commas, around arithmetic operators, etc.:
In [ ]:
a = 1 # yes
a=1 # too cramped
A certain number of rules for writing “beautiful” code (and more importantly using the same conventions as anybody else!) are given in the PEP-8: Style Guide for Python Code.
In [ ]:
f = open('workfile.txt', 'w') # opens the workfile file in writing mode
type(f)
In [ ]:
f.write('This is a test \nand another test')
f.close() # always use close() after opening a file! Very important!
To read from a file:
In [ ]:
f = open('workfile.txt', 'r')
s = f.read()
print(s)
f.close()
For more details: https://docs.python.org/tutorial/inputoutput.html
In [ ]:
f = open('workfile.txt', 'r')
for line in f:
print(line)
f.close()
File modes:
For more information about file modes read the documentation for the open() function. https://docs.python.org/3.5/library/functions.html#open
Reference documentation for this section:
os module: operating system functionality"A portable way of using operating system dependent functionality.”
**Directory and file manipulation
Get the current directory:
In [ ]:
import os
os.getcwd()
List a directory:
In [ ]:
os.listdir(os.curdir)
Make a directory:
In [ ]:
os.mkdir('junkdir')
'junkdir' in os.listdir(os.curdir)
Rename the directory:
In [ ]:
os.rename('junkdir', 'foodir')
'junkdir' in os.listdir(os.curdir)
In [ ]:
'foodir' in os.listdir(os.curdir)
In [ ]:
os.rmdir('foodir') #remove directory
'foodir' in os.listdir(os.curdir)
Delete a file:
In [ ]:
fp = open('junk.txt', 'w')
fp.close()
'junk.txt' in os.listdir(os.curdir)
In [ ]:
os.remove('junk.txt')
'junk.txt' in os.listdir(os.curdir)
In [ ]:
import glob
glob.glob('*.txt')
It is likely that you have raised Exceptions if you have typed all the previous commands of the tutorial. For example, you may have raised an exception if you entered a command with a typo.
Exceptions are raised by different kinds of errors arising when executing Python code. In your own code, you may also catch errors, or define custom error types. You may want to look at the descriptions of the built-in Exceptions when looking for the right exception type.
In [ ]:
1/0
In [ ]:
d = {1:1, 2:2}
d[3]
In [ ]:
l = [1, 2, 3]
l[4]
In [ ]:
l.foobar
In [ ]:
while True:
try:
x = int(input('Please enter a number: '))
break
except ValueError:
print('That was no valid number. Try again...')
x
try/finally
In [ ]:
try:
x = int(input('Please enter a number: '))
finally:
print('Thank you for your input.')
In [ ]:
def filter_name(name):
try:
name = name.encode('ascii')
except UnicodeError as e:
if name == 'Gaël':
print("OK, Gaël")
else:
raise e
return name
filter_name("Gaël")
In [ ]:
filter_name('Stéfan')
In [ ]:
def achilles_arrow(x):
if abs(x - 1) < 1e-3:
raise StopIteration
x = 1 - (1-x)/2.
return x
x = 0
while True:
try:
x = achilles_arrow(x)
except StopIteration:
break
x
Use exceptions to notify certain conditions are met (e.g. StopIteration) or not (e.g. custom error raising)
In [ ]: