Procedural programming in python

Topics

  • Flow control, part 2
    • Functions
    • In class exercise:
      • Functionalize this!
    • From nothing to something:
      • Pairwise correlation between rows in a pandas dataframe
      • Sketch of the process
      • In class exercise:
        • Write the code!
      • Rejoining, sharing ideas, problems, thoughts


Flow control

Flow control figure

Flow control refers how to programs do loops, conditional execution, and order of functional operations.

If

If statements can be use to execute some lines or block of code if a particular condition is satisfied. E.g. Let's print something based on the entries in the list.


In [ ]:
instructors = ['Dave', 'Jim', 'Dorkus the Clown']

if 'Dorkus the Clown' in instructors:
    print('#fakeinstructor')

There is a special do nothing word: pass that skips over some arm of a conditional, e.g.


In [ ]:
if 'Jim' in instructors:
    print("Congratulations!  Jim is teaching, your class won't stink!")
else:
    pass

For

For loops are the standard loop, though while is also common. For has the general form:

for items in list:
    do stuff

For loops and collections like tuples, lists and dictionaries are natural friends.


In [ ]:
for instructor in instructors:
    print(instructor)

You can combine loops and conditionals:


In [ ]:
for instructor in instructors:
    if instructor.endswith('Clown'):
        print(instructor + " doesn't sound like a real instructor name!")
    else:
        print(instructor + " is so smart... all those gooey brains!")

range()

Since for operates over lists, it is common to want to do something like:

NOTE: C-like
for (i = 0; i < 3; ++i) {
    print(i);
}

The Python equivalent is:

for i in [0, 1, 2]:
    do something with i

What happens when the range you want to sample is big, e.g.

NOTE: C-like
for (i = 0; i < 1000000000; ++i) {
    print(i);
}

That would be a real pain in the rear to have to write out the entire list from 1 to 1000000000.

Enter, the range() function. E.g. range(3) is [0, 1, 2]


In [1]:
sum = 0
for i in range(10):
    sum += i
print(sum)


45

Functions

For loops let you repeat some code for every item in a list. Functions are similar in that they run the same lines of code for new values of some variable. They are different in that functions are not limited to looping over items.

Functions are a critical part of writing easy to read, reusable code.

Create a function like:

def function_name (parameters):
    """
    docstring
    """
    function expressions
    return [variable]

Note: Sometimes I use the word argument in place of parameter.

Here is a simple example. It prints a string that was passed in and returns nothing.


In [20]:
def print_string(str):
    """This prints out a string passed as the parameter."""
    print(str)
    for c in str:
        print(c)
        if c == 'r':
            break
    print("done")
    return

In [21]:
print_string("string")


string
s
t
r
done

To call the function, use:

print_string("Dave is awesome!")

Note: The function has to be defined before you can call it!


In [ ]:
print_string("Dave is awesome!")

If you don't provide an argument or too many, you get an error.


In [7]:
#print_string()

Parameters (or arguments) in Python are all passed by reference. This means that if you modify the parameters in the function, they are modified outside of the function.

See the following example:

def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3];
print('list before the function: ', my_list)
change_list(my_list);
print('list after the function: ', my_list)

In [23]:
def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3];
print('list before the function: ', my_list)
change_list(my_list);
print('list after the function: ', my_list)


list before the function:  [1, 2, 3]
list inside the function:  [1, 2, 3, 'four']
list after the function:  [1, 2, 3, 'four']

Variables have scope: global and local

In a function, new variables that you create are not saved when the function returns - these are local variables. Variables defined outside of the function can be accessed but not changed - these are global variables, Note there is a way to do this with the global keyword. Generally, the use of global variables is not encouraged, instead use parameters.

my_global_1 = 'bad idea'
my_global_2 = 'another bad one'
my_global_3 = 'better idea'

def my_function():
    print(my_global_1)
    my_global_2 = 'broke your global, man!'
    global my_global_3
    my_global_3 = 'still a better idea'
    return

my_function()
print(my_global_2)
print(my_global_3)

In [25]:
my_global_1 = 'bad idea'
my_global_2 = 'another bad one'
my_global_3 = 'better idea'

def my_function():
    print(my_global_1)
    my_global_2 = 'broke your global, man!'
    print(my_global_2)
    global my_global_3
    my_global_3 = 'still a better idea'
    return

my_function()
print(my_global_2)
print(my_global_3)


bad idea
broke your global, man!
another bad one
still a better idea

In general, you want to use parameters to provide data to a function and return a result with the return. E.g.

def sum(x, y):
    my_sum = x + y
    return my_sum

If you are going to return multiple objects, what data structure that we talked about can be used? Give and example below.


In [30]:
def a_function(parameter):
    return None

In [31]:
foo = a_function('bar')
print(foo)


None

Parameters have three different types:

type behavior
required positional, must be present or error, e.g. my_func(first_name, last_name)
keyword position independent, e.g. my_func(first_name, last_name) can be called my_func(first_name='Dave', last_name='Beck') or my_func(last_name='Beck', first_name='Dave')
default keyword params that default to a value if not provided

In [32]:
def print_name(first, last='the Clown'):
    print('Your name is %s %s' % (first, last))
    return

Take a minute and play around with the above function. Which are required? Keyword? Default?


In [34]:
def massive_correlation_analysis(data, method='pearson'):
    pass
    return

Functions can contain any code that you put anywhere else including:

  • if...elif...else
  • for...else
  • while
  • other function calls

In [39]:
def print_name_age(first, last, age):
    print_name(first, last)
    print('Your age is %d' % (age))
    print('Your age is ' + str(age))
    if age > 35:
        print('You are really old.')
    return

In [40]:
print_name_age(age=40, last='Beck', first='Dave')


Your name is Dave Beck
Your age is 40
Your age is 40
You are really old.

Once you have some code that is functionalized and not going to change, you can move it to a file that ends in .py, check it into version control, import it into your notebook and use it!

Let's do this now for the above two functions.

...

See you after the break!

Import the function...


In [ ]:

Call them!


In [ ]:


Hacky Hack Time with Functions!

Notes from last class:

  • The os package has tools for checking if a file exists: os.path.exists
    import os
    filename = 'HCEPDB_moldata.zip'
    if os.path.exists(filename):
      print("wahoo!")
  • Use the requests package to get the file given a url (got this from the requests docs)
    import requests
    url = 'http://faculty.washington.edu/dacb/HCEPDB_moldata.zip'
    req = requests.get(url)
    assert req.status_code == 200 # if the download failed, this line will generate an error
    with open(filename, 'wb') as f:
      f.write(req.content)
  • Use the zipfile package to decompress the file while reading it into pandas
    import pandas as pd
    import zipfile
    csv_filename = 'HCEPDB_moldata.csv'
    zf = zipfile.ZipFile(filename)
    data = pd.read_csv(zf.open(csv_filename))

Here was my solution

import os
import requests
import pandas as pd
import zipfile

filename = 'HCEPDB_moldata.zip'
url = 'http://faculty.washington.edu/dacb/HCEPDB_moldata.zip'
csv_filename = 'HCEPDB_moldata.csv'

if os.path.exists(filename):
    pass
else:
    req = requests.get(url)
    assert req.status_code == 200 # if the download failed, this line will generate an error
    with open(filename, 'wb') as f:
        f.write(req.content)

zf = zipfile.ZipFile(filename)
data = pd.read_csv(zf.open(csv_filename))

My solution:


In [4]:
def download_if_not_exists(url, filename):
    if os.path.exists(filename):
        pass
    else:
        req = requests.get(url)
        assert req.status_code == 200 # if the download failed, this line will generate an error
        with open(filename, 'wb') as f:
            f.write(req.content)

In [5]:
def load_HCEPDB_data(url, zip_filename, csv_filename):
    download_if_not_exists(url, zip_filename)
    zf = zipfile.ZipFile(zip_filename)
    data = pd.read_csv(zf.open(csv_filename))
    return data

In [6]:
import os
import requests
import pandas as pd
import zipfile

load_HCEPDB_data('http://faculty.washington.edu/dacb/HCEPDB_moldata_set1.zip', 'HCEPDB_moldata_set1.zip', 'HCEPDB_moldata_set1.csv')


Out[6]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str
0 655365 C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 C18H9N3OSSe 394.3151 5.161953 0.867601 91.567575 -5.467601 2.022944 -3.444656 C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
1 1245190 C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]... C22H15NSeSi 400.4135 5.261398 0.504824 160.401549 -5.104824 1.630750 -3.474074 C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
2 65553 [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1 C12H12SeSi3 319.4448 6.138294 0.630274 149.887545 -5.230274 1.682250 -3.548025 C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
3 720918 C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1 C20H12OSSe 379.3398 1.991366 0.242119 126.581347 -4.842119 1.809439 -3.032680 C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1
4 1310744 C1C=CC=C1c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2ccc... C24H13N3SSe 454.4137 5.605135 0.951911 90.622776 -5.551911 2.029717 -3.522194 C1=CC=C(C1)c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2c...
5 196637 C1C=CC=C1c1cc2[se]c3cc4ccsc4cc3c2[se]1 C17H10SSe2 404.2520 2.644436 0.587932 69.223461 -5.187932 2.201106 -2.986827 C1=CC=C(C1)c1cc2[se]c3cc4ccsc4cc3c2[se]1
6 262174 C1C=CC=C1c1cc2[se]c3c4occc4c4cscc4c3c2[se]1 C19H10OSSe2 444.2730 2.523057 0.397670 97.645325 -4.997670 1.982122 -3.015548 C1=CC=C(C1)c1cc2[se]c3c4occc4c4cscc4c3c2[se]1
7 393249 C1C=CC=C1c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12 C24H15NSe 396.3495 3.115895 0.869140 55.174815 -5.469140 2.331815 -3.137325 C1=CC=C(C1)c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12
8 35 C1C2=C([SiH2]C=C2)C=C1c1cc2occc2c2cscc12 C17H12OSSi 292.4328 2.743214 0.387106 109.062905 -4.987106 1.909966 -3.077141 C1=CC2=C([SiH2]1)C=C(C2)c1cc2occc2c2cscc12
9 1048612 C1C=CC=C1C1=Cc2sc3cc4C=C[SiH2]c4cc3c2C1 C18H14SSi 290.4606 2.408411 0.431315 85.937708 -5.031315 2.065850 -2.965465 C1=CC=C(C1)C1=Cc2sc3cc4C=C[SiH2]c4cc3c2C1
10 917542 C1C=c2ccc3[se]c4c5[se]c(cc5[se]c4c3c2=C1)C1=CC... C20H12Se3 489.1948 2.843278 0.302591 144.614366 -4.902591 1.708198 -3.194393 C1=CC=C(C1)c1cc2[se]c3c([se]c4ccc5=CCC=c5c34)c...
11 1441831 C1C=CC=C1C1=Cc2ncc3c4[se]ccc4cnc3c2C1 C18H12N2Se 335.2668 2.687240 0.675497 61.225278 -5.275497 2.270953 -3.004544 C1=CC=C(C1)C1=Cc2ncc3c4[se]ccc4cnc3c2C1
12 1376296 C1C=CC=C1C1=Cc2c(C1)c1[se]c3ccc4cscc4c3c1c1=C[... C24H16SSeSi 443.5024 2.844637 0.189206 231.387394 -4.789206 1.312334 -3.476872 C1=CC=C(C1)C1=Cc2c(C1)c1[se]c3ccc4cscc4c3c1c1=...
13 1638442 C1C=c2ccc3cnc4c5[SiH2]C(=Cc5c5nsnc5c4c3c2=C1)C... C23H15N3SSi 393.5445 6.462512 0.602405 165.105179 -5.202405 1.603165 -3.599240 C1=CC=C(C1)C1=Cc2c([SiH2]1)c1ncc3ccc4=CCC=c4c3...
14 98350 C1C=CC=C1C1=Cc2ccc3c4CC=Cc4c4cscc4c3c2[SiH2]1 C22H16SSi 340.5204 2.631463 0.410851 98.573546 -5.010851 1.975707 -3.035144 C1=CC=C(C1)C1=Cc2ccc3c4CC=Cc4c4cscc4c3c2[SiH2]1
15 2162747 C1C=CC=C1C1=Cc2c([SiH2]1)c1c3c[nH]cc3c3ccc4=C[... C27H19NOSi2 429.6251 2.039158 0.140744 222.981280 -4.740744 1.361137 -3.379607 C1=CC=C(C1)C1=Cc2c([SiH2]1)c1c3c[nH]cc3c3ccc4=...
16 557119 C1C=c2c3C=C(Cc3c3occc3c2=C1)C1=CC=CC1 C19H14O 258.3186 0.237205 0.024962 146.246545 -4.624962 1.700415 -2.924547 C1=CC=C(C1)C1=Cc2c(C1)c1occc1c1=CCC=c21
17 753728 C1C=CC=C1C1=Cc2c([SiH2]1)c1cc3ncccc3cc1c1c[nH]... C22H16N2Si 336.4684 3.103831 0.409504 116.650708 -5.009504 1.863416 -3.146088 C1=CC=C(C1)C1=Cc2c([SiH2]1)c1cc3ncccc3cc1c1c[n...
18 819265 C1C=CC=C1C1=Cc2c([SiH2]1)c1c(c3cscc23)c2[se]cc... C23H16SSeSi2 459.5774 5.385253 0.368606 224.848916 -4.968606 1.352309 -3.616298 C1=CC=C(C1)C1=Cc2c([SiH2]1)c1c(c3cscc23)c2[se]...
19 1278019 C1C=CC=C1C1=Cc2c([SiH2]1)c1c(c3[SiH2]C=Cc3c3=C... C23H18OSi3 394.6522 5.489489 0.301242 280.455932 -4.901242 1.135619 -3.765623 C1=CC=C(C1)C1=Cc2c([SiH2]1)c1c(c3[SiH2]C=Cc3c3...
20 2096063 C1C=CC=C1c1cc2[se]c3c(c2c2cscc12)c1ccccc1c1ccc... C27H14N2S2Se 509.5136 6.204093 0.570055 167.497914 -5.170055 1.593078 -3.576977 C1=CC=C(C1)c1cc2[se]c3c(c2c2cscc12)c1ccccc1c1c...
21 1572945 C1C=CC=C1C1=Cc2[se]c3c4sccc4c4ccccc4c3c2C1 C22H14SSe 389.3786 2.167252 0.330623 100.884304 -4.930623 1.961253 -2.969370 C1=CC=C(C1)C1=Cc2[se]c3c4sccc4c4ccccc4c3c2C1
22 2359381 C1C=CC=C1C1=Cc2c(C1)c1c3cscc3c3ccc4nsnc4c3c1c1... C26H14N2OS2 434.5416 4.112982 0.299549 211.318161 -4.899549 1.409229 -3.490319 C1=CC=C(C1)C1=Cc2c(C1)c1c3cscc3c3ccc4nsnc4c3c1...
23 1540183 C1C=CC=C1c1cc2[se]c3c([se]c4ccc5cscc5c34)c2cn1 C20H11NSSe2 455.2999 3.212565 0.683568 72.329945 -5.283568 2.174712 -3.108856 C1=CC=C(C1)c1cc2[se]c3c([se]c4ccc5cscc5c34)c2cn1
24 1638500 C1C=CC=C1c1cc2[se]c3ccc4ccccc4c3c2c2cocc12 C23H14OSe 385.3226 3.088844 0.482262 98.573546 -5.082262 1.977235 -3.105027 C1=CC=C(C1)c1cc2[se]c3ccc4ccccc4c3c2c2cocc12
25 2621542 C1C=c2c3ccccc3c3c4ccccc4c4C=C(Cc4c3c2=C1)C1=CC... C29H20 368.4770 2.552886 0.341115 115.180406 -4.941115 1.872759 -3.068355 C1=CC=C(C1)C1=Cc2c(C1)c1c(c3ccccc23)c2ccccc2c2...
26 98411 C1C=CC=C1c1cc2[se]c3cc4cccnc4cc3c2c2cscc12 C22H13NSSe 402.3777 4.247356 0.653960 99.957476 -5.253960 1.967245 -3.286715 C1=CC=C(C1)c1cc2[se]c3cc4cccnc4cc3c2c2cscc12
27 524398 C1C=c2c3C=C([SiH2]c3c3ncc4ccc5nsnc5c4c3c2=C1)C... C23H15N3SSi 393.5445 5.860942 0.497394 181.348711 -5.097394 1.533947 -3.563447 C1=CC=C(C1)C1=Cc2c([SiH2]1)c1ncc3ccc4nsnc4c3c1...
28 131187 C1C=c2c3ccc4nsnc4c3c3cnc4C=C(Cc4c3c2=C1)C1=CC=CC1 C24H15N3S 377.4695 6.517681 0.691659 145.026911 -5.291659 1.706854 -3.584805 C1=CC=C(C1)C1=Cc2ncc3c(c2C1)c1=CCC=c1c1ccc2nsn...
29 163960 C1C=CC=C1C1=Cc2ncc3c4CC=Cc4ccc3c2[SiH2]1 C19H15NSi 285.4205 3.235009 0.585638 85.014628 -5.185638 2.071184 -3.114454 C1=CC=C(C1)C1=Cc2ncc3c4CC=Cc4ccc3c2[SiH2]1
... ... ... ... ... ... ... ... ... ... ... ...
1106468 1779493 c1cc2c3nsnc3c3c(ncc4cc(-c5cccc6c[nH]cc56)c5csc... C25H12N4S2Se 511.4898 4.404175 0.608078 111.468683 -5.208078 1.893235 -3.314843 NaN
1106469 2860840 c1cc2c3nsnc3c3c(ncc4cc(-c5cccc6nsnc56)c5nsnc5c... C21H7N7S3Se 532.4933 6.515421 1.336547 75.024985 -5.936547 2.152054 -3.784493 NaN
1106470 1222442 C1C(=Cc2[se]c3c4occc4c4nsnc4c3c12)c1cccc2ccccc12 C23H12N2OSSe 443.3868 4.398127 0.683511 99.030648 -5.283511 1.973899 -3.309613 c1cc2c3nsnc3c3c4CC(=Cc4[se]c3c2o1)c1cccc2ccccc12
1106471 3090232 [SiH2]1C=Cc2c1csc2-c1cc2ccc3c4occc4c4nsnc4c3c2... C24H12N2O2S2Si 452.5888 4.193127 0.839972 76.828261 -5.439972 2.136325 -3.303646 c1cc2c3nsnc3c3c(ccc4cc(-c5scc6[SiH2]C=Cc56)c5c...
1106472 206659 c1csc(n1)-c1cc2ccc3c4occc4c4nsnc4c3c2c2cscc12 C21H9N3OS3 415.5201 4.589759 0.887008 79.636066 -5.487008 2.114699 -3.372309 c1cc2c3nsnc3c3c(ccc4cc(-c5nccs5)c5cscc5c34)c2o1
1106473 2434889 c1occ2c(cccc12)-c1cc2oc3c(c4nsnc4c4ccncc34)c2c... C25H11N3O2S2 449.5129 5.665579 0.809923 107.658433 -5.409923 1.917514 -3.492409 c1cc2c3nsnc3c3c(oc4cc(-c5cccc6cocc56)c5cscc5c3...
1106474 960331 c1cc2c3nsnc3c3c(ccc4cc(cnc34)-c3scc4[se]ccc34)... C21H9N3S3Se 478.4811 5.081765 0.914993 85.475998 -5.514993 2.067645 -3.447349 NaN
1106475 1681228 [SiH2]1C(=Cc2c1c1c3nsnc3c3ccoc3c1c1ccccc21)c1c... C23H11N5OS2Si 465.5919 10.033001 0.953904 161.872709 -5.553904 1.617546 -3.936358 c1cc2c3nsnc3c3c4[SiH2]C(=Cc4c4ccccc4c3c2o1)c1c...
1106476 1517392 C1C(=Cc2sc3c4sccc4c4nsnc4c3c12)c1scc2C=C[SiH2]c12 C19H10N2S4Si 422.6520 5.013859 0.701342 110.024660 -5.301342 1.904229 -3.397113 c1cc2c3nsnc3c3c4CC(=Cc4sc3c2s1)c1scc2C=C[SiH2]c12
1106477 2598739 C1C=c2cccc(C3=Cc4c([SiH2]3)c3c5nsnc5c5ccoc5c3c... C25H15N3OSSi 433.5655 3.112296 0.268163 178.619700 -4.868163 1.545688 -3.322475 c1cc2c3nsnc3c3c4[SiH2]C(=Cc4c4c[nH]cc4c3c2o1)c...
1106478 763733 [SiH2]1C(=Cc2[se]c3c(c12)c1nsnc1c1ccc2cscc2c31... C19H9N5S2SeSi 478.4931 9.147375 1.082907 130.002865 -5.682907 1.788135 -3.894772 c1cc2c3nsnc3c3c4[SiH2]C(=Cc4[se]c3c2c2cscc12)c...
1106479 42846 c1cc2csc(-c3cc4sc5c6[se]ccc6c6nsnc6c5c4c4cscc3... C22H8N2S5Se 539.6092 5.518285 0.632665 134.238713 -5.232665 1.763726 -3.468940 c1cc2c3nsnc3c3c(sc4cc(-c5scc6ccsc56)c5cscc5c34...
1106480 272226 [SiH2]1C=c2c(cc3sc4c5occc5c5nsnc5c4c3c2=C1)-c1... C20H10N2OS2SeSi 465.4900 6.291725 0.642023 150.822679 -5.242023 1.676692 -3.565331 c1cc2c3nsnc3c3c(sc4cc(-c5ccc[se]5)c5=C[SiH2]C=...
1106481 2271076 c1cc2c3nsnc3c3c(sc4cc(-c5scc6cc[se]c56)c5ccccc... C24H10N2OS3Se 517.5140 4.768060 0.797364 92.030673 -5.397364 2.021933 -3.375431 NaN
1106482 1124198 C1C(=Cc2c1c1c3nsnc3c3ccoc3c1c1ccccc21)c1ccccn1 C24H13N3OS 391.4527 4.974625 0.765936 99.957476 -5.365936 1.964935 -3.401001 c1cc2c3nsnc3c3c4CC(=Cc4c4ccccc4c3c2o1)c1ccccn1
1106483 1582951 [SiH2]1C=c2cccc(C3=Cc4cnc5c6cnccc6c6nsnc6c5c4[... C22H14N4SSi2 422.6186 8.389379 0.910843 141.753503 -5.510843 1.725547 -3.785296 c1cc2c3nsnc3c3c4[SiH2]C(=Cc4cnc3c2cn1)c1cccc2=...
1106484 1058666 [SiH2]1C=Cc2c1csc2-c1cc2ncc3c4occc4c4nsnc4c3c2cn1 C20H10N4OS2Si 414.5440 7.680649 1.171715 100.884304 -5.771715 1.960875 -3.810840 c1cc2c3nsnc3c3c(cnc4cc(ncc34)-c3scc4[SiH2]C=Cc...
1106485 370546 [SiH2]1C=c2c(cc3sc4c5sccc5c5nsnc5c4c3c2=C1)-c1... C22H12N2S3Si 428.6348 6.067688 0.676091 138.123000 -5.276091 1.744040 -3.532051 c1cc2c3nsnc3c3c(sc4cc(-c5ccccc5)c5=C[SiH2]C=c5...
1106486 894837 C1C=c2cccc(-c3cc4ccc5c6sccc6c6nsnc6c5c4cn3)c2=C1 C24H13N3S2 407.5197 5.638177 0.479687 180.895837 -5.079687 1.535536 -3.544151 c1cc2c3nsnc3c3c(ccc4cc(ncc34)-c3cccc4=CCC=c34)...
1106487 2205559 [SiH2]1C=c2cccc(-c3cc4ncc5c6occc6c6nsnc6c5c4c4... C23H11N5OS2Si 465.5919 6.928005 0.823054 129.547072 -5.423054 1.789620 -3.633434 c1cc2c3nsnc3c3c(cnc4cc(-c5cccc6=C[SiH2]C=c56)c...
1106488 141179 [SiH2]1C=Cc2csc(c12)-c1cc2ncc3c4sccc4c4nsnc4c3... C21H9N5S4Si 487.6871 6.762845 1.172927 88.737230 -5.772927 2.044499 -3.728428 c1cc2c3nsnc3c3c(cnc4cc(-c5scc6C=C[SiH2]c56)c5n...
1106489 1091453 C1C(=Cc2cnc3c4sccc4c4nsnc4c3c12)c1cccc2nsnc12 C20H9N5S3 415.5241 7.230940 1.128969 98.573546 -5.728969 1.974714 -3.754254 c1cc2c3nsnc3c3c4CC(=Cc4cnc3c2s1)c1cccc2nsnc12
1106490 2303876 [SiH2]1C=Cc2csc(C3=Cc4c([SiH2]3)c3c5nsnc5c5cc[... C22H12N2S3SeSi2 535.6808 7.524076 0.713512 162.292795 -5.313512 1.615859 -3.697653 c1cc2c3nsnc3c3c4[SiH2]C(=Cc4c4cscc4c3c2[se]1)c...
1106491 1648533 [SiH2]1C=c2c(cc3ccc4c5[se]ccc5c5nsnc5c4c3c2=C1... C23H11N5S2SeSi 528.5529 10.055248 0.886720 174.523481 -5.486720 1.562081 -3.924639 c1cc2c3nsnc3c3c(ccc4cc(-c5cncc6nsnc56)c5=C[SiH...
1106492 829339 c1cc2c3nsnc3c3c(ccc4cc(-c5scc6sccc56)c5cocc5c3... C24H10N2O2S3 454.5530 4.369276 0.695215 96.724891 -5.295215 1.989505 -3.305709 NaN
1106493 2729884 c1cc2c3nsnc3c3c(sc4cc(-c5cccc6nsnc56)c5ccccc5c... C24H10N4S3Se 529.5290 5.785201 1.025143 86.852379 -5.625143 2.056575 -3.568567 NaN
1106494 1779614 [SiH2]1C=Cc2csc(c12)-c1cc2cnc3c4[se]ccc4c4nsnc... C21H9N5S3SeSi 534.5811 7.293623 1.213582 92.495763 -5.813582 2.018996 -3.794586 c1cc2c3nsnc3c3c(ncc4cc(-c5scc6C=C[SiH2]c56)c5n...
1106495 1943455 C1C=c2cccc(-c3cc4ncc5c6sccc6c6nsnc6c5c4c4cscc3... C26H13N3S3 463.6077 5.619779 0.591400 146.246545 -5.191400 1.699286 -3.492114 c1cc2c3nsnc3c3c(cnc4cc(-c5cccc6=CCC=c56)c5cscc...
1106496 1779616 [SiH2]1C(=Cc2c1c1c3nsnc3c3cc[se]c3c1c1ccccc21)... C26H15N3SSeSi 508.5375 4.886015 0.426423 176.344363 -5.026423 1.555438 -3.470986 c1cc2c3nsnc3c3c4[SiH2]C(=Cc4c4ccccc4c3c2[se]1)...
1106497 239522 [SiH2]1C(=Cc2c1c1c3nsnc3c3ccsc3c1c1ccccc21)c1c... C23H13N3S2Si 423.5947 6.313634 1.019342 95.325067 -5.619342 1.997782 -3.621560 c1cc2c3nsnc3c3c4[SiH2]C(=Cc4c4ccccc4c3c2s1)c1c...

1106498 rows × 11 columns

How many functions did you use?

Why did you choose to use functions for these pieces?


From something to nothing

Task: Compute the pairwise Pearson correlation between rows in a dataframe.

Let's say we have three molecules (A, B, C) with three measurements each (v1, v2, v3). So for each molecule we have a vector of measurements:

$$X=\begin{bmatrix} X_{v_{1}} \\ X_{v_{2}} \\ X_{v_{3}} \\ \end{bmatrix} $$

Where X is a molecule and the components are the values for each of the measurements. These make up the rows in our matrix.

Often, we want to compare molecules to determine how similar or different they are. One measure is the Pearson correlation.

Pearson correlation:

Expressed graphically, when you plot the paired measurements for two samples (in this case molecules) against each other you can see positively correlated, no correlation, and negatively correlated. Eg.

Simple input dataframe (note when you are writing code it is always a good idea to have a simple test case where you can readily compute by hand or know the output):

index v1 v2 v3
A -1 0 1
B 1 0 -1
C .5 0 .5
  • If the above is a dataframe what shape and size is the output?
  • Whare are some unique features of the output?

For our test case, what will the output be?

A B C
A 1 -1 0
B -1 1 0
C 0 0 1

Let's sketch the idea...


In [ ]:

In class exercise

20-30 minutes

Objectives:

  1. Write code using functions to compute the pairwise Pearson correlation between rows in a pandas dataframe. You will have to use for and possibly if.
  2. Use a cell to test each function with an input that yields an expected output. Think about the shape and values of the outputs.
  3. Put the code in a .py file in the directory with the Jupyter notebook, import and run!

To help you get started...

To create the sample dataframe:

df = pd.DataFrame([[-1, 0, 1], [1, 0, -1], [.5, 0, .5]])

To loop over rows in a dataframe, check out (Google is your friend):

DataFrame.iterrows

In [11]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


How do we know it is working?

Use the test case!

Our three row example is a useful tool for checking that our code is working. We can write some tests that compare the output of our functions to our expectations.

E.g. The diagonals should be 1, and corr(A, B) = -1, ...

But first, let's talk assert and raise

We've already briefly been exposed to assert in this code:

if os.path.exists(filename):
    pass
else:
    req = requests.get(url)
    # if the download failed, next line will raise an error
    assert req.status_code == 200
    with open(filename, 'wb') as f:
        f.write(req.content)

What is the assert doing there?

Let's play with assert. What should the following asserts do?

assert True == False, "You assert wrongly, sir!"
assert 'Dave' in instructors
assert function_that_returns_True_or_False(parameters)

In [ ]:

So when an assert statement is true, the code keeps executing and when it is false, it raises an exception (also known as an error).

We've all probably seen lots of exception. E.g.

def some_function(parameter):
    return

some_function()
some_dict = { }
print(some_dict['invalid key'])
'fourty' + 2

Like C++ and other languages, Python let's you raise your own exception. You can do it with raise (surprise!). Exceptions are special objects and you can create your own type of exceptions. For now, we are going to look at the simplest Exception.

We create an Exception object by calling the generator:

Exception()

This isn't very helpful. We really want to supply a description. The Exception object takes any number of strings. One good form if you are using the generic exception object is:

Exception('Short description', 'Long description')

In [ ]:

Creating an exception object isn't useful alone, however. We need to send it down the software stack to the Python interpreter so that it can handle the exception condition. We do this with raise.

raise Exception("An error has occurred.")

Now you can create your own error messages like a pro!


In [ ]:

DETOUR!

There are lots of types of exceptions beyond the generic class Exception. You can use them in your own code if they make sense. E.g.

import math
my_variable = math.inf
if my_variable == math.inf:
    raise ValueError('my_variable cannot be infinity')

List of Standard Exceptions −

EXCEPTION NAME DESCRIPTION
Exception Base class for all exceptions
StopIteration Raised when the next() method of an iterator does not point to any object.
SystemExit Raised by the sys.exit() function.
StandardError Base class for all built-in exceptions except StopIteration and SystemExit.
ArithmeticError Base class for all errors that occur for numeric calculation.
OverflowError Raised when a calculation exceeds maximum limit for a numeric type.
FloatingPointError Raised when a floating point calculation fails.
ZeroDivisonError Raised when division or modulo by zero takes place for all numeric types.
AssertionError Raised in case of failure of the Assert statement.
AttributeError Raised in case of failure of attribute reference or assignment.
EOFError Raised when there is no input from either the raw_input() or input() function and the end of file is reached.
ImportError Raised when an import statement fails.
KeyboardInterrupt Raised when the user interrupts program execution, usually by pressing Ctrl+c.
LookupError Base class for all lookup errors.

IndexError

KeyError

Raised when an index is not found in a sequence.

Raised when the specified key is not found in the dictionary.

NameError Raised when an identifier is not found in the local or global namespace.

UnboundLocalError

EnvironmentError

Raised when trying to access a local variable in a function or method but no value has been assigned to it.

Base class for all exceptions that occur outside the Python environment.

IOError

IOError

Raised when an input/ output operation fails, such as the print statement or the open() function when trying to open a file that does not exist.

Raised for operating system-related errors.

SyntaxError

IndentationError

Raised when there is an error in Python syntax.

Raised when indentation is not specified properly.

SystemError Raised when the interpreter finds an internal problem, but when this error is encountered the Python interpreter does not exit.
SystemExit Raised when Python interpreter is quit by using the sys.exit() function. If not handled in the code, causes the interpreter to exit.
Raised when Python interpreter is quit by using the sys.exit() function. If not handled in the code, causes the interpreter to exit. Raised when an operation or function is attempted that is invalid for the specified data type.
ValueError Raised when the built-in function for a data type has the valid type of arguments, but the arguments have invalid values specified.
RuntimeError Raised when a generated error does not fall into any category.
NotImplementedError Raised when an abstract method that needs to be implemented in an inherited class is not actually implemented.

In [ ]:

Put it all together... assert and raise

Breaking assert down, it is really just an if test followed by a raise. So the code below:

assert <some_test>, <message>

is equivalent to a short hand for:

if not <some_test>:
        raise AssertionError(<message>)

Prove it? OK.

instructors = ['Dorkus the Clown', 'Jim']
assert 'Dave' in instructors, "Dave isn't in the instructor list!"
instructors = ['Dorkus the Clown', 'Jim']
assert 'Dave' in instructors, "Dave isn't in the instructor list!"
if not 'Dave' in instructors:
    raise AssertionError("Dave isn't in the instructor list!")

Questions?

All of this was in preparation for some testing...

Can we write some quick tests that make sure our code is doing what we think it is? Something of the form:

corr_matrix = pairwise_row_correlations(my_sample_dataframe)
assert corr_matrix looks like what we expect, "The function is broken!"

What are the smallest units of code that we can test?

What asserts can we make for these pieces of code?

Remember, in computers, 1.0 does not necessarily = 1

Put the following in an empty cell:

.99999999999999999999

How can we test for two floating point numbers being (almost) equal? Pro tip: Google!


In [ ]:

From nothing to something wrap up

Here we created some functions from just a short description of our needs.

  • Before we wrote any code, we walked through the flow control and decided on the parts that were necessary.
  • Before we wrote any code, we created a simple test example with simple predictable output.
  • We wrote some code according to our specifications.
  • We wrote tests using assert to verify our code against the simple test example.

Next: errors, part 2; unit tests; debugging;

QUESTIONS?


In [ ]: