Software Engineering for Data Scientists

Introduction to Python & Jupyter notebooks

Today's Objectives

1. Opening & Navigating the Jupyter Notebook

2. Simple Math in the Jupyter Notebook

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

1. Opening and Navigating the Jupyter Notebook

We will start today with the interactive environment that we will be using often through the course: the Jupyter Notebook.

We will walk through the following steps together:

Download miniconda (be sure to get Version 3.5) and install it on your system (hopefully you have done this before coming to class)
Use the conda command-line tool to update your package listing and install the Jupyter notebook:

Update conda's listing of packages for your system:
```
$ conda update conda
```
Install Jupyter notebook and all its requirements
```
$ conda install jupyter
```

Navigate to the HCEPDB directory. For example:

$ cd ~/Desktop/HCEPDB/

Use curl to download the main lecture notebook and the simple breakout notebook:

# you may skip this next step if you downloaded the file from your web browser
$ curl -O https://uwdirect.github.io/SEDS_content/02.Python.ipynb

...
$ curl -O https://uwdirect.github.io/SEDS_content/02.Simple_Breakout.ipynb
...

$ ls
...
02.Python.ipnyb
02.Simple_Breakout.ipnyb
...

Type jupyter notebook in the terminal to start the notebook
```
$ jupyter notebook
```
If everything has worked correctly, it should automatically launch your default browser
Click on 02.Python.ipnyb to open the notebook containing the content for this lecture.

With that, you're set up to use the Jupyter notebook!

2. Simple Math in the Jupyter Notebook

Now that we have the Jupyter notebook up and running, we're going to do a short breakout exploring some of the mathematical functionality that Python offers.

Please open 02.Simple_Breakout.ipynb, find a partner, and make your way through that notebook, typing and executing code along the way.

3. Loading data with `pandas`

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

Python's Data Science Ecosystem

In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are:

`numpy`: Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

`scipy`: Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.

`pandas`: Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

`matplotlib`: Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda. All it takes is to run

$ conda install numpy scipy pandas matplotlib

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

Loading Data with Pandas



In [49]:

    
import numpy
numpy.__path__









    Out[49]:





['/Users/dacb/miniconda3/lib/python3.5/site-packages/numpy']



In [50]:

    
import pandas



In [51]:

    
df = pandas.DataFrame()

Because we'll use it so much, we often import under a shortened name using the import ... as ... pattern:



In [1]:

    
import pandas as pd



In [8]:

    
df = pd.DataFrame()



In [9]:

    
df









    Out[9]:

Now we can use the read_csv command to read the comma-separated-value data:



In [19]:

    
data = pd.read_csv('HCEPDB_moldata.csv')



In [6]:

    
data.head(10)









    Out[6]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
    
  
  
    
      0
      655365
      C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
      C18H9N3OSSe
      394.3151
      5.161953
      0.867601
      91.567575
      -5.467601
      2.022944
      -3.444656
      C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
    
    
      1
      1245190
      C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]...
      C22H15NSeSi
      400.4135
      5.261398
      0.504824
      160.401549
      -5.104824
      1.630750
      -3.474074
      C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
    
    
      2
      21847
      C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1...
      C24H17NOSi
      363.4903
      0.000000
      0.000000
      197.474780
      -4.539526
      1.462158
      -3.077368
      C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
    
    
      3
      65553
      [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1
      C12H12SeSi3
      319.4448
      6.138294
      0.630274
      149.887545
      -5.230274
      1.682250
      -3.548025
      C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
    
    
      4
      720918
      C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1
      C20H12OSSe
      379.3398
      1.991366
      0.242119
      126.581347
      -4.842119
      1.809439
      -3.032680
      C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1
    
    
      5
      1310744
      C1C=CC=C1c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2ccc...
      C24H13N3SSe
      454.4137
      5.605135
      0.951911
      90.622776
      -5.551911
      2.029717
      -3.522194
      C1=CC=C(C1)c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2c...
    
    
      6
      196637
      C1C=CC=C1c1cc2[se]c3cc4ccsc4cc3c2[se]1
      C17H10SSe2
      404.2520
      2.644436
      0.587932
      69.223461
      -5.187932
      2.201106
      -2.986827
      C1=CC=C(C1)c1cc2[se]c3cc4ccsc4cc3c2[se]1
    
    
      7
      262174
      C1C=CC=C1c1cc2[se]c3c4occc4c4cscc4c3c2[se]1
      C19H10OSSe2
      444.2730
      2.523057
      0.397670
      97.645325
      -4.997670
      1.982122
      -3.015548
      C1=CC=C(C1)c1cc2[se]c3c4occc4c4cscc4c3c2[se]1
    
    
      8
      393249
      C1C=CC=C1c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12
      C24H15NSe
      396.3495
      3.115895
      0.869140
      55.174815
      -5.469140
      2.331815
      -3.137325
      C1=CC=C(C1)c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12
    
    
      9
      35
      C1C2=C([SiH2]C=C2)C=C1c1cc2occc2c2cscc12
      C17H12OSSi
      292.4328
      2.743214
      0.387106
      109.062905
      -4.987106
      1.909966
      -3.077141
      C1=CC2=C([SiH2]1)C=C(C2)c1cc2occc2c2cscc12

Note: strings in Python can be defined either with double quotes or single quotes

Viewing Pandas Dataframes

The head() and tail() methods show us the first and last rows of the data



In [55]:

    
data.head()









    Out[55]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
    
  
  
    
      0
      655365
      C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
      C18H9N3OSSe
      394.3151
      5.161953
      0.867601
      91.567575
      -5.467601
      2.022944
      -3.444656
      C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
    
    
      1
      1245190
      C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]...
      C22H15NSeSi
      400.4135
      5.261398
      0.504824
      160.401549
      -5.104824
      1.630750
      -3.474074
      C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
    
    
      2
      21847
      C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1...
      C24H17NOSi
      363.4903
      0.000000
      0.000000
      197.474780
      -4.539526
      1.462158
      -3.077368
      C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
    
    
      3
      65553
      [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1
      C12H12SeSi3
      319.4448
      6.138294
      0.630274
      149.887545
      -5.230274
      1.682250
      -3.548025
      C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
    
    
      4
      720918
      C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1
      C20H12OSSe
      379.3398
      1.991366
      0.242119
      126.581347
      -4.842119
      1.809439
      -3.032680
      C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1



In [56]:

    
data.tail()









    Out[56]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
    
  
  
    
      2322844
      2543603
      [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1cnc(...
      C22H14N4S3Si2
      486.7506
      0.000000
      0.000000
      0.000000
      -5.632512
      1.454082
      -4.178430
      c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1cnc(...
    
    
      2322845
      2304057
      [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
      C22H14N4S3Si2
      486.7506
      9.335485
      1.120744
      128.197094
      -5.720744
      1.798600
      -3.922144
      c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
    
    
      2322846
      2007035
      [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
      C26H18S3Si2
      482.7982
      2.498209
      0.834995
      46.046052
      -5.434995
      2.433160
      -3.001835
      c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
    
    
      2322847
      1961981
      C1ccc2c1c(sc2-c1scc2cc[SiH2]c12)-c1ccc(cc1)-c1...
      C25H16S3SeSi
      519.6454
      2.679067
      0.659243
      62.544032
      -5.259243
      2.258468
      -3.000775
      c1sc(c2[SiH2]ccc12)-c1sc(c2Cccc12)-c1ccc(cc1)-...
    
    
      2322848
      2754558
      [SiH2]1ccc2csc(c12)-c1sc(-c2sc(-c3scc4ccsc34)c...
      C24H13NOS5Si
      519.7887
      1.272400
      0.102802
      190.489616
      -4.702802
      1.490950
      -3.211851
      c1sc(c2[SiH2]ccc12)-c1sc(-c2sc(-c3scc4ccsc34)c...

The shape attribute shows us the number of elements:



In [7]:

    
data.shape[0]









    Out[7]:





2322849

The columns attribute gives us the column names



In [11]:

    
data.columns









    Out[11]:





Index(['id', 'SMILES_str', 'stoich_str', 'mass', 'pce', 'voc', 'jsc',
       'e_homo_alpha', 'e_gap_alpha', 'e_lumo_alpha', 'tmp_smiles_str'],
      dtype='object')

The index attribute gives us the index names



In [12]:

    
data.index









    Out[12]:





RangeIndex(start=0, stop=2322849, step=1)

Let's make our id column the index



In [16]:

    
data.set_index('id')

Now let's revisit the data.index



In [20]:

    
data.index









    Out[20]:





RangeIndex(start=0, stop=2322849, step=1)

View it with head again:



In [21]:

    
data.head()









    Out[21]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
    
  
  
    
      0
      655365
      C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
      C18H9N3OSSe
      394.3151
      5.161953
      0.867601
      91.567575
      -5.467601
      2.022944
      -3.444656
      C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
    
    
      1
      1245190
      C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]...
      C22H15NSeSi
      400.4135
      5.261398
      0.504824
      160.401549
      -5.104824
      1.630750
      -3.474074
      C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
    
    
      2
      21847
      C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1...
      C24H17NOSi
      363.4903
      0.000000
      0.000000
      197.474780
      -4.539526
      1.462158
      -3.077368
      C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
    
    
      3
      65553
      [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1
      C12H12SeSi3
      319.4448
      6.138294
      0.630274
      149.887545
      -5.230274
      1.682250
      -3.548025
      C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
    
    
      4
      720918
      C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1
      C20H12OSSe
      379.3398
      1.991366
      0.242119
      126.581347
      -4.842119
      1.809439
      -3.032680
      C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1



In [63]:

    
data.tail()









    Out[63]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
    
  
  
    
      2322844
      2543603
      [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1cnc(...
      C22H14N4S3Si2
      486.7506
      0.000000
      0.000000
      0.000000
      -5.632512
      1.454082
      -4.178430
      c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1cnc(...
    
    
      2322845
      2304057
      [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
      C22H14N4S3Si2
      486.7506
      9.335485
      1.120744
      128.197094
      -5.720744
      1.798600
      -3.922144
      c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
    
    
      2322846
      2007035
      [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
      C26H18S3Si2
      482.7982
      2.498209
      0.834995
      46.046052
      -5.434995
      2.433160
      -3.001835
      c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
    
    
      2322847
      1961981
      C1ccc2c1c(sc2-c1scc2cc[SiH2]c12)-c1ccc(cc1)-c1...
      C25H16S3SeSi
      519.6454
      2.679067
      0.659243
      62.544032
      -5.259243
      2.258468
      -3.000775
      c1sc(c2[SiH2]ccc12)-c1sc(c2Cccc12)-c1ccc(cc1)-...
    
    
      2322848
      2754558
      [SiH2]1ccc2csc(c12)-c1sc(-c2sc(-c3scc4ccsc34)c...
      C24H13NOS5Si
      519.7887
      1.272400
      0.102802
      190.489616
      -4.702802
      1.490950
      -3.211851
      c1sc(c2[SiH2]ccc12)-c1sc(-c2sc(-c3scc4ccsc34)c...

The dtypes attribute gives the data types of each column:



In [64]:

    
data.dtypes









    Out[64]:





id                  int64
SMILES_str         object
stoich_str         object
mass              float64
pce               float64
voc               float64
jsc               float64
e_homo_alpha      float64
e_gap_alpha       float64
e_lumo_alpha      float64
tmp_smiles_str     object
dtype: object

4. Manipulating data with `pandas`

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:



In [23]:

    
data['mass'].head()









    Out[23]:





0    394.3151
1    400.4135
2    363.4903
3    319.4448
4    379.3398
Name: mass, dtype: float64



In [26]:

    
data.stoich_str.tail()









    Out[26]:





2322844    C22H14N4S3Si2
2322845    C22H14N4S3Si2
2322846      C26H18S3Si2
2322847     C25H16S3SeSi
2322848     C24H13NOS5Si
Name: stoich_str, dtype: object

Mathematical operations on columns happen element-wise (note 18.01528 is the weight of H2O):



In [66]:

    
data['mass'] / 18.01528









    Out[66]:





0          21.887814
1          22.226327
2          20.176778
3          17.731881
4          21.056559
5          25.223793
6          22.439396
7          24.660899
8          22.000740
9          16.232487
10         16.123013
11         27.154438
12         18.610135
13         24.618124
14         21.845039
15         18.901755
16         23.847817
17         14.338861
18         18.676834
19         25.510422
20         21.906526
21         28.282303
22         22.900893
23         21.613797
24         24.120724
25         25.272985
26         21.388655
27         20.453582
28         22.335356
29         21.845039
             ...    
2322819    26.854220
2322820    28.907283
2322821    28.907283
2322822    26.963916
2322823    23.740863
2322824    25.962366
2322825    25.961922
2322826    26.241690
2322827    25.962366
2322828    28.015429
2322829    27.960803
2322830    26.187509
2322831    29.690108
2322832    28.224985
2322833    26.242356
2322834    29.744511
2322835    27.851330
2322836    28.015429
2322837    22.685010
2322838    24.905747
2322839    25.852893
2322840    26.242356
2322841    26.296982
2322842    26.909068
2322843    29.799137
2322844    27.018764
2322845    27.018764
2322846    26.799373
2322847    28.844703
2322848    28.852657
Name: mass, dtype: float64

Columns can be created (or overwritten) with the assignment operator. Let's create a mass_ratio_H2O column with the mass ratio of each molecule to H2O



In [67]:

    
data['mass_ratio_H2O'] = data['mass'] / 18.01528



In [68]:

    
data.head()









    Out[68]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
      mass_ratio_H2O
    
  
  
    
      0
      655365
      C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
      C18H9N3OSSe
      394.3151
      5.161953
      0.867601
      91.567575
      -5.467601
      2.022944
      -3.444656
      C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
      21.887814
    
    
      1
      1245190
      C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]...
      C22H15NSeSi
      400.4135
      5.261398
      0.504824
      160.401549
      -5.104824
      1.630750
      -3.474074
      C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
      22.226327
    
    
      2
      21847
      C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1...
      C24H17NOSi
      363.4903
      0.000000
      0.000000
      197.474780
      -4.539526
      1.462158
      -3.077368
      C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
      20.176778
    
    
      3
      65553
      [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1
      C12H12SeSi3
      319.4448
      6.138294
      0.630274
      149.887545
      -5.230274
      1.682250
      -3.548025
      C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
      17.731881
    
    
      4
      720918
      C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1
      C20H12OSSe
      379.3398
      1.991366
      0.242119
      126.581347
      -4.842119
      1.809439
      -3.032680
      C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1
      21.056559

In preparation for grouping the data, let's bin the molecules by their molecular mass. For that, we'll use pd.cut



In [28]:

    
data['mass_group'] = pd.cut(data['mass'], 10)



In [31]:

    
data.head()









    Out[31]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
      mass_group
    
  
  
    
      0
      655365
      C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
      C18H9N3OSSe
      394.3151
      5.161953
      0.867601
      91.567575
      -5.467601
      2.022944
      -3.444656
      C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
      (375.486, 427.298]
    
    
      1
      1245190
      C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]...
      C22H15NSeSi
      400.4135
      5.261398
      0.504824
      160.401549
      -5.104824
      1.630750
      -3.474074
      C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
      (375.486, 427.298]
    
    
      2
      21847
      C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1...
      C24H17NOSi
      363.4903
      0.000000
      0.000000
      197.474780
      -4.539526
      1.462158
      -3.077368
      C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
      (323.674, 375.486]
    
    
      3
      65553
      [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1
      C12H12SeSi3
      319.4448
      6.138294
      0.630274
      149.887545
      -5.230274
      1.682250
      -3.548025
      C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
      (271.862, 323.674]
    
    
      4
      720918
      C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1
      C20H12OSSe
      379.3398
      1.991366
      0.242119
      126.581347
      -4.842119
      1.809439
      -3.032680
      C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1
      (375.486, 427.298]



In [71]:

    
data.dtypes









    Out[71]:





id                   int64
SMILES_str          object
stoich_str          object
mass               float64
pce                float64
voc                float64
jsc                float64
e_homo_alpha       float64
e_gap_alpha        float64
e_lumo_alpha       float64
tmp_smiles_str      object
mass_ratio_H2O     float64
mass_group        category
dtype: object

Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at value counts and the basics of group-by operations.

Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.

The pandas.value_counts returns statistics on the unique values within each column.

We can use it, for example, to break down the molecules by their mass group that we just created:



In [72]:

    
pd.value_counts(data['mass_group'])









    Out[72]:





(375.486, 427.298]     749908
(427.298, 479.111]     729369
(323.674, 375.486]     395811
(479.111, 530.923]     247833
(271.862, 323.674]     132159
(530.923, 582.735]      41111
(220.0499, 271.862]     21253
(582.735, 634.547]       4176
(167.72, 220.0499]        937
(634.547, 686.359]        292
Name: mass_group, dtype: int64

What happens if we try this on a continuous valued variable?



In [73]:

    
pd.value_counts(data['mass'])









    Out[73]:





390.4646    4420
378.4536    4214
393.5405    3844
396.4928    3834
404.5674    3750
389.4805    3732
443.6003    3670
428.5134    3643
388.4924    3628
391.4527    3600
394.5286    3590
392.5564    3493
407.5197    3441
389.4765    3428
447.5407    3402
395.5047    3338
340.4048    3324
379.4417    3293
394.5206    3274
446.5526    3270
406.5316    3250
408.5078    3236
442.6162    3227
397.4809    3212
395.5087    3202
445.5685    3137
410.5956    3102
444.5884    3102
445.5645    3024
405.5555    2996
            ... 
468.1371       1
420.5824       1
291.3072       1
355.2182       1
519.0435       1
223.2737       1
406.4796       1
279.2541       1
401.4315       1
404.2939       1
322.6622       1
227.3297       1
320.4628       1
363.0374       1
300.5132       1
300.1743       1
263.2631       1
402.4156       1
302.2960       1
363.1353       1
419.3808       1
286.3984       1
390.4086       1
429.4335       1
291.2731       1
391.3927       1
421.3271       1
239.3407       1
290.2890       1
290.2850       1
Name: mass, dtype: int64

We can do a little data exploration with this to look 0s in columns. Here, let's look at the power conversion effeciency (pce)



In [74]:

    
pd.value_counts(data['pce'])









    Out[74]:





0.000000    109854
1.334476         6
2.229965         4
3.378631         4
0.832335         4
1.757805         4
1.748131         4
0.611385         4
2.347740         4
2.455332         4
2.056930         4
1.608040         4
6.193170         4
4.878460         4
4.075839         4
1.960030         4
3.146475         4
0.313562         4
1.478590         4
1.848397         4
5.641962         4
1.626072         4
2.387791         4
0.879680         4
3.100729         4
2.476426         4
0.490215         4
1.010616         4
2.441131         4
5.540558         4
             ...  
2.729901         1
2.777837         1
3.411103         1
1.502644         1
6.706605         1
4.088114         1
2.680453         1
2.035536         1
6.631585         1
3.110992         1
1.118627         1
1.411595         1
4.541150         1
6.235040         1
2.247999         1
5.689173         1
7.154354         1
1.232087         1
2.282375         1
5.728708         1
5.578350         1
3.953680         1
5.979996         1
3.171306         1
6.382782         1
2.324180         1
5.221554         1
5.986343         1
5.352108         1
1.503284         1
Name: pce, dtype: int64

Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)

Let's break take this in smaller steps. Recall our mass_group column.



In [75]:

    
pd.value_counts(data['mass_group'])









    Out[75]:





(375.486, 427.298]     749908
(427.298, 479.111]     729369
(323.674, 375.486]     395811
(479.111, 530.923]     247833
(271.862, 323.674]     132159
(530.923, 582.735]      41111
(220.0499, 271.862]     21253
(582.735, 634.547]       4176
(167.72, 220.0499]        937
(634.547, 686.359]        292
Name: mass_group, dtype: int64

groupby allows us to look at the number of values for each column and each value.



In [76]:

    
data.groupby(['mass_group']).count()









    Out[76]:






  
    
      
      id
      SMILES_str
      stoich_str
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
      tmp_smiles_str
      mass_ratio_H2O
    
    
      mass_group
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      (167.72, 220.0499]
      937
      937
      937
      937
      937
      937
      937
      937
      937
      937
      844
      937
    
    
      (220.0499, 271.862]
      21253
      21253
      21253
      21253
      21253
      21253
      21253
      21253
      21253
      21253
      20080
      21253
    
    
      (271.862, 323.674]
      132159
      132159
      132159
      132159
      132159
      132159
      132159
      132159
      132159
      132159
      124329
      132159
    
    
      (323.674, 375.486]
      395811
      395811
      395811
      395811
      395811
      395811
      395811
      395811
      395811
      395811
      373490
      395811
    
    
      (375.486, 427.298]
      749908
      749908
      749908
      749908
      749908
      749908
      749908
      749908
      749908
      749908
      715158
      749908
    
    
      (427.298, 479.111]
      729369
      729369
      729369
      729369
      729369
      729369
      729369
      729369
      729369
      729369
      699620
      729369
    
    
      (479.111, 530.923]
      247833
      247833
      247833
      247833
      247833
      247833
      247833
      247833
      247833
      247833
      236177
      247833
    
    
      (530.923, 582.735]
      41111
      41111
      41111
      41111
      41111
      41111
      41111
      41111
      41111
      41111
      39588
      41111
    
    
      (582.735, 634.547]
      4176
      4176
      4176
      4176
      4176
      4176
      4176
      4176
      4176
      4176
      4093
      4176
    
    
      (634.547, 686.359]
      292
      292
      292
      292
      292
      292
      292
      292
      292
      292
      291
      292

Now, let's find the mean of each of the columns for each mass_group. Notice what happens to the non-numeric columns.



In [30]:

    
data.groupby(['mass_group']).mean()









    Out[30]:






  
    
      
      id
      mass
      pce
      voc
      jsc
      e_homo_alpha
      e_gap_alpha
      e_lumo_alpha
    
    
      mass_group
      
      
      
      
      
      
      
      
    
  
  
    
      (167.72, 220.0499]
      795448
      200.883189
      0.984688
      0.815517
      28.384561
      -5.401061
      2.782885
      -2.618176
    
    
      (220.0499, 271.862]
      815134
      254.578292
      1.658181
      0.673716
      49.137854
      -5.267026
      2.492683
      -2.774343
    
    
      (271.862, 323.674]
      812740
      301.611175
      2.372569
      0.660835
      69.179613
      -5.257699
      2.287170
      -2.970529
    
    
      (323.674, 375.486]
      821670
      351.087766
      3.210127
      0.625780
      91.152983
      -5.233774
      2.091818
      -3.141957
    
    
      (375.486, 427.298]
      1026763
      401.579728
      3.918477
      0.566805
      116.591717
      -5.183407
      1.903792
      -3.279615
    
    
      (427.298, 479.111]
      1737479
      452.051313
      4.441766
      0.509674
      140.703180
      -5.136794
      1.742163
      -3.394631
    
    
      (479.111, 530.923]
      2011742
      503.278084
      4.480230
      0.467202
      148.073101
      -5.115852
      1.667151
      -3.448701
    
    
      (530.923, 582.735]
      2184168
      550.076470
      4.535071
      0.434818
      148.438136
      -5.117461
      1.593890
      -3.523570
    
    
      (582.735, 634.547]
      2294000
      593.530994
      4.694802
      0.421064
      151.808823
      -5.106992
      1.545764
      -3.561228
    
    
      (634.547, 686.359]
      2359797
      640.290482
      4.716907
      0.411735
      164.561936
      -5.049109
      1.541983
      -3.507127



In [29]:

    
data.groupby(['mass_group'])









    Out[29]:





<pandas.core.groupby.DataFrameGroupBy object at 0x10cdf1da0>

You can specify a groupby using the names of table columns and compute other functions, such as the sum, count, std, and describe.



In [78]:

    
data.groupby(['mass_group'])['pce'].describe()









    Out[78]:





mass_group                
(167.72, 220.0499]   count       937.000000
                     mean          0.984688
                     std           0.873994
                     min           0.000000
                     25%           0.413301
                     50%           0.786599
                     75%           1.325538
                     max           5.746969
(220.0499, 271.862]  count     21253.000000
                     mean          1.658181
                     std           1.342245
                     min           0.000000
                     25%           0.735355
                     50%           1.331621
                     75%           2.252834
                     max          10.956561
(271.862, 323.674]   count    132159.000000
                     mean          2.372569
                     std           1.727534
                     min           0.000000
                     25%           1.173593
                     50%           1.984474
                     75%           3.182373
                     max          11.062819
(323.674, 375.486]   count    395811.000000
                     mean          3.210127
                     std           2.155961
                     min           0.000000
                     25%           1.685650
                     50%           2.728819
                                  ...      
(479.111, 530.923]   std           2.729844
                     min           0.000000
                     25%           2.485322
                     50%           4.070540
                     75%           6.461316
                     max          11.118864
(530.923, 582.735]   count     41111.000000
                     mean          4.535071
                     std           2.936067
                     min           0.000000
                     25%           2.501887
                     50%           4.038260
                     75%           6.801271
                     max          11.112351
(582.735, 634.547]   count      4176.000000
                     mean          4.694802
                     std           2.948578
                     min           0.000000
                     25%           3.017875
                     50%           4.043051
                     75%           7.011431
                     max          11.020831
(634.547, 686.359]   count       292.000000
                     mean          4.716907
                     std           2.454961
                     min           0.000000
                     25%           3.345544
                     50%           3.878462
                     75%           6.152368
                     max          10.858928
Name: pce, dtype: float64

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

<data object>.groupby(<grouping values>).<aggregate>()

You can even group by multiple values: for example we can look at the LUMO-HOMO gap grouped by the mass_group and pce.



In [79]:

    
grouped = data.groupby(['mass_group', 'pce'])['e_gap_alpha'].mean()
grouped









    Out[79]:





mass_group          pce      
(167.72, 220.0499]  0.000000     2.412702
                    0.000245     2.099989
                    0.004285     2.953079
                    0.005825     3.073889
                    0.008490     2.722965
                    0.008899     2.136513
                    0.044463     2.452195
                    0.046568     2.823902
                    0.046691     3.114940
                    0.048831     3.140034
                    0.052351     2.884674
                    0.055327     2.728120
                    0.059029     2.880716
                    0.061336     2.844688
                    0.072549     3.090869
                    0.079122     3.159215
                    0.080511     3.364599
                    0.083184     2.854799
                    0.084001     1.942450
                    0.084225     1.974778
                    0.094322     3.709872
                    0.101431     2.850968
                    0.104110     2.307507
                    0.111652     2.742858
                    0.113507     2.809079
                    0.114355     2.319816
                    0.115878     3.608025
                    0.125624     3.028626
                    0.125981     3.340156
                    0.130970     2.554368
                                   ...   
(634.547, 686.359]  9.066947     1.175464
                    9.185450     1.100996
                    9.186357     1.548446
                    9.270504     1.145993
                    9.347098     1.198710
                    9.349263     1.196206
                    9.360067     1.220964
                    9.362396     1.204888
                    9.364733     1.219149
                    9.391619     1.135501
                    9.413005     1.183518
                    9.424252     1.155483
                    9.463312     1.222076
                    9.534784     1.135481
                    9.548010     1.142618
                    9.571270     1.134874
                    9.598884     1.122950
                    9.603046     1.179339
                    9.661744     1.177666
                    9.710178     1.178226
                    9.715666     1.161244
                    9.761812     1.191025
                    9.807911     1.188336
                    9.814294     1.170697
                    10.012045    1.195555
                    10.036598    1.461847
                    10.401586    1.409304
                    10.430989    1.436652
                    10.698090    1.393320
                    10.858928    1.352444
Name: e_gap_alpha, dtype: float64

5. Visualizing data with `pandas`

Of course, looking at tables of data is not very intuitive. Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib library to generate plots.

Whenever you do plotting in the Jupyter notebook, you will want to first run this magic command which configures the notebook to work well with plots:



In [82]:

    
import matplotlib
%matplotlib inline

Now we can simply call the plot() method of any series or dataframe to get a reasonable view of the data:



In [83]:

    
data.groupby(['mass_group'])['pce'].mean().plot()









    Out[83]:





<matplotlib.axes._subplots.AxesSubplot at 0x10da0cc88>

Other plot types

Pandas supports a range of other plotting types; you can find these by using the autocomplete on the plot method:



In [84]:

    
data.groupby(['mass_group'])['SMILES_str'].count().plot().hist(10)









    Out[84]:





(array([ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]),
 array([  9.5,   9.6,   9.7,   9.8,   9.9,  10. ,  10.1,  10.2,  10.3,
         10.4,  10.5]),
 <a list of 10 Patch objects>)



In [ ]:

	id	SMILES_str	stoich_str	mass	pce	voc	jsc	e_homo_alpha	e_gap_alpha	e_lumo_alpha	tmp_smiles_str
0	655365	C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1	C18H9N3OSSe	394.3151	5.161953	0.867601	91.567575	-5.467601	2.022944	-3.444656	C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
1	1245190	C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]...	C22H15NSeSi	400.4135	5.261398	0.504824	160.401549	-5.104824	1.630750	-3.474074	C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
2	21847	C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1...	C24H17NOSi	363.4903	0.000000	0.000000	197.474780	-4.539526	1.462158	-3.077368	C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
3	65553	[SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1	C12H12SeSi3	319.4448	6.138294	0.630274	149.887545	-5.230274	1.682250	-3.548025	C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
4	720918	C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1	C20H12OSSe	379.3398	1.991366	0.242119	126.581347	-4.842119	1.809439	-3.032680	C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1
5	1310744	C1C=CC=C1c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2ccc...	C24H13N3SSe	454.4137	5.605135	0.951911	90.622776	-5.551911	2.029717	-3.522194	C1=CC=C(C1)c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2c...
6	196637	C1C=CC=C1c1cc2[se]c3cc4ccsc4cc3c2[se]1	C17H10SSe2	404.2520	2.644436	0.587932	69.223461	-5.187932	2.201106	-2.986827	C1=CC=C(C1)c1cc2[se]c3cc4ccsc4cc3c2[se]1
7	262174	C1C=CC=C1c1cc2[se]c3c4occc4c4cscc4c3c2[se]1	C19H10OSSe2	444.2730	2.523057	0.397670	97.645325	-4.997670	1.982122	-3.015548	C1=CC=C(C1)c1cc2[se]c3c4occc4c4cscc4c3c2[se]1
8	393249	C1C=CC=C1c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12	C24H15NSe	396.3495	3.115895	0.869140	55.174815	-5.469140	2.331815	-3.137325	C1=CC=C(C1)c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12
9	35	C1C2=C([SiH2]C=C2)C=C1c1cc2occc2c2cscc12	C17H12OSSi	292.4328	2.743214	0.387106	109.062905	-4.987106	1.909966	-3.077141	C1=CC2=C([SiH2]1)C=C(C2)c1cc2occc2c2cscc12

	id	SMILES_str	stoich_str	mass	pce	voc	jsc	e_homo_alpha	e_gap_alpha	e_lumo_alpha	tmp_smiles_str
2322844	2543603	[SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1cnc(...	C22H14N4S3Si2	486.7506	0.000000	0.000000	0.000000	-5.632512	1.454082	-4.178430	c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1cnc(...
2322845	2304057	[SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(...	C22H14N4S3Si2	486.7506	9.335485	1.120744	128.197094	-5.720744	1.798600	-3.922144	c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
2322846	2007035	[SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(...	C26H18S3Si2	482.7982	2.498209	0.834995	46.046052	-5.434995	2.433160	-3.001835	c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
2322847	1961981	C1ccc2c1c(sc2-c1scc2cc[SiH2]c12)-c1ccc(cc1)-c1...	C25H16S3SeSi	519.6454	2.679067	0.659243	62.544032	-5.259243	2.258468	-3.000775	c1sc(c2[SiH2]ccc12)-c1sc(c2Cccc12)-c1ccc(cc1)-...
2322848	2754558	[SiH2]1ccc2csc(c12)-c1sc(-c2sc(-c3scc4ccsc34)c...	C24H13NOS5Si	519.7887	1.272400	0.102802	190.489616	-4.702802	1.490950	-3.211851	c1sc(c2[SiH2]ccc12)-c1sc(-c2sc(-c3scc4ccsc34)c...

	id	SMILES_str	stoich_str	mass	pce	voc	jsc	e_homo_alpha	e_gap_alpha	e_lumo_alpha	tmp_smiles_str	mass_ratio_H2O
mass_group
(167.72, 220.0499]	937	937	937	937	937	937	937	937	937	937	844	937
(220.0499, 271.862]	21253	21253	21253	21253	21253	21253	21253	21253	21253	21253	20080	21253
(271.862, 323.674]	132159	132159	132159	132159	132159	132159	132159	132159	132159	132159	124329	132159
(323.674, 375.486]	395811	395811	395811	395811	395811	395811	395811	395811	395811	395811	373490	395811
(375.486, 427.298]	749908	749908	749908	749908	749908	749908	749908	749908	749908	749908	715158	749908
(427.298, 479.111]	729369	729369	729369	729369	729369	729369	729369	729369	729369	729369	699620	729369
(479.111, 530.923]	247833	247833	247833	247833	247833	247833	247833	247833	247833	247833	236177	247833
(530.923, 582.735]	41111	41111	41111	41111	41111	41111	41111	41111	41111	41111	39588	41111
(582.735, 634.547]	4176	4176	4176	4176	4176	4176	4176	4176	4176	4176	4093	4176
(634.547, 686.359]	292	292	292	292	292	292	292	292	292	292	291	292

	id	mass	pce	voc	jsc	e_homo_alpha	e_gap_alpha	e_lumo_alpha
mass_group
(167.72, 220.0499]	795448	200.883189	0.984688	0.815517	28.384561	-5.401061	2.782885	-2.618176
(220.0499, 271.862]	815134	254.578292	1.658181	0.673716	49.137854	-5.267026	2.492683	-2.774343
(271.862, 323.674]	812740	301.611175	2.372569	0.660835	69.179613	-5.257699	2.287170	-2.970529
(323.674, 375.486]	821670	351.087766	3.210127	0.625780	91.152983	-5.233774	2.091818	-3.141957
(375.486, 427.298]	1026763	401.579728	3.918477	0.566805	116.591717	-5.183407	1.903792	-3.279615
(427.298, 479.111]	1737479	452.051313	4.441766	0.509674	140.703180	-5.136794	1.742163	-3.394631
(479.111, 530.923]	2011742	503.278084	4.480230	0.467202	148.073101	-5.115852	1.667151	-3.448701
(530.923, 582.735]	2184168	550.076470	4.535071	0.434818	148.438136	-5.117461	1.593890	-3.523570
(582.735, 634.547]	2294000	593.530994	4.694802	0.421064	151.808823	-5.106992	1.545764	-3.561228
(634.547, 686.359]	2359797	640.290482	4.716907	0.411735	164.561936	-5.049109	1.541983	-3.507127

Software Engineering for Data Scientists

Introduction to Python & Jupyter notebooks

Today's Objectives

1. Opening & Navigating the Jupyter Notebook

2. Simple Math in the Jupyter Notebook

3. Loading data with pandas

4. Cleaning and Manipulating data with pandas

1. Opening and Navigating the Jupyter Notebook

2. Simple Math in the Jupyter Notebook

3. Loading data with pandas

Python's Data Science Ecosystem

numpy: Numerical Python

scipy: Scientific Python

pandas: Labeled Data Manipulation in Python

matplotlib: Visualization in Python

Installing Pandas & friends

Loading Data with Pandas

Viewing Pandas Dataframes

4. Manipulating data with pandas

Simple Grouping of Data

Value Counts

Group-by Operation

5. Visualizing data with pandas

Other plot types

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

3. Loading data with `pandas`

`numpy`: Numerical Python

`scipy`: Scientific Python

`pandas`: Labeled Data Manipulation in Python

`matplotlib`: Visualization in Python

4. Manipulating data with `pandas`

5. Visualizing data with `pandas`