Software Engineering for Data Scientists

Introduction to Python & Jupyter notebooks

Today's Objectives

1. Opening & Navigating the Jupyter Notebook

2. Simple Math in the Jupyter Notebook

3. Loading data with pandas

4. Cleaning and Manipulating data with pandas

1. Opening and Navigating the Jupyter Notebook

We will start today with the interactive environment that we will be using often through the course: the Jupyter Notebook.

We will walk through the following steps together:

  1. Download miniconda (be sure to get Version 3.5) and install it on your system (hopefully you have done this before coming to class)

  2. Use the conda command-line tool to update your package listing and install the Jupyter notebook:

    Update conda's listing of packages for your system:

    $ conda update conda

    Install Jupyter notebook and all its requirements

    $ conda install jupyter
  3. Navigate to the HCEPDB directory. For example:

    $ cd ~/Desktop/HCEPDB/

    Use curl to download the main lecture notebook and the simple breakout notebook:

    # you may skip this next step if you downloaded the file from your web browser
    $ curl -O https://uwdirect.github.io/SEDS_content/02.Python.ipynb
    
    ...
    $ curl -O https://uwdirect.github.io/SEDS_content/02.Simple_Breakout.ipynb
    ...
    
    $ ls
    ...
    02.Python.ipnyb
    02.Simple_Breakout.ipnyb
    ...
  4. Type jupyter notebook in the terminal to start the notebook

    $ jupyter notebook

    If everything has worked correctly, it should automatically launch your default browser

  5. Click on 02.Python.ipnyb to open the notebook containing the content for this lecture.

With that, you're set up to use the Jupyter notebook!

2. Simple Math in the Jupyter Notebook

Now that we have the Jupyter notebook up and running, we're going to do a short breakout exploring some of the mathematical functionality that Python offers.

Please open 02.Simple_Breakout.ipynb, find a partner, and make your way through that notebook, typing and executing code along the way.

3. Loading data with pandas

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

Python's Data Science Ecosystem

In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are:

numpy: Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

scipy: Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.

pandas: Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

matplotlib: Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda. All it takes is to run

$ conda install numpy scipy pandas matplotlib

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

Loading Data with Pandas


In [49]:
import numpy
numpy.__path__


Out[49]:
['/Users/dacb/miniconda3/lib/python3.5/site-packages/numpy']

In [50]:
import pandas

In [51]:
df = pandas.DataFrame()

Because we'll use it so much, we often import under a shortened name using the import ... as ... pattern:


In [1]:
import pandas as pd

In [8]:
df = pd.DataFrame()

In [9]:
df


Out[9]:

Now we can use the read_csv command to read the comma-separated-value data:


In [19]:
data = pd.read_csv('HCEPDB_moldata.csv')

In [6]:
data.head(10)


Out[6]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str
0 655365 C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 C18H9N3OSSe 394.3151 5.161953 0.867601 91.567575 -5.467601 2.022944 -3.444656 C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
1 1245190 C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]... C22H15NSeSi 400.4135 5.261398 0.504824 160.401549 -5.104824 1.630750 -3.474074 C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
2 21847 C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1... C24H17NOSi 363.4903 0.000000 0.000000 197.474780 -4.539526 1.462158 -3.077368 C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
3 65553 [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1 C12H12SeSi3 319.4448 6.138294 0.630274 149.887545 -5.230274 1.682250 -3.548025 C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
4 720918 C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1 C20H12OSSe 379.3398 1.991366 0.242119 126.581347 -4.842119 1.809439 -3.032680 C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1
5 1310744 C1C=CC=C1c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2ccc... C24H13N3SSe 454.4137 5.605135 0.951911 90.622776 -5.551911 2.029717 -3.522194 C1=CC=C(C1)c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2c...
6 196637 C1C=CC=C1c1cc2[se]c3cc4ccsc4cc3c2[se]1 C17H10SSe2 404.2520 2.644436 0.587932 69.223461 -5.187932 2.201106 -2.986827 C1=CC=C(C1)c1cc2[se]c3cc4ccsc4cc3c2[se]1
7 262174 C1C=CC=C1c1cc2[se]c3c4occc4c4cscc4c3c2[se]1 C19H10OSSe2 444.2730 2.523057 0.397670 97.645325 -4.997670 1.982122 -3.015548 C1=CC=C(C1)c1cc2[se]c3c4occc4c4cscc4c3c2[se]1
8 393249 C1C=CC=C1c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12 C24H15NSe 396.3495 3.115895 0.869140 55.174815 -5.469140 2.331815 -3.137325 C1=CC=C(C1)c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12
9 35 C1C2=C([SiH2]C=C2)C=C1c1cc2occc2c2cscc12 C17H12OSSi 292.4328 2.743214 0.387106 109.062905 -4.987106 1.909966 -3.077141 C1=CC2=C([SiH2]1)C=C(C2)c1cc2occc2c2cscc12

Note: strings in Python can be defined either with double quotes or single quotes

Viewing Pandas Dataframes

The head() and tail() methods show us the first and last rows of the data


In [55]:
data.head()


Out[55]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str
0 655365 C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 C18H9N3OSSe 394.3151 5.161953 0.867601 91.567575 -5.467601 2.022944 -3.444656 C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
1 1245190 C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]... C22H15NSeSi 400.4135 5.261398 0.504824 160.401549 -5.104824 1.630750 -3.474074 C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
2 21847 C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1... C24H17NOSi 363.4903 0.000000 0.000000 197.474780 -4.539526 1.462158 -3.077368 C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
3 65553 [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1 C12H12SeSi3 319.4448 6.138294 0.630274 149.887545 -5.230274 1.682250 -3.548025 C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
4 720918 C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1 C20H12OSSe 379.3398 1.991366 0.242119 126.581347 -4.842119 1.809439 -3.032680 C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1

In [56]:
data.tail()


Out[56]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str
2322844 2543603 [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1cnc(... C22H14N4S3Si2 486.7506 0.000000 0.000000 0.000000 -5.632512 1.454082 -4.178430 c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1cnc(...
2322845 2304057 [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(... C22H14N4S3Si2 486.7506 9.335485 1.120744 128.197094 -5.720744 1.798600 -3.922144 c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
2322846 2007035 [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(... C26H18S3Si2 482.7982 2.498209 0.834995 46.046052 -5.434995 2.433160 -3.001835 c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
2322847 1961981 C1ccc2c1c(sc2-c1scc2cc[SiH2]c12)-c1ccc(cc1)-c1... C25H16S3SeSi 519.6454 2.679067 0.659243 62.544032 -5.259243 2.258468 -3.000775 c1sc(c2[SiH2]ccc12)-c1sc(c2Cccc12)-c1ccc(cc1)-...
2322848 2754558 [SiH2]1ccc2csc(c12)-c1sc(-c2sc(-c3scc4ccsc34)c... C24H13NOS5Si 519.7887 1.272400 0.102802 190.489616 -4.702802 1.490950 -3.211851 c1sc(c2[SiH2]ccc12)-c1sc(-c2sc(-c3scc4ccsc34)c...

The shape attribute shows us the number of elements:


In [7]:
data.shape[0]


Out[7]:
2322849

The columns attribute gives us the column names


In [11]:
data.columns


Out[11]:
Index(['id', 'SMILES_str', 'stoich_str', 'mass', 'pce', 'voc', 'jsc',
       'e_homo_alpha', 'e_gap_alpha', 'e_lumo_alpha', 'tmp_smiles_str'],
      dtype='object')

The index attribute gives us the index names


In [12]:
data.index


Out[12]:
RangeIndex(start=0, stop=2322849, step=1)

Let's make our id column the index


In [16]:
data.set_index('id')

Now let's revisit the data.index


In [20]:
data.index


Out[20]:
RangeIndex(start=0, stop=2322849, step=1)

View it with head again:


In [21]:
data.head()


Out[21]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str
0 655365 C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 C18H9N3OSSe 394.3151 5.161953 0.867601 91.567575 -5.467601 2.022944 -3.444656 C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
1 1245190 C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]... C22H15NSeSi 400.4135 5.261398 0.504824 160.401549 -5.104824 1.630750 -3.474074 C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
2 21847 C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1... C24H17NOSi 363.4903 0.000000 0.000000 197.474780 -4.539526 1.462158 -3.077368 C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
3 65553 [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1 C12H12SeSi3 319.4448 6.138294 0.630274 149.887545 -5.230274 1.682250 -3.548025 C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
4 720918 C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1 C20H12OSSe 379.3398 1.991366 0.242119 126.581347 -4.842119 1.809439 -3.032680 C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1

In [63]:
data.tail()


Out[63]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str
2322844 2543603 [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1cnc(... C22H14N4S3Si2 486.7506 0.000000 0.000000 0.000000 -5.632512 1.454082 -4.178430 c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1cnc(...
2322845 2304057 [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(... C22H14N4S3Si2 486.7506 9.335485 1.120744 128.197094 -5.720744 1.798600 -3.922144 c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
2322846 2007035 [SiH2]1ccc2csc(c12)-c1sc(c2[SiH2]ccc12)-c1ccc(... C26H18S3Si2 482.7982 2.498209 0.834995 46.046052 -5.434995 2.433160 -3.001835 c1sc(c2[SiH2]ccc12)-c1sc(c2[SiH2]ccc12)-c1ccc(...
2322847 1961981 C1ccc2c1c(sc2-c1scc2cc[SiH2]c12)-c1ccc(cc1)-c1... C25H16S3SeSi 519.6454 2.679067 0.659243 62.544032 -5.259243 2.258468 -3.000775 c1sc(c2[SiH2]ccc12)-c1sc(c2Cccc12)-c1ccc(cc1)-...
2322848 2754558 [SiH2]1ccc2csc(c12)-c1sc(-c2sc(-c3scc4ccsc34)c... C24H13NOS5Si 519.7887 1.272400 0.102802 190.489616 -4.702802 1.490950 -3.211851 c1sc(c2[SiH2]ccc12)-c1sc(-c2sc(-c3scc4ccsc34)c...

The dtypes attribute gives the data types of each column:


In [64]:
data.dtypes


Out[64]:
id                  int64
SMILES_str         object
stoich_str         object
mass              float64
pce               float64
voc               float64
jsc               float64
e_homo_alpha      float64
e_gap_alpha       float64
e_lumo_alpha      float64
tmp_smiles_str     object
dtype: object

4. Manipulating data with pandas

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:


In [23]:
data['mass'].head()


Out[23]:
0    394.3151
1    400.4135
2    363.4903
3    319.4448
4    379.3398
Name: mass, dtype: float64

In [26]:
data.stoich_str.tail()


Out[26]:
2322844    C22H14N4S3Si2
2322845    C22H14N4S3Si2
2322846      C26H18S3Si2
2322847     C25H16S3SeSi
2322848     C24H13NOS5Si
Name: stoich_str, dtype: object

Mathematical operations on columns happen element-wise (note 18.01528 is the weight of H2O):


In [66]:
data['mass'] / 18.01528


Out[66]:
0          21.887814
1          22.226327
2          20.176778
3          17.731881
4          21.056559
5          25.223793
6          22.439396
7          24.660899
8          22.000740
9          16.232487
10         16.123013
11         27.154438
12         18.610135
13         24.618124
14         21.845039
15         18.901755
16         23.847817
17         14.338861
18         18.676834
19         25.510422
20         21.906526
21         28.282303
22         22.900893
23         21.613797
24         24.120724
25         25.272985
26         21.388655
27         20.453582
28         22.335356
29         21.845039
             ...    
2322819    26.854220
2322820    28.907283
2322821    28.907283
2322822    26.963916
2322823    23.740863
2322824    25.962366
2322825    25.961922
2322826    26.241690
2322827    25.962366
2322828    28.015429
2322829    27.960803
2322830    26.187509
2322831    29.690108
2322832    28.224985
2322833    26.242356
2322834    29.744511
2322835    27.851330
2322836    28.015429
2322837    22.685010
2322838    24.905747
2322839    25.852893
2322840    26.242356
2322841    26.296982
2322842    26.909068
2322843    29.799137
2322844    27.018764
2322845    27.018764
2322846    26.799373
2322847    28.844703
2322848    28.852657
Name: mass, dtype: float64

Columns can be created (or overwritten) with the assignment operator. Let's create a mass_ratio_H2O column with the mass ratio of each molecule to H2O


In [67]:
data['mass_ratio_H2O'] = data['mass'] / 18.01528

In [68]:
data.head()


Out[68]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str mass_ratio_H2O
0 655365 C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 C18H9N3OSSe 394.3151 5.161953 0.867601 91.567575 -5.467601 2.022944 -3.444656 C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 21.887814
1 1245190 C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]... C22H15NSeSi 400.4135 5.261398 0.504824 160.401549 -5.104824 1.630750 -3.474074 C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH... 22.226327
2 21847 C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1... C24H17NOSi 363.4903 0.000000 0.000000 197.474780 -4.539526 1.462158 -3.077368 C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc... 20.176778
3 65553 [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1 C12H12SeSi3 319.4448 6.138294 0.630274 149.887545 -5.230274 1.682250 -3.548025 C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si... 17.731881
4 720918 C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1 C20H12OSSe 379.3398 1.991366 0.242119 126.581347 -4.842119 1.809439 -3.032680 C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1 21.056559

In preparation for grouping the data, let's bin the molecules by their molecular mass. For that, we'll use pd.cut


In [28]:
data['mass_group'] = pd.cut(data['mass'], 10)

In [31]:
data.head()


Out[31]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str mass_group
0 655365 C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 C18H9N3OSSe 394.3151 5.161953 0.867601 91.567575 -5.467601 2.022944 -3.444656 C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1 (375.486, 427.298]
1 1245190 C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]... C22H15NSeSi 400.4135 5.261398 0.504824 160.401549 -5.104824 1.630750 -3.474074 C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH... (375.486, 427.298]
2 21847 C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1... C24H17NOSi 363.4903 0.000000 0.000000 197.474780 -4.539526 1.462158 -3.077368 C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc... (323.674, 375.486]
3 65553 [SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1 C12H12SeSi3 319.4448 6.138294 0.630274 149.887545 -5.230274 1.682250 -3.548025 C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si... (271.862, 323.674]
4 720918 C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1 C20H12OSSe 379.3398 1.991366 0.242119 126.581347 -4.842119 1.809439 -3.032680 C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1 (375.486, 427.298]

In [71]:
data.dtypes


Out[71]:
id                   int64
SMILES_str          object
stoich_str          object
mass               float64
pce                float64
voc                float64
jsc                float64
e_homo_alpha       float64
e_gap_alpha        float64
e_lumo_alpha       float64
tmp_smiles_str      object
mass_ratio_H2O     float64
mass_group        category
dtype: object

Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at value counts and the basics of group-by operations.

Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.

The pandas.value_counts returns statistics on the unique values within each column.

We can use it, for example, to break down the molecules by their mass group that we just created:


In [72]:
pd.value_counts(data['mass_group'])


Out[72]:
(375.486, 427.298]     749908
(427.298, 479.111]     729369
(323.674, 375.486]     395811
(479.111, 530.923]     247833
(271.862, 323.674]     132159
(530.923, 582.735]      41111
(220.0499, 271.862]     21253
(582.735, 634.547]       4176
(167.72, 220.0499]        937
(634.547, 686.359]        292
Name: mass_group, dtype: int64

What happens if we try this on a continuous valued variable?


In [73]:
pd.value_counts(data['mass'])


Out[73]:
390.4646    4420
378.4536    4214
393.5405    3844
396.4928    3834
404.5674    3750
389.4805    3732
443.6003    3670
428.5134    3643
388.4924    3628
391.4527    3600
394.5286    3590
392.5564    3493
407.5197    3441
389.4765    3428
447.5407    3402
395.5047    3338
340.4048    3324
379.4417    3293
394.5206    3274
446.5526    3270
406.5316    3250
408.5078    3236
442.6162    3227
397.4809    3212
395.5087    3202
445.5685    3137
410.5956    3102
444.5884    3102
445.5645    3024
405.5555    2996
            ... 
468.1371       1
420.5824       1
291.3072       1
355.2182       1
519.0435       1
223.2737       1
406.4796       1
279.2541       1
401.4315       1
404.2939       1
322.6622       1
227.3297       1
320.4628       1
363.0374       1
300.5132       1
300.1743       1
263.2631       1
402.4156       1
302.2960       1
363.1353       1
419.3808       1
286.3984       1
390.4086       1
429.4335       1
291.2731       1
391.3927       1
421.3271       1
239.3407       1
290.2890       1
290.2850       1
Name: mass, dtype: int64

We can do a little data exploration with this to look 0s in columns. Here, let's look at the power conversion effeciency (pce)


In [74]:
pd.value_counts(data['pce'])


Out[74]:
0.000000    109854
1.334476         6
2.229965         4
3.378631         4
0.832335         4
1.757805         4
1.748131         4
0.611385         4
2.347740         4
2.455332         4
2.056930         4
1.608040         4
6.193170         4
4.878460         4
4.075839         4
1.960030         4
3.146475         4
0.313562         4
1.478590         4
1.848397         4
5.641962         4
1.626072         4
2.387791         4
0.879680         4
3.100729         4
2.476426         4
0.490215         4
1.010616         4
2.441131         4
5.540558         4
             ...  
2.729901         1
2.777837         1
3.411103         1
1.502644         1
6.706605         1
4.088114         1
2.680453         1
2.035536         1
6.631585         1
3.110992         1
1.118627         1
1.411595         1
4.541150         1
6.235040         1
2.247999         1
5.689173         1
7.154354         1
1.232087         1
2.282375         1
5.728708         1
5.578350         1
3.953680         1
5.979996         1
3.171306         1
6.382782         1
2.324180         1
5.221554         1
5.986343         1
5.352108         1
1.503284         1
Name: pce, dtype: int64

Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)

Let's break take this in smaller steps. Recall our mass_group column.


In [75]:
pd.value_counts(data['mass_group'])


Out[75]:
(375.486, 427.298]     749908
(427.298, 479.111]     729369
(323.674, 375.486]     395811
(479.111, 530.923]     247833
(271.862, 323.674]     132159
(530.923, 582.735]      41111
(220.0499, 271.862]     21253
(582.735, 634.547]       4176
(167.72, 220.0499]        937
(634.547, 686.359]        292
Name: mass_group, dtype: int64

groupby allows us to look at the number of values for each column and each value.


In [76]:
data.groupby(['mass_group']).count()


Out[76]:
id SMILES_str stoich_str mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha tmp_smiles_str mass_ratio_H2O
mass_group
(167.72, 220.0499] 937 937 937 937 937 937 937 937 937 937 844 937
(220.0499, 271.862] 21253 21253 21253 21253 21253 21253 21253 21253 21253 21253 20080 21253
(271.862, 323.674] 132159 132159 132159 132159 132159 132159 132159 132159 132159 132159 124329 132159
(323.674, 375.486] 395811 395811 395811 395811 395811 395811 395811 395811 395811 395811 373490 395811
(375.486, 427.298] 749908 749908 749908 749908 749908 749908 749908 749908 749908 749908 715158 749908
(427.298, 479.111] 729369 729369 729369 729369 729369 729369 729369 729369 729369 729369 699620 729369
(479.111, 530.923] 247833 247833 247833 247833 247833 247833 247833 247833 247833 247833 236177 247833
(530.923, 582.735] 41111 41111 41111 41111 41111 41111 41111 41111 41111 41111 39588 41111
(582.735, 634.547] 4176 4176 4176 4176 4176 4176 4176 4176 4176 4176 4093 4176
(634.547, 686.359] 292 292 292 292 292 292 292 292 292 292 291 292

Now, let's find the mean of each of the columns for each mass_group. Notice what happens to the non-numeric columns.


In [30]:
data.groupby(['mass_group']).mean()


Out[30]:
id mass pce voc jsc e_homo_alpha e_gap_alpha e_lumo_alpha
mass_group
(167.72, 220.0499] 795448 200.883189 0.984688 0.815517 28.384561 -5.401061 2.782885 -2.618176
(220.0499, 271.862] 815134 254.578292 1.658181 0.673716 49.137854 -5.267026 2.492683 -2.774343
(271.862, 323.674] 812740 301.611175 2.372569 0.660835 69.179613 -5.257699 2.287170 -2.970529
(323.674, 375.486] 821670 351.087766 3.210127 0.625780 91.152983 -5.233774 2.091818 -3.141957
(375.486, 427.298] 1026763 401.579728 3.918477 0.566805 116.591717 -5.183407 1.903792 -3.279615
(427.298, 479.111] 1737479 452.051313 4.441766 0.509674 140.703180 -5.136794 1.742163 -3.394631
(479.111, 530.923] 2011742 503.278084 4.480230 0.467202 148.073101 -5.115852 1.667151 -3.448701
(530.923, 582.735] 2184168 550.076470 4.535071 0.434818 148.438136 -5.117461 1.593890 -3.523570
(582.735, 634.547] 2294000 593.530994 4.694802 0.421064 151.808823 -5.106992 1.545764 -3.561228
(634.547, 686.359] 2359797 640.290482 4.716907 0.411735 164.561936 -5.049109 1.541983 -3.507127

In [29]:
data.groupby(['mass_group'])


Out[29]:
<pandas.core.groupby.DataFrameGroupBy object at 0x10cdf1da0>

You can specify a groupby using the names of table columns and compute other functions, such as the sum, count, std, and describe.


In [78]:
data.groupby(['mass_group'])['pce'].describe()


Out[78]:
mass_group                
(167.72, 220.0499]   count       937.000000
                     mean          0.984688
                     std           0.873994
                     min           0.000000
                     25%           0.413301
                     50%           0.786599
                     75%           1.325538
                     max           5.746969
(220.0499, 271.862]  count     21253.000000
                     mean          1.658181
                     std           1.342245
                     min           0.000000
                     25%           0.735355
                     50%           1.331621
                     75%           2.252834
                     max          10.956561
(271.862, 323.674]   count    132159.000000
                     mean          2.372569
                     std           1.727534
                     min           0.000000
                     25%           1.173593
                     50%           1.984474
                     75%           3.182373
                     max          11.062819
(323.674, 375.486]   count    395811.000000
                     mean          3.210127
                     std           2.155961
                     min           0.000000
                     25%           1.685650
                     50%           2.728819
                                  ...      
(479.111, 530.923]   std           2.729844
                     min           0.000000
                     25%           2.485322
                     50%           4.070540
                     75%           6.461316
                     max          11.118864
(530.923, 582.735]   count     41111.000000
                     mean          4.535071
                     std           2.936067
                     min           0.000000
                     25%           2.501887
                     50%           4.038260
                     75%           6.801271
                     max          11.112351
(582.735, 634.547]   count      4176.000000
                     mean          4.694802
                     std           2.948578
                     min           0.000000
                     25%           3.017875
                     50%           4.043051
                     75%           7.011431
                     max          11.020831
(634.547, 686.359]   count       292.000000
                     mean          4.716907
                     std           2.454961
                     min           0.000000
                     25%           3.345544
                     50%           3.878462
                     75%           6.152368
                     max          10.858928
Name: pce, dtype: float64

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

<data object>.groupby(<grouping values>).<aggregate>()

You can even group by multiple values: for example we can look at the LUMO-HOMO gap grouped by the mass_group and pce.


In [79]:
grouped = data.groupby(['mass_group', 'pce'])['e_gap_alpha'].mean()
grouped


Out[79]:
mass_group          pce      
(167.72, 220.0499]  0.000000     2.412702
                    0.000245     2.099989
                    0.004285     2.953079
                    0.005825     3.073889
                    0.008490     2.722965
                    0.008899     2.136513
                    0.044463     2.452195
                    0.046568     2.823902
                    0.046691     3.114940
                    0.048831     3.140034
                    0.052351     2.884674
                    0.055327     2.728120
                    0.059029     2.880716
                    0.061336     2.844688
                    0.072549     3.090869
                    0.079122     3.159215
                    0.080511     3.364599
                    0.083184     2.854799
                    0.084001     1.942450
                    0.084225     1.974778
                    0.094322     3.709872
                    0.101431     2.850968
                    0.104110     2.307507
                    0.111652     2.742858
                    0.113507     2.809079
                    0.114355     2.319816
                    0.115878     3.608025
                    0.125624     3.028626
                    0.125981     3.340156
                    0.130970     2.554368
                                   ...   
(634.547, 686.359]  9.066947     1.175464
                    9.185450     1.100996
                    9.186357     1.548446
                    9.270504     1.145993
                    9.347098     1.198710
                    9.349263     1.196206
                    9.360067     1.220964
                    9.362396     1.204888
                    9.364733     1.219149
                    9.391619     1.135501
                    9.413005     1.183518
                    9.424252     1.155483
                    9.463312     1.222076
                    9.534784     1.135481
                    9.548010     1.142618
                    9.571270     1.134874
                    9.598884     1.122950
                    9.603046     1.179339
                    9.661744     1.177666
                    9.710178     1.178226
                    9.715666     1.161244
                    9.761812     1.191025
                    9.807911     1.188336
                    9.814294     1.170697
                    10.012045    1.195555
                    10.036598    1.461847
                    10.401586    1.409304
                    10.430989    1.436652
                    10.698090    1.393320
                    10.858928    1.352444
Name: e_gap_alpha, dtype: float64

5. Visualizing data with pandas

Of course, looking at tables of data is not very intuitive. Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib library to generate plots.

Whenever you do plotting in the Jupyter notebook, you will want to first run this magic command which configures the notebook to work well with plots:


In [82]:
import matplotlib
%matplotlib inline

Now we can simply call the plot() method of any series or dataframe to get a reasonable view of the data:


In [83]:
data.groupby(['mass_group'])['pce'].mean().plot()


Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x10da0cc88>

Other plot types

Pandas supports a range of other plotting types; you can find these by using the autocomplete on the plot method:


In [84]:
data.groupby(['mass_group'])['SMILES_str'].count().plot().hist(10)


Out[84]:
(array([ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]),
 array([  9.5,   9.6,   9.7,   9.8,   9.9,  10. ,  10.1,  10.2,  10.3,
         10.4,  10.5]),
 <a list of 10 Patch objects>)

In [ ]: