We will start today with the interactive environment that we will be using often through the course: the Jupyter Notebook.
We will walk through the following steps together:
Download miniconda (be sure to get Version 3.5) and install it on your system (hopefully you have done this before coming to class)
Use the conda
command-line tool to update your package listing and install the Jupyter notebook:
Update conda
's listing of packages for your system:
$ conda update conda
Install Jupyter notebook and all its requirements
$ conda install jupyter
Navigate to the HCEPDB directory. For example:
$ cd ~/Desktop/HCEPDB/
Use curl to download the main lecture notebook and the simple breakout notebook:
# you may skip this next step if you downloaded the file from your web browser
$ curl -O https://uwdirect.github.io/SEDS_content/02.Python.ipynb
...
$ curl -O https://uwdirect.github.io/SEDS_content/02.Simple_Breakout.ipynb
...
$ ls
...
02.Python.ipnyb
02.Simple_Breakout.ipnyb
...
Type jupyter notebook
in the terminal to start the notebook
$ jupyter notebook
If everything has worked correctly, it should automatically launch your default browser
Click on 02.Python.ipnyb
to open the notebook containing the content for this lecture.
With that, you're set up to use the Jupyter notebook!
Now that we have the Jupyter notebook up and running, we're going to do a short breakout exploring some of the mathematical functionality that Python offers.
Please open 02.Simple_Breakout.ipynb, find a partner, and make your way through that notebook, typing and executing code along the way.
In addition to Python's built-in modules like the math
module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:
numpy
: Numerical PythonNumpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.
scipy
: Scientific PythonScipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.
pandas
: Labeled Data Manipulation in PythonPandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.
matplotlib
: Visualization in PythonMatplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).
Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda
. All it takes is to run
$ conda install numpy scipy pandas matplotlib
and (so long as your conda setup is working) the packages will be downloaded and installed on your system.
In [49]:
import numpy
numpy.__path__
Out[49]:
In [50]:
import pandas
In [51]:
df = pandas.DataFrame()
Because we'll use it so much, we often import under a shortened name using the import ... as ...
pattern:
In [1]:
import pandas as pd
In [8]:
df = pd.DataFrame()
In [9]:
df
Out[9]:
Now we can use the read_csv
command to read the comma-separated-value data:
In [19]:
data = pd.read_csv('HCEPDB_moldata.csv')
In [6]:
data.head(10)
Out[6]:
Note: strings in Python can be defined either with double quotes or single quotes
The head()
and tail()
methods show us the first and last rows of the data
In [55]:
data.head()
Out[55]:
In [56]:
data.tail()
Out[56]:
The shape
attribute shows us the number of elements:
In [7]:
data.shape[0]
Out[7]:
The columns
attribute gives us the column names
In [11]:
data.columns
Out[11]:
The index
attribute gives us the index names
In [12]:
data.index
Out[12]:
Let's make our id
column the index
In [16]:
data.set_index('id')
Now let's revisit the data.index
In [20]:
data.index
Out[20]:
View it with head again:
In [21]:
data.head()
Out[21]:
In [63]:
data.tail()
Out[63]:
The dtypes
attribute gives the data types of each column:
In [64]:
data.dtypes
Out[64]:
Access columns by name using square-bracket indexing:
In [23]:
data['mass'].head()
Out[23]:
In [26]:
data.stoich_str.tail()
Out[26]:
Mathematical operations on columns happen element-wise (note 18.01528 is the weight of H2O):
In [66]:
data['mass'] / 18.01528
Out[66]:
Columns can be created (or overwritten) with the assignment operator. Let's create a mass_ratio_H2O column with the mass ratio of each molecule to H2O
In [67]:
data['mass_ratio_H2O'] = data['mass'] / 18.01528
In [68]:
data.head()
Out[68]:
In preparation for grouping the data, let's bin the molecules by their molecular mass. For that, we'll use pd.cut
In [28]:
data['mass_group'] = pd.cut(data['mass'], 10)
In [31]:
data.head()
Out[31]:
In [71]:
data.dtypes
Out[71]:
Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.
The pandas.value_counts
returns statistics on the unique values within each column.
We can use it, for example, to break down the molecules by their mass group that we just created:
In [72]:
pd.value_counts(data['mass_group'])
Out[72]:
What happens if we try this on a continuous valued variable?
In [73]:
pd.value_counts(data['mass'])
Out[73]:
We can do a little data exploration with this to look 0s in columns. Here, let's look at the power conversion effeciency (pce
)
In [74]:
pd.value_counts(data['pce'])
Out[74]:
One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)
Let's break take this in smaller steps.
Recall our mass_group
column.
In [75]:
pd.value_counts(data['mass_group'])
Out[75]:
groupby allows us to look at the number of values for each column and each value.
In [76]:
data.groupby(['mass_group']).count()
Out[76]:
Now, let's find the mean of each of the columns for each mass_group
. Notice what happens to the non-numeric columns.
In [30]:
data.groupby(['mass_group']).mean()
Out[30]:
In [29]:
data.groupby(['mass_group'])
Out[29]:
You can specify a groupby using the names of table columns and compute other functions, such as the sum
, count
, std
, and describe
.
In [78]:
data.groupby(['mass_group'])['pce'].describe()
Out[78]:
The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)
<data object>.groupby(<grouping values>).<aggregate>()
You can even group by multiple values: for example we can look at the LUMO-HOMO gap grouped by the mass_group
and pce
.
In [79]:
grouped = data.groupby(['mass_group', 'pce'])['e_gap_alpha'].mean()
grouped
Out[79]:
pandas
Of course, looking at tables of data is not very intuitive.
Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib
library to generate plots.
Whenever you do plotting in the Jupyter notebook, you will want to first run this magic command which configures the notebook to work well with plots:
In [82]:
import matplotlib
%matplotlib inline
Now we can simply call the plot()
method of any series or dataframe to get a reasonable view of the data:
In [83]:
data.groupby(['mass_group'])['pce'].mean().plot()
Out[83]:
In [84]:
data.groupby(['mass_group'])['SMILES_str'].count().plot().hist(10)
Out[84]:
In [ ]: