Chapter 06: An overview of scientific libaries

Further to Sympy, this lab sheet will give a brief overview of a number of popular Python libraries that can be used to tackle a number of different problems:

numpy efficient numeric computations (for example for efficient linear algebra);
matplotlib plotting;
scipy a general purpose scientif library (for example include algorithms for numeric optimisation);
networkx network/graph theory (for example to compute the chromatic number of a graph);
pandas data manipulation (for example to read data from a given source and manipulate it);
scikit-learn machine learning (this is one of Python's most popular libraries).

This lab sheet will not good a detailed overview of each library but aim to give a brief introduction to each.

Numpy - http://www.numpy.org/

A video describing the library.

This is a fundamental library for efficient numeric computation. The building block is the numpy.array which can also be used to carry out linear algebraic manipulation.

Let us import numpy and define two 3 by 3 matrices:



In [1]:

    
import numpy as np
A = np.array([[5, 1, -1], [-1, 2, 4], [1, 1, 1]])
B = np.array([[1, 2, 0], [-4, 2, 2], [1, 3, 1]])

We can access $A_{02}$ for example:



In [2]:

    
A[0, 2]









    Out[2]:





-1

Or an entire row of $A$:



In [3]:

    
A[0]









    Out[3]:





array([ 5,  1, -1])

Or an entire column:



In [4]:

    
A[:,1]









    Out[4]:





array([1, 2, 1])

We can do scalar multiplication:



In [5]:

    
5 * A









    Out[5]:





array([[25,  5, -5],
       [-5, 10, 20],
       [ 5,  5,  5]])

We can raise the matrix to a high power:



In [6]:

    
np.linalg.matrix_power(A, 5)









    Out[6]:





array([[2357,  869,  211],
       [ -59,  290,  506],
       [ 599,  329,  231]])

We can carry out matrix addition:



In [7]:

    
A + B









    Out[7]:





array([[ 6,  3, -1],
       [-5,  4,  6],
       [ 2,  4,  2]])

But also more complex things like matrix multiplication:



In [8]:

    
np.dot(A, B)









    Out[8]:





array([[ 0,  9,  1],
       [-5, 14,  8],
       [-2,  7,  3]])

Recent versions of Python have @ as shorthand for matrix multiplication:



In [9]:

    
A @ B









    Out[9]:





array([[ 0,  9,  1],
       [-5, 14,  8],
       [-2,  7,  3]])

We can also get the inverse and the determinant of a $A$:



In [10]:

    
np.linalg.inv(A), np.linalg.det(A)









    Out[10]:





(array([[ 1. ,  1. , -3. ],
        [-2.5, -3. ,  9.5],
        [ 1.5,  2. , -5.5]]), -2.0)

Numpy also has numerous helpful functions that can be useful even if we are not doing any linear algebra. For example let us get an array of 100 values between -2 and 1:



In [11]:

    
np.linspace(-2, 1, 10 ** 2)









    Out[11]:





array([-2.        , -1.96969697, -1.93939394, -1.90909091, -1.87878788,
       -1.84848485, -1.81818182, -1.78787879, -1.75757576, -1.72727273,
       -1.6969697 , -1.66666667, -1.63636364, -1.60606061, -1.57575758,
       -1.54545455, -1.51515152, -1.48484848, -1.45454545, -1.42424242,
       -1.39393939, -1.36363636, -1.33333333, -1.3030303 , -1.27272727,
       -1.24242424, -1.21212121, -1.18181818, -1.15151515, -1.12121212,
       -1.09090909, -1.06060606, -1.03030303, -1.        , -0.96969697,
       -0.93939394, -0.90909091, -0.87878788, -0.84848485, -0.81818182,
       -0.78787879, -0.75757576, -0.72727273, -0.6969697 , -0.66666667,
       -0.63636364, -0.60606061, -0.57575758, -0.54545455, -0.51515152,
       -0.48484848, -0.45454545, -0.42424242, -0.39393939, -0.36363636,
       -0.33333333, -0.3030303 , -0.27272727, -0.24242424, -0.21212121,
       -0.18181818, -0.15151515, -0.12121212, -0.09090909, -0.06060606,
       -0.03030303,  0.        ,  0.03030303,  0.06060606,  0.09090909,
        0.12121212,  0.15151515,  0.18181818,  0.21212121,  0.24242424,
        0.27272727,  0.3030303 ,  0.33333333,  0.36363636,  0.39393939,
        0.42424242,  0.45454545,  0.48484848,  0.51515152,  0.54545455,
        0.57575758,  0.60606061,  0.63636364,  0.66666667,  0.6969697 ,
        0.72727273,  0.75757576,  0.78787879,  0.81818182,  0.84848485,
        0.87878788,  0.90909091,  0.93939394,  0.96969697,  1.        ])

Further resources for numpy

Matplotlib - https://matplotlib.org/

A video describing the library.

This is the most popular Python library for creating plots. We will illustrate first by creating a plot of the function:

$$ -x ^ 4 + 9x ^ 2 + 4 x - 12 $$

We do this by creating a set of $x$ values and computing the corresponding set of $f(x)$ values:



In [12]:

    
def f(x):
    return - x ** 4 + 9 * x ** 2 + 4 * x - 12

xs = np.linspace(-4, 5, 10 ** 3)  # Create 1000 points
ys = [f(x) for x in xs]

We are now ready to import and use matplotlib:



In [13]:

    
import matplotlib.pyplot as plt

%matplotlib inline

plt.figure()
plt.plot(xs, ys)
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.show()

Note the %matplotlib inline command which is a special Jupyter command to ensure plots are displayed in this notebook.

We can also save that image to a file that can be used elsewhere:



In [14]:

    
plt.figure()
plt.plot(xs, ys)
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.savefig("plot.pdf")  # experiment with `.png`, `.jpg` etc...

Note that there are numerous other types of plots that can be obtained, for example here is a histogram of 1000 random values generated by numpy:



In [15]:

    
np.random.seed(0)  # Setting a seed
values = np.random.normal(size=10 ** 3)
plt.figure()
plt.hist(values, bins=20)
plt.xlabel("$Value$")
plt.ylabel("Frequency")
plt.show()

Further resources for matplotlib

Scipy - https://docs.scipy.org/doc/scipy/reference/

A video describing the library.

Scipy is a library bringing together many different capabilities. One thing it can do is find roots using numeric approximation techniques:



In [16]:

    
from scipy import optimize



In [17]:

    
optimize.root(f, x0=0)









    Out[17]:





    fjac: array([[-1.]])
     fun: array([0.])
 message: 'The solution converged.'
    nfev: 6
     qtf: array([4.84234874e-12])
       r: array([49.99999584])
  status: 1
 success: True
       x: array([3.])



In [18]:

    
f(3)









    Out[18]:





0

It can also minimize functions (of more than one variable):



In [19]:

    
def g(x):
    """Here we assume g is a function of 2 variables and x is a vector"""
    return np.cos(x[1]) / (1 + x[0])



In [20]:

    
optimize.minimize(g, x0=[0, 0])









    Out[20]:





      fun: 0.002697614111595088
 hess_inv: array([[7.11535655e+06, 5.36889474e+00],
       [5.36889474e+00, 1.00000405e+00]])
      jac: array([-7.27712177e-06, -7.03148544e-07])
  message: 'Optimization terminated successfully.'
     nfev: 84
      nit: 20
     njev: 21
   status: 0
  success: True
        x: array([3.6969793e+02, 2.6064249e-04])

We can also fit a function to data, here we will create some data that fits a curve, add some noise and try and recover the function:



In [20]:

    
def func(x, a, b):
    return a * x ** 2 + b

xs = np.linspace(0, 10, 50)
a_value, b_value = 1 / 2, 70
np.random.seed(0)
ys = [func(x, a=a_value, b=b_value) + 2 * np.random.random() for x in xs]

Let us take a look at the data:



In [21]:

    
plt.figure()
plt.scatter(xs, ys)
plt.show()



In [22]:

    
popt, pcov = optimize.curve_fit(func, xs, ys)

We recover the original used values of a and b:



In [23]:

    
popt









    Out[23]:





array([ 0.49563406, 71.22294657])



In [24]:

    
fitted_ys = [func(x, *popt) for x in xs]
plt.scatter(xs, ys, label="Original data")
plt.plot(xs, fitted_ys, label="Fitted data", color="red")
plt.show()

Further resources for scipy

Scipy lecture notes

Networkx - https://networkx.github.io/

A video describing the library.

Networkx is used to handle graph theoretic objects. We can create a matrix in a number of ways, one of the simplest is by passing a set of edges. Here we import the library and create a graph object:



In [25]:

    
import networkx as nx
edges = [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3)]
G = nx.Graph(edges)
G









    Out[25]:





<networkx.classes.graph.Graph at 0x1017d860b8>

It is possible to obtain the adjacency matrix of the graph:



In [26]:

    
M = nx.to_numpy_array(G)
M









    Out[26]:





array([[0., 1., 1., 1.],
       [1., 0., 1., 1.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.]])

It is also possible to draw the graph:



In [27]:

    
plt.figure()
nx.draw(G)

We can also compute a graph coloring:



In [28]:

    
coloring = nx.greedy_color(G)
coloring









    Out[28]:





{0: 0, 1: 1, 2: 2, 3: 2}

We see that the chromatic number of this graph is 3:



In [29]:

    
len(set(coloring.values()))









    Out[29]:





3

Further resources for networkx

Pandas - https://pandas.pydata.org/

A video describing the library.

Pandas is Python's main library for data manipulation. As an example let us consider this dataset about extra marrital afairs: affairs.csv (download). This is data from the following research paper:

Fair, Ray C. "A theory of extramarital affairs." Journal of Political Economy 86.1 (1978): 45-61.

A pdf can be found here: https://fairmodel.econ.yale.edu/rayfair/pdf/1978A200.PDF

Let us import pandas and read the dataset in:



In [30]:

    
import pandas as pd
df = pd.read_csv("affairs.csv")

We can look at the top of the data set:



In [31]:

    
df.head()

From the paper we read that the variables are:

sex: The reported gender of the individual;
age: The age of the individual;
ym: Number of years married;
child: Whether or not the individual has a child;
religious: How religious the individual is (5: "very", 1: "not");
education: Level of education (9: grade school, 20: PhD or MD);
occupation: Occupation based on a scalle called the "Hollingshead classification";
rate: Individual rating of marriage (5: "very happy", 1: "very unhappy").

We can obtain a quick overview of the data using the describe() method:



In [32]:

    
df.describe()

We can also choose to slice our data set in specific ways, for example how is the mean number of affairs related to the rating of the marriage and whether or not the individual is male or female?



In [33]:

    
df.groupby(["sex", "rate"])["nbaffairs"].mean()









    Out[33]:





sex     rate
female  1       4.545455
        2       3.371429
        3       1.782609
        4       0.967742
        5       0.823077
male    1       2.400000
        2       4.548387
        3       1.085106
        4       1.623762
        5       0.588235
Name: nbaffairs, dtype: float64

We can also choose to slice our data manually, for example let us only look at the data set for males:



In [34]:

    
df[df["sex"] == "male"]









    Out[34]:







  
    
      
      sex
      age
      ym
      child
      religious
      education
      occupation
      rate
      nbaffairs
    
  
  
    
      0
      male
      37.0
      10.00
      no
      3
      18
      7
      4
      0
    
    
      3
      male
      57.0
      15.00
      yes
      5
      18
      6
      5
      0
    
    
      4
      male
      22.0
      0.75
      no
      2
      17
      6
      3
      0
    
    
      7
      male
      57.0
      15.00
      yes
      2
      14
      4
      4
      0
    
    
      9
      male
      22.0
      1.50
      no
      4
      14
      4
      5
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      594
      male
      32.0
      10.00
      yes
      4
      14
      4
      3
      3
    
    
      595
      male
      47.0
      15.00
      yes
      3
      16
      4
      2
      7
    
    
      596
      male
      22.0
      1.50
      yes
      1
      12
      2
      5
      1
    
    
      598
      male
      32.0
      10.00
      yes
      2
      17
      6
      5
      2
    
    
      599
      male
      22.0
      7.00
      yes
      3
      18
      6
      2
      2
    
  

286 rows × 9 columns

We can combine this with matplotlib to get a plot of the mean number of affairs based on rating by gender:



In [35]:

    
plt.figure()
for sex in ["male", "female"]:
    plt.scatter(range(1, 5 + 1), df[df["sex"] == sex].groupby("rate")["nbaffairs"].mean(), label=sex)
plt.legend()
plt.xlabel("rate")
plt.ylabel("mean nbaffairs")
plt.show()

Further resources for pandas

Scikit-learn - http://scikit-learn.org

A video describing the library.

The scikit learn library is a popular library for machine learning. As an example, let us aim to train a model that will predict whether or not someone will have an affair using the data above.

First let us create a new variable male and also change the child variable to be a boolean and store this data in a new dataframe called X which will be used to predict a new variable a new variable cheat (stored in y):



In [36]:

    
df["bool_child"] = df["child"] == "yes"
df["male"] = df["sex"] == "male"
X = df[["age", "ym", "religious", "education", "rate", "occupation", "bool_child", "male"]]
y = df["nbaffairs"] > 0

Now let us import the specifit classifier for the algorithm we want to use:



In [37]:

    
from sklearn.ensemble import RandomForestClassifier



In [38]:

    
seed = 3
clf = RandomForestClassifier(random_state=seed)
clf.fit(X, y)









    



/Users/vince/anaconda3/envs/cfm/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)






    Out[38]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=3, verbose=0, warm_start=False)



In [39]:

    
for feature, importance in zip(X.columns, clf.feature_importances_):
    print(feature, round(importance, 5))









    



age 0.18605
ym 0.12635
religious 0.18442
education 0.16467
rate 0.13675
occupation 0.12495
bool_child 0.02848
male 0.04834

The most important feature seems to be the age of the individual. Let us use our trained model to predict whether or not some given individual is likely to have an affair over their lifetime:



In [40]:

    
ages = range(35, 100)
probability_of_affair = []
for age in ages:
    ym = age - 24
    vince_knight = [[age, ym, 1, 20, 5, 3, True, True]]
    probability_of_affair.append(clf.predict_proba(vince_knight)[0][0])

plt.figure()    
plt.plot(ages, probability_of_affair)
plt.xlabel("Vince's age")
plt.ylabel("Vince's probability of having an affair")
plt.ylim(0, 1)
plt.show()

Further resources for scikit-learn

This is a brief overview of the type of thing each library can do, depending on what you want to do be sure to explore them fully and there are many other libraries in the Python ecosystem.

	age	ym	religious	education	occupation	rate	nbaffairs
count	601.000000	601.000000	601.000000	601.000000	601.000000	601.000000	601.000000
mean	32.487521	8.177696	3.116473	16.166389	4.194676	3.931780	1.455907
std	9.288762	5.571303	1.167509	2.402555	1.819443	1.103179	3.298758
min	17.500000	0.125000	1.000000	9.000000	1.000000	1.000000	0.000000
25%	27.000000	4.000000	2.000000	14.000000	3.000000	3.000000	0.000000
50%	32.000000	7.000000	3.000000	16.000000	5.000000	4.000000	0.000000
75%	37.000000	15.000000	4.000000	18.000000	6.000000	5.000000	0.000000
max	57.000000	15.000000	5.000000	20.000000	7.000000	5.000000	12.000000

	sex	age	ym	child	religious	education	occupation	rate
0	male	37.0	10.00	no	3	18	7	4
1	female	27.0	4.00	no	4	14	6	4
2	female	32.0	15.00	yes	1	12	1	4
3	male	57.0	15.00	yes	5	18	6	5
4	male	22.0	0.75	no	2	17	6	3