There are various ways to install Python. Assuming the reader is not (yet) well versed in programming, I suggest to download the Anaconda distribution, which works for all major operating systems (Mac OS X, Windows, Linux) and provides a fully fledged Python installation including necessary libraries and Jupyter notebooks.
Be sure to download the latest Python 3.x version (not 2.x; backward compability is not always given!).
The installation should not cause any problems. If you wisch a step-by-step guide (incl. some further insights) see Cyrille Rossants excellent notebook on the topic.
What you see here is a so called Jupyter notebook. It makes it possible to interactively combine code with output (results, graphics), markdown text and LaTeX (for mathematical expressions). All codes discussed in this course will be provided through such notebooks and you soon will understand and appreciate the functionality they provide.
If you are keen on learning more about IPython/Jupyter, consider this notebook - a well written introduction by Cyrille Rossant.
To analyze data in Python, data has to be stored as some kind of data type. These data types form the structure of the data and make data easily accessible. Python's basic building blocks are:
We will discuss the first four data types above as these are relevant for us.
Python is a dynamically typed language, meaning that - unlike in static languages such as VBA, C, Java etc. - you do not explicitly need to assign a data type to a variable. Python will do that for you. A few examples will explain this best:
In [1]:
a = 42 # In VBA you would first have to define data type, only then the value: Dim a as integer; a = 42
b = 10.3 # VBA: Dim b as Double; b = 10.3
c = 'hello' # VBA: Dim c as String; c = "hello"
d = True # VBA: Dim d as Boolean; d = True
print('a: ', (a))
print('b: ', type(b))
print('c: ', type(c))
print('d: ', type(d))
How did I execute this code section? If I only want to run a code cell, I select the cell and hit ctrl + enter. If you wish to run the entire notebook, go to Kernel (dropdown menu) and select "Restart & Run All". There are lots of shortcuts that let you handle a Jupyter notebook just from the keyboard. Press H on your keyboard (or go to Help/Keyboard Shortcuts) to see all of the shortcuts.
In [2]:
a = 2 + 4 - 8 # Addition & Subtraction
b = 6 * 7 / 3 - 2 # Multiplication & Division
c = 2**(1/2) # Exponents & Square root
d = 10 % 3 # Modulus
e = 10 // 3 # Floor division
print(' a =', a, '\n',
'b =', b, '\n',
'c =', c, '\n',
'd =', d, '\n',
'e =', e)
We can even use arithmetic operators to concatenate strings:
In [3]:
a = 'Hello'
b = 'World!'
print(a + ' ' + b)
print(a * 3)
In [4]:
e = ['Calynn', 'Dillon', '10.3', c, d]
print('e: ', type(e))
print([type(item) for item in e])
Note that the third element in list e is set in quotation marks and thus Python interprets this as string.
For as useful list appear, its flexibility come at a high cost. Because each element contains not only the value itself but also information about the data type, storing data in list consumes a lot of memory. For this reason the Python community introduced NumPy (short for Numerical Python). Among other things, this package provides fixed-type arrays which are more efficient to store and operate on dense data than simple lists.
Fixed-type arrays are dense arrays of uniform type. Uniform here means that all entries of the array have the same data type, e.g. all are floating point numbers. We start by importing the NumPy package (following general convention we import this package under the alias name np) and create some simple NumPy arrays form Python lists.
In [5]:
import numpy as np
# Integer array
np.array([3, 18, 12])
Out[5]:
In [6]:
# Floating point array
np.array([3., 18, 12])
Out[6]:
Similarly, we can explicitly set the data type:
In [7]:
np.array([3, 18, 12], dtype='float32')
Out[7]:
In [8]:
# Multidimensional arrays
np.array([range(i, i + 3) for i in [1, 2, 3]])
Out[8]:
We've seen above that we can define the data type for NumPy arrays. A list of available data types can be found in the NumPy documentation.
In [9]:
# Integer array with 8 zeros
np.zeros(shape=8, dtype='int')
Out[9]:
In [10]:
# 2x3 floating-point array filled with 1s
np.ones(shape=(2, 3), dtype='float32')
Out[10]:
Note that I do not need to use np.ones(shape=(2, 3), dtype='float32') to define the shape. It's ok to go with the short version as long as the order is correct, Python will understand. However, going with the explicit version helps to make your code more readable and I encourage students to follow this advice.
Each predefined function is documented in a help page. One can search for it by calling np.ones? or help(np.ones). If it is not clear how the function is precisely called, it is best to use (wildcard character) - as in `np.ne*?. This will list all functions in the NumPy library which contain 'ne' in their name. In our examplenp.ones?` shows the following details:
In [11]:
# np.ones?
The function description also shows examples of how the function can be used. Often this is very helpful, but for the sake of brevity, these are omitted here.
In the function description we see that the order of arguments is shape, dtype, order. If our inputs are in this order, one does not need to specify the argument. Furthermore, we see that dtype and order is optional, meaning that if left undefined, Python will simply use the default argument. This makes for a very short code. However, going with the longer explicit version helps to make your code more readable and I encourage students to follow this advice.
In [12]:
# 3x2 array filled with 2.71
np.full(shape=(3, 2), fill_value=2.71)
Out[12]:
In [13]:
# 3x3 boolean array filled with 'True'
np.full((2, 2), 1, bool)
Out[13]:
In [14]:
# Array filled with linear sequence
np.arange(start = 0, stop = 1, step = 0.1) # or simply np.arrange(0, 1, 0.1)
Out[14]:
In [15]:
# Array of evenly spaced values
np.linspace(start = 0, stop = 1, num = 4)
Out[15]:
Arrays with random variables are easily created. Below three examples. See the numpy.random documentation page for details on how to generate other random variable-arrays.
In [16]:
# 4x4 array of uniformly distributed random variables
np.random.random((4, 4))
Out[16]:
In [17]:
# 3x3 array of normally distributed random variables (with mean = 4, sd = 6)
np.random.normal(loc = 4, scale = 6, size = (3, 3))
Out[17]:
In [18]:
# 3x3 array of random integers in interval [0, 15)
np.random.randint(low = 0, high = 15, size = (3, 3))
Out[18]:
In [19]:
# 4x4 identity matrix
np.eye(4)
Out[19]:
Here are some attributes we can call:
| Attribute | Description |
|---|---|
ndim |
No. of dimensions |
shape |
Size of each dimension |
size |
Total size of array |
dtype |
Data type of array |
itemsize |
Size (in bytes) |
nbytes |
Total size (in bytes) |
To show how one can access them we'll define three arrays.
In [20]:
np.random.seed(1234) # Set seed for reproducibility
x = np.random.randint(10, size = 6) # 1-dimensional array (vector)
y = np.random.randint(10, size = (3, 4)) # 2-dimensional array (matrix)
z = np.random.randint(10, size = (3, 4, 5)) # 3-dimensional array
And here's how we call for these properties:
In [21]:
print(' ', x, '\n\n', y, '\n\n', z)
In [22]:
print('ndim: ', z.ndim)
print('shape: ', z.shape)
print('size: ', z.size)
print('data type: ', z.dtype)
print('itemsize: ', z.itemsize)
print('nbytes: ', z.nbytes)
In [23]:
print(e, '\n') # List from above
print(x, '\n') # One dimensional np array from above
print(y, '\n') # Two dimensional np array from above
In [24]:
e[2]
Out[24]:
In [25]:
e[2] * 2
Out[25]:
In [26]:
x[5]
Out[26]:
In [27]:
y[2, 0] # Note again that [m, n] starts counting for both rows (m) as well as columns (n) from 0
Out[27]:
To access the end of an array, you can also use negative indices:
In [28]:
e[-1]
Out[28]:
In [29]:
y[-2, 2]
Out[29]:
Arrays are also possible as inputs:
In [30]:
ind = [3, 5, -4]
x[ind]
Out[30]:
In [31]:
x = np.arange(12).reshape((3, 4))
print(x)
row = np.array([1, 2])
col = np.array([0, 3])
x[row, col]
Out[31]:
Knowing the index we can also replace elements of an array:
In [32]:
x[0] = 99
x
Out[32]:
IMPORTANT NOTE:
NumPy arrays have a fixed type. This means that e.g. if you insert a floating-point value to an integer array, the value will be truncated!
In [33]:
x[0] = 3.14159; x
Out[33]:
In [34]:
x = np.arange(10)
x
Out[34]:
In [35]:
x[:3] # First three elements
Out[35]:
In [36]:
x[7:] # Elements AFTER 7th element
Out[36]:
In [37]:
x[4:8] # Element 5, 6, 7 and 8
Out[37]:
In [38]:
x[::2] # Even elements
Out[38]:
In [39]:
x[1::2] # Odd elements
Out[39]:
In [40]:
x[::-1] # All elements reversed
Out[40]:
In [41]:
x[::-2] # Odd elements reversed
Out[41]:
Array slicing works the same for multidimensional arrays.
In [42]:
y # from above
Out[42]:
In [43]:
y[:2, :3] # Rows 0 and 1, columns 0, 1, 2
Out[43]:
In [44]:
y[:, 2] # Third column
Out[44]:
In [45]:
y[0, :] # First row
Out[45]:
IMPORTANT NOTE:
When slicing and assigning part of an existing array to a new variable, the new variable will only hold a "view" but not a copy. This means, that if you change a value in the new array, the original array will also be changed. The idea behind this is to save memory. But fear not: with the ".copy()" method, you still can get a true copy.
Here a few corresponding examples for better understanding:
In [46]:
ySub = y[:2, :2]
print(ySub)
In [47]:
ySub[0, 0] = 99
print(ySub, '\n')
print(y)
In [48]:
ySubCopy = y[:2, :2].copy()
ySubCopy[0, 0] = 33
print(ySubCopy, '\n')
print(y)
In [49]:
x = np.array([1, 2, 3])
y = np.array([11, 12, 13])
z = np.array([21, 22, 23])
np.concatenate([x, y, z])
Out[49]:
In [50]:
# Stack two vectors horizontally
np.hstack([x, y])
Out[50]:
In [51]:
# Stack two vectors vertically
np.vstack([x, y])
Out[51]:
In [52]:
# Stack matrix with column vector
m = np.arange(0, 9, 1).reshape((3, 3))
np.vstack([m, z])
Out[52]:
In [53]:
# Stack matrix with row vector
np.hstack([m, z.reshape(3, 1)])
Out[53]:
The opposite of concatenating is splitting. Numpy has np.split, np.hsplit and np.vsplit functions. Each of these takes a list of indices, giving the split points, as input.
In [54]:
x = np.arange(8.0)
a, b, c = np.split(x, [3, 5])
print(a, b, c)
In [55]:
x = np.arange(16).reshape(4, 4)
upper, lower = np.vsplit(x, [3])
print(upper, '\n\n', lower)
In [56]:
left, right = np.hsplit(x, [2])
print(left, '\n\n', right)
Boolean operators check an input and return either True (equals 1 as value) or False (equals 0). This is often very helpful if one wants to check for conditions or sort out part of a data set which meet a certain condition. Here are the common comparison operators:
| Operator | Description |
|---|---|
| == | equal ($=$) |
| != | not equal ($\neq$) |
| < | less than ($<$) |
| <= | less or equal ($\leq$) |
| > | greater ($>$) |
| >= | greater or equal ($\geq$) |
| & | Mathematical AND ($\land$) |
| | | Mathematical OR ($\lor$) |
in |
element of ($\in$) |
The following sections give a glimpse of how these operators can be used.
In [57]:
x = np.arange(start=0, stop=8, step=1)
print(x)
print(x == 2)
print(x != 3)
print((x < 2) | (x > 6))
In [58]:
# Notice the difference
print(x[x <= 4])
print(x <= 4)
In [59]:
x = 3
if x%2 == 0:
print(x, 'is an even number')
else:
print(x, 'is an odd number')
It is also possible to have more than one condition as the next example shows.
In [60]:
x = 20
if x > 0:
print(x, 'is positive')
elif x < 0:
print(x, 'is negative')
else:
print(x, 'is neither strictly positive nor strictly negative')
Combining these two statements would make for a nested if ... else statement.
In [61]:
x = -3
if x > 0:
if (x%2) == 0:
print(x, 'is positive and even')
else:
print(x, 'is positive and odd')
elif x < 0:
if (x%2) == 0:
print(x, 'is negative and even')
else:
print(x, 'is negative and odd')
else:
print(x, 'is 0')
"For" loops iterate over a given sequence. They are very easy to implement as the following example shows. We start with an example and give some explanations afterwards.
For our example, let's assume you ought to sum up the integer values of a sequence from 10 to 1 with a loop. There are obviously more efficient ways of doing this but this serves well as an introductory example. From primary school we know the result is easily calculated as
$$ \begin{equation} \sum_{i=1}^n x_i = \dfrac{n (n+1)}{2} \qquad -> \qquad \dfrac{10 \cdot 11}{2} = 55 \end{equation} $$
In [62]:
seq = np.arange(start=10, stop=0, step=-1)
seqSum = 0
for value in seq:
seqSum = seqSum + value
seqSum
Out[62]:
A few imprtant notes:
seqSum = 0 here. Otherwise, if we run the code repeatedly we add to the previous total!value takes on every value in array seq. In the first loop value=10, second loop value=9, etc. Loops can be nested, too. Here's an example.
In [63]:
seq = seq.reshape(2, 5)
seqSum = 0
row, col = seq.shape
for rowIndex in range(0, row):
for colIndex in range(0, col):
seqSum = seqSum + seq[rowIndex, colIndex]
seqSum
Out[63]:
In [64]:
seqSum = 0
i = 10
while i >= 1:
seqSum = seqSum + i
i = i - 1 # Also: i -= 1
print(seqSum)
In [65]:
def sumOddEven(vector):
"""Calculates sum of odd and even numbers in array.
Args:
vector: NumPy array of length n
Returns:
odd: Sum of odd numbers
even: Sum of even numbers
"""
# Initiate values
odd = 0
even = 0
# Loop through values of array; check for each
# value whether it is odd or even and add to
# previous total.
for value in vector:
if (value % 2) == 0:
even = even + value
else:
odd = odd + value
return odd, even
# Initiate array [1, 2, ..., 99, 100]
seq = np.arange(1, 101, 1)
# Apply function and print results
odd, even = sumOddEven(seq)
print('Odd: ', odd, ', ', 'Even: ', even)
Above code snippet not only shows how functions are set up but also displays the importance of comments. Comments are preceeded by a hash sign (#), such that the interpreter will not parse what follows the hash. When programming, you should always comment your code to notate your work. This details your steps/thoughts/ideas not only for other developers but also for you when you pick up your code some time after writing it. Good programmers make heavy use of commenting and I strongly encourage the reader to follow this standard.
In [66]:
%%timeit
seq = np.arange(1,10001, 1)
sumOddEven(seq)
In [67]:
%%timeit
seq[(seq % 2) == 0].sum()
seq[(seq % 2) == 1].sum()
Above timing results show what was hinted before: In 9'999 out of a 10'000 cases it is significantly faster using already built in functions compared to loops. The simple reason is that modules such as NumPy or Pandas use (at their core) optimized compile code to calculate the results and this is most certainly faster than a loop.
So in summary: Above examples helped introduce if statements, loops and functions. In real life, however, you should check if Python does not already offer a built-in function for your task. If yes, make sure to use it.
In closing this chapter we briefly introduce NumPy's broadcasting functionality. Rules for matrix arithmetic apply to NumPy arrays as one would expect and it is left to the reader to explore it. Broadcasting, however, goes one step further in that it allows for element-by-element operations on arrays (and matrices) of different dimensions - which under normal rules would not be compatible. An example shows this best.
In [68]:
M = np.ones(shape=(3, 3))
v = np.array([1, 2, 3])
M + v
Out[68]:
In [69]:
# Notice the difference
vecAdd = v + v
broadAdd = v.reshape((3, 1)) + v
print(vecAdd, '\n')
print(broadAdd)
The following ressources, which were consulted to write this notebook, are recommended to better acquaint yourself with Python and NumPy: