Getting Started With Python

Installation

There are various ways to install Python. Assuming the reader is not (yet) well versed in programming, I suggest to download the Anaconda distribution, which works for all major operating systems (Mac OS X, Windows, Linux) and provides a fully fledged Python installation including necessary libraries and Jupyter notebooks.

Be sure to download the latest Python 3.x version (not 2.x; backward compability is not always given!).

The installation should not cause any problems. If you wisch a step-by-step guide (incl. some further insights) see Cyrille Rossants excellent notebook on the topic.

IPyton / Jupyter Notebooks

What you see here is a so called Jupyter notebook. It makes it possible to interactively combine code with output (results, graphics), markdown text and LaTeX (for mathematical expressions). All codes discussed in this course will be provided through such notebooks and you soon will understand and appreciate the functionality they provide.

  • The basic markdown commands are well summarized here.
  • LaTeX is a typesetting language with extensive capabilities to typeset math. For a basic introductin to math in LaTeX see sections 3.3 - 3.4 (p. 22 - 33) of More Math into LaTeX by Grätzer (2007), available as pdf here

If you are keen on learning more about IPython/Jupyter, consider this notebook - a well written introduction by Cyrille Rossant.

Data Types in Python

Building Blocks

To analyze data in Python, data has to be stored as some kind of data type. These data types form the structure of the data and make data easily accessible. Python's basic building blocks are:

  • Numbers (integer, floating point, and complex)
  • Booleans (true/false)
  • Strings
  • Lists
  • Dictionaries
  • Tuples

We will discuss the first four data types above as these are relevant for us.

Python is a dynamically typed language, meaning that - unlike in static languages such as VBA, C, Java etc. - you do not explicitly need to assign a data type to a variable. Python will do that for you. A few examples will explain this best:


In [1]:
a = 42       # In VBA you would first have to define data type, only then the value: Dim a as integer; a = 42
b = 10.3     # VBA: Dim b as Double; b = 10.3
c = 'hello'  # VBA: Dim c as String; c = "hello"
d = True     # VBA: Dim d as Boolean; d = True

print('a: ', (a))
print('b: ', type(b))
print('c: ', type(c))
print('d: ', type(d))


a:  42
b:  <class 'float'>
c:  <class 'str'>
d:  <class 'bool'>

Running Code in Jupyter

How did I execute this code section? If I only want to run a code cell, I select the cell and hit ctrl + enter. If you wish to run the entire notebook, go to Kernel (dropdown menu) and select "Restart & Run All". There are lots of shortcuts that let you handle a Jupyter notebook just from the keyboard. Press H on your keyboard (or go to Help/Keyboard Shortcuts) to see all of the shortcuts.

Simple Arithmetics

Simple arithmetic operations are straight forward:


In [2]:
a = 2 + 4 - 8      # Addition & Subtraction
b = 6 * 7 / 3 - 2  # Multiplication & Division
c = 2**(1/2)       # Exponents & Square root
d = 10 % 3         # Modulus
e = 10 // 3        # Floor division

print(' a =', a, '\n',
      'b =', b, '\n',
      'c =', c, '\n',
      'd =', d, '\n',
      'e =', e)


 a = -2 
 b = 12.0 
 c = 1.4142135623730951 
 d = 1 
 e = 3

We can even use arithmetic operators to concatenate strings:


In [3]:
a = 'Hello'
b = 'World!'
print(a + ' ' + b)

print(a * 3)


Hello World!
HelloHelloHello

Lists

Now let's look at lists. Lists are capable of combining multiple data types.


In [4]:
e = ['Calynn', 'Dillon', '10.3', c, d]
print('e: ', type(e))
print([type(item) for item in e])


e:  <class 'list'>
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'float'>, <class 'int'>]

Note that the third element in list e is set in quotation marks and thus Python interprets this as string.

NumPy Arrays

NumPy Arrays from Lists

For as useful list appear, its flexibility come at a high cost. Because each element contains not only the value itself but also information about the data type, storing data in list consumes a lot of memory. For this reason the Python community introduced NumPy (short for Numerical Python). Among other things, this package provides fixed-type arrays which are more efficient to store and operate on dense data than simple lists.

Fixed-type arrays are dense arrays of uniform type. Uniform here means that all entries of the array have the same data type, e.g. all are floating point numbers. We start by importing the NumPy package (following general convention we import this package under the alias name np) and create some simple NumPy arrays form Python lists.


In [5]:
import numpy as np

# Integer array
np.array([3, 18, 12])


Out[5]:
array([ 3, 18, 12])

In [6]:
# Floating point array
np.array([3., 18, 12])


Out[6]:
array([ 3., 18., 12.])

Similarly, we can explicitly set the data type:


In [7]:
np.array([3, 18, 12], dtype='float32')


Out[7]:
array([ 3., 18., 12.], dtype=float32)

In [8]:
# Multidimensional arrays
np.array([range(i, i + 3) for i in [1, 2, 3]])


Out[8]:
array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

We've seen above that we can define the data type for NumPy arrays. A list of available data types can be found in the NumPy documentation.

NumPy Arrays from Scratch

Sometimes it is helpful to create arrays from scratch. Here are some examples:


In [9]:
# Integer array with 8 zeros 
np.zeros(shape=8, dtype='int')


Out[9]:
array([0, 0, 0, 0, 0, 0, 0, 0])

In [10]:
# 2x3 floating-point array filled with 1s
np.ones(shape=(2, 3), dtype='float32')


Out[10]:
array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)

Note that I do not need to use np.ones(shape=(2, 3), dtype='float32') to define the shape. It's ok to go with the short version as long as the order is correct, Python will understand. However, going with the explicit version helps to make your code more readable and I encourage students to follow this advice.

Each predefined function is documented in a help page. One can search for it by calling np.ones? or help(np.ones). If it is not clear how the function is precisely called, it is best to use (wildcard character) - as in `np.ne*?. This will list all functions in the NumPy library which contain 'ne' in their name. In our examplenp.ones?` shows the following details:


In [11]:
# np.ones?

The function description also shows examples of how the function can be used. Often this is very helpful, but for the sake of brevity, these are omitted here.

In the function description we see that the order of arguments is shape, dtype, order. If our inputs are in this order, one does not need to specify the argument. Furthermore, we see that dtype and order is optional, meaning that if left undefined, Python will simply use the default argument. This makes for a very short code. However, going with the longer explicit version helps to make your code more readable and I encourage students to follow this advice.


In [12]:
# 3x2 array filled with 2.71
np.full(shape=(3, 2), fill_value=2.71)


Out[12]:
array([[2.71, 2.71],
       [2.71, 2.71],
       [2.71, 2.71]])

In [13]:
# 3x3 boolean array filled with 'True'
np.full((2, 2), 1, bool)


Out[13]:
array([[ True,  True],
       [ True,  True]])

In [14]:
# Array filled with linear sequence
np.arange(start = 0, stop = 1, step = 0.1)  # or simply np.arrange(0, 1, 0.1)


Out[14]:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [15]:
# Array of evenly spaced values
np.linspace(start = 0, stop = 1, num = 4)


Out[15]:
array([0.        , 0.33333333, 0.66666667, 1.        ])

Arrays with random variables are easily created. Below three examples. See the numpy.random documentation page for details on how to generate other random variable-arrays.


In [16]:
# 4x4 array of uniformly distributed random variables
np.random.random((4, 4))


Out[16]:
array([[0.3100082 , 0.76968533, 0.02175423, 0.80573935],
       [0.58202484, 0.88288993, 0.03688884, 0.44580011],
       [0.68589452, 0.88908094, 0.76026142, 0.51912257],
       [0.34875867, 0.21215201, 0.45880794, 0.55672555]])

In [17]:
# 3x3 array of normally distributed random variables (with mean = 4, sd = 6)
np.random.normal(loc = 4, scale = 6, size = (3, 3))


Out[17]:
array([[ 7.66926691,  2.89977709, 12.71007316],
       [13.32236512,  7.22990312,  7.69309056],
       [ 3.97128896,  0.43381839,  6.57577244]])

In [18]:
# 3x3 array of random integers in interval [0, 15)
np.random.randint(low = 0, high = 15, size = (3, 3))


Out[18]:
array([[ 3,  5,  7],
       [12,  4, 11],
       [ 4,  0,  8]])

In [19]:
# 4x4 identity matrix
np.eye(4)


Out[19]:
array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

NumPy Array Attributes

Each NumPy array has certain attributes.

Here are some attributes we can call:

Attribute Description
ndim No. of dimensions
shape Size of each dimension
size Total size of array
dtype Data type of array
itemsize Size (in bytes)
nbytes Total size (in bytes)

To show how one can access them we'll define three arrays.


In [20]:
np.random.seed(1234)  # Set seed for reproducibility

x = np.random.randint(10, size = 6)          # 1-dimensional array (vector)
y = np.random.randint(10, size = (3, 4))     # 2-dimensional array (matrix)
z = np.random.randint(10, size = (3, 4, 5))  # 3-dimensional array

And here's how we call for these properties:


In [21]:
print(' ', x, '\n\n', y, '\n\n', z)


  [3 6 5 4 8 9] 

 [[1 7 9 6]
 [8 0 5 0]
 [9 6 2 0]] 

 [[[5 2 6 3 7]
  [0 9 0 3 2]
  [3 1 3 1 3]
  [7 1 7 4 0]]

 [[5 1 5 9 9]
  [4 0 9 8 8]
  [6 8 6 3 1]
  [2 5 2 5 6]]

 [[7 4 3 5 6]
  [4 6 2 4 2]
  [7 9 7 7 2]
  [9 7 4 9 0]]]

In [22]:
print('ndim:      ', z.ndim)
print('shape:     ', z.shape)
print('size:      ', z.size)
print('data type: ', z.dtype)
print('itemsize:  ', z.itemsize)
print('nbytes:    ', z.nbytes)


ndim:       3
shape:      (3, 4, 5)
size:       60
data type:  int32
itemsize:   4
nbytes:     240

Index: How to Access Elements

What might be a bit counterintuitive at the beginning is that Python's indexing starts at 0. Other than that, accessing the $i$'th element (starting at 0) of a list or a array is straight forward.


In [23]:
print(e, '\n')  # List from above
print(x, '\n')  # One dimensional np array from above
print(y, '\n')  # Two dimensional np array from above


['Calynn', 'Dillon', '10.3', 1.4142135623730951, 1] 

[3 6 5 4 8 9] 

[[1 7 9 6]
 [8 0 5 0]
 [9 6 2 0]] 


In [24]:
e[2]


Out[24]:
'10.3'

In [25]:
e[2] * 2


Out[25]:
'10.310.3'

In [26]:
x[5]


Out[26]:
9

In [27]:
y[2, 0]  # Note again that [m, n] starts counting for both rows (m) as well as columns (n) from 0


Out[27]:
9

To access the end of an array, you can also use negative indices:


In [28]:
e[-1]


Out[28]:
1

In [29]:
y[-2, 2]


Out[29]:
5

Arrays are also possible as inputs:


In [30]:
ind = [3, 5, -4]
x[ind]


Out[30]:
array([4, 9, 5])

In [31]:
x = np.arange(12).reshape((3, 4))
print(x)
row = np.array([1, 2])
col = np.array([0, 3])
x[row, col]


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Out[31]:
array([ 4, 11])

Knowing the index we can also replace elements of an array:


In [32]:
x[0] = 99
x


Out[32]:
array([[99, 99, 99, 99],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

IMPORTANT NOTE:

NumPy arrays have a fixed type. This means that e.g. if you insert a floating-point value to an integer array, the value will be truncated!


In [33]:
x[0] = 3.14159; x


Out[33]:
array([[ 3,  3,  3,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Array Slicing

We can also use square brackets to access a subset of the data. The syntax is:

x[start:stop:step]

The default values are: start=0, stop='size of dimension', step=1


In [34]:
x = np.arange(10)
x


Out[34]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [35]:
x[:3]  # First three elements


Out[35]:
array([0, 1, 2])

In [36]:
x[7:]  # Elements AFTER 7th element


Out[36]:
array([7, 8, 9])

In [37]:
x[4:8]  # Element 5, 6, 7 and 8


Out[37]:
array([4, 5, 6, 7])

In [38]:
x[::2]  # Even elements


Out[38]:
array([0, 2, 4, 6, 8])

In [39]:
x[1::2]  # Odd elements


Out[39]:
array([1, 3, 5, 7, 9])

In [40]:
x[::-1]  # All elements reversed


Out[40]:
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [41]:
x[::-2]  # Odd elements reversed


Out[41]:
array([9, 7, 5, 3, 1])

Array slicing works the same for multidimensional arrays.


In [42]:
y  # from above


Out[42]:
array([[1, 7, 9, 6],
       [8, 0, 5, 0],
       [9, 6, 2, 0]])

In [43]:
y[:2, :3]  # Rows 0 and 1, columns 0, 1, 2


Out[43]:
array([[1, 7, 9],
       [8, 0, 5]])

In [44]:
y[:, 2]  # Third column


Out[44]:
array([9, 5, 2])

In [45]:
y[0, :]  # First row


Out[45]:
array([1, 7, 9, 6])

IMPORTANT NOTE:

When slicing and assigning part of an existing array to a new variable, the new variable will only hold a "view" but not a copy. This means, that if you change a value in the new array, the original array will also be changed. The idea behind this is to save memory. But fear not: with the ".copy()" method, you still can get a true copy.

Here a few corresponding examples for better understanding:


In [46]:
ySub = y[:2, :2]
print(ySub)


[[1 7]
 [8 0]]

In [47]:
ySub[0, 0] = 99
print(ySub, '\n')
print(y)


[[99  7]
 [ 8  0]] 

[[99  7  9  6]
 [ 8  0  5  0]
 [ 9  6  2  0]]

In [48]:
ySubCopy = y[:2, :2].copy()
ySubCopy[0, 0] = 33
print(ySubCopy, '\n')
print(y)


[[33  7]
 [ 8  0]] 

[[99  7  9  6]
 [ 8  0  5  0]
 [ 9  6  2  0]]

Concatenating, Stacking and Splitting

Often it is useful to combine multiple arrays into one or to split a single array into multiple arrays. To accomplish this, we can use NumPy's concatenate and vstack/hstack function.


In [49]:
x = np.array([1, 2, 3])
y = np.array([11, 12, 13])
z = np.array([21, 22, 23])
np.concatenate([x, y, z])


Out[49]:
array([ 1,  2,  3, 11, 12, 13, 21, 22, 23])

In [50]:
# Stack two vectors horizontally
np.hstack([x, y])


Out[50]:
array([ 1,  2,  3, 11, 12, 13])

In [51]:
# Stack two vectors vertically
np.vstack([x, y])


Out[51]:
array([[ 1,  2,  3],
       [11, 12, 13]])

In [52]:
# Stack matrix with column vector
m = np.arange(0, 9, 1).reshape((3, 3))
np.vstack([m, z])


Out[52]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [21, 22, 23]])

In [53]:
# Stack matrix with row vector
np.hstack([m, z.reshape(3, 1)])


Out[53]:
array([[ 0,  1,  2, 21],
       [ 3,  4,  5, 22],
       [ 6,  7,  8, 23]])

The opposite of concatenating is splitting. Numpy has np.split, np.hsplit and np.vsplit functions. Each of these takes a list of indices, giving the split points, as input.


In [54]:
x = np.arange(8.0)
a, b, c = np.split(x, [3, 5])
print(a, b, c)


[0. 1. 2.] [3. 4.] [5. 6. 7.]

In [55]:
x = np.arange(16).reshape(4, 4)
upper, lower = np.vsplit(x, [3])
print(upper, '\n\n', lower)


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 

 [[12 13 14 15]]

In [56]:
left, right = np.hsplit(x, [2])
print(left, '\n\n', right)


[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]] 

 [[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]

Conditions

Boolean Operators

Boolean operators check an input and return either True (equals 1 as value) or False (equals 0). This is often very helpful if one wants to check for conditions or sort out part of a data set which meet a certain condition. Here are the common comparison operators:

Operator Description
== equal ($=$)
!= not equal ($\neq$)
< less than ($<$)
<= less or equal ($\leq$)
> greater ($>$)
>= greater or equal ($\geq$)
& Mathematical AND ($\land$)
| Mathematical OR ($\lor$)
in element of ($\in$)

The following sections give a glimpse of how these operators can be used.


In [57]:
x = np.arange(start=0, stop=8, step=1)
print(x)
print(x == 2)
print(x != 3)
print((x < 2) | (x > 6))


[0 1 2 3 4 5 6 7]
[False False  True False False False False False]
[ True  True  True False  True  True  True  True]
[ True  True False False False False False  True]

In [58]:
# Notice the difference
print(x[x <= 4])
print(x <= 4)


[0 1 2 3 4]
[ True  True  True  True  True False False False]

If ... else statements

These statements check a given condition and depending on the result (True, False) execute a subsequent code. As usual, an example will do. Notice that indentation is necessary for Python to correctly compile the code.


In [59]:
x = 3

if x%2 == 0:
    print(x, 'is an even number')
else:
    print(x, 'is an odd number')


3 is an odd number

It is also possible to have more than one condition as the next example shows.


In [60]:
x = 20

if x > 0:
    print(x, 'is positive')
elif x < 0:
    print(x, 'is negative')
else:
    print(x, 'is neither strictly positive nor strictly negative')


20 is positive

Combining these two statements would make for a nested if ... else statement.


In [61]:
x = -3
if x > 0:
    if (x%2) == 0:
        print(x, 'is positive and even')
    else:
        print(x, 'is positive and odd')
elif x < 0:
    if (x%2) == 0:
        print(x, 'is negative and even')
    else:
        print(x, 'is negative and odd')
else:
    print(x, 'is 0')


-3 is negative and odd

Loops

"For" Loops

"For" loops iterate over a given sequence. They are very easy to implement as the following example shows. We start with an example and give some explanations afterwards.

For our example, let's assume you ought to sum up the integer values of a sequence from 10 to 1 with a loop. There are obviously more efficient ways of doing this but this serves well as an introductory example. From primary school we know the result is easily calculated as

$$ \begin{equation} \sum_{i=1}^n x_i = \dfrac{n (n+1)}{2} \qquad -> \qquad \dfrac{10 \cdot 11}{2} = 55 \end{equation} $$

In [62]:
seq = np.arange(start=10, stop=0, step=-1)
seqSum = 0
for value in seq:
    seqSum = seqSum + value

seqSum


Out[62]:
55

A few imprtant notes:

  • Indentation is not just here for better readability of the code but it is actually necessary for Python to correctly interpret the code.
  • Though it is not necessary, we initiate seqSum = 0 here. Otherwise, if we run the code repeatedly we add to the previous total!
  • value takes on every value in array seq. In the first loop value=10, second loop value=9, etc.

Loops can be nested, too. Here's an example.


In [63]:
seq = seq.reshape(2, 5)
seqSum = 0
row, col = seq.shape

for rowIndex in range(0, row):
    for colIndex in range(0, col):
        seqSum = seqSum + seq[rowIndex, colIndex]
        
seqSum


Out[63]:
55

"While" Loops

"While" loops execute as long as a certain boolean condition is met. Picking up the above example we can formulate the following loop:


In [64]:
seqSum = 0
i = 10
while i >= 1:
    seqSum = seqSum + i
    i = i - 1  # Also: i -= 1
    
print(seqSum)


55

Functions

Functions come into play when either a task needs to be performed more than once or when it helps to reduce the complexity of a code.

Following up on our play examples from above, let us assume we are tasked to write a function which sums up all even and all odd integers of a vector.


In [65]:
def sumOddEven(vector):
    """Calculates sum of odd and even numbers in array.
    
    Args:
        vector: NumPy array of length n
        
    Returns:
        odd: Sum of odd numbers
        even: Sum of even numbers
    """
    
    # Initiate values
    odd = 0
    even = 0
    
    # Loop through values of array; check for each
    # value whether it is odd or even and add to 
    # previous total.
    for value in vector:
        if (value % 2) == 0:
            even = even + value
        else:
            odd = odd + value
    
    return odd, even

# Initiate array [1, 2, ..., 99, 100]
seq = np.arange(1, 101, 1)

# Apply function and print results
odd, even = sumOddEven(seq)
print('Odd: ', odd, ', ', 'Even: ', even)


Odd:  2500 ,  Even:  2550

Commenting

Above code snippet not only shows how functions are set up but also displays the importance of comments. Comments are preceeded by a hash sign (#), such that the interpreter will not parse what follows the hash. When programming, you should always comment your code to notate your work. This details your steps/thoughts/ideas not only for other developers but also for you when you pick up your code some time after writing it. Good programmers make heavy use of commenting and I strongly encourage the reader to follow this standard.

Slowness of Loops

It is at this point important to note that loops should only be used as a last resort. Below we show why. The first code runs our previously defined function. The second code uses NumPy's built-in function.


In [66]:
%%timeit
seq = np.arange(1,10001, 1)
sumOddEven(seq)


7.02 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [67]:
%%timeit
seq[(seq % 2) == 0].sum()
seq[(seq % 2) == 1].sum()


12.8 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Above timing results show what was hinted before: In 9'999 out of a 10'000 cases it is significantly faster using already built in functions compared to loops. The simple reason is that modules such as NumPy or Pandas use (at their core) optimized compile code to calculate the results and this is most certainly faster than a loop.

So in summary: Above examples helped introduce if statements, loops and functions. In real life, however, you should check if Python does not already offer a built-in function for your task. If yes, make sure to use it.

Broadcasting

Computations on Arrays

In closing this chapter we briefly introduce NumPy's broadcasting functionality. Rules for matrix arithmetic apply to NumPy arrays as one would expect and it is left to the reader to explore it. Broadcasting, however, goes one step further in that it allows for element-by-element operations on arrays (and matrices) of different dimensions - which under normal rules would not be compatible. An example shows this best.


In [68]:
M = np.ones(shape=(3, 3))
v = np.array([1, 2, 3])
M + v


Out[68]:
array([[2., 3., 4.],
       [2., 3., 4.],
       [2., 3., 4.]])

In [69]:
# Notice the difference
vecAdd = v + v
broadAdd = v.reshape((3, 1)) + v

print(vecAdd, '\n')
print(broadAdd)


[2 4 6] 

[[2 3 4]
 [3 4 5]
 [4 5 6]]

Further Resources

The following ressources, which were consulted to write this notebook, are recommended to better acquaint yourself with Python and NumPy:

  • Vanderplas, Jake, 2016, Python Data Science Handbook (O'Reilly Media, Sebastopol, CA).
  • Sheppard, Kevin, 2017, Introduction to Python for Econometrics, Statistics and Data Analysis from Website https://www.kevinsheppard.com/images/b/b3/Python_introduction-2016.pdf, 07/07/2017.
  • Paarsch, Harry J., and Golyaev, Konstantin, 2016, A Gentle Introduction to Effective Computing in Quantitative Research: What Every Research Assistant Should Know, MIT Press, Cambridge, MA.