LSESU Applicable Maths Python Lesson 6

29/11/16

Today is all about handling and generating data. We'll be looking at the first principles of 2 different packages you should know about for handling data in Python:

* NumPy
* Pandas

Run the appropriate version of the following commands ASAP to get yourself set up!


In [ ]:
# Run this if you are using a Mac machine or have multiple versions of Python installed
!pip3 install numpy pandas matplotlib pandas_datareader --upgrade

In [ ]:
# Run this if you are using a Windows machine
!pip install numpy==1.11.1 pandas==0.19.0 matplotlib==1.5.3 pandas_datareader==0.2.1 --upgrade

In [ ]:
# Everyone run this block
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from pandas_datareader import data as web
%matplotlib inline
  • Recap from last week

We looked at the basics of Object Oriented Programming, or OOP, last week. If you couldn't make it, don't be concerned because the content from last week won't affect what we will be looking at today.

class Human(object):
    def __init__(self,name,age,height):
        self.name = name
        self.age = age
        self.height = height

    def __lt__(self,other):
        return self.age < other.age
    def __le__(self,other):
        return self.age <= other.age
    def __gt__(self,other):
        return self.age > other.age
    def __ge__(self,other):
        return self.age >= other.age
    def __eq__(self,other):
        return self.age==other.age

    def age_in_dog_years(self):
        return 7*self.age

NumPy

NumPy is the standard mathematical and scientific computing package for Python. NumPy is a need to know if you want to write efficient and interpretable code. NumPy includes an optimised array type as well as linear algebra, Fourier Transform and random number capabilities.

Under the hood, much of NumPy is written in C/C++/Fortran which is highly optimised according to your specific computer. Using NumPy gives you features and speed you couldn't achieve with native Python.

Link to NumPy documentation

The main NumPy feature - the Array type

The NumPy array is a grid of values all of the same type, this is different to Python lists which can have elements of different types.

Arrays are indexed similarly to lists, with each dimension being indexed from zero. When declaring the Array object be clear in your mind the dimensions of the array you need.

np.ones((3,4),int) 
    -> 3 is the number of ROWS
    -> 4 is the number of COLUMNS
    -> int is the type of the array elements
--
array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])

In [ ]:
# Creating an array of all zeroes

np.zeros((3,4),int)

In [ ]:
print('a. np.zeros((2,2))')
a = np.zeros((2,2),int)  
print(a)                 

print('b. np.ones((1,2))')
b = np.ones((1,2))   
print(b)              

print('c. np.full((2,2), 7)')
c = np.full((2,2), 7) # Create a constant array
print(c)              

print('d. np.eye(2)')
d = np.eye(2)        
print(c)             
    
print('e. np.random.random((2,2))')
e = np.random.random((2,2)) 
print(e)

In general you can follow the format below for declaring most NumPy arrays

np.format(shape, fill_value, dtype)

Where shape is declared as a tuple like (3,4) and dtype is the type of the data which is constant across the array but doesn't have to be a number

Try replacing int in the declaration of array a with str


In [ ]:
# You can also declare NumPy arrays using standard Python
# Lists (and Lists of Lists)

l1 = [1,2,3,4]
l2 = [5,6,7,8]
l3 = [9,10,11,12]
l = [l1,l2,l3]

print(l)

l_array = np.array(l)

#print(l_array)

Indexing a NumPy array


In [ ]:
# For a list of lists we would use the [][] notation
upper_left_val = l[0][0]
print(upper_left_val)
print()

# We use a single [] with np and seperate dimensions by ,
upper_left_val_np = l_array[0,0]
print(upper_left_val_np)
print()

# You can also slice arrays as so
print(l_array[0:2,1:3])
print()

# And use the shape attribute of the object to understand
# size 
print(l_array.shape)
print()

# Or the dtype attribute to inspect the type of the array
print(l_array.dtype)

In [ ]:
# Using the arange function you can retrieve linearly
# spaced integers
lin_space_int = np.arange(1,10)
print(lin_space_int)
print()

# Or specify a step to create different spacings
lin_space_new = np.arange(1,10,0.5)
print(lin_space_new)

Array mathematics


In [ ]:
# Declare two example arrays
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# By default, operations are element wise in NumPy

# Addition, two options
print(x+y)
#print(np.add(x,y))
print()

# Subtraction
print(x-y)
#print(np.subtract(x,y))
print()

# Product
print(x*y)
#print np.multiply(x,y)
print()

# Division
print(x/y)
#print(np.divide(x,y))
print()

# Square Root
print(np.sqrt(x))
print()

## For Matrix operations, use the set of NumPy functions

# Dot product
print(x.dot(y))
#print(np.dot(x,y))
print()

# You can also sum across dimensions easily
print(np.sum(x)) # For every element
print()

print(np.sum(x,axis=0))
print()

# Or transpose a Matrix
print(x)
print()
print(x.T)

Challenge

Declare a 5 by 5 array of any numbers you want using one of the above methods we've discussed. Then read this NumPy documentation, when you are ready print the mean, standard deviation, minimum and maximum of your array


In [ ]:
# TO DO 
# You can declare an array of random numbers or start with a list of lists
# Check your array is the right size by printing the .shape attribute



# Print the mean


# Print the standard deviation 


# Print the minimum


# Print the maximum



# END TODO

Pandas

Pandas is a data manipulation package that we've glimpsed before. If you know R, then the Pandas Dataframe type will be very familiar, if not you can think of a Dataframe is a spreadsheet like object which can be manipulated and interfaced with much easier than lists of dictionaries (or dictionaries of lists!).

Link to Pandas documentation

The main Pandas feature - the DataFrame type


In [ ]:
# Pandas DataFrames can be initialised using Numpy objects or
# or with native Python objects.

# Using Numpy
df1 = pd.DataFrame(np.random.randn(3,4),columns=list('ABCD'))
print(df1)

# Or with a dictionary of list vales
my_dict = {
    'A':[1,2,3,4],
    'B':['2016-12-25','2015-12-25','2014-12-25','2013-12-25'],
    'C':pd.Series(1,index=list(range(4)),dtype="float32"),
    'D':pd.Categorical(['Test','Test','Test','Train']),
    'E':[True,True,False,False]
}
df2 = pd.DataFrame(my_dict)
df2

In [ ]:
# Many features of the DataFrame are similar to a NumPy array

df2.dtypes

#df2.shape

In [ ]:
# A fast way to grab quick insights of the numerical features
# of your data is to use .describe()

df2.describe()

In [ ]:
# We can sort and organise our dataframe based upon defined regions
df2.sort_index(axis=1,ascending=False)

In [ ]:
df2.sort_values(by='B')

Selecting elements


In [ ]:
# Use [] indexing and a column name to grab columns
df2['C']

In [ ]:
# Use [] and row values to grab rows
df2[1:3]

In [ ]:
# Or use .loc to retrieve specific values like the 0th row
df2.loc[0]

In [ ]:
# Or use loc and a condition

df2.loc[df2['E']==True]

Revisiting what we did on the first day


In [58]:
# Choose a stock
ticker = 'DRYS'

# Choose a start date in US format MM/DD/YYYY
stock_start = '10/2/2014'
# Choose an end date in US format MM/DD/YYYY
stock_end = '28/11/2016'

# Retrieve the Data from Google's Finance Database
stock = web.DataReader(ticker,data_source='google',
                       start = stock_start,end=stock_end)

# Print a table of the Data to see what we have just fetched
stock.tail()


Out[58]:
Open High Low Close Volume
Date
2016-11-18 12.64 21.72 11.05 11.81 28910418
2016-11-21 13.93 16.44 8.73 9.10 23824704
2016-11-22 9.50 9.66 6.10 6.22 18100450
2016-11-23 6.36 6.86 5.35 6.01 24905788
2016-11-25 6.05 6.19 5.46 5.48 7269443

In [59]:
# Generate the logarithm of the ratio between each days closing price
stock['Log_Ret'] = np.log(stock['Close']/stock['Close'].shift(1))

# Generate the rolling standard deviation across the time series data
stock['Volatility'] = (stock['Log_Ret'].rolling(window=100).std())*np.sqrt(100)

In [60]:
# Create a plot of changing Closing Price and Volatility
stock[['Close','Volatility']].plot(subplots=True,color='b',figsize=(8,6))


Out[60]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x110ae34a8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x110b5ce10>], dtype=object)

Experiment with changing the Stock Ticker from GOOG and the dates. What interesting insights can you find? Is there a company which has become very volatile recently?