Today is all about handling and generating data. We'll be looking at the first principles of 2 different packages you should know about for handling data in Python:
* NumPy
* Pandas
Run the appropriate version of the following commands ASAP to get yourself set up!
In [ ]:
# Run this if you are using a Mac machine or have multiple versions of Python installed
!pip3 install numpy pandas matplotlib pandas_datareader --upgrade
In [ ]:
# Run this if you are using a Windows machine
!pip install numpy==1.11.1 pandas==0.19.0 matplotlib==1.5.3 pandas_datareader==0.2.1 --upgrade
In [ ]:
# Everyone run this block
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from pandas_datareader import data as web
%matplotlib inline
We looked at the basics of Object Oriented Programming, or OOP, last week. If you couldn't make it, don't be concerned because the content from last week won't affect what we will be looking at today.
class Human(object):
def __init__(self,name,age,height):
self.name = name
self.age = age
self.height = height
def __lt__(self,other):
return self.age < other.age
def __le__(self,other):
return self.age <= other.age
def __gt__(self,other):
return self.age > other.age
def __ge__(self,other):
return self.age >= other.age
def __eq__(self,other):
return self.age==other.age
def age_in_dog_years(self):
return 7*self.age
NumPy is the standard mathematical and scientific computing package for Python. NumPy is a need to know if you want to write efficient and interpretable code. NumPy includes an optimised array type as well as linear algebra, Fourier Transform and random number capabilities.
Under the hood, much of NumPy is written in C/C++/Fortran which is highly optimised according to your specific computer. Using NumPy gives you features and speed you couldn't achieve with native Python.
The NumPy array is a grid of values all of the same type, this is different to Python lists which can have elements of different types.
Arrays are indexed similarly to lists, with each dimension being indexed from zero. When declaring the Array object be clear in your mind the dimensions of the array you need.
np.ones((3,4),int)
-> 3 is the number of ROWS
-> 4 is the number of COLUMNS
-> int is the type of the array elements
--
array([[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]])
In [ ]:
# Creating an array of all zeroes
np.zeros((3,4),int)
In [ ]:
print('a. np.zeros((2,2))')
a = np.zeros((2,2),int)
print(a)
print('b. np.ones((1,2))')
b = np.ones((1,2))
print(b)
print('c. np.full((2,2), 7)')
c = np.full((2,2), 7) # Create a constant array
print(c)
print('d. np.eye(2)')
d = np.eye(2)
print(c)
print('e. np.random.random((2,2))')
e = np.random.random((2,2))
print(e)
In general you can follow the format below for declaring most NumPy arrays
np.format(shape, fill_value, dtype)
Where shape is declared as a tuple like (3,4) and dtype is the type of the data which is constant across the array but doesn't have to be a number
Try replacing int in the declaration of array a with str
In [ ]:
# You can also declare NumPy arrays using standard Python
# Lists (and Lists of Lists)
l1 = [1,2,3,4]
l2 = [5,6,7,8]
l3 = [9,10,11,12]
l = [l1,l2,l3]
print(l)
l_array = np.array(l)
#print(l_array)
In [ ]:
# For a list of lists we would use the [][] notation
upper_left_val = l[0][0]
print(upper_left_val)
print()
# We use a single [] with np and seperate dimensions by ,
upper_left_val_np = l_array[0,0]
print(upper_left_val_np)
print()
# You can also slice arrays as so
print(l_array[0:2,1:3])
print()
# And use the shape attribute of the object to understand
# size
print(l_array.shape)
print()
# Or the dtype attribute to inspect the type of the array
print(l_array.dtype)
In [ ]:
# Using the arange function you can retrieve linearly
# spaced integers
lin_space_int = np.arange(1,10)
print(lin_space_int)
print()
# Or specify a step to create different spacings
lin_space_new = np.arange(1,10,0.5)
print(lin_space_new)
In [ ]:
# Declare two example arrays
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
# By default, operations are element wise in NumPy
# Addition, two options
print(x+y)
#print(np.add(x,y))
print()
# Subtraction
print(x-y)
#print(np.subtract(x,y))
print()
# Product
print(x*y)
#print np.multiply(x,y)
print()
# Division
print(x/y)
#print(np.divide(x,y))
print()
# Square Root
print(np.sqrt(x))
print()
## For Matrix operations, use the set of NumPy functions
# Dot product
print(x.dot(y))
#print(np.dot(x,y))
print()
# You can also sum across dimensions easily
print(np.sum(x)) # For every element
print()
print(np.sum(x,axis=0))
print()
# Or transpose a Matrix
print(x)
print()
print(x.T)
In [ ]:
# TO DO
# You can declare an array of random numbers or start with a list of lists
# Check your array is the right size by printing the .shape attribute
# Print the mean
# Print the standard deviation
# Print the minimum
# Print the maximum
# END TODO
Pandas is a data manipulation package that we've glimpsed before. If you know R, then the Pandas Dataframe type will be very familiar, if not you can think of a Dataframe is a spreadsheet like object which can be manipulated and interfaced with much easier than lists of dictionaries (or dictionaries of lists!).
In [ ]:
# Pandas DataFrames can be initialised using Numpy objects or
# or with native Python objects.
# Using Numpy
df1 = pd.DataFrame(np.random.randn(3,4),columns=list('ABCD'))
print(df1)
# Or with a dictionary of list vales
my_dict = {
'A':[1,2,3,4],
'B':['2016-12-25','2015-12-25','2014-12-25','2013-12-25'],
'C':pd.Series(1,index=list(range(4)),dtype="float32"),
'D':pd.Categorical(['Test','Test','Test','Train']),
'E':[True,True,False,False]
}
df2 = pd.DataFrame(my_dict)
df2
In [ ]:
# Many features of the DataFrame are similar to a NumPy array
df2.dtypes
#df2.shape
In [ ]:
# A fast way to grab quick insights of the numerical features
# of your data is to use .describe()
df2.describe()
In [ ]:
# We can sort and organise our dataframe based upon defined regions
df2.sort_index(axis=1,ascending=False)
In [ ]:
df2.sort_values(by='B')
In [ ]:
# Use [] indexing and a column name to grab columns
df2['C']
In [ ]:
# Use [] and row values to grab rows
df2[1:3]
In [ ]:
# Or use .loc to retrieve specific values like the 0th row
df2.loc[0]
In [ ]:
# Or use loc and a condition
df2.loc[df2['E']==True]
In [58]:
# Choose a stock
ticker = 'DRYS'
# Choose a start date in US format MM/DD/YYYY
stock_start = '10/2/2014'
# Choose an end date in US format MM/DD/YYYY
stock_end = '28/11/2016'
# Retrieve the Data from Google's Finance Database
stock = web.DataReader(ticker,data_source='google',
start = stock_start,end=stock_end)
# Print a table of the Data to see what we have just fetched
stock.tail()
Out[58]:
In [59]:
# Generate the logarithm of the ratio between each days closing price
stock['Log_Ret'] = np.log(stock['Close']/stock['Close'].shift(1))
# Generate the rolling standard deviation across the time series data
stock['Volatility'] = (stock['Log_Ret'].rolling(window=100).std())*np.sqrt(100)
In [60]:
# Create a plot of changing Closing Price and Volatility
stock[['Close','Volatility']].plot(subplots=True,color='b',figsize=(8,6))
Out[60]: