Class 1: Python for Machine Learning
Deep learning is a group of exciting new technologies for neural networks. By using a combination of advanced training techniques neural network architectural components, it is now possible to train neural networks of much greater complexity. This course will introduce the student to deep belief neural networks, regularization units (ReLU), convolution neural networks and recurrent neural networks. High performance computing (HPC) aspects will demonstrate how deep learning can be leveraged both on graphical processing units (GPUs), as well as grids. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain. Focus will be primarily upon the application of deep learning, with some introduction to the mathematical foundations of deep learning. Students will use the Python programming language to architect a deep learning model for several of real-world data sets and interpret the results of these networks.
Your grade will be calculated according to the following assignments:
Assignment | Weight | Title |
---|---|---|
Class Participation | 10% | Class attendance and participation |
Program 1 | 10% | Python for data science |
Program 2 | 10% | TensorFlow for classification |
Program 3 | 10% | Time series with TensorFlow |
Program 4 | 10% | Computer vision with TensorFlow |
Mid Term | 20% | Understanding of deep learning and TensorFlow |
Final Project | 30% | Adapt deep learning to a past Kaggle competition |
The following book will be used to supplement in class discussion. Internet resources and papers will augment the text with the latest research.
Heaton, J. (2015). Deep learning and neural networks (Vol. 3, Artificial Intelligence for Humans). St. Louis, MO: Heaton Research.
You do not need the other books in the series.
I will be your instructor for this course. A brief summary of my credentials is given here:
Social media:
The focus of this class is deep learning, which is a very popular type of machine learning that is based upon the original neural networks popularized in the 1980's. There is very little difference between how a deep neural network is calculated compared with the original neural network. We've always been able to create and calculate deep neural networks. A deep neural network is nothing more than a neural network with many layers. While we've always been able to create/calculate deep neural networks, we've lacked an effective means of training them. Deep learning provides an efficient means to train deep neural networks.
If deep learning is a type of machine learning, this begs the question, "What is machine learning?" The following diagram illustrates how machine learning differs from traditional software development.
Researchers have applied machine learning to many different areas. This class will explore three specific domains for the application of deep neural networks:
Regression is when a model, such as a neural network, accepts input and produces a numeric output. Consider if you were tasked to write a program that predicted how many miles per gallon (MPG) a car could achieve. For the inputs you would probably want such features as the weight of the car, the horsepower, how large the engine is, etc. Your program would be a combination of math and if-statements.
Machine learning lets the computer learn the "formula" for calculating the MPG of a car, using data. Consider this dataset. We can use regression machine learning models to study this data and learn how to predict the MPG for a car.
The output of a classification model is what class the input belongs to. For example, consider using four measurements of an iris flower to determine the species that the flower is in. This dataset could be used to perform this.
Neural networks are one of the earliest types of machine learning model. Neural networks were originally introduced in the 1940's and have risen and fallen several times from popularity. Four researchers have contributed greatly to the development of neural networks. They have consistently pushed neural network research, both through the ups and downs:
The current luminaries of artificial neural network (ANN) research and ultimately deep learning, in order as appearing in the above picture:
For predictive modeling neural networks are not that different than other models, such as:
Like these other models, neural networks can perform both classification and regression. When applied to relatively low-dimensional predictive modeling tasks, deep neural networks do not necessarily add significant accuracy over other model types. Andrew Ng describes the advantage of deep neural networks over traditional model types as follows:
Neural networks also have two additional significant advantages over other machine learning models:
Python 3.x is the programming language that will be used for this class. Python, as a programming language, has the widest support for deep learning. The three most popular frameworks for deep learning in Python are:
Some references on popular programming languages for AI/Data Science:
This is a technical class. You will need to be able to compile and execute Python code that makes use of TensorFlow for deep learning. There are two options to you for accomplish this:
This option allows you to skip any issues associated with installing Python and TensorFlow on your machine. Installing Python is relatively easy. However, TensorFlow has specific instructions for Windows, Linux and Mac. It is straightforward to install TensorFlow onto a Mac or Linux. Windows is an entirely different prospect, as Google does not offer specific support for Windows at this time.
The IBM Data Scientist Workbench is a web site that provides you with your own environment to run a Jupyter notebook from. There is nothing proprietary about the workbench, the same code that will run from the IBM system will also run on your local computer. I will be using the Data Scientist Workbench for many of the examples during class. To make use of this website you will need to register at the following URL:
When you first sign up, it will take the workbench some time to setup your environment, this could easily take 30 minutes plus. While your environment is being setup, you will see a cute icon of a dog chasing his tail.
Upon logging into the workbench, you will see a welcome screen similar to the following:
You will primarily make use of the "My Data" and "Jupyter Notebook" buttons on the above page. Clicking "My Data" will reveal all data that is currently held by your account. This includes both CSV data files, as well as any Jupyter notebooks you might have loaded or created.
Clicking "Jupyter Notebook" will start Jupyter Notebook. This allows you to choose which notebook you would like to work with. If you downloaded a notebook from my GitHub site you can simply drag this .ipynb file to the web browser. You can also choose to create a new Jupyter notebook that you can later download. The following screen capture shows Jupyter notebook running in Data Scientist Workbench.
It is also possible to install and run Python/TensorFlow entirely from your own computer. This will be somewhat difficult for Microsoft Windows, as Google has not yet added official support for TensorFlow. Official support is currently only provided for Mac and Linux.
The first step is to install Python 3.x. I recommend using the Anaconda release of Python, as it already includes many of the data science related packages that will be needed by this class. Anaconda directly supports: Windows, Mac and Linux. Download Anaconda from the following URL:
Once Anaconda has been downloaded it is easy to install Jupyter notebooks with the following command:
conda install jupyter
Once Jupyter is installed, it is started with the following command:
jupyter notebook
Space matters in Python, indent code to define blocks
Jupyter Notebooks Allow Python and Markdown to coexist.
Even $\LaTeX$:
$ f'(x) = \lim_{h\to0} \frac{f(x+h) - f(x)}{h}. $
xrange
instead of range
, you are dealing with Python 2print x
instead of print(x)
, you are dealing with Python 2
In [2]:
# What version of Python do you have?
import sys
import tensorflow as tf
import sklearn as sk
import pandas as pd
print("Python {}".format(sys.version))
print('TensorFlow {}'.format(tf.__version__))
print('Pandas {}'.format(pd.__version__))
print('Scikit-Learn {}'.format(sk.__version__))
Software used in this class:
In [1]:
#Python cares about space! No curly braces.
for x in range(1,10): # If you ever see xrange, you are in Python 2
print(x) # If you ever see print x (no parenthesis), you are in Python 2
In [2]:
sum = 0
for x in range(1,10):
sum += x
print("Adding {}, sum so far is {}".format(x,sum))
print("Final sum: {}".format(sum))
In [3]:
c = ['a', 'b', 'c', 'd']
print(c)
In [4]:
# Iterate over a collection.
for s in c:
print(s)
In [5]:
# Iterate over a collection, and know where your index. (Python is zero-based!)
for i,c in enumerate(c):
print("{}:{}".format(i,c))
In [6]:
# Manually add items, lists allow duplicates
c = []
c.append('a')
c.append('b')
c.append('c')
c.append('c')
print(c)
In [7]:
# Manually add items, sets do not allow duplicates
# Sets add, lists append. I find this annoying.
c = set()
c.add('a')
c.add('b')
c.add('c')
c.add('c')
print(c)
In [8]:
# Insert
c = ['a','b','c']
c.insert(0,'a0')
print(c)
# Remove
c.remove('b')
print(c)
# Remove at index
del c[0]
print(c)
In [9]:
map = { 'name': "Jeff", 'address':"123 Main"}
print(map)
print(map['name'])
if 'name' in map:
print("Name is defined")
if 'age' in map:
print("age defined")
else:
print("age undefined")
In [3]:
map = { 'name': "Jeff", 'address':"123 Main"}
# All of the keys
print("Key: {}".format(map.keys()))
# All of the values
print("Values: {}".format(map.values()))
In [11]:
# Python list & map structures
customers = [
{'name': 'Jeff & Tracy Heaton', 'pets': ['Wynton','Cricket']},
{'name': 'John Smith', 'pets': ['rover']},
{'name': 'Jane Doe'}
]
print(customers)
for customer in customers:
print("{}:{}".format(customer['name'],customer.get('pets','no pets')))
Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is based on the dataframe concept found in the R programming language. For this class, Pandas will be the primary means by which data is manipulated in conjunction with neural networks.
The dataframe is a key component of Pandas. We will use it to access the auto-mpg dataset. This dataset can be found on the UCI machine learning repository. For this class we will use a version of the Auto MPG dataset where I added column headers. You can find my version here.
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. It contains data for 398 cars, including mpg, cylinders), displacement, horsepower , weight, acceleration, model year, origin and the car's name.
The following code loads the MPG dataset into a dataframe:
In [13]:
# Simple dataframe
import os
import pandas as pd
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read)
print(df[0:5])
In [14]:
# Perform basic statistics on a dataframe.
import os
import pandas as pd
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
# Strip non-numerics
df = df.select_dtypes(include=['int', 'float'])
headers = list(df.columns.values)
fields = []
for field in headers:
fields.append( {
'name' : field,
'mean': df[field].mean(),
'var': df[field].var(),
'sdev': df[field].std()
})
for field in fields:
print(field)
In [1]:
import os
import pandas as pd
import numpy as np
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
#np.random.seed(42) # Uncomment this line to get the same shuffle each time
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
df
Out[1]:
In [16]:
import os
import pandas as pd
import numpy as np
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.sort_values(by='name',ascending=True)
print("The first car is: {}".format(df['name'].iloc[1]))
df
Out[16]:
In [17]:
import os
import pandas as pd
import numpy as np
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-shuffle.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
df.to_csv(filename_write,index=False) # Specify index = false to not write row numbers
print("Done")
It is possible to add new fields to the dataframe that are calculated from the other fields. We can create a new column that gives the weight in kilograms. The equation to calculate a metric weight, given a weight in pounds is:
$ m_{(kg)} = m_{(lb)} \times 0.45359237 $
This can be used with the following Python code:
In [5]:
import os
import pandas as pd
import numpy as np
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df.insert(1,'weight_kg',(df['weight']*0.45359237).astype(int))
df
Out[5]:
The data fed into a machine learning model rarely bares much similarity to the data that the data scientist originally received. One common transformation is to normalize the inputs. A normalization allows numbers to be put in a standard form so that two values can easily be compared. Consider if a friend told you that he received a $10 discount. Is this a good deal? Maybe. But the value is not normalized. If your friend purchased a car, then the discount is not that good. If your friend purchased dinner, this is a very good discount!
Percentages are a very common form of normalization. If your friend tells you they got 10% off, we know that this is a better discount than 5%. It does not matter how much the purchase price was. One very common machine learning normalization is the Z-Score:
$z = {x- \mu \over \sigma} $
To calculate the Z-Score you need to also calculate the mean($\mu$) and the standard deviation ($\sigma$). The mean is calculated as follows:
$\mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$
The standard deviation is calculated as follows:
$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}, {\rm \ \ where\ \ } \mu = \frac{1}{N} \sum_{i=1}^N x_i$
The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero, above zero is above average, and below zero is below average. Z-Scores above/below -3/3 are very rare, these are outliers.
In [20]:
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df['mpg'] = zscore(df['mpg'])
df
Out[20]:
Missing values are a reality of machine learning. Ideally every row of data will have values for all columns. However, this is rarely the case. Most of the values are present in the MPG database. However, there are missing values in the horsepower column. A common practice is to replace missing values with the median value for that column. The median is calculated as described here. The following code replaces any NA values in horsepower with the median:
In [21]:
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
# df = df.dropna() # you can also simply drop NA values
print("horsepower has na? {}".format(pd.isnull(df['horsepower']).values.any()))
In [22]:
# Create a new dataframe from name and horsepower
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower],axis=1)
result
Out[22]:
In [23]:
# Create a new dataframe from name and horsepower, but this time by row
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower])
result
Out[23]:
It is very important that we evaluate a machine learning model based on its ability to predict data that it has never seen before. Because of this we often divide the training data into a validation and training set. The machine learning model will learn from the training data, but ultimately be evaluated based on the validation data.
There are two predominant means of dealing with training and validation data:
In [24]:
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index)) # Usually a good idea to shuffle
mask = np.random.rand(len(df)) < 0.8
trainDF = pd.DataFrame(df[mask])
validationDF = pd.DataFrame(df[~mask])
print("Training DF: {}".format(len(trainDF)))
print("Validation DF: {}".format(len(validationDF)))
There are several types of cross validation; however, k-fold is the most common. The value K specifies the number of folds. The two most common values for K are either 5 or 10. For this course we will always use a K value of 5, or a 5-fold cross validation. A 5-fold validation is illustrated by the following diagram:
First, the data are split into 5 equal (or close to, due to rounding) folds. These folds are used to generate 5 training/validation set combinations. Each of the folds becomes the validation set once, and the remaining folds become the training sets. This allows the validated results to be appended together to produce a final out-of-sample prediction for the entire dataset.
The following code demonstrates a 5-fold cross validation:
In [25]:
import os
from sklearn.cross_validation import KFold
import pandas as pd
import numpy as np
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
kf = KFold(len(df), n_folds=5)
fold = 1
for train_index, validate_index in kf:
trainDF = pd.DataFrame(df.ix[train_index,:])
validateDF = pd.DataFrame(df.ix[validate_index])
print("Fold #{}, Training Size: {}, Validation Size: {}".format(fold,len(trainDF),len(validateDF)))
fold+=1
It is possible to access files directly, rather than using Pandas. For class assignments you should use Pandas; however, direct access is possible. Using the CSV package, you can read the files in, line-by-line and process them. Accessing a file line-by-line can allow you to process very large files that would not fit into memory. For the purposes of this class, all files will fit into memory, and you should use Pandas for all class assignments.
In [27]:
# Read a raw text file (avoid this)
import codecs
import os
path = "./data"
# Always specify your encoding! There is no such thing as "its just a text file".
# See... http://www.joelonsoftware.com/articles/Unicode.html
# Also see... http://www.utf8everywhere.org/
encoding = 'utf-8'
filename = os.path.join(path,"auto-mpg.csv")
c = 0
with codecs.open(filename, "r", encoding) as fh:
# Iterate over this line by line...
for line in fh:
c+=1 # Only the first 5 lines
if c>5: break
print(line.strip())
In [28]:
# Read a CSV file
import codecs
import os
import csv
encoding = 'utf-8'
path = "./data/"
filename = os.path.join(path,"auto-mpg.csv")
c = 0
with codecs.open(filename, "r", encoding) as fh:
reader = csv.reader(fh)
for row in reader:
c+=1
if c>5: break
print(row)
In [30]:
# Read a CSV, symbolic headers
import codecs
import os
import csv
path = "./data"
encoding = 'utf-8'
filename = os.path.join(path,"auto-mpg.csv")
c = 0
with codecs.open(filename, "r", encoding) as fh:
reader = csv.reader(fh)
# Generate header index using comprehension.
# Comprehension is cool, but not necessarily a beginners feature of Python.
header_idx = {key: value for (value, key) in enumerate(next(reader))}
for row in reader:
c+=1
if c>5: break
print( "Car Name: {}".format(row[header_idx['name']]))
In [31]:
# Read a CSV, manual stats
import codecs
import os
import csv
import math
path = "./data/"
encoding = 'utf-8'
filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-norm.csv")
c = 0
with codecs.open(filename_read, "r", encoding) as fh:
reader = csv.reader(fh)
# Generate header index using comprehension.
# Comprehension is cool, but not necessarily a beginners feature of Python.
header_idx = {key: value for (value, key) in enumerate(next(reader))}
headers = header_idx.keys()
#print([(key,{'count':0}) for key in headers])
fields = {key: value for (key, value) in [(key,{'count':0,'sum':0,'variance':0}) for key in headers] }
# Pass 1, means
row_count = 0
for row in reader:
row_count += 1
for name in headers:
try:
value = float(row[header_idx[name]])
field = fields[name]
field['count'] += 1
field['sum'] += value
except ValueError:
pass
# Calculate means, toss sums (part of pass 1)
for field in fields.values():
# If 90% are not missing (or non-numeric) calculate a mean
if (field['count']/row_count)>0.9:
field['mean'] = field['sum'] / field['count']
del field['sum']
# Pass 2, standard deviation & variance
fh.seek(0)
for row in reader:
for name in headers:
try:
value = float(row[header_idx[name]])
field = fields[name]
# If we failed to calculate a mean, no variance.
if 'mean' in field:
field['variance'] += (value - field['mean'])**2
except ValueError:
pass
# Calculate standard deviation, keep variance (part of pass 2)
for field in fields.values():
# If no variance, then no standard deviation
if 'mean' in field:
field['variance'] /= field['count']
field['sdev'] = math.sqrt(field['variance'])
else:
del field['variance']
# Print summary stats
for key in sorted(fields.keys()):
print("{}:{}".format(key,fields[key]))
The first programming assignment will give you a chance to try out Python, Pandas and build some skills that you will use to learn about Deep Learning. You should submit this assignment as either a Jupyter notebook (.ipynb) or a regular Python (.py) file. The following code shows a possible skeleton structure for this assignment:
In [ ]:
# Programming Assignment #1,
# Solution by YOUR NAME
# T81-558: Application of Deep Learning
import os
import sklearn
from sklearn.cross_validation import KFold
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "./data/"
def question1():
print()
print("***Question 1***")
def question2():
print()
print("***Question 2***")
def question3():
print()
print("***Question 3***")
def question4():
print()
print("***Question 4***")
def question5():
print()
print("***Question 5***")
question1()
question2()
question3()
question4()
question5()
In [ ]: