T81-558: Applications of Deep Neural Networks

Class 1: Python for Machine Learning

Course Description

Deep learning is a group of exciting new technologies for neural networks. By using a combination of advanced training techniques neural network architectural components, it is now possible to train neural networks of much greater complexity. This course will introduce the student to deep belief neural networks, regularization units (ReLU), convolution neural networks and recurrent neural networks. High performance computing (HPC) aspects will demonstrate how deep learning can be leveraged both on graphical processing units (GPUs), as well as grids. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain. Focus will be primarily upon the application of deep learning, with some introduction to the mathematical foundations of deep learning. Students will use the Python programming language to architect a deep learning model for several of real-world data sets and interpret the results of these networks.

Assignments

Your grade will be calculated according to the following assignments:

Assignment Weight Title
Class Participation 10% Class attendance and participation
Program 1 10% Python for data science
Program 2 10% TensorFlow for classification
Program 3 10% Time series with TensorFlow
Program 4 10% Computer vision with TensorFlow
Mid Term 20% Understanding of deep learning and TensorFlow
Final Project 30% Adapt deep learning to a past Kaggle competition

Course Textbook

The following book will be used to supplement in class discussion. Internet resources and papers will augment the text with the latest research.

Heaton, J. (2015). Deep learning and neural networks (Vol. 3, Artificial Intelligence for Humans). St. Louis, MO: Heaton Research.

You do not need the other books in the series.

Jeff Heaton

I will be your instructor for this course. A brief summary of my credentials is given here:

  • Master of Information Management (MIM), Washington University in St. Louis, MO
  • PhD (candidate) in Computer Science, Nova Southeastern University in Ft. Lauderdale, FL
  • Senior Data Scientist, Reinsurance Group of America (RGA)
  • Senior Member, IEEE
  • jtheaton at domain name of this university
  • Other industry certifications: FLMI, ARA, ACS

Social media:

  • Homepage - My home page. Includes my research interests and publications.
  • Linked In - My Linked In profile, feel free to connect.
  • Twitter - My Twitter feed.
  • Google Scholar - My citations on Google Scholar.
  • Research Gate - My profile/research at Research Gate.
  • Others - About me and other social media sites that I am a member of.

Course Resources

  • IBM Data Science Workbench - Free web based platform that includes Python, Juypter Notebooks, and TensorFlow. No setup needed.
  • Python Anaconda - Python distribution that includes many data science packages, such as Numpy, Scipy, Scikit-Learn, Pandas, and much more.
  • Juypter Notebooks - Easy to use environment that combines Python, Graphics and Text.
  • TensorFlow - Google's mathematics package for deep learning.
  • Kaggle - Competitive data science. Good source of sample data.
  • Course GitHub Repository - All of the course notebooks will be published here.

What is Deep Learning

The focus of this class is deep learning, which is a very popular type of machine learning that is based upon the original neural networks popularized in the 1980's. There is very little difference between how a deep neural network is calculated compared with the original neural network. We've always been able to create and calculate deep neural networks. A deep neural network is nothing more than a neural network with many layers. While we've always been able to create/calculate deep neural networks, we've lacked an effective means of training them. Deep learning provides an efficient means to train deep neural networks.

What is Machine Learning

If deep learning is a type of machine learning, this begs the question, "What is machine learning?" The following diagram illustrates how machine learning differs from traditional software development.

  • Traditional Software Development - Programmers create programs that specify how to transform input into the desired output.
  • Machine Learning - Programmers create models that can learn to produce the desired output for given input. This learning fills the traditional role of the computer program.

Researchers have applied machine learning to many different areas. This class will explore three specific domains for the application of deep neural networks:

  • Predictive Modeling - Several named input values are used to predict another named value that becomes the output. For example, using four measurements of iris flowers to predict the species.
  • Computer Vision - The use of machine learning to detect patterns in visual data. For example, is an image a cat or a dog.
  • Time Series - The use of machine learning to detect patterns in in time. Common applications of time series are: financial applications, speech recognition, and even natural language processing (NLP).

Regression

Regression is when a model, such as a neural network, accepts input and produces a numeric output. Consider if you were tasked to write a program that predicted how many miles per gallon (MPG) a car could achieve. For the inputs you would probably want such features as the weight of the car, the horsepower, how large the engine is, etc. Your program would be a combination of math and if-statements.

Machine learning lets the computer learn the "formula" for calculating the MPG of a car, using data. Consider this dataset. We can use regression machine learning models to study this data and learn how to predict the MPG for a car.

Classification

The output of a classification model is what class the input belongs to. For example, consider using four measurements of an iris flower to determine the species that the flower is in. This dataset could be used to perform this.

What are Neural Networks

Neural networks are one of the earliest types of machine learning model. Neural networks were originally introduced in the 1940's and have risen and fallen several times from popularity. Four researchers have contributed greatly to the development of neural networks. They have consistently pushed neural network research, both through the ups and downs:

The current luminaries of artificial neural network (ANN) research and ultimately deep learning, in order as appearing in the above picture:

  • Yann LeCun, Facebook and New York University - Optical character recognition and computer vision using convolutional neural networks (CNN). The founding father of convolutional nets.
  • Geoffrey Hinton, Google and University of Toronto. Extensive work on neural networks. Creator of deep learning and early adapter/creator of backpropagation for neural networks.
  • Yoshua Bengio, University of Montreal. Extensive research into deep learning, neural networks, and machine learning. He has so far remained completely in academia.
  • Andrew Ng, Badiu and Stanford University. Extensive research into deep learning, neural networks, and application to robotics.

Why Deep Learning?

For predictive modeling neural networks are not that different than other models, such as:

  • Support Vector Machines
  • Random Forests
  • Gradient Boosted Machines

Like these other models, neural networks can perform both classification and regression. When applied to relatively low-dimensional predictive modeling tasks, deep neural networks do not necessarily add significant accuracy over other model types. Andrew Ng describes the advantage of deep neural networks over traditional model types as follows:

Neural networks also have two additional significant advantages over other machine learning models:

  • Convolutional Neural Networks - Can scan an image for patterns within the image.
  • Recurrent Neural Networks - Can find patterns across several inputs, not just within a single input.

Python for Deep Learning

Python 3.x is the programming language that will be used for this class. Python, as a programming language, has the widest support for deep learning. The three most popular frameworks for deep learning in Python are:

Some references on popular programming languages for AI/Data Science:

Software Installation

This is a technical class. You will need to be able to compile and execute Python code that makes use of TensorFlow for deep learning. There are two options to you for accomplish this:

  • Use IBM Data Scientist Workbench online
  • Install Python, TensorFlow and some IDE (Jupyter, TensorFlow, etc.)

Using IBM Data Scientist Workbench

This option allows you to skip any issues associated with installing Python and TensorFlow on your machine. Installing Python is relatively easy. However, TensorFlow has specific instructions for Windows, Linux and Mac. It is straightforward to install TensorFlow onto a Mac or Linux. Windows is an entirely different prospect, as Google does not offer specific support for Windows at this time.

The IBM Data Scientist Workbench is a web site that provides you with your own environment to run a Jupyter notebook from. There is nothing proprietary about the workbench, the same code that will run from the IBM system will also run on your local computer. I will be using the Data Scientist Workbench for many of the examples during class. To make use of this website you will need to register at the following URL:

When you first sign up, it will take the workbench some time to setup your environment, this could easily take 30 minutes plus. While your environment is being setup, you will see a cute icon of a dog chasing his tail.

Upon logging into the workbench, you will see a welcome screen similar to the following:

You will primarily make use of the "My Data" and "Jupyter Notebook" buttons on the above page. Clicking "My Data" will reveal all data that is currently held by your account. This includes both CSV data files, as well as any Jupyter notebooks you might have loaded or created.

Clicking "Jupyter Notebook" will start Jupyter Notebook. This allows you to choose which notebook you would like to work with. If you downloaded a notebook from my GitHub site you can simply drag this .ipynb file to the web browser. You can also choose to create a new Jupyter notebook that you can later download. The following screen capture shows Jupyter notebook running in Data Scientist Workbench.

Installing Python and TensorFlow

It is also possible to install and run Python/TensorFlow entirely from your own computer. This will be somewhat difficult for Microsoft Windows, as Google has not yet added official support for TensorFlow. Official support is currently only provided for Mac and Linux.

The first step is to install Python 3.x. I recommend using the Anaconda release of Python, as it already includes many of the data science related packages that will be needed by this class. Anaconda directly supports: Windows, Mac and Linux. Download Anaconda from the following URL:

Once Anaconda has been downloaded it is easy to install Jupyter notebooks with the following command:

conda install jupyter

Once Jupyter is installed, it is started with the following command:

jupyter notebook

Python Introduction

Jupyter Notebooks

Space matters in Python, indent code to define blocks

Jupyter Notebooks Allow Python and Markdown to coexist.

Even $\LaTeX$:

$ f'(x) = \lim_{h\to0} \frac{f(x+h) - f(x)}{h}. $

Python Versions

  • If you see xrange instead of range, you are dealing with Python 2
  • If you see print x instead of print(x), you are dealing with Python 2

In [2]:
# What version of Python do you have?

import sys
import tensorflow as tf
import sklearn as sk
import pandas as pd

print("Python {}".format(sys.version))
print('TensorFlow {}'.format(tf.__version__))
print('Pandas {}'.format(pd.__version__))
print('Scikit-Learn {}'.format(sk.__version__))


Python 3.4.3 (default, Oct 14 2015, 20:28:29) 
[GCC 4.8.4]
TensorFlow 0.8.0
Pandas 0.18.1
Scikit-Learn 0.17.1

Software used in this class:

  • Python - The programming language.
  • TensorFlow - Googles deep learning framework, must have 0.8 or higher. We will use SKFlow (part of TensorFlow), tutorial [here])(https://github.com/tensorflow/skflow)
  • Pandas - Allows for data preprocessing. Tutorial here
  • Scikit-Learn - Machine learning framework for Python. Tutorial here.

Count to 10 in Python

Use a for loop and a range.


In [1]:
#Python cares about space!  No curly braces.
for x in range(1,10):  # If you ever see xrange, you are in Python 2
    print(x)  # If you ever see print x (no parenthesis), you are in Python 2


1
2
3
4
5
6
7
8
9

Printing Numbers and Strings


In [2]:
sum = 0
for x in range(1,10):
    sum += x
    print("Adding {}, sum so far is {}".format(x,sum))
    
print("Final sum: {}".format(sum))


Adding 1, sum so far is 1
Adding 2, sum so far is 3
Adding 3, sum so far is 6
Adding 4, sum so far is 10
Adding 5, sum so far is 15
Adding 6, sum so far is 21
Adding 7, sum so far is 28
Adding 8, sum so far is 36
Adding 9, sum so far is 45
Final sum: 45

Lists & Sets


In [3]:
c = ['a', 'b', 'c', 'd']
print(c)


['a', 'b', 'c', 'd']

In [4]:
# Iterate over a collection.
for s in c:
    print(s)


a
b
c
d

In [5]:
# Iterate over a collection, and know where your index.  (Python is zero-based!)
for i,c in enumerate(c):
    print("{}:{}".format(i,c))


0:a
1:b
2:c
3:d

In [6]:
# Manually add items, lists allow duplicates
c = []
c.append('a')
c.append('b')
c.append('c')
c.append('c')
print(c)


['a', 'b', 'c', 'c']

In [7]:
# Manually add items, sets do not allow duplicates
# Sets add, lists append.  I find this annoying.
c = set()
c.add('a')
c.add('b')
c.add('c')
c.add('c')
print(c)


{'b', 'c', 'a'}

In [8]:
# Insert
c = ['a','b','c']
c.insert(0,'a0')
print(c)
# Remove
c.remove('b')
print(c)
# Remove at index
del c[0]
print(c)


['a0', 'a', 'b', 'c']
['a0', 'a', 'c']
['a', 'c']

Maps/Dictionaries/Hash Tables


In [9]:
map = { 'name': "Jeff", 'address':"123 Main"}
print(map)
print(map['name'])

if 'name' in map:
    print("Name is defined")
    
if 'age' in map:
    print("age defined")
else:
    print("age undefined")


{'address': '123 Main', 'name': 'Jeff'}
Jeff
Name is defined
age undefined

In [3]:
map = { 'name': "Jeff", 'address':"123 Main"}
# All of the keys
print("Key: {}".format(map.keys()))

# All of the values
print("Values: {}".format(map.values()))


Key: dict_keys(['name', 'address'])
Values: dict_values(['Jeff', '123 Main'])

In [11]:
# Python list & map structures 
customers = [
    {'name': 'Jeff & Tracy Heaton', 'pets': ['Wynton','Cricket']},
    {'name': 'John Smith', 'pets': ['rover']},
    {'name': 'Jane Doe'}
]

print(customers)

for customer in customers:
    print("{}:{}".format(customer['name'],customer.get('pets','no pets')))


[{'pets': ['Wynton', 'Cricket'], 'name': 'Jeff & Tracy Heaton'}, {'pets': ['rover'], 'name': 'John Smith'}, {'name': 'Jane Doe'}]
Jeff & Tracy Heaton:['Wynton', 'Cricket']
John Smith:['rover']
Jane Doe:no pets

Pandas

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is based on the dataframe concept found in the R programming language. For this class, Pandas will be the primary means by which data is manipulated in conjunction with neural networks.

The dataframe is a key component of Pandas. We will use it to access the auto-mpg dataset. This dataset can be found on the UCI machine learning repository. For this class we will use a version of the Auto MPG dataset where I added column headers. You can find my version here.

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. It contains data for 398 cars, including mpg, cylinders), displacement, horsepower , weight, acceleration, model year, origin and the car's name.

The following code loads the MPG dataset into a dataframe:


In [13]:
# Simple dataframe
import os
import pandas as pd

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read)
print(df[0:5])


    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  

In [14]:
# Perform basic statistics on a dataframe.

import os
import pandas as pd

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])

# Strip non-numerics
df = df.select_dtypes(include=['int', 'float'])

headers = list(df.columns.values)
fields = []

for field in headers:
    fields.append( {
        'name' : field,
        'mean': df[field].mean(),
        'var': df[field].var(),
        'sdev': df[field].std()
    })
    
for field in fields:
    print(field)


{'sdev': 7.815984312565782, 'mean': 23.514572864321615, 'name': 'mpg', 'var': 61.089610774274405}
{'sdev': 104.26983817119581, 'mean': 193.42587939698493, 'name': 'displacement', 'var': 10872.199152247364}
{'sdev': 38.49115993282855, 'mean': 104.46938775510205, 'name': 'horsepower', 'var': 1481.5693929745862}
{'sdev': 2.7576889298126757, 'mean': 15.568090452261291, 'name': 'acceleration', 'var': 7.604848233611381}

Sorting and Shuffling Dataframes

It is possable to sort and shuffle.


In [1]:
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
#np.random.seed(42) # Uncomment this line to get the same shuffle each time
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
df


Out[1]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 33.0 4 105.0 74.0 2190 14.2 81 2 volkswagen jetta
1 14.0 8 318.0 150.0 4096 13.0 71 1 plymouth fury iii
2 15.0 8 400.0 150.0 3761 9.5 70 1 chevrolet monte carlo
3 15.0 8 350.0 145.0 4440 14.0 75 1 chevrolet bel air
4 18.0 6 232.0 100.0 2945 16.0 73 1 amc hornet
5 34.4 4 98.0 65.0 2045 16.2 81 1 ford escort 4w
6 24.0 4 90.0 75.0 2108 15.5 74 2 fiat 128
7 17.6 8 302.0 129.0 3725 13.4 79 1 ford ltd landau
8 18.6 6 225.0 110.0 3620 18.7 78 1 dodge aspen
9 21.5 4 121.0 110.0 2600 12.8 77 2 bmw 320i
10 27.9 4 156.0 105.0 2800 14.4 80 1 dodge colt
11 18.0 8 318.0 150.0 3436 11.0 70 1 plymouth satellite
12 16.0 8 318.0 150.0 4190 13.0 76 1 dodge coronet brougham
13 35.1 4 81.0 60.0 1760 16.1 81 3 honda civic 1300
14 30.0 4 97.0 67.0 1985 16.4 77 3 subaru dl
15 14.0 8 302.0 137.0 4042 14.5 73 1 ford gran torino
16 44.6 4 91.0 67.0 1850 13.8 80 3 honda civic 1500 gl
17 15.0 8 318.0 150.0 3777 12.5 73 1 dodge coronet custom
18 23.9 4 119.0 97.0 2405 14.9 78 3 datsun 200-sx
19 28.0 4 116.0 90.0 2123 14.0 71 2 opel 1900
20 14.0 8 318.0 150.0 4077 14.0 72 1 plymouth satellite custom (sw)
21 29.0 4 135.0 84.0 2525 16.0 82 1 dodge aries se
22 18.0 6 232.0 100.0 2789 15.0 73 1 amc gremlin
23 15.0 8 350.0 145.0 4082 13.0 73 1 chevrolet monte carlo s
24 22.0 6 198.0 95.0 2833 15.5 70 1 plymouth duster
25 21.1 4 134.0 95.0 2515 14.8 78 3 toyota celica gt liftback
26 31.0 4 76.0 52.0 1649 16.5 74 3 toyota corona
27 36.0 4 120.0 88.0 2160 14.5 82 3 nissan stanza xe
28 38.0 4 91.0 67.0 1995 16.2 82 3 datsun 310 gx
29 18.0 6 225.0 105.0 3613 16.5 74 1 plymouth satellite sebring
... ... ... ... ... ... ... ... ... ...
368 22.0 6 250.0 105.0 3353 14.5 76 1 chevrolet nova
369 33.0 4 91.0 53.0 1795 17.5 75 3 honda civic cvcc
370 30.0 4 146.0 67.0 3250 21.8 80 2 mercedes-benz 240d
371 24.0 4 121.0 110.0 2660 14.0 73 2 saab 99le
372 30.5 4 98.0 63.0 2051 17.0 77 1 chevrolet chevette
373 19.4 6 232.0 90.0 3210 17.2 78 1 amc concord
374 10.0 8 307.0 200.0 4376 15.0 70 1 chevy c20
375 31.5 4 98.0 68.0 2045 18.5 77 3 honda accord cvcc
376 16.5 8 351.0 138.0 3955 13.2 79 1 mercury grand marquis
377 32.3 4 97.0 67.0 2065 17.8 81 3 subaru
378 13.0 8 351.0 158.0 4363 13.0 73 1 ford ltd
379 29.0 4 97.0 78.0 1940 14.5 77 2 volkswagen rabbit custom
380 23.0 4 120.0 88.0 2957 17.0 75 2 peugeot 504
381 26.6 8 350.0 105.0 3725 19.0 81 1 oldsmobile cutlass ls
382 20.8 6 200.0 85.0 3070 16.7 78 1 mercury zephyr
383 19.1 6 225.0 90.0 3381 18.7 80 1 dodge aspen
384 19.2 8 267.0 125.0 3605 15.0 79 1 chevrolet malibu classic (sw)
385 25.0 6 181.0 110.0 2945 16.4 82 1 buick century limited
386 35.7 4 98.0 80.0 1915 14.4 79 1 dodge colt hatchback custom
387 19.0 4 121.0 112.0 2868 15.5 73 2 volvo 144ea
388 13.0 8 350.0 165.0 4274 12.0 72 1 chevrolet impala
389 29.0 4 90.0 70.0 1937 14.2 76 2 vw rabbit
390 18.0 6 250.0 88.0 3021 16.5 73 1 ford maverick
391 20.5 6 231.0 105.0 3425 16.9 77 1 buick skylark
392 14.0 8 340.0 160.0 3609 8.0 70 1 plymouth 'cuda 340
393 29.9 4 98.0 65.0 2380 20.7 81 1 ford escort 2h
394 28.0 4 151.0 90.0 2678 16.5 80 1 chevrolet citation
395 12.0 8 400.0 167.0 4906 12.5 73 1 ford country
396 15.5 8 318.0 145.0 4140 13.7 77 1 dodge monaco brougham
397 43.4 4 90.0 48.0 2335 23.7 80 2 vw dasher (diesel)

398 rows × 9 columns


In [16]:
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.sort_values(by='name',ascending=True)
print("The first car is: {}".format(df['name'].iloc[1]))
df


The first car is: amc ambassador dpl
Out[16]:
mpg cylinders displacement horsepower weight acceleration year origin name
96 13.0 8 360.0 175.0 3821 11.0 73 1 amc ambassador brougham
9 15.0 8 390.0 190.0 3850 8.5 70 1 amc ambassador dpl
66 17.0 8 304.0 150.0 3672 11.5 72 1 amc ambassador sst
315 24.3 4 151.0 90.0 3003 20.1 80 1 amc concord
257 19.4 6 232.0 90.0 3210 17.2 78 1 amc concord
261 18.1 6 258.0 120.0 3410 15.1 78 1 amc concord d/l
374 23.0 4 151.0 NaN 3035 20.5 82 1 amc concord dl
283 20.2 6 232.0 90.0 3265 18.2 79 1 amc concord dl 6
107 18.0 6 232.0 100.0 2789 15.0 73 1 amc gremlin
33 19.0 6 232.0 100.0 2634 13.0 71 1 amc gremlin
169 20.0 6 232.0 100.0 2914 16.0 75 1 amc gremlin
24 21.0 6 199.0 90.0 2648 15.0 70 1 amc gremlin
127 19.0 6 232.0 100.0 2901 16.0 74 1 amc hornet
16 18.0 6 199.0 97.0 2774 15.5 70 1 amc hornet
194 22.5 6 232.0 90.0 3085 17.6 76 1 amc hornet
99 18.0 6 232.0 100.0 2945 16.0 73 1 amc hornet
45 18.0 6 258.0 110.0 2962 13.5 71 1 amc hornet sportabout (sw)
162 15.0 6 258.0 110.0 3730 19.0 75 1 amc matador
134 16.0 6 258.0 110.0 3632 18.0 74 1 amc matador
86 14.0 8 304.0 150.0 3672 11.5 73 1 amc matador
189 15.5 8 304.0 120.0 3962 13.9 76 1 amc matador
37 18.0 6 232.0 100.0 3288 15.5 71 1 amc matador
72 15.0 8 304.0 150.0 3892 12.5 72 1 amc matador (sw)
140 14.0 8 304.0 150.0 4257 15.5 74 1 amc matador (sw)
176 19.0 6 232.0 90.0 3211 17.0 75 1 amc pacer
202 17.5 6 258.0 95.0 3193 17.8 76 1 amc pacer d/l
3 16.0 8 304.0 150.0 3433 12.0 70 1 amc rebel sst
296 27.4 4 121.0 80.0 2670 15.0 79 1 amc spirit dl
21 24.0 4 107.0 90.0 2430 14.5 70 2 audi 100 ls
177 23.0 4 115.0 95.0 2694 15.0 75 2 audi 100ls
... ... ... ... ... ... ... ... ... ...
82 23.0 4 120.0 97.0 2506 14.5 72 3 toyouta corona mark ii (sw)
335 35.0 4 122.0 88.0 2500 15.1 80 2 triumph tr7 coupe
332 29.8 4 89.0 62.0 1845 15.3 80 2 vokswagen rabbit
19 26.0 4 97.0 46.0 1835 20.5 70 2 volkswagen 1131 deluxe sedan
77 22.0 4 121.0 76.0 2511 18.0 72 2 volkswagen 411 (sw)
172 25.0 4 90.0 71.0 2223 16.5 75 2 volkswagen dasher
142 26.0 4 79.0 67.0 1963 15.5 74 2 volkswagen dasher
240 30.5 4 97.0 78.0 2190 14.1 77 2 volkswagen dasher
353 33.0 4 105.0 74.0 2190 14.2 81 2 volkswagen jetta
55 27.0 4 97.0 60.0 1834 19.0 71 2 volkswagen model 111
175 29.0 4 90.0 70.0 1937 14.0 75 2 volkswagen rabbit
203 29.5 4 97.0 71.0 1825 12.2 76 2 volkswagen rabbit
233 29.0 4 97.0 78.0 1940 14.5 77 2 volkswagen rabbit custom
244 43.1 4 90.0 48.0 1985 21.5 78 2 volkswagen rabbit custom diesel
375 36.0 4 105.0 74.0 1980 15.3 82 2 volkswagen rabbit l
278 31.5 4 89.0 71.0 1990 14.9 78 2 volkswagen scirocco
102 26.0 4 97.0 46.0 1950 21.0 73 2 volkswagen super beetle
59 23.0 4 97.0 54.0 2254 23.5 72 2 volkswagen type 3
120 19.0 4 121.0 112.0 2868 15.5 73 2 volvo 144ea
76 18.0 4 121.0 112.0 2933 14.5 72 2 volvo 145e (sw)
179 22.0 4 121.0 98.0 2945 14.5 75 2 volvo 244dl
207 20.0 4 130.0 102.0 3150 15.7 76 2 volvo 245
275 17.0 6 163.0 125.0 3140 13.6 78 2 volvo 264gl
360 30.7 6 145.0 76.0 3160 19.6 81 2 volvo diesel
326 43.4 4 90.0 48.0 2335 23.7 80 2 vw dasher (diesel)
394 44.0 4 97.0 52.0 2130 24.6 82 2 vw pickup
309 41.5 4 98.0 76.0 2144 14.7 80 2 vw rabbit
197 29.0 4 90.0 70.0 1937 14.2 76 2 vw rabbit
325 44.3 4 90.0 48.0 2085 21.7 80 2 vw rabbit c (diesel)
293 31.9 4 89.0 71.0 1925 14.0 79 2 vw rabbit custom

398 rows × 9 columns

Saving a Dataframe

Many of the assignments in this course will require that you save a dataframe to submit to the instructor. The following code performs a shuffle and then saves a new copy.


In [17]:
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-shuffle.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
df.to_csv(filename_write,index=False) # Specify index = false to not write row numbers
print("Done")


Done

Calculated Fields

It is possible to add new fields to the dataframe that are calculated from the other fields. We can create a new column that gives the weight in kilograms. The equation to calculate a metric weight, given a weight in pounds is:

$ m_{(kg)} = m_{(lb)} \times 0.45359237 $

This can be used with the following Python code:


In [5]:
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df.insert(1,'weight_kg',(df['weight']*0.45359237).astype(int))
df


Out[5]:
mpg weight_kg cylinders displacement horsepower weight acceleration year origin name
0 18.0 1589 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 1675 8 350.0 165.0 3693 11.5 70 1 buick skylark 320
2 18.0 1558 8 318.0 150.0 3436 11.0 70 1 plymouth satellite
3 16.0 1557 8 304.0 150.0 3433 12.0 70 1 amc rebel sst
4 17.0 1564 8 302.0 140.0 3449 10.5 70 1 ford torino
5 15.0 1969 8 429.0 198.0 4341 10.0 70 1 ford galaxie 500
6 14.0 1974 8 454.0 220.0 4354 9.0 70 1 chevrolet impala
7 14.0 1955 8 440.0 215.0 4312 8.5 70 1 plymouth fury iii
8 14.0 2007 8 455.0 225.0 4425 10.0 70 1 pontiac catalina
9 15.0 1746 8 390.0 190.0 3850 8.5 70 1 amc ambassador dpl
10 15.0 1616 8 383.0 170.0 3563 10.0 70 1 dodge challenger se
11 14.0 1637 8 340.0 160.0 3609 8.0 70 1 plymouth 'cuda 340
12 15.0 1705 8 400.0 150.0 3761 9.5 70 1 chevrolet monte carlo
13 14.0 1399 8 455.0 225.0 3086 10.0 70 1 buick estate wagon (sw)
14 24.0 1075 4 113.0 95.0 2372 15.0 70 3 toyota corona mark ii
15 22.0 1285 6 198.0 95.0 2833 15.5 70 1 plymouth duster
16 18.0 1258 6 199.0 97.0 2774 15.5 70 1 amc hornet
17 21.0 1173 6 200.0 85.0 2587 16.0 70 1 ford maverick
18 27.0 966 4 97.0 88.0 2130 14.5 70 3 datsun pl510
19 26.0 832 4 97.0 46.0 1835 20.5 70 2 volkswagen 1131 deluxe sedan
20 25.0 1211 4 110.0 87.0 2672 17.5 70 2 peugeot 504
21 24.0 1102 4 107.0 90.0 2430 14.5 70 2 audi 100 ls
22 25.0 1077 4 104.0 95.0 2375 17.5 70 2 saab 99e
23 26.0 1013 4 121.0 113.0 2234 12.5 70 2 bmw 2002
24 21.0 1201 6 199.0 90.0 2648 15.0 70 1 amc gremlin
25 10.0 2093 8 360.0 215.0 4615 14.0 70 1 ford f250
26 10.0 1984 8 307.0 200.0 4376 15.0 70 1 chevy c20
27 11.0 1987 8 318.0 210.0 4382 13.5 70 1 dodge d200
28 9.0 2146 8 304.0 193.0 4732 18.5 70 1 hi 1200d
29 27.0 966 4 97.0 88.0 2130 14.5 71 3 datsun pl510
... ... ... ... ... ... ... ... ... ... ...
368 27.0 1197 4 112.0 88.0 2640 18.6 82 1 chevrolet cavalier wagon
369 34.0 1086 4 112.0 88.0 2395 18.0 82 1 chevrolet cavalier 2-door
370 31.0 1168 4 112.0 85.0 2575 16.2 82 1 pontiac j2000 se hatchback
371 29.0 1145 4 135.0 84.0 2525 16.0 82 1 dodge aries se
372 27.0 1240 4 151.0 90.0 2735 18.0 82 1 pontiac phoenix
373 24.0 1299 4 140.0 92.0 2865 16.4 82 1 ford fairmont futura
374 23.0 1376 4 151.0 NaN 3035 20.5 82 1 amc concord dl
375 36.0 898 4 105.0 74.0 1980 15.3 82 2 volkswagen rabbit l
376 37.0 918 4 91.0 68.0 2025 18.2 82 3 mazda glc custom l
377 31.0 893 4 91.0 68.0 1970 17.6 82 3 mazda glc custom
378 38.0 963 4 105.0 63.0 2125 14.7 82 1 plymouth horizon miser
379 36.0 963 4 98.0 70.0 2125 17.3 82 1 mercury lynx l
380 36.0 979 4 120.0 88.0 2160 14.5 82 3 nissan stanza xe
381 36.0 1000 4 107.0 75.0 2205 14.5 82 3 honda accord
382 34.0 1018 4 108.0 70.0 2245 16.9 82 3 toyota corolla
383 38.0 891 4 91.0 67.0 1965 15.0 82 3 honda civic
384 32.0 891 4 91.0 67.0 1965 15.7 82 3 honda civic (auto)
385 38.0 904 4 91.0 67.0 1995 16.2 82 3 datsun 310 gx
386 25.0 1335 6 181.0 110.0 2945 16.4 82 1 buick century limited
387 38.0 1367 6 262.0 85.0 3015 17.0 82 1 oldsmobile cutlass ciera (diesel)
388 26.0 1172 4 156.0 92.0 2585 14.5 82 1 chrysler lebaron medallion
389 22.0 1285 6 232.0 112.0 2835 14.7 82 1 ford granada l
390 32.0 1208 4 144.0 96.0 2665 13.9 82 3 toyota celica gt
391 36.0 1075 4 135.0 84.0 2370 13.0 82 1 dodge charger 2.2
392 27.0 1338 4 151.0 90.0 2950 17.3 82 1 chevrolet camaro
393 27.0 1265 4 140.0 86.0 2790 15.6 82 1 ford mustang gl
394 44.0 966 4 97.0 52.0 2130 24.6 82 2 vw pickup
395 32.0 1040 4 135.0 84.0 2295 11.6 82 1 dodge rampage
396 28.0 1190 4 120.0 79.0 2625 18.6 82 1 ford ranger
397 31.0 1233 4 119.0 82.0 2720 19.4 82 1 chevy s-10

398 rows × 10 columns

Field Transformation & Preprocessing

The data fed into a machine learning model rarely bares much similarity to the data that the data scientist originally received. One common transformation is to normalize the inputs. A normalization allows numbers to be put in a standard form so that two values can easily be compared. Consider if a friend told you that he received a $10 discount. Is this a good deal? Maybe. But the value is not normalized. If your friend purchased a car, then the discount is not that good. If your friend purchased dinner, this is a very good discount!

Percentages are a very common form of normalization. If your friend tells you they got 10% off, we know that this is a better discount than 5%. It does not matter how much the purchase price was. One very common machine learning normalization is the Z-Score:

$z = {x- \mu \over \sigma} $

To calculate the Z-Score you need to also calculate the mean($\mu$) and the standard deviation ($\sigma$). The mean is calculated as follows:

$\mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$

The standard deviation is calculated as follows:

$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}, {\rm \ \ where\ \ } \mu = \frac{1}{N} \sum_{i=1}^N x_i$

The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero, above zero is above average, and below zero is below average. Z-Scores above/below -3/3 are very rare, these are outliers.


In [20]:
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df['mpg'] = zscore(df['mpg'])
df


Out[20]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 -0.706439 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu
1 -1.090751 8 350.0 165.0 3693 11.5 70 1 buick skylark 320
2 -0.706439 8 318.0 150.0 3436 11.0 70 1 plymouth satellite
3 -0.962647 8 304.0 150.0 3433 12.0 70 1 amc rebel sst
4 -0.834543 8 302.0 140.0 3449 10.5 70 1 ford torino
5 -1.090751 8 429.0 198.0 4341 10.0 70 1 ford galaxie 500
6 -1.218855 8 454.0 220.0 4354 9.0 70 1 chevrolet impala
7 -1.218855 8 440.0 215.0 4312 8.5 70 1 plymouth fury iii
8 -1.218855 8 455.0 225.0 4425 10.0 70 1 pontiac catalina
9 -1.090751 8 390.0 190.0 3850 8.5 70 1 amc ambassador dpl
10 -1.090751 8 383.0 170.0 3563 10.0 70 1 dodge challenger se
11 -1.218855 8 340.0 160.0 3609 8.0 70 1 plymouth 'cuda 340
12 -1.090751 8 400.0 150.0 3761 9.5 70 1 chevrolet monte carlo
13 -1.218855 8 455.0 225.0 3086 10.0 70 1 buick estate wagon (sw)
14 0.062185 4 113.0 95.0 2372 15.0 70 3 toyota corona mark ii
15 -0.194023 6 198.0 95.0 2833 15.5 70 1 plymouth duster
16 -0.706439 6 199.0 97.0 2774 15.5 70 1 amc hornet
17 -0.322127 6 200.0 85.0 2587 16.0 70 1 ford maverick
18 0.446497 4 97.0 88.0 2130 14.5 70 3 datsun pl510
19 0.318393 4 97.0 46.0 1835 20.5 70 2 volkswagen 1131 deluxe sedan
20 0.190289 4 110.0 87.0 2672 17.5 70 2 peugeot 504
21 0.062185 4 107.0 90.0 2430 14.5 70 2 audi 100 ls
22 0.190289 4 104.0 95.0 2375 17.5 70 2 saab 99e
23 0.318393 4 121.0 113.0 2234 12.5 70 2 bmw 2002
24 -0.322127 6 199.0 90.0 2648 15.0 70 1 amc gremlin
25 -1.731270 8 360.0 215.0 4615 14.0 70 1 ford f250
26 -1.731270 8 307.0 200.0 4376 15.0 70 1 chevy c20
27 -1.603167 8 318.0 210.0 4382 13.5 70 1 dodge d200
28 -1.859374 8 304.0 193.0 4732 18.5 70 1 hi 1200d
29 0.446497 4 97.0 88.0 2130 14.5 71 3 datsun pl510
... ... ... ... ... ... ... ... ... ...
368 0.446497 4 112.0 88.0 2640 18.6 82 1 chevrolet cavalier wagon
369 1.343225 4 112.0 88.0 2395 18.0 82 1 chevrolet cavalier 2-door
370 0.958913 4 112.0 85.0 2575 16.2 82 1 pontiac j2000 se hatchback
371 0.702705 4 135.0 84.0 2525 16.0 82 1 dodge aries se
372 0.446497 4 151.0 90.0 2735 18.0 82 1 pontiac phoenix
373 0.062185 4 140.0 92.0 2865 16.4 82 1 ford fairmont futura
374 -0.065919 4 151.0 NaN 3035 20.5 82 1 amc concord dl
375 1.599433 4 105.0 74.0 1980 15.3 82 2 volkswagen rabbit l
376 1.727537 4 91.0 68.0 2025 18.2 82 3 mazda glc custom l
377 0.958913 4 91.0 68.0 1970 17.6 82 3 mazda glc custom
378 1.855641 4 105.0 63.0 2125 14.7 82 1 plymouth horizon miser
379 1.599433 4 98.0 70.0 2125 17.3 82 1 mercury lynx l
380 1.599433 4 120.0 88.0 2160 14.5 82 3 nissan stanza xe
381 1.599433 4 107.0 75.0 2205 14.5 82 3 honda accord
382 1.343225 4 108.0 70.0 2245 16.9 82 3 toyota corolla
383 1.855641 4 91.0 67.0 1965 15.0 82 3 honda civic
384 1.087017 4 91.0 67.0 1965 15.7 82 3 honda civic (auto)
385 1.855641 4 91.0 67.0 1995 16.2 82 3 datsun 310 gx
386 0.190289 6 181.0 110.0 2945 16.4 82 1 buick century limited
387 1.855641 6 262.0 85.0 3015 17.0 82 1 oldsmobile cutlass ciera (diesel)
388 0.318393 4 156.0 92.0 2585 14.5 82 1 chrysler lebaron medallion
389 -0.194023 6 232.0 112.0 2835 14.7 82 1 ford granada l
390 1.087017 4 144.0 96.0 2665 13.9 82 3 toyota celica gt
391 1.599433 4 135.0 84.0 2370 13.0 82 1 dodge charger 2.2
392 0.446497 4 151.0 90.0 2950 17.3 82 1 chevrolet camaro
393 0.446497 4 140.0 86.0 2790 15.6 82 1 ford mustang gl
394 2.624265 4 97.0 52.0 2130 24.6 82 2 vw pickup
395 1.087017 4 135.0 84.0 2295 11.6 82 1 dodge rampage
396 0.574601 4 120.0 79.0 2625 18.6 82 1 ford ranger
397 0.958913 4 119.0 82.0 2720 19.4 82 1 chevy s-10

398 rows × 9 columns

Missing Values

Missing values are a reality of machine learning. Ideally every row of data will have values for all columns. However, this is rarely the case. Most of the values are present in the MPG database. However, there are missing values in the horsepower column. A common practice is to replace missing values with the median value for that column. The median is calculated as described here. The following code replaces any NA values in horsepower with the median:


In [21]:
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
# df = df.dropna() # you can also simply drop NA values
print("horsepower has na? {}".format(pd.isnull(df['horsepower']).values.any()))


horsepower has na? False

Concatenating Rows and Columns

Rows and columns can be concatenated together to form new data frames.


In [22]:
# Create a new dataframe from name and horsepower

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower],axis=1)
result


Out[22]:
name horsepower
0 chevrolet chevelle malibu 130.0
1 buick skylark 320 165.0
2 plymouth satellite 150.0
3 amc rebel sst 150.0
4 ford torino 140.0
5 ford galaxie 500 198.0
6 chevrolet impala 220.0
7 plymouth fury iii 215.0
8 pontiac catalina 225.0
9 amc ambassador dpl 190.0
10 dodge challenger se 170.0
11 plymouth 'cuda 340 160.0
12 chevrolet monte carlo 150.0
13 buick estate wagon (sw) 225.0
14 toyota corona mark ii 95.0
15 plymouth duster 95.0
16 amc hornet 97.0
17 ford maverick 85.0
18 datsun pl510 88.0
19 volkswagen 1131 deluxe sedan 46.0
20 peugeot 504 87.0
21 audi 100 ls 90.0
22 saab 99e 95.0
23 bmw 2002 113.0
24 amc gremlin 90.0
25 ford f250 215.0
26 chevy c20 200.0
27 dodge d200 210.0
28 hi 1200d 193.0
29 datsun pl510 88.0
... ... ...
368 chevrolet cavalier wagon 88.0
369 chevrolet cavalier 2-door 88.0
370 pontiac j2000 se hatchback 85.0
371 dodge aries se 84.0
372 pontiac phoenix 90.0
373 ford fairmont futura 92.0
374 amc concord dl NaN
375 volkswagen rabbit l 74.0
376 mazda glc custom l 68.0
377 mazda glc custom 68.0
378 plymouth horizon miser 63.0
379 mercury lynx l 70.0
380 nissan stanza xe 88.0
381 honda accord 75.0
382 toyota corolla 70.0
383 honda civic 67.0
384 honda civic (auto) 67.0
385 datsun 310 gx 67.0
386 buick century limited 110.0
387 oldsmobile cutlass ciera (diesel) 85.0
388 chrysler lebaron medallion 92.0
389 ford granada l 112.0
390 toyota celica gt 96.0
391 dodge charger 2.2 84.0
392 chevrolet camaro 90.0
393 ford mustang gl 86.0
394 vw pickup 52.0
395 dodge rampage 84.0
396 ford ranger 79.0
397 chevy s-10 82.0

398 rows × 2 columns


In [23]:
# Create a new dataframe from name and horsepower, but this time by row

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower])
result


Out[23]:
0         chevrolet chevelle malibu
1                 buick skylark 320
2                plymouth satellite
3                     amc rebel sst
4                       ford torino
5                  ford galaxie 500
6                  chevrolet impala
7                 plymouth fury iii
8                  pontiac catalina
9                amc ambassador dpl
10              dodge challenger se
11               plymouth 'cuda 340
12            chevrolet monte carlo
13          buick estate wagon (sw)
14            toyota corona mark ii
15                  plymouth duster
16                       amc hornet
17                    ford maverick
18                     datsun pl510
19     volkswagen 1131 deluxe sedan
20                      peugeot 504
21                      audi 100 ls
22                         saab 99e
23                         bmw 2002
24                      amc gremlin
25                        ford f250
26                        chevy c20
27                       dodge d200
28                         hi 1200d
29                     datsun pl510
                   ...             
368                              88
369                              88
370                              85
371                              84
372                              90
373                              92
374                             NaN
375                              74
376                              68
377                              68
378                              63
379                              70
380                              88
381                              75
382                              70
383                              67
384                              67
385                              67
386                             110
387                              85
388                              92
389                             112
390                              96
391                              84
392                              90
393                              86
394                              52
395                              84
396                              79
397                              82
dtype: object

Training and Validation

It is very important that we evaluate a machine learning model based on its ability to predict data that it has never seen before. Because of this we often divide the training data into a validation and training set. The machine learning model will learn from the training data, but ultimately be evaluated based on the validation data.

  • Training Data - In Sample Data - The data that the machine learning model was fit to/created from.
  • Validation Data - Out of Sample Data - The data that the machine learning model is evaluated upon after it is fit to the training data.

There are two predominant means of dealing with training and validation data:

  • Training/Validation Split - The data are split according to some ratio between a training and validation (hold-out) set. Common ratios are 80% training and 20% validation.
  • K-Fold Cross Validation - The data are split into a number of folds and models. Because a number of models equal to the folds is created out-of-sample predictions can be generated for the entire dataset.

Training/Validation Split

The code below performs a split of the MPG data into a training and validation set. The training set uses 80% of the data and the validation set uses 20%.

The following image shows how a model is trained on 80% of the data and then validated against the remaining 20%.


In [24]:
path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index)) # Usually a good idea to shuffle
mask = np.random.rand(len(df)) < 0.8
trainDF = pd.DataFrame(df[mask])
validationDF = pd.DataFrame(df[~mask])

print("Training DF: {}".format(len(trainDF)))
print("Validation DF: {}".format(len(validationDF)))


Training DF: 317
Validation DF: 81

K-Fold Cross Validation

There are several types of cross validation; however, k-fold is the most common. The value K specifies the number of folds. The two most common values for K are either 5 or 10. For this course we will always use a K value of 5, or a 5-fold cross validation. A 5-fold validation is illustrated by the following diagram:

First, the data are split into 5 equal (or close to, due to rounding) folds. These folds are used to generate 5 training/validation set combinations. Each of the folds becomes the validation set once, and the remaining folds become the training sets. This allows the validated results to be appended together to produce a final out-of-sample prediction for the entire dataset.

The following code demonstrates a 5-fold cross validation:


In [25]:
import os
from sklearn.cross_validation import KFold
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
kf = KFold(len(df), n_folds=5)

fold = 1
for train_index, validate_index in kf:        
    trainDF = pd.DataFrame(df.ix[train_index,:])
    validateDF = pd.DataFrame(df.ix[validate_index])
    print("Fold #{}, Training Size: {}, Validation Size: {}".format(fold,len(trainDF),len(validateDF)))
    fold+=1


Fold #1, Training Size: 318, Validation Size: 80
Fold #2, Training Size: 318, Validation Size: 80
Fold #3, Training Size: 318, Validation Size: 80
Fold #4, Training Size: 319, Validation Size: 79
Fold #5, Training Size: 319, Validation Size: 79

Accessing Files Directly

It is possible to access files directly, rather than using Pandas. For class assignments you should use Pandas; however, direct access is possible. Using the CSV package, you can read the files in, line-by-line and process them. Accessing a file line-by-line can allow you to process very large files that would not fit into memory. For the purposes of this class, all files will fit into memory, and you should use Pandas for all class assignments.


In [27]:
# Read a raw text file (avoid this)
import codecs
import os

path = "./data"

# Always specify your encoding! There is no such thing as "its just a text file".
# See... http://www.joelonsoftware.com/articles/Unicode.html
# Also see... http://www.utf8everywhere.org/
encoding = 'utf-8'
filename = os.path.join(path,"auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    # Iterate over this line by line...
    for line in fh:
        c+=1 # Only the first 5 lines
        if c>5: break
        print(line.strip())


mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
18,8,307,130,3504,12,70,1,chevrolet chevelle malibu
15,8,350,165,3693,11.5,70,1,buick skylark 320
18,8,318,150,3436,11,70,1,plymouth satellite
16,8,304,150,3433,12,70,1,amc rebel sst

In [28]:
# Read a CSV file
import codecs
import os
import csv

encoding = 'utf-8'
path = "./data/"
filename = os.path.join(path,"auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    reader = csv.reader(fh)
    for row in reader:
        c+=1
        if c>5: break
        print(row)


['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevrolet chevelle malibu']
['15', '8', '350', '165', '3693', '11.5', '70', '1', 'buick skylark 320']
['18', '8', '318', '150', '3436', '11', '70', '1', 'plymouth satellite']
['16', '8', '304', '150', '3433', '12', '70', '1', 'amc rebel sst']

In [30]:
# Read a CSV, symbolic headers
import codecs
import os
import csv

path = "./data"

encoding = 'utf-8'
filename = os.path.join(path,"auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    reader = csv.reader(fh)

    # Generate header index using comprehension.
    # Comprehension is cool, but not necessarily a beginners feature of Python.
    header_idx = {key: value for (value, key) in enumerate(next(reader))}
    
    for row in reader:
        c+=1
        if c>5: break
        print( "Car Name: {}".format(row[header_idx['name']]))


Car Name: chevrolet chevelle malibu
Car Name: buick skylark 320
Car Name: plymouth satellite
Car Name: amc rebel sst
Car Name: ford torino

In [31]:
# Read a CSV, manual stats
import codecs
import os
import csv
import math

path = "./data/"

encoding = 'utf-8'
filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-norm.csv")

c = 0

with codecs.open(filename_read, "r", encoding) as fh:
    reader = csv.reader(fh)

    # Generate header index using comprehension.
    # Comprehension is cool, but not necessarily a beginners feature of Python.
    header_idx = {key: value for (value, key) in enumerate(next(reader))}
    headers = header_idx.keys()
    
    #print([(key,{'count':0}) for key in headers])
    
    fields = {key: value for (key, value) in [(key,{'count':0,'sum':0,'variance':0}) for key in headers] }
    
    # Pass 1, means
    row_count = 0
    for row in reader:
        row_count += 1
        for name in headers:
            try:
                value = float(row[header_idx[name]])
                field = fields[name]
                field['count'] += 1
                field['sum'] += value
            except ValueError:
                pass
    
    # Calculate means, toss sums (part of pass 1)
    for field in fields.values():
        # If 90% are not missing (or non-numeric) calculate a mean
        if (field['count']/row_count)>0.9:
            field['mean'] = field['sum'] / field['count']
            del field['sum']
    
    # Pass 2, standard deviation & variance
    fh.seek(0)
    for row in reader:
        for name in headers:
            try:
                value = float(row[header_idx[name]])
                field = fields[name]
                # If we failed to calculate a mean, no variance.
                if 'mean' in field:
                    field['variance'] += (value - field['mean'])**2
            except ValueError:
                pass
            
    # Calculate standard deviation, keep variance (part of pass 2)
    for field in fields.values():
        # If no variance, then no standard deviation
        if 'mean' in field:
            field['variance'] /= field['count']
            field['sdev'] = math.sqrt(field['variance'])
        else:
            del field['variance']
    
    # Print summary stats
    for key in sorted(fields.keys()):
        print("{}:{}".format(key,fields[key]))


acceleration:{'sdev': 2.7542223175940177, 'mean': 15.568090452261291, 'count': 398, 'variance': 7.585740574732961}
cylinders:{'sdev': 1.698865960539558, 'mean': 5.454773869346734, 'count': 398, 'variance': 2.8861455518799946}
displacement:{'sdev': 104.13876352708563, 'mean': 193.42587939698493, 'count': 398, 'variance': 10844.882068950259}
horsepower:{'sdev': 38.442032714425984, 'mean': 104.46938775510205, 'count': 392, 'variance': 1477.7898792169979}
mpg:{'sdev': 7.806159061274433, 'mean': 23.514572864321615, 'count': 398, 'variance': 60.93611928991693}
name:{'sum': 0, 'count': 0}
origin:{'sdev': 0.801046637381194, 'mean': 1.5728643216080402, 'count': 398, 'variance': 0.6416757152597181}
weight:{'sdev': 845.7772335198177, 'mean': 2970.424623115578, 'count': 398, 'variance': 715339.1287404363}
year:{'sdev': 3.6929784655780975, 'mean': 76.01005025125629, 'count': 398, 'variance': 13.638089947223559}

First Programming Assignment

The first programming assignment will give you a chance to try out Python, Pandas and build some skills that you will use to learn about Deep Learning. You should submit this assignment as either a Jupyter notebook (.ipynb) or a regular Python (.py) file. The following code shows a possible skeleton structure for this assignment:


In [ ]:
# Programming Assignment #1, 
# Solution by YOUR NAME
# T81-558: Application of Deep Learning
import os
import sklearn
from sklearn.cross_validation import KFold
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

def question1():
    print()
    print("***Question 1***")
    
def question2():
    print()
    print("***Question 2***")

def question3():
    print()
    print("***Question 3***")
    
def question4():
    print()
    print("***Question 4***")

def question5():
    print()
    print("***Question 5***")
      

question1()
question2()
question3()
question4()
question5()

In [ ]: