T81-558: Applications of Deep Neural Networks

Class 1: Python for Machine Learning

Instructor: Jeff Heaton, School of Engineering and Applied Science, Washington University in St. Louis
For more information visit the class website.

Course Description

Deep learning is a group of exciting new technologies for neural networks. By using a combination of advanced training techniques neural network architectural components, it is now possible to train neural networks of much greater complexity. This course will introduce the student to deep belief neural networks, regularization units (ReLU), convolution neural networks and recurrent neural networks. High performance computing (HPC) aspects will demonstrate how deep learning can be leveraged both on graphical processing units (GPUs), as well as grids. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain. Focus will be primarily upon the application of deep learning, with some introduction to the mathematical foundations of deep learning. Students will use the Python programming language to architect a deep learning model for several of real-world data sets and interpret the results of these networks.

Assignments

Your grade will be calculated according to the following assignments:

Assignment	Weight	Title
Class Participation	10%	Class attendance and participation
Program 1	10%	Python for data science
Program 2	10%	TensorFlow for classification
Program 3	10%	Time series with TensorFlow
Program 4	10%	Computer vision with TensorFlow
Mid Term	20%	Understanding of deep learning and TensorFlow
Final Project	30%	Adapt deep learning to a past Kaggle competition

Course Textbook

The following book will be used to supplement in class discussion. Internet resources and papers will augment the text with the latest research.

Heaton, J. (2015). Deep learning and neural networks (Vol. 3, Artificial Intelligence for Humans). St. Louis, MO: Heaton Research.

You do not need the other books in the series.

Jeff Heaton

I will be your instructor for this course. A brief summary of my credentials is given here:

Master of Information Management (MIM), Washington University in St. Louis, MO
PhD (candidate) in Computer Science, Nova Southeastern University in Ft. Lauderdale, FL
Senior Data Scientist, Reinsurance Group of America (RGA)
Senior Member, IEEE
jtheaton at domain name of this university
Other industry certifications: FLMI, ARA, ACS

Social media:

Homepage - My home page. Includes my research interests and publications.
Linked In - My Linked In profile, feel free to connect.
Twitter - My Twitter feed.
Google Scholar - My citations on Google Scholar.
Research Gate - My profile/research at Research Gate.
Others - About me and other social media sites that I am a member of.

Course Resources

IBM Data Science Workbench - Free web based platform that includes Python, Juypter Notebooks, and TensorFlow. No setup needed.
Python Anaconda - Python distribution that includes many data science packages, such as Numpy, Scipy, Scikit-Learn, Pandas, and much more.
Juypter Notebooks - Easy to use environment that combines Python, Graphics and Text.
TensorFlow - Google's mathematics package for deep learning.
Kaggle - Competitive data science. Good source of sample data.
Course GitHub Repository - All of the course notebooks will be published here.

What is Deep Learning

The focus of this class is deep learning, which is a very popular type of machine learning that is based upon the original neural networks popularized in the 1980's. There is very little difference between how a deep neural network is calculated compared with the original neural network. We've always been able to create and calculate deep neural networks. A deep neural network is nothing more than a neural network with many layers. While we've always been able to create/calculate deep neural networks, we've lacked an effective means of training them. Deep learning provides an efficient means to train deep neural networks.

What is Machine Learning

If deep learning is a type of machine learning, this begs the question, "What is machine learning?" The following diagram illustrates how machine learning differs from traditional software development.

Traditional Software Development - Programmers create programs that specify how to transform input into the desired output.
Machine Learning - Programmers create models that can learn to produce the desired output for given input. This learning fills the traditional role of the computer program.

Researchers have applied machine learning to many different areas. This class will explore three specific domains for the application of deep neural networks:

Predictive Modeling - Several named input values are used to predict another named value that becomes the output. For example, using four measurements of iris flowers to predict the species.
Computer Vision - The use of machine learning to detect patterns in visual data. For example, is an image a cat or a dog.
Time Series - The use of machine learning to detect patterns in in time. Common applications of time series are: financial applications, speech recognition, and even natural language processing (NLP).

Regression

Regression is when a model, such as a neural network, accepts input and produces a numeric output. Consider if you were tasked to write a program that predicted how many miles per gallon (MPG) a car could achieve. For the inputs you would probably want such features as the weight of the car, the horsepower, how large the engine is, etc. Your program would be a combination of math and if-statements.

Machine learning lets the computer learn the "formula" for calculating the MPG of a car, using data. Consider this dataset. We can use regression machine learning models to study this data and learn how to predict the MPG for a car.

Classification

The output of a classification model is what class the input belongs to. For example, consider using four measurements of an iris flower to determine the species that the flower is in. This dataset could be used to perform this.

What are Neural Networks

Neural networks are one of the earliest types of machine learning model. Neural networks were originally introduced in the 1940's and have risen and fallen several times from popularity. Four researchers have contributed greatly to the development of neural networks. They have consistently pushed neural network research, both through the ups and downs:

The current luminaries of artificial neural network (ANN) research and ultimately deep learning, in order as appearing in the above picture:

Yann LeCun, Facebook and New York University - Optical character recognition and computer vision using convolutional neural networks (CNN). The founding father of convolutional nets.
Geoffrey Hinton, Google and University of Toronto. Extensive work on neural networks. Creator of deep learning and early adapter/creator of backpropagation for neural networks.
Yoshua Bengio, University of Montreal. Extensive research into deep learning, neural networks, and machine learning. He has so far remained completely in academia.
Andrew Ng, Badiu and Stanford University. Extensive research into deep learning, neural networks, and application to robotics.

Why Deep Learning?

For predictive modeling neural networks are not that different than other models, such as:

Support Vector Machines
Random Forests
Gradient Boosted Machines

Like these other models, neural networks can perform both classification and regression. When applied to relatively low-dimensional predictive modeling tasks, deep neural networks do not necessarily add significant accuracy over other model types. Andrew Ng describes the advantage of deep neural networks over traditional model types as follows:

Neural networks also have two additional significant advantages over other machine learning models:

Convolutional Neural Networks - Can scan an image for patterns within the image.
Recurrent Neural Networks - Can find patterns across several inputs, not just within a single input.

Python for Deep Learning

Python 3.x is the programming language that will be used for this class. Python, as a programming language, has the widest support for deep learning. The three most popular frameworks for deep learning in Python are:

Theano (University of Montreal)
TensorFlow (Google)
CNTK (Microsoft)

Some references on popular programming languages for AI/Data Science:

Software Installation

This is a technical class. You will need to be able to compile and execute Python code that makes use of TensorFlow for deep learning. There are two options to you for accomplish this:

Use IBM Data Scientist Workbench online
Install Python, TensorFlow and some IDE (Jupyter, TensorFlow, etc.)

Using IBM Data Scientist Workbench

This option allows you to skip any issues associated with installing Python and TensorFlow on your machine. Installing Python is relatively easy. However, TensorFlow has specific instructions for Windows, Linux and Mac. It is straightforward to install TensorFlow onto a Mac or Linux. Windows is an entirely different prospect, as Google does not offer specific support for Windows at this time.

The IBM Data Scientist Workbench is a web site that provides you with your own environment to run a Jupyter notebook from. There is nothing proprietary about the workbench, the same code that will run from the IBM system will also run on your local computer. I will be using the Data Scientist Workbench for many of the examples during class. To make use of this website you will need to register at the following URL:

Data Scientist Workbench

When you first sign up, it will take the workbench some time to setup your environment, this could easily take 30 minutes plus. While your environment is being setup, you will see a cute icon of a dog chasing his tail.

Upon logging into the workbench, you will see a welcome screen similar to the following:

You will primarily make use of the "My Data" and "Jupyter Notebook" buttons on the above page. Clicking "My Data" will reveal all data that is currently held by your account. This includes both CSV data files, as well as any Jupyter notebooks you might have loaded or created.

Clicking "Jupyter Notebook" will start Jupyter Notebook. This allows you to choose which notebook you would like to work with. If you downloaded a notebook from my GitHub site you can simply drag this .ipynb file to the web browser. You can also choose to create a new Jupyter notebook that you can later download. The following screen capture shows Jupyter notebook running in Data Scientist Workbench.

Installing Python and TensorFlow

It is also possible to install and run Python/TensorFlow entirely from your own computer. This will be somewhat difficult for Microsoft Windows, as Google has not yet added official support for TensorFlow. Official support is currently only provided for Mac and Linux.

The first step is to install Python 3.x. I recommend using the Anaconda release of Python, as it already includes many of the data science related packages that will be needed by this class. Anaconda directly supports: Windows, Mac and Linux. Download Anaconda from the following URL:

Anaconda

Once Anaconda has been downloaded it is easy to install Jupyter notebooks with the following command:

conda install jupyter

Once Jupyter is installed, it is started with the following command:

jupyter notebook

Python Introduction

Anaconda v3.5 Scientific Python Distribution, including:
- Scikit-Learn
- Pandas
- Others: csv, json, numpy, scipy
Jupyter Notebooks
PyCharm IDE
Cx_Oracle
MatPlotLib

Jupyter Notebooks

Space matters in Python, indent code to define blocks

Jupyter Notebooks Allow Python and Markdown to coexist.

Even $\LaTeX$:

$ f'(x) = \lim_{h\to0} \frac{f(x+h) - f(x)}{h}. $

Python Versions

If you see xrange instead of range, you are dealing with Python 2
If you see print x instead of print(x), you are dealing with Python 2



In [2]:

    
# What version of Python do you have?

import sys
import tensorflow as tf
import sklearn as sk
import pandas as pd

print("Python {}".format(sys.version))
print('TensorFlow {}'.format(tf.__version__))
print('Pandas {}'.format(pd.__version__))
print('Scikit-Learn {}'.format(sk.__version__))









    



Python 3.4.3 (default, Oct 14 2015, 20:28:29) 
[GCC 4.8.4]
TensorFlow 0.8.0
Pandas 0.18.1
Scikit-Learn 0.17.1

Software used in this class:

Python - The programming language.
TensorFlow - Googles deep learning framework, must have 0.8 or higher. We will use SKFlow (part of TensorFlow), tutorial [here])(https://github.com/tensorflow/skflow)
Pandas - Allows for data preprocessing. Tutorial here
Scikit-Learn - Machine learning framework for Python. Tutorial here.

Count to 10 in Python

Use a for loop and a range.



In [1]:

    
#Python cares about space!  No curly braces.
for x in range(1,10):  # If you ever see xrange, you are in Python 2
    print(x)  # If you ever see print x (no parenthesis), you are in Python 2

Printing Numbers and Strings



In [2]:

    
sum = 0
for x in range(1,10):
    sum += x
    print("Adding {}, sum so far is {}".format(x,sum))
    
print("Final sum: {}".format(sum))









    



Adding 1, sum so far is 1
Adding 2, sum so far is 3
Adding 3, sum so far is 6
Adding 4, sum so far is 10
Adding 5, sum so far is 15
Adding 6, sum so far is 21
Adding 7, sum so far is 28
Adding 8, sum so far is 36
Adding 9, sum so far is 45
Final sum: 45

Lists & Sets



In [3]:

    
c = ['a', 'b', 'c', 'd']
print(c)









    



['a', 'b', 'c', 'd']



In [4]:

    
# Iterate over a collection.
for s in c:
    print(s)









    



a
b
c
d



In [5]:

    
# Iterate over a collection, and know where your index.  (Python is zero-based!)
for i,c in enumerate(c):
    print("{}:{}".format(i,c))









    



0:a
1:b
2:c
3:d



In [6]:

    
# Manually add items, lists allow duplicates
c = []
c.append('a')
c.append('b')
c.append('c')
c.append('c')
print(c)









    



['a', 'b', 'c', 'c']



In [7]:

    
# Manually add items, sets do not allow duplicates
# Sets add, lists append.  I find this annoying.
c = set()
c.add('a')
c.add('b')
c.add('c')
c.add('c')
print(c)









    



{'b', 'c', 'a'}



In [8]:

    
# Insert
c = ['a','b','c']
c.insert(0,'a0')
print(c)
# Remove
c.remove('b')
print(c)
# Remove at index
del c[0]
print(c)









    



['a0', 'a', 'b', 'c']
['a0', 'a', 'c']
['a', 'c']

Maps/Dictionaries/Hash Tables



In [9]:

    
map = { 'name': "Jeff", 'address':"123 Main"}
print(map)
print(map['name'])

if 'name' in map:
    print("Name is defined")
    
if 'age' in map:
    print("age defined")
else:
    print("age undefined")









    



{'address': '123 Main', 'name': 'Jeff'}
Jeff
Name is defined
age undefined



In [3]:

    
map = { 'name': "Jeff", 'address':"123 Main"}
# All of the keys
print("Key: {}".format(map.keys()))

# All of the values
print("Values: {}".format(map.values()))









    



Key: dict_keys(['name', 'address'])
Values: dict_values(['Jeff', '123 Main'])



In [11]:

    
# Python list & map structures 
customers = [
    {'name': 'Jeff & Tracy Heaton', 'pets': ['Wynton','Cricket']},
    {'name': 'John Smith', 'pets': ['rover']},
    {'name': 'Jane Doe'}
]

print(customers)

for customer in customers:
    print("{}:{}".format(customer['name'],customer.get('pets','no pets')))









    



[{'pets': ['Wynton', 'Cricket'], 'name': 'Jeff & Tracy Heaton'}, {'pets': ['rover'], 'name': 'John Smith'}, {'name': 'Jane Doe'}]
Jeff & Tracy Heaton:['Wynton', 'Cricket']
John Smith:['rover']
Jane Doe:no pets

Pandas

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is based on the dataframe concept found in the R programming language. For this class, Pandas will be the primary means by which data is manipulated in conjunction with neural networks.

The dataframe is a key component of Pandas. We will use it to access the auto-mpg dataset. This dataset can be found on the UCI machine learning repository. For this class we will use a version of the Auto MPG dataset where I added column headers. You can find my version here.

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. It contains data for 398 cars, including mpg, cylinders), displacement, horsepower , weight, acceleration, model year, origin and the car's name.

The following code loads the MPG dataset into a dataframe:



In [13]:

    
# Simple dataframe
import os
import pandas as pd

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read)
print(df[0:5])









    



    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino



In [14]:

    
# Perform basic statistics on a dataframe.

import os
import pandas as pd

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])

# Strip non-numerics
df = df.select_dtypes(include=['int', 'float'])

headers = list(df.columns.values)
fields = []

for field in headers:
    fields.append( {
        'name' : field,
        'mean': df[field].mean(),
        'var': df[field].var(),
        'sdev': df[field].std()
    })
    
for field in fields:
    print(field)









    



{'sdev': 7.815984312565782, 'mean': 23.514572864321615, 'name': 'mpg', 'var': 61.089610774274405}
{'sdev': 104.26983817119581, 'mean': 193.42587939698493, 'name': 'displacement', 'var': 10872.199152247364}
{'sdev': 38.49115993282855, 'mean': 104.46938775510205, 'name': 'horsepower', 'var': 1481.5693929745862}
{'sdev': 2.7576889298126757, 'mean': 15.568090452261291, 'name': 'acceleration', 'var': 7.604848233611381}

Sorting and Shuffling Dataframes

It is possable to sort and shuffle.



In [1]:

    
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
#np.random.seed(42) # Uncomment this line to get the same shuffle each time
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
df









    Out[1]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      0
      33.0
      4
      105.0
      74.0
      2190
      14.2
      81
      2
      volkswagen jetta
    
    
      1
      14.0
      8
      318.0
      150.0
      4096
      13.0
      71
      1
      plymouth fury iii
    
    
      2
      15.0
      8
      400.0
      150.0
      3761
      9.5
      70
      1
      chevrolet monte carlo
    
    
      3
      15.0
      8
      350.0
      145.0
      4440
      14.0
      75
      1
      chevrolet bel air
    
    
      4
      18.0
      6
      232.0
      100.0
      2945
      16.0
      73
      1
      amc hornet
    
    
      5
      34.4
      4
      98.0
      65.0
      2045
      16.2
      81
      1
      ford escort 4w
    
    
      6
      24.0
      4
      90.0
      75.0
      2108
      15.5
      74
      2
      fiat 128
    
    
      7
      17.6
      8
      302.0
      129.0
      3725
      13.4
      79
      1
      ford ltd landau
    
    
      8
      18.6
      6
      225.0
      110.0
      3620
      18.7
      78
      1
      dodge aspen
    
    
      9
      21.5
      4
      121.0
      110.0
      2600
      12.8
      77
      2
      bmw 320i
    
    
      10
      27.9
      4
      156.0
      105.0
      2800
      14.4
      80
      1
      dodge colt
    
    
      11
      18.0
      8
      318.0
      150.0
      3436
      11.0
      70
      1
      plymouth satellite
    
    
      12
      16.0
      8
      318.0
      150.0
      4190
      13.0
      76
      1
      dodge coronet brougham
    
    
      13
      35.1
      4
      81.0
      60.0
      1760
      16.1
      81
      3
      honda civic 1300
    
    
      14
      30.0
      4
      97.0
      67.0
      1985
      16.4
      77
      3
      subaru dl
    
    
      15
      14.0
      8
      302.0
      137.0
      4042
      14.5
      73
      1
      ford gran torino
    
    
      16
      44.6
      4
      91.0
      67.0
      1850
      13.8
      80
      3
      honda civic 1500 gl
    
    
      17
      15.0
      8
      318.0
      150.0
      3777
      12.5
      73
      1
      dodge coronet custom
    
    
      18
      23.9
      4
      119.0
      97.0
      2405
      14.9
      78
      3
      datsun 200-sx
    
    
      19
      28.0
      4
      116.0
      90.0
      2123
      14.0
      71
      2
      opel 1900
    
    
      20
      14.0
      8
      318.0
      150.0
      4077
      14.0
      72
      1
      plymouth satellite custom (sw)
    
    
      21
      29.0
      4
      135.0
      84.0
      2525
      16.0
      82
      1
      dodge aries se
    
    
      22
      18.0
      6
      232.0
      100.0
      2789
      15.0
      73
      1
      amc gremlin
    
    
      23
      15.0
      8
      350.0
      145.0
      4082
      13.0
      73
      1
      chevrolet monte carlo s
    
    
      24
      22.0
      6
      198.0
      95.0
      2833
      15.5
      70
      1
      plymouth duster
    
    
      25
      21.1
      4
      134.0
      95.0
      2515
      14.8
      78
      3
      toyota celica gt liftback
    
    
      26
      31.0
      4
      76.0
      52.0
      1649
      16.5
      74
      3
      toyota corona
    
    
      27
      36.0
      4
      120.0
      88.0
      2160
      14.5
      82
      3
      nissan stanza xe
    
    
      28
      38.0
      4
      91.0
      67.0
      1995
      16.2
      82
      3
      datsun 310 gx
    
    
      29
      18.0
      6
      225.0
      105.0
      3613
      16.5
      74
      1
      plymouth satellite sebring
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      368
      22.0
      6
      250.0
      105.0
      3353
      14.5
      76
      1
      chevrolet nova
    
    
      369
      33.0
      4
      91.0
      53.0
      1795
      17.5
      75
      3
      honda civic cvcc
    
    
      370
      30.0
      4
      146.0
      67.0
      3250
      21.8
      80
      2
      mercedes-benz 240d
    
    
      371
      24.0
      4
      121.0
      110.0
      2660
      14.0
      73
      2
      saab 99le
    
    
      372
      30.5
      4
      98.0
      63.0
      2051
      17.0
      77
      1
      chevrolet chevette
    
    
      373
      19.4
      6
      232.0
      90.0
      3210
      17.2
      78
      1
      amc concord
    
    
      374
      10.0
      8
      307.0
      200.0
      4376
      15.0
      70
      1
      chevy c20
    
    
      375
      31.5
      4
      98.0
      68.0
      2045
      18.5
      77
      3
      honda accord cvcc
    
    
      376
      16.5
      8
      351.0
      138.0
      3955
      13.2
      79
      1
      mercury grand marquis
    
    
      377
      32.3
      4
      97.0
      67.0
      2065
      17.8
      81
      3
      subaru
    
    
      378
      13.0
      8
      351.0
      158.0
      4363
      13.0
      73
      1
      ford ltd
    
    
      379
      29.0
      4
      97.0
      78.0
      1940
      14.5
      77
      2
      volkswagen rabbit custom
    
    
      380
      23.0
      4
      120.0
      88.0
      2957
      17.0
      75
      2
      peugeot 504
    
    
      381
      26.6
      8
      350.0
      105.0
      3725
      19.0
      81
      1
      oldsmobile cutlass ls
    
    
      382
      20.8
      6
      200.0
      85.0
      3070
      16.7
      78
      1
      mercury zephyr
    
    
      383
      19.1
      6
      225.0
      90.0
      3381
      18.7
      80
      1
      dodge aspen
    
    
      384
      19.2
      8
      267.0
      125.0
      3605
      15.0
      79
      1
      chevrolet malibu classic (sw)
    
    
      385
      25.0
      6
      181.0
      110.0
      2945
      16.4
      82
      1
      buick century limited
    
    
      386
      35.7
      4
      98.0
      80.0
      1915
      14.4
      79
      1
      dodge colt hatchback custom
    
    
      387
      19.0
      4
      121.0
      112.0
      2868
      15.5
      73
      2
      volvo 144ea
    
    
      388
      13.0
      8
      350.0
      165.0
      4274
      12.0
      72
      1
      chevrolet impala
    
    
      389
      29.0
      4
      90.0
      70.0
      1937
      14.2
      76
      2
      vw rabbit
    
    
      390
      18.0
      6
      250.0
      88.0
      3021
      16.5
      73
      1
      ford maverick
    
    
      391
      20.5
      6
      231.0
      105.0
      3425
      16.9
      77
      1
      buick skylark
    
    
      392
      14.0
      8
      340.0
      160.0
      3609
      8.0
      70
      1
      plymouth 'cuda 340
    
    
      393
      29.9
      4
      98.0
      65.0
      2380
      20.7
      81
      1
      ford escort 2h
    
    
      394
      28.0
      4
      151.0
      90.0
      2678
      16.5
      80
      1
      chevrolet citation
    
    
      395
      12.0
      8
      400.0
      167.0
      4906
      12.5
      73
      1
      ford country
    
    
      396
      15.5
      8
      318.0
      145.0
      4140
      13.7
      77
      1
      dodge monaco brougham
    
    
      397
      43.4
      4
      90.0
      48.0
      2335
      23.7
      80
      2
      vw dasher (diesel)
    
  

398 rows × 9 columns



In [16]:

    
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.sort_values(by='name',ascending=True)
print("The first car is: {}".format(df['name'].iloc[1]))
df









    



The first car is: amc ambassador dpl






    Out[16]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      96
      13.0
      8
      360.0
      175.0
      3821
      11.0
      73
      1
      amc ambassador brougham
    
    
      9
      15.0
      8
      390.0
      190.0
      3850
      8.5
      70
      1
      amc ambassador dpl
    
    
      66
      17.0
      8
      304.0
      150.0
      3672
      11.5
      72
      1
      amc ambassador sst
    
    
      315
      24.3
      4
      151.0
      90.0
      3003
      20.1
      80
      1
      amc concord
    
    
      257
      19.4
      6
      232.0
      90.0
      3210
      17.2
      78
      1
      amc concord
    
    
      261
      18.1
      6
      258.0
      120.0
      3410
      15.1
      78
      1
      amc concord d/l
    
    
      374
      23.0
      4
      151.0
      NaN
      3035
      20.5
      82
      1
      amc concord dl
    
    
      283
      20.2
      6
      232.0
      90.0
      3265
      18.2
      79
      1
      amc concord dl 6
    
    
      107
      18.0
      6
      232.0
      100.0
      2789
      15.0
      73
      1
      amc gremlin
    
    
      33
      19.0
      6
      232.0
      100.0
      2634
      13.0
      71
      1
      amc gremlin
    
    
      169
      20.0
      6
      232.0
      100.0
      2914
      16.0
      75
      1
      amc gremlin
    
    
      24
      21.0
      6
      199.0
      90.0
      2648
      15.0
      70
      1
      amc gremlin
    
    
      127
      19.0
      6
      232.0
      100.0
      2901
      16.0
      74
      1
      amc hornet
    
    
      16
      18.0
      6
      199.0
      97.0
      2774
      15.5
      70
      1
      amc hornet
    
    
      194
      22.5
      6
      232.0
      90.0
      3085
      17.6
      76
      1
      amc hornet
    
    
      99
      18.0
      6
      232.0
      100.0
      2945
      16.0
      73
      1
      amc hornet
    
    
      45
      18.0
      6
      258.0
      110.0
      2962
      13.5
      71
      1
      amc hornet sportabout (sw)
    
    
      162
      15.0
      6
      258.0
      110.0
      3730
      19.0
      75
      1
      amc matador
    
    
      134
      16.0
      6
      258.0
      110.0
      3632
      18.0
      74
      1
      amc matador
    
    
      86
      14.0
      8
      304.0
      150.0
      3672
      11.5
      73
      1
      amc matador
    
    
      189
      15.5
      8
      304.0
      120.0
      3962
      13.9
      76
      1
      amc matador
    
    
      37
      18.0
      6
      232.0
      100.0
      3288
      15.5
      71
      1
      amc matador
    
    
      72
      15.0
      8
      304.0
      150.0
      3892
      12.5
      72
      1
      amc matador (sw)
    
    
      140
      14.0
      8
      304.0
      150.0
      4257
      15.5
      74
      1
      amc matador (sw)
    
    
      176
      19.0
      6
      232.0
      90.0
      3211
      17.0
      75
      1
      amc pacer
    
    
      202
      17.5
      6
      258.0
      95.0
      3193
      17.8
      76
      1
      amc pacer d/l
    
    
      3
      16.0
      8
      304.0
      150.0
      3433
      12.0
      70
      1
      amc rebel sst
    
    
      296
      27.4
      4
      121.0
      80.0
      2670
      15.0
      79
      1
      amc spirit dl
    
    
      21
      24.0
      4
      107.0
      90.0
      2430
      14.5
      70
      2
      audi 100 ls
    
    
      177
      23.0
      4
      115.0
      95.0
      2694
      15.0
      75
      2
      audi 100ls
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      82
      23.0
      4
      120.0
      97.0
      2506
      14.5
      72
      3
      toyouta corona mark ii (sw)
    
    
      335
      35.0
      4
      122.0
      88.0
      2500
      15.1
      80
      2
      triumph tr7 coupe
    
    
      332
      29.8
      4
      89.0
      62.0
      1845
      15.3
      80
      2
      vokswagen rabbit
    
    
      19
      26.0
      4
      97.0
      46.0
      1835
      20.5
      70
      2
      volkswagen 1131 deluxe sedan
    
    
      77
      22.0
      4
      121.0
      76.0
      2511
      18.0
      72
      2
      volkswagen 411 (sw)
    
    
      172
      25.0
      4
      90.0
      71.0
      2223
      16.5
      75
      2
      volkswagen dasher
    
    
      142
      26.0
      4
      79.0
      67.0
      1963
      15.5
      74
      2
      volkswagen dasher
    
    
      240
      30.5
      4
      97.0
      78.0
      2190
      14.1
      77
      2
      volkswagen dasher
    
    
      353
      33.0
      4
      105.0
      74.0
      2190
      14.2
      81
      2
      volkswagen jetta
    
    
      55
      27.0
      4
      97.0
      60.0
      1834
      19.0
      71
      2
      volkswagen model 111
    
    
      175
      29.0
      4
      90.0
      70.0
      1937
      14.0
      75
      2
      volkswagen rabbit
    
    
      203
      29.5
      4
      97.0
      71.0
      1825
      12.2
      76
      2
      volkswagen rabbit
    
    
      233
      29.0
      4
      97.0
      78.0
      1940
      14.5
      77
      2
      volkswagen rabbit custom
    
    
      244
      43.1
      4
      90.0
      48.0
      1985
      21.5
      78
      2
      volkswagen rabbit custom diesel
    
    
      375
      36.0
      4
      105.0
      74.0
      1980
      15.3
      82
      2
      volkswagen rabbit l
    
    
      278
      31.5
      4
      89.0
      71.0
      1990
      14.9
      78
      2
      volkswagen scirocco
    
    
      102
      26.0
      4
      97.0
      46.0
      1950
      21.0
      73
      2
      volkswagen super beetle
    
    
      59
      23.0
      4
      97.0
      54.0
      2254
      23.5
      72
      2
      volkswagen type 3
    
    
      120
      19.0
      4
      121.0
      112.0
      2868
      15.5
      73
      2
      volvo 144ea
    
    
      76
      18.0
      4
      121.0
      112.0
      2933
      14.5
      72
      2
      volvo 145e (sw)
    
    
      179
      22.0
      4
      121.0
      98.0
      2945
      14.5
      75
      2
      volvo 244dl
    
    
      207
      20.0
      4
      130.0
      102.0
      3150
      15.7
      76
      2
      volvo 245
    
    
      275
      17.0
      6
      163.0
      125.0
      3140
      13.6
      78
      2
      volvo 264gl
    
    
      360
      30.7
      6
      145.0
      76.0
      3160
      19.6
      81
      2
      volvo diesel
    
    
      326
      43.4
      4
      90.0
      48.0
      2335
      23.7
      80
      2
      vw dasher (diesel)
    
    
      394
      44.0
      4
      97.0
      52.0
      2130
      24.6
      82
      2
      vw pickup
    
    
      309
      41.5
      4
      98.0
      76.0
      2144
      14.7
      80
      2
      vw rabbit
    
    
      197
      29.0
      4
      90.0
      70.0
      1937
      14.2
      76
      2
      vw rabbit
    
    
      325
      44.3
      4
      90.0
      48.0
      2085
      21.7
      80
      2
      vw rabbit c (diesel)
    
    
      293
      31.9
      4
      89.0
      71.0
      1925
      14.0
      79
      2
      vw rabbit custom
    
  

398 rows × 9 columns

Saving a Dataframe

Many of the assignments in this course will require that you save a dataframe to submit to the instructor. The following code performs a shuffle and then saves a new copy.



In [17]:

    
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-shuffle.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
df.to_csv(filename_write,index=False) # Specify index = false to not write row numbers
print("Done")









    



Done

Calculated Fields

It is possible to add new fields to the dataframe that are calculated from the other fields. We can create a new column that gives the weight in kilograms. The equation to calculate a metric weight, given a weight in pounds is:

$ m_{(kg)} = m_{(lb)} \times 0.45359237 $

This can be used with the following Python code:



In [5]:

    
import os
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df.insert(1,'weight_kg',(df['weight']*0.45359237).astype(int))
df









    Out[5]:






  
    
      
      mpg
      weight_kg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      0
      18.0
      1589
      8
      307.0
      130.0
      3504
      12.0
      70
      1
      chevrolet chevelle malibu
    
    
      1
      15.0
      1675
      8
      350.0
      165.0
      3693
      11.5
      70
      1
      buick skylark 320
    
    
      2
      18.0
      1558
      8
      318.0
      150.0
      3436
      11.0
      70
      1
      plymouth satellite
    
    
      3
      16.0
      1557
      8
      304.0
      150.0
      3433
      12.0
      70
      1
      amc rebel sst
    
    
      4
      17.0
      1564
      8
      302.0
      140.0
      3449
      10.5
      70
      1
      ford torino
    
    
      5
      15.0
      1969
      8
      429.0
      198.0
      4341
      10.0
      70
      1
      ford galaxie 500
    
    
      6
      14.0
      1974
      8
      454.0
      220.0
      4354
      9.0
      70
      1
      chevrolet impala
    
    
      7
      14.0
      1955
      8
      440.0
      215.0
      4312
      8.5
      70
      1
      plymouth fury iii
    
    
      8
      14.0
      2007
      8
      455.0
      225.0
      4425
      10.0
      70
      1
      pontiac catalina
    
    
      9
      15.0
      1746
      8
      390.0
      190.0
      3850
      8.5
      70
      1
      amc ambassador dpl
    
    
      10
      15.0
      1616
      8
      383.0
      170.0
      3563
      10.0
      70
      1
      dodge challenger se
    
    
      11
      14.0
      1637
      8
      340.0
      160.0
      3609
      8.0
      70
      1
      plymouth 'cuda 340
    
    
      12
      15.0
      1705
      8
      400.0
      150.0
      3761
      9.5
      70
      1
      chevrolet monte carlo
    
    
      13
      14.0
      1399
      8
      455.0
      225.0
      3086
      10.0
      70
      1
      buick estate wagon (sw)
    
    
      14
      24.0
      1075
      4
      113.0
      95.0
      2372
      15.0
      70
      3
      toyota corona mark ii
    
    
      15
      22.0
      1285
      6
      198.0
      95.0
      2833
      15.5
      70
      1
      plymouth duster
    
    
      16
      18.0
      1258
      6
      199.0
      97.0
      2774
      15.5
      70
      1
      amc hornet
    
    
      17
      21.0
      1173
      6
      200.0
      85.0
      2587
      16.0
      70
      1
      ford maverick
    
    
      18
      27.0
      966
      4
      97.0
      88.0
      2130
      14.5
      70
      3
      datsun pl510
    
    
      19
      26.0
      832
      4
      97.0
      46.0
      1835
      20.5
      70
      2
      volkswagen 1131 deluxe sedan
    
    
      20
      25.0
      1211
      4
      110.0
      87.0
      2672
      17.5
      70
      2
      peugeot 504
    
    
      21
      24.0
      1102
      4
      107.0
      90.0
      2430
      14.5
      70
      2
      audi 100 ls
    
    
      22
      25.0
      1077
      4
      104.0
      95.0
      2375
      17.5
      70
      2
      saab 99e
    
    
      23
      26.0
      1013
      4
      121.0
      113.0
      2234
      12.5
      70
      2
      bmw 2002
    
    
      24
      21.0
      1201
      6
      199.0
      90.0
      2648
      15.0
      70
      1
      amc gremlin
    
    
      25
      10.0
      2093
      8
      360.0
      215.0
      4615
      14.0
      70
      1
      ford f250
    
    
      26
      10.0
      1984
      8
      307.0
      200.0
      4376
      15.0
      70
      1
      chevy c20
    
    
      27
      11.0
      1987
      8
      318.0
      210.0
      4382
      13.5
      70
      1
      dodge d200
    
    
      28
      9.0
      2146
      8
      304.0
      193.0
      4732
      18.5
      70
      1
      hi 1200d
    
    
      29
      27.0
      966
      4
      97.0
      88.0
      2130
      14.5
      71
      3
      datsun pl510
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      368
      27.0
      1197
      4
      112.0
      88.0
      2640
      18.6
      82
      1
      chevrolet cavalier wagon
    
    
      369
      34.0
      1086
      4
      112.0
      88.0
      2395
      18.0
      82
      1
      chevrolet cavalier 2-door
    
    
      370
      31.0
      1168
      4
      112.0
      85.0
      2575
      16.2
      82
      1
      pontiac j2000 se hatchback
    
    
      371
      29.0
      1145
      4
      135.0
      84.0
      2525
      16.0
      82
      1
      dodge aries se
    
    
      372
      27.0
      1240
      4
      151.0
      90.0
      2735
      18.0
      82
      1
      pontiac phoenix
    
    
      373
      24.0
      1299
      4
      140.0
      92.0
      2865
      16.4
      82
      1
      ford fairmont futura
    
    
      374
      23.0
      1376
      4
      151.0
      NaN
      3035
      20.5
      82
      1
      amc concord dl
    
    
      375
      36.0
      898
      4
      105.0
      74.0
      1980
      15.3
      82
      2
      volkswagen rabbit l
    
    
      376
      37.0
      918
      4
      91.0
      68.0
      2025
      18.2
      82
      3
      mazda glc custom l
    
    
      377
      31.0
      893
      4
      91.0
      68.0
      1970
      17.6
      82
      3
      mazda glc custom
    
    
      378
      38.0
      963
      4
      105.0
      63.0
      2125
      14.7
      82
      1
      plymouth horizon miser
    
    
      379
      36.0
      963
      4
      98.0
      70.0
      2125
      17.3
      82
      1
      mercury lynx l
    
    
      380
      36.0
      979
      4
      120.0
      88.0
      2160
      14.5
      82
      3
      nissan stanza xe
    
    
      381
      36.0
      1000
      4
      107.0
      75.0
      2205
      14.5
      82
      3
      honda accord
    
    
      382
      34.0
      1018
      4
      108.0
      70.0
      2245
      16.9
      82
      3
      toyota corolla
    
    
      383
      38.0
      891
      4
      91.0
      67.0
      1965
      15.0
      82
      3
      honda civic
    
    
      384
      32.0
      891
      4
      91.0
      67.0
      1965
      15.7
      82
      3
      honda civic (auto)
    
    
      385
      38.0
      904
      4
      91.0
      67.0
      1995
      16.2
      82
      3
      datsun 310 gx
    
    
      386
      25.0
      1335
      6
      181.0
      110.0
      2945
      16.4
      82
      1
      buick century limited
    
    
      387
      38.0
      1367
      6
      262.0
      85.0
      3015
      17.0
      82
      1
      oldsmobile cutlass ciera (diesel)
    
    
      388
      26.0
      1172
      4
      156.0
      92.0
      2585
      14.5
      82
      1
      chrysler lebaron medallion
    
    
      389
      22.0
      1285
      6
      232.0
      112.0
      2835
      14.7
      82
      1
      ford granada l
    
    
      390
      32.0
      1208
      4
      144.0
      96.0
      2665
      13.9
      82
      3
      toyota celica gt
    
    
      391
      36.0
      1075
      4
      135.0
      84.0
      2370
      13.0
      82
      1
      dodge charger 2.2
    
    
      392
      27.0
      1338
      4
      151.0
      90.0
      2950
      17.3
      82
      1
      chevrolet camaro
    
    
      393
      27.0
      1265
      4
      140.0
      86.0
      2790
      15.6
      82
      1
      ford mustang gl
    
    
      394
      44.0
      966
      4
      97.0
      52.0
      2130
      24.6
      82
      2
      vw pickup
    
    
      395
      32.0
      1040
      4
      135.0
      84.0
      2295
      11.6
      82
      1
      dodge rampage
    
    
      396
      28.0
      1190
      4
      120.0
      79.0
      2625
      18.6
      82
      1
      ford ranger
    
    
      397
      31.0
      1233
      4
      119.0
      82.0
      2720
      19.4
      82
      1
      chevy s-10
    
  

398 rows × 10 columns

Field Transformation & Preprocessing

The data fed into a machine learning model rarely bares much similarity to the data that the data scientist originally received. One common transformation is to normalize the inputs. A normalization allows numbers to be put in a standard form so that two values can easily be compared. Consider if a friend told you that he received a $10 discount. Is this a good deal? Maybe. But the value is not normalized. If your friend purchased a car, then the discount is not that good. If your friend purchased dinner, this is a very good discount!

Percentages are a very common form of normalization. If your friend tells you they got 10% off, we know that this is a better discount than 5%. It does not matter how much the purchase price was. One very common machine learning normalization is the Z-Score:

$z = {x- \mu \over \sigma} $

To calculate the Z-Score you need to also calculate the mean($\mu$) and the standard deviation ($\sigma$). The mean is calculated as follows:

$\mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$

The standard deviation is calculated as follows:

$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}, {\rm \ \ where\ \ } \mu = \frac{1}{N} \sum_{i=1}^N x_i$

The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero, above zero is above average, and below zero is below average. Z-Scores above/below -3/3 are very rare, these are outliers.



In [20]:

    
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df['mpg'] = zscore(df['mpg'])
df









    Out[20]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      0
      -0.706439
      8
      307.0
      130.0
      3504
      12.0
      70
      1
      chevrolet chevelle malibu
    
    
      1
      -1.090751
      8
      350.0
      165.0
      3693
      11.5
      70
      1
      buick skylark 320
    
    
      2
      -0.706439
      8
      318.0
      150.0
      3436
      11.0
      70
      1
      plymouth satellite
    
    
      3
      -0.962647
      8
      304.0
      150.0
      3433
      12.0
      70
      1
      amc rebel sst
    
    
      4
      -0.834543
      8
      302.0
      140.0
      3449
      10.5
      70
      1
      ford torino
    
    
      5
      -1.090751
      8
      429.0
      198.0
      4341
      10.0
      70
      1
      ford galaxie 500
    
    
      6
      -1.218855
      8
      454.0
      220.0
      4354
      9.0
      70
      1
      chevrolet impala
    
    
      7
      -1.218855
      8
      440.0
      215.0
      4312
      8.5
      70
      1
      plymouth fury iii
    
    
      8
      -1.218855
      8
      455.0
      225.0
      4425
      10.0
      70
      1
      pontiac catalina
    
    
      9
      -1.090751
      8
      390.0
      190.0
      3850
      8.5
      70
      1
      amc ambassador dpl
    
    
      10
      -1.090751
      8
      383.0
      170.0
      3563
      10.0
      70
      1
      dodge challenger se
    
    
      11
      -1.218855
      8
      340.0
      160.0
      3609
      8.0
      70
      1
      plymouth 'cuda 340
    
    
      12
      -1.090751
      8
      400.0
      150.0
      3761
      9.5
      70
      1
      chevrolet monte carlo
    
    
      13
      -1.218855
      8
      455.0
      225.0
      3086
      10.0
      70
      1
      buick estate wagon (sw)
    
    
      14
      0.062185
      4
      113.0
      95.0
      2372
      15.0
      70
      3
      toyota corona mark ii
    
    
      15
      -0.194023
      6
      198.0
      95.0
      2833
      15.5
      70
      1
      plymouth duster
    
    
      16
      -0.706439
      6
      199.0
      97.0
      2774
      15.5
      70
      1
      amc hornet
    
    
      17
      -0.322127
      6
      200.0
      85.0
      2587
      16.0
      70
      1
      ford maverick
    
    
      18
      0.446497
      4
      97.0
      88.0
      2130
      14.5
      70
      3
      datsun pl510
    
    
      19
      0.318393
      4
      97.0
      46.0
      1835
      20.5
      70
      2
      volkswagen 1131 deluxe sedan
    
    
      20
      0.190289
      4
      110.0
      87.0
      2672
      17.5
      70
      2
      peugeot 504
    
    
      21
      0.062185
      4
      107.0
      90.0
      2430
      14.5
      70
      2
      audi 100 ls
    
    
      22
      0.190289
      4
      104.0
      95.0
      2375
      17.5
      70
      2
      saab 99e
    
    
      23
      0.318393
      4
      121.0
      113.0
      2234
      12.5
      70
      2
      bmw 2002
    
    
      24
      -0.322127
      6
      199.0
      90.0
      2648
      15.0
      70
      1
      amc gremlin
    
    
      25
      -1.731270
      8
      360.0
      215.0
      4615
      14.0
      70
      1
      ford f250
    
    
      26
      -1.731270
      8
      307.0
      200.0
      4376
      15.0
      70
      1
      chevy c20
    
    
      27
      -1.603167
      8
      318.0
      210.0
      4382
      13.5
      70
      1
      dodge d200
    
    
      28
      -1.859374
      8
      304.0
      193.0
      4732
      18.5
      70
      1
      hi 1200d
    
    
      29
      0.446497
      4
      97.0
      88.0
      2130
      14.5
      71
      3
      datsun pl510
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      368
      0.446497
      4
      112.0
      88.0
      2640
      18.6
      82
      1
      chevrolet cavalier wagon
    
    
      369
      1.343225
      4
      112.0
      88.0
      2395
      18.0
      82
      1
      chevrolet cavalier 2-door
    
    
      370
      0.958913
      4
      112.0
      85.0
      2575
      16.2
      82
      1
      pontiac j2000 se hatchback
    
    
      371
      0.702705
      4
      135.0
      84.0
      2525
      16.0
      82
      1
      dodge aries se
    
    
      372
      0.446497
      4
      151.0
      90.0
      2735
      18.0
      82
      1
      pontiac phoenix
    
    
      373
      0.062185
      4
      140.0
      92.0
      2865
      16.4
      82
      1
      ford fairmont futura
    
    
      374
      -0.065919
      4
      151.0
      NaN
      3035
      20.5
      82
      1
      amc concord dl
    
    
      375
      1.599433
      4
      105.0
      74.0
      1980
      15.3
      82
      2
      volkswagen rabbit l
    
    
      376
      1.727537
      4
      91.0
      68.0
      2025
      18.2
      82
      3
      mazda glc custom l
    
    
      377
      0.958913
      4
      91.0
      68.0
      1970
      17.6
      82
      3
      mazda glc custom
    
    
      378
      1.855641
      4
      105.0
      63.0
      2125
      14.7
      82
      1
      plymouth horizon miser
    
    
      379
      1.599433
      4
      98.0
      70.0
      2125
      17.3
      82
      1
      mercury lynx l
    
    
      380
      1.599433
      4
      120.0
      88.0
      2160
      14.5
      82
      3
      nissan stanza xe
    
    
      381
      1.599433
      4
      107.0
      75.0
      2205
      14.5
      82
      3
      honda accord
    
    
      382
      1.343225
      4
      108.0
      70.0
      2245
      16.9
      82
      3
      toyota corolla
    
    
      383
      1.855641
      4
      91.0
      67.0
      1965
      15.0
      82
      3
      honda civic
    
    
      384
      1.087017
      4
      91.0
      67.0
      1965
      15.7
      82
      3
      honda civic (auto)
    
    
      385
      1.855641
      4
      91.0
      67.0
      1995
      16.2
      82
      3
      datsun 310 gx
    
    
      386
      0.190289
      6
      181.0
      110.0
      2945
      16.4
      82
      1
      buick century limited
    
    
      387
      1.855641
      6
      262.0
      85.0
      3015
      17.0
      82
      1
      oldsmobile cutlass ciera (diesel)
    
    
      388
      0.318393
      4
      156.0
      92.0
      2585
      14.5
      82
      1
      chrysler lebaron medallion
    
    
      389
      -0.194023
      6
      232.0
      112.0
      2835
      14.7
      82
      1
      ford granada l
    
    
      390
      1.087017
      4
      144.0
      96.0
      2665
      13.9
      82
      3
      toyota celica gt
    
    
      391
      1.599433
      4
      135.0
      84.0
      2370
      13.0
      82
      1
      dodge charger 2.2
    
    
      392
      0.446497
      4
      151.0
      90.0
      2950
      17.3
      82
      1
      chevrolet camaro
    
    
      393
      0.446497
      4
      140.0
      86.0
      2790
      15.6
      82
      1
      ford mustang gl
    
    
      394
      2.624265
      4
      97.0
      52.0
      2130
      24.6
      82
      2
      vw pickup
    
    
      395
      1.087017
      4
      135.0
      84.0
      2295
      11.6
      82
      1
      dodge rampage
    
    
      396
      0.574601
      4
      120.0
      79.0
      2625
      18.6
      82
      1
      ford ranger
    
    
      397
      0.958913
      4
      119.0
      82.0
      2720
      19.4
      82
      1
      chevy s-10
    
  

398 rows × 9 columns

Missing Values

Missing values are a reality of machine learning. Ideally every row of data will have values for all columns. However, this is rarely the case. Most of the values are present in the MPG database. However, there are missing values in the horsepower column. A common practice is to replace missing values with the median value for that column. The median is calculated as described here. The following code replaces any NA values in horsepower with the median:



In [21]:

    
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
# df = df.dropna() # you can also simply drop NA values
print("horsepower has na? {}".format(pd.isnull(df['horsepower']).values.any()))









    



horsepower has na? False

Concatenating Rows and Columns

Rows and columns can be concatenated together to form new data frames.



In [22]:

    
# Create a new dataframe from name and horsepower

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower],axis=1)
result









    Out[22]:






  
    
      
      name
      horsepower
    
  
  
    
      0
      chevrolet chevelle malibu
      130.0
    
    
      1
      buick skylark 320
      165.0
    
    
      2
      plymouth satellite
      150.0
    
    
      3
      amc rebel sst
      150.0
    
    
      4
      ford torino
      140.0
    
    
      5
      ford galaxie 500
      198.0
    
    
      6
      chevrolet impala
      220.0
    
    
      7
      plymouth fury iii
      215.0
    
    
      8
      pontiac catalina
      225.0
    
    
      9
      amc ambassador dpl
      190.0
    
    
      10
      dodge challenger se
      170.0
    
    
      11
      plymouth 'cuda 340
      160.0
    
    
      12
      chevrolet monte carlo
      150.0
    
    
      13
      buick estate wagon (sw)
      225.0
    
    
      14
      toyota corona mark ii
      95.0
    
    
      15
      plymouth duster
      95.0
    
    
      16
      amc hornet
      97.0
    
    
      17
      ford maverick
      85.0
    
    
      18
      datsun pl510
      88.0
    
    
      19
      volkswagen 1131 deluxe sedan
      46.0
    
    
      20
      peugeot 504
      87.0
    
    
      21
      audi 100 ls
      90.0
    
    
      22
      saab 99e
      95.0
    
    
      23
      bmw 2002
      113.0
    
    
      24
      amc gremlin
      90.0
    
    
      25
      ford f250
      215.0
    
    
      26
      chevy c20
      200.0
    
    
      27
      dodge d200
      210.0
    
    
      28
      hi 1200d
      193.0
    
    
      29
      datsun pl510
      88.0
    
    
      ...
      ...
      ...
    
    
      368
      chevrolet cavalier wagon
      88.0
    
    
      369
      chevrolet cavalier 2-door
      88.0
    
    
      370
      pontiac j2000 se hatchback
      85.0
    
    
      371
      dodge aries se
      84.0
    
    
      372
      pontiac phoenix
      90.0
    
    
      373
      ford fairmont futura
      92.0
    
    
      374
      amc concord dl
      NaN
    
    
      375
      volkswagen rabbit l
      74.0
    
    
      376
      mazda glc custom l
      68.0
    
    
      377
      mazda glc custom
      68.0
    
    
      378
      plymouth horizon miser
      63.0
    
    
      379
      mercury lynx l
      70.0
    
    
      380
      nissan stanza xe
      88.0
    
    
      381
      honda accord
      75.0
    
    
      382
      toyota corolla
      70.0
    
    
      383
      honda civic
      67.0
    
    
      384
      honda civic (auto)
      67.0
    
    
      385
      datsun 310 gx
      67.0
    
    
      386
      buick century limited
      110.0
    
    
      387
      oldsmobile cutlass ciera (diesel)
      85.0
    
    
      388
      chrysler lebaron medallion
      92.0
    
    
      389
      ford granada l
      112.0
    
    
      390
      toyota celica gt
      96.0
    
    
      391
      dodge charger 2.2
      84.0
    
    
      392
      chevrolet camaro
      90.0
    
    
      393
      ford mustang gl
      86.0
    
    
      394
      vw pickup
      52.0
    
    
      395
      dodge rampage
      84.0
    
    
      396
      ford ranger
      79.0
    
    
      397
      chevy s-10
      82.0
    
  

398 rows × 2 columns



In [23]:

    
# Create a new dataframe from name and horsepower, but this time by row

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower])
result









    Out[23]:





0         chevrolet chevelle malibu
1                 buick skylark 320
2                plymouth satellite
3                     amc rebel sst
4                       ford torino
5                  ford galaxie 500
6                  chevrolet impala
7                 plymouth fury iii
8                  pontiac catalina
9                amc ambassador dpl
10              dodge challenger se
11               plymouth 'cuda 340
12            chevrolet monte carlo
13          buick estate wagon (sw)
14            toyota corona mark ii
15                  plymouth duster
16                       amc hornet
17                    ford maverick
18                     datsun pl510
19     volkswagen 1131 deluxe sedan
20                      peugeot 504
21                      audi 100 ls
22                         saab 99e
23                         bmw 2002
24                      amc gremlin
25                        ford f250
26                        chevy c20
27                       dodge d200
28                         hi 1200d
29                     datsun pl510
                   ...             
368                              88
369                              88
370                              85
371                              84
372                              90
373                              92
374                             NaN
375                              74
376                              68
377                              68
378                              63
379                              70
380                              88
381                              75
382                              70
383                              67
384                              67
385                              67
386                             110
387                              85
388                              92
389                             112
390                              96
391                              84
392                              90
393                              86
394                              52
395                              84
396                              79
397                              82
dtype: object

Training and Validation

It is very important that we evaluate a machine learning model based on its ability to predict data that it has never seen before. Because of this we often divide the training data into a validation and training set. The machine learning model will learn from the training data, but ultimately be evaluated based on the validation data.

Training Data - In Sample Data - The data that the machine learning model was fit to/created from.
Validation Data - Out of Sample Data - The data that the machine learning model is evaluated upon after it is fit to the training data.

There are two predominant means of dealing with training and validation data:

Training/Validation Split - The data are split according to some ratio between a training and validation (hold-out) set. Common ratios are 80% training and 20% validation.
K-Fold Cross Validation - The data are split into a number of folds and models. Because a number of models equal to the folds is created out-of-sample predictions can be generated for the entire dataset.

Training/Validation Split

The code below performs a split of the MPG data into a training and validation set. The training set uses 80% of the data and the validation set uses 20%.

The following image shows how a model is trained on 80% of the data and then validated against the remaining 20%.



In [24]:

    
path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index)) # Usually a good idea to shuffle
mask = np.random.rand(len(df)) < 0.8
trainDF = pd.DataFrame(df[mask])
validationDF = pd.DataFrame(df[~mask])

print("Training DF: {}".format(len(trainDF)))
print("Validation DF: {}".format(len(validationDF)))









    



Training DF: 317
Validation DF: 81

K-Fold Cross Validation

There are several types of cross validation; however, k-fold is the most common. The value K specifies the number of folds. The two most common values for K are either 5 or 10. For this course we will always use a K value of 5, or a 5-fold cross validation. A 5-fold validation is illustrated by the following diagram:

First, the data are split into 5 equal (or close to, due to rounding) folds. These folds are used to generate 5 training/validation set combinations. Each of the folds becomes the validation set once, and the remaining folds become the training sets. This allows the validated results to be appended together to produce a final out-of-sample prediction for the entire dataset.

The following code demonstrates a 5-fold cross validation:



In [25]:

    
import os
from sklearn.cross_validation import KFold
import pandas as pd
import numpy as np

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
kf = KFold(len(df), n_folds=5)

fold = 1
for train_index, validate_index in kf:        
    trainDF = pd.DataFrame(df.ix[train_index,:])
    validateDF = pd.DataFrame(df.ix[validate_index])
    print("Fold #{}, Training Size: {}, Validation Size: {}".format(fold,len(trainDF),len(validateDF)))
    fold+=1









    



Fold #1, Training Size: 318, Validation Size: 80
Fold #2, Training Size: 318, Validation Size: 80
Fold #3, Training Size: 318, Validation Size: 80
Fold #4, Training Size: 319, Validation Size: 79
Fold #5, Training Size: 319, Validation Size: 79

Accessing Files Directly

It is possible to access files directly, rather than using Pandas. For class assignments you should use Pandas; however, direct access is possible. Using the CSV package, you can read the files in, line-by-line and process them. Accessing a file line-by-line can allow you to process very large files that would not fit into memory. For the purposes of this class, all files will fit into memory, and you should use Pandas for all class assignments.



In [27]:

    
# Read a raw text file (avoid this)
import codecs
import os

path = "./data"

# Always specify your encoding! There is no such thing as "its just a text file".
# See... http://www.joelonsoftware.com/articles/Unicode.html
# Also see... http://www.utf8everywhere.org/
encoding = 'utf-8'
filename = os.path.join(path,"auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    # Iterate over this line by line...
    for line in fh:
        c+=1 # Only the first 5 lines
        if c>5: break
        print(line.strip())









    



mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
18,8,307,130,3504,12,70,1,chevrolet chevelle malibu
15,8,350,165,3693,11.5,70,1,buick skylark 320
18,8,318,150,3436,11,70,1,plymouth satellite
16,8,304,150,3433,12,70,1,amc rebel sst



In [28]:

    
# Read a CSV file
import codecs
import os
import csv

encoding = 'utf-8'
path = "./data/"
filename = os.path.join(path,"auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    reader = csv.reader(fh)
    for row in reader:
        c+=1
        if c>5: break
        print(row)









    



['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevrolet chevelle malibu']
['15', '8', '350', '165', '3693', '11.5', '70', '1', 'buick skylark 320']
['18', '8', '318', '150', '3436', '11', '70', '1', 'plymouth satellite']
['16', '8', '304', '150', '3433', '12', '70', '1', 'amc rebel sst']



In [30]:

    
# Read a CSV, symbolic headers
import codecs
import os
import csv

path = "./data"

encoding = 'utf-8'
filename = os.path.join(path,"auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    reader = csv.reader(fh)

    # Generate header index using comprehension.
    # Comprehension is cool, but not necessarily a beginners feature of Python.
    header_idx = {key: value for (value, key) in enumerate(next(reader))}
    
    for row in reader:
        c+=1
        if c>5: break
        print( "Car Name: {}".format(row[header_idx['name']]))









    



Car Name: chevrolet chevelle malibu
Car Name: buick skylark 320
Car Name: plymouth satellite
Car Name: amc rebel sst
Car Name: ford torino



In [31]:

    
# Read a CSV, manual stats
import codecs
import os
import csv
import math

path = "./data/"

encoding = 'utf-8'
filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-norm.csv")

c = 0

with codecs.open(filename_read, "r", encoding) as fh:
    reader = csv.reader(fh)

    # Generate header index using comprehension.
    # Comprehension is cool, but not necessarily a beginners feature of Python.
    header_idx = {key: value for (value, key) in enumerate(next(reader))}
    headers = header_idx.keys()
    
    #print([(key,{'count':0}) for key in headers])
    
    fields = {key: value for (key, value) in [(key,{'count':0,'sum':0,'variance':0}) for key in headers] }
    
    # Pass 1, means
    row_count = 0
    for row in reader:
        row_count += 1
        for name in headers:
            try:
                value = float(row[header_idx[name]])
                field = fields[name]
                field['count'] += 1
                field['sum'] += value
            except ValueError:
                pass
    
    # Calculate means, toss sums (part of pass 1)
    for field in fields.values():
        # If 90% are not missing (or non-numeric) calculate a mean
        if (field['count']/row_count)>0.9:
            field['mean'] = field['sum'] / field['count']
            del field['sum']
    
    # Pass 2, standard deviation & variance
    fh.seek(0)
    for row in reader:
        for name in headers:
            try:
                value = float(row[header_idx[name]])
                field = fields[name]
                # If we failed to calculate a mean, no variance.
                if 'mean' in field:
                    field['variance'] += (value - field['mean'])**2
            except ValueError:
                pass
            
    # Calculate standard deviation, keep variance (part of pass 2)
    for field in fields.values():
        # If no variance, then no standard deviation
        if 'mean' in field:
            field['variance'] /= field['count']
            field['sdev'] = math.sqrt(field['variance'])
        else:
            del field['variance']
    
    # Print summary stats
    for key in sorted(fields.keys()):
        print("{}:{}".format(key,fields[key]))









    



acceleration:{'sdev': 2.7542223175940177, 'mean': 15.568090452261291, 'count': 398, 'variance': 7.585740574732961}
cylinders:{'sdev': 1.698865960539558, 'mean': 5.454773869346734, 'count': 398, 'variance': 2.8861455518799946}
displacement:{'sdev': 104.13876352708563, 'mean': 193.42587939698493, 'count': 398, 'variance': 10844.882068950259}
horsepower:{'sdev': 38.442032714425984, 'mean': 104.46938775510205, 'count': 392, 'variance': 1477.7898792169979}
mpg:{'sdev': 7.806159061274433, 'mean': 23.514572864321615, 'count': 398, 'variance': 60.93611928991693}
name:{'sum': 0, 'count': 0}
origin:{'sdev': 0.801046637381194, 'mean': 1.5728643216080402, 'count': 398, 'variance': 0.6416757152597181}
weight:{'sdev': 845.7772335198177, 'mean': 2970.424623115578, 'count': 398, 'variance': 715339.1287404363}
year:{'sdev': 3.6929784655780975, 'mean': 76.01005025125629, 'count': 398, 'variance': 13.638089947223559}

First Programming Assignment

The first programming assignment will give you a chance to try out Python, Pandas and build some skills that you will use to learn about Deep Learning. You should submit this assignment as either a Jupyter notebook (.ipynb) or a regular Python (.py) file. The following code shows a possible skeleton structure for this assignment:



In [ ]:

    
# Programming Assignment #1, 
# Solution by YOUR NAME
# T81-558: Application of Deep Learning
import os
import sklearn
from sklearn.cross_validation import KFold
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "./data/"

def question1():
    print()
    print("***Question 1***")
    
def question2():
    print()
    print("***Question 2***")

def question3():
    print()
    print("***Question 3***")
    
def question4():
    print()
    print("***Question 4***")

def question5():
    print()
    print("***Question 5***")
      

question1()
question2()
question3()
question4()
question5()



In [ ]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	33.0	4	105.0	74.0	2190	14.2	81	2	volkswagen jetta
1	14.0	8	318.0	150.0	4096	13.0	71	1	plymouth fury iii
2	15.0	8	400.0	150.0	3761	9.5	70	1	chevrolet monte carlo
3	15.0	8	350.0	145.0	4440	14.0	75	1	chevrolet bel air
4	18.0	6	232.0	100.0	2945	16.0	73	1	amc hornet
5	34.4	4	98.0	65.0	2045	16.2	81	1	ford escort 4w
6	24.0	4	90.0	75.0	2108	15.5	74	2	fiat 128
7	17.6	8	302.0	129.0	3725	13.4	79	1	ford ltd landau
8	18.6	6	225.0	110.0	3620	18.7	78	1	dodge aspen
9	21.5	4	121.0	110.0	2600	12.8	77	2	bmw 320i
10	27.9	4	156.0	105.0	2800	14.4	80	1	dodge colt
11	18.0	8	318.0	150.0	3436	11.0	70	1	plymouth satellite
12	16.0	8	318.0	150.0	4190	13.0	76	1	dodge coronet brougham
13	35.1	4	81.0	60.0	1760	16.1	81	3	honda civic 1300
14	30.0	4	97.0	67.0	1985	16.4	77	3	subaru dl
15	14.0	8	302.0	137.0	4042	14.5	73	1	ford gran torino
16	44.6	4	91.0	67.0	1850	13.8	80	3	honda civic 1500 gl
17	15.0	8	318.0	150.0	3777	12.5	73	1	dodge coronet custom
18	23.9	4	119.0	97.0	2405	14.9	78	3	datsun 200-sx
19	28.0	4	116.0	90.0	2123	14.0	71	2	opel 1900
20	14.0	8	318.0	150.0	4077	14.0	72	1	plymouth satellite custom (sw)
21	29.0	4	135.0	84.0	2525	16.0	82	1	dodge aries se
22	18.0	6	232.0	100.0	2789	15.0	73	1	amc gremlin
23	15.0	8	350.0	145.0	4082	13.0	73	1	chevrolet monte carlo s
24	22.0	6	198.0	95.0	2833	15.5	70	1	plymouth duster
25	21.1	4	134.0	95.0	2515	14.8	78	3	toyota celica gt liftback
26	31.0	4	76.0	52.0	1649	16.5	74	3	toyota corona
27	36.0	4	120.0	88.0	2160	14.5	82	3	nissan stanza xe
28	38.0	4	91.0	67.0	1995	16.2	82	3	datsun 310 gx
29	18.0	6	225.0	105.0	3613	16.5	74	1	plymouth satellite sebring
...	...	...	...	...	...	...	...	...	...
368	22.0	6	250.0	105.0	3353	14.5	76	1	chevrolet nova
369	33.0	4	91.0	53.0	1795	17.5	75	3	honda civic cvcc
370	30.0	4	146.0	67.0	3250	21.8	80	2	mercedes-benz 240d
371	24.0	4	121.0	110.0	2660	14.0	73	2	saab 99le
372	30.5	4	98.0	63.0	2051	17.0	77	1	chevrolet chevette
373	19.4	6	232.0	90.0	3210	17.2	78	1	amc concord
374	10.0	8	307.0	200.0	4376	15.0	70	1	chevy c20
375	31.5	4	98.0	68.0	2045	18.5	77	3	honda accord cvcc
376	16.5	8	351.0	138.0	3955	13.2	79	1	mercury grand marquis
377	32.3	4	97.0	67.0	2065	17.8	81	3	subaru
378	13.0	8	351.0	158.0	4363	13.0	73	1	ford ltd
379	29.0	4	97.0	78.0	1940	14.5	77	2	volkswagen rabbit custom
380	23.0	4	120.0	88.0	2957	17.0	75	2	peugeot 504
381	26.6	8	350.0	105.0	3725	19.0	81	1	oldsmobile cutlass ls
382	20.8	6	200.0	85.0	3070	16.7	78	1	mercury zephyr
383	19.1	6	225.0	90.0	3381	18.7	80	1	dodge aspen
384	19.2	8	267.0	125.0	3605	15.0	79	1	chevrolet malibu classic (sw)
385	25.0	6	181.0	110.0	2945	16.4	82	1	buick century limited
386	35.7	4	98.0	80.0	1915	14.4	79	1	dodge colt hatchback custom
387	19.0	4	121.0	112.0	2868	15.5	73	2	volvo 144ea
388	13.0	8	350.0	165.0	4274	12.0	72	1	chevrolet impala
389	29.0	4	90.0	70.0	1937	14.2	76	2	vw rabbit
390	18.0	6	250.0	88.0	3021	16.5	73	1	ford maverick
391	20.5	6	231.0	105.0	3425	16.9	77	1	buick skylark
392	14.0	8	340.0	160.0	3609	8.0	70	1	plymouth 'cuda 340
393	29.9	4	98.0	65.0	2380	20.7	81	1	ford escort 2h
394	28.0	4	151.0	90.0	2678	16.5	80	1	chevrolet citation
395	12.0	8	400.0	167.0	4906	12.5	73	1	ford country
396	15.5	8	318.0	145.0	4140	13.7	77	1	dodge monaco brougham
397	43.4	4	90.0	48.0	2335	23.7	80	2	vw dasher (diesel)

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
96	13.0	8	360.0	175.0	3821	11.0	73	1	amc ambassador brougham
9	15.0	8	390.0	190.0	3850	8.5	70	1	amc ambassador dpl
66	17.0	8	304.0	150.0	3672	11.5	72	1	amc ambassador sst
315	24.3	4	151.0	90.0	3003	20.1	80	1	amc concord
257	19.4	6	232.0	90.0	3210	17.2	78	1	amc concord
261	18.1	6	258.0	120.0	3410	15.1	78	1	amc concord d/l
374	23.0	4	151.0	NaN	3035	20.5	82	1	amc concord dl
283	20.2	6	232.0	90.0	3265	18.2	79	1	amc concord dl 6
107	18.0	6	232.0	100.0	2789	15.0	73	1	amc gremlin
33	19.0	6	232.0	100.0	2634	13.0	71	1	amc gremlin
169	20.0	6	232.0	100.0	2914	16.0	75	1	amc gremlin
24	21.0	6	199.0	90.0	2648	15.0	70	1	amc gremlin
127	19.0	6	232.0	100.0	2901	16.0	74	1	amc hornet
16	18.0	6	199.0	97.0	2774	15.5	70	1	amc hornet
194	22.5	6	232.0	90.0	3085	17.6	76	1	amc hornet
99	18.0	6	232.0	100.0	2945	16.0	73	1	amc hornet
45	18.0	6	258.0	110.0	2962	13.5	71	1	amc hornet sportabout (sw)
162	15.0	6	258.0	110.0	3730	19.0	75	1	amc matador
134	16.0	6	258.0	110.0	3632	18.0	74	1	amc matador
86	14.0	8	304.0	150.0	3672	11.5	73	1	amc matador
189	15.5	8	304.0	120.0	3962	13.9	76	1	amc matador
37	18.0	6	232.0	100.0	3288	15.5	71	1	amc matador
72	15.0	8	304.0	150.0	3892	12.5	72	1	amc matador (sw)
140	14.0	8	304.0	150.0	4257	15.5	74	1	amc matador (sw)
176	19.0	6	232.0	90.0	3211	17.0	75	1	amc pacer
202	17.5	6	258.0	95.0	3193	17.8	76	1	amc pacer d/l
3	16.0	8	304.0	150.0	3433	12.0	70	1	amc rebel sst
296	27.4	4	121.0	80.0	2670	15.0	79	1	amc spirit dl
21	24.0	4	107.0	90.0	2430	14.5	70	2	audi 100 ls
177	23.0	4	115.0	95.0	2694	15.0	75	2	audi 100ls
...	...	...	...	...	...	...	...	...	...
82	23.0	4	120.0	97.0	2506	14.5	72	3	toyouta corona mark ii (sw)
335	35.0	4	122.0	88.0	2500	15.1	80	2	triumph tr7 coupe
332	29.8	4	89.0	62.0	1845	15.3	80	2	vokswagen rabbit
19	26.0	4	97.0	46.0	1835	20.5	70	2	volkswagen 1131 deluxe sedan
77	22.0	4	121.0	76.0	2511	18.0	72	2	volkswagen 411 (sw)
172	25.0	4	90.0	71.0	2223	16.5	75	2	volkswagen dasher
142	26.0	4	79.0	67.0	1963	15.5	74	2	volkswagen dasher
240	30.5	4	97.0	78.0	2190	14.1	77	2	volkswagen dasher
353	33.0	4	105.0	74.0	2190	14.2	81	2	volkswagen jetta
55	27.0	4	97.0	60.0	1834	19.0	71	2	volkswagen model 111
175	29.0	4	90.0	70.0	1937	14.0	75	2	volkswagen rabbit
203	29.5	4	97.0	71.0	1825	12.2	76	2	volkswagen rabbit
233	29.0	4	97.0	78.0	1940	14.5	77	2	volkswagen rabbit custom
244	43.1	4	90.0	48.0	1985	21.5	78	2	volkswagen rabbit custom diesel
375	36.0	4	105.0	74.0	1980	15.3	82	2	volkswagen rabbit l
278	31.5	4	89.0	71.0	1990	14.9	78	2	volkswagen scirocco
102	26.0	4	97.0	46.0	1950	21.0	73	2	volkswagen super beetle
59	23.0	4	97.0	54.0	2254	23.5	72	2	volkswagen type 3
120	19.0	4	121.0	112.0	2868	15.5	73	2	volvo 144ea
76	18.0	4	121.0	112.0	2933	14.5	72	2	volvo 145e (sw)
179	22.0	4	121.0	98.0	2945	14.5	75	2	volvo 244dl
207	20.0	4	130.0	102.0	3150	15.7	76	2	volvo 245
275	17.0	6	163.0	125.0	3140	13.6	78	2	volvo 264gl
360	30.7	6	145.0	76.0	3160	19.6	81	2	volvo diesel
326	43.4	4	90.0	48.0	2335	23.7	80	2	vw dasher (diesel)
394	44.0	4	97.0	52.0	2130	24.6	82	2	vw pickup
309	41.5	4	98.0	76.0	2144	14.7	80	2	vw rabbit
197	29.0	4	90.0	70.0	1937	14.2	76	2	vw rabbit
325	44.3	4	90.0	48.0	2085	21.7	80	2	vw rabbit c (diesel)
293	31.9	4	89.0	71.0	1925	14.0	79	2	vw rabbit custom

	mpg	weight_kg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	1589	8	307.0	130.0	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	1675	8	350.0	165.0	3693	11.5	70	1	buick skylark 320
2	18.0	1558	8	318.0	150.0	3436	11.0	70	1	plymouth satellite
3	16.0	1557	8	304.0	150.0	3433	12.0	70	1	amc rebel sst
4	17.0	1564	8	302.0	140.0	3449	10.5	70	1	ford torino
5	15.0	1969	8	429.0	198.0	4341	10.0	70	1	ford galaxie 500
6	14.0	1974	8	454.0	220.0	4354	9.0	70	1	chevrolet impala
7	14.0	1955	8	440.0	215.0	4312	8.5	70	1	plymouth fury iii
8	14.0	2007	8	455.0	225.0	4425	10.0	70	1	pontiac catalina
9	15.0	1746	8	390.0	190.0	3850	8.5	70	1	amc ambassador dpl
10	15.0	1616	8	383.0	170.0	3563	10.0	70	1	dodge challenger se
11	14.0	1637	8	340.0	160.0	3609	8.0	70	1	plymouth 'cuda 340
12	15.0	1705	8	400.0	150.0	3761	9.5	70	1	chevrolet monte carlo
13	14.0	1399	8	455.0	225.0	3086	10.0	70	1	buick estate wagon (sw)
14	24.0	1075	4	113.0	95.0	2372	15.0	70	3	toyota corona mark ii
15	22.0	1285	6	198.0	95.0	2833	15.5	70	1	plymouth duster
16	18.0	1258	6	199.0	97.0	2774	15.5	70	1	amc hornet
17	21.0	1173	6	200.0	85.0	2587	16.0	70	1	ford maverick
18	27.0	966	4	97.0	88.0	2130	14.5	70	3	datsun pl510
19	26.0	832	4	97.0	46.0	1835	20.5	70	2	volkswagen 1131 deluxe sedan
20	25.0	1211	4	110.0	87.0	2672	17.5	70	2	peugeot 504
21	24.0	1102	4	107.0	90.0	2430	14.5	70	2	audi 100 ls
22	25.0	1077	4	104.0	95.0	2375	17.5	70	2	saab 99e
23	26.0	1013	4	121.0	113.0	2234	12.5	70	2	bmw 2002
24	21.0	1201	6	199.0	90.0	2648	15.0	70	1	amc gremlin
25	10.0	2093	8	360.0	215.0	4615	14.0	70	1	ford f250
26	10.0	1984	8	307.0	200.0	4376	15.0	70	1	chevy c20
27	11.0	1987	8	318.0	210.0	4382	13.5	70	1	dodge d200
28	9.0	2146	8	304.0	193.0	4732	18.5	70	1	hi 1200d
29	27.0	966	4	97.0	88.0	2130	14.5	71	3	datsun pl510
...	...	...	...	...	...	...	...	...	...	...
368	27.0	1197	4	112.0	88.0	2640	18.6	82	1	chevrolet cavalier wagon
369	34.0	1086	4	112.0	88.0	2395	18.0	82	1	chevrolet cavalier 2-door
370	31.0	1168	4	112.0	85.0	2575	16.2	82	1	pontiac j2000 se hatchback
371	29.0	1145	4	135.0	84.0	2525	16.0	82	1	dodge aries se
372	27.0	1240	4	151.0	90.0	2735	18.0	82	1	pontiac phoenix
373	24.0	1299	4	140.0	92.0	2865	16.4	82	1	ford fairmont futura
374	23.0	1376	4	151.0	NaN	3035	20.5	82	1	amc concord dl
375	36.0	898	4	105.0	74.0	1980	15.3	82	2	volkswagen rabbit l
376	37.0	918	4	91.0	68.0	2025	18.2	82	3	mazda glc custom l
377	31.0	893	4	91.0	68.0	1970	17.6	82	3	mazda glc custom
378	38.0	963	4	105.0	63.0	2125	14.7	82	1	plymouth horizon miser
379	36.0	963	4	98.0	70.0	2125	17.3	82	1	mercury lynx l
380	36.0	979	4	120.0	88.0	2160	14.5	82	3	nissan stanza xe
381	36.0	1000	4	107.0	75.0	2205	14.5	82	3	honda accord
382	34.0	1018	4	108.0	70.0	2245	16.9	82	3	toyota corolla
383	38.0	891	4	91.0	67.0	1965	15.0	82	3	honda civic
384	32.0	891	4	91.0	67.0	1965	15.7	82	3	honda civic (auto)
385	38.0	904	4	91.0	67.0	1995	16.2	82	3	datsun 310 gx
386	25.0	1335	6	181.0	110.0	2945	16.4	82	1	buick century limited
387	38.0	1367	6	262.0	85.0	3015	17.0	82	1	oldsmobile cutlass ciera (diesel)
388	26.0	1172	4	156.0	92.0	2585	14.5	82	1	chrysler lebaron medallion
389	22.0	1285	6	232.0	112.0	2835	14.7	82	1	ford granada l
390	32.0	1208	4	144.0	96.0	2665	13.9	82	3	toyota celica gt
391	36.0	1075	4	135.0	84.0	2370	13.0	82	1	dodge charger 2.2
392	27.0	1338	4	151.0	90.0	2950	17.3	82	1	chevrolet camaro
393	27.0	1265	4	140.0	86.0	2790	15.6	82	1	ford mustang gl
394	44.0	966	4	97.0	52.0	2130	24.6	82	2	vw pickup
395	32.0	1040	4	135.0	84.0	2295	11.6	82	1	dodge rampage
396	28.0	1190	4	120.0	79.0	2625	18.6	82	1	ford ranger
397	31.0	1233	4	119.0	82.0	2720	19.4	82	1	chevy s-10

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	-0.706439	8	307.0	130.0	3504	12.0	70	1	chevrolet chevelle malibu
1	-1.090751	8	350.0	165.0	3693	11.5	70	1	buick skylark 320
2	-0.706439	8	318.0	150.0	3436	11.0	70	1	plymouth satellite
3	-0.962647	8	304.0	150.0	3433	12.0	70	1	amc rebel sst
4	-0.834543	8	302.0	140.0	3449	10.5	70	1	ford torino
5	-1.090751	8	429.0	198.0	4341	10.0	70	1	ford galaxie 500
6	-1.218855	8	454.0	220.0	4354	9.0	70	1	chevrolet impala
7	-1.218855	8	440.0	215.0	4312	8.5	70	1	plymouth fury iii
8	-1.218855	8	455.0	225.0	4425	10.0	70	1	pontiac catalina
9	-1.090751	8	390.0	190.0	3850	8.5	70	1	amc ambassador dpl
10	-1.090751	8	383.0	170.0	3563	10.0	70	1	dodge challenger se
11	-1.218855	8	340.0	160.0	3609	8.0	70	1	plymouth 'cuda 340
12	-1.090751	8	400.0	150.0	3761	9.5	70	1	chevrolet monte carlo
13	-1.218855	8	455.0	225.0	3086	10.0	70	1	buick estate wagon (sw)
14	0.062185	4	113.0	95.0	2372	15.0	70	3	toyota corona mark ii
15	-0.194023	6	198.0	95.0	2833	15.5	70	1	plymouth duster
16	-0.706439	6	199.0	97.0	2774	15.5	70	1	amc hornet
17	-0.322127	6	200.0	85.0	2587	16.0	70	1	ford maverick
18	0.446497	4	97.0	88.0	2130	14.5	70	3	datsun pl510
19	0.318393	4	97.0	46.0	1835	20.5	70	2	volkswagen 1131 deluxe sedan
20	0.190289	4	110.0	87.0	2672	17.5	70	2	peugeot 504
21	0.062185	4	107.0	90.0	2430	14.5	70	2	audi 100 ls
22	0.190289	4	104.0	95.0	2375	17.5	70	2	saab 99e
23	0.318393	4	121.0	113.0	2234	12.5	70	2	bmw 2002
24	-0.322127	6	199.0	90.0	2648	15.0	70	1	amc gremlin
25	-1.731270	8	360.0	215.0	4615	14.0	70	1	ford f250
26	-1.731270	8	307.0	200.0	4376	15.0	70	1	chevy c20
27	-1.603167	8	318.0	210.0	4382	13.5	70	1	dodge d200
28	-1.859374	8	304.0	193.0	4732	18.5	70	1	hi 1200d
29	0.446497	4	97.0	88.0	2130	14.5	71	3	datsun pl510
...	...	...	...	...	...	...	...	...	...
368	0.446497	4	112.0	88.0	2640	18.6	82	1	chevrolet cavalier wagon
369	1.343225	4	112.0	88.0	2395	18.0	82	1	chevrolet cavalier 2-door
370	0.958913	4	112.0	85.0	2575	16.2	82	1	pontiac j2000 se hatchback
371	0.702705	4	135.0	84.0	2525	16.0	82	1	dodge aries se
372	0.446497	4	151.0	90.0	2735	18.0	82	1	pontiac phoenix
373	0.062185	4	140.0	92.0	2865	16.4	82	1	ford fairmont futura
374	-0.065919	4	151.0	NaN	3035	20.5	82	1	amc concord dl
375	1.599433	4	105.0	74.0	1980	15.3	82	2	volkswagen rabbit l
376	1.727537	4	91.0	68.0	2025	18.2	82	3	mazda glc custom l
377	0.958913	4	91.0	68.0	1970	17.6	82	3	mazda glc custom
378	1.855641	4	105.0	63.0	2125	14.7	82	1	plymouth horizon miser
379	1.599433	4	98.0	70.0	2125	17.3	82	1	mercury lynx l
380	1.599433	4	120.0	88.0	2160	14.5	82	3	nissan stanza xe
381	1.599433	4	107.0	75.0	2205	14.5	82	3	honda accord
382	1.343225	4	108.0	70.0	2245	16.9	82	3	toyota corolla
383	1.855641	4	91.0	67.0	1965	15.0	82	3	honda civic
384	1.087017	4	91.0	67.0	1965	15.7	82	3	honda civic (auto)
385	1.855641	4	91.0	67.0	1995	16.2	82	3	datsun 310 gx
386	0.190289	6	181.0	110.0	2945	16.4	82	1	buick century limited
387	1.855641	6	262.0	85.0	3015	17.0	82	1	oldsmobile cutlass ciera (diesel)
388	0.318393	4	156.0	92.0	2585	14.5	82	1	chrysler lebaron medallion
389	-0.194023	6	232.0	112.0	2835	14.7	82	1	ford granada l
390	1.087017	4	144.0	96.0	2665	13.9	82	3	toyota celica gt
391	1.599433	4	135.0	84.0	2370	13.0	82	1	dodge charger 2.2
392	0.446497	4	151.0	90.0	2950	17.3	82	1	chevrolet camaro
393	0.446497	4	140.0	86.0	2790	15.6	82	1	ford mustang gl
394	2.624265	4	97.0	52.0	2130	24.6	82	2	vw pickup
395	1.087017	4	135.0	84.0	2295	11.6	82	1	dodge rampage
396	0.574601	4	120.0	79.0	2625	18.6	82	1	ford ranger
397	0.958913	4	119.0	82.0	2720	19.4	82	1	chevy s-10