Exercise 13

This particular Automobile Data Set includes a good mix of categorical values as well as continuous values and serves as a useful example that is relatively easy to understand. Since domain understanding is an important aspect when deciding how to encode various categorical values - this data set makes a good case study.

Read the data into Pandas



In [1]:

    
import pandas as pd

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()









    Out[1]:







  
    
      
      symboling
      normalized_losses
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      wheel_base
      ...
      engine_size
      fuel_system
      bore
      stroke
      compression_ratio
      horsepower
      peak_rpm
      city_mpg
      highway_mpg
      price
    
  
  
    
      0
      3
      NaN
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      88.6
      ...
      130
      mpfi
      3.47
      2.68
      9.0
      111.0
      5000.0
      21
      27
      13495.0
    
    
      1
      3
      NaN
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      88.6
      ...
      130
      mpfi
      3.47
      2.68
      9.0
      111.0
      5000.0
      21
      27
      16500.0
    
    
      2
      1
      NaN
      alfa-romero
      gas
      std
      two
      hatchback
      rwd
      front
      94.5
      ...
      152
      mpfi
      2.68
      3.47
      9.0
      154.0
      5000.0
      19
      26
      16500.0
    
    
      3
      2
      164.0
      audi
      gas
      std
      four
      sedan
      fwd
      front
      99.8
      ...
      109
      mpfi
      3.19
      3.40
      10.0
      102.0
      5500.0
      24
      30
      13950.0
    
    
      4
      2
      164.0
      audi
      gas
      std
      four
      sedan
      4wd
      front
      99.4
      ...
      136
      mpfi
      3.19
      3.40
      8.0
      115.0
      5500.0
      18
      22
      17450.0
    
  

5 rows × 26 columns



In [4]:

    
df.shape









    Out[4]:





(205, 26)



In [2]:

    
df.dtypes









    Out[2]:





symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object



In [3]:

    
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()









    Out[3]:







  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      engine_type
      num_cylinders
      fuel_system
    
  
  
    
      0
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      dohc
      four
      mpfi
    
    
      1
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      dohc
      four
      mpfi
    
    
      2
      alfa-romero
      gas
      std
      two
      hatchback
      rwd
      front
      ohcv
      six
      mpfi
    
    
      3
      audi
      gas
      std
      four
      sedan
      fwd
      front
      ohc
      four
      mpfi
    
    
      4
      audi
      gas
      std
      four
      sedan
      4wd
      front
      ohc
      five
      mpfi



In [ ]:

Exercise 13.1

Does the database contain missing values? If so, replace them using one of the methods explained in class



In [ ]:

Exercise 13.2

Split the data into training and testing sets

Train a Random Forest Regressor to predict the price of a car using the numeric features



In [ ]:

Exercise 13.3

Create dummy variables for the categorical features

Train a Random Forest Regressor and compare



In [ ]:

Exercise 13.4

Apply two other methods of categorical encoding

compare the results



In [ ]:

	symboling	normalized_losses	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	wheel_base	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.0	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.0	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0