Guide To Encoding Categorical Values in Python

Supporting notebook for article on Practical Business Python.

Import the pandas, scikit-learn, numpy and category_encoder libraries.



In [1]:

    
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelBinarizer, LabelEncoder

import category_encoders as ce

Need to define the headers since the data does not contain any



In [2]:

    
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style",
           "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", 
           "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

Read in the data from the url, add headers and convert ? to nan values



In [3]:

    
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                 header=None, names=headers, na_values="?" )



In [4]:

    
df.head()









    Out[4]:






  
    
      
      symboling
      normalized_losses
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      wheel_base
      ...
      engine_size
      fuel_system
      bore
      stroke
      compression_ratio
      horsepower
      peak_rpm
      city_mpg
      highway_mpg
      price
    
  
  
    
      0
      3
      NaN
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      88.6
      ...
      130
      mpfi
      3.47
      2.68
      9.0
      111.0
      5000.0
      21
      27
      13495.0
    
    
      1
      3
      NaN
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      88.6
      ...
      130
      mpfi
      3.47
      2.68
      9.0
      111.0
      5000.0
      21
      27
      16500.0
    
    
      2
      1
      NaN
      alfa-romero
      gas
      std
      two
      hatchback
      rwd
      front
      94.5
      ...
      152
      mpfi
      2.68
      3.47
      9.0
      154.0
      5000.0
      19
      26
      16500.0
    
    
      3
      2
      164.0
      audi
      gas
      std
      four
      sedan
      fwd
      front
      99.8
      ...
      109
      mpfi
      3.19
      3.40
      10.0
      102.0
      5500.0
      24
      30
      13950.0
    
    
      4
      2
      164.0
      audi
      gas
      std
      four
      sedan
      4wd
      front
      99.4
      ...
      136
      mpfi
      3.19
      3.40
      8.0
      115.0
      5500.0
      18
      22
      17450.0
    
  

5 rows × 26 columns

Look at the data types contained in the dataframe



In [5]:

    
df.dtypes









    Out[5]:





symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

Create a copy of the data with only the object columns.



In [6]:

    
obj_df = df.select_dtypes(include=['object']).copy()



In [7]:

    
obj_df.head()









    Out[7]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      engine_type
      num_cylinders
      fuel_system
    
  
  
    
      0
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      dohc
      four
      mpfi
    
    
      1
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      dohc
      four
      mpfi
    
    
      2
      alfa-romero
      gas
      std
      two
      hatchback
      rwd
      front
      ohcv
      six
      mpfi
    
    
      3
      audi
      gas
      std
      four
      sedan
      fwd
      front
      ohc
      four
      mpfi
    
    
      4
      audi
      gas
      std
      four
      sedan
      4wd
      front
      ohc
      five
      mpfi

Check for null values in the data



In [8]:

    
obj_df[obj_df.isnull().any(axis=1)]









    Out[8]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      engine_type
      num_cylinders
      fuel_system
    
  
  
    
      27
      dodge
      gas
      turbo
      NaN
      sedan
      fwd
      front
      ohc
      four
      mpfi
    
    
      63
      mazda
      diesel
      std
      NaN
      sedan
      fwd
      front
      ohc
      four
      idi

Since the num_doors column contains the null values, look at what values are current options



In [9]:

    
obj_df["num_doors"].value_counts()









    Out[9]:





four    114
two      89
Name: num_doors, dtype: int64

We will fill in the doors value with the most common element - four.



In [10]:

    
obj_df = obj_df.fillna({"num_doors": "four"})



In [11]:

    
obj_df[obj_df.isnull().any(axis=1)]









    Out[11]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      engine_type
      num_cylinders
      fuel_system

Encoding values using pandas

Convert the num_cylinders and num_doors values to numbers



In [12]:

    
obj_df["num_cylinders"].value_counts()









    Out[12]:





four      159
six        24
five       11
eight       5
two         4
twelve      1
three       1
Name: num_cylinders, dtype: int64



In [13]:

    
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
                                  "two": 2, "twelve": 12, "three":3 }}



In [14]:

    
obj_df.replace(cleanup_nums, inplace=True)



In [15]:

    
obj_df.head()









    Out[15]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      engine_type
      num_cylinders
      fuel_system
    
  
  
    
      0
      alfa-romero
      gas
      std
      2
      convertible
      rwd
      front
      dohc
      4
      mpfi
    
    
      1
      alfa-romero
      gas
      std
      2
      convertible
      rwd
      front
      dohc
      4
      mpfi
    
    
      2
      alfa-romero
      gas
      std
      2
      hatchback
      rwd
      front
      ohcv
      6
      mpfi
    
    
      3
      audi
      gas
      std
      4
      sedan
      fwd
      front
      ohc
      4
      mpfi
    
    
      4
      audi
      gas
      std
      4
      sedan
      4wd
      front
      ohc
      5
      mpfi

Check the data types to make sure they are coming through as numbers



In [16]:

    
obj_df.dtypes









    Out[16]:





make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

One approach to encoding labels is to convert the values to a pandas category



In [17]:

    
obj_df["body_style"].value_counts()









    Out[17]:





sedan          96
hatchback      70
wagon          25
hardtop         8
convertible     6
Name: body_style, dtype: int64



In [18]:

    
obj_df["body_style"] = obj_df["body_style"].astype('category')



In [19]:

    
obj_df.dtypes









    Out[19]:





make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object

We can assign the category codes to a new column so we have a clean numeric representation



In [20]:

    
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes



In [21]:

    
obj_df.head()









    Out[21]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      engine_type
      num_cylinders
      fuel_system
      body_style_cat
    
  
  
    
      0
      alfa-romero
      gas
      std
      2
      convertible
      rwd
      front
      dohc
      4
      mpfi
      0
    
    
      1
      alfa-romero
      gas
      std
      2
      convertible
      rwd
      front
      dohc
      4
      mpfi
      0
    
    
      2
      alfa-romero
      gas
      std
      2
      hatchback
      rwd
      front
      ohcv
      6
      mpfi
      2
    
    
      3
      audi
      gas
      std
      4
      sedan
      fwd
      front
      ohc
      4
      mpfi
      3
    
    
      4
      audi
      gas
      std
      4
      sedan
      4wd
      front
      ohc
      5
      mpfi
      3



In [22]:

    
obj_df.dtypes









    Out[22]:





make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
body_style_cat         int8
dtype: object

In order to do one hot encoding, use pandas get_dummies



In [23]:

    
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()









    Out[23]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      engine_location
      engine_type
      num_cylinders
      fuel_system
      body_style_cat
      drive_wheels_4wd
      drive_wheels_fwd
      drive_wheels_rwd
    
  
  
    
      0
      alfa-romero
      gas
      std
      2
      convertible
      front
      dohc
      4
      mpfi
      0
      0.0
      0.0
      1.0
    
    
      1
      alfa-romero
      gas
      std
      2
      convertible
      front
      dohc
      4
      mpfi
      0
      0.0
      0.0
      1.0
    
    
      2
      alfa-romero
      gas
      std
      2
      hatchback
      front
      ohcv
      6
      mpfi
      2
      0.0
      0.0
      1.0
    
    
      3
      audi
      gas
      std
      4
      sedan
      front
      ohc
      4
      mpfi
      3
      0.0
      1.0
      0.0
    
    
      4
      audi
      gas
      std
      4
      sedan
      front
      ohc
      5
      mpfi
      3
      1.0
      0.0
      0.0

get_dummiers has options for selecting the columns and adding prefixes to make the resulting data easier to understand.



In [24]:

    
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()









    Out[24]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      engine_location
      engine_type
      num_cylinders
      fuel_system
      body_style_cat
      body_convertible
      body_hardtop
      body_hatchback
      body_sedan
      body_wagon
      drive_4wd
      drive_fwd
      drive_rwd
    
  
  
    
      0
      alfa-romero
      gas
      std
      2
      front
      dohc
      4
      mpfi
      0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      1
      alfa-romero
      gas
      std
      2
      front
      dohc
      4
      mpfi
      0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      2
      alfa-romero
      gas
      std
      2
      front
      ohcv
      6
      mpfi
      2
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      3
      audi
      gas
      std
      4
      front
      ohc
      4
      mpfi
      3
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      1.0
      0.0
    
    
      4
      audi
      gas
      std
      4
      front
      ohc
      5
      mpfi
      3
      0.0
      0.0
      0.0
      1.0
      0.0
      1.0
      0.0
      0.0

Another approach to encoding values is to select an attribute and convert it to True or False. In this case, we can check if an engine is an OHC or not.



In [25]:

    
obj_df["engine_type"].value_counts()









    Out[25]:





ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

Use np.where and the str accessor to do this in one efficient line



In [26]:

    
obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)



In [27]:

    
obj_df[["make", "engine_type", "OHC_Code"]].head(20)









    Out[27]:






  
    
      
      make
      engine_type
      OHC_Code
    
  
  
    
      0
      alfa-romero
      dohc
      1
    
    
      1
      alfa-romero
      dohc
      1
    
    
      2
      alfa-romero
      ohcv
      1
    
    
      3
      audi
      ohc
      1
    
    
      4
      audi
      ohc
      1
    
    
      5
      audi
      ohc
      1
    
    
      6
      audi
      ohc
      1
    
    
      7
      audi
      ohc
      1
    
    
      8
      audi
      ohc
      1
    
    
      9
      audi
      ohc
      1
    
    
      10
      bmw
      ohc
      1
    
    
      11
      bmw
      ohc
      1
    
    
      12
      bmw
      ohc
      1
    
    
      13
      bmw
      ohc
      1
    
    
      14
      bmw
      ohc
      1
    
    
      15
      bmw
      ohc
      1
    
    
      16
      bmw
      ohc
      1
    
    
      17
      bmw
      ohc
      1
    
    
      18
      chevrolet
      l
      0
    
    
      19
      chevrolet
      ohc
      1

Encoding Values Using Scitkit-learn

Instantiate the LabelEncoder



In [28]:

    
lb_make = LabelEncoder()



In [29]:

    
obj_df["make_code"] = lb_make.fit_transform(obj_df["make"])



In [30]:

    
obj_df[["make", "make_code"]].head(11)









    Out[30]:






  
    
      
      make
      make_code
    
  
  
    
      0
      alfa-romero
      0
    
    
      1
      alfa-romero
      0
    
    
      2
      alfa-romero
      0
    
    
      3
      audi
      1
    
    
      4
      audi
      1
    
    
      5
      audi
      1
    
    
      6
      audi
      1
    
    
      7
      audi
      1
    
    
      8
      audi
      1
    
    
      9
      audi
      1
    
    
      10
      bmw
      2

To accomplish something similar to pandas get_dummies, use LabelBinarizer



In [31]:

    
lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(obj_df["body_style"])

The results are an array that needs to be converted to a DataFrame



In [32]:

    
lb_results









    Out[32]:





array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       ..., 
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0]])



In [33]:

    
pd.DataFrame(lb_results, columns=lb_style.classes_).head()









    Out[33]:






  
    
      
      convertible
      hardtop
      hatchback
      sedan
      wagon
    
  
  
    
      0
      1
      0
      0
      0
      0
    
    
      1
      1
      0
      0
      0
      0
    
    
      2
      0
      0
      1
      0
      0
    
    
      3
      0
      0
      0
      1
      0
    
    
      4
      0
      0
      0
      1
      0

Advanced Encoding

category_encoder library



In [34]:

    
# Get a new clean dataframe
obj_df = df.select_dtypes(include=['object']).copy()



In [35]:

    
obj_df.head()









    Out[35]:






  
    
      
      make
      fuel_type
      aspiration
      num_doors
      body_style
      drive_wheels
      engine_location
      engine_type
      num_cylinders
      fuel_system
    
  
  
    
      0
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      dohc
      four
      mpfi
    
    
      1
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      dohc
      four
      mpfi
    
    
      2
      alfa-romero
      gas
      std
      two
      hatchback
      rwd
      front
      ohcv
      six
      mpfi
    
    
      3
      audi
      gas
      std
      four
      sedan
      fwd
      front
      ohc
      four
      mpfi
    
    
      4
      audi
      gas
      std
      four
      sedan
      4wd
      front
      ohc
      five
      mpfi

Try out the Backward Difference Encoder on the engine_type column



In [36]:

    
encoder = ce.backward_difference.BackwardDifferenceEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)









    Out[36]:





BackwardDifferenceEncoder(cols=['engine_type'], drop_invariant=False,
             return_df=True, verbose=0)



In [37]:

    
encoder.transform(obj_df).iloc[:,0:7].head()









    Out[37]:






  
    
      
      col_engine_type_0
      col_engine_type_1
      col_engine_type_2
      col_engine_type_3
      col_engine_type_4
      col_engine_type_5
      col_engine_type_6
    
  
  
    
      0
      1.0
      0.142857
      0.285714
      0.428571
      0.571429
      0.714286
      -0.142857
    
    
      1
      1.0
      0.142857
      0.285714
      0.428571
      0.571429
      0.714286
      -0.142857
    
    
      2
      1.0
      0.142857
      0.285714
      0.428571
      0.571429
      0.714286
      0.857143
    
    
      3
      1.0
      0.142857
      -0.714286
      -0.571429
      -0.428571
      -0.285714
      -0.142857
    
    
      4
      1.0
      0.142857
      -0.714286
      -0.571429
      -0.428571
      -0.285714
      -0.142857

Another approach is to use a polynomial encoding.



In [38]:

    
encoder = ce.polynomial.PolynomialEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)









    Out[38]:





PolynomialEncoder(cols=['engine_type'], drop_invariant=False, return_df=True,
         verbose=0)



In [39]:

    
encoder.transform(obj_df).iloc[:,0:7].head()









    Out[39]:






  
    
      
      col_engine_type_0
      col_engine_type_1
      col_engine_type_2
      col_engine_type_3
      col_engine_type_4
      col_engine_type_5
      col_engine_type_6
    
  
  
    
      0
      1.0
      -5.669467e-01
      5.455447e-01
      -4.082483e-01
      0.241747
      -1.091089e-01
      0.032898
    
    
      1
      1.0
      -5.669467e-01
      5.455447e-01
      -4.082483e-01
      0.241747
      -1.091089e-01
      0.032898
    
    
      2
      1.0
      3.779645e-01
      3.970680e-17
      -4.082483e-01
      -0.564076
      -4.364358e-01
      -0.197386
    
    
      3
      1.0
      1.347755e-17
      -4.364358e-01
      1.528598e-17
      0.483494
      8.990141e-18
      -0.657952
    
    
      4
      1.0
      1.347755e-17
      -4.364358e-01
      1.528598e-17
      0.483494
      8.990141e-18
      -0.657952



In [ ]:

	symboling	normalized_losses	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	wheel_base	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.0	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.0	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
27	dodge	gas	turbo	NaN	sedan	fwd	front	ohc	four	mpfi
63	mazda	diesel	std	NaN	sedan	fwd	front	ohc	four	idi

	make	fuel_type	aspiration	num_doors	engine_location	engine_type	num_cylinders	fuel_system	body_style_cat	body_convertible	body_hatchback	body_sedan	drive_4wd	drive_fwd	drive_rwd
0	alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1.0	0.0	0.0	0.0	0.0	1.0
1	alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1.0	0.0	0.0	0.0	0.0	1.0
2	alfa-romero	gas	std	2	front	ohcv	6	mpfi	2	0.0	1.0	0.0	0.0	0.0	1.0
3	audi	gas	std	4	front	ohc	4	mpfi	3	0.0	0.0	1.0	0.0	1.0	0.0
4	audi	gas	std	4	front	ohc	5	mpfi	3	0.0	0.0	1.0	1.0	0.0	0.0

	col_engine_type_0	col_engine_type_1	col_engine_type_2	col_engine_type_3	col_engine_type_4	col_engine_type_5	col_engine_type_6
0	1.0	0.142857	0.285714	0.428571	0.571429	0.714286	-0.142857
1	1.0	0.142857	0.285714	0.428571	0.571429	0.714286	-0.142857
2	1.0	0.142857	0.285714	0.428571	0.571429	0.714286	0.857143
3	1.0	0.142857	-0.714286	-0.571429	-0.428571	-0.285714	-0.142857
4	1.0	0.142857	-0.714286	-0.571429	-0.428571	-0.285714	-0.142857

	col_engine_type_0	col_engine_type_1	col_engine_type_2	col_engine_type_3	col_engine_type_4	col_engine_type_5	col_engine_type_6
0	1.0	-5.669467e-01	5.455447e-01	-4.082483e-01	0.241747	-1.091089e-01	0.032898
1	1.0	-5.669467e-01	5.455447e-01	-4.082483e-01	0.241747	-1.091089e-01	0.032898
2	1.0	3.779645e-01	3.970680e-17	-4.082483e-01	-0.564076	-4.364358e-01	-0.197386
3	1.0	1.347755e-17	-4.364358e-01	1.528598e-17	0.483494	8.990141e-18	-0.657952
4	1.0	1.347755e-17	-4.364358e-01	1.528598e-17	0.483494	8.990141e-18	-0.657952