Supporting notebook for article on Practical Business Python.
Import the pandas, scikit-learn, numpy and category_encoder libraries.
In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
import category_encoders as ce
Need to define the headers since the data does not contain any
In [2]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style",
"drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke",
"compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]
Read in the data from the url, add headers and convert ? to nan values
In [3]:
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
header=None, names=headers, na_values="?" )
In [4]:
df.head()
Out[4]:
Look at the data types contained in the dataframe
In [5]:
df.dtypes
Out[5]:
Create a copy of the data with only the object columns.
In [6]:
obj_df = df.select_dtypes(include=['object']).copy()
In [7]:
obj_df.head()
Out[7]:
Check for null values in the data
In [8]:
obj_df[obj_df.isnull().any(axis=1)]
Out[8]:
Since the num_doors column contains the null values, look at what values are current options
In [9]:
obj_df["num_doors"].value_counts()
Out[9]:
We will fill in the doors value with the most common element - four.
In [10]:
obj_df = obj_df.fillna({"num_doors": "four"})
In [11]:
obj_df[obj_df.isnull().any(axis=1)]
Out[11]:
Convert the num_cylinders and num_doors values to numbers
In [12]:
obj_df["num_cylinders"].value_counts()
Out[12]:
In [13]:
cleanup_nums = {"num_doors": {"four": 4, "two": 2},
"num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
"two": 2, "twelve": 12, "three":3 }}
In [14]:
obj_df.replace(cleanup_nums, inplace=True)
In [15]:
obj_df.head()
Out[15]:
In [16]:
obj_df.dtypes
Out[16]:
One approach to encoding labels is to convert the values to a pandas category
In [17]:
obj_df["body_style"].value_counts()
Out[17]:
In [18]:
obj_df["body_style"] = obj_df["body_style"].astype('category')
In [19]:
obj_df.dtypes
Out[19]:
We can assign the category codes to a new column so we have a clean numeric representation
In [20]:
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes
In [21]:
obj_df.head()
Out[21]:
In [22]:
obj_df.dtypes
Out[22]:
In order to do one hot encoding, use pandas get_dummies
In [23]:
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()
Out[23]:
get_dummiers has options for selecting the columns and adding prefixes to make the resulting data easier to understand.
In [24]:
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()
Out[24]:
In [25]:
obj_df["engine_type"].value_counts()
Out[25]:
Use np.where and the str accessor to do this in one efficient line
In [26]:
obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)
In [27]:
obj_df[["make", "engine_type", "OHC_Code"]].head(20)
Out[27]:
Instantiate the LabelEncoder
In [28]:
lb_make = LabelEncoder()
In [29]:
obj_df["make_code"] = lb_make.fit_transform(obj_df["make"])
In [30]:
obj_df[["make", "make_code"]].head(11)
Out[30]:
To accomplish something similar to pandas get_dummies, use LabelBinarizer
In [31]:
lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(obj_df["body_style"])
The results are an array that needs to be converted to a DataFrame
In [32]:
lb_results
Out[32]:
In [33]:
pd.DataFrame(lb_results, columns=lb_style.classes_).head()
Out[33]:
category_encoder library
In [34]:
# Get a new clean dataframe
obj_df = df.select_dtypes(include=['object']).copy()
In [35]:
obj_df.head()
Out[35]:
Try out the Backward Difference Encoder on the engine_type column
In [36]:
encoder = ce.backward_difference.BackwardDifferenceEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)
Out[36]:
In [37]:
encoder.transform(obj_df).iloc[:,0:7].head()
Out[37]:
Another approach is to use a polynomial encoding.
In [38]:
encoder = ce.polynomial.PolynomialEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)
Out[38]:
In [39]:
encoder.transform(obj_df).iloc[:,0:7].head()
Out[39]:
In [ ]: