We will state our imports, as usual. You will find that we are using the preprocessing
module of sklearn
, in addition to pandas
and numpy
.
In [1]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt
from sklearn import preprocessing as pp
Let's start by loading the data. For convenience, we will make all column names upper case, so we don't have to consider that throughout the analysis.
In [2]:
def read_data(path, with_preview=False):
data = pd.read_csv(path)
data.columns = data.columns.str.upper()
return data
data = read_data("../data/shot_logs.csv")
As in many real world problems, we have a column holding information about more than one feature.
So we can use them throughout the example and the practice exercise we need to extract all the feature to separate columns.
In [3]:
def extract_features(string):
(date, string) = string.split('-')
if '@' in string:
(away, home) = string.split('@')
if 'vs.' in string:
(away, home) = string.split('vs.')
date = date.strip()
away = away.strip()
home = home.strip()
return (date, away, home)
def extract_date(string):
features = extract_features(string)
date = features[0]
date = pd.to_datetime(date)
date = date
return date
def extract_away_team(string):
features = extract_features(string)
away = features[1]
return away
def extract_home_team(string):
features = extract_features(string)
home = features[2]
return home
def expand_matchup_column(data, with_preview=False):
data['MATCHUP_DATE'] = data['MATCHUP'].apply(extract_date)
data['MATCHUP_DATE'] = data['MATCHUP_DATE'].astype(np.int64) / int(1e6)
data['MATCHUP_AWAY_TEAM'] = data['MATCHUP'].apply(extract_away_team)
data['MATCHUP_HOME_TEAM'] = data['MATCHUP'].apply(extract_home_team)
data = data.drop('MATCHUP', axis=1)
return data
data = expand_matchup_column(data)
As you can see below, there are plenty of categorical features in our dataset.
We will grab one of them, the MATCHUP_AWAY_TEAM
, and perform different types of encoding on it, so you can get a sense of what they do.
In [4]:
display(data.head(n=3))
data.columns.values
Out[4]:
One of the categorical features in our dataset is the MATCHUP_AWAY_TEAM
.
Let's label encoding it using factorization, and preview the results in a new dataframe with two columns: the old nominal values and the new, numeric ones.
We will sample 5 rows from the resulting dataframe to interpret the results.
In [5]:
def factorize_column(data, colname):
categorical_feature = data[colname]
categorical_feature_encoded = pd.factorize(categorical_feature)[0]
return categorical_feature_encoded
def compare_with_original_column(data, colname, encoded_column):
comparison = pd.DataFrame()
comparison['NOMINAL'] = data[colname]
comparison['ORDINAL'] = encoded_column
comparison_sample = comparison.sample(n=5)
return comparison_sample
encoded_column = factorize_column(data, 'MATCHUP_AWAY_TEAM')
compare_with_original_column(data, 'MATCHUP_AWAY_TEAM', encoded_column)
Out[5]:
As we can see, all our nominal labels were now replaced by numeric values.
Now, we will try to do the exact same thing, only using label encoding.
In [6]:
def label_encode_column(data, colname):
categorical_feature = data[colname]
encoder = pp.LabelEncoder()
categorical_feature_encoded = encoder.fit_transform(categorical_feature)
return categorical_feature_encoded
encoded_column = label_encode_column(data, 'MATCHUP_AWAY_TEAM')
compare_with_original_column(data, 'MATCHUP_AWAY_TEAM', encoded_column)
Out[6]:
As we can see, we get similar results: our nominal categorical variable is transformed into a numerical one.
As we know, this is a problem: we assume that teams have a natural order, meaning that Miami is great than Cleveland, for example.
We also assume that the distances between cities are known, which they are not.
Thus, this is a call for us to use dummy encoding, as we have seen in the lecture!
As stated above, the away team is not an ordinal feature: it has no natural order, nor known distances.
Therefore we will use dummy encoding to binary encode each level, using 0 or 1 to indicate the absence or presence of some label for each sample.
In [7]:
def dummy_encode_column(data, colname):
categorical_feature = data[colname]
number_of_levels = len(categorical_feature.unique())
dummy_encode_example = pd.get_dummies(categorical_feature, drop_first=True)
dummy_encode_number_of_cols = dummy_encode_example.shape[1]
dummy_encode_sample = dummy_encode_example
print("There are %s total levels for the categorical variable %s." % (number_of_levels, colname))
print("There are %s (n-1) columns in the dummy encoded dataframe." % dummy_encode_number_of_cols)
return dummy_encode_sample
dummy_encode_column(data, 'MATCHUP_AWAY_TEAM').sample(n=5)
Out[7]:
Please note that we need to explicitly use drop_first=True
to get n-1 (leave one out) columns in the encoded dataframe.
Also, we are using get_dummies
directly on a categorical feature containing strings; this is something we will not be able to do in our next example.
Let's try to apply the one hot encoding similarly to what we did with the label encoding above.
Try to uncomment the line below.
In [8]:
def one_hot_encode_column(data, colname):
categorical_feature = data[colname]
number_of_levels = len(np.unique(categorical_feature))
encoder = pp.OneHotEncoder()
one_hot_encode_feature = encoder.fit_transform(categorical_feature)
return one_hot_encode_feature
# one_hot_encode_column(data, 'MATCHUP_AWAY_TEAM')
Oops! Could not convert string to float! Looks like we will have to label encode our variable first, to make it numeric.
In [9]:
def one_hot_encode_column(data, colname):
data_to_encode = pd.DataFrame()
data_to_encode[colname] = label_encode_column(data, colname)
number_of_levels = len(data_to_encode[colname].unique())
encoder = pp.OneHotEncoder(categorical_features=[0])
one_hot_encoded_feature = encoder.fit_transform(data_to_encode).toarray()
one_hot_encoded_feature = pd.DataFrame(one_hot_encoded_feature)
return one_hot_encoded_feature
one_hot_encode_column(data, 'MATCHUP_AWAY_TEAM').sample(n=5)
Out[9]:
In alternative, we can use a label binarizer, that will return an array of binary variables, indicating whether the label is present for each observation or not.
In [10]:
def binarize_labels_column(data, colname):
categorical_feature = data[colname]
encoder = pp.LabelBinarizer()
return encoder.fit_transform(categorical_feature)
binarize_labels_column(data, 'MATCHUP_AWAY_TEAM')
Out[10]:
Converted to a dataframe, the output is similar to that of the one hot encoder above.
Such a binarizer can be useful to apply a so-called one-vs-all scheme, that consists in learning a model per class, when combined with sklearn
's pipelines.
In [11]:
pd.DataFrame(binarize_labels_column(data, 'MATCHUP_AWAY_TEAM')).sample(n=5)
Out[11]:
Please note that, unlike in Pandas
, there is no automatic way to apply a leave-one-out approach (n-1 features), which may lead to collinearity problems.
Let's take two features in very different scales. As told in the presentation, many machine learning algorithms employ distances, which are sensitive to scale.
This way, features on a bigger scale will have a disproportionate impact.
Let's take two features, touch time and shot distance, and try to get the on the same scale.
Please note there are input errors in the touch time column (negative touch time), and we will clean them before starting.
In [12]:
data = data[data['TOUCH_TIME'] > 0].copy()
def scatter_plot_two_features(data, colnames, total_samples):
scatter_plot_data = data[[colnames[0], colnames[1]]].sample(n=total_samples)
x = scatter_plot_data[colnames[0]]
y = scatter_plot_data[colnames[1]]
plot = plt.scatter(x, y)
return plot
cols = ['TOUCH_TIME', 'SHOT_DIST']
scatter_plot_two_features(data, cols, 1000)
Out[12]:
A possible standardization technique is scaling features between a given minimum and maximum value, often between zero and one.
In [13]:
def rescale_columns(data, colnames, total_samples_preview):
scaler = pp.MinMaxScaler()
rescaled_data = data[colnames]
rescaled_data = scaler.fit_transform(rescaled_data)
rescaled_data = pd.DataFrame(rescaled_data, columns=colnames)
rescaled_data_sample = rescaled_data.sample(n=5)
scatter_plot_two_features(rescaled_data, colnames, total_samples_preview)
return rescaled_data_sample
rescale_columns(data, cols, 500)
Out[13]:
Alternatively, the MaxAbsScaler()
scaler (used in similar fashion) would return data in the range [-1, 1].
Standardization is a common to make features look more or less look like standard normally distributed data: Gaussian with 0 mean and unit variance.
In practice, we often ignore the shape of the distribution and just transform the data to:
One can assume that, if the data size is huge, both training and validation sets can be approximately viewed as normal distribution.
In [14]:
def standardize_columns(data, colnames, total_samples_preview):
scaler = pp.StandardScaler()
rescaled_data = data[colnames]
rescaled_data = scaler.fit_transform(rescaled_data)
rescaled_data = pd.DataFrame(rescaled_data, columns=colnames)
rescaled_data_sample = rescaled_data.sample(n=5)
scatter_plot_two_features(rescaled_data, colnames, total_samples_preview)
return rescaled_data_sample
standardize_columns(data, cols, 500)
Out[14]:
Normalization is the process of scaling individual samples to have unit norm, i.e. into a vector of length 1.
This process can be useful if you plan to quantify the similarity of any pair of samples, using a dot-product for example.
In [15]:
def normalize_columns(data, colnames, total_samples_preview):
scaler = pp.Normalizer()
rescaled_data = data[colnames]
rescaled_data = scaler.fit_transform(rescaled_data)
rescaled_data = pd.DataFrame(rescaled_data, columns=colnames)
rescaled_data_sample = rescaled_data.sample(n=5)
scatter_plot_two_features(rescaled_data, colnames, total_samples_preview)
return rescaled_data_sample
normalize_columns(data, cols, 500)
Out[15]:
Another strategy we can use is binarization.
Feature binarization is the process of thresholding numerical features to get boolean values.
Here, for example, we can place a threshold in shot distance at 23.9 feet to create a new binary variables: the three point line.
In [16]:
def binarize_columns(data, colnames, threshold):
scaler = pp.Binarizer(threshold=threshold)
binarized_data = data
binarized_data['3_POINTER'] = scaler.fit_transform(binarized_data[colnames].values.reshape(-1, 1))
binarized_data = binarized_data[[colnames, '3_POINTER']]
binarized_data_sample = binarized_data.sample(n=5)
return binarized_data_sample
three_point_line = 23.9
binarize_columns(data, 'SHOT_DIST', three_point_line)
Out[16]: