Load the Pima diabetes dataset as a pandas dataframe. (Note that the data does not include a header row. You'll have to build that yourself based on the documentation.)
In [61]:
import pandas
names = ['num_times_pregnant', 'glucose_concentration',
'blood_pressure', 'skin_fold_thickness', 'insulin',
'bmi', 'diabetes_pedigree', 'age', 'target']
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/' +\
'pima-indians-diabetes/pima-indians-diabetes.data'
df = pandas.read_csv(data_url, header=None, index_col=False, names=names)
print(df)
Check the dataframe to see which columns contain 0's. Based on the data type of each column, do these 0's all make sense? Which 0's are suspicious?
In [62]:
for name in names:
print(name, ':', any(df.loc[:, name] == 0))
Answer: Columns 2-6 (glucose, blood pressure, skin fold thickness, insulin, and BMI) all contain zeros, but none of these measurements should ever be 0 in a human.
Assume that 0s indiciate missing values, and fix them in the dataset by eliminating samples with missing features. Then run a logistic regression, and measure the performance of the model.
In [63]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
for i in range(1,6):
df.loc[df.loc[:, names[i]] == 0, names[i]] = np.nan
df_no_nan = df.dropna(axis=0, how='any')
X = df_no_nan.iloc[:, :8].values
y = df_no_nan.iloc[:, 8].values
def fit_and_score_rlr(X, y, normalize=True):
if normalize:
scaler = StandardScaler().fit(X)
X_std = scaler.transform(X)
else:
X_std = X
X_train, X_test, y_train, y_test = train_test_split(X_std, y,
test_size=0.33,
random_state=42)
rlr = LogisticRegression(C=1)
rlr.fit(X_train, y_train)
return rlr.score(X_test, y_test)
fit_and_score_rlr(X, y)
Out[63]:
Next, replace missing features through mean imputation. Run a regression and measure the performance of the model.
In [64]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=1)
X = imputer.fit_transform(df.iloc[:, :8].values)
y = df.iloc[:, 8].values
fit_and_score_rlr(X, y)
Out[64]:
Comment on your results.
Answer: Interestingly, there's not a huge performance improvement between the two approaches! In my run, using mean imputation corresponded to about a 3 point increase in model performance. Some ideas for why this might be:
Load the TA evaluation dataset. As before, the data and header are split into two files, so you'll have to combine them yourself.
In [65]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tae/tae.data'
names = ['native_speaker', 'instructor', 'course', 'season', 'class_size', 'rating']
df = pandas.read_csv(data_url, header=None, index_col=False, names=names)
print(df)
Which of the features are categorical? Are they ordinal, or nominal? Which features are numeric?
Answer: According to the documentation:
Encode the categorical variables in a naive fashion, by leaving them in place as numerics. Run a classification and measure performance against a test set.
In [70]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
fit_and_score_rlr(X, y, normalize=True)
Out[70]:
Now, encode the categorical variables with a one-hot encoder. Again, run a classification and measure performance.
In [71]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categorical_features=range(5))
X_encoded = enc.fit_transform(X)
fit_and_score_rlr(X_encoded, y, normalize=False)
Out[71]:
Comment on your results.
In [ ]:
Raschka mentions that decision trees and random forests do not require standardized features prior to classification, while the rest of the classifiers we've seen so far do. Why might that be? Explain the intuition behind this idea based on the differences between tree-based classifiers and the other classifiers we've seen.
Now, we'll test the two scaling algorithms on the wine dataset. Start by loading the wine dataset.
In [ ]:
Scale the features via "standardization" (as Raschka describes it). Classify and measure performance.
In [ ]:
Scale the features via "normalization" (as Raschka describes it). Again, classify and measure performance.
In [ ]:
Comment on your results.
In [ ]:
In [68]:
class SBS(object):
"""
Class to select the k-best features in a dataset via sequential backwards selection.
"""
def __init__(self):
"""
Initialize the SBS model.
"""
pass
def fit(self):
"""
Fit SBS to a dataset.
"""
pass
def transform(self):
"""
Transform a dataset based on the model.
"""
pass
def fit_transform(self):
"""
Fit SBS to a dataset and transform it, returning the k-best features.
"""
pass
Now, we'll practice feature selection. Start by loading the breast cancer dataset.
In [ ]:
Use a random forest to determine the feature importances. Plot the features and their importances.
In [ ]:
Use L1 regularization with a standard C value (0.1) to eliminate low-information features. Again, plot the feature importances using the coef_
attribute of the model.
In [ ]:
How do the feature importances from the random forest/L1 regularization compare?