CSAL4243: Introduction to Machine Learning

Muhammad Mudassir Khan (mudasssir.khan@ucp.edu.pk)

Lecture 10: Data Preprocessing

Overview




In [1]:
from IPython.display import Image
%matplotlib inline

In [2]:
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

Dealing with missing data


In [3]:
import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

# If you are using Python 2.7, you need
# to convert the string to unicode:
# csv_data = unicode(csv_data)

df = pd.read_csv(StringIO(csv_data))
df


Out[3]:
A B C D
0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 10.0 11.0 12.0 NaN

In [4]:
df.isnull().sum()


Out[4]:
A    0
B    0
C    1
D    1
dtype: int64



Eliminating samples or features with missing values


In [5]:
df.dropna()


Out[5]:
A B C D
0 1.0 2.0 3.0 4.0

In [6]:
df.dropna(axis=1)


Out[6]:
A B
0 1.0 2.0
1 5.0 6.0
2 10.0 11.0

In [7]:
# only drop rows where all columns are NaN
df.dropna(how='all')


Out[7]:
A B C D
0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 10.0 11.0 12.0 NaN

In [8]:
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)


Out[8]:
A B C D
0 1.0 2.0 3.0 4.0

In [9]:
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])


Out[9]:
A B C D
0 1.0 2.0 3.0 4.0
2 10.0 11.0 12.0 NaN



Imputing missing values


In [10]:
from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data


Out[10]:
array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])

In [11]:
df.values


Out[11]:
array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,  nan,   8.],
       [ 10.,  11.,  12.,  nan]])



Understanding the scikit-learn estimator API


In [12]:
Image(filename='./images/10_01.png', width=400)


Out[12]:

In [13]:
Image(filename='./images/10_02.png', width=400)


Out[13]:



Handling categorical data


In [14]:
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df


Out[14]:
color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1



Mapping ordinal features


In [15]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df


Out[15]:
color size price classlabel
0 green 1 10.1 class1
1 red 2 13.5 class2
2 blue 3 15.3 class1

In [16]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)


Out[16]:
0     M
1     L
2    XL
Name: size, dtype: object



Encoding class labels


In [17]:
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping


Out[17]:
{'class1': 0, 'class2': 1}

In [18]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df


Out[18]:
color size price classlabel
0 green 1 10.1 0
1 red 2 13.5 1
2 blue 3 15.3 0

In [19]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df


Out[19]:
color size price classlabel
0 green 1 10.1 class1
1 red 2 13.5 class2
2 blue 3 15.3 class1

In [21]:
from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y


Out[21]:
array([0, 1, 0])

In [22]:
class_le.inverse_transform(y)


Out[22]:
array(['class1', 'class2', 'class1'], dtype=object)



Performing one-hot encoding on nominal features


In [23]:
X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X


Out[23]:
array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

In [24]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()


Out[24]:
array([[  0. ,   1. ,   0. ,   1. ,  10.1],
       [  0. ,   0. ,   1. ,   2. ,  13.5],
       [  1. ,   0. ,   0. ,   3. ,  15.3]])

In [25]:
# An even more convenient way to create those dummy features via one-hot encoding
# is to use the get_dummies method implemented in pandas. Applied on a DataFrame ,
# the get_dummies method will only convert string columns and leave all other
# columns unchanged:

pd.get_dummies(df[['price', 'color', 'size']])


Out[25]:
price size color_blue color_green color_red
0 10.1 1 0.0 1.0 0.0
1 13.5 2 0.0 0.0 1.0
2 15.3 3 1.0 0.0 0.0



Partitioning a dataset in training and test sets


In [23]:
df_wine = pd.read_csv('https://archive.ics.uci.edu/'
                      'ml/machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()


Class labels [1 2 3]
Out[23]:
Class label Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue OD280/OD315 of diluted wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735


In [26]:
if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)



Bringing features onto the same scale


In [27]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

In [28]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

A visual example:


In [29]:
ex = pd.DataFrame([0, 1, 2, 3, 4, 5])

# standardize
ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)

# Please note that pandas uses ddof=1 (sample standard deviation) 
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)

# normalize
ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min())
ex.columns = ['input', 'standardized', 'normalized']
ex


Out[29]:
input standardized normalized
0 0 -1.46385 0.0
1 1 -0.87831 0.2
2 2 -0.29277 0.4
3 3 0.29277 0.6
4 4 0.87831 0.8
5 5 1.46385 1.0





Titanic dataset


In [47]:
df_titanic = pd.read_csv('datasets/titanic_kaggle.csv', encoding='utf-8')

In [48]:
df_titanic.head()


Out[48]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [49]:
# info of features
df_titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

In [50]:
# check missing values in each feature
df_titanic.isnull().sum()


Out[50]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [51]:
# drop unnecessary columns, these columns won't be useful in analysis and prediction
df_titanic = df_titanic.drop(['PassengerId','Name','Ticket'], axis=1)

In [56]:
df_titanic['Age'] = df_titanic['Age'].fillna((df_titanic['Age'].mean()))
df_titanic.isnull().sum()


Out[56]:
Survived      0
Pclass        0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [58]:
# cabin has too many null values, drop it
df_titanic = df_titanic.drop(['Cabin'], axis=1)

In [60]:
df_titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB

In [69]:
# find unique values 
df_titanic["Embarked"].unique()


Out[69]:
array(['S', 'C', 'Q', nan], dtype=object)

In [68]:
# find frequency of each
df_titanic["Embarked"].value_counts()


Out[68]:
S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [73]:
# null values in Embarked
df_titanic["Embarked"].isnull().sum()


Out[73]:
0

In [72]:
# replace it by the most frequent value in Embarked
df_titanic['Embarked'] = df_titanic['Embarked'].fillna("S")

In [74]:
# null values in Embarked
df_titanic["Embarked"].isnull().sum()


Out[74]:
0

In [75]:
# all null values
df_titanic.isnull().sum()


Out[75]:
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [77]:
# convert gender from male and female to binary values 0 and 1
gender_mapping = {'male': 0, 'female': 1}

df_titanic['Sex'] = df_titanic['Sex'].map(gender_mapping)

In [80]:
df_titanic["Sex"].value_counts()


Out[80]:
0    577
1    314
Name: Sex, dtype: int64

In [86]:
# convert Embarked using one hot encoding
df_titanic = pd.get_dummies(df_titanic)

In [103]:
df_titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Sex           891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8
dtypes: float64(2), int64(5), uint8(3)
memory usage: 51.4 KB

In [89]:
# split into train and test frames
df_train, df_test = train_test_split(df_titanic, test_size=0.3, random_state=0)

In [90]:
# read X and y
X_train, y_train = df_train.iloc[:, 1:].values, df_train.iloc[:, 0].values
X_test, y_test = df_test.iloc[:, 1:].values, df_test.iloc[:, 0].values

In [91]:
# standardize values
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

In [105]:
# train a logistic classifier
lr = LogisticRegression()
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))


Training accuracy: 0.804173354735
Test accuracy: 0.794776119403

Selecting meaningful features

Sparse solutions with L1-regularization


In [32]:
Image(filename='./images/04_12.png', width=500)


Out[32]:

In [33]:
Image(filename='./images/04_13.png', width=500)


Out[33]:

In [40]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1', C=0.1)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))


Training accuracy: 0.983870967742
Test accuracy: 0.981481481481

In [41]:
lr.intercept_


Out[41]:
array([-0.38380829, -0.15808142, -0.70041622])

In [42]:
lr.coef_


Out[42]:
array([[ 0.28005481,  0.        ,  0.        , -0.0277974 ,  0.        ,
         0.        ,  0.7100353 ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.23667089],
       [-0.64398803, -0.06882797, -0.05718877,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        , -0.92673341,
         0.06018449,  0.        , -0.37111244],
       [ 0.        ,  0.06135625,  0.        ,  0.        ,  0.        ,
         0.        , -0.63650094,  0.        ,  0.        ,  0.49828265,
        -0.35840648, -0.57077367,  0.        ]])

In [37]:
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.subplot(111)
    
colors = ['blue', 'green', 'red', 'cyan', 
          'magenta', 'yellow', 'black', 
          'pink', 'lightgreen', 'lightblue', 
          'gray', 'indigo', 'orange']

weights, params = [], []
for c in np.arange(-4, 6):
    lr = LogisticRegression(penalty='l1', C=10**c, random_state=0)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10**c)

weights = np.array(weights)

for column, color in zip(range(weights.shape[1]), colors):
    plt.plot(params, weights[:, column],
             label=df_wine.columns[column + 1],
             color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center', 
          bbox_to_anchor=(1.38, 1.03),
          ncol=1, fancybox=True)
# plt.savefig('./figures/l1_path.png', dpi=300)
plt.show()


Credits

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Andrew Ng, Machine Learning, Coursera

Scikit Learn

David Kaleko