CSAL4243: Introduction to Machine Learning

Muhammad Mudassir Khan (mudasssir.khan@ucp.edu.pk)

Lecture 10: Data Preprocessing

Overview

Dealing with missing data
Handling categorical data
Partitioning a dataset in training and test sets
Bringing features onto the same scale
Titanic Dataset
Selecting meaningful features
- Sparse solutions with L1 regularization
Resources
Credits



In [1]:

    
from IPython.display import Image
%matplotlib inline



In [2]:

    
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

Dealing with missing data



In [3]:

    
import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

# If you are using Python 2.7, you need
# to convert the string to unicode:
# csv_data = unicode(csv_data)

df = pd.read_csv(StringIO(csv_data))
df



In [4]:

    
df.isnull().sum()









    Out[4]:





A    0
B    0
C    1
D    1
dtype: int64

Eliminating samples or features with missing values



In [5]:

    
df.dropna()



In [6]:

    
df.dropna(axis=1)



In [7]:

    
# only drop rows where all columns are NaN
df.dropna(how='all')



In [8]:

    
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)



In [9]:

    
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])

Imputing missing values



In [10]:

    
from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data









    Out[10]:





array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])



In [11]:

    
df.values









    Out[11]:





array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,  nan,   8.],
       [ 10.,  11.,  12.,  nan]])

Understanding the scikit-learn estimator API



In [12]:

    
Image(filename='./images/10_01.png', width=400)









    Out[12]:



In [13]:

    
Image(filename='./images/10_02.png', width=400)









    Out[13]:

Handling categorical data



In [14]:

    
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df









    Out[14]:






  
    
      
      color
      size
      price
      classlabel
    
  
  
    
      0
      green
      M
      10.1
      class1
    
    
      1
      red
      L
      13.5
      class2
    
    
      2
      blue
      XL
      15.3
      class1

Mapping ordinal features



In [15]:

    
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df









    Out[15]:






  
    
      
      color
      size
      price
      classlabel
    
  
  
    
      0
      green
      1
      10.1
      class1
    
    
      1
      red
      2
      13.5
      class2
    
    
      2
      blue
      3
      15.3
      class1



In [16]:

    
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)









    Out[16]:





0     M
1     L
2    XL
Name: size, dtype: object

Encoding class labels



In [17]:

    
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping









    Out[17]:





{'class1': 0, 'class2': 1}



In [18]:

    
df['classlabel'] = df['classlabel'].map(class_mapping)
df



In [19]:

    
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df









    Out[19]:






  
    
      
      color
      size
      price
      classlabel
    
  
  
    
      0
      green
      1
      10.1
      class1
    
    
      1
      red
      2
      13.5
      class2
    
    
      2
      blue
      3
      15.3
      class1



In [21]:

    
from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y









    Out[21]:





array([0, 1, 0])



In [22]:

    
class_le.inverse_transform(y)









    Out[22]:





array(['class1', 'class2', 'class1'], dtype=object)

Performing one-hot encoding on nominal features



In [23]:

    
X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X









    Out[23]:





array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)



In [24]:

    
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()









    Out[24]:





array([[  0. ,   1. ,   0. ,   1. ,  10.1],
       [  0. ,   0. ,   1. ,   2. ,  13.5],
       [  1. ,   0. ,   0. ,   3. ,  15.3]])



In [25]:

    
# An even more convenient way to create those dummy features via one-hot encoding
# is to use the get_dummies method implemented in pandas. Applied on a DataFrame ,
# the get_dummies method will only convert string columns and leave all other
# columns unchanged:

pd.get_dummies(df[['price', 'color', 'size']])









    Out[25]:






  
    
      
      price
      size
      color_blue
      color_green
      color_red
    
  
  
    
      0
      10.1
      1
      0.0
      1.0
      0.0
    
    
      1
      13.5
      2
      0.0
      0.0
      1.0
    
    
      2
      15.3
      3
      1.0
      0.0
      0.0

Partitioning a dataset in training and test sets



In [23]:

    
df_wine = pd.read_csv('https://archive.ics.uci.edu/'
                      'ml/machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()









    



Class labels [1 2 3]






    Out[23]:






  
    
      
      Class label
      Alcohol
      Malic acid
      Ash
      Alcalinity of ash
      Magnesium
      Total phenols
      Flavanoids
      Nonflavanoid phenols
      Proanthocyanins
      Color intensity
      Hue
      OD280/OD315 of diluted wines
      Proline
    
  
  
    
      0
      1
      14.23
      1.71
      2.43
      15.6
      127
      2.80
      3.06
      0.28
      2.29
      5.64
      1.04
      3.92
      1065
    
    
      1
      1
      13.20
      1.78
      2.14
      11.2
      100
      2.65
      2.76
      0.26
      1.28
      4.38
      1.05
      3.40
      1050
    
    
      2
      1
      13.16
      2.36
      2.67
      18.6
      101
      2.80
      3.24
      0.30
      2.81
      5.68
      1.03
      3.17
      1185
    
    
      3
      1
      14.37
      1.95
      2.50
      16.8
      113
      3.85
      3.49
      0.24
      2.18
      7.80
      0.86
      3.45
      1480
    
    
      4
      1
      13.24
      2.59
      2.87
      21.0
      118
      2.80
      2.69
      0.39
      1.82
      4.32
      1.04
      2.93
      735



In [26]:

    
if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Bringing features onto the same scale



In [27]:

    
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)



In [28]:

    
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

A visual example:



In [29]:

    
ex = pd.DataFrame([0, 1, 2, 3, 4, 5])

# standardize
ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)

# Please note that pandas uses ddof=1 (sample standard deviation) 
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)

# normalize
ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min())
ex.columns = ['input', 'standardized', 'normalized']
ex









    Out[29]:






  
    
      
      input
      standardized
      normalized
    
  
  
    
      0
      0
      -1.46385
      0.0
    
    
      1
      1
      -0.87831
      0.2
    
    
      2
      2
      -0.29277
      0.4
    
    
      3
      3
      0.29277
      0.6
    
    
      4
      4
      0.87831
      0.8
    
    
      5
      5
      1.46385
      1.0

Titanic dataset



In [47]:

    
df_titanic = pd.read_csv('datasets/titanic_kaggle.csv', encoding='utf-8')



In [48]:

    
df_titanic.head()









    Out[48]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [49]:

    
# info of features
df_titanic.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB



In [50]:

    
# check missing values in each feature
df_titanic.isnull().sum()









    Out[50]:





PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64



In [51]:

    
# drop unnecessary columns, these columns won't be useful in analysis and prediction
df_titanic = df_titanic.drop(['PassengerId','Name','Ticket'], axis=1)



In [56]:

    
df_titanic['Age'] = df_titanic['Age'].fillna((df_titanic['Age'].mean()))
df_titanic.isnull().sum()









    Out[56]:





Survived      0
Pclass        0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64



In [58]:

    
# cabin has too many null values, drop it
df_titanic = df_titanic.drop(['Cabin'], axis=1)



In [60]:

    
df_titanic.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB



In [69]:

    
# find unique values 
df_titanic["Embarked"].unique()









    Out[69]:





array(['S', 'C', 'Q', nan], dtype=object)



In [68]:

    
# find frequency of each
df_titanic["Embarked"].value_counts()









    Out[68]:





S    644
C    168
Q     77
Name: Embarked, dtype: int64



In [73]:

    
# null values in Embarked
df_titanic["Embarked"].isnull().sum()









    Out[73]:





0



In [72]:

    
# replace it by the most frequent value in Embarked
df_titanic['Embarked'] = df_titanic['Embarked'].fillna("S")



In [74]:

    
# null values in Embarked
df_titanic["Embarked"].isnull().sum()









    Out[74]:





0



In [75]:

    
# all null values
df_titanic.isnull().sum()









    Out[75]:





Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64



In [77]:

    
# convert gender from male and female to binary values 0 and 1
gender_mapping = {'male': 0, 'female': 1}

df_titanic['Sex'] = df_titanic['Sex'].map(gender_mapping)



In [80]:

    
df_titanic["Sex"].value_counts()









    Out[80]:





0    577
1    314
Name: Sex, dtype: int64



In [86]:

    
# convert Embarked using one hot encoding
df_titanic = pd.get_dummies(df_titanic)



In [103]:

    
df_titanic.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Sex           891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8
dtypes: float64(2), int64(5), uint8(3)
memory usage: 51.4 KB



In [89]:

    
# split into train and test frames
df_train, df_test = train_test_split(df_titanic, test_size=0.3, random_state=0)



In [90]:

    
# read X and y
X_train, y_train = df_train.iloc[:, 1:].values, df_train.iloc[:, 0].values
X_test, y_test = df_test.iloc[:, 1:].values, df_test.iloc[:, 0].values



In [91]:

    
# standardize values
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)



In [105]:

    
# train a logistic classifier
lr = LogisticRegression()
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))









    



Training accuracy: 0.804173354735
Test accuracy: 0.794776119403

Selecting meaningful features

Sparse solutions with L1-regularization



In [32]:

    
Image(filename='./images/04_12.png', width=500)









    Out[32]:



In [33]:

    
Image(filename='./images/04_13.png', width=500)









    Out[33]:



In [40]:

    
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1', C=0.1)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))









    



Training accuracy: 0.983870967742
Test accuracy: 0.981481481481



In [41]:

    
lr.intercept_









    Out[41]:





array([-0.38380829, -0.15808142, -0.70041622])



In [42]:

    
lr.coef_









    Out[42]:





array([[ 0.28005481,  0.        ,  0.        , -0.0277974 ,  0.        ,
         0.        ,  0.7100353 ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.23667089],
       [-0.64398803, -0.06882797, -0.05718877,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        , -0.92673341,
         0.06018449,  0.        , -0.37111244],
       [ 0.        ,  0.06135625,  0.        ,  0.        ,  0.        ,
         0.        , -0.63650094,  0.        ,  0.        ,  0.49828265,
        -0.35840648, -0.57077367,  0.        ]])



In [37]:

    
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.subplot(111)
    
colors = ['blue', 'green', 'red', 'cyan', 
          'magenta', 'yellow', 'black', 
          'pink', 'lightgreen', 'lightblue', 
          'gray', 'indigo', 'orange']

weights, params = [], []
for c in np.arange(-4, 6):
    lr = LogisticRegression(penalty='l1', C=10**c, random_state=0)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10**c)

weights = np.array(weights)

for column, color in zip(range(weights.shape[1]), colors):
    plt.plot(params, weights[:, column],
             label=df_wine.columns[column + 1],
             color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center', 
          bbox_to_anchor=(1.38, 1.03),
          ncol=1, fancybox=True)
# plt.savefig('./figures/l1_path.png', dpi=300)
plt.show()

Resources

Course website: https://w4zir.github.io/ml17s/

Course resources

Credits

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Andrew Ng, Machine Learning, Coursera

Scikit Learn

David Kaleko

	Class label	Alcohol	Malic acid	Ash	Alcalinity of ash	Magnesium	Total phenols	Flavanoids	Nonflavanoid phenols	Proanthocyanins	Color intensity	Hue	OD280/OD315 of diluted wines	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

	input	standardized	normalized
0	0	-1.46385	0.0
1	1	-0.87831	0.2
2	2	-0.29277	0.4
3	3	0.29277	0.6
4	4	0.87831	0.8
5	5	1.46385	1.0

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S