We will primarily be working out of the Scikit Learn Cookbook. I love to use Jupyter notebooks so this presentation will be slides from the notebook. Recently started using the live-reveal extension to make slides of my notebook. You can download it and play around with the examples yourself later. Also a reminder that this course is intended to be at the intermediate level.
So we will be going over pre-model workflow primarily out of the scikit learn cookbook, but I'll also be including other examples not in the book. How many people here have used scikit-learn before? How many people here have had experience with data cleaning? I have had some experience with data preprocessing, I have learned a lot by dealing with very messy data. I think the best way to have the importance of data cleaning and preprocessing understood is by dealing with it often. Always learning some new trick or finding some new aberration in data. I would love for others to share as we go through some of these sections if they have examples of atrocious data and how they worked around it.
In [1]:
%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
%load_ext watermark
In [2]:
%watermark -a "Jaya Zenchenko" -n -t -z -u -h -m -w -v -p scikit-learn,matplotlib,pandas,seaborn,numpy,scipy,conda
Ever since I discovered Watermark, I love it because it automatically documents the package versions, and information about the machine I ran the code on. This is great for reproducibility when sharing your work/results with others.
So scikit learn is a machine learning package for python. I think it is the gold standard for the way packages should be in the scientific community. It is so easy to pick up quickly and use to build awesome models and create machine learning algorithms with. So it's easy to use, but the real work is in 2 areas, first we need to make sure our data is cleaned and properly formatted for the various machine learning methods, and we need to make sure that the data looks the way that it should for the given model. We want to make sure that we are satisfying the assumptions of the model before using it. Garbage in, garbage out. So this is why we need to preprocess our data.
- Noisy data (out-of-range, impossible combinations, human error, etc)
- Missing Data (General noise/error, MCAR, MNAR, MAR)
- Data of different types (numeric, categorical)
- Too many attributes
- Too much data
- Data does not fit the assumed characteristic for a given method
- Imputing (filling in missing values)
- Binary Features
- Categorical
- Binarizing
- Correlation Analysis/Chi-Squared Test of Independance
- Aggregating
- Random Sampling
- Scaling/Normalizing
So what kinds of issues can we find in our data? Data has been known to "lie". It comes from various sources such as people, sensors, the internet, and all these have been known to lie sometimes. We can have noisy data, missing data, different types of data, too much data, and data values not as expected. I'll be going over a few examples of these today, primarily ways of filling in missing data, dealing with data of different types, and scaling and normalizing.
Scikit learn has many built in data sets, it's a good place to start to play with these techniques and other methods for future classes. Other data sets include the UCI Machine Learning repository, Kaggle data sets, and local government/open data sets.
Another option is to create your own fake data set with different properties (distribution, number of clusters, noise, etc). This can be a good approach to test out different algorithms and their performance on data sets that are behaving with the assumptions in mind.
Remember that when using real data, to always spend enough time doing exploratory data analysis to understand the data before applying different methods.
In [3]:
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy
import seaborn as sns
import pandas as pd
%matplotlib inline
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
sns.set()
In [4]:
iris = load_iris()
In [5]:
iris.feature_names
Out[5]:
In [6]:
iris.data[0:5]
Out[6]:
I like seaborn's visualization capability and integration with pandas so I'm going to download the dataset from there instead.
In [7]:
df = sns.load_dataset("iris")
In [8]:
df.head()
Out[8]:
Let's randomly select samples to remove so that we have missing values in our data.
In [9]:
numpy.random.seed(seed=40)
sample_idx = numpy.random.random_integers(0,df.shape[0]-1, 10 )
feature_idx = numpy.random.random_integers(0,df.shape[1]-2, 10)
In [10]:
print "sample_idx", sample_idx
print "feature_idx", feature_idx
In [11]:
for idx, jdx in zip(sample_idx, feature_idx):
df.ix[idx, jdx] = None
In [12]:
df.head(15)
Out[12]:
scikit-learn has an Imputer function. This has a fit_transform function and can be part of a pre-processing pipeline.
In [13]:
imputer = preprocessing.Imputer()
In [14]:
imputer
Out[14]:
Strategy can take values 'mean', 'median', 'most_frequent'. Can also give the Imputer function what the missing value looks like in the data. Sometimes people use -1 or 99999.
In [15]:
imputed_df = df.copy()
In [16]:
imputed_data = imputer.fit_transform(df.ix[:,0:4])
In [17]:
imputed_df.ix[:,0:4] = imputed_data
In [18]:
imputed_df.head(15)
Out[18]:
In [19]:
print df.mean()
df.groupby('species').mean()
Out[19]:
Now something to think about might be whether want to set the mean of all the data as the imputed value. Looking at the data by the given species, for petal length, the means vary vastly between the 3 species. So we many want to impute using the mean within the species and not over the whole data set.
In [21]:
sns.pairplot(df, hue="species")
plt.show()
Looking at this data, another method could be to use clustering or regression to model the data without missing values and then see where the data with some missing feature values would be. The thing to note about imputing the data in a more specialized way is that the preprocessing.impute() function would need to be called on the data itself and the Imputer could not be part of the pipeline.
Binarizing is the process of converting a variable to a 0 or 1 given a certain threshold.
To show an example for binarizing, I wanted to have a data set with both categorical and numeric data. I downloaded an 'exercise' dataset from seaborn.
In [22]:
sns.set(style="ticks")
exercise = sns.load_dataset("exercise")
In [23]:
exercise.head()
Out[23]:
Let's look at a histogram of the numeric variable - pulse:
In [25]:
exercise.pulse.hist()
plt.title('Histogram of Pulse')
plt.show()
In [26]:
exercise.ix[:,'high_pulse'] = preprocessing.binarize(exercise.pulse, threshold=120)[0]
In [29]:
exercise.head()
Out[29]:
In [30]:
exercise[exercise.high_pulse==1].head()
Out[30]:
Obviously this is a simple example, we could just easily do this with a one line pandas call, but again the advantage of doing this through scikit-learn is that it can be part of the pipeline.
pandas one liner: exercise['high_pulse'] = exercise.pulse>120
Since we can't just plug in the exercise data as in into a machine learning algorithm, we need to transform the data so that it only contains numerical data. So there are 2 primary ways of doing this. One is to create a numeric value for each of the categories in a given column, or another way is to create new features very similar to what is called "creating dummy variables" in statistics.
In [31]:
encoder = preprocessing.LabelEncoder()
In [32]:
exercise.diet.unique()
Out[32]:
In [33]:
encoder.fit_transform(exercise.diet)
Out[33]:
OneHotEncoder expects the data to be numeric, so LabelEncoder would need to be applied first to convert everything to a numeric value.
Way to do it in pandas, because 1 and 0 don't have any meaning really, it doesn't matter which one got the 1 label and which one got the 0 label.
In [35]:
exercise.diet.cat.codes.head()
Out[35]:
Let's make a deep copy of the exercise data frame so we can start modifying it.
In [36]:
exercise_numeric_df = exercise.copy()
In [37]:
exercise.columns
Out[37]:
In [38]:
exercise.head()
Out[38]:
Let's identify the categorical columns:
In [39]:
cat_columns = ['diet','kind', 'time']
In [42]:
# Pandas: exercise_numeric_df[cat_columns] = exercise[cat_columns].apply(lambda x: x.cat.codes)
exercise_numeric_df[cat_columns] = exercise[cat_columns].apply(lambda x: encoder.fit_transform(x))
In [43]:
exercise_numeric_df.head()
Out[43]:
Now we need to convert the diet, kind, and time columns into "dummy variables":
In [44]:
one_hot_encoder = preprocessing.OneHotEncoder(categorical_features=[2,4,5])
In [45]:
one_hot_encoder
Out[45]:
In [46]:
exercise_numeric_encoded_matrix = one_hot_encoder.fit_transform(exercise_numeric_df.values)
In [56]:
exercise_numeric_encoded_matrix.toarray()[0:10,:]
Out[56]:
In [48]:
exercise_numeric_encoded_matrix.shape
Out[48]:
It's much easier to visualize what is happening using pandas, so I'll include that here as well.
In [49]:
pd.get_dummies(exercise).head()
Out[49]:
In [50]:
pd.get_dummies(exercise).shape
Out[50]:
So most of the time I do go through the pandas approach because it's more readable and then at the end I'll use the .values function to get the values out of the data frame. Pandas might be a way to start exploring the data quickly, but once the algorithm is finalized, then putting the scikit learn pipeline can be what is in production.
Z score normalization makes the data normally distributed which is an assumption for many algorithms. Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. Min-max scaling transforms the data so that there is smaller standard deviations for outliers than z score. Z score standardization is performed more frequently than min max. However min-max scaling is used in image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range).
In [51]:
exercise_numeric_encoded_matrix
Out[51]:
In [52]:
standard_scaler = preprocessing.StandardScaler(with_mean=True)
In [57]:
standard_scaler
Out[57]:
In [59]:
exercise_numeric_encoded_matrix.toarray()[0:5,0:8]
Out[59]:
In [60]:
exercise_data_scaled = standard_scaler.fit_transform(exercise_numeric_encoded_matrix.toarray()[:,0:8])
In [61]:
numpy.mean(exercise_data_scaled, axis=0)
Out[61]:
In [62]:
numpy.linalg.norm(exercise_data_scaled[0,:])
Out[62]:
In [63]:
normalizer = preprocessing.Normalizer()
In [64]:
normalizer
Out[64]:
In [65]:
exercise_data_scaled_normalized = normalizer.fit_transform(exercise_data_scaled)
In [66]:
numpy.linalg.norm(exercise_data_scaled_normalized[0,:])
Out[66]:
In [67]:
exercise_data_scaled_normalized[0:5,:]
Out[67]:
In [68]:
from sklearn import pipeline
I was excited to find the FunctionTransformer functionality, this means we can create our own modification of the data. This is newly available in scikit-learn v 0.17 which just recently was released.
In [69]:
my_function = preprocessing.FunctionTransformer(func=lambda x: x.toarray()[:,0:8], \
validate=True, accept_sparse=True, pass_y=False)
In [70]:
preprocessing_pipeline = pipeline.Pipeline([('one_hot_encoding', one_hot_encoder), \
('my_function', my_function), \
('standard_scaler', standard_scaler), \
('normalizer', normalizer)])
In [71]:
preprocessing_pipeline.fit_transform(exercise_numeric_df.values)[0:5,:]
Out[71]: