There are variety of preprocessing tasks one should consider before using numeric data in analysis and predictive models.
Numeric variables are not always directly comparable as variables are often measured on different scales and cover different ranges. Furthermore, large differences between values, one variable has values in the range 1-100 while other variable ranges from 1 to 100000, can affect certain modeling techniques (e.g., where the values of the two variables need to be combined in some way).
Some of the issues mentioned above can be alleviated by centering and scaling the data. A common way to center data is to subtract the mean value from each data point, which centers the data around zero (and sets the new mean to zero).
In [23]:
# This line lets me show plots
%matplotlib inline
#import useful modules
import numpy as np
import pandas as pd
from ggplot import mtcars
Let's center the mtcars dataset in the ggplot library. First, let's calculate the means for the data in each column:
In [27]:
print (mtcars.head() )
colmeans = mtcars.sum()/mtcars.shape[0] # Get column means
colmeans
Out[27]:
Now, subtract the column means from each row, element-wise, to zero center the data:
In [28]:
centered_mtcars = mtcars - colmeans
print(centered_mtcars.describe())
Notice that in zero-centered data, negative values represent original values that were below average and positive numbers represent values that were above average.
To put all values on a common scale, we can divide all values in a column by that column's standard deviation.
In [29]:
# Get column standard deviations
column_deviations = centered_mtcars.std(axis=0)
centered_and_scaled_mtcars = centered_mtcars/column_deviations
print(centered_and_scaled_mtcars.describe())
All columns/variables/features now have a standard deviation of 1, and roughly the same mean, 0. This can also be achieved using the scale() function in the module scikit-learn. scale() returns an ndarray which can be convert into a DataFrame, if needed.
In [31]:
from sklearn import preprocessing
scaled_data = preprocessing.scale(mtcars)
#reconstruct a DataFrame from the scaled data
scaled_mtcars = pd.DataFrame(scaled_data,
index=mtcars.index,
columns=mtcars.columns)
print(scaled_mtcars.describe() )
Note that the values are not exactly the same as those calculated "manually", likely due to scikit-learn's implementation of centering and scaling.
The distribution of the data can have a significant impact on analysis and modeling, as many techniques assume, or require that the data follows a particular distribution, e.g., Gaussian. Some data sets exhibit significant asymmetry (skewness). To illustrate, let's generate a few distributions. Let us look at a few examples.
In [33]:
normally_distributed = np.random.normal(size=10000) # Generate normal data*
normally_distributed = pd.DataFrame(normally_distributed) # Convert to DF
normally_distributed.hist(figsize=(8,8), # Plot histogram
bins=30)
skewed = np.random.exponential(scale=2, # Generate skewed data
size= 10000)
skewed = pd.DataFrame(skewed) # Convert to DF
skewed.hist(figsize=(8,8), # Plot histogram
bins=50)
Out[33]:
Data with a long right tail is called positively skewed or right skewed. In a skewed dataset, the extreme values in the long tail can have a very large influence on some of the test and models performed or build for the data.
Reducing skew may be in some cases appropriate. Two simple transformations that can reduce skew are taking the square root of each data point or taking the natural logarithm of each data point.
In [35]:
sqrt_transformed = skewed.apply(np.sqrt) # Get the square root of data points*
sqrt_transformed.hist(figsize=(8,8), # Plot histogram
bins=50)
log_transformed = (skewed+1).apply(np.log) # Get the log of the data
log_transformed.hist(figsize = (8,8), # Plot histogram
bins=50)
Out[35]:
In predictive modeling, each variable used to construct a model would ideally represent some unique feature of the data. In reality, variables often exhibit collinearity, and variables with strong correlations can interfere with the modeling process. We can check the pairwise correlations between numeric variables using the df.corr() function:
In [36]:
mtcars.ix[:,0:6].corr() # Check the pairwise correlations of 6 variables
Out[36]:
A positive correlation implies that when one variable goes up the other tends to go up as well, while negative correlations indicate an inverse relationship.
In the mtcar dataset, the number of cylinders a car has (cyl) and its weight (wt) have fairly strong negative correlations to gas mileage (mpg.), i.e., heavier cars and cars with more cylinders tend to get lower gas mileage. A scatter plot matrix can help visualize this. pandas' scatter_matrix() function accomplishes this:
In [38]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(mtcars.ix[:,0:6], # Make a scatter matrix of 6 columns
figsize=(10, 10), # Set plot size
diagonal='kde') # Show distribution estimates on diagonal
Out[38]:
In [ ]: