Preparing Numeric Data

There are variety of preprocessing tasks one should consider before using numeric data in analysis and predictive models.

Centering and Scaling

Numeric variables are not always directly comparable as variables are often measured on different scales and cover different ranges. Furthermore, large differences between values, one variable has values in the range 1-100 while other variable ranges from 1 to 100000, can affect certain modeling techniques (e.g., where the values of the two variables need to be combined in some way).

Some of the issues mentioned above can be alleviated by centering and scaling the data. A common way to center data is to subtract the mean value from each data point, which centers the data around zero (and sets the new mean to zero).


In [23]:
# This line lets me show plots
%matplotlib inline              

#import useful modules
import numpy as np              
import pandas as pd
from ggplot import mtcars

Let's center the mtcars dataset in the ggplot library. First, let's calculate the means for the data in each column:


In [27]:
print (mtcars.head() )

colmeans = mtcars.sum()/mtcars.shape[0]  # Get column means
colmeans


                    mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  \
name                                                                         
Mazda RX4          21.0    6  160.0  110  3.90  2.620  16.46   0   1     4   
Mazda RX4 Wag      21.0    6  160.0  110  3.90  2.875  17.02   0   1     4   
Datsun 710         22.8    4  108.0   93  3.85  2.320  18.61   1   1     4   
Hornet 4 Drive     21.4    6  258.0  110  3.08  3.215  19.44   1   0     3   
Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3   

                   carb  
name                     
Mazda RX4             4  
Mazda RX4 Wag         4  
Datsun 710            1  
Hornet 4 Drive        1  
Hornet Sportabout     2  
Out[27]:
mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596563
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64

Now, subtract the column means from each row, element-wise, to zero center the data:


In [28]:
centered_mtcars = mtcars - colmeans

print(centered_mtcars.describe())


                mpg        cyl          disp          hp          drat  \
count  3.200000e+01  32.000000  3.200000e+01   32.000000  3.200000e+01   
mean   3.996803e-15   0.000000 -3.907985e-14    0.000000 -5.967449e-16   
std    6.026948e+00   1.785922  1.239387e+02   68.562868  5.346787e-01   
min   -9.690625e+00  -2.187500 -1.596219e+02  -94.687500 -8.365625e-01   
25%   -4.665625e+00  -2.187500 -1.098969e+02  -50.187500 -5.165625e-01   
50%   -8.906250e-01  -0.187500 -3.442188e+01  -23.687500  9.843750e-02   
75%    2.709375e+00   1.812500  9.527812e+01   33.312500  3.234375e-01   
max    1.380938e+01   1.812500  2.412781e+02  188.312500  1.333437e+00   

                 wt          qsec         vs         am       gear     carb  
count  3.200000e+01  3.200000e+01  32.000000  32.000000  32.000000  32.0000  
mean   4.787837e-16 -2.609024e-15   0.000000   0.000000   0.000000   0.0000  
std    9.784574e-01  1.786943e+00   0.504016   0.498991   0.737804   1.6152  
min   -1.704250e+00 -3.348750e+00  -0.437500  -0.406250  -0.687500  -1.8125  
25%   -6.360000e-01 -9.562500e-01  -0.437500  -0.406250  -0.687500  -0.8125  
50%    1.077500e-01 -1.387500e-01  -0.437500  -0.406250   0.312500  -0.8125  
75%    3.927500e-01  1.051250e+00   0.562500   0.593750   0.312500   1.1875  
max    2.206750e+00  5.051250e+00   0.562500   0.593750   1.312500   5.1875  

Notice that in zero-centered data, negative values represent original values that were below average and positive numbers represent values that were above average.

To put all values on a common scale, we can divide all values in a column by that column's standard deviation.


In [29]:
# Get column standard deviations
column_deviations = centered_mtcars.std(axis=0)   

centered_and_scaled_mtcars = centered_mtcars/column_deviations 
print(centered_and_scaled_mtcars.describe())


                mpg           cyl          disp            hp          drat  \
count  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01   
mean   6.678685e-16 -6.938894e-18 -3.087808e-16 -2.428613e-17 -1.113692e-15   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.607883e+00 -1.224858e+00 -1.287910e+00 -1.381032e+00 -1.564608e+00   
25%   -7.741273e-01 -1.224858e+00 -8.867035e-01 -7.319924e-01 -9.661175e-01   
50%   -1.477738e-01 -1.049878e-01 -2.777331e-01 -3.454858e-01  1.841059e-01   
75%    4.495434e-01  1.014882e+00  7.687521e-01  4.858679e-01  6.049193e-01   
max    2.291272e+00  1.014882e+00  1.946754e+00  2.746567e+00  2.493904e+00   

                 wt          qsec            vs            am          gear  \
count  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01   
mean   4.735795e-16 -1.491862e-15  1.387779e-17  8.326673e-17 -5.030698e-17   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.741772e+00 -1.874010e+00 -8.680278e-01 -8.141431e-01 -9.318192e-01   
25%   -6.500027e-01 -5.351317e-01 -8.680278e-01 -8.141431e-01 -9.318192e-01   
50%    1.101223e-01 -7.764656e-02 -8.680278e-01 -8.141431e-01  4.235542e-01   
75%    4.013971e-01  5.882951e-01  1.116036e+00  1.189901e+00  4.235542e-01   
max    2.255336e+00  2.826755e+00  1.116036e+00  1.189901e+00  1.778928e+00   

               carb  
count  3.200000e+01  
mean   1.387779e-17  
std    1.000000e+00  
min   -1.122152e+00  
25%   -5.030337e-01  
50%   -5.030337e-01  
75%    7.352031e-01  
max    3.211677e+00  

All columns/variables/features now have a standard deviation of 1, and roughly the same mean, 0. This can also be achieved using the scale() function in the module scikit-learn. scale() returns an ndarray which can be convert into a DataFrame, if needed.


In [31]:
from sklearn import preprocessing

scaled_data = preprocessing.scale(mtcars)  

#reconstruct a DataFrame from the scaled data
scaled_mtcars = pd.DataFrame(scaled_data,    
                           index=mtcars.index,
                           columns=mtcars.columns)

print(scaled_mtcars.describe() )


                mpg           cyl          disp            hp          drat  \
count  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01   
mean  -5.481726e-16  4.163336e-17  1.387779e-16 -1.734723e-17 -3.122502e-16   
std    1.016001e+00  1.016001e+00  1.016001e+00  1.016001e+00  1.016001e+00   
min   -1.633610e+00 -1.244457e+00 -1.308518e+00 -1.403130e+00 -1.589643e+00   
25%   -7.865141e-01 -1.244457e+00 -9.008917e-01 -7.437050e-01 -9.815764e-01   
50%   -1.501383e-01 -1.066677e-01 -2.821771e-01 -3.510140e-01  1.870518e-01   
75%    4.567366e-01  1.031121e+00  7.810529e-01  4.936423e-01  6.145986e-01   
max    2.327934e+00  1.031121e+00  1.977904e+00  2.790515e+00  2.533809e+00   

                 wt          qsec            vs            am          gear  \
count  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01   
mean   4.683753e-17 -1.469311e-15 -6.938894e-18  5.551115e-17 -1.144917e-16   
std    1.016001e+00  1.016001e+00  1.016001e+00  1.016001e+00  1.016001e+00   
min   -1.769642e+00 -1.903996e+00 -8.819171e-01 -8.271702e-01 -9.467293e-01   
25%   -6.604034e-01 -5.436944e-01 -8.819171e-01 -8.271702e-01 -9.467293e-01   
50%    1.118844e-01 -7.888899e-02 -8.819171e-01 -8.271702e-01  4.303315e-01   
75%    4.078199e-01  5.977084e-01  1.133893e+00  1.208941e+00  4.303315e-01   
max    2.291423e+00  2.871986e+00  1.133893e+00  1.208941e+00  1.807392e+00   

               carb  
count  3.200000e+01  
mean  -6.938894e-18  
std    1.016001e+00  
min   -1.140108e+00  
25%   -5.110827e-01  
50%   -5.110827e-01  
75%    7.469671e-01  
max    3.263067e+00  

Note that the values are not exactly the same as those calculated "manually", likely due to scikit-learn's implementation of centering and scaling.

Dealing With Skewed Data

The distribution of the data can have a significant impact on analysis and modeling, as many techniques assume, or require that the data follows a particular distribution, e.g., Gaussian. Some data sets exhibit significant asymmetry (skewness). To illustrate, let's generate a few distributions. Let us look at a few examples.


In [33]:
normally_distributed = np.random.normal(size=10000)  # Generate normal data*

normally_distributed = pd.DataFrame(normally_distributed) # Convert to DF

normally_distributed.hist(figsize=(8,8),                  # Plot histogram
                          bins=30)       

skewed = np.random.exponential(scale=2,      # Generate skewed data
                               size= 10000)  

skewed = pd.DataFrame(skewed)                # Convert to DF

skewed.hist(figsize=(8,8),                   # Plot histogram
            bins=50)


Out[33]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x110d46898>]], dtype=object)

Data with a long right tail is called positively skewed or right skewed. In a skewed dataset, the extreme values in the long tail can have a very large influence on some of the test and models performed or build for the data.

Reducing skew may be in some cases appropriate. Two simple transformations that can reduce skew are taking the square root of each data point or taking the natural logarithm of each data point.


In [35]:
sqrt_transformed = skewed.apply(np.sqrt) # Get the square root of data points*

sqrt_transformed.hist(figsize=(8,8),     # Plot histogram
                 bins=50)  
log_transformed = (skewed+1).apply(np.log)   # Get the log of the data

log_transformed.hist(figsize = (8,8),          # Plot histogram
                 bins=50)


Out[35]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1127ec908>]], dtype=object)

Highly Correlated Variables

In predictive modeling, each variable used to construct a model would ideally represent some unique feature of the data. In reality, variables often exhibit collinearity, and variables with strong correlations can interfere with the modeling process. We can check the pairwise correlations between numeric variables using the df.corr() function:


In [36]:
mtcars.ix[:,0:6].corr()   # Check the pairwise correlations of 6 variables


Out[36]:
mpg cyl disp hp drat wt
mpg 1.000000 -0.852162 -0.847551 -0.776168 0.681172 -0.867659
cyl -0.852162 1.000000 0.902033 0.832447 -0.699938 0.782496
disp -0.847551 0.902033 1.000000 0.790949 -0.710214 0.887980
hp -0.776168 0.832447 0.790949 1.000000 -0.448759 0.658748
drat 0.681172 -0.699938 -0.710214 -0.448759 1.000000 -0.712441
wt -0.867659 0.782496 0.887980 0.658748 -0.712441 1.000000

A positive correlation implies that when one variable goes up the other tends to go up as well, while negative correlations indicate an inverse relationship.

In the mtcar dataset, the number of cylinders a car has (cyl) and its weight (wt) have fairly strong negative correlations to gas mileage (mpg.), i.e., heavier cars and cars with more cylinders tend to get lower gas mileage. A scatter plot matrix can help visualize this. pandas' scatter_matrix() function accomplishes this:


In [38]:
from pandas.tools.plotting import scatter_matrix


scatter_matrix(mtcars.ix[:,0:6],   # Make a scatter matrix of 6 columns
               figsize=(10, 10),   # Set plot size
               diagonal='kde')    # Show distribution estimates on diagonal


Out[38]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1128239e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112a1c9e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112b9ac18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x115f2d828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x115f96a58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x115f96a90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1160582e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1126d8b38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112768b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112967c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x111325cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11184b860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11139ec50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1118612e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1160b21d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1161194e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11616cef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1161df240>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11622fd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11629bcc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1162ff748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x116367a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1163c2e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x116431358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x116487c50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1164f8278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x116545a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1165b4f60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1165c4cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x116677dd8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1166e8128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x116724ef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x116798710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1167f3400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11685b828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1168b6ac8>]], dtype=object)

In [ ]: