General notes about using Pandas.



I will attempt to generalize them if I have the time, else they will just be lesson dumps to be cleaned up later.



Replacing NaNs in a column with non-identical values

Key point is that when taking a slice of a df, you are returned a copy of that slice, and the changes to the elements might not propogate back into the original df.

Steps:

  • first get the number of rows
  • make a vector/list of random values of equal in length to the number of rows
  • make a boolean index of all the NaNs in the original data slice
  • update the vector of random values to have all the elements that are useful data in the slice overwrite the random ones
  • Now the random vector is the version of the column you want in the df
  • simply push the vector in to the df

code below does not work, just an example for now


In [ ]:
import pandas as pd 
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")

# original array
orig_df

# column 'Age' has some NaN values
# A simple approximation of the distribution of ages is a gaussian, but this is not commonly accurate.
# lets make a vector of random ages centered on the mean, with a width of the std
mn = orig_df["Age"].mean()
st = orig_df["Age"].std()

# number of rows
n = orig_df.shape[0]

# vector of random values using the 'standard normal'.  ie. centered on 0, with variance = 1.0
rands = np.random.randn(n)
# change to centered on mean and with width equal to std
rands = rands*st + mn #above two steps could be combined

#--------------------------------
### OR
## use a truncated normal distribution to make sure none of the values are outside the input data's range
import scipy.stats as stats

lower, upper = orig_df['Age'].min(), orig_df['Age'].max()
mu, sigma = orig_df["Age"].mean(), orig_df["Age"].std()

# number of rows
n = orig_df.shape[0]

print 'max: ',traorig_dfin_df['Age'].max()
print 'min: ',orig_df['Age'].min()

# vector of random values using the truncated normal distribution.  
X = stats.truncnorm((lower - mu) / sigma, (upper - mu) / sigma, loc=mu, scale=sigma)
rands = X.rvs(n)

#---------------------------------

# get the indexes of the elements in the original array that are NaN
idx = np.isfinite(orig_df['Age'])

# use the indexes to replace the NON-NaNs in the random array with the good values from the original array
rands[idx.values] = orig_df[idx]['Age'].values

## At this point rands is now the cleaned column of data we wanted, so push it in to the original df
orig_df['Age'] = rands

print 'After this gaussian replacment, the number of NaNs are: ',orig_df['Age'].isnull().sum()

In [2]:
## john recommends trying to learn how to merge/join.  
## But, make their indexs different, so it isn't just a simple thing.

In [3]:
# NOTE:
# how to delete column properly
#item_purchase_log_df_clean = item_purchase_log_df_clean.drop("item_id_nm",1)
## where 1 is the axis number (0 for rows and 1 for columns.)

In [ ]:
# Convert categorical columns to ordinal
from sklearn.preprocessing import LabelEncoder
# Convert categorical column values to ordinal for model fitting
le_title = LabelEncoder()
# To convert to ordinal:
orig_df.Title = le_title.fit_transform(orig_df.Title)
# To convert back to categorical:
#orig_df.Title = le_title.inverse_transform(orig_df.Title)

In [ ]:
### OR you could add new columns with ordinal true/false values
titles_dummies = pd.get_dummies(orig_df['Title'],prefix='Title')
orig_df = pd.concat([orig_df,titles_dummies],axis=1)