In [ ]:

    
import pandas as pd
%matplotlib inline

A simple example to illustrate the intuition behind dummy variables



In [ ]:

    
df = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})



In [ ]:

    
df



In [ ]:

    
pd.get_dummies(df['key'],prefix='key')

Now we have a matrix of values based on the presence of absence of the attribute value in our dataset

Now let's look at another example using our flight data



In [ ]:

    
df = pd.read_csv('data/ontime_reports_may_2015_ny.csv')



In [ ]:

    
#count number of NaNs in column
df['DEP_DELAY'].isnull().sum()



In [ ]:

    
#calculate the percentage this represents of the total number of instances
df['DEP_DELAY'].isnull().sum()/df['DEP_DELAY'].sum()

We could explore whether the NaNs are actually zero delays, but we'll just filter them out for now, especially since they represent such a small number of instances



In [ ]:

    
#filter DEP_DELAY NaNs
df = df[pd.notnull(df['DEP_DELAY'])]

We can discretize the continuous DEP_DELAY value by giving it a value of 0 if it's delayed and a 1 if it's not. We record this value into a separate column. (We could also code -1 for early, 0 for ontime, and 1 for late)



In [ ]:

    
#code whether delay or not delayed
df['IS_DELAYED'] = df['DEP_DELAY'].apply(lambda x: 1 if x>0 else 0 )



In [ ]:

    
#Let's check that our column was created properly
df[['DEP_DELAY','IS_DELAYED']]



In [ ]:

    
###Dummy variables create a



In [ ]:

    
pd.get_dummies(df['ORIGIN'],prefix='origin')

Normalize values



In [ ]:

    
#Normalize the data attributes for the Iris dataset
# Example from Jump Start Scikit Learn https://machinelearningmastery.com/jump-start-scikit-learn/
from sklearn.datasets import load_iris 
from sklearn import preprocessing #load the iris dataset
iris=load_iris()
X=iris.data
y=iris.target #normalize the data attributes 
normalized_X = preprocessing.normalize(X)



In [ ]:

    
zip(X,normalized_X)



In [ ]:



In [ ]:



In [ ]: