In [ ]:
import pandas as pd
%matplotlib inline

A simple example to illustrate the intuition behind dummy variables


In [ ]:
df = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})

In [ ]:
df

In [ ]:
pd.get_dummies(df['key'],prefix='key')

Now we have a matrix of values based on the presence of absence of the attribute value in our dataset

Now let's look at another example using our flight data


In [ ]:
df = pd.read_csv('data/ontime_reports_may_2015_ny.csv')

In [ ]:
#count number of NaNs in column
df['DEP_DELAY'].isnull().sum()

In [ ]:
#calculate the percentage this represents of the total number of instances
df['DEP_DELAY'].isnull().sum()/df['DEP_DELAY'].sum()

We could explore whether the NaNs are actually zero delays, but we'll just filter them out for now, especially since they represent such a small number of instances


In [ ]:
#filter DEP_DELAY NaNs
df = df[pd.notnull(df['DEP_DELAY'])]

We can discretize the continuous DEP_DELAY value by giving it a value of 0 if it's delayed and a 1 if it's not. We record this value into a separate column. (We could also code -1 for early, 0 for ontime, and 1 for late)


In [ ]:
#code whether delay or not delayed
df['IS_DELAYED'] = df['DEP_DELAY'].apply(lambda x: 1 if x>0 else 0 )

In [ ]:
#Let's check that our column was created properly
df[['DEP_DELAY','IS_DELAYED']]

In [ ]:
###Dummy variables create a

In [ ]:
pd.get_dummies(df['ORIGIN'],prefix='origin')

Normalize values


In [ ]:
#Normalize the data attributes for the Iris dataset
# Example from Jump Start Scikit Learn https://machinelearningmastery.com/jump-start-scikit-learn/
from sklearn.datasets import load_iris 
from sklearn import preprocessing #load the iris dataset
iris=load_iris()
X=iris.data
y=iris.target #normalize the data attributes 
normalized_X = preprocessing.normalize(X)

In [ ]:
zip(X,normalized_X)

In [ ]:


In [ ]:


In [ ]: