pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

http://pandas.pydata.org/



In [ ]:

    
# Series
import numpy as np
import pandas as pd
myArray = np.array([2,3,4])
row_names = ['p','q','r']
mySeries = pd.Series(myArray,index=row_names)
print mySeries
print mySeries[0]
print mySeries['p']



In [ ]:

    
# Dataframes
myArray = np.array([[2,3,4],[5,6,7]])
row_names = ['p','q']
col_names = ['One','Two','Three']
myDataFrame = pd.DataFrame(myArray,index = row_names,columns = col_names)
print myDataFrame
print 'Method 1 :'
print 'One column = \n%s'%myDataFrame['One']
print 'Method 2 :'
print 'One column = \n%s'%myDataFrame.One

Working with Data



In [ ]:

    
# Let's load data from a csv
df = pd.read_csv("../data/diabetes.csv")
df.info()



In [ ]:

    
# Examine data
df.head()

Normalizing data

If we look at two of the features in the data we can see they are of different scales.



In [ ]:

    
%matplotlib inline
import matplotlib.pyplot as plt
# Histogram
bins=range(0,100,10)

plt.hist(df["Age"].values, bins, alpha=0.5, label='age')
plt.show()

plt.hist(df["BMI"].values, bins, alpha=0.5, label='BMI')
plt.show()

plt.hist(df["Age"].values, bins, alpha=0.5, label='age')
plt.hist(df["BMI"].values, bins, alpha=0.5, label='BMI')
plt.show()

We can use standard deviation to normalize data.

Here we generate a random set of data that creates a dataset that follows a Standard deviation from the mean.

https://en.wikipedia.org/wiki/Standard_deviation



In [ ]:

    
from numpy.random import normal
gaussian_numbers = normal(size=5000)
plt.hist(gaussian_numbers, bins=np.linspace(-5.0, 5.0, num=20)) # Set bin bounds
plt.show()

We are now going to normalize the data so we give both data items the same weight.

for each column, we compute the mean and remove the standard deviation
Let's say we have points x1, x2,.. xn in column "AGE"
mean = $(1/n) * (x1+x2+...xn)$
std = $\sqrt{(1/n) * ( (x1-mean)^2 + (x2 -mean)^2 + ...)}$



In [ ]:

    
# Let's start with an example on the AGE feature
# I create a new array for easier manipulation
arr_age = df["Age"].values
arr_age[:10]

with numpy array we can do simple vectorized operations so if i do

    arr = arr - c

it subtracts c to all elements in arr if i do

    arr = arr/c

it divides all elements in arr by c



In [ ]:

    
mean_age = np.mean(arr_age)
std_age = np.std(arr_age)
print 'Age Mean: {} Std:{}'.format(mean_age, std_age)



In [ ]:

    
# So to compute the standardized array, I write :
arr_age_new = (arr_age - mean_age)/std_age
arr_age_new[:10]



In [ ]:

    
# I can now apply the same idea to a pandas dataframe
# using some built in pandas functions :
df_new = (df - df.mean()) / df.std()
df_new.head()



In [ ]:

    
df.head()



In [ ]:

    
# Histogram
bins=np.linspace(-5.0, 5.0, num=20)

plt.hist(df_new["Age"].values, bins, alpha=0.5, label='age')
plt.hist(df_new["BMI"].values, bins, alpha=0.5, label='BMI')
plt.show()



In [ ]: