Introduction to Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with panel data.

pandas is well suited for tabular data with heterogeneously-typed colums, as in an SQL table or Excel spreadsheet

Key Features:

Easy handling of missing data
Automatic and explicit data alignment
Intelligent label-based slicing, indexing and subsetting of large data sets
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
Robust IO Tools for loading data from flat files, Excel files, databases etc.



In [ ]:

    
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org  width=800 height=350></iframe>")

Before we explore the package pandas, let's import pandas package. We often use pd to refer to pandas in convention.



In [ ]:

    
%matplotlib inline
import pandas as pd
import numpy as np

Series

A Series is a single vector of data (like a Numpy array) with an index that labels each element in the vector.



In [ ]:

    
counts = pd.Series([223, 43, 53, 24, 43])
counts



In [ ]:

    
type(counts)

If an index is not specified, a default sequence of integers is assigned as index.

We can access the values like an array



In [ ]:

    
counts[0]



In [ ]:

    
counts[1:4]

You can get the array representation and index object of the Series via its values and index atrributes, respectively.



In [ ]:

    
counts.values



In [ ]:

    
counts.index

We can assign meaningful labels to the index, if they are available:



In [ ]:

    
fruit = pd.Series([223,  43,  53,  24, 43],
                 index=['apple', 'orange', 'banana', 'pears', 'lemon'])

fruit



In [ ]:

    
fruit.index

These labels can be used to refer to the values in the Series.



In [ ]:

    
fruit['apple']



In [ ]:

    
fruit[['apple', 'lemon']]

We can give both the array of values and the index meaningful labels themselves:



In [ ]:

    
fruit.name = 'counts'
fruit.index.name = 'fruit'
fruit

Operations can be applied to Series without losing the data structure.
Use bool array to filter Series



In [ ]:

    
fruit > 50



In [ ]:

    
fruit[fruit > 50]

Critically, the labels are used to align data when used in operations with other Series objects.



In [ ]:

    
fruit2 = pd.Series([11, 12, 13, 14, 15],
                   index=fruit.index)
fruit2



In [ ]:

    
fruit2 = fruit2.drop('apple')
fruit2



In [ ]:

    
fruit2['grape'] = 18
fruit2



In [ ]:

    
fruit3 = fruit + fruit2
fruit3

Contrast this with arrays, where arrays of the same length will combine values element-wise; Adding Series combined values with the same label in the resulting series.
Notice that the missing values were propogated by addition.



In [ ]:

    
fruit3.dropna()



In [ ]:

    
fruit3



In [ ]:

    
fruit3.isnull()

DataFrame

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet.Each column can be a different value type (numeric, string, boolean etc).



In [ ]:

    
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year':[2000, 2001, 2002, 2001, 2003],
       'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df



In [ ]:

    
len(df)  # Get the number of rows in the dataframe



In [ ]:

    
df.shape # Get the (rows, cols) of the dataframe



In [ ]:

    
df.T



In [ ]:

    
df.columns  # get the index of columns



In [ ]:

    
df.index  # get the index of the row



In [ ]:

    
df.dtypes



In [ ]:

    
df.describe()

There are three basic ways to access the data in the dataframe
1. use DataFrame[] to access data quickly
2. use DataFrame.iloc[row, col] integer position based selection method
3. use DataFrame.loc[row, col] label based selection method



In [ ]:

    
df



In [ ]:

    
df['state']  # indexing by label



In [ ]:

    
df[['state', 'year']]  # indexing by a list of label



In [ ]:

    
df[:2]  # numpy-style indexing



In [ ]:

    
df.iloc[0, 0]



In [ ]:

    
df.iloc[0, :]



In [ ]:

    
df.iloc[:, 1]



In [ ]:

    
df.iloc[:2, 1:3]



In [ ]:

    
df.loc[:, 'state']



In [ ]:

    
df.loc[:, ['state', 'year']]

Add new column and delete column



In [ ]:

    
df['debt'] = np.random.randn(len(df))
df['rain'] = np.abs(np.random.randn(len(df)))
df



In [ ]:

    
df = df.drop('debt', axis=1)
df



In [ ]:

    
row1 = pd.Series([4.5, 'Nevada', 2005, 2.56], index=df.columns)
df.append(row1,ignore_index=True)



In [ ]:

    
df.drop([0, 1])

data filtering



In [ ]:

    
df['pop'] < 2



In [ ]:

    
df



In [ ]:

    
df.loc[df['pop'] < 2, 'pop'] = 2

df



In [ ]:

    
df['year'] == 2001



In [ ]:

    
(df['pop'] > 3) | (df['year'] == 2001)



In [ ]:

    
df.loc[(df['pop'] > 3) | (df['year'] == 2001), 'pop'] = 3
df

Sorting index



In [ ]:

    
df.sort_index(ascending=False)



In [ ]:

    
df.sort_index(axis=1, ascending=False)

Summarizing and Computing Descriptive Statistics

Built in functions to calculate the values over row or columns.



In [ ]:

    
df



In [ ]:

    
df.loc[:, ['pop', 'rain']].sum()



In [ ]:

    
df.loc[:,['pop', 'rain']].mean()



In [ ]:

    
df.loc[:, ['pop', 'rain']].var()



In [ ]:

    
df.loc[:, ['pop', 'rain']].cumsum()

Apply functions to each column or row of a DataFrame



In [ ]:

    
df



In [ ]:

    
df.loc[:, ['pop', 'rain']].apply(lambda x: x.max() - x.min())  # apply new functions to each row

Grouped and apply



In [ ]:

    
df



In [ ]:

    
df.groupby(df['state']).mean()



In [ ]:

    
df.groupby(df['state'])[['pop', 'rain']].apply(lambda x: x.max() - x.min())



In [ ]:

    
grouped = df.groupby(df['state'])
group_list = []
for name, group in grouped:
    print(name)
    print(group)
    print('\n')

Set Hierarchical indexing



In [ ]:

    
df



In [ ]:

    
df_h = df.set_index(['state', 'year'])
df_h



In [ ]:

    
df_h.index.is_unique



In [ ]:

    
df_h.loc['Ohio', :].max() - df_h.loc['Ohio', :].min()

Import and Store Data

Read and write csv file.



In [ ]:

    
df



In [ ]:

    
df.to_csv('test_csv_file.csv',index=False)



In [ ]:

    
%more test_csv_file.csv



In [ ]:

    
df_csv = pd.read_csv('test_csv_file.csv')
df_csv

Read and write excel file.



In [ ]:

    
writer = pd.ExcelWriter('test_excel_file.xlsx')
df.to_excel(writer, 'sheet1', index=False)
writer.save()



In [ ]:

    
df_excel = pd.read_excel('test_excel_file.xlsx', sheetname='sheet1')
df_excel



In [ ]:

    
pd.read_table??

Filtering out Missing Data

You have a number of options for filtering out missing data.



In [ ]:

    
df = pd.DataFrame([[1, 6.5, 3.], [1., np.nan, np.nan],
                [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

df



In [ ]:

    
cleaned = df.dropna()  # delete rows with Nan value
cleaned



In [ ]:

    
df.dropna(how='all')  # delete rows with all Nan value



In [ ]:

    
df.dropna(thresh=2)  # keep the rows with at least thresh non-Nan value



In [ ]:

    
df.fillna(0)  # fill Nan with a constant

Plotting in DataFrame



In [ ]:

    
variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                       'gamma': np.random.gamma(1, size=100), 
                       'poisson': np.random.poisson(size=100)})
variables.head()



In [ ]:

    
variables.shape



In [ ]:

    
variables.cumsum().plot()



In [ ]:

    
variables.cumsum().plot(subplots=True)

Exercise 1

Write a Python program to create and display a DataFrame from a specified dictionary data which has the index labels. Sample Python dictionary data and list labels:

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']} labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']



In [ ]:

Exercise 2

Select the rows the score is between 15 and 20



In [ ]:

Exercise 3

Write a Python program to change the score in row 'd' to 11.5.



In [ ]:

Exercise 4

Write a Python program to replace the 'qualify' column contains the values 'yes' and 'no' with True and False.



In [ ]: