PANDAS

Module 1: Data Structure and Applications

pandas is an open source, BSD-licensed Python library providing fast, flexible, easy-to-use, data structures and data analysis tools for working with “relational” or “labeled” data
Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Created by Wes McKinney in 2008 used to manipulate, clean, query data https://github.com/wesm/pydata-book

Documentation

Import: Pandas, NumPy, Matplotlib



In [4]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Overview

Two primary data structures of pandas

Series (1-dimensional) array
DataFrame (tabular, spreadsheet) Indexing and slicing of pandas objects
Arithmetic operations and function applications

Series

One-dimensional array-like object containing:

Array of data (of any NumPy data type)
Associated array of data labels, called its index
Mapping of index values to data values
Array operations preserve the index-value link
Series automatically align differently indexed data in arithmetic operations



In [6]:

    
from pandas import Series, DataFrame
s = Series([3, -1, 0, 5])
s









    Out[6]:





0    3
1   -1
2    0
3    5
dtype: int64

Get index object of the Series via its index attributes, or create our own index:



In [10]:

    
s.index









    Out[10]:





RangeIndex(start=0, stop=4, step=1)



In [16]:

    
s2 = Series([13, -3, 5, 9], 
            index = ["a", "b","c", "d"])
s2









    Out[16]:





a    13
b    -3
c     5
d     9
dtype: int64

Operations

Any operations preserve the index-value link



In [19]:

    
s + 3









    Out[19]:





0    6
1    2
2    3
3    8
dtype: int64



In [20]:

    
s2 * 3









    Out[20]:





a    39
b    -9
c    15
d    27
dtype: int64

Indexing

Selecting items from an object by index



In [22]:

    
s[s > 0]









    Out[22]:





0    3
3    5
dtype: int64



In [23]:

    
s2[["b","c"]]









    Out[23]:





b   -3
c    5
dtype: int64

Automatic Alignment

Series automatically aligns differently indexed data in arithmetic operations



In [26]:

    
names1 = ["Ann", "Bob", "Carl", "Doris"] 
balance1 = [200, 100, 300, 400] 
account1 = Series(balance1, index=names1) 
account1









    Out[26]:





Ann      200
Bob      100
Carl     300
Doris    400
dtype: int64



In [29]:

    
names2 = ["Carl", "Doris", "Ann", "Bob"] 
balance2 = [20, 10, 30, 40] 
account2 = Series(balance2, index=names2)
account2









    Out[29]:





Carl     20
Doris    10
Ann      30
Bob      40
dtype: int64



In [30]:

    
# Automatic alignment by index
account1 + account2









    Out[30]:





Ann      230
Bob      140
Carl     320
Doris    410
dtype: int64

DataFrames

Tabular, spreadsheet-like data structure
Contains an ordered collection of columns, each of which can be a different value type:
- Numeric
- String
- Boolean

DataFrame, has both a row and column index

Extract rows and columns
Handle missing data

Easiest way to construct DataFrame is from a dictionary of equal-length lists or NumPy arrays Example: DataFrame

First, create a dictionary with 3 key-value pairs
Then, create a data frame



In [54]:

    
# values are equal length lists; keys
data = {"Name": ["Ann", "Bob", "Carl", "Doris"],
        "HW1": [ 90, 85, 70, 100],
        "HW2": [ 80, 70, 90, 90]}

# Create a data frame 
grades = DataFrame(data)
grades



In [56]:

    
grades = DataFrame(data, columns = ["Name", "HW1", "HW2"])
grades

Extracting COLUMNS

A column in DataFrame can be retrieved as a Series either by:

dict-like notation: grades[“Name”]
by attribute: grades.Name



In [57]:

    
grades["Name"]









    Out[57]:





0      Ann
1      Bob
2     Carl
3    Doris
Name: Name, dtype: object

Extracting ROWS

Rows can be retrieved by a position or name by a couple of methods:

.loc for label based indexing
.iloc for positional indexing



In [58]:

    
grades.iloc[2]









    Out[58]:





Name    Carl
HW1       70
HW2       90
Name: 2, dtype: object

Modifying Data

Columns can be modified by assignment



In [59]:

    
grades["HW3"] = 0
grades

Assigning a Series

If we assign a Series, it will be conformed exactly to the DataFrame’s index, inserting missing values in any hole.



In [60]:

    
HW3 = Series([70, 90], index = [1, 3])
grades["HW3"] = HW3
grades

Transposing a Matrix – .T method



In [61]:

    
grades.T

Values

The values attribute returns the data contained in the DataFrame as a 2D array



In [62]:

    
grades.values









    Out[62]:





array([['Ann', 90, 80, nan],
       ['Bob', 85, 70, 70.0],
       ['Carl', 70, 90, nan],
       ['Doris', 100, 90, 90.0]], dtype=object)

INDEXING and SLICING with pandas objects

Index Objects

Panda’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names).
Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index



In [65]:

    
grades = Series([ 60, 90, 80, 75],
                index= ["a", "b", "c", "d"])
grades









    Out[65]:





a    60
b    90
c    80
d    75
dtype: int64

ReIndexing Series

ReIndexing creates a new object with the data conformed to a new index



In [64]:

    
grades = Series([ 60, 90, 80, 75], 
                index = ["Bob", "Tom", "Ann", "Jane"])
grades









    Out[64]:





Bob     60
Tom     90
Ann     80
Jane    75
dtype: int64

Forward Filling

The method option ffill forward fills the values:



In [71]:

    
a = Series(['A', 'B', 'C'], index = [0, 3, 5])
a









    Out[71]:





0    A
3    B
5    C
dtype: object



In [74]:

    
a.reindex(range(6), method="ffill")









    Out[74]:





0    A
1    A
2    A
3    B
4    B
5    C
dtype: object

Indexing Series:

Series indexing works analogously to NumPy array indexing except you can use the Series’ index values instead of only integers



In [85]:

    
s = Series(np.arange(5), index = ["a", "b", "c", "d", "e"])
s









    Out[85]:





a    0
b    1
c    2
d    3
e    4
dtype: int64



In [86]:

    
s["c"]









    Out[86]:





2



In [87]:

    
s[3]









    Out[87]:





3

Slicing Series:

Count elements in series starting from 0
Slices up to, but not including last item in index
Can specify order of items to return



In [88]:

    
s[1:3]









    Out[88]:





b    1
c    2
dtype: int64



In [89]:

    
s[s >= 2]









    Out[89]:





c    2
d    3
e    4
dtype: int64



In [90]:

    
s[["b", "e", "a"]]









    Out[90]:





b    1
e    4
a    0
dtype: int64

Setting works just as with arrays:

Setting values b through d to 33 inclusive



In [92]:

    
s["b" : "d"] = 33
s









    Out[92]:





a     0
b    33
c    33
d    33
e     4
dtype: int64

INDEXING DataFrames

Indexing a DataFrame retrieves one or more columns either with a single value or sequence

Retrieve single columns
Select multiple columns



In [94]:

    
grades = DataFrame(np.arange(16).reshape((4, 4)), 
                   index = ["Andy", "Brad", "Carla", "Donna"],
                   columns = ["HW1", "HW2", "HW3", "HW4"])
grades



In [95]:

    
grades["HW1"]









    Out[95]:





Andy      0
Brad      4
Carla     8
Donna    12
Name: HW1, dtype: int64



In [96]:

    
grades[["HW2", "HW3"]]

Selecting rows:

Return rows in HW3 with values greater than 6



In [99]:

    
g = DataFrame(np.arange(16).reshape((4, 4)), 
                   index = ["Ann", "Bob", "Carl", "Donna"],
                   columns = ["HW1", "HW2", "HW3", "HW4"])
g[g["HW3"] > 6]



In [101]:

    
# Select rows up to, but not including 2

g[:2]

pandas FUNCTION Applications

How to apply:

Arithmetic operations on DataFrames
Arithmetic operations between differently indexed-objects
Operations between a DataFrames and a Series
Applications of NumPy ufuncs (element-wise array methods) to pandas objects

Arithmetic and data alignment

When adding together two objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs
The internal data alignment introduces NaN values in the indices that don’t overlap



In [106]:

    
a = Series([5, 4, 0, 7], 
           index = ["a", "c", "d", "e"])
b = Series([-1, 3, 4, -2, 1], 
           index = ["a", "c", "e", "f", "g"])
a









    Out[106]:





a    5
c    4
d    0
e    7
dtype: int64



In [107]:

    
b









    Out[107]:





a   -1
c    3
e    4
f   -2
g    1
dtype: int64

Adding Series data objects

Alignment is performed on both the rows and the columns
Adding two Series returns a DataFrame whose index and columns are unions of each Series
Where a value is included in one object, but not the other, will return NaN



In [103]:

    
a + b









    Out[103]:





a     4.0
c     7.0
d     NaN
e    11.0
f     NaN
g     NaN
dtype: float64

Arithmetic methods with fill values

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object, but not in the other
We specify add, with fill_value, and the missing value in other Series assumes value of 0



In [108]:

    
a.add(b, fill_value=0)









    Out[108]:





a     4.0
c     7.0
d     0.0
e    11.0
f    -2.0
g     1.0
dtype: float64

Operations between DataFrame and Series are similar

By default, arithmetic between a DataFrame and a Series:

matches the index of the Series, on the DataFrame’s columns,
broadcasting down the rows



In [116]:

    
st0 = Series([0, 1, 2, 3], 
           index = ["HW1", "HW2", "HW3", "HW4"])
st0









    Out[116]:





HW1    0
HW2    1
HW3    2
HW4    3
dtype: int64



In [117]:

    
grades + st0

Importing from Excel

Pandas support various import and export options, very easy to import file from Excel
For example, suppose we have an Excel file “Excel_table” with contents:
Need to parse the worksheet in which the data is located



In [120]:

    
xls_file = pd.ExcelFile('Excel_table.xlsx')
t = xls_file.parse("Sheet1")
t

Function Applications

NumPy ufuncs (element-wise array methods) work with pandas object
Use abs() function to obtain absolute values of object d
Returns DF with all positive values



In [121]:

    
d = DataFrame(np.random.randn(3, 3), 
              columns=list("xyz"),
              index = ["A", "B", "C"])
d



In [122]:

    
np.abs(d)

Function Definition

We can define and then apply a function to find both min and max values in each column
Returns a Series that will contain pairs of min and max values for each of three columns
Use keyword apply and then classify function to apply in parentheses



In [123]:

    
def minmax(t):
    return Series([t.min(), t.max()],
                   index = ["min", "max"])
d.apply(minmax)

Sorting Series object

sort_index() method returns a new object sorted lexi-cographically by index



In [125]:

    
a = Series(range(5),
           index = ["Bob", "john", "Jane", "Ann", "Cathy"])

a.sort_index()









    Out[125]:





Ann      3
Bob      0
Cathy    4
Jane     2
john     1
dtype: int64

Sorting DataFrame

Sort by index on either axis
With sort_index, rows will be sorted alphabetically, and can specify the axis (axis=0, rows)
if we specify axis = 1, the the columns will be sorted alphabetically



In [126]:

    
df = DataFrame(np.arange(16).reshape((4, 4)), 
                   index = ["B", "Q", "M", "A"],
                   columns = list("pseb"))
df



In [127]:

    
# Sort index alphabetically by ROW
df.sort_index(axis=0)



In [129]:

    
# Sort index alpha by COLUMN
df.sort_index(axis=1)

Sorting by Values

To sort a Series by its values, use its sort_values method:



In [130]:

    
s= Series([7, -2, 0, 8, -1])
s.sort_values()









    Out[130]:





1   -2
4   -1
2    0
0    7
3    8
dtype: int64

Sorting with Missing Values

Using sort_values will sort numerical values first
NaN missing values will be included at end of Series with index of their position in original



In [133]:

    
s2 = Series([7, np.nan, -2, 0, np.nan, 8, -1])
s2
s2.sort_values()









    Out[133]:





2   -2.0
6   -1.0
3    0.0
0    7.0
5    8.0
1    NaN
4    NaN
dtype: float64

Sum a DataFrame

Calling DataFrame’s sum method returns a Series containing column sums
Default returns sum across columns; set axis=1 to sum across rows



In [134]:

    
df = DataFrame(np.arange(9).reshape(3, 3), 
              columns=list("xyz"),
              index = ["A", "B", "C"])
df



In [135]:

    
df.sum()









    Out[135]:





x     9
y    12
z    15
dtype: int64



In [136]:

    
df.sum(axis=1)









    Out[136]:





A     3
B    12
C    21
dtype: int64

idxmin and idxmax

return the index value where the minimum or maximum values are attained:



In [137]:

    
df.idxmax()









    Out[137]:





x    C
y    C
z    C
dtype: object



In [138]:

    
df.idxmin()









    Out[138]:





x    A
y    A
z    A
dtype: object



In [ ]:

	x	y	z
A	1.048421	-1.972915	-1.650191
B	1.016919	-0.267448	-0.093964
C	-0.192132	0.774659	-0.130791

	x	y	z
A	1.048421	1.972915	1.650191
B	1.016919	0.267448	0.093964
C	0.192132	0.774659	0.130791

	x	y	z
min	-0.192132	-1.972915	-1.650191
max	1.048421	0.774659	-0.093964