PANDAS

Module 1: Data Structure and Applications

  • pandas is an open source, BSD-licensed Python library providing fast, flexible, easy-to-use, data structures and data analysis tools for working with “relational” or “labeled” data
  • Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
  • Created by Wes McKinney in 2008 used to manipulate, clean, query data https://github.com/wesm/pydata-book

Documentation

Import: Pandas, NumPy, Matplotlib


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Overview

Two primary data structures of pandas

  • Series (1-dimensional) array
  • DataFrame (tabular, spreadsheet) Indexing and slicing of pandas objects
    Arithmetic operations and function applications

Series

One-dimensional array-like object containing:

  • Array of data (of any NumPy data type)
  • Associated array of data labels, called its index
  • Mapping of index values to data values
  • Array operations preserve the index-value link
  • Series automatically align differently indexed data in arithmetic operations

In [6]:
from pandas import Series, DataFrame
s = Series([3, -1, 0, 5])
s


Out[6]:
0    3
1   -1
2    0
3    5
dtype: int64

Get index object of the Series via its index attributes, or create our own index:


In [10]:
s.index


Out[10]:
RangeIndex(start=0, stop=4, step=1)

In [16]:
s2 = Series([13, -3, 5, 9], 
            index = ["a", "b","c", "d"])
s2


Out[16]:
a    13
b    -3
c     5
d     9
dtype: int64

Operations

  • Any operations preserve the index-value link

In [19]:
s + 3


Out[19]:
0    6
1    2
2    3
3    8
dtype: int64

In [20]:
s2 * 3


Out[20]:
a    39
b    -9
c    15
d    27
dtype: int64

Indexing

  • Selecting items from an object by index

In [22]:
s[s > 0]


Out[22]:
0    3
3    5
dtype: int64

In [23]:
s2[["b","c"]]


Out[23]:
b   -3
c    5
dtype: int64

Automatic Alignment

  • Series automatically aligns differently indexed data in arithmetic operations

In [26]:
names1 = ["Ann", "Bob", "Carl", "Doris"] 
balance1 = [200, 100, 300, 400] 
account1 = Series(balance1, index=names1) 
account1


Out[26]:
Ann      200
Bob      100
Carl     300
Doris    400
dtype: int64

In [29]:
names2 = ["Carl", "Doris", "Ann", "Bob"] 
balance2 = [20, 10, 30, 40] 
account2 = Series(balance2, index=names2)
account2


Out[29]:
Carl     20
Doris    10
Ann      30
Bob      40
dtype: int64

In [30]:
# Automatic alignment by index
account1 + account2


Out[30]:
Ann      230
Bob      140
Carl     320
Doris    410
dtype: int64

DataFrames

  • Tabular, spreadsheet-like data structure
  • Contains an ordered collection of columns, each of which can be a different value type:
    • Numeric
    • String
    • Boolean

DataFrame, has both a row and column index

  • Extract rows and columns
  • Handle missing data

Easiest way to construct DataFrame is from a dictionary of equal-length lists or NumPy arrays Example: DataFrame

  • First, create a dictionary with 3 key-value pairs
  • Then, create a data frame

In [54]:
# values are equal length lists; keys
data = {"Name": ["Ann", "Bob", "Carl", "Doris"],
        "HW1": [ 90, 85, 70, 100],
        "HW2": [ 80, 70, 90, 90]}

# Create a data frame 
grades = DataFrame(data)
grades


Out[54]:
HW1 HW2 Name
0 90 80 Ann
1 85 70 Bob
2 70 90 Carl
3 100 90 Doris

In [56]:
grades = DataFrame(data, columns = ["Name", "HW1", "HW2"])
grades


Out[56]:
Name HW1 HW2
0 Ann 90 80
1 Bob 85 70
2 Carl 70 90
3 Doris 100 90

Extracting COLUMNS

A column in DataFrame can be retrieved as a Series either by:

  • dict-like notation: grades[“Name”]
  • by attribute: grades.Name

In [57]:
grades["Name"]


Out[57]:
0      Ann
1      Bob
2     Carl
3    Doris
Name: Name, dtype: object

Extracting ROWS

Rows can be retrieved by a position or name by a couple of methods:

  • .loc for label based indexing
  • .iloc for positional indexing

In [58]:
grades.iloc[2]


Out[58]:
Name    Carl
HW1       70
HW2       90
Name: 2, dtype: object

Modifying Data

  • Columns can be modified by assignment

In [59]:
grades["HW3"] = 0
grades


Out[59]:
Name HW1 HW2 HW3
0 Ann 90 80 0
1 Bob 85 70 0
2 Carl 70 90 0
3 Doris 100 90 0

Assigning a Series

  • If we assign a Series, it will be conformed exactly to the DataFrame’s index, inserting missing values in any hole.

In [60]:
HW3 = Series([70, 90], index = [1, 3])
grades["HW3"] = HW3
grades


Out[60]:
Name HW1 HW2 HW3
0 Ann 90 80 NaN
1 Bob 85 70 70.0
2 Carl 70 90 NaN
3 Doris 100 90 90.0

Transposing a Matrix – .T method


In [61]:
grades.T


Out[61]:
0 1 2 3
Name Ann Bob Carl Doris
HW1 90 85 70 100
HW2 80 70 90 90
HW3 NaN 70 NaN 90

Values

  • The values attribute returns the data contained in the DataFrame as a 2D array

In [62]:
grades.values


Out[62]:
array([['Ann', 90, 80, nan],
       ['Bob', 85, 70, 70.0],
       ['Carl', 70, 90, nan],
       ['Doris', 100, 90, 90.0]], dtype=object)

INDEXING and SLICING with pandas objects

Index Objects

  • Panda’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names).
  • Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index

In [65]:
grades = Series([ 60, 90, 80, 75],
                index= ["a", "b", "c", "d"])
grades


Out[65]:
a    60
b    90
c    80
d    75
dtype: int64

ReIndexing Series

  • ReIndexing creates a new object with the data conformed to a new index

In [64]:
grades = Series([ 60, 90, 80, 75], 
                index = ["Bob", "Tom", "Ann", "Jane"])
grades


Out[64]:
Bob     60
Tom     90
Ann     80
Jane    75
dtype: int64

Forward Filling

  • The method option ffill forward fills the values:

In [71]:
a = Series(['A', 'B', 'C'], index = [0, 3, 5])
a


Out[71]:
0    A
3    B
5    C
dtype: object

In [74]:
a.reindex(range(6), method="ffill")


Out[74]:
0    A
1    A
2    A
3    B
4    B
5    C
dtype: object

Indexing Series:

  • Series indexing works analogously to NumPy array indexing except you can use the Series’ index values instead of only integers

In [85]:
s = Series(np.arange(5), index = ["a", "b", "c", "d", "e"])
s


Out[85]:
a    0
b    1
c    2
d    3
e    4
dtype: int64

In [86]:
s["c"]


Out[86]:
2

In [87]:
s[3]


Out[87]:
3

Slicing Series:

  • Count elements in series starting from 0
  • Slices up to, but not including last item in index
  • Can specify order of items to return

In [88]:
s[1:3]


Out[88]:
b    1
c    2
dtype: int64

In [89]:
s[s >= 2]


Out[89]:
c    2
d    3
e    4
dtype: int64

In [90]:
s[["b", "e", "a"]]


Out[90]:
b    1
e    4
a    0
dtype: int64

Setting works just as with arrays:

  • Setting values b through d to 33 inclusive

In [92]:
s["b" : "d"] = 33
s


Out[92]:
a     0
b    33
c    33
d    33
e     4
dtype: int64

INDEXING DataFrames

Indexing a DataFrame retrieves one or more columns either with a single value or sequence

  • Retrieve single columns
  • Select multiple columns

In [94]:
grades = DataFrame(np.arange(16).reshape((4, 4)), 
                   index = ["Andy", "Brad", "Carla", "Donna"],
                   columns = ["HW1", "HW2", "HW3", "HW4"])
grades


Out[94]:
HW1 HW2 HW3 HW4
Andy 0 1 2 3
Brad 4 5 6 7
Carla 8 9 10 11
Donna 12 13 14 15

In [95]:
grades["HW1"]


Out[95]:
Andy      0
Brad      4
Carla     8
Donna    12
Name: HW1, dtype: int64

In [96]:
grades[["HW2", "HW3"]]


Out[96]:
HW2 HW3
Andy 1 2
Brad 5 6
Carla 9 10
Donna 13 14

Selecting rows:

  • Return rows in HW3 with values greater than 6

In [99]:
g = DataFrame(np.arange(16).reshape((4, 4)), 
                   index = ["Ann", "Bob", "Carl", "Donna"],
                   columns = ["HW1", "HW2", "HW3", "HW4"])
g[g["HW3"] > 6]


Out[99]:
HW1 HW2 HW3 HW4
Carl 8 9 10 11
Donna 12 13 14 15

In [101]:
# Select rows up to, but not including 2

g[:2]


Out[101]:
HW1 HW2 HW3 HW4
Ann 0 1 2 3
Bob 4 5 6 7

pandas FUNCTION Applications

How to apply:

  • Arithmetic operations on DataFrames
  • Arithmetic operations between differently indexed-objects
  • Operations between a DataFrames and a Series
  • Applications of NumPy ufuncs (element-wise array methods) to pandas objects

Arithmetic and data alignment

  • When adding together two objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs
  • The internal data alignment introduces NaN values in the indices that don’t overlap

In [106]:
a = Series([5, 4, 0, 7], 
           index = ["a", "c", "d", "e"])
b = Series([-1, 3, 4, -2, 1], 
           index = ["a", "c", "e", "f", "g"])
a


Out[106]:
a    5
c    4
d    0
e    7
dtype: int64

In [107]:
b


Out[107]:
a   -1
c    3
e    4
f   -2
g    1
dtype: int64

Adding Series data objects

  • Alignment is performed on both the rows and the columns
  • Adding two Series returns a DataFrame whose index and columns are unions of each Series
  • Where a value is included in one object, but not the other, will return NaN

In [103]:
a + b


Out[103]:
a     4.0
c     7.0
d     NaN
e    11.0
f     NaN
g     NaN
dtype: float64

Arithmetic methods with fill values

  • In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object, but not in the other
  • We specify add, with fill_value, and the missing value in other Series assumes value of 0

In [108]:
a.add(b, fill_value=0)


Out[108]:
a     4.0
c     7.0
d     0.0
e    11.0
f    -2.0
g     1.0
dtype: float64

Operations between DataFrame and Series are similar

By default, arithmetic between a DataFrame and a Series:

  • matches the index of the Series, on the DataFrame’s columns,
  • broadcasting down the rows

In [116]:
st0 = Series([0, 1, 2, 3], 
           index = ["HW1", "HW2", "HW3", "HW4"])
st0


Out[116]:
HW1    0
HW2    1
HW3    2
HW4    3
dtype: int64

In [117]:
grades + st0


Out[117]:
HW1 HW2 HW3 HW4
Andy 0 2 4 6
Brad 4 6 8 10
Carla 8 10 12 14
Donna 12 14 16 18

Importing from Excel

  • Pandas support various import and export options, very easy to import file from Excel
  • For example, suppose we have an Excel file “Excel_table” with contents:
  • Need to parse the worksheet in which the data is located

In [120]:
xls_file = pd.ExcelFile('Excel_table.xlsx')
t = xls_file.parse("Sheet1")
t


Out[120]:
x y z
a 1 0 -1
b 2 1 5
c -5 2 4

Function Applications

  • NumPy ufuncs (element-wise array methods) work with pandas object
  • Use abs() function to obtain absolute values of object d
  • Returns DF with all positive values

In [121]:
d = DataFrame(np.random.randn(3, 3), 
              columns=list("xyz"),
              index = ["A", "B", "C"])
d


Out[121]:
x y z
A 1.048421 -1.972915 -1.650191
B 1.016919 -0.267448 -0.093964
C -0.192132 0.774659 -0.130791

In [122]:
np.abs(d)


Out[122]:
x y z
A 1.048421 1.972915 1.650191
B 1.016919 0.267448 0.093964
C 0.192132 0.774659 0.130791

Function Definition

  • We can define and then apply a function to find both min and max values in each column
  • Returns a Series that will contain pairs of min and max values for each of three columns
  • Use keyword apply and then classify function to apply in parentheses

In [123]:
def minmax(t):
    return Series([t.min(), t.max()],
                   index = ["min", "max"])
d.apply(minmax)


Out[123]:
x y z
min -0.192132 -1.972915 -1.650191
max 1.048421 0.774659 -0.093964

Sorting Series object

  • sort_index() method returns a new object sorted lexi-cographically by index

In [125]:
a = Series(range(5),
           index = ["Bob", "john", "Jane", "Ann", "Cathy"])

a.sort_index()


Out[125]:
Ann      3
Bob      0
Cathy    4
Jane     2
john     1
dtype: int64

Sorting DataFrame

  • Sort by index on either axis
  • With sort_index, rows will be sorted alphabetically, and can specify the axis (axis=0, rows)
  • if we specify axis = 1, the the columns will be sorted alphabetically

In [126]:
df = DataFrame(np.arange(16).reshape((4, 4)), 
                   index = ["B", "Q", "M", "A"],
                   columns = list("pseb"))
df


Out[126]:
p s e b
B 0 1 2 3
Q 4 5 6 7
M 8 9 10 11
A 12 13 14 15

In [127]:
# Sort index alphabetically by ROW
df.sort_index(axis=0)


Out[127]:
p s e b
A 12 13 14 15
B 0 1 2 3
M 8 9 10 11
Q 4 5 6 7

In [129]:
# Sort index alpha by COLUMN
df.sort_index(axis=1)


Out[129]:
b e p s
B 3 2 0 1
Q 7 6 4 5
M 11 10 8 9
A 15 14 12 13

Sorting by Values

  • To sort a Series by its values, use its sort_values method:

In [130]:
s= Series([7, -2, 0, 8, -1])
s.sort_values()


Out[130]:
1   -2
4   -1
2    0
0    7
3    8
dtype: int64

Sorting with Missing Values

  • Using sort_values will sort numerical values first
  • NaN missing values will be included at end of Series with index of their position in original

In [133]:
s2 = Series([7, np.nan, -2, 0, np.nan, 8, -1])
s2
s2.sort_values()


Out[133]:
2   -2.0
6   -1.0
3    0.0
0    7.0
5    8.0
1    NaN
4    NaN
dtype: float64

Sum a DataFrame

  • Calling DataFrame’s sum method returns a Series containing column sums
  • Default returns sum across columns; set axis=1 to sum across rows

In [134]:
df = DataFrame(np.arange(9).reshape(3, 3), 
              columns=list("xyz"),
              index = ["A", "B", "C"])
df


Out[134]:
x y z
A 0 1 2
B 3 4 5
C 6 7 8

In [135]:
df.sum()


Out[135]:
x     9
y    12
z    15
dtype: int64

In [136]:
df.sum(axis=1)


Out[136]:
A     3
B    12
C    21
dtype: int64

idxmin and idxmax

  • return the index value where the minimum or maximum values are attained:

In [137]:
df.idxmax()


Out[137]:
x    C
y    C
z    C
dtype: object

In [138]:
df.idxmin()


Out[138]:
x    A
y    A
z    A
dtype: object

In [ ]: