Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The data manipulation capabilities of pandas are built on top of the numpy library. In a way, numpy is a dependency of the pandas library.

In this notebook we'll try various pandas methods and in the process learn more about Pandas.

Installation

Please follow this link. All the necessary steps are mentioned here.

Importing Pandas

Once Pandas is installed, we can use it our file


In [43]:
import numpy as np
import pandas as pd

Series

Series are similar to numpy arrays. The only difference between them is that series can have axis labels which means that it can be indexed by a label and also by number location.

Creating Series

There are various ways to create Series. Some of them are listed below.

  1. Using Python List

In [45]:
seriesLabel = ['label1', 'label2', 'label3']
exampleList = [5, 10, 20]

In [46]:
pd.Series(exampleList)


Out[46]:
0     5
1    10
2    20
dtype: int64

In [47]:
pd.Series(exampleList, seriesLabel)


Out[47]:
label1     5
label2    10
label3    20
dtype: int64
  1. Using Numpy Arrays

In [48]:
exampleNumpyArray = np.array([6, 12, 18])

In [49]:
pd.Series(exampleNumpyArray)


Out[49]:
0     6
1    12
2    18
dtype: int64

In [50]:
pd.Series(exampleNumpyArray, seriesLabel)


Out[50]:
label1     6
label2    12
label3    18
dtype: int64
  1. Using Dictionary

In [51]:
exampleDictionary = { 'label4': 7, 'label5': 14, 'label6': 21 }

In [52]:
# No need to mention labels parameter
pd.Series(exampleDictionary)


Out[52]:
label4     7
label5    14
label6    21
dtype: int64

In [53]:
# If you mention different labels for a dictionary
pd.Series(exampleDictionary, seriesLabel)


Out[53]:
label1   NaN
label2   NaN
label3   NaN
dtype: float64

Data and Index Parameter in Series

  1. Data

Series can hold a variety of data.


In [54]:
def sampleFunc1():
    pass

def sampleFunc2():
    pass

def sampleFunc3():
    pass

pd.Series(data=[sampleFunc1, sampleFunc2, sampleFunc3])


Out[54]:
0    <function sampleFunc1 at 0x11835d8c8>
1    <function sampleFunc2 at 0x11835d9d8>
2    <function sampleFunc3 at 0x11835d950>
dtype: object

In [55]:
pd.Series(['a', 2, 'hey'])


Out[55]:
0      a
1      2
2    hey
dtype: object
  1. Index

It is the second parameter which acts as the label for the series.


In [56]:
pd.Series(data=[sampleFunc1, sampleFunc2, sampleFunc3], index=['a', 'b', 'c'])


Out[56]:
a    <function sampleFunc1 at 0x11835d8c8>
b    <function sampleFunc2 at 0x11835d9d8>
c    <function sampleFunc3 at 0x11835d950>
dtype: object

In [57]:
pd.Series(['a', 2, 'hey'], ['label', 2, 'key'])


Out[57]:
label      a
2          2
key      hey
dtype: object

DataFrames

DataFrames are like spreadsheets or SQL tables. DataFrames are utilised a lot by pandas users.

Creating a DataFrame

pd.DataFrame( data, index, columns )

data -> content of the cells
index -> labels for rows
columns -> labels for columns

Returns wwo-dimensional size-mutable, potentially heterogeneous tabular data i.e. DataFrame


In [71]:
pd.DataFrame(data = np.random.randint(1,51, (4,3)), index = ['row1', 'row2', 'row3', 'row4'], columns = ['col1', 'col2', 'col3'])


Out[71]:
col1 col2 col3
row1 11 47 33
row2 25 24 2
row3 50 14 4
row4 1 14 7

Selection and Indexing


In [73]:
dataFrame = pd.DataFrame(data = np.random.randint(1,51, (4,3)), index = ['row1', 'row2', 'row3', 'row4'], columns = ['col1', 'col2', 'col3'])
dataFrame


Out[73]:
col1 col2 col3
row1 44 21 31
row2 37 40 8
row3 46 5 49
row4 19 33 14

Selecting a single column


In [74]:
dataFrame['col1']


Out[74]:
row1    44
row2    37
row3    46
row4    19
Name: col1, dtype: int64

Selecting multiple columns


In [75]:
dataFrame[['col1', 'col2']]


Out[75]:
col1 col2
row1 44 21
row2 37 40
row3 46 5
row4 19 33

Creation of new columns using arithmetic operators


In [76]:
dataFrame['newCol1'] = dataFrame['col3'] - dataFrame['col2']
dataFrame


Out[76]:
col1 col2 col3 newCol1
row1 44 21 31 10
row2 37 40 8 -32
row3 46 5 49 44
row4 19 33 14 -19

In [77]:
dataFrame['newCol2'] = dataFrame['col1'] * dataFrame['col3']
dataFrame


Out[77]:
col1 col2 col3 newCol1 newCol2
row1 44 21 31 10 1364
row2 37 40 8 -32 296
row3 46 5 49 44 2254
row4 19 33 14 -19 266

Removal of columns


In [78]:
# axis -> 0 means that we are targeting the rows
# axis -> 1 means that we are targeting the columns
dataFrame.drop('newCol1', axis=1)


Out[78]:
col1 col2 col3 newCol2
row1 44 21 31 1364
row2 37 40 8 296
row3 46 5 49 2254
row4 19 33 14 266

In [79]:
# we did not really drop the column
dataFrame


Out[79]:
col1 col2 col3 newCol1 newCol2
row1 44 21 31 10 1364
row2 37 40 8 -32 296
row3 46 5 49 44 2254
row4 19 33 14 -19 266

In [80]:
# Pandas saves us from accidentally dropping the columns
# Inorder to delete it
dataFrame.drop('newCol1', axis=1, inplace=True)
dataFrame


Out[80]:
col1 col2 col3 newCol2
row1 44 21 31 1364
row2 37 40 8 296
row3 46 5 49 2254
row4 19 33 14 266

In [81]:
dataFrame.drop('newCol2', axis=1, inplace=True)
dataFrame


Out[81]:
col1 col2 col3
row1 44 21 31
row2 37 40 8
row3 46 5 49
row4 19 33 14

Selecting a single Row


In [82]:
dataFrame.loc['row1']


Out[82]:
col1    44
col2    21
col3    31
Name: row1, dtype: int64

Selecting multiple rows


In [83]:
dataFrame.loc[['row1', 'row2']]


Out[83]:
col1 col2 col3
row1 44 21 31
row2 37 40 8

Selecting rows based on their index number


In [84]:
dataFrame.iloc[1]


Out[84]:
col1    37
col2    40
col3     8
Name: row2, dtype: int64

Removal of rows


In [85]:
dataFrame.drop('row1', axis=0)


Out[85]:
col1 col2 col3
row2 37 40 8
row3 46 5 49
row4 19 33 14

In [86]:
# again pandas didn't drop it completely
dataFrame


Out[86]:
col1 col2 col3
row1 44 21 31
row2 37 40 8
row3 46 5 49
row4 19 33 14

In [87]:
# we should use 'inplace' to drop the row
# dataFrame.drop('row1', axis=0, inplace=True)
# dataFrame

Selecting both columns and rows


In [88]:
dataFrame.loc['row1', 'col2']


Out[88]:
21

In [89]:
dataFrame.loc[['row1', 'row2', 'row3'],['col2', 'col3']]


Out[89]:
col2 col3
row1 21 31
row2 40 8
row3 5 49

In [90]:
dataFrame.iloc[0,1]


Out[90]:
21

In [91]:
dataFrame.iloc[[0,1]]


Out[91]:
col1 col2 col3
row1 44 21 31
row2 37 40 8

Conditional Selection


In [92]:
dataFrame


Out[92]:
col1 col2 col3
row1 44 21 31
row2 37 40 8
row3 46 5 49
row4 19 33 14

In [93]:
dataFrame > 10


Out[93]:
col1 col2 col3
row1 True True True
row2 True True False
row3 True False True
row4 True True True

Instead of getting true and false values, we can also get the actual value if the condition is satisfied


In [94]:
dataFrame[dataFrame > 10]


Out[94]:
col1 col2 col3
row1 44 21.0 31.0
row2 37 40.0 NaN
row3 46 NaN 49.0
row4 19 33.0 14.0

We can also target individual columns


In [102]:
dataFrame[dataFrame['col2'] > 10]


Out[102]:
col1 col2 col3
row1 44 21 31
row2 37 40 8
row4 19 33 14

We can also output columns only that we want


In [103]:
dataFrame[dataFrame['col2'] > 10]['col3']


Out[103]:
row1    31
row2     8
row4    14
Name: col3, dtype: int64

In [105]:
dataFrame[dataFrame['col2'] > 10][['col1', 'col3']]


Out[105]:
col1 col3
row1 44 31
row2 37 8
row4 19 14

If we want to apply conditional operators on multiple columns then we do so by


In [109]:
dataFrame[(dataFrame['col1'] > 10) & (dataFrame['col3'] > 10)]


Out[109]:
col1 col2 col3
row1 44 21 31
row3 46 5 49
row4 19 33 14

In [ ]:

Note: This notebook is not complete, more content will be added soon.