Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The data manipulation capabilities of pandas are built on top of the numpy library. In a way, numpy is a dependency of the pandas library.

In this notebook we'll try various pandas methods and in the process learn more about Pandas.

Installation

Please follow this link. All the necessary steps are mentioned here.

Importing Pandas

Once Pandas is installed, we can use it our file



In [43]:

    
import numpy as np
import pandas as pd

Series

Series are similar to numpy arrays. The only difference between them is that series can have axis labels which means that it can be indexed by a label and also by number location.

Creating Series

There are various ways to create Series. Some of them are listed below.

Using Python List



In [45]:

    
seriesLabel = ['label1', 'label2', 'label3']
exampleList = [5, 10, 20]



In [46]:

    
pd.Series(exampleList)









    Out[46]:





0     5
1    10
2    20
dtype: int64



In [47]:

    
pd.Series(exampleList, seriesLabel)









    Out[47]:





label1     5
label2    10
label3    20
dtype: int64

Using Numpy Arrays



In [48]:

    
exampleNumpyArray = np.array([6, 12, 18])



In [49]:

    
pd.Series(exampleNumpyArray)









    Out[49]:





0     6
1    12
2    18
dtype: int64



In [50]:

    
pd.Series(exampleNumpyArray, seriesLabel)









    Out[50]:





label1     6
label2    12
label3    18
dtype: int64

Using Dictionary



In [51]:

    
exampleDictionary = { 'label4': 7, 'label5': 14, 'label6': 21 }



In [52]:

    
# No need to mention labels parameter
pd.Series(exampleDictionary)









    Out[52]:





label4     7
label5    14
label6    21
dtype: int64



In [53]:

    
# If you mention different labels for a dictionary
pd.Series(exampleDictionary, seriesLabel)









    Out[53]:





label1   NaN
label2   NaN
label3   NaN
dtype: float64

Data and Index Parameter in Series

Data

Series can hold a variety of data.



In [54]:

    
def sampleFunc1():
    pass

def sampleFunc2():
    pass

def sampleFunc3():
    pass

pd.Series(data=[sampleFunc1, sampleFunc2, sampleFunc3])









    Out[54]:





0    <function sampleFunc1 at 0x11835d8c8>
1    <function sampleFunc2 at 0x11835d9d8>
2    <function sampleFunc3 at 0x11835d950>
dtype: object



In [55]:

    
pd.Series(['a', 2, 'hey'])









    Out[55]:





0      a
1      2
2    hey
dtype: object

Index

It is the second parameter which acts as the label for the series.



In [56]:

    
pd.Series(data=[sampleFunc1, sampleFunc2, sampleFunc3], index=['a', 'b', 'c'])









    Out[56]:





a    <function sampleFunc1 at 0x11835d8c8>
b    <function sampleFunc2 at 0x11835d9d8>
c    <function sampleFunc3 at 0x11835d950>
dtype: object



In [57]:

    
pd.Series(['a', 2, 'hey'], ['label', 2, 'key'])









    Out[57]:





label      a
2          2
key      hey
dtype: object

DataFrames

DataFrames are like spreadsheets or SQL tables. DataFrames are utilised a lot by pandas users.

Creating a DataFrame

pd.DataFrame( data, index, columns )

data -> content of the cells
index -> labels for rows
columns -> labels for columns

Returns wwo-dimensional size-mutable, potentially heterogeneous tabular data i.e. DataFrame



In [71]:

    
pd.DataFrame(data = np.random.randint(1,51, (4,3)), index = ['row1', 'row2', 'row3', 'row4'], columns = ['col1', 'col2', 'col3'])

Selection and Indexing



In [73]:

    
dataFrame = pd.DataFrame(data = np.random.randint(1,51, (4,3)), index = ['row1', 'row2', 'row3', 'row4'], columns = ['col1', 'col2', 'col3'])
dataFrame

Selecting a single column



In [74]:

    
dataFrame['col1']









    Out[74]:





row1    44
row2    37
row3    46
row4    19
Name: col1, dtype: int64

Selecting multiple columns



In [75]:

    
dataFrame[['col1', 'col2']]

Creation of new columns using arithmetic operators



In [76]:

    
dataFrame['newCol1'] = dataFrame['col3'] - dataFrame['col2']
dataFrame



In [77]:

    
dataFrame['newCol2'] = dataFrame['col1'] * dataFrame['col3']
dataFrame

Removal of columns



In [78]:

    
# axis -> 0 means that we are targeting the rows
# axis -> 1 means that we are targeting the columns
dataFrame.drop('newCol1', axis=1)



In [79]:

    
# we did not really drop the column
dataFrame



In [80]:

    
# Pandas saves us from accidentally dropping the columns
# Inorder to delete it
dataFrame.drop('newCol1', axis=1, inplace=True)
dataFrame



In [81]:

    
dataFrame.drop('newCol2', axis=1, inplace=True)
dataFrame

Selecting a single Row



In [82]:

    
dataFrame.loc['row1']









    Out[82]:





col1    44
col2    21
col3    31
Name: row1, dtype: int64

Selecting multiple rows



In [83]:

    
dataFrame.loc[['row1', 'row2']]

Selecting rows based on their index number



In [84]:

    
dataFrame.iloc[1]









    Out[84]:





col1    37
col2    40
col3     8
Name: row2, dtype: int64

Removal of rows



In [85]:

    
dataFrame.drop('row1', axis=0)



In [86]:

    
# again pandas didn't drop it completely
dataFrame



In [87]:

    
# we should use 'inplace' to drop the row
# dataFrame.drop('row1', axis=0, inplace=True)
# dataFrame

Selecting both columns and rows



In [88]:

    
dataFrame.loc['row1', 'col2']









    Out[88]:





21



In [89]:

    
dataFrame.loc[['row1', 'row2', 'row3'],['col2', 'col3']]



In [90]:

    
dataFrame.iloc[0,1]









    Out[90]:





21



In [91]:

    
dataFrame.iloc[[0,1]]

Conditional Selection



In [92]:

    
dataFrame



In [93]:

    
dataFrame > 10









    Out[93]:







  
    
      
      col1
      col2
      col3
    
  
  
    
      row1
      True
      True
      True
    
    
      row2
      True
      True
      False
    
    
      row3
      True
      False
      True
    
    
      row4
      True
      True
      True

Instead of getting true and false values, we can also get the actual value if the condition is satisfied



In [94]:

    
dataFrame[dataFrame > 10]

We can also target individual columns



In [102]:

    
dataFrame[dataFrame['col2'] > 10]

We can also output columns only that we want



In [103]:

    
dataFrame[dataFrame['col2'] > 10]['col3']









    Out[103]:





row1    31
row2     8
row4    14
Name: col3, dtype: int64



In [105]:

    
dataFrame[dataFrame['col2'] > 10][['col1', 'col3']]

If we want to apply conditional operators on multiple columns then we do so by



In [109]:

    
dataFrame[(dataFrame['col1'] > 10) & (dataFrame['col3'] > 10)]



In [ ]:

Note: This notebook is not complete, more content will be added soon.

	col1	col2	col3	newCol1	newCol2
row1	44	21	31	10	1364
row2	37	40	8	-32	296
row3	46	5	49	44	2254
row4	19	33	14	-19	266

	col1	col2	col3
row1	True	True	True
row2	True	True	False
row3	True	False	True
row4	True	True	True