Pandas Practice

Series

Loading packages and initializations


In [2]:
import numpy as np
import pandas as pd

labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10,'b':20,'c':30}

print ("Labels:", labels)
print("My data:", my_data)
print("Dictionary:", d)


Labels: ['a', 'b', 'c']
My data: [10, 20, 30]
Dictionary: {'a': 10, 'b': 20, 'c': 30}

Creating a Series (Pandas class)

  • From numerical data only
  • From numerical data and corresponding index (row labels)
  • From NumPy array as the source of numerical data
  • Just using a pre-defined dictionary

In [3]:
pd.Series(data=my_data) # Output looks very similar to a NumPy array


Out[3]:
0    10
1    20
2    30
dtype: int64

In [4]:
pd.Series(data=my_data, index=labels) # Note the extra information about index


Out[4]:
a    10
b    20
c    30
dtype: int64

In [5]:
# Inputs are in order of the expected parameters (not explicitly named), NumPy array is used for data
pd.Series(arr, labels)


Out[5]:
a    10
b    20
c    30
dtype: int32

In [6]:
pd.Series(d) # Using a pre-defined Dictionary object


Out[6]:
a    10
b    20
c    30
dtype: int64

What type of values can a Pandas Series hold?


In [7]:
print ("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(arr))
print ("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))
print ("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))
print ("\nHolding objects from a dictionary\n",'-'*40, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))


Holding numerical data
-------------------------
0    10
1    20
2    30
dtype: int32

Holding text labels
--------------------
0    a
1    b
2    c
dtype: object

Holding functions
--------------------
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

Holding objects from a dictionary
----------------------------------------
0    <built-in method keys of dict object at 0x0000...
1    <built-in method items of dict object at 0x000...
2    <built-in method values of dict object at 0x00...
dtype: object

Indexing and slicing


In [8]:
ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])
ser2 = pd.Series([1,2,5,4],['CA', 'OR', 'NV', 'AZ'])

print ("\nIndexing by name of the item/object (string identifier)\n",'-'*56, sep='')
print("Value for CA in ser1:", ser1['CA'])
print("Value for AZ in ser1:", ser1['AZ'])
print("Value for NV in ser2:", ser2['NV'])

print ("\nIndexing by number (positional value in the series)\n",'-'*52, sep='')
print("Value for CA in ser1:", ser1[0])
print("Value for AZ in ser1:", ser1[3])
print("Value for NV in ser2:", ser2[2])

print ("\nIndexing by a range\n",'-'*25, sep='')
print ("Value for OR, CO, and AZ in ser1:\n", ser1[1:4], sep='')


Indexing by name of the item/object (string identifier)
--------------------------------------------------------
Value for CA in ser1: 1
Value for AZ in ser1: 4
Value for NV in ser2: 5

Indexing by number (positional value in the series)
----------------------------------------------------
Value for CA in ser1: 1
Value for AZ in ser1: 4
Value for NV in ser2: 5

Indexing by a range
-------------------------
Value for OR, CO, and AZ in ser1:
OR    2
CO    3
AZ    4
dtype: int64

Adding/Merging two series with common indices


In [9]:
ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])
ser2 = pd.Series([1,2,5,4],['CA', 'OR', 'NV', 'AZ'])
ser3 = ser1+ser2

print ("\nAfter adding the two series, the result looks like this...\n",'-'*59, sep='')
print(ser3)
print("\nPython tries to add values where it finds common index name, and puts NaN where indices are missing\n")

print ("\nThe idea works even for multiplication...\n",'-'*43, sep='')
print (ser1*ser2)

print ("\nOr even for combination of mathematical operations!\n",'-'*53, sep='')
print (np.exp(ser1)+np.log10(ser2))


After adding the two series, the result looks like this...
-----------------------------------------------------------
AZ    8.0
CA    2.0
CO    NaN
NV    NaN
OR    4.0
dtype: float64

Python tries to add values where it finds common index name, and puts NaN where indices are missing


The idea works even for multiplication...
-------------------------------------------
AZ    16.0
CA     1.0
CO     NaN
NV     NaN
OR     4.0
dtype: float64

Or even for combination of mathematical operations!
-----------------------------------------------------
AZ    55.200210
CA     2.718282
CO          NaN
NV          NaN
OR     7.690086
dtype: float64

DataFrame (the Real Meat!)


In [10]:
from numpy.random import randn as rn

Creating and accessing DataFrame

  • Indexing
  • Adding and deleting rows and columns
  • Subsetting DataFrame

In [11]:
np.random.seed(101)
matrix_data = rn(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)


The data frame looks like
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Indexing and slicing (columns)

  • By bracket method
  • By DOT method (NOT recommended)

In [12]:
print("\nThe 'X' column\n",'-'*25, sep='')
print(df['X'])
print("\nType of the column: ", type(df['X']), sep='')
print("\nThe 'X' and 'Z' columns indexed by passing a list\n",'-'*55, sep='')
print(df[['X','Z']])
print("\nType of the pair of columns: ", type(df[['X','Z']]), sep='')
print ("\nSo, for more than one column, the object turns into a DataFrame")


The 'X' column
-------------------------
A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64

Type of the column: <class 'pandas.core.series.Series'>

The 'X' and 'Z' columns indexed by passing a list
-------------------------------------------------------
          X         Z
A  0.628133  0.503826
B -0.319318  0.605965
C  0.740122 -0.589001
D -0.758872  0.955057
E  1.978757  0.683509

Type of the pair of columns: <class 'pandas.core.frame.DataFrame'>

So, for more than one column, the object turns into a DataFrame

In [13]:
print("\nThe 'X' column accessed by DOT method (NOT recommended)\n",'-'*55, sep='')
print(df.X)


The 'X' column accessed by DOT method (NOT recommended)
-------------------------------------------------------
A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64

Creating and deleting a (new) column (or row)


In [14]:
print("\nA column is created by assigning it in relation to an existing column\n",'-'*75, sep='')
df['New'] = df['X']+df['Z']
df['New (Sum of X and Z)'] = df['X']+df['Z']
print(df)
print("\nA column is dropped by using df.drop() method\n",'-'*55, sep='')
df = df.drop('New', axis=1) # Notice the axis=1 option, axis = 0 is default, so one has to change it to 1
print(df)
df1=df.drop('A')
print("\nA row (index) is dropped by using df.drop() method and axis=0\n",'-'*65, sep='')
print(df1)
print("\nAn in-place change can be done by making inplace=True in the drop method\n",'-'*75, sep='')
df.drop('New (Sum of X and Z)', axis=1, inplace=True)
print(df)


A column is created by assigning it in relation to an existing column
---------------------------------------------------------------------------
          W         X         Y         Z       New  New (Sum of X and Z)
A  2.706850  0.628133  0.907969  0.503826  1.131958              1.131958
B  0.651118 -0.319318 -0.848077  0.605965  0.286647              0.286647
C -2.018168  0.740122  0.528813 -0.589001  0.151122              0.151122
D  0.188695 -0.758872 -0.933237  0.955057  0.196184              0.196184
E  0.190794  1.978757  2.605967  0.683509  2.662266              2.662266

A column is dropped by using df.drop() method
-------------------------------------------------------
          W         X         Y         Z  New (Sum of X and Z)
A  2.706850  0.628133  0.907969  0.503826              1.131958
B  0.651118 -0.319318 -0.848077  0.605965              0.286647
C -2.018168  0.740122  0.528813 -0.589001              0.151122
D  0.188695 -0.758872 -0.933237  0.955057              0.196184
E  0.190794  1.978757  2.605967  0.683509              2.662266

A row (index) is dropped by using df.drop() method and axis=0
-----------------------------------------------------------------
          W         X         Y         Z  New (Sum of X and Z)
B  0.651118 -0.319318 -0.848077  0.605965              0.286647
C -2.018168  0.740122  0.528813 -0.589001              0.151122
D  0.188695 -0.758872 -0.933237  0.955057              0.196184
E  0.190794  1.978757  2.605967  0.683509              2.662266

An in-place change can be done by making inplace=True in the drop method
---------------------------------------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Selecting/indexing Rows

  • Label-based 'loc' method
  • Index (numeric) 'iloc' method

In [15]:
print("\nLabel-based 'loc' method can be used for selecting row(s)\n",'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc' method can be used for selecting row(s)\n",'-'*70, sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])


Label-based 'loc' method can be used for selecting row(s)
------------------------------------------------------------

Single row

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Multiple rows

          W         X         Y         Z
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001

Index position based 'iloc' method can be used for selecting row(s)
----------------------------------------------------------------------

Single row

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Multiple rows

          W         X         Y         Z
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001

Subsetting DataFrame


In [16]:
np.random.seed(101)
matrix_data = rn(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)

print("\nThe DatFrame\n",'-'*45, sep='')
print(df)
print("\nElement at row 'B' and column 'Y' is\n") 
print(df.loc['B','Y'])
print("\nSubset comprising of rows B and D, and columns W and Y, is\n")
df.loc[['B','D'],['W','Y']]


The DatFrame
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Element at row 'B' and column 'Y' is

-0.848076983404

Subset comprising of rows B and D, and columns W and Y, is

Out[16]:
W Y
B 0.651118 -0.848077
D 0.188695 -0.933237

Conditional selection, index (re)setting, multi-index

Basic idea of conditional check and Boolean DataFrame


In [17]:
print("\nThe DataFrame\n",'-'*45, sep='')
print(df)
print("\nBoolean DataFrame(s) where we are checking if the values are greater than 0\n",'-'*75, sep='')
print(df>0)
print("\n")
print(df.loc[['A','B','C']]>0)
booldf = df>0
print("\nDataFrame indexed by boolean dataframe\n",'-'*45, sep='')
print(df[booldf])


The DataFrame
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Boolean DataFrame(s) where we are checking if the values are greater than 0
---------------------------------------------------------------------------
       W      X      Y      Z
A   True   True   True   True
B   True  False  False   True
C  False   True   True  False
D   True  False  False   True
E   True   True   True   True


       W      X      Y      Z
A   True   True   True   True
B   True  False  False   True
C  False   True   True  False

DataFrame indexed by boolean dataframe
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118       NaN       NaN  0.605965
C       NaN  0.740122  0.528813       NaN
D  0.188695       NaN       NaN  0.955057
E  0.190794  1.978757  2.605967  0.683509

Passing Boolean series to conditionally subset the DataFrame


In [27]:
matrix_data = np.matrix('22,66,140;42,70,148;30,62,125;35,68,160;25,62,152')
row_labels = ['A','B','C','D','E']
column_headings = ['Age', 'Height', 'Weight']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nA new DataFrame\n",'-'*25, sep='')
print(df)
print("\nRows with Height > 65 inch\n",'-'*35, sep='')
print(df[df['Height']>65])

booldf1 = df['Height']>65
booldf2 = df['Weight']>145
print("\nRows with Height > 65 inch and Weight >145 lbs\n",'-'*55, sep='')
print(df[(booldf1) & (booldf2)])

print("\nDataFrame with only Age and Weight columns whose Height > 65 inch\n",'-'*68, sep='')
print(df[booldf1][['Age','Weight']])


A new DataFrame
-------------------------
   Age  Height  Weight
A   22      66     140
B   42      70     148
C   30      62     125
D   35      68     160
E   25      62     152

Rows with Height > 65 inch
-----------------------------------
   Age  Height  Weight
A   22      66     140
B   42      70     148
D   35      68     160

Rows with Height > 65 inch and Weight >145 lbs
-------------------------------------------------------
   Age  Height  Weight
B   42      70     148
D   35      68     160

DataFrame with only Age and Weight columns whose Height > 65 inch
--------------------------------------------------------------------
   Age  Weight
A   22     140
B   42     148
D   35     160

Re-setting and Setting Index


In [50]:
matrix_data = np.matrix('22,66,140;42,70,148;30,62,125;35,68,160;25,62,152')
row_labels = ['A','B','C','D','E']
column_headings = ['Age', 'Height', 'Weight']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe DataFrame\n",'-'*25, sep='')
print(df)
print("\nAfter resetting index\n",'-'*35, sep='')
print(df.reset_index())
print("\nAfter resetting index with 'drop' option TRUE\n",'-'*45, sep='')
print(df.reset_index(drop=True))
print("\nAdding a new column 'Profession'\n",'-'*45, sep='')
df['Profession'] = "Student Teacher Engineer Doctor Nurse".split()
print(df)
print("\nSetting 'Profession' column as index\n",'-'*45, sep='')
print (df.set_index('Profession'))


The DataFrame
-------------------------
   Age  Height  Weight
A   22      66     140
B   42      70     148
C   30      62     125
D   35      68     160
E   25      62     152

After resetting index
-----------------------------------
  index  Age  Height  Weight
0     A   22      66     140
1     B   42      70     148
2     C   30      62     125
3     D   35      68     160
4     E   25      62     152

After resetting index with 'drop' option TRUE
---------------------------------------------
   Age  Height  Weight
0   22      66     140
1   42      70     148
2   30      62     125
3   35      68     160
4   25      62     152

Adding a new column 'Profession'
---------------------------------------------
   Age  Height  Weight Profession
A   22      66     140    Student
B   42      70     148    Teacher
C   30      62     125   Engineer
D   35      68     160     Doctor
E   25      62     152      Nurse

Setting 'Profession' column as index
---------------------------------------------
            Age  Height  Weight
Profession                     
Student      22      66     140
Teacher      42      70     148
Engineer     30      62     125
Doctor       35      68     160
Nurse        25      62     152

Multi-indexing


In [87]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))

print("\nTuple pairs after the zip and list command\n",'-'*45, sep='')
print(hier_index)
hier_index = pd.MultiIndex.from_tuples(hier_index)
print("\nIndex hierarchy\n",'-'*25, sep='')
print(hier_index)
print("\nIndex hierarchy type\n",'-'*25, sep='')
print(type(hier_index))

print("\nCreating DataFrame with multi-index\n",'-'*37, sep='')
np.random.seed(101)
df1 = pd.DataFrame(data=np.round(rn(6,3),2), index= hier_index, columns= ['A','B','C'])
print(df1)

print("\nSubsetting multi-index DataFrame using two 'loc' methods\n",'-'*60, sep='')
print(df1.loc['G2'].loc[[1,3]][['B','C']])

print("\nNaming the indices by 'index.names' method\n",'-'*45, sep='')
df1.index.names=['Outer', 'Inner']
print(df1)


Tuple pairs after the zip and list command
---------------------------------------------
[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

Index hierarchy
-------------------------
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

Index hierarchy type
-------------------------
<class 'pandas.core.indexes.multi.MultiIndex'>

Creating DataFrame with multi-index
-------------------------------------
         A     B     C
G1 1  2.71  0.63  0.91
   2  0.50  0.65 -0.32
   3 -0.85  0.61 -2.02
G2 1  0.74  0.53 -0.59
   2  0.19 -0.76 -0.93
   3  0.96  0.19  1.98

Subsetting multi-index DataFrame using two 'loc' methods
------------------------------------------------------------
      B     C
1  0.53 -0.59
3  0.19  1.98

Naming the indices by 'index.names' method
---------------------------------------------
                A     B     C
Outer Inner                  
G1    1      2.71  0.63  0.91
      2      0.50  0.65 -0.32
      3     -0.85  0.61 -2.02
G2    1      0.74  0.53 -0.59
      2      0.19 -0.76 -0.93
      3      0.96  0.19  1.98

Cross-section ('XS') command


In [109]:
print("\nGrabbing a cross-section from outer level\n",'-'*45, sep='')
print(df1.xs('G1'))
print("\nGrabbing a cross-section from inner level (for all outer levels)\n",'-'*65, sep='')
print(df1.xs(2,level='Inner'))


Grabbing a cross-section from outer level
---------------------------------------------
          A     B     C
Inner                  
1      2.71  0.63  0.91
2      0.50  0.65 -0.32
3     -0.85  0.61 -2.02

Grabbing a cross-section from inner level (for all outer levels)
-----------------------------------------------------------------
          A     B     C
Outer                  
G1     0.50  0.65 -0.32
G2     0.19 -0.76 -0.93

Missing Values


In [95]:
df = pd.DataFrame({'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]})
df['States']="CA NV AZ".split()
df.set_index('States',inplace=True)
print(df)


          A    B  C
States             
CA      1.0  5.0  1
NV      2.0  NaN  2
AZ      NaN  NaN  3

Pandas 'dropna' method


In [108]:
print("\nDropping any rows with a NaN value\n",'-'*35, sep='')
print(df.dropna(axis=0))
print("\nDropping any column with a NaN value\n",'-'*35, sep='')
print(df.dropna(axis=1))
print("\nDropping a row with a minimum 2 NaN value using 'thresh' parameter\n",'-'*68, sep='')
print(df.dropna(axis=0, thresh=2))


Dropping any rows with a NaN value
-----------------------------------
          A    B  C
States             
CA      1.0  5.0  1

Dropping any column with a NaN value
-----------------------------------
        C
States   
CA      1
NV      2
AZ      3

Dropping a row with a minimum 2 NaN value using 'thresh' parameter
--------------------------------------------------------------------
          A    B  C
States             
CA      1.0  5.0  1
NV      2.0  NaN  2

Pandas 'fillna' method


In [114]:
print("\nFilling values with a default value\n",'-'*35, sep='')
print(df.fillna(value='FILL VALUE'))
print("\nFilling values with a computed value (mean of column A here)\n",'-'*60, sep='')
print(df.fillna(value=df['A'].mean()))


Filling values with a default value
-----------------------------------
                 A           B  C
States                           
CA               1           5  1
NV               2  FILL VALUE  2
AZ      FILL VALUE  FILL VALUE  3

Filling values with a computed value (mean of column A here)
------------------------------------------------------------
          A    B  C
States             
CA      1.0  5.0  1
NV      2.0  1.5  2
AZ      1.5  1.5  3

GroupBy method


In [151]:
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df


Out[151]:
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350

In [176]:
byComp = df.groupby('Company')
print("\nGrouping by 'Company' column and listing mean sales\n",'-'*55, sep='')
print(byComp.mean())
print("\nGrouping by 'Company' column and listing sum of sales\n",'-'*55, sep='')
print(byComp.sum())
# Note dataframe conversion of the series and transpose
print("\nAll in one line of command (Stats for 'FB')\n",'-'*65, sep='')
print(pd.DataFrame(df.groupby('Company').describe().loc['FB']).transpose())
print("\nSame type of extraction with little different command\n",'-'*68, sep='')
print(df.groupby('Company').describe().loc[['GOOG', 'MSFT']])


Grouping by 'Company' column and listing mean sales
-------------------------------------------------------
         Sales
Company       
FB       296.5
GOOG     160.0
MSFT     232.0

Grouping by 'Company' column and listing sum of sales
-------------------------------------------------------
         Sales
Company       
FB         593
GOOG       320
MSFT       464

All in one line of command (Stats for 'FB')
-----------------------------------------------------------------
   Sales                                                       
   count   mean        std    min     25%    50%     75%    max
FB   2.0  296.5  75.660426  243.0  269.75  296.5  323.25  350.0

Same type of extraction with little different command
--------------------------------------------------------------------
        Sales                                                      
        count   mean         std    min    25%    50%    75%    max
Company                                                            
GOOG      2.0  160.0   56.568542  120.0  140.0  160.0  180.0  200.0
MSFT      2.0  232.0  152.735065  124.0  178.0  232.0  286.0  340.0

Merging, Joining, Concatenating

Concatenation


In [188]:
# Creating data frames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8,9,10,11])

print("\nThe DataFrame number 1\n",'-'*30, sep='')
print(df1)
print("\nThe DataFrame number 2\n",'-'*30, sep='')
print(df2)
print("\nThe DataFrame number 3\n",'-'*30, sep='')
print(df3)


The DataFrame number 1
------------------------------
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

The DataFrame number 2
------------------------------
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

The DataFrame number 3
------------------------------
      A    B    C    D
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

In [189]:
df_cat1 = pd.concat([df1,df2,df3], axis=0)
print("\nAfter concatenation along row\n",'-'*30, sep='')
print(df_cat1)

df_cat2 = pd.concat([df1,df2,df3], axis=1)
print("\nAfter concatenation along column\n",'-'*60, sep='')
print(df_cat2)
df_cat2.fillna(value=0, inplace=True)
print("\nAfter filling missing values with zero\n",'-'*60, sep='')
print(df_cat2)


After concatenation along row
------------------------------
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

After concatenation along column
------------------------------------------------------------
      A    B    C    D    A    B    C    D    A    B    C    D
0    A0   B0   C0   D0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1    A1   B1   C1   D1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2    A2   B2   C2   D2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3    A3   B3   C3   D3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4   NaN  NaN  NaN  NaN   A4   B4   C4   D4  NaN  NaN  NaN  NaN
5   NaN  NaN  NaN  NaN   A5   B5   C5   D5  NaN  NaN  NaN  NaN
6   NaN  NaN  NaN  NaN   A6   B6   C6   D6  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN   A7   B7   C7   D7  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A8   B8   C8   D8
9   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A9   B9   C9   D9
10  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A10  B10  C10  D10
11  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A11  B11  C11  D11

After filling missing values with zero
------------------------------------------------------------
     A   B   C   D   A   B   C   D    A    B    C    D
0   A0  B0  C0  D0   0   0   0   0    0    0    0    0
1   A1  B1  C1  D1   0   0   0   0    0    0    0    0
2   A2  B2  C2  D2   0   0   0   0    0    0    0    0
3   A3  B3  C3  D3   0   0   0   0    0    0    0    0
4    0   0   0   0  A4  B4  C4  D4    0    0    0    0
5    0   0   0   0  A5  B5  C5  D5    0    0    0    0
6    0   0   0   0  A6  B6  C6  D6    0    0    0    0
7    0   0   0   0  A7  B7  C7  D7    0    0    0    0
8    0   0   0   0   0   0   0   0   A8   B8   C8   D8
9    0   0   0   0   0   0   0   0   A9   B9   C9   D9
10   0   0   0   0   0   0   0   0  A10  B10  C10  D10
11   0   0   0   0   0   0   0   0  A11  B11  C11  D11

Merging by a common 'key'

The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together.


In [183]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})

print("\nThe DataFrame 'left'\n",'-'*30, sep='')
print(left)
print("\nThe DataFrame 'right'\n",'-'*30, sep='')
print(right)


The DataFrame 'left'
------------------------------
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3

The DataFrame 'right'
------------------------------
    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3

In [185]:
merge1= pd.merge(left,right,how='inner',on='key')
print("\nAfter simple merging with 'inner' method\n",'-'*50, sep='')
print(merge1)


After simple merging with 'inner' method
--------------------------------------------------
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

Merging on a set of keys


In [190]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [193]:
left


Out[193]:
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K0 K1
2 A2 B2 K1 K0
3 A3 B3 K2 K1

In [194]:
right


Out[194]:
C D key1 key2
0 C0 D0 K0 K0
1 C1 D1 K1 K0
2 C2 D2 K1 K0
3 C3 D3 K2 K0

In [191]:
pd.merge(left, right, on=['key1', 'key2'])


Out[191]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2

In [192]:
pd.merge(left, right, how='outer',on=['key1', 'key2'])


Out[192]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3

In [195]:
pd.merge(left, right, how='left',on=['key1', 'key2'])


Out[195]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN

In [196]:
pd.merge(left, right, how='right',on=['key1', 'key2'])


Out[196]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3

Joining

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single DataFrame based on 'index keys'.


In [197]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [198]:
left


Out[198]:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2

In [199]:
right


Out[199]:
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3

In [200]:
left.join(right)


Out[200]:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2

In [201]:
left.join(right, how='outer')


Out[201]:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3

Useful operations

head() and unique values

  • head()
  • unique()
  • nunique()
  • value_count()

In [232]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                   'col2':[444,555,666,444,333,222,666,777,666,555],
                   'col3':'aaa bb c dd eeee fff gg h iii j'.split()})
df


Out[232]:
col1 col2 col3
0 1 444 aaa
1 2 555 bb
2 3 666 c
3 4 444 dd
4 5 333 eeee
5 6 222 fff
6 7 666 gg
7 8 777 h
8 9 666 iii
9 10 555 j

In [233]:
print("\nMethod head() is for showing first few entries\n",'-'*50, sep='')
df.head()


Method head() is for showing first few entries
--------------------------------------------------
Out[233]:
col1 col2 col3
0 1 444 aaa
1 2 555 bb
2 3 666 c
3 4 444 dd
4 5 333 eeee

In [239]:
print("\nFinding unique values in 'col2'\n",'-'*40, sep='') # Note 'unique' method applies to pd.series only
print(df['col2'].unique())


Finding unique values in 'col2'
----------------------------------------
[444 555 666 333 222 777]

In [238]:
print("\nFinding number of unique values in 'col2'\n",'-'*45, sep='')
print(df['col2'].nunique())


Finding number of unique values in 'col2'
---------------------------------------------
6

In [237]:
print("\nTable of unique values in 'col2'\n",'-'*40, sep='')
t1=df['col2'].value_counts()
print(t1)


Table of unique values in 'col2'
----------------------------------------
666    3
444    2
555    2
222    1
333    1
777    1
Name: col2, dtype: int64

Applying functions

Pandas work with 'apply' method to accept any user-defined function


In [229]:
# Define a function
def testfunc(x):
    if (x> 500):
        return (10*np.log10(x))
    else:
        return (x/10)

In [286]:
df['FuncApplied'] = df['col2'].apply(testfunc)
print(df)


   col1  col2  col3  FuncApplied
0     1   444   aaa    44.400000
1     2   555    bb    27.442930
2     3   666     c    28.234742
3     4   444    dd    44.400000
4     5   333  eeee    33.300000
5     6   222   fff    22.200000
6     7   666    gg    28.234742
7     8   777     h    28.904210
8     9   666   iii    28.234742
9    10   555     j    27.442930

Apply works with built-in function too!


In [287]:
df['col3length']= df['col3'].apply(len)
print(df)


   col1  col2  col3  FuncApplied  col3length
0     1   444   aaa    44.400000           3
1     2   555    bb    27.442930           2
2     3   666     c    28.234742           1
3     4   444    dd    44.400000           2
4     5   333  eeee    33.300000           4
5     6   222   fff    22.200000           3
6     7   666    gg    28.234742           2
7     8   777     h    28.904210           1
8     9   666   iii    28.234742           3
9    10   555     j    27.442930           1

Combine 'apply' with lambda expession for in-line calculations


In [265]:
df['FuncApplied'].apply(lambda x: np.sqrt(x))


Out[265]:
0    6.663332
1    5.238600
2    5.313637
3    6.663332
4    5.770615
5    4.711688
6    5.313637
7    5.376264
8    5.313637
9    5.238600
Name: FuncApplied, dtype: float64

Standard statistical functions directly apply to columns


In [267]:
print("\nSum of the column 'FuncApplied' is: ",df['FuncApplied'].sum())
print("Mean of the column 'FuncApplied' is: ",df['FuncApplied'].mean())
print("Std dev of the column 'FuncApplied' is: ",df['FuncApplied'].std())
print("Min and max of the column 'FuncApplied' are: ",df['FuncApplied'].min(),"and",df['FuncApplied'].max())


Sum of the column 'FuncApplied' is:  312.7942967255717
Mean of the column 'FuncApplied' is:  31.27942967255717
Std dev of the column 'FuncApplied' is:  7.4065059423607895
Min and max of the column 'FuncApplied' are:  22.2 and 44.4

Deletion, sorting, list of column and row names

Getting the names of the columns


In [278]:
print("\nName of columns\n",'-'*20, sep='')
print(df.columns)
l = list(df.columns)
print("\nColumn names in a list of strings for later manipulation:",l)


Name of columns
--------------------
Index(['col1', 'col2', 'col3', 'FuncApplied', 'col3length'], dtype='object')

Column names in a list of strings for later manipulation: ['col1', 'col2', 'col3', 'FuncApplied', 'col3length']

Deleting last column by 'del' command (this affects the dataframe immediately, unlike drop method
----------------------------------------------------------------------------------------------------
   col1  col2  col3  FuncApplied
0     1   444   aaa    44.400000
1     2   555    bb    27.442930
2     3   666     c    28.234742
3     4   444    dd    44.400000
4     5   333  eeee    33.300000
5     6   222   fff    22.200000
6     7   666    gg    28.234742
7     8   777     h    28.904210
8     9   666   iii    28.234742
9    10   555     j    27.442930

Deletion by 'del' command # This affects the dataframe immediately, unlike drop method.


In [289]:
print("\nDeleting last column by 'del' command\n",'-'*50, sep='')
del df['col3length']
print(df)
df['col3length']= df['col3'].apply(len)


Deleting last column by 'del' command
--------------------------------------------------
   col1  col2  col3  FuncApplied
0     1   444   aaa    44.400000
1     2   555    bb    27.442930
2     3   666     c    28.234742
3     4   444    dd    44.400000
4     5   333  eeee    33.300000
5     6   222   fff    22.200000
6     7   666    gg    28.234742
7     8   777     h    28.904210
8     9   666   iii    28.234742
9    10   555     j    27.442930

Sorting and Ordering a DataFrame


In [290]:
df.sort_values(by='col2') #inplace=False by default


Out[290]:
col1 col2 col3 FuncApplied col3length
5 6 222 fff 22.200000 3
4 5 333 eeee 33.300000 4
0 1 444 aaa 44.400000 3
3 4 444 dd 44.400000 2
1 2 555 bb 27.442930 2
9 10 555 j 27.442930 1
2 3 666 c 28.234742 1
6 7 666 gg 28.234742 2
8 9 666 iii 28.234742 3
7 8 777 h 28.904210 1

In [292]:
df.sort_values(by='FuncApplied',ascending=False) #inplace=False by default


Out[292]:
col1 col2 col3 FuncApplied col3length
0 1 444 aaa 44.400000 3
3 4 444 dd 44.400000 2
4 5 333 eeee 33.300000 4
7 8 777 h 28.904210 1
2 3 666 c 28.234742 1
6 7 666 gg 28.234742 2
8 9 666 iii 28.234742 3
1 2 555 bb 27.442930 2
9 10 555 j 27.442930 1
5 6 222 fff 22.200000 3

Find Null Values or Check for Null Values


In [294]:
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()


Out[294]:
col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz

In [295]:
df.isnull()


Out[295]:
col1 col2 col3
0 False True False
1 False False False
2 False False False
3 True False False

In [296]:
df.fillna('FILL')


Out[296]:
col1 col2 col3
0 1 FILL abc
1 2 555 def
2 3 666 ghi
3 FILL 444 xyz

Pivot Table


In [297]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)
df


Out[297]:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1

In [303]:
# Index out of 'A' and 'B', columns from 'C', actual numerical values from 'D'
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])


Out[303]:
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN

In [302]:
# Index out of 'A' and 'B', columns from 'C', actual numerical values from 'D'
df.pivot_table(values='D',index=['A', 'B'],columns=['C'], fill_value='FILLED')


Out[302]:
C x y
A B
bar one 4 1
two FILLED 5
foo one 1 3
two 2 FILLED

Pandas built-in Visualization

Import packages


In [304]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Read in the CSV data file


In [308]:
df1=pd.read_csv('df1.csv', index_col=0)
df1.head()


Out[308]:
A B C D
2000-01-01 1.339091 -0.163643 -0.646443 1.041233
2000-01-02 -0.774984 0.137034 -0.882716 -2.253382
2000-01-03 -0.921037 -0.482943 -0.417100 0.478638
2000-01-04 -1.738808 -0.072973 0.056517 0.015085
2000-01-05 -0.905980 1.778576 0.381918 0.291436

In [320]:
df2=pd.read_csv('df2')
df2.head()


Out[320]:
a b c d
0 0.039762 0.218517 0.103423 0.957904
1 0.937288 0.041567 0.899125 0.977680
2 0.780504 0.008948 0.557808 0.797510
3 0.672717 0.247870 0.264071 0.444358
4 0.053829 0.520124 0.552264 0.190008

Histogram of a single column


In [310]:
df1['A'].hist()


Out[310]:
<matplotlib.axes._subplots.AxesSubplot at 0x2998d34be10>

Histogram with a different set of arguments (list of columns, bins, figure size, etc)


In [312]:
df1.hist(column=['B','C'],bins=20,figsize=(10,4))


Out[312]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002998D985E48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002998D9E6F60>]], dtype=object)

Histogram with generic plot method of Pandas


In [325]:
df1.plot(kind='hist', bins=30, grid=True, figsize=(12,7))


Out[325]:
<matplotlib.axes._subplots.AxesSubplot at 0x2999061ac18>

Area plot


In [323]:
import seaborn as sns #Plot style will change to Seaborn package style from now on
df2.plot.area(alpha=0.4)


Out[323]:
<matplotlib.axes._subplots.AxesSubplot at 0x2998db84780>

Bar plot (with and without stacking)


In [326]:
df2.plot.bar()


Out[326]:
<matplotlib.axes._subplots.AxesSubplot at 0x29990bceba8>

In [328]:
df2.plot.bar(stacked=True)


Out[328]:
<matplotlib.axes._subplots.AxesSubplot at 0x29990f4bda0>

Lineplot


In [331]:
df1.plot.line(x=df1.index,y=['B','C'],figsize=(12,4),lw=1) # Note matplotlib arguments like 'lw' and 'figsize'


Out[331]:
<matplotlib.axes._subplots.AxesSubplot at 0x299911693c8>

Scatterplot


In [332]:
df1.plot.scatter(x='A',y='B',figsize=(12,8))


Out[332]:
<matplotlib.axes._subplots.AxesSubplot at 0x299921d1748>

In [335]:
df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm',figsize=(12,8)) # Color of the scatter dots set based on column C


Out[335]:
<matplotlib.axes._subplots.AxesSubplot at 0x299923aba20>

In [344]:
df1.plot.scatter(x='A',y='B',s=10*np.exp(df1['C']),c='C',figsize=(12,8)) # Size of the dots set based on column C


Out[344]:
<matplotlib.axes._subplots.AxesSubplot at 0x299934f4208>

Boxplot


In [346]:
df2.plot.box()


Out[346]:
<matplotlib.axes._subplots.AxesSubplot at 0x299939b5860>

Hexagonal bin plot for bivariate data


In [348]:
df=pd.DataFrame(data=np.random.randn(1000,2),columns=['A','B'])
df.head()


Out[348]:
A B
0 1.017515 -0.420704
1 0.576828 1.814651
2 -0.261503 -0.254697
3 0.974428 0.030171
4 0.605864 0.079012

In [350]:
df.plot.hexbin(x='A',y='B',gridsize=30,cmap='coolwarm')


Out[350]:
<matplotlib.axes._subplots.AxesSubplot at 0x29994d46eb8>

Kernel density estimation


In [352]:
df2.plot.density(lw=3)


Out[352]:
<matplotlib.axes._subplots.AxesSubplot at 0x29994f5b6d8>

In [ ]: