Pandas Practice

Series

Loading packages and initializations



In [2]:

    
import numpy as np
import pandas as pd

labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10,'b':20,'c':30}

print ("Labels:", labels)
print("My data:", my_data)
print("Dictionary:", d)









    



Labels: ['a', 'b', 'c']
My data: [10, 20, 30]
Dictionary: {'a': 10, 'b': 20, 'c': 30}

Creating a Series (Pandas class)

From numerical data only
From numerical data and corresponding index (row labels)
From NumPy array as the source of numerical data
Just using a pre-defined dictionary



In [3]:

    
pd.Series(data=my_data) # Output looks very similar to a NumPy array









    Out[3]:





0    10
1    20
2    30
dtype: int64



In [4]:

    
pd.Series(data=my_data, index=labels) # Note the extra information about index









    Out[4]:





a    10
b    20
c    30
dtype: int64



In [5]:

    
# Inputs are in order of the expected parameters (not explicitly named), NumPy array is used for data
pd.Series(arr, labels)









    Out[5]:





a    10
b    20
c    30
dtype: int32



In [6]:

    
pd.Series(d) # Using a pre-defined Dictionary object









    Out[6]:





a    10
b    20
c    30
dtype: int64

What type of values can a Pandas Series hold?



In [7]:

    
print ("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(arr))
print ("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))
print ("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))
print ("\nHolding objects from a dictionary\n",'-'*40, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))









    



Holding numerical data
-------------------------
0    10
1    20
2    30
dtype: int32

Holding text labels
--------------------
0    a
1    b
2    c
dtype: object

Holding functions
--------------------
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

Holding objects from a dictionary
----------------------------------------
0    <built-in method keys of dict object at 0x0000...
1    <built-in method items of dict object at 0x000...
2    <built-in method values of dict object at 0x00...
dtype: object

Indexing and slicing



In [8]:

    
ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])
ser2 = pd.Series([1,2,5,4],['CA', 'OR', 'NV', 'AZ'])

print ("\nIndexing by name of the item/object (string identifier)\n",'-'*56, sep='')
print("Value for CA in ser1:", ser1['CA'])
print("Value for AZ in ser1:", ser1['AZ'])
print("Value for NV in ser2:", ser2['NV'])

print ("\nIndexing by number (positional value in the series)\n",'-'*52, sep='')
print("Value for CA in ser1:", ser1[0])
print("Value for AZ in ser1:", ser1[3])
print("Value for NV in ser2:", ser2[2])

print ("\nIndexing by a range\n",'-'*25, sep='')
print ("Value for OR, CO, and AZ in ser1:\n", ser1[1:4], sep='')









    



Indexing by name of the item/object (string identifier)
--------------------------------------------------------
Value for CA in ser1: 1
Value for AZ in ser1: 4
Value for NV in ser2: 5

Indexing by number (positional value in the series)
----------------------------------------------------
Value for CA in ser1: 1
Value for AZ in ser1: 4
Value for NV in ser2: 5

Indexing by a range
-------------------------
Value for OR, CO, and AZ in ser1:
OR    2
CO    3
AZ    4
dtype: int64

Adding/Merging two series with common indices



In [9]:

    
ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])
ser2 = pd.Series([1,2,5,4],['CA', 'OR', 'NV', 'AZ'])
ser3 = ser1+ser2

print ("\nAfter adding the two series, the result looks like this...\n",'-'*59, sep='')
print(ser3)
print("\nPython tries to add values where it finds common index name, and puts NaN where indices are missing\n")

print ("\nThe idea works even for multiplication...\n",'-'*43, sep='')
print (ser1*ser2)

print ("\nOr even for combination of mathematical operations!\n",'-'*53, sep='')
print (np.exp(ser1)+np.log10(ser2))









    



After adding the two series, the result looks like this...
-----------------------------------------------------------
AZ    8.0
CA    2.0
CO    NaN
NV    NaN
OR    4.0
dtype: float64

Python tries to add values where it finds common index name, and puts NaN where indices are missing


The idea works even for multiplication...
-------------------------------------------
AZ    16.0
CA     1.0
CO     NaN
NV     NaN
OR     4.0
dtype: float64

Or even for combination of mathematical operations!
-----------------------------------------------------
AZ    55.200210
CA     2.718282
CO          NaN
NV          NaN
OR     7.690086
dtype: float64

DataFrame (the Real Meat!)



In [10]:

    
from numpy.random import randn as rn

Creating and accessing DataFrame

Indexing
Adding and deleting rows and columns
Subsetting DataFrame



In [11]:

    
np.random.seed(101)
matrix_data = rn(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)









    



The data frame looks like
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Indexing and slicing (columns)

By bracket method
By DOT method (NOT recommended)



In [12]:

    
print("\nThe 'X' column\n",'-'*25, sep='')
print(df['X'])
print("\nType of the column: ", type(df['X']), sep='')
print("\nThe 'X' and 'Z' columns indexed by passing a list\n",'-'*55, sep='')
print(df[['X','Z']])
print("\nType of the pair of columns: ", type(df[['X','Z']]), sep='')
print ("\nSo, for more than one column, the object turns into a DataFrame")









    



The 'X' column
-------------------------
A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64

Type of the column: <class 'pandas.core.series.Series'>

The 'X' and 'Z' columns indexed by passing a list
-------------------------------------------------------
          X         Z
A  0.628133  0.503826
B -0.319318  0.605965
C  0.740122 -0.589001
D -0.758872  0.955057
E  1.978757  0.683509

Type of the pair of columns: <class 'pandas.core.frame.DataFrame'>

So, for more than one column, the object turns into a DataFrame



In [13]:

    
print("\nThe 'X' column accessed by DOT method (NOT recommended)\n",'-'*55, sep='')
print(df.X)









    



The 'X' column accessed by DOT method (NOT recommended)
-------------------------------------------------------
A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64

Creating and deleting a (new) column (or row)



In [14]:

    
print("\nA column is created by assigning it in relation to an existing column\n",'-'*75, sep='')
df['New'] = df['X']+df['Z']
df['New (Sum of X and Z)'] = df['X']+df['Z']
print(df)
print("\nA column is dropped by using df.drop() method\n",'-'*55, sep='')
df = df.drop('New', axis=1) # Notice the axis=1 option, axis = 0 is default, so one has to change it to 1
print(df)
df1=df.drop('A')
print("\nA row (index) is dropped by using df.drop() method and axis=0\n",'-'*65, sep='')
print(df1)
print("\nAn in-place change can be done by making inplace=True in the drop method\n",'-'*75, sep='')
df.drop('New (Sum of X and Z)', axis=1, inplace=True)
print(df)









    



A column is created by assigning it in relation to an existing column
---------------------------------------------------------------------------
          W         X         Y         Z       New  New (Sum of X and Z)
A  2.706850  0.628133  0.907969  0.503826  1.131958              1.131958
B  0.651118 -0.319318 -0.848077  0.605965  0.286647              0.286647
C -2.018168  0.740122  0.528813 -0.589001  0.151122              0.151122
D  0.188695 -0.758872 -0.933237  0.955057  0.196184              0.196184
E  0.190794  1.978757  2.605967  0.683509  2.662266              2.662266

A column is dropped by using df.drop() method
-------------------------------------------------------
          W         X         Y         Z  New (Sum of X and Z)
A  2.706850  0.628133  0.907969  0.503826              1.131958
B  0.651118 -0.319318 -0.848077  0.605965              0.286647
C -2.018168  0.740122  0.528813 -0.589001              0.151122
D  0.188695 -0.758872 -0.933237  0.955057              0.196184
E  0.190794  1.978757  2.605967  0.683509              2.662266

A row (index) is dropped by using df.drop() method and axis=0
-----------------------------------------------------------------
          W         X         Y         Z  New (Sum of X and Z)
B  0.651118 -0.319318 -0.848077  0.605965              0.286647
C -2.018168  0.740122  0.528813 -0.589001              0.151122
D  0.188695 -0.758872 -0.933237  0.955057              0.196184
E  0.190794  1.978757  2.605967  0.683509              2.662266

An in-place change can be done by making inplace=True in the drop method
---------------------------------------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Selecting/indexing Rows

Label-based 'loc' method
Index (numeric) 'iloc' method



In [15]:

    
print("\nLabel-based 'loc' method can be used for selecting row(s)\n",'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc' method can be used for selecting row(s)\n",'-'*70, sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])









    



Label-based 'loc' method can be used for selecting row(s)
------------------------------------------------------------

Single row

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Multiple rows

          W         X         Y         Z
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001

Index position based 'iloc' method can be used for selecting row(s)
----------------------------------------------------------------------

Single row

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Multiple rows

          W         X         Y         Z
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001

Subsetting DataFrame



In [16]:

    
np.random.seed(101)
matrix_data = rn(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)

print("\nThe DatFrame\n",'-'*45, sep='')
print(df)
print("\nElement at row 'B' and column 'Y' is\n") 
print(df.loc['B','Y'])
print("\nSubset comprising of rows B and D, and columns W and Y, is\n")
df.loc[['B','D'],['W','Y']]









    



The DatFrame
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Element at row 'B' and column 'Y' is

-0.848076983404

Subset comprising of rows B and D, and columns W and Y, is







    Out[16]:







  
    
      
      W
      Y
    
  
  
    
      B
      0.651118
      -0.848077
    
    
      D
      0.188695
      -0.933237

Conditional selection, index (re)setting, multi-index

Basic idea of conditional check and Boolean DataFrame



In [17]:

    
print("\nThe DataFrame\n",'-'*45, sep='')
print(df)
print("\nBoolean DataFrame(s) where we are checking if the values are greater than 0\n",'-'*75, sep='')
print(df>0)
print("\n")
print(df.loc[['A','B','C']]>0)
booldf = df>0
print("\nDataFrame indexed by boolean dataframe\n",'-'*45, sep='')
print(df[booldf])









    



The DataFrame
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Boolean DataFrame(s) where we are checking if the values are greater than 0
---------------------------------------------------------------------------
       W      X      Y      Z
A   True   True   True   True
B   True  False  False   True
C  False   True   True  False
D   True  False  False   True
E   True   True   True   True


       W      X      Y      Z
A   True   True   True   True
B   True  False  False   True
C  False   True   True  False

DataFrame indexed by boolean dataframe
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118       NaN       NaN  0.605965
C       NaN  0.740122  0.528813       NaN
D  0.188695       NaN       NaN  0.955057
E  0.190794  1.978757  2.605967  0.683509

Passing Boolean series to conditionally subset the DataFrame



In [27]:

    
matrix_data = np.matrix('22,66,140;42,70,148;30,62,125;35,68,160;25,62,152')
row_labels = ['A','B','C','D','E']
column_headings = ['Age', 'Height', 'Weight']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nA new DataFrame\n",'-'*25, sep='')
print(df)
print("\nRows with Height > 65 inch\n",'-'*35, sep='')
print(df[df['Height']>65])

booldf1 = df['Height']>65
booldf2 = df['Weight']>145
print("\nRows with Height > 65 inch and Weight >145 lbs\n",'-'*55, sep='')
print(df[(booldf1) & (booldf2)])

print("\nDataFrame with only Age and Weight columns whose Height > 65 inch\n",'-'*68, sep='')
print(df[booldf1][['Age','Weight']])









    



A new DataFrame
-------------------------
   Age  Height  Weight
A   22      66     140
B   42      70     148
C   30      62     125
D   35      68     160
E   25      62     152

Rows with Height > 65 inch
-----------------------------------
   Age  Height  Weight
A   22      66     140
B   42      70     148
D   35      68     160

Rows with Height > 65 inch and Weight >145 lbs
-------------------------------------------------------
   Age  Height  Weight
B   42      70     148
D   35      68     160

DataFrame with only Age and Weight columns whose Height > 65 inch
--------------------------------------------------------------------
   Age  Weight
A   22     140
B   42     148
D   35     160

Re-setting and Setting Index



In [50]:

    
matrix_data = np.matrix('22,66,140;42,70,148;30,62,125;35,68,160;25,62,152')
row_labels = ['A','B','C','D','E']
column_headings = ['Age', 'Height', 'Weight']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe DataFrame\n",'-'*25, sep='')
print(df)
print("\nAfter resetting index\n",'-'*35, sep='')
print(df.reset_index())
print("\nAfter resetting index with 'drop' option TRUE\n",'-'*45, sep='')
print(df.reset_index(drop=True))
print("\nAdding a new column 'Profession'\n",'-'*45, sep='')
df['Profession'] = "Student Teacher Engineer Doctor Nurse".split()
print(df)
print("\nSetting 'Profession' column as index\n",'-'*45, sep='')
print (df.set_index('Profession'))









    



The DataFrame
-------------------------
   Age  Height  Weight
A   22      66     140
B   42      70     148
C   30      62     125
D   35      68     160
E   25      62     152

After resetting index
-----------------------------------
  index  Age  Height  Weight
0     A   22      66     140
1     B   42      70     148
2     C   30      62     125
3     D   35      68     160
4     E   25      62     152

After resetting index with 'drop' option TRUE
---------------------------------------------
   Age  Height  Weight
0   22      66     140
1   42      70     148
2   30      62     125
3   35      68     160
4   25      62     152

Adding a new column 'Profession'
---------------------------------------------
   Age  Height  Weight Profession
A   22      66     140    Student
B   42      70     148    Teacher
C   30      62     125   Engineer
D   35      68     160     Doctor
E   25      62     152      Nurse

Setting 'Profession' column as index
---------------------------------------------
            Age  Height  Weight
Profession                     
Student      22      66     140
Teacher      42      70     148
Engineer     30      62     125
Doctor       35      68     160
Nurse        25      62     152

Multi-indexing



In [87]:

    
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))

print("\nTuple pairs after the zip and list command\n",'-'*45, sep='')
print(hier_index)
hier_index = pd.MultiIndex.from_tuples(hier_index)
print("\nIndex hierarchy\n",'-'*25, sep='')
print(hier_index)
print("\nIndex hierarchy type\n",'-'*25, sep='')
print(type(hier_index))

print("\nCreating DataFrame with multi-index\n",'-'*37, sep='')
np.random.seed(101)
df1 = pd.DataFrame(data=np.round(rn(6,3),2), index= hier_index, columns= ['A','B','C'])
print(df1)

print("\nSubsetting multi-index DataFrame using two 'loc' methods\n",'-'*60, sep='')
print(df1.loc['G2'].loc[[1,3]][['B','C']])

print("\nNaming the indices by 'index.names' method\n",'-'*45, sep='')
df1.index.names=['Outer', 'Inner']
print(df1)









    



Tuple pairs after the zip and list command
---------------------------------------------
[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

Index hierarchy
-------------------------
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

Index hierarchy type
-------------------------
<class 'pandas.core.indexes.multi.MultiIndex'>

Creating DataFrame with multi-index
-------------------------------------
         A     B     C
G1 1  2.71  0.63  0.91
   2  0.50  0.65 -0.32
   3 -0.85  0.61 -2.02
G2 1  0.74  0.53 -0.59
   2  0.19 -0.76 -0.93
   3  0.96  0.19  1.98

Subsetting multi-index DataFrame using two 'loc' methods
------------------------------------------------------------
      B     C
1  0.53 -0.59
3  0.19  1.98

Naming the indices by 'index.names' method
---------------------------------------------
                A     B     C
Outer Inner                  
G1    1      2.71  0.63  0.91
      2      0.50  0.65 -0.32
      3     -0.85  0.61 -2.02
G2    1      0.74  0.53 -0.59
      2      0.19 -0.76 -0.93
      3      0.96  0.19  1.98

Cross-section ('XS') command



In [109]:

    
print("\nGrabbing a cross-section from outer level\n",'-'*45, sep='')
print(df1.xs('G1'))
print("\nGrabbing a cross-section from inner level (for all outer levels)\n",'-'*65, sep='')
print(df1.xs(2,level='Inner'))









    



Grabbing a cross-section from outer level
---------------------------------------------
          A     B     C
Inner                  
1      2.71  0.63  0.91
2      0.50  0.65 -0.32
3     -0.85  0.61 -2.02

Grabbing a cross-section from inner level (for all outer levels)
-----------------------------------------------------------------
          A     B     C
Outer                  
G1     0.50  0.65 -0.32
G2     0.19 -0.76 -0.93

Missing Values



In [95]:

    
df = pd.DataFrame({'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]})
df['States']="CA NV AZ".split()
df.set_index('States',inplace=True)
print(df)









    



          A    B  C
States             
CA      1.0  5.0  1
NV      2.0  NaN  2
AZ      NaN  NaN  3

Pandas 'dropna' method



In [108]:

    
print("\nDropping any rows with a NaN value\n",'-'*35, sep='')
print(df.dropna(axis=0))
print("\nDropping any column with a NaN value\n",'-'*35, sep='')
print(df.dropna(axis=1))
print("\nDropping a row with a minimum 2 NaN value using 'thresh' parameter\n",'-'*68, sep='')
print(df.dropna(axis=0, thresh=2))









    



Dropping any rows with a NaN value
-----------------------------------
          A    B  C
States             
CA      1.0  5.0  1

Dropping any column with a NaN value
-----------------------------------
        C
States   
CA      1
NV      2
AZ      3

Dropping a row with a minimum 2 NaN value using 'thresh' parameter
--------------------------------------------------------------------
          A    B  C
States             
CA      1.0  5.0  1
NV      2.0  NaN  2

Pandas 'fillna' method



In [114]:

    
print("\nFilling values with a default value\n",'-'*35, sep='')
print(df.fillna(value='FILL VALUE'))
print("\nFilling values with a computed value (mean of column A here)\n",'-'*60, sep='')
print(df.fillna(value=df['A'].mean()))









    



Filling values with a default value
-----------------------------------
                 A           B  C
States                           
CA               1           5  1
NV               2  FILL VALUE  2
AZ      FILL VALUE  FILL VALUE  3

Filling values with a computed value (mean of column A here)
------------------------------------------------------------
          A    B  C
States             
CA      1.0  5.0  1
NV      2.0  1.5  2
AZ      1.5  1.5  3

GroupBy method



In [151]:

    
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df



In [176]:

    
byComp = df.groupby('Company')
print("\nGrouping by 'Company' column and listing mean sales\n",'-'*55, sep='')
print(byComp.mean())
print("\nGrouping by 'Company' column and listing sum of sales\n",'-'*55, sep='')
print(byComp.sum())
# Note dataframe conversion of the series and transpose
print("\nAll in one line of command (Stats for 'FB')\n",'-'*65, sep='')
print(pd.DataFrame(df.groupby('Company').describe().loc['FB']).transpose())
print("\nSame type of extraction with little different command\n",'-'*68, sep='')
print(df.groupby('Company').describe().loc[['GOOG', 'MSFT']])









    



Grouping by 'Company' column and listing mean sales
-------------------------------------------------------
         Sales
Company       
FB       296.5
GOOG     160.0
MSFT     232.0

Grouping by 'Company' column and listing sum of sales
-------------------------------------------------------
         Sales
Company       
FB         593
GOOG       320
MSFT       464

All in one line of command (Stats for 'FB')
-----------------------------------------------------------------
   Sales                                                       
   count   mean        std    min     25%    50%     75%    max
FB   2.0  296.5  75.660426  243.0  269.75  296.5  323.25  350.0

Same type of extraction with little different command
--------------------------------------------------------------------
        Sales                                                      
        count   mean         std    min    25%    50%    75%    max
Company                                                            
GOOG      2.0  160.0   56.568542  120.0  140.0  160.0  180.0  200.0
MSFT      2.0  232.0  152.735065  124.0  178.0  232.0  286.0  340.0

Merging, Joining, Concatenating

Concatenation



In [188]:

    
# Creating data frames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8,9,10,11])

print("\nThe DataFrame number 1\n",'-'*30, sep='')
print(df1)
print("\nThe DataFrame number 2\n",'-'*30, sep='')
print(df2)
print("\nThe DataFrame number 3\n",'-'*30, sep='')
print(df3)









    



The DataFrame number 1
------------------------------
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

The DataFrame number 2
------------------------------
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

The DataFrame number 3
------------------------------
      A    B    C    D
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11



In [189]:

    
df_cat1 = pd.concat([df1,df2,df3], axis=0)
print("\nAfter concatenation along row\n",'-'*30, sep='')
print(df_cat1)

df_cat2 = pd.concat([df1,df2,df3], axis=1)
print("\nAfter concatenation along column\n",'-'*60, sep='')
print(df_cat2)
df_cat2.fillna(value=0, inplace=True)
print("\nAfter filling missing values with zero\n",'-'*60, sep='')
print(df_cat2)









    



After concatenation along row
------------------------------
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

After concatenation along column
------------------------------------------------------------
      A    B    C    D    A    B    C    D    A    B    C    D
0    A0   B0   C0   D0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1    A1   B1   C1   D1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2    A2   B2   C2   D2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3    A3   B3   C3   D3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4   NaN  NaN  NaN  NaN   A4   B4   C4   D4  NaN  NaN  NaN  NaN
5   NaN  NaN  NaN  NaN   A5   B5   C5   D5  NaN  NaN  NaN  NaN
6   NaN  NaN  NaN  NaN   A6   B6   C6   D6  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN   A7   B7   C7   D7  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A8   B8   C8   D8
9   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A9   B9   C9   D9
10  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A10  B10  C10  D10
11  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A11  B11  C11  D11

After filling missing values with zero
------------------------------------------------------------
     A   B   C   D   A   B   C   D    A    B    C    D
0   A0  B0  C0  D0   0   0   0   0    0    0    0    0
1   A1  B1  C1  D1   0   0   0   0    0    0    0    0
2   A2  B2  C2  D2   0   0   0   0    0    0    0    0
3   A3  B3  C3  D3   0   0   0   0    0    0    0    0
4    0   0   0   0  A4  B4  C4  D4    0    0    0    0
5    0   0   0   0  A5  B5  C5  D5    0    0    0    0
6    0   0   0   0  A6  B6  C6  D6    0    0    0    0
7    0   0   0   0  A7  B7  C7  D7    0    0    0    0
8    0   0   0   0   0   0   0   0   A8   B8   C8   D8
9    0   0   0   0   0   0   0   0   A9   B9   C9   D9
10   0   0   0   0   0   0   0   0  A10  B10  C10  D10
11   0   0   0   0   0   0   0   0  A11  B11  C11  D11

Merging by a common 'key'

The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together.



In [183]:

    
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})

print("\nThe DataFrame 'left'\n",'-'*30, sep='')
print(left)
print("\nThe DataFrame 'right'\n",'-'*30, sep='')
print(right)









    



The DataFrame 'left'
------------------------------
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3

The DataFrame 'right'
------------------------------
    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3



In [185]:

    
merge1= pd.merge(left,right,how='inner',on='key')
print("\nAfter simple merging with 'inner' method\n",'-'*50, sep='')
print(merge1)









    



After simple merging with 'inner' method
--------------------------------------------------
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

Merging on a set of keys



In [190]:

    
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})



In [193]:

    
left



In [194]:

    
right



In [191]:

    
pd.merge(left, right, on=['key1', 'key2'])



In [192]:

    
pd.merge(left, right, how='outer',on=['key1', 'key2'])



In [195]:

    
pd.merge(left, right, how='left',on=['key1', 'key2'])



In [196]:

    
pd.merge(left, right, how='right',on=['key1', 'key2'])

Joining

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single DataFrame based on 'index keys'.



In [197]:

    
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])



In [198]:

    
left



In [199]:

    
right



In [200]:

    
left.join(right)



In [201]:

    
left.join(right, how='outer')

Useful operations

head() and unique values

head()
unique()
nunique()
value_count()



In [232]:

    
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                   'col2':[444,555,666,444,333,222,666,777,666,555],
                   'col3':'aaa bb c dd eeee fff gg h iii j'.split()})
df



In [233]:

    
print("\nMethod head() is for showing first few entries\n",'-'*50, sep='')
df.head()









    



Method head() is for showing first few entries
--------------------------------------------------






    Out[233]:







  
    
      
      col1
      col2
      col3
    
  
  
    
      0
      1
      444
      aaa
    
    
      1
      2
      555
      bb
    
    
      2
      3
      666
      c
    
    
      3
      4
      444
      dd
    
    
      4
      5
      333
      eeee



In [239]:

    
print("\nFinding unique values in 'col2'\n",'-'*40, sep='') # Note 'unique' method applies to pd.series only
print(df['col2'].unique())









    



Finding unique values in 'col2'
----------------------------------------
[444 555 666 333 222 777]



In [238]:

    
print("\nFinding number of unique values in 'col2'\n",'-'*45, sep='')
print(df['col2'].nunique())









    



Finding number of unique values in 'col2'
---------------------------------------------
6



In [237]:

    
print("\nTable of unique values in 'col2'\n",'-'*40, sep='')
t1=df['col2'].value_counts()
print(t1)









    



Table of unique values in 'col2'
----------------------------------------
666    3
444    2
555    2
222    1
333    1
777    1
Name: col2, dtype: int64

Applying functions

Pandas work with 'apply' method to accept any user-defined function



In [229]:

    
# Define a function
def testfunc(x):
    if (x> 500):
        return (10*np.log10(x))
    else:
        return (x/10)



In [286]:

    
df['FuncApplied'] = df['col2'].apply(testfunc)
print(df)









    



   col1  col2  col3  FuncApplied
0     1   444   aaa    44.400000
1     2   555    bb    27.442930
2     3   666     c    28.234742
3     4   444    dd    44.400000
4     5   333  eeee    33.300000
5     6   222   fff    22.200000
6     7   666    gg    28.234742
7     8   777     h    28.904210
8     9   666   iii    28.234742
9    10   555     j    27.442930

Apply works with built-in function too!



In [287]:

    
df['col3length']= df['col3'].apply(len)
print(df)









    



   col1  col2  col3  FuncApplied  col3length
0     1   444   aaa    44.400000           3
1     2   555    bb    27.442930           2
2     3   666     c    28.234742           1
3     4   444    dd    44.400000           2
4     5   333  eeee    33.300000           4
5     6   222   fff    22.200000           3
6     7   666    gg    28.234742           2
7     8   777     h    28.904210           1
8     9   666   iii    28.234742           3
9    10   555     j    27.442930           1

Combine 'apply' with lambda expession for in-line calculations



In [265]:

    
df['FuncApplied'].apply(lambda x: np.sqrt(x))









    Out[265]:





0    6.663332
1    5.238600
2    5.313637
3    6.663332
4    5.770615
5    4.711688
6    5.313637
7    5.376264
8    5.313637
9    5.238600
Name: FuncApplied, dtype: float64

Standard statistical functions directly apply to columns



In [267]:

    
print("\nSum of the column 'FuncApplied' is: ",df['FuncApplied'].sum())
print("Mean of the column 'FuncApplied' is: ",df['FuncApplied'].mean())
print("Std dev of the column 'FuncApplied' is: ",df['FuncApplied'].std())
print("Min and max of the column 'FuncApplied' are: ",df['FuncApplied'].min(),"and",df['FuncApplied'].max())









    



Sum of the column 'FuncApplied' is:  312.7942967255717
Mean of the column 'FuncApplied' is:  31.27942967255717
Std dev of the column 'FuncApplied' is:  7.4065059423607895
Min and max of the column 'FuncApplied' are:  22.2 and 44.4

Deletion, sorting, list of column and row names

Getting the names of the columns



In [278]:

    
print("\nName of columns\n",'-'*20, sep='')
print(df.columns)
l = list(df.columns)
print("\nColumn names in a list of strings for later manipulation:",l)









    



Name of columns
--------------------
Index(['col1', 'col2', 'col3', 'FuncApplied', 'col3length'], dtype='object')

Column names in a list of strings for later manipulation: ['col1', 'col2', 'col3', 'FuncApplied', 'col3length']

Deleting last column by 'del' command (this affects the dataframe immediately, unlike drop method
----------------------------------------------------------------------------------------------------
   col1  col2  col3  FuncApplied
0     1   444   aaa    44.400000
1     2   555    bb    27.442930
2     3   666     c    28.234742
3     4   444    dd    44.400000
4     5   333  eeee    33.300000
5     6   222   fff    22.200000
6     7   666    gg    28.234742
7     8   777     h    28.904210
8     9   666   iii    28.234742
9    10   555     j    27.442930

Deletion by 'del' command # This affects the dataframe immediately, unlike drop method.



In [289]:

    
print("\nDeleting last column by 'del' command\n",'-'*50, sep='')
del df['col3length']
print(df)
df['col3length']= df['col3'].apply(len)









    



Deleting last column by 'del' command
--------------------------------------------------
   col1  col2  col3  FuncApplied
0     1   444   aaa    44.400000
1     2   555    bb    27.442930
2     3   666     c    28.234742
3     4   444    dd    44.400000
4     5   333  eeee    33.300000
5     6   222   fff    22.200000
6     7   666    gg    28.234742
7     8   777     h    28.904210
8     9   666   iii    28.234742
9    10   555     j    27.442930

Sorting and Ordering a DataFrame



In [290]:

    
df.sort_values(by='col2') #inplace=False by default









    Out[290]:







  
    
      
      col1
      col2
      col3
      FuncApplied
      col3length
    
  
  
    
      5
      6
      222
      fff
      22.200000
      3
    
    
      4
      5
      333
      eeee
      33.300000
      4
    
    
      0
      1
      444
      aaa
      44.400000
      3
    
    
      3
      4
      444
      dd
      44.400000
      2
    
    
      1
      2
      555
      bb
      27.442930
      2
    
    
      9
      10
      555
      j
      27.442930
      1
    
    
      2
      3
      666
      c
      28.234742
      1
    
    
      6
      7
      666
      gg
      28.234742
      2
    
    
      8
      9
      666
      iii
      28.234742
      3
    
    
      7
      8
      777
      h
      28.904210
      1



In [292]:

    
df.sort_values(by='FuncApplied',ascending=False) #inplace=False by default









    Out[292]:







  
    
      
      col1
      col2
      col3
      FuncApplied
      col3length
    
  
  
    
      0
      1
      444
      aaa
      44.400000
      3
    
    
      3
      4
      444
      dd
      44.400000
      2
    
    
      4
      5
      333
      eeee
      33.300000
      4
    
    
      7
      8
      777
      h
      28.904210
      1
    
    
      2
      3
      666
      c
      28.234742
      1
    
    
      6
      7
      666
      gg
      28.234742
      2
    
    
      8
      9
      666
      iii
      28.234742
      3
    
    
      1
      2
      555
      bb
      27.442930
      2
    
    
      9
      10
      555
      j
      27.442930
      1
    
    
      5
      6
      222
      fff
      22.200000
      3

Find Null Values or Check for Null Values



In [294]:

    
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()



In [295]:

    
df.isnull()









    Out[295]:







  
    
      
      col1
      col2
      col3
    
  
  
    
      0
      False
      True
      False
    
    
      1
      False
      False
      False
    
    
      2
      False
      False
      False
    
    
      3
      True
      False
      False



In [296]:

    
df.fillna('FILL')

Pivot Table



In [297]:

    
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)
df



In [303]:

    
# Index out of 'A' and 'B', columns from 'C', actual numerical values from 'D'
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])



In [302]:

    
# Index out of 'A' and 'B', columns from 'C', actual numerical values from 'D'
df.pivot_table(values='D',index=['A', 'B'],columns=['C'], fill_value='FILLED')

Pandas built-in Visualization

Import packages



In [304]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Read in the CSV data file



In [308]:

    
df1=pd.read_csv('df1.csv', index_col=0)
df1.head()



In [320]:

    
df2=pd.read_csv('df2')
df2.head()

Histogram of a single column



In [310]:

    
df1['A'].hist()









    Out[310]:





<matplotlib.axes._subplots.AxesSubplot at 0x2998d34be10>

Histogram with a different set of arguments (list of columns, bins, figure size, etc)



In [312]:

    
df1.hist(column=['B','C'],bins=20,figsize=(10,4))









    Out[312]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002998D985E48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002998D9E6F60>]], dtype=object)

Histogram with generic plot method of Pandas



In [325]:

    
df1.plot(kind='hist', bins=30, grid=True, figsize=(12,7))









    Out[325]:





<matplotlib.axes._subplots.AxesSubplot at 0x2999061ac18>

Area plot



In [323]:

    
import seaborn as sns #Plot style will change to Seaborn package style from now on
df2.plot.area(alpha=0.4)









    Out[323]:





<matplotlib.axes._subplots.AxesSubplot at 0x2998db84780>

Bar plot (with and without stacking)



In [326]:

    
df2.plot.bar()









    Out[326]:





<matplotlib.axes._subplots.AxesSubplot at 0x29990bceba8>



In [328]:

    
df2.plot.bar(stacked=True)









    Out[328]:





<matplotlib.axes._subplots.AxesSubplot at 0x29990f4bda0>

Lineplot



In [331]:

    
df1.plot.line(x=df1.index,y=['B','C'],figsize=(12,4),lw=1) # Note matplotlib arguments like 'lw' and 'figsize'









    Out[331]:





<matplotlib.axes._subplots.AxesSubplot at 0x299911693c8>

Scatterplot



In [332]:

    
df1.plot.scatter(x='A',y='B',figsize=(12,8))









    Out[332]:





<matplotlib.axes._subplots.AxesSubplot at 0x299921d1748>



In [335]:

    
df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm',figsize=(12,8)) # Color of the scatter dots set based on column C









    Out[335]:





<matplotlib.axes._subplots.AxesSubplot at 0x299923aba20>



In [344]:

    
df1.plot.scatter(x='A',y='B',s=10*np.exp(df1['C']),c='C',figsize=(12,8)) # Size of the dots set based on column C









    Out[344]:





<matplotlib.axes._subplots.AxesSubplot at 0x299934f4208>

Boxplot



In [346]:

    
df2.plot.box()









    Out[346]:





<matplotlib.axes._subplots.AxesSubplot at 0x299939b5860>

Hexagonal bin plot for bivariate data



In [348]:

    
df=pd.DataFrame(data=np.random.randn(1000,2),columns=['A','B'])
df.head()



In [350]:

    
df.plot.hexbin(x='A',y='B',gridsize=30,cmap='coolwarm')









    Out[350]:





<matplotlib.axes._subplots.AxesSubplot at 0x29994d46eb8>

Kernel density estimation



In [352]:

    
df2.plot.density(lw=3)









    Out[352]:





<matplotlib.axes._subplots.AxesSubplot at 0x29994f5b6d8>



In [ ]:

	A	B	C	D
2000-01-01	1.339091	-0.163643	-0.646443	1.041233
2000-01-02	-0.774984	0.137034	-0.882716	-2.253382
2000-01-03	-0.921037	-0.482943	-0.417100	0.478638
2000-01-04	-1.738808	-0.072973	0.056517	0.015085
2000-01-05	-0.905980	1.778576	0.381918	0.291436

	a	b	c	d
0	0.039762	0.218517	0.103423	0.957904
1	0.937288	0.041567	0.899125	0.977680
2	0.780504	0.008948	0.557808	0.797510
3	0.672717	0.247870	0.264071	0.444358
4	0.053829	0.520124	0.552264	0.190008

	A	B
0	1.017515	-0.420704
1	0.576828	1.814651
2	-0.261503	-0.254697
3	0.974428	0.030171
4	0.605864	0.079012

	Company	Person	Sales
0	GOOG	Sam	200
1	GOOG	Charlie	120
2	MSFT	Amy	340
3	MSFT	Vanessa	124
4	FB	Carl	243
5	FB	Sarah	350

	A	B	key1	key2	C	D
0	A0	B0	K0	K0	C0	D0
1	A1	B1	K0	K1	NaN	NaN
2	A2	B2	K1	K0	C1	D1
3	A2	B2	K1	K0	C2	D2
4	A3	B3	K2	K1	NaN	NaN
5	NaN	NaN	K2	K0	C3	D3

	A	B	key1	key2	C	D
0	A0	B0	K0	K0	C0	D0
1	A1	B1	K0	K1	NaN	NaN
2	A2	B2	K1	K0	C1	D1
3	A2	B2	K1	K0	C2	D2
4	A3	B3	K2	K1	NaN	NaN

	col1	col2	col3
0	1	444	aaa
1	2	555	bb
2	3	666	c
3	4	444	dd
4	5	333	eeee
5	6	222	fff
6	7	666	gg
7	8	777	h
8	9	666	iii
9	10	555	j

	col1	col2	col3	FuncApplied	col3length
5	6	222	fff	22.200000	3
4	5	333	eeee	33.300000	4
0	1	444	aaa	44.400000	3
3	4	444	dd	44.400000	2
1	2	555	bb	27.442930	2
9	10	555	j	27.442930	1
2	3	666	c	28.234742	1
6	7	666	gg	28.234742	2
8	9	666	iii	28.234742	3
7	8	777	h	28.904210	1

	col1	col2	col3
0	False	True	False
1	False	False	False
2	False	False	False
3	True	False	False

	A	B	key1	key2	C	D
0	A0	B0	K0	K0	C0	D0
1	A1	B1	K0	K1	NaN	NaN
2	A2	B2	K1	K0	C1	D1
3	A2	B2	K1	K0	C2	D2
4	A3	B3	K2	K1	NaN	NaN
5	NaN	NaN	K2	K0	C3	D3

	A	B	key1	key2	C	D
0	A0	B0	K0	K0	C0	D0
1	A1	B1	K0	K1	NaN	NaN
2	A2	B2	K1	K0	C1	D1
3	A2	B2	K1	K0	C2	D2
4	A3	B3	K2	K1	NaN	NaN

	A	B	key1	key2	C	D
0	A0	B0	K0	K0	C0	D0
1	A1	B1	K0	K1	NaN	NaN
2	A2	B2	K1	K0	C1	D1
3	A2	B2	K1	K0	C2	D2
4	A3	B3	K2	K1	NaN	NaN
5	NaN	NaN	K2	K0	C3	D3

	A	B	key1	key2	C	D
0	A0	B0	K0	K0	C0	D0
1	A1	B1	K0	K1	NaN	NaN
2	A2	B2	K1	K0	C1	D1
3	A2	B2	K1	K0	C2	D2
4	A3	B3	K2	K1	NaN	NaN