Pandas cheat sheet

Pandas is Python Data Analysis library. Series and Dataframes are major data structures in Pandas. Pandas is built on top of NumPy arrays.

ToC

Series
- Create a series
- Operations on series
DataFrames

Series

Series is 1 dimensional data structure. It is similar to numpy array, but each data point has a label in the place of an index.

Create a series



In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
#from a list
l1 = [1,2,3,4,5]
ser1 = pd.Series(data = l1)  #when you dont specify labels for index, it is autogenerated
ser1









    Out[2]:





0    1
1    2
2    3
3    4
4    5
dtype: int64



In [3]:

    
#from a numpy array
arr1 = np.array(l1)
l2 = ['a', 'b', 'c','e', 'd']
ser2 = pd.Series(data=arr1, index=l2) #indices can of any data type, here string
ser2









    Out[3]:





a    1
b    2
c    3
e    4
d    5
dtype: int32



In [4]:

    
#from a dictionary
d1 = {'usa':1, 'india':2, 'germany':3, 'japan':'china', 'china':4}
ser3 = pd.Series(d1)
ser3









    Out[4]:





china          4
germany        3
india          2
japan      china
usa            1
dtype: object

Thus Series can have different datatypes.

Operations on series

You can add, multiply and other numerical opertions on Series just like on numpy arrays.



In [5]:

    
ser1a = pd.Series(l1)
ser1 + ser1a #each individual element with matching index/label is summed









    Out[5]:





0     2
1     4
2     6
3     8
4    10
dtype: int64

When labels dont match, it puts a nan. Thus when two series are added, you may or may not get the same number of elements



In [6]:

    
ser1 + ser3









    Out[6]:





0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
china      NaN
germany    NaN
india      NaN
japan      NaN
usa        NaN
dtype: object

DataFrames

Creating dataFrames

Pandas DataFrames are built on top of Series. It looks similar to a NumPy array, but has labels for both columns and rows.



In [2]:

    
arr1 = np.random.rand(4,4)
arr1









    Out[2]:





array([[ 0.13430231,  0.62520675,  0.97098126,  0.71760504],
       [ 0.71376565,  0.77318189,  0.05968879,  0.45089927],
       [ 0.05898965,  0.90430093,  0.43148674,  0.08768307],
       [ 0.50989094,  0.50103695,  0.24427914,  0.76313491]])



In [3]:

    
row_lables = ['Car1', 'Car2', 'Car3', 'Car4']
col_labels = ['reliability', 'cost', 'competition', 'halflife']

#create a dataframe
df1 = pd.DataFrame(data=arr1, index=row_lables, columns=col_labels)
df1









    Out[3]:






  
    
      
      reliability
      cost
      competition
      halflife
    
  
  
    
      Car1
      0.134302
      0.625207
      0.970981
      0.717605
    
    
      Car2
      0.713766
      0.773182
      0.059689
      0.450899
    
    
      Car3
      0.058990
      0.904301
      0.431487
      0.087683
    
    
      Car4
      0.509891
      0.501037
      0.244279
      0.763135

Slicing and dicing DataFrames

You can access DataFrames similar to Series and slice it similar to NumPy arrays

Access columns

DataFrameObj['column_name'] ==> returns a pandas.core.series.Series



In [15]:

    
# Accessing a whole column
df1['reliability']









    Out[15]:





Car1    0.894051
Car2    0.014127
Car3    0.351610
Car4    0.601929
Name: reliability, dtype: float64



In [16]:

    
#can access as a property, but this is not advisable 
#since it can clobber builtin methods and properties
df1.reliability









    Out[16]:





Car1    0.894051
Car2    0.014127
Car3    0.351610
Car4    0.601929
Name: reliability, dtype: float64

Access rows

DataFrameobj.loc['row_label'] also returns a Series. Notice the .loc



In [18]:

    
df1.loc['Car4']









    Out[18]:





reliability    0.601929
cost           0.854320
competition    0.391956
halflife       0.759363
Name: Car4, dtype: float64



In [19]:

    
type(df1.loc['Car3'])









    Out[19]:





pandas.core.series.Series

Accessing using index number

If you don't know the labels, but know the index like in an array, use iloc and pass the index number.



In [21]:

    
#get first row, first col
val1 = df1.iloc[0,0]
print(val1)
print(type(val1))









    



0.894051123737
<class 'numpy.float64'>



In [23]:

    
#get full first row
val2 = df1.iloc[0,:]
val2









    Out[23]:





reliability    0.894051
cost           0.849727
competition    0.538400
halflife       0.863986
Name: Car1, dtype: float64



In [24]:

    
type(val2)









    Out[24]:





pandas.core.series.Series

Dicing DataFrames

Dicing using labels ==> use DataFrameObj.loc[[row_labels],[col_labels]]



In [26]:

    
#Get cost and competition of cars 2,3
df1.loc[['Car2', 'Car3'], ['cost', 'competition']]









    Out[26]:






  
    
      
      cost
      competition
    
  
  
    
      Car2
      0.935368
      0.719570
    
    
      Car3
      0.659950
      0.605077

With index number, dice using

DataFrameObj.iloc[[row_indices], [col_indices]]



In [27]:

    
df1.iloc[[1,2], [1,2]]









    Out[27]:






  
    
      
      cost
      competition
    
  
  
    
      Car2
      0.935368
      0.719570
    
    
      Car3
      0.659950
      0.605077

Conditional selection

When running a condition on a DataFrame, you are returned a Bool dataframe.



In [9]:

    
df1









    Out[9]:






  
    
      
      reliability
      cost
      competition
      halflife
    
  
  
    
      Car1
      0.776415
      0.435083
      0.236151
      0.169087
    
    
      Car2
      0.790403
      0.987459
      0.370570
      0.734146
    
    
      Car3
      0.884783
      0.233803
      0.691639
      0.725398
    
    
      Car4
      0.693038
      0.716824
      0.766937
      0.490821



In [10]:

    
# find cars with reliability > 0.85
df1['reliability'] > 0.85









    Out[10]:





Car1    False
Car2    False
Car3     True
Car4    False
Name: reliability, dtype: bool



In [11]:

    
#to get the car select the data elements using the bool series
df1[df1['reliability'] > 0.85]









    Out[11]:






  
    
      
      reliability
      cost
      competition
      halflife
    
  
  
    
      Car3
      0.884783
      0.233803
      0.691639
      0.725398



In [20]:

    
#To get only the car name, which in this case is the index
df1[df1['reliability'] > 0.85].index[0]









    Out[20]:





'Car3'

Chaining conditions

In a Pythonic way, you can chain conditions

df[df condition][selection][selection]



In [21]:

    
#to get the actual value of reliablity for this car
df1[df1['reliability'] > 0.85]['reliability']









    Out[21]:





Car3    0.884783
Name: reliability, dtype: float64



In [22]:

    
# get both reliability and cost
df1[df1['reliability'] > 0.85][['reliability', 'cost']]









    Out[22]:






  
    
      
      reliability
      cost
    
  
  
    
      Car3
      0.884783
      0.233803

Multiple conditions

You can select dataframe elements with multiple conditions. Note cannot use Python and , or. Instead use &, |



In [24]:

    
#select cars that have reliability > 0.7 but competition less than 0.5
df1[(df1['reliability'] > 0.7) & (df1['competition'] < 0.5)]









    Out[24]:






  
    
      
      reliability
      cost
      competition
      halflife
    
  
  
    
      Car1
      0.776415
      0.435083
      0.236151
      0.169087
    
    
      Car2
      0.790403
      0.987459
      0.370570
      0.734146



In [26]:

    
# select cars that have half life > 0.5 or competition < 0.4
df1[(df1['halflife'] > 0.5) | (df1['competition'] < 0.4)]









    Out[26]:






  
    
      
      reliability
      cost
      competition
      halflife
    
  
  
    
      Car1
      0.776415
      0.435083
      0.236151
      0.169087
    
    
      Car2
      0.790403
      0.987459
      0.370570
      0.734146
    
    
      Car3
      0.884783
      0.233803
      0.691639
      0.725398

Operations on DataFrames

Adding new columns

Create new columns just like adding a kvp to a dictionary.

DataFrameObj['new_col'] = Series



In [4]:

    
#add full life column
df1['full_life'] = df1['halflife'] * 2 #similar to array, series broadcast multiplication
df1









    Out[4]:






  
    
      
      reliability
      cost
      competition
      halflife
      full_life
    
  
  
    
      Car1
      0.134302
      0.625207
      0.970981
      0.717605
      1.435210
    
    
      Car2
      0.713766
      0.773182
      0.059689
      0.450899
      0.901799
    
    
      Car3
      0.058990
      0.904301
      0.431487
      0.087683
      0.175366
    
    
      Car4
      0.509891
      0.501037
      0.244279
      0.763135
      1.526270

Dropping rows and columns

DataFrameObj.drop(label, axis, inplace=True / False)

Row labels are axis = 0 and columns are axis = 1



In [5]:

    
df1.drop('full_life', axis=1, inplace=False)









    Out[5]:






  
    
      
      reliability
      cost
      competition
      halflife
    
  
  
    
      Car1
      0.134302
      0.625207
      0.970981
      0.717605
    
    
      Car2
      0.713766
      0.773182
      0.059689
      0.450899
    
    
      Car3
      0.058990
      0.904301
      0.431487
      0.087683
    
    
      Car4
      0.509891
      0.501037
      0.244279
      0.763135



In [6]:

    
df1.drop('Car3') #all else is the default









    Out[6]:






  
    
      
      reliability
      cost
      competition
      halflife
      full_life
    
  
  
    
      Car1
      0.134302
      0.625207
      0.970981
      0.717605
      1.435210
    
    
      Car2
      0.713766
      0.773182
      0.059689
      0.450899
      0.901799
    
    
      Car4
      0.509891
      0.501037
      0.244279
      0.763135
      1.526270

Drop a row based on a condition.



In [7]:

    
df1.drop(df1[df1['cost'] > 0.65].index, inplace=False)









    Out[7]:






  
    
      
      reliability
      cost
      competition
      halflife
      full_life
    
  
  
    
      Car1
      0.134302
      0.625207
      0.970981
      0.717605
      1.43521
    
    
      Car4
      0.509891
      0.501037
      0.244279
      0.763135
      1.52627

DataFrame Index

So far, Car1, Car2.. is the index for rows. If you would like to set a different column as an index, use set_index. If you want to make index as a column rather, and use numerals for index, use reset_index

Set index



In [30]:

    
#set car names as index for the data frame
car_names = 'altima outback taurus mustang'.split()
car_names









    Out[30]:





['altima', 'outback', 'taurus', 'mustang']



In [36]:

    
df1['car_names'] = car_names
df1









    Out[36]:






  
    
      
      reliability
      cost
      competition
      halflife
      car_names
    
  
  
    
      Car1
      0.776415
      0.435083
      0.236151
      0.169087
      altima
    
    
      Car2
      0.790403
      0.987459
      0.370570
      0.734146
      outback
    
    
      Car3
      0.884783
      0.233803
      0.691639
      0.725398
      taurus
    
    
      Car4
      0.693038
      0.716824
      0.766937
      0.490821
      mustang



In [37]:

    
df_new_index = df1.set_index(keys= df1['car_names'], inplace=False)
df_new_index









    Out[37]:






  
    
      
      reliability
      cost
      competition
      halflife
      car_names
    
    
      car_names
      
      
      
      
      
    
  
  
    
      altima
      0.776415
      0.435083
      0.236151
      0.169087
      altima
    
    
      outback
      0.790403
      0.987459
      0.370570
      0.734146
      outback
    
    
      taurus
      0.884783
      0.233803
      0.691639
      0.725398
      taurus
    
    
      mustang
      0.693038
      0.716824
      0.766937
      0.490821
      mustang

Note, the old index is lost.

Rest index



In [38]:

    
#reset df1 index to numerals and convert existing to a column
df1.reset_index()









    Out[38]:






  
    
      
      index
      reliability
      cost
      competition
      halflife
      car_names
    
  
  
    
      0
      Car1
      0.776415
      0.435083
      0.236151
      0.169087
      altima
    
    
      1
      Car2
      0.790403
      0.987459
      0.370570
      0.734146
      outback
    
    
      2
      Car3
      0.884783
      0.233803
      0.691639
      0.725398
      taurus
    
    
      3
      Car4
      0.693038
      0.716824
      0.766937
      0.490821
      mustang

	reliability	cost	competition	halflife
Car1	0.134302	0.625207	0.970981	0.717605
Car2	0.713766	0.773182	0.059689	0.450899
Car3	0.058990	0.904301	0.431487	0.087683
Car4	0.509891	0.501037	0.244279	0.763135

	reliability	cost	competition	halflife
Car1	0.776415	0.435083	0.236151	0.169087
Car2	0.790403	0.987459	0.370570	0.734146
Car3	0.884783	0.233803	0.691639	0.725398
Car4	0.693038	0.716824	0.766937	0.490821

	reliability	cost	competition	halflife	car_names
car_names
altima	0.776415	0.435083	0.236151	0.169087	altima
outback	0.790403	0.987459	0.370570	0.734146	outback
taurus	0.884783	0.233803	0.691639	0.725398	taurus
mustang	0.693038	0.716824	0.766937	0.490821	mustang