Pandas cheat sheet

Pandas is Python Data Analysis library. Series and Dataframes are major data structures in Pandas. Pandas is built on top of NumPy arrays.

Series

Series is 1 dimensional data structure. It is similar to numpy array, but each data point has a label in the place of an index.

Create a series


In [1]:
import pandas as pd
import numpy as np

In [2]:
#from a list
l1 = [1,2,3,4,5]
ser1 = pd.Series(data = l1)  #when you dont specify labels for index, it is autogenerated
ser1


Out[2]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [3]:
#from a numpy array
arr1 = np.array(l1)
l2 = ['a', 'b', 'c','e', 'd']
ser2 = pd.Series(data=arr1, index=l2) #indices can of any data type, here string
ser2


Out[3]:
a    1
b    2
c    3
e    4
d    5
dtype: int32

In [4]:
#from a dictionary
d1 = {'usa':1, 'india':2, 'germany':3, 'japan':'china', 'china':4}
ser3 = pd.Series(d1)
ser3


Out[4]:
china          4
germany        3
india          2
japan      china
usa            1
dtype: object

Thus Series can have different datatypes.

Operations on series

You can add, multiply and other numerical opertions on Series just like on numpy arrays.


In [5]:
ser1a = pd.Series(l1)
ser1 + ser1a #each individual element with matching index/label is summed


Out[5]:
0     2
1     4
2     6
3     8
4    10
dtype: int64

When labels dont match, it puts a nan. Thus when two series are added, you may or may not get the same number of elements


In [6]:
ser1 + ser3


Out[6]:
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
china      NaN
germany    NaN
india      NaN
japan      NaN
usa        NaN
dtype: object

DataFrames

Creating dataFrames

Pandas DataFrames are built on top of Series. It looks similar to a NumPy array, but has labels for both columns and rows.


In [2]:
arr1 = np.random.rand(4,4)
arr1


Out[2]:
array([[ 0.13430231,  0.62520675,  0.97098126,  0.71760504],
       [ 0.71376565,  0.77318189,  0.05968879,  0.45089927],
       [ 0.05898965,  0.90430093,  0.43148674,  0.08768307],
       [ 0.50989094,  0.50103695,  0.24427914,  0.76313491]])

In [3]:
row_lables = ['Car1', 'Car2', 'Car3', 'Car4']
col_labels = ['reliability', 'cost', 'competition', 'halflife']

#create a dataframe
df1 = pd.DataFrame(data=arr1, index=row_lables, columns=col_labels)
df1


Out[3]:
reliability cost competition halflife
Car1 0.134302 0.625207 0.970981 0.717605
Car2 0.713766 0.773182 0.059689 0.450899
Car3 0.058990 0.904301 0.431487 0.087683
Car4 0.509891 0.501037 0.244279 0.763135

Slicing and dicing DataFrames

You can access DataFrames similar to Series and slice it similar to NumPy arrays

Access columns

DataFrameObj['column_name'] ==> returns a pandas.core.series.Series

In [15]:
# Accessing a whole column
df1['reliability']


Out[15]:
Car1    0.894051
Car2    0.014127
Car3    0.351610
Car4    0.601929
Name: reliability, dtype: float64

In [16]:
#can access as a property, but this is not advisable 
#since it can clobber builtin methods and properties
df1.reliability


Out[16]:
Car1    0.894051
Car2    0.014127
Car3    0.351610
Car4    0.601929
Name: reliability, dtype: float64

Access rows

DataFrameobj.loc['row_label'] also returns a Series. Notice the .loc

In [18]:
df1.loc['Car4']


Out[18]:
reliability    0.601929
cost           0.854320
competition    0.391956
halflife       0.759363
Name: Car4, dtype: float64

In [19]:
type(df1.loc['Car3'])


Out[19]:
pandas.core.series.Series

Accessing using index number

If you don't know the labels, but know the index like in an array, use iloc and pass the index number.


In [21]:
#get first row, first col
val1 = df1.iloc[0,0]
print(val1)
print(type(val1))


0.894051123737
<class 'numpy.float64'>

In [23]:
#get full first row
val2 = df1.iloc[0,:]
val2


Out[23]:
reliability    0.894051
cost           0.849727
competition    0.538400
halflife       0.863986
Name: Car1, dtype: float64

In [24]:
type(val2)


Out[24]:
pandas.core.series.Series

Dicing DataFrames

Dicing using labels ==> use DataFrameObj.loc[[row_labels],[col_labels]]


In [26]:
#Get cost and competition of cars 2,3
df1.loc[['Car2', 'Car3'], ['cost', 'competition']]


Out[26]:
cost competition
Car2 0.935368 0.719570
Car3 0.659950 0.605077

With index number, dice using

DataFrameObj.iloc[[row_indices], [col_indices]]

In [27]:
df1.iloc[[1,2], [1,2]]


Out[27]:
cost competition
Car2 0.935368 0.719570
Car3 0.659950 0.605077

Conditional selection

When running a condition on a DataFrame, you are returned a Bool dataframe.


In [9]:
df1


Out[9]:
reliability cost competition halflife
Car1 0.776415 0.435083 0.236151 0.169087
Car2 0.790403 0.987459 0.370570 0.734146
Car3 0.884783 0.233803 0.691639 0.725398
Car4 0.693038 0.716824 0.766937 0.490821

In [10]:
# find cars with reliability > 0.85
df1['reliability'] > 0.85


Out[10]:
Car1    False
Car2    False
Car3     True
Car4    False
Name: reliability, dtype: bool

In [11]:
#to get the car select the data elements using the bool series
df1[df1['reliability'] > 0.85]


Out[11]:
reliability cost competition halflife
Car3 0.884783 0.233803 0.691639 0.725398

In [20]:
#To get only the car name, which in this case is the index
df1[df1['reliability'] > 0.85].index[0]


Out[20]:
'Car3'

Chaining conditions

In a Pythonic way, you can chain conditions

df[df condition][selection][selection]

In [21]:
#to get the actual value of reliablity for this car
df1[df1['reliability'] > 0.85]['reliability']


Out[21]:
Car3    0.884783
Name: reliability, dtype: float64

In [22]:
# get both reliability and cost
df1[df1['reliability'] > 0.85][['reliability', 'cost']]


Out[22]:
reliability cost
Car3 0.884783 0.233803

Multiple conditions

You can select dataframe elements with multiple conditions. Note cannot use Python and , or. Instead use &, |


In [24]:
#select cars that have reliability > 0.7 but competition less than 0.5
df1[(df1['reliability'] > 0.7) & (df1['competition'] < 0.5)]


Out[24]:
reliability cost competition halflife
Car1 0.776415 0.435083 0.236151 0.169087
Car2 0.790403 0.987459 0.370570 0.734146

In [26]:
# select cars that have half life > 0.5 or competition < 0.4
df1[(df1['halflife'] > 0.5) | (df1['competition'] < 0.4)]


Out[26]:
reliability cost competition halflife
Car1 0.776415 0.435083 0.236151 0.169087
Car2 0.790403 0.987459 0.370570 0.734146
Car3 0.884783 0.233803 0.691639 0.725398

Operations on DataFrames

Adding new columns

Create new columns just like adding a kvp to a dictionary.

DataFrameObj['new_col'] = Series

In [4]:
#add full life column
df1['full_life'] = df1['halflife'] * 2 #similar to array, series broadcast multiplication
df1


Out[4]:
reliability cost competition halflife full_life
Car1 0.134302 0.625207 0.970981 0.717605 1.435210
Car2 0.713766 0.773182 0.059689 0.450899 0.901799
Car3 0.058990 0.904301 0.431487 0.087683 0.175366
Car4 0.509891 0.501037 0.244279 0.763135 1.526270

Dropping rows and columns

DataFrameObj.drop(label, axis, inplace=True / False)

Row labels are axis = 0 and columns are axis = 1


In [5]:
df1.drop('full_life', axis=1, inplace=False)


Out[5]:
reliability cost competition halflife
Car1 0.134302 0.625207 0.970981 0.717605
Car2 0.713766 0.773182 0.059689 0.450899
Car3 0.058990 0.904301 0.431487 0.087683
Car4 0.509891 0.501037 0.244279 0.763135

In [6]:
df1.drop('Car3') #all else is the default


Out[6]:
reliability cost competition halflife full_life
Car1 0.134302 0.625207 0.970981 0.717605 1.435210
Car2 0.713766 0.773182 0.059689 0.450899 0.901799
Car4 0.509891 0.501037 0.244279 0.763135 1.526270

Drop a row based on a condition.


In [7]:
df1.drop(df1[df1['cost'] > 0.65].index, inplace=False)


Out[7]:
reliability cost competition halflife full_life
Car1 0.134302 0.625207 0.970981 0.717605 1.43521
Car4 0.509891 0.501037 0.244279 0.763135 1.52627

DataFrame Index

So far, Car1, Car2.. is the index for rows. If you would like to set a different column as an index, use set_index. If you want to make index as a column rather, and use numerals for index, use reset_index

Set index


In [30]:
#set car names as index for the data frame
car_names = 'altima outback taurus mustang'.split()
car_names


Out[30]:
['altima', 'outback', 'taurus', 'mustang']

In [36]:
df1['car_names'] = car_names
df1


Out[36]:
reliability cost competition halflife car_names
Car1 0.776415 0.435083 0.236151 0.169087 altima
Car2 0.790403 0.987459 0.370570 0.734146 outback
Car3 0.884783 0.233803 0.691639 0.725398 taurus
Car4 0.693038 0.716824 0.766937 0.490821 mustang

In [37]:
df_new_index = df1.set_index(keys= df1['car_names'], inplace=False)
df_new_index


Out[37]:
reliability cost competition halflife car_names
car_names
altima 0.776415 0.435083 0.236151 0.169087 altima
outback 0.790403 0.987459 0.370570 0.734146 outback
taurus 0.884783 0.233803 0.691639 0.725398 taurus
mustang 0.693038 0.716824 0.766937 0.490821 mustang

Note, the old index is lost.

Rest index


In [38]:
#reset df1 index to numerals and convert existing to a column
df1.reset_index()


Out[38]:
index reliability cost competition halflife car_names
0 Car1 0.776415 0.435083 0.236151 0.169087 altima
1 Car2 0.790403 0.987459 0.370570 0.734146 outback
2 Car3 0.884783 0.233803 0.691639 0.725398 taurus
3 Car4 0.693038 0.716824 0.766937 0.490821 mustang