Series/Dataframes slicing and function application

Slicing

(from panda docs: https://pandas.pydata.org/pandas-docs/stable/indexing.html)

The axis labeling information in pandas objects serves many purposes:

  • Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
  • Enables automatic and explicit data alignment
  • Allows intuitive getting and setting of subsets of the data set

indexing operators: []

attribute operators: .

Example:


In [3]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])

df["A"] #indexing


Out[3]:
0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64

In [4]:
df.A #attribute


Out[4]:
0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64

In [5]:
type(df.A)


Out[5]:
pandas.core.series.Series

In [6]:
df.A[0]


Out[6]:
1.4297170198955451

In [7]:
df[["A","B"]]


Out[7]:
A B
0 1.429717 -1.431192
1 -0.653524 -0.141962
2 -0.523050 1.060752
3 1.128553 1.259760
4 0.803358 0.845983
5 -0.287873 -0.066727
6 0.246169 0.611750
7 1.083117 0.409958

In [9]:
type(df[["A","B"]])


Out[9]:
pandas.core.frame.DataFrame

Slicing ranges

The most robust way to slice Dataframes is by using .loc and .iloc methods, however the following also holds:


In [11]:
s = df["A"]
s[:5]


Out[11]:
0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
Name: A, dtype: float64

In [18]:
s[::2]


Out[18]:
0    1.429717
2   -0.523050
4    0.803358
6    0.246169
Name: A, dtype: float64

In [19]:
s[::-1]


Out[19]:
7    1.083117
6    0.246169
5   -0.287873
4    0.803358
3    1.128553
2   -0.523050
1   -0.653524
0    1.429717
Name: A, dtype: float64

Watchout... this is a rather incoherent use of the indexinf method over rows. That's why it is said that loc provides a more coherent use.


In [22]:
df[:3] # for convenience as it is a common use


Out[22]:
A B C D
0 1.429717 -1.431192 -0.027299 1.137717
1 -0.653524 -0.141962 -0.134841 0.963939
2 -0.523050 1.060752 0.496447 -0.002856

In [21]:
df["A"]


Out[21]:
0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64

The .loc attribute is the primary access method. The following are valid inputs:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
  • A list or array of labels ['a', 'b', 'c']
  • A slice object with labels 'a':'f' (note that contrary to usual python slices, both the start and the stop are included!)
  • A boolean array
  • A callable, see Selection By Callable

In [24]:
df.loc[:,"B":]


Out[24]:
B C D
0 -1.431192 -0.027299 1.137717
1 -0.141962 -0.134841 0.963939
2 1.060752 0.496447 -0.002856
3 1.259760 1.106879 -0.234083
4 0.845983 -1.546642 -0.393005
5 -0.066727 -0.787697 -1.099309
6 0.611750 2.238141 -1.105078
7 0.409958 0.515915 -0.178913

In [30]:
df.loc[4:,"C":] = 0
df


Out[30]:
A B C D
0 1.429717 -1.431192 -0.027299 1.137717
1 -0.653524 -0.141962 -0.134841 0.963939
2 -0.523050 1.060752 0.496447 -0.002856
3 1.128553 1.259760 1.106879 -0.234083
4 0.803358 0.845983 0.000000 0.000000
5 -0.287873 -0.066727 0.000000 0.000000
6 0.246169 0.611750 0.000000 0.000000
7 1.083117 0.409958 0.000000 0.000000

Boolean accessing


In [42]:
df.loc[:,"A"]>0


Out[42]:
0     True
1    False
2    False
3     True
4     True
5    False
6     True
7     True
Name: A, dtype: bool

In [43]:
df.loc[df.loc[:,"A"]>0]


Out[43]:
A B C D
0 1.429717 -1.431192 -0.027299 1.137717
3 1.128553 1.259760 1.106879 -0.234083
4 0.803358 0.845983 0.000000 0.000000
6 0.246169 0.611750 0.000000 0.000000
7 1.083117 0.409958 0.000000 0.000000

Accessing by position by .iloc


In [44]:
df.iloc[0,0]


Out[44]:
1.4297170198955451

In [45]:
df.iloc[3:,2:]


Out[45]:
C D
3 1.106879 -0.234083
4 0.000000 0.000000
5 0.000000 0.000000
6 0.000000 0.000000
7 0.000000 0.000000

In [46]:
df.iloc[[0,1,3],[1,3]]


Out[46]:
B D
0 -1.431192 1.137717
1 -0.141962 0.963939
3 1.259760 -0.234083

Selection by callable


In [48]:
df.loc[:,lambda df: df.columns == "A"]


Out[48]:
A
0 1.429717
1 -0.653524
2 -0.523050
3 1.128553
4 0.803358
5 -0.287873
6 0.246169
7 1.083117

Selection by isin


In [51]:
df["X"] = range(0, df.shape[0])
df


Out[51]:
A B C D X
0 1.429717 -1.431192 -0.027299 1.137717 0
1 -0.653524 -0.141962 -0.134841 0.963939 1
2 -0.523050 1.060752 0.496447 -0.002856 2
3 1.128553 1.259760 1.106879 -0.234083 3
4 0.803358 0.845983 0.000000 0.000000 4
5 -0.287873 -0.066727 0.000000 0.000000 5
6 0.246169 0.611750 0.000000 0.000000 6
7 1.083117 0.409958 0.000000 0.000000 7

In [53]:
df[df["X"].isin([0,2])]


Out[53]:
A B C D X
0 1.429717 -1.431192 -0.027299 1.137717 0
2 -0.523050 1.060752 0.496447 -0.002856 2

The where() Method


In [54]:
df.where(df["A"]>0)


Out[54]:
A B C D X
0 1.429717 -1.431192 -0.027299 1.137717 0.0
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 1.128553 1.259760 1.106879 -0.234083 3.0
4 0.803358 0.845983 0.000000 0.000000 4.0
5 NaN NaN NaN NaN NaN
6 0.246169 0.611750 0.000000 0.000000 6.0
7 1.083117 0.409958 0.000000 0.000000 7.0

In [56]:
df.where(df["A"]>0,100)


Out[56]:
A B C D X
0 1.429717 -1.431192 -0.027299 1.137717 0
1 100.000000 100.000000 100.000000 100.000000 100
2 100.000000 100.000000 100.000000 100.000000 100
3 1.128553 1.259760 1.106879 -0.234083 3
4 0.803358 0.845983 0.000000 0.000000 4
5 100.000000 100.000000 100.000000 100.000000 100
6 0.246169 0.611750 0.000000 0.000000 6
7 1.083117 0.409958 0.000000 0.000000 7

Applying Functions

Over series:

  • Vectorized functions
  • Apply/map

In [58]:
s * 2


Out[58]:
0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64

In [59]:
s.max()


Out[59]:
1.4297170198955451

In [60]:
np.max(s)


Out[60]:
1.4297170198955451

In [62]:
s.apply(np.max)


Out[62]:
0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64

In [64]:
def multiply_by_2(x):
    return x*2
s.apply(multiply_by_2)


Out[64]:
0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64

In [65]:
s.apply(lambda x: x*2)


Out[65]:
0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64

In [67]:
s.map(lambda x: x*2)


Out[67]:
0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64

In [70]:
mydict={2:"a"}
df["X"].map(mydict)


Out[70]:
0    NaN
1    NaN
2      a
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
Name: X, dtype: object

Over dataframes:

  • apply (we can decide which axis)
  • applymap

In [73]:
df.apply(np.max,axis=0)


Out[73]:
A    1.429717
B    1.259760
C    1.106879
D    1.137717
X    7.000000
dtype: float64

In [74]:
df.apply(np.max,axis=1)


Out[74]:
0    1.429717
1    1.000000
2    2.000000
3    3.000000
4    4.000000
5    5.000000
6    6.000000
7    7.000000
dtype: float64

In [75]:
df.applymap(lambda x: x*2)


Out[75]:
A B C D X
0 2.859434 -2.862384 -0.054597 2.275433 0
1 -1.307049 -0.283924 -0.269682 1.927878 2
2 -1.046100 2.121505 0.992893 -0.005712 4
3 2.257105 2.519519 2.213757 -0.468166 6
4 1.606717 1.691967 0.000000 0.000000 8
5 -0.575746 -0.133454 0.000000 0.000000 10
6 0.492338 1.223500 0.000000 0.000000 12
7 2.166234 0.819917 0.000000 0.000000 14