Series/Dataframes slicing and function application

Slicing

(from panda docs: https://pandas.pydata.org/pandas-docs/stable/indexing.html)

The axis labeling information in pandas objects serves many purposes:

Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
Enables automatic and explicit data alignment
Allows intuitive getting and setting of subsets of the data set

indexing operators: []

attribute operators: .

Example:



In [3]:

    
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])

df["A"] #indexing









    Out[3]:





0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64



In [4]:

    
df.A #attribute









    Out[4]:





0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64



In [5]:

    
type(df.A)









    Out[5]:





pandas.core.series.Series



In [6]:

    
df.A[0]









    Out[6]:





1.4297170198955451



In [7]:

    
df[["A","B"]]



In [9]:

    
type(df[["A","B"]])









    Out[9]:





pandas.core.frame.DataFrame

Slicing ranges

The most robust way to slice Dataframes is by using .loc and .iloc methods, however the following also holds:



In [11]:

    
s = df["A"]
s[:5]









    Out[11]:





0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
Name: A, dtype: float64



In [18]:

    
s[::2]









    Out[18]:





0    1.429717
2   -0.523050
4    0.803358
6    0.246169
Name: A, dtype: float64



In [19]:

    
s[::-1]









    Out[19]:





7    1.083117
6    0.246169
5   -0.287873
4    0.803358
3    1.128553
2   -0.523050
1   -0.653524
0    1.429717
Name: A, dtype: float64

Watchout... this is a rather incoherent use of the indexinf method over rows. That's why it is said that loc provides a more coherent use.



In [22]:

    
df[:3] # for convenience as it is a common use



In [21]:

    
df["A"]









    Out[21]:





0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64

The .loc attribute is the primary access method. The following are valid inputs:

A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
A list or array of labels ['a', 'b', 'c']
A slice object with labels 'a':'f' (note that contrary to usual python slices, both the start and the stop are included!)
A boolean array
A callable, see Selection By Callable



In [24]:

    
df.loc[:,"B":]



In [30]:

    
df.loc[4:,"C":] = 0
df

Boolean accessing



In [42]:

    
df.loc[:,"A"]>0









    Out[42]:





0     True
1    False
2    False
3     True
4     True
5    False
6     True
7     True
Name: A, dtype: bool



In [43]:

    
df.loc[df.loc[:,"A"]>0]

Accessing by position by .iloc



In [44]:

    
df.iloc[0,0]









    Out[44]:





1.4297170198955451



In [45]:

    
df.iloc[3:,2:]



In [46]:

    
df.iloc[[0,1,3],[1,3]]

Selection by callable



In [48]:

    
df.loc[:,lambda df: df.columns == "A"]

Selection by isin



In [51]:

    
df["X"] = range(0, df.shape[0])
df



In [53]:

    
df[df["X"].isin([0,2])]

The where() Method



In [54]:

    
df.where(df["A"]>0)



In [56]:

    
df.where(df["A"]>0,100)

Applying Functions

Over series:

Vectorized functions
Apply/map



In [58]:

    
s * 2









    Out[58]:





0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64



In [59]:

    
s.max()









    Out[59]:





1.4297170198955451



In [60]:

    
np.max(s)









    Out[60]:





1.4297170198955451



In [62]:

    
s.apply(np.max)









    Out[62]:





0    1.429717
1   -0.653524
2   -0.523050
3    1.128553
4    0.803358
5   -0.287873
6    0.246169
7    1.083117
Name: A, dtype: float64



In [64]:

    
def multiply_by_2(x):
    return x*2
s.apply(multiply_by_2)









    Out[64]:





0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64



In [65]:

    
s.apply(lambda x: x*2)









    Out[65]:





0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64



In [67]:

    
s.map(lambda x: x*2)









    Out[67]:





0    2.859434
1   -1.307049
2   -1.046100
3    2.257105
4    1.606717
5   -0.575746
6    0.492338
7    2.166234
Name: A, dtype: float64



In [70]:

    
mydict={2:"a"}
df["X"].map(mydict)









    Out[70]:





0    NaN
1    NaN
2      a
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
Name: X, dtype: object

Over dataframes:

apply (we can decide which axis)
applymap



In [73]:

    
df.apply(np.max,axis=0)









    Out[73]:





A    1.429717
B    1.259760
C    1.106879
D    1.137717
X    7.000000
dtype: float64



In [74]:

    
df.apply(np.max,axis=1)









    Out[74]:





0    1.429717
1    1.000000
2    2.000000
3    3.000000
4    4.000000
5    5.000000
6    6.000000
7    7.000000
dtype: float64



In [75]:

    
df.applymap(lambda x: x*2)

	A	B	C	D
0	1.429717	-1.431192	-0.027299	1.137717
1	-0.653524	-0.141962	-0.134841	0.963939
2	-0.523050	1.060752	0.496447	-0.002856

	B	C	D
0	-1.431192	-0.027299	1.137717
1	-0.141962	-0.134841	0.963939
2	1.060752	0.496447	-0.002856
3	1.259760	1.106879	-0.234083
4	0.845983	-1.546642	-0.393005
5	-0.066727	-0.787697	-1.099309
6	0.611750	2.238141	-1.105078
7	0.409958	0.515915	-0.178913

	A	B	C	D
0	1.429717	-1.431192	-0.027299	1.137717
1	-0.653524	-0.141962	-0.134841	0.963939
2	-0.523050	1.060752	0.496447	-0.002856
3	1.128553	1.259760	1.106879	-0.234083
4	0.803358	0.845983	0.000000	0.000000
5	-0.287873	-0.066727	0.000000	0.000000
6	0.246169	0.611750	0.000000	0.000000
7	1.083117	0.409958	0.000000	0.000000

	A	B	C	D
0	1.429717	-1.431192	-0.027299	1.137717
3	1.128553	1.259760	1.106879	-0.234083
4	0.803358	0.845983	0.000000	0.000000
6	0.246169	0.611750	0.000000	0.000000
7	1.083117	0.409958	0.000000	0.000000

	C	D
3	1.106879	-0.234083
4	0.000000	0.000000
5	0.000000	0.000000
6	0.000000	0.000000
7	0.000000	0.000000

	A	B	C	D	X
0	2.859434	-2.862384	-0.054597	2.275433	0
1	-1.307049	-0.283924	-0.269682	1.927878	2
2	-1.046100	2.121505	0.992893	-0.005712	4
3	2.257105	2.519519	2.213757	-0.468166	6
4	1.606717	1.691967	0.000000	0.000000	8
5	-0.575746	-0.133454	0.000000	0.000000	10
6	0.492338	1.223500	0.000000	0.000000	12
7	2.166234	0.819917	0.000000	0.000000	14