pandas Series Reindexing, filling, mutating, copying, and maps


In [2]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2

Now NaN values are treated differently in numpy than in pandas. In numpy, as we saw earlier if you’ve got an array with a NaN value, things like summary statistics are calculated as NaN.


In [3]:
np_array = np.array([1,2,3,np.nan])
np_array


Out[3]:
array([  1.,   2.,   3.,  nan])

In [4]:
np_array.mean()


Out[4]:
nan

In [5]:
pd_series = pd.Series([1,2,3,np.nan])
pd_series


Out[5]:
0     1
1     2
2     3
3   NaN
dtype: float64

Pandas Series treat them differently, it just ignores that empty value. We’ll cover filling in those empty values at a later time.


In [6]:
pd_series.mean()


Out[6]:
2.0

In [7]:
np.random.seed(567)

Sometimes you're going to have to make some new indexes. For example we've got two Series.


In [8]:
s1 = pd.Series(np.random.randn(5))
s1


Out[8]:
0    0.213266
1   -0.091899
2   -0.089349
3    0.265756
4    0.376065
dtype: float64

In [9]:
s2 = pd.Series(np.random.randn(5))
s2


Out[9]:
0    0.688025
1    0.510002
2    1.914120
3    0.724774
4    0.124588
dtype: float64

Now at times you’re going to want to reindex a Series. What does this mean? Basically that you want to destroy the index you have currently and reset it. Let’s walk through a practical example.


In [10]:
combo = pd.concat([s1, s2])
combo


Out[10]:
0    0.213266
1   -0.091899
2   -0.089349
3    0.265756
4    0.376065
0    0.688025
1    0.510002
2    1.914120
3    0.724774
4    0.124588
dtype: float64

When we concatenate them, we can see we’ve got repeated index values. We can query just like we would normally by these index values, but in all likelihood we’ll want to replace them with a new one.


In [11]:
combo[0]


Out[11]:
0    0.213266
0    0.688025
dtype: float64

In [12]:
combo.index = range(combo.count())
combo


Out[12]:
0    0.213266
1   -0.091899
2   -0.089349
3    0.265756
4    0.376065
5    0.688025
6    0.510002
7    1.914120
8    0.724774
9    0.124588
dtype: float64

However this is rather limited in what you can achieve. It just overwrites the index we have now. What happens if we’re looking to fill in missing data with nan values? We have to use reindex which will return a new Series.


In [13]:
new_combo = combo.reindex([0,2,15,21])
new_combo


Out[13]:
0     0.213266
2    -0.089349
15         NaN
21         NaN
dtype: float64

We can specify how to handle nan values with fill_value. or we can specify a method by which they should be filled. This can performed during the reindexing using the method parameter (like we did with fill_value), or we can do it after the fact.


In [14]:
combo.reindex([0,2,15,21], fill_value=0)


Out[14]:
0     0.213266
2    -0.089349
15    0.000000
21    0.000000
dtype: float64

In [15]:
new_combo


Out[15]:
0     0.213266
2    -0.089349
15         NaN
21         NaN
dtype: float64

Here’s an example of fill which is forward fill


In [16]:
new_combo.ffill()


Out[16]:
0     0.213266
2    -0.089349
15   -0.089349
21   -0.089349
dtype: float64

and bfill or backward fill


In [17]:
new_combo.bfill()


Out[17]:
0     0.213266
2    -0.089349
15         NaN
21         NaN
dtype: float64

In [18]:
new_combo[21] = 5

In [19]:
new_combo


Out[19]:
0     0.213266
2    -0.089349
15         NaN
21    5.000000
dtype: float64

In [20]:
new_combo.bfill()


Out[20]:
0     0.213266
2    -0.089349
15    5.000000
21    5.000000
dtype: float64

In [21]:
new_combo


Out[21]:
0     0.213266
2    -0.089349
15         NaN
21    5.000000
dtype: float64

Fillna just fills the blanks with whatever value you specify.


In [22]:
new_combo.fillna(12)


Out[22]:
0      0.213266
2     -0.089349
15    12.000000
21     5.000000
dtype: float64

Now lastly I want to cover how we can merge different Series’ on certain values and perform simple arithmetic operations.

When s1 and s2 have the same index it’s easy to say add them together and get what we expect.


In [23]:
s1


Out[23]:
0    0.213266
1   -0.091899
2   -0.089349
3    0.265756
4    0.376065
dtype: float64

In [24]:
s2


Out[24]:
0    0.688025
1    0.510002
2    1.914120
3    0.724774
4    0.124588
dtype: float64

In [25]:
s1 + s2


Out[25]:
0    0.901292
1    0.418102
2    1.824772
3    0.990529
4    0.500653
dtype: float64

However things get more complicated when they have different indices. Now when we try and add them it only does so on the overlapping index labels. Often times this may be what we want when we’re analyzing data but other times it’s not. In order to handle that we’ve got to do some reindexing and use fill values.


In [26]:
s2.index = list(range(3,8))
s2


Out[26]:
3    0.688025
4    0.510002
5    1.914120
6    0.724774
7    0.124588
dtype: float64

In [27]:
s1 + s2


Out[27]:
0         NaN
1         NaN
2         NaN
3    0.953781
4    0.886067
5         NaN
6         NaN
7         NaN
dtype: float64

In [28]:
s1.reindex(range(10),fill_value=0) + s2.reindex(range(10),fill_value=0)


Out[28]:
0    0.213266
1   -0.091899
2   -0.089349
3    0.953781
4    0.886067
5    1.914120
6    0.724774
7    0.124588
8    0.000000
9    0.000000
dtype: float64

In [29]:
s2.index = range(5)

In [30]:
s1 = pd.Series(range(1,4), index= ['a','a','c'])
s1


Out[30]:
a    1
a    2
c    3
dtype: int64

In [31]:
s2 = pd.Series(range(1,4), index=['a','a','b'])
s2


Out[31]:
a    1
a    2
b    3
dtype: int64

Finally when we have multiple labels on an index that are the same and we try to bring these Series together with some sort of operation. We’re going to get some multiple. For example multiplying them is equal to performing a cartesian product or the two Series on those specific labels, in this example A.


In [32]:
s1 * s2


Out[32]:
a     1
a     2
a     2
a     4
b   NaN
c   NaN
dtype: float64

Adding them together means each one is added to each one.


In [33]:
s1 + s2


Out[33]:
a     2
a     3
a     3
a     4
b   NaN
c   NaN
dtype: float64

In [34]:
s1


Out[34]:
a    1
a    2
c    3
dtype: int64

Lastly sometime you’re going to want to experiment with modification to Series or data frames. That can be done with the copy method which returns a copy of the data. That makes it easy to experiment with the data.


In [35]:
s1_copy = s1.copy()

In [36]:
s1_copy['a'] = 3

In [37]:
s1_copy


Out[37]:
a    3
a    3
c    3
dtype: int64

In [38]:
s1


Out[38]:
a    1
a    2
c    3
dtype: int64

There are a couple more methods I want to touch on, most specifically map.


In [39]:
s1.map(lambda x: x ** 2)


Out[39]:
a    1
a    4
c    9
dtype: int64

Maps are going to feel familiar from our raw python section, except we can do something a bit more special with the pandas Series Version. We can map it to a dictionary as well. This will perform a look up in the dictionary and return whatever is there.


In [40]:
s1.map({1:2,2:3,3:12})


Out[40]:
a     2
a     3
c    12
dtype: int64

If it doesn't find the value there, it will return NaN


In [41]:
s1.map({2:3,3:12})


Out[41]:
a   NaN
a     3
c    12
dtype: float64

In [ ]: