Pandas is a high-performance python library that provides a comprehensive set of data structures for manipulating tabular data, providing high-performance indexing, automatic alignment, reshaping, grouping, joining and statistical analyses capabilities.
The two primary data structures in pandas are the Series and the DataFrame objects.
The Series object is the fundamental building block of pandas. A Series represents an one-dimensional array based on the NumPy ndarray but with a labeled index that significantly helps to access the elements.
A Series always has an index even if one is not specified, by default pandas will create an index that consists of sequential integers starting from zero. Access to elements is not by integer position but using values in the index referred as Labels.
Importing pandas into the application is simple. It is common to import both pandas and numpy with their objects mapped into the pd and np namespaces respectively.
In [1]:
import numpy as np
import pandas as pd
pd.__version__
Out[1]:
In [2]:
np.__version__
Out[2]:
In [3]:
# set some options to control output display
pd.set_option('display.notebook_repr_html',False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)
In [4]:
# create one item series
s1 = pd.Series(1)
s1
Out[4]:
'0' is the index and '1' is the value. The data type (dtype) is also shown. We can also retrieve the value using the associated index.
In [5]:
# get value with label 0
s1[0]
Out[5]:
In [6]:
# create from list
s2 = pd.Series([1,2,3,4,5])
s2
Out[6]:
In [8]:
# get the values in the series
s2.values
Out[8]:
In [9]:
# get the index of the series
s2.index
Out[9]:
Pandas will create different index types based on the type of data identified in the index parameter. These different index types are optimized to perform indexing operations for that specific data type. To specify the index at the time of creation of the Series, use the index parameter of the constructor.
In [13]:
# explicitly create an index
# index is alpha, not an integer
s3 = pd.Series([1,2,3], index=['a','b','c'])
s3
Out[13]:
In [14]:
s3.index
Out[14]:
Please note the type of the index items. It is not string but 'object'.
In [15]:
# look up by label value and not object position
s3['b']
Out[15]:
In [18]:
# position also works
s3[2]
Out[18]:
In [19]:
# create series from an existing index
# scalar value will be copied at each index label
s4 = pd.Series(2,index=s2.index)
s4
Out[19]:
It is a common practice to initialize the Series objects using NumPy ndarrays, and with various NumPy functions that create arrays. The following code creates a Series from five normally distributed values:
In [20]:
np.random.seed(123456)
pd.Series(np.random.randn(5))
Out[20]:
In [22]:
# 0 through 9
pd.Series(np.linspace(0,9,10))
Out[22]:
In [23]:
# o through 8
pd.Series(np.arange(0,9))
Out[23]:
A Series can also be created from a Python dictionary. The keys of the dictionary are used as the index lables for the Series:
In [25]:
s6 = pd.Series({'a':1,'b':2,'c':3,'d':4})
s6
Out[25]:
In [26]:
# example series which also contains a NaN
s = pd.Series([0,1,1,2,3,4,5,6,7,np.NaN])
s
Out[26]:
In [27]:
# length of the Series
len(s)
Out[27]:
In [28]:
s.size
Out[28]:
In [29]:
# shape is a tuple with one value
s.shape
Out[29]:
In [30]:
# number of values not part of NaN can be found using count() method
s.count()
Out[30]:
In [31]:
# all unique values
s.unique()
Out[31]:
In [32]:
# count of non-NaN values, returned max to min order
s.value_counts()
Out[32]:
pandas provides the .head() and .tail() methods to examine just the first few or last records in a Series. By default, these return the first five or last rows respectively, but you can use the n parameter or just pass an integer to specify the number of rows:
In [33]:
# first five
s.head()
Out[33]:
In [34]:
# first three
s.head(3)
Out[34]:
In [35]:
# last five
s.tail()
Out[35]:
In [36]:
# last 2
s.tail(n=2) # equivalent to s.tail(2)
Out[36]:
The .take() method will return the rows in a series that correspond to the zero-based positions specified in a list:
In [37]:
# only take specific items
s.take([0,3,9])
Out[37]:
In [39]:
# single item lookup
s3['a']
Out[39]:
In [40]:
# lookup by position since index is not an integer
s3[2]
Out[40]:
In [42]:
# multiple items
s3[['a','c']]
Out[42]:
In [43]:
# series with an integer index but not starting with 0
s5 = pd.Series([1,2,3], index =[11,12,13])
s5[12] # by value as value passed and index are both integer
Out[43]:
To alleviate the potential confusion in determining the label-based lookups versus position-based lookups, index based lookup can be enforced using the .loc[] accessor:
In [44]:
# force lookup by index label
s5.loc[12]
Out[44]:
Lookup by position can be enforced using the iloc[] accessor:
In [45]:
# force lookup by position or location
s5.iloc[1]
Out[45]:
In [46]:
# multiple items by index label
s5.loc[[12,10]]
Out[46]:
In [47]:
# multiple items by position or location
s5.iloc[[1,2]]
Out[47]:
If a location / position passed to .iloc[] in a list is out of bounds, an exception will be thrown. This is different than with .loc[], which if passed a label that does not exist, will return NaN as the value for that label:
In [48]:
s5.loc[[12,-1,15]]
Out[48]:
A Series also has a property .ix that can be used to look up items either by label or by zero-based array position.
In [49]:
s3
Out[49]:
In [50]:
# label based lookup
s3.ix[['a','b']]
Out[50]:
In [51]:
# position based lookup
s3.ix[[1,2]]
Out[51]:
This can become complicated if the indexes are integers and you pass a list of integers to ix. Since they are of the same type, the lookup will be by index label instead of position:
In [52]:
# this looks by label and not position
# note that 1,2 have NaN as those labels do not exist in the index
s5.ix[[1,2,10,11]]
Out[52]:
In [53]:
s6 = pd.Series([1,2,3,4], index=['a','b','c','d'])
s6
Out[53]:
In [54]:
s7 = pd.Series([4,3,2,1], index=['d','c','b','a'])
s7
Out[54]:
In [55]:
s6 + s7
Out[55]:
This is a very different result that what it would have been if it were two pure NumPy arrays being added. A NumPy ndarray would add the items in identical positions of each array resulting in different values.
In [56]:
a1 = np.array([1,2,3,4,5])
a2 = np.array([5,4,3,2,1])
a1 + a2
Out[56]:
The process of adding two Series objects differs from the process of addition of arrays as it first aligns data based on index label values instead of simply applying the operation to elements in the same position. This becomes significantly powerful when using pandas Series to combine data based on labels instead of having to first order the data manually.
In [57]:
# multiply all values in s3 by 2
s3 * 2
Out[57]:
In [58]:
# scalar series using the s3's index
# not efficient as it will no use vectorisation
t = pd.Series(2,s3.index)
s3 * t
Out[58]:
To reinforce the point that alignment is being performed when applying arithmetic operations across two Series objects, look at the following two Series as examples:
In [59]:
# we will add this to s9
s8 = pd.Series({'a':1,'b':2,'c':3,'d':5})
s8
Out[59]:
In [60]:
s9 = pd.Series({'b':6,'c':7,'d':9,'e':10})
s9
Out[60]:
In [61]:
# NaN's result for a and e demonstrates alignment
s8 + s9
Out[61]:
In [62]:
s10 = pd.Series([1.0,2.0,3.0],index=['a','a','b'])
s10
Out[62]:
In [63]:
s11 = pd.Series([4.0,5.0,6.0], index=['a','a','c'])
s11
Out[63]:
In [64]:
# will result in four 'a' index labels
s10 + s11
Out[64]:
The reason for the above result is that during alignment, pandas actually performs a cartesian product of the sets of all the unique index labels in both Series objects, and then applies the specified operation on all items in the products.
To explain why there are four 'a' index values s10 contains two 'a' labels and s11 also contains two 'a' labels. Every combination of 'a' labels in each will be calculated resulting in four 'a' labels. There is one 'b' label from s10 and one 'c' label from s11. Since there is no matching label for either in the other Series object, they only result in a sing row in the resulting Series object.
Each combination of values for 'a' in both Series are computed, resulting in the four values: 1+4,1+5,2+4 and 2+5.
So remember that an index can have duplicate labels, and during alignment this will result in a number of index labels equivalent to the products of the number of the labels in each Series.
In [65]:
nda = np.array([1,2,3,4,5])
nda.mean()
Out[65]:
In [66]:
# mean of numpy array values with a NaN
nda = np.array([1,2,3,4,np.NaN])
nda.mean()
Out[66]:
In [67]:
# Series object ignores NaN values - does not get factored
s = pd.Series(nda)
s.mean()
Out[67]:
In [68]:
# handle NaN values like Numpy
s.mean(skipna=False)
Out[68]:
In [69]:
# which rows have values that are > 5
s = pd.Series(np.arange(0,10))
s > 5
Out[69]:
In [71]:
# select rows where values are > 5
# overloading the Series object [] operator
logicalResults = s > 5
s[logicalResults]
Out[71]:
In [72]:
# a little shorter version
s[s > 5]
Out[72]:
In [73]:
# using & operator
s[(s>5)&(s<9)]
Out[73]:
In [74]:
# using | operator
s[(s > 3) | (s < 5)]
Out[74]:
In [75]:
# are all items >= 0?
(s >=0).all()
Out[75]:
In [76]:
# are any items < 2
s[s < 2].any()
Out[76]:
The result of these logical expressions is a Boolean selection, a Series of True and False values. The .sum() method of a Series, when given a series of Boolean values, will treat True as 1 and False as 0. The following demonstrates using this to determine the number of items in a Series that satisfy a given expression:
In [77]:
(s < 2).sum()
Out[77]:
Reindexing in pandas is a process that makes the data in a Series or DataFrame match a given set of labels.
This process of performing a reindex includes the following steps:
In [78]:
# sample series of five items
s = pd.Series(np.random.randn(5))
s
Out[78]:
In [79]:
# change the index
s.index = ['a','b','c','d','e']
s
Out[79]:
In [80]:
# concat copies index values verbatim
# potentially making duplicates
np.random.seed(123456)
s1 = pd.Series(np.random.randn(3))
s2 = pd.Series(np.random.randn(3))
combined = pd.concat([s1,s2])
combined
Out[80]:
In [81]:
# reset the index
combined.index = np.arange(0,len(combined))
combined
Out[81]:
Greater flexibility in creating a new index is provided using the .reindex() method. An example of the flexibility of .reindex() over assigning the .index property directly is that the list provided to .reindex() can be of a different length than the number of rows in the Series:
In [82]:
np.random.seed(123456)
s1 = pd.Series(np.random.randn(4),['a','b','c','d'])
# reindex with different number of labels
# results in dropped rows and/or NaN's
s2 = s1.reindex(['a','c','g'])
s2
Out[82]:
There are several things here that are important to point out about .reindex() method.
To demonstrate that the result of .reindex() is a new Series object, changing a value in s2 does not change the values in s1:
In [83]:
# s2 is a different series than s1
s2['a'] = 0
s2
Out[83]:
In [ ]:
# this did not modify s1
s1
Reindex is also useful when you want to align two Series to perform an operation on matching elements from each series; however, for some reason, the two Series has index labels that will not initially align.
In [87]:
# different types for the same values of labels causes big issue
s1 = pd.Series([0,1,2],index=[0,1,2])
s2 = pd.Series([3,4,5],index=['0','1','2'])
s1 + s2
Out[87]:
The reason why this happens in pandas are as follows:
In [88]:
# reindex by casting the label types and we will get the desired result
s2.index = s2.index.values.astype(int)
s1 + s2
Out[88]:
The default action of inserting NaN as a missing value during reindexing can be changed by using the fill_value parameter of the method.
In [89]:
# fill with 0 instead on NaN
s2 = s.copy()
s2.reindex(['a','f'],fill_value=0)
Out[89]:
When performing a reindex on ordered data such as a time series, it is possible to perform interpolation or filling of values. The following example demonstrates forward filling, often referred to as "last known value".
In [90]:
# create example to demonstrate fills
s3 = pd.Series(['red','green','blue'],index=[0,3,5])
s3
Out[90]:
In [91]:
# forward fill using ffill method
s3.reindex(np.arange(0,7), method='ffill')
Out[91]:
In [92]:
# backward fill using bfill method
s3.reindex(np.arange(0,7),method='bfill')
Out[92]:
In [94]:
np.random.seed(123456)
s = pd.Series(np.random.randn(3),index=['a','b','c'])
s
Out[94]:
In [95]:
# change a value in the Series
# this done in-place
# a new Series is not returned that has a modified value
s['d'] = 100
s
Out[95]:
In [96]:
# value at a specific index label can be changed by assignment:
s['d'] = -100
s
Out[96]:
Items can be removed from a Series using the del() function and passing the index label(s) to be removed.
In [97]:
del(s['a'])
s
Out[97]:
In [98]:
# a series to use for slicing
# using index labels not starting at 0 to demonstrate
# position based slicing
s = pd.Series(np.arange(100,110),index=np.arange(10,20))
s
Out[98]:
In [99]:
# items at position 0,2,4
s[0:6:2]
Out[99]:
In [100]:
# equivalent to
s.iloc[[0,2,4]]
Out[100]:
In [101]:
# first five by slicing, same as .head(5)
s[:5]
Out[101]:
In [102]:
# fourth position to the end
s[4:]
Out[102]:
In [103]:
# every other item in the first five positions
s[:5:2]
Out[103]:
In [104]:
# every other item starting at the fourth position
s[4::2]
Out[104]:
In [105]:
# reverse the series
s[::-1]
Out[105]:
In [106]:
# every other starting at position 4, in reverse
s[4::-2]
Out[106]:
In [107]:
# :-2 which means positions 0 through (10-2) which is [8]
s[:-2]
Out[107]:
In [108]:
# last 3 items
# equivalent to tail(3)
s[-3:]
Out[108]:
In [109]:
# equivalent to s.tail(4).head(3)
s[-4:-1]
Out[109]:
An important thing to keep in mind when using slicing, is that the result of the slice is actually a view into the original Series. Modification of values through the result of the slice will modify the original Series.
In [110]:
# preserve s
# slice with first 2 rows
copy = s.copy()
slice = copy[:2]
slice
Out[110]:
Now the assignment of a value to an element of a slice will change the value in the original Series:
In [111]:
slice[11] = 1000
copy
Out[111]:
Slicing can be performed on Series objects with a non-integer index.
In [112]:
# used to demonstrate the next two slices
s = pd.Series(np.arange(0,5),index=['a','b','c','d','e'])
s
Out[112]:
In [113]:
# slicing with integer values will extract items based on position:
s[1:3]
Out[113]:
In [114]:
# with non-integer index, it is also possible to slice with values in the same type of the index:
s['b':'d']
Out[114]: