---
layout: post
title: Introduction to Pandas 4
excerpt:
categories: blog
tags: ["Python", "Data Science", "Pandas"]
published: true
comments: true
share: true
---
Let;s get our hands dirty with another important topic in Pandas. We need to able to handle missing data and we will see some methods to handle them. After making our data more suitable for analysis, or prediction we will learn how to plot it with matplotlib
library.
I am going to use matplotlib library for this post and other pandas tutorials but I normally prefer using bokeh interactive plotting library.
If you like to follow this small tutorial in notebook format you can check out here
Missing data is usually shows up as NULL
or N/A
or NaN
(Pandas represents this way.)
Other than appearing natively in the source dataset, missing values can be added to a dataset by an operation such as reindexing, or changing frequencies in the case of a time series:
In [45]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
date_stngs = ['2014-05-01','2014-05-02','2014-05-05','2014-05-06','2014-05-07']
tradeDates = pd.to_datetime(pd.Series(date_stngs))
closingPrices=[531.35,527.93,527.81,515.14,509.96]
googClosingPrices=pd.DataFrame(data=closingPrices, columns=['closingPrice'], index=tradeDates)
googClosingPrices
Out[2]:
Pandas is also provides a very easy and fast API to read stock data from various finance data providers.
In [5]:
from pandas_datareader import data, wb # pandas data READ API
import datetime
In [9]:
googPrices = data.get_data_yahoo("GOOG", start=datetime.datetime(2014, 5, 1),
end=datetime.datetime(2014, 5, 7))
In [10]:
googFinalPrices=pd.DataFrame(googPrices['Close'], index=tradeDates)
googFinalPrices
Out[10]:
We now have a time series that depicts the closing price of Google's stock from May 1, 2014 to May 7, 2014 with gaps in the date range since the trading only occur on business days. If we want to change the date range so that it shows calendar days (that is, along with the weekend), we can change the frequency of the time series index from business days to calendar days as follow:
In [11]:
googClosingPricesCDays = googClosingPrices.asfreq('D')
googClosingPricesCDays
Out[11]:
Note that we have now introduced NaN values for the closingPrice for the weekend dates of May 3, 2014 and May 4, 2014.
We can check which values are missing by using the isnull()
and notnull()
functions as follows:
In [14]:
googClosingPricesCDays.isnull()
Out[14]:
In [15]:
googClosingPricesCDays.notnull()
Out[15]:
A Boolean DataFrame is returned in each case. In datetime and pandas Timestamps, missing values are represented by the NaT value. This is the equivalent of NaN in pandas for time-based types
In [17]:
tDates=tradeDates.copy()
tDates[1]=np.NaN
tDates[4]=np.NaN
tDates
In [20]:
FBVolume=[82.34,54.11,45.99,55.86,78.5]
TWTRVolume=[15.74,12.71,10.39,134.62,68.84]
socialTradingVolume=pd.concat([pd.Series(FBVolume), pd.Series(TWTRVolume),
tradeDates], axis=1,keys=['FB','TWTR','TradeDate'])
socialTradingVolume
Out[20]:
In [21]:
socialTradingVolTS=socialTradingVolume.set_index('TradeDate')
socialTradingVolTS
Out[21]:
In [22]:
socialTradingVolTSCal=socialTradingVolTS.asfreq('D')
socialTradingVolTSCal
Out[22]:
We can perform arithmetic operations on data containing missing values. For example, we can calculate the total trading volume (in millions of shares) across the two stocks for Facebook and Twitter as follows:
In [23]:
socialTradingVolTSCal['FB']+socialTradingVolTSCal['TWTR']
Out[23]:
By default, any operation performed on an object that contains missing values will return a missing value at that position as shown in the following command:
In [24]:
pd.Series([1.0,np.NaN,5.9,6])+pd.Series([3,5,2,5.6])
Out[24]:
In [25]:
pd.Series([1.0,25.0,5.5,6])/pd.Series([3,np.NaN,2,5.6])
Out[25]:
There is a difference, however, in the way NumPy treats aggregate calculations versus what pandas does.
In pandas, the default is to treat the missing value as 0 and do the aggregate calculation, whereas for NumPy, NaN is returned if any of the values are missing. Here is an illustration:
In [26]:
np.mean([1.0,np.NaN,5.9,6])
Out[26]:
In [27]:
np.sum([1.0,np.NaN,5.9,6])
Out[27]:
However, if this data is in a pandas Series, we will get the following output:
In [28]:
pd.Series([1.0,np.NaN,5.9,6]).sum()
Out[28]:
In [29]:
pd.Series([1.0,np.NaN,5.9,6]).mean()
Out[29]:
It is important to be aware of this difference in behavior between pandas and NumPy. However, if we wish to get NumPy to behave the same way as pandas, we can use the np.nanmean and np.nansum functions, which are illustrated as follows:
In [30]:
np.nanmean([1.0,np.NaN,5.9,6])
Out[30]:
In [31]:
np.nansum([1.0,np.NaN,5.9,6])
Out[31]:
In [32]:
socialTradingVolTSCal
Out[32]:
In [33]:
socialTradingVolTSCal.fillna(100)
Out[33]:
We can also fill forward or backward values using the ffill()
or bfill()
arguments:
In [34]:
socialTradingVolTSCal.fillna(method='ffill')
Out[34]:
In [35]:
socialTradingVolTSCal.fillna(method='bfill')
Out[35]:
2. By using the dropna()
function to drop/delete rows and columns with missing values.
In [36]:
socialTradingVolTSCal.dropna()
Out[36]:
3. We can also interpolate and fill in the missing values by using the interpolate()
function
In [37]:
pd.set_option('display.precision',4)
socialTradingVolTSCal.interpolate()
Out[37]:
The interpolate()
function also takes an argument—method that denotes the method. These methods include linear, quadratic, cubic spline, and so on.
The matplotlib api is imported using the standard convention, as shown in the following command: import matplotlib.pyplot as plt
Series and DataFrame have a plot method, which is simply a wrapper around plt.plot
. Here, we will examine how we can do a simple plot of a sine and cosine function. Suppose we wished to plot the following functions over the interval pi to pi:
In [39]:
X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
f,g = np.cos(X)+np.sin(X), np.sin(X)-np.cos(X)
f_ser=pd.Series(f)
g_ser=pd.Series(g)
In [40]:
plotDF=pd.concat([f_ser,g_ser],axis=1)
plotDF.index=X
plotDF.columns=['sin(x)+cos(x)','sin(x)-cos(x)']
plotDF.head()
Out[40]:
In [47]:
plotDF.columns=['f(x)','g(x)']
plotDF.plot(title='Plot of f(x)=sin(x)+cos(x), \n g(x)=sinx(x)-cos(x)')
plt.show()
In [48]:
plotDF.plot(subplots=True, figsize=(6,6))
plt.show()
There is a lot more to using the plotting functionality of matplotlib within pandas. For more information, take a look at the documentation
To summarize, we have discussed how to handle missing data values and manipulate dates in pandas. We also took a brief detour to investigate the plotting functionality in pandas using matplotlib .
In [ ]: