Getting to know your data with Pandas

What is Pandas

Pandas is Python Data Analysis Library. Pandas provides an R-like DataFrame, produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.

Getting started


In [ ]:
import pandas as pd
import pandas.io.data as web
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns


from datetime import datetime

#pd.__version__

%matplotlib inline

Fetching storing and retrieving your data

More information on what types of data you can fetch

http://pandas.pydata.org/pandas-docs/stable/remote_data.html


In [ ]:
#yahoo data
#stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
stocks = 'YELP'
#ls_key = 'Adj Close'
#yahoo_stocks = pd.io.data.DataReader(stocks,'yahoo', 
#                                 start=datetime.datetime(2014, 1, 1), 
#                                 end=datetime.datetime(2014, 12, 31))

yahoo_stocks = pd.io.data.get_data_yahoo(stocks)
#                                 start=datetime.datetime(2011, 10, 1), 
#                                 end=datetime.datetime(2014, 12, 31))

#yahoo_stocks.head()
yahoo_stocks.info()

Reading data from a .csv file

Some basic commands for reading and getting to know the characteristics of your data


In [ ]:
yahoo_stocks.to_csv('yahoo_data.csv')
#print(open('yahoo_data.csv').read())
df = pd.read_csv('yahoo_data.csv')#, index_col='Date', infer_datetime_format=True, parse_dates=True)
#print df.head()
#print df.info()
#print df.index
print df

In [ ]:
#The number of tuples in your data
print len(df)

Playing with data columns


In [ ]:
#The columns/features of your data
print df.columns

In [ ]:
#Selecting a single column from your data
print df['Open']

In [ ]:
#Another way of selecting a single column from your data
print df.Open

In [ ]:
print df.columns
print (df[['Open','Close']]).info()
#print df[['Open','Close']]

In [ ]:
print df.Date.head(10)
print df.Date.tail(10)

In [ ]:
#alternative
print df['Date'].head(10)
print df['Date'].tail(10)

Changing the column names


In [ ]:
new_column_names = [ x.lower().replace(' ','_') for x in df.columns]
new_column_names
df.columns = new_column_names
print df.info()

Now all columns can be accessed using the dot notation


In [ ]:
print df.adj_close

In [ ]:
print df.adj_close.head()

We can now apply methods on different columns


In [ ]:
print df.mean()
#df.std()
#df.median()
#df.open.mean()
#df.high.mean()

Plotting some columns


In [ ]:
df.adj_close.plot(label='adj')
df.low.plot(label='close')
#df.low.plot()
plt.legend(loc=2)

In [ ]:
df.adj_close.hist()
#df.volume.hist()

Bulk Operations

Methods like sum( ) and std( ) work on entire columns. We can run our own functions across all values in a column (or row) using apply( ).


In [ ]:
print df.date.head()

The values property of the column returns a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.


In [ ]:
print df.date.values
first_date = df.date.values[0]
print first_date

In [ ]:
datetime.strptime(first_date, "%Y-%m-%d")

In [ ]:
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
print df.date.head()

Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

The row indices so far have been auto-generated by pandas, and are simply integers starting from 0. From now on we will use dates instead of integers for indices-- the benefits of this will show later. Overwriting the index is as easy as assigning to the index property of the DataFrame.


In [ ]:
df.index = df.date
print df.info()

Accessing rows of the DataFrame using index values


In [ ]:
df.ix[datetime(2015,1,23,0,0)]

Accessing individual rows using iloc[ ]


In [ ]:
df.iloc[0,:]

Iterating over the rows using iterrows( )


In [ ]:
num_positive_days = 0
for idx, row in df.iterrows():
    if row.close > row.open:
        num_positive_days += 1
        
print "The total number of positive-gain days : " + str(num_positive_days)

Droping a column

Now that we have made an index based on date, we can drop the actual ``date" column


In [ ]:
df = df.drop(['date'],axis=1)
print df.info()
print df.columns

Filtering

Selecting interesting rows from the data. All operations below return a DataFrame, which can be treated the same way as all DataFrames we have seen so far.


In [ ]:
positive_days = df[df.close > df.open]
print positive_days.head()

In [ ]:
very_positive_days = df[df.close-df.open > 4]
print len(very_positive_days)
#print very_positive_days.head()

In [ ]:
tmp_high = df.high > 60
print type(tmp_high)

In [ ]:
print tmp_high.head()
print len(tmp_high)

Select only the rows of df that correspond to tmp_high


In [ ]:
print df[tmp_high]

Creating new columns


In [ ]:
df['profit'] = (df.open < df.close)
print df.head()

In [ ]:
for idx, row in df.iterrows():
    if row.close > row.open:
        df.ix[idx,'gain']='negative'
    elif (row.open - row.close) < 1:
        df.ix[idx,'gain']='small_gain'
    elif (row.open - row.close) < 6:
        df.ix[idx,'gain']='medium_gain'
    else:
        df.ix[idx,'gain']='large_gain'
print df.head()

Alternatively


In [ ]:
def namerow(row):
    if row.close > row.open:
        return 'negative'
    elif (row.open - row.close) < 1:
        return 'small_gain'
    elif (row.open - row.close) < 6:
        return 'medium_gain'
    else:
        return 'large_gain'

df['test_column'] = df.apply(namerow, axis = 1)

In [ ]:
print df.head()

In [ ]:
df.drop('test_column', axis = 1)

Grouping

Besides apply(), another great DataFrame function is groupby(). It will group a DataFrame by one or more columns, and let you iterate through each group.

Here we will show the average gain among the three categories of negative, small, medium and large gains we defined above and stored in column ``gain"


In [ ]:
gains = {}
for gain, gain_data in df.groupby("gain"):
    gains[gain] = gain_data.close.mean()
gains

Comparing multiple stocks

As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.


In [ ]:
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
attr = 'Close'
df = pd.io.data.get_data_yahoo(stocks, 
                               start=datetime(2014, 1, 1), 
                               end=datetime(2014, 12, 31))[attr]
print df.head()

In [ ]:
df.ORCL.plot(label = 'oracle')
df.TSLA.plot(label = 'tesla')
df.IBM.plot(label = 'ibm')
df.MSFT.plot(label = 'msft')
df.YELP.plot(label = 'yelp')
plt.legend(loc=5)

Calculating returns over a period of length T

$$r(t) = \frac{f(t)-f(t-T)}{f(t)} $$

The returns can be computed in python with a simple function pct_returns( )


In [ ]:
rets = df.pct_change(30)
#print rets.head(20)

Plotting again the timeseries of the returns of the different stocks


In [ ]:
rets.ORCL.plot(label = 'oracle')
rets.TSLA.plot(label = 'tesla')
rets.IBM.plot(label = 'ibm')
rets.MSFT.plot(label = 'msft')
rets.YELP.plot(label = 'yelp')
plt.legend()

In [ ]:
plt.scatter(rets.TSLA, rets.ORCL)
plt.xlabel('Returns TESLA')
plt.ylabel('Returns ORCL')

Correlations of columns

The correlation coefficient between variables $X$ and $Y$ is defined as follows: $$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$

In python we can compute the correlation coefficient of all pairs of columns with corr( )


In [ ]:
corr = rets.corr()
print corr

Visualizing the correlation coefficient of all columns

We will learn more about visualization later, but for now this is a simple example


In [ ]:
sns.heatmap(corr, annot=True)

Returns vs risk (standard deviation)

In many applications, we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk). Below we visualize the result of such an analysis


In [ ]:
plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Standard Deviation (Risk)')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (20, -20),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

In [1]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: