Getting to know your data with Pandas

What is Pandas

Pandas is Python Data Analysis Library. Pandas provides an R-like DataFrame, produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.

Getting started



In [ ]:

    
import pandas as pd
import pandas.io.data as web
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns


from datetime import datetime

#pd.__version__

%matplotlib inline

Fetching storing and retrieving your data

More information on what types of data you can fetch

http://pandas.pydata.org/pandas-docs/stable/remote_data.html



In [ ]:

    
#yahoo data
#stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
stocks = 'YELP'
#ls_key = 'Adj Close'
#yahoo_stocks = pd.io.data.DataReader(stocks,'yahoo', 
#                                 start=datetime.datetime(2014, 1, 1), 
#                                 end=datetime.datetime(2014, 12, 31))

yahoo_stocks = pd.io.data.get_data_yahoo(stocks)
#                                 start=datetime.datetime(2011, 10, 1), 
#                                 end=datetime.datetime(2014, 12, 31))

#yahoo_stocks.head()
yahoo_stocks.info()

Reading data from a .csv file

Some basic commands for reading and getting to know the characteristics of your data



In [ ]:

    
yahoo_stocks.to_csv('yahoo_data.csv')
#print(open('yahoo_data.csv').read())
df = pd.read_csv('yahoo_data.csv')#, index_col='Date', infer_datetime_format=True, parse_dates=True)
#print df.head()
#print df.info()
#print df.index
print df



In [ ]:

    
#The number of tuples in your data
print len(df)

Playing with data columns



In [ ]:

    
#The columns/features of your data
print df.columns



In [ ]:

    
#Selecting a single column from your data
print df['Open']



In [ ]:

    
#Another way of selecting a single column from your data
print df.Open



In [ ]:

    
print df.columns
print (df[['Open','Close']]).info()
#print df[['Open','Close']]



In [ ]:

    
print df.Date.head(10)
print df.Date.tail(10)



In [ ]:

    
#alternative
print df['Date'].head(10)
print df['Date'].tail(10)

Changing the column names



In [ ]:

    
new_column_names = [ x.lower().replace(' ','_') for x in df.columns]
new_column_names
df.columns = new_column_names
print df.info()

Now all columns can be accessed using the dot notation



In [ ]:

    
print df.adj_close



In [ ]:

    
print df.adj_close.head()

We can now apply methods on different columns



In [ ]:

    
print df.mean()
#df.std()
#df.median()
#df.open.mean()
#df.high.mean()

Plotting some columns



In [ ]:

    
df.adj_close.plot(label='adj')
df.low.plot(label='close')
#df.low.plot()
plt.legend(loc=2)



In [ ]:

    
df.adj_close.hist()
#df.volume.hist()

Bulk Operations

Methods like sum( ) and std( ) work on entire columns. We can run our own functions across all values in a column (or row) using apply( ).



In [ ]:

    
print df.date.head()

The values property of the column returns a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.



In [ ]:

    
print df.date.values
first_date = df.date.values[0]
print first_date



In [ ]:

    
datetime.strptime(first_date, "%Y-%m-%d")



In [ ]:

    
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
print df.date.head()

Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

The row indices so far have been auto-generated by pandas, and are simply integers starting from 0. From now on we will use dates instead of integers for indices-- the benefits of this will show later. Overwriting the index is as easy as assigning to the index property of the DataFrame.



In [ ]:

    
df.index = df.date
print df.info()

Accessing rows of the DataFrame using index values



In [ ]:

    
df.ix[datetime(2015,1,23,0,0)]

Accessing individual rows using iloc[ ]



In [ ]:

    
df.iloc[0,:]

Iterating over the rows using iterrows( )



In [ ]:

    
num_positive_days = 0
for idx, row in df.iterrows():
    if row.close > row.open:
        num_positive_days += 1
        
print "The total number of positive-gain days : " + str(num_positive_days)

Droping a column

Now that we have made an index based on date, we can drop the actual ``date" column



In [ ]:

    
df = df.drop(['date'],axis=1)
print df.info()
print df.columns

Filtering

Selecting interesting rows from the data. All operations below return a DataFrame, which can be treated the same way as all DataFrames we have seen so far.



In [ ]:

    
positive_days = df[df.close > df.open]
print positive_days.head()



In [ ]:

    
very_positive_days = df[df.close-df.open > 4]
print len(very_positive_days)
#print very_positive_days.head()



In [ ]:

    
tmp_high = df.high > 60
print type(tmp_high)



In [ ]:

    
print tmp_high.head()
print len(tmp_high)

Select only the rows of df that correspond to tmp_high



In [ ]:

    
print df[tmp_high]

Creating new columns



In [ ]:

    
df['profit'] = (df.open < df.close)
print df.head()



In [ ]:

    
for idx, row in df.iterrows():
    if row.close > row.open:
        df.ix[idx,'gain']='negative'
    elif (row.open - row.close) < 1:
        df.ix[idx,'gain']='small_gain'
    elif (row.open - row.close) < 6:
        df.ix[idx,'gain']='medium_gain'
    else:
        df.ix[idx,'gain']='large_gain'
print df.head()

Alternatively



In [ ]:

    
def namerow(row):
    if row.close > row.open:
        return 'negative'
    elif (row.open - row.close) < 1:
        return 'small_gain'
    elif (row.open - row.close) < 6:
        return 'medium_gain'
    else:
        return 'large_gain'

df['test_column'] = df.apply(namerow, axis = 1)



In [ ]:

    
print df.head()



In [ ]:

    
df.drop('test_column', axis = 1)

Grouping

Besides apply(), another great DataFrame function is groupby(). It will group a DataFrame by one or more columns, and let you iterate through each group.

Here we will show the average gain among the three categories of negative, small, medium and large gains we defined above and stored in column ``gain"



In [ ]:

    
gains = {}
for gain, gain_data in df.groupby("gain"):
    gains[gain] = gain_data.close.mean()
gains

Comparing multiple stocks

As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.



In [ ]:

    
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
attr = 'Close'
df = pd.io.data.get_data_yahoo(stocks, 
                               start=datetime(2014, 1, 1), 
                               end=datetime(2014, 12, 31))[attr]
print df.head()



In [ ]:

    
df.ORCL.plot(label = 'oracle')
df.TSLA.plot(label = 'tesla')
df.IBM.plot(label = 'ibm')
df.MSFT.plot(label = 'msft')
df.YELP.plot(label = 'yelp')
plt.legend(loc=5)

Calculating returns over a period of length T

$$r(t) = \frac{f(t)-f(t-T)}{f(t)} $$

The returns can be computed in python with a simple function pct_returns( )



In [ ]:

    
rets = df.pct_change(30)
#print rets.head(20)

Plotting again the timeseries of the returns of the different stocks



In [ ]:

    
rets.ORCL.plot(label = 'oracle')
rets.TSLA.plot(label = 'tesla')
rets.IBM.plot(label = 'ibm')
rets.MSFT.plot(label = 'msft')
rets.YELP.plot(label = 'yelp')
plt.legend()



In [ ]:

    
plt.scatter(rets.TSLA, rets.ORCL)
plt.xlabel('Returns TESLA')
plt.ylabel('Returns ORCL')

Correlations of columns

The correlation coefficient between variables $X$ and $Y$ is defined as follows: $$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$

In python we can compute the correlation coefficient of all pairs of columns with corr( )



In [ ]:

    
corr = rets.corr()
print corr

Visualizing the correlation coefficient of all columns

We will learn more about visualization later, but for now this is a simple example



In [ ]:

    
sns.heatmap(corr, annot=True)

Returns vs risk (standard deviation)

In many applications, we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk). Below we visualize the result of such an analysis



In [ ]:

    
plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Standard Deviation (Risk)')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (20, -20),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))



In [1]:

    
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[1]: