Pandas is Python Data Analysis Library. Pandas provides an R-like DataFrame, produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.
In [ ]:
import pandas as pd
import pandas.io.data as web
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from datetime import datetime
#pd.__version__
%matplotlib inline
More information on what types of data you can fetch
http://pandas.pydata.org/pandas-docs/stable/remote_data.html
In [ ]:
#yahoo data
#stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
stocks = 'YELP'
#ls_key = 'Adj Close'
#yahoo_stocks = pd.io.data.DataReader(stocks,'yahoo',
# start=datetime.datetime(2014, 1, 1),
# end=datetime.datetime(2014, 12, 31))
yahoo_stocks = pd.io.data.get_data_yahoo(stocks)
# start=datetime.datetime(2011, 10, 1),
# end=datetime.datetime(2014, 12, 31))
#yahoo_stocks.head()
yahoo_stocks.info()
More on pandas data frames:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html
Some basic commands for reading and getting to know the characteristics of your data
In [ ]:
yahoo_stocks.to_csv('yahoo_data.csv')
#print(open('yahoo_data.csv').read())
df = pd.read_csv('yahoo_data.csv')#, index_col='Date', infer_datetime_format=True, parse_dates=True)
#print df.head()
#print df.info()
#print df.index
print df
In [ ]:
#The number of tuples in your data
print len(df)
In [ ]:
#The columns/features of your data
print df.columns
In [ ]:
#Selecting a single column from your data
print df['Open']
In [ ]:
#Another way of selecting a single column from your data
print df.Open
In [ ]:
print df.columns
print (df[['Open','Close']]).info()
#print df[['Open','Close']]
In [ ]:
print df.Date.head(10)
print df.Date.tail(10)
In [ ]:
#alternative
print df['Date'].head(10)
print df['Date'].tail(10)
In [ ]:
new_column_names = [ x.lower().replace(' ','_') for x in df.columns]
new_column_names
df.columns = new_column_names
print df.info()
In [ ]:
print df.adj_close
In [ ]:
print df.adj_close.head()
In [ ]:
print df.mean()
#df.std()
#df.median()
#df.open.mean()
#df.high.mean()
In [ ]:
df.adj_close.plot(label='adj')
df.low.plot(label='close')
#df.low.plot()
plt.legend(loc=2)
In [ ]:
df.adj_close.hist()
#df.volume.hist()
Methods like sum( ) and std( ) work on entire columns. We can run our own functions across all values in a column (or row) using apply( ).
In [ ]:
print df.date.head()
The values property of the column returns a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.
In [ ]:
print df.date.values
first_date = df.date.values[0]
print first_date
In [ ]:
datetime.strptime(first_date, "%Y-%m-%d")
In [ ]:
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
print df.date.head()
Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.
The row indices so far have been auto-generated by pandas, and are simply integers starting from 0. From now on we will use dates instead of integers for indices-- the benefits of this will show later. Overwriting the index is as easy as assigning to the index property of the DataFrame.
In [ ]:
df.index = df.date
print df.info()
In [ ]:
df.ix[datetime(2015,1,23,0,0)]
In [ ]:
df.iloc[0,:]
In [ ]:
num_positive_days = 0
for idx, row in df.iterrows():
if row.close > row.open:
num_positive_days += 1
print "The total number of positive-gain days : " + str(num_positive_days)
Now that we have made an index based on date, we can drop the actual ``date" column
In [ ]:
df = df.drop(['date'],axis=1)
print df.info()
print df.columns
Selecting interesting rows from the data. All operations below return a DataFrame, which can be treated the same way as all DataFrames we have seen so far.
In [ ]:
positive_days = df[df.close > df.open]
print positive_days.head()
In [ ]:
very_positive_days = df[df.close-df.open > 4]
print len(very_positive_days)
#print very_positive_days.head()
In [ ]:
tmp_high = df.high > 60
print type(tmp_high)
In [ ]:
print tmp_high.head()
print len(tmp_high)
Select only the rows of df that correspond to tmp_high
In [ ]:
print df[tmp_high]
In [ ]:
df['profit'] = (df.open < df.close)
print df.head()
In [ ]:
for idx, row in df.iterrows():
if row.close > row.open:
df.ix[idx,'gain']='negative'
elif (row.open - row.close) < 1:
df.ix[idx,'gain']='small_gain'
elif (row.open - row.close) < 6:
df.ix[idx,'gain']='medium_gain'
else:
df.ix[idx,'gain']='large_gain'
print df.head()
In [ ]:
def namerow(row):
if row.close > row.open:
return 'negative'
elif (row.open - row.close) < 1:
return 'small_gain'
elif (row.open - row.close) < 6:
return 'medium_gain'
else:
return 'large_gain'
df['test_column'] = df.apply(namerow, axis = 1)
In [ ]:
print df.head()
In [ ]:
df.drop('test_column', axis = 1)
Besides apply(), another great DataFrame function is groupby(). It will group a DataFrame by one or more columns, and let you iterate through each group.
Here we will show the average gain among the three categories of negative, small, medium and large gains we defined above and stored in column ``gain"
In [ ]:
gains = {}
for gain, gain_data in df.groupby("gain"):
gains[gain] = gain_data.close.mean()
gains
As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.
In [ ]:
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
attr = 'Close'
df = pd.io.data.get_data_yahoo(stocks,
start=datetime(2014, 1, 1),
end=datetime(2014, 12, 31))[attr]
print df.head()
In [ ]:
df.ORCL.plot(label = 'oracle')
df.TSLA.plot(label = 'tesla')
df.IBM.plot(label = 'ibm')
df.MSFT.plot(label = 'msft')
df.YELP.plot(label = 'yelp')
plt.legend(loc=5)
The returns can be computed in python with a simple function pct_returns( )
In [ ]:
rets = df.pct_change(30)
#print rets.head(20)
Plotting again the timeseries of the returns of the different stocks
In [ ]:
rets.ORCL.plot(label = 'oracle')
rets.TSLA.plot(label = 'tesla')
rets.IBM.plot(label = 'ibm')
rets.MSFT.plot(label = 'msft')
rets.YELP.plot(label = 'yelp')
plt.legend()
In [ ]:
plt.scatter(rets.TSLA, rets.ORCL)
plt.xlabel('Returns TESLA')
plt.ylabel('Returns ORCL')
The correlation coefficient between variables $X$ and $Y$ is defined as follows: $$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$
In python we can compute the correlation coefficient of all pairs of columns with corr( )
In [ ]:
corr = rets.corr()
print corr
We will learn more about visualization later, but for now this is a simple example
In [ ]:
sns.heatmap(corr, annot=True)
In many applications, we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk). Below we visualize the result of such an analysis
In [ ]:
plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Standard Deviation (Risk)')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
plt.annotate(
label,
xy = (x, y), xytext = (20, -20),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
In [1]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
styles = open("../theme/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: