As pandas is built on Python, any means available in Python can be used to retrieve data from outside source. This really makes the possibility of the data that can be accessed unlimited including text files, excel spreadsheets, web sites and services, databases and cloud based services.
In [1]:
# import pandas and numpy
import numpy as np
import pandas as pd
# set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)
In [2]:
# view the first five lines of data/msft.csv
! head -n 5 ../../data/msft.csv # OS/Linux
# !type ..\..\data\msft.csv # on windows
In [3]:
# read in msft.csv into a DataFrame
msft = pd.read_csv("../../data/msft.csv")
msft.head()
Out[3]:
In [4]:
# specifying the index column
msft = pd.read_csv("../../data/msft.csv", index_col=0)
msft.head()
Out[4]:
The data field is now the index however because of this it is also not a column data. If you want to use the date as a column, you will need to create a new column and assign the index labels to that column.
In [5]:
# examine the types of the columns in the DataFrame
msft.dtypes
Out[5]:
In [6]:
# to force type of columns, use the dtypes parameter
# following forces the column to be float64
msft = pd.read_csv("../../data/msft.csv", dtype={'Volume': np.float64})
msft.dtypes
Out[6]:
In [7]:
# specify a new set of names for the columns
# all lower case, remove space in Adj Close
# also, header = 0 skips the header row
df = pd.read_csv("../../data/msft.csv",header=0,names=['open','high','low','close','volume','adjclose'])
df.head()
Out[7]:
In [9]:
# read in data only in the Date and close columns,
# use Date as the inde
df2 = pd.read_csv("../../data/msft.csv",usecols=['Date','Close'],index_col=['Date'])
df2.head()
Out[9]:
In [10]:
# save df2 to a new csv file
# also specify naming the index as date
df2.to_csv("../../data/msft_modified.csv",index_label='date')
It was necessary to tell the method that the index label should be saved with a column name of date using index_label=date. Otherwise, the index does not have a name added to the first row of the file, which makes it difficult to read back properly.
In [11]:
# view the start of the file just saved
!head ../../data/msft_modified.csv
In [12]:
# use read_table with sep=',' to read a csv
df=pd.read_table("../../data/msft.csv",sep=',')
df.head()
Out[12]:
In [13]:
# save as pipe delimited
df.to_csv("../../data/msft_piped.txt",sep='|')
# check if it worked
!head -n 5 ../../data/msft_piped.txt
Sometimes, data in a field-delimited file may contain erroneous headers and footers. Examples can be company information at the top, such as invoice number, addresses and summary footers. Sometimes data is stored on ever other line. These situations will cause error when pandas tries to open files. To handle these scenarios some useful parameters can be used.
In [14]:
# messy file
!head ../../data/msft2.csv # Linux
In [15]:
# read, but skip rows 0,2 and 3
df = pd.read_csv("../../data/msft2.csv",skiprows=[0,2,3])
df
Out[15]:
Another common situation is where a file has content at the end of the file which should be ignored to prevent an error, such as the following:
In [16]:
# another messy file with mess at the end
!cat ../../data/msft_with_footer.csv # osx / Linux
In [20]:
# skip only two lines at the end
# engine parameter to force python implementation rather than default c implementation
df = pd.read_csv("../../data/msft_with_footer.csv",skipfooter=2,engine='python')
df
Out[20]:
In [21]:
# only process the first three rows
pd.read_csv("../../data/msft.csv",nrows=3)
Out[21]:
In [22]:
# skip 100 lines, then only process the next five
pd.read_csv("../../data/msft.csv", skiprows=100, nrows=5, header=0,names=['open','high','low','close','vol','adjclose'])
Out[22]:
In [23]:
# read excel file
# only reads first sheet
df = pd.read_excel("../../data/stocks.xlsx")
df.head()
Out[23]:
In [26]:
# read from the appl worksheet
aapl = pd.read_excel("../../data/stocks.xlsx", sheetname='aapl')
aapl.head()
Out[26]:
In [27]:
# save to excel file in worksheet sheet1
df.to_excel("../../data/stocks2.xlsx")
In [28]:
# write making the worksheet name MSFT
df.to_excel("../../data/stocks_msft.xlsx", sheet_name='MSFT')
To write more than one DataFrame to a single Excel file and each DataFrame object on a separate worksheet use the ExcelWriter object along with the with keyword.
In [29]:
from pandas import ExcelWriter
with ExcelWriter("../../data/all_stocks.xls") as writer:
aapl.to_excel(writer,sheet_name='AAPL')
df.to_excel(writer,sheet_name='MSFT')
In [30]:
# write to xlsx
df.to_excel("../../data/msft2.xlsx")
In [31]:
# write the excel data to a JSON file
df.head().to_json("../../data/stocks.json")
!cat ../../data/stocks.json
In [32]:
# read data in from JSON
df_from_json = pd.read_json("../../data/stocks.json")
df_from_json.head(5)
Out[32]:
Notice two slight differences here caused by the reading / writing of data from JSON. First the columns have been reordered alphabetically. Second, the index for DataFram although containing contnet, is sorted as a string.
In [33]:
# url to read
url = "http://www.fdic.gov/bank/individual/failed/banklist.html"
# read it
banks = pd.read_html(url)
# examine a subset of the first table read
banks[0][0:5].ix[:,0:4]
Out[33]:
In [35]:
# write to html
# read the stock data
df=pd.read_excel("../../data/stocks.xlsx")
# write first 2 rows to HTML
df.head(2).to_html("../../data/stocks.html")
# check
!head -n 28 ../../data/stocks.html
HDF5 is a data model, library and file format to store and manage data. It is commonly used in scientific computing environments. It supports an unlimited variety of data types and is designed for flexible and efficient I/O and for high volume and complex data.
HDF5 is portable and extensible allowing applications to evolve in their use of HDF5. HDF5 technology suite includes tools and applications to manage, manipulate, view and analyse data in HDF5 format.
HDF5 is:
HDF5Store is a hierarchical dictionary like object that reads and writes pandas objects to the HDF5 format.
In [36]:
# seed for replication
np.random.seed(123456)
# create a DataFrame of dates and random numbers in three columns
df = pd.DataFrame(np.random.randn(8,3),index=pd.date_range('1/1/2000', periods=8), columns=['A','B','C'])
# create HDF5 store
store = pd.HDFStore('../../data/store.h5')
store['df'] = df # persisting happened here
store
Out[36]:
In [37]:
# read in data from HDF5
store = pd.HDFStore("../../data/store.h5")
df = store['df']
df
Out[37]:
In [38]:
# this changes the DataFrame, but did not persist
df.ix[0].A = 1
# to persist the change, assign the dataframe to the
# HDF5 store object
store['df'] = df
# it is now persisted
# the following loads the store and
# shows the first two rows, demonstrating
# the persisting was done
pd.HDFStore("../../data/store.h5")['df'].head(2)
Out[38]:
In [44]:
# read csv directly from Yahoo! Finance from a URL
df = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv")
df[:5]
Out[44]:
In [42]:
# reference SQLITE
import sqlite3
# read in the stock data from csv
msft = pd.read_csv("../../data/msft.csv")
msft['Symbol'] = "MSFT"
aapl = pd.read_csv("../../data/aapl.csv")
aapl['Symbol'] = 'AAPL'
# create connection
connection = sqlite3.connect("../../data/stocks.sqlite")
# .to_sql() will create sql to store the DataFrame
# in the specified table. if_exists specifies
# what to do if the table already exists
msft.to_sql("STOCK DATA", connection, if_exists="replace")
aapl.to_sql("STOCK DATA", connection, if_exists="append")
# commit the sql and close the connection
connection.commit()
connection.close()
In [43]:
# read data
# connect to the database file
connection = sqlite3.connect("../../data/stocks.sqlite")
# query all records in STOCK_DATA
# returns a DataFrame
# index_col specifies which column to make the DataFrame index
stocks = pd.io.sql.read_sql("SELECT * FROM STOCK_DATA;", connection, index_col="index")
# close the connection
connection.close()
# report the head of the data received
stocks.head()
Out[43]:
In [46]:
# open the connection
connection = sqlite3.connect("../../data/stocks.sqlite")
# construct the query string
query = "SELECT * FROM STOCK_DATA WHERE Volume > 29200100 AND Symbol='MSFT';"
# execute and close connection
items = pd.io.sql.read_sql(query,connection,index_col='index')
connection.close()
items
Out[46]:
As these functions take a connection object, which can be any Python DB-API compatible data adapter, you can more or less work with any supported database data by simply creating an appropriate connection object. The code at pandas level should remain the same for any supported database.
pandas has direct support for various web-based data source classes in the pandas.io.data namespace. The primary class of interest is pandas.io.data.DataReader, which is implemented to read data from various supported sources and return it to the application directly as DataFrame.
Currently, support exists for the following sources via the DataReader class:
In [54]:
import pandas_datareader.data as web
import datetime
# start and end dates
start = datetime.datetime(2012,1,1)
end = datetime.datetime(2014,1,27)
# read the MSFT stock data from Yahoo!
yahoo = web.DataReader('MSFT','yahoo',start,end)
yahoo.head()
Out[54]:
In [55]:
# read from google
google = web.DataReader('MSFT','google',start,end)
google.head()
Out[55]:
In [57]:
# specify we want all yahoo options data for AAPL
# this can take a little time...
from pandas_datareader.data import Options
aapl = Options('AAPL','yahoo')
# read all the data
data = aapl.get_all_data()
# examine the first six rows and four columns
data.iloc[0:6,0:4]
Out[57]:
In [59]:
# get all puts at strike price of $80 (first four columns only)
data.loc[(80, slice(None),'put'),:].iloc[0:5,0:4]
Out[59]:
In [61]:
data.loc[(80,slice('20150117','20150417'),'put'),:].iloc[:,0:4]
Out[61]:
In [64]:
# msft calls expiring on 2015-01-05
expiry = datetime.date(2015, 1, 5)
msft_calls = Options('MSFT','yahoo').get_call_data(expiry=expiry)
msft_calls.iloc[0:5,0:5]
Out[64]:
In [65]:
# msft calls expiring on 2015-01-17
expiry = datetime.date(2015,1,17)
aapl_calls = aapl.get_call_data(expiry=expiry)
aapl_calls.iloc[0:5,0:4]
Out[65]:
In [66]:
gdp = web.DataReader("GDP","fred",datetime.date(2012,1,1),datetime.date(2014,1,27))
gdp
Out[66]:
In [67]:
# get compensation of employees: Wages and Salaries
web.DataReader("A576RC1A027NBEA","fred",datetime.date(1929,1,1),datetime.date(2013,1,1))
Out[67]:
In [68]:
# read from Kenneth French fama global factors data set
factors = web.DataReader("Global_Factors","famafrench")
factors
Out[68]:
In [76]:
from pandas_datareader import wb
all_indicators = wb.get_indicators()
# examine some of the indicators
all_indicators.ix[:,0:1]
Out[76]:
In [77]:
# search of life expectancy indicators
le_indicators = wb.search("life expectancy")
le_indicators.iloc[:3,:2]
Out[77]:
In [78]:
# get countries and show the 3 digit code and name
countries = wb.get_countries()
# show a subset of the country data
countries.iloc[0:10].ix[:,['name','capitalcity','iso2c']]
Out[78]:
In [79]:
# get life expectancy at birth for all countries from 1980 to 2014
le_data_all = wb.download(indicator="SP.DYN.LE00.IN", start='1980',end='2014')
le_data_all
Out[79]:
In [80]:
# only US, CAN and MEX are returned by default
le_data_all.index.levels[0]
Out[80]:
In [81]:
# retrieve life expectancy at birth for all countries
# from 1980 to 2014
le_data_all = wb.download(indicator="SP.DYN.LE00.IN",country=countries['iso2c'],start='1980',end='2012')
le_data_all
Out[81]:
We can do some interesting things with this data. The example we will look at, determines which country has the lowest life expectancy for each year. To do this, we first need to pivot this data, so that the index is the country name and the year is the column.
In [82]:
# le_data_all.pivot(index='country',columns='year')
le_data = le_data_all.reset_index().pivot(index='country',columns='year')
# examine pivoted data
le_data.iloc[:,0:3]
Out[82]:
In [83]:
# ask what is the name of the country for each year
# with the least life expectancy
country_with_least_expectancy = le_data.idxmin(axis=0)
country_with_least_expectancy
Out[83]:
In [84]:
# and what is the minimum life expectancy for each year
expectancy_for_least_country = le_data.min(axis=0)
expectancy_for_least_country
Out[84]:
In [85]:
# this merges the two frames together and gives us
# year, country and expectancy where the minimum exists
least = pd.DataFrame(data={'Country':country_with_least_expectancy.values,
'Expectancy':expectancy_for_least_country.values},
index= country_with_least_expectancy.index.levels[1])
least
Out[85]:
In [ ]: