pandasとDataFrameによるデータ操作

米国市場の株価終値データを用いる。



In [2]:

    
import pandas as pd
import pandas_datareader.data as web#株価など時系列データをwebから取得するパッケージ
from datetime import datetime



In [2]:

    
ticker = ['AAPL','IBM','MSFT','GOOG']
start = '2016-01-01'#datetime(2016,1,1)
end = '2016-12-31'#datetime(2016,12,31)
df = web.DataReader(ticker,'yahoo',start,end)['Close',:,:]



In [3]:

    
print(df.head(5))









    



                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002

日時を扱うdatetimeについて少しだけ学ぶ。



In [3]:

    
dt = datetime(2016,3,18)
print(dt)
print(datetime(2017,4,10,10,20,30))
print(datetime.now())
print(dt.year)
print(dt.month)
print(dt.day)









    



2016-03-18 00:00:00
2017-04-10 10:20:30
2017-04-16 17:54:23.798576
2016
3
18

列名、行名を調べる

列名を調べる



In [17]:

    
print(df.columns)









    



Index(['AAPL', 'GOOG', 'IBM', 'MSFT'], dtype='object')

行名を調べる



In [18]:

    
print(df.index)









    



DatetimeIndex(['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
               '2016-01-08', '2016-01-11', '2016-01-12', '2016-01-13',
               '2016-01-14', '2016-01-15',
               ...
               '2016-12-16', '2016-12-19', '2016-12-20', '2016-12-21',
               '2016-12-22', '2016-12-23', '2016-12-27', '2016-12-28',
               '2016-12-29', '2016-12-30'],
              dtype='datetime64[ns]', name='Date', length=252, freq=None)

型を調べる



In [20]:

    
print(type(df))
print(type(df.ix[0,0]))
print(type(df.columns))
print(type(df.index))
print(type(df.columns[0]))
print(type(df.index[0]))









    



<class 'pandas.core.frame.DataFrame'>
<class 'pandas.indexes.base.Index'>
<class 'pandas.tseries.index.DatetimeIndex'>
<class 'str'>
<class 'pandas.tslib.Timestamp'>
<class 'numpy.float64'>

サイズを調べる



In [44]:

    
print(len(df))#行の長さのみ
print(len(df.ix[0]))#列の長さ(１列目の長さ)のみ
print(len(df.columns))#列の長さのみ
print(df.shape)#行と列の長さ

データフレームから１列を抽出する



In [4]:

    
df1 = df['AAPL']
print(df1.head(5))
print(type(df1))









    



Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>

次の方法は英単語以外の列名のときは注意が必要



In [6]:

    
df1 = df.AAPL
print(df1.head(5))
print(type(df1))









    



Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>



In [7]:

    
df1 = df.ix[:,'AAPL']
print(df1.head(5))
print(type(df1))









    



Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>

ixは列番号で指定できる。



In [38]:

    
df1 = df.ix[:,0]
print(df1.head(5))
print(type(df1))









    



Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>

次のような抽出方法はない。



In [37]:

    
print(df[0])









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1944             try:
-> 1945                 return self._engine.get_loc(key)
   1946             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-37-92b146e0a0e2> in <module>()
----> 1 print(df[0])

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 0

データフレームから複数列抽出する



In [8]:

    
df1 = df[['AAPL','MSFT']]
print(df1.head(5))









    



                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002
<class 'pandas.core.frame.DataFrame'>



In [9]:

    
df1 = df.ix[:,['AAPL','MSFT']]
print(df1.head(5))









    



                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002
<class 'pandas.core.frame.DataFrame'>

ixを使うと列番号で指定できる



In [10]:

    
df1 = df.ix[:,[0,2]]
print(df1.head(5))









    



                  AAPL         IBM
Date                              
2016-01-04  105.349998  135.949997
2016-01-05  102.709999  135.850006
2016-01-06  100.699997  135.169998
2016-01-07   96.449997  132.860001
2016-01-08   96.959999  131.630005
<class 'pandas.core.frame.DataFrame'>



In [34]:

    
df1 = df.ix[:,1:3]
print(df1.head(5))









    



                  GOOG         IBM
Date                              
2016-01-04  741.840027  135.949997
2016-01-05  742.580017  135.850006
2016-01-06  743.619995  135.169998
2016-01-07  726.390015  132.860001
2016-01-08  714.469971  131.630005



In [45]:

    
df1 = df.ix[:,'IBM':]
print(df1.head(5))









    



                   IBM       MSFT
Date                             
2016-01-04  135.949997  54.799999
2016-01-05  135.850006  55.049999
2016-01-06  135.169998  54.049999
2016-01-07  132.860001  52.169998
2016-01-08  131.630005  52.330002

データフレームから１行抽出する



In [26]:

    
df0 = df.ix[datetime(2016,1,5)]
print(df0)
print(type(df0))









    



AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>



In [11]:

    
df0 = df.ix['2016-01-05']
print(df0)
print(type(df0))









    



AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>



In [12]:

    
df0 = df.ix['2016-01-05',:]
print(df0)
print(type(df0))









    



AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>



In [46]:

    
df0 = df.ix[1,:]
print(df0)
print(type(df0))









    



AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>

データフレームから複数行抽出する



In [27]:

    
df0 = df.ix[['2016-01-05','2016-01-07']]
print(df0)
print(type(df0))









    



            AAPL  GOOG  IBM  MSFT
Date                             
2016-01-05   NaN   NaN  NaN   NaN
2016-01-07   NaN   NaN  NaN   NaN
<class 'pandas.core.frame.DataFrame'>

複数行では、文字列だとなぜか抽出できない。むしろなぜか１行のときは抽出できた。
型をあわせて抽出する。



In [28]:

    
df0 = df.ix[[datetime(2016,1,5),datetime(2016,1,7)]]
print(df0)









    



                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
<class 'pandas.core.frame.DataFrame'>



In [29]:

    
df0 = df.ix[[datetime(2016,1,5),datetime(2016,1,7)],:]
print(df0)









    



                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
<class 'pandas.core.frame.DataFrame'>



In [15]:

    
df0 = df.ix[[1,3],:]
print(df0)









    



                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
<class 'pandas.core.frame.DataFrame'>

上から５行、下から５行抽出する



In [36]:

    
print(df.head(5))
print(df.tail(5))









    



                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002
                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-12-23  116.519997  789.909973  166.710007  63.240002
2016-12-27  117.260002  791.549988  167.139999  63.279999
2016-12-28  116.760002  785.049988  166.190002  62.990002
2016-12-29  116.730003  782.789978  166.600006  62.900002
2016-12-30  115.820000  771.820007  165.990005  62.139999



In [57]:

    
df0_head = df.ix[:5,:]
df0_tail = df.ix[-5:,:]
print(df0_head)
print(df0_tail)









    



                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002
                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-12-23  116.519997  789.909973  166.710007  63.240002
2016-12-27  117.260002  791.549988  167.139999  63.279999
2016-12-28  116.760002  785.049988  166.190002  62.990002
2016-12-29  116.730003  782.789978  166.600006  62.900002
2016-12-30  115.820000  771.820007  165.990005  62.139999

次の書き方は多少違和感があるかもしれない(実用のために生み出されたものらしい)。



In [59]:

    
print(df[:5])
print(df[2:5])









    



                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002
                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002

柔軟な抽出

行と列を同時に絞り込む



In [30]:

    
df_sub = df.ix[[datetime(2016,1,5),datetime(2016,1,7)],['AAPL','MSFT']]
print(df_sub)









    



                  AAPL       MSFT
Date                             
2016-01-05  102.709999  55.049999
2016-01-07   96.449997  52.169998

インデックス番号で行と列を絞り込む



In [31]:

    
df_sub = df.ix[[1,3],[0,1]]
print(df_sub)









    



                  AAPL        GOOG
Date                              
2016-01-05  102.709999  742.580017
2016-01-06  100.699997  743.619995

行名・列名とインデックス番号の組み合わせも可能



In [47]:

    
df_sub = df.ix[:5,['AAPL','MSFT']]
print(df_sub)









    



                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002



In [53]:

    
df_sub = df.ix[:datetime(2016,1,8),['AAPL','MSFT']]
print(df_sub)









    



                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002

ただし次のような書き方は不可



In [50]:

    
df_sub = df.ix[:5,['AAPL',3]]
print(df_sub)









    



                  AAPL   3
Date                      
2016-01-04  105.349998 NaN
2016-01-05  102.709999 NaN
2016-01-06  100.699997 NaN
2016-01-07   96.449997 NaN
2016-01-08   96.959999 NaN



In [ ]: