pandasとDataFrameによるデータ操作

米国市場の株価終値データを用いる。


In [2]:
import pandas as pd
import pandas_datareader.data as web#株価など時系列データをwebから取得するパッケージ
from datetime import datetime

In [2]:
ticker = ['AAPL','IBM','MSFT','GOOG']
start = '2016-01-01'#datetime(2016,1,1)
end = '2016-12-31'#datetime(2016,12,31)
df = web.DataReader(ticker,'yahoo',start,end)['Close',:,:]

In [3]:
print(df.head(5))


                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002

日時を扱うdatetimeについて少しだけ学ぶ。


In [3]:
dt = datetime(2016,3,18)
print(dt)
print(datetime(2017,4,10,10,20,30))
print(datetime.now())
print(dt.year)
print(dt.month)
print(dt.day)


2016-03-18 00:00:00
2017-04-10 10:20:30
2017-04-16 17:54:23.798576
2016
3
18

列名、行名を調べる

列名を調べる


In [17]:
print(df.columns)


Index(['AAPL', 'GOOG', 'IBM', 'MSFT'], dtype='object')

行名を調べる


In [18]:
print(df.index)


DatetimeIndex(['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
               '2016-01-08', '2016-01-11', '2016-01-12', '2016-01-13',
               '2016-01-14', '2016-01-15',
               ...
               '2016-12-16', '2016-12-19', '2016-12-20', '2016-12-21',
               '2016-12-22', '2016-12-23', '2016-12-27', '2016-12-28',
               '2016-12-29', '2016-12-30'],
              dtype='datetime64[ns]', name='Date', length=252, freq=None)

型を調べる


In [20]:
print(type(df))
print(type(df.ix[0,0]))
print(type(df.columns))
print(type(df.index))
print(type(df.columns[0]))
print(type(df.index[0]))


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.indexes.base.Index'>
<class 'pandas.tseries.index.DatetimeIndex'>
<class 'str'>
<class 'pandas.tslib.Timestamp'>
<class 'numpy.float64'>

サイズを調べる


In [44]:
print(len(df))#行の長さのみ
print(len(df.ix[0]))#列の長さ(1列目の長さ)のみ
print(len(df.columns))#列の長さのみ
print(df.shape)#行と列の長さ


252
4
4
(252, 4)

データフレームから1列を抽出する


In [4]:
df1 = df['AAPL']
print(df1.head(5))
print(type(df1))


Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>

次の方法は英単語以外の列名のときは注意が必要


In [6]:
df1 = df.AAPL
print(df1.head(5))
print(type(df1))


Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>

In [7]:
df1 = df.ix[:,'AAPL']
print(df1.head(5))
print(type(df1))


Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>

ixは列番号で指定できる。


In [38]:
df1 = df.ix[:,0]
print(df1.head(5))
print(type(df1))


Date
2016-01-04    105.349998
2016-01-05    102.709999
2016-01-06    100.699997
2016-01-07     96.449997
2016-01-08     96.959999
Name: AAPL, dtype: float64
<class 'pandas.core.series.Series'>

次のような抽出方法はない。


In [37]:
print(df[0])


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1944             try:
-> 1945                 return self._engine.get_loc(key)
   1946             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-37-92b146e0a0e2> in <module>()
----> 1 print(df[0])

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/ry8128/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 0

データフレームから複数列抽出する


In [8]:
df1 = df[['AAPL','MSFT']]
print(df1.head(5))


                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002
<class 'pandas.core.frame.DataFrame'>

In [9]:
df1 = df.ix[:,['AAPL','MSFT']]
print(df1.head(5))


                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002
<class 'pandas.core.frame.DataFrame'>

ixを使うと列番号で指定できる


In [10]:
df1 = df.ix[:,[0,2]]
print(df1.head(5))


                  AAPL         IBM
Date                              
2016-01-04  105.349998  135.949997
2016-01-05  102.709999  135.850006
2016-01-06  100.699997  135.169998
2016-01-07   96.449997  132.860001
2016-01-08   96.959999  131.630005
<class 'pandas.core.frame.DataFrame'>

In [34]:
df1 = df.ix[:,1:3]
print(df1.head(5))


                  GOOG         IBM
Date                              
2016-01-04  741.840027  135.949997
2016-01-05  742.580017  135.850006
2016-01-06  743.619995  135.169998
2016-01-07  726.390015  132.860001
2016-01-08  714.469971  131.630005

In [45]:
df1 = df.ix[:,'IBM':]
print(df1.head(5))


                   IBM       MSFT
Date                             
2016-01-04  135.949997  54.799999
2016-01-05  135.850006  55.049999
2016-01-06  135.169998  54.049999
2016-01-07  132.860001  52.169998
2016-01-08  131.630005  52.330002

データフレームから1行抽出する


In [26]:
df0 = df.ix[datetime(2016,1,5)]
print(df0)
print(type(df0))


AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>

In [11]:
df0 = df.ix['2016-01-05']
print(df0)
print(type(df0))


AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>

In [12]:
df0 = df.ix['2016-01-05',:]
print(df0)
print(type(df0))


AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>

In [46]:
df0 = df.ix[1,:]
print(df0)
print(type(df0))


AAPL    102.709999
GOOG    742.580017
IBM     135.850006
MSFT     55.049999
Name: 2016-01-05 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>

データフレームから複数行抽出する


In [27]:
df0 = df.ix[['2016-01-05','2016-01-07']]
print(df0)
print(type(df0))


            AAPL  GOOG  IBM  MSFT
Date                             
2016-01-05   NaN   NaN  NaN   NaN
2016-01-07   NaN   NaN  NaN   NaN
<class 'pandas.core.frame.DataFrame'>

複数行では、文字列だとなぜか抽出できない。むしろなぜか1行のときは抽出できた。
型をあわせて抽出する。


In [28]:
df0 = df.ix[[datetime(2016,1,5),datetime(2016,1,7)]]
print(df0)


                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
<class 'pandas.core.frame.DataFrame'>

In [29]:
df0 = df.ix[[datetime(2016,1,5),datetime(2016,1,7)],:]
print(df0)


                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
<class 'pandas.core.frame.DataFrame'>

In [15]:
df0 = df.ix[[1,3],:]
print(df0)


                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
<class 'pandas.core.frame.DataFrame'>

上から5行、下から5行抽出する


In [36]:
print(df.head(5))
print(df.tail(5))


                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002
                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-12-23  116.519997  789.909973  166.710007  63.240002
2016-12-27  117.260002  791.549988  167.139999  63.279999
2016-12-28  116.760002  785.049988  166.190002  62.990002
2016-12-29  116.730003  782.789978  166.600006  62.900002
2016-12-30  115.820000  771.820007  165.990005  62.139999

In [57]:
df0_head = df.ix[:5,:]
df0_tail = df.ix[-5:,:]
print(df0_head)
print(df0_tail)


                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002
                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-12-23  116.519997  789.909973  166.710007  63.240002
2016-12-27  117.260002  791.549988  167.139999  63.279999
2016-12-28  116.760002  785.049988  166.190002  62.990002
2016-12-29  116.730003  782.789978  166.600006  62.900002
2016-12-30  115.820000  771.820007  165.990005  62.139999

次の書き方は多少違和感があるかもしれない(実用のために生み出されたものらしい)。


In [59]:
print(df[:5])
print(df[2:5])


                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-04  105.349998  741.840027  135.949997  54.799999
2016-01-05  102.709999  742.580017  135.850006  55.049999
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002
                  AAPL        GOOG         IBM       MSFT
Date                                                     
2016-01-06  100.699997  743.619995  135.169998  54.049999
2016-01-07   96.449997  726.390015  132.860001  52.169998
2016-01-08   96.959999  714.469971  131.630005  52.330002

柔軟な抽出

行と列を同時に絞り込む


In [30]:
df_sub = df.ix[[datetime(2016,1,5),datetime(2016,1,7)],['AAPL','MSFT']]
print(df_sub)


                  AAPL       MSFT
Date                             
2016-01-05  102.709999  55.049999
2016-01-07   96.449997  52.169998

インデックス番号で行と列を絞り込む


In [31]:
df_sub = df.ix[[1,3],[0,1]]
print(df_sub)


                  AAPL        GOOG
Date                              
2016-01-05  102.709999  742.580017
2016-01-06  100.699997  743.619995

行名・列名とインデックス番号の組み合わせも可能


In [47]:
df_sub = df.ix[:5,['AAPL','MSFT']]
print(df_sub)


                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002

In [53]:
df_sub = df.ix[:datetime(2016,1,8),['AAPL','MSFT']]
print(df_sub)


                  AAPL       MSFT
Date                             
2016-01-04  105.349998  54.799999
2016-01-05  102.709999  55.049999
2016-01-06  100.699997  54.049999
2016-01-07   96.449997  52.169998
2016-01-08   96.959999  52.330002

ただし次のような書き方は不可


In [50]:
df_sub = df.ix[:5,['AAPL',3]]
print(df_sub)


                  AAPL   3
Date                      
2016-01-04  105.349998 NaN
2016-01-05  102.709999 NaN
2016-01-06  100.699997 NaN
2016-01-07   96.449997 NaN
2016-01-08   96.959999 NaN

In [ ]: