5-1 Pandas IO Data, Different Ways of Indexing Data, Hierarchical Indexing and Panels



In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2

In this section we will be analyzing some financial data. Now pandas gives us access to some data through pandas.io.data

This is basically pandas remote data access: http://pandas.pydata.org/pandas-docs/stable/remote_data.html

Functions from pandas.io.data and pandas.io.ga extract data from various Internet sources into a DataFrame. Currently the following sources are supported:

  • Yahoo! Finance
  • Google Finance
  • St.Louis FED (FRED)
  • Kenneth French’s data library
  • World Bank
  • Google Analytics

So let’s grab some stocks from yahoo data with pandas.io.data. I've seen this list actively change so it's a good idea to see what is available to you - there's likely some really cool plugins that will continue to be built.


In [2]:
import pandas.io.data

In [3]:
?pandas.io.data # <tab>

Now there’s a lot of volatility in oil right now. It’s been rough for producers to say the least. So let’s check out some stocks that are involved in that market.

First we’ll set start and end dates these are just date times. Now I can do this with date times in python but I can also just get a datetime with pandas which can parse a string to pull out a date time. This ends up being pretty handy.


In [7]:
import datetime
print(datetime.datetime(2010,1,1))


2010-01-01 00:00:00

WTI which is W&T Offshore Inc. They drill in the gulf of Mexico.

Let’s also check out

CHK or Chesapeake Energy Corporation.

Tesla Motors

and finally CBAK which is China Bak Battery Incorporated.


In [8]:
start = pd.to_datetime('2010-1-1')
end = datetime.datetime(2015,1,1)
ticker_symbols = ['WTI','CHK','TSLA','CBAK']

In [9]:
wti = pd.io.data.get_data_yahoo(ticker_symbols[0],start=start,end=end)

In [10]:
wti.head()


Out[10]:
Open High Low Close Volume Adj Close
Date
2010-01-04 11.90 12.46 11.86 12.26 838800 9.824887
2010-01-05 12.30 12.63 12.17 12.34 625400 9.888997
2010-01-06 12.41 12.65 12.39 12.58 604700 10.081328
2010-01-07 12.60 12.70 12.24 12.45 565300 9.977149
2010-01-08 12.37 12.54 12.12 12.50 521100 10.017218

Now we can get these one by one in a for loop...


In [12]:
for symbol in ticker_symbols:
    print(symbol)
    df = pd.io.data.get_data_yahoo(symbol,start=start,end=end)


WTI
CHK
TSLA
CBAK

or we can just get them all by passing in the list....


In [13]:
panl = pd.io.data.get_data_yahoo(ticker_symbols,start=start,end=end)

But we get something different back that we haven’t encountered yet. This is a panel. Now panels are advanced and explaining their use case is outside of this video. However, I’ll give you the basics.

Panels are 3 dimensional containers that we can query on each of those dimensions.


In [14]:
panl


Out[14]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1258 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2010-01-04 00:00:00 to 2014-12-31 00:00:00
Minor_axis axis: CBAK to WTI

We can see they’ve got an items axis, a major axis and a minor axis. Panels are a core part of pandas however they are much less used in pandas and therefore are a bit neglected as of now. That’s not just me trying to avoid the topic - which would be my suspicion if I heard that. But that ’s stated almost verbatim in the docs.

http://pandas-docs.github.io/pandas-docs-travis/dsintro.html#from-dataframe-using-to-panel-method


In [15]:
type(panl)


Out[15]:
pandas.core.panel.Panel

However let’s touch on the basis because you may come across them.

We've got a lot of the basic methods like shape.


In [16]:
panl.shape


Out[16]:
(6, 1258, 4)

We’ve got these three axes so we want to query data in them. We've got to do that a bit differently.

Since we know these axis values we can query them.

Now items are done like standard DataFrame columns with dot syntax.


In [17]:
panl.Open.head()


Out[17]:
CBAK CHK TSLA WTI
Date
2010-01-04 2.90 27.429997 NaN 11.90
2010-01-05 2.72 28.300002 NaN 12.30
2010-01-06 2.98 29.209995 NaN 12.41
2010-01-07 2.90 28.629998 NaN 12.60
2010-01-08 2.90 28.389996 NaN 12.37

The major and minor axises are done differently. with the major_xs and minor_xs commands.


In [18]:
panl.major_xs('2013-5-1')


Out[18]:
Open High Low Close Volume Adj Close
CBAK 0.620000 0.790000 0.60 0.770000 56800 0.770000
CHK 19.900002 19.979996 18.86 19.190006 17267300 17.569929
TSLA 55.990002 55.990002 53.00 53.279999 2742800 53.279999
WTI 11.540000 11.540000 11.03 11.220000 720500 10.394898

In [19]:
panl.minor_xs('CHK').head()


Out[19]:
Open High Low Close Volume Adj Close
Date
2010-01-04 27.429997 28.109996 26.920004 28.089999 31146800 24.503810
2010-01-05 28.300002 29.120002 28.199999 28.970004 28692700 25.271467
2010-01-06 29.209995 29.220005 28.530005 28.649996 16055000 24.992314
2010-01-07 28.629998 28.799995 28.180002 28.720002 13906600 25.053382
2010-01-08 28.389996 28.919998 28.050002 28.909998 11656400 25.219122

Some summary statistics are available to us like mean on the panel.


In [20]:
panl.mean()


Out[20]:
Open High Low Close Volume Adj Close
CBAK 1.729793 1.801296 1.670469 1.726248 117253.020668 4.791367
CHK 24.194174 24.530963 23.802076 24.165184 14230891.096979 21.921955
TSLA 88.218820 89.988099 86.329102 88.177315 4136559.595070 88.177315
WTI 16.080199 16.418887 15.720723 16.069173 806965.580286 14.329300

We can perform different kinds of selections and transposition using the major and minor axes however as I said above I'm not going to cover this material. However I’m going to convert this panel to a data frame to show you how to do that and in the process and we’re going to cover a new topic!

Now when we convert the panel to a data frame with the to_frame command - we can see it looks a bit different.


In [21]:
df = panl.to_frame()
df.head()


Out[21]:
Open High Low Close Volume Adj Close
Date minor
2010-01-04 CBAK 2.900000 2.900000 2.670000 2.700000 456600 13.500000
CHK 27.429997 28.109996 26.920004 28.089999 31146800 24.503810
WTI 11.900000 12.460000 11.860000 12.260000 838800 9.824887
2010-01-05 CBAK 2.720000 3.100000 2.690000 2.850000 1179500 14.250000
CHK 28.300002 29.120002 28.199999 28.970004 28692700 25.271467

But when we use the head method to see the first 5 we see things are a little different and that’s because we now have multiple indices or a hierarchical or multi index. Now hierarchical indexes are extremely powerful but they’re beyond the scope of this current video - I'll touch on them a bit later in this section.

What you need to know right now is that there are levels that are stacked on one another and those can be queried.


In [22]:
df.index.levels


Out[22]:
FrozenList([[2010-01-04 00:00:00, 2010-01-05 00:00:00, 2010-01-06 00:00:00, 2010-01-07 00:00:00, 2010-01-08 00:00:00, 2010-01-11 00:00:00, 2010-01-12 00:00:00, 2010-01-13 00:00:00, 2010-01-14 00:00:00, 2010-01-15 00:00:00, 2010-01-19 00:00:00, 2010-01-20 00:00:00, 2010-01-21 00:00:00, 2010-01-22 00:00:00, 2010-01-25 00:00:00, 2010-01-26 00:00:00, 2010-01-27 00:00:00, 2010-01-28 00:00:00, 2010-01-29 00:00:00, 2010-02-01 00:00:00, 2010-02-02 00:00:00, 2010-02-03 00:00:00, 2010-02-04 00:00:00, 2010-02-05 00:00:00, 2010-02-08 00:00:00, 2010-02-09 00:00:00, 2010-02-10 00:00:00, 2010-02-11 00:00:00, 2010-02-12 00:00:00, 2010-02-16 00:00:00, 2010-02-17 00:00:00, 2010-02-18 00:00:00, 2010-02-19 00:00:00, 2010-02-22 00:00:00, 2010-02-23 00:00:00, 2010-02-24 00:00:00, 2010-02-25 00:00:00, 2010-02-26 00:00:00, 2010-03-01 00:00:00, 2010-03-02 00:00:00, 2010-03-03 00:00:00, 2010-03-04 00:00:00, 2010-03-05 00:00:00, 2010-03-08 00:00:00, 2010-03-09 00:00:00, 2010-03-10 00:00:00, 2010-03-11 00:00:00, 2010-03-12 00:00:00, 2010-03-15 00:00:00, 2010-03-16 00:00:00, 2010-03-17 00:00:00, 2010-03-18 00:00:00, 2010-03-19 00:00:00, 2010-03-22 00:00:00, 2010-03-23 00:00:00, 2010-03-24 00:00:00, 2010-03-25 00:00:00, 2010-03-26 00:00:00, 2010-03-29 00:00:00, 2010-03-30 00:00:00, 2010-03-31 00:00:00, 2010-04-01 00:00:00, 2010-04-05 00:00:00, 2010-04-06 00:00:00, 2010-04-07 00:00:00, 2010-04-08 00:00:00, 2010-04-09 00:00:00, 2010-04-12 00:00:00, 2010-04-13 00:00:00, 2010-04-14 00:00:00, 2010-04-15 00:00:00, 2010-04-16 00:00:00, 2010-04-19 00:00:00, 2010-04-20 00:00:00, 2010-04-21 00:00:00, 2010-04-22 00:00:00, 2010-04-23 00:00:00, 2010-04-26 00:00:00, 2010-04-27 00:00:00, 2010-04-28 00:00:00, 2010-04-29 00:00:00, 2010-04-30 00:00:00, 2010-05-03 00:00:00, 2010-05-04 00:00:00, 2010-05-05 00:00:00, 2010-05-06 00:00:00, 2010-05-07 00:00:00, 2010-05-10 00:00:00, 2010-05-11 00:00:00, 2010-05-12 00:00:00, 2010-05-13 00:00:00, 2010-05-14 00:00:00, 2010-05-17 00:00:00, 2010-05-18 00:00:00, 2010-05-19 00:00:00, 2010-05-20 00:00:00, 2010-05-21 00:00:00, 2010-05-24 00:00:00, 2010-05-25 00:00:00, 2010-05-26 00:00:00, ...], ['CBAK', 'CHK', 'TSLA', 'WTI']])

In [23]:
print(len(df.index.levels))


2

Alright we have our dataset in a hierarchical index. but that's not what we want to work with right now which will likely come up when you're analyzing data - you'll want to completely reset your index. Well have no fear, we can do that with the reset index command.

You may find yourself using this often just to get back to square one and start over when performing analysis.


In [24]:
df.reset_index()


Out[24]:
Date minor Open High Low Close Volume Adj Close
0 2010-01-04 CBAK 2.900000 2.900000 2.670000 2.700000 456600 13.500000
1 2010-01-04 CHK 27.429997 28.109996 26.920004 28.089999 31146800 24.503810
2 2010-01-04 WTI 11.900000 12.460000 11.860000 12.260000 838800 9.824887
3 2010-01-05 CBAK 2.720000 3.100000 2.690000 2.850000 1179500 14.250000
4 2010-01-05 CHK 28.300002 29.120002 28.199999 28.970004 28692700 25.271467
5 2010-01-05 WTI 12.300000 12.630000 12.170000 12.340000 625400 9.888997
6 2010-01-06 CBAK 2.980000 3.050000 2.850000 2.880000 633500 14.400000
7 2010-01-06 CHK 29.209995 29.220005 28.530005 28.649996 16055000 24.992314
8 2010-01-06 WTI 12.410000 12.650000 12.390000 12.580000 604700 10.081328
9 2010-01-07 CBAK 2.900000 3.090000 2.780000 2.900000 784600 14.500000
10 2010-01-07 CHK 28.629998 28.799995 28.180002 28.720002 13906600 25.053382
11 2010-01-07 WTI 12.600000 12.700000 12.240000 12.450000 565300 9.977149
12 2010-01-08 CBAK 2.900000 3.140000 2.870000 3.040000 797800 15.200000
13 2010-01-08 CHK 28.389996 28.919998 28.050002 28.909998 11656400 25.219122
14 2010-01-08 WTI 12.370000 12.540000 12.120000 12.500000 521100 10.017218
15 2010-01-11 CBAK 3.110000 3.170000 3.030000 3.060000 510100 15.300000
16 2010-01-11 CHK 28.980002 28.980002 27.730006 28.000005 15041300 24.425306
17 2010-01-11 WTI 12.650000 13.270000 12.140000 12.400000 1177400 9.937080
18 2010-01-12 CBAK 3.080000 3.080000 2.900000 2.920000 383800 14.600001
19 2010-01-12 CHK 27.510000 27.699996 27.170007 27.580006 13724400 24.058927
20 2010-01-12 WTI 12.330000 12.430000 11.800000 11.900000 921200 9.536391
21 2010-01-13 CBAK 2.920000 3.020000 2.910000 2.940000 276000 14.700000
22 2010-01-13 CHK 27.419997 27.939999 26.920004 27.819999 11103000 24.268281
23 2010-01-13 WTI 11.970000 12.000000 11.360000 11.420000 1778300 9.151730
24 2010-01-14 CBAK 3.040000 3.040000 2.900000 2.920000 186800 14.600001
25 2010-01-14 CHK 28.020003 28.379996 27.520000 27.669999 16784700 24.137431
26 2010-01-14 WTI 11.420000 11.900000 11.420000 11.800000 890300 9.456254
27 2010-01-15 CBAK 2.930000 2.950000 2.810000 2.850000 226100 14.250000
28 2010-01-15 CHK 27.520000 28.089999 27.250000 27.910002 18228400 24.346793
29 2010-01-15 WTI 11.820000 11.860000 11.000000 11.100000 2163400 8.895290
... ... ... ... ... ... ... ... ...
4880 2014-12-19 TSLA 220.190002 220.399994 214.500000 219.289993 6910500 219.289993
4881 2014-12-19 WTI 6.800000 7.880000 6.750000 7.630000 5775100 7.630000
4882 2014-12-22 CBAK 1.920000 1.930000 1.880000 1.910000 24100 1.910000
4883 2014-12-22 CHK 19.000000 19.200001 18.299999 18.420000 20323700 18.221374
4884 2014-12-22 TSLA 220.000000 224.059998 218.259995 222.600006 4799400 222.600006
4885 2014-12-22 WTI 7.460000 7.690000 6.990000 7.520000 2758500 7.520000
4886 2014-12-23 CBAK 1.950000 2.350000 1.950000 1.980000 103100 1.980000
4887 2014-12-23 CHK 19.430000 20.350000 19.200001 20.290001 21976500 20.071210
4888 2014-12-23 TSLA 223.809998 224.320007 219.520004 220.970001 4505700 220.970001
4889 2014-12-23 WTI 7.610000 7.950000 7.560000 7.670000 2856200 7.670000
4890 2014-12-24 CBAK 1.950000 1.960000 1.890000 1.930000 7400 1.930000
4891 2014-12-24 CHK 20.190001 20.290001 19.600000 19.840000 7504400 19.626062
4892 2014-12-24 TSLA 219.770004 222.500000 219.250000 222.259995 1332200 222.259995
4893 2014-12-24 WTI 7.510000 7.830000 7.420000 7.550000 1098500 7.550000
4894 2014-12-26 CBAK 1.970000 1.980000 1.870000 1.890000 24900 1.890000
4895 2014-12-26 CHK 20.000000 20.180000 19.500000 19.709999 7746500 19.497462
4896 2014-12-26 TSLA 221.509995 228.500000 221.500000 227.820007 3327000 227.820007
4897 2014-12-26 WTI 7.610000 7.790000 7.370000 7.490000 1154900 7.490000
4898 2014-12-29 CBAK 1.890000 1.970000 1.820000 1.940000 42100 1.940000
4899 2014-12-29 CHK 20.000000 20.160000 19.799999 20.010000 9109900 19.794228
4900 2014-12-29 TSLA 226.899994 227.910004 224.020004 225.710007 2802500 225.710007
4901 2014-12-29 WTI 7.650000 7.870000 7.530000 7.720000 1592400 7.720000
4902 2014-12-30 CBAK 1.960000 1.960000 1.860000 1.880000 27200 1.880000
4903 2014-12-30 CHK 19.799999 20.080000 19.510000 19.639999 9890600 19.428217
4904 2014-12-30 TSLA 223.990005 225.649994 221.399994 222.229996 2903200 222.229996
4905 2014-12-30 WTI 7.600000 7.790000 7.430000 7.430000 1106000 7.430000
4906 2014-12-31 CBAK 1.840000 1.920000 1.830000 1.850000 76200 1.850000
4907 2014-12-31 CHK 19.459999 19.990000 19.230000 19.570000 10677900 19.358973
4908 2014-12-31 TSLA 223.089996 225.679993 222.250000 222.410004 2402100 222.410004
4909 2014-12-31 WTI 7.290000 7.440000 6.940000 7.340000 1434700 7.340000

4910 rows × 8 columns


In [25]:
df.reset_index(inplace=True)

Now we've reset our index. Now I don't want to work with this data set as I'm not an expert on financial data however we'll be working with a really cool data set in our next video. We'll be working with an airplane data set that has flights across the country. This is going to give us the opportunity to work on a ton of cool problems.


In [26]:
df


Out[26]:
Date minor Open High Low Close Volume Adj Close
0 2010-01-04 CBAK 2.900000 2.900000 2.670000 2.700000 456600 13.500000
1 2010-01-04 CHK 27.429997 28.109996 26.920004 28.089999 31146800 24.503810
2 2010-01-04 WTI 11.900000 12.460000 11.860000 12.260000 838800 9.824887
3 2010-01-05 CBAK 2.720000 3.100000 2.690000 2.850000 1179500 14.250000
4 2010-01-05 CHK 28.300002 29.120002 28.199999 28.970004 28692700 25.271467
5 2010-01-05 WTI 12.300000 12.630000 12.170000 12.340000 625400 9.888997
6 2010-01-06 CBAK 2.980000 3.050000 2.850000 2.880000 633500 14.400000
7 2010-01-06 CHK 29.209995 29.220005 28.530005 28.649996 16055000 24.992314
8 2010-01-06 WTI 12.410000 12.650000 12.390000 12.580000 604700 10.081328
9 2010-01-07 CBAK 2.900000 3.090000 2.780000 2.900000 784600 14.500000
10 2010-01-07 CHK 28.629998 28.799995 28.180002 28.720002 13906600 25.053382
11 2010-01-07 WTI 12.600000 12.700000 12.240000 12.450000 565300 9.977149
12 2010-01-08 CBAK 2.900000 3.140000 2.870000 3.040000 797800 15.200000
13 2010-01-08 CHK 28.389996 28.919998 28.050002 28.909998 11656400 25.219122
14 2010-01-08 WTI 12.370000 12.540000 12.120000 12.500000 521100 10.017218
15 2010-01-11 CBAK 3.110000 3.170000 3.030000 3.060000 510100 15.300000
16 2010-01-11 CHK 28.980002 28.980002 27.730006 28.000005 15041300 24.425306
17 2010-01-11 WTI 12.650000 13.270000 12.140000 12.400000 1177400 9.937080
18 2010-01-12 CBAK 3.080000 3.080000 2.900000 2.920000 383800 14.600001
19 2010-01-12 CHK 27.510000 27.699996 27.170007 27.580006 13724400 24.058927
20 2010-01-12 WTI 12.330000 12.430000 11.800000 11.900000 921200 9.536391
21 2010-01-13 CBAK 2.920000 3.020000 2.910000 2.940000 276000 14.700000
22 2010-01-13 CHK 27.419997 27.939999 26.920004 27.819999 11103000 24.268281
23 2010-01-13 WTI 11.970000 12.000000 11.360000 11.420000 1778300 9.151730
24 2010-01-14 CBAK 3.040000 3.040000 2.900000 2.920000 186800 14.600001
25 2010-01-14 CHK 28.020003 28.379996 27.520000 27.669999 16784700 24.137431
26 2010-01-14 WTI 11.420000 11.900000 11.420000 11.800000 890300 9.456254
27 2010-01-15 CBAK 2.930000 2.950000 2.810000 2.850000 226100 14.250000
28 2010-01-15 CHK 27.520000 28.089999 27.250000 27.910002 18228400 24.346793
29 2010-01-15 WTI 11.820000 11.860000 11.000000 11.100000 2163400 8.895290
... ... ... ... ... ... ... ... ...
4880 2014-12-19 TSLA 220.190002 220.399994 214.500000 219.289993 6910500 219.289993
4881 2014-12-19 WTI 6.800000 7.880000 6.750000 7.630000 5775100 7.630000
4882 2014-12-22 CBAK 1.920000 1.930000 1.880000 1.910000 24100 1.910000
4883 2014-12-22 CHK 19.000000 19.200001 18.299999 18.420000 20323700 18.221374
4884 2014-12-22 TSLA 220.000000 224.059998 218.259995 222.600006 4799400 222.600006
4885 2014-12-22 WTI 7.460000 7.690000 6.990000 7.520000 2758500 7.520000
4886 2014-12-23 CBAK 1.950000 2.350000 1.950000 1.980000 103100 1.980000
4887 2014-12-23 CHK 19.430000 20.350000 19.200001 20.290001 21976500 20.071210
4888 2014-12-23 TSLA 223.809998 224.320007 219.520004 220.970001 4505700 220.970001
4889 2014-12-23 WTI 7.610000 7.950000 7.560000 7.670000 2856200 7.670000
4890 2014-12-24 CBAK 1.950000 1.960000 1.890000 1.930000 7400 1.930000
4891 2014-12-24 CHK 20.190001 20.290001 19.600000 19.840000 7504400 19.626062
4892 2014-12-24 TSLA 219.770004 222.500000 219.250000 222.259995 1332200 222.259995
4893 2014-12-24 WTI 7.510000 7.830000 7.420000 7.550000 1098500 7.550000
4894 2014-12-26 CBAK 1.970000 1.980000 1.870000 1.890000 24900 1.890000
4895 2014-12-26 CHK 20.000000 20.180000 19.500000 19.709999 7746500 19.497462
4896 2014-12-26 TSLA 221.509995 228.500000 221.500000 227.820007 3327000 227.820007
4897 2014-12-26 WTI 7.610000 7.790000 7.370000 7.490000 1154900 7.490000
4898 2014-12-29 CBAK 1.890000 1.970000 1.820000 1.940000 42100 1.940000
4899 2014-12-29 CHK 20.000000 20.160000 19.799999 20.010000 9109900 19.794228
4900 2014-12-29 TSLA 226.899994 227.910004 224.020004 225.710007 2802500 225.710007
4901 2014-12-29 WTI 7.650000 7.870000 7.530000 7.720000 1592400 7.720000
4902 2014-12-30 CBAK 1.960000 1.960000 1.860000 1.880000 27200 1.880000
4903 2014-12-30 CHK 19.799999 20.080000 19.510000 19.639999 9890600 19.428217
4904 2014-12-30 TSLA 223.990005 225.649994 221.399994 222.229996 2903200 222.229996
4905 2014-12-30 WTI 7.600000 7.790000 7.430000 7.430000 1106000 7.430000
4906 2014-12-31 CBAK 1.840000 1.920000 1.830000 1.850000 76200 1.850000
4907 2014-12-31 CHK 19.459999 19.990000 19.230000 19.570000 10677900 19.358973
4908 2014-12-31 TSLA 223.089996 225.679993 222.250000 222.410004 2402100 222.410004
4909 2014-12-31 WTI 7.290000 7.440000 6.940000 7.340000 1434700 7.340000

4910 rows × 8 columns


In [ ]: