Copyright 2015 Enthought, Inc. All Rights Reserved

Pandas IO

This folder contains 4 datasets in different ascii-based files:

  1. all values of the S&P500 index since its beginning until July 2014,
  2. some stock values for 5 major stocks between 2005 and 2009,
  3. some recent tick data (one value per minute) for the AAPL stock, and
  4. all historical data for the NASDAQ stock exchange since its beginning downloaded from the web.

The goal of this exercise is to load each of these files into a DataFrame using read_table (and read_html for the bonus question), while doing some clean up. These four DataFrames will represent the basis for upcoming financial tasks.


In [1]:
from pandas import read_table, read_html

In [2]:
read_table?

Question 1.

The first file ("HistoricalS_and_P500.csv") is comma separated and contains historical data for the S&P200 index since 1950. The first column contains the date in the format MM/DD/YY. We would like to make these dates the index of the DataFrame created. The next 6 columns are the usual values for a given day. The name of the columns is stated at the top of the file.

  1. Try to do a simple read_table on the file, parsing the dates in the data (see parse_dates), and using the first column as an index. Look at the date range of the resulting dataframe.
  2. To fix this issue, we will need to help the date parser. Have a look to the read_table documentation (see converters). You will see that that can be done by providing our own date parser, or modifying the string as it comes in to add the century number the way we want. Write a function that converts a date formatted as MM/DD/YY and make it MM/DD/YYYY.
  3. Use that function to improve your read_table call.

In [3]:
def convert_date(bytes_input):
    str_input = bytes_input.decode("utf-8")
    month, day, year = str_input.split("/")
    if int(year) < 50:
        new_year = "20" + year
    else:
        new_year = "19" + year
    return "/".join([month, day, new_year])

In [4]:
read_table("HistoricalS_and_P500.csv", sep=",", index_col=0, parse_dates=True, converters={0:convert_date})


Out[4]:
Open High Low Close Volume Adj Close
Date
2014-07-08 1976.39 1976.39 1959.46 1963.71 3302430000 1963.71
2014-07-07 1984.22 1984.22 1974.88 1977.65 2681260000 1977.65
2014-07-03 1975.88 1985.59 1975.88 1985.44 1998090000 1985.44
2014-07-02 1973.06 1976.67 1972.58 1974.62 2851480000 1974.62
2014-07-01 1962.29 1978.58 1962.29 1973.32 3188240000 1973.32
2014-06-30 1960.79 1964.24 1958.22 1960.23 3037350000 1960.23
2014-06-27 1956.56 1961.47 1952.18 1960.96 4290590000 1960.96
2014-06-26 1959.89 1959.89 1944.69 1957.22 2778840000 1957.22
2014-06-25 1949.27 1960.83 1947.49 1959.53 3106710000 1959.53
2014-06-24 1961.97 1968.17 1948.34 1949.98 3089700000 1949.98
2014-06-23 1962.92 1963.74 1958.89 1962.61 2717630000 1962.61
2014-06-20 1960.45 1963.91 1959.17 1962.87 4336240000 1962.87
2014-06-19 1957.50 1959.87 1952.26 1959.48 2952150000 1959.48
2014-06-18 1942.73 1957.74 1939.29 1956.98 3065220000 1956.98
2014-06-17 1937.15 1943.69 1933.55 1941.99 2971260000 1941.99
2014-06-16 1934.84 1941.15 1930.91 1937.78 2926130000 1937.78
2014-06-13 1930.80 1937.30 1927.69 1936.16 2598230000 1936.16
2014-06-12 1943.35 1943.35 1925.78 1930.11 3040480000 1930.11
2014-06-11 1949.37 1949.37 1940.08 1943.89 2710620000 1943.89
2014-06-10 1950.34 1950.86 1944.64 1950.79 2702360000 1950.79
2014-06-09 1948.97 1955.55 1947.16 1951.27 2812180000 1951.27
2014-06-06 1942.41 1949.44 1942.41 1949.44 2864300000 1949.44
2014-06-05 1928.52 1941.74 1922.93 1940.46 3113270000 1940.46
2014-06-04 1923.06 1928.63 1918.60 1927.88 2793920000 1927.88
2014-06-03 1923.07 1925.07 1918.79 1924.24 2867180000 1924.24
2014-06-02 1923.87 1925.88 1915.98 1924.97 2509020000 1924.97
2014-05-30 1920.33 1924.03 1916.64 1923.57 3263490000 1923.57
2014-05-29 1910.60 1920.03 1909.82 1920.03 2709050000 1920.03
2014-05-28 1911.77 1914.46 1907.30 1909.78 2976450000 1909.78
2014-05-27 1902.01 1912.28 1902.01 1911.91 2911020000 1911.91
... ... ... ... ... ... ...
1950-02-14 17.06 17.06 17.06 17.06 2210000 17.06
1950-02-10 17.24 17.24 17.24 17.24 1790000 17.24
1950-02-09 17.28 17.28 17.28 17.28 1810000 17.28
1950-02-08 17.21 17.21 17.21 17.21 1470000 17.21
1950-02-07 17.23 17.23 17.23 17.23 1360000 17.23
1950-02-06 17.32 17.32 17.32 17.32 1490000 17.32
1950-02-03 17.29 17.29 17.29 17.29 2210000 17.29
1950-02-02 17.23 17.23 17.23 17.23 2040000 17.23
1950-02-01 17.05 17.05 17.05 17.05 1810000 17.05
1950-01-31 17.05 17.05 17.05 17.05 1690000 17.05
1950-01-30 17.02 17.02 17.02 17.02 1640000 17.02
1950-01-27 16.82 16.82 16.82 16.82 1250000 16.82
1950-01-26 16.73 16.73 16.73 16.73 1150000 16.73
1950-01-25 16.74 16.74 16.74 16.74 1700000 16.74
1950-01-24 16.86 16.86 16.86 16.86 1250000 16.86
1950-01-23 16.92 16.92 16.92 16.92 1340000 16.92
1950-01-20 16.90 16.90 16.90 16.90 1440000 16.90
1950-01-19 16.87 16.87 16.87 16.87 1170000 16.87
1950-01-18 16.85 16.85 16.85 16.85 1570000 16.85
1950-01-17 16.86 16.86 16.86 16.86 1790000 16.86
1950-01-16 16.72 16.72 16.72 16.72 1460000 16.72
1950-01-13 16.67 16.67 16.67 16.67 3330000 16.67
1950-01-12 16.76 16.76 16.76 16.76 2970000 16.76
1950-01-11 17.09 17.09 17.09 17.09 2630000 17.09
1950-01-10 17.03 17.03 17.03 17.03 2160000 17.03
1950-01-09 17.08 17.08 17.08 17.08 2520000 17.08
1950-01-06 16.98 16.98 16.98 16.98 2010000 16.98
1950-01-05 16.93 16.93 16.93 16.93 2550000 16.93
1950-01-04 16.85 16.85 16.85 16.85 1890000 16.85
1950-01-03 16.66 16.66 16.66 16.66 1260000 16.66

16232 rows × 6 columns

Question 2.

The second file ("adj_close_stock_data_yahoo_2005_2010.txt") is space delimited. More precisely, the separation can be any number of space characters. This file contains 8 columns. The first 3 are the year, month, and day of the trading day and the next 5 are the open prices for the 5 stocks: AAPL, GOOG, MSFT, PG, XOM. Additionally, the file contains some missing values in 2005. They are encoded as a - when corrupted data was collected for that day. Load the dataset into a DataFrame with a basic numbered index for now, making sure to encode the - symbols as NaN (missing values).

Note: the separator argument of read_table can receive a regular expression. \s is any space character. \s+ is one or more space character.


In [5]:
read_table("adj_close_stock_data_yahoo_2005_2010.txt", sep="\s+", parse_dates=[[0,1,2]], na_values=["-"])


Out[5]:
year_month_day AAPL GOOGL MSFT PG XOM
0 2005-01-03 4.32 101.46 21.88 NaN 40.44
1 2005-01-04 4.37 97.35 21.96 41.79 40.16
2 2005-01-05 4.40 96.85 21.91 42.23 39.95
3 2005-01-06 4.41 94.37 21.89 42.43 40.46
4 2005-01-07 4.73 97.02 21.82 42.88 40.20
5 2005-01-10 4.71 97.63 21.93 43.22 40.35
6 2005-01-11 4.41 96.87 21.87 43.17 40.25
7 2005-01-12 4.47 97.79 21.91 43.33 40.84
8 2005-01-13 4.77 97.76 21.50 42.58 40.81
9 2005-01-14 4.79 100.09 21.37 42.85 41.23
10 2005-01-18 4.82 102.05 21.54 42.93 41.58
11 2005-01-19 4.77 98.75 21.26 43.66 41.21
12 2005-01-20 4.81 97.06 21.16 43.67 40.95
13 2005-01-21 4.81 94.23 20.99 42.86 40.72
14 2005-01-24 4.83 90.45 21.01 42.52 41.28
15 2005-01-25 4.92 88.65 21.29 42.92 41.44
16 2005-01-26 4.93 94.71 21.28 42.70 41.74
17 2005-01-27 4.96 94.13 21.37 42.61 41.78
18 2005-01-28 5.05 95.27 21.42 41.71 41.39
19 2005-01-31 5.25 97.91 21.51 41.00 41.66
20 2005-02-01 5.29 96.05 21.60 40.72 43.01
21 2005-02-02 5.44 103.08 21.65 41.13 43.54
22 2005-02-03 NaN 105.54 21.42 41.01 44.02
23 2005-02-04 5.38 102.28 21.54 40.53 44.64
24 2005-02-07 5.39 98.11 21.41 40.46 44.73
25 2005-02-08 5.52 99.42 21.47 40.04 45.25
26 2005-02-09 5.38 95.89 21.33 39.67 45.06
27 2005-02-10 5.35 94.08 21.33 39.76 45.71
28 2005-02-11 5.54 93.79 21.25 40.17 45.52
29 2005-02-14 5.78 96.59 21.28 40.59 46.11
... ... ... ... ... ... ...
1229 2009-11-18 28.12 288.61 26.73 53.74 67.03
1230 2009-11-19 27.38 286.78 26.44 53.52 66.48
1231 2009-11-20 27.30 285.27 26.30 53.22 66.24
1232 2009-11-23 28.11 291.47 26.58 54.02 67.41
1233 2009-11-24 27.92 291.84 26.55 54.08 67.66
1234 2009-11-25 27.88 293.16 26.45 54.14 68.10
1235 2009-11-27 27.39 290.17 25.94 53.80 66.68
1236 2009-11-30 27.30 291.79 26.11 53.69 66.85
1237 2009-12-01 26.90 295.23 26.64 54.17 67.72
1238 2009-12-02 26.79 294.05 26.44 54.42 67.50
1239 2009-12-03 26.83 293.16 26.48 53.87 66.78
1240 2009-12-04 26.40 292.80 26.62 53.91 66.12
1241 2009-12-07 25.80 293.42 26.45 53.80 65.70
1242 2009-12-08 25.93 293.82 26.25 53.28 64.97
1243 2009-12-09 27.01 294.80 26.38 53.60 64.82
1244 2009-12-10 26.82 296.05 26.52 53.60 64.48
1245 2009-12-11 26.58 295.55 26.50 53.68 64.86
1246 2009-12-14 26.90 298.16 26.73 54.26 62.06
1247 2009-12-15 26.51 296.87 26.65 53.49 61.60
1248 2009-12-16 26.63 299.18 26.72 53.53 60.94
1249 2009-12-17 26.20 297.27 26.28 53.04 60.75
1250 2009-12-18 26.69 298.51 26.95 53.00 60.74
1251 2009-12-21 27.07 299.64 27.10 52.79 61.01
1252 2009-12-22 27.36 300.86 27.36 52.79 61.07
1253 2009-12-23 27.60 306.15 27.45 52.80 60.79
1254 2009-12-24 28.54 309.55 27.52 52.77 61.15
1255 2009-12-28 28.89 311.75 27.67 52.75 61.52
1256 2009-12-29 28.55 310.01 27.87 53.03 61.31
1257 2009-12-30 28.90 311.68 27.49 52.85 61.24
1258 2009-12-31 28.77 310.30 27.06 52.21 60.73

1259 rows × 6 columns

Question 3.

The third file ("TechTickerSampleData.txt") starts with a header of 7 lines. Afterward, the data part is tab delimited and contains tick data for AAPL and MSFT. The first column specifies if the tick information is for AAPL or MSFT. The second column is the timestamp of the tick data in minutes, and the meaning of the following 5 columns can be read off of line number 5: Close, High, Low, Open, Volume.

Create a DataFrame, making the second column the index.


In [6]:
read_table("TechTickerSampleData.txt", sep="\t", skiprows=7, names=["stock", "timestamp", "close", "high", "low", "open", "volume"], 
           index_col=1)


Out[6]:
stock close high low open volume
timestamp
0 AAPL 84.5993 84.6257 84.5714 84.5714 568330
0 MSFT 40.2800 40.2800 40.2100 40.2401 542320
1 MSFT 40.1900 40.3600 40.1700 40.2900 2596758
2 MSFT 40.1750 40.2300 40.0700 40.1900 2433238
2 AAPL 84.8871 84.9714 84.4300 84.5871 5143782
3 MSFT 40.0900 40.1800 40.0800 40.1800 1912087
3 AAPL 84.8457 84.9543 84.7571 84.8786 3119410
4 MSFT 40.1300 40.1500 40.0700 40.0900 1425872
4 AAPL 84.4715 84.9414 84.4300 84.8571 2805446
5 AAPL 84.1600 84.6329 84.0400 84.4817 5049219
6 MSFT 40.1630 40.2000 40.1200 40.1400 1057386
6 AAPL 83.8929 84.2527 83.7657 84.1601 3386593
7 AAPL 84.1800 84.2043 83.8214 83.9143 2441691
8 MSFT 40.0400 40.1500 40.0250 40.1352 830479
8 AAPL 84.0943 84.2071 83.9814 84.1739 1349978
9 MSFT 39.9700 40.0500 39.9500 40.0400 2709454
10 MSFT 39.9950 40.0000 39.9600 39.9700 1688028
10 AAPL 84.5166 84.5814 84.3714 84.4214 1822870
11 MSFT 39.9800 40.0000 39.9600 39.9940 1404386
11 AAPL 84.5143 84.6314 84.4486 84.5014 2166633
12 MSFT 40.0000 40.0450 39.9650 39.9750 2170004
12 AAPL 84.2429 84.5429 84.2271 84.5171 2253258
13 MSFT 40.0000 40.0100 39.9500 39.9900 6820612
13 AAPL 84.4971 84.4986 84.1643 84.2243 4716439
48 MSFT 40.2900 40.3200 40.2800 40.3100 1140301
49 MSFT 39.8750 40.3400 39.8400 40.2900 4144519
50 MSFT 39.9800 40.0200 39.8450 39.8750 3178738
51 MSFT 39.8600 40.0400 39.8400 39.9800 2289461
51 AAPL 84.4614 84.6829 84.2443 84.6143 4026218
52 MSFT 39.7900 39.9600 39.7700 39.8500 1377684
... ... ... ... ... ... ...
348 MSFT 41.7900 41.8150 41.7100 41.7900 1445177
348 AAPL 95.1100 95.1700 94.9700 95.1050 1858039
349 MSFT 41.6850 41.8300 41.6700 41.8200 3404465
349 AAPL 95.0350 95.1700 94.9700 95.1500 2449449
350 MSFT 41.6850 41.6850 41.6850 41.6850 122
384 MSFT 41.7800 41.8000 41.7000 41.7000 475538
384 AAPL 95.3300 95.3800 95.3300 95.3600 473563
385 MSFT 41.5000 41.7800 41.4800 41.7700 3126191
385 AAPL 95.3000 95.6200 94.9600 95.3400 5032743
386 MSFT 41.6050 41.6100 41.4800 41.5100 1715520
386 AAPL 95.5390 95.8900 95.1800 95.3090 4398637
387 MSFT 41.7300 41.7300 41.5500 41.6100 1083479
387 AAPL 95.6700 95.7900 95.4900 95.5350 1789334
388 MSFT 41.8000 41.8500 41.7100 41.7300 1250176
388 AAPL 95.4000 95.7500 95.3300 95.6650 1791034
389 MSFT 41.8000 41.8900 41.8000 41.8100 1049825
389 AAPL 95.3400 95.5000 95.2200 95.4000 1791164
390 MSFT 41.8400 41.8600 41.7800 41.8000 960177
390 AAPL 95.1300 95.3400 95.0800 95.3399 1226239
391 MSFT 41.8810 41.9100 41.8400 41.8400 842344
391 AAPL 95.0277 95.2300 94.9300 95.1200 1412825
392 MSFT 41.8900 41.9500 41.8800 41.8858 950716
392 AAPL 95.1650 95.2500 95.0100 95.0250 1195498
393 MSFT 41.9300 41.9500 41.8800 41.8965 1008343
393 AAPL 95.0050 95.1900 94.9900 95.1763 1116781
394 MSFT 41.9650 41.9750 41.9100 41.9390 1119940
394 AAPL 95.0300 95.0900 94.8625 95.0076 1726397
395 MSFT 41.9600 42.0150 41.9500 41.9680 1359014
395 AAPL 95.0500 95.1600 95.0000 95.0358 1338130
396 MSFT 41.9850 41.9900 41.9500 41.9670 386750

1410 rows × 6 columns

Bonus

Question 4.

The fourth file ("HistoricalNASDAQ.htm") is an html file with daily NASDAQ index returns. The description of the columns is again provided on the first line of the table. Return a DataFrame whose index is the date column.

Advice: Parsing html is always a complicated task. Reading the help on read_html is highly recommended. In particular, the default value for header may cause issues.


In [7]:
read_html("HistoricalNASDAQ.htm", index_col=0, parse_dates=True, header=0)[0]


Out[7]:
Open High Low Close Volume Adj Close
Date
2014-07-08 4442.87 4443.56 4372.04 4391.46 2221820000 4391.46
2014-07-07 4477.74 4478.02 4447.60 4451.53 1691390000 4451.53
2014-07-03 4472.89 4485.93 4463.85 4485.93 1001730000 4485.93
2014-07-02 4457.86 4466.92 4450.87 4457.73 1599480000 4457.73
2014-07-01 4424.71 4471.60 4424.43 4458.65 1942550000 4458.65
2014-06-30 4398.37 4417.46 4396.59 4408.18 1848110000 4408.18
2014-06-27 4371.80 4398.85 4371.60 4397.93 3964930000 4397.93
2014-06-26 4379.44 4379.80 4347.45 4379.05 1554070000 4379.05
2014-06-25 4341.82 4383.55 4339.41 4379.76 1722820000 4379.76
2014-06-24 4367.90 4399.87 4342.90 4350.35 2014700000 4350.35
2014-06-23 4368.96 4371.81 4358.68 4368.68 1712930000 4368.68
2014-06-20 4365.37 4368.80 4354.03 4368.04 2721380000 4368.04
2014-06-19 4370.14 4372.18 4339.61 4359.33 1845460000 4359.33
2014-06-18 4341.18 4365.10 4320.54 4362.84 1860660000 4362.84
2014-06-17 4316.01 4346.12 4311.15 4337.23 1814360000 4337.23
2014-06-16 4303.93 4326.89 4296.26 4321.10 1675120000 4321.10
2014-06-13 4315.31 4317.67 4288.41 4310.65 1754560000 4310.65
2014-06-12 4323.11 4328.35 4284.53 4297.63 1908190000 4297.63
2014-06-11 4322.92 4338.21 4315.49 4331.93 1778460000 4331.93
2014-06-10 4329.20 4338.87 4319.93 4338.00 1787120000 4338.00
2014-06-09 4324.35 4346.74 4320.28 4336.24 1783060000 4336.24
2014-06-06 4312.76 4322.51 4305.74 4321.40 1616650000 4321.40
2014-06-05 4259.12 4299.50 4241.68 4296.23 1926750000 4296.23
2014-06-04 4222.21 4256.19 4216.23 4251.64 1610090000 4251.64
2014-06-03 4222.04 4240.35 4215.80 4234.08 1718640000 4234.08
2014-06-02 4247.96 4247.96 4207.61 4237.20 1631310000 4237.20
2014-05-30 4251.49 4252.08 4221.95 4242.62 1903660000 4242.62
2014-05-29 4238.04 4247.95 4228.96 4247.95 1714000000 4247.95
2014-05-28 4234.96 4238.17 4216.89 4225.08 1785140000 4225.08
2014-05-27 4206.44 4237.07 4204.72 4237.07 1812330000 4237.07
... ... ... ... ... ... ...
1971-03-19 105.77 105.77 105.77 105.77 0 105.77
1971-03-18 105.88 105.88 105.88 105.88 0 105.88
1971-03-17 105.50 105.50 105.50 105.50 0 105.50
1971-03-16 105.56 105.56 105.56 105.56 0 105.56
1971-03-15 105.13 105.13 105.13 105.13 0 105.13
1971-03-12 104.50 104.50 104.50 104.50 0 104.50
1971-03-11 104.26 104.26 104.26 104.26 0 104.26
1971-03-10 104.14 104.14 104.14 104.14 0 104.14
1971-03-09 104.41 104.41 104.41 104.41 0 104.41
1971-03-08 104.23 104.23 104.23 104.23 0 104.23
1971-03-05 103.00 103.00 103.00 103.00 0 103.00
1971-03-04 102.78 102.78 102.78 102.78 0 102.78
1971-03-03 102.07 102.07 102.07 102.07 0 102.07
1971-03-02 101.84 101.84 101.84 101.84 0 101.84
1971-03-01 101.78 101.78 101.78 101.78 0 101.78
1971-02-26 101.34 101.34 101.34 101.34 0 101.34
1971-02-25 101.23 101.23 101.23 101.23 0 101.23
1971-02-24 100.64 100.64 100.64 100.64 0 100.64
1971-02-23 99.72 99.72 99.72 99.72 0 99.72
1971-02-22 99.68 99.68 99.68 99.68 0 99.68
1971-02-19 100.70 100.70 100.70 100.70 0 100.70
1971-02-18 101.42 101.42 101.42 101.42 0 101.42
1971-02-17 101.74 101.74 101.74 101.74 0 101.74
1971-02-16 102.19 102.19 102.19 102.19 0 102.19
1971-02-12 102.05 102.05 102.05 102.05 0 102.05
1971-02-11 101.45 101.45 101.45 101.45 0 101.45
1971-02-10 100.69 100.69 100.69 100.69 0 100.69
1971-02-09 100.76 100.76 100.76 100.76 0 100.76
1971-02-08 100.84 100.84 100.84 100.84 0 100.84
1971-02-05 100.00 100.00 100.00 100.00 0 100.00

10951 rows × 6 columns


In [7]: