Copyright 2015 Enthought, Inc. All Rights Reserved
This folder contains 4 datasets in different ascii-based files:
The goal of this exercise is to load each of these files into a DataFrame using read_table (and read_html for the bonus question), while doing some clean up. These four DataFrames will represent the basis for upcoming financial tasks.
In [1]:
from pandas import read_table, read_html
In [2]:
read_table?
The first file ("HistoricalS_and_P500.csv") is comma separated and contains historical data for the S&P200 index since 1950. The first column contains the date in the format MM/DD/YY. We would like to make these dates the index of the DataFrame created. The next 6 columns are the usual values for a given day. The name of the columns is stated at the top of the file.
read_table on the file, parsing the dates in the data (see parse_dates), and using the first column as an index. Look at the date range of the resulting dataframe.read_table documentation (see converters). You will see that that can be done by providing our own date parser, or modifying the string as it comes in to add the century number the way we want. Write a function that converts a date formatted as MM/DD/YY and make it MM/DD/YYYY.read_table call.
In [3]:
def convert_date(bytes_input):
str_input = bytes_input.decode("utf-8")
month, day, year = str_input.split("/")
if int(year) < 50:
new_year = "20" + year
else:
new_year = "19" + year
return "/".join([month, day, new_year])
In [4]:
read_table("HistoricalS_and_P500.csv", sep=",", index_col=0, parse_dates=True, converters={0:convert_date})
Out[4]:
Open
High
Low
Close
Volume
Adj Close
Date
2014-07-08
1976.39
1976.39
1959.46
1963.71
3302430000
1963.71
2014-07-07
1984.22
1984.22
1974.88
1977.65
2681260000
1977.65
2014-07-03
1975.88
1985.59
1975.88
1985.44
1998090000
1985.44
2014-07-02
1973.06
1976.67
1972.58
1974.62
2851480000
1974.62
2014-07-01
1962.29
1978.58
1962.29
1973.32
3188240000
1973.32
2014-06-30
1960.79
1964.24
1958.22
1960.23
3037350000
1960.23
2014-06-27
1956.56
1961.47
1952.18
1960.96
4290590000
1960.96
2014-06-26
1959.89
1959.89
1944.69
1957.22
2778840000
1957.22
2014-06-25
1949.27
1960.83
1947.49
1959.53
3106710000
1959.53
2014-06-24
1961.97
1968.17
1948.34
1949.98
3089700000
1949.98
2014-06-23
1962.92
1963.74
1958.89
1962.61
2717630000
1962.61
2014-06-20
1960.45
1963.91
1959.17
1962.87
4336240000
1962.87
2014-06-19
1957.50
1959.87
1952.26
1959.48
2952150000
1959.48
2014-06-18
1942.73
1957.74
1939.29
1956.98
3065220000
1956.98
2014-06-17
1937.15
1943.69
1933.55
1941.99
2971260000
1941.99
2014-06-16
1934.84
1941.15
1930.91
1937.78
2926130000
1937.78
2014-06-13
1930.80
1937.30
1927.69
1936.16
2598230000
1936.16
2014-06-12
1943.35
1943.35
1925.78
1930.11
3040480000
1930.11
2014-06-11
1949.37
1949.37
1940.08
1943.89
2710620000
1943.89
2014-06-10
1950.34
1950.86
1944.64
1950.79
2702360000
1950.79
2014-06-09
1948.97
1955.55
1947.16
1951.27
2812180000
1951.27
2014-06-06
1942.41
1949.44
1942.41
1949.44
2864300000
1949.44
2014-06-05
1928.52
1941.74
1922.93
1940.46
3113270000
1940.46
2014-06-04
1923.06
1928.63
1918.60
1927.88
2793920000
1927.88
2014-06-03
1923.07
1925.07
1918.79
1924.24
2867180000
1924.24
2014-06-02
1923.87
1925.88
1915.98
1924.97
2509020000
1924.97
2014-05-30
1920.33
1924.03
1916.64
1923.57
3263490000
1923.57
2014-05-29
1910.60
1920.03
1909.82
1920.03
2709050000
1920.03
2014-05-28
1911.77
1914.46
1907.30
1909.78
2976450000
1909.78
2014-05-27
1902.01
1912.28
1902.01
1911.91
2911020000
1911.91
...
...
...
...
...
...
...
1950-02-14
17.06
17.06
17.06
17.06
2210000
17.06
1950-02-10
17.24
17.24
17.24
17.24
1790000
17.24
1950-02-09
17.28
17.28
17.28
17.28
1810000
17.28
1950-02-08
17.21
17.21
17.21
17.21
1470000
17.21
1950-02-07
17.23
17.23
17.23
17.23
1360000
17.23
1950-02-06
17.32
17.32
17.32
17.32
1490000
17.32
1950-02-03
17.29
17.29
17.29
17.29
2210000
17.29
1950-02-02
17.23
17.23
17.23
17.23
2040000
17.23
1950-02-01
17.05
17.05
17.05
17.05
1810000
17.05
1950-01-31
17.05
17.05
17.05
17.05
1690000
17.05
1950-01-30
17.02
17.02
17.02
17.02
1640000
17.02
1950-01-27
16.82
16.82
16.82
16.82
1250000
16.82
1950-01-26
16.73
16.73
16.73
16.73
1150000
16.73
1950-01-25
16.74
16.74
16.74
16.74
1700000
16.74
1950-01-24
16.86
16.86
16.86
16.86
1250000
16.86
1950-01-23
16.92
16.92
16.92
16.92
1340000
16.92
1950-01-20
16.90
16.90
16.90
16.90
1440000
16.90
1950-01-19
16.87
16.87
16.87
16.87
1170000
16.87
1950-01-18
16.85
16.85
16.85
16.85
1570000
16.85
1950-01-17
16.86
16.86
16.86
16.86
1790000
16.86
1950-01-16
16.72
16.72
16.72
16.72
1460000
16.72
1950-01-13
16.67
16.67
16.67
16.67
3330000
16.67
1950-01-12
16.76
16.76
16.76
16.76
2970000
16.76
1950-01-11
17.09
17.09
17.09
17.09
2630000
17.09
1950-01-10
17.03
17.03
17.03
17.03
2160000
17.03
1950-01-09
17.08
17.08
17.08
17.08
2520000
17.08
1950-01-06
16.98
16.98
16.98
16.98
2010000
16.98
1950-01-05
16.93
16.93
16.93
16.93
2550000
16.93
1950-01-04
16.85
16.85
16.85
16.85
1890000
16.85
1950-01-03
16.66
16.66
16.66
16.66
1260000
16.66
16232 rows × 6 columns
The second file ("adj_close_stock_data_yahoo_2005_2010.txt") is space delimited. More precisely, the separation can be any number of space characters. This file contains 8 columns. The first 3 are the year, month, and day of the trading day and the next 5 are the open prices for the 5 stocks: AAPL, GOOG, MSFT, PG, XOM. Additionally, the file contains some missing values in 2005. They are encoded as a - when corrupted data was collected for that day. Load the dataset into a DataFrame with a basic numbered index for now, making sure to encode the - symbols as NaN (missing values).
Note: the separator argument of read_table can receive a regular expression. \s is any space character. \s+ is one or more space character.
In [5]:
read_table("adj_close_stock_data_yahoo_2005_2010.txt", sep="\s+", parse_dates=[[0,1,2]], na_values=["-"])
Out[5]:
year_month_day
AAPL
GOOGL
MSFT
PG
XOM
0
2005-01-03
4.32
101.46
21.88
NaN
40.44
1
2005-01-04
4.37
97.35
21.96
41.79
40.16
2
2005-01-05
4.40
96.85
21.91
42.23
39.95
3
2005-01-06
4.41
94.37
21.89
42.43
40.46
4
2005-01-07
4.73
97.02
21.82
42.88
40.20
5
2005-01-10
4.71
97.63
21.93
43.22
40.35
6
2005-01-11
4.41
96.87
21.87
43.17
40.25
7
2005-01-12
4.47
97.79
21.91
43.33
40.84
8
2005-01-13
4.77
97.76
21.50
42.58
40.81
9
2005-01-14
4.79
100.09
21.37
42.85
41.23
10
2005-01-18
4.82
102.05
21.54
42.93
41.58
11
2005-01-19
4.77
98.75
21.26
43.66
41.21
12
2005-01-20
4.81
97.06
21.16
43.67
40.95
13
2005-01-21
4.81
94.23
20.99
42.86
40.72
14
2005-01-24
4.83
90.45
21.01
42.52
41.28
15
2005-01-25
4.92
88.65
21.29
42.92
41.44
16
2005-01-26
4.93
94.71
21.28
42.70
41.74
17
2005-01-27
4.96
94.13
21.37
42.61
41.78
18
2005-01-28
5.05
95.27
21.42
41.71
41.39
19
2005-01-31
5.25
97.91
21.51
41.00
41.66
20
2005-02-01
5.29
96.05
21.60
40.72
43.01
21
2005-02-02
5.44
103.08
21.65
41.13
43.54
22
2005-02-03
NaN
105.54
21.42
41.01
44.02
23
2005-02-04
5.38
102.28
21.54
40.53
44.64
24
2005-02-07
5.39
98.11
21.41
40.46
44.73
25
2005-02-08
5.52
99.42
21.47
40.04
45.25
26
2005-02-09
5.38
95.89
21.33
39.67
45.06
27
2005-02-10
5.35
94.08
21.33
39.76
45.71
28
2005-02-11
5.54
93.79
21.25
40.17
45.52
29
2005-02-14
5.78
96.59
21.28
40.59
46.11
...
...
...
...
...
...
...
1229
2009-11-18
28.12
288.61
26.73
53.74
67.03
1230
2009-11-19
27.38
286.78
26.44
53.52
66.48
1231
2009-11-20
27.30
285.27
26.30
53.22
66.24
1232
2009-11-23
28.11
291.47
26.58
54.02
67.41
1233
2009-11-24
27.92
291.84
26.55
54.08
67.66
1234
2009-11-25
27.88
293.16
26.45
54.14
68.10
1235
2009-11-27
27.39
290.17
25.94
53.80
66.68
1236
2009-11-30
27.30
291.79
26.11
53.69
66.85
1237
2009-12-01
26.90
295.23
26.64
54.17
67.72
1238
2009-12-02
26.79
294.05
26.44
54.42
67.50
1239
2009-12-03
26.83
293.16
26.48
53.87
66.78
1240
2009-12-04
26.40
292.80
26.62
53.91
66.12
1241
2009-12-07
25.80
293.42
26.45
53.80
65.70
1242
2009-12-08
25.93
293.82
26.25
53.28
64.97
1243
2009-12-09
27.01
294.80
26.38
53.60
64.82
1244
2009-12-10
26.82
296.05
26.52
53.60
64.48
1245
2009-12-11
26.58
295.55
26.50
53.68
64.86
1246
2009-12-14
26.90
298.16
26.73
54.26
62.06
1247
2009-12-15
26.51
296.87
26.65
53.49
61.60
1248
2009-12-16
26.63
299.18
26.72
53.53
60.94
1249
2009-12-17
26.20
297.27
26.28
53.04
60.75
1250
2009-12-18
26.69
298.51
26.95
53.00
60.74
1251
2009-12-21
27.07
299.64
27.10
52.79
61.01
1252
2009-12-22
27.36
300.86
27.36
52.79
61.07
1253
2009-12-23
27.60
306.15
27.45
52.80
60.79
1254
2009-12-24
28.54
309.55
27.52
52.77
61.15
1255
2009-12-28
28.89
311.75
27.67
52.75
61.52
1256
2009-12-29
28.55
310.01
27.87
53.03
61.31
1257
2009-12-30
28.90
311.68
27.49
52.85
61.24
1258
2009-12-31
28.77
310.30
27.06
52.21
60.73
1259 rows × 6 columns
The third file ("TechTickerSampleData.txt") starts with a header of 7 lines. Afterward, the data part is tab delimited and contains tick data for AAPL and MSFT. The first column specifies if the tick information is for AAPL or MSFT. The second column is the timestamp of the tick data in minutes, and the meaning of the following 5 columns can be read off of line number 5: Close, High, Low, Open, Volume.
Create a DataFrame, making the second column the index.
In [6]:
read_table("TechTickerSampleData.txt", sep="\t", skiprows=7, names=["stock", "timestamp", "close", "high", "low", "open", "volume"],
index_col=1)
Out[6]:
stock
close
high
low
open
volume
timestamp
0
AAPL
84.5993
84.6257
84.5714
84.5714
568330
0
MSFT
40.2800
40.2800
40.2100
40.2401
542320
1
MSFT
40.1900
40.3600
40.1700
40.2900
2596758
2
MSFT
40.1750
40.2300
40.0700
40.1900
2433238
2
AAPL
84.8871
84.9714
84.4300
84.5871
5143782
3
MSFT
40.0900
40.1800
40.0800
40.1800
1912087
3
AAPL
84.8457
84.9543
84.7571
84.8786
3119410
4
MSFT
40.1300
40.1500
40.0700
40.0900
1425872
4
AAPL
84.4715
84.9414
84.4300
84.8571
2805446
5
AAPL
84.1600
84.6329
84.0400
84.4817
5049219
6
MSFT
40.1630
40.2000
40.1200
40.1400
1057386
6
AAPL
83.8929
84.2527
83.7657
84.1601
3386593
7
AAPL
84.1800
84.2043
83.8214
83.9143
2441691
8
MSFT
40.0400
40.1500
40.0250
40.1352
830479
8
AAPL
84.0943
84.2071
83.9814
84.1739
1349978
9
MSFT
39.9700
40.0500
39.9500
40.0400
2709454
10
MSFT
39.9950
40.0000
39.9600
39.9700
1688028
10
AAPL
84.5166
84.5814
84.3714
84.4214
1822870
11
MSFT
39.9800
40.0000
39.9600
39.9940
1404386
11
AAPL
84.5143
84.6314
84.4486
84.5014
2166633
12
MSFT
40.0000
40.0450
39.9650
39.9750
2170004
12
AAPL
84.2429
84.5429
84.2271
84.5171
2253258
13
MSFT
40.0000
40.0100
39.9500
39.9900
6820612
13
AAPL
84.4971
84.4986
84.1643
84.2243
4716439
48
MSFT
40.2900
40.3200
40.2800
40.3100
1140301
49
MSFT
39.8750
40.3400
39.8400
40.2900
4144519
50
MSFT
39.9800
40.0200
39.8450
39.8750
3178738
51
MSFT
39.8600
40.0400
39.8400
39.9800
2289461
51
AAPL
84.4614
84.6829
84.2443
84.6143
4026218
52
MSFT
39.7900
39.9600
39.7700
39.8500
1377684
...
...
...
...
...
...
...
348
MSFT
41.7900
41.8150
41.7100
41.7900
1445177
348
AAPL
95.1100
95.1700
94.9700
95.1050
1858039
349
MSFT
41.6850
41.8300
41.6700
41.8200
3404465
349
AAPL
95.0350
95.1700
94.9700
95.1500
2449449
350
MSFT
41.6850
41.6850
41.6850
41.6850
122
384
MSFT
41.7800
41.8000
41.7000
41.7000
475538
384
AAPL
95.3300
95.3800
95.3300
95.3600
473563
385
MSFT
41.5000
41.7800
41.4800
41.7700
3126191
385
AAPL
95.3000
95.6200
94.9600
95.3400
5032743
386
MSFT
41.6050
41.6100
41.4800
41.5100
1715520
386
AAPL
95.5390
95.8900
95.1800
95.3090
4398637
387
MSFT
41.7300
41.7300
41.5500
41.6100
1083479
387
AAPL
95.6700
95.7900
95.4900
95.5350
1789334
388
MSFT
41.8000
41.8500
41.7100
41.7300
1250176
388
AAPL
95.4000
95.7500
95.3300
95.6650
1791034
389
MSFT
41.8000
41.8900
41.8000
41.8100
1049825
389
AAPL
95.3400
95.5000
95.2200
95.4000
1791164
390
MSFT
41.8400
41.8600
41.7800
41.8000
960177
390
AAPL
95.1300
95.3400
95.0800
95.3399
1226239
391
MSFT
41.8810
41.9100
41.8400
41.8400
842344
391
AAPL
95.0277
95.2300
94.9300
95.1200
1412825
392
MSFT
41.8900
41.9500
41.8800
41.8858
950716
392
AAPL
95.1650
95.2500
95.0100
95.0250
1195498
393
MSFT
41.9300
41.9500
41.8800
41.8965
1008343
393
AAPL
95.0050
95.1900
94.9900
95.1763
1116781
394
MSFT
41.9650
41.9750
41.9100
41.9390
1119940
394
AAPL
95.0300
95.0900
94.8625
95.0076
1726397
395
MSFT
41.9600
42.0150
41.9500
41.9680
1359014
395
AAPL
95.0500
95.1600
95.0000
95.0358
1338130
396
MSFT
41.9850
41.9900
41.9500
41.9670
386750
1410 rows × 6 columns
The fourth file ("HistoricalNASDAQ.htm") is an html file with daily NASDAQ index returns. The description of the columns is again provided on the first line of the table. Return a DataFrame whose index is the date column.
Advice: Parsing html is always a complicated task. Reading the help on read_html is highly recommended. In particular, the default value for header may cause issues.
In [7]:
read_html("HistoricalNASDAQ.htm", index_col=0, parse_dates=True, header=0)[0]
Out[7]:
Open
High
Low
Close
Volume
Adj Close
Date
2014-07-08
4442.87
4443.56
4372.04
4391.46
2221820000
4391.46
2014-07-07
4477.74
4478.02
4447.60
4451.53
1691390000
4451.53
2014-07-03
4472.89
4485.93
4463.85
4485.93
1001730000
4485.93
2014-07-02
4457.86
4466.92
4450.87
4457.73
1599480000
4457.73
2014-07-01
4424.71
4471.60
4424.43
4458.65
1942550000
4458.65
2014-06-30
4398.37
4417.46
4396.59
4408.18
1848110000
4408.18
2014-06-27
4371.80
4398.85
4371.60
4397.93
3964930000
4397.93
2014-06-26
4379.44
4379.80
4347.45
4379.05
1554070000
4379.05
2014-06-25
4341.82
4383.55
4339.41
4379.76
1722820000
4379.76
2014-06-24
4367.90
4399.87
4342.90
4350.35
2014700000
4350.35
2014-06-23
4368.96
4371.81
4358.68
4368.68
1712930000
4368.68
2014-06-20
4365.37
4368.80
4354.03
4368.04
2721380000
4368.04
2014-06-19
4370.14
4372.18
4339.61
4359.33
1845460000
4359.33
2014-06-18
4341.18
4365.10
4320.54
4362.84
1860660000
4362.84
2014-06-17
4316.01
4346.12
4311.15
4337.23
1814360000
4337.23
2014-06-16
4303.93
4326.89
4296.26
4321.10
1675120000
4321.10
2014-06-13
4315.31
4317.67
4288.41
4310.65
1754560000
4310.65
2014-06-12
4323.11
4328.35
4284.53
4297.63
1908190000
4297.63
2014-06-11
4322.92
4338.21
4315.49
4331.93
1778460000
4331.93
2014-06-10
4329.20
4338.87
4319.93
4338.00
1787120000
4338.00
2014-06-09
4324.35
4346.74
4320.28
4336.24
1783060000
4336.24
2014-06-06
4312.76
4322.51
4305.74
4321.40
1616650000
4321.40
2014-06-05
4259.12
4299.50
4241.68
4296.23
1926750000
4296.23
2014-06-04
4222.21
4256.19
4216.23
4251.64
1610090000
4251.64
2014-06-03
4222.04
4240.35
4215.80
4234.08
1718640000
4234.08
2014-06-02
4247.96
4247.96
4207.61
4237.20
1631310000
4237.20
2014-05-30
4251.49
4252.08
4221.95
4242.62
1903660000
4242.62
2014-05-29
4238.04
4247.95
4228.96
4247.95
1714000000
4247.95
2014-05-28
4234.96
4238.17
4216.89
4225.08
1785140000
4225.08
2014-05-27
4206.44
4237.07
4204.72
4237.07
1812330000
4237.07
...
...
...
...
...
...
...
1971-03-19
105.77
105.77
105.77
105.77
0
105.77
1971-03-18
105.88
105.88
105.88
105.88
0
105.88
1971-03-17
105.50
105.50
105.50
105.50
0
105.50
1971-03-16
105.56
105.56
105.56
105.56
0
105.56
1971-03-15
105.13
105.13
105.13
105.13
0
105.13
1971-03-12
104.50
104.50
104.50
104.50
0
104.50
1971-03-11
104.26
104.26
104.26
104.26
0
104.26
1971-03-10
104.14
104.14
104.14
104.14
0
104.14
1971-03-09
104.41
104.41
104.41
104.41
0
104.41
1971-03-08
104.23
104.23
104.23
104.23
0
104.23
1971-03-05
103.00
103.00
103.00
103.00
0
103.00
1971-03-04
102.78
102.78
102.78
102.78
0
102.78
1971-03-03
102.07
102.07
102.07
102.07
0
102.07
1971-03-02
101.84
101.84
101.84
101.84
0
101.84
1971-03-01
101.78
101.78
101.78
101.78
0
101.78
1971-02-26
101.34
101.34
101.34
101.34
0
101.34
1971-02-25
101.23
101.23
101.23
101.23
0
101.23
1971-02-24
100.64
100.64
100.64
100.64
0
100.64
1971-02-23
99.72
99.72
99.72
99.72
0
99.72
1971-02-22
99.68
99.68
99.68
99.68
0
99.68
1971-02-19
100.70
100.70
100.70
100.70
0
100.70
1971-02-18
101.42
101.42
101.42
101.42
0
101.42
1971-02-17
101.74
101.74
101.74
101.74
0
101.74
1971-02-16
102.19
102.19
102.19
102.19
0
102.19
1971-02-12
102.05
102.05
102.05
102.05
0
102.05
1971-02-11
101.45
101.45
101.45
101.45
0
101.45
1971-02-10
100.69
100.69
100.69
100.69
0
100.69
1971-02-09
100.76
100.76
100.76
100.76
0
100.76
1971-02-08
100.84
100.84
100.84
100.84
0
100.84
1971-02-05
100.00
100.00
100.00
100.00
0
100.00
10951 rows × 6 columns
In [7]:
Content source: Sandor-PRA/pandas_tutorial
Similar notebooks: