Data loading, storage, and file formats

Python has become a beloved language for text and file munging due to its simple syntax for interacting with files, intuitive data structures, and convenient features like tuple packing and unpacking. pandas features a number of functions for reading tabular data as a DataFrame object. Table 6-1 has a summary of all of them, though read_csv and read_table are likely the ones you’ll use the most.


In [30]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import sys
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)

In [2]:
%pwd


Out[2]:
u'/Users/pmui/datascience/lecture03.more.wrangling'

Reading and Writing Data in Text Format

Type inference is one of the more important features of these functions; that means you don’t have to specify which columns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though. Let’s start with a small comma-separated (CSV) text file:


In [3]:
!cat ch06/ex1.csv


a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [31]:
df = pd.read_csv('ch06/ex1.csv')
df


Out[31]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

In [34]:
!cat ch06/test.csv


message,a,b,c,d
hello,1,2,3,4
world,5,6,7,8

In [33]:
dfx = pd.read_csv('ch06/test.csv')
dfx


Out[33]:
message a b c d
0 hello 1 2 3 4
1 world 5 6 7 8

In [5]:
pd.read_table('ch06/ex1.csv', sep=',')


Out[5]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

A file will not always have a header row. Consider this file:


In [6]:
!cat ch06/ex2.csv


1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

To read this in, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:


In [36]:
pd.read_csv('ch06/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])


Out[36]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

In [8]:
pd.read_csv('ch06/ex2.csv', header=None)


Out[8]:
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

Suppose you wanted the message column to be the index of the returned DataFrame. You can either indicate you want the column at index 4 or named 'message' using the index_col argument:


In [37]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('ch06/ex2.csv', names=names, index_col='message')


Out[37]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12

In the event that you want to form a hierarchical index from multiple columns, just pass a list of column numbers or names:


In [10]:
!cat ch06/csv_mindex.csv
parsed = pd.read_csv('ch06/csv_mindex.csv', index_col=['key1', 'key2'])
parsed


key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
Out[10]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

In [11]:
list(open('ch06/ex3.txt'))


Out[11]:
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. In these cases, you can pass a regular expression as a delimiter for read_table. Consider a text file that looks like this:

While you could do some munging by hand, in this case fields are separated by a variable amount of whitespace. This can be expressed by the regular expression \s+, so we have then:


In [12]:
result = pd.read_table('ch06/ex3.txt', sep='\s+')
result


Out[12]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

Because there was one fewer column name than the number of data rows, read_table infers that the first column should be the DataFrame’s index in this special case.

The parser functions have many additional arguments to help you handle the wide variety of exception file formats that occur. For example, you can skip the first, third, and fourth rows of a file with skiprows:


In [13]:
!cat ch06/ex4.csv
pd.read_csv('ch06/ex4.csv', skiprows=[0, 2, 3])


# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Out[13]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and NULL:


In [14]:
!cat ch06/ex5.csv
result = pd.read_csv('ch06/ex5.csv')
result
pd.isnull(result)


something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
Out[14]:
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False

The na_values option can take either a list or set of strings to consider missing values:


In [15]:
result = pd.read_csv('ch06/ex5.csv', na_values=['NULL'])
result


Out[15]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

Different NA sentinels can be specified for each column in a dict:


In [16]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv('ch06/ex5.csv', na_values=sentinels)


Out[16]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

Reading text files in pieces

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.


In [38]:
result = pd.read_csv('ch06/ex6.csv')
result


Out[38]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
5 1.817480 0.742273 0.419395 -2.251035 Q
6 -0.776764 0.935518 -0.332872 -1.875641 U
7 -0.913135 1.530624 -0.572657 0.477252 K
8 0.358480 -0.497572 -0.367016 0.507702 S
9 -1.740877 -1.160417 -1.637830 2.172201 G
10 0.240564 -0.328249 1.252155 1.072796 8
11 0.764018 1.165476 -0.639544 1.495258 R
12 0.571035 -0.310537 0.582437 -0.298765 1
13 2.317658 0.430710 -1.334216 0.199679 P
14 1.547771 -1.119753 -2.277634 0.329586 J
15 -1.310608 0.401719 -1.000987 1.156708 E
16 -0.088496 0.634712 0.153324 0.415335 B
17 -0.018663 -0.247487 -1.446522 0.750938 A
18 -0.070127 -1.579097 0.120892 0.671432 F
19 -0.194678 -0.492039 2.359605 0.319810 H
20 -0.248618 0.868707 -0.492226 -0.717959 W
21 -1.091549 -0.867110 -0.647760 -0.832562 C
22 0.641404 -0.138822 -0.621963 -0.284839 C
23 1.216408 0.992687 0.165162 -0.069619 V
24 -0.564474 0.792832 0.747053 0.571675 I
25 1.759879 -0.515666 -0.230481 1.362317 S
26 0.126266 0.309281 0.382820 -0.239199 L
27 1.334360 -0.100152 -0.840731 -0.643967 6
28 -0.737620 0.278087 -0.053235 -0.950972 J
29 -1.148486 -0.986292 -0.144963 0.124362 Y
... ... ... ... ... ...
9970 0.633495 -0.186524 0.927627 0.143164 4
9971 0.308636 -0.112857 0.762842 -1.072977 1
9972 -1.627051 -0.978151 0.154745 -1.229037 Z
9973 0.314847 0.097989 0.199608 0.955193 P
9974 1.666907 0.992005 0.496128 -0.686391 S
9975 0.010603 0.708540 -1.258711 0.226541 K
9976 0.118693 -0.714455 -0.501342 -0.254764 K
9977 0.302616 -2.011527 -0.628085 0.768827 H
9978 -0.098572 1.769086 -0.215027 -0.053076 A
9979 -0.019058 1.964994 0.738538 -0.883776 F
9980 -0.595349 0.001781 -1.423355 -1.458477 M
9981 1.392170 -1.396560 -1.425306 -0.847535 H
9982 -0.896029 -0.152287 1.924483 0.365184 6
9983 -2.274642 -0.901874 1.500352 0.996541 N
9984 -0.301898 1.019906 1.102160 2.624526 I
9985 -2.548389 -0.585374 1.496201 -0.718815 D
9986 -0.064588 0.759292 -1.568415 -0.420933 E
9987 -0.143365 -1.111760 -1.815581 0.435274 2
9988 -0.070412 -1.055921 0.338017 -0.440763 X
9989 0.649148 0.994273 -1.384227 0.485120 Q
9990 -0.370769 0.404356 -1.051628 -1.050899 8
9991 -0.409980 0.155627 -0.818990 1.277350 W
9992 0.301214 -1.111203 0.668258 0.671922 A
9993 1.821117 0.416445 0.173874 0.505118 X
9994 0.068804 1.322759 0.802346 0.223618 H
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0

10000 rows × 5 columns

If you want to only read out a small number of rows (avoiding reading the entire file), specify that with nrows:


In [18]:
pd.read_csv('ch06/ex6.csv', nrows=5)


Out[18]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q

To read out a file in pieces, specify a chunksize as a number of rows:


In [40]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
chunker


Out[40]:
<pandas.io.parsers.TextFileReader at 0x1023d34d0>

The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:


In [42]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)

tot = Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

In [43]:
tot[:10]


Out[43]:
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

Writing data out to text format

Data can also be exported to delimited format. Let’s consider one of the CSV files read above:


In [22]:
data = pd.read_csv('ch06/ex5.csv')
data


Out[22]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:


In [44]:
data.to_csv('ch06/out.csv')
!cat ch06/out.csv


,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

Other delimiters can be used, of course (writing to sys.stdout so it just prints the text result; make sure to import sys):


In [24]:
data.to_csv(sys.stdout, sep='|')


|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:


In [46]:
data.to_csv(sys.stdout, na_rep='HELLO')


,something,a,b,c,d,message
0,one,1,2,3.0,4,HELLO
1,two,5,6,HELLO,8,world
2,three,9,10,11.0,12,foo

With no other options specified, both the row and column labels are written. Both of these can be disabled:


In [26]:
data.to_csv(sys.stdout, index=False, header=False)


one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

You can also write only a subset of the columns, and in an order of your choosing:


In [27]:
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])


a,b,c
1,2,3.0
5,6,
9,10,11.0

Series also has a to_csv method:


In [28]:
dates = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7), index=dates)
ts.to_csv('ch06/tseries.csv')
!cat ch06/tseries.csv


2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

With a bit of wrangling (no header, first column as index), you can read a CSV version of a Series with read_csv, but there is also a from_csv convenience method that makes it a bit simpler:


In [29]:
Series.from_csv('ch06/tseries.csv', parse_dates=True)


Out[29]:
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

Manually working with delimited formats

Most forms of tabular data can be loaded from disk using functions like pan das.read_table. In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table. To illustrate the basic tools, consider a small CSV file:


In [ ]:
!cat ch06/ex7.csv

In [ ]:
import csv
f = open('ch06/ex7.csv')

reader = csv.reader(f)

In [ ]:
for line in reader:
    print(line)

In [ ]:
lines = list(csv.reader(open('ch06/ex7.csv')))
header, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

In [ ]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

In [ ]:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))

In [ ]:
%cat mydata.csv

JSON data


In [ ]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
              {"name": "Katie", "age": 33, "pet": "Cisco"}]
}
"""

In [ ]:
import json
result = json.loads(obj)
result

In [ ]:
asjson = json.dumps(result)

In [ ]:
siblings = DataFrame(result['siblings'], columns=['name', 'age'])
siblings

XML and HTML, Web scraping

NB. The Yahoo! Finance API has changed and this example no longer works


In [ ]:
from lxml.html import parse
from urllib2 import urlopen

parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options'))

doc = parsed.getroot()

In [ ]:
links = doc.findall('.//a')
links[15:20]

In [ ]:
lnk = links[28]
lnk
lnk.get('href')
lnk.text_content()

In [ ]:
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:]

In [ ]:
tables = doc.findall('.//table')
calls = tables[9]
puts = tables[13]

In [ ]:
rows = calls.findall('.//tr')

In [ ]:
def _unpack(row, kind='td'):
    elts = row.findall('.//%s' % kind)
    return [val.text_content() for val in elts]

In [ ]:
_unpack(rows[0], kind='th')
_unpack(rows[1], kind='td')

In [ ]:
from pandas.io.parsers import TextParser

def parse_options_data(table):
    rows = table.findall('.//tr')
    header = _unpack(rows[0], kind='th')
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data, names=header).get_chunk()

In [ ]:
call_data = parse_options_data(calls)
put_data = parse_options_data(puts)
call_data[:10]

Parsing XML with lxml.objectify


In [ ]:
%cd ch06/mta_perf/Performance_XML_Data

In [ ]:
!head -21 Performance_MNR.xml

In [ ]:
from lxml import objectify

path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

In [ ]:
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

In [ ]:
perf = DataFrame(data)
perf

In [ ]:
root

In [ ]:
root.get('href')

In [ ]:
root.text

Binary data formats


In [ ]:
cd ../..

In [ ]:
frame = pd.read_csv('ch06/ex1.csv')
frame
frame.to_pickle('ch06/frame_pickle')

In [ ]:
pd.read_pickle('ch06/frame_pickle')

Using HDF5 format


In [ ]:
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

In [ ]:
store['obj1']

In [ ]:
store.close()
os.remove('mydata.h5')

Interacting with HTML and Web APIs


In [ ]:
import requests
url = 'https://api.github.com/repos/pydata/pandas/milestones/28/labels'
resp = requests.get(url)
resp

In [ ]:
data[:5]

In [ ]:
issue_labels = DataFrame(data)
issue_labels

Interacting with databases


In [1]:
import sqlite3

query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""

con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()

In [ ]:
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)
con.commit()

In [ ]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

In [ ]:
cursor.description

In [ ]:
DataFrame(rows, columns=zip(*cursor.description)[0])

In [ ]:
import pandas.io.sql as sql
sql.read_sql('select * from test', con)