Python has become a beloved language for text and file munging due to its simple syntax for interacting with files, intuitive data structures, and convenient features like tuple packing and unpacking. pandas features a number of functions for reading tabular data as a DataFrame object. Table 6-1 has a summary of all of them, though read_csv and read_table are likely the ones you’ll use the most.
In [30]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import sys
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)
In [2]:
%pwd
Out[2]:
Type inference is one of the more important features of these functions; that means you don’t have to specify which columns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though. Let’s start with a small comma-separated (CSV) text file:
In [3]:
!cat ch06/ex1.csv
In [31]:
df = pd.read_csv('ch06/ex1.csv')
df
Out[31]:
In [34]:
!cat ch06/test.csv
In [33]:
dfx = pd.read_csv('ch06/test.csv')
dfx
Out[33]:
In [5]:
pd.read_table('ch06/ex1.csv', sep=',')
Out[5]:
A file will not always have a header row. Consider this file:
In [6]:
!cat ch06/ex2.csv
To read this in, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:
In [36]:
pd.read_csv('ch06/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
Out[36]:
In [8]:
pd.read_csv('ch06/ex2.csv', header=None)
Out[8]:
Suppose you wanted the message column to be the index of the returned DataFrame. You can either indicate you want the column at index 4 or named 'message' using the index_col argument:
In [37]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('ch06/ex2.csv', names=names, index_col='message')
Out[37]:
In the event that you want to form a hierarchical index from multiple columns, just pass a list of column numbers or names:
In [10]:
!cat ch06/csv_mindex.csv
parsed = pd.read_csv('ch06/csv_mindex.csv', index_col=['key1', 'key2'])
parsed
Out[10]:
In [11]:
list(open('ch06/ex3.txt'))
Out[11]:
In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. In these cases, you can pass a regular expression as a delimiter for read_table. Consider a text file that looks like this:
While you could do some munging by hand, in this case fields are separated by a variable amount of whitespace. This can be expressed by the regular expression \s+, so we have then:
In [12]:
result = pd.read_table('ch06/ex3.txt', sep='\s+')
result
Out[12]:
Because there was one fewer column name than the number of data rows, read_table infers that the first column should be the DataFrame’s index in this special case.
The parser functions have many additional arguments to help you handle the wide variety of exception file formats that occur. For example, you can skip the first, third, and fourth rows of a file with skiprows:
In [13]:
!cat ch06/ex4.csv
pd.read_csv('ch06/ex4.csv', skiprows=[0, 2, 3])
Out[13]:
Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and NULL:
In [14]:
!cat ch06/ex5.csv
result = pd.read_csv('ch06/ex5.csv')
result
pd.isnull(result)
Out[14]:
The na_values option can take either a list or set of strings to consider missing values:
In [15]:
result = pd.read_csv('ch06/ex5.csv', na_values=['NULL'])
result
Out[15]:
Different NA sentinels can be specified for each column in a dict:
In [16]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv('ch06/ex5.csv', na_values=sentinels)
Out[16]:
When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
In [38]:
result = pd.read_csv('ch06/ex6.csv')
result
Out[38]:
If you want to only read out a small number of rows (avoiding reading the entire file), specify that with nrows:
In [18]:
pd.read_csv('ch06/ex6.csv', nrows=5)
Out[18]:
To read out a file in pieces, specify a chunksize as a number of rows:
In [40]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
chunker
Out[40]:
The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:
In [42]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
In [43]:
tot[:10]
Out[43]:
Data can also be exported to delimited format. Let’s consider one of the CSV files read above:
In [22]:
data = pd.read_csv('ch06/ex5.csv')
data
Out[22]:
Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:
In [44]:
data.to_csv('ch06/out.csv')
!cat ch06/out.csv
Other delimiters can be used, of course (writing to sys.stdout so it just prints the text result; make sure to import sys):
In [24]:
data.to_csv(sys.stdout, sep='|')
Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:
In [46]:
data.to_csv(sys.stdout, na_rep='HELLO')
With no other options specified, both the row and column labels are written. Both of these can be disabled:
In [26]:
data.to_csv(sys.stdout, index=False, header=False)
You can also write only a subset of the columns, and in an order of your choosing:
In [27]:
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
Series also has a to_csv method:
In [28]:
dates = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7), index=dates)
ts.to_csv('ch06/tseries.csv')
!cat ch06/tseries.csv
With a bit of wrangling (no header, first column as index), you can read a CSV version of a Series with read_csv, but there is also a from_csv convenience method that makes it a bit simpler:
In [29]:
Series.from_csv('ch06/tseries.csv', parse_dates=True)
Out[29]:
Most forms of tabular data can be loaded from disk using functions like pan das.read_table. In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table. To illustrate the basic tools, consider a small CSV file:
In [ ]:
!cat ch06/ex7.csv
In [ ]:
import csv
f = open('ch06/ex7.csv')
reader = csv.reader(f)
In [ ]:
for line in reader:
print(line)
In [ ]:
lines = list(csv.reader(open('ch06/ex7.csv')))
header, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict
In [ ]:
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
In [ ]:
with open('mydata.csv', 'w') as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow(('one', 'two', 'three'))
writer.writerow(('1', '2', '3'))
writer.writerow(('4', '5', '6'))
writer.writerow(('7', '8', '9'))
In [ ]:
%cat mydata.csv
In [ ]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
{"name": "Katie", "age": 33, "pet": "Cisco"}]
}
"""
In [ ]:
import json
result = json.loads(obj)
result
In [ ]:
asjson = json.dumps(result)
In [ ]:
siblings = DataFrame(result['siblings'], columns=['name', 'age'])
siblings
In [ ]:
from lxml.html import parse
from urllib2 import urlopen
parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options'))
doc = parsed.getroot()
In [ ]:
links = doc.findall('.//a')
links[15:20]
In [ ]:
lnk = links[28]
lnk
lnk.get('href')
lnk.text_content()
In [ ]:
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:]
In [ ]:
tables = doc.findall('.//table')
calls = tables[9]
puts = tables[13]
In [ ]:
rows = calls.findall('.//tr')
In [ ]:
def _unpack(row, kind='td'):
elts = row.findall('.//%s' % kind)
return [val.text_content() for val in elts]
In [ ]:
_unpack(rows[0], kind='th')
_unpack(rows[1], kind='td')
In [ ]:
from pandas.io.parsers import TextParser
def parse_options_data(table):
rows = table.findall('.//tr')
header = _unpack(rows[0], kind='th')
data = [_unpack(r) for r in rows[1:]]
return TextParser(data, names=header).get_chunk()
In [ ]:
call_data = parse_options_data(calls)
put_data = parse_options_data(puts)
call_data[:10]
In [ ]:
%cd ch06/mta_perf/Performance_XML_Data
In [ ]:
!head -21 Performance_MNR.xml
In [ ]:
from lxml import objectify
path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
In [ ]:
data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
'DESIRED_CHANGE', 'DECIMAL_PLACES']
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
In [ ]:
perf = DataFrame(data)
perf
In [ ]:
root
In [ ]:
root.get('href')
In [ ]:
root.text
In [ ]:
cd ../..
In [ ]:
frame = pd.read_csv('ch06/ex1.csv')
frame
frame.to_pickle('ch06/frame_pickle')
In [ ]:
pd.read_pickle('ch06/frame_pickle')
In [ ]:
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store
In [ ]:
store['obj1']
In [ ]:
store.close()
os.remove('mydata.h5')
In [ ]:
import requests
url = 'https://api.github.com/repos/pydata/pandas/milestones/28/labels'
resp = requests.get(url)
resp
In [ ]:
data[:5]
In [ ]:
issue_labels = DataFrame(data)
issue_labels
In [1]:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
In [ ]:
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
In [ ]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
In [ ]:
cursor.description
In [ ]:
DataFrame(rows, columns=zip(*cursor.description)[0])
In [ ]:
import pandas.io.sql as sql
sql.read_sql('select * from test', con)