作者:方跃文
Email: fyuewen@gmail.com
** 时间:始于2018年3月6日, 结束写作于 2018年7月27日, 2018年11月复习一次。
数据如果不能导入和导出,那么本书中介绍的工具自然也就没有用武之地了。在我们的日常生活和研究中,有许多文件输入和输出的应用场景。
输入和输出通常划分为几个大类:读取文本文件和其他更高效的磁盘存储格式,加载数据库中的数据,利用web api操作网络资源。
pandas 提供了一些用于将表格型数据读取为 DataFrame 对象的函数。表6-1对此进行了总结,其中 read_csv and read_table 将会是我们经常用到的。
Table 6-1. Parsing functions in pandas
Functions | Description | |
---|---|---|
read_csv | load delimited data from a file, URL, or file-like object; use comma as defult delimiter | |
read_table | load delimited data from a file, URL, or file-like object; use tab ('\t') as defult delimiter | |
read_fwf | read data in fixed-width column format, i.e. no delimiters | |
read_clipboard | Version of read_table that reads data from the clipboard; useful for converting tables from web pages | |
read_excel | read tabular data from an Excel XLS or XLSX file | |
read_hdf | read HDF5 files written by pandas | |
read_html | read all tables found in a given HTML document | |
read_json | read data from a JSON | |
read_msgpack | read pandas data encoded using the MessagePack binary format | |
read_pickle | read an arbitrary object in Python pickle format | pickle竟然是咸菜的意思哦~ |
read_sas | read a SASdataset stored in one of the SAS system's custom strorage format | |
read_sql | read the results of a SQL query (using SQLAlchemy) asa pandas DataFrame | |
read_stata | read a dataset from Stata file format | |
read_feather | read the Feather binary file format |
上述函数的重要功能之一就是type inferende,也就是类型推断。我们不需要为所读入的数据指定是什么类型。不过日期和其他自定义类型数据的处理 需要小心一些,因为自定义数据可能并不那么容易被推断出来属于何种数据类型。
HDF5,Feather和mgspack在文件中就会有数据类型,因此读入会更加方便。
Let's start with a small comma-separated csv file:
In [7]:
!cat chapter06/ex1.csv
由于这个是逗号隔开的文本,我们可以方便地使用 read_csv 来读入数据
In [9]:
import pandas as pd
df = pd.read_csv('chapter06/ex1.csv')
In [10]:
df
Out[10]:
In [ ]:
当然我们也可以使用 read_table,只需要将delimiter指定为逗号即可
In [11]:
import pandas as pd
df1 = pd.read_table('chapter06/ex1.csv', sep=',')
In [12]:
df1
Out[12]:
上面这个例子中是已经包含了header部分了,但并不是所有数据都自带header,例如
In [15]:
import pandas as pd
df1 = pd.read_table('chapter06/ex2.csv', sep=',')
df2 = pd.read_table('chapter06/ex2.csv', sep=',', header=None)
In [14]:
df1
Out[14]:
In [16]:
df2
Out[16]:
df1中我没有指明是header=None,所以原始文件中都第一行被错误地用作了header。为了 避免这种错误,需要明确指明header=None
当然,我们也可以自己去指定名字:
In [18]:
import pandas as pd
df1 = pd.read_csv('chapter06/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
df1
Out[18]:
如果我们在读入数据都时候,就希望其中都某一列作为索引index,比如我们可以对上面这个例子,将message这列作为索引:
In [21]:
import pandas as pd
names = ['a', 'b', 'c', 'd', 'message']
df1 = pd.read_csv('chapter06/ex2.csv', names=names, index_col='message')
In [22]:
df1
Out[22]:
如果我们想从多列转化成层次化索引(hierarchical index),我们需要由列编号或者列名组成的列表即可。
先看个例子
In [23]:
!cat ./chapter06/csv_mindex.csv
In [24]:
import pandas as pd
df = pd.read_csv('chapter06/csv_mindex.csv', index_col=['key1', 'key2'])
In [25]:
df
Out[25]:
实际情况中,有的表格的分隔符并不是逗号,或者相等间隔的空白富,例如下面这种情形,
In [27]:
list(open('chapter06/ex3.txt'))
Out[27]:
在第一列中,A和B之间的空格符明显多余第二到第五列数字之间的空格符。假设,我们还是使用默认的sep=‘’,我们将得到
In [8]:
import pandas as pd
result = pd.read_table('chapter06/ex3.txt')
result
Out[8]:
显然,ABC之间的空格没有被正确识别,导致ABC被列入了同一个列之中。 喂了 我们可以传递一个正则表达式作为分隔符。Here, 我们就可以使用 正则表达式 \s+ 来表示;
In [29]:
import pandas as pd
result = pd.read_table('chapter06/ex3.txt', sep='\s+')
In [30]:
result
Out[30]:
我们也可以使用这个delim_whitespace=True来实现数据的正确读入
In [13]:
import pandas as pd
result = pd.read_table('chapter06/ex3.txt', delim_whitespace=True)
"""
delim_whitespace: is a boolean, default False
Specifies whether or not whitespace (e.g. ' ' or '\t') will be used as the sep.
Equivalent to setting sep='\s+'.
"""
print(result)
In [ ]:
The parser functions(解析器函数)in the Table 6-1 have many aiddtional arguments to help handle the wide variaty of exception fil formats that occur.让我来举个例子。我们可以利用skiprows来 跳过文件中的前面几行或者特别想指定的行。这个功能是相当有用的。特别是像我这样的人,喜欢在rawdata里面 写几句comments来说明一下这些数据的内容。
In [31]:
!cat chapter06/ex4.csv
In [32]:
import pandas as pd
df = pd.read_csv('./chapter06/ex4.csv', skiprows=[0,2,3])
In [33]:
df
Out[33]:
在chapter05的笔记中,我也说到过,实际处理的数据是很有可能包含空数剧的。那些missing data通常 不会被呈现出来,或者会被一些sentinel value所替代,例如NA或NULL
In [15]:
list(open('chapter06/ex5.csv'))
Out[15]:
In [21]:
import pandas as pd
import numpy as np
result = pd.read_csv('./chapter06/ex5.csv',sep=',')
result
Out[21]:
In [30]:
print(result['c'][1])
In [31]:
pd.isnull(result)
Out[31]:
In [ ]:
In [38]:
pd.notnull(result)
Out[38]:
The na_values option can take either a list or a set of strings to consider missing values:也就是说只要原始数据中出现了na_values所指定的字符串或者list,就会以 NaN的方式呈现在pandas数据中。需要注意的是,如果原始数据处写的是NA或者NULL,被read只会它们都只是str,并不能直接转化为floating。
In [39]:
result = pd.read_csv('chapter06/ex5.csv', na_values = 'NULL') # 注意原始数据中有一处有连续两个逗号,那就是产生NULL的地方。
#因为在原属数据中,逗号之间并未输入任何数据,因此这个空的数据类型仍然是 floating
In [40]:
result
Out[40]:
可以用一个字典为各个列指定不同的NA标记值:这就可以让我们很方便地在读取数据时候定义那些是null值。
In [33]:
import pandas as pd
sentinels = {'message':['foo', 'NA'], 'something': ['two']}
result = pd.read_csv('chapter06/ex5.csv', na_values = sentinels)
In [34]:
result
Out[34]:
In [35]:
pd.isnull(result)
Out[35]:
In [50]:
result = pd.read_csv('./chapter06/ex6.csv', nrows=5) # 我们只从原始数据中读入5行
In [52]:
result
Out[52]:
在notebook中查看很大的文件中的内容时候,最好设置一下display时候的最大行数,这样就不会把所有列给打印出来
In [40]:
pd.options.display.max_rows = 8 # only print 8 rows clearly,other rows using ...
In [42]:
result = pd.read_csv('./chapter06/ex6.csv')
result
Out[42]:
In [ ]:
如果想逐块读入文件,我们可以指定chunksize来实现:
In [53]:
chunker = pd.read_csv('chapter06/ex6.csv', chunksize = 1000)
In [54]:
chunker
Out[54]:
这里 TextParser 允许我们可以进行迭代。
In [54]:
chunker = pd.read_csv('chapter06/ex6.csv', chunksize = 1000)
tot = pd.Series([])
for piece in chunker:
# print(piece['key'].value_counts())
tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
这样我们就有了
In [55]:
tot[:10]
Out[55]:
In [46]:
对 value_counts 有点忘记了,所以特意地写了下面这个简单对例子来说明。具体对可以在chapter05中看笔记。
In [76]:
pd.options.display.max_rows = 20
result = pd.DataFrame({"Values": [1,2,3,4,22,2,2,3,4]})
In [77]:
result
Out[77]:
In [78]:
result['Values'].value_counts()
Out[78]:
In [83]:
!cat chapter06/ex5.csv
In [2]:
import pandas as pd
data = pd.read_csv('chapter06/ex5.csv')
In [3]:
data
Out[3]:
In [4]:
data.to_csv('chapter06/out.csv')
In [82]:
!cat chapter06/out.csv
相比于原始数据,我们注意到输出的csv文件中也包含了索引。
In [80]:
import sys #使用了sys.out 所以data会被直接打印在屏幕端,而不是输出到其他文本文件
data.to_csv(sys.stdout, sep='|')
In [86]:
data.to_csv(sys.stdout, sep='@')
missing data在上面是被表示为空字符串,但是可能并不容易看出来。 我们也许希望它是用其他字符表示到,比如用 NULL 来表示
In [8]:
import pandas as pd
import sys
data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NULL')
In [10]:
import pandas as pd
import sys
data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN')
观察上面到几个例子,我们可以看出来,默认请看下,index和header都会被输出。 不过,我们也可以禁止输出他们
In [79]:
import pandas as pd
import sys
data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN', index=False, header=False)
此外我们还可以指定输出某些列
In [17]:
import pandas as pd
import sys
data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN', columns=['a', 'b','c'], index=False)
我们也能将Series输出到文本:
In [19]:
import pandas as pd
import numpy as np
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
In [20]:
ts
Out[20]:
In [21]:
ts.to_csv('chapter06/tseries.csv')
In [22]:
!cat chapter06/tseries.csv
从csv转为series的方法一:先化为DataFrame,然后利用loc方法转为Series
In [98]:
pd.read_csv('chapter06/tseries.csv', parse_dates=True,header=None)
Out[98]:
In [101]:
result = pd.read_csv('chapter06/tseries.csv', parse_dates=True, header=None,index_col=0)
In [102]:
result
Out[102]:
In [103]:
x = result.loc[:,1]
In [104]:
x
Out[104]:
方法二:这也是原书第一版中的方法,不过这个方法已经被弃用。
In [83]:
df = pd.Series.from_csv('chapter06/tseries.csv')
# Series.from_csv has DEPRECATED. That's whi I use read_csv in above cells.
In [84]:
df
Out[84]:
方法三:使用 squeeze=True,这是我个人最推荐的方法。不过根据官方的文档说法,squeeze只在原始数据包含一个columns时候才会返回Series
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
In [92]:
result = pd.read_csv('chapter06/tseries.csv', parse_dates=True, header=None,index_col=0,squeeze=True)
In [89]:
type(result)
Out[89]:
In [93]:
result
Out[93]:
In [1]:
!cat chapter06/ex7.csv
In [1]:
#For any file with a single-character delimiter,
#we can use csv module. To use it, pass any open file
#or file-like object to csv.reader
import csv
file = open('chapter06/ex7.csv')
reader = csv.reader(file)
In [3]:
type(reader)
Out[3]:
In [4]:
#Iterating through the reader likea file yields
#tuples of values with any quote characters removed
for line in reader:
print(line)
从这个地方开始,我们可以把数据改成一个我们自己期望的格式。 我们来一步一步的举一个例子。
首先,将文件读入到一个lines的列表中
In [6]:
import csv
with open('chapter06/ex7.csv') as f:
lines = list(csv.reader(f))
header, values = lines[0], lines[1:]
In [7]:
header
Out[7]:
In [8]:
values
Out[8]:
In [13]:
#then we careate a dictionary of data columns using a dictionary comprehension
#and the expression zip(*values), which transposes rows to columns
data_dict = {h: v for h, v in zip(header, zip(*values))}
print(data_dict)
CSV文件格式有各种不同的格式。为了定义一个有特别的delimiter、字符串引用规则或者行结尾的csvfile,我们可以定一一个 简单的subcalss:
In [ ]:
csv输出这里,未来某个时间最好复习加强一下。
In [27]:
import csv
file = open('chapter06/ex7.csv')
class my_dialect(csv.Dialect):
lineterminator = '\m'
delimiter = ';'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
data = csv.reader(file, dialect=my_dialect)
for line in data:
print(line)
当然,我们也可以不用subclass,而直接通过keywords来独立地指定csv文件的dialect参数:
In [50]:
import csv
file = open('chapter06/ex7.csv')
reader1 = csv.reader(file, delimiter='|', quotechar=' ', lineterminator='\r\n')
for line1 in reader1:
print(','.join(line1))
对于更加复杂的文件或者有固定的多个字符的delimiters情况,我们可能无法使用csv module。 在这些情况下,我们不得不使用line splitting和其他的cleanup手段,例如split mehtod, 或者正则表达式方法re.split
To write delimited files manually, we can use csv.writer. It accepts an open, writable file object and the same dialect and format options as csv.reader:
In [59]:
import csv
class my_dialect(csv.Dialect):
lineterminator = '\t\n' #\m
delimiter = ';'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
file = open('chapter06/ex7.csv')
with open('chapter06/mydata.csv', 'w') as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow(('one', 'two', 'three'))
writer.writerow(('1', '2', '3'))
writer.writerow(('4', '5', '6'))
writer.writerow(('7', '8', '9'))
In [60]:
!cat chapter06/mydata.csv
In [1]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
} """
There are several Python libraries that can read or write JSON data. For exampple, json module
In [3]:
import json
result = json.loads(obj)
In [4]:
result
Out[4]:
In [5]:
list(result)
Out[5]:
In [6]:
result['places_lived']
Out[6]:
In [8]:
type(result)
Out[8]:
We can also revert the above result object back to JSON data using json.dump
In [10]:
asjson = json.dumps(result)
In [11]:
asjson
Out[11]:
We can also pass a list of dicts to the DataFrame constructor and slect a subset of the data fields:
In [13]:
import pandas as pd
siblings = pd.DataFrame(result['siblings'], columns = ['name', 'age'])
In [14]:
siblings
Out[14]:
pandas.read_json 可以自动的将JSON datasets转化为Series或者DataFrame,例如
In [15]:
!cat chapter06/example.json
In [16]:
# import pandas as pd
import json
data = pd.read_json('chapter06/example.json')
data
Out[16]:
If we want to export data from pandas to JSON, one is to use the 'to_json' methods on Series and DataFrame"
In [17]:
print(data.to_json())
In [20]:
import pandas as pd
frame = pd.read_csv('chapter06/ex1.csv')
frame
Out[20]:
In [23]:
frame.to_pickle('chapter06/frame_pickle') #this will write the data to the binary file named frame_pickle
Using pickle module, we can read any 'pickled' objsect stored in a file, or even more easiliy we can use pandas.read_pickle, for example
In [24]:
pd.read_pickle('chapter06/frame_pickle')
Out[24]:
Attention
pickle只建议于短期存储格式的情形。因为pickle的数据格式随着时间可能变得不稳定;一个今天pickled的数据,可能在明天就会因为 更新的pickle库版本而无法pickle。尽管开发者已经努力维护避免这种情况,但是在未来的某个时间点上可能最好还是放弃使用pickle format.
pandas has built-in support for two more library data formats, i.e. HDF5 and Message Pack. Some other storage formats for pandas or NumPy data include:
A compressable column-oriented binary format based on the Blosc compression library
A cross-lanuage column-oriented file format. Feather uses the Apache Arrow columnar memory format
HDF5 is a well-regarded file format intended for storing large quantities of scientific arrya data. It is available as a C library, and it has interfaces available in many other languages. "HDF" means "hierarchical data format". each HDF5 file can store multipole datases and supporting metadata.
Compared with other simpler formats, HDF5 feastures on-the-fly compression with a varietry of compression modes. HDF5 can be a good choice for working with very large data-sets that don't fit into memory
尽管使用 PyTables或者h5py这两个库就可以简单直接的读写HDF5文件,但是pandas提高了高水平的接口可以简化存储Series或者DataFrame对象。 The ‘HDFStore' class works like a dict and handles the low-level details:
In [30]:
import pandas as pd
import numpy as np
import
frame = pd.DataFrame({'a': np.random.randn(1000)})
store = pd.HDFStore('chapter06/mydata.h5')
To solve the above prolem, I installed sevearl modules:
pip install --upgrade tables
pip install lxml
pip install wrapt
In [33]:
import pandas as pd
import numpy as np
frame = pd.DataFrame({'a': np.random.randn(1000)})
store = pd.HDFStore('chapter06/mydata.h5py')
In [36]:
store['obj1'] = frame # Store a DataFrame
In [37]:
store['obj1_col'] = frame['a'] # Store a Series
In [38]:
store
Out[38]:
In [ ]:
Objects contained in the HDF5 file can then be retrieved with the
same dict-like API:
In [48]:
x = store['obj1']
type(x)
Out[48]:
In [49]:
y = store['obj1_col']
type(y)
Out[49]:
In [39]:
list(store)
Out[39]:
HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower, but it supports query operations using a special syntax:
In [50]:
store.put('obj2', frame, format='table')
In [51]:
store.select('obj2', where=['index >=10 and index <= 15'])
Out[51]:
In [52]:
list(store)
Out[52]:
In [53]:
store.close()
put其实是store['obj2'] = frame 的精确写法,只是前者可以让我们使用一些其他的选项,然更好的对我们想要存储的数据 进行特别的格式化。
pandas.read_hdf functions gives us a shortcut to these tools:
In [55]:
frame.to_hdf('chapter06/mydata.h5py', 'obj3', format='table')
In [57]:
pd.read_hdf('chapter06/mydata.h5py', 'obj3', where=['index < 5'])
Out[57]:
Note
如果我们处理的数据是存储在远程的服务器上的话,可以使用一个专为分布式存设计的二进制格式,比如Apache Parquet。 这些格式都还处于发展中。
如果数据大部分都在本地,那么鼓励去探索下PyTables和h5py这两者是否也可以满足我们的需求,以及与pd.HDFStore执行效率 的差别。
pandas supports reading tabular data in Excel files using either the ExcelFile class or pandas.read_excel function. These tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. We may need to install these manually with pip or conda.
To use ExcelFile, create an instance by passing a path to an xls or xlsx file:
In [61]:
import pandas as pd
xlsx = pd.ExcelFile('chapter06/ex1.xlsx')
# Data stored in a sheet can then be read into DataFrame with parse:
pd.read_excel(xlsx, 'Sheet1')
Out[61]:
对于又多个sheet的Excel表格,更快的方式就是像上面那样,先create ExcelFile。 不过,也可以简单地把文件路径和文件名直接pass给read_excel. 更推荐前面一种做法。
In [62]:
# 这里举一个例子,直接pass路径给read_excel
frame = pd.read_excel('chapter06/ex1.xlsx', 'Sheet1')
In [63]:
frame
Out[63]:
In [65]:
writer = pd.ExcelWriter('chapter06/ex2.xlsx') # Internally thsi was performed by openpyxl module
In [66]:
frame.to_excel(writer, 'Sheet1')
In [67]:
writer.save()
In [68]:
#我们也可以直接pass一个文件路径到 to_excel:
frame.to_excel('chapter06/ex2.xlsx')
In [69]:
!open chapter06/ex2.xlsx
In [45]:
import requests
url = "https://api.github.com/repos/pandas-dev/pandas/issues"
resp = requests.get(url)
In [3]:
resp
Out[3]:
The "Response" object's json method will return a dictionary contaning JSON parsed into native Python objects:
In [46]:
data = resp.json()
In [47]:
data[0]['url']
Out[47]:
In [49]:
data[0]['title']
Out[49]:
data的每个元素都是一个字典,包含了所有在github issues中找到的信息(除了comments)。我们可以 直接把这些data信息传递给DataFrame和exact fields of interest:
In [53]:
import pandas as pd
issues = pd.DataFrame(data, columns = ['number', 'title', 'labels', 'state'])
In [54]:
issues
Out[54]:
In [ ]:
现在,我们来看看,利用requests 能从我的github主页上找出哪些信息
In [34]:
import requests
url = 'https://api.github.com/yw-fang'
resp = requests.get(url)
In [35]:
data = resp.json()
In [39]:
data['message']
Out[39]:
In [41]:
data['documentation_url']
Out[41]:
In [43]:
textdata = resp.text
In [44]:
textdata
Out[44]:
Many data are usually stored in database. SQL-based relational databases (such as SQL server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have come quite popular. The choice of database is usually dependent on the performance,data integrity (数据的完整性), and scalability (可扩展性)needs of an application
Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I'll create a SQLite database using Python's built-in sqlite3 driver.
In [56]:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
In [57]:
con = sqlite3.connect('mydata.sqlite')
In [58]:
con.execute(query)
Out[58]:
In [60]:
con.commit()
Now, let's insert a few rows of data:
In [61]:
data = [('At', 'Home', 1.25, 6),
('Out', 'Plane', 2.6, 3),
('In', 'Bottle', 1.7, 5)]
In [62]:
stmt = "INSERT INTO test VALUES (?, ?, ?, ?)"
In [63]:
con.executemany(stmt, data)
Out[63]:
In [64]:
con.commit()
Most Python SQL drivers return a list of tuples when selecting data from a table:
In [65]:
cursor = con.execute('select * from test')
In [66]:
rows = cursor.fetchall()
In [67]:
rows
Out[67]:
我们可以将元祖列表传给DataFrame的构造器,不过我们还需要列名。它包含在cursor的description中
In [68]:
cursor.description
Out[68]:
In [72]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
Out[72]:
上面的做法是蛮麻烦的,因为每次都得向database进行query。SQLAlchemy project提供了一些渐变的方法。pandas有一个 read_sql 函数可以是我们从一般的 SQLAlchemy 链接中获取数据。 下面,我们就举一个例子,如何使用 SQLAlchemy 连接到上面创建的 SQLite databse,并且从中读取数据
In [76]:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
In [77]:
pd.read_sql('select * from test', db)
Out[77]:
In [ ]: