阅读笔记

作者:方跃文

Email: fyuewen@gmail.com

** 时间:始于2018年3月6日, 结束写作于 2018年7月27日, 2018年11月复习一次。

第六章 数据加载、存储和文件格式

数据如果不能导入和导出,那么本书中介绍的工具自然也就没有用武之地了。在我们的日常生活和研究中,有许多文件输入和输出的应用场景。

输入和输出通常划分为几个大类:读取文本文件和其他更高效的磁盘存储格式,加载数据库中的数据,利用web api操作网络资源。

读写文本格式的数据

pandas 提供了一些用于将表格型数据读取为 DataFrame 对象的函数。表6-1对此进行了总结,其中 read_csv and read_table 将会是我们经常用到的。

Table 6-1. Parsing functions in pandas

Functions Description
read_csv load delimited data from a file, URL, or file-like object; use comma as defult delimiter
read_table load delimited data from a file, URL, or file-like object; use tab ('\t') as defult delimiter
read_fwf read data in fixed-width column format, i.e. no delimiters
read_clipboard Version of read_table that reads data from the clipboard; useful for converting tables from web pages
read_excel read tabular data from an Excel XLS or XLSX file
read_hdf read HDF5 files written by pandas
read_html read all tables found in a given HTML document
read_json read data from a JSON
read_msgpack read pandas data encoded using the MessagePack binary format
read_pickle read an arbitrary object in Python pickle format pickle竟然是咸菜的意思哦~
read_sas read a SASdataset stored in one of the SAS system's custom strorage format
read_sql read the results of a SQL query (using SQLAlchemy) asa pandas DataFrame
read_stata read a dataset from Stata file format
read_feather read the Feather binary file format

上述函数的重要功能之一就是type inferende,也就是类型推断。我们不需要为所读入的数据指定是什么类型。不过日期和其他自定义类型数据的处理 需要小心一些,因为自定义数据可能并不那么容易被推断出来属于何种数据类型。

HDF5,Feather和mgspack在文件中就会有数据类型,因此读入会更加方便。

Let's start with a small comma-separated csv file:


In [7]:
!cat chapter06/ex1.csv


a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

由于这个是逗号隔开的文本,我们可以方便地使用 read_csv 来读入数据


In [9]:
import pandas as pd

df = pd.read_csv('chapter06/ex1.csv')

In [10]:
df


Out[10]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

In [ ]:
当然我们也可以使用 read_table只需要将delimiter指定为逗号即可

In [11]:
import pandas as pd

df1 = pd.read_table('chapter06/ex1.csv', sep=',')

In [12]:
df1


Out[12]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

上面这个例子中是已经包含了header部分了,但并不是所有数据都自带header,例如


In [15]:
import pandas as pd

df1 = pd.read_table('chapter06/ex2.csv', sep=',')
df2 = pd.read_table('chapter06/ex2.csv', sep=',', header=None)

In [14]:
df1


Out[14]:
1 2 3 4 hello
0 5 6 7 8 world
1 9 10 11 12 foo

In [16]:
df2


Out[16]:
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

df1中我没有指明是header=None,所以原始文件中都第一行被错误地用作了header。为了 避免这种错误,需要明确指明header=None

当然,我们也可以自己去指定名字:


In [18]:
import pandas as pd

df1 = pd.read_csv('chapter06/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
df1


Out[18]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

如果我们在读入数据都时候,就希望其中都某一列作为索引index,比如我们可以对上面这个例子,将message这列作为索引:


In [21]:
import pandas as pd

names = ['a', 'b', 'c', 'd', 'message']
df1 = pd.read_csv('chapter06/ex2.csv', names=names, index_col='message')

In [22]:
df1


Out[22]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12

如果我们想从多列转化成层次化索引(hierarchical index),我们需要由列编号或者列名组成的列表即可。

先看个例子


In [23]:
!cat ./chapter06/csv_mindex.csv


key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

In [24]:
import pandas as pd

df = pd.read_csv('chapter06/csv_mindex.csv', index_col=['key1', 'key2'])

In [25]:
df


Out[25]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

实际情况中,有的表格的分隔符并不是逗号,或者相等间隔的空白富,例如下面这种情形,


In [27]:
list(open('chapter06/ex3.txt'))


Out[27]:
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

在第一列中,A和B之间的空格符明显多余第二到第五列数字之间的空格符。假设,我们还是使用默认的sep=‘’,我们将得到


In [8]:
import pandas as pd

result = pd.read_table('chapter06/ex3.txt')
result


Out[8]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

显然,ABC之间的空格没有被正确识别,导致ABC被列入了同一个列之中。 喂了 我们可以传递一个正则表达式作为分隔符。Here, 我们就可以使用 正则表达式 \s+ 来表示;


In [29]:
import pandas as pd

result = pd.read_table('chapter06/ex3.txt', sep='\s+')

In [30]:
result


Out[30]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

我们也可以使用这个delim_whitespace=True来实现数据的正确读入


In [13]:
import pandas as pd

result = pd.read_table('chapter06/ex3.txt', delim_whitespace=True)


"""
delim_whitespace: is a boolean, default False
Specifies whether or not whitespace (e.g. ' ' or '\t') will be used as the sep. 
Equivalent to setting sep='\s+'. 
"""
print(result)


            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

In [ ]:

The parser functions(解析器函数)in the Table 6-1 have many aiddtional arguments to help handle the wide variaty of exception fil formats that occur.让我来举个例子。我们可以利用skiprows来 跳过文件中的前面几行或者特别想指定的行。这个功能是相当有用的。特别是像我这样的人,喜欢在rawdata里面 写几句comments来说明一下这些数据的内容。


In [31]:
!cat chapter06/ex4.csv


# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [32]:
import pandas as pd

df = pd.read_csv('./chapter06/ex4.csv', skiprows=[0,2,3])

In [33]:
df


Out[33]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

在chapter05的笔记中,我也说到过,实际处理的数据是很有可能包含空数剧的。那些missing data通常 不会被呈现出来,或者会被一些sentinel value所替代,例如NA或NULL


In [15]:
list(open('chapter06/ex5.csv'))


Out[15]:
['something,a,b,c,d,message\n',
 'one,1,2,3,4,NA\n',
 'two,5,6,,8,world\n',
 'three,9,10,11,12,foo']

In [21]:
import pandas as pd
import numpy as np
result = pd.read_csv('./chapter06/ex5.csv',sep=',')
result


Out[21]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

In [30]:
print(result['c'][1])


nan

In [31]:
pd.isnull(result)


Out[31]:
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False

In [ ]:


In [38]:
pd.notnull(result)


Out[38]:
something a b c d message
0 True True True True True False
1 True True True False True True
2 True True True True True True

The na_values option can take either a list or a set of strings to consider missing values:也就是说只要原始数据中出现了na_values所指定的字符串或者list,就会以 NaN的方式呈现在pandas数据中。需要注意的是,如果原始数据处写的是NA或者NULL,被read只会它们都只是str,并不能直接转化为floating。


In [39]:
result = pd.read_csv('chapter06/ex5.csv', na_values = 'NULL') # 注意原始数据中有一处有连续两个逗号,那就是产生NULL的地方。
#因为在原属数据中,逗号之间并未输入任何数据,因此这个空的数据类型仍然是 floating

In [40]:
result


Out[40]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

可以用一个字典为各个列指定不同的NA标记值:这就可以让我们很方便地在读取数据时候定义那些是null值。


In [33]:
import pandas as pd
sentinels = {'message':['foo', 'NA'], 'something': ['two']}
result = pd.read_csv('chapter06/ex5.csv', na_values = sentinels)

In [34]:
result


Out[34]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

In [35]:
pd.isnull(result)


Out[35]:
something a b c d message
0 False False False False False True
1 True False False True False False
2 False False False False False True

Reading text files in pieces 逐块读取文本文件

对于很大的文件,我们一般希望只读取一小部分,或者对小部分进行迭代式地读入。


In [50]:
result = pd.read_csv('./chapter06/ex6.csv', nrows=5)  # 我们只从原始数据中读入5行

In [52]:
result


Out[52]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q

在notebook中查看很大的文件中的内容时候,最好设置一下display时候的最大行数,这样就不会把所有列给打印出来


In [40]:
pd.options.display.max_rows = 8 # only print 8 rows clearly,other rows using ...

In [42]:
result = pd.read_csv('./chapter06/ex6.csv')
result


Out[42]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
... ... ... ... ... ...
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0

10000 rows × 5 columns


In [ ]:

如果想逐块读入文件,我们可以指定chunksize来实现:


In [53]:
chunker = pd.read_csv('chapter06/ex6.csv', chunksize = 1000)

In [54]:
chunker


Out[54]:
<pandas.io.parsers.TextFileReader at 0x10c8b5c18>

这里 TextParser 允许我们可以进行迭代。


In [54]:
chunker = pd.read_csv('chapter06/ex6.csv', chunksize = 1000)
tot = pd.Series([])
for piece in chunker:
#     print(piece['key'].value_counts())
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
    
tot = tot.sort_values(ascending=False)

这样我们就有了


In [55]:
tot[:10]


Out[55]:
E    368.0
X    364.0
L    346.0
O    343.0
     ...  
J    337.0
F    335.0
K    334.0
H    330.0
Length: 10, dtype: float64

In [46]:

对 value_counts 有点忘记了,所以特意地写了下面这个简单对例子来说明。具体对可以在chapter05中看笔记。


In [76]:
pd.options.display.max_rows = 20
result = pd.DataFrame({"Values": [1,2,3,4,22,2,2,3,4]})

In [77]:
result


Out[77]:
Values
0 1
1 2
2 3
3 4
4 22
5 2
6 2
7 3
8 4

In [78]:
result['Values'].value_counts()


Out[78]:
2     3
4     2
3     2
22    1
1     1
Name: Values, dtype: int64

Writing data to text format 将数据输出到文本

之前我们只介绍了读入文本,但是却没有说输出文本。先来看些简单都例子:


In [83]:
!cat chapter06/ex5.csv


something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [2]:
import pandas as pd

data = pd.read_csv('chapter06/ex5.csv')

In [3]:
data


Out[3]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

In [4]:
data.to_csv('chapter06/out.csv')

In [82]:
!cat chapter06/out.csv


,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

相比于原始数据,我们注意到输出的csv文件中也包含了索引。


In [80]:
import sys  #使用了sys.out 所以data会被直接打印在屏幕端,而不是输出到其他文本文件
data.to_csv(sys.stdout, sep='|')


|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

In [86]:
data.to_csv(sys.stdout, sep='@')


@something@a@b@c@d@message
0@one@1@2@3.0@4@
1@two@5@6@@8@world
2@three@9@10@11.0@12@foo

missing data在上面是被表示为空字符串,但是可能并不容易看出来。 我们也许希望它是用其他字符表示到,比如用 NULL 来表示


In [8]:
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NULL')


,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

In [10]:
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN')


,something,a,b,c,d,message
0,one,1,2,3.0,4,NAN
1,two,5,6,NAN,8,world
2,three,9,10,11.0,12,foo

观察上面到几个例子,我们可以看出来,默认请看下,index和header都会被输出。 不过,我们也可以禁止输出他们


In [79]:
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN', index=False, header=False)


one,1,2,3.0,4,NAN
two,5,6,NAN,8,world
three,9,10,11.0,12,foo

此外我们还可以指定输出某些列


In [17]:
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN', columns=['a', 'b','c'], index=False)


a,b,c
1,2,3.0
5,6,NAN
9,10,11.0

我们也能将Series输出到文本:


In [19]:
import pandas as pd
import numpy as np

dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)

In [20]:
ts


Out[20]:
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Freq: D, dtype: int64

In [21]:
ts.to_csv('chapter06/tseries.csv')

In [22]:
!cat chapter06/tseries.csv


2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

从csv转为series的方法一:先化为DataFrame,然后利用loc方法转为Series


In [98]:
pd.read_csv('chapter06/tseries.csv', parse_dates=True,header=None)


Out[98]:
0 1
0 2000-01-01 0
1 2000-01-02 1
2 2000-01-03 2
3 2000-01-04 3
4 2000-01-05 4
5 2000-01-06 5
6 2000-01-07 6

In [101]:
result = pd.read_csv('chapter06/tseries.csv', parse_dates=True, header=None,index_col=0)

In [102]:
result


Out[102]:
1
0
2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 5
2000-01-07 6

In [103]:
x = result.loc[:,1]

In [104]:
x


Out[104]:
0
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Name: 1, dtype: int64

方法二:这也是原书第一版中的方法,不过这个方法已经被弃用。


In [83]:
df = pd.Series.from_csv('chapter06/tseries.csv')  
# Series.from_csv has DEPRECATED. That's whi I use read_csv in above cells.


/Users/ywfang/miniconda3/envs/plotenv/lib/python3.6/site-packages/pandas/core/series.py:2890: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls
  infer_datetime_format=infer_datetime_format)

In [84]:
df


Out[84]:
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

方法三:使用 squeeze=True,这是我个人最推荐的方法。不过根据官方的文档说法,squeeze只在原始数据包含一个columns时候才会返回Series

squeeze : boolean, default False

If the parsed data only contains one column then return a Series


In [92]:
result = pd.read_csv('chapter06/tseries.csv', parse_dates=True, header=None,index_col=0,squeeze=True)

In [89]:
type(result)


Out[89]:
pandas.core.series.Series

In [93]:
result


Out[93]:
0
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Name: 1, dtype: int64

Working with delimited formats

Although we can use functions like pandas.read_table to laod most formas of tabular data, however, munal processing may be necessary sometimes. 其实实际情况中,我们拿到的数据很有可能会有一些很奇特的行,他们无法被read_table等函数识别导入。

为了说明这些基本工具,我们来看一些简单的例子


In [1]:
!cat chapter06/ex7.csv


"a","b","c"
"1","2","3"
"1","2","3","4"

In [1]:
#For any file with a single-character delimiter,
#we can use csv module. To use it, pass any open file
#or file-like object to csv.reader

import csv

file = open('chapter06/ex7.csv')

reader = csv.reader(file)

In [3]:
type(reader)


Out[3]:
_csv.reader

In [4]:
#Iterating through the reader likea file yields
#tuples of values with any quote characters removed

for line in reader:
    print(line)


['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']

从这个地方开始,我们可以把数据改成一个我们自己期望的格式。 我们来一步一步的举一个例子。

首先,将文件读入到一个lines的列表中


In [6]:
import csv
with open('chapter06/ex7.csv') as f:
    lines = list(csv.reader(f))

header, values = lines[0], lines[1:]

In [7]:
header


Out[7]:
['a', 'b', 'c']

In [8]:
values


Out[8]:
[['1', '2', '3'], ['1', '2', '3', '4']]

In [13]:
#then we careate a dictionary of data columns using a dictionary comprehension
#and the expression zip(*values), which transposes rows to columns

data_dict = {h: v for h, v in zip(header, zip(*values))}
print(data_dict)


{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV文件格式有各种不同的格式。为了定义一个有特别的delimiter、字符串引用规则或者行结尾的csvfile,我们可以定一一个 简单的subcalss:


In [ ]:
csv输出这里未来某个时间最好复习加强一下

In [27]:
import csv
file = open('chapter06/ex7.csv')

class my_dialect(csv.Dialect):
    lineterminator = '\m'
    delimiter = ';' 
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
    
data = csv.reader(file, dialect=my_dialect)

for line in data:
    print(line)


['a,"b","c"']
['1,"2","3"']
['1,"2","3","4"']

当然,我们也可以不用subclass,而直接通过keywords来独立地指定csv文件的dialect参数:


In [50]:
import csv
file = open('chapter06/ex7.csv')
reader1 = csv.reader(file, delimiter='|', quotechar=' ', lineterminator='\r\n')
for line1 in reader1:
    print(','.join(line1))


"a","b","c"
"1","2","3"
"1","2","3","4"

对于更加复杂的文件或者有固定的多个字符的delimiters情况,我们可能无法使用csv module。 在这些情况下,我们不得不使用line splitting和其他的cleanup手段,例如split mehtod, 或者正则表达式方法re.split

To write delimited files manually, we can use csv.writer. It accepts an open, writable file object and the same dialect and format options as csv.reader:


In [59]:
import csv


class my_dialect(csv.Dialect):
    lineterminator = '\t\n'  #\m
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

file = open('chapter06/ex7.csv')
with open('chapter06/mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))

In [60]:
!cat chapter06/mydata.csv


one;two;three	
1;2;3	
4;5;6	
7;8;9	

JSON data

JSON, short for Javascript object notation, has become one of the standard formats for sending data for HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV.


In [1]:
obj = """
    {"name": "Wes",
     "places_lived": ["United States", "Spain", "Germany"],
     "pet": null,
     "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
                  {"name": "Katie", "age": 38,
                   "pets": ["Sixes", "Stache", "Cisco"]}]
} """

There are several Python libraries that can read or write JSON data. For exampple, json module


In [3]:
import json
result = json.loads(obj)

In [4]:
result


Out[4]:
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [5]:
list(result)


Out[5]:
['name', 'places_lived', 'pet', 'siblings']

In [6]:
result['places_lived']


Out[6]:
['United States', 'Spain', 'Germany']

In [8]:
type(result)


Out[8]:
dict

We can also revert the above result object back to JSON data using json.dump


In [10]:
asjson = json.dumps(result)

In [11]:
asjson


Out[11]:
'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

We can also pass a list of dicts to the DataFrame constructor and slect a subset of the data fields:


In [13]:
import pandas as pd
siblings = pd.DataFrame(result['siblings'], columns = ['name', 'age'])

In [14]:
siblings


Out[14]:
name age
0 Scott 30
1 Katie 38

pandas.read_json 可以自动的将JSON datasets转化为Series或者DataFrame,例如


In [15]:
!cat chapter06/example.json


[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]

In [16]:
# import pandas as pd
import json 

data = pd.read_json('chapter06/example.json')
data


Out[16]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9

If we want to export data from pandas to JSON, one is to use the 'to_json' methods on Series and DataFrame"


In [17]:
print(data.to_json())


{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

Binary Data Formats

将数据转换到二进制格式上最简单的方式就是使用python内置的pickle。 pandas对象都有 to_pickle 方法可以将数据写入到pickle格式中。


In [20]:
import pandas as pd

frame = pd.read_csv('chapter06/ex1.csv')
frame


Out[20]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

In [23]:
frame.to_pickle('chapter06/frame_pickle')  #this will write the data to the binary file named frame_pickle

Using pickle module, we can read any 'pickled' objsect stored in a file, or even more easiliy we can use pandas.read_pickle, for example


In [24]:
pd.read_pickle('chapter06/frame_pickle')


Out[24]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

Attention

pickle只建议于短期存储格式的情形。因为pickle的数据格式随着时间可能变得不稳定;一个今天pickled的数据,可能在明天就会因为 更新的pickle库版本而无法pickle。尽管开发者已经努力维护避免这种情况,但是在未来的某个时间点上可能最好还是放弃使用pickle format.

pandas has built-in support for two more library data formats, i.e. HDF5 and Message Pack. Some other storage formats for pandas or NumPy data include:

  • bcolz

A compressable column-oriented binary format based on the Blosc compression library

  • Feather

A cross-lanuage column-oriented file format. Feather uses the Apache Arrow columnar memory format

Using HDF5 Format

HDF5 is a well-regarded file format intended for storing large quantities of scientific arrya data. It is available as a C library, and it has interfaces available in many other languages. "HDF" means "hierarchical data format". each HDF5 file can store multipole datases and supporting metadata.

Compared with other simpler formats, HDF5 feastures on-the-fly compression with a varietry of compression modes. HDF5 can be a good choice for working with very large data-sets that don't fit into memory

尽管使用 PyTables或者h5py这两个库就可以简单直接的读写HDF5文件,但是pandas提高了高水平的接口可以简化存储Series或者DataFrame对象。 The ‘HDFStore' class works like a dict and handles the low-level details:


In [30]:
import pandas as pd
import numpy as np
import 

frame = pd.DataFrame({'a': np.random.randn(1000)})
store = pd.HDFStore('chapter06/mydata.h5')


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/miniconda3/envs/plotenv/lib/python3.6/site-packages/pandas/io/pytables.py in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    444         try:
--> 445             import tables  # noqa
    446         except ImportError as ex:  # pragma: no cover

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-30-40e32b04374b> in <module>()
      3 
      4 frame = pd.DataFrame({'a': np.random.randn(1000)})
----> 5 store = pd.HDFStore('chapter06/mydata.h5')

~/miniconda3/envs/plotenv/lib/python3.6/site-packages/pandas/io/pytables.py in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    446         except ImportError as ex:  # pragma: no cover
    447             raise ImportError('HDFStore requires PyTables, "{ex}" problem '
--> 448                               'importing'.format(ex=str(ex)))
    449 
    450         if complib is not None and complib not in tables.filters.all_complibs:

ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing

To solve the above prolem, I installed sevearl modules:

  • pip install --upgrade tables

  • pip install lxml

  • pip install wrapt


In [33]:
import pandas as pd
import numpy as np

frame = pd.DataFrame({'a': np.random.randn(1000)})
store = pd.HDFStore('chapter06/mydata.h5py')

In [36]:
store['obj1'] = frame  # Store a DataFrame

In [37]:
store['obj1_col'] = frame['a']  # Store a Series

In [38]:
store


Out[38]:
<class 'pandas.io.pytables.HDFStore'>
File path: chapter06/mydata.h5py

In [ ]:
Objects contained in the HDF5 file can then be retrieved with the 
same dict-like API:

In [48]:
x = store['obj1']
type(x)


Out[48]:
pandas.core.frame.DataFrame

In [49]:
y = store['obj1_col']
type(y)


Out[49]:
pandas.core.series.Series

In [39]:
list(store)


Out[39]:
['/obj1', '/obj1_col']

HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower, but it supports query operations using a special syntax:


In [50]:
store.put('obj2', frame, format='table')

In [51]:
store.select('obj2', where=['index >=10 and index <= 15'])


Out[51]:
a
10 -0.342778
11 -0.204930
12 -0.761121
13 1.199154
14 0.040380
15 0.670995

In [52]:
list(store)


Out[52]:
['/obj1', '/obj1_col', '/obj2']

In [53]:
store.close()

put其实是store['obj2'] = frame 的精确写法,只是前者可以让我们使用一些其他的选项,然更好的对我们想要存储的数据 进行特别的格式化。

pandas.read_hdf functions gives us a shortcut to these tools:


In [55]:
frame.to_hdf('chapter06/mydata.h5py', 'obj3', format='table')

In [57]:
pd.read_hdf('chapter06/mydata.h5py', 'obj3', where=['index < 5'])


Out[57]:
a
0 1.475226
1 0.352356
2 0.835724
3 -1.079280
4 0.224576

Note

如果我们处理的数据是存储在远程的服务器上的话,可以使用一个专为分布式存设计的二进制格式,比如Apache Parquet。 这些格式都还处于发展中。

如果数据大部分都在本地,那么鼓励去探索下PyTables和h5py这两者是否也可以满足我们的需求,以及与pd.HDFStore执行效率 的差别。

Reading Microsoft Excel Files

read excel

pandas supports reading tabular data in Excel files using either the ExcelFile class or pandas.read_excel function. These tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. We may need to install these manually with pip or conda.

To use ExcelFile, create an instance by passing a path to an xls or xlsx file:


In [61]:
import pandas as pd

xlsx = pd.ExcelFile('chapter06/ex1.xlsx')

# Data stored in a sheet can then be read into DataFrame with parse:
pd.read_excel(xlsx, 'Sheet1')


Out[61]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

对于又多个sheet的Excel表格,更快的方式就是像上面那样,先create ExcelFile。 不过,也可以简单地把文件路径和文件名直接pass给read_excel. 更推荐前面一种做法。


In [62]:
# 这里举一个例子,直接pass路径给read_excel

frame = pd.read_excel('chapter06/ex1.xlsx', 'Sheet1')

In [63]:
frame


Out[63]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

write data to excel

To write pandas data to Excel format, firstly create an ExcelWriter, then write data to it using pandas objects’ to_excel method:


In [65]:
writer = pd.ExcelWriter('chapter06/ex2.xlsx')  # Internally thsi was performed by openpyxl module

In [66]:
frame.to_excel(writer, 'Sheet1')

In [67]:
writer.save()

In [68]:
#我们也可以直接pass一个文件路径到 to_excel:
frame.to_excel('chapter06/ex2.xlsx')

In [69]:
!open chapter06/ex2.xlsx

Interacting with web APIs

很多网站会提供API供其他人从网站获取数据。获取这些数据的方式有很多,其中至一就是requests

例如,我们可以使用requests找出最近的30个 pands库的 Github issues


In [45]:
import requests

url = "https://api.github.com/repos/pandas-dev/pandas/issues"
resp = requests.get(url)

In [3]:
resp


Out[3]:
<Response [200]>

The "Response" object's json method will return a dictionary contaning JSON parsed into native Python objects:


In [46]:
data = resp.json()

In [47]:
data[0]['url']


Out[47]:
'https://api.github.com/repos/pandas-dev/pandas/issues/22077'

In [49]:
data[0]['title']


Out[49]:
' "TypeError: unorderable types" in Python3 when index values are dict keys of tuples or tuples with non-homogeneous dtypes'

data的每个元素都是一个字典,包含了所有在github issues中找到的信息(除了comments)。我们可以 直接把这些data信息传递给DataFrame和exact fields of interest:


In [53]:
import pandas as pd

issues = pd.DataFrame(data, columns = ['number', 'title', 'labels', 'state'])

In [54]:
issues


Out[54]:
number title labels state
0 22077 "TypeError: unorderable types" in Python3 whe... [{'id': 49747336, 'node_id': 'MDU6TGFiZWw0OTc0... open
1 22076 Feature-Request: Allow user defined categories... [] open
2 22075 CLN/STYLE: Lint comprehensions [{'id': 106935113, 'node_id': 'MDU6TGFiZWwxMDY... open
3 22074 [Bug] Fix various DatetimeIndex comparison bugs [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=... open
4 22073 BUG: Column-major DataFrames stored in HDFStor... [{'id': 47229190, 'node_id': 'MDU6TGFiZWw0NzIy... open
5 22072 fix for TypeError: unorderable types" in when ... [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=... open
6 22071 Remove unused, avoid uses of deprecated api [{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE... open
7 22070 Split fastpath IntegerArray constructor and ge... [{'id': 849023693, 'node_id': 'MDU6TGFiZWw4NDk... open
8 22069 BUG: groupby.plot.kde applies to index as well [{'id': 76865106, 'node_id': 'MDU6TGFiZWw3Njg2... open
9 22068 De-duplicate dispatch code, remove unreachable... [{'id': 57296398, 'node_id': 'MDU6TGFiZWw1NzI5... open
10 22067 TypeError in pandas._libs.tslibs.period._ordin... [{'id': 42670965, 'node_id': 'MDU6TGFiZWw0MjY3... open
11 22066 DOC: consistent docstring for compression kwarg [{'id': 2301354, 'node_id': 'MDU6TGFiZWwyMzAxM... open
12 22065 Handling NaN during resampling [{'id': 2822342, 'node_id': 'MDU6TGFiZWwyODIyM... open
13 22064 TST: suppress deprecation warnings for compress [{'id': 87485152, 'node_id': 'MDU6TGFiZWw4NzQ4... open
14 22063 PROPOSAL: use datetime64 and timedelta64 to eval [] open
15 22060 Lost of index information by operation for zer... [{'id': 195647922, 'node_id': 'MDU6TGFiZWwxOTU... open
16 22058 DOC: Reword doc for filepath_or_buffer in read... [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT... open
17 22057 read_csv Documentation Rendering Poorly [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT... open
18 22056 value_counts bin limit truncation [{'id': 35818298, 'node_id': 'MDU6TGFiZWwzNTgx... open
19 22055 Feature Request: Ability to parse comments [{'id': 307649777, 'node_id': 'MDU6TGFiZWwzMDc... open
20 22054 fix raise of TypeError when subtracting timede... [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=... open
21 22053 df.groupby(cols).tshift(0, freq) Drops Index L... [{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj... open
22 22052 pd.testing.assert_frame_equal check_like not w... [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT... open
23 22051 where method results in a different outputs wh... [{'id': 31404521, 'node_id': 'MDU6TGFiZWwzMTQw... open
24 22049 DEPR: deprecated IntervalIndex.itemsize, remov... [{'id': 150096370, 'node_id': 'MDU6TGFiZWwxNTA... open
25 22047 Breaking change multiplication bool and int [] open
26 22046 Replacing multiple columns (or just one) with ... [] open
27 22045 [documentation] merge(suffixes=(False, False))... [{'id': 57296398, 'node_id': 'MDU6TGFiZWw1NzI5... open
28 22044 Multi-index and CategoricalIndex performance [{'id': 71268330, 'node_id': 'MDU6TGFiZWw3MTI2... open
29 22043 [WIP] Imprecise indexer [{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=... open

In [ ]:

现在,我们来看看,利用requests 能从我的github主页上找出哪些信息


In [34]:
import requests

url = 'https://api.github.com/yw-fang'
resp = requests.get(url)

In [35]:
data = resp.json()

In [39]:
data['message']


Out[39]:
'Not Found'

In [41]:
data['documentation_url']


Out[41]:
'https://developer.github.com/v3'

In [43]:
textdata = resp.text

In [44]:
textdata


Out[44]:
'{"message":"Not Found","documentation_url":"https://developer.github.com/v3"}'

Interacting with databases

Many data are usually stored in database. SQL-based relational databases (such as SQL server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have come quite popular. The choice of database is usually dependent on the performance,data integrity (数据的完整性), and scalability (可扩展性)needs of an application

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I'll create a SQLite database using Python's built-in sqlite3 driver.


In [56]:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""

In [57]:
con = sqlite3.connect('mydata.sqlite')

In [58]:
con.execute(query)


Out[58]:
<sqlite3.Cursor at 0x10d6dbb90>

In [60]:
con.commit()

Now, let's insert a few rows of data:


In [61]:
data = [('At', 'Home', 1.25, 6),
       ('Out', 'Plane', 2.6, 3),
       ('In', 'Bottle', 1.7, 5)]

In [62]:
stmt = "INSERT INTO test VALUES (?, ?, ?, ?)"

In [63]:
con.executemany(stmt, data)


Out[63]:
<sqlite3.Cursor at 0x10d6dbab0>

In [64]:
con.commit()

Most Python SQL drivers return a list of tuples when selecting data from a table:


In [65]:
cursor = con.execute('select * from test')

In [66]:
rows = cursor.fetchall()

In [67]:
rows


Out[67]:
[('At', 'Home', 1.25, 6), ('Out', 'Plane', 2.6, 3), ('In', 'Bottle', 1.7, 5)]

我们可以将元祖列表传给DataFrame的构造器,不过我们还需要列名。它包含在cursor的description中


In [68]:
cursor.description


Out[68]:
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [72]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])


Out[72]:
a b c d
0 At Home 1.25 6
1 Out Plane 2.60 3
2 In Bottle 1.70 5

上面的做法是蛮麻烦的,因为每次都得向database进行query。SQLAlchemy project提供了一些渐变的方法。pandas有一个 read_sql 函数可以是我们从一般的 SQLAlchemy 链接中获取数据。 下面,我们就举一个例子,如何使用 SQLAlchemy 连接到上面创建的 SQLite databse,并且从中读取数据


In [76]:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')

In [77]:
pd.read_sql('select * from test', db)


Out[77]:
a b c d
0 At Home 1.25 6
1 Out Plane 2.60 3
2 In Bottle 1.70 5

Conclusion

获取数据往往书数据分析中的第一步。在这一章节中,我们已经提及了多种获取数据的方法。再接下去的章节中,我们要开始深入研究数据,并对他们分析和可视化。


In [ ]: