阅读笔记

作者：方跃文

Email: fyuewen@gmail.com

** 时间：始于2018年3月6日，结束写作于 2018年7月27日, ２018年11月复习一次。

第六章数据加载、存储和文件格式

数据如果不能导入和导出，那么本书中介绍的工具自然也就没有用武之地了。在我们的日常生活和研究中，有许多文件输入和输出的应用场景。

输入和输出通常划分为几个大类：读取文本文件和其他更高效的磁盘存储格式，加载数据库中的数据，利用web api操作网络资源。

读写文本格式的数据

pandas 提供了一些用于将表格型数据读取为 DataFrame 对象的函数。表6-1对此进行了总结，其中 read_csv and read_table 将会是我们经常用到的。

Table 6-1. Parsing functions in pandas

Functions	Description
read_csv	load delimited data from a file, URL, or file-like object; use comma as defult delimiter
read_table	load delimited data from a file, URL, or file-like object; use tab ('\t') as defult delimiter
read_fwf	read data in fixed-width column format, i.e. no delimiters
read_clipboard	Version of read_table that reads data from the clipboard; useful for converting tables from web pages
read_excel	read tabular data from an Excel XLS or XLSX file
read_hdf	read HDF5 files written by pandas
read_html	read all tables found in a given HTML document
read_json	read data from a JSON
read_msgpack	read pandas data encoded using the MessagePack binary format
read_pickle	read an arbitrary object in Python pickle format	pickle竟然是咸菜的意思哦～
read_sas	read a SASdataset stored in one of the SAS system's custom strorage format
read_sql	read the results of a SQL query (using SQLAlchemy) asa pandas DataFrame
read_stata	read a dataset from Stata file format
read_feather	read the Feather binary file format

上述函数的重要功能之一就是type inferende，也就是类型推断。我们不需要为所读入的数据指定是什么类型。不过日期和其他自定义类型数据的处理需要小心一些，因为自定义数据可能并不那么容易被推断出来属于何种数据类型。

HDF5，Feather和mgspack在文件中就会有数据类型，因此读入会更加方便。

Let's start with a small comma-separated csv file:



In [7]:

    
!cat chapter06/ex1.csv









    



a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

由于这个是逗号隔开的文本，我们可以方便地使用 read_csv 来读入数据



In [9]:

    
import pandas as pd

df = pd.read_csv('chapter06/ex1.csv')



In [10]:

    
df



In [ ]:

    
当然我们也可以使用 read_table，只需要将delimiter指定为逗号即可



In [11]:

    
import pandas as pd

df1 = pd.read_table('chapter06/ex1.csv', sep=',')



In [12]:

    
df1

上面这个例子中是已经包含了header部分了，但并不是所有数据都自带header，例如



In [15]:

    
import pandas as pd

df1 = pd.read_table('chapter06/ex2.csv', sep=',')
df2 = pd.read_table('chapter06/ex2.csv', sep=',', header=None)



In [14]:

    
df1



In [16]:

    
df2

df1中我没有指明是header=None，所以原始文件中都第一行被错误地用作了header。为了避免这种错误，需要明确指明header=None

当然，我们也可以自己去指定名字：



In [18]:

    
import pandas as pd

df1 = pd.read_csv('chapter06/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
df1

如果我们在读入数据都时候，就希望其中都某一列作为索引index，比如我们可以对上面这个例子，将message这列作为索引：



In [21]:

    
import pandas as pd

names = ['a', 'b', 'c', 'd', 'message']
df1 = pd.read_csv('chapter06/ex2.csv', names=names, index_col='message')



In [22]:

    
df1

如果我们想从多列转化成层次化索引（hierarchical index），我们需要由列编号或者列名组成的列表即可。

先看个例子



In [23]:

    
!cat ./chapter06/csv_mindex.csv









    



key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16



In [24]:

    
import pandas as pd

df = pd.read_csv('chapter06/csv_mindex.csv', index_col=['key1', 'key2'])



In [25]:

    
df

实际情况中，有的表格的分隔符并不是逗号，或者相等间隔的空白富，例如下面这种情形，



In [27]:

    
list(open('chapter06/ex3.txt'))









    Out[27]:





['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

在第一列中，A和B之间的空格符明显多余第二到第五列数字之间的空格符。假设，我们还是使用默认的sep=‘’，我们将得到



In [8]:

    
import pandas as pd

result = pd.read_table('chapter06/ex3.txt')
result

显然，ABC之间的空格没有被正确识别，导致ABC被列入了同一个列之中。喂了我们可以传递一个正则表达式作为分隔符。Here, 我们就可以使用正则表达式 \s+ 来表示；



In [29]:

    
import pandas as pd

result = pd.read_table('chapter06/ex3.txt', sep='\s+')



In [30]:

    
result

我们也可以使用这个delim_whitespace=True来实现数据的正确读入



In [13]:

    
import pandas as pd

result = pd.read_table('chapter06/ex3.txt', delim_whitespace=True)


"""
delim_whitespace: is a boolean, default False
Specifies whether or not whitespace (e.g. ' ' or '\t') will be used as the sep. 
Equivalent to setting sep='\s+'. 
"""
print(result)









    



            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491



In [ ]:

The parser functions（解析器函数）in the Table 6-1 have many aiddtional arguments to help handle the wide variaty of exception fil formats that occur.让我来举个例子。我们可以利用skiprows来跳过文件中的前面几行或者特别想指定的行。这个功能是相当有用的。特别是像我这样的人，喜欢在rawdata里面写几句comments来说明一下这些数据的内容。



In [31]:

    
!cat chapter06/ex4.csv









    



# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo



In [32]:

    
import pandas as pd

df = pd.read_csv('./chapter06/ex4.csv', skiprows=[0,2,3])



In [33]:

    
df

在chapter05的笔记中，我也说到过，实际处理的数据是很有可能包含空数剧的。那些missing data通常不会被呈现出来，或者会被一些sentinel value所替代，例如NA或NULL



In [15]:

    
list(open('chapter06/ex5.csv'))









    Out[15]:





['something,a,b,c,d,message\n',
 'one,1,2,3,4,NA\n',
 'two,5,6,,8,world\n',
 'three,9,10,11,12,foo']



In [21]:

    
import pandas as pd
import numpy as np
result = pd.read_csv('./chapter06/ex5.csv',sep=',')
result



In [30]:

    
print(result['c'][1])

nan



In [31]:

    
pd.isnull(result)









    Out[31]:







  
    
      
      something
      a
      b
      c
      d
      message
    
  
  
    
      0
      False
      False
      False
      False
      False
      True
    
    
      1
      False
      False
      False
      True
      False
      False
    
    
      2
      False
      False
      False
      False
      False
      False



In [ ]:



In [38]:

    
pd.notnull(result)









    Out[38]:







  
    
      
      something
      a
      b
      c
      d
      message
    
  
  
    
      0
      True
      True
      True
      True
      True
      False
    
    
      1
      True
      True
      True
      False
      True
      True
    
    
      2
      True
      True
      True
      True
      True
      True

The na_values option can take either a list or a set of strings to consider missing values:也就是说只要原始数据中出现了na_values所指定的字符串或者list，就会以 NaN的方式呈现在pandas数据中。需要注意的是，如果原始数据处写的是NA或者NULL，被read只会它们都只是str，并不能直接转化为floating。



In [39]:

    
result = pd.read_csv('chapter06/ex5.csv', na_values = 'NULL') # 注意原始数据中有一处有连续两个逗号，那就是产生NULL的地方。
#因为在原属数据中，逗号之间并未输入任何数据，因此这个空的数据类型仍然是 floating



In [40]:

    
result

可以用一个字典为各个列指定不同的NA标记值：这就可以让我们很方便地在读取数据时候定义那些是null值。



In [33]:

    
import pandas as pd
sentinels = {'message':['foo', 'NA'], 'something': ['two']}
result = pd.read_csv('chapter06/ex5.csv', na_values = sentinels)



In [34]:

    
result



In [35]:

    
pd.isnull(result)









    Out[35]:







  
    
      
      something
      a
      b
      c
      d
      message
    
  
  
    
      0
      False
      False
      False
      False
      False
      True
    
    
      1
      True
      False
      False
      True
      False
      False
    
    
      2
      False
      False
      False
      False
      False
      True

Reading text files in pieces 逐块读取文本文件

对于很大的文件，我们一般希望只读取一小部分，或者对小部分进行迭代式地读入。



In [50]:

    
result = pd.read_csv('./chapter06/ex6.csv', nrows=5)  # 我们只从原始数据中读入5行



In [52]:

    
result

在notebook中查看很大的文件中的内容时候，最好设置一下display时候的最大行数，这样就不会把所有列给打印出来



In [40]:

    
pd.options.display.max_rows = 8 #　only print 8 rows clearly，other rows using ...



In [42]:

    
result = pd.read_csv('./chapter06/ex6.csv')
result









    Out[42]:







  
    
      
      one
      two
      three
      four
      key
    
  
  
    
      0
      0.467976
      -0.038649
      -0.295344
      -1.824726
      L
    
    
      1
      -0.358893
      1.404453
      0.704965
      -0.200638
      B
    
    
      2
      -0.501840
      0.659254
      -0.421691
      -0.057688
      G
    
    
      3
      0.204886
      1.074134
      1.388361
      -0.982404
      R
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      9996
      -0.479893
      -0.650419
      0.745152
      -0.646038
      E
    
    
      9997
      0.523331
      0.787112
      0.486066
      1.093156
      K
    
    
      9998
      -0.362559
      0.598894
      -1.843201
      0.887292
      G
    
    
      9999
      -0.096376
      -1.012999
      -0.657431
      -0.573315
      0
    
  

10000 rows × 5 columns



In [ ]:

如果想逐块读入文件，我们可以指定chunksize来实现：



In [53]:

    
chunker = pd.read_csv('chapter06/ex6.csv', chunksize = 1000)



In [54]:

    
chunker









    Out[54]:





<pandas.io.parsers.TextFileReader at 0x10c8b5c18>

这里 TextParser 允许我们可以进行迭代。



In [54]:

    
chunker = pd.read_csv('chapter06/ex6.csv', chunksize = 1000)
tot = pd.Series([])
for piece in chunker:
#     print(piece['key'].value_counts())
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
    
tot = tot.sort_values(ascending=False)

这样我们就有了



In [55]:

    
tot[:10]









    Out[55]:





E    368.0
X    364.0
L    346.0
O    343.0
     ...  
J    337.0
F    335.0
K    334.0
H    330.0
Length: 10, dtype: float64



In [46]:

对 value_counts 有点忘记了，所以特意地写了下面这个简单对例子来说明。具体对可以在chapter05中看笔记。



In [76]:

    
pd.options.display.max_rows = 20
result = pd.DataFrame({"Values": [1,2,3,4,22,2,2,3,4]})



In [77]:

    
result



In [78]:

    
result['Values'].value_counts()









    Out[78]:





2     3
4     2
3     2
22    1
1     1
Name: Values, dtype: int64

Writing data to text format 将数据输出到文本

之前我们只介绍了读入文本，但是却没有说输出文本。先来看些简单都例子：



In [83]:

    
!cat chapter06/ex5.csv









    



something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo



In [2]:

    
import pandas as pd

data = pd.read_csv('chapter06/ex5.csv')



In [3]:

    
data



In [4]:

    
data.to_csv('chapter06/out.csv')



In [82]:

    
!cat chapter06/out.csv









    



,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

相比于原始数据，我们注意到输出的csv文件中也包含了索引。



In [80]:

    
import sys  #使用了sys.out 所以data会被直接打印在屏幕端，而不是输出到其他文本文件
data.to_csv(sys.stdout, sep='|')









    



|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo



In [86]:

    
data.to_csv(sys.stdout, sep='@')









    



@something@a@b@c@d@message
0@one@1@2@3.0@4@
1@two@5@6@@8@world
2@three@9@10@11.0@12@foo

missing data在上面是被表示为空字符串，但是可能并不容易看出来。我们也许希望它是用其他字符表示到，比如用 NULL 来表示



In [8]:

    
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NULL')









    



,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo



In [10]:

    
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN')









    



,something,a,b,c,d,message
0,one,1,2,3.0,4,NAN
1,two,5,6,NAN,8,world
2,three,9,10,11.0,12,foo

观察上面到几个例子，我们可以看出来，默认请看下，index和header都会被输出。不过，我们也可以禁止输出他们



In [79]:

    
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN', index=False, header=False)









    



one,1,2,3.0,4,NAN
two,5,6,NAN,8,world
three,9,10,11.0,12,foo

此外我们还可以指定输出某些列



In [17]:

    
import pandas as pd
import sys

data = pd.read_csv('chapter06/ex5.csv')
data.to_csv(sys.stdout, na_rep='NAN', columns=['a', 'b','c'], index=False)









    



a,b,c
1,2,3.0
5,6,NAN
9,10,11.0

我们也能将Series输出到文本：



In [19]:

    
import pandas as pd
import numpy as np

dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)



In [20]:

    
ts









    Out[20]:





2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Freq: D, dtype: int64



In [21]:

    
ts.to_csv('chapter06/tseries.csv')



In [22]:

    
!cat chapter06/tseries.csv









    



2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

从csv转为series的方法一：先化为DataFrame，然后利用loc方法转为Series



In [98]:

    
pd.read_csv('chapter06/tseries.csv', parse_dates=True,header=None)



In [101]:

    
result = pd.read_csv('chapter06/tseries.csv', parse_dates=True, header=None,index_col=0)



In [102]:

    
result



In [103]:

    
x = result.loc[:,1]



In [104]:

    
x









    Out[104]:





0
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Name: 1, dtype: int64

方法二：这也是原书第一版中的方法，不过这个方法已经被弃用。



In [83]:

    
df = pd.Series.from_csv('chapter06/tseries.csv')  
# Series.from_csv has DEPRECATED. That's whi I use read_csv in above cells.









    



/Users/ywfang/miniconda3/envs/plotenv/lib/python3.6/site-packages/pandas/core/series.py:2890: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls
  infer_datetime_format=infer_datetime_format)



In [84]:

    
df









    Out[84]:





2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

方法三：使用 squeeze=True，这是我个人最推荐的方法。不过根据官方的文档说法，squeeze只在原始数据包含一个columns时候才会返回Series

squeeze : boolean, default False

If the parsed data only contains one column then return a Series



In [92]:

    
result = pd.read_csv('chapter06/tseries.csv', parse_dates=True, header=None,index_col=0,squeeze=True)



In [89]:

    
type(result)









    Out[89]:





pandas.core.series.Series



In [93]:

    
result









    Out[93]:





0
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Name: 1, dtype: int64

Working with delimited formats

Although we can use functions like pandas.read_table to laod most formas of tabular data, however, munal processing may be necessary sometimes. 其实实际情况中，我们拿到的数据很有可能会有一些很奇特的行，他们无法被read_table等函数识别导入。

为了说明这些基本工具，我们来看一些简单的例子



In [1]:

    
!cat chapter06/ex7.csv









    



"a","b","c"
"1","2","3"
"1","2","3","4"



In [1]:

    
#For any file with a single-character delimiter,
#we can use csv module. To use it, pass any open file
#or file-like object to csv.reader

import csv

file = open('chapter06/ex7.csv')

reader = csv.reader(file)



In [3]:

    
type(reader)









    Out[3]:





_csv.reader



In [4]:

    
#Iterating through the reader likea file yields
#tuples of values with any quote characters removed

for line in reader:
    print(line)









    



['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']

从这个地方开始，我们可以把数据改成一个我们自己期望的格式。我们来一步一步的举一个例子。

首先，将文件读入到一个lines的列表中



In [6]:

    
import csv
with open('chapter06/ex7.csv') as f:
    lines = list(csv.reader(f))

header, values = lines[0], lines[1:]



In [7]:

    
header









    Out[7]:





['a', 'b', 'c']



In [8]:

    
values









    Out[8]:





[['1', '2', '3'], ['1', '2', '3', '4']]



In [13]:

    
#then we careate a dictionary of data columns using a dictionary comprehension
#and the expression zip(*values), which transposes rows to columns

data_dict = {h: v for h, v in zip(header, zip(*values))}
print(data_dict)









    



{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV文件格式有各种不同的格式。为了定义一个有特别的delimiter、字符串引用规则或者行结尾的csvfile，我们可以定一一个简单的subcalss：



In [ ]:

    
csv输出这里，未来某个时间最好复习加强一下。



In [27]:

    
import csv
file = open('chapter06/ex7.csv')

class my_dialect(csv.Dialect):
    lineterminator = '\m'
    delimiter = ';' 
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
    
data = csv.reader(file, dialect=my_dialect)

for line in data:
    print(line)









    



['a,"b","c"']
['1,"2","3"']
['1,"2","3","4"']

当然，我们也可以不用subclass，而直接通过keywords来独立地指定csv文件的dialect参数：



In [50]:

    
import csv
file = open('chapter06/ex7.csv')
reader1 = csv.reader(file, delimiter='|', quotechar=' ', lineterminator='\r\n')
for line1 in reader1:
    print(','.join(line1))









    



"a","b","c"
"1","2","3"
"1","2","3","4"

对于更加复杂的文件或者有固定的多个字符的delimiters情况，我们可能无法使用csv module。在这些情况下，我们不得不使用line splitting和其他的cleanup手段，例如split mehtod，或者正则表达式方法re.split

To write delimited files manually, we can use csv.writer. It accepts an open, writable file object and the same dialect and format options as csv.reader:



In [59]:

    
import csv


class my_dialect(csv.Dialect):
    lineterminator = '\t\n'  #\m
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

file = open('chapter06/ex7.csv')
with open('chapter06/mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))



In [60]:

    
!cat chapter06/mydata.csv









    



one;two;three	
1;2;3	
4;5;6	
7;8;9

JSON data

JSON, short for Javascript object notation, has become one of the standard formats for sending data for HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV.



In [1]:

    
obj = """
    {"name": "Wes",
     "places_lived": ["United States", "Spain", "Germany"],
     "pet": null,
     "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
                  {"name": "Katie", "age": 38,
                   "pets": ["Sixes", "Stache", "Cisco"]}]
} """

There are several Python libraries that can read or write JSON data. For exampple, json module



In [3]:

    
import json
result = json.loads(obj)



In [4]:

    
result









    Out[4]:





{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}



In [5]:

    
list(result)









    Out[5]:





['name', 'places_lived', 'pet', 'siblings']



In [6]:

    
result['places_lived']









    Out[6]:





['United States', 'Spain', 'Germany']



In [8]:

    
type(result)









    Out[8]:





dict

We can also revert the above result object back to JSON data using json.dump



In [10]:

    
asjson = json.dumps(result)



In [11]:

    
asjson









    Out[11]:





'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

We can also pass a list of dicts to the DataFrame constructor and slect a subset of the data fields:



In [13]:

    
import pandas as pd
siblings = pd.DataFrame(result['siblings'], columns = ['name', 'age'])



In [14]:

    
siblings

pandas.read_json 可以自动的将JSON datasets转化为Series或者DataFrame，例如



In [15]:

    
!cat chapter06/example.json









    



[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]



In [16]:

    
# import pandas as pd
import json 

data = pd.read_json('chapter06/example.json')
data

If we want to export data from pandas to JSON, one is to use the 'to_json' methods on Series and DataFrame"



In [17]:

    
print(data.to_json())









    



{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

Binary Data Formats

将数据转换到二进制格式上最简单的方式就是使用python内置的pickle。 pandas对象都有 to_pickle 方法可以将数据写入到pickle格式中。



In [20]:

    
import pandas as pd

frame = pd.read_csv('chapter06/ex1.csv')
frame



In [23]:

    
frame.to_pickle('chapter06/frame_pickle')  #this will write the data to the binary file named frame_pickle

Using pickle module, we can read any 'pickled' objsect stored in a file, or even more easiliy we can use pandas.read_pickle, for example



In [24]:

    
pd.read_pickle('chapter06/frame_pickle')

Attention

pickle只建议于短期存储格式的情形。因为pickle的数据格式随着时间可能变得不稳定；一个今天pickled的数据，可能在明天就会因为更新的pickle库版本而无法pickle。尽管开发者已经努力维护避免这种情况，但是在未来的某个时间点上可能最好还是放弃使用pickle format.

pandas has built-in support for two more library data formats, i.e. HDF5 and Message Pack. Some other storage formats for pandas or NumPy data include:

bcolz

A compressable column-oriented binary format based on the Blosc compression library

Feather

A cross-lanuage column-oriented file format. Feather uses the Apache Arrow columnar memory format

Using HDF5 Format

HDF5 is a well-regarded file format intended for storing large quantities of scientific arrya data. It is available as a C library, and it has interfaces available in many other languages. "HDF" means "hierarchical data format". each HDF5 file can store multipole datases and supporting metadata.

Compared with other simpler formats, HDF5 feastures on-the-fly compression with a varietry of compression modes. HDF5 can be a good choice for working with very large data-sets that don't fit into memory

尽管使用 PyTables或者h5py这两个库就可以简单直接的读写HDF5文件，但是pandas提高了高水平的接口可以简化存储Series或者DataFrame对象。 The ‘HDFStore'　class works like a dict and handles the low-level details:



In [30]:

    
import pandas as pd
import numpy as np
import 

frame = pd.DataFrame({'a': np.random.randn(1000)})
store = pd.HDFStore('chapter06/mydata.h5')









    



---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/miniconda3/envs/plotenv/lib/python3.6/site-packages/pandas/io/pytables.py in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    444         try:
--> 445             import tables  # noqa
    446         except ImportError as ex:  # pragma: no cover

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-30-40e32b04374b> in <module>()
      3 
      4 frame = pd.DataFrame({'a': np.random.randn(1000)})
----> 5 store = pd.HDFStore('chapter06/mydata.h5')

~/miniconda3/envs/plotenv/lib/python3.6/site-packages/pandas/io/pytables.py in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    446         except ImportError as ex:  # pragma: no cover
    447             raise ImportError('HDFStore requires PyTables, "{ex}" problem '
--> 448                               'importing'.format(ex=str(ex)))
    449 
    450         if complib is not None and complib not in tables.filters.all_complibs:

ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing

To solve the above prolem, I installed sevearl modules:

pip install --upgrade tables
pip install lxml
pip install wrapt



In [33]:

    
import pandas as pd
import numpy as np

frame = pd.DataFrame({'a': np.random.randn(1000)})
store = pd.HDFStore('chapter06/mydata.h5py')



In [36]:

    
store['obj1'] = frame  # Store a DataFrame



In [37]:

    
store['obj1_col'] = frame['a']  # Store a Series



In [38]:

    
store









    Out[38]:





<class 'pandas.io.pytables.HDFStore'>
File path: chapter06/mydata.h5py



In [ ]:

    
Objects contained in the HDF5 file can then be retrieved with the 
same dict-like API:



In [48]:

    
x = store['obj1']
type(x)









    Out[48]:





pandas.core.frame.DataFrame



In [49]:

    
y = store['obj1_col']
type(y)









    Out[49]:





pandas.core.series.Series



In [39]:

    
list(store)









    Out[39]:





['/obj1', '/obj1_col']

HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower, but it supports query operations using a special syntax:



In [50]:

    
store.put('obj2', frame, format='table')



In [51]:

    
store.select('obj2', where=['index >=10 and index <= 15'])



In [52]:

    
list(store)









    Out[52]:





['/obj1', '/obj1_col', '/obj2']



In [53]:

    
store.close()

put其实是store['obj2'] = frame 的精确写法，只是前者可以让我们使用一些其他的选项，然更好的对我们想要存储的数据进行特别的格式化。

pandas.read_hdf functions gives us a shortcut to these tools:



In [55]:

    
frame.to_hdf('chapter06/mydata.h5py', 'obj3', format='table')



In [57]:

    
pd.read_hdf('chapter06/mydata.h5py', 'obj3', where=['index < 5'])

Note

如果我们处理的数据是存储在远程的服务器上的话，可以使用一个专为分布式存设计的二进制格式，比如Apache Parquet。这些格式都还处于发展中。

如果数据大部分都在本地，那么鼓励去探索下PyTables和h5py这两者是否也可以满足我们的需求，以及与pd.HDFStore执行效率的差别。

Reading Microsoft Excel Files

read excel

pandas supports reading tabular data in Excel files using either the ExcelFile class or pandas.read_excel function. These tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. We may need to install these manually with pip or conda.

To use ExcelFile, create an instance by passing a path to an xls or xlsx file:



In [61]:

    
import pandas as pd

xlsx = pd.ExcelFile('chapter06/ex1.xlsx')

# Data stored in a sheet can then be read into DataFrame with parse:
pd.read_excel(xlsx, 'Sheet1')

对于又多个sheet的Excel表格，更快的方式就是像上面那样，先create ExcelFile。不过，也可以简单地把文件路径和文件名直接pass给read_excel. 更推荐前面一种做法。



In [62]:

    
# 这里举一个例子，直接pass路径给read_excel

frame = pd.read_excel('chapter06/ex1.xlsx', 'Sheet1')



In [63]:

    
frame

write data to excel

To write pandas data to Excel format, firstly create an ExcelWriter, then write data to it using pandas objects’ to_excel method:



In [65]:

    
writer = pd.ExcelWriter('chapter06/ex2.xlsx')  # Internally thsi was performed by openpyxl module



In [66]:

    
frame.to_excel(writer, 'Sheet1')



In [67]:

    
writer.save()



In [68]:

    
#我们也可以直接pass一个文件路径到 to_excel:
frame.to_excel('chapter06/ex2.xlsx')



In [69]:

    
!open chapter06/ex2.xlsx

Interacting with web APIs

很多网站会提供API供其他人从网站获取数据。获取这些数据的方式有很多，其中至一就是requests

例如，我们可以使用requests找出最近的30个 pands库的 Github issues



In [45]:

    
import requests

url = "https://api.github.com/repos/pandas-dev/pandas/issues"
resp = requests.get(url)



In [3]:

    
resp









    Out[3]:





<Response [200]>

The "Response" object's json method will return a dictionary contaning JSON parsed into native Python objects:



In [46]:

    
data = resp.json()



In [47]:

    
data[0]['url']









    Out[47]:





'https://api.github.com/repos/pandas-dev/pandas/issues/22077'



In [49]:

    
data[0]['title']









    Out[49]:





' "TypeError: unorderable types" in Python3 when index values are dict keys of tuples or tuples with non-homogeneous dtypes'

data的每个元素都是一个字典，包含了所有在github issues中找到的信息（除了comments）。我们可以直接把这些data信息传递给DataFrame和exact fields of interest：



In [53]:

    
import pandas as pd

issues = pd.DataFrame(data, columns = ['number', 'title', 'labels', 'state'])



In [54]:

    
issues









    Out[54]:







  
    
      
      number
      title
      labels
      state
    
  
  
    
      0
      22077
      "TypeError: unorderable types" in Python3 whe...
      [{'id': 49747336, 'node_id': 'MDU6TGFiZWw0OTc0...
      open
    
    
      1
      22076
      Feature-Request: Allow user defined categories...
      []
      open
    
    
      2
      22075
      CLN/STYLE: Lint comprehensions
      [{'id': 106935113, 'node_id': 'MDU6TGFiZWwxMDY...
      open
    
    
      3
      22074
      [Bug] Fix various DatetimeIndex comparison bugs
      [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...
      open
    
    
      4
      22073
      BUG: Column-major DataFrames stored in HDFStor...
      [{'id': 47229190, 'node_id': 'MDU6TGFiZWw0NzIy...
      open
    
    
      5
      22072
      fix for TypeError: unorderable types" in when ...
      [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...
      open
    
    
      6
      22071
      Remove unused, avoid uses of deprecated api
      [{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...
      open
    
    
      7
      22070
      Split fastpath IntegerArray constructor and ge...
      [{'id': 849023693, 'node_id': 'MDU6TGFiZWw4NDk...
      open
    
    
      8
      22069
      BUG: groupby.plot.kde applies to index as well
      [{'id': 76865106, 'node_id': 'MDU6TGFiZWw3Njg2...
      open
    
    
      9
      22068
      De-duplicate dispatch code, remove unreachable...
      [{'id': 57296398, 'node_id': 'MDU6TGFiZWw1NzI5...
      open
    
    
      10
      22067
      TypeError in pandas._libs.tslibs.period._ordin...
      [{'id': 42670965, 'node_id': 'MDU6TGFiZWw0MjY3...
      open
    
    
      11
      22066
      DOC: consistent docstring for compression kwarg
      [{'id': 2301354, 'node_id': 'MDU6TGFiZWwyMzAxM...
      open
    
    
      12
      22065
      Handling NaN during resampling
      [{'id': 2822342, 'node_id': 'MDU6TGFiZWwyODIyM...
      open
    
    
      13
      22064
      TST: suppress deprecation warnings for compress
      [{'id': 87485152, 'node_id': 'MDU6TGFiZWw4NzQ4...
      open
    
    
      14
      22063
      PROPOSAL: use datetime64 and timedelta64 to eval
      []
      open
    
    
      15
      22060
      Lost of index information by operation for zer...
      [{'id': 195647922, 'node_id': 'MDU6TGFiZWwxOTU...
      open
    
    
      16
      22058
      DOC: Reword doc for filepath_or_buffer in read...
      [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...
      open
    
    
      17
      22057
      read_csv Documentation Rendering Poorly
      [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...
      open
    
    
      18
      22056
      value_counts bin limit truncation
      [{'id': 35818298, 'node_id': 'MDU6TGFiZWwzNTgx...
      open
    
    
      19
      22055
      Feature Request: Ability to parse comments
      [{'id': 307649777, 'node_id': 'MDU6TGFiZWwzMDc...
      open
    
    
      20
      22054
      fix raise of TypeError when subtracting timede...
      [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...
      open
    
    
      21
      22053
      df.groupby(cols).tshift(0, freq) Drops Index L...
      [{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...
      open
    
    
      22
      22052
      pd.testing.assert_frame_equal check_like not w...
      [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...
      open
    
    
      23
      22051
      where method results in a different outputs wh...
      [{'id': 31404521, 'node_id': 'MDU6TGFiZWwzMTQw...
      open
    
    
      24
      22049
      DEPR: deprecated IntervalIndex.itemsize, remov...
      [{'id': 150096370, 'node_id': 'MDU6TGFiZWwxNTA...
      open
    
    
      25
      22047
      Breaking change multiplication bool and int
      []
      open
    
    
      26
      22046
      Replacing multiple columns (or just one) with ...
      []
      open
    
    
      27
      22045
      [documentation] merge(suffixes=(False, False))...
      [{'id': 57296398, 'node_id': 'MDU6TGFiZWw1NzI5...
      open
    
    
      28
      22044
      Multi-index and CategoricalIndex performance
      [{'id': 71268330, 'node_id': 'MDU6TGFiZWw3MTI2...
      open
    
    
      29
      22043
      [WIP] Imprecise indexer
      [{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...
      open



In [ ]:

现在，我们来看看，利用requests 能从我的github主页上找出哪些信息



In [34]:

    
import requests

url = 'https://api.github.com/yw-fang'
resp = requests.get(url)



In [35]:

    
data = resp.json()



In [39]:

    
data['message']









    Out[39]:





'Not Found'



In [41]:

    
data['documentation_url']









    Out[41]:





'https://developer.github.com/v3'



In [43]:

    
textdata = resp.text



In [44]:

    
textdata









    Out[44]:





'{"message":"Not Found","documentation_url":"https://developer.github.com/v3"}'

Interacting with databases

Many data are usually stored in database. SQL-based relational databases (such as SQL server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have come quite popular. The choice of database is usually dependent on the performance,data integrity (数据的完整性)， and scalability （可扩展性）needs of an application

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I'll create a SQLite database using Python's built-in sqlite3 driver.



In [56]:

    
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""



In [57]:

    
con = sqlite3.connect('mydata.sqlite')



In [58]:

    
con.execute(query)









    Out[58]:





<sqlite3.Cursor at 0x10d6dbb90>



In [60]:

    
con.commit()

Now, let's insert a few rows of data:



In [61]:

    
data = [('At', 'Home', 1.25, 6),
       ('Out', 'Plane', 2.6, 3),
       ('In', 'Bottle', 1.7, 5)]



In [62]:

    
stmt = "INSERT INTO test VALUES (?, ?, ?, ?)"



In [63]:

    
con.executemany(stmt, data)









    Out[63]:





<sqlite3.Cursor at 0x10d6dbab0>



In [64]:

    
con.commit()

Most Python SQL drivers return a list of tuples when selecting data from a table:



In [65]:

    
cursor = con.execute('select * from test')



In [66]:

    
rows = cursor.fetchall()



In [67]:

    
rows









    Out[67]:





[('At', 'Home', 1.25, 6), ('Out', 'Plane', 2.6, 3), ('In', 'Bottle', 1.7, 5)]

我们可以将元祖列表传给DataFrame的构造器，不过我们还需要列名。它包含在cursor的description中



In [68]:

    
cursor.description









    Out[68]:





(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))



In [72]:

    
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

上面的做法是蛮麻烦的，因为每次都得向database进行query。SQLAlchemy project提供了一些渐变的方法。pandas有一个 read_sql 函数可以是我们从一般的 SQLAlchemy 链接中获取数据。下面，我们就举一个例子，如何使用 SQLAlchemy 连接到上面创建的 SQLite databse，并且从中读取数据



In [76]:

    
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')



In [77]:

    
pd.read_sql('select * from test', db)

Conclusion

获取数据往往书数据分析中的第一步。在这一章节中，我们已经提及了多种获取数据的方法。再接下去的章节中，我们要开始深入研究数据，并对他们分析和可视化。



In [ ]:

	a
0	1.475226
1	0.352356
2	0.835724
3	-1.079280
4	0.224576

	A	B	C
aaa	-0.264438	-1.026059	-0.619500
bbb	0.927272	0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	1.100491

	something	a	b	c	d	message
0	False	False	False	False	False	True
1	False	False	False	True	False	False
2	False	False	False	False	False	False

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

	0	1
0	2000-01-01	0
1	2000-01-02	1
2	2000-01-03	2
3	2000-01-04	3
4	2000-01-05	4
5	2000-01-06	5
6	2000-01-07	6

	a
10	-0.342778
11	-0.204930
12	-0.761121
13	1.199154
14	0.040380
15	0.670995

	number	title	labels	state
0	22077	"TypeError: unorderable types" in Python3 whe...	[{'id': 49747336, 'node_id': 'MDU6TGFiZWw0OTc0...	open
1	22076	Feature-Request: Allow user defined categories...	[]	open
2	22075	CLN/STYLE: Lint comprehensions	[{'id': 106935113, 'node_id': 'MDU6TGFiZWwxMDY...	open
3	22074	[Bug] Fix various DatetimeIndex comparison bugs	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...	open
4	22073	BUG: Column-major DataFrames stored in HDFStor...	[{'id': 47229190, 'node_id': 'MDU6TGFiZWw0NzIy...	open
5	22072	fix for TypeError: unorderable types" in when ...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...	open
6	22071	Remove unused, avoid uses of deprecated api	[{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...	open
7	22070	Split fastpath IntegerArray constructor and ge...	[{'id': 849023693, 'node_id': 'MDU6TGFiZWw4NDk...	open
8	22069	BUG: groupby.plot.kde applies to index as well	[{'id': 76865106, 'node_id': 'MDU6TGFiZWw3Njg2...	open
9	22068	De-duplicate dispatch code, remove unreachable...	[{'id': 57296398, 'node_id': 'MDU6TGFiZWw1NzI5...	open
10	22067	TypeError in pandas._libs.tslibs.period._ordin...	[{'id': 42670965, 'node_id': 'MDU6TGFiZWw0MjY3...	open
11	22066	DOC: consistent docstring for compression kwarg	[{'id': 2301354, 'node_id': 'MDU6TGFiZWwyMzAxM...	open
12	22065	Handling NaN during resampling	[{'id': 2822342, 'node_id': 'MDU6TGFiZWwyODIyM...	open
13	22064	TST: suppress deprecation warnings for compress	[{'id': 87485152, 'node_id': 'MDU6TGFiZWw4NzQ4...	open
14	22063	PROPOSAL: use datetime64 and timedelta64 to eval	[]	open
15	22060	Lost of index information by operation for zer...	[{'id': 195647922, 'node_id': 'MDU6TGFiZWwxOTU...	open
16	22058	DOC: Reword doc for filepath_or_buffer in read...	[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...	open
17	22057	read_csv Documentation Rendering Poorly	[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...	open
18	22056	value_counts bin limit truncation	[{'id': 35818298, 'node_id': 'MDU6TGFiZWwzNTgx...	open
19	22055	Feature Request: Ability to parse comments	[{'id': 307649777, 'node_id': 'MDU6TGFiZWwzMDc...	open
20	22054	fix raise of TypeError when subtracting timede...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...	open
21	22053	df.groupby(cols).tshift(0, freq) Drops Index L...	[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...	open
22	22052	pd.testing.assert_frame_equal check_like not w...	[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...	open
23	22051	where method results in a different outputs wh...	[{'id': 31404521, 'node_id': 'MDU6TGFiZWwzMTQw...	open
24	22049	DEPR: deprecated IntervalIndex.itemsize, remov...	[{'id': 150096370, 'node_id': 'MDU6TGFiZWwxNTA...	open
25	22047	Breaking change multiplication bool and int	[]	open
26	22046	Replacing multiple columns (or just one) with ...	[]	open
27	22045	[documentation] merge(suffixes=(False, False))...	[{'id': 57296398, 'node_id': 'MDU6TGFiZWw1NzI5...	open
28	22044	Multi-index and CategoricalIndex performance	[{'id': 71268330, 'node_id': 'MDU6TGFiZWw3MTI2...	open
29	22043	[WIP] Imprecise indexer	[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...	open