Pandas 数据读写

API

读取 写入
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_json
read_html to_html
read_stata to_stata
read_clipboard to_clipboard
read_pickle to_pickle

CVS 文件读写

csv 文件内容

white,read,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse

In [1]:
import numpy as np
import pandas as pd
csvframe=pd.read_csv('myCSV_01.csv')
csvframe


Out[1]:
white read blue green animal
0 1 5 2 3 cat
1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

In [2]:
# 也可以通过read_table来读写数据
pd.read_table('myCSV_01.csv',sep=',')


Out[2]:
white read blue green animal
0 1 5 2 3 cat
1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

读取没有head的数据

1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse

In [5]:
pd.read_csv('myCSV_02.csv',header=None)


Out[5]:
0 1 2 3 4
0 1 5 2 3 cat
1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

可以指定header


In [7]:
pd.read_csv('myCSV_02.csv',names=['white','red','blue','green','animal'])


Out[7]:
white red blue green animal
0 1 5 2 3 cat
1 2 7 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck
4 4 4 2 1 mouse

创建一个具有等级结构的DataFrame对象,可以添加index_col选项,数据文件格式

colors,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
red,up,2,2,2
red,down,1,1,4

In [9]:
pd.read_csv('myCSV_03.csv',index_col=['colors','status'])


Out[9]:
item1 item2 item3
colors status
black up 3 4 6
down 2 6 7
white up 5 5 5
down 3 3 2
red up 2 2 2
down 1 1 4

Regexp 解析TXT文件

使用正则表达式指定sep,来达到解析数据文件的目的。

正则元素 功能
. 换行符以外所有元素
\d 数字
\D 非数字
\s 空白字符
\S 非空白字符
\n 换行符
\t 制表符
\uxxxx 使用十六进制表示ideaUnicode字符

数据文件随机以制表符和空格分隔

white  red  blue  green
1  4 3   2
2 4    6 7

In [11]:
pd.read_csv('myCSV_04.csv',sep='\s+')


Out[11]:
white red blue green
0 1 4 3 2
1 2 4 6 7

读取有字母分隔的数据

000end123aaa122
001end125aaa144

In [17]:
pd.read_csv('myCSV_05.csv',sep='\D*',header=None,engine='python')


Out[17]:
0 1 2
0 0 123 122
1 1 125 144

读取文本文件跳过一些不必要的行

##### log file #####
this file has been generate by automatic system
white,red,blue,green,animal
12-feb-2015:counting of animals inside the house
1,3,5,2,cat
2,4,8,5,dog
13-feb-2015:counting of animals inside the house
3,3,6,7,horse
2,2,8,3,duck

In [18]:
pd.read_table('myCSV_06.csv',sep=',',skiprows=[0,1,3,6])


Out[18]:
white red blue green animal
0 1 3 5 2 cat
1 2 4 8 5 dog
2 3 3 6 7 horse
3 2 2 8 3 duck

从TXT文件中读取部分数据

只想读文件的一部分,可明确指定解析的行号,这时候用到nrowsskiprows选项,从指定的行开始和从起始行往后读多少行(norow=i)


In [20]:
pd.read_csv('myCSV_02.csv',skiprows=[2],nrows=3,header=None)


Out[20]:
0 1 2 3 4
0 1 5 2 3 cat
1 2 7 8 5 dog
2 2 2 8 3 duck

实例 : 对于一列数据,每隔两行取一个累加起来,最后把和插入到列的Series对象中


In [26]:
out = pd.Series()
i=0
pieces = pd.read_csv('myCSV_01.csv',chunksize=3)
for piece in pieces:
    print piece
    out.set_value(i,piece['white'].sum())
    i += 1
out


   white  read  blue  green animal
0      1     5     2      3    cat
1      2     7     8      5    dog
2      3     3     6      7  horse
   white  read  blue  green animal
0      2     2     8      3   duck
1      4     4     2      1  mouse
Out[26]:
0    6
1    6
dtype: int64

写入文件

  • to_csv(filenmae)
  • to_csv(filename,index=False,header=False)
  • to_csv(filename,na_rep='NaN')

HTML文件读写

写入HTML文件


In [28]:
frame = pd.DataFrame(np.arange(4).reshape((2,2)))
print frame.to_html()


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>3</td>
    </tr>
  </tbody>
</table>

创建复杂的DataFrame


In [29]:
frame = pd.DataFrame(np.random.random((4,4)),
                    index=['white','black','red','blue'],
                    columns=['up','down','left','right'])
frame


Out[29]:
up down left right
white 0.812351 0.909022 0.957337 0.277063
black 0.390331 0.181537 0.697226 0.867162
red 0.647759 0.505687 0.189957 0.217562
blue 0.540899 0.416303 0.681703 0.973349

In [30]:
s = ['<HTML>']
s.append('<HEAD><TITLE>MY DATAFRAME</TITLE></HEAD>')
s.append('<BODY>')
s.append(frame.to_html())
s.append('</BODY></HTML>')
html=''.join(s)
with open('myFrame.html','w') as html_file:
    html_file.write(html)

HTML读表格


In [31]:
web_frames = pd.read_html('myFrame.html')
web_frames[0]


Out[31]:
Unnamed: 0 up down left right
0 white 0.812351 0.909022 0.957337 0.277063
1 black 0.390331 0.181537 0.697226 0.867162
2 red 0.647759 0.505687 0.189957 0.217562
3 blue 0.540899 0.416303 0.681703 0.973349

In [33]:
# 以网址作为参数
ranking = pd.read_html('http://www.meccanismocomplesso.org/en/meccanismo-complesso-sito-2/classifica-punteggio/')
ranking[0]


Out[33]:
Position Name Points Rank
0 1 Fabio Nelli 16682 NaN
1 2 admin 9029 NaN
2 3 BrunoOrsini 2078 NaN
3 4 Berserker 700 NaN
4 5 albertosallusti 277 NaN
5 6 Dnocioni 194 NaN
6 7 Jon 181 NaN
7 8 Mr.Y 180 NaN
8 9 michele sisinni 150 NaN
9 10 Selina 133 NaN
10 11 stefano gustin 120 NaN
11 12 Pietro Baima 107 NaN
12 13 Cecilia Lalatta Costerbosa 106 NaN
13 14 Davide Aloisi 106 NaN
14 15 Leonardo Zampi 106 NaN
15 16 gildalombardi 105 NaN
16 17 Marco Contigiani 101 NaN
17 18 Maurizio Andreoli 100 NaN
18 19 Telerobotlabs 100 NaN
19 20 ron 55 NaN
20 21 EnricoSx 30 NaN
21 22 guz 30 NaN
22 23 Frankie842 24 NaN
23 24 dan_bar 23 NaN
24 25 carlo71 23 NaN
25 26 Gilles 23 NaN
26 27 Forno92 23 NaN
27 28 robertozz 21 NaN
28 29 ciro66 20 NaN
29 30 rocruss 20 NaN
... ... ... ... ...
93 94 Andrea Ficicchia 10 NaN
94 95 Nicola Secciani 10 NaN
95 96 vietcusc 10 NaN
96 97 abramelek 10 NaN
97 98 stefanodebiasio 10 NaN
98 99 Fabrizio 10 NaN
99 100 drogo75 10 NaN
100 101 federico 10 NaN
101 102 Archimede_2000 10 NaN
102 103 phuocdai 10 NaN
103 104 darkakushi 10 NaN
104 105 vinicio.nigro 10 NaN
105 106 Enry 10 NaN
106 107 sbnmc 10 NaN
107 108 ANDREAFFI 10 NaN
108 109 brunoli 10 NaN
109 110 alexit71 10 NaN
110 111 Giuseppe 10 NaN
111 112 jo bach 10 NaN
112 113 franco 10 NaN
113 114 toledo 10 NaN
114 115 Michele Domanico 10 NaN
115 116 Goffredo Geymet 10 NaN
116 117 michele007 10 NaN
117 118 francis_bell 10 NaN
118 119 edsel2016 7 NaN
119 120 pierpaolo 6 NaN
120 121 fdaniele 4 NaN
121 122 squarewheel7 4 NaN
122 123 Marco Corbetta 1 NaN

123 rows × 4 columns

读写xml文件

使用的第三方的库 lxml


In [35]:
from lxml import objectify
xml = objectify.parse('books.xml')
xml


Out[35]:
<lxml.etree._ElementTree at 0x9bcd8c8>

In [36]:
root =xml.getroot()

In [37]:
root.Book.Author


Out[37]:
'EnRoss Mark'

In [38]:
root.Book.PublishDate


Out[38]:
'2014-22-01'

In [39]:
root.getchildren()


Out[39]:
[<Element Book at 0x9bcaa08>, <Element Book at 0x9bcaac8>]

In [40]:
[child.tag for child in root.Book.getchildren()]


Out[40]:
['Author', 'Title', 'Genre', 'Price', 'PublishDate']

In [41]:
[child.text for child in root.Book.getchildren()]


Out[41]:
['EnRoss Mark', 'xml cookbook', 'Computer', '23.56', '2014-22-01']

In [53]:
def etree2df(root):
    column_names=[]
    for i in range(0,len(root.getchildren()[0].getchildren())):
        column_names.append(root.getchildren()[0].getchildren()[i].tag)
    xml_frame = pd.DataFrame(columns=column_names)
    for j in range(0,len(root.getchildren())):
        obj = root.getchildren()[j].getchildren()
        texts = []
        for k in range(0,len(column_names)):
            texts.append(obj[k].text)
        row = dict(zip(column_names,texts))
        row_s=pd.Series(row)
        row_s.name=j
        xml_frame = xml_frame.append(row_s)
    return xml_frame
etree2df(root)


Out[53]:
Author Title Genre Price PublishDate
0 EnRoss Mark xml cookbook Computer 23.56 2014-22-01
1 Barbara xml for dummies Computer 35.95 2014-12-06

读写Excel文件


In [54]:
pd.read_excel('data.xlsx')


Out[54]:
white red green black
a 12 23 17 18
b 22 16 19 18
c 14 23 22 21

In [55]:
pd.read_excel('data.xlsx','Sheet2')


Out[55]:
yellow purple blue orange
A 11 16 44 22
B 20 22 23 44
C 30 31 37 32

In [56]:
frame = pd.DataFrame(np.random.random((4,4)),
                    index=['exp1','exp2','exp3','exp4'],
                    columns=['Jan2015','Feb2015','Mar2015','Apr2015'])
frame


Out[56]:
Jan2015 Feb2015 Mar2015 Apr2015
exp1 0.696097 0.333292 0.799803 0.945701
exp2 0.409402 0.778060 0.695790 0.725668
exp3 0.458230 0.398461 0.064312 0.797225
exp4 0.031586 0.230344 0.446283 0.128528

In [57]:
frame.to_excel('data2.xlsx')

JSON数据


In [58]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['white','black','red','blue'],
                    columns=['up','down','right','left'])
frame.to_json('frame.json')

In [59]:
# 读取json
pd.read_json('frame.json')


Out[59]:
down left right up
black 5 7 6 4
blue 13 15 14 12
red 9 11 10 8
white 1 3 2 0

HDF5数据

HDF文件(hierarchical data from)等级数据格式,用二进制文件存储数据。


In [60]:
from pandas.io.pytables import HDFStore
store = HDFStore('mydata.h5')
store['obj1']=frame

In [61]:
store['obj1']


Out[61]:
up down right left
white 0 1 2 3
black 4 5 6 7
red 8 9 10 11
blue 12 13 14 15

pickle数据


In [62]:
frame.to_pickle('frame.pkl')

In [63]:
pd.read_pickle('frame.pkl')


Out[63]:
up down right left
white 0 1 2 3
black 4 5 6 7
red 8 9 10 11
blue 12 13 14 15

数据库连接

以sqlite3为例介绍


In [66]:
frame=pd.DataFrame(np.arange(20).reshape((4,5)),
                  columns=['white','red','blue','black','green'])
frame


Out[66]:
white red blue black green
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19

In [71]:
from sqlalchemy import create_engine
enegine=create_engine('sqlite:///foo.db')

In [72]:
frame.to_sql('colors',enegine)

In [73]:
pd.read_sql('colors',enegine)


Out[73]:
index white red blue black green
0 0 0 1 2 3 4
1 1 5 6 7 8 9
2 2 10 11 12 13 14
3 3 15 16 17 18 19

In [ ]: