Pandas 数据读写

API

读取	写入
read_csv	to_csv
read_excel	to_excel
read_hdf	to_hdf
read_sql	to_sql
read_json	to_json
read_html	to_html
read_stata	to_stata
read_clipboard	to_clipboard
read_pickle	to_pickle

CVS 文件读写

csv 文件内容

white,read,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse



In [1]:

    
import numpy as np
import pandas as pd
csvframe=pd.read_csv('myCSV_01.csv')
csvframe



In [2]:

    
# 也可以通过read_table来读写数据
pd.read_table('myCSV_01.csv',sep=',')

读取没有head的数据

1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse



In [5]:

    
pd.read_csv('myCSV_02.csv',header=None)

可以指定header



In [7]:

    
pd.read_csv('myCSV_02.csv',names=['white','red','blue','green','animal'])

创建一个具有等级结构的DataFrame对象，可以添加index_col选项,数据文件格式

colors,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
red,up,2,2,2
red,down,1,1,4



In [9]:

    
pd.read_csv('myCSV_03.csv',index_col=['colors','status'])

Regexp 解析TXT文件

使用正则表达式指定sep,来达到解析数据文件的目的。

正则元素	功能
.	换行符以外所有元素
\d	数字
\D	非数字
\s	空白字符
\S	非空白字符
\n	换行符
\t	制表符
\uxxxx	使用十六进制表示ideaUnicode字符

数据文件随机以制表符和空格分隔

white  red  blue  green
1  4 3   2
2 4    6 7



In [11]:

    
pd.read_csv('myCSV_04.csv',sep='\s+')

读取有字母分隔的数据

000end123aaa122
001end125aaa144



In [17]:

    
pd.read_csv('myCSV_05.csv',sep='\D*',header=None,engine='python')

读取文本文件跳过一些不必要的行

##### log file #####
this file has been generate by automatic system
white,red,blue,green,animal
12-feb-2015:counting of animals inside the house
1,3,5,2,cat
2,4,8,5,dog
13-feb-2015:counting of animals inside the house
3,3,6,7,horse
2,2,8,3,duck



In [18]:

    
pd.read_table('myCSV_06.csv',sep=',',skiprows=[0,1,3,6])

从TXT文件中读取部分数据

只想读文件的一部分，可明确指定解析的行号，这时候用到nrows和skiprows选项，从指定的行开始和从起始行往后读多少行（norow=i)



In [20]:

    
pd.read_csv('myCSV_02.csv',skiprows=[2],nrows=3,header=None)

实例：对于一列数据，每隔两行取一个累加起来，最后把和插入到列的Series对象中



In [26]:

    
out = pd.Series()
i=0
pieces = pd.read_csv('myCSV_01.csv',chunksize=3)
for piece in pieces:
    print piece
    out.set_value(i,piece['white'].sum())
    i += 1
out









    



   white  read  blue  green animal
0      1     5     2      3    cat
1      2     7     8      5    dog
2      3     3     6      7  horse
   white  read  blue  green animal
0      2     2     8      3   duck
1      4     4     2      1  mouse






    Out[26]:





0    6
1    6
dtype: int64

写入文件

to_csv(filenmae)
to_csv(filename,index=False,header=False)
to_csv(filename,na_rep='NaN')

HTML文件读写

写入HTML文件



In [28]:

    
frame = pd.DataFrame(np.arange(4).reshape((2,2)))
print frame.to_html()









    



<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>3</td>
    </tr>
  </tbody>
</table>

创建复杂的DataFrame



In [29]:

    
frame = pd.DataFrame(np.random.random((4,4)),
                    index=['white','black','red','blue'],
                    columns=['up','down','left','right'])
frame



In [30]:

    
s = ['<HTML>']
s.append('<HEAD><TITLE>MY DATAFRAME</TITLE></HEAD>')
s.append('<BODY>')
s.append(frame.to_html())
s.append('</BODY></HTML>')
html=''.join(s)
with open('myFrame.html','w') as html_file:
    html_file.write(html)

HTML读表格



In [31]:

    
web_frames = pd.read_html('myFrame.html')
web_frames[0]



In [33]:

    
# 以网址作为参数
ranking = pd.read_html('http://www.meccanismocomplesso.org/en/meccanismo-complesso-sito-2/classifica-punteggio/')
ranking[0]









    Out[33]:






  
    
      
      Position
      Name
      Points
      Rank
    
  
  
    
      0
      1
      Fabio Nelli
      16682
      NaN
    
    
      1
      2
      admin
      9029
      NaN
    
    
      2
      3
      BrunoOrsini
      2078
      NaN
    
    
      3
      4
      Berserker
      700
      NaN
    
    
      4
      5
      albertosallusti
      277
      NaN
    
    
      5
      6
      Dnocioni
      194
      NaN
    
    
      6
      7
      Jon
      181
      NaN
    
    
      7
      8
      Mr.Y
      180
      NaN
    
    
      8
      9
      michele sisinni
      150
      NaN
    
    
      9
      10
      Selina
      133
      NaN
    
    
      10
      11
      stefano gustin
      120
      NaN
    
    
      11
      12
      Pietro Baima
      107
      NaN
    
    
      12
      13
      Cecilia Lalatta Costerbosa
      106
      NaN
    
    
      13
      14
      Davide Aloisi
      106
      NaN
    
    
      14
      15
      Leonardo Zampi
      106
      NaN
    
    
      15
      16
      gildalombardi
      105
      NaN
    
    
      16
      17
      Marco Contigiani
      101
      NaN
    
    
      17
      18
      Maurizio Andreoli
      100
      NaN
    
    
      18
      19
      Telerobotlabs
      100
      NaN
    
    
      19
      20
      ron
      55
      NaN
    
    
      20
      21
      EnricoSx
      30
      NaN
    
    
      21
      22
      guz
      30
      NaN
    
    
      22
      23
      Frankie842
      24
      NaN
    
    
      23
      24
      dan_bar
      23
      NaN
    
    
      24
      25
      carlo71
      23
      NaN
    
    
      25
      26
      Gilles
      23
      NaN
    
    
      26
      27
      Forno92
      23
      NaN
    
    
      27
      28
      robertozz
      21
      NaN
    
    
      28
      29
      ciro66
      20
      NaN
    
    
      29
      30
      rocruss
      20
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      93
      94
      Andrea Ficicchia
      10
      NaN
    
    
      94
      95
      Nicola Secciani
      10
      NaN
    
    
      95
      96
      vietcusc
      10
      NaN
    
    
      96
      97
      abramelek
      10
      NaN
    
    
      97
      98
      stefanodebiasio
      10
      NaN
    
    
      98
      99
      Fabrizio
      10
      NaN
    
    
      99
      100
      drogo75
      10
      NaN
    
    
      100
      101
      federico
      10
      NaN
    
    
      101
      102
      Archimede_2000
      10
      NaN
    
    
      102
      103
      phuocdai
      10
      NaN
    
    
      103
      104
      darkakushi
      10
      NaN
    
    
      104
      105
      vinicio.nigro
      10
      NaN
    
    
      105
      106
      Enry
      10
      NaN
    
    
      106
      107
      sbnmc
      10
      NaN
    
    
      107
      108
      ANDREAFFI
      10
      NaN
    
    
      108
      109
      brunoli
      10
      NaN
    
    
      109
      110
      alexit71
      10
      NaN
    
    
      110
      111
      Giuseppe
      10
      NaN
    
    
      111
      112
      jo bach
      10
      NaN
    
    
      112
      113
      franco
      10
      NaN
    
    
      113
      114
      toledo
      10
      NaN
    
    
      114
      115
      Michele Domanico
      10
      NaN
    
    
      115
      116
      Goffredo Geymet
      10
      NaN
    
    
      116
      117
      michele007
      10
      NaN
    
    
      117
      118
      francis_bell
      10
      NaN
    
    
      118
      119
      edsel2016
      7
      NaN
    
    
      119
      120
      pierpaolo
      6
      NaN
    
    
      120
      121
      fdaniele
      4
      NaN
    
    
      121
      122
      squarewheel7
      4
      NaN
    
    
      122
      123
      Marco Corbetta
      1
      NaN
    
  

123 rows × 4 columns

读写xml文件

使用的第三方的库 lxml



In [35]:

    
from lxml import objectify
xml = objectify.parse('books.xml')
xml









    Out[35]:





<lxml.etree._ElementTree at 0x9bcd8c8>



In [36]:

    
root =xml.getroot()



In [37]:

    
root.Book.Author









    Out[37]:





'EnRoss Mark'



In [38]:

    
root.Book.PublishDate









    Out[38]:





'2014-22-01'



In [39]:

    
root.getchildren()









    Out[39]:





[<Element Book at 0x9bcaa08>, <Element Book at 0x9bcaac8>]



In [40]:

    
[child.tag for child in root.Book.getchildren()]









    Out[40]:





['Author', 'Title', 'Genre', 'Price', 'PublishDate']



In [41]:

    
[child.text for child in root.Book.getchildren()]









    Out[41]:





['EnRoss Mark', 'xml cookbook', 'Computer', '23.56', '2014-22-01']



In [53]:

    
def etree2df(root):
    column_names=[]
    for i in range(0,len(root.getchildren()[0].getchildren())):
        column_names.append(root.getchildren()[0].getchildren()[i].tag)
    xml_frame = pd.DataFrame(columns=column_names)
    for j in range(0,len(root.getchildren())):
        obj = root.getchildren()[j].getchildren()
        texts = []
        for k in range(0,len(column_names)):
            texts.append(obj[k].text)
        row = dict(zip(column_names,texts))
        row_s=pd.Series(row)
        row_s.name=j
        xml_frame = xml_frame.append(row_s)
    return xml_frame
etree2df(root)









    Out[53]:






  
    
      
      Author
      Title
      Genre
      Price
      PublishDate
    
  
  
    
      0
      EnRoss Mark
      xml cookbook
      Computer
      23.56
      2014-22-01
    
    
      1
      Barbara
      xml for dummies
      Computer
      35.95
      2014-12-06

读写Excel文件



In [54]:

    
pd.read_excel('data.xlsx')



In [55]:

    
pd.read_excel('data.xlsx','Sheet2')



In [56]:

    
frame = pd.DataFrame(np.random.random((4,4)),
                    index=['exp1','exp2','exp3','exp4'],
                    columns=['Jan2015','Feb2015','Mar2015','Apr2015'])
frame



In [57]:

    
frame.to_excel('data2.xlsx')

JSON数据



In [58]:

    
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['white','black','red','blue'],
                    columns=['up','down','right','left'])
frame.to_json('frame.json')



In [59]:

    
# 读取json
pd.read_json('frame.json')

HDF5数据

HDF文件（hierarchical data from)等级数据格式，用二进制文件存储数据。



In [60]:

    
from pandas.io.pytables import HDFStore
store = HDFStore('mydata.h5')
store['obj1']=frame



In [61]:

    
store['obj1']

pickle数据



In [62]:

    
frame.to_pickle('frame.pkl')



In [63]:

    
pd.read_pickle('frame.pkl')

数据库连接

以sqlite3为例介绍



In [66]:

    
frame=pd.DataFrame(np.arange(20).reshape((4,5)),
                  columns=['white','red','blue','black','green'])
frame



In [71]:

    
from sqlalchemy import create_engine
enegine=create_engine('sqlite:///foo.db')



In [72]:

    
frame.to_sql('colors',enegine)



In [73]:

    
pd.read_sql('colors',enegine)



In [ ]:

	up	down	left	right
white	0.812351	0.909022	0.957337	0.277063
black	0.390331	0.181537	0.697226	0.867162
red	0.647759	0.505687	0.189957	0.217562
blue	0.540899	0.416303	0.681703	0.973349

	Unnamed: 0	up	down	left	right
0	white	0.812351	0.909022	0.957337	0.277063
1	black	0.390331	0.181537	0.697226	0.867162
2	red	0.647759	0.505687	0.189957	0.217562
3	blue	0.540899	0.416303	0.681703	0.973349

	Jan2015	Feb2015	Mar2015	Apr2015
exp1	0.696097	0.333292	0.799803	0.945701
exp2	0.409402	0.778060	0.695790	0.725668
exp3	0.458230	0.398461	0.064312	0.797225
exp4	0.031586	0.230344	0.446283	0.128528

		item1	item2	item3
colors	status
black	up	3	4	6
black	down	2	6	7
white	up	5	5	5
white	down	3	3	2
red	up	2	2	2
red	down	1	1	4

	Position	Name	Points	Rank
0	1	Fabio Nelli	16682	NaN
1	2	admin	9029	NaN
2	3	BrunoOrsini	2078	NaN
3	4	Berserker	700	NaN
4	5	albertosallusti	277	NaN
5	6	Dnocioni	194	NaN
6	7	Jon	181	NaN
7	8	Mr.Y	180	NaN
8	9	michele sisinni	150	NaN
9	10	Selina	133	NaN
10	11	stefano gustin	120	NaN
11	12	Pietro Baima	107	NaN
12	13	Cecilia Lalatta Costerbosa	106	NaN
13	14	Davide Aloisi	106	NaN
14	15	Leonardo Zampi	106	NaN
15	16	gildalombardi	105	NaN
16	17	Marco Contigiani	101	NaN
17	18	Maurizio Andreoli	100	NaN
18	19	Telerobotlabs	100	NaN
19	20	ron	55	NaN
20	21	EnricoSx	30	NaN
21	22	guz	30	NaN
22	23	Frankie842	24	NaN
23	24	dan_bar	23	NaN
24	25	carlo71	23	NaN
25	26	Gilles	23	NaN
26	27	Forno92	23	NaN
27	28	robertozz	21	NaN
28	29	ciro66	20	NaN
29	30	rocruss	20	NaN
...	...	...	...	...
93	94	Andrea Ficicchia	10	NaN
94	95	Nicola Secciani	10	NaN
95	96	vietcusc	10	NaN
96	97	abramelek	10	NaN
97	98	stefanodebiasio	10	NaN
98	99	Fabrizio	10	NaN
99	100	drogo75	10	NaN
100	101	federico	10	NaN
101	102	Archimede_2000	10	NaN
102	103	phuocdai	10	NaN
103	104	darkakushi	10	NaN
104	105	vinicio.nigro	10	NaN
105	106	Enry	10	NaN
106	107	sbnmc	10	NaN
107	108	ANDREAFFI	10	NaN
108	109	brunoli	10	NaN
109	110	alexit71	10	NaN
110	111	Giuseppe	10	NaN
111	112	jo bach	10	NaN
112	113	franco	10	NaN
113	114	toledo	10	NaN
114	115	Michele Domanico	10	NaN
115	116	Goffredo Geymet	10	NaN
116	117	michele007	10	NaN
117	118	francis_bell	10	NaN
118	119	edsel2016	7	NaN
119	120	pierpaolo	6	NaN
120	121	fdaniele	4	NaN
121	122	squarewheel7	4	NaN
122	123	Marco Corbetta	1	NaN

	Author	Title	Genre	Price	PublishDate
0	EnRoss Mark	xml cookbook	Computer	23.56	2014-22-01
1	Barbara	xml for dummies	Computer	35.95	2014-12-06