Pandas 数据读写
csv 文件内容
white,read,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse
In [1]:
import numpy as np
import pandas as pd
csvframe=pd.read_csv('myCSV_01.csv')
csvframe
Out[1]:
In [2]:
# 也可以通过read_table来读写数据
pd.read_table('myCSV_01.csv',sep=',')
Out[2]:
In [5]:
pd.read_csv('myCSV_02.csv',header=None)
Out[5]:
可以指定header
In [7]:
pd.read_csv('myCSV_02.csv',names=['white','red','blue','green','animal'])
Out[7]:
创建一个具有等级结构的DataFrame对象,可以添加index_col选项,数据文件格式
colors,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
red,up,2,2,2
red,down,1,1,4
In [9]:
pd.read_csv('myCSV_03.csv',index_col=['colors','status'])
Out[9]:
数据文件随机以制表符和空格分隔
white red blue green
1 4 3 2
2 4 6 7
In [11]:
pd.read_csv('myCSV_04.csv',sep='\s+')
Out[11]:
读取有字母分隔的数据
000end123aaa122
001end125aaa144
In [17]:
pd.read_csv('myCSV_05.csv',sep='\D*',header=None,engine='python')
Out[17]:
读取文本文件跳过一些不必要的行
##### log file #####
this file has been generate by automatic system
white,red,blue,green,animal
12-feb-2015:counting of animals inside the house
1,3,5,2,cat
2,4,8,5,dog
13-feb-2015:counting of animals inside the house
3,3,6,7,horse
2,2,8,3,duck
In [18]:
pd.read_table('myCSV_06.csv',sep=',',skiprows=[0,1,3,6])
Out[18]:
In [20]:
pd.read_csv('myCSV_02.csv',skiprows=[2],nrows=3,header=None)
Out[20]:
实例 : 对于一列数据,每隔两行取一个累加起来,最后把和插入到列的Series对象中
In [26]:
out = pd.Series()
i=0
pieces = pd.read_csv('myCSV_01.csv',chunksize=3)
for piece in pieces:
print piece
out.set_value(i,piece['white'].sum())
i += 1
out
Out[26]:
In [28]:
frame = pd.DataFrame(np.arange(4).reshape((2,2)))
print frame.to_html()
创建复杂的DataFrame
In [29]:
frame = pd.DataFrame(np.random.random((4,4)),
index=['white','black','red','blue'],
columns=['up','down','left','right'])
frame
Out[29]:
In [30]:
s = ['<HTML>']
s.append('<HEAD><TITLE>MY DATAFRAME</TITLE></HEAD>')
s.append('<BODY>')
s.append(frame.to_html())
s.append('</BODY></HTML>')
html=''.join(s)
with open('myFrame.html','w') as html_file:
html_file.write(html)
In [31]:
web_frames = pd.read_html('myFrame.html')
web_frames[0]
Out[31]:
In [33]:
# 以网址作为参数
ranking = pd.read_html('http://www.meccanismocomplesso.org/en/meccanismo-complesso-sito-2/classifica-punteggio/')
ranking[0]
Out[33]:
In [35]:
from lxml import objectify
xml = objectify.parse('books.xml')
xml
Out[35]:
In [36]:
root =xml.getroot()
In [37]:
root.Book.Author
Out[37]:
In [38]:
root.Book.PublishDate
Out[38]:
In [39]:
root.getchildren()
Out[39]:
In [40]:
[child.tag for child in root.Book.getchildren()]
Out[40]:
In [41]:
[child.text for child in root.Book.getchildren()]
Out[41]:
In [53]:
def etree2df(root):
column_names=[]
for i in range(0,len(root.getchildren()[0].getchildren())):
column_names.append(root.getchildren()[0].getchildren()[i].tag)
xml_frame = pd.DataFrame(columns=column_names)
for j in range(0,len(root.getchildren())):
obj = root.getchildren()[j].getchildren()
texts = []
for k in range(0,len(column_names)):
texts.append(obj[k].text)
row = dict(zip(column_names,texts))
row_s=pd.Series(row)
row_s.name=j
xml_frame = xml_frame.append(row_s)
return xml_frame
etree2df(root)
Out[53]:
In [54]:
pd.read_excel('data.xlsx')
Out[54]:
In [55]:
pd.read_excel('data.xlsx','Sheet2')
Out[55]:
In [56]:
frame = pd.DataFrame(np.random.random((4,4)),
index=['exp1','exp2','exp3','exp4'],
columns=['Jan2015','Feb2015','Mar2015','Apr2015'])
frame
Out[56]:
In [57]:
frame.to_excel('data2.xlsx')
In [58]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['white','black','red','blue'],
columns=['up','down','right','left'])
frame.to_json('frame.json')
In [59]:
# 读取json
pd.read_json('frame.json')
Out[59]:
HDF文件(hierarchical data from)等级数据格式,用二进制文件存储数据。
In [60]:
from pandas.io.pytables import HDFStore
store = HDFStore('mydata.h5')
store['obj1']=frame
In [61]:
store['obj1']
Out[61]:
In [62]:
frame.to_pickle('frame.pkl')
In [63]:
pd.read_pickle('frame.pkl')
Out[63]:
In [66]:
frame=pd.DataFrame(np.arange(20).reshape((4,5)),
columns=['white','red','blue','black','green'])
frame
Out[66]:
In [71]:
from sqlalchemy import create_engine
enegine=create_engine('sqlite:///foo.db')
In [72]:
frame.to_sql('colors',enegine)
In [73]:
pd.read_sql('colors',enegine)
Out[73]:
In [ ]: