Pandas是Python的一个用于数据分析的库: http://pandas.pydata.org
API速查:http://pandas.pydata.org/pandas-docs/stable/api.html
基于NumPy,SciPy的功能,在其上补充了大量的数据操作(Data Manipulation)功能。
统计、分组、排序、透视表自由转换,如果你已经很熟悉结构化数据库(RDBMS)与Excel的功能,就会知道Pandas有过之而无不及!
普通的程序员看到一份数据会怎么做?
In [1]:
import codecs
import requests
import numpy as np
import scipy as sp
import scipy.stats as spstat
import pandas as pd
import datetime
import json
In [2]:
r = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
with codecs.open('S1EP3_Iris.txt','w',encoding='utf-8') as f:
f.write(r.text)
In [3]:
with codecs.open('S1EP3_Iris.txt','r',encoding='utf-8') as f:
lines = f.readlines()
for idx,line in enumerate(lines):
print line,
if idx==10:
break
In [4]:
import pandas as pd
irisdata = pd.read_csv('S1EP3_Iris.txt',header = None, encoding='utf-8')
irisdata
Out[4]:
In [5]:
cnames = ['sepal_length','sepal_width','petal_length','petal_width','class']
irisdata.columns = cnames
irisdata
Out[5]:
In [6]:
irisdata[irisdata['petal_width']==irisdata.petal_width.max()]
Out[6]:
In [7]:
irisdata.iloc[::30,:2]
Out[7]:
In [8]:
print irisdata['class'].value_counts()
for x in xrange(4):
s = irisdata.iloc[:,x]
print '{0:<12}'.format(s.name.upper()), " Statistics: ", \
'{0:>5} {1:>5} {2:>5} {3:>5}'.format(s.max(), s.min(), round(s.mean(),2),round(s.std(),2))
In [9]:
slogs = lambda x:sp.log(x)*x
entpy = lambda x:sp.exp((slogs(x.sum())-x.map(slogs).sum())/x.sum())
irisdata.groupby('class').agg(entpy)
Out[9]:
Pandas的重要数据类型
用值构建一个Series:
由默认index和values组成。
In [10]:
Series1 = pd.Series(np.random.randn(4))
print Series1,type(Series1)
print Series1.index
print Series1.values
In [11]:
print Series1>0
print Series1[Series1>0]
In [12]:
print Series1*2
print Series1+5
In [13]:
print np.exp(Series1)
#NumPy Universal Function
f_np = np.frompyfunc(lambda x:np.exp(x*2+5),1,1)
print f_np(Series1)
在序列上就使用行标,而不是创建一个2列的数据表,能够轻松辨别哪里是数据,哪里是元数据:
In [14]:
Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])
print Series2,type(Series2)
print Series2.index
print type(Series2.index)
print Series2.values
虽然行是有顺序的,但是仍然能够通过行级的index来访问到数据:
(当然也不尽然像Ordered Dict,因为行索引甚至可以重复,不推荐重复的行索引不代表不能用)
In [15]:
print Series2[['norm_0','norm_3']]
In [16]:
print 'norm_0' in Series2
print 'norm_6' in Series2
默认行索引就像行号一样:
In [17]:
print Series1.index
从Key不重复的Ordered Dict或者从Dict来定义Series就不需要担心行索引重复:
In [18]:
Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}
Series3_pdSeries = pd.Series(Series3_Dict)
print Series3_pdSeries
print Series3_pdSeries.values
print Series3_pdSeries.index
与Dict区别一: 有序
In [19]:
Series4_IndexList = ["Japan","China","Singapore","S.Korea"]
Series4_pdSeries = pd.Series( Series3_Dict ,index = Series4_IndexList)
print Series4_pdSeries
print Series4_pdSeries.values
print Series4_pdSeries.index
print Series4_pdSeries.isnull()
print Series4_pdSeries.notnull()
与Dict区别二: index内值可以重复,尽管不推荐。
In [20]:
Series5_IndexList = ['A','B','B','C']
Series5 = pd.Series(Series1.values,index = Series5_IndexList)
print Series5
print Series5[['B','A']]
整个序列级别的元数据信息:name
当数据序列以及index本身有了名字,就可以更方便的进行后续的数据关联啦!
In [21]:
print Series4_pdSeries.name
print Series4_pdSeries.index.name
In [22]:
Series4_pdSeries.name = "Capital Series"
Series4_pdSeries.index.name = "Nation"
print Series4_pdSeries
pd.DataFrame(Series4_pdSeries)
Out[22]:
In [23]:
dataNumPy = np.asarray([('Japan','Tokyo',4000),\
('S.Korea','Seoul',1300),('China','Beijing',9100)])
DF1 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])
DF1
Out[23]:
In [24]:
dataDict = {'nation':['Japan','S.Korea','China'],\
'capital':['Tokyo','Seoul','Beijing'],'GDP':[4900,1300,9100]}
DF2 = pd.DataFrame(dataDict)
DF2
Out[24]:
In [25]:
DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])
DF21
Out[25]:
In [26]:
DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])
DF22
Out[26]:
In [27]:
print DF22.nation
print DF22.capital
print DF22['GDP']
In [28]:
print DF22[0:1] #给出的实际是DataFrame
print DF22.ix[0] #通过对应Index给出行
In [29]:
print DF22.iloc[0,:]
print DF22.iloc[:,0]
In [30]:
DF22['population'] = [1600,130,55]
DF22['region'] = 'East_Asian'
DF22
Out[30]:
In [ ]:
In [31]:
index_names = ['a','b','c']
Series_for_Index = pd.Series(index_names)
print pd.Index(index_names)
print pd.Index(Series_for_Index)
In [32]:
index_names = ['a','b','c']
index0 = pd.Index(index_names)
print index0.get_values()
index0[2] = 'd'
In [33]:
#print [('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)]
multi1 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)])
multi1.name = ['index1','index2']
print multi1
In [34]:
data_for_multi1 = pd.Series(xrange(0,16),index=multi1)
data_for_multi1
Out[34]:
In [35]:
data_for_multi1.unstack()
Out[35]:
In [36]:
data_for_multi1.unstack().stack()
Out[36]:
我们来看一下非平衡数据的例子:
Row_1,2,3,4和Col_1,2,3,4并不是全组合的。
In [37]:
multi2 = pd.Index([('Row_'+str(x),'Col_'+str(y+1)) \
for x in xrange(5) for y in xrange(x)])
multi2
Out[37]:
In [38]:
data_for_multi2 = pd.Series(np.arange(10),index = multi2)
data_for_multi2
Out[38]:
In [39]:
data_for_multi2.unstack()
Out[39]:
In [40]:
data_for_multi2.unstack().stack()
Out[40]:
In [41]:
dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]
pd.DatetimeIndex(dates)
Out[41]:
In [42]:
periodindex1 = pd.period_range('2015-01','2015-04',freq='M')
print periodindex1
In [43]:
print periodindex1.asfreq('D',how='start')
print periodindex1.asfreq('D',how='end')
In [44]:
periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')
periodindex_day = pd.period_range('2015-01-01','2015-03-31',freq='D')
print periodindex_mon
print periodindex_day
In [45]:
#print pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)
full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)
full_ts
Out[45]:
In [46]:
full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')
full_ts
Out[46]:
In [47]:
index1 = pd.Index(['A','B','B','C','C'])
index2 = pd.Index(['C','D','E','E','F'])
index3 = pd.Index(['B','C','A'])
print index1.append(index2)
print index1.difference(index2)
print index1.intersection(index2)
print index1.union(index2) # Support unique-value Index well
print index1.isin(index2)
print index1.delete(2)
print index1.insert(0,'K') # Not suggested
print index3.drop('A') # Support unique-value Index well
print index1.is_monotonic,index2.is_monotonic,index3.is_monotonic
print index1.is_unique,index2.is_unique,index3.is_unique
老生常谈,从基础来看,我们仍然关心pandas对于与外部数据是如何交互的。
还记得刚开始的例子吗?
提问:下列例子中,header=4,names=cnames时,究竟会读到怎样的数据?
In [48]:
print cnames
irisdata = pd.read_csv('S1EP3_Iris.txt',header = None, names = cnames,\
encoding='utf-8')
irisdata[::30]
Out[48]:
希望了解全部参数的请移步API:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
这里介绍一些常用的参数:
读取处理:
内容处理:
收尾处理:
In [49]:
irisdata.to_excel('S1EP3_irisdata.xls',index = None,encoding='utf-8')
irisdata_from_excel = pd.read_excel('S1EP3_irisdata.xls',header=0, encoding='utf-8')
irisdata_from_excel[::30]
Out[49]:
唯一重要的参数:sheetname=k,标志着一个excel的第k个sheet页将会被取出。(从0开始)
JSON:网络传输中常用的一种数据格式。
仔细看一下,实际上这就是我们平时收集到异源数据的风格是一致的:
In [50]:
json_data = [{'name':'Wang','sal':50000,'job':'VP'},\
{'name':'Zhang','job':'Manager','report':'VP'},\
{'name':'Li','sal':5000,'report':'Manager'}]
data_employee = pd.read_json(json.dumps(json_data))
data_employee_ri = data_employee.reindex(columns=['name','job','sal','report'])
data_employee_ri
Out[50]:
In [ ]:
IP = '127.0.0.1'
us = 'root'
pw = '123456'
举例说明如果是MySQL:
In [ ]:
import pymysql
import pymysql.cursors
connection = pymysql.connect(host=IP,\
user=us,\
password=pw,\
charset='utf8mb4',\
cursorclass=pymysql.cursors.DictCursor)
#pd.read_sql_query("sql",connection)
#df.to_sql('tablename',connection,flavor='mysql')
在第一部分的基础上,数据会有更多种操纵方式:
In [51]:
pd.DataFrame([np.random.rand(2),np.random.rand(2),np.random.rand(2)],columns=['C1','C2'])
Out[51]:
In [52]:
pd.concat([data_employee_ri,data_employee_ri,data_employee_ri])
Out[52]:
In [53]:
pd.concat([data_employee_ri,data_employee_ri,data_employee_ri],ignore_index=True)
Out[53]:
根据数据列关联,使用on关键字
In [54]:
pd.merge(data_employee_ri,data_employee_ri,on='name')
Out[54]:
In [55]:
pd.merge(data_employee_ri,data_employee_ri,on=['name','job'])
Out[55]:
根据index关联,可以直接使用left_index和right_index
In [56]:
data_employee_ri.index.name = 'index1'
pd.merge(data_employee_ri,data_employee_ri,\
left_index='index1',right_index='index1')
Out[56]:
TIPS: 增加how关键字,并指定
结合how,可以看到merge基本再现了SQL应有的功能,并保持代码整洁。
In [57]:
DF31xA = pd.DataFrame({'name':[u'老王',u'老张',u'老李'],'sal':[5000,3000,1000]})
DF31xA
Out[57]:
In [58]:
DF31xB = pd.DataFrame({'name':[u'老王',u'老刘'],'job':['VP','Manager']})
DF31xB
Out[58]:
how='left': 保留左表信息
In [59]:
pd.merge(DF31xA,DF31xB,on='name',how='left')
Out[59]:
how='right': 保留右表信息
In [60]:
pd.merge(DF31xA,DF31xB,on='name',how='right')
Out[60]:
how='inner': 保留两表交集信息,这样尽量避免出现缺失值
In [61]:
pd.merge(DF31xA,DF31xB,on='name',how='inner')
Out[61]:
how='outer': 保留两表并集信息,这样会导致缺失值,但最大程度的整合了已有信息
In [62]:
pd.merge(DF31xA,DF31xB,on='name',how='outer')
Out[62]:
In [63]:
dataNumPy32 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])
DF32 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])
DF32
Out[63]:
In [64]:
def GDP_Factorize(v):
fv = np.float64(v)
if fv > 6000.0:
return 'High'
elif fv < 2000.0:
return 'Low'
else:
return 'Medium'
DF32['GDP_Level'] = DF32['GDP'].map(GDP_Factorize)
DF32['NATION'] = DF32.nation.map(str.upper)
DF32
Out[64]:
In [65]:
DF32.applymap(lambda x: float(x)*2 if x.isdigit() else x.upper())
Out[65]:
In [66]:
DF32.apply(lambda x:x['nation']+x['capital']+'_'+x['GDP'],axis=1)
Out[66]:
In [67]:
dataNumPy33 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])
DF33 = pd.DataFrame(dataNumPy33,columns=['nation','capital','GDP'])
DF33
Out[67]:
In [68]:
DF33.sort(['capital','nation'])
Out[68]:
In [69]:
DF33.sort('GDP',ascending=False)
Out[69]:
In [70]:
DF33.sort('GDP').sort(ascending=False)
Out[70]:
In [71]:
DF33.sort_index(axis=1,ascending=True)
Out[71]:
一个好用的功能:Rank
In [72]:
DF33
Out[72]:
In [73]:
DF33.rank()
Out[73]:
In [74]:
DF33.rank(ascending=False)
Out[74]:
注意tied data(相同值)的处理:
In [75]:
DF33x = pd.DataFrame({'name':[u'老王',u'老张',u'老李',u'老刘'],'sal':np.array([5000,3000,5000,9000])})
DF33x
Out[75]:
DF33x.rank()默认使用method='average',两条数据相等时,处理排名时大家都用平均值
In [76]:
DF33x.sal.rank()
Out[76]:
method='min',处理排名时大家都用最小值
In [77]:
DF33x.sal.rank(method='min')
Out[77]:
method='max',处理排名时大家都用最大值
In [78]:
DF33x.sal.rank(method='max')
Out[78]:
method='first',处理排名时谁先出现就先给谁较小的数值。
In [79]:
DF33x.sal.rank(method='first')
Out[79]:
In [80]:
DF34 = data_for_multi2.unstack()
DF34
Out[80]:
忽略缺失值:
In [81]:
DF34.mean(skipna=True)
Out[81]:
In [82]:
DF34.mean(skipna=False)
Out[82]:
如果不想忽略缺失值的话,就需要祭出fillna了:
In [83]:
DF34
Out[83]:
In [84]:
DF34.fillna(0).mean(axis=1,skipna=False)
Out[84]:
groupby的功能类似SQL的group by关键字:
Split-Apply-Combine
Pandas的groupby的灵活性:
In [85]:
from IPython.display import Image
Image(filename="S1EP3_group.png")
Out[85]:
分组的具体逻辑
In [86]:
irisdata_group = irisdata.groupby('class')
irisdata_group
Out[86]:
In [87]:
for level,subsetDF in irisdata_group:
print level
print subsetDF[::20]
分组可以快速实现MapReduce的逻辑
In [88]:
irisdata.groupby('class').agg(\
lambda x:((x-x.mean())**3).sum()*(len(x)-0.0)/\
(len(x)-1.0)/(len(x)-2.0)/(x.std()*np.sqrt((len(x)-0.0)/(len(x)-1.0)))**3 if len(x)>2 else None)
Out[88]:
In [89]:
irisdata.groupby('class').agg(spstat.skew)
Out[89]:
In [90]:
pd.concat([irisdata,irisdata.groupby('class').transform('mean')],axis=1)[::20]
Out[90]:
In [91]:
factor1 = np.random.randint(0,3,50)
factor2 = np.random.randint(0,2,50)
factor3 = np.random.randint(0,3,50)
values = np.random.randn(50)
In [92]:
hierindexDF = pd.DataFrame({'F1':factor1,'F2':factor2,'F3':factor3,'F4':values})
hierindexDF
Out[92]:
In [93]:
hierindexDF_gbsum = hierindexDF.groupby(['F1','F2','F3']).sum()
hierindexDF_gbsum
Out[93]:
观察Index:
In [94]:
hierindexDF_gbsum.index
Out[94]:
unstack:
In [95]:
hierindexDF_gbsum.unstack()
Out[95]:
In [96]:
hierindexDF_gbsum.unstack(0)
Out[96]:
In [97]:
hierindexDF_gbsum.unstack(1)
Out[97]:
In [98]:
hierindexDF_gbsum.unstack([2,0])
Out[98]:
更进一步的,stack的功能是和unstack对应,把column上的多级索引换到index上去
In [99]:
hierindexDF_gbsum.unstack([2,0]).stack([1,2])
Out[99]: