Pandas 数据处理
In [4]:
import numpy as np
import pandas as pd
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
'price':[12.33,11.44,33.21,13.23,33.62]})
frame1
Out[4]:
In [6]:
frame2 = pd.DataFrame({'id':['ball','pencil','pen','ashtray'],
'color':['white','red','red','black']})
frame2
Out[6]:
In [7]:
pd.merge(frame1,frame2)
Out[7]:
也可以通过制定on关键词,使用条件join
In [8]:
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
'color':['white','red','red','black','green'],
'brand':['OMG','ABC','ABC','POD','POD']})
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
'brand':['OMG','POD','ABC','POD']})
pd.merge(frame1,frame2,on='id')
Out[8]:
In [9]:
pd.merge(frame1,frame2,on='brand')
Out[9]:
使用left_on和right_on进行有选择的连接
In [10]:
pd.merge(frame1,frame2,left_on='id',right_on='id')
Out[10]:
使用类似left_join,right_join和out_join等方式
In [11]:
pd.merge(frame1,frame2,on='id')
Out[11]:
In [12]:
pd.merge(frame1,frame2,on='id',how='outer')
Out[12]:
In [13]:
pd.merge(frame1,frame2,on='id',how='left')
Out[13]:
In [15]:
pd.merge(frame1,frame2,on='id',how='right')
Out[15]:
In [16]:
# 合并多个键
pd.merge(frame1,frame2,on=['id','brand'],how='outer')
Out[16]:
In [18]:
pd.merge(frame1,frame2,right_index=True,left_index=True)
Out[18]:
join 函数更适合根据索引进行合并
In [23]:
#重命名frame2列的名称
frame2.columns=['brand2','id2']
# frame1.join(frame2)
Out[23]:
In [1]:
import numpy as np
import pandas as pd
array1 = np.arange(9).reshape((3,3))
array1
Out[1]:
In [4]:
array2 = np.arange(9).reshape((3,3))+6
array2
Out[4]:
In [5]:
# 拼接
# axis=1 行拼接
np.concatenate([array1,array2],axis=1)
Out[5]:
In [8]:
# axis = 0 列拼接
np.concatenate([array1,array2],axis=0)
Out[8]:
In [11]:
ser1 = pd.Series(np.random.rand(4),index=[1,2,3,4])
ser1
Out[11]:
In [16]:
ser2 = pd.Series(np.random.rand(4),index=[5,6,7,8])
ser2
Out[16]:
In [17]:
pd.concat([ser1,ser2])
Out[17]:
concat()函数默认是按照axis=0这条轴拼接,返回Series对象,如果指定axis=1,那么将会返回DataFrame对象
In [18]:
pd.concat([ser1,ser2],axis=1)
Out[18]:
上述结果返回的相当于'out_join'操作,可以吧join选项修改为'inner'
In [20]:
pd.concat([ser1,ser1],axis=1,join='inner')
Out[20]:
创建等级索引,借助keys完成
In [24]:
pd.concat([ser1,ser2],keys=[1,2])
Out[24]:
使用axis=1的话,将会DataFrame的各列对象
In [25]:
pd.concat([ser1,ser2],keys=[1,2],axis=1)
Out[25]:
In [29]:
frame1 = pd.DataFrame(np.random.rand(9).reshape(3,3),index=[1,2,3],columns=['A','B','C'])
frame2 = pd.DataFrame(np.random.rand(9).reshape(3,3),index=[4,5,6],columns=['A','B','C'])
pd.concat([frame1,frame2])
Out[29]:
In [30]:
pd.concat([frame1,frame2],axis=1)
Out[30]:
In [31]:
ser1 = pd.Series(np.random.rand(5),index=[1,2,3,4,5])
ser2 = pd.Series(np.random.rand(4),index=[2,4,5,6])
ser1.combine_first(ser2)
Out[31]:
In [32]:
ser2.combine_first(ser1)
Out[32]:
In [34]:
ser1[:3].combine_first(ser2[:3])
Out[34]:
In [36]:
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),
index=['white','black','red'],
columns=['ball','pen','pencil'])
frame1
Out[36]:
In [39]:
ser5 = frame1.stack()
ser5
Out[39]:
In [40]:
ser5.unstack()
Out[40]:
In [41]:
ser5.unstack(0)
Out[41]:
从“长”格式和“宽”格式旋转
In [43]:
longframe = pd.DataFrame({'color':['white','white','white',
'red','red','red',
'black','black','black'],
'item':['ball','pen','mug',
'ball','pen','mug',
'ball','pen','mug'],
'value':np.random.rand(9)})
longframe
Out[43]:
In [44]:
longframe.pivot('color','item')
Out[44]:
In [55]:
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),
index=['white','black','red'],
columns=['ball','pen','pencil'])
frame1
Out[55]:
In [56]:
del frame1['ball']
frame1
Out[56]:
In [57]:
frame1.drop('white')
Out[57]:
In [60]:
dframe = pd.DataFrame({'color':['white','white','red','red','white'],
'value':[2,1,3,3,2]})
dframe
Out[60]:
使用duplicate()函数,返回Series对象,数值类型为bool类型
In [62]:
dframe.duplicated()
Out[62]:
筛选数据
In [64]:
dframe[dframe.duplicated()]
Out[64]:
In [66]:
dframe.drop_duplicates()
Out[66]:
In [69]:
frame = pd.DataFrame({'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','rosso','verdo','black','yellow'],
'price':[5.56,4.2,1.3,0.56,2.75]})
frame
Out[69]:
如果DataFrame中,有一些元素不正确,在映射关系中,使用旧元素为key,新元素作为value
In [71]:
newcolor = {
'rosso':'red',
'verde':'green'
}
frame.replace(newcolor)
Out[71]:
将NaN转换成0
In [73]:
ser = pd.Series([1,3,np.nan,4,6,np.nan,4])
ser.replace(np.nan,0)
Out[73]:
In [74]:
frame = pd.DataFrame({'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','red','green','black','yellow']})
frame
Out[74]:
In [78]:
price={
'ball':5.56,
'mug':4.2,
'bottle':1.3,
'scissors':3.41,
'pen':1.3,
'pencil':0.56,
'ashtray':2.75
}
frame['price']=frame['item'].map(price)
frame
Out[78]:
In [83]:
# 重命名 index 索引
reindex = {
0:'first',
1:'second',
2:'third',
3:'fourth',
4:'fifth'
}
frame.rename(reindex)
Out[83]:
In [85]:
# 重命名列明,添加columns选项
recolumn = {
'item':'object',
'price':'value'
}
frame.rename(index=reindex,columns=recolumn)
Out[85]:
In [87]:
frame.rename(index={1:'first'},columns={'item':'object'},inplace=True)
frame
Out[87]:
In [92]:
results = np.random.randint(0,100,200)
bins=[0,25,50,75,100]
cat = pd.cut(results,bins)
cat
Out[92]:
In [95]:
cat.categories
Out[95]:
In [100]:
cat.codes
Out[100]:
In [101]:
pd.value_counts(cat)
Out[101]:
In [102]:
bin_names = ['unlikely','less likely','likely','high likely']
pd.cut(result,bins,labels=bin_names)
Out[102]:
In [103]:
pd.cut(result,5)
Out[103]:
In [104]:
pd.qcut(result,5)
Out[104]:
In [106]:
# 随机创建一个只有三列的DataFrame对象,每一列包含了一个100个数据
randFrame = pd.DataFrame(np.random.randn(100,3))
randFrame
Out[106]:
In [107]:
randFrame.describe()
Out[107]:
假定比标准差3倍中误差为异常数据
In [108]:
randFrame.std()
Out[108]:
In [109]:
randFrame[(np.abs(randFrame) > 3 * randFrame.std()).any(1)]
Out[109]:
In [3]:
nframe = pd.DataFrame(np.arange(25).reshape(5,5))
nframe
Out[3]:
In [6]:
new_order = np.random.permutation(5)
new_order
Out[6]:
In [7]:
nframe.take(new_order)
Out[7]:
In [8]:
new_order = [3,4,2]
nframe.take(new_order)
Out[8]:
In [10]:
sample = np.random.randint(0,len(nframe),size=3)
sample
Out[10]:
In [11]:
nframe.take(sample)
Out[11]:
In [12]:
text = '16 bolton Avenue , Boston'
text.split(',')
Out[12]:
In [14]:
tokens = [s.strip() for s in text.split(',')]
tokens
Out[14]:
In [15]:
strings =['A+','A','A+','B','BB','BBB','C+']
';'.join(strings)
Out[15]:
In [16]:
'Boston' in text
Out[16]:
In [17]:
text.index('Boston')
Out[17]:
In [18]:
text.find('Boston')
Out[18]:
In [20]:
text.index('China')
In [21]:
text.find('China')
Out[21]:
In [22]:
text.count('e')
Out[22]:
In [23]:
text.replace('Boston','China')
Out[23]:
In [24]:
text.replace('1','')
Out[24]:
In [26]:
import re
text = 'This is an \t odd \n text'
re.split('\s+',text)
Out[26]:
In [28]:
regex = re.compile('\s+')
regex.split(text)
Out[28]:
In [31]:
# findall 函数匹配文本中所有符合条件的正则表达式
text ='This is my address: 16 bolton Avenue, boston'
re.findall('a\w+',text)
Out[31]:
In [32]:
re.findall('[A,a]\w+',text)
Out[32]:
与findall函数相关的函数还有两个,match()和search(),findall()函数返回所有符合条件的模式的子串,而search()函数返回第一个符合条件的子串
In [33]:
re.search('[A,a]\w+',text)
Out[33]:
In [34]:
search = re.search('[A,a]\w+',text)
search.start()
Out[34]:
match 从字符串的开始匹配,如果第一个字符不匹配,则不再搜索字符串内部。
In [36]:
re.match('[A,a]\w+',text)
In [37]:
re.match('T\w+',text)
Out[37]:
In [41]:
match = re.match('T\w+',text)
text[match.start():match.end()]
Out[41]:
内部机制:SPLIT-APPLY-COMBINE
分组
将数据集分成多个组
用函数处理
用函数处理每一组数据
合并
将不同组得到的数据合并起来
In [42]:
frame = pd.DataFrame({'color':['white','red','green','red','green'],
'object':['pen','pencil','pencil','ashtray','pen'],
'price1':[5.56,4.20,1.30,0.56,2.75],
'price2':[4.75,4.12,1.60,0.75,3.15]})
frame
Out[42]:
In [50]:
group=frame['price1'].groupby(frame['color'])
group.groups
Out[50]:
In [51]:
group.mean()
Out[51]:
In [53]:
group.sum()
Out[53]:
In [54]:
ggroup = frame['price1'].groupby([frame['color'],frame['object']])
ggroup.groups
Out[54]:
In [55]:
ggroup.sum()
Out[55]:
In [56]:
frame[['price1','price2']].groupby(frame['color']).mean()
Out[56]:
In [57]:
frame.groupby(frame['color']).mean()
Out[57]:
In [58]:
for name,group in frame.groupby('color'):
print name
print group
In [60]:
frame['price1'].groupby(frame['color']).mean()
Out[60]:
In [61]:
frame.groupby(frame['color'])['price1'].mean()
Out[61]:
In [62]:
group = frame.groupby('color')
group['price1'].quantile(0.5)
Out[62]:
In [64]:
def range(series):
return series.max() - series.min()
group['price1'].agg(range)
Out[64]:
In [65]:
group.agg(range)
Out[65]:
In [66]:
group['price1'].agg(['mean','std',range])
Out[66]:
In [67]:
frame
Out[67]:
In [68]:
sums = frame.groupby('color').sum().add_prefix('tot_')
sums
Out[68]:
In [70]:
pd.merge(frame,sums,left_on='color',right_index=True)
Out[70]:
In [71]:
frame.groupby('color').transform(np.sum).add_prefix('tot_')
Out[71]:
In [73]:
frame = pd.DataFrame({'color':['white','black','white','white','black','black'],
'status':['up','up','down','down','down','up'],
'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})
frame
Out[73]:
In [74]:
frame.groupby(['color','status']).apply(lambda x:x.max())
Out[74]: