Pandas 是一个Python数据分析库,安装完 Anaconda 会安装好 Pandas
In [1]:
import numpy as np
import pandas as pd
a = pd.Series([12,-4,22,0])
a
Out[1]:
In [2]:
# 讲标签替换成有意义的值
a = pd.Series([12,-4,22,0],index=['a','b','c','d'])
print a
In [3]:
print a.values
print a.index
In [4]:
print a[2]
print a['b']
print a[0:2]
print a[['b','d']]
In [5]:
a[2]=10
a['d']=-5
In [6]:
arr =np.array([1,2,3,4])
s = pd.Series(arr)
s
Out[6]:
In [7]:
s1 = pd.Series(s)
s1[0]=-10
print s
In [8]:
s[s>2]
Out[8]:
In [9]:
s/2
Out[9]:
In [10]:
np.log(s)
Out[10]:
In [1]:
import numpy as np
import pandas as pd
s2 = pd.Series([5,-3,np.NaN,14])
s2
Out[1]:
In [2]:
s2.isnull()
Out[2]:
In [4]:
s2.notnull()
Out[4]:
In [5]:
# 当做筛选条件
s2[s2.notnull()]
Out[5]:
In [6]:
s2[s2.isnull()]
Out[6]:
In [8]:
mydic = {'red':200,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydic)
myseries
Out[8]:
In [9]:
colors = ['red','yellow','orange','blue','green']
mySeries = pd.Series(mydic, index=colors)
mySeries
Out[9]:
In [10]:
mydict2 = {'red':400,'yellow':1000,'black':700}
mySeries2 = pd.Series(mydict2)
mySeries + mySeries2
Out[10]:
DataFrame 列表式跟Excel比较类似,其设计初衷将Series的使用场景扩展至多维
index | color | object | price |
---|---|---|---|
0 | blue | ball | 1.2 |
1 | green | pen | 1.0 |
2 | yellow | pencil | 0.6 |
DataFrame 对象则有所不同,它有两个索引数组
In [3]:
import numpy as np
import pandas as pd
data = {'color':['blue','green','yellow'],'object':['ball','pen','pencil'],'price':[1.2,1.0,0.6]}
frame = pd.DataFrame(data)
frame
Out[3]:
当然也可以选择你感兴趣的内容
In [3]:
frame2 = pd.DataFrame(data,columns=['object','price'])
frame2
Out[3]:
修改index值
In [4]:
frame2 = pd.DataFrame(data,index=['one','two','three'])
frame2
Out[4]:
其他创建DataFrame的方式
In [7]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame3
Out[7]:
In [9]:
frame.columns
Out[9]:
In [10]:
frame.index
Out[10]:
In [11]:
frame.values
Out[11]:
In [12]:
frame['price']
Out[12]:
In [13]:
frame.ix[2]
Out[13]:
In [16]:
# 选择选择元素
frame.ix[[0,2]]
Out[16]:
In [17]:
# 索引值选择类似切片
frame[0:1]
Out[17]:
In [18]:
frame[0:3]
Out[18]:
In [19]:
# 选择其中的元素
frame['object'][2]
Out[19]:
In [23]:
frame.index.name='id'
frame.columns.name='item'
frame
Out[23]:
In [24]:
# 增加一个新列
frame['new']=12 # 默认
frame
Out[24]:
In [25]:
frame['new'] = [3.0,1.3,2.2]
frame
Out[25]:
In [26]:
# 通过Series其他方式更新
ser = pd.Series(np.arange(3))
frame['new']=ser
frame
Out[26]:
In [30]:
frame.isin([1.0,'pen'])
Out[30]:
In [31]:
frame[frame.isin([1.0,'pen'])]
Out[31]:
In [32]:
del frame['new']
frame
Out[32]:
In [33]:
frame[frame<3]
Out[33]:
In [34]:
nestdict = {'red':{2012:22,2013:33},'white':{2011:13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:18}}
frame2 = pd.DataFrame(nestdict)
frame2
Out[34]:
In [36]:
frame2.T
Out[36]:
In [3]:
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser.index
Out[3]:
In [4]:
ser.idxmin()
Out[4]:
In [5]:
ser.idxmax()
Out[5]:
In [6]:
serd = pd.Series(range(6),index=['white','white','blue','green','green','yellow'])
serd
Out[6]:
In [7]:
serd['white']
Out[7]:
In [10]:
serd.index.is_unique
Out[10]:
In [11]:
ser = pd.Series([2,5,7,4],index=['one','two','three','four'])
ser
Out[11]:
In [13]:
ser.reindex(['three','four','five','one'])
Out[13]:
上述reindex函数删除了'two'标签,增加了'five'标签,并且该值为NaN 自动填充标签
In [5]:
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3
Out[5]:
上述索引列并不完整,而是缺失了1,2和4,常见的需求为插值,得到一个完整的序列,reindex函数的method选项的值进行给定
In [6]:
ser3.reindex(range(6),method='ffill')
Out[6]:
在插值过程中,所以缺失的索引所对应的值是比其小的索引值,如果想要后面的值,修改method的选项
In [7]:
ser3.reindex(range(6),method='bfill')
Out[7]:
In [8]:
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser
Out[8]:
In [9]:
ser.drop('yellow')
Out[9]:
In [11]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame
Out[11]:
In [12]:
# 删除行
frame.drop(['blue','yellow'])
Out[12]:
删除列操作,指定列名,也要指定axis的值
In [14]:
frame.drop(['pen','pencil'],axis=1)
Out[14]:
In [15]:
s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1+s2
Out[15]:
In [16]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index=['blue','green','white','yellow'],
columns=['mug','pen','ball'])
frame1+frame2
Out[16]:
In [17]:
frame1.add(frame2)
Out[17]:
In [18]:
frame1.sub(frame2)
Out[18]:
In [19]:
frame1.div(frame2)
Out[19]:
In [20]:
frame1.mul(frame2)
Out[20]:
In [21]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame
Out[21]:
In [22]:
ser = pd.Series(np.arange(4),index=['ball','pen','pencil','paper'])
ser
Out[22]:
In [23]:
frame - ser
Out[23]:
In [24]:
ser['mug']=9
frame - ser
Out[24]:
In [25]:
frame
Out[25]:
In [26]:
np.sqrt(frame)
Out[26]:
In [27]:
f = lambda x:x.max()-x.min()
frame.apply(f)
Out[27]:
In [28]:
# 对行进行操作
frame.apply(f, axis=1)
Out[28]:
apply 函数不一定返回标量,也可以返回一个向量
In [29]:
def f(x):
return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
Out[29]:
In [30]:
frame.apply(f,axis=1)
Out[30]:
In [31]:
frame.sum()
Out[31]:
In [33]:
frame.mean()
Out[33]:
In [34]:
frame.describe()
Out[34]:
In [38]:
ser =pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser
Out[38]:
In [39]:
ser.sort_index()
Out[39]:
DataFrame 可以对两个轴任意一条进行排序,如果用索引列进行排序,直接使用sort_index()函数,如果对列进行排序,指定axis=1
In [40]:
frame
Out[40]:
In [41]:
frame.sort_index()
Out[41]:
In [42]:
frame.sort_index(axis=1)
Out[42]:
In [46]:
ser.sort_values()
Out[46]:
In [47]:
frame.sort_values(by='pen')
Out[47]:
In [48]:
#基于多列排序
frame.sort_values(by=['pen','pencil'])
Out[48]:
ranking 操作与排序操作相关,该操作为序列的每一个元素安排一个位置,从1开始
In [49]:
ser.rank()
Out[49]:
In [50]:
ser.rank(method='first')
Out[50]:
In [52]:
# 降序排列
ser.rank(ascending=False)
Out[52]:
correlation和covariance是两个重要的统计量,pandas计算这两个量使用corr()和cov().
In [53]:
seq1 = pd.Series([3,4,3,4,5,4,3,2],index=['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2 = pd.Series([1,2,3,4,4,3,2,1],index=['2006','2007','2008','2009','2010','2011','2012','2013'])
print seq1.corr(seq2)
print seq1.cov(seq2)
In [54]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2
Out[54]:
In [61]:
frame2.corr()
Out[61]:
In [62]:
frame2.cov()
Out[62]:
In [64]:
frame2.corrwith(ser)
Out[64]:
In [65]:
frame2.corrwith(frame)
Out[65]:
In [66]:
ser = pd.Series([0,1,2,np.NaN,9],index=['red','blue','yellow','white','green'])
ser
Out[66]:
In [69]:
ser['green']=None
ser
Out[69]:
In [70]:
ser.dropna()
Out[70]:
In [71]:
ser[ser.notnull()]
Out[71]:
In [72]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
index=['blue','green','red'],
columns=['ball','mug','pen'])
frame3
Out[72]:
在DataFrame中使用dropna()函数,一旦这一行或者列存在NaN,则将其整行或者整列全部删除
In [73]:
frame3.dropna()
Out[73]:
In [76]:
# 改进
frame3.dropna(how='all')
Out[76]:
In [75]:
frame3.fillna(0)
Out[75]:
In [77]:
frame3.fillna({'ball':1,'mug':0,'pen':99})
Out[77]:
In [79]:
mser = pd.Series(np.random.rand(8),
index=[['white','white','white','blue','blue','red','red','red'],
['up','down','right','up','down','up','down','left']])
mser
Out[79]:
In [81]:
mser.index
Out[81]:
In [82]:
mser['white']
Out[82]:
In [83]:
mser[:,'up']
Out[83]:
In [85]:
mser['white','up']
Out[85]:
使用unstack()函数将其转换成DataFrame,其中第二列的索引转换成列
In [86]:
mser.unstack()
Out[86]:
逆操作,将DataFrame转换成Series对象,使用stack()函数
In [87]:
frame.stack()
Out[87]:
对于DataFrame来讲,可以对其DataFrame对象的行和列分别进行定义等级索引
In [88]:
mframe = pd.DataFrame(np.random.randn(16).reshape((4,4)),
index=[['white','white','red','red'],['up','down','up','down']],
columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe
Out[88]:
In [91]:
# 指定index和columns的名称
mframe.columns.names=['object','id']
mframe.index.names=['colors','status']
mframe
Out[91]:
In [92]:
mframe.swaplevel('colors','status')
Out[92]:
In [93]:
mframe.sortlevel('colors')
Out[93]:
In [94]:
mframe.sum(level='colors')
Out[94]:
如果相对某一列进行统计分析
In [95]:
mframe.sum(level='id',axis=1)
Out[95]:
In [ ]: