Python数值计算库NumPy

—— 一切向量化计算的基础

1. NumPy初探

1.1 开始使用



In [2]:

    
from numpy import cos,sin #避免使用
import numpy as np #np.method()

从三种“列表“定义开始



In [11]:

    
r1 = range(5)
r2 = np.arange(5)
r3 = xrange(5)



In [3]:

    
print r1,r2,r3









    



[0, 1, 2, 3, 4] [0 1 2 3 4] xrange(5)



In [4]:

    
for i in r1:
    print i,
print '\n'
for i in r2:
    print i,
print '\n'
for i in r3:
    print i,



In [5]:

    
print type(r1),type(r2),type(r3)









    



<type 'list'> <type 'numpy.ndarray'> <type 'xrange'>



In [3]:

    
print np.arange(0,5),np.arange(5,0),np.arange(5,0,-1)









    



[0 1 2 3 4] [] [5 4 3 2 1]

IPython能够支持自动补全和帮助：



In [4]:

    
np.arange #Tab









    Out[4]:





<function numpy.core.multiarray.arange>



In [6]:

    
np.arange?

论numpy的正确打开方式：少用原生语法、多用ufunc和broadcasting，少用数据转换



In [7]:

    
def vecadd(n):
    a = range(n)
    b = range(n)
    c = []
    for i in range(n):
        a[i] = i*2
        b[i] = i+3
        c.append(a[i]+b[i])
    return c

def np_vecadd(n):
    a = np.arange(n)*2
    b = np.arange(n)+3
    c = a+b
    return c

N = 10000000

为了比较性能，使用ipython的“魔法函数”timeit，或者datetime库的计时函数：



In [8]:

    
%timeit vecadd(N)









    



1 loops, best of 3: 3.11 s per loop



In [9]:

    
%timeit np_vecadd(N)









    



10 loops, best of 3: 108 ms per loop



In [12]:

    
import datetime
t1 = datetime.datetime.now()
result = vecadd(N)
t2 = datetime.datetime.now()
t2-t1









    Out[12]:





datetime.timedelta(0, 3, 686897)



In [13]:

    
np.arange(5)+3 #ufunc or broadcasting









    Out[13]:





array([3, 4, 5, 6, 7])



In [14]:

    
np.arange(5)*2









    Out[14]:





array([0, 2, 4, 6, 8])



In [15]:

    
a,b,c = 0,2,11
print np.linspace(a,b,c),'\n'*2,np.arange(c)/float(c-1)*(b-a)+a









    



[ 0.   0.2  0.4  0.6  0.8  1.   1.2  1.4  1.6  1.8  2. ] 

[ 0.   0.2  0.4  0.6  0.8  1.   1.2  1.4  1.6  1.8  2. ]

1.2 Numpy数据类型理解



In [12]:

    
#使用np.ndarray类型的r2
print r2,type(r2),r2.dtype,r2.shape









    



[0 1 2 3 4] <type 'numpy.ndarray'> int64 (5,)



In [13]:

    
#Immunity of Mutables
arr1 = np.array([np.arange(5),np.arange(5)])
arr2 = np.array([r2,r2])
arr3 = [r2,r2]
print arr1,'\n'*2,arr2,'\n'*2,arr3
r2[2] = 10
print arr1,'\n'*2,arr2,'\n'*2,arr3
r2[2] = 2
#不需要关心类型是否可变，在np.array创建过程中就已经建立独立拷贝









    



[[0 1 2 3 4]
 [0 1 2 3 4]] 

[[0 1 2 3 4]
 [0 1 2 3 4]] 

[array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])]
[[0 1 2 3 4]
 [0 1 2 3 4]] 

[[0 1 2 3 4]
 [0 1 2 3 4]] 

[array([ 0,  1, 10,  3,  4]), array([ 0,  1, 10,  3,  4])]



In [18]:

    
print arr1,arr1.shape









    



[[0 1 2 3 4]
 [0 1 2 3 4]] (2, 5)



In [19]:

    
print arr1[0,0],arr1[1,2],arr1.dtype



In [20]:

    
print np.float64(arr1) #正确









    



[[ 0.  1.  2.  3.  4.]
 [ 0.  1.  2.  3.  4.]]



In [21]:

    
arr1.dtype = np.float64 #错误
print arr1
arr1.dtype = np.int64
print arr1









    



[[  0.00000000e+000   4.94065646e-324   9.88131292e-324   1.48219694e-323
    1.97626258e-323]
 [  0.00000000e+000   4.94065646e-324   9.88131292e-324   1.48219694e-323
    1.97626258e-323]]
[[0 1 2 3 4]
 [0 1 2 3 4]]



In [22]:

    
# bool : 单个bit的布尔变量，仅存True或者False
# inti : 当前平台的整数类型，int32或int64
import platform
platform.architecture()









    Out[22]:





('64bit', '')

int8: -128 ~ 127
int16: -32768 ~ 32767
int32: -$2^{31}$ ~ $2^{31}$-1
int64: -$2^{63}$ ~ $2^{63}$-1
uint8: 0 ~ 255
uint16: 0 ~ 65535
uint32: 0 ~ $2^{32}$-1
uint64: 0 ~ $2^{64}$-1
float16: 符号＋5bit指数＋10bit小数
float32: 符号＋8bit指数＋23bit小数
float64: 符号＋11bit指数＋52bit小数
complex64: 单精度浮点复数
complex128: 双精度浮点复数



In [24]:

    
i = np.float64(3)
print i

3.0



In [25]:

    
c = np.complex128(1+2j)
print c.real,c.imag



In [26]:

    
print i.dtype.itemsize,c.dtype.itemsize

Python不显示指定数据类型与精度，如何控制数据类型？Numpy自定义



In [27]:

    
sku = np.dtype([('sku_id',np.int32),('desc',np.str_,50),('value',np.float64)])
print sku
sku2 = np.dtype({'names':['sku_id','desc','value'],'formats':['<i4','S50','<f8']})
print sku2









    



[('sku_id', '<i4'), ('desc', 'S50'), ('value', '<f8')]
[('sku_id', '<i4'), ('desc', 'S50'), ('value', '<f8')]



In [28]:

    
online_shop = np.array([(1,'apple',2.3),(2.1,5,5),(3,'banana',True)],dtype=sku)
print online_shop









    



[(1, 'apple', 2.3) (2, '5', 5.0) (3, 'banana', 1.0)]

1.3 Numpy数据操作

数据切片（Slicing）变形（Reshaping）拼接（Stacking）与拆分（Splitting）



In [30]:

    
arr = np.arange(12)
print arr,arr.shape









    



[ 0  1  2  3  4  5  6  7  8  9 10 11] (12,)



In [31]:

    
print arr[3:5],arr[-5:-2],arr[3:28:5]









    



[3 4] [7 8 9] [3 8]



In [32]:

    
arr = arr.reshape(3,2,2)
print arr









    



[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]



In [33]:

    
arr.resize((3,4))
print arr,'\n',arr.ndim,'\n',arr.shape









    



[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 
2 
(3, 4)



In [34]:

    
print arr.T,'\n'*2,arr.transpose()









    



[[ 0  4  8]
 [ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]] 

[[ 0  4  8]
 [ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]]



In [20]:

    
arrflat = arr.flat



In [21]:

    
print arrflat









    



<numpy.flatiter object at 0x1021d2600>



In [22]:

    
print np.array([item for item in arrflat],dtype=np.int64)









    



[ 0  1  2  3  4  5  6  7  8  9 10 11]



In [23]:

    
print arr.flatten(order='C'),'\n'*2, arr.flatten(order='F'),'\n'*2,type(arr.flatten())









    



[ 0  1  2  3  4  5  6  7  8  9 10 11] 

[ 0  4  8  1  5  9  2  6 10  3  7 11] 

<type 'numpy.ndarray'>



In [38]:

    
print arr.ravel(),'\n'*2,type(arr.ravel())









    



[ 0  1  2  3  4  5  6  7  8  9 10 11] 

<type 'numpy.ndarray'>



In [24]:

    
print np.vstack((arr,arr)),'\n'*2,np.row_stack((arr,arr)),'\n'*2, np.concatenate((arr,arr),axis=0)









    



[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]



In [25]:

    
print np.hstack((arr,arr)),'\n'*2,np.column_stack((arr,arr)),'\n'*2, np.concatenate((arr,arr),axis=1)









    



[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]] 

[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]] 

[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]]



In [26]:

    
print np.dstack((arr,arr))









    



[[[ 0  0]
  [ 1  1]
  [ 2  2]
  [ 3  3]]

 [[ 4  4]
  [ 5  5]
  [ 6  6]
  [ 7  7]]

 [[ 8  8]
  [ 9  9]
  [10 10]
  [11 11]]]



In [35]:

    
arr.resize((3,4,1))



In [36]:

    
print arr,'\n'*2,np.concatenate((arr,arr),axis=2)









    



[[[ 0]
  [ 1]
  [ 2]
  [ 3]]

 [[ 4]
  [ 5]
  [ 6]
  [ 7]]

 [[ 8]
  [ 9]
  [10]
  [11]]] 

[[[ 0  0]
  [ 1  1]
  [ 2  2]
  [ 3  3]]

 [[ 4  4]
  [ 5  5]
  [ 6  6]
  [ 7  7]]

 [[ 8  8]
  [ 9  9]
  [10 10]
  [11 11]]]



In [37]:

    
print arr,'\n'*2,np.tile(arr,(1,1,3))









    



[[[ 0]
  [ 1]
  [ 2]
  [ 3]]

 [[ 4]
  [ 5]
  [ 6]
  [ 7]]

 [[ 8]
  [ 9]
  [10]
  [11]]] 

[[[ 0  0  0]
  [ 1  1  1]
  [ 2  2  2]
  [ 3  3  3]]

 [[ 4  4  4]
  [ 5  5  5]
  [ 6  6  6]
  [ 7  7  7]]

 [[ 8  8  8]
  [ 9  9  9]
  [10 10 10]
  [11 11 11]]]



In [38]:

    
arr2 = np.arange(8).reshape(2,2,2)
print arr2









    



[[[0 1]
  [2 3]]

 [[4 5]
  [6 7]]]



In [39]:

    
print np.vsplit(arr2,2), '\n'*2, np.split(arr2,2,axis=0)









    



[array([[[0, 1],
        [2, 3]]]), array([[[4, 5],
        [6, 7]]])] 

[array([[[0, 1],
        [2, 3]]]), array([[[4, 5],
        [6, 7]]])]



In [40]:

    
print np.hsplit(arr2,2), '\n'*2, np.split(arr2,2,axis=1)









    



[array([[[0, 1]],

       [[4, 5]]]), array([[[2, 3]],

       [[6, 7]]])] 

[array([[[0, 1]],

       [[4, 5]]]), array([[[2, 3]],

       [[6, 7]]])]



In [48]:

    
print np.dsplit(arr2,2), '\n'*2, np.split(arr2,2,axis=2)









    



[array([[[0],
        [2]],

       [[4],
        [6]]]), array([[[1],
        [3]],

       [[5],
        [7]]])] 

[array([[[0],
        [2]],

       [[4],
        [6]]]), array([[[1],
        [3]],

       [[5],
        [7]]])]

2 NumPy数学函数

2.0 使用Random

使用其中的rand、randint等方法生成随机数来方便后续的测试。

随机数生成函数与统计分布将在统计基础课程中另行介绍。



In [41]:

    
arr_random = np.random.randint(1,6,15)
print arr_random









    



[2 3 3 1 5 4 2 3 5 2 3 4 5 4 1]

2.1 数学函数的测试

统计函数返回一个值：



In [42]:

    
print arr_random.max(),arr_random.min(),arr_random.mean(),arr_random.var(),arr_random.std()









    



5 1 3.13333333333 1.71555555556 1.30979218029

Ufunc类函数返回逐个计算值：

很多Ufunc即Universal Functional在Python中是以C语言级别实现的，效率高，建议使用。



In [43]:

    
print np.log(arr_random),'\n'*2,np.exp(arr_random),'\n'*2,arr_random>2,'\n'*2,np.mod(arr_random,2)









    



[ 0.69314718  1.09861229  1.09861229  0.          1.60943791  1.38629436
  0.69314718  1.09861229  1.60943791  0.69314718  1.09861229  1.38629436
  1.60943791  1.38629436  0.        ] 

[   7.3890561    20.08553692   20.08553692    2.71828183  148.4131591
   54.59815003    7.3890561    20.08553692  148.4131591     7.3890561
   20.08553692   54.59815003  148.4131591    54.59815003    2.71828183] 

[False  True  True False  True  True False  True  True False  True  True
  True  True False] 

[0 1 1 1 1 0 0 1 1 0 1 0 1 0 1]

过滤值：类似SAS的Compress（其实是反过来）



In [52]:

    
arr_random.compress(arr_random>2)









    Out[52]:





array([3, 3, 4, 5, 4, 4, 3])

过滤值：类似R语法



In [44]:

    
arr_random[arr_random>2]









    Out[44]:





array([3, 3, 5, 4, 3, 5, 3, 4, 5, 4])

类似R的语法中，用于过滤条件的np.array是可以进行布尔操作的：



In [45]:

    
print arr_random[(arr_random>1) & (arr_random<4)],
print arr_random[(arr_random>1) ^ (arr_random<4)]









    



[2 3 3 2 3 2 3] [1 5 4 5 4 5 4 1]

Python的Clip，对值进行裁剪



In [46]:

    
arr_random.clip(2,4)









    Out[46]:





array([2, 3, 3, 2, 4, 4, 2, 3, 4, 2, 3, 4, 4, 4, 2])

可以方便的进行时间序列操作——差分



In [47]:

    
arr_random = np.random.rand(10)
print arr_random,'\n'*2,np.diff(arr_random)









    



[ 0.50319447  0.85476985  0.69350054  0.85602088  0.39440826  0.82899253
  0.02564264  0.18737798  0.34984262  0.94389631] 

[ 0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427 -0.80334988
  0.16173534  0.16246464  0.59405369]

差分其实是一种最简单的卷积：



In [49]:

    
w = np.array([1,-1]) 
len = 2
print arr_random,'\n'*2,np.convolve(w,arr_random),'\n'*2,np.convolve(w,arr_random)[len-1:-len+1]









    



[ 0.50319447  0.85476985  0.69350054  0.85602088  0.39440826  0.82899253
  0.02564264  0.18737798  0.34984262  0.94389631] 

[ 0.50319447  0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427
 -0.80334988  0.16173534  0.16246464  0.59405369 -0.94389631] 

[ 0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427 -0.80334988
  0.16173534  0.16246464  0.59405369]

差分的反向操作——累计加和（累积乘法类似）：



In [50]:

    
diff = np.convolve(w,arr_random)[0:-1]
print diff
print diff.cumsum(),'\n'*2,diff.cumprod()









    



[ 0.50319447  0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427
 -0.80334988  0.16173534  0.16246464  0.59405369]
[ 0.50319447  0.85476985  0.69350054  0.85602088  0.39440826  0.82899253
  0.02564264  0.18737798  0.34984262  0.94389631] 

[  5.03194468e-01   1.76910787e-01  -2.85302802e-02  -4.63675088e-03
   2.14038276e-03   9.30176676e-04  -7.47257324e-04  -1.20857914e-04
  -1.96351376e-05  -1.16643260e-05]

排序时间到了：给出排序结果，本地排序，并保留原数组下标位置



In [54]:

    
arr_random = np.random.rand(10)



In [55]:

    
print np.sort(arr_random)









    



[ 0.08786745  0.09802729  0.15244481  0.15467132  0.18330197  0.2293121
  0.29645834  0.52844745  0.72392558  0.87188892]



In [56]:

    
print arr_random









    



[ 0.15244481  0.15467132  0.29645834  0.87188892  0.09802729  0.08786745
  0.72392558  0.52844745  0.2293121   0.18330197]



In [58]:

    
arr_random.sort()
print arr_random









    



[ 0.08786745  0.09802729  0.15244481  0.15467132  0.18330197  0.2293121
  0.29645834  0.52844745  0.72392558  0.87188892]



In [63]:

    
arr_random = np.random.rand(10)

使用enumerate和sorted排序：

关于更多匿名函数（Lambda）使用，请参考Python函数式编程，也就是Functional Programming相关内容。



In [64]:

    
s = sorted(enumerate(arr_random),key=lambda x:x[1],reverse=False)
print s









    



[(4, 0.12863450206826932), (9, 0.32980836827086768), (2, 0.33491594197762742), (1, 0.33654387285035214), (3, 0.40929439162767556), (7, 0.42846880907613405), (6, 0.50874654192233604), (5, 0.58841594297657973), (8, 0.67497111346825522), (0, 0.99078322046068001)]



In [65]:

    
zip(*s)[0]









    Out[65]:





(4, 9, 2, 1, 3, 7, 6, 5, 8, 0)



In [66]:

    
print type(enumerate(arr_random))
for idx,value in enumerate(arr_random):
    print idx,value
print zip(xrange(10),arr_random)









    



<type 'enumerate'>
0 0.990783220461
1 0.33654387285
2 0.334915941978
3 0.409294391628
4 0.128634502068
5 0.588415942977
6 0.508746541922
7 0.428468809076
8 0.674971113468
9 0.329808368271
[(0, 0.99078322046068001), (1, 0.33654387285035214), (2, 0.33491594197762742), (3, 0.40929439162767556), (4, 0.12863450206826932), (5, 0.58841594297657973), (6, 0.50874654192233604), (7, 0.42846880907613405), (8, 0.67497111346825522), (9, 0.32980836827086768)]



In [67]:

    
print np.lexsort((np.arange(10),arr_random)) #接受一个参数，传入元组









    



[4 9 2 1 3 7 6 5 8 0]



In [68]:

    
print np.argsort(arr_random) #相当于lexsort参数元组第一列给下标









    



[4 9 2 1 3 7 6 5 8 0]

3 二维数组与矩阵操作

3.1 基本定义

直接定义值：



In [63]:

    
arr2d_random = np.random.rand(10).reshape(5,2)
arr2d_zeros = np.zeros_like(arr2d_random)
arr2d_ones = np.ones_like(arr2d_random)



In [64]:

    
print arr2d_random,'\n'*2,arr2d_zeros,'\n'*2,arr2d_ones









    



[[ 0.67758622  0.37036523]
 [ 0.10966036  0.81607858]
 [ 0.82083969  0.11857728]
 [ 0.6070126   0.83232989]
 [ 0.27118225  0.09858034]] 

[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]] 

[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]

直接填充常用值：



In [71]:

    
print np.zeros((5,2)), '\n'*2 , np.ones((5,2)),'\n'*2,np.eye(3)









    



[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]] 

[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]] 

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

善于利用函数者，可以使用函数，传入下标作为参数，生成二维数组：

”首相卡梅隆“函数：



In [72]:

    
def CameronThePrimeMinister(i,j):
    return (i+1)*(j+1)

print np.fromfunction(CameronThePrimeMinister,(9,9))









    



[[  1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  2.   4.   6.   8.  10.  12.  14.  16.  18.]
 [  3.   6.   9.  12.  15.  18.  21.  24.  27.]
 [  4.   8.  12.  16.  20.  24.  28.  32.  36.]
 [  5.  10.  15.  20.  25.  30.  35.  40.  45.]
 [  6.  12.  18.  24.  30.  36.  42.  48.  54.]
 [  7.  14.  21.  28.  35.  42.  49.  56.  63.]
 [  8.  16.  24.  32.  40.  48.  56.  64.  72.]
 [  9.  18.  27.  36.  45.  54.  63.  72.  81.]]

习惯R者，使用broadcasting，生成二维数组，更具体的原理和应用在后续介绍：

9 * 1的二维数组和长度为9的一维数组broadcasting两次，互相补齐对方较短的维度：



In [59]:

    
x = np.arange(1,10)

print x.reshape(-1,1),'\n'*2,x,'\n'*2,x.reshape(-1,1) * x









    



[[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]] 

[1 2 3 4 5 6 7 8 9] 

[[ 1  2  3  4  5  6  7  8  9]
 [ 2  4  6  8 10 12 14 16 18]
 [ 3  6  9 12 15 18 21 24 27]
 [ 4  8 12 16 20 24 28 32 36]
 [ 5 10 15 20 25 30 35 40 45]
 [ 6 12 18 24 30 36 42 48 54]
 [ 7 14 21 28 35 42 49 56 63]
 [ 8 16 24 32 40 48 56 64 72]
 [ 9 18 27 36 45 54 63 72 81]]

更进一步缩减代码，生成正交栅格用于广播的方法，支持灵活的运算：

状如切片，功能如range/linspace，实数第三参数代表步长，虚数第三参数代表总长度



In [60]:

    
x,y = np.ogrid[0:1:6j,0:1:6j]
print x,y
print np.exp(-x**2-y**2)









    



[[ 0. ]
 [ 0.2]
 [ 0.4]
 [ 0.6]
 [ 0.8]
 [ 1. ]] [[ 0.   0.2  0.4  0.6  0.8  1. ]]
[[ 1.          0.96078944  0.85214379  0.69767633  0.52729242  0.36787944]
 [ 0.96078944  0.92311635  0.81873075  0.67032005  0.50661699  0.35345468]
 [ 0.85214379  0.81873075  0.72614904  0.59452055  0.44932896  0.31348618]
 [ 0.69767633  0.67032005  0.59452055  0.48675226  0.36787944  0.25666078]
 [ 0.52729242  0.50661699  0.44932896  0.36787944  0.2780373   0.19398004]
 [ 0.36787944  0.35345468  0.31348618  0.25666078  0.19398004  0.13533528]]

习惯数据库的笛卡尔积者，可以用outer关键字进行某些运算，支持传入iterable更能节省空间和时间：



In [61]:

    
np.multiply.outer(xrange(1,10),xrange(1,10))









    Out[61]:





array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9],
       [ 2,  4,  6,  8, 10, 12, 14, 16, 18],
       [ 3,  6,  9, 12, 15, 18, 21, 24, 27],
       [ 4,  8, 12, 16, 20, 24, 28, 32, 36],
       [ 5, 10, 15, 20, 25, 30, 35, 40, 45],
       [ 6, 12, 18, 24, 30, 36, 42, 48, 54],
       [ 7, 14, 21, 28, 35, 42, 49, 56, 63],
       [ 8, 16, 24, 32, 40, 48, 56, 64, 72],
       [ 9, 18, 27, 36, 45, 54, 63, 72, 81]])

二维数组和矩阵有一些不同：

矩阵可以方便的求逆、直接做乘法，二维数组的＊只能作为np.multiply而不是np.dot：



In [65]:

    
cov =  np.cov(arr2d_random.T)
print cov









    



[[ 0.08765291 -0.035521  ]
 [-0.035521    0.12995714]]



In [66]:

    
stdiag = np.diag(np.sqrt(np.diag(cov)))
print stdiag









    



[[ 0.29606233  0.        ]
 [ 0.          0.36049569]]



In [67]:

    
invstdiag = np.array(np.mat(stdiag).I)
print invstdiag









    



[[ 3.37766708  0.        ]
 [ 0.          2.77395825]]



In [69]:

    
#错误
invstdiag*cov*invstdiag









    Out[69]:





array([[ 1., -0.],
       [-0.,  1.]])



In [70]:

    
#正确
np.mat(stdiag).I*np.mat(cov)*np.mat(stdiag).I









    Out[70]:





matrix([[ 1.       , -0.3328143],
        [-0.3328143,  1.       ]])



In [71]:

    
#验证
np.corrcoef(arr2d_random.T)









    Out[71]:





array([[ 1.       , -0.3328143],
       [-0.3328143,  1.       ]])

一般最小二乘法求解线性回归（OLS）

要进行OLS，核心是矩阵求逆：

$ y = X \beta + \beta_0 \rightarrow y ＝ X \beta + \beta_0 + \epsilon \rightarrow \hat{\beta} = (X^TX)^{-1}X^Ty $

随机制造数据来做个OLS的例子：



In [72]:

    
x = np.random.rand(30).reshape(15,2)
y = x[:,0]*0.7 - x[:,1]*0.2 + 0.1+0.1*np.random.rand(15)



In [73]:

    
print y,'\n'*2,x









    



[ 0.55541422  0.21316423  0.28526456  0.75681944  0.06493666  0.30003525
  0.26751402  0.70417776  0.19398665  0.51653352  0.22651041  0.30310932
  0.55547587  0.67303199  0.27000489] 

[[ 0.86168706  0.9843286 ]
 [ 0.29516578  0.61655325]
 [ 0.23422653  0.33417053]
 [ 0.96572639  0.52828207]
 [ 0.13055911  0.83228165]
 [ 0.55808258  0.9865468 ]
 [ 0.15394166  0.0973267 ]
 [ 0.76897767  0.16983656]
 [ 0.25024672  0.44084084]
 [ 0.65832659  0.61386939]
 [ 0.13896686  0.12434786]
 [ 0.28149567  0.22562727]
 [ 0.72422763  0.73299455]
 [ 0.91282181  0.50645161]
 [ 0.18151339  0.15076421]]



In [75]:

    
Xcvr = np.mat(np.hstack((np.ones(15).reshape(15,1),x)))
print Xcvr









    



[[ 1.          0.86168706  0.9843286 ]
 [ 1.          0.29516578  0.61655325]
 [ 1.          0.23422653  0.33417053]
 [ 1.          0.96572639  0.52828207]
 [ 1.          0.13055911  0.83228165]
 [ 1.          0.55808258  0.9865468 ]
 [ 1.          0.15394166  0.0973267 ]
 [ 1.          0.76897767  0.16983656]
 [ 1.          0.25024672  0.44084084]
 [ 1.          0.65832659  0.61386939]
 [ 1.          0.13896686  0.12434786]
 [ 1.          0.28149567  0.22562727]
 [ 1.          0.72422763  0.73299455]
 [ 1.          0.91282181  0.50645161]
 [ 1.          0.18151339  0.15076421]]

给出模型参数的估计：$\hat{\beta} = (X^TX)^{-1}X^Ty $



In [76]:

    
H = Xcvr*(Xcvr.T*Xcvr).I*Xcvr.T
betahats = np.dot((Xcvr.T*Xcvr).I*Xcvr.T,y)
preds = np.dot(H,y)
print betahats,'\n'*2,preds









    



[[ 0.16623078  0.74514118 -0.2600523 ]] 

[[ 0.55233238  0.22583487  0.25386079  0.74845231  0.04707899  0.32552733
   0.25562902  0.69506132  0.23805824  0.49713888  0.23744376  0.3173099
   0.51526569  0.71470799  0.2622773 ]]

对照statsmodels的OLS结果：



In [77]:

    
import statsmodels.formula.api as sm
model = sm.OLS(y,Xcvr).fit()
model.summary()









    



/Users/wangweiyang/anaconda/anaconda/lib/python2.7/site-packages/scipy/stats/stats.py:1233: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15
  int(n))






    Out[77]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.987


  Model:                    OLS          Adj. R-squared:        0.985


  Method:              Least Squares     F-statistic:           448.7


  Date:              Sat, 06 Jun 2015    Prob (F-statistic):  5.28e-12


  Time:                  15:42:28        Log-Likelihood:       34.808


  No. Observations:           15         AIC:                  -63.62


  Df Residuals:               12         BIC:                  -61.49


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




           coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  const      0.1662      0.015     11.182   0.000      0.134     0.199


  x1         0.7451      0.025     29.856   0.000      0.691     0.800


  x2        -0.2601      0.026    -10.159   0.000     -0.316    -0.204




  Omnibus:         0.516    Durbin-Watson:         3.068


  Prob(Omnibus):   0.773    Jarque-Bera (JB):      0.567


  Skew:           -0.339    Prob(JB):              0.753


  Kurtosis:        2.332    Cond. No.               5.39

Page-Rank算法的核心: 矩阵相乘

关系矩阵$R(i,j)$元素表示从第i个页面到第j个页面发生了跳转。

初始向量w反复乘以按行标准化后的概率转移矩阵（要保证行和全部为1）。

当内存空间足够时，直接用矩阵迭代乘法将矩阵与自己相乘，空间换时间。



In [78]:

    
rand_relation = (np.random.rand(100)>0.7).reshape(10,10)
rand_relation.dtype=np.int8



In [88]:

    
rand_relation









    Out[88]:





array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
       [1, 1, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int8)



In [89]:

    
rand_relation = rand_relation-np.diag(np.diag(rand_relation))



In [79]:

    
rand_relation









    Out[79]:





array([[1, 0, 0, 1, 1, 0, 1, 0, 1, 0],
       [1, 1, 1, 0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 1, 0, 0, 1, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int8)



In [80]:

    
invrowsum = 1/(np.sum(rand_relation,axis=1)+1e-12)
print invrowsum
diag_invrowsum = np.diag(invrowsum)









    



[ 0.2         0.16666667  0.16666667  0.2         1.          0.33333333
  0.33333333  0.25        0.33333333  1.        ]



In [81]:

    
tpm = np.dot(diag_invrowsum,rand_relation+1e-12/10)
tpm = (tpm>1e-12/10)*tpm
print tpm









    



[[ 0.2         0.          0.          0.2         0.2         0.          0.2
   0.          0.2         0.        ]
 [ 0.16666667  0.16666667  0.16666667  0.          0.          0.16666667
   0.          0.16666667  0.          0.16666667]
 [ 0.16666667  0.          0.16666667  0.16666667  0.16666667  0.16666667
   0.          0.          0.          0.16666667]
 [ 0.2         0.2         0.          0.          0.2         0.2         0.
   0.          0.          0.2       ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          1.          0.        ]
 [ 0.33333333  0.33333333  0.33333333  0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.33333333  0.          0.33333333
   0.          0.33333333  0.          0.        ]
 [ 0.25        0.25        0.          0.          0.25        0.          0.
   0.          0.25        0.        ]
 [ 0.33333333  0.          0.          0.33333333  0.          0.          0.
   0.          0.          0.33333333]
 [ 0.          0.          0.          0.          1.          0.          0.
   0.          0.          0.        ]]



In [82]:

    
import scipy.sparse as ssp
sp_tpm = ssp.csr_matrix(tpm)



In [83]:

    
sp_tpm









    Out[83]:





<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 37 stored elements in Compressed Sparse Row format>



In [95]:

    
for i in xrange(10):
    sp_tpm = sp_tpm.dot(sp_tpm)
    print repr(sp_tpm)

sp_tpm_a = sp_tpm.toarray()
print sp_tpm_a,'\n'*2,sp_tpm_a[0,:]









    



<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 90 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
[[ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]] 

[ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
  0.02836983  0.20294049  0.10191621  0.08804292]

4. 还有一件事……文件读写

Numpy的分析功能虽然大部分被Pandas替代，但文件读写仍然是一个科学计算库的基本功。

熟悉Matlab和R语言的人对下列语法不会陌生：



In [96]:

    
a = np.random.rand(10).reshape(5,2)
print a
np.save('a.npy',a)
a_copy = np.load('a.npy')
print a_copy









    



[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]]
[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]]

保存多个变量时，等号前面是变量的key，从结果r里面可以取到。

如果不使用key来保存a，那么通过“arr_0”,"arr_1"来取到。



In [97]:

    
b = range(5)
np.savez('a_and_b.npz',a,bdata=b)
r = np.load('a_and_b.npz')
print r["arr_0"],'\n'*2,r["bdata"]









    



[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]] 

[0 1 2 3 4]

最后，对txt，csv，tsv的读写：

csv意为comma separated values

tsv意为tab separated values



In [98]:

    
np.savetxt("a.txt",a,delimiter = '\t')
a_copy2 = np.loadtxt("a.txt",delimiter = '\t')
print a_copy2









    



[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]]



In [ ]:

Dep. Variable:	y	R-squared:	0.987
Model:	OLS	Adj. R-squared:	0.985
Method:	Least Squares	F-statistic:	448.7
Date:	Sat, 06 Jun 2015	Prob (F-statistic):	5.28e-12
Time:	15:42:28	Log-Likelihood:	34.808
No. Observations:	15	AIC:	-63.62
Df Residuals:	12	BIC:	-61.49
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	0.1662	0.015	11.182	0.000	0.134 0.199
x1	0.7451	0.025	29.856	0.000	0.691 0.800
x2	-0.2601	0.026	-10.159	0.000	-0.316 -0.204

Omnibus:	0.516	Durbin-Watson:	3.068
Prob(Omnibus):	0.773	Jarque-Bera (JB):	0.567
Skew:	-0.339	Prob(JB):	0.753
Kurtosis:	2.332	Cond. No.	5.39