Python数值计算库NumPy

—— 一切向量化计算的基础

1. NumPy初探

1.1 开始使用


In [2]:
from numpy import cos,sin #避免使用
import numpy as np #np.method()

从三种“列表“定义开始


In [11]:
r1 = range(5)
r2 = np.arange(5)
r3 = xrange(5)

In [3]:
print r1,r2,r3


[0, 1, 2, 3, 4] [0 1 2 3 4] xrange(5)

In [4]:
for i in r1:
    print i,
print '\n'
for i in r2:
    print i,
print '\n'
for i in r3:
    print i,


0 1 2 3 4 

0 1 2 3 4 

0 1 2 3 4

In [5]:
print type(r1),type(r2),type(r3)


<type 'list'> <type 'numpy.ndarray'> <type 'xrange'>

In [3]:
print np.arange(0,5),np.arange(5,0),np.arange(5,0,-1)


[0 1 2 3 4] [] [5 4 3 2 1]

IPython能够支持自动补全和帮助:


In [4]:
np.arange #Tab


Out[4]:
<function numpy.core.multiarray.arange>

In [6]:
np.arange?

论numpy的正确打开方式:少用原生语法、多用ufunc和broadcasting,少用数据转换


In [7]:
def vecadd(n):
    a = range(n)
    b = range(n)
    c = []
    for i in range(n):
        a[i] = i*2
        b[i] = i+3
        c.append(a[i]+b[i])
    return c

def np_vecadd(n):
    a = np.arange(n)*2
    b = np.arange(n)+3
    c = a+b
    return c

N = 10000000

为了比较性能,使用ipython的“魔法函数”timeit,或者datetime库的计时函数:


In [8]:
%timeit vecadd(N)


1 loops, best of 3: 3.11 s per loop

In [9]:
%timeit np_vecadd(N)


10 loops, best of 3: 108 ms per loop

In [12]:
import datetime
t1 = datetime.datetime.now()
result = vecadd(N)
t2 = datetime.datetime.now()
t2-t1


Out[12]:
datetime.timedelta(0, 3, 686897)

In [13]:
np.arange(5)+3 #ufunc or broadcasting


Out[13]:
array([3, 4, 5, 6, 7])

In [14]:
np.arange(5)*2


Out[14]:
array([0, 2, 4, 6, 8])

In [15]:
a,b,c = 0,2,11
print np.linspace(a,b,c),'\n'*2,np.arange(c)/float(c-1)*(b-a)+a


[ 0.   0.2  0.4  0.6  0.8  1.   1.2  1.4  1.6  1.8  2. ] 

[ 0.   0.2  0.4  0.6  0.8  1.   1.2  1.4  1.6  1.8  2. ]

1.2 Numpy数据类型理解


In [12]:
#使用np.ndarray类型的r2
print r2,type(r2),r2.dtype,r2.shape


[0 1 2 3 4] <type 'numpy.ndarray'> int64 (5,)

In [13]:
#Immunity of Mutables
arr1 = np.array([np.arange(5),np.arange(5)])
arr2 = np.array([r2,r2])
arr3 = [r2,r2]
print arr1,'\n'*2,arr2,'\n'*2,arr3
r2[2] = 10
print arr1,'\n'*2,arr2,'\n'*2,arr3
r2[2] = 2
#不需要关心类型是否可变,在np.array创建过程中就已经建立独立拷贝


[[0 1 2 3 4]
 [0 1 2 3 4]] 

[[0 1 2 3 4]
 [0 1 2 3 4]] 

[array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])]
[[0 1 2 3 4]
 [0 1 2 3 4]] 

[[0 1 2 3 4]
 [0 1 2 3 4]] 

[array([ 0,  1, 10,  3,  4]), array([ 0,  1, 10,  3,  4])]

In [18]:
print arr1,arr1.shape


[[0 1 2 3 4]
 [0 1 2 3 4]] (2, 5)

In [19]:
print arr1[0,0],arr1[1,2],arr1.dtype


0 2 int64

In [20]:
print np.float64(arr1) #正确


[[ 0.  1.  2.  3.  4.]
 [ 0.  1.  2.  3.  4.]]

In [21]:
arr1.dtype = np.float64 #错误
print arr1
arr1.dtype = np.int64
print arr1


[[  0.00000000e+000   4.94065646e-324   9.88131292e-324   1.48219694e-323
    1.97626258e-323]
 [  0.00000000e+000   4.94065646e-324   9.88131292e-324   1.48219694e-323
    1.97626258e-323]]
[[0 1 2 3 4]
 [0 1 2 3 4]]

In [22]:
# bool : 单个bit的布尔变量,仅存True或者False
# inti : 当前平台的整数类型,int32或int64
import platform
platform.architecture()


Out[22]:
('64bit', '')
  • int8: -128 ~ 127
  • int16: -32768 ~ 32767
  • int32: -$2^{31}$ ~ $2^{31}$-1
  • int64: -$2^{63}$ ~ $2^{63}$-1
  • uint8: 0 ~ 255
  • uint16: 0 ~ 65535
  • uint32: 0 ~ $2^{32}$-1
  • uint64: 0 ~ $2^{64}$-1
  • float16: 符号+5bit指数+10bit小数
  • float32: 符号+8bit指数+23bit小数
  • float64: 符号+11bit指数+52bit小数
  • complex64: 单精度浮点复数
  • complex128: 双精度浮点复数

In [24]:
i = np.float64(3)
print i


3.0

In [25]:
c = np.complex128(1+2j)
print c.real,c.imag


1.0 2.0

In [26]:
print i.dtype.itemsize,c.dtype.itemsize


8 16

Python不显示指定数据类型与精度,如何控制数据类型?Numpy自定义


In [27]:
sku = np.dtype([('sku_id',np.int32),('desc',np.str_,50),('value',np.float64)])
print sku
sku2 = np.dtype({'names':['sku_id','desc','value'],'formats':['<i4','S50','<f8']})
print sku2


[('sku_id', '<i4'), ('desc', 'S50'), ('value', '<f8')]
[('sku_id', '<i4'), ('desc', 'S50'), ('value', '<f8')]

In [28]:
online_shop = np.array([(1,'apple',2.3),(2.1,5,5),(3,'banana',True)],dtype=sku)
print online_shop


[(1, 'apple', 2.3) (2, '5', 5.0) (3, 'banana', 1.0)]

1.3 Numpy数据操作

数据切片(Slicing)变形(Reshaping)拼接(Stacking)与拆分(Splitting)


In [30]:
arr = np.arange(12)
print arr,arr.shape


[ 0  1  2  3  4  5  6  7  8  9 10 11] (12,)

In [31]:
print arr[3:5],arr[-5:-2],arr[3:28:5]


[3 4] [7 8 9] [3 8]

In [32]:
arr = arr.reshape(3,2,2)
print arr


[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]

In [33]:
arr.resize((3,4))
print arr,'\n',arr.ndim,'\n',arr.shape


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 
2 
(3, 4)

In [34]:
print arr.T,'\n'*2,arr.transpose()


[[ 0  4  8]
 [ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]] 

[[ 0  4  8]
 [ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]]

In [20]:
arrflat = arr.flat

In [21]:
print arrflat


<numpy.flatiter object at 0x1021d2600>

In [22]:
print np.array([item for item in arrflat],dtype=np.int64)


[ 0  1  2  3  4  5  6  7  8  9 10 11]

In [23]:
print arr.flatten(order='C'),'\n'*2, arr.flatten(order='F'),'\n'*2,type(arr.flatten())


[ 0  1  2  3  4  5  6  7  8  9 10 11] 

[ 0  4  8  1  5  9  2  6 10  3  7 11] 

<type 'numpy.ndarray'>

In [38]:
print arr.ravel(),'\n'*2,type(arr.ravel())


[ 0  1  2  3  4  5  6  7  8  9 10 11] 

<type 'numpy.ndarray'>

In [24]:
print np.vstack((arr,arr)),'\n'*2,np.row_stack((arr,arr)),'\n'*2, np.concatenate((arr,arr),axis=0)


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

In [25]:
print np.hstack((arr,arr)),'\n'*2,np.column_stack((arr,arr)),'\n'*2, np.concatenate((arr,arr),axis=1)


[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]] 

[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]] 

[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]]

In [26]:
print np.dstack((arr,arr))


[[[ 0  0]
  [ 1  1]
  [ 2  2]
  [ 3  3]]

 [[ 4  4]
  [ 5  5]
  [ 6  6]
  [ 7  7]]

 [[ 8  8]
  [ 9  9]
  [10 10]
  [11 11]]]

In [35]:
arr.resize((3,4,1))

In [36]:
print arr,'\n'*2,np.concatenate((arr,arr),axis=2)


[[[ 0]
  [ 1]
  [ 2]
  [ 3]]

 [[ 4]
  [ 5]
  [ 6]
  [ 7]]

 [[ 8]
  [ 9]
  [10]
  [11]]] 

[[[ 0  0]
  [ 1  1]
  [ 2  2]
  [ 3  3]]

 [[ 4  4]
  [ 5  5]
  [ 6  6]
  [ 7  7]]

 [[ 8  8]
  [ 9  9]
  [10 10]
  [11 11]]]

In [37]:
print arr,'\n'*2,np.tile(arr,(1,1,3))


[[[ 0]
  [ 1]
  [ 2]
  [ 3]]

 [[ 4]
  [ 5]
  [ 6]
  [ 7]]

 [[ 8]
  [ 9]
  [10]
  [11]]] 

[[[ 0  0  0]
  [ 1  1  1]
  [ 2  2  2]
  [ 3  3  3]]

 [[ 4  4  4]
  [ 5  5  5]
  [ 6  6  6]
  [ 7  7  7]]

 [[ 8  8  8]
  [ 9  9  9]
  [10 10 10]
  [11 11 11]]]

In [38]:
arr2 = np.arange(8).reshape(2,2,2)
print arr2


[[[0 1]
  [2 3]]

 [[4 5]
  [6 7]]]

In [39]:
print np.vsplit(arr2,2), '\n'*2, np.split(arr2,2,axis=0)


[array([[[0, 1],
        [2, 3]]]), array([[[4, 5],
        [6, 7]]])] 

[array([[[0, 1],
        [2, 3]]]), array([[[4, 5],
        [6, 7]]])]

In [40]:
print np.hsplit(arr2,2), '\n'*2, np.split(arr2,2,axis=1)


[array([[[0, 1]],

       [[4, 5]]]), array([[[2, 3]],

       [[6, 7]]])] 

[array([[[0, 1]],

       [[4, 5]]]), array([[[2, 3]],

       [[6, 7]]])]

In [48]:
print np.dsplit(arr2,2), '\n'*2, np.split(arr2,2,axis=2)


[array([[[0],
        [2]],

       [[4],
        [6]]]), array([[[1],
        [3]],

       [[5],
        [7]]])] 

[array([[[0],
        [2]],

       [[4],
        [6]]]), array([[[1],
        [3]],

       [[5],
        [7]]])]

2 NumPy数学函数

2.0 使用Random

使用其中的rand、randint等方法生成随机数来方便后续的测试。

随机数生成函数与统计分布将在统计基础课程中另行介绍。


In [41]:
arr_random = np.random.randint(1,6,15)
print arr_random


[2 3 3 1 5 4 2 3 5 2 3 4 5 4 1]

2.1 数学函数的测试

统计函数返回一个值:


In [42]:
print arr_random.max(),arr_random.min(),arr_random.mean(),arr_random.var(),arr_random.std()


5 1 3.13333333333 1.71555555556 1.30979218029

Ufunc类函数返回逐个计算值:

很多Ufunc即Universal Functional在Python中是以C语言级别实现的,效率高,建议使用。


In [43]:
print np.log(arr_random),'\n'*2,np.exp(arr_random),'\n'*2,arr_random>2,'\n'*2,np.mod(arr_random,2)


[ 0.69314718  1.09861229  1.09861229  0.          1.60943791  1.38629436
  0.69314718  1.09861229  1.60943791  0.69314718  1.09861229  1.38629436
  1.60943791  1.38629436  0.        ] 

[   7.3890561    20.08553692   20.08553692    2.71828183  148.4131591
   54.59815003    7.3890561    20.08553692  148.4131591     7.3890561
   20.08553692   54.59815003  148.4131591    54.59815003    2.71828183] 

[False  True  True False  True  True False  True  True False  True  True
  True  True False] 

[0 1 1 1 1 0 0 1 1 0 1 0 1 0 1]

过滤值:类似SAS的Compress(其实是反过来)


In [52]:
arr_random.compress(arr_random>2)


Out[52]:
array([3, 3, 4, 5, 4, 4, 3])

过滤值:类似R语法


In [44]:
arr_random[arr_random>2]


Out[44]:
array([3, 3, 5, 4, 3, 5, 3, 4, 5, 4])

类似R的语法中,用于过滤条件的np.array是可以进行布尔操作的:


In [45]:
print arr_random[(arr_random>1) & (arr_random<4)],
print arr_random[(arr_random>1) ^ (arr_random<4)]


[2 3 3 2 3 2 3] [1 5 4 5 4 5 4 1]

Python的Clip,对值进行裁剪


In [46]:
arr_random.clip(2,4)


Out[46]:
array([2, 3, 3, 2, 4, 4, 2, 3, 4, 2, 3, 4, 4, 4, 2])

可以方便的进行时间序列操作——差分


In [47]:
arr_random = np.random.rand(10)
print arr_random,'\n'*2,np.diff(arr_random)


[ 0.50319447  0.85476985  0.69350054  0.85602088  0.39440826  0.82899253
  0.02564264  0.18737798  0.34984262  0.94389631] 

[ 0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427 -0.80334988
  0.16173534  0.16246464  0.59405369]

差分其实是一种最简单的卷积:


In [49]:
w = np.array([1,-1]) 
len = 2
print arr_random,'\n'*2,np.convolve(w,arr_random),'\n'*2,np.convolve(w,arr_random)[len-1:-len+1]


[ 0.50319447  0.85476985  0.69350054  0.85602088  0.39440826  0.82899253
  0.02564264  0.18737798  0.34984262  0.94389631] 

[ 0.50319447  0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427
 -0.80334988  0.16173534  0.16246464  0.59405369 -0.94389631] 

[ 0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427 -0.80334988
  0.16173534  0.16246464  0.59405369]

差分的反向操作——累计加和(累积乘法类似):


In [50]:
diff = np.convolve(w,arr_random)[0:-1]
print diff
print diff.cumsum(),'\n'*2,diff.cumprod()


[ 0.50319447  0.35157538 -0.16126931  0.16252034 -0.46161263  0.43458427
 -0.80334988  0.16173534  0.16246464  0.59405369]
[ 0.50319447  0.85476985  0.69350054  0.85602088  0.39440826  0.82899253
  0.02564264  0.18737798  0.34984262  0.94389631] 

[  5.03194468e-01   1.76910787e-01  -2.85302802e-02  -4.63675088e-03
   2.14038276e-03   9.30176676e-04  -7.47257324e-04  -1.20857914e-04
  -1.96351376e-05  -1.16643260e-05]

排序时间到了:给出排序结果,本地排序,并保留原数组下标位置


In [54]:
arr_random = np.random.rand(10)

In [55]:
print np.sort(arr_random)


[ 0.08786745  0.09802729  0.15244481  0.15467132  0.18330197  0.2293121
  0.29645834  0.52844745  0.72392558  0.87188892]

In [56]:
print arr_random


[ 0.15244481  0.15467132  0.29645834  0.87188892  0.09802729  0.08786745
  0.72392558  0.52844745  0.2293121   0.18330197]

In [58]:
arr_random.sort()
print arr_random


[ 0.08786745  0.09802729  0.15244481  0.15467132  0.18330197  0.2293121
  0.29645834  0.52844745  0.72392558  0.87188892]

In [63]:
arr_random = np.random.rand(10)

使用enumerate和sorted排序:

关于更多匿名函数(Lambda)使用,请参考Python函数式编程,也就是Functional Programming相关内容。


In [64]:
s = sorted(enumerate(arr_random),key=lambda x:x[1],reverse=False)
print s


[(4, 0.12863450206826932), (9, 0.32980836827086768), (2, 0.33491594197762742), (1, 0.33654387285035214), (3, 0.40929439162767556), (7, 0.42846880907613405), (6, 0.50874654192233604), (5, 0.58841594297657973), (8, 0.67497111346825522), (0, 0.99078322046068001)]

In [65]:
zip(*s)[0]


Out[65]:
(4, 9, 2, 1, 3, 7, 6, 5, 8, 0)

In [66]:
print type(enumerate(arr_random))
for idx,value in enumerate(arr_random):
    print idx,value
print zip(xrange(10),arr_random)


<type 'enumerate'>
0 0.990783220461
1 0.33654387285
2 0.334915941978
3 0.409294391628
4 0.128634502068
5 0.588415942977
6 0.508746541922
7 0.428468809076
8 0.674971113468
9 0.329808368271
[(0, 0.99078322046068001), (1, 0.33654387285035214), (2, 0.33491594197762742), (3, 0.40929439162767556), (4, 0.12863450206826932), (5, 0.58841594297657973), (6, 0.50874654192233604), (7, 0.42846880907613405), (8, 0.67497111346825522), (9, 0.32980836827086768)]

In [67]:
print np.lexsort((np.arange(10),arr_random)) #接受一个参数,传入元组


[4 9 2 1 3 7 6 5 8 0]

In [68]:
print np.argsort(arr_random) #相当于lexsort参数元组第一列给下标


[4 9 2 1 3 7 6 5 8 0]

3 二维数组与矩阵操作

3.1 基本定义

直接定义值:


In [63]:
arr2d_random = np.random.rand(10).reshape(5,2)
arr2d_zeros = np.zeros_like(arr2d_random)
arr2d_ones = np.ones_like(arr2d_random)

In [64]:
print arr2d_random,'\n'*2,arr2d_zeros,'\n'*2,arr2d_ones


[[ 0.67758622  0.37036523]
 [ 0.10966036  0.81607858]
 [ 0.82083969  0.11857728]
 [ 0.6070126   0.83232989]
 [ 0.27118225  0.09858034]] 

[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]] 

[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]

直接填充常用值:


In [71]:
print np.zeros((5,2)), '\n'*2 , np.ones((5,2)),'\n'*2,np.eye(3)


[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]] 

[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]] 

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

善于利用函数者,可以使用函数,传入下标作为参数,生成二维数组:

”首相卡梅隆“函数:


In [72]:
def CameronThePrimeMinister(i,j):
    return (i+1)*(j+1)

print np.fromfunction(CameronThePrimeMinister,(9,9))


[[  1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  2.   4.   6.   8.  10.  12.  14.  16.  18.]
 [  3.   6.   9.  12.  15.  18.  21.  24.  27.]
 [  4.   8.  12.  16.  20.  24.  28.  32.  36.]
 [  5.  10.  15.  20.  25.  30.  35.  40.  45.]
 [  6.  12.  18.  24.  30.  36.  42.  48.  54.]
 [  7.  14.  21.  28.  35.  42.  49.  56.  63.]
 [  8.  16.  24.  32.  40.  48.  56.  64.  72.]
 [  9.  18.  27.  36.  45.  54.  63.  72.  81.]]

习惯R者,使用broadcasting,生成二维数组,更具体的原理和应用在后续介绍:

9 * 1的二维数组和长度为9的一维数组broadcasting两次,互相补齐对方较短的维度:


In [59]:
x = np.arange(1,10)

print x.reshape(-1,1),'\n'*2,x,'\n'*2,x.reshape(-1,1) * x


[[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]] 

[1 2 3 4 5 6 7 8 9] 

[[ 1  2  3  4  5  6  7  8  9]
 [ 2  4  6  8 10 12 14 16 18]
 [ 3  6  9 12 15 18 21 24 27]
 [ 4  8 12 16 20 24 28 32 36]
 [ 5 10 15 20 25 30 35 40 45]
 [ 6 12 18 24 30 36 42 48 54]
 [ 7 14 21 28 35 42 49 56 63]
 [ 8 16 24 32 40 48 56 64 72]
 [ 9 18 27 36 45 54 63 72 81]]

更进一步缩减代码,生成正交栅格用于广播的方法,支持灵活的运算:

状如切片,功能如range/linspace,实数第三参数代表步长,虚数第三参数代表总长度


In [60]:
x,y = np.ogrid[0:1:6j,0:1:6j]
print x,y
print np.exp(-x**2-y**2)


[[ 0. ]
 [ 0.2]
 [ 0.4]
 [ 0.6]
 [ 0.8]
 [ 1. ]] [[ 0.   0.2  0.4  0.6  0.8  1. ]]
[[ 1.          0.96078944  0.85214379  0.69767633  0.52729242  0.36787944]
 [ 0.96078944  0.92311635  0.81873075  0.67032005  0.50661699  0.35345468]
 [ 0.85214379  0.81873075  0.72614904  0.59452055  0.44932896  0.31348618]
 [ 0.69767633  0.67032005  0.59452055  0.48675226  0.36787944  0.25666078]
 [ 0.52729242  0.50661699  0.44932896  0.36787944  0.2780373   0.19398004]
 [ 0.36787944  0.35345468  0.31348618  0.25666078  0.19398004  0.13533528]]

习惯数据库的笛卡尔积者,可以用outer关键字进行某些运算,支持传入iterable更能节省空间和时间:


In [61]:
np.multiply.outer(xrange(1,10),xrange(1,10))


Out[61]:
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9],
       [ 2,  4,  6,  8, 10, 12, 14, 16, 18],
       [ 3,  6,  9, 12, 15, 18, 21, 24, 27],
       [ 4,  8, 12, 16, 20, 24, 28, 32, 36],
       [ 5, 10, 15, 20, 25, 30, 35, 40, 45],
       [ 6, 12, 18, 24, 30, 36, 42, 48, 54],
       [ 7, 14, 21, 28, 35, 42, 49, 56, 63],
       [ 8, 16, 24, 32, 40, 48, 56, 64, 72],
       [ 9, 18, 27, 36, 45, 54, 63, 72, 81]])

二维数组和矩阵有一些不同:

矩阵可以方便的求逆、直接做乘法,二维数组的*只能作为np.multiply而不是np.dot:


In [65]:
cov =  np.cov(arr2d_random.T)
print cov


[[ 0.08765291 -0.035521  ]
 [-0.035521    0.12995714]]

In [66]:
stdiag = np.diag(np.sqrt(np.diag(cov)))
print stdiag


[[ 0.29606233  0.        ]
 [ 0.          0.36049569]]

In [67]:
invstdiag = np.array(np.mat(stdiag).I)
print invstdiag


[[ 3.37766708  0.        ]
 [ 0.          2.77395825]]

In [69]:
#错误
invstdiag*cov*invstdiag


Out[69]:
array([[ 1., -0.],
       [-0.,  1.]])

In [70]:
#正确
np.mat(stdiag).I*np.mat(cov)*np.mat(stdiag).I


Out[70]:
matrix([[ 1.       , -0.3328143],
        [-0.3328143,  1.       ]])

In [71]:
#验证
np.corrcoef(arr2d_random.T)


Out[71]:
array([[ 1.       , -0.3328143],
       [-0.3328143,  1.       ]])

一般最小二乘法求解线性回归(OLS)

要进行OLS,核心是矩阵求逆:

$ y = X \beta + \beta_0 \rightarrow y = X \beta + \beta_0 + \epsilon \rightarrow \hat{\beta} = (X^TX)^{-1}X^Ty $

随机制造数据来做个OLS的例子:


In [72]:
x = np.random.rand(30).reshape(15,2)
y = x[:,0]*0.7 - x[:,1]*0.2 + 0.1+0.1*np.random.rand(15)

In [73]:
print y,'\n'*2,x


[ 0.55541422  0.21316423  0.28526456  0.75681944  0.06493666  0.30003525
  0.26751402  0.70417776  0.19398665  0.51653352  0.22651041  0.30310932
  0.55547587  0.67303199  0.27000489] 

[[ 0.86168706  0.9843286 ]
 [ 0.29516578  0.61655325]
 [ 0.23422653  0.33417053]
 [ 0.96572639  0.52828207]
 [ 0.13055911  0.83228165]
 [ 0.55808258  0.9865468 ]
 [ 0.15394166  0.0973267 ]
 [ 0.76897767  0.16983656]
 [ 0.25024672  0.44084084]
 [ 0.65832659  0.61386939]
 [ 0.13896686  0.12434786]
 [ 0.28149567  0.22562727]
 [ 0.72422763  0.73299455]
 [ 0.91282181  0.50645161]
 [ 0.18151339  0.15076421]]

In [75]:
Xcvr = np.mat(np.hstack((np.ones(15).reshape(15,1),x)))
print Xcvr


[[ 1.          0.86168706  0.9843286 ]
 [ 1.          0.29516578  0.61655325]
 [ 1.          0.23422653  0.33417053]
 [ 1.          0.96572639  0.52828207]
 [ 1.          0.13055911  0.83228165]
 [ 1.          0.55808258  0.9865468 ]
 [ 1.          0.15394166  0.0973267 ]
 [ 1.          0.76897767  0.16983656]
 [ 1.          0.25024672  0.44084084]
 [ 1.          0.65832659  0.61386939]
 [ 1.          0.13896686  0.12434786]
 [ 1.          0.28149567  0.22562727]
 [ 1.          0.72422763  0.73299455]
 [ 1.          0.91282181  0.50645161]
 [ 1.          0.18151339  0.15076421]]

给出模型参数的估计:$\hat{\beta} = (X^TX)^{-1}X^Ty $


In [76]:
H = Xcvr*(Xcvr.T*Xcvr).I*Xcvr.T
betahats = np.dot((Xcvr.T*Xcvr).I*Xcvr.T,y)
preds = np.dot(H,y)
print betahats,'\n'*2,preds


[[ 0.16623078  0.74514118 -0.2600523 ]] 

[[ 0.55233238  0.22583487  0.25386079  0.74845231  0.04707899  0.32552733
   0.25562902  0.69506132  0.23805824  0.49713888  0.23744376  0.3173099
   0.51526569  0.71470799  0.2622773 ]]

对照statsmodels的OLS结果:


In [77]:
import statsmodels.formula.api as sm
model = sm.OLS(y,Xcvr).fit()
model.summary()


/Users/wangweiyang/anaconda/anaconda/lib/python2.7/site-packages/scipy/stats/stats.py:1233: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15
  int(n))
Out[77]:
OLS Regression Results
Dep. Variable: y R-squared: 0.987
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 448.7
Date: Sat, 06 Jun 2015 Prob (F-statistic): 5.28e-12
Time: 15:42:28 Log-Likelihood: 34.808
No. Observations: 15 AIC: -63.62
Df Residuals: 12 BIC: -61.49
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 0.1662 0.015 11.182 0.000 0.134 0.199
x1 0.7451 0.025 29.856 0.000 0.691 0.800
x2 -0.2601 0.026 -10.159 0.000 -0.316 -0.204
Omnibus: 0.516 Durbin-Watson: 3.068
Prob(Omnibus): 0.773 Jarque-Bera (JB): 0.567
Skew: -0.339 Prob(JB): 0.753
Kurtosis: 2.332 Cond. No. 5.39

Page-Rank算法的核心: 矩阵相乘

关系矩阵$R(i,j)$元素表示从第i个页面到第j个页面发生了跳转。

初始向量w反复乘以按行标准化后的概率转移矩阵(要保证行和全部为1)。

当内存空间足够时,直接用矩阵迭代乘法将矩阵与自己相乘,空间换时间。


In [78]:
rand_relation = (np.random.rand(100)>0.7).reshape(10,10)
rand_relation.dtype=np.int8

In [88]:
rand_relation


Out[88]:
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
       [1, 1, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int8)

In [89]:
rand_relation = rand_relation-np.diag(np.diag(rand_relation))

In [79]:
rand_relation


Out[79]:
array([[1, 0, 0, 1, 1, 0, 1, 0, 1, 0],
       [1, 1, 1, 0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 1, 0, 0, 1, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int8)

In [80]:
invrowsum = 1/(np.sum(rand_relation,axis=1)+1e-12)
print invrowsum
diag_invrowsum = np.diag(invrowsum)


[ 0.2         0.16666667  0.16666667  0.2         1.          0.33333333
  0.33333333  0.25        0.33333333  1.        ]

In [81]:
tpm = np.dot(diag_invrowsum,rand_relation+1e-12/10)
tpm = (tpm>1e-12/10)*tpm
print tpm


[[ 0.2         0.          0.          0.2         0.2         0.          0.2
   0.          0.2         0.        ]
 [ 0.16666667  0.16666667  0.16666667  0.          0.          0.16666667
   0.          0.16666667  0.          0.16666667]
 [ 0.16666667  0.          0.16666667  0.16666667  0.16666667  0.16666667
   0.          0.          0.          0.16666667]
 [ 0.2         0.2         0.          0.          0.2         0.2         0.
   0.          0.          0.2       ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          1.          0.        ]
 [ 0.33333333  0.33333333  0.33333333  0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.33333333  0.          0.33333333
   0.          0.33333333  0.          0.        ]
 [ 0.25        0.25        0.          0.          0.25        0.          0.
   0.          0.25        0.        ]
 [ 0.33333333  0.          0.          0.33333333  0.          0.          0.
   0.          0.          0.33333333]
 [ 0.          0.          0.          0.          1.          0.          0.
   0.          0.          0.        ]]

In [82]:
import scipy.sparse as ssp
sp_tpm = ssp.csr_matrix(tpm)

In [83]:
sp_tpm


Out[83]:
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 37 stored elements in Compressed Sparse Row format>

In [95]:
for i in xrange(10):
    sp_tpm = sp_tpm.dot(sp_tpm)
    print repr(sp_tpm)

sp_tpm_a = sp_tpm.toarray()
print sp_tpm_a,'\n'*2,sp_tpm_a[0,:]


<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 90 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
<10x10 sparse matrix of type '<type 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>
[[ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]
 [ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
   0.02836983  0.20294049  0.10191621  0.08804292]] 

[ 0.15452327  0.11008338  0.09542497  0.08082163  0.09042249  0.04745482
  0.02836983  0.20294049  0.10191621  0.08804292]

4. 还有一件事……文件读写

Numpy的分析功能虽然大部分被Pandas替代,但文件读写仍然是一个科学计算库的基本功。

熟悉Matlab和R语言的人对下列语法不会陌生:


In [96]:
a = np.random.rand(10).reshape(5,2)
print a
np.save('a.npy',a)
a_copy = np.load('a.npy')
print a_copy


[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]]
[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]]

保存多个变量时,等号前面是变量的key,从结果r里面可以取到。

如果不使用key来保存a,那么通过“arr_0”,"arr_1"来取到。


In [97]:
b = range(5)
np.savez('a_and_b.npz',a,bdata=b)
r = np.load('a_and_b.npz')
print r["arr_0"],'\n'*2,r["bdata"]


[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]] 

[0 1 2 3 4]

最后,对txt,csv,tsv的读写:

csv意为comma separated values

tsv意为tab separated values


In [98]:
np.savetxt("a.txt",a,delimiter = '\t')
a_copy2 = np.loadtxt("a.txt",delimiter = '\t')
print a_copy2


[[ 0.93525653  0.76660826]
 [ 0.01380756  0.55349255]
 [ 0.59566385  0.76267376]
 [ 0.91820398  0.32110088]
 [ 0.04019846  0.34415171]]

In [ ]: