Loading packages


In [2]:
import numpy as np
import matplotlib.pylab as py
import pandas as pa
import scipy.stats as st
np.set_printoptions(precision=2)
%matplotlib inline

Discrete Random Variables

In this section we show a few example of discrete random variables using Python.

The documentation for these routines can be found at:

http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html


In [3]:
X=st.bernoulli(p=0.3)
X.rvs(100)


Out[3]:
array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 0])

In [4]:
# Note that "high" is not included.
X=st.randint(low=1,high=5)
X.rvs(100)


Out[4]:
array([4, 2, 4, 1, 4, 1, 4, 3, 3, 2, 2, 2, 1, 1, 2, 3, 3, 2, 1, 4, 1, 3, 3,
       1, 4, 2, 4, 3, 1, 3, 2, 4, 3, 3, 4, 3, 4, 3, 2, 1, 3, 2, 4, 4, 2, 3,
       1, 2, 3, 2, 3, 2, 1, 4, 3, 4, 2, 3, 2, 3, 4, 4, 2, 3, 3, 4, 3, 1, 3,
       1, 1, 2, 3, 1, 3, 4, 4, 3, 2, 4, 1, 2, 3, 3, 3, 4, 4, 4, 4, 1, 3, 4,
       1, 2, 1, 3, 3, 1, 2, 1])

Continuous Random Variables

The documentation for these routines can be found at:

http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html


In [5]:
XUniform=st.uniform(loc=0.7,scale=0.3);
# "bins" tells you how many bars to use
# "normed" says to turn the counts into probability densities
py.hist(XUniform.rvs(1000000),bins=20,normed=True);
x = np.linspace(-0.1,1.1,100)
py.plot(x,XUniform.pdf(x))
#py.savefig('Figures/uniformPDF.png')


Out[5]:
[<matplotlib.lines.Line2D at 0xcb862b0>]

In [7]:
py.plot(XUniform.cdf(x))
#py.savefig('Figures/uniformCDF.png')


Out[7]:
[<matplotlib.lines.Line2D at 0xe910e10>]

In [8]:
XNormal=st.norm(loc=0,scale=1);
# "bins" tells you how many bars to use
# "normed" says to turn the counts into probability densities
py.hist(XNormal.rvs(1000),bins=100,normed=True);
x = np.linspace(-3,3,100)
py.plot(x,XNormal.pdf(x))
#py.savefig('Figures/normalPDF.png')


Out[8]:
[<matplotlib.lines.Line2D at 0xeb762e8>]

In [11]:
py.plot(XNormal.cdf(x))
#py.savefig('Figures/normalCDF.png')


Out[11]:
[<matplotlib.lines.Line2D at 0xf22b518>]

Now we can look at the histograms of some of our data from Case Study 2.


In [14]:
data = pa.read_hdf('data.h5','movies')

In [15]:
data


Out[15]:
user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama
5 18 1193 4 978156168 F 18 3 95825 One Flew Over the Cuckoo's Nest (1975) Drama
6 19 1193 5 982730936 M 1 10 48073 One Flew Over the Cuckoo's Nest (1975) Drama
7 24 1193 5 978136709 F 25 7 10023 One Flew Over the Cuckoo's Nest (1975) Drama
8 28 1193 3 978125194 F 25 1 14607 One Flew Over the Cuckoo's Nest (1975) Drama
9 33 1193 5 978557765 M 45 3 55421 One Flew Over the Cuckoo's Nest (1975) Drama
10 39 1193 5 978043535 M 18 4 61820 One Flew Over the Cuckoo's Nest (1975) Drama
11 42 1193 3 978038981 M 25 8 24502 One Flew Over the Cuckoo's Nest (1975) Drama
12 44 1193 4 978018995 M 45 17 98052 One Flew Over the Cuckoo's Nest (1975) Drama
13 47 1193 4 977978345 M 18 4 94305 One Flew Over the Cuckoo's Nest (1975) Drama
14 48 1193 4 977975061 M 25 4 92107 One Flew Over the Cuckoo's Nest (1975) Drama
15 49 1193 4 978813972 M 18 12 77084 One Flew Over the Cuckoo's Nest (1975) Drama
16 53 1193 5 977946400 M 25 0 96931 One Flew Over the Cuckoo's Nest (1975) Drama
17 54 1193 5 977944039 M 50 1 56723 One Flew Over the Cuckoo's Nest (1975) Drama
18 58 1193 5 977933866 M 25 2 30303 One Flew Over the Cuckoo's Nest (1975) Drama
19 59 1193 4 977934292 F 50 1 55413 One Flew Over the Cuckoo's Nest (1975) Drama
20 62 1193 4 977968584 F 35 3 98105 One Flew Over the Cuckoo's Nest (1975) Drama
21 80 1193 4 977786172 M 56 1 49327 One Flew Over the Cuckoo's Nest (1975) Drama
22 81 1193 5 977785864 F 25 0 60640 One Flew Over the Cuckoo's Nest (1975) Drama
23 88 1193 5 977694161 F 45 1 02476 One Flew Over the Cuckoo's Nest (1975) Drama
24 89 1193 5 977683596 F 56 9 85749 One Flew Over the Cuckoo's Nest (1975) Drama
25 95 1193 5 977626632 M 45 0 98201 One Flew Over the Cuckoo's Nest (1975) Drama
26 96 1193 3 977621789 F 25 16 78028 One Flew Over the Cuckoo's Nest (1975) Drama
27 99 1193 2 982791053 F 1 10 19390 One Flew Over the Cuckoo's Nest (1975) Drama
28 102 1193 5 1040737607 M 35 19 20871 One Flew Over the Cuckoo's Nest (1975) Drama
29 104 1193 2 977546620 M 25 12 00926 One Flew Over the Cuckoo's Nest (1975) Drama
... ... ... ... ... ... ... ... ... ... ...
1000179 4933 3084 3 962757020 M 25 15 94040 Home Page (1999) Documentary
1000180 4802 2218 2 1014866656 M 56 1 40601 Juno and Paycock (1930) Drama
1000181 4812 2308 2 962932391 M 18 14 25301 Detroit 9000 (1973) Action|Crime
1000182 4874 624 4 962781918 F 25 4 70808 Condition Red (1995) Action|Drama|Thriller
1000183 5059 1434 4 962484364 M 45 16 22652 Stranger, The (1994) Action
1000184 5947 1434 4 957190428 F 45 16 97215 Stranger, The (1994) Action
1000185 5077 1868 3 962417299 M 25 2 20037 Truce, The (1996) Drama|War
1000186 5944 1868 1 957197520 F 18 10 27606 Truce, The (1996) Drama|War
1000187 5105 404 3 962337582 M 50 7 18977 Brother Minister: The Assassination of Malcolm... Documentary
1000188 5185 404 4 963402617 F 35 4 44485 Brother Minister: The Assassination of Malcolm... Documentary
1000189 5532 404 5 959619841 M 25 17 27408 Brother Minister: The Assassination of Malcolm... Documentary
1000190 5543 404 3 960127592 M 25 17 97401 Brother Minister: The Assassination of Malcolm... Documentary
1000191 5220 2543 3 961546137 M 25 7 91436 Six Ways to Sunday (1997) Comedy
1000192 5754 2543 4 958272316 F 18 1 60640 Six Ways to Sunday (1997) Comedy
1000193 5227 591 3 961475931 M 18 10 64050 Tough and Deadly (1995) Action|Drama|Thriller
1000194 5795 591 1 958145253 M 25 1 92688 Tough and Deadly (1995) Action|Drama|Thriller
1000195 5313 3656 5 960920392 M 56 0 55406 Lured (1947) Crime
1000196 5328 2438 4 960838075 F 25 4 91740 Outside Ozona (1998) Drama|Thriller
1000197 5334 3323 3 960796159 F 56 13 46140 Chain of Fools (2000) Comedy|Crime
1000198 5334 127 1 960795494 F 56 13 46140 Silence of the Palace, The (Saimt el Qusur) (1... Drama
1000199 5334 3382 5 960796159 F 56 13 46140 Song of Freedom (1936) Drama
1000200 5420 1843 3 960156505 F 1 19 14850 Slappy and the Stinkers (1998) Children's|Comedy
1000201 5433 286 3 960240881 F 35 17 45014 Nemesis 2: Nebula (1995) Action|Sci-Fi|Thriller
1000202 5494 3530 4 959816296 F 35 17 94306 Smoking/No Smoking (1993) Comedy
1000203 5556 2198 3 959445515 M 45 6 92103 Modulations (1998) Documentary
1000204 5949 2198 5 958846401 M 18 17 47901 Modulations (1998) Documentary
1000205 5675 2703 3 976029116 M 35 14 30030 Broken Vessels (1998) Drama
1000206 5780 2845 1 958153068 M 18 17 92886 White Boys (1999) Drama
1000207 5851 3607 5 957756608 F 18 20 55410 One Little Indian (1973) Comedy|Drama|Western
1000208 5938 2909 4 957273353 M 25 1 35401 Five Wives, Three Secretaries and Me (1998) Documentary

1000209 rows × 10 columns


In [16]:
data['title'][100000]


Out[16]:
'Payback (1999)'

In [17]:
X=data.pivot_table('rating',index='timestamp',aggfunc='count')

In [18]:
X.plot()


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0xf5187f0>

In [19]:
# Warning: Some versions of Pandas use "index" and "columns", some use "rows" and "cols"
X=data.pivot_table('rating',index='title',aggfunc='sum')
#X=data.pivot_table('rating',rows='title',aggfunc='sum')

In [20]:
X


Out[20]:
title
$1,000,000 Duck (1971)                             112
'Night Mother (1986)                               236
'Til There Was You (1997)                          140
'burbs, The (1989)                                 882
...And Justice for All (1979)                      739
1-900 (1994)                                         5
10 Things I Hate About You (1999)                 2396
101 Dalmatians (1961)                             2032
101 Dalmatians (1996)                             1109
12 Angry Men (1957)                               2646
13th Warrior, The (1999)                          2369
187 (1997)                                         151
2 Days in the Valley (1996)                        939
20 Dates (1998)                                    397
20,000 Leagues Under the Sea (1954)               2129
200 Cigarettes (1999)                              522
2001: A Space Odyssey (1968)                      6982
2010 (1984)                                       1606
24 7: Twenty Four Seven (1997)                      20
24-hour Woman (1998)                                16
28 Days (2000)                                    1548
3 Ninjas: High Noon On Mega Mountain (1998)         64
3 Strikes (2000)                                    11
301, 302 (1995)                                     26
39 Steps, The (1935)                              1031
400 Blows, The (Les Quatre cents coups) (1959)     808
42 Up (1998)                                       372
52 Pick-Up (1986)                                  462
54 (1998)                                          716
7th Voyage of Sinbad, The (1958)                   933
                                                  ... 
Wrongfully Accused (1998)                          314
Wyatt Earp (1994)                                  882
X-Files: Fight the Future, The (1998)             3479
X-Men (2000)                                      5773
X: The Unknown (1956)                               34
Xiu Xiu: The Sent-Down Girl (Tian yu) (1998)       252
Yankee Zulu (1994)                                   6
Yards, The (1999)                                  248
Year My Voice Broke, The (1987)                    103
Year of Living Dangerously (1982)                 1523
Year of the Horse (1997)                            13
Yellow Submarine (1968)                           1475
Yojimbo (1961)                                     947
You Can't Take It With You (1938)                  309
You So Crazy (1994)                                 34
You've Got Mail (1998)                            2833
Young Doctors in Love (1982)                       206
Young Frankenstein (1974)                         5071
Young Guns (1988)                                 1921
Young Guns II (1990)                              1073
Young Poisoner's Handbook, The (1995)              287
Young Sherlock Holmes (1985)                      1285
Young and Innocent (1937)                           33
Your Friends and Neighbors (1998)                  368
Zachariah (1971)                                     7
Zed & Two Noughts, A (1985)                         99
Zero Effect (1998)                                1129
Zero Kelvin (Kj�rlighetens kj�tere) (1995)           7
Zeus and Roxanne (1997)                             58
eXistenZ (1999)                                   1335
Name: rating, dtype: int64

In [21]:
X.hist()


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x118d9f28>

In [22]:
# Warning: Some versions of Pandas use "index" and "columns", some use "rows" and "cols"
X=data.pivot_table('rating',index='occupation',aggfunc='sum')
#X=data.pivot_table('rating',rows='occupation',aggfunc='sum')

In [23]:
X


Out[23]:
occupation
0     461646
1     305270
2     178897
3     115630
4     463433
5      77295
6     136229
7     379506
8       9381
9      41484
10     82276
11     74384
12    209060
13     52014
14    177700
15     84684
16    165518
17    263126
18     42665
19     50883
20    211232
Name: rating, dtype: int64

Central limit theorem

Here we show an example of the central limit theorem. You can play around with "numberOfDistributions" and "numberOfSamples" to see how quickly this converges to something that looks Gaussian.


In [24]:
numberOfDistributions = 100
numberOfSamples = 1000
XTest = st.uniform(loc=0,scale=1);
# The same thing works with many distributions.
#XTest = st.lognorm(s=1.0);
XCLT=np.zeros([numberOfSamples])
for i in range(numberOfSamples):
    for j in range(numberOfDistributions):
        XCLT[i] += XTest.rvs()
    XCLT[i] = XCLT[i]/numberOfDistributions

In [25]:
py.hist(XCLT,normed=True)


Out[25]:
(array([  0.25,   0.8 ,   3.25,  11.  ,  12.25,  12.1 ,   6.3 ,   2.85,
          1.1 ,   0.1 ]),
 array([ 0.4 ,  0.42,  0.44,  0.46,  0.48,  0.5 ,  0.52,  0.54,  0.56,
         0.58,  0.6 ]),
 <a list of 10 Patch objects>)

Linear Algebra

Some basic ideas in Linear Algebra and how you can use them in Python.


In [26]:
import numpy as np

In [27]:
a=np.array([1,2,3])
a


Out[27]:
array([1, 2, 3])

In [28]:
A=np.matrix(np.random.randint(1,10,size=[3,3]))
A


Out[28]:
matrix([[1, 7, 6],
        [1, 1, 6],
        [4, 2, 5]])

In [29]:
x=np.matrix([[1],[2],[3]])
print x
print x.T


[[1]
 [2]
 [3]]
[[1 2 3]]

In [30]:
a*a


Out[30]:
array([1, 4, 9])

In [31]:
np.dot(a,a)


Out[31]:
14

In [32]:
x.T*x


Out[32]:
matrix([[14]])

In [33]:
A*x


Out[33]:
matrix([[33],
        [21],
        [23]])

In [34]:
b = np.matrix([[5],[6],[7]])
b


Out[34]:
matrix([[5],
        [6],
        [7]])

In [35]:
Ai = np.linalg.inv(A)
print A
print Ai


[[1 7 6]
 [1 1 6]
 [4 2 5]]
[[-0.06 -0.2   0.32]
 [ 0.17 -0.17  0.  ]
 [-0.02  0.23 -0.05]]

In [36]:
A*Ai


Out[36]:
matrix([[ 1.,  0.,  0.],
        [ 0.,  1.,  0.],
        [ 0.,  0.,  1.]])

In [37]:
Ai*A


Out[37]:
matrix([[  1.00e+00,   0.00e+00,  -2.22e-16],
        [  0.00e+00,   1.00e+00,   0.00e+00],
        [  0.00e+00,   0.00e+00,   1.00e+00]])

In [38]:
xHat = Ai*b
xHat


Out[38]:
matrix([[ 0.69],
        [-0.17],
        [ 0.91]])

In [39]:
print A*xHat
print b


[[ 5.]
 [ 6.]
 [ 7.]]
[[5]
 [6]
 [7]]

But matrix inversion can be very expensive.


In [40]:
sizes = range(100,1000,200)
times = np.zeros(len(sizes))
for i in range(len(sizes)):
    A = np.random.random(size=[sizes[i],sizes[i]])
    x = %timeit -o np.linalg.inv(A)
    times[i] = x.best
py.plot(sizes,times)


The slowest run took 15.27 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 345 µs per loop
100 loops, best of 3: 6.5 ms per loop
10 loops, best of 3: 21.9 ms per loop
10 loops, best of 3: 56.1 ms per loop
10 loops, best of 3: 110 ms per loop
Out[40]:
[<matplotlib.lines.Line2D at 0x12134128>]

Something slightly more advanced: Sparse matrices.

Sparse matrices (those with lots of 0s) can often be worked with much more efficiently than general matrices than standard methods.


In [41]:
from scipy.sparse.linalg import spsolve
from scipy.sparse import rand,eye
mySize = 1000
A=rand(mySize,mySize,0.001)+eye(mySize)
b=np.random.random(size=[mySize])

The sparsity structure of A.


In [42]:
py.spy(A,markersize=0.1)


Out[42]:
<matplotlib.lines.Line2D at 0x11f9cac8>

In [43]:
dense = %timeit -o np.linalg.solve(A.todense(),b)


10 loops, best of 3: 42.7 ms per loop

In [44]:
sparse = %timeit -o spsolve(A,b)


The slowest run took 12.11 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.03 ms per loop

In [45]:
dense.best/sparse.best


Out[45]:
41.60531476347737

Descriptive statistics

Pandas provides many routines for computing statistics.


In [59]:
XNormal=st.norm(loc=0.7,scale=2);
x = XNormal.rvs(1000)
print np.mean(x)
print np.std(x)
print np.var(x)


0.741294831708
1.95243659129
3.81200864301

But empirical measures are not always good approximations of the true properties of the distribution.


In [62]:
sizes = np.arange(16)+1
errors = np.zeros(16)
for i in range(16):
    x = XNormal.rvs(2**i)
    errors[i] = np.abs(0.7-np.mean(x))
py.plot(sizes,errors)
py.plot(sizes,2/np.sqrt(sizes))
py.plot(sizes,2*2/np.sqrt(sizes),'r')
#py.savefig('Figures/errorInMean.png')


Out[62]:
[<matplotlib.lines.Line2D at 0x21523160>]

Playing around with data


In [63]:
data.pivot_table?

In [64]:
X=data.pivot_table('rating',index='title',aggfunc='mean')
#X=data.pivot_table('rating',rows='title',aggfunc='mean')

In [65]:
hist(X)

In [66]:
X=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
#X=data.pivot_table('rating',rows='title',cols='gender',aggfunc='mean')

In [67]:
py.subplot(1,2,1)
X['M'].hist()
py.subplot(1,2,2)
X['F'].hist()


Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x21269b70>

In [68]:
py.plot(X['M'],X['F'],'.')


Out[68]:
[<matplotlib.lines.Line2D at 0x1b82a080>]

In [55]:
X.cov()


Out[55]:
gender F M
gender
F 0.510240 0.351683
M 0.351683 0.470851

In [56]:
X.corr()


Out[56]:
gender F M
gender
F 1.00000 0.76319
M 0.76319 1.00000

In [57]:
X=data.pivot_table('rating',index='occupation',columns='gender',aggfunc='mean')
#X=data.pivot_table('rating',rows='occupation',cols='gender',aggfunc='mean')

In [ ]:
X

In [ ]: