In [1]:

    
# http://ipython.org/ipython-doc/rel-1.1.0/api/generated/IPython.core.magics.pylab.html#
%pylab --no-import-all inline









    



Populating the interactive namespace from numpy and matplotlib

Warm up with sequences, lists

Warm-up exercise I: verifying sum of integers calculated by (young) Gauss

http://mathandmultimedia.com/2010/09/15/sum-first-n-positive-integers/

Gauss displayed his genius at an early age. According to anecdotes, when he was in primary school, he was punished by his teacher due to misbehavior. He was told to add the numbers from 1 to 100. He was able to compute its sum, which is 5050, in a matter of seconds.

Now, how on earth did he do it?

Triangular numbers

$T_n= \sum_{k=1}^n k = 1+2+3+ \dotsb +n = \frac{n(n+1)}{2} = {n+1 \choose 2}$



In [11]:

    
from itertools import islice

def triangular():
    n = 1
    i = 1
    while True:
        yield n
        i +=1
        n += i



In [14]:

    
for i, n in enumerate(islice(triangular(), 10)):
    print i+1, n



In [12]:

    
list(islice(triangular(), 100))[-1]









    Out[12]:





5050



In [13]:

    
list(islice(triangular(),99,100))[0]









    Out[13]:





5050

Warm Up Exercise II: Wheat and chessboard problem

http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem :

If a chessboard were to have wheat placed upon each square such that one grain were placed on the first square, two on the second, four on the third, and so on (doubling the number of grains on each subsequent square), how many grains of wheat would be on the chessboard at the finish?

The total number of grains equals 18,446,744,073,709,551,615, which is a much higher number than most people intuitively expect.

try using pow



In [17]:

    
# Legend of the Chessboard YouTube video

from IPython.display import YouTubeVideo
YouTubeVideo('t3d0Y-JpRRg')









    Out[17]:



In [3]:

    
# generator comprehension

k  = (pow(2,n) for n in xrange(64))
k.next()









    Out[3]:





1



In [14]:

    
__builtin__.sum((pow(2,n) for n in xrange(64)))









    Out[14]:





18446744073709551615L



In [20]:

    
pow(2,64) -1









    Out[20]:





18446744073709551615L

Slicing/Indexing Review

http://stackoverflow.com/a/509295/7782

Use on any of the sequence types (python docs on sequence types):

There are seven sequence types: strings, Unicode strings, lists, tuples, bytearrays, buffers, and xrange objects.

The use of square brackets are for accessing slices of sequence.

Let's remind ourselves of how to use slices

s[i]
s[i:j]
s[i:j:k]
meaning of negative indices
0-base counting



In [21]:

    
m = range(10)
m









    Out[21]:





[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



In [22]:

    
m[0]









    Out[22]:





0



In [23]:

    
m[-1]









    Out[23]:





9



In [24]:

    
m[::-1]









    Out[24]:





[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]



In [25]:

    
m[2:3]









    Out[25]:





[2]



In [26]:

    
import string
alphabet = string.lowercase

alphabet









    Out[26]:





'abcdefghijklmnopqrstuvwxyz'



In [27]:

    
# 13 letter of the alphabet
alphabet[12]









    Out[27]:





'm'

We will revisit generalized slicing in NumPy.

Import/naming conventions and pylab mode

http://my.safaribooksonline.com/book/programming/python/9781449323592/1dot-preliminaries/id2699702

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame

These imports done for you in pylab mode.

pylab mode

ipython --help

yields

--pylab=<CaselessStrEnum> (InteractiveShellApp.pylab)
    Default: None
    Choices: ['tk', 'qt', 'wx', 'gtk', 'osx', 'inline', 'auto']
    Pre-load matplotlib and numpy for interactive use, selecting a particular
    matplotlib backend and loop integration.



In [24]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame

NumPy

http://www.numpy.org/:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

a powerful N-dimensional array object [let's start with 1 and 2 dimensions]
sophisticated (broadcasting) functions [what is broadcasting?]
tools for integrating C/C++ and Fortran code [why useful?]
useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

See PfDA, Chapter 4

ndarray.ndim, ndarray.shape



In [29]:

    
# first: a numpy array of zero-dimension

a0 = np.array(5)
a0









    Out[29]:





array(5)

use shape to get a tuple of array dimensions



In [30]:

    
a0.ndim, a0.shape









    Out[30]:





(0, ())



In [31]:

    
# 1-d array
a1 = np.array([1,2])
a1.ndim, a1.shape









    Out[31]:





(1, (2,))



In [32]:

    
# 2-d array

a2 = np.array(([1,2], [3,4]))
a2.ndim, a2.shape









    Out[32]:





(2, (2, 2))

dtype: type of given ndarray



In [33]:

    
a2.dtype









    Out[33]:





dtype('int64')

np.arange

arange is one instance of ndarray creating function in NumPy

Compare to xrange.



In [15]:

    
from numpy import arange



In [16]:

    
type(arange(10))









    Out[16]:





numpy.ndarray



In [17]:

    
for k in arange(10):
    print k



In [18]:

    
list(arange(10)) == list(xrange(10))









    Out[18]:





True

NumPy.ndarray.reshape



In [19]:

    
#how to map 0..63 -> 2x2 array
a3 = np.arange(64).reshape(8,8)
a3









    Out[19]:





array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47],
       [48, 49, 50, 51, 52, 53, 54, 55],
       [56, 57, 58, 59, 60, 61, 62, 63]])



In [39]:

    
# 2nd row, 3rd column --> remember index starts at 0
a3[1,2]









    Out[39]:





10



In [40]:

    
# check that reshape works

for i in range(8):
    for j in range(8):
        if a3[i,j] != i*8 + j:
            print i, j

scalar multiplication

example of broadcasting:

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.



In [41]:

    
2*a3









    Out[41]:





array([[  0,   2,   4,   6,   8,  10,  12,  14],
       [ 16,  18,  20,  22,  24,  26,  28,  30],
       [ 32,  34,  36,  38,  40,  42,  44,  46],
       [ 48,  50,  52,  54,  56,  58,  60,  62],
       [ 64,  66,  68,  70,  72,  74,  76,  78],
       [ 80,  82,  84,  86,  88,  90,  92,  94],
       [ 96,  98, 100, 102, 104, 106, 108, 110],
       [112, 114, 116, 118, 120, 122, 124, 126]])

add 2 to all elements in a3



In [42]:

    
a3+2









    Out[42]:





array([[ 2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23, 24, 25],
       [26, 27, 28, 29, 30, 31, 32, 33],
       [34, 35, 36, 37, 38, 39, 40, 41],
       [42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57],
       [58, 59, 60, 61, 62, 63, 64, 65]])

sorting



In [20]:

    
# reverse sort -- best way?
#http://stackoverflow.com/a/6771620/7782

np.sort(np.arange(100))[::-1]









    Out[20]:





array([99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83,
       82, 81, 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66,
       65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49,
       48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32,
       31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
       14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0])

Boolean slice: important novel type of slicing

This stuff is a bit tricky (see PfDA, pp. 89-92)

Consider example of picking out whole numbers less than 20 that are evenly divisible by 3. Generate a list of such numbers



In [44]:

    
# list comprehension

[i for i in xrange(20) if i % 3 == 0]









    Out[44]:





[0, 3, 6, 9, 12, 15, 18]



In [45]:

    
a3 = np.arange(20) 
a3









    Out[45]:





array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])



In [46]:

    
# basic indexing

print a3[0]
print a3[::-1]
print a3[2:5]









    



0
[19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0]
[2 3 4]



In [47]:

    
np.mod(a3, 3)









    Out[47]:





array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1])



In [48]:

    
np.mod(a3, 3) == 0









    Out[48]:





array([ True, False, False,  True, False, False,  True, False, False,
        True, False, False,  True, False, False,  True, False, False,
        True, False], dtype=bool)



In [49]:

    
divisible_by_3 = np.mod(a3, 3) == 0



In [50]:

    
a3[divisible_by_3]









    Out[50]:





array([ 0,  3,  6,  9, 12, 15, 18])



In [51]:

    
# if you want to understand this in terms of the overloaded operators -- don't worry if you don't get this.
a3.__getitem__(np.mod(a3,3).__eq__(0))









    Out[51]:





array([ 0,  3,  6,  9, 12, 15, 18])

Exercise: Calculate a series that holds all the squares less than 100

Use arange, np.sqrt, astype



In [52]:

    
a4 = arange(100)
a4sqrt = np.sqrt(a4)
a4[a4sqrt == a4sqrt.astype(np.int)]









    Out[52]:





array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

We will come back to indexing later.

http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

Pandas

pandas.Series

Make a series out of an array



In [53]:

    
s1 = Series(arange(5))

confirm that the type of s1 is what you would expect



In [54]:

    
type(s1)









    Out[54]:





pandas.core.series.Series

show that the series is also an array



In [55]:

    
s1.ndim, isinstance(s1, np.ndarray)









    Out[55]:





(1, True)



In [56]:

    
s1.index









    Out[56]:





Int64Index([0, 1, 2, 3, 4], dtype=int64)



In [57]:

    
import string
allTheLetters = string.lowercase
allTheLetters









    Out[57]:





'abcdefghijklmnopqrstuvwxyz'



In [58]:

    
s2 = Series(data=arange(5), index=list(allTheLetters)[:5])
s2









    Out[58]:





a    0
b    1
c    2
d    3
e    4
dtype: int64



In [59]:

    
s2.index









    Out[59]:





Index([u'a', u'b', u'c', u'd', u'e'], dtype=object)

http://my.safaribooksonline.com/book/programming/python/9781449323592/5dot-getting-started-with-pandas/id2828378 :

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values



In [60]:

    
# can use both numeric indexing and the labels
s2[0], s2['a']









    Out[60]:





(0, 0)



In [61]:

    
for i in range(len(s2)):
    print i, s2[i]

it is possible conflict in indexing -- consider



In [62]:

    
s3 = Series(data=['albert', 'betty', 'cathy'], index=[3,1, 0])
s3









    Out[62]:





3    albert
1     betty
0     cathy
dtype: object



In [63]:

    
s3[0], list(s3)[0]









    Out[63]:





('cathy', 'albert')

but slicing works to return specific numeric index



In [64]:

    
s3[::-1]









    Out[64]:





0     cathy
1     betty
3    albert
dtype: object



In [65]:

    
for i in range(len(s3)):
    print i, s3[i:i+1]









    



0 3    albert
dtype: object
1 1    betty
dtype: object
2 0    cathy
dtype: object



In [66]:

    
s3.name = 'person names'
s3.name









    Out[66]:





'person names'



In [67]:

    
s3.index.name = 'confounding label'
s3.index.name









    Out[67]:





'confounding label'



In [68]:

    
s3









    Out[68]:





confounding label
3                    albert
1                     betty
0                     cathy
Name: person names, dtype: object

Important points remaining:

"NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link"
"Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict"

Gauss & Chess revisited, using Series

You get some nice matplotlib integration via pandas



In [21]:

    
# Gauss addition using np.arange, Series 

from pandas import Series
Series(arange(101).cumsum()).plot()









    Out[21]:





<matplotlib.axes.AxesSubplot at 0x109805890>



In [22]:

    
from pandas import Series
Series((pow(2,k) for k in xrange(64)), dtype=np.float64).cumsum().plot()









    Out[22]:





<matplotlib.axes.AxesSubplot at 0x1098438d0>

Wheat and Chessboard w/ NumPy

http://docs.scipy.org/doc/numpy/reference/ufuncs.html



In [71]:

    
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.ones.html
from numpy import ones



In [72]:

    
2*ones(64, dtype=np.int)









    Out[72]:





array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])



In [73]:

    
arange(64)









    Out[73]:





array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63])



In [74]:

    
sum(np.power(2, arange(64, dtype=np.uint64)))









    Out[74]:





1.8446744073709552e+19



In [75]:

    
sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))









    Out[75]:





1.8446744073709552e+19



In [76]:

    
precise_ans = sum([pow(2,n) for n in xrange(64)])
np_ans = sum(np.power(2*ones(64, dtype=np.uint64), arange(64)))

precise_ans, np_ans









    Out[76]:





(18446744073709551615L, 1.8446744073709552e+19)



In [77]:

    
# Raise an assertion if two items are not equal up to desired precision.
np.testing.assert_almost_equal(precise_ans, np_ans) is None









    Out[77]:





True

DataFrame

so many ways to use DataFrames....let's try them out in context of the census calculations



In [25]:

    
# not really intuitive to me:  reversal of column/row
DataFrame(dict([('06', {'name': 'California', 'abbreviation':'CA'})] ))









    Out[25]:






  
    
      
      06
    
  
  
    
      abbreviation
               CA
    
    
      name
       California
    
  

2 rows × 1 columns



In [79]:

    
DataFrame([{'name': 'California', 'abbreviation':'CA'}], index= ['06'])









    Out[79]:






  
    
      
      abbreviation
      name
    
  
  
    
      06
       CA
       California



In [80]:

    
Series(['06'], name='FIPS')









    Out[80]:





0    06
Name: FIPS, dtype: object



In [81]:

    
DataFrame([{'name': 'California', 'abbreviation':'CA'}], 
          index=Series(['06'], name='FIPS'))









    Out[81]:






  
    
      
      abbreviation
      name
    
    
      FIPS
      
      
    
  
  
    
      06
       CA
       California

Advanced: Operator Overloading



In [82]:

    
n0 = 5
n0 == 5









    Out[82]:





True

Now I thought I'd be able to use a n0.__eq__(5) but nope -- it's complicated -- see http://stackoverflow.com/questions/2281222/why-when-in-python-does-x-y-call-y-eq-x#comment2254663_2282795



In [83]:

    
try:
    n0.__eq__(5)
except Exception as e:
    print e









    



'int' object has no attribute '__eq__'

can do: int.__cmp__(x)



In [84]:

    
(n0.__cmp__(4), n0.__cmp__(5), n0.__cmp__(6))









    Out[84]:





(1, 0, -1)

how about ndarray?



In [85]:

    
arange(5) == 2









    Out[85]:





array([False, False,  True, False, False], dtype=bool)



In [86]:

    
# 
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html
np.array_equal(arange(5) == 2 , arange(5).__eq__(2))









    Out[86]:





True

Appendix: underlying mechanics of slicing

Useful if you want to understand how the slicing syntax really works.



In [87]:

    
isinstance([1,2], list)









    Out[87]:





True



In [88]:

    
isinstance(arange(5), list) # what does that mean -- could still be list-like









    Out[88]:





False



In [89]:

    
l1 = range(5)



In [90]:

    
type(l1)









    Out[90]:





list



In [91]:

    
l1[0], l1.__getitem__(0), l1[0] == l1.__getitem__(0)









    Out[91]:





(0, 0, True)



In [92]:

    
l1[::-1], l1.__getitem__(slice(None, None, -1))









    Out[92]:





([4, 3, 2, 1, 0], [4, 3, 2, 1, 0])



In [93]:

    
ar1 = arange(5)
ar1[3], ar1.__getitem__(3)









    Out[93]:





(3, 3)



In [94]:

    
ar1 == 2









    Out[94]:





array([False, False,  True, False, False], dtype=bool)



In [95]:

    
ar1[ar1 == 2].shape









    Out[95]:





(1,)



In [96]:

    
ar1.__eq__(2)









    Out[96]:





array([False, False,  True, False, False], dtype=bool)



In [97]:

    
ar1.__getitem__(slice(2, 4, None))









    Out[97]:





array([2, 3])



In [98]:

    
slice(ar1.__eq__(2), None, None)









    Out[98]:





slice(array([False, False,  True, False, False], dtype=bool), None, None)



In [99]:

    
ar1.__getitem__(ar1.__eq__(2))









    Out[99]:





array([2])



In [100]:

    
ar1[:2], ar1.__getitem__(slice(2))









    Out[100]:





(array([0, 1]), array([0, 1]))



In [101]:

    
ar1 + 7









    Out[101]:





array([ 7,  8,  9, 10, 11])



In [102]:

    
ar1.__add__(7)









    Out[102]:





array([ 7,  8,  9, 10, 11])



In [103]:

    
min(ar1 + 7)









    Out[103]:





7



In [104]:

    
alphabet[:]









    Out[104]:





'abcdefghijklmnopqrstuvwxyz'