What is IPython Notebook?

Interactive python shell runs in the background and communicates with a browser, supports Matplots, HTML, JS, TeX markup, images inline

Similar to Mathematica, crucial concept is that it allows you to query data interactively and quickly, saving your input and output

IPython site

I am using WinPython which is a portable Python environment.

There are others such as enthought, python x,y.


In [ ]:
# display help
# press ctrl + m + h

In [1]:
print 'hello world!'


hello world!

We can insert cells at any position, remove cells, copy and paste entire cells

supports path completion (on linux and macs only), command completion, docstring and will open the source code (if available)

also supports shell commands


In [2]:
# ping and other shell commands are supported, prefix with !
!ping www.quantel.com


Pinging www.quantel.com [10.54.1.107] with 32 bytes of data:
Reply from 10.54.1.107: bytes=32 time=1ms TTL=64
Reply from 10.54.1.107: bytes=32 time<1ms TTL=64
Reply from 10.54.1.107: bytes=32 time<1ms TTL=64
Reply from 10.54.1.107: bytes=32 time<1ms TTL=64

Ping statistics for 10.54.1.107:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 1ms, Average = 0ms

In [3]:
# display docstring
enumerate?

In [4]:
# can also open the source file
from pandas import *
read_csv

supports 'magic' functions


In [6]:
def for_loop():
    ret = []
    for x in range(10000):
        ret.append(x)
    return ret

def list_comp():
    return [x for x in range(10000)]

# which is faster?
%timeit for_loop()
%timeit list_comp()


100 loops, best of 3: 2.15 ms per loop
1000 loops, best of 3: 729 us per loop

In [7]:
# list of all magic functions
%lsmagic


Available line magics:
%alias  %alias_magic  %autocall  %automagic  %bookmark  %cd  %clear  %cls  %colors  %config  %connect_info  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %install_default_config  %install_ext  %install_profiles  %killbgscripts  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %lsmagic  %macro  %magic  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %run  %save  %sc  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%capture  %%cmd  %%file  %%powershell  %%prun  %%script  %%sx  %%system  %%timeit

Automagic is ON, % prefix IS NOT needed for line magics.

supports inline display of plots


In [8]:
plot(rand(100))


Out[8]:
[<matplotlib.lines.Line2D at 0x87505c0>]

and images


In [9]:
from IPython.core.display import Image
#Image(filename=r'Desktop\kid_f_off.jpg')
Image(filename=r'Downloads\Happy_Holidays.jpg')


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-6d1a920a8ad5> in <module>()
      1 from IPython.core.display import Image
      2 #Image(filename=r'Desktop\kid_f_off.jpg')
----> 3 Image(filename=r'Downloads\Happy_Holidays.jpg')

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in __init__(self, data, url, filename, format, embed)
    478         self.format = unicode(format).lower()
    479         self.embed = embed if embed is not None else (url is None)
--> 480         super(Image, self).__init__(data=data, url=url, filename=filename)
    481 
    482     def reload(self):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in __init__(self, data, url, filename)
    263             self.url = url
    264             self.filename = None if filename is None else unicode(filename)
--> 265         self.reload()
    266 
    267     def reload(self):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in reload(self)
    483         """Reload the raw data from file or URL."""
    484         if self.embed:
--> 485             super(Image,self).reload()
    486 
    487     def _repr_html_(self):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in reload(self)
    268         """Reload the raw data from file or URL."""
    269         if self.filename is not None:
--> 270             with open(self.filename, self._read_flags) as f:
    271                 self.data = f.read()
    272         elif self.url is not None:

IOError: [Errno 2] No such file or directory: u'Downloads\\Happy_Holidays.jpg'

In [10]:
# the above was a delibarate error to demonstrate traceback output
#Image(filename=r'C:\Users\alanwo\Desktop\kid_f_off.jpg')
Image(filename=r'C:\Downloads\Happy_Holidays.jpg')


Out[10]:

display youtube videos too!


In [11]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('tGvHNNOLnCk')


Out[11]:

and boring math

$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$

It also supports mixing python with other languages such as cython, R, Octave.

Scripting with Bash, Perl and Ruby are also supported.

Clustering is supported, multiple instances can be launched and the work divided between them.

What is Pandas?

Pandas consists of:

• A set of labeled array data structures, the primary of which are Series/TimeSeries and DataFrame

• Index objects enabling both simple axis indexing and multi-level (lookups, data alignment, and reindexing)

• An integrated group by engine for aggregating and transforming data sets

• Date range generation (date_range)

• Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.

• Memory-efficent “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)

• Moving window statistics (rolling mean, rolling standard deviation, etc.)

• Static and moving window linear and panel regression

If you can't be bothered to read the above then you can think of Pandas as an in memory DB tool (but it is much more than this)

Some simple examples


In [12]:
from pandas import *
# Series
# s = Series(data, index=index)
# data => a Python dict, an ndarray, a scalar value 
# NaN for missing data
sample_series = Series(randn(5))
sample_series


Out[12]:
0   -0.169314
1    0.972852
2    0.173603
3    0.549859
4   -1.262332
dtype: float64

In [13]:
# we can pass an index list for the rows
sample_series = Series(randn(5), index=['A','B','C','D','E'])
sample_series


Out[13]:
A    0.267264
B   -0.821806
C    2.045298
D   -0.316652
E    1.313129
dtype: float64

Series can be thought of as 1-d arrays (really a 1-d numpy ndarray underneath)

They are the fundamental datatype in Pandas


In [14]:
# calling .values decomposes to the underlying data
sample_series.values


Out[14]:
array([ 0.26726378, -0.82180644,  2.04529782, -0.31665223,  1.31312935])

In [15]:
# some simple indexing
sample_series[0]


Out[15]:
0.26726378406431067

In [16]:
sample_series[2]


Out[16]:
2.0452978176194083

In [17]:
# DataFrame is one of : 
#  Dict of 1D ndarrays, lists, dicts, or Series
#  2-D numpy.ndarray
#  Structured or record ndarray
#  A Series
#  Another DataFrame
sample_df = DataFrame(randn(5))
sample_df


Out[17]:
0
0 0.077678
1 -0.424129
2 -0.144588
3 1.504509
4 1.015034

In [18]:
# again we can pass the column and row names
sample_df = DataFrame({'A':randn(5)}, index=['1', 'c', '5', 'g', 't']) # use abritrary row values
sample_df


Out[18]:
A
1 0.258513
c -1.457385
5 -1.145670
g 0.264279
t 0.869407

In [19]:
# data can consist of anything and can be hetereogenous for a column
from datetime import datetime
sample_df = DataFrame( {'A':['hello', 9, 12312.00, datetime.now(), [x for x in range(5)]]})
sample_df


Out[19]:
A
0 hello
1 9
2 12312
3 2013-09-20 15:17:20.578000
4 [0, 1, 2, 3, 4]

In [20]:
# we can construct from a dict of Series
d = {'col1':Series(randn(5), index=['a','b','c','d','e']), 'col2':Series(randn(4), index=['a','b','c','d'])}
d


Out[20]:
{'col1': a    0.470946
b    0.585088
c    1.641423
d   -0.207128
e   -0.803649
dtype: float64,
 'col2': a    0.017838
b    0.678610
c   -0.088913
d   -0.738760
dtype: float64}

In [21]:
sample_df = DataFrame(d)
sample_df


Out[21]:
col1 col2
a 0.470946 0.017838
b 0.585088 0.678610
c 1.641423 -0.088913
d -0.207128 -0.738760
e -0.803649 NaN

In [22]:
# column selection
sample_df['col1']


Out[22]:
a    0.470946
b    0.585088
c    1.641423
d   -0.207128
e   -0.803649
Name: col1, dtype: float64

In [23]:
# this is really a series
type(sample_df['col1'])


Out[23]:
pandas.core.series.Series

In [24]:
# different to this, which is a dataframe
type(sample_df[['col1']])


Out[24]:
pandas.core.frame.DataFrame

In [25]:
# row selection in dataframes using fancy indexing
sample_df.ix['a']


Out[25]:
col1    0.470946
col2    0.017838
Name: a, dtype: float64

In [26]:
# can also use irow, integer based indexing
sample_df.irow(1)


Out[26]:
col1    0.585088
col2    0.678610
Name: b, dtype: float64

In [27]:
# add a column
sample_df['new_col'] = randn(5)
sample_df


Out[27]:
col1 col2 new_col
a 0.470946 0.017838 -0.551610
b 0.585088 0.678610 -0.424923
c 1.641423 -0.088913 1.158524
d -0.207128 -0.738760 -1.506391
e -0.803649 NaN -0.754853

In [28]:
# can compute new values based on other cols
sample_df['diff'] = sample_df['col1'] - sample_df['col2']
sample_df


Out[28]:
col1 col2 new_col diff
a 0.470946 0.017838 -0.551610 0.453109
b 0.585088 0.678610 -0.424923 -0.093522
c 1.641423 -0.088913 1.158524 1.730336
d -0.207128 -0.738760 -1.506391 0.531632
e -0.803649 NaN -0.754853 NaN

In [29]:
from datetime import datetime
sample_df['date'] = datetime.now()
sample_df


Out[29]:
col1 col2 new_col diff date
a 0.470946 0.017838 -0.551610 0.453109 2013-09-20 15:20:05.470000
b 0.585088 0.678610 -0.424923 -0.093522 2013-09-20 15:20:05.470000
c 1.641423 -0.088913 1.158524 1.730336 2013-09-20 15:20:05.470000
d -0.207128 -0.738760 -1.506391 0.531632 2013-09-20 15:20:05.470000
e -0.803649 NaN -0.754853 NaN 2013-09-20 15:20:05.470000

In [30]:
# we can apply functions to a column, row or entire dataframe
sample_df['day'] = sample_df['date'].apply(lambda x: x.day)
sample_df


Out[30]:
col1 col2 new_col diff date day
a 0.470946 0.017838 -0.551610 0.453109 2013-09-20 15:20:05.470000 20
b 0.585088 0.678610 -0.424923 -0.093522 2013-09-20 15:20:05.470000 20
c 1.641423 -0.088913 1.158524 1.730336 2013-09-20 15:20:05.470000 20
d -0.207128 -0.738760 -1.506391 0.531632 2013-09-20 15:20:05.470000 20
e -0.803649 NaN -0.754853 NaN 2013-09-20 15:20:05.470000 20

In [31]:
# note that NaN values are propagated, we can deal with this in a few ways
# drop them

copy = sample_df.dropna()
copy


Out[31]:
col1 col2 new_col diff date day
a 0.470946 0.017838 -0.551610 0.453109 2013-09-20 15:20:05.470000 20
b 0.585088 0.678610 -0.424923 -0.093522 2013-09-20 15:20:05.470000 20
c 1.641423 -0.088913 1.158524 1.730336 2013-09-20 15:20:05.470000 20
d -0.207128 -0.738760 -1.506391 0.531632 2013-09-20 15:20:05.470000 20

In [32]:
# or replace with missing values
copy = sample_df
copy.fillna(0, inplace=True)
copy


Out[32]:
col1 col2 new_col diff date day
a 0.470946 0.017838 -0.551610 0.453109 2013-09-20 15:20:05.470000 20
b 0.585088 0.678610 -0.424923 -0.093522 2013-09-20 15:20:05.470000 20
c 1.641423 -0.088913 1.158524 1.730336 2013-09-20 15:20:05.470000 20
d -0.207128 -0.738760 -1.506391 0.531632 2013-09-20 15:20:05.470000 20
e -0.803649 0.000000 -0.754853 0.000000 2013-09-20 15:20:05.470000 20

In [33]:
# we can select multiple columns
copy = sample_df[['col1', 'col2']]
copy


Out[33]:
col1 col2
a 0.470946 0.017838
b 0.585088 0.678610
c 1.641423 -0.088913
d -0.207128 -0.738760
e -0.803649 0.000000

In [34]:
# or a cross section
copy = sample_df[['col1', 'col2']][1:4]
copy
# note that for index selection [beg, end) - beg is included, end is excluded


Out[34]:
col1 col2
b 0.585088 0.678610
c 1.641423 -0.088913
d -0.207128 -0.738760

In [35]:
# we can also do label selection on rows and slice the columns
copy = sample_df.loc['b':, 'col1':'new_col']
copy


Out[35]:
col1 col2 new_col
b 0.585088 0.678610 -0.424923
c 1.641423 -0.088913 1.158524
d -0.207128 -0.738760 -1.506391
e -0.803649 0.000000 -0.754853

In [36]:
# we can create a boolean mask
sample_df['col1']>0


Out[36]:
a     True
b     True
c     True
d    False
e    False
Name: col1, dtype: bool

In [37]:
# then use this for selection
copy = sample_df[sample_df['col1']>0]
copy


Out[37]:
col1 col2 new_col diff date day
a 0.470946 0.017838 -0.551610 0.453109 2013-09-20 15:20:05.470000 20
b 0.585088 0.678610 -0.424923 -0.093522 2013-09-20 15:20:05.470000 20
c 1.641423 -0.088913 1.158524 1.730336 2013-09-20 15:20:05.470000 20

In [38]:
# can also use more advanced criteria
sample_df.ix[(sample_df['col1'] < 0) & (sample_df['diff']<0)]
# Note do not use or/and for boolean comparisons, you must use the bitwise versions |/& also for negation ~


Out[38]:
Index([], dtype=object) Empty DataFrame

In [39]:
# we can get a specific value at row and column
sample_df.loc['c', 'col2']


Out[39]:
-0.088912753762695362

In [40]:
# yes fancy reverse works too
sample_df[::-1]


Out[40]:
col1 col2 new_col diff date day
e -0.803649 0.000000 -0.754853 0.000000 2013-09-20 15:20:05.470000 20
d -0.207128 -0.738760 -1.506391 0.531632 2013-09-20 15:20:05.470000 20
c 1.641423 -0.088913 1.158524 1.730336 2013-09-20 15:20:05.470000 20
b 0.585088 0.678610 -0.424923 -0.093522 2013-09-20 15:20:05.470000 20
a 0.470946 0.017838 -0.551610 0.453109 2013-09-20 15:20:05.470000 20

In [41]:
# we can perform some simple stats
sample_df.describe()


Out[41]:
col1 col2 new_col diff day
count 5.000000 5.000000 5.000000 5.000000 5
mean 0.337336 -0.026245 -0.415850 0.524311 20
std 0.918989 0.502757 0.974855 0.727365 0
min -0.803649 -0.738760 -1.506391 -0.093522 20
25% -0.207128 -0.088913 -0.754853 0.000000 20
50% 0.470946 0.000000 -0.551610 0.453109 20
75% 0.585088 0.017838 -0.424923 0.531632 20
max 1.641423 0.678610 1.158524 1.730336 20

In [42]:
# do some aggregation like sum, count, cumsum and more
sample_df.sum()


Out[42]:
col1         1.686681
col2        -0.131225
new_col     -2.079252
diff         2.621555
day        100.000000
dtype: float64

In [43]:
# plot values
sample_df[['col1', 'col2']].plot()


Out[43]:
<matplotlib.axes.AxesSubplot at 0x88d5d30>

OK. let's load some data

Pandas supports loading data from csv, html (will load from url, detect HTML tables and try to parse),

json, excel, sql, text files, clipboard memory, hdf and pickles too.


In [44]:
names = read_csv(r'c:\pandas-exercises-master\baby-names2.csv')
names.head()


Out[44]:
year name prop sex soundex
0 1880 John 0.081541 boy J500
1 1880 William 0.080511 boy W450
2 1880 James 0.050057 boy J520
3 1880 Charles 0.045167 boy C642
4 1880 George 0.043292 boy G620

In [46]:
names.dtypes


Out[46]:
year         int64
name        object
prop       float64
sex         object
soundex     object
dtype: object

In [47]:
names.tail()


Out[47]:
year name prop sex soundex
257995 2008 Carleigh 0.000128 girl C642
257996 2008 Iyana 0.000128 girl I500
257997 2008 Kenley 0.000127 girl K540
257998 2008 Sloane 0.000127 girl S450
257999 2008 Elianna 0.000127 girl E450

In [48]:
names.dtypes


Out[48]:
year         int64
name        object
prop       float64
sex         object
soundex     object
dtype: object

In [49]:
names.columns


Out[49]:
Index([u'year', u'name', u'prop', u'sex', u'soundex'], dtype=object)

In [50]:
# names are ordered by popularity, for year 1880 the top names were
names[names['year']==1880].head()


Out[50]:
year name prop sex soundex
0 1880 John 0.081541 boy J500
1 1880 William 0.080511 boy W450
2 1880 James 0.050057 boy J520
3 1880 Charles 0.045167 boy C642
4 1880 George 0.043292 boy G620

In [51]:
# and the least popular
names[names['year']==1880].tail()


Out[51]:
year name prop sex soundex
129995 1880 Emaline 0.000041 girl E545
129996 1880 Ester 0.000041 girl E236
129997 1880 Eulah 0.000041 girl E400
129998 1880 Eulalie 0.000041 girl E440
129999 1880 Euna 0.000041 girl E500

In [52]:
# separate the boys from the girls
boys = names[names['sex']=='boy']
girls = names[names['sex']=='girl']
girls


Out[52]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129000 entries, 129000 to 257999
Data columns (total 5 columns):
year       129000  non-null values
name       129000  non-null values
prop       129000  non-null values
sex        129000  non-null values
soundex    129000  non-null values
dtypes: float64(1), int64(1), object(3)

In [53]:
# perform a groupby operation
boys.groupby('year')
# this has created a groupby object, it has not performed any aggregation or grouping yet


Out[53]:
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000A578978>

In [54]:
boys.groupby('year').size()
# shows we have 1000 entries for each year


Out[54]:
year
1880    1000
1881    1000
1882    1000
1883    1000
1884    1000
1885    1000
1886    1000
1887    1000
1888    1000
1889    1000
1890    1000
1891    1000
1892    1000
1893    1000
1894    1000
...
1994    1000
1995    1000
1996    1000
1997    1000
1998    1000
1999    1000
2000    1000
2001    1000
2002    1000
2003    1000
2004    1000
2005    1000
2006    1000
2007    1000
2008    1000
Length: 129, dtype: int64

In [55]:
# this is the same for both boys and girls
names.groupby(['year','sex']).size()


Out[55]:
year  sex 
1880  boy     1000
      girl    1000
1881  boy     1000
      girl    1000
1882  boy     1000
      girl    1000
1883  boy     1000
      girl    1000
1884  boy     1000
      girl    1000
1885  boy     1000
      girl    1000
1886  boy     1000
      girl    1000
1887  boy     1000
...
2001  girl    1000
2002  boy     1000
      girl    1000
2003  boy     1000
      girl    1000
2004  boy     1000
      girl    1000
2005  boy     1000
      girl    1000
2006  boy     1000
      girl    1000
2007  boy     1000
      girl    1000
2008  boy     1000
      girl    1000
Length: 258, dtype: int64

In [56]:
# we can index this object too, this selects top names for year 2001,2002
names.groupby(['year','sex']).size().ix[2000]


Out[56]:
sex
boy     1000
girl    1000
dtype: int64

In [57]:
# so lets filter so we can see the most popular names by year
boys[boys.year == 2000] # note that columns are also attributes


Out[57]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 120000 to 120999
Data columns (total 5 columns):
year       1000  non-null values
name       1000  non-null values
prop       1000  non-null values
sex        1000  non-null values
soundex    1000  non-null values
dtypes: float64(1), int64(1), object(3)

In [58]:
# display the proportion
boys[boys.year==2000].prop


Out[58]:
120000    0.016514
120001    0.015346
120002    0.013695
120003    0.013194
120004    0.011990
120005    0.011815
120006    0.011324
120007    0.010930
120008    0.010681
120009    0.010305
120010    0.009896
120011    0.009745
120012    0.009715
120013    0.009617
120014    0.009516
...
120985    0.000072
120986    0.000072
120987    0.000072
120988    0.000072
120989    0.000071
120990    0.000071
120991    0.000071
120992    0.000071
120993    0.000071
120994    0.000071
120995    0.000071
120996    0.000071
120997    0.000071
120998    0.000071
120999    0.000071
Name: prop, Length: 1000, dtype: float64

In [59]:
# display the top 5
boys[boys.year == 2000][:5]


Out[59]:
year name prop sex soundex
120000 2000 Jacob 0.016514 boy J210
120001 2000 Michael 0.015346 boy M240
120002 2000 Matthew 0.013695 boy M300
120003 2000 Joshua 0.013194 boy J200
120004 2000 Christopher 0.011990 boy C623

In [60]:
# we can get the index value for the top proportion value
boys[boys.year == 2000].prop.idxmax()


Out[60]:
120000

In [61]:
# this can be used for indexing
boys.ix[boys[boys.year == 2000].prop.idxmax()]  # to get the whole row.


Out[61]:
year           2000
name          Jacob
prop       0.016514
sex             boy
soundex        J210
Name: 120000, dtype: object

In [62]:
# using this we can now write a function
def get_max_record(group):
    return group.ix[group.prop.idxmax()]
get_max_record(boys)


Out[62]:
year           1880
name           John
prop       0.081541
sex             boy
soundex        J500
Name: 0, dtype: object

In [63]:
# do this for each year
result = boys.groupby('year').apply(get_max_record)

In [64]:
set_option('display.max_rows', 500)
result


Out[64]:
year name prop sex soundex
year
1880 1880 John 0.081541 boy J500
1881 1881 John 0.080975 boy J500
1882 1882 John 0.078314 boy J500
1883 1883 John 0.079066 boy J500
1884 1884 John 0.076476 boy J500
1885 1885 John 0.075517 boy J500
1886 1886 John 0.075820 boy J500
1887 1887 John 0.074181 boy J500
1888 1888 John 0.071186 boy J500
1889 1889 John 0.071804 boy J500
1890 1890 John 0.071034 boy J500
1891 1891 John 0.070292 boy J500
1892 1892 John 0.068759 boy J500
1893 1893 John 0.066495 boy J500
1894 1894 John 0.065961 boy J500
1895 1895 John 0.065699 boy J500
1896 1896 John 0.063051 boy J500
1897 1897 John 0.061907 boy J500
1898 1898 John 0.061770 boy J500
1899 1899 John 0.060674 boy J500
1900 1900 John 0.060619 boy J500
1901 1901 John 0.059677 boy J500
1902 1902 John 0.059568 boy J500
1903 1903 John 0.058833 boy J500
1904 1904 John 0.058535 boy J500
1905 1905 John 0.056266 boy J500
1906 1906 John 0.057359 boy J500
1907 1907 John 0.056635 boy J500
1908 1908 John 0.056150 boy J500
1909 1909 John 0.054234 boy J500
1910 1910 John 0.054914 boy J500
1911 1911 John 0.055709 boy J500
1912 1912 John 0.054465 boy J500
1913 1913 John 0.054678 boy J500
1914 1914 John 0.055531 boy J500
1915 1915 John 0.054005 boy J500
1916 1916 John 0.054202 boy J500
1917 1917 John 0.054062 boy J500
1918 1918 John 0.053923 boy J500
1919 1919 John 0.052724 boy J500
1920 1920 John 0.051710 boy J500
1921 1921 John 0.051160 boy J500
1922 1922 John 0.050889 boy J500
1923 1923 John 0.050741 boy J500
1924 1924 Robert 0.052025 boy R163
1925 1925 Robert 0.052875 boy R163
1926 1926 Robert 0.053366 boy R163
1927 1927 Robert 0.053075 boy R163
1928 1928 Robert 0.053177 boy R163
1929 1929 Robert 0.054029 boy R163
1930 1930 Robert 0.055021 boy R163
1931 1931 Robert 0.056598 boy R163
1932 1932 Robert 0.055166 boy R163
1933 1933 Robert 0.053140 boy R163
1934 1934 Robert 0.052583 boy R163
1935 1935 Robert 0.052852 boy R163
1936 1936 Robert 0.054959 boy R163
1937 1937 Robert 0.056541 boy R163
1938 1938 Robert 0.054806 boy R163
1939 1939 Robert 0.052645 boy R163
1940 1940 James 0.052662 boy J520
1941 1941 James 0.053182 boy J520
1942 1942 James 0.054812 boy J520
1943 1943 James 0.055183 boy J520
1944 1944 James 0.055406 boy J520
1945 1945 James 0.054288 boy J520
1946 1946 James 0.052999 boy J520
1947 1947 James 0.050987 boy J520
1948 1948 James 0.049697 boy J520
1949 1949 James 0.048244 boy J520
1950 1950 James 0.047336 boy J520
1951 1951 James 0.045662 boy J520
1952 1952 James 0.044132 boy J520
1953 1953 Robert 0.043059 boy R163
1954 1954 Michael 0.042792 boy M240
1955 1955 Michael 0.042293 boy M240
1956 1956 Michael 0.042244 boy M240
1957 1957 Michael 0.042392 boy M240
1958 1958 Michael 0.042029 boy M240
1959 1959 Michael 0.039381 boy M240
1960 1960 David 0.039669 boy D130
1961 1961 Michael 0.040331 boy M240
1962 1962 Michael 0.040453 boy M240
1963 1963 Michael 0.040569 boy M240
1964 1964 Michael 0.040776 boy M240
1965 1965 Michael 0.042762 boy M240
1966 1966 Michael 0.043997 boy M240
1967 1967 Michael 0.046329 boy M240
1968 1968 Michael 0.046184 boy M240
1969 1969 Michael 0.046564 boy M240
1970 1970 Michael 0.044784 boy M240
1971 1971 Michael 0.042687 boy M240
1972 1972 Michael 0.042659 boy M240
1973 1973 Michael 0.042049 boy M240
1974 1974 Michael 0.041444 boy M240
1975 1975 Michael 0.042178 boy M240
1976 1976 Michael 0.041012 boy M240
1977 1977 Michael 0.039554 boy M240
1978 1978 Michael 0.039308 boy M240
1979 1979 Michael 0.037823 boy M240
1980 1980 Michael 0.037039 boy M240
1981 1981 Michael 0.036931 boy M240
1982 1982 Michael 0.036157 boy M240
1983 1983 Michael 0.036506 boy M240
1984 1984 Michael 0.036117 boy M240
1985 1985 Michael 0.033745 boy M240
1986 1986 Michael 0.033429 boy M240
1987 1987 Michael 0.032652 boy M240
1988 1988 Michael 0.032053 boy M240
1989 1989 Michael 0.031213 boy M240
1990 1990 Michael 0.030358 boy M240
1991 1991 Michael 0.028698 boy M240
1992 1992 Michael 0.025914 boy M240
1993 1993 Michael 0.024010 boy M240
1994 1994 Michael 0.021831 boy M240
1995 1995 Michael 0.020594 boy M240
1996 1996 Michael 0.019156 boy M240
1997 1997 Michael 0.018805 boy M240
1998 1998 Michael 0.018066 boy M240
1999 1999 Jacob 0.017346 boy J210
2000 2000 Jacob 0.016514 boy J210
2001 2001 Jacob 0.015734 boy J210
2002 2002 Jacob 0.014792 boy J210
2003 2003 Jacob 0.014104 boy J210
2004 2004 Jacob 0.013196 boy J210
2005 2005 Jacob 0.012148 boy J210
2006 2006 Jacob 0.011331 boy J210
2007 2007 Jacob 0.010948 boy J210
2008 2008 Jacob 0.010355 boy J210

In [65]:
# plot this, this shows that the proportion required to be the most popular name has fallen generally, there are some notable peaks and troughs though.
# can you think why?
result.prop.plot()


Out[65]:
<matplotlib.axes.AxesSubplot at 0x885c2e8>

In [66]:
# lets select a specific name
boys[boys.name == 'Travis']


Out[66]:
year name prop sex soundex
563 1880 Travis 0.000101 boy T612
2648 1882 Travis 0.000082 boy T612
3672 1883 Travis 0.000080 boy T612
4656 1884 Travis 0.000081 boy T612
5602 1885 Travis 0.000095 boy T612
6598 1886 Travis 0.000092 boy T612
9871 1889 Travis 0.000059 boy T612
10733 1890 Travis 0.000075 boy T612
11696 1891 Travis 0.000082 boy T612
12593 1892 Travis 0.000099 boy T612
13831 1893 Travis 0.000066 boy T612
14849 1894 Travis 0.000064 boy T612
15735 1895 Travis 0.000079 boy T612
16774 1896 Travis 0.000077 boy T612
17645 1897 Travis 0.000098 boy T612
18991 1898 Travis 0.000053 boy T612
19658 1899 Travis 0.000095 boy T612
20442 1900 Travis 0.000166 boy T612
22552 1902 Travis 0.000121 boy T612
23778 1903 Travis 0.000077 boy T612
24510 1904 Travis 0.000137 boy T612
25505 1905 Travis 0.000147 boy T612
26587 1906 Travis 0.000118 boy T612
27535 1907 Travis 0.000132 boy T612
28512 1908 Travis 0.000150 boy T612
29451 1909 Travis 0.000181 boy T612
30510 1910 Travis 0.000149 boy T612
31467 1911 Travis 0.000166 boy T612
32448 1912 Travis 0.000171 boy T612
33413 1913 Travis 0.000203 boy T612
34388 1914 Travis 0.000220 boy T612
35426 1915 Travis 0.000182 boy T612
36404 1916 Travis 0.000204 boy T612
37393 1917 Travis 0.000215 boy T612
38381 1918 Travis 0.000233 boy T612
39384 1919 Travis 0.000228 boy T612
40385 1920 Travis 0.000218 boy T612
41383 1921 Travis 0.000219 boy T612
42374 1922 Travis 0.000231 boy T612
43385 1923 Travis 0.000209 boy T612
44382 1924 Travis 0.000210 boy T612
45388 1925 Travis 0.000208 boy T612
46404 1926 Travis 0.000191 boy T612
47364 1927 Travis 0.000220 boy T612
48371 1928 Travis 0.000214 boy T612
49338 1929 Travis 0.000246 boy T612
50360 1930 Travis 0.000225 boy T612
51325 1931 Travis 0.000266 boy T612
52329 1932 Travis 0.000260 boy T612
53311 1933 Travis 0.000290 boy T612
54281 1934 Travis 0.000344 boy T612
55292 1935 Travis 0.000324 boy T612
56295 1936 Travis 0.000314 boy T612
57300 1937 Travis 0.000293 boy T612
58290 1938 Travis 0.000312 boy T612
59295 1939 Travis 0.000312 boy T612
60294 1940 Travis 0.000299 boy T612
61309 1941 Travis 0.000278 boy T612
62319 1942 Travis 0.000256 boy T612
63347 1943 Travis 0.000216 boy T612
64330 1944 Travis 0.000240 boy T612
65342 1945 Travis 0.000225 boy T612
66318 1946 Travis 0.000246 boy T612
67337 1947 Travis 0.000218 boy T612
68347 1948 Travis 0.000211 boy T612
69349 1949 Travis 0.000207 boy T612
70343 1950 Travis 0.000211 boy T612
71353 1951 Travis 0.000197 boy T612
72356 1952 Travis 0.000184 boy T612
73355 1953 Travis 0.000189 boy T612
74370 1954 Travis 0.000176 boy T612
75380 1955 Travis 0.000171 boy T612
76384 1956 Travis 0.000177 boy T612
77409 1957 Travis 0.000164 boy T612
78371 1958 Travis 0.000201 boy T612
79346 1959 Travis 0.000233 boy T612
80327 1960 Travis 0.000256 boy T612
81323 1961 Travis 0.000271 boy T612
82301 1962 Travis 0.000312 boy T612
83257 1963 Travis 0.000430 boy T612
84234 1964 Travis 0.000507 boy T612
85221 1965 Travis 0.000568 boy T612
86187 1966 Travis 0.000718 boy T612
87165 1967 Travis 0.000902 boy T612
88140 1968 Travis 0.001162 boy T612
89112 1969 Travis 0.001567 boy T612
90068 1970 Travis 0.002270 boy T612
91060 1971 Travis 0.002762 boy T612
92056 1972 Travis 0.003068 boy T612
93052 1973 Travis 0.003637 boy T612
94053 1974 Travis 0.003496 boy T612
95050 1975 Travis 0.003674 boy T612
96045 1976 Travis 0.004188 boy T612
97043 1977 Travis 0.004259 boy T612
98045 1978 Travis 0.004072 boy T612
99035 1979 Travis 0.006169 boy T612
100036 1980 Travis 0.005808 boy T612
101039 1981 Travis 0.005435 boy T612
102038 1982 Travis 0.005525 boy T612
103040 1983 Travis 0.005627 boy T612
104041 1984 Travis 0.005461 boy T612
105045 1985 Travis 0.004725 boy T612
106048 1986 Travis 0.004416 boy T612
107041 1987 Travis 0.005335 boy T612
108040 1988 Travis 0.005711 boy T612
109040 1989 Travis 0.005500 boy T612
110041 1990 Travis 0.005325 boy T612
111043 1991 Travis 0.004786 boy T612
112047 1992 Travis 0.004239 boy T612
113049 1993 Travis 0.003845 boy T612
114052 1994 Travis 0.003338 boy T612
115065 1995 Travis 0.002788 boy T612
116072 1996 Travis 0.002576 boy T612
117080 1997 Travis 0.002346 boy T612
118097 1998 Travis 0.001937 boy T612
119101 1999 Travis 0.001770 boy T612
120114 2000 Travis 0.001576 boy T612
121133 2001 Travis 0.001424 boy T612
122143 2002 Travis 0.001306 boy T612
123158 2003 Travis 0.001209 boy T612
124165 2004 Travis 0.001164 boy T612
125173 2005 Travis 0.001114 boy T612
126162 2006 Travis 0.001186 boy T612
127169 2007 Travis 0.001087 boy T612
128176 2008 Travis 0.001025 boy T612

In [67]:
# we can set a mutli index on the data frame and use this
idf = boys.set_index(['name', 'year'])

In [68]:
# select the bottom 50
idf[-50:]


Out[68]:
prop sex soundex
name year
Reilly 2008 0.000097 boy R400
Sheldon 2008 0.000097 boy S435
Abdullah 2008 0.000097 boy A134
Jagger 2008 0.000097 boy J260
Thaddeus 2008 0.000097 boy T320
Case 2008 0.000096 boy C200
Kyson 2008 0.000096 boy K250
Lamont 2008 0.000096 boy L553
Chaz 2008 0.000096 boy C200
Makhi 2008 0.000096 boy M200
Jan 2008 0.000095 boy J500
Marques 2008 0.000095 boy M622
Oswaldo 2008 0.000095 boy O243
Donavan 2008 0.000095 boy D515
Keyon 2008 0.000095 boy K500
Kyan 2008 0.000095 boy K500
Simeon 2008 0.000095 boy S550
Trystan 2008 0.000095 boy T623
Andreas 2008 0.000094 boy A536
Dangelo 2008 0.000094 boy D524
Landin 2008 0.000094 boy L535
Reagan 2008 0.000094 boy R250
Turner 2008 0.000094 boy T656
Arnav 2008 0.000094 boy A651
Brenton 2008 0.000094 boy B653
Callum 2008 0.000094 boy C450
Jayvion 2008 0.000094 boy J150
Bridger 2008 0.000093 boy B632
Sammy 2008 0.000093 boy S500
Deegan 2008 0.000093 boy D250
Jaylan 2008 0.000093 boy J450
Lennon 2008 0.000093 boy L550
Odin 2008 0.000093 boy O350
Abdiel 2008 0.000092 boy A134
Jerimiah 2008 0.000092 boy J650
Eliezer 2008 0.000092 boy E426
Bronson 2008 0.000091 boy B652
Cornelius 2008 0.000091 boy C654
Pierre 2008 0.000091 boy P600
Cortez 2008 0.000091 boy C632
Baron 2008 0.000090 boy B650
Carlo 2008 0.000090 boy C640
Carsen 2008 0.000090 boy C625
Fletcher 2008 0.000090 boy F432
Izayah 2008 0.000090 boy I200
Kolten 2008 0.000090 boy K435
Damari 2008 0.000089 boy D560
Hugh 2008 0.000089 boy H200
Jensen 2008 0.000089 boy J525
Yurem 2008 0.000089 boy Y650

In [69]:
# where's travis?
idf.ix['Travis']


Out[69]:
prop sex soundex
year
1880 0.000101 boy T612
1882 0.000082 boy T612
1883 0.000080 boy T612
1884 0.000081 boy T612
1885 0.000095 boy T612
1886 0.000092 boy T612
1889 0.000059 boy T612
1890 0.000075 boy T612
1891 0.000082 boy T612
1892 0.000099 boy T612
1893 0.000066 boy T612
1894 0.000064 boy T612
1895 0.000079 boy T612
1896 0.000077 boy T612
1897 0.000098 boy T612
1898 0.000053 boy T612
1899 0.000095 boy T612
1900 0.000166 boy T612
1902 0.000121 boy T612
1903 0.000077 boy T612
1904 0.000137 boy T612
1905 0.000147 boy T612
1906 0.000118 boy T612
1907 0.000132 boy T612
1908 0.000150 boy T612
1909 0.000181 boy T612
1910 0.000149 boy T612
1911 0.000166 boy T612
1912 0.000171 boy T612
1913 0.000203 boy T612
1914 0.000220 boy T612
1915 0.000182 boy T612
1916 0.000204 boy T612
1917 0.000215 boy T612
1918 0.000233 boy T612
1919 0.000228 boy T612
1920 0.000218 boy T612
1921 0.000219 boy T612
1922 0.000231 boy T612
1923 0.000209 boy T612
1924 0.000210 boy T612
1925 0.000208 boy T612
1926 0.000191 boy T612
1927 0.000220 boy T612
1928 0.000214 boy T612
1929 0.000246 boy T612
1930 0.000225 boy T612
1931 0.000266 boy T612
1932 0.000260 boy T612
1933 0.000290 boy T612
1934 0.000344 boy T612
1935 0.000324 boy T612
1936 0.000314 boy T612
1937 0.000293 boy T612
1938 0.000312 boy T612
1939 0.000312 boy T612
1940 0.000299 boy T612
1941 0.000278 boy T612
1942 0.000256 boy T612
1943 0.000216 boy T612
1944 0.000240 boy T612
1945 0.000225 boy T612
1946 0.000246 boy T612
1947 0.000218 boy T612
1948 0.000211 boy T612
1949 0.000207 boy T612
1950 0.000211 boy T612
1951 0.000197 boy T612
1952 0.000184 boy T612
1953 0.000189 boy T612
1954 0.000176 boy T612
1955 0.000171 boy T612
1956 0.000177 boy T612
1957 0.000164 boy T612
1958 0.000201 boy T612
1959 0.000233 boy T612
1960 0.000256 boy T612
1961 0.000271 boy T612
1962 0.000312 boy T612
1963 0.000430 boy T612
1964 0.000507 boy T612
1965 0.000568 boy T612
1966 0.000718 boy T612
1967 0.000902 boy T612
1968 0.001162 boy T612
1969 0.001567 boy T612
1970 0.002270 boy T612
1971 0.002762 boy T612
1972 0.003068 boy T612
1973 0.003637 boy T612
1974 0.003496 boy T612
1975 0.003674 boy T612
1976 0.004188 boy T612
1977 0.004259 boy T612
1978 0.004072 boy T612
1979 0.006169 boy T612
1980 0.005808 boy T612
1981 0.005435 boy T612
1982 0.005525 boy T612
1983 0.005627 boy T612
1984 0.005461 boy T612
1985 0.004725 boy T612
1986 0.004416 boy T612
1987 0.005335 boy T612
1988 0.005711 boy T612
1989 0.005500 boy T612
1990 0.005325 boy T612
1991 0.004786 boy T612
1992 0.004239 boy T612
1993 0.003845 boy T612
1994 0.003338 boy T612
1995 0.002788 boy T612
1996 0.002576 boy T612
1997 0.002346 boy T612
1998 0.001937 boy T612
1999 0.001770 boy T612
2000 0.001576 boy T612
2001 0.001424 boy T612
2002 0.001306 boy T612
2003 0.001209 boy T612
2004 0.001164 boy T612
2005 0.001114 boy T612
2006 0.001186 boy T612
2007 0.001087 boy T612
2008 0.001025 boy T612

In [70]:
# plot his popularity
idf.ix['Travis'].prop.plot()


Out[70]:
<matplotlib.axes.AxesSubplot at 0x875cac8>

In [72]:
# calc the mean proportion
boys.groupby('name')['prop'].mean()


Out[72]:
name
Aaden       0.000442
Aarav       0.000101
Aaron       0.002266
Ab          0.000044
Abb         0.000046
Abbie       0.000046
Abbott      0.000044
Abdiel      0.000092
Abdul       0.000064
Abdullah    0.000094
Abe         0.000225
Abel        0.000152
Abelardo    0.000041
Abie        0.000042
Abner       0.000130
...
Zander       0.000369
Zane         0.000175
Zavier       0.000098
Zayden       0.000185
Zayne        0.000091
Zeb          0.000080
Zebulon      0.000055
Zechariah    0.000112
Zed          0.000056
Zeke         0.000055
Zenas        0.000052
Zeno         0.000071
Zigmund      0.000052
Zion         0.000490
Zollie       0.000060
Name: prop, Length: 3437, dtype: float64
Out[72]:
name
Aaden       0.000442
Aarav       0.000101
Aaron       0.002266
Ab          0.000044
Abb         0.000046
Abbie       0.000046
Abbott      0.000044
Abdiel      0.000092
Abdul       0.000064
Abdullah    0.000094
Abe         0.000225
Abel        0.000152
Abelardo    0.000041
Abie        0.000042
Abner       0.000130
...
Zander       0.000369
Zane         0.000175
Zavier       0.000098
Zayden       0.000185
Zayne        0.000091
Zeb          0.000080
Zebulon      0.000055
Zechariah    0.000112
Zed          0.000056
Zeke         0.000055
Zenas        0.000052
Zeno         0.000071
Zigmund      0.000052
Zion         0.000490
Zollie       0.000060
Name: prop, Length: 3437, dtype: float64

In [73]:
# sort it
boys.groupby('name')['prop'].mean().order()


Out[73]:
name
Danniel     0.000027
Kennard     0.000027
Deryl       0.000028
Grayling    0.000028
Michial     0.000028
Cheryl      0.000028
Dwyane      0.000028
Levern      0.000029
Diane       0.000029
Randolf     0.000029
Theodis     0.000029
Kimball     0.000029
Renard      0.000029
Patric      0.000029
Lex         0.000030
...
Paul       0.008322
Daniel     0.008517
Frank      0.010386
Edward     0.011069
Richard    0.014183
Thomas     0.014739
George     0.016254
David      0.016737
Joseph     0.017771
Michael    0.018342
Charles    0.019521
Robert     0.029625
William    0.034182
James      0.035465
John       0.041082
Name: prop, Length: 3437, dtype: float64

In [74]:
# get some stats
boys['prop'].describe()


Out[74]:
count    129000.000000
mean          0.000916
std           0.003441
min           0.000026
25%           0.000071
50%           0.000142
75%           0.000453
max           0.081541
dtype: float64

In [75]:
# groupby by year, call describe on the prop column
result = boys.groupby('year')['prop'].describe()

In [76]:
# display the first 50
result[:50]


Out[76]:
year       
1880  count    1000.000000
      mean        0.000931
      std         0.004871
      min         0.000042
      25%         0.000059
      50%         0.000118
      75%         0.000372
      max         0.081541
1881  count    1000.000000
      mean        0.000930
      std         0.004812
      min         0.000037
      25%         0.000065
      50%         0.000120
      75%         0.000369
      max         0.080975
1882  count    1000.000000
      mean        0.000928
      std         0.004691
      min         0.000041
      25%         0.000066
      50%         0.000115
      75%         0.000377
      max         0.078314
1883  count    1000.000000
      mean        0.000929
      std         0.004675
      min         0.000044
      25%         0.000062
      50%         0.000116
      75%         0.000382
      max         0.079066
1884  count    1000.000000
      mean        0.000927
      std         0.004542
      min         0.000041
      25%         0.000065
      50%         0.000122
      75%         0.000385
      max         0.076476
1885  count    1000.000000
      mean        0.000926
      std         0.004459
      min         0.000043
      25%         0.000069
      50%         0.000121
      75%         0.000373
      max         0.075517
1886  count    1000.000000
      mean        0.000926
dtype: float64

In [77]:
# lets look at 2008
df = boys[boys.year == 2008]
df.prop


Out[77]:
128000    0.010355
128001    0.009437
128002    0.009301
128003    0.008799
128004    0.008702
128005    0.008566
128006    0.008442
128007    0.008438
128008    0.008268
128009    0.008061
128010    0.007877
128011    0.007694
128012    0.007570
128013    0.007467
128014    0.007262
...
128985    0.000092
128986    0.000091
128987    0.000091
128988    0.000091
128989    0.000091
128990    0.000090
128991    0.000090
128992    0.000090
128993    0.000090
128994    0.000090
128995    0.000090
128996    0.000089
128997    0.000089
128998    0.000089
128999    0.000089
Name: prop, Length: 1000, dtype: float64

In [78]:
df = boys[boys.year == 2008].sort_index(by='prop', ascending=False)   # If not in descending order. Can also do ascending=True for ascending.
df.prop


Out[78]:
128000    0.010355
128001    0.009437
128002    0.009301
128003    0.008799
128004    0.008702
128005    0.008566
128006    0.008442
128007    0.008438
128008    0.008268
128009    0.008061
128010    0.007877
128011    0.007694
128012    0.007570
128013    0.007467
128014    0.007262
...
128985    0.000092
128986    0.000091
128987    0.000091
128988    0.000091
128989    0.000091
128992    0.000090
128994    0.000090
128993    0.000090
128995    0.000090
128991    0.000090
128990    0.000090
128997    0.000089
128998    0.000089
128996    0.000089
128999    0.000089
Name: prop, Length: 1000, dtype: float64

In [79]:
# perform a cumulative summation
df.prop.cumsum()   # numpy operation


Out[79]:
128000    0.010355
128001    0.019792
128002    0.029093
128003    0.037892
128004    0.046594
128005    0.055160
128006    0.063602
128007    0.072040
128008    0.080308
128009    0.088369
128010    0.096246
128011    0.103940
128012    0.111510
128013    0.118977
128014    0.126239
...
128985    0.794154
128986    0.794245
128987    0.794336
128988    0.794427
128989    0.794518
128992    0.794608
128994    0.794698
128993    0.794788
128995    0.794878
128991    0.794968
128990    0.795058
128997    0.795147
128998    0.795236
128996    0.795325
128999    0.795414
Length: 1000, dtype: float64

In [80]:
# how many does it take to reach 50%. Also called a measure of diversity.
df.prop.cumsum().searchsorted(0.5)


Out[80]:
127

In [81]:
# plot the rank where 50% proportion is
def get_quantile_count(group, quantile = 0.5):
    df = group.sort_index(by='prop', ascending=False)
    return df.prop.cumsum().searchsorted(quantile)

boys.groupby('year').apply(get_quantile_count).plot()


Out[81]:
<matplotlib.axes.AxesSubplot at 0xa7ab3c8>

In [82]:
# do this for boys and girls
def get_quantile_count(group, quantile=0.5):     # Problem with no different colors for boys and girls.
    group = group.groupby('soundex').sum()
    df = group.sort_index(by='prop', ascending=False)
    return df.prop.cumsum().searchsorted(quantile)

#f = lambda x: get_quantile_count(x, 0.1)
q = 0.5
boy_ct = boys.groupby('year').apply(get_quantile_count, quantile=q)   # to pass different values for quantile
girl_ct = girls.groupby('year').apply(get_quantile_count, quantile=q)
boy_ct.plot(label='boy')
girl_ct.plot(label='girl')
legend(loc='best')    # with --pylab=inline, we don't have to do plt.legend()


Out[82]:
<matplotlib.legend.Legend at 0xad1ae80>

In [83]:
# display the ranking
boys[boys.year == 2008].prop.rank()    # mean rank by default.


Out[83]:
128000    1000
128001     999
128002     998
128003     997
128004     996
128005     995
128006     994
128007     993
128008     992
128009     991
128010     990
128011     989
128012     988
128013     987
128014     986
...
128985    16.0
128986    12.5
128987    12.5
128988    12.5
128989    12.5
128990     7.5
128991     7.5
128992     7.5
128993     7.5
128994     7.5
128995     7.5
128996     2.5
128997     2.5
128998     2.5
128999     2.5
Name: prop, Length: 1000, dtype: float64

In [84]:
# create group object
grouped = boys.groupby('year')['prop']

In [85]:
grouped.transform(Series.rank)   # transform is more rigid than apply. Output the same size as the input.


Out[85]:
0     1000
1      999
2      998
3      997
4      996
5      995
6      994
7      993
8      992
9      991
10     990
11     989
12     988
13     987
14     986
...
128985    16.0
128986    12.5
128987    12.5
128988    12.5
128989    12.5
128990     7.5
128991     7.5
128992     7.5
128993     7.5
128994     7.5
128995     7.5
128996     2.5
128997     2.5
128998     2.5
128999     2.5
Name: prop, Length: 129000, dtype: float64

In [86]:
# add this as a column
boys['year_rank'] = grouped.transform(Series.rank)

In [87]:
# how popular is Alan
boys[boys.name == 'Alan'].year_rank.plot()


Out[87]:
<matplotlib.axes.AxesSubplot at 0xa78e160>

In [89]:
# Other names
boys[boys.name == 'Michael'].year_rank.plot()


Out[89]:
<matplotlib.axes.AxesSubplot at 0xc6e8978>

In [90]:
# lets load some birth data
births = read_csv(r'c:\pandas-exercises-master\births.csv')

In [91]:
merged = merge(names, births, on=['year', 'sex'])   # merge 2 tables: names and births., inner by default

In [92]:
# calc the number of people born with a given name by multiplying proportion by total number of births
merged['persons'] = np.floor(merged.prop * merged.births)

In [94]:
merged.head()


Out[94]:
year name prop sex soundex births persons
0 1880 John 0.081541 boy J500 118405 9654
1 1880 William 0.080511 boy W450 118405 9532
2 1880 James 0.050057 boy J520 118405 5926
3 1880 Charles 0.045167 boy C642 118405 5347
4 1880 George 0.043292 boy G620 118405 5125
Out[94]:
year name prop sex soundex births persons
0 1880 John 0.081541 boy J500 118405 9654
1 1880 William 0.080511 boy W450 118405 9532
2 1880 James 0.050057 boy J520 118405 5926
3 1880 Charles 0.045167 boy C642 118405 5347
4 1880 George 0.043292 boy G620 118405 5125

In [95]:
merged.groupby(['name', 'sex'])['persons'].sum()   # slice and dice. It's a hierarhical labeling.


Out[95]:
name      sex 
Aaden     boy        959
Aaliyah   girl     39660
Aarav     boy        219
Aaron     boy     508034
          girl      1365
Ab        boy         24
Abagail   girl      2680
Abb       boy         15
Abbey     girl     14333
Abbie     boy         10
          girl     16556
Abbigail  girl      6792
Abbott    boy          9
Abby      girl     47845
Abdiel    boy        199
...
Zilpah  girl        8
Zilpha  girl       70
Zina    girl     2263
Zion    boy     11428
        girl     1530
Zita    girl      831
Zoa     girl       73
Zoe     girl    66271
Zoey    girl    19335
Zoie    girl     4688
Zola    girl     4847
Zollie  boy        60
Zona    girl     3218
Zora    girl     4551
Zula    girl     3578
Name: persons, Length: 7455, dtype: float64

In [96]:
merged.groupby(['name', 'sex'])['persons'].sum().order()


Out[96]:
name      sex 
Achsah    girl    3
Ala       girl    3
Alabama   girl    3
Alfred    girl    3
Amey      girl    3
Chester   girl    3
Chestina  girl    3
Clem      girl    3
Crissie   girl    3
Daisye    girl    3
Dema      girl    3
Dollye    girl    3
Eithel    girl    3
Abie      boy     4
Acey      boy     4
...
Elizabeth    girl    1540430
Patricia     girl    1567602
Daniel       boy     1776483
Christopher  boy     1924919
Thomas       boy     2239606
Charles      boy     2320818
Joseph       boy     2490274
Richard      boy     2535681
David        boy     3495940
William      boy     3951365
Mary         girl    4097626
Michael      boy     4207352
Robert       boy     4752198
John         boy     5016124
James        boy     5021269
Name: persons, Length: 7455, dtype: float64

In [97]:
mboys = merge(boys, births)  # inner join by default.

In [98]:
# calc the number of boys born with a given name
mboys['persons'] = np.floor(mboys.prop * mboys.births)

In [99]:
# Select out persons
persons = mboys.set_index(['year', 'name']).persons

In [100]:
persons  # hierarhical index


Out[100]:
year  name   
1880  John       9654
      William    9532
      James      5926
      Charles    5347
      George     5125
      Frank      3241
      Joseph     2632
      Thomas     2533
      Henry      2443
      Robert     2415
      Edward     2363
      Harry      2152
      Walter     1754
      Arthur     1598
      Fred       1568
...
2008  Eliezer      199
      Bronson      197
      Cornelius    197
      Pierre       197
      Cortez       197
      Baron        195
      Carlo        195
      Carsen       195
      Fletcher     195
      Izayah       195
      Kolten       195
      Damari       193
      Hugh         193
      Jensen       193
      Yurem        193
Name: persons, Length: 129000, dtype: float64

In [101]:
# Select out all the people named Chris. Plot is kind of crowded. Matplotlib doesn't go more than 130 in x axis.
persons.ix[:, 'Christopher'].plot(kind='bar', rot=90)


Out[101]:
<matplotlib.axes.AxesSubplot at 0xc70d2b0>

In [102]:
# what about me?
persons.ix[:, 'Alan'].plot(kind='bar', rot=90)


Out[102]:
<matplotlib.axes.AxesSubplot at 0xf54f7f0>

In [103]:
persons.unstack('name')   # Create a data frame whose columns are each unique names, and the row indexes are the years.


Out[103]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129 entries, 1880 to 2008
Columns: 3437 entries, Aaden to Zollie
dtypes: float64(3437)

In [104]:
result = _   # underscore in ipython: the output of the last statement, because we don't want to compute again the same thing.

In [105]:
result['Alan']


Out[105]:
year
1880     NaN
1881     NaN
1882       5
1883     NaN
1884       8
1885      11
1886       4
1887       5
1888       5
1889       9
1890      11
1891       8
1892      16
1893      11
1894      10
1895      21
1896      13
1897      19
1898      14
1899      17
1900      26
1901      19
1902      26
1903      24
1904      28
1905      23
1906      27
1907      38
1908      30
1909      44
1910      59
1911      55
1912     102
1913     158
1914     185
1915     236
1916     251
1917     309
1918     332
1919     331
1920     431
1921     375
1922     479
1923     588
1924     623
1925     803
1926     836
1927     810
1928     963
1929    1100
1930    1304
1931    1418
1932    1339
1933    1341
1934    1367
1935    1424
1936    1756
1937    1840
1938    2240
1939    2331
1940    2377
1941    2624
1942    3136
1943    3855
1944    4278
1945    5077
1946    6482
1947    7813
1948    7391
1949    7233
1950    7737
1951    8382
1952    8314
1953    8433
1954    8705
1955    9027
1956    8720
1957    8213
1958    8221
1959    8150
1960    8357
1961    8616
1962    7702
1963    6938
1964    6807
1965    5617
1966    4943
1967    4566
1968    3986
1969    3862
1970    3554
1971    3167
1972    2576
1973    2268
1974    2229
1975    2249
1976    2188
1977    2131
1978    2116
1979    2321
1980    2665
1981    2583
1982    2678
1983    2767
1984    2698
1985    2683
1986    2640
1987    2583
1988    2751
1989    2723
1990    2640
1991    2541
1992    2602
1993    2559
1994    2433
1995    2291
1996    2442
1997    2175
1998    2137
1999    2246
2000    2413
2001    2642
2002    2618
2003    3094
2004    3062
2005    3204
2006    3467
2007    3244
2008    3020
Name: Alan, Length: 129, dtype: float64

In [106]:
result['Alan'].plot()


Out[106]:
<matplotlib.axes.AxesSubplot at 0xf53b6d8>

What else is Pandas useful for?

As Pandas uses Numpy underneath it can be used as the data structure for many Numpy compatible libraries

Additionally since you can convert the values to a list, dict, json etc.. it is compatible with pretty much everything


In [107]:
# e.g.
sample_df['col2'].tolist()


Out[107]:
[0.017837544444303886,
 0.6786099870029839,
 -0.08891275376269536,
 -0.7387600206972956,
 0.0]

In [108]:
# dataframes become a list of lists
sample_df[['col1','col2']].values.tolist()


Out[108]:
[[0.4709464627097036, 0.017837544444303886],
 [0.5850881841278334, 0.6786099870029839],
 [1.641423457204322, -0.08891275376269536],
 [-0.2071279112214586, -0.7387600206972956],
 [-0.8036489512527647, 0.0]]

In [109]:
# for dictionaries key values are columns, values are dicts with key values as row index value: row value
sample_df.to_dict()


Out[109]:
{'col1': {'a': 0.47094646270970358,
  'b': 0.58508818412783337,
  'c': 1.641423457204322,
  'd': -0.2071279112214586,
  'e': -0.8036489512527647},
 'col2': {'a': 0.017837544444303886,
  'b': 0.67860998700298392,
  'c': -0.088912753762695362,
  'd': -0.73876002069729563,
  'e': 0.0},
 'date': {'a': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'b': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'c': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'd': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'e': Timestamp('2013-09-20 15:20:05.470000', tz=None)},
 'day': {'a': 20, 'b': 20, 'c': 20, 'd': 20, 'e': 20},
 'diff': {'a': 0.45310891826539967,
  'b': -0.09352180287515055,
  'c': 1.7303362109670175,
  'd': 0.53163210947583706,
  'e': 0.0},
 'new_col': {'a': -0.55161032907010144,
  'b': -0.42492282964636496,
  'c': 1.158524297399135,
  'd': -1.5063906936060718,
  'e': -0.75485284391010221}}

A quick funky d3.js visualisation example

Some background about De Bruijn Graph http://en.wikipedia.org/wiki/De_Bruijn_graph

Set-up networkx for visualizing the local structure of such network.


In [110]:
import networkx as nx
from IPython.core.display import display_javascript
#from IPython.frontend.html.notebook import visutils as vis
import json
import time

Load D3.js


In [111]:
%install_ext https://raw.github.com/cschin/ipython_d3_mashup/master/extension/visutils.py
%reload_ext visutils
vis.run_js("$.getScript('http://d3js.org/d3.v2.js')")
vis.run_js("$.getScript('https://raw.github.com/cschin/ipython_d3_mashup/master/extension/vis_extension.js')")
time.sleep(2)
vis.run_js("IPython.vis_init();")


Installed visutils.py. To use it, type:
  %load_ext visutils

Set up the visulization "cell"/widget.


In [112]:
try:
    vis_display.remove()
except:
    pass
plot_area_style = {"position":"absolute",
                   "top":"0px",
                   "width":"850px",
                   "left":"750px",
                   "height":"350px",
                   "border":"9px groove",
                   "background-color":"rgba(200,200,200,0.5)"}

vis_cell = vis.VISCellWidget(name="plot_area", style = plot_area_style)

## attache the container to a "visual display"
vis_display = vis.NotebookVisualDisplay(container = vis_cell)


## create the SVG element for D3
svg_style = {"width":"850px", 
             "height":"300px",
             "border":"2px solid"}

svg = vis.SVGWidget(name = "svg_display", 
                   parent = "plot_area", 
                   style = svg_style,
                   vis = vis_display)



In [113]:
def replace_nodes(G0, Ns, N0):
    G0.add_node(N0, X="X")

    head = Ns[0]
    tail = Ns[-1]
    
    p = G0.predecessors(Ns[0])
    if len(p) == 1:
        p = p[0]
        G0.add_edge(p, N0)
        
    n = G0.successors(Ns[-1])
    if len(n) == 1:
        n = n[0]
        G0.add_edge(N0, n)
    
    G0.remove_nodes_from(Ns)

def reduce_graph(G0):
    G1 = G0.copy()
    G2 = G0.copy()
    nodes_to_remove = []
    for n in G1.nodes():
        if len(G1.successors(n)) > 1 or len(G1.predecessors(n)) > 1:
            if n[0] != "^" and n[-1] != "$":
                nodes_to_remove.append(n)
    G1.remove_nodes_from(nodes_to_remove)

    for ns in nx.weakly_connected_components(G1):
        ns = [n for n in ns if n[0] != "^" and n[-1] != "$"]
        if len(ns) == 0: continue
        contig = []
        n_sorted = nx.topological_sort(G1, ns)
        n_sorted = [n for n in n_sorted if n[0] != "^" and n[-1] != "$"]
        if len(n_sorted) <= 1: continue

        for kmer in n_sorted:
            assert len(G1.successors(kmer)) <= 1
            if len(contig) == 0:
                contig.append(kmer)
            else:
                contig.append(kmer[-1])
        replace_nodes(G2, n_sorted, "".join(contig))
        
    return G2

Generate the json for d3.js for the force layout.


In [114]:
def get_group(w, w1):
    c = 0
    for c1, c2 in zip(w,w1):
        if c1 == c2:
            c += 1
            continue
        break
    return c

def set_g_json(seq, k, reduce_g = False):
    G=nx.DiGraph()
    seq = "^"+seq+"$"
    for i in range(len(seq)-k+1):
        w1 = seq[i:i+k-1]
        w2 = seq[i+1:i+k]
        G.add_edge(w1, w2)
    if reduce_g == True:
        G = reduce_graph(G)
    def generateD3JSONForG(G):
        s = {"nodes":[], "links":[]}
        name2Idx = {}
        c = 0
        for n in G.nodes():
            #print n
            g = len(G.neighbors(n))
            if "^" in n:
                s["nodes"].append({"name":n, "group":g, "fixed":True, "x":0,"y":150})
            elif "$" in n:
                s["nodes"].append({"name":n, "group":g, "fixed":True, "x":850,"y":150})
            else:
                s["nodes"].append({"name":n, "group":g})
            name2Idx[n] = c
            c += 1
        for e in G.edges():
            col = "rgb(0,0,255)"
            width = 1
            s["links"].append({"source":name2Idx[e[0]], "target":name2Idx[e[1]], "color":col, "width":width})
        return json.dumps(s)     
    n_json = generateD3JSONForG(G)
    vis_cell.set_js_var("n_json", n_json)

Set up d3.js code for showing up the network around the word START.


In [115]:
vis_display.js_code=[]

set_g_json("ACGTACGTTGTGCAGTAGTAGTAGT",5)

js = """

(function() {

var plot_neighbor=function(json) {
  var w = 850,
      h = 300,
      fill = d3.scale.category10();
  var vis = d3.select("#plot_area #svg_display")
  var force = d3.layout.force()
      .charge(-40)
      .linkDistance(2)
      .nodes(json.nodes)
      .links(json.links)
      .size([w, h])
      .linkStrength(0.1)
      .start();

  var link = vis.selectAll("line.link")
      .data(json.links)
      .enter().append("svg:line")
      .attr("class", "link")
      .style("stroke-width", function(d) { return d.width; })
      .style("stroke", function(d) { return d.color; })
      .attr("x1", function(d) { return d.source.x; })
      .attr("y1", function(d) { return d.source.y; })
      .attr("x2", function(d) { return d.target.x; })
      .attr("y2", function(d) { return d.target.y; });

  var node = vis.selectAll("circle.node")
      .data(json.nodes)
      .enter().append("svg:circle")
      .attr("class", "node")
      .attr("cx", function(d) { return d.x; })
      .attr("cy", function(d) { return d.y; })
      .attr("r", 4)
      .style("fill", function(d) { return fill(d.group); })
      .call(force.drag);

  node.append("svg:title")
      .text(function(d) { return d.name; });

  vis.style("opacity", 1e-6)
    .transition()
      .duration(1000)
      .style("opacity", 1);


// Per-type markers, as they don't inherit styles.
vis.append("svg:defs").selectAll("marker")
    .data(["arrow"])
    .enter().append("svg:marker")
    .attr("id", String)
    .attr("viewBox", "0 -5 10 10")
    .attr("refX", 15)
    .attr("refY", -1.5)
    .attr("markerWidth", 6)
    .attr("markerHeight", 6)
    .attr("orient", "auto")
    .append("svg:path")
    .attr("d", "M0,-5L10,0L0,5");
    

var path = vis.append("svg:g").selectAll("path")
    .data(force.links())
  .enter().append("svg:path")
    .attr("class", function(d) { return "link arrow"; })
    .attr("marker-end", function(d) { return "url(#arrow)"; })
    .style("stroke-width", "1.5px");

  force.on("tick", function() {
    link.attr("x1", function(d) { return d.source.x; })
        .attr("y1", function(d) { return d.source.y; })
        .attr("x2", function(d) { return d.target.x; })
        .attr("y2", function(d) { return d.target.y; });

    node.attr("cx", function(d) { return d.x; })
        .attr("cy", function(d) { return d.y; });

    path.attr("d", function(d) {
      var dx = d.target.x - d.source.x,
          dy = d.target.y - d.source.y,
          dr = Math.sqrt(dx * dx + dy * dy);
      return "M" + d.source.x + "," + d.source.y + "L" + d.target.x + "," + d.target.y;
    })

  });};
var vc = IPython.vis_utils.name_to_viscell["plot_area"];
//alert(vc.data);
var n_json=$.parseJSON(vc.data.n_json);
//var n_json = vc.data.n_json;
//alert(vc.data["n_json"]);
plot_neighbor(n_json)})()
"""
vis_display.attach_js_code(js)
vis_display.refresh()



In [116]:
vis_display.hide()



In [117]:
vis_display.show()



In [118]:
def show_neighbors(w, k, reduce_g = False):
    vis.run_js('$("#svg_display *").remove();')
    set_g_json(w, k, reduce_g = reduce_g)
    for jc in vis_display.js_code:
        vis.run_js(jc)
    
    
## create a test input text box
input_style = {"width":"440px"}
tb = vis.InputWidget(name = "input_1",
                     parent = "plot_area",
                     style = input_style,
                     value = "AATTAATTAAGGTTTTAATTATTAATTGTAATTAATTAATTAATACTGAT",
                     vis = vis_display)

def onchange(self, *argv, **kwargv):
    self.update_value()
    
vis.set_action(tb, "onchange", onchange)


## create a input text box for k
input_style = {"width":"40px"}
kb = vis.InputWidget(name = "input_2",
                     parent = "plot_area",
                     style = input_style,
                     value = "5",
                     vis = vis_display)
    
vis.set_action(kb, "onchange", onchange)

button_style = {"width":"120px"}
b = vis.ButtonWidget(name="button", 
                     parent="plot_area", 
                     style=button_style, 
                     text="show graph",
                     vis = vis_display)
b.argv = [tb, kb]
def onclick(self, *argv, **kwargv):
    self.text = argv[0].value
    self.k = int(argv[1].value)
    show_neighbors(self.text, self.k, reduce_g = False)
vis.set_action(b, "onclick", onclick, "argv")
    
    
button_style = {"width":"120px"}
b_r = vis.ButtonWidget(name="button2", 
                     parent="plot_area", 
                     style=button_style, 
                     text="reduce graph",
                     vis = vis_display)
b_r.argv = [tb, kb]
def onclick(self, *argv, **kwargv):
    self.text = argv[0].value
    self.k = int(argv[1].value)
    show_neighbors(self.text, self.k, reduce_g = True)
vis.set_action(b_r, "onclick", onclick, "argv")    
    

button_style = {"width":"120px"}
b2 = vis.ButtonWidget(name="button3", 
                     parent="plot_area", 
                     style=button_style, 
                     text="close",
                     vis = vis_display)
def onclick(self, *argv, **kwargv):
    vis_display.remove()
    
vis.set_action(b2, "onclick", onclick)

vis_display.refresh()



In [ ]:


In [ ]:

Questions??


In [ ]:


In [ ]: