In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Let's say hello to Pandas. The convention is use pd in the import.



In [2]:

    
import pandas as pd

Pandas Series Object

Array/list, one dimensional object. The easiest way is to initialize it with a list.



In [3]:

    
cities = ['London', 'New York', 'Berlin', 'Toronto']



In [4]:

    
s =pd.Series(cities)
s









    Out[4]:





0      London
1    New York
2      Berlin
3     Toronto
dtype: object

We see that we have the values in the list, but also and index, in this case going from 0 to 3. That is the default index.



In [5]:

    
s.index









    Out[5]:





Int64Index([0, 1, 2, 3], dtype=int64)



In [6]:

    
s.values









    Out[6]:





array(['London', 'New York', 'Berlin', 'Toronto'], dtype=object)

We can reference the values by using the appropiate index. In this case there is not much difference with a numpy array.



In [7]:

    
s[1]









    Out[7]:





'New York'

You can assign you own index. Which brings the power to the structure. Later it will be clear why.



In [8]:

    
s = pd.Series(cities, index = ['A', 'B', 'C', 'D'])
s









    Out[8]:





A      London
B    New York
C      Berlin
D     Toronto
dtype: object

In this case we have used labels A to D. And we can now reference the data in the Series by using the label.



In [9]:

    
s['D']









    Out[9]:





'Toronto'

Bogotá

Bogotá Image by Jorge Díaz

Let us try to look at a more meaningful series. The population of Bogotá (the capital of Colombia) as evolving in time. Data taken from wikipedia. In this case it makes sense that the label is the year, and the value is the corresponding population. Then we can easily retrieve by year. Previously we used a list, but we can also use a dictionary to initialize a Series Object. In this case the keys will be the index, and the value will be the population in that year:



In [41]:

    
population_bogota = {1800:21964, 
                     1912:121257, 
                     1951:715250, 
                     1964:1697311, 
                     1973:2855065, 
                     1985:4236490, 
                     1999:6276428, 
                     2012:7571345}



In [42]:

    
series_bogota = pd.Series(population_bogota)
series_bogota









    Out[42]:





1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
dtype: int64

When you are working with different series it may be useful to include some meta-data on the series. Like the name of the series itself, and of the index.



In [43]:

    
series_bogota.name = 'Bogota population'
series_bogota.index.name = 'year'



In [44]:

    
series_bogota









    Out[44]:





year
1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64



In [45]:

    
series_bogota.index









    Out[45]:





Int64Index([1800, 1912, 1951, 1964, 1973, 1985, 1999, 2012], dtype=int64)



In [46]:

    
series_bogota.values









    Out[46]:





array([  21964,  121257,  715250, 1697311, 2855065, 4236490, 6276428,
       7571345])

We can obviously reference specific values by using the index



In [47]:

    
series_bogota[1800]









    Out[47]:





21964

Note that the values in the series are numpy arrays. This is part of what makes pandas fast and can be very useful when working with other libraries.



In [48]:

    
type(series_bogota.values)









    Out[48]:





numpy.ndarray

The index, on the other hand is a pandas Object. There is a hierarchy of Index objects that includes specific types for time indexes, hierichical index and other types of indexes.



In [49]:

    
type(series_bogota.index)









    Out[49]:





pandas.core.index.Int64Index

We can query som information based on values. When did the population of bogota went above 1.000.000



In [50]:

    
series_bogota[series_bogota > 1000000]









    Out[50]:





year
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

I need the population from the 60's on



In [51]:

    
series_bogota[series_bogota.index > 1960]









    Out[51]:





year
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

What is actually going on with this way of querying information? Let us see what the bit inside the brackets yields.



In [52]:

    
series_bogota.index > 1965









    Out[52]:





array([False, False, False, False,  True,  True,  True,  True], dtype=bool)

This means that we can also query using an array of booleans, perhaps handy if we have complex programatic conditions.



In [53]:

    
series_bogota[[True, False, True]]









    Out[53]:





year
1800     21964
1951    715250
Name: Bogota population, dtype: int64

You can also query based on the index value itself.



In [54]:

    
series_bogota[[1973, 2012, 2011]]









    Out[54]:





1973    2855065
2012    7571345
2011        NaN
Name: Bogota population, dtype: float64

Note that we get NaN, for 2011, because this label is not in the original index. This is part of the automatic handling of missing values, and will turn out to be extremely valuable when working with real data.

Lets add a ficticious value for 2011.



In [55]:

    
series_bogota = series_bogota.set_value(2011, 6500000)
series_bogota









    Out[55]:





year
1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
2011    6500000
Name: Bogota population, dtype: int64

You can apply functions to each element of the series. Remember, this is close to numpy.



In [56]:

    
millions = lambda x: x/1000000.0
series_bogota.apply(millions)









    Out[56]:





year
1800    0.021964
1912    0.121257
1951    0.715250
1964    1.697311
1973    2.855065
1985    4.236490
1999    6.276428
2012    7.571345
2011    6.500000
Name: Bogota population, dtype: float64

Finally, sometimes we need to have a quick idea of what the data looks like. For this we can use the function describe.



In [57]:

    
series_bogota.describe()









    Out[57]:





count          9.000000
mean     3332790.000000
std      2926351.983191
min        21964.000000
25%       715250.000000
50%      2855065.000000
75%      6276428.000000
max      7571345.000000
dtype: float64

Or even cooler, plots



In [58]:

    
pd.Series.plot(series_bogota, kind='bar')









    Out[58]:





<matplotlib.axes.AxesSubplot at 0x10993e150>



In [59]:

    
series_bogota.sort()
pd.Series.plot(series_bogota/1000000.0, kind='bar')









    Out[59]:





<matplotlib.axes.AxesSubplot at 0x109a4b350>

What about population change?



In [60]:

    
series_bogota.pct_change()









    Out[60]:





year
1800         NaN
1912    4.520716
1951    4.898629
1964    1.373032
1973    0.682111
1985    0.483851
1999    0.481516
2011    0.035621
2012    0.164822
Name: Bogota population, dtype: float64



In [61]:

    
series_bogota.pct_change().plot(kind='bar')
plt.ylabel('percentage change')









    Out[61]:





<matplotlib.text.Text at 0x109d0ccd0>

Note that there is a lot of missing data. Is there a way to solve this quickly?

Reindex + Interpolation



In [62]:

    
series_bogota = series_bogota.reindex(range(1800, 2014))



In [63]:

    
series_bogota









    Out[63]:





1800    21964
1801      NaN
1802      NaN
1803      NaN
1804      NaN
1805      NaN
1806      NaN
1807      NaN
1808      NaN
1809      NaN
1810      NaN
1811      NaN
1812      NaN
1813      NaN
1814      NaN
...
1999    6276428
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011    6500000
2012    7571345
2013        NaN
Name: Bogota population, Length: 214, dtype: float64



In [64]:

    
series_bogota = series_bogota.interpolate('values')
series_bogota









    Out[64]:





1800    21964.000000
1801    22850.544643
1802    23737.089286
1803    24623.633929
1804    25510.178571
1805    26396.723214
1806    27283.267857
1807    28169.812500
1808    29056.357143
1809    29942.901786
1810    30829.446429
1811    31715.991071
1812    32602.535714
1813    33489.080357
1814    34375.625000
...
1999    6276428
2000    6295059
2001    6313690
2002    6332321
2003    6350952
2004    6369583
2005    6388214
2006    6406845
2007    6425476
2008    6444107
2009    6462738
2010    6481369
2011    6500000
2012    7571345
2013    7571345
Name: Bogota population, Length: 214, dtype: float64



In [65]:

    
(series_bogota/1000000.0).plot()









    Out[65]:





<matplotlib.axes.AxesSubplot at 0x109dc0750>



In [66]:

    
(series_bogota).pct_change().plot()









    Out[66]:





<matplotlib.axes.AxesSubplot at 0x109dc6e90>

Finally, let us look at adding series.

Cali

Cali Image by Carlos Dussán Gómez



In [67]:

    
population_cali = {1809: 7546, 1938:101883, 1973:991549, 1985:1429026, 2013:2319684}



In [68]:

    
series_cali = pd.Series(population_cali)
series_cali.name = 'Cali (Colombia) Population'
series_cali.index.name = 'year'
series_cali









    Out[68]:





year
1809       7546
1938     101883
1973     991549
1985    1429026
2013    2319684
Name: Cali (Colombia) Population, dtype: int64



In [69]:

    
(series_cali + series_bogota)









    Out[69]:





1800             NaN
1801             NaN
1802             NaN
1803             NaN
1804             NaN
1805             NaN
1806             NaN
1807             NaN
1808             NaN
1809    37488.901786
1810             NaN
1811             NaN
1812             NaN
1813             NaN
1814             NaN
...
1999        NaN
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011        NaN
2012        NaN
2013    9891029
Length: 214, dtype: float64

Probably the most meaningful way to add the series is if they share an index. So I can reindex cali with the index of Bogotá



In [70]:

    
series_cali









    Out[70]:





year
1809       7546
1938     101883
1973     991549
1985    1429026
2013    2319684
Name: Cali (Colombia) Population, dtype: int64



In [71]:

    
series_cali = series_cali.reindex(series_bogota.index)
series_cali









    Out[71]:





1800     NaN
1801     NaN
1802     NaN
1803     NaN
1804     NaN
1805     NaN
1806     NaN
1807     NaN
1808     NaN
1809    7546
1810     NaN
1811     NaN
1812     NaN
1813     NaN
1814     NaN
...
1999        NaN
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011        NaN
2012        NaN
2013    2319684
Name: Cali (Colombia) Population, Length: 214, dtype: float64



In [72]:

    
len(series_bogota) == len(series_cali)









    Out[72]:





True



In [73]:

    
np.alltrue(series_bogota.index == series_cali.index)









    Out[73]:





True



In [74]:

    
series_cali = series_cali.interpolate('values')
series_cali









    Out[74]:





1800             NaN
1801             NaN
1802             NaN
1803             NaN
1804             NaN
1805             NaN
1806             NaN
1807             NaN
1808             NaN
1809     7546.000000
1810     8277.294574
1811     9008.589147
1812     9739.883721
1813    10471.178295
1814    11202.472868
...
1999    1874355.000000
2000    1906164.214286
2001    1937973.428571
2002    1969782.642857
2003    2001591.857143
2004    2033401.071429
2005    2065210.285714
2006    2097019.500000
2007    2128828.714286
2008    2160637.928571
2009    2192447.142857
2010    2224256.357143
2011    2256065.571429
2012    2287874.785714
2013    2319684.000000
Name: Cali (Colombia) Population, Length: 214, dtype: float64



In [75]:

    
series_cali = series_cali.fillna(0.0)
series_cali









    Out[75]:





1800        0.000000
1801        0.000000
1802        0.000000
1803        0.000000
1804        0.000000
1805        0.000000
1806        0.000000
1807        0.000000
1808        0.000000
1809     7546.000000
1810     8277.294574
1811     9008.589147
1812     9739.883721
1813    10471.178295
1814    11202.472868
...
1999    1874355.000000
2000    1906164.214286
2001    1937973.428571
2002    1969782.642857
2003    2001591.857143
2004    2033401.071429
2005    2065210.285714
2006    2097019.500000
2007    2128828.714286
2008    2160637.928571
2009    2192447.142857
2010    2224256.357143
2011    2256065.571429
2012    2287874.785714
2013    2319684.000000
Name: Cali (Colombia) Population, Length: 214, dtype: float64

Now finally add the two series



In [76]:

    
series_bogota + series_cali









    Out[76]:





1800    21964.000000
1801    22850.544643
1802    23737.089286
1803    24623.633929
1804    25510.178571
1805    26396.723214
1806    27283.267857
1807    28169.812500
1808    29056.357143
1809    37488.901786
1810    39106.741002
1811    40724.580219
1812    42342.419435
1813    43960.258652
1814    45578.097868
...
1999    8150783.000000
2000    8201223.214286
2001    8251663.428571
2002    8302103.642857
2003    8352543.857143
2004    8402984.071429
2005    8453424.285714
2006    8503864.500000
2007    8554304.714286
2008    8604744.928571
2009    8655185.142857
2010    8705625.357143
2011    8756065.571429
2012    9859219.785714
2013    9891029.000000
Length: 214, dtype: float64



In [77]:

    
series_bogota.plot(label='Bogota population')
series_cali.plot(label='Cali population')
plt.legend(loc='best')









    Out[77]:





<matplotlib.legend.Legend at 0x109fd1950>

I do not vouch for the statistics here, it is just an example of the tool.

Pandas Data Frame



In [78]:

    
df = pd.DataFrame({'bogotá':series_bogota, 'cali':series_cali})



In [79]:

    
df









    Out[79]:




<class 'pandas.core.frame.DataFrame'>
Int64Index: 214 entries, 1800 to 2013
Data columns (total 2 columns):
bogotá    214  non-null values
cali      214  non-null values
dtypes: float64(2)



In [80]:

    
df.head()









    Out[80]:






  
    
      
      bogotá
      cali
    
  
  
    
      1800
       21964.000000
       0
    
    
      1801
       22850.544643
       0
    
    
      1802
       23737.089286
       0
    
    
      1803
       24623.633929
       0
    
    
      1804
       25510.178571
       0



In [81]:

    
pd.options.display.float_format = '{:20,.2f}'.format
df.index.name = 'year'



In [82]:

    
df.tail()









    Out[82]:






  
    
      
      bogotá
      cali
    
    
      year
      
      
    
  
  
    
      2009
              6,462,738.00
              2,192,447.14
    
    
      2010
              6,481,369.00
              2,224,256.36
    
    
      2011
              6,500,000.00
              2,256,065.57
    
    
      2012
              7,571,345.00
              2,287,874.79
    
    
      2013
              7,571,345.00
              2,319,684.00

Elements of DataFrame



In [83]:

    
df.index









    Out[83]:





Int64Index([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013], dtype=int64)



In [84]:

    
df.columns









    Out[84]:





Index([u'bogotá', u'cali'], dtype=object)



In [85]:

    
np.shape(df.values)









    Out[85]:





(214, 2)



In [86]:

    
df['population difference'] = df['bogotá'] - df['cali']



In [88]:

    
df.tail()









    Out[88]:






  
    
      
      bogotá
      cali
      population difference
    
    
      year
      
      
      
    
  
  
    
      2009
              6,462,738.00
              2,192,447.14
              4,270,290.86
    
    
      2010
              6,481,369.00
              2,224,256.36
              4,257,112.64
    
    
      2011
              6,500,000.00
              2,256,065.57
              4,243,934.43
    
    
      2012
              7,571,345.00
              2,287,874.79
              5,283,470.21
    
    
      2013
              7,571,345.00
              2,319,684.00
              5,251,661.00



In [87]:

    
df.describe()









    Out[87]:






  
    
      
      bogotá
      cali
      population difference
    
  
  
    
      count
                    214.00
                    214.00
                    214.00
    
    
      mean
              1,269,633.63
                440,945.03
                828,688.61
    
    
      std
              2,057,142.86
                658,404.11
              1,404,249.56
    
    
      min
                 21,964.00
                      0.00
                 21,964.00
    
    
      25%
                 69,172.50
                 39,905.78
                 29,266.72
    
    
      50%
                116,381.00
                 78,847.22
                 37,533.78
    
    
      75%
              1,376,252.60
                654,746.87
                721,505.72
    
    
      max
              7,571,345.00
              2,319,684.00
              5,283,470.21



In [89]:

    
df['population difference'].tail()









    Out[89]:





year
2009           4,270,290.86
2010           4,257,112.64
2011           4,243,934.43
2012           5,283,470.21
2013           5,251,661.00
Name: population difference, dtype: float64



In [90]:

    
df[df.index >1990]









    Out[90]:






  
    
      
      bogotá
      cali
      population difference
    
    
      year
      
      
      
    
  
  
    
      1991
              5,110,749.14
              1,619,881.29
              3,490,867.86
    
    
      1992
              5,256,459.00
              1,651,690.50
              3,604,768.50
    
    
      1993
              5,402,168.86
              1,683,499.71
              3,718,669.14
    
    
      1994
              5,547,878.71
              1,715,308.93
              3,832,569.79
    
    
      1995
              5,693,588.57
              1,747,118.14
              3,946,470.43
    
    
      1996
              5,839,298.43
              1,778,927.36
              4,060,371.07
    
    
      1997
              5,985,008.29
              1,810,736.57
              4,174,271.71
    
    
      1998
              6,130,718.14
              1,842,545.79
              4,288,172.36
    
    
      1999
              6,276,428.00
              1,874,355.00
              4,402,073.00
    
    
      2000
              6,295,059.00
              1,906,164.21
              4,388,894.79
    
    
      2001
              6,313,690.00
              1,937,973.43
              4,375,716.57
    
    
      2002
              6,332,321.00
              1,969,782.64
              4,362,538.36
    
    
      2003
              6,350,952.00
              2,001,591.86
              4,349,360.14
    
    
      2004
              6,369,583.00
              2,033,401.07
              4,336,181.93
    
    
      2005
              6,388,214.00
              2,065,210.29
              4,323,003.71
    
    
      2006
              6,406,845.00
              2,097,019.50
              4,309,825.50
    
    
      2007
              6,425,476.00
              2,128,828.71
              4,296,647.29
    
    
      2008
              6,444,107.00
              2,160,637.93
              4,283,469.07
    
    
      2009
              6,462,738.00
              2,192,447.14
              4,270,290.86
    
    
      2010
              6,481,369.00
              2,224,256.36
              4,257,112.64
    
    
      2011
              6,500,000.00
              2,256,065.57
              4,243,934.43
    
    
      2012
              7,571,345.00
              2,287,874.79
              5,283,470.21
    
    
      2013
              7,571,345.00
              2,319,684.00
              5,251,661.00



In [91]:

    
df.plot()









    Out[91]:





<matplotlib.axes.AxesSubplot at 0x10a2c04d0>



In [92]:

    
!ls *.csv









    



cali_and_bogota.csv metropolitan.csv    patients.csv



In [93]:

    
df.to_csv('cali_and_bogota.csv')



In [94]:

    
!head cali_and_bogota.csv









    



year,bogotá,cali,population difference
1800,21964.0,0.0,21964.0
1801,22850.54464285714,0.0,22850.54464285714
1802,23737.089285714286,0.0,23737.089285714286
1803,24623.633928571428,0.0,24623.633928571428
1804,25510.178571428572,0.0,25510.178571428572
1805,26396.723214285714,0.0,26396.723214285714
1806,27283.267857142855,0.0,27283.267857142855
1807,28169.8125,0.0,28169.8125
1808,29056.357142857145,0.0,29056.357142857145

There are many ways to initialize DataFrames

Let's be more cosmopolitan (if we still have 7 more minutes)



In [95]:

    
!cat metropolitan.csv









    



Metropolitan area,Country,Population,Area
Tokyo,Japan,32450000,8014
Seoul,South Korea,20550000,5076
Mexico City,Mexico,20450000,7346
New York City,United States,19750000,17884
Mumbai-Bombay,India,19200000,2350
Jakarta,Indonesia,18900000,5100
São Paulo,Brazil,18850000,8479
New Delhi,India,18600000,3182
Osaka-Kobe-Kyoto,Japan,17375000,6930
Shanghai,China,16650000,5177
Manila,Philippines,16300000,2521
Hong Kong,Hong Kong/China,15800000,3051
Los Angeles,United States,15250000,10780
Kolkata,India,15100000,1785
Moscow,Russia,15000000,14925
Cairo,Egypt,14450000,1600
Buenos Aires,Argentina,13170000,10888
London,United Kingdom,12875000,11391
Beijing,China,12500000,6562
Karachi,Pakistan,11800000,1100



In [96]:

    
df = pd.read_csv('metropolitan.csv')
df









    Out[96]:






  
    
      
      Metropolitan area
      Country
      Population
      Area
    
  
  
    
      0 
                  Tokyo
                 Japan
       32450000
        8014
    
    
      1 
                  Seoul
           South Korea
       20550000
        5076
    
    
      2 
            Mexico City
                Mexico
       20450000
        7346
    
    
      3 
          New York City
         United States
       19750000
       17884
    
    
      4 
          Mumbai-Bombay
                 India
       19200000
        2350
    
    
      5 
                Jakarta
             Indonesia
       18900000
        5100
    
    
      6 
              São Paulo
                Brazil
       18850000
        8479
    
    
      7 
              New Delhi
                 India
       18600000
        3182
    
    
      8 
       Osaka-Kobe-Kyoto
                 Japan
       17375000
        6930
    
    
      9 
               Shanghai
                 China
       16650000
        5177
    
    
      10
                 Manila
           Philippines
       16300000
        2521
    
    
      11
              Hong Kong
       Hong Kong/China
       15800000
        3051
    
    
      12
            Los Angeles
         United States
       15250000
       10780
    
    
      13
                Kolkata
                 India
       15100000
        1785
    
    
      14
                 Moscow
                Russia
       15000000
       14925
    
    
      15
                  Cairo
                 Egypt
       14450000
        1600
    
    
      16
           Buenos Aires
             Argentina
       13170000
       10888
    
    
      17
                 London
        United Kingdom
       12875000
       11391
    
    
      18
                Beijing
                 China
       12500000
        6562
    
    
      19
                Karachi
              Pakistan
       11800000
        1100



In [97]:

    
df = pd.read_csv('metropolitan.csv', index_col='Metropolitan area')
df









    Out[97]:






  
    
      
      Country
      Population
      Area
    
    
      Metropolitan area
      
      
      
    
  
  
    
      Tokyo
                 Japan
       32450000
        8014
    
    
      Seoul
           South Korea
       20550000
        5076
    
    
      Mexico City
                Mexico
       20450000
        7346
    
    
      New York City
         United States
       19750000
       17884
    
    
      Mumbai-Bombay
                 India
       19200000
        2350
    
    
      Jakarta
             Indonesia
       18900000
        5100
    
    
      São Paulo
                Brazil
       18850000
        8479
    
    
      New Delhi
                 India
       18600000
        3182
    
    
      Osaka-Kobe-Kyoto
                 Japan
       17375000
        6930
    
    
      Shanghai
                 China
       16650000
        5177
    
    
      Manila
           Philippines
       16300000
        2521
    
    
      Hong Kong
       Hong Kong/China
       15800000
        3051
    
    
      Los Angeles
         United States
       15250000
       10780
    
    
      Kolkata
                 India
       15100000
        1785
    
    
      Moscow
                Russia
       15000000
       14925
    
    
      Cairo
                 Egypt
       14450000
        1600
    
    
      Buenos Aires
             Argentina
       13170000
       10888
    
    
      London
        United Kingdom
       12875000
       11391
    
    
      Beijing
                 China
       12500000
        6562
    
    
      Karachi
              Pakistan
       11800000
        1100



In [98]:

    
df.describe()









    Out[98]:






  
    
      
      Population
      Area
    
  
  
    
      count
                     20.00
                     20.00
    
    
      mean
             17,251,000.00
                  6,707.05
    
    
      std
              4,485,562.40
                  4,619.33
    
    
      min
             11,800,000.00
                  1,100.00
    
    
      25%
             14,862,500.00
                  2,918.50
    
    
      50%
             16,475,000.00
                  5,869.50
    
    
      75%
             18,975,000.00
                  9,054.25
    
    
      max
             32,450,000.00
                 17,884.00



In [99]:

    
df.dtypes









    Out[99]:





Country       object
Population     int64
Area           int64
dtype: object



In [100]:

    
df['Population']= df['Population'].apply(float)
df['Area']= df['Area'].apply(float)



In [101]:

    
df.dtypes









    Out[101]:





Country        object
Population    float64
Area          float64
dtype: object

Three biggest metro areas



In [102]:

    
df.sort('Population')
df.head(3)









    Out[102]:






  
    
      
      Country
      Population
      Area
    
    
      Metropolitan area
      
      
      
    
  
  
    
      Tokyo
             Japan
             32,450,000.00
                  8,014.00
    
    
      Seoul
       South Korea
             20,550,000.00
                  5,076.00
    
    
      Mexico City
            Mexico
             20,450,000.00
                  7,346.00

Area largest than 10.0000 km2



In [103]:

    
df[df['Area']> 10000]









    Out[103]:






  
    
      
      Country
      Population
      Area
    
    
      Metropolitan area
      
      
      
    
  
  
    
      New York City
        United States
             19,750,000.00
                 17,884.00
    
    
      Los Angeles
        United States
             15,250,000.00
                 10,780.00
    
    
      Moscow
               Russia
             15,000,000.00
                 14,925.00
    
    
      Buenos Aires
            Argentina
             13,170,000.00
                 10,888.00
    
    
      London
       United Kingdom
             12,875,000.00
                 11,391.00



In [104]:

    
df[(df['Area']> 10000) & (df['Population'] > 20000)]









    Out[104]:






  
    
      
      Country
      Population
      Area
    
    
      Metropolitan area
      
      
      
    
  
  
    
      New York City
        United States
             19,750,000.00
                 17,884.00
    
    
      Los Angeles
        United States
             15,250,000.00
                 10,780.00
    
    
      Moscow
               Russia
             15,000,000.00
                 14,925.00
    
    
      Buenos Aires
            Argentina
             13,170,000.00
                 10,888.00
    
    
      London
       United Kingdom
             12,875,000.00
                 11,391.00



In [105]:

    
df['Density'] = df['Population']/df['Area']
df.sort('Density', ascending=False)['Density'].plot(kind='bar')









    Out[105]:





<matplotlib.axes.AxesSubplot at 0x109faac10>



In [106]:

    
df.groupby('Country').head()









    Out[106]:






  
    
      
      
      Country
      Population
      Area
      Density
    
    
      Country
      Metropolitan area
      
      
      
      
    
  
  
    
      Argentina
      Buenos Aires
             Argentina
             13,170,000.00
                 10,888.00
                  1,209.59
    
    
      Brazil
      São Paulo
                Brazil
             18,850,000.00
                  8,479.00
                  2,223.14
    
    
      China
      Shanghai
                 China
             16,650,000.00
                  5,177.00
                  3,216.15
    
    
      Beijing
                 China
             12,500,000.00
                  6,562.00
                  1,904.91
    
    
      Egypt
      Cairo
                 Egypt
             14,450,000.00
                  1,600.00
                  9,031.25
    
    
      Hong Kong/China
      Hong Kong
       Hong Kong/China
             15,800,000.00
                  3,051.00
                  5,178.63
    
    
      India
      Mumbai-Bombay
                 India
             19,200,000.00
                  2,350.00
                  8,170.21
    
    
      New Delhi
                 India
             18,600,000.00
                  3,182.00
                  5,845.38
    
    
      Kolkata
                 India
             15,100,000.00
                  1,785.00
                  8,459.38
    
    
      Indonesia
      Jakarta
             Indonesia
             18,900,000.00
                  5,100.00
                  3,705.88
    
    
      Japan
      Tokyo
                 Japan
             32,450,000.00
                  8,014.00
                  4,049.16
    
    
      Osaka-Kobe-Kyoto
                 Japan
             17,375,000.00
                  6,930.00
                  2,507.22
    
    
      Mexico
      Mexico City
                Mexico
             20,450,000.00
                  7,346.00
                  2,783.83
    
    
      Pakistan
      Karachi
              Pakistan
             11,800,000.00
                  1,100.00
                 10,727.27
    
    
      Philippines
      Manila
           Philippines
             16,300,000.00
                  2,521.00
                  6,465.69
    
    
      Russia
      Moscow
                Russia
             15,000,000.00
                 14,925.00
                  1,005.03
    
    
      South Korea
      Seoul
           South Korea
             20,550,000.00
                  5,076.00
                  4,048.46
    
    
      United Kingdom
      London
        United Kingdom
             12,875,000.00
                 11,391.00
                  1,130.28
    
    
      United States
      New York City
         United States
             19,750,000.00
                 17,884.00
                  1,104.34
    
    
      Los Angeles
         United States
             15,250,000.00
                 10,780.00
                  1,414.66



In [107]:

    
type(df.groupby('Country'))









    Out[107]:





pandas.core.groupby.DataFrameGroupBy



In [108]:

    
print df.groupby('Country').sum().sort('Density', ascending=False).head()









    



                      Population                 Area              Density
Country                                                                   
India              52,900,000.00             7,317.00            22,474.98
Pakistan           11,800,000.00             1,100.00            10,727.27
Egypt              14,450,000.00             1,600.00             9,031.25
Japan              49,825,000.00            14,944.00             6,556.38
Philippines        16,300,000.00             2,521.00             6,465.69

All the data in this notebook was taken from wikipedia.

	bogotá	cali
1800	21964.000000	0
1801	22850.544643	0
1802	23737.089286	0
1803	24623.633929	0
1804	25510.178571	0

	bogotá	cali
year
2009	6,462,738.00	2,192,447.14
2010	6,481,369.00	2,224,256.36
2011	6,500,000.00	2,256,065.57
2012	7,571,345.00	2,287,874.79
2013	7,571,345.00	2,319,684.00

	bogotá	cali	population difference
count	214.00	214.00	214.00
mean	1,269,633.63	440,945.03	828,688.61
std	2,057,142.86	658,404.11	1,404,249.56
min	21,964.00	0.00	21,964.00
25%	69,172.50	39,905.78	29,266.72
50%	116,381.00	78,847.22	37,533.78
75%	1,376,252.60	654,746.87	721,505.72
max	7,571,345.00	2,319,684.00	5,283,470.21

	bogotá	cali	population difference
year
1991	5,110,749.14	1,619,881.29	3,490,867.86
1992	5,256,459.00	1,651,690.50	3,604,768.50
1993	5,402,168.86	1,683,499.71	3,718,669.14
1994	5,547,878.71	1,715,308.93	3,832,569.79
1995	5,693,588.57	1,747,118.14	3,946,470.43
1996	5,839,298.43	1,778,927.36	4,060,371.07
1997	5,985,008.29	1,810,736.57	4,174,271.71
1998	6,130,718.14	1,842,545.79	4,288,172.36
1999	6,276,428.00	1,874,355.00	4,402,073.00
2000	6,295,059.00	1,906,164.21	4,388,894.79
2001	6,313,690.00	1,937,973.43	4,375,716.57
2002	6,332,321.00	1,969,782.64	4,362,538.36
2003	6,350,952.00	2,001,591.86	4,349,360.14
2004	6,369,583.00	2,033,401.07	4,336,181.93
2005	6,388,214.00	2,065,210.29	4,323,003.71
2006	6,406,845.00	2,097,019.50	4,309,825.50
2007	6,425,476.00	2,128,828.71	4,296,647.29
2008	6,444,107.00	2,160,637.93	4,283,469.07
2009	6,462,738.00	2,192,447.14	4,270,290.86
2010	6,481,369.00	2,224,256.36	4,257,112.64
2011	6,500,000.00	2,256,065.57	4,243,934.43
2012	7,571,345.00	2,287,874.79	5,283,470.21
2013	7,571,345.00	2,319,684.00	5,251,661.00

	Metropolitan area	Country	Population	Area
0	Tokyo	Japan	32450000	8014
1	Seoul	South Korea	20550000	5076
2	Mexico City	Mexico	20450000	7346
3	New York City	United States	19750000	17884
4	Mumbai-Bombay	India	19200000	2350
5	Jakarta	Indonesia	18900000	5100
6	São Paulo	Brazil	18850000	8479
7	New Delhi	India	18600000	3182
8	Osaka-Kobe-Kyoto	Japan	17375000	6930
9	Shanghai	China	16650000	5177
10	Manila	Philippines	16300000	2521
11	Hong Kong	Hong Kong/China	15800000	3051
12	Los Angeles	United States	15250000	10780
13	Kolkata	India	15100000	1785
14	Moscow	Russia	15000000	14925
15	Cairo	Egypt	14450000	1600
16	Buenos Aires	Argentina	13170000	10888
17	London	United Kingdom	12875000	11391
18	Beijing	China	12500000	6562
19	Karachi	Pakistan	11800000	1100

	Country	Population	Area
Metropolitan area
Tokyo	Japan	32450000	8014
Seoul	South Korea	20550000	5076
Mexico City	Mexico	20450000	7346
New York City	United States	19750000	17884
Mumbai-Bombay	India	19200000	2350
Jakarta	Indonesia	18900000	5100
São Paulo	Brazil	18850000	8479
New Delhi	India	18600000	3182
Osaka-Kobe-Kyoto	Japan	17375000	6930
Shanghai	China	16650000	5177
Manila	Philippines	16300000	2521
Hong Kong	Hong Kong/China	15800000	3051
Los Angeles	United States	15250000	10780
Kolkata	India	15100000	1785
Moscow	Russia	15000000	14925
Cairo	Egypt	14450000	1600
Buenos Aires	Argentina	13170000	10888
London	United Kingdom	12875000	11391
Beijing	China	12500000	6562
Karachi	Pakistan	11800000	1100

	Population	Area
count	20.00	20.00
mean	17,251,000.00	6,707.05
std	4,485,562.40	4,619.33
min	11,800,000.00	1,100.00
25%	14,862,500.00	2,918.50
50%	16,475,000.00	5,869.50
75%	18,975,000.00	9,054.25
max	32,450,000.00	17,884.00

		Country	Population	Area	Density
Country	Metropolitan area
Argentina	Buenos Aires	Argentina	13,170,000.00	10,888.00	1,209.59
Brazil	São Paulo	Brazil	18,850,000.00	8,479.00	2,223.14
China	Shanghai	China	16,650,000.00	5,177.00	3,216.15
China	Beijing	China	12,500,000.00	6,562.00	1,904.91
Egypt	Cairo	Egypt	14,450,000.00	1,600.00	9,031.25
Hong Kong/China	Hong Kong	Hong Kong/China	15,800,000.00	3,051.00	5,178.63
India	Mumbai-Bombay	India	19,200,000.00	2,350.00	8,170.21
	New Delhi	India	18,600,000.00	3,182.00	5,845.38
	Kolkata	India	15,100,000.00	1,785.00	8,459.38
Indonesia	Jakarta	Indonesia	18,900,000.00	5,100.00	3,705.88
Japan	Tokyo	Japan	32,450,000.00	8,014.00	4,049.16
Japan	Osaka-Kobe-Kyoto	Japan	17,375,000.00	6,930.00	2,507.22
Mexico	Mexico City	Mexico	20,450,000.00	7,346.00	2,783.83
Pakistan	Karachi	Pakistan	11,800,000.00	1,100.00	10,727.27
Philippines	Manila	Philippines	16,300,000.00	2,521.00	6,465.69
Russia	Moscow	Russia	15,000,000.00	14,925.00	1,005.03
South Korea	Seoul	South Korea	20,550,000.00	5,076.00	4,048.46
United Kingdom	London	United Kingdom	12,875,000.00	11,391.00	1,130.28
United States	New York City	United States	19,750,000.00	17,884.00	1,104.34
United States	Los Angeles	United States	15,250,000.00	10,780.00	1,414.66