In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Let's say hello to Pandas. The convention is use pd in the import.


In [2]:
import pandas as pd

Pandas Series Object

Array/list, one dimensional object. The easiest way is to initialize it with a list.


In [3]:
cities = ['London', 'New York', 'Berlin', 'Toronto']

In [4]:
s =pd.Series(cities)
s


Out[4]:
0      London
1    New York
2      Berlin
3     Toronto
dtype: object

We see that we have the values in the list, but also and index, in this case going from 0 to 3. That is the default index.


In [5]:
s.index


Out[5]:
Int64Index([0, 1, 2, 3], dtype=int64)

In [6]:
s.values


Out[6]:
array(['London', 'New York', 'Berlin', 'Toronto'], dtype=object)

We can reference the values by using the appropiate index. In this case there is not much difference with a numpy array.


In [7]:
s[1]


Out[7]:
'New York'

You can assign you own index. Which brings the power to the structure. Later it will be clear why.


In [8]:
s = pd.Series(cities, index = ['A', 'B', 'C', 'D'])
s


Out[8]:
A      London
B    New York
C      Berlin
D     Toronto
dtype: object

In this case we have used labels A to D. And we can now reference the data in the Series by using the label.


In [9]:
s['D']


Out[9]:
'Toronto'

Let us try to look at a more meaningful series. The population of Bogotá (the capital of Colombia) as evolving in time. Data taken from wikipedia. In this case it makes sense that the label is the year, and the value is the corresponding population. Then we can easily retrieve by year. Previously we used a list, but we can also use a dictionary to initialize a Series Object. In this case the keys will be the index, and the value will be the population in that year:


In [41]:
population_bogota = {1800:21964, 
                     1912:121257, 
                     1951:715250, 
                     1964:1697311, 
                     1973:2855065, 
                     1985:4236490, 
                     1999:6276428, 
                     2012:7571345}

In [42]:
series_bogota = pd.Series(population_bogota)
series_bogota


Out[42]:
1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
dtype: int64

When you are working with different series it may be useful to include some meta-data on the series. Like the name of the series itself, and of the index.


In [43]:
series_bogota.name = 'Bogota population'
series_bogota.index.name = 'year'

In [44]:
series_bogota


Out[44]:
year
1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

In [45]:
series_bogota.index


Out[45]:
Int64Index([1800, 1912, 1951, 1964, 1973, 1985, 1999, 2012], dtype=int64)

In [46]:
series_bogota.values


Out[46]:
array([  21964,  121257,  715250, 1697311, 2855065, 4236490, 6276428,
       7571345])

We can obviously reference specific values by using the index


In [47]:
series_bogota[1800]


Out[47]:
21964

Note that the values in the series are numpy arrays. This is part of what makes pandas fast and can be very useful when working with other libraries.


In [48]:
type(series_bogota.values)


Out[48]:
numpy.ndarray

The index, on the other hand is a pandas Object. There is a hierarchy of Index objects that includes specific types for time indexes, hierichical index and other types of indexes.


In [49]:
type(series_bogota.index)


Out[49]:
pandas.core.index.Int64Index

We can query som information based on values. When did the population of bogota went above 1.000.000


In [50]:
series_bogota[series_bogota > 1000000]


Out[50]:
year
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

I need the population from the 60's on


In [51]:
series_bogota[series_bogota.index > 1960]


Out[51]:
year
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

What is actually going on with this way of querying information? Let us see what the bit inside the brackets yields.


In [52]:
series_bogota.index > 1965


Out[52]:
array([False, False, False, False,  True,  True,  True,  True], dtype=bool)

This means that we can also query using an array of booleans, perhaps handy if we have complex programatic conditions.


In [53]:
series_bogota[[True, False, True]]


Out[53]:
year
1800     21964
1951    715250
Name: Bogota population, dtype: int64

You can also query based on the index value itself.


In [54]:
series_bogota[[1973, 2012, 2011]]


Out[54]:
1973    2855065
2012    7571345
2011        NaN
Name: Bogota population, dtype: float64

Note that we get NaN, for 2011, because this label is not in the original index. This is part of the automatic handling of missing values, and will turn out to be extremely valuable when working with real data.

Lets add a ficticious value for 2011.


In [55]:
series_bogota = series_bogota.set_value(2011, 6500000)
series_bogota


Out[55]:
year
1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
2011    6500000
Name: Bogota population, dtype: int64

You can apply functions to each element of the series. Remember, this is close to numpy.


In [56]:
millions = lambda x: x/1000000.0
series_bogota.apply(millions)


Out[56]:
year
1800    0.021964
1912    0.121257
1951    0.715250
1964    1.697311
1973    2.855065
1985    4.236490
1999    6.276428
2012    7.571345
2011    6.500000
Name: Bogota population, dtype: float64

Finally, sometimes we need to have a quick idea of what the data looks like. For this we can use the function describe.


In [57]:
series_bogota.describe()


Out[57]:
count          9.000000
mean     3332790.000000
std      2926351.983191
min        21964.000000
25%       715250.000000
50%      2855065.000000
75%      6276428.000000
max      7571345.000000
dtype: float64

Or even cooler, plots


In [58]:
pd.Series.plot(series_bogota, kind='bar')


Out[58]:
<matplotlib.axes.AxesSubplot at 0x10993e150>

In [59]:
series_bogota.sort()
pd.Series.plot(series_bogota/1000000.0, kind='bar')


Out[59]:
<matplotlib.axes.AxesSubplot at 0x109a4b350>

What about population change?


In [60]:
series_bogota.pct_change()


Out[60]:
year
1800         NaN
1912    4.520716
1951    4.898629
1964    1.373032
1973    0.682111
1985    0.483851
1999    0.481516
2011    0.035621
2012    0.164822
Name: Bogota population, dtype: float64

In [61]:
series_bogota.pct_change().plot(kind='bar')
plt.ylabel('percentage change')


Out[61]:
<matplotlib.text.Text at 0x109d0ccd0>

Note that there is a lot of missing data. Is there a way to solve this quickly?

Reindex + Interpolation


In [62]:
series_bogota = series_bogota.reindex(range(1800, 2014))

In [63]:
series_bogota


Out[63]:
1800    21964
1801      NaN
1802      NaN
1803      NaN
1804      NaN
1805      NaN
1806      NaN
1807      NaN
1808      NaN
1809      NaN
1810      NaN
1811      NaN
1812      NaN
1813      NaN
1814      NaN
...
1999    6276428
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011    6500000
2012    7571345
2013        NaN
Name: Bogota population, Length: 214, dtype: float64

In [64]:
series_bogota = series_bogota.interpolate('values')
series_bogota


Out[64]:
1800    21964.000000
1801    22850.544643
1802    23737.089286
1803    24623.633929
1804    25510.178571
1805    26396.723214
1806    27283.267857
1807    28169.812500
1808    29056.357143
1809    29942.901786
1810    30829.446429
1811    31715.991071
1812    32602.535714
1813    33489.080357
1814    34375.625000
...
1999    6276428
2000    6295059
2001    6313690
2002    6332321
2003    6350952
2004    6369583
2005    6388214
2006    6406845
2007    6425476
2008    6444107
2009    6462738
2010    6481369
2011    6500000
2012    7571345
2013    7571345
Name: Bogota population, Length: 214, dtype: float64

In [65]:
(series_bogota/1000000.0).plot()


Out[65]:
<matplotlib.axes.AxesSubplot at 0x109dc0750>

In [66]:
(series_bogota).pct_change().plot()


Out[66]:
<matplotlib.axes.AxesSubplot at 0x109dc6e90>

Finally, let us look at adding series.


In [67]:
population_cali = {1809: 7546, 1938:101883, 1973:991549, 1985:1429026, 2013:2319684}

In [68]:
series_cali = pd.Series(population_cali)
series_cali.name = 'Cali (Colombia) Population'
series_cali.index.name = 'year'
series_cali


Out[68]:
year
1809       7546
1938     101883
1973     991549
1985    1429026
2013    2319684
Name: Cali (Colombia) Population, dtype: int64

In [69]:
(series_cali + series_bogota)


Out[69]:
1800             NaN
1801             NaN
1802             NaN
1803             NaN
1804             NaN
1805             NaN
1806             NaN
1807             NaN
1808             NaN
1809    37488.901786
1810             NaN
1811             NaN
1812             NaN
1813             NaN
1814             NaN
...
1999        NaN
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011        NaN
2012        NaN
2013    9891029
Length: 214, dtype: float64

Probably the most meaningful way to add the series is if they share an index. So I can reindex cali with the index of Bogotá


In [70]:
series_cali


Out[70]:
year
1809       7546
1938     101883
1973     991549
1985    1429026
2013    2319684
Name: Cali (Colombia) Population, dtype: int64

In [71]:
series_cali = series_cali.reindex(series_bogota.index)
series_cali


Out[71]:
1800     NaN
1801     NaN
1802     NaN
1803     NaN
1804     NaN
1805     NaN
1806     NaN
1807     NaN
1808     NaN
1809    7546
1810     NaN
1811     NaN
1812     NaN
1813     NaN
1814     NaN
...
1999        NaN
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011        NaN
2012        NaN
2013    2319684
Name: Cali (Colombia) Population, Length: 214, dtype: float64

In [72]:
len(series_bogota) == len(series_cali)


Out[72]:
True

In [73]:
np.alltrue(series_bogota.index == series_cali.index)


Out[73]:
True

In [74]:
series_cali = series_cali.interpolate('values')
series_cali


Out[74]:
1800             NaN
1801             NaN
1802             NaN
1803             NaN
1804             NaN
1805             NaN
1806             NaN
1807             NaN
1808             NaN
1809     7546.000000
1810     8277.294574
1811     9008.589147
1812     9739.883721
1813    10471.178295
1814    11202.472868
...
1999    1874355.000000
2000    1906164.214286
2001    1937973.428571
2002    1969782.642857
2003    2001591.857143
2004    2033401.071429
2005    2065210.285714
2006    2097019.500000
2007    2128828.714286
2008    2160637.928571
2009    2192447.142857
2010    2224256.357143
2011    2256065.571429
2012    2287874.785714
2013    2319684.000000
Name: Cali (Colombia) Population, Length: 214, dtype: float64

In [75]:
series_cali = series_cali.fillna(0.0)
series_cali


Out[75]:
1800        0.000000
1801        0.000000
1802        0.000000
1803        0.000000
1804        0.000000
1805        0.000000
1806        0.000000
1807        0.000000
1808        0.000000
1809     7546.000000
1810     8277.294574
1811     9008.589147
1812     9739.883721
1813    10471.178295
1814    11202.472868
...
1999    1874355.000000
2000    1906164.214286
2001    1937973.428571
2002    1969782.642857
2003    2001591.857143
2004    2033401.071429
2005    2065210.285714
2006    2097019.500000
2007    2128828.714286
2008    2160637.928571
2009    2192447.142857
2010    2224256.357143
2011    2256065.571429
2012    2287874.785714
2013    2319684.000000
Name: Cali (Colombia) Population, Length: 214, dtype: float64

Now finally add the two series


In [76]:
series_bogota + series_cali


Out[76]:
1800    21964.000000
1801    22850.544643
1802    23737.089286
1803    24623.633929
1804    25510.178571
1805    26396.723214
1806    27283.267857
1807    28169.812500
1808    29056.357143
1809    37488.901786
1810    39106.741002
1811    40724.580219
1812    42342.419435
1813    43960.258652
1814    45578.097868
...
1999    8150783.000000
2000    8201223.214286
2001    8251663.428571
2002    8302103.642857
2003    8352543.857143
2004    8402984.071429
2005    8453424.285714
2006    8503864.500000
2007    8554304.714286
2008    8604744.928571
2009    8655185.142857
2010    8705625.357143
2011    8756065.571429
2012    9859219.785714
2013    9891029.000000
Length: 214, dtype: float64

In [77]:
series_bogota.plot(label='Bogota population')
series_cali.plot(label='Cali population')
plt.legend(loc='best')


Out[77]:
<matplotlib.legend.Legend at 0x109fd1950>

I do not vouch for the statistics here, it is just an example of the tool.

Pandas Data Frame


In [78]:
df = pd.DataFrame({'bogotá':series_bogota, 'cali':series_cali})

In [79]:
df


Out[79]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 214 entries, 1800 to 2013
Data columns (total 2 columns):
bogotá    214  non-null values
cali      214  non-null values
dtypes: float64(2)

In [80]:
df.head()


Out[80]:
bogotá cali
1800 21964.000000 0
1801 22850.544643 0
1802 23737.089286 0
1803 24623.633929 0
1804 25510.178571 0

In [81]:
pd.options.display.float_format = '{:20,.2f}'.format
df.index.name = 'year'

In [82]:
df.tail()


Out[82]:
bogotá cali
year
2009 6,462,738.00 2,192,447.14
2010 6,481,369.00 2,224,256.36
2011 6,500,000.00 2,256,065.57
2012 7,571,345.00 2,287,874.79
2013 7,571,345.00 2,319,684.00

Elements of DataFrame


In [83]:
df.index


Out[83]:
Int64Index([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013], dtype=int64)

In [84]:
df.columns


Out[84]:
Index([u'bogotá', u'cali'], dtype=object)

In [85]:
np.shape(df.values)


Out[85]:
(214, 2)

In [86]:
df['population difference'] = df['bogotá'] - df['cali']

In [88]:
df.tail()


Out[88]:
bogotá cali population difference
year
2009 6,462,738.00 2,192,447.14 4,270,290.86
2010 6,481,369.00 2,224,256.36 4,257,112.64
2011 6,500,000.00 2,256,065.57 4,243,934.43
2012 7,571,345.00 2,287,874.79 5,283,470.21
2013 7,571,345.00 2,319,684.00 5,251,661.00

In [87]:
df.describe()


Out[87]:
bogotá cali population difference
count 214.00 214.00 214.00
mean 1,269,633.63 440,945.03 828,688.61
std 2,057,142.86 658,404.11 1,404,249.56
min 21,964.00 0.00 21,964.00
25% 69,172.50 39,905.78 29,266.72
50% 116,381.00 78,847.22 37,533.78
75% 1,376,252.60 654,746.87 721,505.72
max 7,571,345.00 2,319,684.00 5,283,470.21

In [89]:
df['population difference'].tail()


Out[89]:
year
2009           4,270,290.86
2010           4,257,112.64
2011           4,243,934.43
2012           5,283,470.21
2013           5,251,661.00
Name: population difference, dtype: float64

In [90]:
df[df.index >1990]


Out[90]:
bogotá cali population difference
year
1991 5,110,749.14 1,619,881.29 3,490,867.86
1992 5,256,459.00 1,651,690.50 3,604,768.50
1993 5,402,168.86 1,683,499.71 3,718,669.14
1994 5,547,878.71 1,715,308.93 3,832,569.79
1995 5,693,588.57 1,747,118.14 3,946,470.43
1996 5,839,298.43 1,778,927.36 4,060,371.07
1997 5,985,008.29 1,810,736.57 4,174,271.71
1998 6,130,718.14 1,842,545.79 4,288,172.36
1999 6,276,428.00 1,874,355.00 4,402,073.00
2000 6,295,059.00 1,906,164.21 4,388,894.79
2001 6,313,690.00 1,937,973.43 4,375,716.57
2002 6,332,321.00 1,969,782.64 4,362,538.36
2003 6,350,952.00 2,001,591.86 4,349,360.14
2004 6,369,583.00 2,033,401.07 4,336,181.93
2005 6,388,214.00 2,065,210.29 4,323,003.71
2006 6,406,845.00 2,097,019.50 4,309,825.50
2007 6,425,476.00 2,128,828.71 4,296,647.29
2008 6,444,107.00 2,160,637.93 4,283,469.07
2009 6,462,738.00 2,192,447.14 4,270,290.86
2010 6,481,369.00 2,224,256.36 4,257,112.64
2011 6,500,000.00 2,256,065.57 4,243,934.43
2012 7,571,345.00 2,287,874.79 5,283,470.21
2013 7,571,345.00 2,319,684.00 5,251,661.00

In [91]:
df.plot()


Out[91]:
<matplotlib.axes.AxesSubplot at 0x10a2c04d0>

In [92]:
!ls *.csv


cali_and_bogota.csv metropolitan.csv    patients.csv

In [93]:
df.to_csv('cali_and_bogota.csv')

In [94]:
!head cali_and_bogota.csv


year,bogotá,cali,population difference
1800,21964.0,0.0,21964.0
1801,22850.54464285714,0.0,22850.54464285714
1802,23737.089285714286,0.0,23737.089285714286
1803,24623.633928571428,0.0,24623.633928571428
1804,25510.178571428572,0.0,25510.178571428572
1805,26396.723214285714,0.0,26396.723214285714
1806,27283.267857142855,0.0,27283.267857142855
1807,28169.8125,0.0,28169.8125
1808,29056.357142857145,0.0,29056.357142857145

There are many ways to initialize DataFrames

Let's be more cosmopolitan (if we still have 7 more minutes)


In [95]:
!cat metropolitan.csv


Metropolitan area,Country,Population,Area
Tokyo,Japan,32450000,8014
Seoul,South Korea,20550000,5076
Mexico City,Mexico,20450000,7346
New York City,United States,19750000,17884
Mumbai-Bombay,India,19200000,2350
Jakarta,Indonesia,18900000,5100
São Paulo,Brazil,18850000,8479
New Delhi,India,18600000,3182
Osaka-Kobe-Kyoto,Japan,17375000,6930
Shanghai,China,16650000,5177
Manila,Philippines,16300000,2521
Hong Kong,Hong Kong/China,15800000,3051
Los Angeles,United States,15250000,10780
Kolkata,India,15100000,1785
Moscow,Russia,15000000,14925
Cairo,Egypt,14450000,1600
Buenos Aires,Argentina,13170000,10888
London,United Kingdom,12875000,11391
Beijing,China,12500000,6562
Karachi,Pakistan,11800000,1100

In [96]:
df = pd.read_csv('metropolitan.csv')
df


Out[96]:
Metropolitan area Country Population Area
0 Tokyo Japan 32450000 8014
1 Seoul South Korea 20550000 5076
2 Mexico City Mexico 20450000 7346
3 New York City United States 19750000 17884
4 Mumbai-Bombay India 19200000 2350
5 Jakarta Indonesia 18900000 5100
6 São Paulo Brazil 18850000 8479
7 New Delhi India 18600000 3182
8 Osaka-Kobe-Kyoto Japan 17375000 6930
9 Shanghai China 16650000 5177
10 Manila Philippines 16300000 2521
11 Hong Kong Hong Kong/China 15800000 3051
12 Los Angeles United States 15250000 10780
13 Kolkata India 15100000 1785
14 Moscow Russia 15000000 14925
15 Cairo Egypt 14450000 1600
16 Buenos Aires Argentina 13170000 10888
17 London United Kingdom 12875000 11391
18 Beijing China 12500000 6562
19 Karachi Pakistan 11800000 1100

In [97]:
df = pd.read_csv('metropolitan.csv', index_col='Metropolitan area')
df


Out[97]:
Country Population Area
Metropolitan area
Tokyo Japan 32450000 8014
Seoul South Korea 20550000 5076
Mexico City Mexico 20450000 7346
New York City United States 19750000 17884
Mumbai-Bombay India 19200000 2350
Jakarta Indonesia 18900000 5100
São Paulo Brazil 18850000 8479
New Delhi India 18600000 3182
Osaka-Kobe-Kyoto Japan 17375000 6930
Shanghai China 16650000 5177
Manila Philippines 16300000 2521
Hong Kong Hong Kong/China 15800000 3051
Los Angeles United States 15250000 10780
Kolkata India 15100000 1785
Moscow Russia 15000000 14925
Cairo Egypt 14450000 1600
Buenos Aires Argentina 13170000 10888
London United Kingdom 12875000 11391
Beijing China 12500000 6562
Karachi Pakistan 11800000 1100

In [98]:
df.describe()


Out[98]:
Population Area
count 20.00 20.00
mean 17,251,000.00 6,707.05
std 4,485,562.40 4,619.33
min 11,800,000.00 1,100.00
25% 14,862,500.00 2,918.50
50% 16,475,000.00 5,869.50
75% 18,975,000.00 9,054.25
max 32,450,000.00 17,884.00

In [99]:
df.dtypes


Out[99]:
Country       object
Population     int64
Area           int64
dtype: object

In [100]:
df['Population']= df['Population'].apply(float)
df['Area']= df['Area'].apply(float)

In [101]:
df.dtypes


Out[101]:
Country        object
Population    float64
Area          float64
dtype: object

Three biggest metro areas


In [102]:
df.sort('Population')
df.head(3)


Out[102]:
Country Population Area
Metropolitan area
Tokyo Japan 32,450,000.00 8,014.00
Seoul South Korea 20,550,000.00 5,076.00
Mexico City Mexico 20,450,000.00 7,346.00

Area largest than 10.0000 km2


In [103]:
df[df['Area']> 10000]


Out[103]:
Country Population Area
Metropolitan area
New York City United States 19,750,000.00 17,884.00
Los Angeles United States 15,250,000.00 10,780.00
Moscow Russia 15,000,000.00 14,925.00
Buenos Aires Argentina 13,170,000.00 10,888.00
London United Kingdom 12,875,000.00 11,391.00

In [104]:
df[(df['Area']> 10000) & (df['Population'] > 20000)]


Out[104]:
Country Population Area
Metropolitan area
New York City United States 19,750,000.00 17,884.00
Los Angeles United States 15,250,000.00 10,780.00
Moscow Russia 15,000,000.00 14,925.00
Buenos Aires Argentina 13,170,000.00 10,888.00
London United Kingdom 12,875,000.00 11,391.00

In [105]:
df['Density'] = df['Population']/df['Area']
df.sort('Density', ascending=False)['Density'].plot(kind='bar')


Out[105]:
<matplotlib.axes.AxesSubplot at 0x109faac10>

In [106]:
df.groupby('Country').head()


Out[106]:
Country Population Area Density
Country Metropolitan area
Argentina Buenos Aires Argentina 13,170,000.00 10,888.00 1,209.59
Brazil São Paulo Brazil 18,850,000.00 8,479.00 2,223.14
China Shanghai China 16,650,000.00 5,177.00 3,216.15
Beijing China 12,500,000.00 6,562.00 1,904.91
Egypt Cairo Egypt 14,450,000.00 1,600.00 9,031.25
Hong Kong/China Hong Kong Hong Kong/China 15,800,000.00 3,051.00 5,178.63
India Mumbai-Bombay India 19,200,000.00 2,350.00 8,170.21
New Delhi India 18,600,000.00 3,182.00 5,845.38
Kolkata India 15,100,000.00 1,785.00 8,459.38
Indonesia Jakarta Indonesia 18,900,000.00 5,100.00 3,705.88
Japan Tokyo Japan 32,450,000.00 8,014.00 4,049.16
Osaka-Kobe-Kyoto Japan 17,375,000.00 6,930.00 2,507.22
Mexico Mexico City Mexico 20,450,000.00 7,346.00 2,783.83
Pakistan Karachi Pakistan 11,800,000.00 1,100.00 10,727.27
Philippines Manila Philippines 16,300,000.00 2,521.00 6,465.69
Russia Moscow Russia 15,000,000.00 14,925.00 1,005.03
South Korea Seoul South Korea 20,550,000.00 5,076.00 4,048.46
United Kingdom London United Kingdom 12,875,000.00 11,391.00 1,130.28
United States New York City United States 19,750,000.00 17,884.00 1,104.34
Los Angeles United States 15,250,000.00 10,780.00 1,414.66

In [107]:
type(df.groupby('Country'))


Out[107]:
pandas.core.groupby.DataFrameGroupBy

In [108]:
print df.groupby('Country').sum().sort('Density', ascending=False).head()


                      Population                 Area              Density
Country                                                                   
India              52,900,000.00             7,317.00            22,474.98
Pakistan           11,800,000.00             1,100.00            10,727.27
Egypt              14,450,000.00             1,600.00             9,031.25
Japan              49,825,000.00            14,944.00             6,556.38
Philippines        16,300,000.00             2,521.00             6,465.69

All the data in this notebook was taken from wikipedia.