Data exploration

We use the dataset found at https://github.com/mkcor/data-wrangling/blob/master/data/tidy_who.csv (see the notebook at the root of that repo for the generation of this dataset).


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/tidy_who.csv')

Loading data

... just works. read_csv() comes with many convenient arguments, such as skiprows, nrows, na_values, etc. Note that, alternatively, we could have run df = pd.read_csv('https://raw.githubusercontent.com/mkcor/data-wrangling/master/data/tidy_who.csv').


In [3]:
df.head()


Out[3]:
country g_whoregion year cases type sex age_range
0 Afghanistan EMR 1980 NaN sp m 14
1 Afghanistan EMR 1981 NaN sp m 14
2 Afghanistan EMR 1982 NaN sp m 14
3 Afghanistan EMR 1983 NaN sp m 14
4 Afghanistan EMR 1984 NaN sp m 14

In [4]:
df.shape


Out[4]:
(429744, 7)

In [5]:
df.sample(10)


Out[5]:
country g_whoregion year cases type sex age_range
179685 India SEA 1983 NaN sn f 2534
291809 Angola AFR 1997 NaN ep f 3544
5802 Saint Vincent and the Grenadines AMR 2005 0.0 sp m 14
66582 Palau WPR 2005 NaN sp f 1524
71576 France EUR 1994 NaN sp f 2534
122880 Algeria AFR 2004 NaN sn m 2534
174358 Republic of Korea WPR 1985 NaN sn f 1524
92852 Benin AFR 1988 NaN sp f 5564
343542 Sao Tome and Principe AFR 1981 NaN rel m 2534
214090 Tunisia EMR 1990 NaN sn f 65

In [6]:
df.describe()


Out[6]:
year cases age_range
count 429744.000000 81381.000000 429744.000000
mean 1997.571540 667.482496 2542.714286
std 10.407887 4490.566875 1990.957917
min 1980.000000 0.000000 14.000000
25% 1989.000000 3.000000 65.000000
50% 1998.000000 28.000000 2534.000000
75% 2007.000000 200.000000 4554.000000
max 2015.000000 250051.000000 5564.000000

In [7]:
df['g_whoregion'].unique()


Out[7]:
array(['EMR', 'EUR', 'AFR', 'WPR', 'AMR', 'SEA'], dtype=object)

In [8]:
df['country'].nunique()


Out[8]:
219

Selecting data

We are already familiar with column selection. The [ ] syntax is the most basic way of indexing.


In [9]:
df['country'].head(3)


Out[9]:
0    Afghanistan
1    Afghanistan
2    Afghanistan
Name: country, dtype: object

Columns can also be accessed as attributes (as long as they have a valid Python name).


In [10]:
df.country[1000:1003]


Out[10]:
1000    Brazil
1001    Brazil
1002    Brazil
Name: country, dtype: object

We can select elements of a DataFrame either by label (with the .loc attribute) or by position (with the .iloc attribute). Row and column indices take the usual order (first and second place, respectively).


In [ ]:
df.loc[0, 'country']

In [11]:
df.loc[df.shape[0] - 1, 'country']


Out[11]:
'Zimbabwe'

In [12]:
df.iloc[0, 0]


Out[12]:
'Afghanistan'

In [13]:
df.iloc[df.shape[0] - 1, 0]


Out[13]:
'Zimbabwe'

Slicing works too.


In [14]:
df.loc[:5, 'country']


Out[14]:
0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
5    Afghanistan
Name: country, dtype: object

Label indexing is more natural than positional indexing (think of a function call, where keyword arguments are easier to work with than positional arguments).


In [15]:
df.loc[:5, 'country':'type']


Out[15]:
country g_whoregion year cases type
0 Afghanistan EMR 1980 NaN sp
1 Afghanistan EMR 1981 NaN sp
2 Afghanistan EMR 1982 NaN sp
3 Afghanistan EMR 1983 NaN sp
4 Afghanistan EMR 1984 NaN sp
5 Afghanistan EMR 1985 NaN sp

In [16]:
df.iloc[:5, :5]


Out[16]:
country g_whoregion year cases type
0 Afghanistan EMR 1980 NaN sp
1 Afghanistan EMR 1981 NaN sp
2 Afghanistan EMR 1982 NaN sp
3 Afghanistan EMR 1983 NaN sp
4 Afghanistan EMR 1984 NaN sp

Often we want to select data based on certain conditions.


In [17]:
cond = df.year < 1981

In [18]:
df[cond].shape


Out[18]:
(11872, 7)

In [19]:
df[cond & (df.country == 'Argentina') & (df.type == 'rel') & (df.sex == 'm')]


Out[19]:
country g_whoregion year cases type sex age_range
322596 Argentina AMR 1980 NaN rel m 14
330270 Argentina AMR 1980 NaN rel m 1524
337944 Argentina AMR 1980 NaN rel m 2534
345618 Argentina AMR 1980 NaN rel m 3544
353292 Argentina AMR 1980 NaN rel m 4554
360966 Argentina AMR 1980 NaN rel m 5564
368640 Argentina AMR 1980 NaN rel m 65

In [20]:
gr_and_it = df.country.isin(['Greece', 'Italy'])

In [21]:
df[gr_and_it].tail()


Out[21]:
country g_whoregion year cases type sex age_range
425497 Italy EUR 2011 NaN rel f 65
425498 Italy EUR 2012 NaN rel f 65
425499 Italy EUR 2013 249.0 rel f 65
425500 Italy EUR 2014 NaN rel f 65
425501 Italy EUR 2015 280.0 rel f 65

Subsets can be selected by callable functions (returning valid indexers). The following function performs a selection by label (along country and g_whoregion).


In [22]:
lambda x: ['country', 'g_whoregion']


Out[22]:
<function __main__.<lambda>>

So it can serve as a column indexer.


In [23]:
df.loc[:3, lambda x: ['country', 'g_whoregion']]


Out[23]:
country g_whoregion
0 Afghanistan EMR
1 Afghanistan EMR
2 Afghanistan EMR
3 Afghanistan EMR

The following function filters for data where the number of cases is greater than 100,000.


In [24]:
lambda x: x.cases > 100000


Out[24]:
<function __main__.<lambda>>

So it can serve as a row indexer.


In [25]:
great = df.loc[lambda x: x.cases > 100000, :]
great


Out[25]:
country g_whoregion year cases type sex age_range
133665 India SEA 2007 250051.0 sn m 3544
187383 India SEA 2007 148811.0 sn f 3544
241101 India SEA 2007 105825.0 ep m 3544
294819 India SEA 2007 101015.0 ep f 3544
333196 India SEA 2014 180319.0 rel m 1524
333197 India SEA 2015 186771.0 rel m 1524
340870 India SEA 2014 190483.0 rel m 2534
340871 India SEA 2015 197298.0 rel m 2534
348544 India SEA 2014 199850.0 rel m 3544
348545 India SEA 2015 207000.0 rel m 3544
354519 China WPR 2013 100297.0 rel m 4554
354520 China WPR 2014 102352.0 rel m 4554
354521 China WPR 2015 103685.0 rel m 4554
356218 India SEA 2014 188106.0 rel m 4554
356219 India SEA 2015 194837.0 rel m 4554
362193 China WPR 2013 112558.0 rel m 5564
362194 China WPR 2014 108902.0 rel m 5564
362195 China WPR 2015 105403.0 rel m 5564
363892 India SEA 2014 148933.0 rel m 5564
363893 India SEA 2015 154262.0 rel m 5564
369867 China WPR 2013 124476.0 rel m 65
369868 China WPR 2014 123436.0 rel m 65
369869 China WPR 2015 125699.0 rel m 65
371566 India SEA 2014 102929.0 rel m 65
371567 India SEA 2015 106612.0 rel m 65
386914 India SEA 2014 149671.0 rel f 1524
386915 India SEA 2015 155026.0 rel f 1524
394588 India SEA 2014 127605.0 rel f 2534
394589 India SEA 2015 132171.0 rel f 2534

In [26]:
df.cases.loc[lambda x: x > 100000]


Out[26]:
133665    250051.0
187383    148811.0
241101    105825.0
294819    101015.0
333196    180319.0
333197    186771.0
340870    190483.0
340871    197298.0
348544    199850.0
348545    207000.0
354519    100297.0
354520    102352.0
354521    103685.0
356218    188106.0
356219    194837.0
362193    112558.0
362194    108902.0
362195    105403.0
363892    148933.0
363893    154262.0
369867    124476.0
369868    123436.0
369869    125699.0
371566    102929.0
371567    106612.0
386914    149671.0
386915    155026.0
394588    127605.0
394589    132171.0
Name: cases, dtype: float64

We may want to select or mask data while preserving the original shape.


In [27]:
great.where(great.country == 'India')


Out[27]:
country g_whoregion year cases type sex age_range
133665 India SEA 2007.0 250051.0 sn m 3544.0
187383 India SEA 2007.0 148811.0 sn f 3544.0
241101 India SEA 2007.0 105825.0 ep m 3544.0
294819 India SEA 2007.0 101015.0 ep f 3544.0
333196 India SEA 2014.0 180319.0 rel m 1524.0
333197 India SEA 2015.0 186771.0 rel m 1524.0
340870 India SEA 2014.0 190483.0 rel m 2534.0
340871 India SEA 2015.0 197298.0 rel m 2534.0
348544 India SEA 2014.0 199850.0 rel m 3544.0
348545 India SEA 2015.0 207000.0 rel m 3544.0
354519 NaN NaN NaN NaN NaN NaN NaN
354520 NaN NaN NaN NaN NaN NaN NaN
354521 NaN NaN NaN NaN NaN NaN NaN
356218 India SEA 2014.0 188106.0 rel m 4554.0
356219 India SEA 2015.0 194837.0 rel m 4554.0
362193 NaN NaN NaN NaN NaN NaN NaN
362194 NaN NaN NaN NaN NaN NaN NaN
362195 NaN NaN NaN NaN NaN NaN NaN
363892 India SEA 2014.0 148933.0 rel m 5564.0
363893 India SEA 2015.0 154262.0 rel m 5564.0
369867 NaN NaN NaN NaN NaN NaN NaN
369868 NaN NaN NaN NaN NaN NaN NaN
369869 NaN NaN NaN NaN NaN NaN NaN
371566 India SEA 2014.0 102929.0 rel m 65.0
371567 India SEA 2015.0 106612.0 rel m 65.0
386914 India SEA 2014.0 149671.0 rel f 1524.0
386915 India SEA 2015.0 155026.0 rel f 1524.0
394588 India SEA 2014.0 127605.0 rel f 2534.0
394589 India SEA 2015.0 132171.0 rel f 2534.0

In [28]:
great.mask(great.country == 'India')


Out[28]:
country g_whoregion year cases type sex age_range
133665 NaN NaN NaN NaN NaN NaN NaN
187383 NaN NaN NaN NaN NaN NaN NaN
241101 NaN NaN NaN NaN NaN NaN NaN
294819 NaN NaN NaN NaN NaN NaN NaN
333196 NaN NaN NaN NaN NaN NaN NaN
333197 NaN NaN NaN NaN NaN NaN NaN
340870 NaN NaN NaN NaN NaN NaN NaN
340871 NaN NaN NaN NaN NaN NaN NaN
348544 NaN NaN NaN NaN NaN NaN NaN
348545 NaN NaN NaN NaN NaN NaN NaN
354519 China WPR 2013.0 100297.0 rel m 4554.0
354520 China WPR 2014.0 102352.0 rel m 4554.0
354521 China WPR 2015.0 103685.0 rel m 4554.0
356218 NaN NaN NaN NaN NaN NaN NaN
356219 NaN NaN NaN NaN NaN NaN NaN
362193 China WPR 2013.0 112558.0 rel m 5564.0
362194 China WPR 2014.0 108902.0 rel m 5564.0
362195 China WPR 2015.0 105403.0 rel m 5564.0
363892 NaN NaN NaN NaN NaN NaN NaN
363893 NaN NaN NaN NaN NaN NaN NaN
369867 China WPR 2013.0 124476.0 rel m 65.0
369868 China WPR 2014.0 123436.0 rel m 65.0
369869 China WPR 2015.0 125699.0 rel m 65.0
371566 NaN NaN NaN NaN NaN NaN NaN
371567 NaN NaN NaN NaN NaN NaN NaN
386914 NaN NaN NaN NaN NaN NaN NaN
386915 NaN NaN NaN NaN NaN NaN NaN
394588 NaN NaN NaN NaN NaN NaN NaN
394589 NaN NaN NaN NaN NaN NaN NaN

Hands-on exercises

  1. Select the rows of df where country is Greece, age is at most 24, and year is 2000. Name it df1.
  2. Write the df1 DataFrame to a CSV file located in the data/ subdirectory. (Hint: The method is named to_csv.)
  3. Read this CSV file into a DataFrame named df2. What do you notice about the index? (Feel free to fire up a Terminal and look at the CSV file.)

In [29]:
df1 = df[(df.country == 'Greece') & (df.year == 2000) & (df.age_range.isin([14, 1524]))]

In [30]:
df1.to_csv('../data/df1.csv')

In [31]:
df2 = pd.read_csv('../data/df1.csv')

Indexing


In [32]:
df1.index


Out[32]:
Int64Index([  2768,  10442,  56486,  64160, 110204, 117878, 163922, 171596,
            217640, 225314, 271358, 279032, 325076, 332750, 378794, 386468],
           dtype='int64')

In [33]:
df2.index


Out[33]:
RangeIndex(start=0, stop=16, step=1)

We could specify that the first (unnamed) column should be used as the index (row labels).


In [34]:
pd.read_csv('../data/df1.csv', index_col=0)


Out[34]:
country g_whoregion year cases type sex age_range
2768 Greece EUR 2000 1.0 sp m 14
10442 Greece EUR 2000 10.0 sp m 1524
56486 Greece EUR 2000 0.0 sp f 14
64160 Greece EUR 2000 2.0 sp f 1524
110204 Greece EUR 2000 NaN sn m 14
117878 Greece EUR 2000 NaN sn m 1524
163922 Greece EUR 2000 NaN sn f 14
171596 Greece EUR 2000 NaN sn f 1524
217640 Greece EUR 2000 NaN ep m 14
225314 Greece EUR 2000 NaN ep m 1524
271358 Greece EUR 2000 NaN ep f 14
279032 Greece EUR 2000 NaN ep f 1524
325076 Greece EUR 2000 NaN rel m 14
332750 Greece EUR 2000 NaN rel m 1524
378794 Greece EUR 2000 NaN rel f 14
386468 Greece EUR 2000 NaN rel f 1524

Remember we learnt set_index() in the previous section? We also have reset_index() at our disposal.


In [35]:
df1.reset_index()


Out[35]:
index country g_whoregion year cases type sex age_range
0 2768 Greece EUR 2000 1.0 sp m 14
1 10442 Greece EUR 2000 10.0 sp m 1524
2 56486 Greece EUR 2000 0.0 sp f 14
3 64160 Greece EUR 2000 2.0 sp f 1524
4 110204 Greece EUR 2000 NaN sn m 14
5 117878 Greece EUR 2000 NaN sn m 1524
6 163922 Greece EUR 2000 NaN sn f 14
7 171596 Greece EUR 2000 NaN sn f 1524
8 217640 Greece EUR 2000 NaN ep m 14
9 225314 Greece EUR 2000 NaN ep m 1524
10 271358 Greece EUR 2000 NaN ep f 14
11 279032 Greece EUR 2000 NaN ep f 1524
12 325076 Greece EUR 2000 NaN rel m 14
13 332750 Greece EUR 2000 NaN rel m 1524
14 378794 Greece EUR 2000 NaN rel f 14
15 386468 Greece EUR 2000 NaN rel f 1524

And we are back to a default index for this DataFrame. The original index is stored in its own column.


In [36]:
df1.reset_index().index


Out[36]:
RangeIndex(start=0, stop=16, step=1)

Hands-on exercises

  1. Write the df1 DataFrame with a default index to another CSV file.
  2. Read this other CSV file into a DataFrame, setting its index to be the original index.

Reference