We use the dataset found at https://github.com/mkcor/data-wrangling/blob/master/data/tidy_who.csv (see the notebook at the root of that repo for the generation of this dataset).
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('../data/tidy_who.csv')
... just works. read_csv()
comes with many convenient arguments, such as skiprows
, nrows
, na_values
, etc. Note that, alternatively, we could have run df = pd.read_csv('https://raw.githubusercontent.com/mkcor/data-wrangling/master/data/tidy_who.csv')
.
In [3]:
df.head()
Out[3]:
In [4]:
df.shape
Out[4]:
In [5]:
df.sample(10)
Out[5]:
In [6]:
df.describe()
Out[6]:
In [7]:
df['g_whoregion'].unique()
Out[7]:
In [8]:
df['country'].nunique()
Out[8]:
We are already familiar with column selection. The [ ]
syntax is the most basic way of indexing.
In [9]:
df['country'].head(3)
Out[9]:
Columns can also be accessed as attributes (as long as they have a valid Python name).
In [10]:
df.country[1000:1003]
Out[10]:
We can select elements of a DataFrame either by label (with the .loc
attribute) or by position (with the .iloc
attribute). Row and column indices take the usual order (first and second place, respectively).
In [ ]:
df.loc[0, 'country']
In [11]:
df.loc[df.shape[0] - 1, 'country']
Out[11]:
In [12]:
df.iloc[0, 0]
Out[12]:
In [13]:
df.iloc[df.shape[0] - 1, 0]
Out[13]:
Slicing works too.
In [14]:
df.loc[:5, 'country']
Out[14]:
Label indexing is more natural than positional indexing (think of a function call, where keyword arguments are easier to work with than positional arguments).
In [15]:
df.loc[:5, 'country':'type']
Out[15]:
In [16]:
df.iloc[:5, :5]
Out[16]:
Often we want to select data based on certain conditions.
In [17]:
cond = df.year < 1981
In [18]:
df[cond].shape
Out[18]:
In [19]:
df[cond & (df.country == 'Argentina') & (df.type == 'rel') & (df.sex == 'm')]
Out[19]:
In [20]:
gr_and_it = df.country.isin(['Greece', 'Italy'])
In [21]:
df[gr_and_it].tail()
Out[21]:
Subsets can be selected by callable functions (returning valid indexers).
The following function performs a selection by label (along country
and g_whoregion
).
In [22]:
lambda x: ['country', 'g_whoregion']
Out[22]:
So it can serve as a column indexer.
In [23]:
df.loc[:3, lambda x: ['country', 'g_whoregion']]
Out[23]:
The following function filters for data where the number of cases is greater than 100,000.
In [24]:
lambda x: x.cases > 100000
Out[24]:
So it can serve as a row indexer.
In [25]:
great = df.loc[lambda x: x.cases > 100000, :]
great
Out[25]:
In [26]:
df.cases.loc[lambda x: x > 100000]
Out[26]:
We may want to select or mask data while preserving the original shape.
In [27]:
great.where(great.country == 'India')
Out[27]:
In [28]:
great.mask(great.country == 'India')
Out[28]:
df
where country is Greece, age is at most 24, and year is 2000. Name it df1
.df1
DataFrame to a CSV file located in the data/
subdirectory. (Hint: The method is named to_csv
.)df2
. What do you notice about the index? (Feel free to fire up a Terminal and look at the CSV file.)
In [29]:
df1 = df[(df.country == 'Greece') & (df.year == 2000) & (df.age_range.isin([14, 1524]))]
In [30]:
df1.to_csv('../data/df1.csv')
In [31]:
df2 = pd.read_csv('../data/df1.csv')
In [32]:
df1.index
Out[32]:
In [33]:
df2.index
Out[33]:
We could specify that the first (unnamed) column should be used as the index (row labels).
In [34]:
pd.read_csv('../data/df1.csv', index_col=0)
Out[34]:
Remember we learnt set_index()
in the previous section? We also have reset_index()
at our disposal.
In [35]:
df1.reset_index()
Out[35]:
And we are back to a default index for this DataFrame. The original index is stored in its own column.
In [36]:
df1.reset_index().index
Out[36]: