02 - Pandas: Basic operations on Series and DataFrames

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.


In [2]:
# redefining the example objects

population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [3]:
countries.head()


Out[3]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam
4 United Kingdom 64.9 244820 London

The 'new' concepts

Elementwise-operations

Just like with numpy arrays, many operations are element-wise:


In [4]:
population / 100


Out[4]:
Germany           0.813
Belgium           0.113
France            0.643
United Kingdom    0.649
Netherlands       0.169
dtype: float64

In [5]:
countries['population'] / countries['area']


Out[5]:
0    0.000370
1    0.000096
2    0.000228
3    0.000407
4    0.000265
dtype: float64

In [6]:
np.log(countries['population'])


Out[6]:
0    2.424803
1    4.163560
2    4.398146
3    2.827314
4    4.172848
Name: population, dtype: float64

which can be added as a new column, as follows:


In [7]:
countries["log_population"] = np.log(countries['population'])

In [8]:
countries.columns


Out[8]:
Index(['country', 'population', 'area', 'capital', 'log_population'], dtype='object')

In [9]:
countries['population'] > 40


Out[9]:
0    False
1     True
2     True
3    False
4     True
Name: population, dtype: bool
REMEMBER: * When you have an operation which does NOT work element-wise or you have no idea how to do it directly in Pandas, use the **apply()** function * A typical use case is with a custom written or a **lambda** function

In [11]:
countries["capital"].apply(lambda x: len(x)) # in case you forgot the functionality: countries["capital"].str.len()


Out[11]:
0    8
1    5
2    6
3    9
4    6
Name: capital, dtype: int64

In [12]:
def population_annotater(population):
    """annotate as large or small"""
    if population > 50:
        return 'large'
    else:
        return 'small'

In [13]:
countries["population"].apply(population_annotater) # a custom user function


Out[13]:
0    small
1    large
2    large
3    small
4    large
Name: population, dtype: object

Aggregations (reductions)

Pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column).

The average population number:


In [14]:
population.mean()


Out[14]:
47.739999999999995

The minimum area:


In [15]:
countries['area'].min()


Out[15]:
30510

For dataframes, often only the numeric columns are included in the result:


In [16]:
countries.median()


Out[16]:
population            64.30000
area              244820.00000
log_population         4.16356
dtype: float64

Application on a real dataset

Reading in the titanic data set...


In [17]:
df = pd.read_csv("../data/titanic.csv")

Quick exploration first...


In [18]:
df.head()


Out[18]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [19]:
len(df)


Out[19]:
891

The available metadata of the titanic data set provides the following information:

VARIABLE DESCRIPTION
Survived Survival (0 = No; 1 = Yes)
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name Name
Sex Sex
Age Age
SibSp Number of Siblings/Spouses Aboard
Parch Number of Parents/Children Aboard
Ticket Ticket Number
Fare Passenger Fare
Cabin Cabin
Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
EXERCISE:
  • What is the average age of the passengers?

In [20]:
df['Age'].mean()


Out[20]:
29.69911764705882
EXERCISE:
  • Plot the age distribution of the titanic passengers

In [21]:
df['Age'].hist() #bins=30, log=True


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efed36e9518>
EXERCISE:
  • What is the survival rate? (the relative number of people that survived)

Note: the 'Survived' column indicates whether someone survived (1) or not (0).

In [22]:
df['Survived'].sum() / len(df['Survived'])


Out[22]:
0.3838383838383838

In [23]:
df['Survived'].mean()


Out[23]:
0.3838383838383838
EXERCISE:
  • What is the maximum Fare? And the median?

In [24]:
df['Fare'].max()


Out[24]:
512.3292

In [25]:
df['Fare'].median()


Out[25]:
14.4542
EXERCISE:
  • Calculate the 75th percentile (`quantile`) of the Fare price (Tip: look in the docstring how to specify the percentile)

In [26]:
df['Fare'].quantile(0.75)


Out[26]:
31.0
EXERCISE:
  • Calculate the normalized Fares (normalized relative to its mean)

In [27]:
df['Fare'] / df['Fare'].mean()


Out[27]:
0      0.225126
1      2.213478
2      0.246086
3      1.648853
4      0.249967
5      0.262646
6      1.610426
7      0.654418
8      0.345709
9      0.933754
10     0.518566
11     0.824426
12     0.249967
13     0.971146
14     0.243887
15     0.496829
16     0.904385
17     0.403674
18     0.558933
19     0.224350
20     0.807348
21     0.403674
22     0.249321
23     1.102340
24     0.654418
25     0.974640
26     0.224350
27     8.166635
28     0.244664
29     0.245179
         ...   
861    0.357096
862    0.805149
863    2.159656
864    0.403674
865    0.403674
866    0.430326
867    1.567988
868    0.294993
869    0.345709
870    0.245179
871    1.631905
872    0.155259
873    0.279467
874    0.745244
875    0.224350
876    0.305730
877    0.245179
878    0.245179
879    2.582218
880    0.807348
881    0.245179
882    0.326563
883    0.326044
884    0.218915
885    0.904385
886    0.403674
887    0.931555
888    0.728166
889    0.931555
890    0.240652
Name: Fare, Length: 891, dtype: float64
EXERCISE:
  • Calculate the log of the Fares, and add this as a new column ('Fare_log') to the DataFrame.

In [28]:
np.log(df['Fare'])


/home/joris/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log
  """Entry point for launching an IPython kernel.
Out[28]:
0      1.981001
1      4.266662
2      2.070022
3      3.972177
4      2.085672
5      2.135148
6      3.948596
7      3.048088
8      2.409941
9      3.403555
10     2.815409
11     3.279030
12     2.085672
13     3.442819
14     2.061048
15     2.772589
16     3.371597
17     2.564949
18     2.890372
19     1.977547
20     3.258097
21     2.564949
22     2.083085
23     3.569533
24     3.048088
25     3.446410
26     1.977547
27     5.572154
28     2.064226
29     2.066331
         ...   
861    2.442347
862    3.255370
863    4.242046
864    2.564949
865    2.564949
866    2.628884
867    3.921890
868    2.251292
869    2.409941
870    2.066331
871    3.961845
872    1.609438
873    2.197225
874    3.178054
875    1.977547
876    2.287045
877    2.066331
878    2.066331
879    4.420746
880    3.258097
881    2.066331
882    2.352964
883    2.351375
884    1.953028
885    3.371597
886    2.564949
887    3.401197
888    3.154870
889    3.401197
890    2.047693
Name: Fare, Length: 891, dtype: float64

In [29]:
df['Fare_log'] = np.log(df['Fare'])
df.head()


/home/joris/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log
  """Entry point for launching an IPython kernel.
Out[29]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Fare_log
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1.981001
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 4.266662
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 2.070022
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 3.972177
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 2.085672

Acknowledgement

This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).