Pandas

The pandas library can be installed by pip and usually imported as an alias pd in some of the literature. This tutorial only covers the basic stuff, for more informations on Data Science in Python, please refers to other books, like:

https://jakevdp.github.io/PythonDataScienceHandbook/


In [2]:
import pandas as pd

Now, we would like to do some data science exercise using pandas. First, we shall obtain a dataset first. In the same folder, you will find a file named Titanic.csv. The data is obtained from The R Datasets Package: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html. Below are the describtions from the package:

Survival of passengers on the Titanic

Description

This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.

Format

A 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables. The variables and their levels are as follows:

No  Name      Levels
1   Class     1st, 2nd, 3rd, Crew
2   Sex       Male, Female
3   Age       Child, Adult
4   Survived  No, Yes

Details

The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger.

These data were originally collected by the British Board of Trade in their investigation of the sinking. Note that there is not complete agreement among primary sources as to the exact numbers on board, rescued, or lost.

Due in particular to the very successful film ‘Titanic’, the last years saw a rise in public interest in the Titanic. Very detailed data about the passengers is now available on the Internet, at sites such as Encyclopedia Titanica (https://www.encyclopedia-titanica.org/).

Start playing with our data

The file is a csv file, which can be easily opened by pd.DataFrame.from_csv(). The pandas package provides a DataFrame object, the details of this object can be checked by typing %pdoc pd.DataFrame. The pandas DataFrame can be printed by print, or just type the variable name in a new cell for a prettier look.

DataFrame

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure


In [5]:
%pdoc pd.DataFrame

In [22]:
df = pd.DataFrame.from_csv('Titanic.csv')
print df


   Class     Sex    Age Survived  Freq
1    1st    Male  Child       No     0
2    2nd    Male  Child       No     0
3    3rd    Male  Child       No    35
4   Crew    Male  Child       No     0
5    1st  Female  Child       No     0
6    2nd  Female  Child       No     0
7    3rd  Female  Child       No    17
8   Crew  Female  Child       No     0
9    1st    Male  Adult       No   118
10   2nd    Male  Adult       No   154
11   3rd    Male  Adult       No   387
12  Crew    Male  Adult       No   670
13   1st  Female  Adult       No     4
14   2nd  Female  Adult       No    13
15   3rd  Female  Adult       No    89
16  Crew  Female  Adult       No     3
17   1st    Male  Child      Yes     5
18   2nd    Male  Child      Yes    11
19   3rd    Male  Child      Yes    13
20  Crew    Male  Child      Yes     0
21   1st  Female  Child      Yes     1
22   2nd  Female  Child      Yes    13
23   3rd  Female  Child      Yes    14
24  Crew  Female  Child      Yes     0
25   1st    Male  Adult      Yes    57
26   2nd    Male  Adult      Yes    14
27   3rd    Male  Adult      Yes    75
28  Crew    Male  Adult      Yes   192
29   1st  Female  Adult      Yes   140
30   2nd  Female  Adult      Yes    80
31   3rd  Female  Adult      Yes    76
32  Crew  Female  Adult      Yes    20

In [10]:
df


Out[10]:
Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
7 3rd Female Child No 17
8 Crew Female Child No 0
9 1st Male Adult No 118
10 2nd Male Adult No 154
11 3rd Male Adult No 387
12 Crew Male Adult No 670
13 1st Female Adult No 4
14 2nd Female Adult No 13
15 3rd Female Adult No 89
16 Crew Female Adult No 3
17 1st Male Child Yes 5
18 2nd Male Child Yes 11
19 3rd Male Child Yes 13
20 Crew Male Child Yes 0
21 1st Female Child Yes 1
22 2nd Female Child Yes 13
23 3rd Female Child Yes 14
24 Crew Female Child Yes 0
25 1st Male Adult Yes 57
26 2nd Male Adult Yes 14
27 3rd Male Adult Yes 75
28 Crew Male Adult Yes 192
29 1st Female Adult Yes 140
30 2nd Female Adult Yes 80
31 3rd Female Adult Yes 76
32 Crew Female Adult Yes 20

In [14]:
print df['Sex']


1       Male
2       Male
3       Male
4       Male
5     Female
6     Female
7     Female
8     Female
9       Male
10      Male
11      Male
12      Male
13    Female
14    Female
15    Female
16    Female
17      Male
18      Male
19      Male
20      Male
21    Female
22    Female
23    Female
24    Female
25      Male
26      Male
27      Male
28      Male
29    Female
30    Female
31    Female
32    Female
Name: Sex, dtype: object

In [17]:
dfsurvived = df[df.Survived == 'Yes']; dfsurvived


Out[17]:
Class Sex Age Survived Freq
17 1st Male Child Yes 5
18 2nd Male Child Yes 11
19 3rd Male Child Yes 13
20 Crew Male Child Yes 0
21 1st Female Child Yes 1
22 2nd Female Child Yes 13
23 3rd Female Child Yes 14
24 Crew Female Child Yes 0
25 1st Male Adult Yes 57
26 2nd Male Adult Yes 14
27 3rd Male Adult Yes 75
28 Crew Male Adult Yes 192
29 1st Female Adult Yes 140
30 2nd Female Adult Yes 80
31 3rd Female Adult Yes 76
32 Crew Female Adult Yes 20

In [27]:
dfNOsurvived = df[df.Survived == 'No']; dfNOsurvived


Out[27]:
Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
7 3rd Female Child No 17
8 Crew Female Child No 0
9 1st Male Adult No 118
10 2nd Male Adult No 154
11 3rd Male Adult No 387
12 Crew Male Adult No 670
13 1st Female Adult No 4
14 2nd Female Adult No 13
15 3rd Female Adult No 89
16 Crew Female Adult No 3

Now,we want to answered some questions:

  1. Higher survival rates in children?
  2. Higher survival rates in females?

To tackle this problem, we will use a Group By approach. The basic idea is to split the data into groups based on some value, apply a particular operation to the subset of data within each group (often an aggregation), and then combine the results into an output dataframe. The illustractions are like this:


In [31]:
## Higher survival rates in children? Yes
import numpy as np
A= dfsurvived.groupby('Age').aggregate(np.sum) # Survived
B= df.groupby('Age').aggregate(np.sum) # All 
print A/B


           Freq
Age            
Adult  0.312620
Child  0.522936

In [33]:
## Higher survival rates in females? Yes
A= dfsurvived.groupby('Sex').aggregate(np.sum) # Survived
B= df.groupby('Sex').aggregate(np.sum) # All 
print A/B


            Freq
Sex             
Female  0.731915
Male    0.212016

In [ ]: