The pandas library can be installed by pip and usually imported as an alias pd in some of the literature. This tutorial only covers the basic stuff, for more informations on Data Science in Python, please refers to other books, like:
In [2]:
import pandas as pd
Now, we would like to do some data science exercise using pandas. First, we shall obtain a dataset first. In the same folder, you will find a file named Titanic.csv. The data is obtained from The R Datasets Package: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html. Below are the describtions from the package:
This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.
A 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables. The variables and their levels are as follows:
No Name Levels
1 Class 1st, 2nd, 3rd, Crew
2 Sex Male, Female
3 Age Child, Adult
4 Survived No, Yes
The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger.
These data were originally collected by the British Board of Trade in their investigation of the sinking. Note that there is not complete agreement among primary sources as to the exact numbers on board, rescued, or lost.
Due in particular to the very successful film ‘Titanic’, the last years saw a rise in public interest in the Titanic. Very detailed data about the passengers is now available on the Internet, at sites such as Encyclopedia Titanica (https://www.encyclopedia-titanica.org/).
The file is a csv file, which can be easily opened by pd.DataFrame.from_csv(). The pandas package provides a DataFrame object, the details of this object can be checked by typing %pdoc pd.DataFrame. The pandas DataFrame can be printed by print, or just type the variable name in a new cell for a prettier look.
DataFrameTwo-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure
In [5]:
%pdoc pd.DataFrame
In [22]:
df = pd.DataFrame.from_csv('Titanic.csv')
print df
In [10]:
df
Out[10]:
In [14]:
print df['Sex']
In [17]:
dfsurvived = df[df.Survived == 'Yes']; dfsurvived
Out[17]:
In [27]:
dfNOsurvived = df[df.Survived == 'No']; dfNOsurvived
Out[27]:
To tackle this problem, we will use a Group By approach. The basic idea is to split the data into groups based on some value, apply a particular operation to the subset of data within each group (often an aggregation), and then combine the results into an output dataframe. The illustractions are like this:
In [31]:
## Higher survival rates in children? Yes
import numpy as np
A= dfsurvived.groupby('Age').aggregate(np.sum) # Survived
B= df.groupby('Age').aggregate(np.sum) # All
print A/B
In [33]:
## Higher survival rates in females? Yes
A= dfsurvived.groupby('Sex').aggregate(np.sum) # Survived
B= df.groupby('Sex').aggregate(np.sum) # All
print A/B
In [ ]: