01 - Pandas: Data Structures

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:

    
import pandas as pd



In [2]:

    
import numpy as np
import matplotlib.pyplot as plt

Introduction

Let's directly start with importing some data: the titanic dataset about the passengers of the Titanic and their survival:



In [3]:

    
df = pd.read_csv("../data/titanic.csv")



In [4]:

    
df.head()









    Out[4]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Starting from reading such a tabular dataset, Pandas provides the functionalities to answer questions about this data in a few lines of code. Let's start with a few examples as illustration:

What is the age distribution of the passengers?



In [5]:

    
df['Age'].hist()









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8e75bb0a90>

How does the survival rate of the passengers differ between sexes?



In [6]:

    
df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))

Or how does the survival rate differ between the different classes of the Titanic?



In [7]:

    
df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8e75a9e860>

Are young people (e.g. < 25 years) likely to survive?



In [8]:

    
df['Survived'].sum() / df['Survived'].count()









    Out[8]:





0.3838383838383838



In [9]:

    
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])









    Out[9]:





0.4119601328903654

All the needed functionality for the above examples will be explained throughout the course, but as a start: the data types to work with.

The pandas data structures: `DataFrame` and `Series`

Pandas provides two fundamental data objects, for 1D (Series) and 2D data (DataFrame).

DataFrame: 2D tabular data

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

For the examples here, we are going to create a small DataFrame with some data about a few countries.

When creating a DataFrame manually, a common way to do this is from dictionary of arrays or lists:



In [10]:

    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries









    Out[10]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      4
      United Kingdom
      64.9
      244820
      London

In practice, you will of course often import your data from an external source (text file, excel, database, ..), which we will see later.

Note that in the IPython notebook, the dataframe will display in a rich HTML view.

Attributes of the DataFrame

The DataFrame has a built-in concept of named rows and columns, the index and columns attributes:



In [11]:

    
countries.index









    Out[11]:





RangeIndex(start=0, stop=5, step=1)

By default, the index is the numbers 0 through N - 1



In [12]:

    
countries.columns









    Out[12]:





Index(['country', 'population', 'area', 'capital'], dtype='object')

To check the data types of the different columns:



In [13]:

    
countries.dtypes









    Out[13]:





country        object
population    float64
area            int64
capital        object
dtype: object

An overview of that information can be given with the info() method:



In [14]:

    
countries.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
country       5 non-null object
population    5 non-null float64
area          5 non-null int64
capital       5 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 240.0+ bytes

__NumPy__ provides * multi-dimensional, homogeneously typed arrays (single data type!)
__Pandas__ provides * 2D, heterogeneous data structure (multiple data types!) * labeled (named) row and column index

One-dimensional data: `Series` (a column of a DataFrame)

A Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:



In [16]:

    
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s









    Out[16]:





0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

And often, you access a Series representing a column in the data, using typical [] indexing syntax and the column name:



In [17]:

    
countries['area']









    Out[17]:





0     30510
1    671308
2    357050
3     41526
4    244820
Name: area, dtype: int64

Attributes of a Series: `index` and `values`

The series also has an index, which by default is the numbers 0 through N - 1 (but no .columns):



In [18]:

    
s.index









    Out[18]:





RangeIndex(start=0, stop=4, step=1)

You can access the underlying numpy array representation with the .values attribute:



In [19]:

    
s.values









    Out[19]:





array([0.1, 0.2, 0.3, 0.4])

We can access series values via the index, just like for NumPy arrays:



In [20]:

    
s[0]









    Out[20]:





0.1

Unlike the NumPy array, though, this index can be something other than integers:



In [21]:

    
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2









    Out[21]:





a    0
b    1
c    2
d    3
dtype: int64



In [22]:

    
s2['c']









    Out[22]:





2

Pandas Series versus dictionaries

In this way, a Series object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.

In fact, it's possible to construct a series directly from a Python dictionary:



In [23]:

    
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
population = pd.Series(pop_dict)
population









    Out[23]:





Germany           81.3
Belgium           11.3
France            64.3
United Kingdom    64.9
Netherlands       16.9
dtype: float64

We can index the populations like a dict as expected ...



In [24]:

    
population['France']









    Out[24]:





64.3

... but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.

Eg element-wise operations:



In [25]:

    
population * 1000









    Out[25]:





Germany           81300.0
Belgium           11300.0
France            64300.0
United Kingdom    64900.0
Netherlands       16900.0
dtype: float64

Some useful methods on these data structures

Exploration of the Series and DataFrame is essential (check out what you're dealing with).



In [26]:

    
countries.head() # Top rows









    Out[26]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      4
      United Kingdom
      64.9
      244820
      London



In [27]:

    
countries.tail() # Bottom rows









    Out[27]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      4
      United Kingdom
      64.9
      244820
      London

The describe method computes summary statistics for each column:



In [28]:

    
countries.describe()









    Out[28]:







  
    
      
      population
      area
    
  
  
    
      count
      5.000000
      5.000000
    
    
      mean
      47.740000
      269042.800000
    
    
      std
      31.519645
      264012.827994
    
    
      min
      11.300000
      30510.000000
    
    
      25%
      16.900000
      41526.000000
    
    
      50%
      64.300000
      244820.000000
    
    
      75%
      64.900000
      357050.000000
    
    
      max
      81.300000
      671308.000000

Sorting your data by a specific column is another important first-check:



In [29]:

    
countries.sort_values(by='population')









    Out[29]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      4
      United Kingdom
      64.9
      244820
      London
    
    
      2
      Germany
      81.3
      357050
      Berlin

The plot method can be used to quickly visualize the data in different ways:



In [30]:

    
countries.plot()









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8e759d1940>

However, for this dataset, it does not say that much:



In [31]:

    
countries['population'].plot(kind='barh')









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8e75978c18>

**EXERCISE**: * You can play with the `kind` keyword of the `plot` function in the figure above: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin', 'box'

Importing and exporting data

A wide range of input/output formats are natively supported by pandas:

CSV, text
SQL database
Excel
HDF5
json
html
pickle
sas, stata
Parquet
...



In [32]:

    
# pd.read_



In [33]:

    
# countries.to_

**Note: I/O interface** * All readers are `pd.read_...` * All writers are `DataFrame.to_...`

Application on a real dataset

Throughout the pandas notebooks, many of exercises will use the titanic dataset. This dataset has records of all the passengers of the Titanic, with characteristics of the passengers (age, class, etc. See below), and an indication whether they survived the disaster.

The available metadata of the titanic data set provides the following information:

VARIABLE	DESCRIPTION
Survived	Survival (0 = No; 1 = Yes)
Pclass	Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name	Name
Sex	Sex
Age	Age
SibSp	Number of Siblings/Spouses Aboard
Parch	Number of Parents/Children Aboard
Ticket	Ticket Number
Fare	Passenger Fare
Cabin	Cabin
Embarked	Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

**EXERCISE**: * Read the CVS file (available at `../data/titanic.csv`) into a pandas DataFrame. Call the result `df`.



In [34]:

    
df = pd.read_csv("../data/titanic.csv")

**EXERCISE**: * Quick exploration: show the first 5 rows of the DataFrame.



In [35]:

    
df.head()









    Out[35]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

**EXERCISE**: * How many records (i.e. rows) has the titanic dataset?

Hints

* The length of a DataFrame gives the number of rows (`len(..)`). Alternatively, you can check the "shape" (number of rows, number of columns) of the DataFrame using the `shape` attribute.



In [36]:

    
len(df)









    Out[36]:





891

EXERCISE: * Select the 'Age' column (remember: we can use the [] indexing notation and the column label).



In [37]:

    
df['Age']









    Out[37]:





0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

EXERCISE: * Make a box plot of the Fare column.



In [38]:

    
df['Fare'].plot(kind='box')









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8e759049e8>

**EXERCISE**: * Sort the rows of the DataFrame by 'Age' column, with the oldest passenger at the top. Check the help of the `sort_values` function and find out how to sort from the largest values to the lowest values



In [39]:

    
df.sort_values(by='Age', ascending=False)









    Out[39]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      630
      631
      1
      1
      Barkworth, Mr. Algernon Henry Wilson
      male
      80.0
      0
      0
      27042
      30.0000
      A23
      S
    
    
      851
      852
      0
      3
      Svensson, Mr. Johan
      male
      74.0
      0
      0
      347060
      7.7750
      NaN
      S
    
    
      493
      494
      0
      1
      Artagaveytia, Mr. Ramon
      male
      71.0
      0
      0
      PC 17609
      49.5042
      NaN
      C
    
    
      96
      97
      0
      1
      Goldschmidt, Mr. George B
      male
      71.0
      0
      0
      PC 17754
      34.6542
      A5
      C
    
    
      116
      117
      0
      3
      Connors, Mr. Patrick
      male
      70.5
      0
      0
      370369
      7.7500
      NaN
      Q
    
    
      672
      673
      0
      2
      Mitchell, Mr. Henry Michael
      male
      70.0
      0
      0
      C.A. 24580
      10.5000
      NaN
      S
    
    
      745
      746
      0
      1
      Crosby, Capt. Edward Gifford
      male
      70.0
      1
      1
      WE/P 5735
      71.0000
      B22
      S
    
    
      33
      34
      0
      2
      Wheadon, Mr. Edward H
      male
      66.0
      0
      0
      C.A. 24579
      10.5000
      NaN
      S
    
    
      54
      55
      0
      1
      Ostby, Mr. Engelhart Cornelius
      male
      65.0
      0
      1
      113509
      61.9792
      B30
      C
    
    
      280
      281
      0
      3
      Duane, Mr. Frank
      male
      65.0
      0
      0
      336439
      7.7500
      NaN
      Q
    
    
      456
      457
      0
      1
      Millet, Mr. Francis Davis
      male
      65.0
      0
      0
      13509
      26.5500
      E38
      S
    
    
      438
      439
      0
      1
      Fortune, Mr. Mark
      male
      64.0
      1
      4
      19950
      263.0000
      C23 C25 C27
      S
    
    
      545
      546
      0
      1
      Nicholson, Mr. Arthur Ernest
      male
      64.0
      0
      0
      693
      26.0000
      NaN
      S
    
    
      275
      276
      1
      1
      Andrews, Miss. Kornelia Theodosia
      female
      63.0
      1
      0
      13502
      77.9583
      D7
      S
    
    
      483
      484
      1
      3
      Turkula, Mrs. (Hedwig)
      female
      63.0
      0
      0
      4134
      9.5875
      NaN
      S
    
    
      570
      571
      1
      2
      Harris, Mr. George
      male
      62.0
      0
      0
      S.W./PP 752
      10.5000
      NaN
      S
    
    
      252
      253
      0
      1
      Stead, Mr. William Thomas
      male
      62.0
      0
      0
      113514
      26.5500
      C87
      S
    
    
      829
      830
      1
      1
      Stone, Mrs. George Nelson (Martha Evelyn)
      female
      62.0
      0
      0
      113572
      80.0000
      B28
      NaN
    
    
      555
      556
      0
      1
      Wright, Mr. George
      male
      62.0
      0
      0
      113807
      26.5500
      NaN
      S
    
    
      625
      626
      0
      1
      Sutton, Mr. Frederick
      male
      61.0
      0
      0
      36963
      32.3208
      D50
      S
    
    
      326
      327
      0
      3
      Nysveen, Mr. Johan Hansen
      male
      61.0
      0
      0
      345364
      6.2375
      NaN
      S
    
    
      170
      171
      0
      1
      Van der hoef, Mr. Wyckoff
      male
      61.0
      0
      0
      111240
      33.5000
      B19
      S
    
    
      684
      685
      0
      2
      Brown, Mr. Thomas William Solomon
      male
      60.0
      1
      1
      29750
      39.0000
      NaN
      S
    
    
      694
      695
      0
      1
      Weir, Col. John
      male
      60.0
      0
      0
      113800
      26.5500
      NaN
      S
    
    
      587
      588
      1
      1
      Frolicher-Stehli, Mr. Maxmillian
      male
      60.0
      1
      1
      13567
      79.2000
      B41
      C
    
    
      366
      367
      1
      1
      Warren, Mrs. Frank Manley (Anna Sophia Atkinson)
      female
      60.0
      1
      0
      110813
      75.2500
      D37
      C
    
    
      94
      95
      0
      3
      Coxon, Mr. Daniel
      male
      59.0
      0
      0
      364500
      7.2500
      NaN
      S
    
    
      232
      233
      0
      2
      Sjostedt, Mr. Ernst Adolf
      male
      59.0
      0
      0
      237442
      13.5000
      NaN
      S
    
    
      268
      269
      1
      1
      Graham, Mrs. William Thompson (Edith Junkins)
      female
      58.0
      0
      1
      PC 17582
      153.4625
      C125
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58.0
      0
      0
      113783
      26.5500
      C103
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      718
      719
      0
      3
      McEvoy, Mr. Michael
      male
      NaN
      0
      0
      36568
      15.5000
      NaN
      Q
    
    
      727
      728
      1
      3
      Mannion, Miss. Margareth
      female
      NaN
      0
      0
      36866
      7.7375
      NaN
      Q
    
    
      732
      733
      0
      2
      Knight, Mr. Robert J
      male
      NaN
      0
      0
      239855
      0.0000
      NaN
      S
    
    
      738
      739
      0
      3
      Ivanoff, Mr. Kanio
      male
      NaN
      0
      0
      349201
      7.8958
      NaN
      S
    
    
      739
      740
      0
      3
      Nankoff, Mr. Minko
      male
      NaN
      0
      0
      349218
      7.8958
      NaN
      S
    
    
      740
      741
      1
      1
      Hawksford, Mr. Walter James
      male
      NaN
      0
      0
      16988
      30.0000
      D45
      S
    
    
      760
      761
      0
      3
      Garfirth, Mr. John
      male
      NaN
      0
      0
      358585
      14.5000
      NaN
      S
    
    
      766
      767
      0
      1
      Brewe, Dr. Arthur Jackson
      male
      NaN
      0
      0
      112379
      39.6000
      NaN
      C
    
    
      768
      769
      0
      3
      Moran, Mr. Daniel J
      male
      NaN
      1
      0
      371110
      24.1500
      NaN
      Q
    
    
      773
      774
      0
      3
      Elias, Mr. Dibo
      male
      NaN
      0
      0
      2674
      7.2250
      NaN
      C
    
    
      776
      777
      0
      3
      Tobin, Mr. Roger
      male
      NaN
      0
      0
      383121
      7.7500
      F38
      Q
    
    
      778
      779
      0
      3
      Kilgannon, Mr. Thomas J
      male
      NaN
      0
      0
      36865
      7.7375
      NaN
      Q
    
    
      783
      784
      0
      3
      Johnston, Mr. Andrew G
      male
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      790
      791
      0
      3
      Keane, Mr. Andrew "Andy"
      male
      NaN
      0
      0
      12460
      7.7500
      NaN
      Q
    
    
      792
      793
      0
      3
      Sage, Miss. Stella Anna
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      793
      794
      0
      1
      Hoyt, Mr. William Fisher
      male
      NaN
      0
      0
      PC 17600
      30.6958
      NaN
      C
    
    
      815
      816
      0
      1
      Fry, Mr. Richard
      male
      NaN
      0
      0
      112058
      0.0000
      B102
      S
    
    
      825
      826
      0
      3
      Flynn, Mr. John
      male
      NaN
      0
      0
      368323
      6.9500
      NaN
      Q
    
    
      826
      827
      0
      3
      Lam, Mr. Len
      male
      NaN
      0
      0
      1601
      56.4958
      NaN
      S
    
    
      828
      829
      1
      3
      McCormack, Mr. Thomas Joseph
      male
      NaN
      0
      0
      367228
      7.7500
      NaN
      Q
    
    
      832
      833
      0
      3
      Saad, Mr. Amin
      male
      NaN
      0
      0
      2671
      7.2292
      NaN
      C
    
    
      837
      838
      0
      3
      Sirota, Mr. Maurice
      male
      NaN
      0
      0
      392092
      8.0500
      NaN
      S
    
    
      839
      840
      1
      1
      Marechal, Mr. Pierre
      male
      NaN
      0
      0
      11774
      29.7000
      C47
      C
    
    
      846
      847
      0
      3
      Sage, Mr. Douglas Bullen
      male
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      849
      850
      1
      1
      Goldenberg, Mrs. Samuel L (Edwiga Grabowska)
      female
      NaN
      1
      0
      17453
      89.1042
      C92
      C
    
    
      859
      860
      0
      3
      Razi, Mr. Raihed
      male
      NaN
      0
      0
      2629
      7.2292
      NaN
      C
    
    
      863
      864
      0
      3
      Sage, Miss. Dorothy Edith "Dolly"
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      868
      869
      0
      3
      van Melkebeke, Mr. Philemon
      male
      NaN
      0
      0
      345777
      9.5000
      NaN
      S
    
    
      878
      879
      0
      3
      Laleff, Mr. Kristo
      male
      NaN
      0
      0
      349217
      7.8958
      NaN
      S
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
  

891 rows × 12 columns

Acknowledgement

This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	country	population	area	capital
0	Belgium	11.3	30510	Brussels
1	France	64.3	671308	Paris
2	Germany	81.3	357050	Berlin
3	Netherlands	16.9	41526	Amsterdam
4	United Kingdom	64.9	244820	London

	population	area
count	5.000000	5.000000
mean	47.740000	269042.800000
std	31.519645	264012.827994
min	11.300000	30510.000000
25%	16.900000	41526.000000
50%	64.300000	244820.000000
75%	64.900000	357050.000000
max	81.300000	671308.000000

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
630	631	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	0	0	27042	30.0000	A23	S
851	852	0	3	Svensson, Mr. Johan	male	74.0	0	0	347060	7.7750	NaN	S
493	494	0	1	Artagaveytia, Mr. Ramon	male	71.0	0	0	PC 17609	49.5042	NaN	C
96	97	0	1	Goldschmidt, Mr. George B	male	71.0	0	0	PC 17754	34.6542	A5	C
116	117	0	3	Connors, Mr. Patrick	male	70.5	0	0	370369	7.7500	NaN	Q
672	673	0	2	Mitchell, Mr. Henry Michael	male	70.0	0	0	C.A. 24580	10.5000	NaN	S
745	746	0	1	Crosby, Capt. Edward Gifford	male	70.0	1	1	WE/P 5735	71.0000	B22	S
33	34	0	2	Wheadon, Mr. Edward H	male	66.0	0	0	C.A. 24579	10.5000	NaN	S
54	55	0	1	Ostby, Mr. Engelhart Cornelius	male	65.0	0	1	113509	61.9792	B30	C
280	281	0	3	Duane, Mr. Frank	male	65.0	0	0	336439	7.7500	NaN	Q
456	457	0	1	Millet, Mr. Francis Davis	male	65.0	0	0	13509	26.5500	E38	S
438	439	0	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S
545	546	0	1	Nicholson, Mr. Arthur Ernest	male	64.0	0	0	693	26.0000	NaN	S
275	276	1	1	Andrews, Miss. Kornelia Theodosia	female	63.0	1	0	13502	77.9583	D7	S
483	484	1	3	Turkula, Mrs. (Hedwig)	female	63.0	0	0	4134	9.5875	NaN	S
570	571	1	2	Harris, Mr. George	male	62.0	0	0	S.W./PP 752	10.5000	NaN	S
252	253	0	1	Stead, Mr. William Thomas	male	62.0	0	0	113514	26.5500	C87	S
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0000	B28	NaN
555	556	0	1	Wright, Mr. George	male	62.0	0	0	113807	26.5500	NaN	S
625	626	0	1	Sutton, Mr. Frederick	male	61.0	0	0	36963	32.3208	D50	S
326	327	0	3	Nysveen, Mr. Johan Hansen	male	61.0	0	0	345364	6.2375	NaN	S
170	171	0	1	Van der hoef, Mr. Wyckoff	male	61.0	0	0	111240	33.5000	B19	S
684	685	0	2	Brown, Mr. Thomas William Solomon	male	60.0	1	1	29750	39.0000	NaN	S
694	695	0	1	Weir, Col. John	male	60.0	0	0	113800	26.5500	NaN	S
587	588	1	1	Frolicher-Stehli, Mr. Maxmillian	male	60.0	1	1	13567	79.2000	B41	C
366	367	1	1	Warren, Mrs. Frank Manley (Anna Sophia Atkinson)	female	60.0	1	0	110813	75.2500	D37	C
94	95	0	3	Coxon, Mr. Daniel	male	59.0	0	0	364500	7.2500	NaN	S
232	233	0	2	Sjostedt, Mr. Ernst Adolf	male	59.0	0	0	237442	13.5000	NaN	S
268	269	1	1	Graham, Mrs. William Thompson (Edith Junkins)	female	58.0	0	1	PC 17582	153.4625	C125	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
...	...	...	...	...	...	...	...	...	...	...	...	...
718	719	0	3	McEvoy, Mr. Michael	male	NaN	0	0	36568	15.5000	NaN	Q
727	728	1	3	Mannion, Miss. Margareth	female	NaN	0	0	36866	7.7375	NaN	Q
732	733	0	2	Knight, Mr. Robert J	male	NaN	0	0	239855	0.0000	NaN	S
738	739	0	3	Ivanoff, Mr. Kanio	male	NaN	0	0	349201	7.8958	NaN	S
739	740	0	3	Nankoff, Mr. Minko	male	NaN	0	0	349218	7.8958	NaN	S
740	741	1	1	Hawksford, Mr. Walter James	male	NaN	0	0	16988	30.0000	D45	S
760	761	0	3	Garfirth, Mr. John	male	NaN	0	0	358585	14.5000	NaN	S
766	767	0	1	Brewe, Dr. Arthur Jackson	male	NaN	0	0	112379	39.6000	NaN	C
768	769	0	3	Moran, Mr. Daniel J	male	NaN	1	0	371110	24.1500	NaN	Q
773	774	0	3	Elias, Mr. Dibo	male	NaN	0	0	2674	7.2250	NaN	C
776	777	0	3	Tobin, Mr. Roger	male	NaN	0	0	383121	7.7500	F38	Q
778	779	0	3	Kilgannon, Mr. Thomas J	male	NaN	0	0	36865	7.7375	NaN	Q
783	784	0	3	Johnston, Mr. Andrew G	male	NaN	1	2	W./C. 6607	23.4500	NaN	S
790	791	0	3	Keane, Mr. Andrew "Andy"	male	NaN	0	0	12460	7.7500	NaN	Q
792	793	0	3	Sage, Miss. Stella Anna	female	NaN	8	2	CA. 2343	69.5500	NaN	S
793	794	0	1	Hoyt, Mr. William Fisher	male	NaN	0	0	PC 17600	30.6958	NaN	C
815	816	0	1	Fry, Mr. Richard	male	NaN	0	0	112058	0.0000	B102	S
825	826	0	3	Flynn, Mr. John	male	NaN	0	0	368323	6.9500	NaN	Q
826	827	0	3	Lam, Mr. Len	male	NaN	0	0	1601	56.4958	NaN	S
828	829	1	3	McCormack, Mr. Thomas Joseph	male	NaN	0	0	367228	7.7500	NaN	Q
832	833	0	3	Saad, Mr. Amin	male	NaN	0	0	2671	7.2292	NaN	C
837	838	0	3	Sirota, Mr. Maurice	male	NaN	0	0	392092	8.0500	NaN	S
839	840	1	1	Marechal, Mr. Pierre	male	NaN	0	0	11774	29.7000	C47	C
846	847	0	3	Sage, Mr. Douglas Bullen	male	NaN	8	2	CA. 2343	69.5500	NaN	S
849	850	1	1	Goldenberg, Mrs. Samuel L (Edwiga Grabowska)	female	NaN	1	0	17453	89.1042	C92	C
859	860	0	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S