01 - Pandas: Data Structures

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import pandas as pd

In [2]:
import numpy as np
import matplotlib.pyplot as plt

Introduction

Let's directly start with importing some data: the titanic dataset about the passengers of the Titanic and their survival:


In [3]:
df = pd.read_csv("../data/titanic.csv")

In [4]:
df.head()


Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Starting from reading such a tabular dataset, Pandas provides the functionalities to answer questions about this data in a few lines of code. Let's start with a few examples as illustration:

  • What is the age distribution of the passengers?

In [5]:
df['Age'].hist()


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e75bb0a90>
  • How does the survival rate of the passengers differ between sexes?

In [6]:
df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))


Out[6]:
Survived
Sex
female 0.742038
male 0.188908
  • Or how does the survival rate differ between the different classes of the Titanic?

In [7]:
df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e75a9e860>
  • Are young people (e.g. < 25 years) likely to survive?

In [8]:
df['Survived'].sum() / df['Survived'].count()


Out[8]:
0.3838383838383838

In [9]:
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])


Out[9]:
0.4119601328903654

All the needed functionality for the above examples will be explained throughout the course, but as a start: the data types to work with.

The pandas data structures: DataFrame and Series

Pandas provides two fundamental data objects, for 1D (Series) and 2D data (DataFrame).

DataFrame: 2D tabular data

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

For the examples here, we are going to create a small DataFrame with some data about a few countries.

When creating a DataFrame manually, a common way to do this is from dictionary of arrays or lists:


In [10]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries


Out[10]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam
4 United Kingdom 64.9 244820 London

In practice, you will of course often import your data from an external source (text file, excel, database, ..), which we will see later.

Note that in the IPython notebook, the dataframe will display in a rich HTML view.

Attributes of the DataFrame

The DataFrame has a built-in concept of named rows and columns, the index and columns attributes:


In [11]:
countries.index


Out[11]:
RangeIndex(start=0, stop=5, step=1)

By default, the index is the numbers 0 through N - 1


In [12]:
countries.columns


Out[12]:
Index(['country', 'population', 'area', 'capital'], dtype='object')

To check the data types of the different columns:


In [13]:
countries.dtypes


Out[13]:
country        object
population    float64
area            int64
capital        object
dtype: object

An overview of that information can be given with the info() method:


In [14]:
countries.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
country       5 non-null object
population    5 non-null float64
area          5 non-null int64
capital       5 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 240.0+ bytes
__NumPy__ provides * multi-dimensional, homogeneously typed arrays (single data type!)
__Pandas__ provides * 2D, heterogeneous data structure (multiple data types!) * labeled (named) row and column index

One-dimensional data: Series (a column of a DataFrame)

A Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:


In [16]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s


Out[16]:
0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

And often, you access a Series representing a column in the data, using typical [] indexing syntax and the column name:


In [17]:
countries['area']


Out[17]:
0     30510
1    671308
2    357050
3     41526
4    244820
Name: area, dtype: int64

Attributes of a Series: index and values

The series also has an index, which by default is the numbers 0 through N - 1 (but no .columns):


In [18]:
s.index


Out[18]:
RangeIndex(start=0, stop=4, step=1)

You can access the underlying numpy array representation with the .values attribute:


In [19]:
s.values


Out[19]:
array([0.1, 0.2, 0.3, 0.4])

We can access series values via the index, just like for NumPy arrays:


In [20]:
s[0]


Out[20]:
0.1

Unlike the NumPy array, though, this index can be something other than integers:


In [21]:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2


Out[21]:
a    0
b    1
c    2
d    3
dtype: int64

In [22]:
s2['c']


Out[22]:
2

Pandas Series versus dictionaries

In this way, a Series object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.

In fact, it's possible to construct a series directly from a Python dictionary:


In [23]:
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
population = pd.Series(pop_dict)
population


Out[23]:
Germany           81.3
Belgium           11.3
France            64.3
United Kingdom    64.9
Netherlands       16.9
dtype: float64

We can index the populations like a dict as expected ...


In [24]:
population['France']


Out[24]:
64.3

... but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.

Eg element-wise operations:


In [25]:
population * 1000


Out[25]:
Germany           81300.0
Belgium           11300.0
France            64300.0
United Kingdom    64900.0
Netherlands       16900.0
dtype: float64

Some useful methods on these data structures

Exploration of the Series and DataFrame is essential (check out what you're dealing with).


In [26]:
countries.head() # Top rows


Out[26]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam
4 United Kingdom 64.9 244820 London

In [27]:
countries.tail() # Bottom rows


Out[27]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam
4 United Kingdom 64.9 244820 London

The describe method computes summary statistics for each column:


In [28]:
countries.describe()


Out[28]:
population area
count 5.000000 5.000000
mean 47.740000 269042.800000
std 31.519645 264012.827994
min 11.300000 30510.000000
25% 16.900000 41526.000000
50% 64.300000 244820.000000
75% 64.900000 357050.000000
max 81.300000 671308.000000

Sorting your data by a specific column is another important first-check:


In [29]:
countries.sort_values(by='population')


Out[29]:
country population area capital
0 Belgium 11.3 30510 Brussels
3 Netherlands 16.9 41526 Amsterdam
1 France 64.3 671308 Paris
4 United Kingdom 64.9 244820 London
2 Germany 81.3 357050 Berlin

The plot method can be used to quickly visualize the data in different ways:


In [30]:
countries.plot()


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e759d1940>

However, for this dataset, it does not say that much:


In [31]:
countries['population'].plot(kind='barh')


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e75978c18>
**EXERCISE**: * You can play with the `kind` keyword of the `plot` function in the figure above: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin', 'box'

Importing and exporting data

A wide range of input/output formats are natively supported by pandas:

  • CSV, text
  • SQL database
  • Excel
  • HDF5
  • json
  • html
  • pickle
  • sas, stata
  • Parquet
  • ...

In [32]:
# pd.read_

In [33]:
# countries.to_
**Note: I/O interface** * All readers are `pd.read_...` * All writers are `DataFrame.to_...`

Application on a real dataset

Throughout the pandas notebooks, many of exercises will use the titanic dataset. This dataset has records of all the passengers of the Titanic, with characteristics of the passengers (age, class, etc. See below), and an indication whether they survived the disaster.

The available metadata of the titanic data set provides the following information:

VARIABLE DESCRIPTION
Survived Survival (0 = No; 1 = Yes)
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name Name
Sex Sex
Age Age
SibSp Number of Siblings/Spouses Aboard
Parch Number of Parents/Children Aboard
Ticket Ticket Number
Fare Passenger Fare
Cabin Cabin
Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
**EXERCISE**: * Read the CVS file (available at `../data/titanic.csv`) into a pandas DataFrame. Call the result `df`.

In [34]:
df = pd.read_csv("../data/titanic.csv")
**EXERCISE**: * Quick exploration: show the first 5 rows of the DataFrame.

In [35]:
df.head()


Out[35]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
**EXERCISE**: * How many records (i.e. rows) has the titanic dataset?
Hints * The length of a DataFrame gives the number of rows (`len(..)`). Alternatively, you can check the "shape" (number of rows, number of columns) of the DataFrame using the `shape` attribute.

In [36]:
len(df)


Out[36]:
891
EXERCISE: * Select the 'Age' column (remember: we can use the [] indexing notation and the column label).

In [37]:
df['Age']


Out[37]:
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64
EXERCISE: * Make a box plot of the Fare column.

In [38]:
df['Fare'].plot(kind='box')


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e759049e8>
**EXERCISE**: * Sort the rows of the DataFrame by 'Age' column, with the oldest passenger at the top. Check the help of the `sort_values` function and find out how to sort from the largest values to the lowest values

In [39]:
df.sort_values(by='Age', ascending=False)


Out[39]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
672 673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 0 C.A. 24580 10.5000 NaN S
745 746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 B22 S
33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S
54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C
280 281 0 3 Duane, Mr. Frank male 65.0 0 0 336439 7.7500 NaN Q
456 457 0 1 Millet, Mr. Francis Davis male 65.0 0 0 13509 26.5500 E38 S
438 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S
545 546 0 1 Nicholson, Mr. Arthur Ernest male 64.0 0 0 693 26.0000 NaN S
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
570 571 1 2 Harris, Mr. George male 62.0 0 0 S.W./PP 752 10.5000 NaN S
252 253 0 1 Stead, Mr. William Thomas male 62.0 0 0 113514 26.5500 C87 S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
555 556 0 1 Wright, Mr. George male 62.0 0 0 113807 26.5500 NaN S
625 626 0 1 Sutton, Mr. Frederick male 61.0 0 0 36963 32.3208 D50 S
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0 0 0 345364 6.2375 NaN S
170 171 0 1 Van der hoef, Mr. Wyckoff male 61.0 0 0 111240 33.5000 B19 S
684 685 0 2 Brown, Mr. Thomas William Solomon male 60.0 1 1 29750 39.0000 NaN S
694 695 0 1 Weir, Col. John male 60.0 0 0 113800 26.5500 NaN S
587 588 1 1 Frolicher-Stehli, Mr. Maxmillian male 60.0 1 1 13567 79.2000 B41 C
366 367 1 1 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) female 60.0 1 0 110813 75.2500 D37 C
94 95 0 3 Coxon, Mr. Daniel male 59.0 0 0 364500 7.2500 NaN S
232 233 0 2 Sjostedt, Mr. Ernst Adolf male 59.0 0 0 237442 13.5000 NaN S
268 269 1 1 Graham, Mrs. William Thompson (Edith Junkins) female 58.0 0 1 PC 17582 153.4625 C125 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
... ... ... ... ... ... ... ... ... ... ... ... ...
718 719 0 3 McEvoy, Mr. Michael male NaN 0 0 36568 15.5000 NaN Q
727 728 1 3 Mannion, Miss. Margareth female NaN 0 0 36866 7.7375 NaN Q
732 733 0 2 Knight, Mr. Robert J male NaN 0 0 239855 0.0000 NaN S
738 739 0 3 Ivanoff, Mr. Kanio male NaN 0 0 349201 7.8958 NaN S
739 740 0 3 Nankoff, Mr. Minko male NaN 0 0 349218 7.8958 NaN S
740 741 1 1 Hawksford, Mr. Walter James male NaN 0 0 16988 30.0000 D45 S
760 761 0 3 Garfirth, Mr. John male NaN 0 0 358585 14.5000 NaN S
766 767 0 1 Brewe, Dr. Arthur Jackson male NaN 0 0 112379 39.6000 NaN C
768 769 0 3 Moran, Mr. Daniel J male NaN 1 0 371110 24.1500 NaN Q
773 774 0 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 NaN C
776 777 0 3 Tobin, Mr. Roger male NaN 0 0 383121 7.7500 F38 Q
778 779 0 3 Kilgannon, Mr. Thomas J male NaN 0 0 36865 7.7375 NaN Q
783 784 0 3 Johnston, Mr. Andrew G male NaN 1 2 W./C. 6607 23.4500 NaN S
790 791 0 3 Keane, Mr. Andrew "Andy" male NaN 0 0 12460 7.7500 NaN Q
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.5500 NaN S
793 794 0 1 Hoyt, Mr. William Fisher male NaN 0 0 PC 17600 30.6958 NaN C
815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0000 B102 S
825 826 0 3 Flynn, Mr. John male NaN 0 0 368323 6.9500 NaN Q
826 827 0 3 Lam, Mr. Len male NaN 0 0 1601 56.4958 NaN S
828 829 1 3 McCormack, Mr. Thomas Joseph male NaN 0 0 367228 7.7500 NaN Q
832 833 0 3 Saad, Mr. Amin male NaN 0 0 2671 7.2292 NaN C
837 838 0 3 Sirota, Mr. Maurice male NaN 0 0 392092 8.0500 NaN S
839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.5500 NaN S
849 850 1 1 Goldenberg, Mrs. Samuel L (Edwiga Grabowska) female NaN 1 0 17453 89.1042 C92 C
859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

891 rows × 12 columns


Acknowledgement

This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).