01 - Pandas: Data Structures
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import pandas as pd
In [2]:
import numpy as np
import matplotlib.pyplot as plt
Let's directly start with importing some data: the titanic
dataset about the passengers of the Titanic and their survival:
In [3]:
df = pd.read_csv("../data/titanic.csv")
In [4]:
df.head()
Out[4]:
Starting from reading such a tabular dataset, Pandas provides the functionalities to answer questions about this data in a few lines of code. Let's start with a few examples as illustration:
In [5]:
df['Age'].hist()
Out[5]:
In [6]:
df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))
Out[6]:
In [7]:
df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')
Out[7]:
In [8]:
df['Survived'].sum() / df['Survived'].count()
Out[8]:
In [9]:
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])
Out[9]:
All the needed functionality for the above examples will be explained throughout the course, but as a start: the data types to work with.
A DataFrame
is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.
For the examples here, we are going to create a small DataFrame with some data about a few countries.
When creating a DataFrame manually, a common way to do this is from dictionary of arrays or lists:
In [10]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
Out[10]:
In practice, you will of course often import your data from an external source (text file, excel, database, ..), which we will see later.
Note that in the IPython notebook, the dataframe will display in a rich HTML view.
In [11]:
countries.index
Out[11]:
By default, the index is the numbers 0 through N - 1
In [12]:
countries.columns
Out[12]:
To check the data types of the different columns:
In [13]:
countries.dtypes
Out[13]:
An overview of that information can be given with the info()
method:
In [14]:
countries.info()
In [16]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s
Out[16]:
And often, you access a Series representing a column in the data, using typical []
indexing syntax and the column name:
In [17]:
countries['area']
Out[17]:
In [18]:
s.index
Out[18]:
You can access the underlying numpy array representation with the .values
attribute:
In [19]:
s.values
Out[19]:
We can access series values via the index, just like for NumPy arrays:
In [20]:
s[0]
Out[20]:
Unlike the NumPy array, though, this index can be something other than integers:
In [21]:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2
Out[21]:
In [22]:
s2['c']
Out[22]:
In this way, a Series
object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.
In fact, it's possible to construct a series directly from a Python dictionary:
In [23]:
pop_dict = {'Germany': 81.3,
'Belgium': 11.3,
'France': 64.3,
'United Kingdom': 64.9,
'Netherlands': 16.9}
population = pd.Series(pop_dict)
population
Out[23]:
We can index the populations like a dict as expected ...
In [24]:
population['France']
Out[24]:
... but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.
Eg element-wise operations:
In [25]:
population * 1000
Out[25]:
Exploration of the Series and DataFrame is essential (check out what you're dealing with).
In [26]:
countries.head() # Top rows
Out[26]:
In [27]:
countries.tail() # Bottom rows
Out[27]:
The describe
method computes summary statistics for each column:
In [28]:
countries.describe()
Out[28]:
Sorting your data by a specific column is another important first-check:
In [29]:
countries.sort_values(by='population')
Out[29]:
The plot
method can be used to quickly visualize the data in different ways:
In [30]:
countries.plot()
Out[30]:
However, for this dataset, it does not say that much:
In [31]:
countries['population'].plot(kind='barh')
Out[31]:
A wide range of input/output formats are natively supported by pandas:
In [32]:
# pd.read_
In [33]:
# countries.to_
Throughout the pandas notebooks, many of exercises will use the titanic dataset. This dataset has records of all the passengers of the Titanic, with characteristics of the passengers (age, class, etc. See below), and an indication whether they survived the disaster.
The available metadata of the titanic data set provides the following information:
VARIABLE | DESCRIPTION |
---|---|
Survived | Survival (0 = No; 1 = Yes) |
Pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
Name | Name |
Sex | Sex |
Age | Age |
SibSp | Number of Siblings/Spouses Aboard |
Parch | Number of Parents/Children Aboard |
Ticket | Ticket Number |
Fare | Passenger Fare |
Cabin | Cabin |
Embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
In [34]:
df = pd.read_csv("../data/titanic.csv")
In [35]:
df.head()
Out[35]:
In [36]:
len(df)
Out[36]:
In [37]:
df['Age']
Out[37]:
In [38]:
df['Fare'].plot(kind='box')
Out[38]:
In [39]:
df.sort_values(by='Age', ascending=False)
Out[39]:
This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).