02 - Pandas: Basic operations on Series and DataFrames
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.
In [2]:
# redefining the example objects
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3,
'United Kingdom': 64.9, 'Netherlands': 16.9})
countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})
In [3]:
countries.head()
Out[3]:
Just like with numpy arrays, many operations are element-wise:
In [4]:
population / 100
Out[4]:
In [5]:
countries['population'] / countries['area']
Out[5]:
In [6]:
np.log(countries['population'])
Out[6]:
which can be added as a new column, as follows:
In [7]:
countries["log_population"] = np.log(countries['population'])
In [8]:
countries.columns
Out[8]:
In [9]:
countries['population'] > 40
Out[9]:
In [11]:
countries["capital"].apply(lambda x: len(x)) # in case you forgot the functionality: countries["capital"].str.len()
Out[11]:
In [12]:
def population_annotater(population):
"""annotate as large or small"""
if population > 50:
return 'large'
else:
return 'small'
In [13]:
countries["population"].apply(population_annotater) # a custom user function
Out[13]:
Pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column).
The average population number:
In [14]:
population.mean()
Out[14]:
The minimum area:
In [15]:
countries['area'].min()
Out[15]:
For dataframes, often only the numeric columns are included in the result:
In [16]:
countries.median()
Out[16]:
Reading in the titanic data set...
In [17]:
df = pd.read_csv("../data/titanic.csv")
Quick exploration first...
In [18]:
df.head()
Out[18]:
In [19]:
len(df)
Out[19]:
The available metadata of the titanic data set provides the following information:
VARIABLE | DESCRIPTION |
---|---|
Survived | Survival (0 = No; 1 = Yes) |
Pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
Name | Name |
Sex | Sex |
Age | Age |
SibSp | Number of Siblings/Spouses Aboard |
Parch | Number of Parents/Children Aboard |
Ticket | Ticket Number |
Fare | Passenger Fare |
Cabin | Cabin |
Embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
In [20]:
df['Age'].mean()
Out[20]:
In [21]:
df['Age'].hist() #bins=30, log=True
Out[21]:
In [22]:
df['Survived'].sum() / len(df['Survived'])
Out[22]:
In [23]:
df['Survived'].mean()
Out[23]:
In [24]:
df['Fare'].max()
Out[24]:
In [25]:
df['Fare'].median()
Out[25]:
In [26]:
df['Fare'].quantile(0.75)
Out[26]:
In [27]:
df['Fare'] / df['Fare'].mean()
Out[27]:
In [28]:
np.log(df['Fare'])
Out[28]:
In [29]:
df['Fare_log'] = np.log(df['Fare'])
df.head()
Out[29]:
This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).