Practical Optimisations for Pandas

Eyal Trabelsi

About Me ๐Ÿ™ˆ

  • Software Engineer at Salesforce ๐Ÿ‘ท
  • Big passion for python, data and performance optimisations ๐Ÿ๐Ÿค–

What's Pandas?๐Ÿผ

  • Library for data manipulation
  • Dataset on Memory
  • Widely used

Performance?! Why ?๐Ÿคจ

  • Fast is better than slow ๐Ÿ‡
  • Memory efficiency is good ๐Ÿ’พ
  • Hardware will only take you so far ๐Ÿ’ป

  • Ok now that i have got you attention, the next question i want to tackle is when should we optimize our code

When โฐ

  • since program readability is our top priority, as python aim to make the programmer life easier, we should only optimize our code when needed.

  • so in other words All optimization are premature.

  • unless
  • Program doesn't meet requirements ๐Ÿš”
  • wheter its too slow for the user
  • or whether its take too much memory
  • Program execution affects development pace ๐Ÿ‘ท
  • We should only optimize the program bottlenecks
  • whether memory and execution time in order to be productive
  • There is a great talk about how it can be done using profiling and the link is provieded
  • Like every refactoring task, you want to have same behaviour and return the same restult.
  • The best way to feel safe while refactoring is by writting tests

  • altough people mostly think the problem lays in python realms its not actually the case and i will show it to you today

  • so hopefully now that i have gained your trust
  • we can tackle The one milion dollar question.
  • how can we optimize our pandas code.

Dataset ๐Ÿ“‰


Inย [60]:
! pip install numba numexpr


Requirement already satisfied: numba in /opt/conda/lib/python3.7/site-packages (0.45.1)
Requirement already satisfied: numexpr in /opt/conda/lib/python3.7/site-packages (2.6.9)
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from numba) (1.17.3)
Requirement already satisfied: llvmlite>=0.29.0dev0 in /opt/conda/lib/python3.7/site-packages (from numba) (0.29.0)

Inย [6]:
import math
import time
import warnings
from dateutil.parser import parse

import janitor
import numpy as np
import pandas as pd
from numba import jit
from sklearn import datasets
from pandas.api.types import is_datetime64_any_dtype as is_datetime

Inย [7]:
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)
pd.options.display.max_columns = 999

Inย [8]:
path = 'https://raw.githubusercontent.com/FBosler/you-datascientist/master/invoices.csv'

def load_dataset(naivly=False):
    df = (pd.concat([pd.read_csv(path)
                       .clean_names()
                       .remove_columns(["meal_id", "company_id"]) 
                     for i in range(20)])
            .assign(meal_tip=lambda x: x.meal_price.map(lambda x: x * 0.2))
            .astype({"meal_price": int})
            .rename(columns={"meal_price": "meal_price_with_tip"}))

    if naivly:
        for col in df.columns:
            df[col] = df[col].astype(object)
    return df

Inย [9]:
df = load_dataset()
df.head()


Out[9]:
order_id date date_of_meal participants meal_price type_of_meal heroes_adjustment meal_tip
0 839FKFW2LLX4LMBB 2016-05-27 2016-05-31 07:00:00+02:00 ['David Bishop'] 469 Breakfast False 93.8
1 97OX39BGVMHODLJM 2018-09-27 2018-10-01 20:00:00+02:00 ['David Bishop'] 22 Dinner False 4.4
2 041ORQM5OIHTIU6L 2014-08-24 2014-08-23 14:00:00+02:00 ['Karen Stansell'] 314 Lunch False 62.8
3 YT796QI18WNGZ7ZJ 2014-04-12 2014-04-07 21:00:00+02:00 ['Addie Patino'] 438 Dinner False 87.6
4 6YLROQT27B6HRF4E 2015-07-28 2015-07-27 14:00:00+02:00 ['Addie Patino' 'Susan Guerrero'] 690 Lunch False 138.0
  • Our dataset is a row 1m data that contain meal invoices
  • Before i begin its important to state that all these optimization depends on the size of your data and for small dataset these optimizations might be irelevant

How ๐Ÿ‘€

Use What You Need ๐Ÿง‘

  • Keep needed columns only
  • For many use cases you dont actually need the entire dataset
  • There might be unneeded columns which we dont use in our analysis or data manipulation and this can save a lot of memory
  • Keep needed rows only
  • again For many usecase you dont need all the rows.
  • For example if you want to understand the dataset layout you dont need all 1m rows
  • this will reduce both your memory footprint and execution time

Dont Reinvent the Wheel ๐ŸŽก

  • Vast ecosystem
  • Use existing solutions
  • Less bugs
  • Highly optimized
  • For example instead of implmenting kmeans by your own
  • use scipy/sckitlearn implementation, not only you will have less bugs
  • but it will probably be highly optimized

Avoid Loops โ™พ

  • Pandas is design for vector manipulations Which make loops inefficient
  • a Rookie mistake in pandas will be to just loop over all the rows" either using iterows or regular iterations

Bad Option ๐Ÿ˜ˆ


Inย [1]:
def iterrows_original_meal_price(df):
    for i, row in df.iterrows():
        df.loc[i]["original_meal_price"] = row["meal_price_with_tip"] - row["meal_tip"]
    return df

Inย [5]:
%%timeit -r 1 -n 1
iterrows_original_meal_price(df)


35min 13s ยฑ 0 ns per loop (mean ยฑ std. dev. of 1 run, 1 loop each)
  • As you might guessed this bring us unsatisfied results of half an hour.
  • there is a much Nicer way and it is to use map or apply.

Better Option ๐Ÿคต

  • apply accepts any user defined function that applies a transformation/aggregation on a DataFrame.

Inย [4]:
def apply_original_meal_price(df):
    df["original_meal_price"] = df.apply(lambda x: x['meal_price_with_tip'] - x['meal_tip'], axis=1)
    return df

Inย [5]:
%%timeit 
apply_original_meal_price(df)


22.5 s ยฑ 170 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
  • As we can the performance look better as it took 21 seconds and about 100 times faster then iterows
  • but can we do any better? The obvious answer is yes using vectorization

100x Improvement In Execution Time โŒ›

Iterrow is evil ๐Ÿ˜ˆ

Best Option ๐Ÿ‘ผ


Inย [65]:
def vectorized_original_meal_price(df):
    df["original_meal_price"] = df["meal_price_with_tip"] - df["meal_tip"] 
    return df
  • Vectorization is the process of executing operations on entire arrays.
  • Pandas/numpy/scipy includes a generous collection of vectorized functions from mathematical operations to aggregations.

Inย [66]:
%%timeit 
vectorized_original_meal_price(df)


2.46 ms ยฑ 18.1 ยตs per loop (mean ยฑ std. dev. of 7 runs, 100 loops each)
  • We can see the benefit of vetorized function right away we got from to 2ms and about 100,00.

Another 8000x Improvement In Execution Time โŒ›

Use Vectorized Operations ๐Ÿ˜‡

Picking the Right Type ๐ŸŒˆ

Motivation ๐Ÿ†


Inย [67]:
ones = np.ones(shape=5000)
ones


Out[67]:
array([1., 1., 1., ..., 1., 1., 1.])

Inย [84]:
types = ['object', 'complex128', 'float64', 'int64', 'int32', 'int16', 'int8', 'bool']
df = pd.DataFrame(dict([(t, ones.astype(t)) for t in types]))
df.memory_usage(index=False, deep=True)


Out[84]:
object        160000
complex128     80000
float64        40000
int64          40000
int32          20000
int16          10000
int8            5000
bool            5000
dtype: int64

Where We Stand ๐ŸŒˆ


Inย [69]:
df = load_dataset(naivly=True)

Dataframe Size ๐Ÿ‹๏ธ


Inย [70]:
df.memory_usage(deep=True).sum()


Out[70]:
478844140

Columns Sizes ๐Ÿ‹๏ธ


Inย [71]:
df.memory_usage(deep=True)


Out[71]:
Index                 8002720
order_id             73024820
date                 67022780
date_of_meal         82027880
participants         84977580
meal_price           36012240
type_of_meal         63688760
heroes_adjustment    32076480
meal_tip             32010880
dtype: int64

Inย [72]:
df.dtypes


Out[72]:
order_id             object
date                 object
date_of_meal         object
participants         object
meal_price           object
type_of_meal         object
heroes_adjustment    object
meal_tip             object
dtype: object
  • I hope its clear that the type of the column effect the memory foot print.
  • There is a great picture explain

Supported Types ๐ŸŒˆ

  • int64
  • float64
  • bool
  • objects
  • datetime64
  • timedelta
  • The category type is good when the same elements occur over and over again.
  • and its New in version 0.23.0.
  • The sparse types is good when most of the array includes nulls.
  • and its New in version 0.24.0
  • The Nullable type is good when element are integer/boolean and includes nulls.
  • This is because NaN is a float and it forces the entire array to be cast as float and thus has a bigger memory footprint.
  • and its New in version 0.24.0

so its pretty obvious that we should aim for the type that has lowest memory footprint with the same functionality

Optimizing Types ๐ŸŒˆ

  • Loading dataframes with specific types
  • Use to_numeric/to_datetime/to_timedelta functions with downcast parameter

Inย [11]:
df = df.astype({'order_id': 'category',
                'date': 'category',
                'date_of_meal': 'category',
                'participants': 'category',
                'meal_price': 'int16',
                'type_of_meal': 'category',
                'heroes_adjustment': 'bool',
                'meal_tip': 'float32'})

Optimized Types ๐ŸŒˆ

Dataframe Size ๐Ÿ‹๏ธ


Inย [75]:
df.memory_usage(deep=True).sum()


Out[75]:
36999962

Columns Sizes ๐Ÿ‹๏ธ


Inย [76]:
df.memory_usage(deep=True)


Out[76]:
Index                8002720
order_id             8963321
date                 2204942
date_of_meal         3942538
participants         5883450
meal_price           2000680
type_of_meal         1000611
heroes_adjustment    1000340
meal_tip             4001360
dtype: int64

Inย [77]:
df.dtypes


Out[77]:
order_id             category
date                 category
date_of_meal         category
participants         category
meal_price              int16
type_of_meal         category
heroes_adjustment        bool
meal_tip              float32
dtype: object

12x Improvement In Memory โŒ›

  • its important to note that i have loaded the dataframe with very naive types in order to emphasis the change
  • but in most use cases you could expect to at least halve the memory foot print

Types Magic Explanation ๐Ÿง™

  • DataFrame is stored in seperated blocks depending on thier types.
  • Under the hood, pandas groups the columns into blocks of values of the same type.
  • For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. Also for extension arrays.
  • The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast.

Custom Types ๐Ÿฆธ๐Ÿผ๐Ÿฆธโ€โ™€๏ธ

  • This optimization is for extreme cases as it requires a lot of effort and skills.
  • basicly its implementing you own custom types using extension array

There are open sourced Types like cyberpandas for ip like objects and geopadnas for spatial like objects

Pandas Usage ๐Ÿผ

Chunks ๐Ÿฐ

  • Splitting large data to smaller parts
  • Work with large datasets

Inย [52]:
def proccess():
    pass

Inย [53]:
def proccess_file(huge_file_path, chunksize = 10 ** 6):
    for chunk in pd.read_csv(path, chunksize=chunksize):
        process(chunk)

Mean/Sum/Mode/Min/etc optimization ๐Ÿงฎ

  • Types matter

Inย [32]:
%%timeit
df["meal_price_with_tip"].astype(object).mean()


96 ms ยฑ 499 ยตs per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)

Inย [33]:
%%timeit
df["meal_price_with_tip"].astype(float).mean()


4.27 ms ยฑ 34.9 ยตs per loop (mean ยฑ std. dev. of 7 runs, 100 loops each)

20x Performance ImprovementโŒ›

DataFrame Serialization ๐Ÿ‹

  • Various file formats
  • Loading time
  • Saving time
  • Disk space

Query/Eval ๐Ÿงฌ

  • Improve Execution Time ๐Ÿ‘
  • The expected behaviour is up to 2 time faster
  • Improve Memory ๐Ÿ‘
  • The main benefit of eval/query is the saved memory

  • NumPy allocate memory to every itermediate step.

  • and this Gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.

  • Not all Operations are supported ๐Ÿ‘Ž

Example


Inย [28]:
%%timeit
df[df.type_of_meal=="Breakfast"]


103 ms ยฑ 348 ยตs per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)

Inย [29]:
%%timeit
df.query("type_of_meal=='Breakfast'")


82.4 ms ยฑ 223 ยตs per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)

% 20 Performance ImprovementโŒ›

Use on big datasets ๐Ÿงž

  • The pandas API indicate to only use eval on more than 10,000 rows, as the traditional method is faster for smaller arrays.

Concat vs Append โž•

  • Every append creates a new dataframe object
  • Multiple appends are inefficient
  • Use concat

Sorting Optimization ๐Ÿ“Ÿ

  • Vanilla python
  • Pandas
  • Numpy
  • Pytorch/Tensorflow

GroupBy Optimizations ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ง

  • Filter early
  • Custom functions are slow

Merge Optimization ๐Ÿ”

  • Filter/Aggregate early
  • Semi join

Compiled Code ?! Why ๐Ÿคฏ

  • Python dynamic nature
  • No compilation optimization
  • Pure Python can be slow

Inย [17]:
def foo(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator
  • Cython
  • the Cython project which converts Python code to compatiable C code
  • Numba
  • and the second one the Numba project, which converts Python code to fast LLVM bytecode

Inย [18]:
%%timeit
df.meal_price_with_tip.map(foo)


17.9 s ยฑ 25.7 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)

Cython and Numba for the rescue ๐Ÿ‘จโ€๐Ÿš’

Cython ๐Ÿคฏ

  • Up to 50x speedup from pure python ๐Ÿ‘
  • Learning Curve ๐Ÿ‘Ž
  • Separated Compilation Step ๐Ÿ‘Ž ๐Ÿ‘
  • The compilation is both a con and a pro
  • on one hand, there is an additional work to integrate the cython code
  • and on the other hand, there is no compilation overhead on runtime

Inย [19]:
%load_ext Cython

Example


Inย [20]:
%%cython
def cython_foo(long N):
    cdef long accumulator
    accumulator = 0

    cdef long i
    for i in range(N):
        accumulator += i

    return accumulator

Inย [21]:
%%timeit
df.meal_price_with_tip.map(cython_foo)


365 ms ยฑ 2 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)

49x Performance ImprovementโŒ›

Numba ๐Ÿคฏ

  • Up to 200x speedup from pure python ๐Ÿ‘
  • Easy ๐Ÿ‘

using numba is really easy its simply adding a decorator to a method

  • Highly Configurable ๐Ÿ‘
  • Debugging ๐Ÿ‘Ž ๐Ÿ‘
  • Debugging is both a con and a pro
  • on one hand, debugging the buissness logic is easy we can just remove the decorator and then we debug it as python code
  • on the other hand, if there is an issue in the compiled code itself its not quite easy to debug these issues.
  • Mostly Numeric ๐Ÿ‘Ž

Example


Inย [22]:
@jit(nopython=True)
def numba_foo(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

Inย [23]:
%%timeit
df.meal_price_with_tip.map(numba_foo)


414 ms ยฑ 596 ยตs per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)

43x Performance ImprovementโŒ›


Inย [ย ]:
@jit(nopython=True, parallel=True)
def numba_foo(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

Inย [ย ]:
%%timeit
df.meal_price_with_tip.map(numba_foo)

1๏ธโƒฃ Vectorized methods

2๏ธโƒฃ Numba

3๏ธโƒฃ Cython

General Python Optimizations ๐Ÿ

  • Since pandas is writen in python we can optimize our pandas code by using python optimization

Caching ๐ŸŽ

  • Avoid unnecessary work/computation.
  • Faster code

Intermediate Variables๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง

  • Intermediate calculations
  • Memory foot print of both objects
  • Smarter variables allocation

Inย [32]:
def another_foo(data):
    return data * 2

def foo(data):
    return data + 10

Example


Inย [28]:
%reload_ext memory_profiler

Inย [37]:
def load_data():
    return np.ones((2 ** 30), dtype=np.uint8)

Inย [50]:
%%memit
def proccess():
    data = load_data()
    return another_foo(foo(data))

proccess()


peak memory: 8106.62 MiB, increment: 3042.64 MiB

Inย [51]:
%%memit
def proccess():
    data = load_data()
    data = foo(data)
    data = another_foo(data)
    return data

proccess()


peak memory: 7102.64 MiB, increment: 2038.66 MiB

Concurrency And Parallelism ๐ŸŽธ๐ŸŽบ๐ŸŽป๐ŸŽท

  • pandas methods use single process
  • CPU bound can benefit parallelism
  • IO bound is very bad without parallesim/concurrency

High Performance Python Book ๐Ÿ“–

Pandas Alternatives ๐Ÿจ๐Ÿป

  • Libraries that provide dataframe apis
  • No free lunch ๐Ÿฅข

fillna cudf sux , group by cudf best, modin can be bad cudf hard

How ๐Ÿ‘€

  • Use What You Need ๐Ÿ’พโŒ›
  • Dont Reinvent the Wheel โŒ›๐Ÿ’พ
  • Avoid Loops โŒ›
  • Picking the Right Types ๐Ÿ’พโŒ›
  • Pandas Usage โŒ›๐Ÿ’พ
  • Compiled Code โŒ›
  • General Python Optimizations โŒ›๐Ÿ’พ
  • Pandas Alternatives โŒ›๐Ÿ’พ

Additional Resources ๐Ÿ“š