A Simple Function to Process Pandas DataFrames in Parallel Across Multiple Cores

Introduction

Python's Pandas library for data processing is awesome. However, one thing it doesn't support out of the box is parallel processing across multiple cores.

I always thought of multiprocessing as something complicated to implement, but then I found this truly awesome blog post.. It shows how to apply an arbitrary Python function to each object in a sequence, in parallel, using Pool.map from the Multiprocessing library.

The author's example involves running urllib2.urlopen() across a list of urls, to scrape html from several web sites in parallel. But the principle applies equally to mapping a function across several columns in a Pandas DataFrame. Here's an example of how useful that can be.

The Kaggle/Bosch Dataset


In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from datetime import datetime

In [ ]:
start = datetime.now()
bosch_train_cat = pd.read_csv('datasets/bosch_train_categorical.csv.zip', 
                              #usecols=['L0_S1_F25', 'L0_S1_F27'], 
                              dtype='object')
print('loaded train in ', str(datetime.now() - start))

In [30]:
bosch_train_cat.head(5)


Out[30]:
L0_S1_F25 L0_S1_F27
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

In [25]:
col = bosch_train_cat['L0_S1_F25']

factorized = pd.factorize(col)

#le = LabelEncoder()
#factorized2 = le.fit_transform(col)

In [26]:
print(factorized[0])

#for col, representation in zip(col, factorized2):
#    print(col, representation)


#print(factorized2[0:5])


[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

Order of operations:

parallel:

  • Generate list of columns which are always nan in train or test

then: drop those col names

then parallel again:

  • factorize train and