Python's Pandas library for data processing is awesome. However, one thing it doesn't support out of the box is parallel processing across multiple cores.
I always thought of multiprocessing as something complicated to implement, but then I found this truly awesome blog post.. It shows how to apply an arbitrary Python function to each object in a sequence, in parallel, using Pool.map from the Multiprocessing library.
The author's example involves running urllib2.urlopen() across a list of urls, to scrape html from several web sites in parallel. But the principle applies equally to mapping a function across several columns in a Pandas DataFrame. Here's an example of how useful that can be.
In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
In [ ]:
start = datetime.now()
bosch_train_cat = pd.read_csv('datasets/bosch_train_categorical.csv.zip',
#usecols=['L0_S1_F25', 'L0_S1_F27'],
dtype='object')
print('loaded train in ', str(datetime.now() - start))
In [30]:
bosch_train_cat.head(5)
Out[30]:
In [25]:
col = bosch_train_cat['L0_S1_F25']
factorized = pd.factorize(col)
#le = LabelEncoder()
#factorized2 = le.fit_transform(col)
In [26]:
print(factorized[0])
#for col, representation in zip(col, factorized2):
# print(col, representation)
#print(factorized2[0:5])