Pre-mature optimization is the root of all evil
A good framework to keep in mind is:
Make it work -- you should first make sure your code is working properly, and make sure you save/version-control the correctly copy. Fast, but buggy code doesn't help anyone.
Make it clean -- make it easy to read + understand your code. This'll help you understand where potential pinch-points are, and it'll also help others read your code. If your code is an unmaintainable mess, people (include future-you) isn't going to want to touch your code, even if it is fast.
Make it fast -- now, and only now, can you start speeding things up.
I'll assume you've already done #1 and #2. You're now interested in #3.
Run things in parallel. My desktop has 12 cores. If I only run my analysis on 1 core while leaving the other 11 idle, I'll be sitting there all day like a chump. Using multiprocessing.Pool.map is an easy way to get started. For more complex parallelism, check out AMS250 - High Performance Computing.
Switch to a compiled language. This is a pain, especially if you're using a lot of special functions from python packages. But if you're using >100 cores, you have relatively few dependences, and your thesis depends on large numerical calculations, then this is your best bet.
Cython-ize your code. Cython theoretically lets you get the benefits of compiled C code, while still having easy-access in python. Jake VanderPlas has a recent demo. In practice it's a bit of a pain.
for loopsnumbafor loopsAny time you use for i in range(<size>):, it should raise a red flag in your mind. Take a moment to think: does the functionality exist within numpy?
This is generally possible if:
max, sum, mean, etc.) Algebra happens automagically; if you're doing a common operation, it's worth searching to see if numpy already has it.
In [1]:
import numpy as np
In [2]:
image_size = 1000
image = np.random.random((image_size, image_size))
In [3]:
%%timeit
sum_sq = 0
for i in range(image_size):
for j in range(image_size):
sum_sq += image[i, j]**2
rms = (sum_sq / (image_size**2))**.5
In [4]:
%%timeit
rms = (image**2).mean()**.5
Wow! A factor of a 100 speed up! (And it's a bit easier to read.)
Note: This works great with algebra, but doesn't tend to work for things like doing calculus/numerical integration for different starting points.
numpy.vectorize exists, but it's primarily to make code pretty, rather than making it fast. It makes things look vectorized, but is actually just a for loop.
In [5]:
size=1000
xs = range(size)
ys = range(size)
In [6]:
%%timeit
X = np.empty((size, size))
Y = np.empty((size, size))
for i, x in enumerate(xs):
for j, y in enumerate(ys):
X[i, j] = x
Y[i, j] = y
In [7]:
%%timeit
Y, X = np.meshgrid(xs, ys)
Pretty good! Still a factor of 100 speed up!
If you look at the numpy documentation for meshgrid, you'll see it suggest "index_tricks". These can also be powerful, but quickly become much less readable, so we're not going to talk about them now.
In [8]:
%%time
size=20000
xs = range(size)
ys = range(size)
Y, X = np.meshgrid(xs, ys)
np.savetxt("tmp.txt", X)
In [9]:
!ls -lh tmp.txt
In [10]:
%%time
size=20000
xs = range(size)
ys = range(size)
Y, X = np.meshgrid(xs, ys)
# np.savetxt("tmp.txt", X)
You can also time separate parts of code:
In [11]:
import datetime
In [12]:
size=2000
start_time_computation = datetime.datetime.now()
xs = range(size)
ys = range(size)
Y, X = np.meshgrid(xs, ys)
end_time_computation = datetime.datetime.now()
start_time_io = datetime.datetime.now()
np.savetxt("tmp.txt", X)
end_time_io = datetime.datetime.now()
In [13]:
run_time_computation = end_time_computation - start_time_computation
run_time_io = end_time_io - start_time_io
print("Run time (computation): ", run_time_computation)
print("Run time (io): ", run_time_io)
This works fine for quick checks, but it assumes:
So this is pretty limited for more complex bits of code.
In [14]:
def simple_sum_sq(size):
image = np.random.random((size, size))
sum_sq = 0
for i in range(size):
for j in range(size):
sum_sq += image[i, j]**2
In [15]:
%load_ext line_profiler
In [16]:
%lprun -f simple_sum_sq simple_sum_sq(200)
Example output:
Timer unit: 1e-06 s
Total time: 0.060595 s
File: <ipython-input-20-d9bf5ccb538f>
Function: simple_sum_sq at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def simple_sum_sq(size):
2 1 821.0 821.0 1.4 image = np.random.random((size, size))
3
4 1 1.0 1.0 0.0 sum_sq = 0
5 201 95.0 0.5 0.2 for i in range(size):
6 40200 16937.0 0.4 28.0 for j in range(size):
7 40000 42741.0 1.1 70.5 sum_sq += image[i, j]**2
That's a pretty nice overview of how long it takes to run a certain line and how often that line gets run. While the image = ... line takes the longest per-run, the sum_sq += ... line gets run way more, and ends up taking up ~70% of the runtime.
I show a more complex profiling example near the bottom of this notebook ("Real World Example")
This is great if you want to see which functions are slow, but don't care where they are located within your code.
This is good for a very complex project, with lots of nested functions. This way you know that calculate_pressure is taking up a bunch of time, even though it's located at 20 locations in your code (and at each location it's not noticably slow).
(Note: I haven't used cProfile much, so I don't know too much about it...)
In [17]:
import cProfile
In [18]:
def create_mesh_grid(size):
size=20000
xs = range(size)
ys = range(size)
Y, X = np.meshgrid(xs, ys)
In [19]:
cProfile.run("create_mesh_grid(20000)")
(Sorry that this example really isn't complicated enough to truly show the power of cProfile)
So you found a simple function, which is run a bunch of times, and it's making your code super slower. How do you make it faster?
numba is a great approach for keeping the simplicity + readability of pure-python, while still getting the speed-up that you might get with compiled C (either native C or cython). It's uses "just-in-time" (JIT) compiler; if you use a function repeatedly, with predictable inputs, it'll compile a version of that function for those inputs.
In [20]:
def simple_sum_sq(size):
image = np.random.random((size, size))
sum_sq = 0
for i in range(size):
for j in range(size):
sum_sq += image[i, j]**2
In [21]:
%%timeit
simple_sum_sq(1000)
In [22]:
import numba
(all I've changed in the next cell is I've added the decorator, @numba.jit())
In [23]:
@numba.jit()
def simple_sum_sq_new(size):
image = np.random.random((size, size))
sum_sq = 0
for i in range(size):
for j in range(size):
sum_sq += image[i, j]**2
In [24]:
%%timeit
simple_sum_sq_new(1000)
That's pretty good. We didn't change a single bit of our code; just added a decorator, and now we have a factor of 50x speed up. Not quite 100x from vectorizing with numpy, but still pretty good given that it required less work.
In general, this is great for simple, mathematical, repetitive tasks with predictable input types. You probably can't just @numba.jit your entire code. But, hey, I've never tried.
As an added bonus, it doesn't conflict with vectorizing your code. In some cases it can even have a minor bonus from doing both:
In [25]:
def calculate_kinetic_energy(mass, velocity):
E_kin = (1/2) * mass * velocity**2
return E_kin
In [26]:
%%timeit
masses = np.random.random(10000)
velocities = np.random.random(masses.shape)
calculate_kinetic_energy(masses, velocities)
In [27]:
@numba.jit(nopython=True)
def calculate_kinetic_energy_jit(mass, velocity):
E_kin = (1/2) * mass * velocity**2
return E_kin
In [28]:
%%timeit
masses = np.random.random(10000)
velocities = np.random.random(masses.shape)
calculate_kinetic_energy_jit(masses, velocities)
This comes from my 1st+2nd year project: https://github.com/egentry/clustered_SNe/commit/848cca6a08c3acb1b666e094cb5ae387480cd224
I've simplified things here, but the core idea is the same.
In [29]:
import pandas as pd
In [30]:
cols = ["Radius", "dR", "dV", "Density",
"Pressure", "Velocity", "Z",
"Temperature", "Energy", "Entropy",
"Mass", "M_int", "C_ad", "Crossing_time"]
cols_in = cols[:-7]
In [31]:
from helper_functions import calculate_mass, \
calculate_mean_molecular_weight, \
calculate_kinetic_energy, \
calculate_internal_energy, \
calculate_momentum, \
calculate_c_ad, \
calculate_entropy, \
calculate_temperature, \
calculate_w_cell, \
calculate_crossing_time
In [32]:
def process_checkpoints(num_checkpoints):
df = pd.DataFrame()
for k in range(num_checkpoints):
# Create some fake data
array_tmp = np.random.random((1000,7))
array_tmp = array_tmp[1:-1] # drop cells with bad data
# transform the array into a dataframe
index = pd.MultiIndex.from_product([[k], np.arange(array_tmp.shape[0])],
names=["k","i"])
df_tmp = pd.DataFrame(array_tmp, columns=cols_in, index = index)
df_tmp["Mass"] = calculate_mass(df_tmp.Density.values,
df_tmp.dV.values)
df_tmp["M_int"] = df_tmp.Mass.cumsum()
df_tmp["Temperature"] = calculate_temperature(df_tmp.Pressure.values,
df_tmp.Density.values, .67)
df_tmp["Energy"] = calculate_internal_energy(df_tmp.Mass.values,
df_tmp.Pressure.values,
df_tmp.Density.values) \
/ df_tmp.Mass.values
df_tmp["C_ad"] = calculate_c_ad(df_tmp.Pressure.values,
df_tmp.Density.values)
w_cell = calculate_w_cell(df_tmp.Velocity.values)
df_tmp["Crossing_time"] = calculate_crossing_time(df_tmp.C_ad.values,
df_tmp.Velocity.values,
w_cell,
df_tmp.dR.values )
df = pd.concat([df, df_tmp])
return df
In [33]:
%time df_old = process_checkpoints(600)
In [34]:
df_old.head()
Out[34]:
When I was just processing a single simulation, 30sec wasn't too bad, so I didn't worry about it. But then I wanted to process batches of hundreds of simulations. In order to do that I had to speed things up.
But what's the major slow down? Let's try line profiling!
In [35]:
%lprun -f process_checkpoints process_checkpoints(300)
Sample output:
Timer unit: 1e-06 s
Total time: 8.40759 s
File: <ipython-input-34-c7b04c4d293b>
Function: process_checkpoints at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def process_checkpoints(num_checkpoints):
2 1 800.0 800.0 0.0 df = pd.DataFrame()
3 301 863.0 2.9 0.0 for k in range(num_checkpoints):
4
5 # Create some fake data
6 300 27055.0 90.2 0.3 array_tmp = np.random.random((1000,7))
7 300 851.0 2.8 0.0 array_tmp = array_tmp[1:-1] # drop cells with bad data
8
9
10 # transform the array into a dataframe
11 300 2051.0 6.8 0.0 index = pd.MultiIndex.from_product([[k], np.arange(array_tmp.shape[0])],
12 300 496092.0 1653.6 5.9 names=["k","i"])
13 300 102367.0 341.2 1.2 df_tmp = pd.DataFrame(array_tmp, columns=cols_in, index = index)
14
15 300 47172.0 157.2 0.6 df_tmp["Mass"] = calculate_mass(df_tmp.Density.values,
16 300 207859.0 692.9 2.5 df_tmp.dV.values)
17 300 297848.0 992.8 3.5 df_tmp["M_int"] = df_tmp.Mass.cumsum()
18 300 37834.0 126.1 0.4 df_tmp["Temperature"] = calculate_temperature(df_tmp.Pressure.values,
19 300 196617.0 655.4 2.3 df_tmp.Density.values, .67)
20 300 37470.0 124.9 0.4 df_tmp["Energy"] = calculate_internal_energy(df_tmp.Mass.values,
21 300 30342.0 101.1 0.4 df_tmp.Pressure.values,
22 300 32599.0 108.7 0.4 df_tmp.Density.values) \
23 300 171803.0 572.7 2.0 / df_tmp.Mass.values
24 300 37491.0 125.0 0.4 df_tmp["C_ad"] = calculate_c_ad(df_tmp.Pressure.values,
25 300 197602.0 658.7 2.4 df_tmp.Density.values)
26
27 300 53682.0 178.9 0.6 w_cell = calculate_w_cell(df_tmp.Velocity.values)
28 300 32733.0 109.1 0.4 df_tmp["Crossing_time"] = calculate_crossing_time(df_tmp.C_ad.values,
29 300 5134.0 17.1 0.1 df_tmp.Velocity.values,
30 300 349.0 1.2 0.0 w_cell,
31 300 195733.0 652.4 2.3 df_tmp.dR.values )
32
33 300 6195246.0 20650.8 73.7 df = pd.concat([df, df_tmp])
34
35 1 1.0 1.0 0.0 return df
So the old problem was with pd.concat. Since I didn't initially tell it how large the final dataframe would be, it had to keep creating new objects, and copying data from the old object. (This led to $\mathcal{O}(n^2)$ scaling, with respect to the number of snapshots.)
If instead I create a dataframe with the correct size to begin with, it should go much faster (and be $\mathcal{O}(n)$).
Before we do that, I'll just show an empty version of the dataframe, to give you a sense of what I mean.
In [36]:
dataframe_columns = ["Radius",
"dR",
"dV",
"Density",
"Pressure",
"Velocity",
"Z",
"Mass",
"M_int",
"Temperature",
"Energy",
"C_ad",
"Crossing_time",
]
num_checkpoints=10
num_zones=20
index = pd.MultiIndex.from_product([np.arange(num_checkpoints),
np.arange(num_zones)],
names=["k","i"])
df = pd.DataFrame(index=index,
columns=dataframe_columns,
dtype=float)
In [37]:
df
Out[37]:
In [38]:
def process_checkpoints_new(num_checkpoints):
num_zones = 1000
dataframe_columns = ["Radius",
"dR",
"dV",
"Density",
"Pressure",
"Velocity",
"Z",
"Mass",
"M_int",
"Temperature",
"Energy",
"C_ad",
"Crossing_time",
]
index = pd.MultiIndex.from_product([np.arange(num_checkpoints),
np.arange(num_zones-2)],
names=["k","i"])
df = pd.DataFrame(index=index,
columns=dataframe_columns,
dtype=float)
for k in range(num_checkpoints):
# Create some fake data
array_tmp = np.random.random((num_zones,7))
array_tmp = array_tmp[1:-1] # drop cells with bad data
# adding the data into the dataframe is much more complicated now
df_tmp = df.loc[k]
df_tmp[cols_in] = array_tmp
df_tmp["Mass"] = calculate_mass(df_tmp.Density.values,
df_tmp.dV.values)
df_tmp["M_int"] = df_tmp.Mass.cumsum()
df_tmp["Temperature"] = calculate_temperature(df_tmp.Pressure.values,
df_tmp.Density.values, .67)
df_tmp["Energy"] = calculate_internal_energy(df_tmp.Mass.values,
df_tmp.Pressure.values,
df_tmp.Density.values) \
/ df_tmp.Mass.values
df_tmp["C_ad"] = calculate_c_ad(df_tmp.Pressure.values,
df_tmp.Density.values)
w_cell = calculate_w_cell(df_tmp.Velocity.values)
df_tmp["Crossing_time"] = calculate_crossing_time(df_tmp.C_ad.values,
df_tmp.Velocity.values,
w_cell,
df_tmp.dR.values )
return df
In [39]:
%time df_new = process_checkpoints_new(600)
Great! A factor of ~10 speed-up, on a mostly-real-world problem.
Note though:
pandas you probably know that you have to be careful getting a view (df_tmp = df.loc[k]) rather than a copy.So it's great to speed things up, but it does often come at a cost.