Sometimes our scripts take a lot of time or memory to run. That happens especially for machine learning tasks when we train our algorithms on a large data sets. Profiling a program is a way of finding out where it becomes slow and where exactly its memory bottlenecks are. Luckily, IPython makes it very easy to profile a python script.
Let's assume we have a few functions that different calculations. We define them here directly but they could also come from your .py file and are imported into IPython.
In [ ]:
from time import sleep
def foo():
print 'foo: calculating heavy stuff...'
sleep(1)
def bar():
print 'bar: calculating heavy stuff...'
sleep(2)
def baz():
foo()
bar()
IPython provides a few so-called magic functions that start with % and are not part of the Python language. For profiling there are %time and %timeit.
In [ ]:
%time baz()
In [ ]:
%timeit baz()
As you can see, %time runs a function and measures its execution time. %timeit runs a function several times and outputs statistics about the execution times.
Very often though we have the know perfectly well that our fuction foo is slow, we just don't know which party exactly make it that slow. That's where we can load the line_profiler extension and output statistics about every single line.
Usage:
%lprun -f function_to_profile function_to_call()
Let's call baz() and output the details for foo() and baz():
In [ ]:
%load_ext line_profiler
%lprun -f foo -f baz baz()
Memory profiling works very similar. Unfortunately, we don't have necessary dependencies met on the Windows machines here. We still show one example how it could look like.
Let's start with a function that uses a bit of memory:
def moo():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a
Again there is an IPython magic function called %memit:
%memit moo()
And there is a memoriy_profiler module similar to line_profiler. However, it only works if the function to profile is a file, which is why we can't run it here in the Notebook. Running the folling code ...
%load_ext memory_profiler
from my_module.py import moo
%mprun -f moo moo()
... would look like this:
Line Mem usage Increment Line Contents4 5.97 MB 0.00 MB def my_func(): 5 13.61 MB 7.64 MB a = [1] * (10 ** 6) 6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7) 7 13.61 MB -152.59 MB del b 8 13.61 MB 0.00 MB return a
In [ ]:
import numpy as np
import nltk
# download Brown corpus if neccessary
nltk.download('brown')
# make all words from corpus lower-case
brown = [word.lower() for word in nltk.corpus.brown.words()]
Start the following code and get yourself a cup of coffee...
In [ ]:
def print_statistics_slow(corpus):
# data structures for counting words
word_list = []
counter_list = []
# calculate statistics of words
for word in corpus:
if word in word_list:
index = word_list.index(word)
counter_list[index] = counter_list[index] + 1
else:
word_list.append(word)
counter_list.append(1)
# create a list of indices sorted according to word occurrances
sorted_indices = np.argsort(counter_list)
# only keep last ten entries of the index list
sorted_indices = sorted_indices[-10:]
# switch order of list
sorted_indices = sorted_indices[::-1]
print 'most common words:'
for index in sorted_indices:
print '%s (%d)' % (word_list[index], counter_list[index])
print '\nOverall there are %d unique words in the corpus.' % len(word_list)
# statistics for whole corpus (may take 15 minutes)
print_statistics_slow(brown)
# statistics for 100.000 first words only
#print_statistics_slow(brown[:100000])
Okay, nice. But what went wrong that the script became so slooow?!? Use the line_profiler to find out.
In [ ]:
# profiling for whole corpus (15 minutes again)
%lprun -f print_statistics_slow print_statistics_slow(brown)
# profiling for first 100.000 words only
#%lprun -f print_statistics_slow print_statistics_slow(brown[:100000])
It turns out that we have used the data structures in a very naive and inefficient way. Re-write the method that it runs faster. line_profiler may help you to gradually remove bottlenecks. In the end the method shouldn't need more than two or three seconds.
Hint: Think about using data structures
dictandsetinstead of lists when appropriate. Asetallows membership testing much more efficiently than alist. With adictyou can quickly map from a key to a value, for instance: word -> index.Also this article about sorting may be usefull: https://wiki.python.org/moin/HowTo/Sorting
In [ ]:
def print_statistics_fast(corpus):
# your code here
%lprun -f print_statistics_fast print_statistics_fast(brown)
Our solution:
In [ ]:
from solutions import *
decrypt_solution(solution_profiling, passphrase='foo')