Profiling

Julia has several means for inspection of the program execution:

viewing execution time and overall memory allocation,
viewing lower level code,
tracking function calls, and
tracking memory allocation.

Prerequisites

Read sections Profiling and Reflection and introspection of the Julia manual (20 min).

Competences

The reader should be able to determine frequency of executed commands and memory allocation a function.

Execution time and overall memory allocation

Let us compute $\alpha*a\cdot b$ for scalar $\alpha$ and (fairly long) vectors $a$ and $b$:



In [1]:

    
a=rand(1000000)
b=rand(1000000)
α=rand()









    Out[1]:





0.4253458264376686



In [2]:

    
@time α*a⋅b









    



  0.141322 seconds (37.12 k allocations: 9.511 MiB)






    Out[2]:





106452.78208585214



In [3]:

    
@time α*(a⋅b)









    



  0.004442 seconds (6 allocations: 192 bytes)






    Out[3]:





106452.78208585156

We see that the second evaluation is faster, due to adequate memory allocation - the reason is that there is no operator precedence between * and $\cdot$. Also, loops in Julia are very fast:



In [4]:

    
@time s=0.0; for i=1:1000000; s+=α*a[i]*b[i]; end; s









    



  0.000004 seconds (5 allocations: 208 bytes)






    Out[4]:





106452.78208585549

Lower level code

The function code_llvm() and code_native() return the LLVM intermediate representation of a function and the compiled machine code, respectively. (There are other useful functions described in the Manual.)

They can also be called as macros, which is the form that we shall use.

Observe the differences in the small examples below:



In [5]:

    
?code_llvm









    



search: code_llvm @code_llvm







    Out[5]:





code_llvm([io], f, types)
Prints the LLVM bitcodes generated for running the method matching the given generic function and type signature to io which defaults to STDOUT.
All metadata and dbg.* calls are removed from the printed bitcode. Use code_llvm_raw for the full IR.



In [6]:

    
?code_native









    



search: code_native @code_native







    Out[6]:





code_native([io], f, types, [syntax])
Prints the native assembly instructions generated for running the method matching the given generic function and type signature to io which defaults to STDOUT. Switch assembly syntax using syntax symbol parameter set to :att for AT&T syntax or :intel for Intel syntax. Output is AT&T syntax by default.



In [7]:

    
@code_llvm +(1,2)









    



define i64 @"jlsys_+_57643"(i64, i64) #0 !dbg !5 {
top:
  %2 = add i64 %1, %0
  ret i64 %2
}



In [8]:

    
@code_llvm +(1.0,2)









    



define double @"jlsys_+_45973"(double, i64) #0 !dbg !5 {
top:
  %2 = sitofp i64 %1 to double
  %3 = fadd double %2, %0
  ret double %3
}



In [9]:

    
@code_native +(1,2)









    



	.text
Filename: int.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 32
	leaq	(%rdi,%rsi), %rax
	popq	%rbp
	retq
	nopw	(%rax,%rax)



In [10]:

    
@code_native +(1.0,2)









    



	.text
Filename: promotion.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 249
	xorps	%xmm1, %xmm1
	cvtsi2sdq	%rdi, %xmm1
	addsd	%xmm1, %xmm0
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

Tracking function calls

We shall demonstrate the process on simple problem of polynomial evaluation. We shall use the registered package Polynomials.jl for polynomial manipulations, and two own functions for polynomial evalutaion:

Horner scheme
evaluation with remembering powers.



In [11]:

    
# Pkg.add("Polynomials")
using Polynomials



In [12]:

    
whos(Polynomials)









    



                          Pade     80 bytes  UnionAll
                          Poly     40 bytes  UnionAll
                   Polynomials     72 KB     Module
                        coeffs      0 bytes  Polynomials.#coeffs
                        degree      0 bytes  Polynomials.#degree
                       padeval      0 bytes  Polynomials.#padeval
                          poly      0 bytes  Polynomials.#poly
                       polyder      0 bytes  Polynomials.#polyder
                       polyfit      0 bytes  Polynomials.#polyfit
                       polyint      0 bytes  Polynomials.#polyint
                       polyval      0 bytes  Polynomials.#polyval
                     printpoly      0 bytes  Polynomials.#printpoly
                         roots      0 bytes  Polynomials.#roots
                      variable      0 bytes  Polynomials.#variable



In [13]:

    
# polyval() is just Horner's scheme
methods(polyval)









    Out[13]:




2 methods for generic function polyval: polyval(p::Polynomials.Poly, v::AbstractArray) at /home/slap/.julia/v0.6/Polynomials/src/Polynomials.jl:460
  polyval{T, S}(p::Polynomials.Poly{T}, x::S) at /home/slap/.julia/v0.6/Polynomials/src/Polynomials.jl:446

function polyval{T,S}(p::Poly{T}, x::S)
    R = promote_type(T,S)

    lenp = length(p)
    if lenp == 0
        return zero(R) * x
    else
        y = convert(R, p[end])
        for i = (endof(p)-1):-1:0
            y = p[i] + x*y
        end
        return y
    end
end



In [14]:

    
# Polynomial with given coefficients
p=Poly([1,2,3,4])









    Out[14]:




1 + 2∙x + 3∙x^2 + 4∙x^3



In [15]:

    
# Polynomial with given zeros
q=poly([1,2,3,4])









    Out[15]:




24 - 50∙x + 35∙x^2 - 10∙x^3 + x^4



In [16]:

    
r=p*q









    Out[16]:




24 - 2∙x + 7∙x^2 + 6∙x^3 - 114∙x^4 + 112∙x^5 - 37∙x^6 + 4∙x^7



In [17]:

    
typeof(r)









    Out[17]:





Polynomials.Poly{Int64}



In [18]:

    
p(pi), polyval(p,pi), q(pi), polyval(q,π)









    Out[18]:





(160.91710523164693, 160.91710523164693, -0.29715441035788004, -0.29715441035788004)



In [19]:

    
p[0]









    Out[19]:




1



In [20]:

    
function mypolyval(p::Poly,x::Number)
    s=p[0]
    t=one(x)
    for i=1:length(p)-1
        t*=x
        s+=p[i]*t
    end
    s
end 

function myhorner(p::Poly,x::Number)
    s=p[end]
    for i=length(p)-2:-1:0
        s*=x
        s+=p[i]
    end
    s
end









    Out[20]:





myhorner (generic function with 1 method)



In [21]:

    
mypolyval(p,map(Float64,π)), myhorner(p,map(Float64,pi))









    Out[21]:





(160.91710523164693, 160.91710523164693)

Let us time the functions:



In [22]:

    
n=1000001
pbig=Poly(rand(n))
x=0.12345;



In [24]:

    
@time pbig(x)









    



  0.005116 seconds (5 allocations: 176 bytes)






    Out[24]:




0.667166770780437



In [26]:

    
@time polyval(pbig,x)









    



  0.006131 seconds (5 allocations: 176 bytes)






    Out[26]:




0.667166770780437



In [28]:

    
@time myhorner(pbig,x)









    



  0.005131 seconds (5 allocations: 176 bytes)






    Out[28]:




0.667166770780437



In [30]:

    
# This is slightly faster
@time mypolyval(pbig,x)









    



  0.003465 seconds (5 allocations: 176 bytes)






    Out[30]:




0.6671667707804371



In [31]:

    
# Provided sufficient assembler knowledge, we can
# inspect the details
@code_native mypolyval(pbig,x)









    



	.text
Filename: In[20]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 264
	movq	(%rdi), %rdi
	movq	8(%rdi), %rax
	xorl	%ecx, %ecx
Source line: 264
	testq	%rax, %rax
	jle	L35
	cmpq	$0, 24(%rdi)
	je	L161
	movq	(%rdi), %rcx
	movq	(%rcx), %rcx
Source line: 4
L35:
	leaq	-1(%rax), %rdx
	testq	%rdx, %rdx
	jle	L117
	movl	$1, %edx
	movabsq	$140098378437456, %rsi  # imm = 0x7F6B32149350
	movsd	(%rsi), %xmm1           # xmm1 = mem[0],zero
	nop
L64:
	xorpd	%xmm2, %xmm2
Source line: 264
	cmpq	%rdx, %rax
	jle	L87
	cmpq	24(%rdi), %rdx
	jae	L127
	movq	(%rdi), %rsi
	movsd	(%rsi,%rdx,8), %xmm2    # xmm2 = mem[0],zero
Source line: 4
L87:
	incq	%rdx
Source line: 5
	mulsd	%xmm0, %xmm1
Source line: 6
	mulsd	%xmm1, %xmm2
	movd	%rcx, %xmm3
	addsd	%xmm2, %xmm3
	movd	%xmm3, %rcx
Source line: 4
	cmpq	%rdx, %rax
	jne	L64
Source line: 8
L117:
	movd	%rcx, %xmm0
	movq	%rbp, %rsp
	popq	%rbp
	retq
Source line: 264
L127:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	incq	%rdx
	movq	%rdx, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	callq	*%rax
Source line: 264
L161:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	callq	*%rax
	nopw	%cs:(%rax,%rax)

Let us track function calls.

`@profile`



In [33]:

    
?@profile









    Out[33]:





@profile
@profile <expression> runs your expression while taking periodic backtraces. These are appended to an internal buffer of backtraces.



In [34]:

    
Profile.clear()



In [35]:

    
@profile (for i = 1:100; mypolyval(pbig,x); end)



In [36]:

    
Profile.print()









    



371 ./task.jl:335; (::IJulia.##14#17)()
 371 ...Julia/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
  371 .../Compat/src/Compat.jl:385; (::Compat.#inner#17{Array{Any,1}...
   371 ...rc/execute_request.jl:158; execute_request(::ZMQ.Socket, ::...
    371 .../Compat/src/Compat.jl:71; include_string(::Module, ::Strin...
     371 ./loading.jl:522; include_string(::String, ::String)
      371 ./<missing>:?; anonymous
       371 ./profile.jl:23; macro expansion
        371 ./In[35]:1; macro expansion
         80  ./In[20]:4; mypolyval(::Polynomials.Poly{...
         291 ./In[20]:6; mypolyval(::Polynomials.Poly{...
          60 ...src/Polynomials.jl:264; getindex



In [37]:

    
Profile.clear()
@profile (for i = 1:100; myhorner(pbig,x); end)
Profile.print()









    



566 ./task.jl:335; (::IJulia.##14#17)()
 566 ...Julia/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
  566 .../Compat/src/Compat.jl:385; (::Compat.#inner#17{Array{Any,1}...
   566 ...rc/execute_request.jl:158; execute_request(::ZMQ.Socket, ::...
    566 .../Compat/src/Compat.jl:71; include_string(::Module, ::Strin...
     566 ./loading.jl:522; include_string(::String, ::String)
      566 ./<missing>:?; anonymous
       566 ./profile.jl:23; macro expansion
        566 ./In[37]:2; macro expansion
         36  ./In[20]:13; myhorner(::Polynomials.Poly{F...
         210 ./In[20]:14; myhorner(::Polynomials.Poly{F...
         320 ./In[20]:15; myhorner(::Polynomials.Poly{F...
          100 ...src/Polynomials.jl:264; getindex

By inspecting the output, we see that the main load is the execution of computational lines inside the loops.

The above profiles also includes IJulia calls. It profiling is done in terminal mode, IJulia calls will not be present. This can be done with the following commands:

include("myfunctions.jl") # Contains the function definitions
Profile.clear()
@profile (for i = 1:100000; mypolyval(pbig,x); end)
Profile.print()

Profile.clear()
@profile (for i = 1:100000; myhorner(pbig,x); end)
Profile.print()

Output can also be viewed using the package ProfileView.jl.



In [38]:

    
# Pkg.add("ProfileView")



In [39]:

    
using ProfileView



In [40]:

    
ProfileView.view()









    Out[40]:

Tracking memory allocation

Memory allocation analysis must be performed in terminal mode. The entire code must be stored in a single file, for example myfile.jl.

The command

julia --track-allocation=user myfile.jl

(or C:\Users\Ivan\AppData\Local\Julia-0.6.2\bin\julia --track-allocation=user myfile.jl)

generates the file myfile.jl.mem with memory allocation displayed for each line of code.



In [ ]: