Julia -- applying language design lessons to technical computing

Harlan D. Harris, PhD


Background/Caveats:

  • I'm a data scientist -- at best I can fake being a software engineer or programming language expert
  • I don't use Julia in production (yet), although I've used it for several one-off analytics projects
  • Please ask questions! ...about stuff I know about...
  • I got hooked in 2012 -- this talk covers some of the reasons why

Outline

  • ### Influences and Lessons Learned (for technical computing)
  • ### "Script" Language with "Systems" Performance (for technical computing)
  • ### Multiple Dispatch > Object Oriented Programming (for technical computing)
  • ### Community Matters; Easy Learning Curves Matter (demos of stuff I've done)

Influences and Lessons Learned

Stefan Karpinski, Jeff Bezanson, Viral Shaw, Alan Edelman (and the Father of Floating Point):

Matlab

Pros: fast matrix algebra, REPL, easy to start using

Cons: commercial software, slow or impractical for many non-numeric tasks, syntax issues

LISP and Dylan

Pros: multiple dispatch, macros

Cons: obscure syntax, hard to learn

Python

Pros: dynamic, great ecosystem, easy to start using, elegant OO design

Cons: have to extend in C for performance, version and package issues

Fortran

Pros: blazing fast, best-of-class algorithms

Cons: hard to do anything but number crunching

Ruby

Pros: syntax matters, dynamic, great ecosystem

Cons: very slow, especially for numerical work

R

Pros: domain-specific but not domain-limited, huge package ecosystem, great interactivity

Cons: forces vectorization for speed, hard to contribute to core, quirky

Javascript

Pros: blazing fast JIT compilers, forces asynchronous design thinking

Cons: everything else

C

Pros: simple syntax, very fast

Cons: no REPL, error-prone memory handling, static typing means hard to start


Julia

  • dynamic, garbage collected
  • single inheritance type system; only leaf types instantiatable
  • BiC REPL; LLVM JIT; runtime type analysis and specialization
  • within 2x of C/C++, usually
  • multiple dispatch; types have slots, but no methods
  • base library includes BiC linear algebra libraries and matrix types
  • simple syntax resembles Matlab, Ruby
  • modern features: list comprehensions, do blocks, functional features, Unicode
  • almost all of core written in Julia (rest in C with a LISP parser)
  • @macros for metaprogramming, used in libraries for performance
  • package system based on git encourages collaboration
  • distributed computing baked in
  • MIT licensed

"Script" Language but "Systems" Performance

Must read: Graydon Hoare's two-part essay on interactive scientific computing

Common pattern: Outer scripting language wraps inner systems language (JCL + Asm, Matlab + Fortran, R/Python + C, Javascript + C++)

  • This sucks for technical computing -- who's got time?!?

Object oriented programming: Easy to add new types!

  • This sucks for technical computing -- I want to add methods, not types, usually!

Web-centric languages! Yay!

  • This sucks for technical computing -- I need interactivity, iteration, and raw speed!

What if you could take all these lessons and build something better?

Jameson Nash in Julia-Users

Julia was designed from the beginning to have certain features that are necessary for high performance:

  • type stability
  • pervasive type inference
  • execution semantics that closely match the hardware capabilities (pass-by-sharing, machine arithmetic)
  • inlining
  • macros

Additionally, certain oft-requested features were not included which make high performance much more difficult – or impossible – to achieve. These include:

  • pass-by-copy
  • first class local namespaces (aka eval in function scope)
  • allowing overloading of the getfield function (a.b access)

First Example: Collatz step counting


In [1]:
function collatz(n) # unproven: always terminates
    k = 0
    while n > 1
        n = isodd(n) ? 3n+1 : n>>1
        k += 1
    end
    return k
end


Out[1]:
collatz (generic function with 1 method)

Notes:

  • Simple syntax -- blocks, no {}, no ;, no whitespace rules
  • Types will be inferred at runtime; JIT compile a specialized version
  • 3n+1 works like God intended

In [2]:
collatz(89)


Out[2]:
30

In [3]:
for i = 2:2:20
    α = collatz(i)
    println("$i = $α")
end


2 = 1
4 = 2
6 = 8
8 = 3
10 = 6
12 = 9
14 = 17
16 = 4
18 = 20
20 = 7

Notes:

  • range is an iterator; not being expanded
  • Unicode is cool, LaTeX is cooler: \alpha + tab = α
  • string interpolation
  • powers of two take fewest steps

In [4]:
@time for i = 1:1e6 collatz(18) end


elapsed time: 0.041812339 seconds (96 bytes allocated)

Notes:

  • @macro expands into tic/toc/printf sorta thing
  • compact!
  • seems fast... why?

In [5]:
@code_native collatz(123)


	.section	__TEXT,__text,regular,pure_instructions
Filename: In[1]
Source line: 4
	push	RBP
	mov	RBP, RSP
	xor	EAX, EAX
	cmp	RDI, 2
	jl	36
Source line: 4
	test	DIL, 1
	jne	8
	sar	RDI
	jmpq	5
	lea	RDI, QWORD PTR [RDI + 2*RDI + 1]
Source line: 5
	inc	RAX
	cmp	RDI, 1
	jg	-36
Source line: 7
	pop	RBP
	ret

In [6]:
@code_llvm collatz(123)


define i64 @"julia_collatz;20213"(i64) {
top:
  %1 = icmp slt i64 %0, 2, !dbg !1800
  br i1 %1, label %L6, label %L, !dbg !1800

L:                                                ; preds = %top, %L3
  %k.0 = phi i64 [ %7, %L3 ], [ 0, %top ]
  %n.0 = phi i64 [ %n.1, %L3 ], [ %0, %top ]
  %2 = and i64 %n.0, 1, !dbg !1801
  %3 = icmp eq i64 %2, 0, !dbg !1801
  br i1 %3, label %L2, label %if1, !dbg !1801

if1:                                              ; preds = %L
  %4 = mul i64 %n.0, 3, !dbg !1801
  %5 = add i64 %4, 1, !dbg !1801
  br label %L3, !dbg !1801

L2:                                               ; preds = %L
  %6 = ashr i64 %n.0, 1, !dbg !1801
  br label %L3, !dbg !1801

L3:                                               ; preds = %L2, %if1
  %n.1 = phi i64 [ %6, %L2 ], [ %5, %if1 ]
  %7 = add i64 %k.0, 1, !dbg !1802
  %8 = icmp sgt i64 %n.1, 1, !dbg !1802
  br i1 %8, label %L, label %L6, !dbg !1802

L6:                                               ; preds = %L3, %top
  %k.1 = phi i64 [ 0, %top ], [ %7, %L3 ]
  ret i64 %k.1, !dbg !1803
}

Multi-Methods & Multiple Dispatch > Standard OOP

From Stefan Karpinski, The Design Impact of Multiple Dispatch, and Jiahao Chen, Julia Compiler and Community.

Multiple Dispatch -- a type of Polymorphism

  • dynamic — based on actual run-time type, not static type (C++)
  • multiple — based on all arguments, not just the receiver

Tends to be written as function application:

f(a,b,c) ⟸ LIKE THIS

a.f(b,c) ⟸ NOT THIS

OOP: What can I do to or with a thing?

  • SmartTrip Card: buy, recharge, pay metro fare, pay bus fare, lose
  • Metrorail Farecard: buy, pay metro fare, lose
  • Cash: pay bus fare, lose
  • object: method(s)
  • classes are fundamental; methods attach to classes

Multi-Methods (& Multiple Dispatch): What (combinations of) things can do this?

  • buy: farecard
  • recharge: rechargable faircard
  • pay: metro fare + farecard
  • pay: bus fare + SmartTrip Card OR Cash
  • lose: Any
  • generic method: applicable object(s)
  • methods are fundamental; classes or sets of classes define what's legal

Example


In [7]:
f(a::Any, b) = "fallback"
f(a::Number, b::Number) = "a and b are both numbers"
f(a::Number, b) = "a is a number"
f(a, b::Number) = "b is a number"
f(a::Integer, b::Integer) = "a and b are both integers"


Out[7]:
f (generic function with 5 methods)

Notes:

  • single-line function syntactic sugar looks math-y
  • a::Any is unnecessary
  • all of these types are abstract; concrete types are Float64, Int64, etc., inferred

In [8]:
f(1.5, 2)


Out[8]:
"a and b are both numbers"

In [9]:
print(typeof(1.5), ", ", typeof(2))


Float64, Int64

In [10]:
f(1, "bar")


Out[10]:
"a is a number"

In [11]:
f(1, 2)


Out[11]:
"a and b are both integers"

In [12]:
f("foo", [1,2])


Out[12]:
"fallback"

Diagonal Dispatch:


In [13]:
f{T<:Number}(a::T, b::T) = "a and b are both $(T)s"


Out[13]:
f (generic function with 6 methods)

In [14]:
methods(f)


Out[14]:
6 methods for generic function f:
  • f(a::Integer,b::Integer) at In[7]:5
  • f{T<:Number}(a::T<:Number,b::T<:Number) at In[13]:1
  • f(a::Number,b::Number) at In[7]:2
  • f(a::Number,b) at In[7]:3
  • f(a,b::Number) at In[7]:4
  • f(a,b) at In[7]:1

In [15]:
f(big(1.5), big(2.5))


Out[15]:
"a and b are both BigFloats"

In [16]:
f("foo", "bar") #<== still doesn't apply to non-numbers


Out[16]:
"fallback"

Interval Arithmatic


In [17]:
immutable Interval{T<:Real} <: Number
  lo::T
  hi::T
end

(a::Real)..(b::Real) = Interval(a,b)

Base.show(io::IO, iv::Interval) = print(io, "($(iv.lo))..($(iv.hi))")


Out[17]:
show (generic function with 85 methods)

In [18]:
(1..2) + 3 # tries but fails to find a way to reconcile two Numbers


no promotion exists for Interval{Int64} and Int64
while loading In[18], in expression starting on line 1
 in + at promotion.jl:158

Notes:

  • An interval is a structure parameterized by a subclass of Real, subclassing Number ("type lattice")
  • .. is a binary operator/function with syntax but no built-in methods
  • Default constructor is adequate here
  • Extending show() in Base module
  • Note fancypants interpolation of expressions

In [19]:
1..2


Out[19]:
(1)..(2)

In [20]:
typeof(ans)


Out[20]:
Interval{Int64} (constructor with 1 method)

In [21]:
sizeof(1..2) # two 64-bit/8-byte ints


Out[21]:
16

In [22]:
(1//2)..(2//3)


Out[22]:
(1//2)..(2//3)

In [23]:
a::Interval + b::Interval = (a.lo + b.lo)..(a.hi + b.hi)
a::Interval - b::Interval = (a.lo - b.hi)..(a.hi - b.lo)


Out[23]:
- (generic function with 138 methods)

In [24]:
(2..3) + (-1..1)


Out[24]:
(1)..(4)

In [25]:
(2..3) + (1.0..3.14159)  # autoconverts


Out[25]:
(3.0)..(6.14159)

In [26]:
@code_native (2..3) + (-1..1)


	.section	__TEXT,__text,regular,pure_instructions
Filename: In[23]
Source line: 1
	push	RBP
	mov	RBP, RSP
Source line: 1
	add	RDI, RDX
	add	RSI, RCX
	mov	RAX, RDI
	mov	RDX, RSI
	pop	RBP
	ret

Design Impacts of Multiple Dispatch

(below quoted from Stefan Karpinski)

Generic functions in Julia aren't special – they're the default

  • even really basic things like + are generic functions

This means you're free to extend everything

  • not just functions that were planned to be extended
  • and everything's compiled the same way, so it's fast

Since generic functions are open:

  • functions are more like protocols which users can also implement

We're forced to think much harder about the meaning of operations

  • can't just describe their behavior.

Results, if done well, are abstractions, defined generically, that extend easily

  • which is particularly important for mathematical objects.

Community Matters; Easy Learning Curves Matter

  • Users == Developers
  • git(hub)-based package ecosystem encourages
    • pull requests, not forks
    • collaboration and adhocracy

Three personal examples

  • round(number, digits, base)
  • DataFrames
  • VennEuler (shown elsewhere)

In [27]:
methods(round) # click through...


Out[27]:
15 methods for generic function round:

In [28]:
round(123.321)


Out[28]:
123.0

In [29]:
round(123.321, 2)


Out[29]:
123.32

In [30]:
round(123.321, -1)


Out[30]:
120.0

In [31]:
round(123.321, 1, 2)


Out[31]:
123.5

In [32]:
for i = 1:10 println(round(123.321, i, 2)) end


123.5
123.25
123.375
123.3125
123.3125
123.328125
123.3203125
123.3203125
123.3203125
123.3212890625

DataFrames and Packages


In [33]:
Pkg.installed()


Out[33]:
Dict{ASCIIString,VersionNumber} with 22 entries:
  "Homebrew"          => v"0.1.8"
  "Nettle"            => v"0.1.4"
  "REPLCompletions"   => v"0.0.1"
  "SortingAlgorithms" => v"0.0.1"
  "NLopt"             => v"0.1.1"
  "Color"             => v"0.2.11"
  "ZMQ"               => v"0.1.13"
  "ArrayViews"        => v"0.4.6"
  "JSON"              => v"0.3.7"
  "StatsBase"         => v"0.6.3"
  "DataArrays"        => v"0.2.0"
  "Iterators"         => v"0.1.6"
  "IJulia"            => v"0.1.12"
  "RDatasets"         => v"0.1.1"
  "GZip"              => v"0.2.13"
  "MathProgBase"      => v"0.2.5"
  "BinDeps"           => v"0.2.14"
  "Cairo"             => v"0.2.15"
  "DataFrames"        => v"0.5.7"
  "Reexport"          => v"0.0.1"
  "VennEuler"         => v"0.0.0-"
  "URIParser"         => v"0.0.2"

In [34]:
using RDatasets
iris = dataset("datasets", "iris")


Out[34]:
SepalLengthSepalWidthPetalLengthPetalWidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
&vellip&vellip&vellip&vellip&vellip&vellip

In [35]:
typeof(iris)


Out[35]:
DataFrame (constructor with 22 methods)

In [36]:
by(iris, :Species, df -> DataFrame(mean_length = mean(df[:PetalLength])))


Out[36]:
Speciesmean_length
1setosa1.462
2versicolor4.26
3virginica5.552

Notes:

  • using pulls dependencies too, in this case including DataFrames
  • show(DataFrame) outputs Markdown-compatible tables
  • by is part of split-apply-combine idiom for DFs
  • :Species is a symbol, ala LISP, used frequently instead of Enums
  • df -> ... is an anonymous function
  • DataFrame constructor gets the new column name by tricky use of named arguments
  • can index DFs with string names, symbol names, or position

In [39]:
using GLM
lm1 = fit(LinearModel, SepalLength ~ SepalWidth + PetalLength, iris)


Out[39]:
DataFrameRegressionModel{LinearModel{DensePredQR{Float64}},Float64}:

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   2.24914   0.24797 9.07022   <1e-15
SepalWidth   0.595525 0.0693282 8.58994   <1e-13
PetalLength   0.47192 0.0171177 27.5692   <1e-59
Warning: could not import Base.add! into NumericExtensions

In [40]:
a ~ b + c


Out[40]:
Formula: a ~ b + c

Notes:

  • fit is a generic function, specialized here on a model type, a Formula, and data
  • type of lm1 is interesting
  • ~ is syntactic sugar that calls a macro that captures the expression in a Formula

See VennEuler elsewhere -- or we're done!