In [1]:
using Distributions
using DataFrames
using Datetime
using Gadfly


Warning: could not import Base.sum! into NumericExtensions
Warning: could not import Base.maximum! into NumericExtensions
Warning: could not import Base.minimum! into NumericExtensions
Warning: New definition 
    -(DataArray{T,N},AbstractArray{T,N}) at /Users/cvogel/.julia/DataArrays/src/operators.jl:335
is ambiguous with: 
    -(AbstractArray{T,2},Diagonal{T}) at linalg/diagonal.jl:27.
To fix, define 
    -(DataArray{T,2},Diagonal{T})
before the new definition.
Warning: New definition 
    -(AbstractArray{T,N},DataArray{T,N}) at /Users/cvogel/.julia/DataArrays/src/operators.jl:335
is ambiguous with: 
    -(Diagonal{T},AbstractArray{T,2}) at linalg/diagonal.jl:26.
To fix, define 
    -(Diagonal{T},DataArray{T,2})
before the new definition.
Warning: New definition 
    -(AbstractDataArray{T,N},AbstractArray{T,N}) at /Users/cvogel/.julia/DataArrays/src/operators.jl:356
is ambiguous with: 
    -(AbstractArray{T,2},Diagonal{T}) at linalg/diagonal.jl:27.
To fix, define 
    -(AbstractDataArray{T,2},Diagonal{T})
before the new definition.
Warning: New definition 
    -(AbstractArray{T,N},AbstractDataArray{T,N}) at /Users/cvogel/.julia/DataArrays/src/operators.jl:356
is ambiguous with: 
    -(Diagonal{T},AbstractArray{T,2}) at linalg/diagonal.jl:26.
To fix, define 
    -(Diagonal{T},AbstractDataArray{T,2})
before the new definition.
Warning: ignoring conflicting import of Stats.minmax into DataFrames
Warning: using FS.rename in module DataFrames conflicts with an existing identifier.
Warning: New definition 
    |(SynchronousStepCollection,Any) at /Users/cvogel/.julia/BinDeps/src/BinDeps.jl:283
is ambiguous with: 
    |(Any,NAtype) at /Users/cvogel/.julia/DataArrays/src/operators.jl:502.
To fix, define 
    |(SynchronousStepCollection,NAtype)
before the new definition.
Warning: New definition 
    |(Any,SynchronousStepCollection) at /Users/cvogel/.julia/BinDeps/src/BinDeps.jl:286
is ambiguous with: 
    |(NAtype,Any) at /Users/cvogel/.julia/DataArrays/src/operators.jl:502.
To fix, define 
    |(NAtype,SynchronousStepCollection)
before the new definition.
Warning: could not import Distributions.kde into Stat
Warning: could not import Distributions.bandwidth into Stat

In [2]:
VERSION


Out[2]:
v"0.3.0-prerelease+814"

Tricked Out Iterators in Julia

Introduction

I want to spend some time messing about with iterators in Julia. I think they not only provide a familiar and useful entry point into Julia's type system and dispatch model, they're also interesting in their own right. [1] Clever application of iterators can help to simplify complicated loops, better express their intent, and improve memory usage.

A word of warning about the code here. Much of the it isn't idiomatic Julia and I wouldn't necessarily recommend using this style in a serious project. I also can't speak to its performance vis-a-vis more obvious Julian alternatives. In some cases, the style of the code examples below may help reduce memory usage, but performance is not my main concern. (This may be the first blogpost about Julia unconcerned with speed). Instead, I'm just interested in different ways of expressing iteration problems.

The Iterator Protocol

What do I mean by iterators? [2] I mean any I in Julia that works on the right hand side of the statement for i = I .... That is, anything you can for-loop over. This includes not only data collections like Arrays, Dicts, and Sets, but also more abstract types like Ranges, as well as what I'll call "higher order" iterators such as those that result from zip or enumerate functions.

As an equivalent definition, an iterator in Julia is any type that implements the iterator protocol. The iterator protocol is comprised of three methods: start, next, and done. So any type in Julia for which these three methods are defined is an iterator. It might be a dumb iterator or a broken iterator, but it's an iterator.

Since the for statement works on iterators, and iterators are just a collection of methods, we can define any for loop using calls to those methods.

For example, this simple loop


In [3]:
arr = [10:-2:1] # 1D array [10, 9, ..., 1]
for i = arr
    println(i^2)
end


100
64
36
16
4

is equivalent to this


In [4]:
state = start(arr)
while !done(arr, state)
    i, state = next(arr, state)
    println(i^2)
end


100
64
36
16
4

In this example, the start method provides the initial state of the iterator; the next method returns the value of the array at a given state, as well as what the next state is. Finally, the done method returns true when we've gone past the end of the iterator, informing the loop that it should stop.

If you know Python, the idea of the iterator protocol is probably familiar. In Python, any object can be an iterator if it has the methods __iter__ and __next__. But notice the lack of side effects in the Julia implementation --calling start or next on the array has no affect on the array itself. start is basically a constant, always returning the value of the initial state whenever you pass it the same type of iterator. And next doesn't really increment anything; it's just a mapping from current state &rightarrow (value, next state). In general, the iterator itself has no internal state being incremented or changed as you pass through a loop.

An iterator's state

More concretely, what's the state of an iterator? How the state is defined, and an iterator's sequence of states depends on the type of the iterator itself. It's best to look at some examples.

Arrays

Arrays are very intuitive iterators. They have states that are just integer values from 1 to the length of the array. So start returns 1.


In [5]:
arr = ["one", "two", "three", "four", "five", "six"] # 6 items
# The initial state
start(arr) # 1


Out[5]:
1

The next mapping is $i \rightarrow i+1$, and the value of the iterator at any state i is just a[i].


In [6]:
# The value and the next state in states 1, 2, ...
for i = 1:4
    println("next(arr, $i) = ", next(arr, i))
end


next(arr, 1) = ("one",2)
next(arr, 2) = ("two",3)
next(arr, 3) = ("three",4)
next(arr, 4) = ("four",5)

If this were a multidimensional array, say 3×2 instead of 6×1, we'd get the same result; iteration would just proceed along the rows of the matrix.

The done method returns true when the state is i = length(a) + 1. You might think it'd be length(a), but recall the for-equivalent while loop above. Having done return true at the last index of the array would prevent the loop from executing on the last element. So in our 6-element array, done is true when the state hits 7.


In [7]:
println(done(arr, 6))  # not yet
println(next(arr, 6))


false
("six",7)

In [8]:
println(done(arr, 7)) # true
println(next(arr, 7)) # BoundsError!


true
BoundsError()
while loading In[8], in expression starting on line 2
 in next at abstractarray.jl:236

Ranges

Ranges have states that looks similar to arrays, except they start at zero.


In [9]:
rng = 11:20  # length 10 range
start(rng) # 0


Out[9]:
0

But the relationship between the current and next state is the same: i &rightarrow i+1.


In [10]:
for i = [0, 1, 9, 10]
    println("next(rng, $i) = ", next(rng, i))
end


next(rng, 0) = (11,1)
next(rng, 1) = (12,2)
next(rng, 9) = (20,10)
next(rng, 10) = (21,11)

Since we start at zero, the done state is one less than the equivalent array.


In [11]:
done(rng, 10)


Out[11]:
true

Unordered collections: Dicts, Sets, etc.

Arrays and ranges have a natural order, so the evolution of state is straightforward. But what about collections such as dictionaries and sets that have no inherent order? Like in many languages, such things can be iterated over, but the order of iteration is not easily predictable.

For example, here's a dictionary:


In [12]:
dictit = {:one => 1, :three => 3, :five => "five!"}


Out[12]:
{:one=>1,:three=>3,:five=>"five!"}

The starting state isn't 0 or 1, as would be natural for an ordered collection.


In [13]:
s0 = start(dictit)


Out[13]:
3

And while next maps state i to state j, the relationship between i and j is not obvious. Here, while the first state is 3, the second is 11, and the rest are similarly weird.


In [14]:
_, s1 = next(dictit, s0)


Out[14]:
((:one,1),11)

In [15]:
_, s2 = next(dictit, s1)


Out[15]:
((:three,3),13)

In [16]:
_, s3 = next(dictit, s2)


Out[16]:
((:five,"five!"),17)

In [17]:
done(dictit, s3)


Out[17]:
true

The states, you probably and correctly suspect, are tied to the internal implementation of the dictionary, e.g. how the keys are hashed. So the state doesn't follow a predictable 1, 2, 3, ... order, and what order of elements we get when iterating is essentially unpredictable.

But because for loops handle the iterator's states for us, we rarely if ever have to worry about the representation of an iterator's state. The for loop implicitly calls the start, done, and next methods, which do all this bookkeeping for us.

Iterators and delayed evaluation

While many iterators are collections of data in memory, like Arrays, Dicts, or Sets, iterators can also represent abstract collections that aren't held in memory.

Range is a good example. When we iterate over the range 1:10, we get the sequence 1, 2, 3, ..., 10. But in memory, this range is comprised of only two integers, 1 and 10. The values in between are only evaluated when we're looping over it.

From https://github.com/JuliaLang/julia/blob/master/base/range.jl, here's how a Range's iterator protocol is defined:

{.julia}
start(r::Ranges) = 0
next{T}(r::Range{T}, i) = (oftype(T, r.start + i*step(r)), i+1)
next{T}(r::Range1{T}, i) = (oftype(T, r.start + i), i+1)
done(r::Ranges, i) = (length(r) <= i)

Application: Iterating over Fibonacci numbers

Here's another example of an iterator which computes values on demand, using the next method to do the work. fibit(n) is an iterator over the first n Fibonacci numbers. When the iterator is constructed, it doesn't calculate all of these numbers. Instead it waits for its next method to be called, providing the next Fibonacci number depending on the current one.


In [18]:
# Iterator produces the first n Fibonacci numbers
type FibIt{T<:Integer}
    last2::(T, T)
    n::Integer
end

In [19]:
fibit(n::Integer) = FibIt((0, 1), n)
# Specify types, e.g. BigInt to prevent overflow.
fibit(n::Integer, T::Type) = FibIt{T}((0, 1), n) 

Base.start(fi::FibIt) = 1

function Base.next(fi::FibIt, state)
    if state == 1
        return (1, 2)
    else
        fi.last2 = fi.last2[2], sum(fi.last2)
        (fi.last2[2], state + 1)
    end
end

Base.done(fi::FibIt, state) = state > fi.n


Out[19]:
done (generic function with 66 methods)

Tasks/Coroutines

This talk of iterators with delayed evaluation may remind Pythonistas of generators. And Julia has a type that is basically equivalent to Python's generators, called Tasks. A Task is constructed by calling the Task() constructor (or @task macro) on a function with a produce statement, which issimilar to Python's yield.

Instead of using the Fibit type above, we could get an equivalent iterator by defining a Task that produces sequential Fibonacci numbers.


In [20]:
function fibtask(n::Integer, T::Type)
    a, b = (zero(T), one(T))
    produce(1)
    function _it()
        for i = 1:n
            produce(b)
            a, b = (b, a+b)
        end
    end
    Task(_it)
end

fibtask(n::Integer) = fibtask(n, Int)


Out[20]:
fibtask (generic function with 2 methods)

In [21]:
# We get iteration for free.
for i in fibtask(10)
    print(i, " ")
end


1 1 2 3 5 8 13 21 34 55 

Whether you create an iterator using a type with the iterator protocol, or by constructing a Task, is up to you. There are pros and cons to each approach. By defining your iterator as a specific type, you can dispatch lots of other functions on it. Here, on the other hand, fibtask is just a Task type, so defining methods for it means defining methods for all Tasks, which may be undesirable or infeasible. On the other hand Tasks give you iterators with less code. Below I'll show an example of an iterator that's hard to define with the iterator protocol methods, but easy to define as a Task. And of course, Tasks are coroutines, and can be used in those contexts.

Realizing iterators without for loops

So far, we've talked about iterators in the context of for loops. We saw that for i = I was a construct for calling I's start, done and next methods, letting us realize and operate on the values in the iterator.

But there are functions which can take iterators as inputs and implicitly iterate over them to some desired result. This obviates the need for explicit for loops, and can make for cleaner more functional code. Some examples follow.

collect and reduce

The collect function takes an iterator input, realizes all its values, and collects them into an array.


In [22]:
collect(fibit(10))


Out[22]:
10-element Array{Any,1}:
  1
  1
  2
  3
  5
  8
 13
 21
 34
 55

The reduce function similarly realizes the values of an iterator, but then successively applies an operator to them to give a scalar result.


In [23]:
reduce(+, fibit(10))


Out[23]:
143

That reduce operation is equivalent to the sum function called with an iterator argument.


In [24]:
sum(fibit(10))


Out[24]:
143

In this next line of code, I compute the sum of the reciprocals of the first 10,000 Fibonacci numbers (which should be close to this), using collect to first put them into an array.


In [25]:
sum(1 ./ collect(BigInt, fibit(10_000, BigInt)))


Out[25]:
3.359885666243177553172011302918927179688905133731968486495553815325130318996609e+00 with 256 bits of precision

Comprehensions

The collect function may remind you of an array comprehension, and it is similar, but here we see array comprehension don't work on our iterator:


In [26]:
[f for f = fibit(10)]


no method length(FibIt{Int64})
while loading In[26], in expression starting on line 1
 in anonymous at no file

What's going on is that the array comprehension wants to allocate an array, then fill it in with the values of the iterator. Since it doesn't know the iterator's length (how many values it will produce), it doesn't know how large an array to allocate. [3] We can fix this for our Fibonacci iterator by giving it a length method.


In [27]:
Base.length(it::FibIt) = it.n
[f for f = fibit(10)]


Out[27]:
10-element Array{Int64,1}:
  1
  1
  2
  3
  5
  8
 13
 21
 34
 55

Now we can redefine our sum-of-reciprocals using a comprehension instead of collect.


In [28]:
sum([1/f for f = fibit(10_000, BigInt)])


Out[28]:
3.359885666243177553172011302918927179688905133731968486495553815325130318996712e+00 with 256 bits of precision

What if we tried this with our Fibonacci task?


In [29]:
[f for f = fibtask(10)]


no method length(Task)
while loading In[29], in expression starting on line 1
 in anonymous at no file

We get the same issue; Tasks don't have a length method. The advantage of using the FibIt type is that we can easily define a length method for it. We can only give our Fibonacci task a method if we give all Tasks a length method, which wouldn't make sense.

The Iterator Package

When we calculated the sum of the reciprocals of Fibonacci number above, we had to realize the values of the Fibonacci iterator before taking the reciprocal, and then sum a collection of all those values. Alternatively, we could have called sum on an iterator that produced $1/x$ for each Fibonacci number $x$.

One way to do this would be to create a new iterator type, called ReciprocalFibIt, and given it its own start, next, and done methods. But that feels a little excessive. Wouldn't it be nicer to be able to construct that iterator from the Fibonacci iterator we already have? Essentially saying, "hey, I want another iterator that gives one over the values of that other iterator."

That would be an example of what I'll call a higher-order iterator, which is an iterator constructed from one or more other iterators. zip and enumerable are common examples.

It turns out Julia has a neat little package of useful higher-order iterators; called, obviously, Iterators. In the rest of (this already very long) post, I'll explore some of things in the package. Pythonistas will notice similarities with the itertools module in the Standard Library.


In [30]:
using Iterators

IMap

The Imap iterator provides us with our wish above: a new iterator that is the result of applying a function to the values of an existing iterator. In the case of our reciprocal Fibonacci numbers, that function is x -> 1/x.


In [31]:
recipricalfib = imap(x -> 1/x, fibit(10_000, BigInt)) # A new iterator, composed from a FibIt
psi = sum(recipricalfib) # No collect needed


Out[31]:
3.359885666243177553172011302918927179688905133731968486495553815325130318996609e+00 with 256 bits of precision

So reciprocalfib is itself an iterator, whose values are only realized when it's passed to the sum function. We didn't have to allocate any arrays before calling sum as with the collect and comprehension examples above.

An IFilter iterator

Since we have a map-like iterator, why not a filter? How would it work? Given an iterator that produces values v1, v2, v3, ..., the filter iterator would only produce the values that met some predicate, skipping any that didn't.

This isn't implemented in the Iterators package. It's a neat idea, but it turns out to be tricky to define in terms of the iterator protocol. It's easy with a Task, though.


In [32]:
function ifilter(f::Function, itr)
    function _it()
        for i = itr
            if f(i)
                produce(i)
            end
        end
    end
    Task(_it)
end


Out[32]:
ifilter (generic function with 1 method)

Application: A list of primes whose digits sum to a prime

Here's an example of it in action. We'll begin with a Range iterator from 1 to 1,000. I want to list all of numbers in that range that are (1) prime and (2) have digits that sum to a prime.

So ifilter takes the predicate test and the original iterator, then produces only those values from the original iterator that pass the test. Turns out there are 89 such primes between 1 and 1,000.


In [33]:
function funnyprimetest(n::Integer)
    sumdigits = sum([parseint(string(c)) for c in string(n)])
    isprime(n) & isprime(sumdigits)
end


Out[33]:
funnyprimetest (generic function with 1 method)

In [34]:
collect(ifilter(funnyprimetest, 1:1000))


Out[34]:
89-element Array{Any,1}:
   2
   3
   5
   7
  11
  23
  29
  41
  43
  47
  61
  67
  83
   ⋮
 829
 863
 881
 883
 887
 911
 919
 937
 953
 971
 977
 991

Repeat and RepeatForever

Another surprisingly useful iterator is Repeat, which simply produces an object some number of times. Here the iterator is just the string "echo!" five times.


In [35]:
for i = repeated("echo!", 5)
    println(i)
end


echo!
echo!
echo!
echo!
echo!

Extension: Repeat for impure functions

One thing about the Repeat iterator though, is that the object or value it repeats is fixed at its construction. If you pass it a called function, it will call that function once in the constructor, and then repeatedly return the result of that first call. For pure functions, that's fine. The first call of sqrt(100) is the same as the second, third, or ten-thousandth call of sqrt(100).

If the function is impure, though, we'll get undesired results.


In [36]:
for i = repeated(rand(), 10) println(i) end


0.4979715948949144
0.4979715948949144
0.4979715948949144
0.4979715948949144
0.4979715948949144
0.4979715948949144
0.4979715948949144
0.4979715948949144
0.4979715948949144
0.4979715948949144

Here, the rand function was called once in the constructor, and that result was repeated again and again. I'd prefer if I could get 10 separate calls to rand. Here's one way to get this to work.


In [37]:
Base.next(it::Iterators.Repeat{Function}, state) = it.x(), state - 1
Base.next(it::Iterators.RepeatForever{Function}, state) = it.x(), nothing


Out[37]:
next (generic function with 83 methods)

In [38]:
for i = repeated(rand, 10) println(i) end # Note the function isn't called; the next function does this.


0.5280218720629153
0.6287778752009863
0.9808561208985482
0.11424183057106774
0.18147790869456637
0.018127429914700066
0.4489191298265862
0.9477518177718107
0.6721105940596026
0.03983884464617704

What I've done is create new next methods for the Repeat and RepeatForever iterators. When the object of the iterators is a function, the next methods call the function. By passing the iterator an uncalled function object, I avoid the call in the constructor, and defer it to the next method.

Take and Drop

The Take iterator only iterates over some specified first values of its input iterator. It works well in combination with infinite iterators, like RepeatForever


In [39]:
randsforever = repeated(rand)


Out[39]:
RepeatForever{Function}(rand)

In [40]:
[r for r = take(randsforever, 10)]


Out[40]:
10-element Array{Any,1}:
 0.0964744
 0.500893 
 0.168912 
 0.465092 
 0.201898 
 0.504639 
 0.845752 
 0.727477 
 0.733524 
 0.0383152

In [41]:
[r for r = take(randsforever, 20)]


Out[41]:
20-element Array{Any,1}:
 0.156504 
 0.557105 
 0.183943 
 0.481632 
 0.351313 
 0.269933 
 0.719834 
 0.473762 
 0.541162 
 0.291951 
 0.303935 
 0.0499518
 0.402561 
 0.231805 
 0.866897 
 0.265337 
 0.21437  
 0.140006 
 0.319362 
 0.836157 

The Drop iterator, on the other hand, ignores some specified first values of its input iterator. So, how many values should be printed in this for loop?


In [42]:
for i = drop(take(randsforever, 10_000), 9998)
    println(i)
end


0.9240435159977587
0.672714940852619

Answer: just the last two, since we take 10,000 random numbers, but drop the first 9,998.

Extension: TakeWhile, TakeUntil

In some cases you may not want to take a fixed number of values from an iterator, but instead you want to take values until some condition is met.

To accomplish this, I'll create a TakeWhile iterator, which takes values from its input iterator so long as they pass some test.


In [43]:
immutable TakeWhile{I}
    xs::I
    cond::Function
end

In [44]:
takewhile(xs, cond) = TakeWhile(xs, cond)

Base.start(it::TakeWhile) = start(it.xs)
Base.next(it::TakeWhile, state) = next(it.xs, state)

function Base.done(it::TakeWhile, state)
    i, _ = next(it, state)
    !it.cond(i) || done(it.xs, state)
end


Out[44]:
done (generic function with 67 methods)

In [45]:
tw = takewhile(1:10, x -> x^2 < 25)
collect(tw)


Out[45]:
4-element Array{Any,1}:
 1
 2
 3
 4

Let's also create a TakeUntil iterator, which takes values until it finds one that passes the test. So the last value produced by this iterator will pass the test and all values before that will have failed.


In [46]:
immutable TakeUntil{I}
    xs::I
    cond::Function
end

In [47]:
takeuntil(xs, cond) = TakeUntil(xs, cond)
Base.start(it::TakeUntil) = start(it.xs), false

function Base.next(it::TakeUntil, state)
    i, s = next(it.xs, state[1])
    i, (s, it.cond(i))
end

function Base.done(it::TakeUntil, state)
    s, iscond = state
    iscond || done(it.xs, s)
end


Out[47]:
done (generic function with 68 methods)

In [48]:
collect(takeuntil(1:10, x -> x*x >= 25))  # x <= sqrt(25) -> 1:5


Out[48]:
5-element Array{Any,1}:
 1
 2
 3
 4
 5

Application: How long does it take to get a prime number from a Poisson process

As an application of the TakeUntil iterator, an experiment. How many draws do we have to make from a Poisson process until we draw a prime number? For this example, I'll use a Poisson with mean 5,000.

In the code, we make a Repeat iterator that repeatedly draws from the Poisson. We pass this into takeuntil and this creates an iterator that draws from the Poisson until we find a prime number. While this is happening, we keep track of the number of steps we took through this iterator.


In [49]:
# Draw random integers from a distibrution d until you get a prime number.
# Return the number of draws.
function primetime(d, dparams)
    randgen = () -> rand(d(dparams...))
    tu = takeuntil(repeated(randgen), isprime)
    time = 0
    for i = tu
        time += 1
    end
    time
end

primetime_poiss5k = () -> primetime(Poisson, 5000)


Out[49]:
(anonymous function)

What's the average wait for a prime? Repeating the experiment 10,000 times, we find the average number of draws is between 7 and 8.


In [50]:
mean(repeated(primetime_poiss5k, 10_000))


Out[50]:
7.5997

To see the distribution of waiting times, I'll collect each repetition of the experiment in an array that we can plot.


In [51]:
times = [t::Int for t = repeated(primetime_poiss5k, 10_000)]


Out[51]:
10000-element Array{Int64,1}:
  8
 12
  1
  1
  1
  9
  4
 18
  1
  4
  1
 12
  3
  ⋮
 21
  9
 19
  2
  2
  2
  9
 10
  6
  2
 10
  5

In [52]:
set_default_plot_size(860px, 500px)
p = plot(x=times, Geom.histogram, Scale.x_discrete,
         Guide.title("Distribution of waiting times until a prime number\nfrom a Poisson(5000) process (N=10,000)"),
         Guide.xlabel("Time"))


Out[52]:

Partition

The Partition iterator split its input iterator into pieces, producing an iterator over iterators. For example we could use it to partition the Range iterator 1:100 into two iterators, 1:50 and 51:100. We can also make overlapping partitions, for example, 1:50, 2:51, 3:52, etc.

Application: Moving Averages

One useful application of overlapping partitions is computing moving averages. The following code imports Google's historical stock price from Yahoo Finance and computes its 60-day moving average.

First, we download the data, creating a 2-D array containing dates, volumes, and prices.


In [53]:
const googdata =
    "http://ichart.finance.yahoo.com/table.csv?s=GOOG&d=0&e=7&f=2014&g=d&a=0&b=7&c=2013&ignore=.csv" |>
        download |>
        open |>
        readall |>
        s -> split(s, "\n") |> 
        a -> map(l -> split(l, ","), a) |>
        a -> filter(l -> contains(l[1], "201"), a) |>
        reverse  # Dates start at most recent, so reverse for chron order.


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13972    0 13972    0     0  16234      0 --:--:-- --:--:-- --:--:-- 16227
Out[53]:
253-element Array{Array{String,1},1}:
 ["2013-01-07","735.45","739.38","730.58","734.75","1655700","734.75"]     
 ["2013-01-08","735.54","736.30","724.43","733.30","1676100","733.30"]     
 ["2013-01-09","732.27","738.35","728.60","738.12","2024700","738.12"]     
 ["2013-01-10","742.83","745.00","733.50","741.48","1835700","741.48"]     
 ["2013-01-11","742.00","742.43","736.30","739.99","1285200","739.99"]     
 ["2013-01-14","737.00","742.20","722.35","723.25","2863900","723.25"]     
 ["2013-01-15","719.33","735.00","712.10","724.93","3927700","724.93"]     
 ["2013-01-16","722.40","724.34","713.67","715.19","2023400","715.19"]     
 ["2013-01-17","717.71","719.64","711.02","711.32","2211500","711.32"]     
 ["2013-01-18","710.36","712.77","701.33","704.51","3226800","704.51"]     
 ["2013-01-22","704.66","705.34","695.52","702.87","3792400","702.87"]     
 ["2013-01-23","735.99","749.00","735.79","741.50","5909100","741.50"]     
 ["2013-01-24","741.24","756.83","740.51","754.21","3382700","754.21"]     
 ⋮                                                                         
 ["2013-12-19","1080.77","1091.99","1079.08","1086.22","1665700","1086.22"]
 ["2013-12-20","1088.30","1101.17","1088.00","1100.62","3261600","1100.62"]
 ["2013-12-23","1107.84","1115.80","1105.12","1115.10","1721600","1115.10"]
 ["2013-12-24","1114.97","1115.24","1108.10","1111.84","734200","1111.84"] 
 ["2013-12-26","1114.01","1119.00","1108.69","1117.46","1337800","1117.46"]
 ["2013-12-27","1120.00","1120.28","1112.94","1118.40","1569700","1118.40"]
 ["2013-12-30","1120.34","1120.50","1109.02","1109.46","1236100","1109.46"]
 ["2013-12-31","1112.24","1121.00","1106.26","1120.71","1357900","1120.71"]
 ["2014-01-02","1115.46","1117.75","1108.26","1113.12","1821400","1113.12"]
 ["2014-01-03","1115.00","1116.93","1104.93","1105.00","1666700","1105.00"]
 ["2014-01-06","1113.01","1118.86","1106.44","1117.32","1769300","1117.32"]
 ["2014-01-07","1125.00","1139.69","1121.16","1138.86","2552600","1138.86"]

We then create iterators over the dates and closing prices in the Array. These iteratively extract and parse values from the relevant columns.


In [54]:
dates = imap(r -> date(r[1]), googdata)
close = imap(r -> parsefloat(r[7]), googdata)


Out[54]:
IMap((anonymous function),{[["2013-01-07","735.45","739.38","730.58","734.75","1655700","734.75"],["2013-01-08","735.54","736.30","724.43","733.30","1676100","733.30"],["2013-01-09","732.27","738.35","728.60","738.12","2024700","738.12"],["2013-01-10","742.83","745.00","733.50","741.48","1835700","741.48"],["2013-01-11","742.00","742.43","736.30","739.99","1285200","739.99"],["2013-01-14","737.00","742.20","722.35","723.25","2863900","723.25"],["2013-01-15","719.33","735.00","712.10","724.93","3927700","724.93"],["2013-01-16","722.40","724.34","713.67","715.19","2023400","715.19"],["2013-01-17","717.71","719.64","711.02","711.32","2211500","711.32"],["2013-01-18","710.36","712.77","701.33","704.51","3226800","704.51"]  …  ["2013-12-23","1107.84","1115.80","1105.12","1115.10","1721600","1115.10"],["2013-12-24","1114.97","1115.24","1108.10","1111.84","734200","1111.84"],["2013-12-26","1114.01","1119.00","1108.69","1117.46","1337800","1117.46"],["2013-12-27","1120.00","1120.28","1112.94","1118.40","1569700","1118.40"],["2013-12-30","1120.34","1120.50","1109.02","1109.46","1236100","1109.46"],["2013-12-31","1112.24","1121.00","1106.26","1120.71","1357900","1120.71"],["2014-01-02","1115.46","1117.75","1108.26","1113.12","1821400","1113.12"],["2014-01-03","1115.00","1116.93","1104.93","1105.00","1666700","1105.00"],["2014-01-06","1113.01","1118.86","1106.44","1117.32","1769300","1117.32"],["2014-01-07","1125.00","1139.69","1121.16","1138.86","2552600","1138.86"]]})

Now we can make 60-day sub-period partitions and compute the average of each. Since I'm using imap nothing has been calculated yet. These are all just iterators promising to do work when called.


In [55]:
ma60 = imap(mean, partition(close, 60, 1))
# NB: The julian way to do this would b [mean(price[i-59:i]) for i = 60:length(price)]


Out[55]:
IMap(mean,{Partition{IMap}(IMap((anonymous function),{[["2013-01-07","735.45","739.38","730.58","734.75","1655700","734.75"],["2013-01-08","735.54","736.30","724.43","733.30","1676100","733.30"],["2013-01-09","732.27","738.35","728.60","738.12","2024700","738.12"],["2013-01-10","742.83","745.00","733.50","741.48","1835700","741.48"],["2013-01-11","742.00","742.43","736.30","739.99","1285200","739.99"],["2013-01-14","737.00","742.20","722.35","723.25","2863900","723.25"],["2013-01-15","719.33","735.00","712.10","724.93","3927700","724.93"],["2013-01-16","722.40","724.34","713.67","715.19","2023400","715.19"],["2013-01-17","717.71","719.64","711.02","711.32","2211500","711.32"],["2013-01-18","710.36","712.77","701.33","704.51","3226800","704.51"]  …  ["2013-12-23","1107.84","1115.80","1105.12","1115.10","1721600","1115.10"],["2013-12-24","1114.97","1115.24","1108.10","1111.84","734200","1111.84"],["2013-12-26","1114.01","1119.00","1108.69","1117.46","1337800","1117.46"],["2013-12-27","1120.00","1120.28","1112.94","1118.40","1569700","1118.40"],["2013-12-30","1120.34","1120.50","1109.02","1109.46","1236100","1109.46"],["2013-12-31","1112.24","1121.00","1106.26","1120.71","1357900","1120.71"],["2014-01-02","1115.46","1117.75","1108.26","1113.12","1821400","1113.12"],["2014-01-03","1115.00","1116.93","1104.93","1105.00","1666700","1105.00"],["2014-01-06","1113.01","1118.86","1106.44","1117.32","1769300","1117.32"],["2014-01-07","1125.00","1139.69","1121.16","1138.86","2552600","1138.86"]]}),60,1)})

With all these useful iterators defined, I can just collect them into arrays for plotting.


In [56]:
plot(layer(x=collect(dates), y=collect(close), Geom.line),
     layer(x=collect(dates)[60:end], y = collect(ma60), Geom.line),
     Guide.ylabel("Price"), Guide.title("GOOG Daily Stock Price 60-Day Moving Avg."))


Out[56]:

Groupby

While the Partition iterator makes partitions of specified lengths, the Groupby iterator splits an iterator based on some condition. One caveat is that the input iterator has to be sorted in some way on the groupby condition, so that values with the same condition are adjacent in the iterator.

Application: Benford's law in labor force figures

In this example, I'm going to look at Benford's Law using the Groupby iterator. Benford's Law posits that the leading digits of numbers in many data sources follows a regular distribution. I'll use the Groupby iterator to group the data by first digit and check this.

The data I'll examine is the size of the labor force population in each U.S. county in 2012.


In [57]:
const lfdata = "http://www.bls.gov/lau/laucnty12.txt" |>
                   download |>
                   open |>
                   readall |>
                   s -> split(s, "\r\n") |>
                   a -> filter(x -> length(x) == 125, a) |>  # Rows with data
                   a -> map(x -> strip(x[79:92]), a) |>      # Column w/ LF data
                   a -> map(x -> replace(x, ",", ""), a) |>  # 1,000 -> 1000
                   x -> x[2:end] |>                          # Remove header
                   sort


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  399k  100  399k    0     0   515k      0 --:--:-- --:--:-- --:--:--  514k
Out[57]:
3220-element Array{ASCIIString,1}:
 "1000"  
 "10016" 
 "1002"  
 "100207"
 "10022" 
 "10028" 
 "10030" 
 "100354"
 "100367"
 "100447"
 "10055" 
 "10070" 
 "10084" 
 ⋮       
 "9927"  
 "99322" 
 "9936"  
 "9956"  
 "9958"  
 "9963"  
 "9968"  
 "99683" 
 "997"   
 "9975"  
 "9994"  
 "9998"  

The analysis is simple with a Groupby iterator. It splits up the data by leading digit, and then I just calculate the frequency of each leading digit in the data by taking the length of each leading-digit group as a share of the total length of the data.


In [58]:
dgroups = Iterators.groupby(lfdata, s -> s[1]) # Groups by first digit
# Extract the digit from the group members
digits = imap(i -> parseint(string(i[1][1])), dgroups)
# Compute the frequency
frequency = imap(x -> length(x) / length(lfdata), dgroups)


Out[58]:
IMap((anonymous function),{GroupBy{Array{ASCIIString,1}}(["1000","10016","1002","100207","10022","10028","10030","100354","100367","100447"  …  "9936","9956","9958","9963","9968","99683","997","9975","9994","9998"],(anonymous function))})

Benford's Law posits that the frequency of digit d in data should be log10(d+1) - log10(d). This function prints out a table of the observed frequencies next to the expected frequencies per Benford's Law.


In [59]:
benfordcheck = function(obs_freqs, digits)
    pred_freqs = map(d -> log10(d+1) - log10(d), digits)
    println("")
    println("Digit Frequency Compared to Benford's Law")
    println("=========================================")
    println("")
    println("Digit  Observed  Expected  Difference");
    for (d, o, e) in zip(digits, obs_freqs, pred_freqs)
        @printf( "%5d %9.2f %9.2f %11.2f\n", d, 100*o, 100*e, 100*(o-e))
    end
end

benfordcheck(frequency, digits)


Digit Frequency Compared to Benford's Law
=========================================

Digit  Observed  Expected  Difference
    1     30.09     30.10       -0.01
    2     16.46     17.61       -1.15
    3     12.02     12.49       -0.48
    4      9.72      9.69        0.03
    5      8.29      7.92        0.37
    6      6.30      6.69       -0.39
    7      6.02      5.80        0.23
    8      5.84      5.12        0.72
    9      5.25      4.58        0.67

We can see the labor force data follows Benford's Law quite closely.

To plot the comparison, I can collect the values from our iterators into a DataFrame.


In [60]:
benford_df = DataFrame(# Extract the digit
                       digits = collect(digits),
                       observed = collect(frequency),
                       expected = map(d -> log10(d+1) - log10(d), digits))

benford_df = stack(benford_df, [2, 3])


Out[60]:
18x3 DataFrame:
           variable     value digits
[1,]     "observed"  0.300932      1
[2,]     "observed"  0.164596      2
[3,]     "observed"  0.120186      3
[4,]     "observed"  0.097205      4
[5,]     "observed" 0.0829193      5
[6,]     "observed" 0.0630435      6
[7,]     "observed" 0.0602484      7
[8,]     "observed" 0.0583851      8
[9,]     "observed" 0.0524845      9
[10,]    "expected"   0.30103      1
[11,]    "expected"  0.176091      2
[12,]    "expected"  0.124939      3
[13,]    "expected"   0.09691      4
[14,]    "expected" 0.0791812      5
[15,]    "expected" 0.0669468      6
[16,]    "expected" 0.0579919      7
[17,]    "expected" 0.0511525      8
[18,]    "expected" 0.0457575      9

In [61]:
set_default_plot_size(800px, 500px)
plot(benford_df, x="digits", y="value", color="variable",
Geom.line, Geom.point,
Scale.x_discrete,
         Guide.title("Benford's Law in 2012 U.S. County Labor Force Data"),
         Guide.xlabel("Leading Digit"),
         Guide.ylabel("Digit Frequency"))


Out[61]:

Iterate

Though its name might be a little confusing, the Iterate iterator is one of my favorites. It recursively applies a function to a starting value, that is f(...f(f(f(x)))...). I come across applications for it all over the place.

Application: Autoregressive time series processes

One application is producing autoregressive time series processes. An AR(1) process has the form $x_{t+1} = \varphi x_t + \varepsilon_{t+1}$, where $\varepsilon$ is some random noise. If We define the function $f(x) = \varphi x + \varepsilon$, then $x_{t+2}$ as a function of $x_t$ is $f(f(x_t))$. Subsequent values can be similarly produced by iteratively applying the function.

First the code for the AR(1) function itself, along with a helper function for plotting a realization of the process.


In [62]:
function ar(phi::Float64, sigma::Float64)
    x -> phi * x + sigma * randn()
end


Out[62]:
ar (generic function with 1 method)

In [63]:
plotar(arseq, title) = plot(x=1:length(arseq), y=arseq, Geom.line,
                            Guide.xlabel("Time"), Guide.ylabel(""), Guide.title(title))


Out[63]:
plotar (generic function with 1 method)

Defining a coefficient and a standard deviation for the random variable, I pass them through a process that creates an iterator that recursively applies the function, starting with a randomly-drawn initial value. Then I collect 250 values of that iterator and plot them.


In [64]:
const ar1coef = 0.9
const ar1sigma = 0.15


Out[64]:
0.15

In [65]:
(ar1coef, ar1sigma) |>
    x -> apply(ar, x) |>
    f -> iterate(f, ar1sigma*rand()) |>
    i -> collect(take(i, 250)) |>
    s -> plotar(s, "AR(1) Time Series")


Out[65]:

This idea can be extended an AR($p$) process, where the current value of $x$ depends on several past values. Whereas the coefficient was a scalar in the


In [66]:
function ar(coeffs::AbstractVector{Float64}, sigma::Float64)
    p = length(coeffs)
    Phi = [coeffs', eye(p)[1:(end-1),:]]
    Sigma = [sigma, zeros(p-1)]
    x -> Phi * x + Sigma .* randn(p)                                           
end


Out[66]:
ar (generic function with 2 methods)

In [67]:
const ar3coeffs = [.9, -.1, -.25]
const ar3sigma = 0.15


Out[67]:
0.15

In [68]:
(ar3coeffs, ar3sigma) |>
    x -> apply(ar, x) |>
    f -> iterate(f, ar3sigma*rand(3)) |>
    i -> map(first, take(i, 250)) |>
    s -> plotar(s, "AR(3) Time Series")


Out[68]:

Conclusion

Most iteration you'll see in the wild uses simple collections or ranges as the iterator, performing extensive work inside the loop. Sometimes our problem can be better expressed using more complicated iterators whose structure represents the logic of our iteration. One thing to notice in all the examples was that once the iterators were defined, there was very little to do after iterating over them. Typically I was just collecting the iteration values into an array, or reducing them with an operation to a scalar result. We were also able to build the problems in such a way that calculation of values in the iterators was delayed until absolutely necessary.

There are tradeoffs to this sort of style, and much of the stuff in this post was more cute than practical. But it was a fun exploration of how to create types that interact with protocols in Julia. Julia's type system and dispatch design are very powerful and interesting, and gives programmers a lot of flexibility in

Footnotes

  1. While we'll see lots of examples of extending Julia's base functions dispatched on newly-defined types, we won't see much multiple dispatch, which is an important design feature of Julia. In fact, pretty much everything here could be implemented in a single-dispatch OO language.
  2. Pythonistas may be thinking about the distinction between iterators and iterables. (See,e.g. this Stack Overflow thread.) That distinction doesn't really apply to Julia. So I won't use the term iterable here, and I'll define an iterator in the two ways discussed above: (1) it is valid in a for i = I statement, and (2) it implements the iterator protocol.
  3. This limitation seems to come from the idea that only iterators with known lengths can be counted on to produce multidimensional arrays. This may be changed in future versions of Julia. See, e.g. Issue #550. The collect function uses the push! function to dynamically allocate the array, but collect can only give a 1-D Array output, whereas comprehensions can be multidimensional.