for Data Science

@BenSadeghi

[Twitter, GitHub, Linkedin]

Based on presentations by John Myles White[1][2][3], Stefan Karpinski and others

Background
Why use Julia?
Language Basics
- Types
- Linear Algebra
- Functions, Multiple Dispatch
- Programming Styles
Package Manager
Statistics in Julia
Tabular Data
Data Visualization
Machine Learning Algorithms
- Unsupervised
- Supervised
Resources

Background

Julia is a high-level dynamic programming language designed to address the requirements of high-performance numerical and scientific computing while also being effective for general purpose programming.

Julia's core is implemented in C and C++, its parser in Scheme, and the LLVM compiler framework is used for just-in-time generation of machine code for x86(-64).

Designed By

Jeff Bezanson
Stefan Karpinski
Viral B. Shah
Alan Edelman (MIT supervisor)

Development began in 2009, open-sourced in February 2012

Currently has 250+ contributors to the language, 400+ overall

Stable release: v0.2.1 (2014/02/11), pre-release: v0.3 (nightly build)

World of Julia

Source: GitHub, 2014/06/30 - Source code by Jiahao Chen

Why Use Julia?

Language Features

Multiple dispatch
Dynamic type system
Performance approaching that of statically-compiled languages like C
Built-in package manager
Lisp-like macros and other metaprogramming facilities
Call C functions directly: no wrappers or special APIs
Shell-like capabilities for managing other processes
Designed for parallelism and distributed computation
User-defined types are as fast and compact as built-ins
Efficient support for Unicode, including but not limited to UTF-8
Familiar Matlab/NumPy-like syntax

Source code

Why You Shouldn't Use Julia

Julia is very young
The Julia package ecosystem is even younger
Breaking changes are still coming in core and will be quite frequent outside of core Julia
Language features are still being added: your favorite may not exist yet
Code quality for packages varies from reasonably well tested, to never tested, to broken

So... Should You Use Julia?

That depends on your use case:

If you tend to build lots of tools from scratch, Julia is usable, but a little rough
If you tend to build upon lots of other packages, Julia isn't ready for you yet

Hierarchical Built-In Types



In [1]:

    
subtypes(Number)









    Out[1]:





2-element Array{Any,1}:
 Complex{T<:Real}
 Real



In [2]:

    
subtypes(Real)









    Out[2]:





4-element Array{Any,1}:
 FloatingPoint       
 Integer             
 MathConst{sym}      
 Rational{T<:Integer}



In [3]:

    
subtypes(Integer)









    Out[3]:





5-element Array{Any,1}:
 BigInt  
 Bool    
 Char    
 Signed  
 Unsigned



In [4]:

    
# Floating Point
@show 5/3

# Mathematical Constant
@show pi

# Rational
@show 2//3 + 1

# BigInt
@show big(2) ^ 1000 ;









    



5 / 3 => 1.6666666666666667
pi => π = 3.1415926535897...
2 // 3 + 1 => 5//3
big(2)^1000 => 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376



In [5]:

    
subtypes(String)









    Out[5]:





8-element Array{Any,1}:
 DirectIndexString   
 GenericString       
 RepString           
 RevString{T<:String}
 RopeString          
 SubString{T<:String}
 UTF16String         
 UTF8String



In [6]:

    
s = "Hello World"

@show typeof(s)
@show s[7] ;









    



typeof(s) => ASCIIString
s[7] => 'W'



In [7]:

    
# Unicode Names and Values

你好 = "(｡◕_◕｡)ﾉ  "

@show typeof(你好)
@show 你好 ^ 3 ;









    



typeof(你好) => UTF8String
你好^3 => "(｡◕_◕｡)ﾉ  (｡◕_◕｡)ﾉ  (｡◕_◕｡)ﾉ  "

User-Defined Types



In [8]:

    
type NewType
    i::Integer
    s::String
end

new_t = NewType(33, "this is a NewType")

@show new_t.i
@show new_t.s ;









    



new_t.i => 33
new_t.s => "this is a NewType"

Linear Algebra



In [9]:

    
# Vectors

v = [1, 1]









    Out[9]:





2-element Array{Int64,1}:
 1
 1



In [10]:

    
# Vector Operations

@show v + [2, 0] # vector addition
@show v + 1      # same as v + [1,1]
@show 5*v        # scalar multiplication









    



v + [2,0] => [3,1]
v + 1 => [2,2]
5v => [5,5]






    Out[10]:





2-element Array{Int64,1}:
 5
 5



In [11]:

    
println( "Dot Product  : ", dot(v, v) )
println( "Norm         : ", norm(v) )









    



Dot Product  : 2
Norm         : 1.4142135623730951



In [12]:

    
# Matrices

M = [1 1 ; 0 1]









    Out[12]:





2x2 Array{Int64,2}:
 1  1
 0  1



In [13]:

    
# Matrix Addition

M + 1 ,
M + [0 0 ; 5 5]









    Out[13]:





(
2x2 Array{Int64,2}:
 2  2
 1  2,

2x2 Array{Int64,2}:
 1  1
 5  6)



In [14]:

    
# Matrix Multiplication

2M ,
M ^ 2 ,
M * v









    Out[14]:





(
2x2 Array{Int64,2}:
 2  2
 0  2,

2x2 Array{Int64,2}:
 1  2
 0  1,

[2,1])



In [15]:

    
# Gaussian Elimination

b = M * v

M \ b        # solve back for v









    Out[15]:





2-element Array{Float64,1}:
 1.0
 1.0

Functions



In [16]:

    
# Named functions

f(x) = 10x

function g(x)
    return x * 10
end

@show f(5)
@show g(5) ;









    



f(5) => 50
g(5) => 50



In [17]:

    
# Anonymous functions assigned to variables

h = x -> x * 10

i = function(x)
    x * 10
end

@show h(5)
@show i(5) ;









    



h(5) => 50
i(5) => 50



In [18]:

    
# Operators are functions

+(4,5)









    Out[18]:





9



In [19]:

    
p = +

p(2,3)









    Out[19]:





5

Multiple Dispatch



In [20]:

    
bar(x::String)  = println("You entered the string: $x")
bar(x::Integer) = x * 10
bar(x::NewType) = println(x.s)

methods(bar)









    Out[20]:




3 methods for generic function bar: bar(x::String) at In[20]:1
 bar(x::Integer) at In[20]:2
 bar(x::NewType) at In[20]:3



In [21]:

    
bar("Hello")
bar(new_t)
bar(5)









    



You entered the string: Hello
this is a NewType






    Out[21]:





50



In [22]:

    
# Adding strings

"Hello" + "World"









    



`+` has no method matching +(::ASCIIString, ::ASCIIString)
while loading In[22], in expression starting on line 3



In [23]:

    
# But the addition operator is a function, so we can apply multi-dispatch

+(a::String, b::String) = a * b

"Hello" + "World"









    Out[23]:





"HelloWorld"



In [24]:

    
+(a::Number, b::String) = string(a) + b
+(a::String, b::Number) = a + string(b)

99 + "bottles"









    Out[24]:





"99bottles"

Object-Oriented Programming



In [25]:

    
# Method Overloading

type SimpleObject
    data::Union(Integer, String)
    set::Function

    function SimpleObject()
        this = new()
        this.data = ""

        function setter(x::Integer)
            println("Setting an integer")
            this.data = x
        end
        function setter(x::String)
            println("Setting a string")
            this.data = x
        end
        this.set = setter

        return this
    end
end

obj = SimpleObject()
obj.set(99)
obj.set("hello")









    



Setting an integer
Setting a string






    Out[25]:





"hello"

Functional Programming



In [26]:

    
# Sum of odd integers between 1 and 5

values = 1:5

myMapper  = x -> x
myFilter  = x -> x % 2 == 1
myReducer = (x,y) -> x + y

mapped    = map( myMapper, values )
filtered  = filter( myFilter, mapped )
reduced   = reduce( myReducer, filtered )









    Out[26]:





9

Metaprogramming



In [27]:

    
# Code Generation
# Functions for exponentiating to the powers of 1 to 5

for n in 1:5
    s = "power$n(x) = x ^ $n"
    println(s)
    expression = parse(s)
    eval(expression) 
end

power5( 2 )









    



power1(x) = x ^ 1
power2(x) = x ^ 2
power3(x) = x ^ 3
power4(x) = x ^ 4
power5(x) = x ^ 5






    Out[27]:





32



In [28]:

    
# Macros: Crude Timer Example

macro timeit(expression)
    quote
        t = time()
        result = $expression    # evaluation
        elapsed = time() - t
        println( "elapsed time: ", elapsed )
        return result
    end
end

@timeit cos(2pi)
@timeit cos(2pi)









    



elapsed time: 0.005074977874755859
elapsed time: 4.0531158447265625e-6






    Out[28]:





1.0

Package Manager

Julia has a built-in package management system. All packages are git repositories, mostly hosted on GitHub.

Installing a new package

Pkg.add("PackageName")

Start using it

using PackageName

Import a function to overload

import PackageName.FunctionName

Update packages

Pkg.update()

Basic Statistics



In [29]:

    
using StatsBase

x = rand(100)    # uniform distribution [0,1)

println( "mean:     ", mean(x) )
println( "variance: ", var(x) )
println( "skewness: ", skewness(x) )
println( "kurtosis: ", kurtosis(x) )









    



mean:     0.5260291483830464
variance: 0.09076375466564988
skewness: 0.007890698485229702
kurtosis: -1.3205549430417554



In [30]:

    
describe(x)









    



Summary Stats:
Mean:         0.526029
Minimum:      0.002137
1st Quartile: 0.287252
Median:       0.495243
3rd Quartile: 0.809833
Maximum:      0.995268

Probability Distributions



In [31]:

    
using Distributions

distr = Normal(0, 2)

println( "pdf @ origin = ", pdf(distr, 0.0) )
println( "cdf @ origin = ", cdf(distr, 0.0) )









    



pdf @ origin = 0.19947114020071635
cdf @ origin = 0.5



In [32]:

    
x = rand(distr, 1000)

fit_mle(Normal, x)









    Out[32]:





Normal( μ=-0.010365910392224809 σ=2.0295200120189767 )

Tabular Data



In [33]:

    
using DataFrames

df = DataFrame(
    A = [6, 3, 4],
    B = ["a", "b", "c"],
    C = [1//2, 3//4, 5//6],
    D = [true, true, false]
)

df[:C][2] = NA
df









    Out[33]:




A B C D
1 6 a 1//2 true
2 3 b NA true
3 4 c 5//6 false



In [34]:

    
# Joins

names = DataFrame(ID = [5, 4], Name = ["Jack", "Jill"])
jobs  = DataFrame(ID = [5, 4], Job = ["Lawyer", "Doctor"])

full  = join(names, jobs, on = :ID)









    Out[34]:




ID Name Job
1 4 Jill Doctor
2 5 Jack Lawyer



In [35]:

    
using RDatasets

iris = dataset("datasets", "iris")
head(iris)









    Out[35]:




SepalLength SepalWidth PetalLength PetalWidth Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa



In [36]:

    
# Group by Species, then compute mean of PetalLength per group

by( iris, :Species, df -> mean(df[:PetalLength]) )









    Out[36]:




Species x1
1 setosa 1.462
2 versicolor 4.26
3 virginica 5.552

Data Visualization



In [37]:

    
using ASCIIPlots

x = iris[:PetalLength]
y = iris[:PetalWidth]

scatterplot(x, y)









    Out[37]:





	-------------------------------------------------------------
	|                                               ^  ^         | 2.50
	|                                        ^    ^              |
	|                                        ^ ^^^  ^ ^^        ^|
	|                                             ^  ^        ^  |
	|                                       ^^ ^ ^^ ^ ^    ^^ ^  |
	|                                        ^  ^      ^         |
	|                                      ^^^    ^  ^ ^ ^       |
	|                                   ^    ^                   |
	|                                ^  ^ ^ ^^       ^           |
	|                            ^     ^^ ^^      ^              |
	|                          ^   ^ ^^^^                        |
	|                            ^ ^ ^ ^  ^                      |
	|                    ^ ^  ^ ^^ ^                             |
	|                                                            |
	|                                                            |
	|                                                            |
	|      ^                                                     |
	|   ^ ^^ ^                                                   |
	|   ^ ^^                                                     |
	|^^ ^ ^^ ^                                                   | 0.10
	-------------------------------------------------------------
	1.00                                                    6.90



In [38]:

    
using Winston

scatter(x, y, ".")

xlabel("PetalLength")
ylabel("PetalWidth")









    Out[38]:



In [39]:

    
using Gadfly

set_default_plot_size(20cm, 12cm)
plot(iris, x = "PetalLength", y = "PetalWidth", color = "Species", Geom.point)









    Out[39]:

ML Algorithms

Unsupervised Learning



In [40]:

    
# K-means Clustering

using Clustering

features = array(iris[:, 1:4])'   # use matrix() on Julia v0.2
result = kmeans( features, 3 )    # onto 3 clusters

plot(iris, x = "PetalLength", y = "PetalWidth", color = result.assignments, Geom.point)









    



 





    



Warning: using DataFrames.describe in module Main conflicts with an existing identifier.
Warning: could not import StatsBase.bandwidth into Stat
Warning: could not import StatsBase.kde into Stat






    



 Iters               objv        objv-change | affected 
-------------------------------------------------------------
      1       8.200215e+01      -6.279785e+01 |        2
      2       8.108093e+01      -9.212131e-01 |        2
      3       7.987358e+01      -1.207354e+00 |        2
      4       7.934436e+01      -5.292157e-01 |        2
      5       7.892131e+01      -4.230544e-01 |        2
      6       7.885567e+01      -6.564390e-02 |        0
      7       7.885567e+01       0.000000e+00 |        0
K-means converged with 7 iterations (objv = 78.85566582597716)






    Out[40]:



In [41]:

    
# Principal Component Analysis

using MultivariateStats

pc = fit(PCA, features; maxoutdim = 2)
reduced = transform(pc, features)
@show size(reduced)

plot(iris, x = reduced[1,:], y = reduced[2,:], color = "Species", Geom.point)









    



size(reduced) => (2,150)






    Out[41]:

ML Algorithms

Supervised Learning - Regression



In [42]:

    
using MultivariateStats

# Generate a noisy linear system
features = rand(1000, 3)                         # feature matrix
coeffs = rand(3)                                 # ground truth of weights
targets = features * coeffs + 0.1 * randn(1000)  # generate response

# Linear Least Square Regression
coeffs_llsq = llsq(features, targets; bias=false)

# Ridge Regression
coeffs_ridge = ridge(features, targets, 0.1; bias=false) # regularization coef = 0.1

@show coeffs
@show coeffs_llsq
@show coeffs_ridge ;









    



coeffs => [0.909725259136879,0.09457886909449087,0.5497737690044144]
coeffs_llsq => [0.9136892428062334,0.09038032773513839,0.5584636569649853]
coeffs_ridge => [0.9131715345035267,0.09081871314618502,0.5583822928501584]



In [43]:

    
# Cross Validation: K-Fold Example

using MLBase, MultivariateStats

n = length(targets)

# Define training and error evaluation functions
function training(inds)
    coeffs = ridge(features[inds, :], targets[inds], 0.1; bias=false)
    return coeffs
end

function error_evaluation(coeffs, inds)
    y = features[inds, :] * coeffs 
    rms_error = sqrt(mean(abs2(targets[inds] .- y)))
    return rms_error
end

# Cross validate
scores = cross_validate(
    inds -> training(inds),
    (coeffs, inds) -> error_evaluation(coeffs, inds),
    n,              # total number of samples
    Kfold(n, 3))    # cross validation plan: 3-fold

# Get the mean and std of scores
@show scores
@show mean_and_std(scores) ;









    



scores => 





    



Warning: using MLBase.transform in module Main conflicts with an existing identifier.






    



[0.09242034479832754,0.10521231254886307,0.10191139455606114]
mean_and_std(scores) => (0.0998480173010839,0.006640915148151889)



In [44]:

    
# Model Tuning: Grid Search

using MLBase, MultivariateStats

# Hold out 20% of records for testing
n_test = int(length(targets) * 0.2)
train_rows = shuffle([1:length(targets)] .> n_test)
features_train, features_test = features[train_rows, :], features[!train_rows, :]
targets_train, targets_test = targets[train_rows], targets[!train_rows]

# Define estimation function
function estfun(regcoef, bias)
    coeffs = ridge(features_train, targets_train, regcoef; bias=bias)
    return bias ? (coeffs[1:end-1], coeffs[end]) : (coeffs, 0.0)
end

# Define error evaluation function as mean squared deviation
evalfun(coeffs) = msd(features_test * coeffs[1] + coeffs[2], targets_test)

result = gridtune(estfun, evalfun,
            ("regcoef", [0.01, 0.1, 1.0]),
            ("bias", [true, false]);
            ord=Reverse,    # smaller msd value indicates better model
            verbose=true)   # show progress information

best_model, best_config, best_score = result

# Print results
coeffs, bias = best_model
println("Best model:")
println("  coeffs = $(coeffs')"),
println("  bias = $bias")
println("Best config: regcoef = $(best_config[1]), bias = $(best_config[2])")
println("Best score: $(best_score)")









    



[regcoef=0.01, bias=true] => 0.011858694727414574
[regcoef=0.1, bias=true] => 0.011850442117052462
[regcoef=1.0, bias=true] => 0.011787552735182201
[regcoef=0.01, bias=false] => 0.011804804972495204
[regcoef=0.1, bias=false] => 0.011801611222973227
[regcoef=1.0, bias=false] => 0.011777881905009186
Best model:
  coeffs = [0.9133644779181795 0.09101416771441581 0.5528567682069362]
  bias = 0.0
Best config: regcoef = 1.0, bias = false
Best score: 0.011777881905009186



In [45]:

    
# Regression Tree

using DecisionTree

# Train model, make predictions on test records
model = build_tree(targets_train, features_train)
predictions = apply_tree(model, features_test)

@show cor(targets_test, predictions)
@show R2(targets_test, predictions)

scatter(targets_test, predictions, ".")
xlabel("actual"); ylabel("predicted")









    



cor(targets_test,predictions) => 0.9062575173707162
R2(targets_test,predictions) => 0.8095402909108701






    Out[45]:

ML Algorithms

Supervised Learning - Classification



In [46]:

    
# Support Vector Machine

using LIBSVM

features = array(iris[:, 1:4])
labels = array(iris[:Species])

# Hold out 20% of records for testing
n_test = int(length(labels) * 0.2)
train_rows = shuffle([1:length(labels)] .> n_test)
features_train, features_test = features[train_rows, :], features[!train_rows, :]
labels_train, labels_test = labels[train_rows], labels[!train_rows]

model = svmtrain(labels_train, features_train')
(predictions, decision_values) = svmpredict(model, features_test')

confusion_matrix(labels_test, predictions)









    Out[46]:





Classes:  ASCIIString["setosa","versicolor","virginica"]
Matrix:   3x3 Array{Int64,2}:
 15  0  0
  0  9  1
  0  0  5
Accuracy: 0.9666666666666667
Kappa:    0.9459459459459458



In [47]:

    
# Random Forest

using DecisionTree

# Train forest using 2 random features per split and 10 trees
model = build_forest(labels_train, features_train, 2, 10)
predictions = apply_forest(model, features_test)

# Pretty print of one tree in forest
print_tree(model.trees[1])

confusion_matrix(labels_test, predictions)









    



Feature 4, Threshold 1.0
L-> setosa : 25/25
R-> Feature 1, Threshold 6.3
    L-> Feature 3, Threshold 4.8
        L-> versicolor : 25/25
        R-> Feature 2, Threshold 2.7
            L-> virginica : 2/2
            R-> Feature 4, Threshold 1.8
                L-> versicolor : 2/2
                R-> Feature 3, Threshold 4.9
                    L-> Feature 1, Threshold 6.2
                        L-> versicolor : 1/1
                        R-> virginica : 1/1
                    R-> virginica : 3/3
    R-> Feature 1, Threshold 6.9
        L-> Feature 3, Threshold 5.0
            L-> versicolor : 6/6
            R-> virginica : 10/10
        R-> virginica : 9/9






    Out[47]:





Classes:  {"setosa","versicolor","virginica"}
Matrix:   3x3 Array{Int64,2}:
 15  0  0
  0  8  2
  0  0  5
Accuracy: 0.9333333333333333
Kappa:    0.8928571428571429

for Data Science

Contents

Background

Designed By

World of Julia

Why Use Julia?

Language Features

Why You Shouldn't Use Julia

So... Should You Use Julia?

Hierarchical Built-In Types

User-Defined Types

Linear Algebra

Functions

Multiple Dispatch

Object-Oriented Programming

Functional Programming

Metaprogramming

Package Manager

Installing a new package

Start using it

Import a function to overload

Update packages

Basic Statistics

Probability Distributions

Tabular Data

Data Visualization

ML Algorithms

Unsupervised Learning

ML Algorithms

Supervised Learning - Regression

ML Algorithms

Supervised Learning - Classification

Other Statistics & ML Packages

Resources

Documentation

GitHub Groups

Discussion Forums / Mailing Lists (Google Groups)

Blogs / Curations

Crash Courses

Cheat Sheets

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa