XGBoost.jl - quickest ways to win data science competitions

Here is an example of how a single non-ensembled model can achieve high ranking scores using XGBoost, which is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

Highlights of XGBoost

### Distributed on Cloud Supports distributed training on multiple machines, including AWS, GCE, Azure, and Yarn clusters. Can be integrated with Flink, Spark and other cloud dataflow systems.
### Battle-tested Wins many data science and machine learning challenges. Used in production by multiple companies.
### Flexible Supports regression, classification, ranking and user defined objectives.

Distinguishing poisonous vs edible mushrooms

Based on 8142 instances and 22 attributes like, odor, habitat, color, etc, we can easily and accurately classify mushrooms as poisonous or edible and in few cases of unknown edibility and hence not recommended.

The Agaricus genus contains the most widely consumed and best-known mushroom today, but there are poisonous ones among them as well. The dataset consists of 8142 observations of Agaricus and Lepiota Family, this is a multivariate dataset with 22 characteristic attributes and classified into 2 classes, edible and poisonous.



In [34]:

    
using XGBoost, DataFrames, Gadfly, GLM
include("$(Pkg.dir())/MLDemos/src/xgboost/mushroom.jl");
path = "$(Pkg.dir())/MLDemos/";

We use auxiliary function to read LIBSVM format into julia Matrix.

Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values.

Ex. :

1 2:1 9:1 10:1 20:1 29:1 33:1 35:1 39:1 40:1 52:1 57:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 104:1 116:1 123:1

0 2:1 9:1 19:1 20:1 22:1 33:1 35:1 38:1 40:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 115:1 119:1

Attribute Information:

cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises?: bruises=t,no=f
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
gill-attachment: attached=a,descending=d,free=f,notched=n
gill-spacing: close=c,crowded=w,distant=d
gill-size: broad=b,narrow=n
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
stalk-shape: enlarging=e,tapering=t
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
veil-type: partial=p,universal=u
veil-color: brown=n,orange=o,white=w,yellow=y
ring-number: none=n,one=o,two=t
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d



In [40]:

    
attribute_dict = Dict(1=>"cap-shape:bell",2=>"cap-shape:conical",3=>"cap-shape:convex",4=>"cap-shape:flat",5=>"cap-shape:knobbed",6=>"cap-shape:sunken",
7=>"cap-surface: fibrous",8=>"cap-surface:grooves",9=>"cap-surface:scaly",10=>"cap-surface:smooth",
11=>"cap-color: brown",12=>"cap-color: buff",13=>"cap-color: cinnamon",14=>"cap-color: gray",15=>"cap-color: green", 16=>"cap-color: pink",17=>"cap-color: purple",18=>"cap-color: red",19=>"cap-color: white",20=>"cap-color: yellow",
21=>"bruises?: yes",22=>"bruises?: no",
23=>"odor: almond",24=>"odor: anise",25=>"odor: acreosote",26=>"odor: fishy",27=>"odor: foul", 28=>"odor: musty",29=>"odor: none",30=>"odor: pungent",31=>"odor: spicy",
32=>"gill-attachment: attached",33=>"gill-attachment: descending",34=>"gill-attachment: free",35=>"gill-attachment: notched",
36=>"gill-spacing: close",37=>"gill-spacing: crowded",38=>"gill-spacing: distant",
39=>"gill-size: broad",40=>"gill-size: narrow",
41=>"gill-color: black",42=>"gill-color: brown",43=>"gill-color: buff",44=>"gill-color: chocolate",45=>"gill-color: gray", 46=>"gill-color: green",47=>"gill-color: orange",48=>"gill-color: pink",49=>"gill-color: purple",50=>"gill-color: red", 51=>"gill-color: white",52=>"gill-color: yellow",
53=>"stalk-shape: enlarging",54=>"stalk-shape: tapering",
55=>"stalk-root: bulbous",56=>"stalk-root: club",57=>"stalk-root: cup",58=>"stalk-root: equal", 59=>"stalk-root: rhizomorphs",60=>"stalk-root: rooted",61=>"stalk-root: missing",
62=>"stalk-surface-above-ring: fibrous",63=>"stalk-surface-above-ring: scaly",64=>"stalk-surface-above-ring: silky",65=>"stalk-surface-above-ring: smooth",
66=>"stalk-color-below-ring: brown",67=>"stalk-color-below-ring: buff",68=>"stalk-color-below-ring: cinnamon",69=>"stalk-color-below-ring: gray",70=>"stalk-color-below-ring: orange", 71=>"stalk-color-below-ring: pink",72=>"stalk-color-below-ring: red",73=>"stalk-color-below-ring: white",74=>"stalk-color-below-ring: yellow",
75=>"veil-type: partial",76=>"veil-type: universal",
77=>"veil-color: brown",78=>"veil-color: orange",79=>"veil-color: white",80=>"veil-color: yellow",
81=>"ring-number: none",82=>"ring-number: one",83=>"ring-number: two",
84=>"ring-type: cobwebby",85=>"ring-type: evanescent",86=>"ring-type: flaring",87=>"ring-type: large", 88=>"ring-type: none",89=>"ring-type: pendant",90=>"ring-type: sheathing",91=>"ring-type: zone",
92=>"spore-print-color: black",93=>"spore-print-color: brown",94=>"spore-print-color: buff",95=>"spore-print-color: chocolate",96=>"spore-print-color: green", 97=>"spore-print-color: orange",98=>"spore-print-color: purple",99=>"spore-print-color: white",100=>"spore-print-color: yellow",
101=>"population: abundant",102=>"population: clustered",103=>"population: numerous", 104=>"population: scattered",105=>"population: several",106=>"population: solitary",
107=>"habitat: grasses",108=>"habitat: leaves",109=>"habitat: meadows",110=>"habitat: paths", 111=>"habitat: urban",112=>"habitat: waste",113=>"habitat: woods")









    Out[40]:





Dict{Int64,ASCIIString} with 112 entries:
  68  => "stalk-color-below-ring: cinnamon"
  89  => "ring-type: pendant"
  11  => "cap-color: brown"
  39  => "gill-size: broad"
  46  => "gill-color: green"
  85  => "ring-type: evanescent"
  25  => "odor: acreosote"
  55  => "stalk-root: bulbous"
  42  => "gill-color: brown"
  29  => "odor: none"
  58  => "stalk-root: equal"
  66  => "stalk-color-below-ring: brown"
  59  => "stalk-root: rhizomorphs"
  8   => "cap-surface:grooves"
  74  => "stalk-color-below-ring: yellow"
  95  => "spore-print-color: chocolate"
  57  => "stalk-root: cup"
  20  => "cap-color: yellow"
  90  => "ring-type: sheathing"
  111 => "habitat: urban"
  14  => "cap-color: gray"
  31  => "odor: spicy"
  78  => "veil-color: orange"
  112 => "habitat: waste"
  70  => "stalk-color-below-ring: orange"
  ⋮   => ⋮



In [4]:

    
train_X, train_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.train", (6513, 126));
test_X, test_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.test", (1611, 126));

Basic training using XGBoost :

You can directly pass julia's matrix as data,



In [52]:

    
num_round = 2;

print("training xgboost with dense matrix\n");
@time bst1 = xgboost(train_X, num_round, label = train_Y, eta=1, max_depth=2, objective="binary:logistic");









    



training xgboost with dense matrix
  0.024368 seconds (146 allocations: 3.137 MB)






    



[1]	train-error:0.046522
[2]	train-error:0.022263

Alternatively, you can sparse matrix as data and also pass parameters in as a map



In [8]:

    
print("training xgboost with sparse matrix\n");
sptrain = sparse(train_X);
param = ["max_depth"=>2, "eta"=>1, "objective"=>"binary:logistic"]
@time bst = xgboost(sptrain, num_round, label = train_Y, param=param)









    



training xgboost with sparse matrix
  0.005247 seconds (131 allocations: 1.648 MB)






    



WARNING: deprecated syntax "[a=>b, ...]" at In[8]:3.
Use "Dict(a=>b, ...)" instead.
[1]	train-error:0.046522
[2]	train-error:0.022263






    Out[8]:





XGBoost.Booster(Ptr{Void} @0x00000000045db0c0)

You can also put in xgboost's DMatrix object. DMatrix stores label, data and other metadata needed for advanced features



In [51]:

    
print("training xgboost with DMatrix\n")
dtrain = DMatrix(train_X, label = train_Y)
println(num_round)
@time bst = xgboost(dtrain, num_round, eta = 1, objective = "binary:logistic")









    



training xgboost with DMatrix
4
  0.030044 seconds (196 allocations: 8.766 KB)






    



[1]	train-error:0.000614
[2]	train-error:0.000000
[3]	train-error:0.000000
[4]	train-error:0.000000






    Out[51]:





XGBoost.Booster(Ptr{Void} @0x0000000006009950)

-----Basic prediction using XGBoost-----

You can put in Matrix, SparseMatrix or DMatrix



In [7]:

    
preds1 = predict(bst1, test_X)
print("test-error=", sum((preds1 .> 0.5) .!= test_Y) / float(size(preds1)[1]), "\n")









    



test-error=0.021725636250775917

The error is only 2.1%.

Lets find out if we can eat a few test mushrooms :



In [75]:

    
# To find if the mushroon in the test set(1<n<1611) is edible or not: ~= 1 edible, `= 0 poisonous.
n=3
attribs = find(test_X[n,:])
@show int(preds[n])
describe_mushroom(n);









    



int(preds[n]) = 0
cap-shape:bell
cap-surface:scaly
cap-color: yellow
bruises?: yes
odor: anise
gill-attachment: free
gill-spacing: close
gill-size: broad
gill-color: black
stalk-shape: enlarging
stalk-root: club
stalk-surface-above-ring: smooth
stalk-color-below-ring: gray
veil-color: brown
ring-type: flaring
ring-type: none
spore-print-color: black
spore-print-color: chocolate
population: clustered
population: solitary
stalk-surface-below-ring: smooth






    



LoadError: KeyError: 122 not found
while loading In[75], in expression starting on line 5

 in getindex at dict.jl:724
 in describe_mushroom at In[67]:3



In [67]:

    
function describe_mushroom(n)
    for i =1:length(attribs)
        println(attribute_dict[attribs[i]])
    end
end









    Out[67]:





describe_mushroom (generic function with 1 method)

XGBoost to solve as linear solver :

Solving the same problem not using linear models instead of trees,



In [35]:

    
param_lm = Dict("booster"=>"gblinear", "eta"=>1, "silent"=>0,
         "objective"=>"binary:logistic", "alpha"=>0.0001, "lambda"=>1)









    Out[35]:





Dict{ASCIIString,Any} with 6 entries:
  "alpha"     => 0.0001
  "objective" => "binary:logistic"
  "lambda"    => 1
  "eta"       => 1
  "booster"   => "gblinear"
  "silent"    => 0



In [18]:

    
dtrain = DMatrix("../data/mushroom/agaricus.txt.train")
dtest = DMatrix("../data/mushroom/agaricus.txt.test")
watchlist  = [(dtest,"eval"), (dtrain,"train")]
num_round = 4









    



6513x126 matrix with 143286 entries is loaded from ../data/mushroom/agaricus.txt.train
1611x126 matrix with 35442 entries is loaded from ../data/mushroom/agaricus.txt.test






    Out[18]:





4



In [36]:

    
bst = xgboost(dtrain, num_round, param=param_lm, watchlist=watchlist)









    



[1]	eval-error:0.065798	train-error:0.051589
[2]	eval-error:0.006828	train-error:0.008291
[3]	eval-error:0.001241	train-error:0.003531
[4]	eval-error:0.000621	train-error:0.001382






    Out[36]:





XGBoost.Booster(Ptr{Void} @0x0000000005cbe790)



In [22]:

    
preds_glm = predict(bst, dtest)









    Out[22]:





1611-element Array{Float32,1}:
 0.0119101  
 0.986318   
 0.00978163 
 0.00433835 
 0.0479371  
 0.04251    
 0.972526   
 0.0180784  
 0.975881   
 0.00829592 
 0.981871   
 0.0150381  
 0.000560712
 ⋮          
 0.999467   
 0.0276137  
 0.0266209  
 0.0035387  
 0.993822   
 0.00399743 
 0.00268785 
 0.0376749  
 0.00671114 
 0.998771   
 0.0237168  
 0.997941



In [23]:

    
labels = get_info(dtest, "label")









    Out[23]:





1611-element Array{Float32,1}:
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 0.0
 0.0
 ⋮  
 1.0
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 1.0



In [24]:

    
print("test-error=", sum((preds .> 0.5) .!= labels) / float(size(preds)[1]), "\n")









    



test-error=0.0



In [ ]: