はじめに

今回は、Bank Marketing Datasetを使って、

  • DataFramesでの集計とplot
  • 機械学習の初歩

を試してみましょう。

最初に下記からデータを取得して、bank.csv, bank-full.csvをJuliaBoxに用意します。

http://archive.ics.uci.edu/ml/datasets/Bank+Marketing


In [1]:
run(`wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip`)
run(`unzip bank.zip -d data`)


--2015-07-06 02:34:11--  http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.1.95
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.1.95|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 579043 (565K) [application/zip]
Saving to: 'bank.zip'

     0K .......... .......... .......... .......... ..........  8%  362K 1s
    50K .......... .......... .......... .......... .......... 17%  363K 1s
   100K .......... .......... .......... .......... .......... 26%  726K 1s
   150K .......... .......... .......... .......... .......... 35%  729K 1s
   200K .......... .......... .......... .......... .......... 44% 83.7M 1s
   250K .......... .......... .......... .......... .......... 53%  729K 0s
   300K .......... .......... .......... .......... .......... 61% 36.0M 0s
   350K .......... .......... .......... .......... .......... 70%  739K 0s
   400K .......... .......... .......... .......... .......... 79% 41.1M 0s
   450K .....
Archive:  bank.zip
  inflating: data/bank-full.csv      
  inflating: data/bank-names.txt     
  inflating: data/bank.csv           
..... .......... .......... .......... .......... 88%  740K 0s
   500K .......... .......... .......... .......... .......... 97%  107M 0s
   550K .......... .....                                      100%  107M=0.6s

2015-07-06 02:34:11 (911 KB/s) - 'bank.zip' saved [579043/579043]

DataFrameを試してみよう

ここでは、DataFrameを試してみます。

DataFrameとは、構造化されたデータを扱うのに便利なデータ構造です。 今回はSQLっぽく処理をする流れを体験します。

公式ドキュメントは若干不親切なので、 weda_654さんのスライドがわかりやすいです。


In [2]:
using DataFrames
table = readtable("data/bank.csv",  separator=';')


Out[2]:
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
130unemployedmarriedprimaryno1787nonocellular19oct791-10unknownno
233servicesmarriedsecondaryno4789yesyescellular11may22013394failureno
335managementsingletertiaryno1350yesnocellular16apr18513301failureno
430managementmarriedtertiaryno1476yesyesunknown3jun1994-10unknownno
559blue-collarmarriedsecondaryno0yesnounknown5may2261-10unknownno
635managementsingletertiaryno747nonocellular23feb14121763failureno
736self-employedmarriedtertiaryno307yesnocellular14may34113302otherno
839technicianmarriedsecondaryno147yesnocellular6may1512-10unknownno
941entrepreneurmarriedtertiaryno221yesnounknown14may572-10unknownno
1043servicesmarriedprimaryno-88yesyescellular17apr31311472failureno
1139servicesmarriedsecondaryno9374yesnounknown20may2731-10unknownno
1243admin.marriedsecondaryno264yesnocellular17apr1132-10unknownno
1336technicianmarriedtertiaryno1109nonocellular13aug3282-10unknownno
1420studentsinglesecondaryno502nonocellular30apr2611-10unknownyes
1531blue-collarmarriedsecondaryno360yesyescellular29jan8912411failureno
1640managementmarriedtertiaryno194noyescellular29aug1892-10unknownno
1756technicianmarriedsecondaryno4073nonocellular27aug2395-10unknownno
1837admin.singletertiaryno2317yesnocellular20apr11411522failureno
1925blue-collarsingleprimaryno-221yesnounknown23may2501-10unknownno
2031servicesmarriedsecondaryno132nonocellular7jul14811521otherno
2138managementdivorcedunknownno0yesnocellular18nov962-10unknownno
2242managementdivorcedtertiaryno16nonocellular19nov1403-10unknownno
2344servicessinglesecondaryno106nonounknown12jun1092-10unknownno
2444entrepreneurmarriedsecondaryno93nonocellular7jul1252-10unknownno
2526housemaidmarriedtertiaryno543nonocellular30jan1693-10unknownno
2641managementmarriedtertiaryno5883nonocellular20nov1822-10unknownno
2755blue-collarmarriedprimaryno627yesnounknown5may2471-10unknownno
2867retiredmarriedunknownno696nonotelephone17aug11911052failureno
2956self-employedmarriedsecondaryno784noyescellular30jul1492-10unknownno
3053admin.marriedsecondaryno105noyescellular21aug742-10unknownno
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

DataFramesMetaの導入

DataFramesをより便利に使うために、DataFramesMetaを使いましょう。 これを使うと、よりPandasっぽい(あるいはSQL likeな)便利な書き方ができるようになります。


In [3]:
Pkg.add("DataFramesMeta")


INFO: Cloning cache of DataFramesMeta from git://github.com/JuliaStats/DataFramesMeta.jl.git
INFO: Installing DataFramesMeta v0.0.1
INFO: Package database updated

In [4]:
using DataFramesMeta

In [5]:
x_thread = @linq table |> 
    where(:age .> 60) |>
    where(:housing .== "yes") |>
    orderby(:job)


Out[5]:
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
161admin.marriedunknownno4629yesnocellular27jan1811921successyes
261blue-collarmarriedprimaryno625yesnounknown19may1582-10unknownno
365housemaidmarriedprimaryno2179yesnocellular4sep1127-10unknownno
466managementmarriedtertiaryno1048yesnocellular23jun9712-10unknownno
563retiredmarriedsecondaryno415yesnocellular7oct3231-10unknownno
675retireddivorcedtertiaryno3810yesnocellular16nov26211831failureyes
763retiredmarriedtertiaryno133yesnocellular13feb1042-10unknownno
871retiredmarriedtertiaryno14220yesnocellular9sep3971-10unknownyes
961retiredmarriedprimaryno1060yesnounknown13may1181-10unknownno
1068retiredmarriedsecondaryno19317yesnocellular4aug2491-10unknownyes
1161retiredmarriedsecondaryno76yesnocellular15jul1957-10unknownno
1262self-employeddivorcedtertiaryno6yesnocellular13oct21611834successyes

In [6]:
age_balance = @linq table |> 
    where(:age .> 60) |>
    where(:housing .== "yes") |>
    select(:age, :balance)


Out[6]:
agebalance
1614629
263415
3753810
4661048
561625
663133
77114220
8611060
9652179
106819317
116176
12626

In [7]:
using Gadfly

In [8]:
plot(age_balance, x=:age, y=:balance, Geom.point)


Out[8]:
age 40 45 50 55 60 65 70 75 80 85 90 95 45.0 45.5 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 71.0 71.5 72.0 72.5 73.0 73.5 74.0 74.5 75.0 75.5 76.0 76.5 77.0 77.5 78.0 78.5 79.0 79.5 80.0 80.5 81.0 81.5 82.0 82.5 83.0 83.5 84.0 84.5 85.0 85.5 86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 40 60 80 100 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 -2.5×10⁴ -2.0×10⁴ -1.5×10⁴ -1.0×10⁴ -5.0×10³ 0 5.0×10³ 1.0×10⁴ 1.5×10⁴ 2.0×10⁴ 2.5×10⁴ 3.0×10⁴ 3.5×10⁴ 4.0×10⁴ 4.5×10⁴ -2.0×10⁴ -1.9×10⁴ -1.8×10⁴ -1.7×10⁴ -1.6×10⁴ -1.5×10⁴ -1.4×10⁴ -1.3×10⁴ -1.2×10⁴ -1.1×10⁴ -1.0×10⁴ -9.0×10³ -8.0×10³ -7.0×10³ -6.0×10³ -5.0×10³ -4.0×10³ -3.0×10³ -2.0×10³ -1.0×10³ 0 1.0×10³ 2.0×10³ 3.0×10³ 4.0×10³ 5.0×10³ 6.0×10³ 7.0×10³ 8.0×10³ 9.0×10³ 1.0×10⁴ 1.1×10⁴ 1.2×10⁴ 1.3×10⁴ 1.4×10⁴ 1.5×10⁴ 1.6×10⁴ 1.7×10⁴ 1.8×10⁴ 1.9×10⁴ 2.0×10⁴ 2.1×10⁴ 2.2×10⁴ 2.3×10⁴ 2.4×10⁴ 2.5×10⁴ 2.6×10⁴ 2.7×10⁴ 2.8×10⁴ 2.9×10⁴ 3.0×10⁴ 3.1×10⁴ 3.2×10⁴ 3.3×10⁴ 3.4×10⁴ 3.5×10⁴ 3.6×10⁴ 3.7×10⁴ 3.8×10⁴ 3.9×10⁴ 4.0×10⁴ -2×10⁴ 0 2×10⁴ 4×10⁴ -2.0×10⁴ -1.8×10⁴ -1.6×10⁴ -1.4×10⁴ -1.2×10⁴ -1.0×10⁴ -8.0×10³ -6.0×10³ -4.0×10³ -2.0×10³ 0 2.0×10³ 4.0×10³ 6.0×10³ 8.0×10³ 1.0×10⁴ 1.2×10⁴ 1.4×10⁴ 1.6×10⁴ 1.8×10⁴ 2.0×10⁴ 2.2×10⁴ 2.4×10⁴ 2.6×10⁴ 2.8×10⁴ 3.0×10⁴ 3.2×10⁴ 3.4×10⁴ 3.6×10⁴ 3.8×10⁴ 4.0×10⁴ balance

DataFrameでMachine Learning

(メンテナー絶賛募集中の)SVM.jlを使って機械学習を試してみましょう


In [9]:
bank = readtable("data/bank-full.csv",  separator = ';')


Out[9]:
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
158managementmarriedtertiaryno2143yesnounknown5may2611-10unknownno
244techniciansinglesecondaryno29yesnounknown5may1511-10unknownno
333entrepreneurmarriedsecondaryno2yesyesunknown5may761-10unknownno
447blue-collarmarriedunknownno1506yesnounknown5may921-10unknownno
533unknownsingleunknownno1nonounknown5may1981-10unknownno
635managementmarriedtertiaryno231yesnounknown5may1391-10unknownno
728managementsingletertiaryno447yesyesunknown5may2171-10unknownno
842entrepreneurdivorcedtertiaryyes2yesnounknown5may3801-10unknownno
958retiredmarriedprimaryno121yesnounknown5may501-10unknownno
1043techniciansinglesecondaryno593yesnounknown5may551-10unknownno
1141admin.divorcedsecondaryno270yesnounknown5may2221-10unknownno
1229admin.singlesecondaryno390yesnounknown5may1371-10unknownno
1353technicianmarriedsecondaryno6yesnounknown5may5171-10unknownno
1458technicianmarriedunknownno71yesnounknown5may711-10unknownno
1557servicesmarriedsecondaryno162yesnounknown5may1741-10unknownno
1651retiredmarriedprimaryno229yesnounknown5may3531-10unknownno
1745admin.singleunknownno13yesnounknown5may981-10unknownno
1857blue-collarmarriedprimaryno52yesnounknown5may381-10unknownno
1960retiredmarriedprimaryno60yesnounknown5may2191-10unknownno
2033servicesmarriedsecondaryno0yesnounknown5may541-10unknownno
2128blue-collarmarriedsecondaryno723yesyesunknown5may2621-10unknownno
2256managementmarriedtertiaryno779yesnounknown5may1641-10unknownno
2332blue-collarsingleprimaryno23yesyesunknown5may1601-10unknownno
2425servicesmarriedsecondaryno50yesnounknown5may3421-10unknownno
2540retiredmarriedprimaryno0yesyesunknown5may1811-10unknownno
2644admin.marriedsecondaryno-372yesnounknown5may1721-10unknownno
2739managementsingletertiaryno255yesnounknown5may2961-10unknownno
2852entrepreneurmarriedsecondaryno113yesyesunknown5may1271-10unknownno
2946managementsinglesecondaryno-246yesnounknown5may2552-10unknownno
3036techniciansinglesecondaryno265yesyesunknown5may3481-10unknownno
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

目的変数をFloatに変換する


In [10]:
bank[:y] = [y == "yes" ? 1.0  : -1.0  for y in bank[:y]]


Out[10]:
45211-element Array{Float64,1}:
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
  ⋮  
 -1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
 -1.0
 -1.0

数値をノーマライズする

機械学習する前に、数値情報を正規化します。


In [11]:
categorical_keys = [:job, :marital, :education, :default, :housing, :loan, :contact, :month, :poutcome]


Out[11]:
9-element Array{Symbol,1}:
 :job      
 :marital  
 :education
 :default  
 :housing  
 :loan     
 :contact  
 :month    
 :poutcome 

In [12]:
numerical_keys = setdiff(names(bank), [categorical_keys, :y])


Out[12]:
7-element Array{Symbol,1}:
 :age     
 :balance 
 :day     
 :duration
 :campaign
 :pdays   
 :previous

In [13]:
bank_normalized = deepcopy(bank)
bank_normalized[numerical_keys]


Out[13]:
agebalancedaydurationcampaignpdaysprevious
158214352611-10
2442951511-10
33325761-10
44715065921-10
533151981-10
63523151391-10
72844752171-10
842253801-10
9581215501-10
10435935551-10
114127052221-10
122939051371-10
1353655171-10
1458715711-10
155716251741-10
165122953531-10
1745135981-10
1857525381-10
19606052191-10
203305541-10
212872352621-10
225677951641-10
23322351601-10
24255053421-10
2540051811-10
2644-37251721-10
273925552961-10
285211351271-10
2946-24652552-10
303626553481-10
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

In [14]:
for key in numerical_keys
    bank_normalized[key] = (bank[:, key] - mean(bank[key])) / std(bank[key])
end

In [15]:
bank_normalized[numerical_keys]


Out[15]:
agebalancedaydurationcampaignpdaysprevious
11.60694718648240680.25641641627596995-1.29846197131290570.011015976074886151-0.5693443410168078-0.4114485561028586-0.25193758438383734
20.28852607995632196-0.437889851794998-1.2984619713129057-0.41612235524730484-0.5693443410168078-0.4114485561028586-0.25193758438383734
3-0.7473762180284589-0.4467575288233406-1.2984619713129057-0.7073530356942532-0.5693443410168078-0.4114485561028586-0.25193758438383734
40.57104488849762580.04720492490359217-1.2984619713129057-0.6452238238655709-0.5693443410168078-0.4114485561028586-0.25193758438383734
5-0.7473762180284589-0.44708596130587175-1.2984619713129057-0.2336177955005505-0.5693443410168078-0.4114485561028586-0.25193758438383734
6-0.5590303456675897-0.3715464903236945-1.2984619713129057-0.4627192641188166-0.5693443410168078-0.4114485561028586-0.25193758438383734
7-1.218240898930632-0.3006050740969542-1.2984619713129057-0.15983935645399025-0.5693443410168078-0.4114485561028586-0.25193758438383734
80.10018020759545271-0.4467575288233406-1.29846197131290570.47310198905071094-0.5693443410168078-0.4114485561028586-0.25193758438383734
91.6069471864824068-0.40767406340212714-1.2984619713129057-0.808313004915862-0.5693443410168078-0.4114485561028586-0.25193758438383734
100.19435314377588733-0.25265393164739824-1.2984619713129057-0.7888976262193987-0.5693443410168078-0.4114485561028586-0.25193758438383734
110.006007271415018085-0.35873762350497757-1.2984619713129057-0.14042397775752702-0.5693443410168078-0.4114485561028586-0.25193758438383734
12-1.1240679627501975-0.3193257256012329-1.2984619713129057-0.47048541559740187-0.5693443410168078-0.4114485561028586-0.25193758438383734
131.1360825055802335-0.4454437988932157-1.29846197131290571.0050833653338034-0.5693443410168078-0.4114485561028586-0.25193758438383734
141.6069471864824068-0.42409568752868737-1.2984619713129057-0.7267684143907165-0.5693443410168078-0.4114485561028586-0.25193758438383734
151.512774250301972-0.39420833161834773-1.2984619713129057-0.326811613243574-0.5693443410168078-0.4114485561028586-0.25193758438383734
160.9477366332193643-0.372203355288757-1.29846197131290570.3682589440898095-0.5693443410168078-0.4114485561028586-0.25193758438383734
170.38269901613675655-0.4431447715154973-1.2984619713129057-0.621925369429815-0.5693443410168078-0.4114485561028586-0.25193758438383734
181.512774250301972-0.4303359046967803-1.2984619713129057-0.8549099137873737-0.5693443410168078-0.4114485561028586-0.25193758438383734
191.7952930588432758-0.42770844483653064-1.2984619713129057-0.15207320497540494-0.5693443410168078-0.4114485561028586-0.25193758438383734
20-0.7473762180284589-0.447414393788403-1.2984619713129057-0.7927807019586914-0.5693443410168078-0.4114485561028586-0.25193758438383734
21-1.218240898930632-0.20995770891834156-1.29846197131290570.014899051814178797-0.5693443410168078-0.4114485561028586-0.25193758438383734
221.4186013141215374-0.19156548989659405-1.2984619713129057-0.36564237063650046-0.5693443410168078-0.4114485561028586-0.25193758438383734
23-0.8415491542088935-0.43986044669018526-1.2984619713129057-0.381174673593671-0.5693443410168078-0.4114485561028586-0.25193758438383734
24-1.5007597074719359-0.4309927696618427-1.29846197131290570.3255451109575904-0.5693443410168078-0.4114485561028586-0.25193758438383734
25-0.08816566476541654-0.447414393788403-1.2984619713129057-0.29963008306852545-0.5693443410168078-0.4114485561028586-0.25193758438383734
260.28852607995632196-0.5695912772900114-1.2984619713129057-0.3345777647221593-0.5693443410168078-0.4114485561028586-0.25193758438383734
27-0.18233860094585116-0.36366411074294563-1.29846197131290570.14692362695012873-0.5693443410168078-0.4114485561028586-0.25193758438383734
281.0419095693997988-0.4103015232623768-1.2984619713129057-0.5093161729903283-0.5693443410168078-0.4114485561028586-0.25193758438383734
290.4768719523171912-0.5282087844910794-1.2984619713129057-0.012282478360869719-0.24655762082480956-0.4114485561028586-0.25193758438383734
30-0.46485740948715504-0.3603797859176336-1.29846197131290570.34884356539334627-0.5693443410168078-0.4114485561028586-0.25193758438383734
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

カテゴリ変数をダミー変数に変換

ここでは、カテゴリ変数をダミー変数に変換します。
カテゴリ変数はEnumなどで表される変数です。
JuliaのDataFrameはこれをよしなに扱ってくれないので、自分で変換をするコードを書きます。

julia-users MLのコードを使わせてもらうことにしましょう

https://groups.google.com/d/msg/julia-users/7-Vtpi8w4YI/KvMlKAZSwDkJ


In [16]:
function getdummy{R}(df::DataFrame, cname::Symbol, ::Type{R})
    darr = df[cname]
    vals = sort(levels(darr))[2:end]
    namedict = Dict(vals, 1:length(vals))    
    arr = zeros(R, length(darr), length(namedict))
    for i=1:length(darr)
        if haskey(namedict, darr[i])
            arr[i, namedict[darr[i]]] = 1
        end         
    end
    newdf = convert(DataFrame, arr)
    names!(newdf, [symbol("$(cname)_$k") for k in vals])
    return newdf
end

function convertdummy{R}(df::DataFrame, cnames::Array{Symbol}, ::Type{R})
    # consider every variable from cnames as categorical
    # and convert them into set of dummy variables,
    # return new dataframe
    newdf = DataFrame()
    for cname in names(df)
        if !in(cname, cnames)
            newdf[cname] = df[cname]
        else
            dummydf = getdummy(df, cname, R)
            for dummyname in names(dummydf)
                newdf[dummyname] = dummydf[dummyname]
            end
        end
    end
    return newdf
end

convertdummy(df::DataFrame, cnames::Array{Symbol}) = convertdummy(df, cnames, Int32)


Out[16]:
convertdummy (generic function with 2 methods)

In [17]:
bank_dummy = convertdummy(bank_normalized[:, 1:16], categorical_keys)


Out[17]:
agejob_blue-collarjob_entrepreneurjob_housemaidjob_managementjob_retiredjob_self-employedjob_servicesjob_studentjob_technicianjob_unemployedjob_unknownmarital_marriedmarital_singleeducation_secondaryeducation_tertiaryeducation_unknowndefault_yesbalancehousing_yesloan_yescontact_telephonecontact_unknowndaymonth_augmonth_decmonth_febmonth_janmonth_julmonth_junmonth_marmonth_maymonth_novmonth_octmonth_sepdurationcampaignpdayspreviouspoutcome_otherpoutcome_successpoutcome_unknown
11.6069471864824068000100000001001000.256416416275969951001-1.2984619713129057000000010000.011015976074886151-0.5693443410168078-0.4114485561028586-0.25193758438383734001
20.2885260799563219600000000100011000-0.4378898517949981001-1.298461971312905700000001000-0.41612235524730484-0.5693443410168078-0.4114485561028586-0.25193758438383734001
3-0.747376218028458901000000000101000-0.44675752882334061101-1.298461971312905700000001000-0.7073530356942532-0.5693443410168078-0.4114485561028586-0.25193758438383734001
40.5710448884976258100000000001000100.047204924903592171001-1.298461971312905700000001000-0.6452238238655709-0.5693443410168078-0.4114485561028586-0.25193758438383734001
5-0.747376218028458900000000001010010-0.447085961305871750001-1.298461971312905700000001000-0.2336177955005505-0.5693443410168078-0.4114485561028586-0.25193758438383734001
6-0.559030345667589700010000000100100-0.37154649032369451001-1.298461971312905700000001000-0.4627192641188166-0.5693443410168078-0.4114485561028586-0.25193758438383734001
7-1.21824089893063200010000000010100-0.30060507409695421101-1.298461971312905700000001000-0.15983935645399025-0.5693443410168078-0.4114485561028586-0.25193758438383734001
80.1001802075954527101000000000000101-0.44675752882334061001-1.2984619713129057000000010000.47310198905071094-0.5693443410168078-0.4114485561028586-0.25193758438383734001
91.606947186482406800001000000100000-0.407674063402127141001-1.298461971312905700000001000-0.808313004915862-0.5693443410168078-0.4114485561028586-0.25193758438383734001
100.1943531437758873300000000100011000-0.252653931647398241001-1.298461971312905700000001000-0.7888976262193987-0.5693443410168078-0.4114485561028586-0.25193758438383734001
110.00600727141501808500000000000001000-0.358737623504977571001-1.298461971312905700000001000-0.14042397775752702-0.5693443410168078-0.4114485561028586-0.25193758438383734001
12-1.124067962750197500000000000011000-0.31932572560123291001-1.298461971312905700000001000-0.47048541559740187-0.5693443410168078-0.4114485561028586-0.25193758438383734001
131.136082505580233500000000100101000-0.44544379889321571001-1.2984619713129057000000010001.0050833653338034-0.5693443410168078-0.4114485561028586-0.25193758438383734001
141.606947186482406800000000100100010-0.424095687528687371001-1.298461971312905700000001000-0.7267684143907165-0.5693443410168078-0.4114485561028586-0.25193758438383734001
151.51277425030197200000010000101000-0.394208331618347731001-1.298461971312905700000001000-0.326811613243574-0.5693443410168078-0.4114485561028586-0.25193758438383734001
160.947736633219364300001000000100000-0.3722033552887571001-1.2984619713129057000000010000.3682589440898095-0.5693443410168078-0.4114485561028586-0.25193758438383734001
170.3826990161367565500000000000010010-0.44314477151549731001-1.298461971312905700000001000-0.621925369429815-0.5693443410168078-0.4114485561028586-0.25193758438383734001
181.51277425030197210000000000100000-0.43033590469678031001-1.298461971312905700000001000-0.8549099137873737-0.5693443410168078-0.4114485561028586-0.25193758438383734001
191.795293058843275800001000000100000-0.427708444836530641001-1.298461971312905700000001000-0.15207320497540494-0.5693443410168078-0.4114485561028586-0.25193758438383734001
20-0.747376218028458900000010000101000-0.4474143937884031001-1.298461971312905700000001000-0.7927807019586914-0.5693443410168078-0.4114485561028586-0.25193758438383734001
21-1.21824089893063210000000000101000-0.209957708918341561101-1.2984619713129057000000010000.014899051814178797-0.5693443410168078-0.4114485561028586-0.25193758438383734001
221.418601314121537400010000000100100-0.191565489896594051001-1.298461971312905700000001000-0.36564237063650046-0.5693443410168078-0.4114485561028586-0.25193758438383734001
23-0.841549154208893510000000000010000-0.439860446690185261101-1.298461971312905700000001000-0.381174673593671-0.5693443410168078-0.4114485561028586-0.25193758438383734001
24-1.500759707471935900000010000101000-0.43099276966184271001-1.2984619713129057000000010000.3255451109575904-0.5693443410168078-0.4114485561028586-0.25193758438383734001
25-0.0881656647654165400001000000100000-0.4474143937884031101-1.298461971312905700000001000-0.29963008306852545-0.5693443410168078-0.4114485561028586-0.25193758438383734001
260.2885260799563219600000000000101000-0.56959127729001141001-1.298461971312905700000001000-0.3345777647221593-0.5693443410168078-0.4114485561028586-0.25193758438383734001
27-0.1823386009458511600010000000010100-0.363664110742945631001-1.2984619713129057000000010000.14692362695012873-0.5693443410168078-0.4114485561028586-0.25193758438383734001
281.041909569399798801000000000101000-0.41030152326237681101-1.298461971312905700000001000-0.5093161729903283-0.5693443410168078-0.4114485561028586-0.25193758438383734001
290.476871952317191200010000000011000-0.52820878449107941001-1.298461971312905700000001000-0.012282478360869719-0.24655762082480956-0.4114485561028586-0.25193758438383734001
30-0.4648574094871550400000000100011000-0.36037978591763361101-1.2984619713129057000000010000.34884356539334627-0.5693443410168078-0.4114485561028586-0.25193758438383734001
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

In [18]:
X = convert(Array, bank_dummy[:, 1:42])'


Out[18]:
42x45211 Array{Real,2}:
  1.60695    0.288526  -0.747376  …  2.92537    1.51277   -0.370684
  0          0          0            0          1          0       
  0          0          1            0          0          1       
  0          0          0            0          0          0       
  1          0          0            0          0          0       
  0          0          0         …  1          0          0       
  0          0          0            0          0          0       
  0          0          0            0          0          0       
  0          0          0            0          0          0       
  0          1          0            0          0          0       
  0          0          0         …  0          0          0       
  0          0          0            0          0          0       
  1          0          1            1          1          1       
  ⋮                               ⋱                        ⋮       
  0          0          0         …  0          0          0       
  1          1          1            0          0          0       
  0          0          0            1          1          1       
  0          0          0            0          0          0       
  0          0          0            0          0          0       
  0.011016  -0.416122  -0.707353  …  3.37376    0.970136   0.399324
 -0.569344  -0.569344  -0.569344     0.721803   0.399016  -0.246558
 -0.411449  -0.411449  -0.411449     1.43617   -0.411449   1.47612 
 -0.251938  -0.251938  -0.251938     1.05046   -0.251938   4.52353 
  0          0          0            0          0          1       
  0          0          0         …  1          0          0       
  1          1          1            0          1          0       

In [19]:
Y = convert(Array, bank_normalized[:y])


Out[19]:
45211-element Array{Float64,1}:
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
  ⋮  
 -1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
 -1.0
 -1.0

In [20]:
attribute_num, sample_num = size(X)


Out[20]:
(42,45211)

In [21]:
train_flags = randbool(sample_num)


Out[21]:
45211-element BitArray{1}:
 false
 false
  true
 false
  true
 false
 false
  true
 false
 false
 false
 false
 false
     ⋮
  true
 false
  true
  true
 false
 false
 false
  true
  true
 false
 false
 false

In [22]:
using SVM
model_svm = svm(X[:, train_flags], Y[train_flags])


Out[22]:
Fitted linear SVM
 * Non-zero weights: 41
 * Iterations: 100
 * Converged: true
WARNING: nnz(A::StridedArray) is deprecated, use countnz(A) instead.
 in show at /home/juser/.julia/v0.3/SVM/src/SVM.jl:15
 in anonymous at show.jl:1159
 in with_output_limit at ./show.jl:1136
 in showlimited at show.jl:1158
 in writemime at replutil.jl:2
 in writemime at multimedia.jl:41
 in sprint at iostream.jl:229
 in display_dict at /home/juser/.julia/v0.3/IJulia/src/execute_request.jl:25
 in execute_request_0x535c5df2 at /home/juser/.julia/v0.3/IJulia/src/execute_request.jl:196
 in eventloop at /home/juser/.julia/v0.3/IJulia/src/IJulia.jl:123
 in anonymous at task.jl:340

In [23]:
accuracy = countnz(predict(model_svm, X[:, ~train_flags]) .== Y[~train_flags]) / countnz(~train_flags)


Out[23]:
0.8870449443148303