HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

Preprocess your data (you may find LabelEncoder useful)
Train both KNN and Random Forest models
Find the best parameters by computing their learning curve (feel free to verify this with grid search)
Create a clasification report
Inspect your models, what features are most important? How might you use this information to improve model precision?



In [1]:

    
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.cross_validation as cv

import sys
import re
import os
import pprint

import random 

from scipy import stats 
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef
from sklearn.grid_search import GridSearchCV
from sklearn.svm import LinearSVC, SVC
from sklearn import preprocessing
from collections import Counter
from datetime import datetime
from collections import Counter
from fuzzywuzzy import fuzz
from sklearn.neighbors import KNeighborsClassifier
from sklearn.learning_curve import learning_curve
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_rows', 25)
pd.set_option('display.precision', 4)
np.set_printoptions(precision = 4)

%matplotlib inline

print 'Python version ' + sys.version
print 'Pandas version ' + pd.__version__
print 'Numpy version ' + np.__version__









    



Python version 2.7.9 |Anaconda 2.1.0 (x86_64)| (default, Dec 15 2014, 10:37:34) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas version 0.15.2
Numpy version 1.9.1



In [2]:

    
bank = pd.read_csv('./bank-additional/bank-additional-full.csv', delimiter = ';')



In [3]:

    
bank.head()









    Out[3]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       56
       housemaid
       married
          basic.4y
            no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      1
       57
        services
       married
       high.school
       unknown
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      2
       37
        services
       married
       high.school
            no
       yes
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      3
       40
          admin.
       married
          basic.6y
            no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      4
       56
        services
       married
       high.school
            no
        no
       yes
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
  

5 rows × 21 columns



In [4]:

    
bank.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.9+ MB

Deal with missing values



In [5]:

    
# Missing values - none?! Maybe coded with ?. Let's check unique codings.
bank.isnull().sum()









    Out[5]:





age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64



In [6]:

    
# Get column names.
colNames = bank.columns



In [7]:

    
# Print unique values per column. Missing values seem coded with 'unknown'.
for col in colNames:
    print col, set(bank[col])









    



age set([17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 94, 95, 98])
job set(['management', 'retired', 'self-employed', 'unknown', 'unemployed', 'admin.', 'technician', 'services', 'student', 'housemaid', 'entrepreneur', 'blue-collar'])
marital set(['unknown', 'single', 'married', 'divorced'])
education set(['basic.9y', 'illiterate', 'basic.4y', 'unknown', 'basic.6y', 'high.school', 'professional.course', 'university.degree'])
default set(['unknown', 'yes', 'no'])
housing set(['unknown', 'yes', 'no'])
loan set(['unknown', 'yes', 'no'])
contact set(['telephone', 'cellular'])
month set(['mar', 'aug', 'sep', 'may', 'jun', 'jul', 'apr', 'nov', 'dec', 'oct'])
day_of_week set(['fri', 'thu', 'wed', 'mon', 'tue'])
duration set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 996, 997, 998, 999, 1000, 1001, 1002, 1003, 1005, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1108, 1109, 1110, 1111, 1112, 1114, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1147, 1148, 1149, 1150, 1151, 1152, 1153, 1154, 1156, 1161, 1162, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1173, 1174, 1175, 1176, 1178, 1180, 1181, 1182, 1183, 1184, 1185, 1186, 1187, 1190, 1191, 1192, 1193, 1195, 1196, 1197, 1199, 1200, 1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1210, 1211, 1212, 1214, 1217, 1218, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1230, 1231, 1232, 1233, 1234, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1248, 1250, 1252, 1254, 1255, 1256, 1257, 1258, 1259, 1260, 1262, 1263, 1265, 1266, 1267, 1268, 1269, 1271, 1272, 1273, 1275, 1276, 1277, 1279, 1281, 1282, 1283, 1285, 1286, 1287, 1288, 1290, 1291, 1293, 1294, 1297, 1298, 1300, 1302, 1303, 1306, 1307, 1309, 1310, 1311, 1313, 1317, 1318, 1319, 1321, 1323, 1326, 1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1336, 1337, 1339, 1340, 1341, 1342, 1344, 1345, 1346, 1347, 1348, 1349, 1352, 1353, 1356, 1357, 1359, 1360, 1361, 1363, 1364, 1365, 1366, 1368, 1369, 1370, 1372, 1373, 1374, 1376, 1380, 1386, 1388, 1389, 1390, 1391, 1392, 1394, 1395, 1397, 1398, 1399, 1405, 1407, 1408, 1410, 1411, 1412, 1416, 1422, 1423, 1424, 1425, 1426, 1432, 1434, 1435, 1437, 1438, 1439, 1440, 1441, 1446, 1447, 1448, 1449, 1452, 1456, 1460, 1461, 1462, 1463, 1464, 1467, 1468, 1469, 1471, 1472, 1473, 1476, 1478, 1479, 1480, 1487, 1488, 1489, 1490, 1491, 1492, 1495, 1499, 1500, 1502, 1503, 1504, 1505, 1508, 1512, 1514, 1516, 1521, 1528, 1529, 1530, 1531, 1532, 1534, 1540, 1543, 1545, 1548, 1550, 1551, 1552, 1554, 1555, 1556, 1559, 1563, 1567, 1569, 1571, 1573, 1574, 1575, 1576, 1579, 1580, 1581, 1584, 1590, 1594, 1597, 1598, 1602, 1603, 1606, 1608, 1611, 1613, 1615, 1616, 1617, 1618, 1622, 1623, 1624, 1628, 1640, 1642, 1649, 1662, 1663, 1665, 1666, 1669, 1673, 1677, 1681, 1689, 1692, 1697, 1707, 1710, 1713, 1720, 1721, 1723, 1730, 1735, 1739, 1740, 1745, 1756, 1767, 1776, 1777, 1788, 1804, 1805, 1806, 1809, 1816, 1817, 1820, 1833, 1834, 1842, 1848, 1850, 1855, 1867, 1868, 1869, 1871, 1877, 1880, 1882, 1906, 1925, 1934, 1946, 1954, 1957, 1958, 1960, 1962, 1966, 1970, 1973, 1975, 1978, 1980, 1992, 1994, 2015, 2016, 2025, 2028, 2029, 2033, 2035, 2053, 2055, 2062, 2078, 2087, 2089, 2093, 2122, 2129, 2139, 2184, 2187, 2191, 2203, 2219, 2231, 2260, 2299, 2301, 2316, 2372, 2420, 2429, 2453, 2456, 2462, 2486, 2516, 2621, 2635, 2653, 2680, 2692, 2769, 2870, 2926, 3076, 3078, 3094, 3183, 3253, 3284, 3322, 3366, 3422, 3509, 3631, 3643, 3785, 4199, 4918])
campaign set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 39, 40, 41, 42, 43, 56])
pdays set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 999])
previous set([0, 1, 2, 3, 4, 5, 6, 7])
poutcome set(['failure', 'success', 'nonexistent'])
emp.var.rate set([1.1000000000000001, -3.3999999999999999, -1.8, -3.0, -2.8999999999999999, -1.7, 1.3999999999999999, -0.10000000000000001, -0.20000000000000001, -1.1000000000000001])
cons.price.idx set([92.962999999999994, 92.430999999999997, 93.876000000000005, 93.917999999999992, 94.76700000000001, 94.600999999999999, 92.378999999999991, 92.892999999999986, 94.198999999999998, 92.712999999999994, 93.369, 93.075000000000003, 92.756, 93.797999999999988, 92.201000000000008, 94.215000000000003, 93.994, 92.842999999999989, 92.649000000000001, 92.468999999999994, 93.200000000000003, 94.465000000000003, 93.748999999999995, 94.027000000000001, 94.055000000000007, 93.444000000000003])
cons.conf.idx set([-37.5, -49.5, -36.399999999999999, -50.799999999999997, -46.200000000000003, -29.800000000000001, -30.100000000000001, -42.700000000000003, -31.399999999999999, -36.100000000000001, -40.399999999999999, -38.299999999999997, -47.100000000000001, -45.899999999999999, -41.799999999999997, -40.799999999999997, -34.600000000000001, -50.0, -33.600000000000001, -42.0, -40.0, -40.299999999999997, -33.0, -39.799999999999997, -26.899999999999999, -34.799999999999997])
euribor3m set([1.25, 0.75, 4.1909999999999998, 5.0, 0.82900000000000007, 0.71700000000000008, 3.8160000000000003, 0.64000000000000001, 0.88800000000000001, 1.6019999999999999, 0.90799999999999992, 4.9550000000000001, 0.83499999999999996, 0.80900000000000005, 0.96900000000000008, 4.0209999999999999, 0.72699999999999998, 4.9619999999999997, 0.86900000000000011, 4.1530000000000005, 3.6689999999999996, 0.66799999999999993, 1.556, 0.91400000000000003, 1.415, 1.2150000000000001, 3.488, 0.72900000000000009, 0.84900000000000009, 4.8580000000000005, 0.64400000000000002, 1.2809999999999999, 0.71599999999999997, 1.5600000000000001, 0.69900000000000007, 0.82200000000000006, 1.264, 0.94200000000000006, 1.4530000000000001, 0.71099999999999997, 0.68299999999999994, 0.81299999999999994, 0.68400000000000005, 0.71499999999999997, 0.90500000000000003, 4.8559999999999999, 0.64200000000000002, 0.74900000000000011, 4.827, 0.8909999999999999, 0.72099999999999997, 4.9589999999999996, 0.67700000000000005, 0.88500000000000001, 0.92700000000000005, 1.3840000000000001, 4.9660000000000002, 0.97199999999999998, 0.70900000000000007, 0.879, 0.85099999999999998, 4.2229999999999999, 0.99299999999999999, 1.2909999999999999, 0.72999999999999998, 1.52, 4.8550000000000004, 1.26, 0.65200000000000002, 1.5740000000000001, 1.05, 0.96499999999999997, 1.778, 0.74099999999999999, 1.4059999999999999, 0.95299999999999996, 1.8109999999999999, 0.70200000000000007, 0.71999999999999997, 0.80200000000000005, 1.6399999999999999, 0.89000000000000001, 1.224, 3.5630000000000002, 0.754, 1.548, 0.77099999999999991, 3.8789999999999996, 1.6140000000000001, 1.206, 0.87, 1.4099999999999999, 0.69700000000000006, 0.89400000000000002, 3.8530000000000002, 0.71400000000000008, 1.0309999999999999, 0.63600000000000001, 1.048, 4.9210000000000003, 1.4350000000000001, 4.9630000000000001, 0.873, 1.726, 4.1200000000000001, 0.65300000000000002, 4.9699999999999998, 0.98199999999999998, 0.63500000000000001, 4.2860000000000005, 1.008, 1.7569999999999999, 4.9180000000000001, 0.71200000000000008, 1.046, 4.859, 0.84599999999999997, 0.74199999999999999, 0.79299999999999993, 4.8660000000000005, 1.268, 1.4830000000000001, 0.65099999999999991, 0.73199999999999998, 0.68200000000000005, 1.6869999999999998, 0.877, 0.76700000000000002, 3.282, 0.77800000000000002, 1.0290000000000001, 0.98699999999999999, 1.7990000000000002, 0.73699999999999999, 0.73999999999999999, 0.67200000000000004, 0.8859999999999999, 0.69200000000000006, 0.878, 1.27, 0.89800000000000002, 1.286, 4.96, 1.03, 0.72199999999999998, 0.85400000000000009, 4.5919999999999996, 0.97699999999999998, 0.81900000000000006, 3.3289999999999997, 0.82099999999999995, 0.68500000000000005, 4.4740000000000002, 0.71299999999999997, 0.70700000000000007, 1.51, 1.085, 3.0529999999999999, 1.244, 0.71799999999999997, 0.73299999999999998, 0.83400000000000007, 0.63900000000000001, 1.0449999999999999, 1.3719999999999999, 0.78099999999999992, 0.79700000000000004, 0.78799999999999992, 0.65500000000000003, 4.2450000000000001, 1.032, 0.82499999999999996, 0.90400000000000003, 4.9359999999999999, 4.4060000000000006, 1.3130000000000002, 0.72400000000000009, 0.89500000000000002, 1.3919999999999999, 0.65900000000000003, 1.266, 0.996, 1.018, 4.7000000000000002, 1.629, 0.64900000000000002, 0.69999999999999996, 4.9569999999999999, 1.016, 4.7330000000000005, 1.5840000000000001, 0.65000000000000002, 4.9639999999999995, 1.0429999999999999, 1.0369999999999999, 1.4450000000000001, 0.98499999999999999, 0.97900000000000009, 0.76200000000000001, 0.90000000000000002, 1.538, 0.748, 0.95599999999999996, 0.7390000000000001, 1.0720000000000001, 0.89300000000000002, 0.80299999999999994, 4.7939999999999996, 0.64300000000000002, 1.0249999999999999, 0.69499999999999995, 4.8600000000000003, 0.77000000000000002, 1.028, 1.2990000000000002, 1.044, 0.752, 1.405, 4.9669999999999996, 1.4980000000000002, 4.343, 1.327, 0.63800000000000001, 1.5309999999999999, 0.755, 0.88400000000000001, 4.8650000000000002, 1.2350000000000001, 3.4279999999999999, 1.3440000000000001, 0.78200000000000003, 0.95900000000000007, 4.9470000000000001, 0.70599999999999996, 0.86099999999999999, 1.04, 0.72299999999999998, 0.83799999999999997, 0.90300000000000002, 5.0449999999999999, 0.64599999999999991, 0.68999999999999995, 0.64500000000000002, 0.88200000000000001, 4.9610000000000003, 1.0, 1.423, 1.663, 0.82700000000000007, 4.968, 0.88300000000000001, 0.93299999999999994, 0.87599999999999989, 0.70099999999999996, 1.3999999999999999, 1.0469999999999999, 1.6499999999999999, 1.0590000000000002, 1.4790000000000001, 0.8590000000000001, 4.8570000000000002, 1.2590000000000001, 1.252, 1.262, 0.76800000000000002, 4.8639999999999999, 1.0390000000000001, 0.70999999999999996, 1.3540000000000001, 0.93700000000000006, 0.94400000000000006, 0.84299999999999997, 0.73099999999999998, 1.7030000000000001, 0.7609999999999999, 0.79000000000000004, 0.7659999999999999, 0.72799999999999998, 0.81000000000000005, 0.74299999999999999, 1.365, 0.83999999999999997, 4.7599999999999998, 0.63700000000000001, 0.88900000000000001, 4.0760000000000005, 0.65400000000000003, 0.66299999999999992, 0.92099999999999993, 4.9580000000000002, 0.68799999999999994, 0.753, 1.0349999999999999, 0.73499999999999999, 4.9649999999999999, 0.70799999999999996, 0.88, 4.9119999999999999, 0.89599999999999991, 0.88099999999999989, 0.70400000000000007, 4.6630000000000003, 0.74400000000000011, 1.0490000000000002, 1.099, 1.0070000000000001, 1.3340000000000001, 4.9560000000000004, 3.7430000000000003, 3.9010000000000002, 0.89900000000000002, 0.63400000000000001, 1.466, 1.0409999999999999, 0.77300000000000002, 0.71900000000000008])
nr.employed set([5017.5, 5023.5, 5008.6999999999998, 5228.1000000000004, 5099.1000000000004, 5191.0, 4963.6000000000004, 5195.8000000000002, 5076.1999999999998, 4991.6000000000004, 5176.3000000000002])
y set(['yes', 'no'])



In [8]:

    
# Impute missing values.
col_dist = {}
def get_col_dist(col_name):
    excl_null_mask = bank[col_name] != 'unknown'
    row_count = bank[excl_null_mask][col_name].size
    col_data = {}
    col_data['prob'] = (bank[excl_null_mask][col_name].value_counts() / row_count).values
    col_data['values'] = (bank[excl_null_mask][col_name].value_counts() / row_count).index.values
    return col_data



In [9]:

    
col_dist['job'] = get_col_dist('job')
col_dist['marital'] = get_col_dist('marital')
col_dist['education'] = get_col_dist('education')
col_dist['default'] = get_col_dist('default')
col_dist['housing'] = get_col_dist('housing')
col_dist['loan'] = get_col_dist('loan')



In [10]:

    
print col_dist









    



{'default': {'values': array(['no', 'yes'], dtype=object), 'prob': array([  9.9991e-01,   9.2050e-05])}, 'loan': {'values': array(['no', 'yes'], dtype=object), 'prob': array([ 0.8446,  0.1554])}, 'marital': {'values': array(['married', 'single', 'divorced'], dtype=object), 'prob': array([ 0.6064,  0.2814,  0.1122])}, 'job': {'values': array(['admin.', 'blue-collar', 'technician', 'services', 'management',
       'retired', 'entrepreneur', 'self-employed', 'housemaid',
       'unemployed', 'student'], dtype=object), 'prob': array([ 0.2551,  0.2265,  0.165 ,  0.0971,  0.0716,  0.0421,  0.0356,
        0.0348,  0.0259,  0.0248,  0.0214])}, 'education': {'values': array(['university.degree', 'high.school', 'basic.9y',
       'professional.course', 'basic.4y', 'basic.6y', 'illiterate'], dtype=object), 'prob': array([ 0.3084,  0.2411,  0.1532,  0.1329,  0.1058,  0.0581,  0.0005])}, 'housing': {'values': array(['yes', 'no'], dtype=object), 'prob': array([ 0.5367,  0.4633])}}



In [11]:

    
def impute_cols(val, options):
    if val == 'unknown':
        return np.random.choice(options['values'], p=options['prob'])
    return val



In [12]:

    
def impute_job(val):
    return impute_cols(val, col_dist['job'])
def impute_marital(val):
    return impute_cols(val, col_dist['marital'])
def impute_edu(val):
    return impute_cols(val, col_dist['education'])
def impute_default(val):
    return impute_cols(val, col_dist['default'])
def impute_housing(val):
    return impute_cols(val, col_dist['housing'])
def impute_loan(val):
    return impute_cols(val, col_dist['loan'])



In [13]:

    
bank.job = bank.job.map(impute_job)
bank.marital = bank.marital.map(impute_marital)
bank.education = bank.education.map(impute_edu)
bank.default = bank.default.map(impute_default)
bank.housing = bank.housing.map(impute_housing)
bank.loan = bank.loan.map(impute_loan)



In [14]:

    
bank.head()









    Out[14]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       56
       housemaid
       married
          basic.4y
       no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      1
       57
        services
       married
       high.school
       no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      2
       37
        services
       married
       high.school
       no
       yes
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      3
       40
          admin.
       married
          basic.6y
       no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      4
       56
        services
       married
       high.school
       no
        no
       yes
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
  

5 rows × 21 columns

Exploratory analyses



In [15]:

    
# Numeric features.
numFeats = ['age', 'duration', 'campaign', 'pdays', 'previous', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

# Numeric features without pdays (strongly related to target).
numFeatsR = ['age', 'duration', 'campaign', 'previous', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

# Some descriptives.
bank[numFeats].describe()









    Out[15]:






  
    
      
      age
      duration
      campaign
      pdays
      previous
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
    
  
  
    
      count
       41188.000
       41188.000
       41188.000
       41188.000
       41188.000
       41188.000
       41188.000
       41188.000
       41188.000
    
    
      mean
          40.024
         258.285
           2.568
         962.475
           0.173
          93.576
         -40.503
           3.621
        5167.036
    
    
      std
          10.421
         259.279
           2.770
         186.911
           0.495
           0.579
           4.628
           1.734
          72.252
    
    
      min
          17.000
           0.000
           1.000
           0.000
           0.000
          92.201
         -50.800
           0.634
        4963.600
    
    
      25%
          32.000
         102.000
           1.000
         999.000
           0.000
          93.075
         -42.700
           1.344
        5099.100
    
    
      50%
          38.000
         180.000
           2.000
         999.000
           0.000
          93.749
         -41.800
           4.857
        5191.000
    
    
      75%
          47.000
         319.000
           3.000
         999.000
           0.000
          93.994
         -36.400
           4.961
        5228.100
    
    
      max
          98.000
        4918.000
          56.000
         999.000
           7.000
          94.767
         -26.900
           5.045
        5228.100



In [16]:

    
# Feature pdays looks weird. Checked documentation: 999 coded if customer contacted first time.
# Recode into bins stating whether and when someone was contacted before (e.g., never, same day etc.)



In [18]:

    
# Create artificial bins for pdays.
def recode_pdays(val):
    if val == 999:
        return 'never contacted'
    elif val == 0:
        return 'same day'
    elif 1 <= val <= 7:
        return 'within 1 week'
    elif 8<= val <= 14:
        return 'between 1 and 2 weeks'
    elif 15 <= val <= 21:
        return 'between 2 and 3 weeks'
    else:
        return 'more than 3 weeks'    

# Recode.
bank['pdays_cat'] = bank.pdays.map(recode_pdays)

# Drop pdays.
bank.drop('pdays', axis = 1, inplace = True)



In [19]:

    
# Preprocessing - label encoder.
# Categorical features.
catFeats = ['y', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 
            'pdays_cat']

# Categorical features without target. 
catFeatsR = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 
            'pdays_cat']



In [20]:

    
#le = preprocessing.LabelEncoder()
# This doesn't work, need one labelencoder for each feature!!!
#for cat in catFeats:
#    bank[cat] = le.fit_transform(bank[cat])
    
label_encoders = {}

for cat in catFeats:
    label_encoders[cat] = preprocessing.LabelEncoder()
    bank[cat] = label_encoders[cat].fit_transform(bank[cat])



In [18]:

    
bank.head()









    Out[18]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
      pdays_cat
    
  
  
    
      0
       56
       3
       1
       0
       0
       0
       0
       1
       6
       1
      ...
       1
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
       3
    
    
      1
       57
       7
       1
       3
       0
       0
       0
       1
       6
       1
      ...
       1
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
       3
    
    
      2
       37
       7
       1
       3
       0
       1
       0
       1
       6
       1
      ...
       1
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
       3
    
    
      3
       40
       0
       1
       1
       0
       0
       0
       1
       6
       1
      ...
       1
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
       3
    
    
      4
       56
       7
       1
       3
       0
       0
       1
       1
       6
       1
      ...
       1
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
       3
    
  

5 rows × 21 columns



In [24]:

    
# Calls by month. Some months see more calls than others.  
plot = bank['month'].value_counts().order(ascending = False).plot(kind = 'bar', figsize = (10, 10), color = 'r')
plot.set_title('Last contact with customer by month')
plot.set_xlabel('Month')
plot.set_ylabel('Number of customers contacted');



In [25]:

    
# Calls by day of week: no apparent difference in call volume by day of week. 
plot = bank['day_of_week'].value_counts().order(ascending = False).plot(kind = 'bar', figsize = (10, 10), color = 'grey')
plot.set_title('Last customer contact by day of week')
plot.set_xlabel('Day of week')
plot.set_ylabel('Number of customers')









    Out[25]:





<matplotlib.text.Text at 0x109e15050>



In [26]:

    
bank.age.hist(bins = 10) # Age a bit scewed, which makes sense, as more ppl probably apply for loans when they're younger.









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a103590>



In [27]:

    
bank.duration.hist(bins = 10) # That is interesting - call duration is skewed and supershort for most calls.
print 'Median duration of call:', bank.duration.median() # Median is 3 minutes!









    



Median duration of call: 180.0



In [28]:

    
print 'Spearman correlation age, call duration', stats.spearmanr(bank.age, bank.duration) # No assoc. betw. age, call dur.









    



Spearman correlation age, call duration (-0.0021225520320207027, 0.66664603338596384)



In [29]:

    
bank.campaign.hist(bins = 10) # Lmtd. number of contact. 
print bank.campaign.median() # Median number of contacts is 2.

2.0



In [30]:

    
print 'Spearman corr age, campaign', stats.spearmanr(bank.age, bank.campaign)
print 'Spearman corr duration, campaign', stats.spearmanr(bank.duration, bank.campaign)









    



Spearman corr age, campaign (0.0057152291644300109, 0.24610168278158667)
Spearman corr duration, campaign (-0.080952981128037713, 7.6232414556298434e-61)



In [31]:

    
# Scatter plot of call duration, campaign. 
plt.scatter(bank.duration, bank.campaign)









    Out[31]:





<matplotlib.collections.PathCollection at 0x10b35d1d0>



In [32]:

    
# Correlations between numeric non-binary features. Exclude pdays, as 999 could bias. 
# Some look significant, double-check them. 
bank[numFeatsR].corr(method = 'spearman')









    Out[32]:






  
    
      
      age
      duration
      campaign
      previous
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
    
  
  
    
      age
       1.000
      -0.002
       0.006
      -0.013
       0.045
       0.115
       0.054
       0.045
    
    
      duration
      -0.002
       1.000
      -0.081
       0.042
       0.003
      -0.009
      -0.078
      -0.095
    
    
      campaign
       0.006
      -0.081
       1.000
      -0.087
       0.096
      -0.002
       0.141
       0.144
    
    
      previous
      -0.013
       0.042
      -0.087
       1.000
      -0.283
      -0.116
      -0.455
      -0.439
    
    
      cons.price.idx
       0.045
       0.003
       0.096
      -0.283
       1.000
       0.246
       0.491
       0.465
    
    
      cons.conf.idx
       0.115
      -0.009
      -0.002
      -0.116
       0.246
       1.000
       0.237
       0.133
    
    
      euribor3m
       0.054
      -0.078
       0.141
      -0.455
       0.491
       0.237
       1.000
       0.929
    
    
      nr.employed
       0.045
      -0.095
       0.144
      -0.439
       0.465
       0.133
       0.929
       1.000



In [33]:

    
print 'p value of Spearman correlation between'
print 'age, consumer confidence idx:', stats.spearmanr(bank['age'], bank['cons.conf.idx']) [1]
print 'campaign, number of employees:', stats.spearmanr(bank.campaign, bank['nr.employed'])[1]
print 'number of previous contacts, consumer confidence idx:', stats.spearmanr(bank.previous, bank['cons.conf.idx'])[1]
print 'number of previous contacts, consumer price idx:', stats.spearmanr(bank.previous, bank['cons.price.idx'])[1]
print 'number of previous contacts, Euribor rate:', stats.spearmanr(bank.previous, bank['euribor3m'])[1]
print 'number of previous contacts, number of employees:', stats.spearmanr(bank.previous, bank['nr.employed'])[1]
print 'consumer price idx, consumer confidence idx:', stats.spearmanr(bank['cons.price.idx'], bank['cons.conf.idx'])[1]
print 'consumer price idx, number of employees:', stats.spearmanr(bank['cons.price.idx'], bank['nr.employed'])[1]
print 'Euribor, consumer confidence index:', stats.spearmanr(bank.euribor3m, bank['cons.conf.idx'])[1]
print 'Euribor, number of employees:', stats.spearmanr(bank.euribor3m, bank['nr.employed'])[1]









    



p value of Spearman correlation between
age, consumer confidence idx: 2.71127093552e-120
campaign, number of employees: 2.17821966014e-190
number of previous contacts, consumer confidence idx: 2.75118643084e-123
number of previous contacts, consumer price idx: 0.0
number of previous contacts, Euribor rate: 0.0
number of previous contacts, number of employees: 0.0
consumer price idx, consumer confidence idx: 0.0
consumer price idx, number of employees: 0.0
Euribor, consumer confidence index: 0.0
Euribor, number of employees: 0.0



In [34]:

    
# Correlations by target: Differ for some features. 
bank[bank.y == 0][numFeatsR].corr(method = 'spearman')









    Out[34]:






  
    
      
      age
      duration
      campaign
      previous
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
    
  
  
    
      age
       1.000
       0.003
       0.006
      -0.024
       0.050
       0.115
       0.065
       0.054
    
    
      duration
       0.003
       1.000
      -0.086
      -0.003
       0.029
      -0.009
      -0.024
      -0.043
    
    
      campaign
       0.006
      -0.086
       1.000
      -0.075
       0.089
       0.006
       0.122
       0.125
    
    
      previous
      -0.024
      -0.003
      -0.075
       1.000
      -0.330
      -0.200
      -0.418
      -0.393
    
    
      cons.price.idx
       0.050
       0.029
       0.089
      -0.330
       1.000
       0.327
       0.496
       0.478
    
    
      cons.conf.idx
       0.115
      -0.009
       0.006
      -0.200
       0.327
       1.000
       0.345
       0.218
    
    
      euribor3m
       0.065
      -0.024
       0.122
      -0.418
       0.496
       0.345
       1.000
       0.915
    
    
      nr.employed
       0.054
      -0.043
       0.125
      -0.393
       0.478
       0.218
       0.915
       1.000



In [35]:

    
bank[bank.y == 1][numFeatsR].corr(method = 'spearman')









    Out[35]:






  
    
      
      age
      duration
      campaign
      previous
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
    
  
  
    
      age
       1.000
      -0.005
      -0.002
       0.047
      -0.004
       0.120
      -0.059
      -0.042
    
    
      duration
      -0.005
       1.000
       0.165
      -0.275
       0.272
      -0.176
       0.490
       0.498
    
    
      campaign
      -0.002
       0.165
       1.000
      -0.091
       0.097
      -0.035
       0.156
       0.152
    
    
      previous
       0.047
      -0.275
      -0.091
       1.000
      -0.003
       0.186
      -0.499
      -0.534
    
    
      cons.price.idx
      -0.004
       0.272
       0.097
      -0.003
       1.000
      -0.251
       0.292
       0.024
    
    
      cons.conf.idx
       0.120
      -0.176
      -0.035
       0.186
      -0.251
       1.000
      -0.472
      -0.300
    
    
      euribor3m
      -0.059
       0.490
       0.156
      -0.499
       0.292
      -0.472
       1.000
       0.893
    
    
      nr.employed
      -0.042
       0.498
       0.152
      -0.534
       0.024
      -0.300
       0.893
       1.000



In [36]:

    
# Factorplot by target.
import seaborn as sns
sns.factorplot('y', data = bank, palette = 'Greens') # Mostly 0 as target! Classification maybe biased?









    Out[36]:





<seaborn.axisgrid.FacetGrid at 0x10ae9fc90>



In [37]:

    
sns.factorplot('y', data = bank, palette = 'Blues', hue = 'loan')









    Out[37]:





<seaborn.axisgrid.FacetGrid at 0x109f32f50>



In [38]:

    
sns.factorplot('y', data = bank, palette = 'Reds', hue = 'default')









    Out[38]:





<seaborn.axisgrid.FacetGrid at 0x10d9fef10>



In [39]:

    
sns.factorplot('y', data = bank, palette = 'Greens', hue = 'marital')









    Out[39]:





<seaborn.axisgrid.FacetGrid at 0x10ad14d90>

Preps for KNN and RF



In [40]:

    
# Get target variable.
target = bank['y']



In [41]:

    
target.shape









    Out[41]:





(41188,)



In [42]:

    
# Drop target.
# Drop duration, documentation suggests.
bank.drop(['y', 'duration'], axis = 1, inplace = True)



In [43]:

    
features = bank.as_matrix()



In [44]:

    
features.shape









    Out[44]:





(41188, 19)

KNN



In [45]:

    
# Split. 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 12)



In [46]:

    
knn = KNeighborsClassifier()



In [47]:

    
# Grid Search for KNN parameters to be used. 
param_knn = {'n_neighbors' : np.arange(10, 101, 5)}
knn_gs = GridSearchCV(knn, param_grid = param_knn)
knn_gs.fit(X_train, y_train)
print 'Best parameters:', knn_gs.best_params_









    



Best parameters: {'n_neighbors': 50}



In [48]:

    
# Run knn with best parameters. 
knn = KNeighborsClassifier(n_neighbors = 50)
knn.fit(X_train, y_train)









    Out[48]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=50, p=2, weights='uniform')



In [49]:

    
# Score.
knn.score(X_test, y_test)









    Out[49]:





0.89317795581451809



In [50]:

    
# Predicted y.
y_pred_knn = knn.predict(X_test)



In [51]:

    
# Confusion matrix.
print 'Confusion matrix:\n', confusion_matrix(y_test, y_pred_knn)









    



Confusion matrix:
[[7232   84]
 [ 796  126]]



In [52]:

    
# Plot confusion matrix.
plt.matshow(confusion_matrix(y_test, y_pred_knn))
plt.title('KNN Confusion matrix')
plt.colorbar()
plt.ylabel('True')
plt.xlabel('Predicted')









    Out[52]:





<matplotlib.text.Text at 0x10adc80d0>



In [53]:

    
# Classification report/ 
print 'classification report:\n', classification_report(y_test, y_pred_knn) # Works well for 0 target, not so for 1 target.









    



classification report:
             precision    recall  f1-score   support

          0       0.90      0.99      0.94      7316
          1       0.60      0.14      0.22       922

avg / total       0.87      0.89      0.86      8238



In [54]:

    
# Learning curve. Thanks to Chad for the code. 
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



In [55]:

    
# Plot learning curve to visualize bias variance trade-off in terms of training samples. 
plot_learning_curve(KNeighborsClassifier(n_neighbors = 50), 'KNN', X_train, y_train, ylim=None, cv=None, n_jobs=-1)









    Out[55]:





<module 'matplotlib.pyplot' from '/Users/mareen/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>



In [56]:

    
# Another plot. This time visualizing best parameter in terms of n_neighbors. 
train_scores = []
test_scores = []
ks = range(10, 101, 5)
for i in ks:
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))
plt.plot(ks, train_scores, label="training scores")
plt.legend(loc="best")
plt.tight_layout()
plt.plot(ks, test_scores, label="validation scores")
plt.legend(loc="best")









    Out[56]:





<matplotlib.legend.Legend at 0x10f98f490>



In [57]:

    
# Look at probabilities.
bank_y_pred_df = pd.DataFrame(knn.predict_proba(X_test)) 
bank_y_pred_df['Predicted'] = y_pred_knn
bank_y_pred_df['True'] = y_test
# Show only FP/ FN examples. 
print 'FP, FN examples:\n', bank_y_pred_df[bank_y_pred_df['Predicted'] != bank_y_pred_df['True']].head(20)
print 'Total FP, FN:\n', len(bank_y_pred_df[bank_y_pred_df['Predicted'] != bank_y_pred_df['True']])









    



FP, FN examples:
        0     1  Predicted  True
6    0.47  0.53          0     1
39   0.91  0.09          0     1
43   0.91  0.09          0     1
56   0.96  0.04          0     1
58   0.95  0.05          0     1
63   0.96  0.04          0     1
77   0.96  0.04          0     1
82   0.65  0.35          0     1
83   0.97  0.03          0     1
84   0.64  0.36          0     1
93   0.46  0.54          0     1
97   0.54  0.46          0     1
118  0.97  0.03          0     1
158  0.54  0.46          0     1
161  0.67  0.33          0     1
170  0.37  0.63          1     0
188  0.98  0.02          0     1
191  0.62  0.38          0     1
194  0.99  0.01          0     1
199  0.54  0.46          1     0
Total FP, FN:
880



In [58]:

    
# Matthews coeff is measure of quality of binary classification. Not very good. Maybe because of unequal target sizes? 
matthews_corrcoef(y_test, y_pred_knn)









    Out[58]:





0.25038950856307657

RF



In [59]:

    
# Random forest.
rf = RandomForestClassifier(n_jobs = -1)



In [60]:

    
# Run GridSearch for best parameters.
param_rf = {'n_estimators': np.arange(10, 101, 10), 'criterion': ['gini', 'entropy']}



In [61]:

    
rf_gs = GridSearchCV(rf, param_grid=param_rf)



In [62]:

    
rf_gs.fit(features, target)









    Out[62]:





GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]), 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)



In [63]:

    
rf_gs.grid_scores_









    Out[63]:





[mean: 0.36911, std: 0.36796, params: {'n_estimators': 10, 'criterion': 'gini'},
 mean: 0.36440, std: 0.37202, params: {'n_estimators': 20, 'criterion': 'gini'},
 mean: 0.36547, std: 0.37104, params: {'n_estimators': 30, 'criterion': 'gini'},
 mean: 0.36627, std: 0.37035, params: {'n_estimators': 40, 'criterion': 'gini'},
 mean: 0.36632, std: 0.37033, params: {'n_estimators': 50, 'criterion': 'gini'},
 mean: 0.36489, std: 0.37157, params: {'n_estimators': 60, 'criterion': 'gini'},
 mean: 0.36523, std: 0.37122, params: {'n_estimators': 70, 'criterion': 'gini'},
 mean: 0.36445, std: 0.37189, params: {'n_estimators': 80, 'criterion': 'gini'},
 mean: 0.36605, std: 0.37056, params: {'n_estimators': 90, 'criterion': 'gini'},
 mean: 0.36545, std: 0.37104, params: {'n_estimators': 100, 'criterion': 'gini'},
 mean: 0.36736, std: 0.36961, params: {'n_estimators': 10, 'criterion': 'entropy'},
 mean: 0.36717, std: 0.36951, params: {'n_estimators': 20, 'criterion': 'entropy'},
 mean: 0.37246, std: 0.36531, params: {'n_estimators': 30, 'criterion': 'entropy'},
 mean: 0.36877, std: 0.36828, params: {'n_estimators': 40, 'criterion': 'entropy'},
 mean: 0.36785, std: 0.36903, params: {'n_estimators': 50, 'criterion': 'entropy'},
 mean: 0.36576, std: 0.37080, params: {'n_estimators': 60, 'criterion': 'entropy'},
 mean: 0.36834, std: 0.36871, params: {'n_estimators': 70, 'criterion': 'entropy'},
 mean: 0.36554, std: 0.37097, params: {'n_estimators': 80, 'criterion': 'entropy'},
 mean: 0.36596, std: 0.37063, params: {'n_estimators': 90, 'criterion': 'entropy'},
 mean: 0.36579, std: 0.37073, params: {'n_estimators': 100, 'criterion': 'entropy'}]



In [64]:

    
# Use these in rf model. 
rf_gs.best_params_









    Out[64]:





{'criterion': 'entropy', 'n_estimators': 30}



In [65]:

    
rf = RandomForestClassifier(n_jobs=-1, n_estimators=30, criterion='entropy')



In [66]:

    
rf.fit(features, target)









    Out[66]:





RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='entropy', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=30, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0)



In [67]:

    
rf.feature_importances_









    Out[67]:





array([  2.0231e-01,   9.0210e-02,   4.2031e-02,   7.7438e-02,
         1.0591e-06,   3.4507e-02,   2.3758e-02,   1.3714e-02,
         1.5688e-02,   6.0585e-02,   9.8040e-02,   1.4692e-02,
         3.7419e-02,   2.6222e-02,   2.2230e-02,   2.1399e-02,
         1.3489e-01,   6.9029e-02,   1.5843e-02])



In [68]:

    
# 10 Most important features. Could build reduced model with just these. 
print 'Most important features:\n', sorted(zip(rf.feature_importances_, bank.columns), reverse  = True)[:10]









    



Most important features:
[(0.20230857353207832, 'age'), (0.13488591961337854, 'euribor3m'), (0.098040068945891914, 'campaign'), (0.090209638072832346, 'job'), (0.077438334291449432, 'education'), (0.069028835181837458, 'nr.employed'), (0.060585052963373527, 'day_of_week'), (0.042030826371506555, 'marital'), (0.037419327129461707, 'poutcome'), (0.03450657183575602, 'housing')]



In [69]:

    
# Confusion matrix
y_pred_rf = rf.predict(features)



In [70]:

    
print 'cf matrix:\n', confusion_matrix(target, y_pred_rf)









    



cf matrix:
[[36469    79]
 [  287  4353]]



In [71]:

    
# Plot confusion matrix.
plt.matshow(confusion_matrix(target, y_pred_rf))
plt.title('RF Confusion matrix')
plt.colorbar()
plt.ylabel('True')
plt.xlabel('Predicted')









    Out[71]:





<matplotlib.text.Text at 0x10d77e5d0>



In [72]:

    
# Classification report. 
print 'classification report:\n', classification_report(target, y_pred_rf) # Works well for both.









    



classification report:
             precision    recall  f1-score   support

          0       0.99      1.00      1.00     36548
          1       0.98      0.94      0.96      4640

avg / total       0.99      0.99      0.99     41188



In [73]:

    
# # Look at probabilities.
bank_y_pred_df_rf = pd.DataFrame(rf.predict_proba(features)) 
bank_y_pred_df_rf['Predicted'] = y_pred_rf
bank_y_pred_df_rf['True'] = target
# Show only FP/ FN
print 'FP, FN examples:\n', bank_y_pred_df_rf[bank_y_pred_df_rf['Predicted'] != bank_y_pred_df_rf['True']].head(20)
print 'Total FP, Fn:\n', len(bank_y_pred_df_rf[bank_y_pred_df_rf['Predicted'] != bank_y_pred_df_rf['True']])









    



FP, FN examples:
          0      1  Predicted  True
75    0.536  0.464          0     1
876   0.508  0.492          0     1
1763  0.567  0.433          0     1
1809  0.567  0.433          0     1
2064  0.629  0.371          0     1
2305  0.533  0.467          0     1
2327  0.497  0.503          1     0
2417  0.577  0.423          0     1
2697  0.500  0.500          0     1
2713  0.567  0.433          0     1
2779  0.567  0.433          0     1
3338  0.521  0.479          0     1
4294  0.556  0.444          0     1
4609  0.600  0.400          0     1
4677  0.500  0.500          0     1
5073  0.500  0.500          0     1
5363  0.500  0.500          0     1
5638  0.500  0.500          0     1
5731  0.600  0.400          0     1
6018  0.634  0.366          0     1
Total FP, Fn:
366



In [75]:

    
# Learning curve. So, more training samples actually brings the Cv score down ... 
plot_learning_curve(RandomForestClassifier(n_jobs=-1, n_estimators=30, criterion='entropy'), 
                    'RF', features, target)









    Out[75]:





<module 'matplotlib.pyplot' from '/Users/mareen/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>



In [76]:

    
# Matthews coeff is measure of quality of binary classification. Pretty good.
matthews_corrcoef(target, y_pred_rf)









    Out[76]:





0.95498311942036418



In [78]:

    
# Plot visualizing best params in terms of N-estimators: gini first.
train_scores = []
test_scores = []
ks = range(10, 101, 10)
for i in ks:
    knn = RandomForestClassifier(n_estimators = i, criterion = 'gini')
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))
plt.plot(ks, train_scores, label="training scores")
plt.legend(loc="best")
plt.tight_layout()
plt.plot(ks, test_scores, label="validation scores")
plt.legend(loc="best")









    Out[78]:





<matplotlib.legend.Legend at 0x11bd58a90>



In [79]:

    
# Plot visualizing best params in terms of N-estimators: entropy.
train_scores = []
test_scores = []
ks = range(10, 101, 10)
for i in ks:
    knn = RandomForestClassifier(n_estimators = i, criterion = 'entropy')
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))
plt.plot(ks, train_scores, label="training scores")
plt.legend(loc="best")
plt.tight_layout()
plt.plot(ks, test_scores, label="validation scores")
plt.legend(loc="best")









    Out[79]:





<matplotlib.legend.Legend at 0x10f7e5a50>



In [ ]:

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no

	age	duration	campaign	pdays	previous	cons.price.idx	cons.conf.idx	euribor3m	nr.employed
count	41188.000	41188.000	41188.000	41188.000	41188.000	41188.000	41188.000	41188.000	41188.000
mean	40.024	258.285	2.568	962.475	0.173	93.576	-40.503	3.621	5167.036
std	10.421	259.279	2.770	186.911	0.495	0.579	4.628	1.734	72.252
min	17.000	0.000	1.000	0.000	0.000	92.201	-50.800	0.634	4963.600
25%	32.000	102.000	1.000	999.000	0.000	93.075	-42.700	1.344	5099.100
50%	38.000	180.000	2.000	999.000	0.000	93.749	-41.800	4.857	5191.000
75%	47.000	319.000	3.000	999.000	0.000	93.994	-36.400	4.961	5228.100
max	98.000	4918.000	56.000	999.000	7.000	94.767	-26.900	5.045	5228.100

	age	duration	campaign	previous	cons.price.idx	cons.conf.idx	euribor3m	nr.employed
age	1.000	-0.002	0.006	-0.013	0.045	0.115	0.054	0.045
duration	-0.002	1.000	-0.081	0.042	0.003	-0.009	-0.078	-0.095
campaign	0.006	-0.081	1.000	-0.087	0.096	-0.002	0.141	0.144
previous	-0.013	0.042	-0.087	1.000	-0.283	-0.116	-0.455	-0.439
cons.price.idx	0.045	0.003	0.096	-0.283	1.000	0.246	0.491	0.465
cons.conf.idx	0.115	-0.009	-0.002	-0.116	0.246	1.000	0.237	0.133
euribor3m	0.054	-0.078	0.141	-0.455	0.491	0.237	1.000	0.929
nr.employed	0.045	-0.095	0.144	-0.439	0.465	0.133	0.929	1.000

	age	duration	campaign	previous	cons.price.idx	cons.conf.idx	euribor3m	nr.employed
age	1.000	-0.005	-0.002	0.047	-0.004	0.120	-0.059	-0.042
duration	-0.005	1.000	0.165	-0.275	0.272	-0.176	0.490	0.498
campaign	-0.002	0.165	1.000	-0.091	0.097	-0.035	0.156	0.152
previous	0.047	-0.275	-0.091	1.000	-0.003	0.186	-0.499	-0.534
cons.price.idx	-0.004	0.272	0.097	-0.003	1.000	-0.251	0.292	0.024
cons.conf.idx	0.120	-0.176	-0.035	0.186	-0.251	1.000	-0.472	-0.300
euribor3m	-0.059	0.490	0.156	-0.499	0.292	-0.472	1.000	0.893
nr.employed	-0.042	0.498	0.152	-0.534	0.024	-0.300	0.893	1.000