Lesson 8: Cross-validation

So far, we've learned about splitting our data into training and testing sets to validate our models. This helps ensure that the model we create on one sample performs well on another sample we want to predict.

However, we don't have to use just TWO samples to train and test our models. Instead, we can split our data up into MULTIPLE samples to train and test on multiple segments of the data. This is called CROSS-VALIDATION. This allows us to ensure that our model predicts outcomes over a wider range of circumstances.

Let's begin by importing our packages.


In [1]:
! conda install geopandas -qy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

import geopandas as gpd
from shapely.geometry import Point, Polygon

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold



# All requested packages already installed.
# packages in environment at /opt/conda:
#
geopandas                 0.3.0                    py36_0    conda-forge

In [2]:
import os
os.getcwd()
os.chdir('/home/jovyan/assignment-08-cross-validation-drewgobbi')

Today we'll be looking at 311 service requests for rodent inspection and abatement aggregated at the Census block level. The data set is already prepared for you and available in the same folder as this assignment. Census blocks are a good geographic level to analyze rodent infestations because they are drawn along natural and human-made boundaries, like rivers and roads, that rats tend not to cross.

We will look at the 'activity' variable, which indicates whether inspectors found rat burrows during an inspection (1) or not (0). Here we are looking only at inpsections in 2016. About 43 percent on inspections in 2016 led to inspectors finding and treating rat burrows, as you can see below.


In [3]:
data = pd.read_csv('rat_data_2016.csv')

In [4]:
data.columns


Out[4]:
Index(['activity', 'alley_condition', 'bbl_hotel', 'bbl_multifamily_rental',
       'bbl_restaurant', 'bbl_single_family_rental', 'bbl_storage',
       'bbl_two_family_rental', 'communitygarden_area', 'communitygarden_id',
       'dcrapermit_addition', 'dcrapermit_demolition', 'dcrapermit_excavation',
       'dcrapermit_new_building', 'dcrapermit_raze', 'impervious_area',
       'month', 'num_mixed_use', 'num_non_residential', 'num_residential',
       'park', 'pct_mixed_use', 'pct_non_residential', 'pct_residential',
       'pop_density', 'sidewalk_grates', 'ssl_cndtn_Average_comm',
       'ssl_cndtn_Average_res', 'ssl_cndtn_Excellent_comm',
       'ssl_cndtn_Excellent_res', 'ssl_cndtn_Fair_comm', 'ssl_cndtn_Fair_res',
       'ssl_cndtn_Good_comm', 'ssl_cndtn_Good_res', 'ssl_cndtn_Poor_comm',
       'ssl_cndtn_Poor_res', 'ssl_cndtn_VeryGood_comm',
       'ssl_cndtn_VeryGood_res', 'tot_pop', 'well_activity', 'WARD'],
      dtype='object')

In [5]:
data.describe().T


Out[5]:
count mean std min 25% 50% 75% max
activity 2606.0 0.431696 0.495408 0.000000 0.000000 0.000000 1.000000 1.000000
alley_condition 2606.0 11.111282 8.900166 0.000000 4.000000 10.000000 16.000000 79.000000
bbl_hotel 2606.0 0.082118 0.376073 0.000000 0.000000 0.000000 0.000000 8.000000
bbl_multifamily_rental 2606.0 1.388718 2.376244 0.000000 0.000000 0.000000 2.000000 22.000000
bbl_restaurant 2606.0 0.569455 1.518526 0.000000 0.000000 0.000000 0.000000 17.000000
bbl_single_family_rental 2606.0 4.709133 8.375165 0.000000 1.000000 2.000000 5.000000 147.000000
bbl_storage 2606.0 0.002686 0.051768 0.000000 0.000000 0.000000 0.000000 1.000000
bbl_two_family_rental 2606.0 0.743668 1.378860 0.000000 0.000000 0.000000 1.000000 15.000000
communitygarden_area 2606.0 18.727920 326.332382 0.000000 0.000000 0.000000 0.000000 11004.319881
communitygarden_id 2606.0 0.242134 3.234069 0.000000 0.000000 0.000000 0.000000 80.000000
dcrapermit_addition 2606.0 0.054490 0.265959 0.000000 0.000000 0.000000 0.000000 4.000000
dcrapermit_demolition 2606.0 0.280507 0.738300 0.000000 0.000000 0.000000 0.000000 12.000000
dcrapermit_excavation 2606.0 0.010361 0.105000 0.000000 0.000000 0.000000 0.000000 2.000000
dcrapermit_new_building 2606.0 0.053722 0.286938 0.000000 0.000000 0.000000 0.000000 4.000000
dcrapermit_raze 2606.0 0.044896 0.308390 0.000000 0.000000 0.000000 0.000000 5.000000
impervious_area 2606.0 18450.549489 18774.921307 2150.037473 11356.602831 14532.464194 19920.166029 473222.487756
month 2606.0 7.194935 3.001022 1.000000 5.000000 7.000000 10.000000 12.000000
num_mixed_use 2606.0 0.153876 0.474006 0.000000 0.000000 0.000000 0.000000 4.000000
num_non_residential 2606.0 3.804682 5.956966 0.000000 0.000000 1.000000 5.000000 58.000000
num_residential 2606.0 39.287797 26.661136 0.000000 21.000000 36.000000 53.000000 334.000000
park 2606.0 0.046815 0.225350 0.000000 0.000000 0.000000 0.000000 4.000000
pct_mixed_use 2606.0 0.003899 0.013706 0.000000 0.000000 0.000000 0.000000 0.250000
pct_non_residential 2606.0 0.139389 0.242231 0.000000 0.000000 0.038462 0.150000 1.000000
pct_residential 2606.0 0.854794 0.247308 0.000000 0.838200 0.959184 1.000000 1.000000
pop_density 2606.0 24969.466549 19217.566990 0.000000 13524.013052 21965.055541 30907.352412 182709.507271
sidewalk_grates 2606.0 2.109363 5.186348 0.000000 0.000000 0.000000 2.000000 73.000000
ssl_cndtn_Average_comm 2606.0 0.402058 0.401132 0.000000 0.000000 0.333333 0.800000 1.000000
ssl_cndtn_Average_res 2606.0 0.574651 0.265928 0.000000 0.444444 0.635642 0.767857 1.000000
ssl_cndtn_Excellent_comm 2606.0 0.022413 0.095254 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_Excellent_res 2606.0 0.001871 0.029081 0.000000 0.000000 0.000000 0.000000 0.766990
ssl_cndtn_Fair_comm 2606.0 0.024810 0.101711 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_Fair_res 2606.0 0.018239 0.047230 0.000000 0.000000 0.000000 0.017857 0.666667
ssl_cndtn_Good_comm 2606.0 0.162147 0.261734 0.000000 0.000000 0.000000 0.250000 1.000000
ssl_cndtn_Good_res 2606.0 0.285093 0.206829 0.000000 0.133333 0.266667 0.403509 1.000000
ssl_cndtn_Poor_comm 2606.0 0.002487 0.032450 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_Poor_res 2606.0 0.002067 0.010527 0.000000 0.000000 0.000000 0.000000 0.200000
ssl_cndtn_VeryGood_comm 2606.0 0.104810 0.232111 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_VeryGood_res 2606.0 0.036727 0.071472 0.000000 0.000000 0.000000 0.052632 1.000000
tot_pop 2606.0 188.892172 211.199274 0.000000 82.000000 137.000000 231.000000 3888.000000
well_activity 2606.0 0.572141 1.569373 0.000000 0.000000 0.000000 0.000000 19.000000
WARD 2606.0 3.994244 2.112836 1.000000 2.000000 4.000000 6.000000 8.000000

Recall from last week that, when we do predictive analysis, we usually are not interested in the relationship between two different variables as we are when we do traditional hypothesis testing. Instead, we're interested in training a model that generates predictions that best fit our target population. Therefore, when we are doing any kind of validation, including cross-validation, it is important for us to choose the metric by which we will evaluate the performance of our models.

For this model, we will predict the locations of requests for rodent inspection and abatement in the District of Columbia. When we select a validation metric, it's important for us to think about what we want to optimize. For example, do we want to make sure that our top predictions accurately identify places with rodent infestations, so we don't send our inspectors on a wild goose chase? Then we may to look at the models precision, or what proportion of its positive predictions turn out to be positive. Or do we want to make sure we don't miss any infestations? If so, we may want to look at recall, or the proportion of positive cases that are correctly categorized by the model. If we care a lot about how the model ranks our observations, then we may want to look at the area under the ROC curve, or ROC-AUC, while if we care more about how well the model fits the data, or its "calibration," we may want to look at Brier score or logarithmic loss (log-loss).

In the case of rodent inspections, we most likely want to make sure that we send our inspectors to places where they are most likely to find rats and to avoid sending them on wild goose [rat] chases. Therefore, we will optimize for precision, which we will call from the metrics library in scikit-learn.

The metrics library in scikit-learn provides a number of different options. You should take some time to look at the different metrics that are available to you and consider which ones are most appropriate for your own research


In [6]:
from sklearn.metrics import precision_score

The next important decision we need to make when cross-validating our models is how we will define our "folds." Folds are the independent subsamples on which we train and test the data. Keep in mind that it is important that our folds are INDEPENDENT, which means we must guarantee that there's no overlap between our training and test set (i.e., no observation is in both the training and test set). Independence can also have other implications for how we slice the data, which we will discuss as we progress through this lesson.

One of the most common approaches to cross-validation is to make random splits in the data. This is often referred to as k-fold cross-validation, in which the only thing we define is the number of folds (k) that want to split our sample into. Here, I'll use the KFold function from scikit-learn's model_selection library. Let's begin by importing the library and then taking a look at how it splits our data.


In [7]:
from sklearn.model_selection import KFold

KFold divides our data into a pre-specified number of (approximately) equally-sized folds so that each observation is in the test set once. When we specify that shuffle=True, KFold first shuffles our data into a random order to ensure that the observations are randomly selected. By selecting a random_state, we can ensure that KFold selects observations the same way each time.

While there are other functions in the model_selection library that will do much of this work for us, KFold will allow us to look at what's going on in the background of our cross-validation process. Let's begin by just looking at how KFold splits our data. Here we split our data into 10 folds each with 10 percent of the data (.1).


In [8]:
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
    print("TRAIN:", train_index, "TEST:", test_index)


TRAIN: [   0    1    2 ..., 2603 2604 2605] TEST: [   9   22   27   33   53   70   92  104  109  117  121  135  137  156  182
  192  195  196  215  217  224  227  252  259  271  276  289  314  317  326
  333  351  398  399  418  422  427  436  438  443  452  465  478  480  482
  489  518  547  562  563  567  569  578  581  597  609  616  618  619  674
  682  686  700  704  710  711  720  722  728  743  745  746  748  764  778
  795  817  831  855  868  878  880  899  913  916  921  927  933  961  962
  982  983  988  998 1000 1012 1013 1018 1023 1032 1036 1051 1052 1059 1078
 1079 1096 1100 1101 1106 1108 1109 1147 1187 1192 1213 1263 1264 1285 1287
 1300 1323 1326 1327 1371 1396 1418 1421 1432 1452 1484 1507 1515 1520 1544
 1568 1570 1577 1580 1585 1588 1590 1592 1597 1622 1627 1656 1657 1668 1680
 1681 1686 1689 1708 1710 1719 1729 1732 1735 1757 1761 1762 1763 1765 1780
 1783 1790 1798 1803 1814 1816 1818 1820 1821 1827 1832 1836 1853 1873 1874
 1895 1898 1901 1902 1921 1927 1929 1939 1943 1961 1968 1969 1972 1974 1979
 1982 1989 1997 1998 2019 2027 2030 2033 2055 2074 2085 2087 2090 2115 2123
 2134 2138 2151 2156 2161 2183 2186 2191 2225 2230 2254 2258 2274 2281 2289
 2302 2312 2316 2317 2343 2345 2346 2358 2359 2378 2386 2387 2398 2400 2410
 2416 2417 2419 2429 2450 2468 2498 2499 2502 2510 2520 2543 2544 2550 2555
 2564 2567 2579 2585 2596 2597]
TRAIN: [   0    1    2 ..., 2603 2604 2605] TEST: [   4   10   14   23   37   39   41   57   69   76   98  113  124  132  145
  148  162  179  191  204  232  234  245  248  251  296  302  303  311  320
  330  353  357  361  379  385  390  402  405  414  425  440  446  454  457
  477  486  487  501  526  527  536  543  558  565  582  610  615  621  634
  638  652  653  666  667  672  676  687  688  692  702  703  708  712  713
  715  716  758  776  789  812  828  838  840  847  852  876  891  897  898
  905  909  914  924  926  935  966  989  997 1002 1003 1025 1041 1068 1070
 1091 1093 1110 1118 1122 1138 1146 1150 1161 1173 1185 1193 1197 1199 1210
 1211 1222 1228 1231 1232 1239 1242 1244 1270 1292 1294 1303 1332 1334 1362
 1366 1377 1380 1405 1412 1414 1424 1426 1449 1465 1467 1486 1487 1493 1496
 1504 1506 1512 1525 1535 1539 1543 1548 1549 1553 1555 1573 1594 1599 1601
 1625 1637 1642 1646 1663 1664 1665 1675 1678 1702 1703 1712 1714 1727 1728
 1768 1770 1774 1781 1809 1813 1824 1825 1826 1839 1845 1851 1852 1859 1885
 1928 1937 1946 1949 1952 1955 1988 2001 2007 2013 2037 2038 2045 2054 2067
 2069 2073 2088 2095 2116 2131 2143 2149 2175 2201 2207 2213 2229 2233 2245
 2246 2262 2266 2272 2276 2288 2293 2297 2321 2333 2341 2354 2356 2376 2381
 2395 2399 2405 2409 2442 2482 2491 2497 2508 2512 2515 2526 2531 2556 2559
 2561 2562 2566 2588 2592 2594]
TRAIN: [   0    2    3 ..., 2603 2604 2605] TEST: [   1    6   11   17   18   30   40   47   48   52   58  107  125  133  149
  157  161  173  175  189  194  200  206  220  229  249  254  264  283  286
  300  305  322  342  384  386  391  392  411  442  444  453  458  459  461
  488  503  505  517  529  530  535  538  557  568  570  574  575  579  587
  596  602  641  646  648  651  657  661  665  670  684  723  727  731  757
  762  768  792  805  841  850  886  892  895  900  906  918  936  949  951
  953  963  995  996 1005 1009 1015 1017 1027 1042 1055 1058 1063 1073 1081
 1098 1103 1116 1126 1127 1139 1140 1157 1160 1174 1188 1190 1203 1205 1226
 1236 1246 1256 1267 1271 1273 1283 1295 1302 1317 1322 1328 1357 1367 1370
 1373 1386 1393 1411 1422 1428 1431 1448 1450 1459 1471 1492 1503 1513 1521
 1523 1528 1533 1540 1547 1567 1569 1598 1602 1621 1631 1632 1643 1673 1677
 1697 1704 1713 1715 1724 1725 1726 1736 1737 1753 1754 1773 1802 1819 1829
 1840 1854 1855 1861 1864 1867 1869 1872 1897 1900 1931 1942 1945 1947 1964
 1983 1985 1996 1999 2012 2015 2026 2039 2052 2072 2083 2097 2099 2109 2124
 2144 2145 2165 2188 2190 2198 2208 2223 2235 2238 2240 2249 2256 2279 2284
 2285 2290 2300 2308 2320 2326 2336 2352 2355 2357 2367 2372 2382 2396 2407
 2414 2422 2430 2432 2462 2467 2471 2477 2480 2495 2500 2511 2518 2525 2548
 2549 2560 2569 2576 2582 2601]
TRAIN: [   0    1    2 ..., 2603 2604 2605] TEST: [  15   31   34   43   61   64   77   80   85   87  106  118  139  141  144
  169  187  202  223  233  240  250  253  260  262  267  270  279  287  294
  295  298  299  309  310  359  360  369  376  381  393  415  474  475  483
  485  491  502  506  512  516  519  521  522  532  546  553  564  572  600
  620  629  633  636  643  654  655  663  701  721  733  735  766  775  781
  791  793  794  799  806  810  815  818  820  825  829  832  836  842  861
  882  883  890  896  910  917  923  937  938  944  946  958  964  971  977
  979  980  981  986  991 1010 1038 1043 1045 1047 1049 1056 1074 1077 1083
 1087 1097 1099 1114 1119 1136 1148 1151 1170 1180 1183 1212 1217 1240 1254
 1255 1279 1301 1325 1330 1339 1341 1347 1355 1368 1374 1385 1388 1390 1397
 1399 1417 1453 1458 1474 1485 1490 1491 1516 1559 1560 1571 1581 1587 1593
 1603 1610 1611 1647 1648 1674 1694 1696 1709 1758 1759 1760 1779 1786 1787
 1789 1801 1808 1817 1860 1865 1875 1876 1892 1903 1907 1916 1919 1923 1935
 1948 1980 1994 2020 2032 2044 2050 2053 2057 2061 2063 2064 2075 2078 2092
 2122 2129 2133 2140 2158 2164 2173 2178 2203 2204 2210 2219 2259 2263 2273
 2296 2298 2305 2311 2319 2328 2351 2360 2391 2392 2394 2397 2401 2402 2406
 2408 2413 2441 2444 2447 2457 2458 2469 2478 2479 2493 2507 2509 2519 2522
 2536 2538 2571 2574 2575 2581]
TRAIN: [   0    1    3 ..., 2602 2604 2605] TEST: [   2    5   19   36   42   45   54   55   65   66   72   73   82   89  102
  108  122  140  142  152  154  159  170  171  177  178  184  185  190  198
  203  210  214  219  268  272  278  308  312  315  318  319  335  340  341
  347  364  378  383  408  412  416  434  467  468  473  479  481  498  511
  534  539  551  561  583  590  598  632  644  658  689  717  718  724  726
  740  744  750  759  760  772  773  785  796  801  811  813  823  839  856
  863  866  893  930  955  965  974  978  985  987  992 1001 1008 1021 1026
 1029 1050 1069 1076 1080 1082 1117 1129 1132 1135 1137 1145 1154 1165 1166
 1175 1195 1216 1225 1235 1257 1259 1261 1266 1275 1277 1280 1284 1293 1338
 1343 1349 1358 1359 1363 1376 1378 1379 1387 1423 1427 1454 1456 1457 1460
 1462 1473 1478 1482 1489 1499 1500 1518 1526 1530 1537 1538 1550 1606 1612
 1615 1633 1635 1661 1666 1679 1717 1742 1748 1749 1752 1764 1775 1785 1846
 1848 1858 1878 1886 1890 1909 1914 1934 1950 1957 1960 1976 2000 2005 2010
 2017 2018 2025 2028 2031 2034 2035 2056 2058 2093 2100 2102 2108 2119 2160
 2166 2172 2181 2194 2202 2218 2228 2236 2242 2244 2248 2253 2264 2265 2267
 2269 2270 2278 2283 2303 2310 2324 2325 2349 2364 2365 2393 2421 2434 2438
 2452 2453 2456 2460 2463 2465 2475 2484 2485 2487 2492 2505 2506 2514 2528
 2532 2552 2570 2598 2599 2603]
TRAIN: [   0    1    2 ..., 2602 2603 2605] TEST: [   8   13   16   29   32   35   49   56   60   68   71   75   96  110  114
  134  147  155  186  211  212  218  226  231  241  243  244  247  258  265
  285  293  304  336  339  354  355  358  362  367  372  377  401  406  420
  426  431  432  435  445  450  455  456  464  471  494  499  500  513  520
  524  533  540  542  548  549  552  554  576  580  585  589  593  611  631
  642  649  668  677  678  679  693  706  725  737  752  769  782  788  803
  814  862  871  875  881  884  887  889  939  940  942  943  948  957  960
  970  984  994  999 1014 1016 1031 1044 1060 1064 1065 1075 1105 1120 1123
 1124 1128 1143 1164 1171 1184 1189 1200 1218 1227 1229 1245 1248 1252 1265
 1299 1310 1312 1315 1319 1336 1340 1354 1361 1403 1407 1436 1444 1446 1451
 1455 1463 1464 1476 1495 1498 1501 1505 1511 1519 1554 1572 1586 1591 1604
 1607 1626 1644 1650 1652 1654 1655 1662 1669 1670 1683 1698 1716 1721 1738
 1739 1741 1746 1766 1767 1771 1776 1794 1806 1837 1850 1870 1880 1888 1893
 1894 1922 1936 1944 1951 1953 1981 1984 1992 2016 2029 2043 2060 2062 2065
 2070 2082 2096 2111 2125 2126 2130 2132 2157 2167 2174 2192 2193 2196 2205
 2216 2224 2231 2243 2250 2255 2268 2301 2307 2314 2323 2330 2337 2363 2368
 2403 2404 2423 2436 2440 2461 2466 2481 2516 2537 2540 2541 2547 2557 2563
 2572 2580 2586 2589 2600 2604]
TRAIN: [   0    1    2 ..., 2602 2603 2604] TEST: [  38   44   51   59   78   81   83   88   97  103  111  119  123  129  131
  165  183  188  205  213  238  239  261  263  269  313  316  328  338  349
  356  363  371  374  382  395  397  410  439  448  463  466  472  476  484
  490  493  507  510  514  528  541  571  599  601  608  613  625  635  645
  656  660  681  685  695  697  729  751  761  771  779  783  784  808  819
  826  833  844  846  849  853  858  867  874  879  904  907  911  919  934
  947  969  993 1019 1034 1035 1039 1057 1067 1084 1088 1089 1094 1102 1115
 1121 1125 1156 1158 1191 1196 1220 1223 1224 1233 1238 1260 1262 1268 1269
 1274 1276 1281 1290 1291 1296 1307 1311 1318 1320 1321 1324 1335 1344 1351
 1360 1364 1375 1382 1383 1391 1401 1408 1410 1415 1419 1420 1430 1439 1440
 1442 1475 1477 1494 1502 1510 1524 1527 1529 1534 1557 1564 1609 1613 1614
 1618 1624 1630 1636 1639 1651 1658 1687 1691 1693 1695 1700 1706 1711 1733
 1743 1756 1772 1784 1804 1811 1830 1831 1833 1841 1842 1857 1881 1882 1887
 1906 1912 1918 1959 1977 1991 2004 2014 2086 2094 2101 2103 2104 2106 2110
 2114 2118 2127 2147 2179 2184 2199 2214 2220 2226 2227 2232 2239 2241 2275
 2286 2287 2294 2318 2327 2329 2332 2342 2348 2373 2375 2411 2420 2428 2433
 2437 2439 2451 2459 2464 2472 2474 2489 2503 2521 2529 2530 2534 2535 2545
 2546 2551 2553 2590 2605]
TRAIN: [   0    1    2 ..., 2603 2604 2605] TEST: [  12   20   28   46   50   62   74   79   90   95  101  105  115  116  120
  127  128  138  143  150  158  167  172  181  193  208  222  225  228  230
  235  236  242  255  266  284  288  290  301  306  331  332  337  344  345
  346  352  366  370  380  389  394  396  403  409  413  417  421  441  462
  492  495  496  515  523  531  545  559  566  588  592  612  614  617  622
  626  628  630  662  669  675  683  690  707  719  734  742  747  753  765
  777  787  790  798  822  824  857  864  870  877  901  912  915  920  922
  959  968 1024 1030 1037 1054 1061 1062 1066 1072 1086 1092 1095 1142 1168
 1169 1178 1181 1182 1186 1214 1230 1234 1237 1258 1282 1286 1288 1309 1313
 1342 1356 1365 1372 1394 1402 1404 1406 1433 1437 1481 1508 1509 1517 1522
 1545 1546 1575 1576 1583 1584 1595 1600 1616 1620 1623 1628 1629 1638 1649
 1659 1676 1682 1685 1688 1692 1730 1745 1751 1755 1782 1788 1791 1793 1796
 1797 1799 1807 1810 1812 1815 1835 1843 1856 1866 1884 1891 1911 1917 1926
 1932 1958 1962 1965 1986 1990 1995 2002 2041 2047 2048 2049 2066 2068 2077
 2079 2098 2112 2113 2128 2137 2142 2148 2155 2162 2168 2170 2200 2206 2209
 2211 2212 2221 2234 2252 2261 2277 2291 2299 2313 2350 2353 2361 2366 2369
 2370 2374 2384 2388 2426 2445 2448 2449 2455 2476 2486 2504 2513 2533 2539
 2542 2565 2573 2583 2595]
TRAIN: [   1    2    4 ..., 2603 2604 2605] TEST: [   0    3    7   21   26   63   93  100  112  126  153  160  163  164  174
  237  246  280  281  282  292  321  325  327  329  334  343  348  350  365
  375  387  400  404  407  419  424  428  437  447  449  451  460  470  497
  504  550  573  577  584  586  594  595  603  604  605  606  624  627  640
  647  650  664  671  673  680  691  694  696  698  699  709  732  736  738
  739  741  754  780  786  800  804  827  830  834  837  845  848  851  854
  859  869  873  902  903  929  932  941  945  950  952  975  990 1004 1006
 1011 1028 1040 1046 1048 1090 1111 1113 1130 1131 1133 1144 1149 1159 1163
 1177 1179 1194 1201 1202 1209 1215 1219 1221 1243 1247 1249 1250 1251 1253
 1289 1298 1305 1306 1308 1314 1331 1333 1337 1348 1353 1369 1384 1389 1398
 1409 1413 1416 1425 1438 1441 1443 1461 1468 1479 1480 1497 1514 1532 1541
 1542 1551 1556 1558 1562 1566 1574 1579 1582 1596 1608 1617 1660 1667 1671
 1672 1690 1705 1707 1722 1734 1744 1747 1769 1800 1805 1834 1838 1844 1847
 1849 1862 1868 1879 1883 1889 1915 1924 1933 1938 1941 1963 1967 1971 1975
 1978 1993 2003 2006 2009 2021 2040 2042 2051 2107 2136 2150 2152 2153 2154
 2180 2182 2185 2189 2195 2215 2247 2271 2295 2304 2306 2309 2315 2331 2340
 2347 2377 2379 2380 2385 2390 2412 2424 2427 2443 2454 2470 2473 2488 2501
 2523 2527 2591 2593 2602]
TRAIN: [   0    1    2 ..., 2603 2604 2605] TEST: [  24   25   67   84   86   91   94   99  130  136  146  151  166  168  176
  180  197  199  201  207  209  216  221  256  257  273  274  275  277  291
  297  307  323  324  368  373  388  423  429  430  433  469  508  509  525
  537  544  555  556  560  591  607  623  637  639  659  705  714  730  749
  755  756  763  767  770  774  797  802  807  809  816  821  835  843  860
  865  872  885  888  894  908  925  928  931  954  956  967  972  973  976
 1007 1020 1022 1033 1053 1071 1085 1104 1107 1112 1134 1141 1152 1153 1155
 1162 1167 1172 1176 1198 1204 1206 1207 1208 1241 1272 1278 1297 1304 1316
 1329 1345 1346 1350 1352 1381 1392 1395 1400 1429 1434 1435 1445 1447 1466
 1469 1470 1472 1483 1488 1531 1536 1552 1561 1563 1565 1578 1589 1605 1619
 1634 1640 1641 1645 1653 1684 1699 1701 1718 1720 1723 1731 1740 1750 1777
 1778 1792 1795 1822 1823 1828 1863 1871 1877 1896 1899 1904 1905 1908 1910
 1913 1920 1925 1930 1940 1954 1956 1966 1970 1973 1987 2008 2011 2022 2023
 2024 2036 2046 2059 2071 2076 2080 2081 2084 2089 2091 2105 2117 2120 2121
 2135 2139 2141 2146 2159 2163 2169 2171 2176 2177 2187 2197 2217 2222 2237
 2251 2257 2260 2280 2282 2292 2322 2334 2335 2338 2339 2344 2362 2371 2383
 2389 2415 2418 2425 2431 2435 2446 2483 2490 2494 2496 2517 2524 2554 2558
 2568 2577 2578 2584 2587]

You can see that ShuffleSplit has selected a random set of observations from the index of our data set for each fold of our cross-validation. Let's look at the size of our training and test set for each fold.


In [9]:
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))


TRAIN: 2345 TEST: 261
TRAIN: 2345 TEST: 261
TRAIN: 2345 TEST: 261
TRAIN: 2345 TEST: 261
TRAIN: 2345 TEST: 261
TRAIN: 2345 TEST: 261
TRAIN: 2346 TEST: 260
TRAIN: 2346 TEST: 260
TRAIN: 2346 TEST: 260
TRAIN: 2346 TEST: 260

Now let's try using KFold to train and test our model on 10 different subsets of our data. Below we set our cross-validator as 'cv'. We then loop through the various splits in our data that cv creates and use it to make our training and test sets. We then use our training set to fit a Logistic Regression model and generate predictions from our test set, which we compare to the actual outcomes we observed.


In [10]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)

## Create for-loop
for train_index, test_index in cv.split(data):

    ## Define training and test sets
    X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_train = data.loc[train_index]['activity']
    X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_test = data.loc[test_index]['activity']
        
    ## Fit model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    ## Generate predictions
    predicted = clf.predict(X_test)
    
    ## Compare to actual outcomes and return precision
    print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))


Precision: 56.9
Precision: 55.8
Precision: 60.4
Precision: 63.5
Precision: 55.4
Precision: 52.8
Precision: 56.9
Precision: 57.1
Precision: 65.8
Precision: 41.0

We can see that, for the most part, about 50 to 60 percent of the inspections our model predicts will lead our inspectors to rat burrows actually do. This is a modest improvement over our inspectors' current performance in the field. Based on these results, if we used our models to determine which locations our inspectors go to in the field, we'd probably see a 10 to 20 point increase in their likelihood of finding rat burrows.

Exercise 1

Try running the k-fold cross-validation a few times with the same random state. Then try running it a few times with different random states. How do the results change?


In [11]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)

## Create for-loop
for train_index, test_index in cv.split(data):

    ## Define training and test sets
    X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_train = data.loc[train_index]['activity']
    X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_test = data.loc[test_index]['activity']
        
    ## Fit model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    ## Generate predictions
    predicted = clf.predict(X_test)
    
    ## Compare to actual outcomes and return precision
    print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))


Precision: 56.9
Precision: 55.8
Precision: 60.4
Precision: 63.5
Precision: 55.4
Precision: 52.8
Precision: 56.9
Precision: 57.1
Precision: 65.8
Precision: 41.0

In [12]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=1)

## Create for-loop
for train_index, test_index in cv.split(data):

    ## Define training and test sets
    X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_train = data.loc[train_index]['activity']
    X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_test = data.loc[test_index]['activity']
        
    ## Fit model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    ## Generate predictions
    predicted = clf.predict(X_test)
    
    ## Compare to actual outcomes and return precision
    print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))


Precision: 60.3
Precision: 52.7
Precision: 57.4
Precision: 60.4
Precision: 57.1
Precision: 47.9
Precision: 54.2
Precision: 43.2
Precision: 71.4
Precision: 53.3

In [13]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=17)

## Create for-loop
for train_index, test_index in cv.split(data):

    ## Define training and test sets
    X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_train = data.loc[train_index]['activity']
    X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
    y_test = data.loc[test_index]['activity']
        
    ## Fit model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    ## Generate predictions
    predicted = clf.predict(X_test)
    
    ## Compare to actual outcomes and return precision
    print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))


Precision: 60.0
Precision: 62.2
Precision: 62.7
Precision: 47.7
Precision: 53.2
Precision: 55.7
Precision: 43.6
Precision: 47.5
Precision: 55.4
Precision: 60.4

Different random states create different measures of precision/datapoints across the 10 subsamples. However, the mean still seems to be in the mid 50's,suggesting some level of consitency

It's important to point out here that, because we have TIME SERIES data, the same Census blocks may be appearing in our training AND our test sets. This is a challenge to ensuring that our training and test samples are INDEPENDENT. While Rodent Control does not inspect the same blocks every month, some of the same blocks may be re-inspected from month to month depending on where 311 requests are coming from.

However, this also affords us an opportunity. More than likely, when we make predictions about which inspections will lead our inspectors to rat burrows, we are interested in predicting FUTURE inspections with observations from PAST inspections. In this case, cross-validating over time can be a very good way of looking at how well our models are performing.

Cross-validating over time requires more than just splitting by month. Rather, we will use observations from each month as a test set and train our models on all PRIOR months. Which we do below.

Cross-validation by Month

Let's begin by seeing what our cross-validation sets look like. Below, we loop through each of the sets to see which months end up in our training and test sets. You can see that as we move from month to month, we have more and more past observations in our training set.


In [14]:
months = np.sort(data.month.unique())

for month in range(2,13):
    test = data[data.month==month]
    train = data[(data.month < month)]

    print('Test Month: '+str(test.month.unique()), 'Training Months: '+str(train.month.unique()))


Test Month: [2] Training Months: [1]
Test Month: [3] Training Months: [1 2]
Test Month: [4] Training Months: [1 2 3]
Test Month: [5] Training Months: [1 2 3 4]
Test Month: [6] Training Months: [1 2 3 4 5]
Test Month: [7] Training Months: [1 2 3 4 5 6]
Test Month: [8] Training Months: [1 2 3 4 5 6 7]
Test Month: [9] Training Months: [1 2 3 4 5 6 7 8]
Test Month: [10] Training Months: [1 2 3 4 5 6 7 8 9]
Test Month: [11] Training Months: [ 1  2  3  4  5  6  7  8  9 10]
Test Month: [12] Training Months: [ 1  2  3  4  5  6  7  8  9 10 11]

In [15]:
months = np.sort(data.month.unique())

for month in range(2,13):

    test = data[data.month==month]
    train = data[(data.month < month)]
    X_test = test.drop(['activity', 'month', 'WARD'], axis=1)
    y_test = test['activity']
    X_train = test.drop(['activity', 'month', 'WARD'], axis=1)
    y_train = test['activity']
        
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_test)
    print('Precision for Month '+str(month)+': '+str(100*round(precision_score(y_test, predicted),3)))


Precision for Month 2: 79.2
Precision for Month 3: 67.9
Precision for Month 4: 48.9
Precision for Month 5: 63.5
Precision for Month 6: 61.3
Precision for Month 7: 67.9
Precision for Month 8: 67.2
Precision for Month 9: 67.6
Precision for Month 10: 57.3
Precision for Month 11: 68.3
Precision for Month 12: 70.5

Our model seems to be performing even better when we cross-validate over months, possibly because we're structuring the cross-validation such that inspections in some of the same blocks appear consistently over time.

Exercise 2

Try re-creating this cross-validation, but with the training set restricted to only the 3 months prior to the test set. Now do the same with the last 1 and 2 months. Do the results change?


In [16]:
months = np.sort(data.month.unique())

for month in range(2,13):
    test = data[data.month==month]
    train = data[data.month>=month-3] 
    train = train.drop[train.month>=month]
    print('Test Month: '+str(test.month.unique()), 'Training Months: '+str(train.month.unique()))


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-6ab3699e1c10> in <module>()
      4     test = data[data.month==month]
      5     train = data[data.month>=month-3]
----> 6     train = train.drop[train.month>=month]
      7     print('Test Month: '+str(test.month.unique()), 'Training Months: '+str(train.month.unique()))

TypeError: 'method' object is not subscriptable

In [ ]:


In [ ]:


In [ ]:

We may still be concerned about the independence of our training and test sets. In particular, as I've pointed out, the same Census blocks may appear repeatedly in our data over time. In this case, it may be good to cross-validate geographically to make sure that our model is performing well in different parts of the city. In particular, we know that requests for rodent abatement (and rats themselves) are more common in some parts of the city than in others. In particular, rats are more common in the more densely-populated parts of downtown and less common in less densely-populated places like Wards 3, 7, and 8. Therefore, we may be interested in cross-validating by ward.

Again, this is as simple as looping through each of the 8 wards, holding out each ward as a test set and training the models on observations from the remaining wards.

Cross-validate by Ward


In [ ]:
data.WARD.value_counts().sort_index()

In [75]:
for ward in np.sort(data.WARD.unique()):

    test = data[data.WARD == ward]
    train = data[data.WARD != ward]
    X_test = test.drop(['activity', 'month', 'WARD'], axis=1)
    y_test = test['activity']
    X_train = test.drop(['activity', 'month', 'WARD'], axis=1)
    y_train = test['activity']
        
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_test)
    print('Precision for Ward '+str(ward)+': '+str(100*round(precision_score(y_test, predicted),3)))


Precision for Ward 1: 60.5
Precision for Ward 2: 60.2
Precision for Ward 3: 81.2
Precision for Ward 4: 62.0
Precision for Ward 5: 62.3
Precision for Ward 6: 55.6
Precision for Ward 7: 0.0
Precision for Ward 8: 0.0
/opt/conda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

Here we see that the model performs very well predicting the outcomes of inspections in wards 1 through 4, but less well in wards 5 though 8. In wards 7 and 8 in particular, the model fails to predict any positive cases. This means that our model may be overfit to observations in Wards 1 through 6, and we may want to re-evaluate our approach.

Exercise 3

Explore the data and our model and try to come up with some reasons that the model is performing poorly on Wards 7 and 8. Is there a way we can fix the model to perform better on those wards? How might we fix the model?


In [19]:
data.head().T


Out[19]:
0 1 2 3 4
activity 1.000000 0.000000 0.000000 1.000000 1.000000
alley_condition 10.000000 25.000000 15.000000 0.000000 10.000000
bbl_hotel 0.000000 0.000000 0.000000 0.000000 0.000000
bbl_multifamily_rental 1.000000 0.000000 0.000000 1.000000 2.000000
bbl_restaurant 1.000000 0.000000 0.000000 0.000000 0.000000
bbl_single_family_rental 2.000000 8.000000 3.000000 3.000000 1.000000
bbl_storage 0.000000 0.000000 0.000000 0.000000 0.000000
bbl_two_family_rental 0.000000 0.000000 0.000000 0.000000 0.000000
communitygarden_area 0.000000 0.000000 0.000000 0.000000 0.000000
communitygarden_id 0.000000 0.000000 0.000000 0.000000 0.000000
dcrapermit_addition 0.000000 0.000000 0.000000 0.000000 0.000000
dcrapermit_demolition 0.000000 0.000000 0.000000 1.000000 0.000000
dcrapermit_excavation 0.000000 0.000000 0.000000 0.000000 0.000000
dcrapermit_new_building 0.000000 0.000000 0.000000 0.000000 0.000000
dcrapermit_raze 0.000000 0.000000 0.000000 0.000000 0.000000
impervious_area 25753.280567 20933.568033 11002.566472 10855.879189 15560.200354
month 1.000000 1.000000 1.000000 1.000000 1.000000
num_mixed_use 0.000000 0.000000 0.000000 0.000000 1.000000
num_non_residential 17.000000 0.000000 0.000000 4.000000 2.000000
num_residential 54.000000 102.000000 26.000000 14.000000 64.000000
park 0.000000 0.000000 0.000000 0.000000 0.000000
pct_mixed_use 0.000000 0.000000 0.000000 0.000000 0.014925
pct_non_residential 0.239437 0.000000 0.000000 0.222222 0.029851
pct_residential 0.760563 1.000000 1.000000 0.777778 0.955224
pop_density 5069.093059 8738.804149 12064.771513 32435.135124 43599.659702
sidewalk_grates 2.000000 0.000000 0.000000 1.000000 3.000000
ssl_cndtn_Average_comm 0.000000 0.000000 0.000000 0.000000 0.285714
ssl_cndtn_Average_res 0.583333 0.627451 0.720000 0.416667 0.622642
ssl_cndtn_Excellent_comm 0.000000 0.000000 0.000000 0.000000 0.000000
ssl_cndtn_Excellent_res 0.000000 0.000000 0.000000 0.000000 0.000000
ssl_cndtn_Fair_comm 0.000000 0.000000 0.000000 0.000000 0.000000
ssl_cndtn_Fair_res 0.041667 0.000000 0.000000 0.000000 0.037736
ssl_cndtn_Good_comm 0.400000 0.000000 0.000000 0.000000 0.142857
ssl_cndtn_Good_res 0.312500 0.343137 0.240000 0.416667 0.339623
ssl_cndtn_Poor_comm 0.000000 0.000000 0.000000 0.000000 0.000000
ssl_cndtn_Poor_res 0.000000 0.000000 0.000000 0.000000 0.000000
ssl_cndtn_VeryGood_comm 0.600000 0.000000 0.000000 1.000000 0.571429
ssl_cndtn_VeryGood_res 0.062500 0.029412 0.040000 0.166667 0.000000
tot_pop 137.000000 273.000000 81.000000 132.000000 216.000000
well_activity 0.000000 0.000000 0.000000 0.000000 0.000000
WARD 3.000000 3.000000 3.000000 2.000000 2.000000

In [24]:
data.describe().T


Out[24]:
count mean std min 25% 50% 75% max
activity 2606.0 0.431696 0.495408 0.000000 0.000000 0.000000 1.000000 1.000000
alley_condition 2606.0 11.111282 8.900166 0.000000 4.000000 10.000000 16.000000 79.000000
bbl_hotel 2606.0 0.082118 0.376073 0.000000 0.000000 0.000000 0.000000 8.000000
bbl_multifamily_rental 2606.0 1.388718 2.376244 0.000000 0.000000 0.000000 2.000000 22.000000
bbl_restaurant 2606.0 0.569455 1.518526 0.000000 0.000000 0.000000 0.000000 17.000000
bbl_single_family_rental 2606.0 4.709133 8.375165 0.000000 1.000000 2.000000 5.000000 147.000000
bbl_storage 2606.0 0.002686 0.051768 0.000000 0.000000 0.000000 0.000000 1.000000
bbl_two_family_rental 2606.0 0.743668 1.378860 0.000000 0.000000 0.000000 1.000000 15.000000
communitygarden_area 2606.0 18.727920 326.332382 0.000000 0.000000 0.000000 0.000000 11004.319881
communitygarden_id 2606.0 0.242134 3.234069 0.000000 0.000000 0.000000 0.000000 80.000000
dcrapermit_addition 2606.0 0.054490 0.265959 0.000000 0.000000 0.000000 0.000000 4.000000
dcrapermit_demolition 2606.0 0.280507 0.738300 0.000000 0.000000 0.000000 0.000000 12.000000
dcrapermit_excavation 2606.0 0.010361 0.105000 0.000000 0.000000 0.000000 0.000000 2.000000
dcrapermit_new_building 2606.0 0.053722 0.286938 0.000000 0.000000 0.000000 0.000000 4.000000
dcrapermit_raze 2606.0 0.044896 0.308390 0.000000 0.000000 0.000000 0.000000 5.000000
impervious_area 2606.0 18450.549489 18774.921307 2150.037473 11356.602831 14532.464194 19920.166029 473222.487756
month 2606.0 7.194935 3.001022 1.000000 5.000000 7.000000 10.000000 12.000000
num_mixed_use 2606.0 0.153876 0.474006 0.000000 0.000000 0.000000 0.000000 4.000000
num_non_residential 2606.0 3.804682 5.956966 0.000000 0.000000 1.000000 5.000000 58.000000
num_residential 2606.0 39.287797 26.661136 0.000000 21.000000 36.000000 53.000000 334.000000
park 2606.0 0.046815 0.225350 0.000000 0.000000 0.000000 0.000000 4.000000
pct_mixed_use 2606.0 0.003899 0.013706 0.000000 0.000000 0.000000 0.000000 0.250000
pct_non_residential 2606.0 0.139389 0.242231 0.000000 0.000000 0.038462 0.150000 1.000000
pct_residential 2606.0 0.854794 0.247308 0.000000 0.838200 0.959184 1.000000 1.000000
pop_density 2606.0 24969.466549 19217.566990 0.000000 13524.013052 21965.055541 30907.352412 182709.507271
sidewalk_grates 2606.0 2.109363 5.186348 0.000000 0.000000 0.000000 2.000000 73.000000
ssl_cndtn_Average_comm 2606.0 0.402058 0.401132 0.000000 0.000000 0.333333 0.800000 1.000000
ssl_cndtn_Average_res 2606.0 0.574651 0.265928 0.000000 0.444444 0.635642 0.767857 1.000000
ssl_cndtn_Excellent_comm 2606.0 0.022413 0.095254 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_Excellent_res 2606.0 0.001871 0.029081 0.000000 0.000000 0.000000 0.000000 0.766990
ssl_cndtn_Fair_comm 2606.0 0.024810 0.101711 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_Fair_res 2606.0 0.018239 0.047230 0.000000 0.000000 0.000000 0.017857 0.666667
ssl_cndtn_Good_comm 2606.0 0.162147 0.261734 0.000000 0.000000 0.000000 0.250000 1.000000
ssl_cndtn_Good_res 2606.0 0.285093 0.206829 0.000000 0.133333 0.266667 0.403509 1.000000
ssl_cndtn_Poor_comm 2606.0 0.002487 0.032450 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_Poor_res 2606.0 0.002067 0.010527 0.000000 0.000000 0.000000 0.000000 0.200000
ssl_cndtn_VeryGood_comm 2606.0 0.104810 0.232111 0.000000 0.000000 0.000000 0.000000 1.000000
ssl_cndtn_VeryGood_res 2606.0 0.036727 0.071472 0.000000 0.000000 0.000000 0.052632 1.000000
tot_pop 2606.0 188.892172 211.199274 0.000000 82.000000 137.000000 231.000000 3888.000000
well_activity 2606.0 0.572141 1.569373 0.000000 0.000000 0.000000 0.000000 19.000000
WARD 2606.0 3.994244 2.112836 1.000000 2.000000 4.000000 6.000000 8.000000

In [41]:
data.bbl_restaurant.value_counts()


Out[41]:
0.0     2015
1.0      260
2.0      137
3.0       83
4.0       38
5.0       24
8.0       11
6.0       10
9.0        9
7.0        7
12.0       4
10.0       2
15.0       2
11.0       2
17.0       1
14.0       1
Name: bbl_restaurant, dtype: int64

In [42]:
data.groupby('WARD').bbl_restaurant.value_counts()


Out[42]:
WARD  bbl_restaurant
1     0.0               283
      1.0                61
      2.0                37
      3.0                18
      4.0                 8
      6.0                 2
      8.0                 2
      11.0                2
      12.0                2
      5.0                 1
      9.0                 1
      15.0                1
      17.0                1
2     0.0               255
      1.0                70
      2.0                49
      3.0                37
      4.0                23
      5.0                19
      6.0                 7
      7.0                 5
      9.0                 5
      8.0                 4
      10.0                2
      14.0                1
3     0.0                71
      1.0                11
      2.0                10
      3.0                 5
      9.0                 3
      7.0                 2
      4.0                 1
      5.0                 1
      8.0                 1
4     0.0               431
      1.0                44
      3.0                11
      2.0                10
5     0.0               336
      1.0                10
      2.0                 7
      3.0                 1
      4.0                 1
6     0.0               358
      1.0                51
      2.0                22
      3.0                11
      4.0                 5
      8.0                 4
      5.0                 3
      12.0                2
      6.0                 1
      15.0                1
7     0.0               148
      1.0                 5
      2.0                 1
8     0.0               133
      1.0                 8
      2.0                 1
Name: bbl_restaurant, dtype: int64

In [65]:
data.groupby(data.activity==1).WARD.value_counts().sort_values(ascending = False)


Out[65]:
activity  WARD
False     6       353
True      2       271
False     4       260
True      4       236
          1       231
False     2       206
True      5       199
False     1       188
          5       156
          7       134
          8       119
True      6       105
False     3        65
True      3        40
          8        23
          7        20
Name: WARD, dtype: int64

In [79]:
data.groupby('WARD').tot_pop.sum().sort_values(ascending = False)


Out[79]:
WARD
1    123400
2     90829
4     72237
6     66442
5     58077
3     27970
7     27699
8     25599
Name: tot_pop, dtype: int64

Looks like Ward 3, 7, 8 are about the same size. Model is MOST Accurate in Ward 3 and not accurate at all in Wards 7 and 8. Maybe our model is overfit to 3 -- what about these wards are different?


In [96]:
three = data[data.WARD==3]
seven = data[data.WARD==7]
eight = data[data.WARD==8]

three.activity.value_counts()


Out[96]:
0.0    65
1.0    40
Name: activity, dtype: int64

In [97]:
seven.activity.value_counts()


Out[97]:
0.0    134
1.0     20
Name: activity, dtype: int64

In [98]:
eight.activity.value_counts()


Out[98]:
0.0    119
1.0     23
Name: activity, dtype: int64

In [108]:
data.groupby('WARD').activity.value_counts(sort=True)


Out[108]:
WARD  activity
1     1.0         231
      0.0         188
2     1.0         271
      0.0         206
3     0.0          65
      1.0          40
4     0.0         260
      1.0         236
5     1.0         199
      0.0         156
6     0.0         353
      1.0         105
7     0.0         134
      1.0          20
8     0.0         119
      1.0          23
Name: activity, dtype: int64

Ward's 5-8 have different active-not active ratios. They're most out of whack in 6-8, the least accurate of our predictions. Ratio of not active/to active is about 1 (give or take .25) for every value.

Things are about 3 times as inactive and 5-6 times as inactive in wards 6-8. Ward 3, the sample we could be overfitting to is about 1.5 times more inactive. A bit weird, but more reasonable. This could be a source of our issue?

Exercise 4

Now try running some cross-validations with the data from your project. What are some different ways you might slice the data you're using for your project? Try them out here. This will be a good way to begin making progress toward your final submission.

PLEASE REMEMBER TO SUBMIT THIS HOMEWORK BY CLASS TIME ON THURSDAY.


In [124]:
data.describe().T


Out[124]:
count mean std min 25% 50% 75% max
X 53155.0 -7.701390e+01 3.896925e-02 -7.711317e+01 -7.703758e+01 -7.701876e+01 -7.699048e+01 -7.691005e+01
Y 53155.0 3.891153e+01 2.755631e-02 3.881366e+01 3.889805e+01 3.891023e+01 3.892769e+01 3.899398e+01
OBJECTID 53155.0 2.584204e+07 4.772180e+04 2.557148e+07 2.580968e+07 2.583192e+07 2.585243e+07 2.599323e+07
BBL_LICENSE_FACT_ID 53155.0 3.573744e+05 3.762054e+04 3.117700e+05 3.340900e+05 3.484900e+05 3.606180e+05 4.720330e+05
CUST_NUM 53155.0 3.838201e+11 2.744648e+11 1.950124e+07 7.010872e+07 4.103160e+11 5.005168e+11 9.313170e+11
LATITUDE 53155.0 3.891153e+01 2.755631e-02 3.881365e+01 3.889804e+01 3.891022e+01 3.892768e+01 3.899398e+01
LONGITUDE 53155.0 -7.701390e+01 3.896912e-02 -7.711317e+01 -7.703758e+01 -7.701876e+01 -7.699048e+01 -7.691005e+01
XCOORD 53155.0 3.987948e+05 3.379723e+03 3.901879e+05 3.967413e+05 3.983744e+05 4.008254e+05 4.078038e+05
YCOORD 53155.0 1.381857e+05 3.059101e+03 1.273200e+05 1.366886e+05 1.380399e+05 1.399780e+05 1.473384e+05
ZIPCODE 53068.0 2.001509e+04 2.909145e+01 2.000100e+04 2.000500e+04 2.001000e+04 2.001800e+04 2.059300e+04
MARADDRESSREPOSITORYID 53155.0 2.280445e+05 8.922920e+04 0.000000e+00 2.250700e+05 2.445020e+05 2.896090e+05 3.147560e+05
DCSTATADDRESSKEY 53155.0 1.390966e+05 1.283098e+05 2.000000e+00 6.841700e+04 8.885700e+04 1.285030e+05 4.819170e+05
DCSTATLOCATIONKEY 53155.0 1.309599e+05 1.090843e+05 2.000000e+00 6.841700e+04 8.885700e+04 1.285030e+05 4.138170e+05
WARD 53155.0 4.105954e+00 2.080958e+00 1.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00

In [132]:
df1 = pd.read_csv('https://opendata.arcgis.com/datasets/82ab09c9541b4eb8ba4b537e131998ce_22.csv')

df2 = pd.read_csv('https://opendata.arcgis.com/datasets/f2e1c2ef9eb44f2899f4a310a80ecec9_2.csv')

In [131]:



Out[131]:
0 1 2 3 4
X -76.9841 -76.9841 -76.9841 -76.9841 -76.9841
Y 38.8363 38.8363 38.8363 38.8363 38.8363
OBJECTID 25571475 25571476 25571477 25571478 25571479
BBL_LICENSE_FACT_ID 313246 313246 313246 313246 313246
LICENSESTATUS ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE
LICENSECATEGORY Charitable Solicitation Charitable Solicitation Charitable Solicitation Charitable Solicitation Charitable Solicitation
CUST_NUM 400212000180 400212000180 400212000180 400212000180 400212000180
TRADE_NAME BREATHE DC BREATHE DC BREATHE DC BREATHE DC BREATHE DC
LICENSE_START_DATE 2016-03-01T00:00:00.000Z 2016-03-01T00:00:00.000Z 2016-03-01T00:00:00.000Z 2016-03-01T00:00:00.000Z 2016-03-01T00:00:00.000Z
LICENSE_EXPIRATION_DATE 2018-02-28T00:00:00.000Z 2018-02-28T00:00:00.000Z 2018-02-28T00:00:00.000Z 2018-02-28T00:00:00.000Z 2018-02-28T00:00:00.000Z
LICENSE_ISSUE_DATE 2016-02-25T00:00:00.000Z 2016-02-25T00:00:00.000Z 2016-02-25T00:00:00.000Z 2016-02-25T00:00:00.000Z 2016-02-25T00:00:00.000Z
AGENT_PHONE 2027265555 2027265555 2027265555 2027265555 2027265555
LASTMODIFIEDDATE 2017-06-20T15:10:12.000Z 2017-06-20T15:10:12.000Z 2017-06-20T15:10:12.000Z 2017-06-20T15:10:12.000Z 2017-06-20T15:10:12.000Z
CITY WASHINGTON WASHINGTON WASHINGTON WASHINGTON WASHINGTON
STATE DC DC DC DC DC
SITEADDRESS 1310 SOUTHERN AVENUE SE 1310 SOUTHERN AVENUE SE 1310 SOUTHERN AVENUE SE 1310 SOUTHERN AVENUE SE 1310 SOUTHERN AVENUE SE
LATITUDE 38.8363 38.8363 38.8363 38.8363 38.8363
LONGITUDE -76.9841 -76.9841 -76.9841 -76.9841 -76.9841
XCOORD 401383 401383 401383 401383 401383
YCOORD 129834 129834 129834 129834 129834
ZIPCODE 20032 20032 20032 20032 20032
MARADDRESSREPOSITORYID 277936 277936 277936 277936 277936
DCSTATADDRESSKEY 120049 120049 120049 120049 120049
DCSTATLOCATIONKEY 120049 120049 120049 120049 120049
WARD 8 8 8 8 8
ANC 8E 8E 8E 8E 8E
SMD 8E03 8E03 8E03 8E03 8E03
DISTRICT SEVENTH SEVENTH SEVENTH SEVENTH SEVENTH
PSA 706 706 706 706 706
NEIGHBORHOODCLUSTER 38 38 38 38 38
HOTSPOT2006NAME NONE NONE NONE NONE NONE
HOTSPOT2005NAME NONE NONE NONE NONE NONE
HOTSPOT2004NAME NONE NONE NONE NONE NONE
BUSINESSIMPROVEMENTDISTRICT NONE NONE NONE NONE NONE

In [133]:
df2.describe().T


Out[133]:
count mean std min 25% 50% 75% max
OBJECTID 22.0 11.500000 6.493587 1.000000 6.25000 11.500000 16.75 22.00
LAST_UPDAT 22.0 2009.000000 0.000000 2009.000000 2009.00000 2009.000000 2009.00 2009.00
X 22.0 397703.338738 3329.244112 391615.268950 395425.71250 397380.225005 400186.49 406439.23
Y 22.0 138079.263527 2909.785057 130590.010007 136453.63472 137979.804471 139832.70 144234.39
ADDRID 22.0 132709.454545 134544.668464 5616.000000 15750.00000 18006.000000 275965.25 307772.00
PRIORITY_LEVEL 13.0 1.000000 0.000000 1.000000 1.00000 1.000000 1.00 1.00

In [162]:
df1 = pd.read_csv('https://opendata.arcgis.com/datasets/82ab09c9541b4eb8ba4b537e131998ce_22.csv')

df2 = pd.read_csv('https://opendata.arcgis.com/datasets/f2e1c2ef9eb44f2899f4a310a80ecec9_2.csv')


DF = df1.merge(df2, on ='X', how = 'outer')

from sklearn.metrics import precision_score
from sklearn.model_selection import KFold

cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
    print("TRAIN:", train_index, "TEST:", test_index)

cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))
    
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)

## Create for-loop
for train_index, test_index in cv.split(data):

## Define training and test sets
    X_train = DF.loc[train_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
    y_train = DF.loc[train_index]['BBL_LICENSE_FACT_ID']
    X_test = DF.loc[test_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
    y_test = DF.loc[test_index]['BBL_LICENSE_FACT_ID']
        
    ## Fit model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    ## Generate predictions
    predicted = clf.predict(X_test)
    
    ## Compare to actual outcomes and return precision
    print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))

In [164]:
DF


Out[164]:
X Y_x OBJECTID_x BBL_LICENSE_FACT_ID LICENSESTATUS LICENSECATEGORY CUST_NUM TRADE_NAME LICENSE_START_DATE LICENSE_EXPIRATION_DATE ... WEB_URL WEB_URL2 STEWARD SOURCE LAST_UPDAT Y_y ADDRID PRIORITY_LEVEL DESCRIPTION EMAIL_PHONE
0 -76.984071 38.836306 25571475.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
1 -76.984071 38.836306 25571476.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
2 -76.984071 38.836306 25571477.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
3 -76.984071 38.836306 25571478.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
4 -76.984071 38.836306 25571479.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
5 -76.984071 38.836306 25571480.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
6 -76.984071 38.836306 25571481.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
7 -76.984071 38.836306 25571482.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
8 -76.984071 38.836306 25571483.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
9 -76.984071 38.836306 25571484.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
10 -76.984071 38.836306 25571485.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
11 -76.984071 38.836306 25571486.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
12 -76.984071 38.836306 25571487.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
13 -76.984071 38.836306 25571488.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
14 -76.984071 38.836306 25784321.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
15 -76.984071 38.836306 25784322.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
16 -76.984071 38.836306 25784323.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
17 -76.984071 38.836306 25784324.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
18 -76.984071 38.836306 25784325.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
19 -76.984071 38.836306 25784326.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
20 -76.984071 38.836306 25784327.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
21 -76.984071 38.836306 25784328.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
22 -76.984071 38.836306 25784329.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
23 -76.984071 38.836306 25784330.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
24 -76.984071 38.836306 25784331.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
25 -76.984071 38.836306 25784332.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
26 -76.984071 38.836306 25784333.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
27 -76.984071 38.836306 25784334.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
28 -76.984071 38.836306 25784335.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
29 -76.984071 38.836306 25784336.0 313246.0 ACTIVE Charitable Solicitation 4.002120e+11 BREATHE DC 2016-03-01T00:00:00.000Z 2018-02-28T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
53147 -76.998701 38.903386 25991812.0 470369.0 EXPIRED One Family Rental 5.005149e+11 Matthew Godwin 2014-03-01T00:00:00.000Z 2016-02-29T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53148 -77.065015 38.933024 25991814.0 470371.0 EXPIRED One Family Rental 5.005149e+11 Lindsay Lincoln 2014-03-01T00:00:00.000Z 2016-02-29T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53149 -77.071379 38.933291 25991818.0 470375.0 EXPIRED One Family Rental 5.005149e+11 Cohen/Kornblut residence; Jonathan Cohen 2014-06-01T00:00:00.000Z 2016-05-31T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53150 -77.074664 38.937192 25992005.0 470376.0 EXPIRED One Family Rental 5.005149e+11 Jay Hewlin ; Jay 2014-09-01T00:00:00.000Z 2016-08-31T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53151 -77.042141 38.930821 25992197.0 470281.0 EXPIRED Apartment 6.600022e+07 NIELS THIESS 2014-01-01T00:00:00.000Z 2015-12-31T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53152 -77.030820 38.921718 25992208.0 470294.0 EXPIRED Apartment 6.800379e+07 WARDMAN COURT CO/OF INTERSTATE REALTY MGR CO 2014-06-01T00:00:00.000Z 2016-05-31T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53153 -76.935101 38.890048 25992430.0 470658.0 EXPIRED Two Family Rental 7.010751e+07 NaN 2014-09-01T00:00:00.000Z 2016-08-31T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53154 -76.980528 38.866850 25993234.0 470560.0 EXPIRED One Family Rental 6.800645e+07 HENRY H STRONG 2014-10-01T00:00:00.000Z 2016-09-30T00:00:00.000Z ... NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN
53155 395248.988794 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Jen http://www.dcfoodfinder.org/ 2009.0 137857.987965 15719.0 0 Locally grown fresh fruits and vegetables. Apr... jennifer.guillaume@dc.gov _ 535-2252
53156 402347.350000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Jen http://www.dcfoodfinder.org/ 2009.0 135781.360000 307772.0 0 Locally grown fresh fruits and vegetables. Sum... jennifer.guillaume@dc.gov _ 535-2252
53157 396725.040000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://mtpfm.com/ Jen http://www.dcfoodfinder.org/ 2009.0 140373.590000 226249.0 0 Locally grown fresh fruits and vegetables. May... jennifer.guillaume@dc.gov _ 535-2252
53158 395414.540000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN Jen http://www.dcfoodfinder.org/ 2009.0 139969.700000 219212.0 0 Locally grown fresh fruits and vegetables. Sum... jennifer.guillaume@dc.gov _ 535-2252
53159 400773.519999 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Jen http://www.dcfoodfinder.org/ 2009.0 140627.060000 16406.0 0 Locally grown fresh fruits and vegetables. WIC... jennifer.guillaume@dc.gov _ 535-2252
53160 400134.560000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... www.congressheightsontherise.com Jen http://www.dcfoodfinder.org/ 2009.0 130590.010007 17957.0 0 Locally grown fresh fruits and vegetables. Sat... jennifer.guillaume@dc.gov _ 535-2252
53161 391615.268950 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.palisadesfarmersmarket.com/ Jen http://www.dcfoodfinder.org/ 2009.0 138847.436638 17316.0 0 Locally grown fresh fruits and vegetables. Sun... jennifer.guillaume@dc.gov _ 535-2252
53162 393836.944730 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.newmorningfarm.net/ Jen http://www.dcfoodfinder.org/ 2009.0 142134.741057 12254.0 0 Locally grown fresh fruits and vegetables. Sat... jennifer.guillaume@dc.gov _ 535-2252
53163 406439.230000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Jen http://www.dcfoodfinder.org/ 2009.0 136608.140000 5742.0 0 Locally grown fresh fruits and vegetables. Fri... jennifer.guillaume@dc.gov _ 535-2252
53164 394042.440000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.chevychasefarmersmarket.org/ Jen http://www.dcfoodfinder.org/ 2009.0 144234.390000 263178.0 0 Locally grown fresh fruits and vegetables. Sat... jennifer.guillaume@dc.gov _ 535-2252
53165 398008.015772 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.freshfarmmarkets.org/ http://www.freshfarmmarket.org/ Jen http://www.dcfoodfinder.org/ 2009.0 136402.132960 11946.0 0 Locally grown fresh fruits and vegetables. Apr... jennifer.guillaume@dc.gov _ 535-2252
53166 400330.039998 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.easternmarket-dc.org/ www.goodgenerallink.org Jen http://www.dcfoodfinder.org/ 2009.0 135504.990010 18055.0 0 DC's oldest continually operated fresh food pu... jennifer.guillaume@dc.gov _ 535-2252
53167 397531.320000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.ams.usda.gov/AMSv1.0/farmersmarkets Jen http://www.dcfoodfinder.org/ 2009.0 135490.010000 294873.0 0 Locally grown fresh fruits and vegetables. Fri... jennifer.guillaume@dc.gov _ 535-2252
53168 398132.500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Jen http://www.dcfoodfinder.org/ 2009.0 135138.410000 276612.0 0 Seasonal produce direct from farmers. Tuesday. jennifer.guillaume@dc.gov _ 535-2252
53169 398945.580000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.14andufarmersmarket.com/ Jen http://www.dcfoodfinder.org/ 2009.0 138304.650014 15468.0 0 Locally grown fresh fruits and vegetables. May... jennifer.guillaume@dc.gov _ 535-2252
53170 396107.063994 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://freshfarmmarkets.org/farmers_markets/ma... http://www.freshfarmmarket.org/ Jen http://www.dcfoodfinder.org/ 2009.0 138086.718942 5616.0 0 Locally grown fresh fruits and vegetables. Yea... jennifer.guillaume@dc.gov _ 535-2252
53171 400226.570000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://freshfarmmarkets.org/farmers_markets/ma... http://www.freshfarmmarket.org/ Jen http://www.dcfoodfinder.org/ 2009.0 136899.100000 288803.0 0 Locally grown fresh fruits and vegetables. May... jennifer.guillaume@dc.gov _ 535-2252
53172 395459.230000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.freshfarmmarket.org/markets/foggy_b... http://www.freshfarmmarket.org/ Jen http://www.dcfoodfinder.org/ 2009.0 136966.330000 274025.0 0 Locally grown fresh fruits and vegetables. Apr... jennifer.guillaume@dc.gov _ 535-2252
53173 400203.800000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Jen http://www.dcfoodfinder.org/ 2009.0 137872.890000 301991.0 0 Locally grown fresh fruits and vegetables. WIC... jennifer.guillaume@dc.gov _ 535-2252
53174 396298.579990 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Jen http://www.dcfoodfinder.org/ 2009.0 139421.699999 16877.0 0 Locally grown fresh fruits and vegetables. May... jennifer.guillaume@dc.gov _ 535-2252
53175 397229.130010 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... http://www.14andufarmersmarket.com/ Jen http://www.dcfoodfinder.org/ 2009.0 138792.679998 15843.0 0 Locally grown fresh fruits and vegetables. Apr... jennifer.guillaume@dc.gov _ 535-2252
53176 394423.740000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... https://www.udc.edu/event/farmers-market/ Jen http://www.dcfoodfinder.org/ 2009.0 141839.770000 297694.0 0 Locally grown fresh fruits and vegetables. Jul... jennifer.guillaume@dc.gov _ 535-2252

53177 rows × 51 columns


In [145]:
from sklearn.metrics import precision_score
from sklearn.model_selection import KFold

cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
    print("TRAIN:", train_index, "TEST:", test_index)

cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))
    
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)

## Create for-loop
for train_index, test_index in cv.split(data):

## Define training and test sets
    X_train = DF.loc[train_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
    y_train = DF.loc[train_index]['BBL_LICENSE_FACT_ID']
    X_test = DF.loc[test_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
    y_test = DF.loc[test_index]['BBL_LICENSE_FACT_ID']
        
    ## Fit model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    ## Generate predictions
    predicted = clf.predict(X_test)
    
    ## Compare to actual outcomes and return precision
    print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))


TRAIN: [    1     2     3 ..., 53152 53153 53154] TEST: [    0    17    18 ..., 53130 53142 53147]
TRAIN: [    0     2     4 ..., 53152 53153 53154] TEST: [    1     3    14 ..., 53140 53150 53151]
TRAIN: [    0     1     2 ..., 53152 53153 53154] TEST: [    6    33    62 ..., 53141 53144 53149]
TRAIN: [    0     1     2 ..., 53151 53152 53153] TEST: [   15    31    37 ..., 53112 53120 53154]
TRAIN: [    0     1     2 ..., 53152 53153 53154] TEST: [    4     7     9 ..., 53077 53132 53146]
TRAIN: [    0     1     2 ..., 53152 53153 53154] TEST: [   29    35    40 ..., 53121 53122 53125]
TRAIN: [    0     1     2 ..., 53151 53152 53154] TEST: [    8    12    23 ..., 53088 53101 53153]
TRAIN: [    0     1     3 ..., 53151 53153 53154] TEST: [    2     5    20 ..., 53143 53145 53152]
TRAIN: [    0     1     2 ..., 53152 53153 53154] TEST: [   11    13    21 ..., 53090 53094 53103]
TRAIN: [    0     1     2 ..., 53152 53153 53154] TEST: [   10    19    43 ..., 53128 53131 53148]
TRAIN: 47839 TEST: 5316
TRAIN: 47839 TEST: 5316
TRAIN: 47839 TEST: 5316
TRAIN: 47839 TEST: 5316
TRAIN: 47839 TEST: 5316
TRAIN: 47840 TEST: 5315
TRAIN: 47840 TEST: 5315
TRAIN: 47840 TEST: 5315
TRAIN: 47840 TEST: 5315
TRAIN: 47840 TEST: 5315
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-145-a51de7a1b7be> in <module>()
     24     ## Fit model
     25     clf = LogisticRegression()
---> 26     clf.fit(X_train, y_train)
     27 
     28     ## Generate predictions

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
   1171 
   1172         X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64,
-> 1173                          order="C")
   1174         check_classification_targets(y)
   1175         self.classes_ = np.unique(y)

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    519     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    520                     ensure_2d, allow_nd, ensure_min_samples,
--> 521                     ensure_min_features, warn_on_dtype, estimator)
    522     if multi_output:
    523         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float: 'HENRY H STRONG'

In [ ]: