K-Nearest Neighbors

Motivation

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data

~scikit-learn

It's a beautiful day in this neighborhood, A beautiful day for a neighbor. Would you be mine? Could you be mine?

~ Mr. Rogers

Data


In [2]:
import pandas
import numpy
from sklearn import neighbors
from sklearn.neighbors import DistanceMetric 
from pprint import pprint

MY_TITANIC_TRAIN = 'train_titanic.csv'
MY_TITANIC_TEST = 'test_titanic.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)
print('length: {0} '.format(len(titanic_dataframe)))
titanic_dataframe.head(5)


length: 891 
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [3]:
titanic_dataframe.describe()


Out[3]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [4]:
titanic_dataframe.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Normalize and Fill


In [5]:
def fix_na(table):
    """Perform necessary in-place modifications."""
    for numeric in [table[col] for col in ["Pclass", "Age", "SibSp", "Parch", "Fare"]]:
        numeric.fillna(numeric.median(), inplace=True)
    table.fillna("unknown", inplace=True)
    table['Port'] = table['Embarked'].map({'C':1, 'S':2, 'Q':3, 'unknown': 0}).astype(int)
    table['Gender'] = table['Sex'].map({'female': 0, 'male': 1, 'unknown': 2}).astype(int)
    table.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

fix_na(titanic_dataframe)

In [33]:
cols = titanic_dataframe.columns.tolist()

# train_target is just the Survived column
train_target = titanic_dataframe[cols[1]]

# train_data is all columns not including ID and Survived
train_data = titanic_dataframe[cols[2: ]]

pprint('column_list: {0}'.format(cols))


("column_list: ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', "
 "'Fare', 'Port', 'Gender']")

In [22]:
test_data = pandas.read_csv(MY_TITANIC_TEST)
fix_na(test_data)

passenger_ids = test_data.PassengerId.values

test_data.drop('PassengerId', axis=1, inplace=True)

Evaluation

SciKit Learn


In [36]:
from sklearn import neighbors

model = neighbors.KNeighborsClassifier()
model.fit(train_data.values, train_target.values)

output = model.predict(test_data)
result = numpy.c_[passenger_ids.astype(int), output.astype(int)]

df_result = pandas.DataFrame(result, columns=['PassengerId', 'Survived'])

df_result.to_csv('titanic.csv', index=False)

63% accuracy.

Kaggle leaderboard placement: 3734th

Not surprising; we were not expecting much on the first attempt.

Possible variable to modify:

  • null replacement values (mean vs median)
  • k neighbors to compare against
  • different algorithms
Predicted condition
Total population Predicted Condition positive Predicted Condition negative Prevalence = Σ Condition positiveΣ Total population
True
condition
condition
positive
True positive False Negative
(Type II error)
True positive rate (TPR), Sensitivity, Recall = Σ True positiveΣ Condition positive False negative rate (FNR), Miss rate = Σ False negativeΣ Condition positive
condition
negative
False Positive
(Type I error)
True negative False positive rate (FPR), Fall-out = Σ False positiveΣ Condition negative True negative rate (TNR), Specificity (SPC) = Σ True negativeΣ Condition negative
Accuracy (ACC) = Σ True positive + Σ True negativeΣ Total population Positive predictive value (PPV), Precision = Σ True positiveΣ Test outcome positive False omission rate (FOR) = Σ False negativeΣ Test outcome negative Positive likelihood ratio (LR+) = TPRFPR Diagnostic odds ratio (DOR) = LR+LR−
False discovery rate (FDR) = Σ False positiveΣ Test outcome positive Negative predictive value (NPV) = Σ True negativeΣ Test outcome negative Negative likelihood ratio (LR−) = FNRTNR

Representation

Distance

Factors/Categorical Data

hamming_distance(row, train_row):
    distance = list(
        1 
        if type(input_row[i]) is str & not training_row[i]==input_row[i]) else 0
        for i in range(len(input_row)
    )

    return sum(distance)

In [25]:
factors = [ 'Pclass', 'Port', 'Gender']
hamming = DistanceMetric.get_metric('hamming')
print(dir(hamming))

def hamming_distance(row, train_row):
    distance = [
        int(isinstance(input_row[i], str) and training_row[i] != input_row[i])
        for i, row in enumerate(input_row)
    ]

    return sum(distance)


['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'dist_to_rdist', 'get_metric', 'pairwise', 'rdist_to_dist']

Euclidian

$\sqrt{\Sigma{(q_i-p_i)^2}}$

euclidian_distance(row, train_row):
    distance = list((row[i] - train_row[i])^2
                     for i in range(len(input_row)) if isnumeric(row[i]) & isnumeric(train_row[i])
               )
    return sqrt(sum(distance))

In [9]:
numericals = ['Age', 'SibSp', 'Parch', 'Fare']
euclidian = DistanceMetric.get_metric('euclidean')
euclidish = []

KNN

knn(row, train, k):
    distance = []
    for train_row in train:
        hamming = hamming_distance(row, train_row[factors])
        euclidian  = euclidian (row, train_row[numericals])
        distance.append(hamming + euclidian)
    distance.sort(key=operator.itemgetter(0))
    out = distance[ :k]
    return (row[id], 1 if sum(out) > k//2 else 0)

Optimization