The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data
It's a beautiful day in this neighborhood, A beautiful day for a neighbor. Would you be mine? Could you be mine?
~ Mr. Rogers
Readings:
In [2]:
import pandas
import numpy
from sklearn import neighbors
from sklearn.neighbors import DistanceMetric
from pprint import pprint
MY_TITANIC_TRAIN = 'train_titanic.csv'
MY_TITANIC_TEST = 'test_titanic.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)
print('length: {0} '.format(len(titanic_dataframe)))
titanic_dataframe.head(5)
Out[2]:
In [3]:
titanic_dataframe.describe()
Out[3]:
In [4]:
titanic_dataframe.info()
In [5]:
def fix_na(table):
"""Perform necessary in-place modifications."""
for numeric in [table[col] for col in ["Pclass", "Age", "SibSp", "Parch", "Fare"]]:
numeric.fillna(numeric.median(), inplace=True)
table.fillna("unknown", inplace=True)
table['Port'] = table['Embarked'].map({'C':1, 'S':2, 'Q':3, 'unknown': 0}).astype(int)
table['Gender'] = table['Sex'].map({'female': 0, 'male': 1, 'unknown': 2}).astype(int)
table.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
fix_na(titanic_dataframe)
In [33]:
cols = titanic_dataframe.columns.tolist()
# train_target is just the Survived column
train_target = titanic_dataframe[cols[1]]
# train_data is all columns not including ID and Survived
train_data = titanic_dataframe[cols[2: ]]
pprint('column_list: {0}'.format(cols))
In [22]:
test_data = pandas.read_csv(MY_TITANIC_TEST)
fix_na(test_data)
passenger_ids = test_data.PassengerId.values
test_data.drop('PassengerId', axis=1, inplace=True)
In [36]:
from sklearn import neighbors
model = neighbors.KNeighborsClassifier()
model.fit(train_data.values, train_target.values)
output = model.predict(test_data)
result = numpy.c_[passenger_ids.astype(int), output.astype(int)]
df_result = pandas.DataFrame(result, columns=['PassengerId', 'Survived'])
df_result.to_csv('titanic.csv', index=False)
Predicted condition | |||||
Total population | Predicted Condition positive | Predicted Condition negative | Prevalence = Σ Condition positiveΣ Total population | ||
True condition |
condition positive |
True positive | False Negative (Type II error) |
True positive rate (TPR), Sensitivity, Recall = Σ True positiveΣ Condition positive | False negative rate (FNR), Miss rate = Σ False negativeΣ Condition positive |
condition negative |
False Positive (Type I error) |
True negative | False positive rate (FPR), Fall-out = Σ False positiveΣ Condition negative | True negative rate (TNR), Specificity (SPC) = Σ True negativeΣ Condition negative | |
Accuracy (ACC) = Σ True positive + Σ True negativeΣ Total population | Positive predictive value (PPV), Precision = Σ True positiveΣ Test outcome positive | False omission rate (FOR) = Σ False negativeΣ Test outcome negative | Positive likelihood ratio (LR+) = TPRFPR | Diagnostic odds ratio (DOR) = LR+LR− | |
False discovery rate (FDR) = Σ False positiveΣ Test outcome positive | Negative predictive value (NPV) = Σ True negativeΣ Test outcome negative | Negative likelihood ratio (LR−) = FNRTNR |
In [25]:
factors = [ 'Pclass', 'Port', 'Gender']
hamming = DistanceMetric.get_metric('hamming')
print(dir(hamming))
def hamming_distance(row, train_row):
distance = [
int(isinstance(input_row[i], str) and training_row[i] != input_row[i])
for i, row in enumerate(input_row)
]
return sum(distance)
In [9]:
numericals = ['Age', 'SibSp', 'Parch', 'Fare']
euclidian = DistanceMetric.get_metric('euclidean')
euclidish = []
knn(row, train, k):
distance = []
for train_row in train:
hamming = hamming_distance(row, train_row[factors])
euclidian = euclidian (row, train_row[numericals])
distance.append(hamming + euclidian)
distance.sort(key=operator.itemgetter(0))
out = distance[ :k]
return (row[id], 1 if sum(out) > k//2 else 0)
Tuning K