Clustering

  • Author: Fang Zhang
  • Date: 2016.1.5
  • E-mail: fza34@sfu.ca

Generate ground truth

Based on the eventlist generated by simulation, patients can be put into clusters.

five event in simulation

  1. unresi transmission
  2. resi transmission
  3. substitution
  4. acquisition
  5. removal

Bulit the clusters forward-timely. Built a global cluster set list, a dictionary that records the index of the patient's cluster in the cluster list and a resistant cluster set of global cluster set list's index.

For event 1, 2, 3:

  1. First judge whether the submit patient is in the clusters,
  2. Then judge whether the recpeit patient is in the clusters,
  3. Move the receipt patient into the submit pateitn's cluster.

For event 4:

  1. If the patient in the clusters and the cluster is not in the resistant set, add a new resistant cluster and put the patient into it.
  2. If the patient is not in clusters, add a new resistant cluster and put the patient into it.

For event 5:

  1. If the patient in the clusters, remove it.

Clustering

We can have the sequences of patients in reality. We can cluster the patients by using the information given by their sequences. The ground truths are:

  1. According to the sequence, we can know whether a patient is resistant or not based on the resistant mutations given by the resi.vcf file. It means that resistant and unresistant patients will not be in a same cluster.

  2. In our ground clusters, every cluster is the set of nodes in a transmission tree. It means that the sequence distances between the nodes in the same cluster are relatively small. Given a sequence, we can find the most alike sequences based on distance profile and put them into a cluster.

Based on the result of simulation, we can easily put all the unremoved patients into resistant and unresistant patients.

  1. For resistant patients, a distance profile is calculated. When a pair of 2 patients have resistance related SNPs, the distance between them is the hamming distance between the two sequences. Otherwise, their distance is infinitely great. Given a distance threshold, patients can be put into clusters.

  2. For unresistant patients, given a distance threshold, patients can be put into clusters.

Clustering Performance Evaluation

In the above picture, X, O and spade respectively represent a cluster of documents. They are clustered into 3 clusters.

  • TP (True Positive) means 2 documents that are clustered into the same cluster are actually belong to the same cluster.
  • FP (Flase Positive) means 2 documents that are clustered into the same cluster are actually belong to two different clusters.
  • TN (True Negative) means 2 documents that are clustered into different clusters are actually belong to two different clusters.
  • FN (False Negative) meand 2 documents that are clustered into different clusters are actually belong to the same cluster.

We have:

  • Precision = TP / (TP + FP)
  • Recall = TP /(TP + FN)
  • F1 = 2PrecisonRecall/(Precision + Recall)

In the above picture,

  • TP+FP = C(6,2) + C(6,2) + C(5,2) = 40
  • TP = C(5,2)+C(4,2)+C(3,2) + C(2,2) = 20
  • FP = 40-20 = 20
  • FN = 1X4 + 1X3 + 5X1+ 2X1 + 5X2 = 24
  • TN = C(17,2) - TP - FP - FN = 72

Therefore,

  • Precision = 20/40 = 0.5
  • Recall = 20/44 = 0.454
  • F1 = 0.476

Experiments

I tried several time based on 12-generation simulation, given a appropriate distance threshold, the precision, recall and F1 based on the ground truth could be around 1. Small distance threshold led to low recall, and big distance threshold led to low precision.

Number of Generation resistant number of patients distance threshold Recall Precision F1
12 unresi 1041 40 0.81 1 0.9
50 0.965 1 0.982
65 0.999 1 1
75 1 1 1
85 1 1 1
100 1 0.996 0.998
110 0.9997 0.995 0.997
resi 149 20 0.917 1 0.957
30 0.917 1 0.957
40 0.917 1 0.957
50 0.917 1 0.957
60 0.917 1 0.957
70 0.917 1 0.957
80 0.917 1 0.957
12 unresi 1034 40 0.924 1 0.96
50 0.954 1 0.976
65 0.992 1 0.996
75 1 1 1
85 1 1 1
100 1 1 1
110 0.988 0.981 0.985
resi 141 20 1 1 1
30 1 1 1
40 1 1 1
50 1 1 1
60 1 1 1
70 1 1 1
80 1 1 1
13 unresi 2098 40 0.8 1 0.889
50 0.913 1 0.954
65 0.987 1 0.994
75 0.994 1 0.997
85 1 0.999 1
100 0.998 0.994 0.996
110 0.999 0.97 0.985
resi 281 20 0.81 1 0.894
30 0.957 1 0.978
40 1 1 1
50 1 1 1
60 1 1 1
70 1 1 1
80 1 1 1
13 unresi 2113 40 0.89 1 0.942
50 0.96 1 0.982
65 0.9995 1 0.9997
75 0.9997 1 0.9998
85 1 1 1
100 0.9995 0.9993 0.9996
110 0.998 0.981 0.99
resi 293 20 0.732 1 0.845
30 0.83 1 0.907
40 0.83 1 0.907
50 0.902 1 0.95
60 0.902 1 0.95
70 0.902 1 0.95
80 0.902 1 0.95
14 unresi 4213 40 0.82 1 0.9
50 0.927 1 0.962
65 0.984 1 0.992
75 0.994 1 0.997
85 0.999 0.999 0.999
100 0.999 0.996 0.998
110 1 0.988 0.944
resi 560 20 0.899 1 0.947
30 0.937 1 0.967
40 0.949 1 0.974
50 0.949 1 0.974
60 0.949 1 0.974
70 0.949 1 0.974
80 0.949 1 0.974
14 unresi 4173 40 0.817 1 0.899
50 0.906 1 0.951
65 0.973 1 0.986
75 0.989 1 0.995
85 0.995 1 0.998
100 0.999 0.995 0.997
110 0.998 0.989 0.994
resi 545 20 0.965 1 0.982
30 0.976 1 0.998
40 0.976 1 0.998
50 0.976 1 0.998
60 0.976 1 0.998
70 0.976 1 0.998
80 0.976 1 0.998

Why does Recall decline?

In my cluster algorithm, for one patient, the patients with distance lower than threshold distance will be put into the cluster of the patient. The convergence condition is that for every patient in the cluster, the closest patient should be in the same cluster.

For example, patients A, B, C and D. B, C and D are actually in the same cluster.
If the distance thresshold is 100, B, C and D will be clustered into one cluster. However, if the distance threshold inreases to 200, A , B and C will be in the same cluster.

Shuffle the order of input patients.

As this algorithm is sensitive to the order of patients. In this experiment, the order of patients was shuffled 50 times. The average precision, recall and F1 were calculated.

  • The experiment was baesd on a 13 generation simulation which generated 2,609 unremoved patients including 2,322 unresistant patients and 287 resistant patients.
  • For unresistant patients, the distance threshold was set [30,60,90,120] and the order was shuffled 50 times.
  • For unresistant patients, the distance threshold was set [10,30,50,70] and the order was shuffled 50 times.
  • From the result, it is clear that, for unresistant patients the clustering algorithm achieves the best evaluation with mean precision 0.993, mean recall 0.997 and mean F1 0.995 when the distance threshold is around 90.
  • As to resistant patients,the clustering algorithm achieves the best evaluation with mean precision 1, mean recall 1 and mean F1 1 when the distance threshold is larger than 30.

  • Resistant patients

Distance_Threshlod:10

precision: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
precision mean: 1.0
precision std: 0.0

recall: 0.595744680851 0.595744680851 0.595744680851 0.63829787234 0.595744680851 0.595744680851 0.63829787234 0.595744680851 0.595744680851 0.595744680851 0.595744680851 0.595744680851 0.595744680851 0.63829787234 0.63829787234 0.63829787234 0.63829787234 0.595744680851 0.595744680851 0.595744680851 0.595744680851 0.595744680851 0.595744680851 0.63829787234 0.63829787234 0.595744680851 0.595744680851 0.63829787234 0.595744680851 0.595744680851 0.595744680851 0.595744680851 0.63829787234 0.595744680851 0.63829787234 0.595744680851 0.595744680851 0.63829787234 0.595744680851 0.63829787234 0.595744680851 0.595744680851 0.595744680851 0.63829787234 0.595744680851 0.63829787234 0.595744680851 0.595744680851 0.595744680851 0.595744680851
recall mean: 0.608510638298
recall std: 0.0196982999952

F1: 0.746666666667 0.746666666667 0.746666666667 0.779220779221 0.746666666667 0.746666666667 0.779220779221 0.746666666667 0.746666666667 0.746666666667 0.746666666667 0.746666666667 0.746666666667 0.779220779221 0.779220779221 0.779220779221 0.779220779221 0.746666666667 0.746666666667 0.746666666667 0.746666666667 0.746666666667 0.746666666667 0.779220779221 0.779220779221 0.746666666667 0.746666666667 0.779220779221 0.746666666667 0.746666666667 0.746666666667 0.746666666667 0.779220779221 0.746666666667 0.779220779221 0.746666666667 0.746666666667 0.779220779221 0.746666666667 0.779220779221 0.746666666667 0.746666666667 0.746666666667 0.779220779221 0.746666666667 0.779220779221 0.746666666667 0.746666666667 0.746666666667 0.746666666667
F1 mean: 0.756432900433
F1 std: 0.0150696258664

cluster number: 261 261 261 260 261 261 260 261 261 261 261 261 261 260 260 260 260 261 261 261 261 261 261 260 260 261 261 260 261 261 261 261 260 261 260 261 261 260 261 260 261 261 261 260 261 260 261 261 261 261
cluster number mean: 260.7
cluster number std: 0.462910049886

Distance_Threshlod:30

precision: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
precision mean: 1.0
precision std: 0.0

recall: 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255 0.978723404255
recall mean: 0.978723404255
recall std: 2.24298922669e-16

F1: 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828 0.989247311828
F1 mean: 0.989247311828
F1 std: 5.60747306673e-16

cluster number: 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 252 cluster number mean: 252.0
cluster number std: 0.0

Distance_Threshlod:50

precision: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
precision mean: 1.0
precision std: 0.0

recall: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
recall mean: 1.0
recall std: 0.0

F1: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
F1 mean: 1.0
F1 std: 0.0

cluster number: 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251
cluster number mean: 251.0
cluster number std: 0.0

Distance_Threshlod:70

precision: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
precision mean: 1.0
precision std: 0.0

recall: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
recall mean: 1.0
recall std: 0.0

F1: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
F1 mean: 1.0
F1 std: 0.0

cluster number: 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251 251
cluster number mean: 251.0
cluster number std: 0.0

  • Unresistant patients

Distance_Threshlod:30

precision: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
precision mean: 1.0
precision std: 0.0

recall: 0.683061079545 0.658025568182 0.692649147727 0.665660511364 0.679865056818 0.701171875 0.685901988636 0.6875 0.679154829545 0.67578125 0.678089488636 0.686967329545 0.682350852273 0.665838068182 0.684659090909 0.677911931818 0.691583806818 0.681818181818 0.669211647727 0.682883522727 0.680397727273 0.687855113636 0.673295454545 0.678799715909 0.679332386364 0.681640625 0.688920454545 0.678977272727 0.678622159091 0.697975852273 0.677024147727 0.683238636364 0.690340909091 0.679154829545 0.680575284091 0.692116477273 0.686789772727 0.674538352273 0.688565340909 0.695134943182 0.684659090909 0.671164772727 0.680397727273 0.6953125 0.687677556818 0.691228693182 0.682350852273 0.688920454545 0.702059659091 0.690518465909
recall mean: 0.683153409091
recall std: 0.00884838573748

F1: 0.81168899673 0.793745984151 0.818420224483 0.799275130583 0.809428178839 0.824339839265 0.813691416535 0.814814814815 0.808924606112 0.806526806527 0.808168447783 0.814440585202 0.811187335092 0.799403112343 0.81281618887 0.808042328042 0.817676078514 0.810810810811 0.801829592597 0.811563621017 0.809805579036 0.815064169998 0.804753820034 0.808672659968 0.809050539226 0.81068524971 0.815811606392 0.808798646362 0.808546646922 0.822126947611 0.807411328745 0.811814345992 0.816806722689 0.808924606112 0.809931325938 0.818048268625 0.814315789474 0.805640971265 0.81556256572 0.820152927621 0.81281618887 0.803229919252 0.809805579036 0.820276497696 0.814939505523 0.817427821522 0.811187335092 0.815811606392 0.824953056541 0.816930994643
F1 mean: 0.811721946406
F1 std: 0.00625717877204

cluster number: 691 701 693 700 692 687 691 693 696 695 692 692 694 700 694 693 694 692 697 693 699 694 700 700 696 698 695 698 698 689 695 695 691 692 698 689 689 697 693 688 694 700 696 688 690 689 693 689 685 690
cluster number mean: 693.76
cluster number std: 3.94120048012

Distance_Threshlod:60

precision: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
precision mean: 1.0
precision std: 0.0

recall: 0.979580965909 0.972833806818 0.970703125 0.967862215909 0.979936079545 0.976029829545 0.986328125 0.955610795455 0.973721590909 0.983842329545 0.972123579545 0.974609375 0.971590909091 0.971768465909 0.976740056818 0.984552556818 0.976029829545 0.982066761364 0.969815340909 0.976384943182 0.978870738636 0.975142045455 0.967151988636 0.972833806818 0.965376420455 0.976384943182 0.978160511364 0.975319602273 0.971413352273 0.961825284091 0.985262784091 0.974076704545 0.973188920455 0.967862215909 0.980823863636 0.980113636364 0.974964488636 0.976384943182 0.969992897727 0.986150568182 0.979936079545 0.973721590909 0.977805397727 0.979048295455 0.967151988636 0.977095170455 0.982066761364 0.976029829545 0.968039772727 0.969105113636
recall mean: 0.974868607955
recall std: 0.00623172449967

F1: 0.989685173558 0.986229862299 0.985133795837 0.983668681765 0.989866379697 0.987869530057 0.993117010816 0.977301616125 0.986685858222 0.991855365614 0.985864769965 0.987141444115 0.985590778098 0.985682125169 0.988233180634 0.992216158182 0.987869530057 0.990952252979 0.984676401659 0.988051388015 0.989322566173 0.987414599065 0.983301742034 0.986229862299 0.982383232451 0.988051388015 0.988959698411 0.987505617978 0.985499414573 0.98054122545 0.992576692604 0.986868141752 0.986412309907 0.983668681765 0.990319110792 0.989956958393 0.987323563787 0.988051388015 0.984767913475 0.993026998033 0.989866379697 0.986685858222 0.988778166801 0.989413242419 0.983301742034 0.988414907948 0.990952252979 0.987869530057 0.983760375316 0.98431018936
F1 mean: 0.987264501053
F1 std: 0.00320042147178

cluster number: 538 540 540 540 537 539 536 545 541 538 540 541 542 540 540 537 540 538 541 540 539 542 540 540 542 539 538 540 540 543 538 541 539 543 540 539 541 540 541 538 538 541 539 540 540 539 539 539 542 541
cluster number mean: 539.88
cluster number std: 1.67380783039

Distance_Threshlod:90

precision: 0.994336283186 0.994336283186 0.991531404375 0.991531404375 1.0 1.0 1.0 0.987582047188 1.0 0.982810033327 0.987582047188 0.998050682261 0.991531404375 0.988740323716 0.99434029006 0.988740323716 0.994336283186 0.991531404375 0.991531404375 0.98567132496 0.980501392758 0.994983876747 0.994336283186 0.994983876747 0.994336283186 0.988740323716 0.994983876747 0.994336283186 0.985658640227 0.988740323716 0.988740323716 0.988740323716 0.988740323716 0.991531404375 0.98961084698 0.991531404375 1.0 0.988740323716 0.994336283186 0.991531404375 1.0 1.0 1.0 0.991531404375 0.991531404375 1.0 0.988740323716 1.0 0.991531404375 0.99499195135
precision mean: 0.992752264608
precision std: 0.00488101356762

recall: 0.997514204545 0.997514204545 0.997869318182 0.997869318182 1.0 1.0 1.0 0.988458806818 1.0 0.994850852273 0.988458806818 1.0 0.997869318182 0.997869318182 0.998224431818 0.997869318182 0.997514204545 0.997869318182 0.997869318182 0.989346590909 1.0 0.986150568182 0.997514204545 0.986150568182 0.997514204545 0.997869318182 0.986150568182 0.997514204545 0.988458806818 0.997869318182 0.997869318182 0.997869318182 0.997869318182 0.997869318182 0.997869318182 0.997869318182 1.0 0.997869318182 0.997514204545 0.997869318182 1.0 1.0 1.0 0.997869318182 0.997869318182 1.0 0.997869318182 1.0 0.997869318182 0.987748579545
recall mean: 0.996637073864
recall std: 0.00414384975034

F1: 0.99592270874 0.99592270874 0.994690265487 0.994690265487 1.0 1.0 1.0 0.988020232496 1.0 0.988793788053 0.988020232496 0.999024390244 0.994690265487 0.993283845882 0.996278575226 0.993283845882 0.99592270874 0.994690265487 0.994690265487 0.987505538325 0.990154711674 0.990547529873 0.99592270874 0.990547529873 0.99592270874 0.993283845882 0.990547529873 0.99592270874 0.987056737589 0.993283845882 0.993283845882 0.993283845882 0.993283845882 0.994690265487 0.993722924587 0.994690265487 1.0 0.993283845882 0.99592270874 0.994690265487 1.0 1.0 1.0 0.994690265487 0.994690265487 1.0 0.993283845882 1.0 0.994690265487 0.991357034661
F1 mean: 0.994683664989
F1 std: 0.00364051726246

cluster number: 529 529 528 528 528 528 528 529 528 529 529 527 528 528 528 528 529 528 528 527 527 531 529 531 529 528 531 529 528 528 528 528 528 528 527 528 528 528 529 528 528 528 528 528 528 528 528 528 528 530
cluster number mean: 528.34
cluster number std: 0.894655332106

Distance_Threshlod:120

precision: 0.953767993226 0.952170963365 0.95798605205 0.953350296862 0.955385920271 0.954576271186 0.958964753959 0.96080780421 0.949392712551 0.955385920271 0.957226629872 0.955385920271 0.9519133085 0.956477214542 0.955776482688 0.958964753959 0.948934731146 0.950531825089 0.955385920271 0.950531825089 0.95468483816 0.952981260647 0.955223880597 0.958964753959 0.951030057413 0.957366984993 0.955385920271 0.954576271186 0.959917780062 0.958311888832 0.956285082497 0.947332883187 0.95468483816 0.95468483816 0.956285082497 0.956285082497 0.951030057413 0.952089704383 0.954576271186 0.957366984993 0.953727506427 0.954970263381 0.951030057413 0.955385920271 0.957366984993 0.958964753959 0.953788651036 0.957334693184 0.959877070172 0.952461512434
precision mean: 0.954937783475
precision std: 0.00306086976452

recall: 1.0 0.996803977273 1.0 0.997869318182 1.0 1.0 1.0 0.996803977273 0.999289772727 1.0 0.985440340909 1.0 0.998224431818 0.995028409091 0.990056818182 1.0 0.996448863636 0.999644886364 1.0 0.999644886364 0.995028409091 0.993252840909 0.988636363636 1.0 1.0 0.996803977273 1.0 1.0 0.995028409091 0.991832386364 0.998224431818 0.996448863636 0.995028409091 0.995028409091 0.998224431818 0.998224431818 1.0 0.995028409091 1.0 0.996803977273 0.988103693182 0.997869318182 1.0 1.0 0.996803977273 1.0 0.996803977273 1.0 0.998224431818 0.999644886364
recall mean: 0.997325994318
recall std: 0.0034898043894

F1: 0.976337002687 0.973976405274 0.978542263921 0.975101934588 0.977184002776 0.976760319112 0.979052585832 0.978474945534 0.973702422145 0.977184002776 0.971128608924 0.977184002776 0.974518980759 0.975372030285 0.9726146869 0.979052585832 0.972111553785 0.974469926439 0.977184002776 0.974469926439 0.97443922796 0.97270039993 0.971642963092 0.979052585832 0.974900467371 0.976687543493 0.977184002776 0.976760319112 0.977157802964 0.974784050257 0.976804795413 0.971270335756 0.97443922796 0.97443922796 0.976804795413 0.976804795413 0.974900467371 0.97308560514 0.976760319112 0.976687543493 0.970611319438 0.975948597725 0.974900467371 0.977184002776 0.976687543493 0.979052585832 0.974822017711 0.978202344768 0.978675254591 0.975482976696
F1 mean: 0.975665915516
F1 std: 0.00222625183981

cluster number: 514 514 516 516 515 515 516 517 515 515 520 515 515 516 517 516 515 515 515 515 516 516 519 516 514 516 515 515 517 517 516 514 516 516 516 516 514 515 515 516 518 517 514 515 516 516 515 515 517 515
cluster number mean: 515.7
cluster number std: 1.21638474042

comparison with DBScan

DBScan is a clustering mathod based on density.
DBScan two main parameters Eps and MinPts. Eps is maximum redius of the neighbourhood. MinPts is is the minimum number of points in an Eps-neighbourhood of that points. In our experiment, MinPts is 1.

In the experiment result below, we can see that for unresistant patients the best performance of DBScan is with Eps = 60. The precision, recall and F1 is 1, 0.995,0.998. It is a little higher than my clustering method. For resistant patients, DBScan achieved The precision, recall and F1 is 1, 1,1. It is the same with my clustering method.

  • Unresistant patients
    Eps:30

precision: 1.0 precision mean: 1.0 precision std: nan

recall: 0.779296875 recall mean: 0.779296875 recall std: nan

F1: 0.875960482986 F1 mean: 0.875960482986 F1 std: nan

Eps:60

precision: 1.0 precision mean: 1.0 precision std: nan

recall: 0.995383522727 recall mean: 0.995383522727 recall std: nan

F1: 0.997686421071 F1 mean: 0.997686421071 F1 std: nan

Eps:90

precision: 0.978627280626 precision mean: 0.978627280626 precision std: nan

recall: 1.0 recall mean: 1.0 recall std: nan

F1: 0.989198208483 F1 mean: 0.989198208483 F1 std: nan

Eps:120

precision: 0.942121110739 precision mean: 0.942121110739 precision std: nan

recall: 1.0 recall mean: 1.0 recall std: nan

F1: 0.970198105082 F1 mean: 0.970198105082 F1 std: nan

  • resistant patients

Eps:10

precision: 1.0 precision mean: 1.0 precision std: nan

recall: 0.808510638298 recall mean: 0.808510638298 recall std: nan

F1: 0.894117647059 F1 mean: 0.894117647059 F1 std: nan

Eps:30

precision: 1.0 precision mean: 1.0 precision std: nan

recall: 0.978723404255 recall mean: 0.978723404255 recall std: nan

F1: 0.989247311828 F1 mean: 0.989247311828 F1 std: nan

Eps:50

precision: 1.0 precision mean: 1.0 precision std: nan

recall: 1.0 recall mean: 1.0 recall std: nan

F1: 1.0 F1 mean: 1.0 F1 std: nan

Eps:70

precision: 1.0 precision mean: 1.0 precision std: nan

recall: 1.0 recall mean: 1.0 recall std: nan

F1: 1.0 F1 mean: 1.0 F1 std: nan


In [ ]: