Clasification of phishng and benign URLs

  • Loading dataset from CSV file
  • Data exploration with 2D and 3D plots
  • Classification with KNN
  • Drawing a boundary between classes with KNN
  • Dimensionality reduction with PCA and t-SNE
  • Clustering with k-Means

In [1]:
# Load CSV
import pandas as pd
import numpy as np


filename = 'Examples - Phishing clasification2.csv'

# Specify the names of attributes if the header is not availabel in a CSV file 
#names = ['Registrar', 'Lifetime', 'Country', 'Class']

# Loading with NumPy
#raw_data = open(filename, 'rt')
#data = numpy.loadtxt(raw_data, delimiter=",")

# Loading with Pandas
data = pd.read_csv(filename)
print(data.shape)
#data
#data.dtypes

# Transforming 'object' data to 'categorical' to get numerical (ordinal numbers) representation
data['Registrar'] = data['Registrar'].astype('category')
data['Country'] = data['Country'].astype('category')
data['Protocol'] = data['Protocol'].astype('category')
data['Class'] = data['Class'].astype('category')

data['Registrar_code'] = data['Registrar'].cat.codes
data['Country_code'] = data['Country'].cat.codes
data['Protocol_code'] = data['Protocol'].cat.codes
data['Class_code'] = data['Class'].cat.codes

#data.dtypes
pd.options.display.max_rows=1000
data
#pd.options.display.max_rows=100


(202, 5)
Out[1]:
Registrar Lifetime Country Protocol Class Registrar_code Country_code Protocol_code Class_code
0 godaddy 2 US http phishing 84 28 0 1
1 XXLPOWER.CO.UA 1 NaN http phishing 78 -1 0 1
2 godaddy 2 NaN http phishing 84 -1 0 1
3 LOGINCOMPUTERS.IN 1 US http phishing 37 28 0 1
4 godaddy 1 US http phishing 84 28 0 1
5 REGISTER.COM, INC. 9 US http phishing 59 28 0 1
6 godaddy 2 NaN http phishing 84 -1 0 1
7 PSI-USA, INC. DBA DOMAIN ROBOT 3 Ireland http phishing 55 12 0 1
8 PDR LTD. D/B/A PUBLICDOMAINREGISTRY.COM 5 India http phishing 54 11 0 1
9 PDR LTD. D/B/A PUBLICDOMAINREGISTRY.COM 4 US http phishing 54 28 0 1
10 FASTNETINFORMATICS.COM 1 NaN http phishing 23 -1 0 1
11 MINBSAPERU.COM 6 NaN http phishing 43 -1 0 1
12 TINYURL.COM 16 NaN http phishing 64 -1 0 1
13 TUCOWS 2 US http phishing 67 28 0 1
14 MYPAYPAAL.COM 1 CA http phishing 44 3 0 1
15 CONVERGYSEMPLEOS.COM 1 NaN http phishing 8 -1 0 1
16 NETWORK SOLUTIONS, LLC. 1 France http phishing 49 8 0 1
17 Namesco Limited 1 UK http phishing 52 27 0 1
18 CONLOUW.COM 1 NaN http phishing 7 -1 0 1
19 NAMESILO, LLC 1 India http phishing 47 11 0 1
20 Webiq Domains Solutions Pvt. Ltd. 4 India http phishing 74 11 0 1
21 allegro.pl 18 US https phishing 79 28 1 1
22 Ascio Technologies 2 CZ http phishing 4 4 0 1
23 BAGTHOSEBARGAINS.COM 1 US https phishing 5 28 1 1
24 godaddy 1 NaN http phishing 84 -1 0 1
25 godaddy 1 US http phishing 84 28 0 1
26 IMOMANS.COM 1 US http phishing 32 28 0 1
27 orcadesmarine.co.uk 2 France http phishing 90 8 0 1
28 FBS INC. 1 Turkey http phishing 24 26 0 1
29 EMINENCECOACH.COM 2 NaN http phishing 16 -1 0 1
30 ALMODY.COM 1 US http phishing 1 28 0 1
31 SUNDOWNVACATIONS.COM 1 US http phishing 62 28 0 1
32 godaddy 4 Netherlands http phishing 84 19 0 1
33 godaddy 3 US http phishing 84 28 0 1
34 COMITETUL.INFO 1 Netherlands http phishing 6 19 0 1
35 Godaddy 2 Canada http phishing 28 5 0 1
36 Godaddy 2 US http phishing 28 28 0 1
37 SUNDOWNVACATIONS.COM 1 US http phishing 62 28 0 1
38 ALMODY.COM 1 US http phishing 1 28 0 1
39 FASTDOMAIN, INC 5 US http phishing 22 28 0 1
40 EVERYCITY.CO.UK 7 US http phishing 20 28 0 1
41 EMINENCECOACH.COM 2 US http phishing 16 28 0 1
42 home.pl 3 Poland https phishing 85 20 1 1
43 HOLISTICHEALTH-GUIDE.COM 7 India http phishing 29 11 0 1
44 NaN 1 US http phishing -1 28 0 1
45 godaddy 9 US http phishing 84 28 0 1
46 godaddy 9 US http phishing 84 28 0 1
47 DNC HOLDINGS, INC. 4 US http phishing 14 28 0 1
48 godaddy 1 US http phishing 84 28 0 1
49 enom.com 6 US http phishing 83 28 0 1
50 ALMODY.COM 1 US http phishing 1 28 0 1
51 eib.edu.bd 0 US http phishing 82 28 0 1
52 enom.com 4 NaN http phishing 83 -1 0 1
53 godaddy 1 US http phishing 84 28 0 1
54 TUCOWS 16 US http phishing 67 28 0 1
55 godaddy 12 US http phishing 84 28 0 1
56 godaddy 14 US http phishing 84 28 0 1
57 godaddy 4 Germany http phishing 84 9 0 1
58 NAMESILO, LLC 1 Bangladesh http phishing 47 2 0 1
59 LESPOWER.COM 5 US http phishing 36 28 0 1
60 godaddy 4 NaN http phishing 84 -1 0 1
61 TUCOWS 4 CA http phishing 67 3 0 1
62 lowicz.pl 3 Poland http phishing 87 20 0 1
63 godaddy 2 US http phishing 84 28 0 1
64 godaddy 2 US http phishing 84 28 0 1
65 enom.com 6 US http phishing 83 28 0 1
66 WEBFUSION LTD. 8 UK http phishing 71 27 0 1
67 casamentoparasempre.com.br 1 US http phishing 80 28 0 1
68 enom.com 6 US http phishing 83 28 0 1
69 easyDNS Technologies, Inc. 3 US http phishing 81 28 0 1
70 MANGAVERDEBEACH.COM 2 US http phishing 41 28 0 1
71 PDR LTD. D/B/A PUBLICDOMAINREGISTRY.COM 1 US http phishing 54 28 0 1
72 godaddy 1 US http phishing 84 28 0 1
73 godaddy 4 Germany http phishing 84 9 0 1
74 godaddy 3 NaN http phishing 84 -1 0 1
75 Akky (Una division de NIC Mexico) 3 Mexica http phishing 2 15 0 1
76 godaddy 1 US http phishing 84 28 0 1
77 godaddy 5 US http phishing 84 28 0 1
78 godaddy 1 US http phishing 84 28 0 1
79 enom.com 1 US http phishing 83 28 0 1
80 enom.com 12 US http phishing 83 28 0 1
81 godaddy 7 US http phishing 84 28 0 1
82 WEBFUSION LTD. 8 UK http phishing 71 27 0 1
83 PDR LTD. D/B/A PUBLICDOMAINREGISTRY.COM 7 US http phishing 54 28 0 1
84 WEBFUSION LTD. 8 UK http phishing 71 27 0 1
85 PDR LTD. D/B/A PUBLICDOMAINREGISTRY.COM 1 US http phishing 54 28 0 1
86 godaddy 1 US http phishing 84 28 0 1
87 godaddy 1 US http phishing 84 28 0 1
88 REG-ACTIVE24 2 Czech Republic http phishing 58 7 0 1
89 godaddy 4 Germany http phishing 84 9 0 1
90 HONG KONG TELECOMMUNICATIONS (HKT) LIMITED 12 Hong Kong http phishing 30 10 0 1
91 Comercial Gonalves e Rocha 7 US http phishing 12 28 0 1
92 godaddy 1 US http phishing 84 28 0 1
93 godaddy 1 US http phishing 84 28 0 1
94 TUCOWS 1 South Africa http phishing 67 25 0 1
95 godaddy 1 India http phishing 84 11 0 1
96 godaddy 1 Netherlands http phishing 84 19 0 1
97 godaddy 6 US http phishing 84 28 0 1
98 PDR LTD. D/B/A PUBLICDOMAINREGISTRY.COM 7 India http phishing 54 11 0 1
99 godaddy 1 US http phishing 84 28 0 1
100 godaddy 1 US http phishing 84 28 0 1
101 godaddy 2 US http phishing 84 28 0 1
102 NaN 1 US http phishing -1 28 0 1
103 TUCOWS 11 UK http phishing 67 27 0 1
104 TLDS, LLC DBA SRSPLUS 9 US http phishing 66 28 0 1
105 godaddy 1 US http phishing 84 28 0 1
106 WILD WEST DOMAINS 1 US http phishing 72 28 0 1
107 godaddy 1 US http phishing 84 28 0 1
108 TLDS, LLC DBA SRSPLUS 9 Canada http phishing 66 5 0 1
109 WILD WEST DOMAINS 1 US http phishing 72 28 0 1
110 godaddy 4 Germany http phishing 84 9 0 1
111 PUBLICDOMAINREGISTRY.COM 1 Murrica http phishing 56 16 0 1
112 JOKER.COM 4 Murrica http phishing 33 16 0 1
113 LAUNCHPAD.COM 0 Murrica http phishing 35 16 0 1
114 WhoisGuard Protected 0 Murrica http phishing 76 16 0 1
115 godaddy 6 Murrica http phishing 84 16 0 1
116 home.pl 5 Poland http phishing 85 20 0 1
117 godaddy 1 US http phishing 84 28 0 1
118 godaddy 1 US https phishing 84 28 1 1
119 PDR LTD. D/B/A PUBLICDOMAINREGISTRY.COM 3 US http phishing 54 28 0 1
120 godaddy 5 US http phishing 84 28 0 1
121 Lunar Pages - The Magi Organization 1 US http phishing 40 28 0 1
122 CloudFlare 1 US http phishing 11 28 0 1
123 CyrusOne LLC 5 US http phishing 13 28 0 1
124 TUCOWS 6 US http phishing 67 28 0 1
125 home.pl 8 Poland https phishing 85 20 1 1
126 Hetzner Online GmbH 4 Russia http phishing 31 22 0 1
127 home.pl 1 Poland https phishing 85 20 1 1
128 NaN 0 Chile http phishing -1 6 0 1
129 TLD REGISTRAR SOLUTIONS LTD 1 US http phishing 65 28 0 1
130 Andrew Florides 1 UK http phishing 3 27 0 1
131 EUROPEAN PROJECTS GROUP SP.Z O.O. 13 Poland http phishing 19 20 0 1
132 SaudiNIC 5 US http phishing 63 28 0 1
133 enom.com 4 US http phishing 83 28 0 1
134 Wild West Domains, LLC (R120-LROR) 1 Singapore http phishing 77 24 0 1
135 LAUNCHPAD.COM 3 US http phishing 35 28 0 1
136 CRAZY DOMAINS FZ-LLC 3 Australia http phishing 9 1 0 1
137 godaddy 4 Germany http phishing 84 9 0 1
138 1 & 1 INTERNET AG 2 Germany http phishing 0 9 0 1
139 godaddy 1 Murrica http phishing 84 16 0 1
140 mclink.it 14 Italy http phishing 88 13 0 1
141 godaddy 1 Murrica http phishing 84 16 0 1
142 godaddy 1 Murrica http phishing 84 16 0 1
143 godaddy 1 Murrica http phishing 84 16 0 1
144 Websitewelcome.com 1 Murrica http phishing 75 16 0 1
145 mysitehosted.com 0 Murrica http phishing 89 16 0 1
146 jump.ro 2 Romaina http phishing 86 21 0 1
147 PublicDomainRegistry.com 1 India http phishing 57 11 0 1
148 Registrar IANA ID 1 Murrica http phishing 60 16 0 1
149 TUCOWS 3 Murrica http phishing 67 16 0 1
150 wildwestdomains.com 7 Murrica http phishing 92 16 0 1
151 PublicDomainRegistry.com 1 Murrica http phishing 57 16 0 1
152 registro.br 1 Murrica http phishing 91 16 0 1
153 godaddy 6 Murrica http phishing 84 16 0 1
154 SE Direkt 14 SE https benign 61 23 1 0
155 MARKMONITOR INC. 7 SE https benign 42 23 1 0
156 MARKMONITOR INC. 12 SE https benign 42 23 1 0
157 KEY-SYSTEMS GMBH 12 SE https benign 34 23 1 0
158 NL61-IS 10 US https benign 51 28 1 0
159 Local Register Inc 13 SE https benign 38 23 1 0
160 EURODNS S.A 17 LU https benign 18 14 1 0
161 EURODNS S.A 10 LU https benign 18 14 1 0
162 KEY-SYSTEMS GMBH 10 NZ https benign 34 18 1 0
163 EURODNS S.A 16 LU https benign 18 14 1 0
164 EURODNS S.A 15 LU https benign 18 14 1 0
165 DOMAIN.COM, LLC 17 US https benign 15 28 1 0
166 EURODNS S.A 15 LU https benign 18 14 1 0
167 DNC HOLDINGS, INC. 17 US https benign 14 28 1 0
168 EURODNS S.A 16 LU https benign 18 14 1 0
169 EURODNS S.A 25 LU https benign 18 14 1 0
170 CSC CORPORATE DOMAINS, INC. 22 US https benign 10 28 1 0
171 FABULOUS.COM PTY LTD. 13 AU https benign 21 0 1 0
172 Loopia AB 2 SE https benign 39 23 1 0
173 GODADDY.COM, LLC 11 US https benign 26 28 1 0
174 EURODNS S.A 18 LU https benign 18 14 1 0
175 NETWORKING4ALL B.V. 18 CZ https benign 50 4 1 0
176 GOOGLE INC. 23 NZ https benign 27 18 1 0
177 EURODNS S.A 4 LU https benign 18 14 1 0
178 GODADDY.COM, LLC 8 US https benign 26 28 1 0
179 GODADDY.COM, LLC 16 US https benign 26 28 1 0
180 MARKMONITOR INC. 16 US https benign 42 28 1 0
181 ENOM, INC 7 US https benign 17 28 1 0
182 NAME.COM, INC. 22 US https benign 46 28 1 0
183 GANDI SAS 12 US https benign 25 28 1 0
184 GODADDY.COM, LLC 7 US https benign 26 28 1 0
185 WILD WEST DOMAINS, LLC 10 US https benign 73 28 1 0
186 MARKMONITOR INC. 20 AU https benign 42 0 1 0
187 MARKMONITOR INC. 16 US https benign 42 28 1 0
188 EURODNS S.A 16 LU https benign 18 14 1 0
189 GODADDY.COM, LLC 11 US https benign 26 28 1 0
190 ENOM, INC 13 NL https benign 17 17 1 0
191 NETWORK SOLUTIONS, LLC 19 Canada https benign 48 5 1 0
192 MARKMONITOR INC. 23 US https benign 42 28 1 0
193 Melbourne IT Ltd 8 US https benign 45 28 1 0
194 GANDI SAS 7 US https benign 25 28 1 0
195 CSC CORPORATE DOMAINS, INC. 34 US https benign 10 28 1 0
196 FABULOUS.COM PTY LTD. 23 US https benign 21 28 1 0
197 Vitalwerks Internet Solutions, LLC DBA No-IP (... 14 Canada https benign 70 5 1 0
198 Onlinenic Inc 5 US https benign 53 28 1 0
199 TUCOWS, INC. 7 US https benign 68 28 1 0
200 NETWORK SOLUTIONS, LLC 6 US https benign 48 28 1 0
201 USA 2 US https benign 69 28 1 0

In [2]:
X = data[['Registrar_code', 'Lifetime', 'Country_code', 'Protocol_code']].values   #Feature Matrix
y = data['Class_code'].values          #Target Variable

feature_names = data[['Registrar_code', 'Lifetime', 'Country_code', 'Protocol_code']].columns.values
#print(feature_names)
target_names = data['Class'].cat.categories
country_names = data['Country'].cat.categories
registrar_names = data['Registrar'].cat.categories
protocol_names = data['Protocol'].cat.categories
#print(target_names, country_names, registrar_names)

In [3]:
import matplotlib.pyplot as plt

x_index = 1
y_index = 3

# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: target_names[int(i)])
plt.scatter(X[:, x_index], X[:, y_index], c=y, cmap=plt.cm.get_cmap('Paired', 2))
plt.colorbar(ticks=[0, 1], format=formatter)
plt.clim(-0.5, 1.5)
plt.xlabel(feature_names[x_index])
plt.ylabel(feature_names[y_index]);
plt.show()


<Figure size 640x480 with 2 Axes>

In [4]:
from mpl_toolkits.mplot3d import Axes3D
    
fig = plt.figure(1, figsize=(10, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=100)
ax.scatter(X[:, 1], X[:, 2], X[:, 3], lw=2, c=y, cmap='Paired')
ax.set_xlabel(feature_names[1])
ax.set_ylabel(feature_names[2]);
ax.set_zlabel(feature_names[3]);
plt.show()



In [5]:
from sklearn import neighbors

# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

# fit the model
knn.fit(X, y)

# call the "predict" method:
registrar_code = 48
lifetime = 2
country_code = 28 
protocol_code = 1

result = knn.predict([[registrar_code, lifetime, country_code, protocol_code],])
#print(target_names)
print(result, target_names[result[0]], ": ", registrar_names[registrar_code], lifetime, country_names[country_code], protocol_names[protocol_code] )


[1] phishing :  NETWORK SOLUTIONS, LLC 2 US https

In [6]:
from matplotlib.colors import ListedColormap

n_neighbors = 5
h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['cyan', 'red'])
cmap_bold = ListedColormap(['blue', 'orange'])

# Get '1: Lifetime' and '2: Country' attributes only 
x_index = 1
y_index = 2
X2 = X[:,[x_index, y_index]] 

for weights in ['uniform', 'distance']:
    # we create an instance of Neighbours Classifier and fit the data.
    knn = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    knn.fit(X2, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X2[:, 0].min() - 1, X2[:, 0].max() + 1
    y_min, y_max = X2[:, 1].min() - 1, X2[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X2[:, 0], X2[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("2-Class classification (k = %i, weights = '%s')"
              % (n_neighbors, weights))
    plt.xlabel(feature_names[x_index])
    plt.ylabel(feature_names[y_index]);

plt.show()


Dimensionality reduction with PCA


In [7]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print("Reduced dataset shape:", X_reduced.shape)


Reduced dataset shape: (202, 2)

In [8]:
# PCA only
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='Paired')

print("Meaning of the components:")
for component in pca.components_:
    print(" + ".join("%.3f x %s" % (value, name)
                     for value, name in zip(component, feature_names)))


Meaning of the components:
-0.998 x Registrar_code + 0.065 x Lifetime + -0.022 x Country_code + 0.006 x Protocol_code
0.020 x Registrar_code + -0.035 x Lifetime + -0.999 x Country_code + -0.005 x Protocol_code

Dimensionality reduction with t-SNE


In [9]:
from sklearn.manifold import TSNE
X_reduced2 = TSNE(n_components=2).fit_transform(X)
# PCA + t-SNE
X_reduced3 = TSNE(n_components=2).fit_transform(X_reduced)
print("Reduced dataset shape:", X_reduced3.shape)


Reduced dataset shape: (202, 2)

In [10]:
# t-SNE only
plt.scatter(X_reduced2[:, 0], X_reduced2[:, 1], c=y, cmap='Paired')


Out[10]:
<matplotlib.collections.PathCollection at 0x21180ec1508>

In [11]:
# PCA + t-SNE
plt.scatter(X_reduced3[:, 0], X_reduced3[:, 1], c=y, cmap='Paired')


Out[11]:
<matplotlib.collections.PathCollection at 0x21183bf5fc8>

Clustering: K-means


In [12]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=2, random_state=0) # Fixing the RNG in kmeans
k_means.fit(X)
y_pred = k_means.predict(X)

plt.scatter(X_reduced2[:, 0], X_reduced2[:, 1], c=y_pred, cmap='Paired');



In [27]:
TP = 0
TN = 0
FP = 0
FN = 0

for i in range (0, len(y)):
    #print(i, ":", y[i])
    if (y[i] == 1): # Positive
        if (y[i] == y_pred[i]):
            TP+=1
        else:
            FN+=1
    else:
        if (y[i] == y_pred[i]):
            TN+=1
        else:
            FP+=1

    
print("TP =", TP, "TN =", TN, "FP =", FP, "FN =", FN) 

TPR = TP / (TP+FN)
TNR = TN / (TN+FP)
FPR = FP / (FP+TN)
FNR = FN / (TP+FN)
PPV = (TP+TN) / (TP+TN+FP+FN)
NPV = TN / (TN+FN)
Fmeasure = 2*PPV*TPR / (PPV + TPR)

print("TPR =", TPR, "TNR =", TNR, "FPR =", FPR, "FNR =", FNR, "PPV =", PPV, "NPV =", NPV, "F-measure =", Fmeasure)


TP = 110 TN = 41 FP = 7 FN = 44
TPR = 0.7142857142857143 TNR = 0.8541666666666666 FPR = 0.14583333333333334 FNR = 0.2857142857142857 PPV = 0.7475247524752475 NPV = 0.4823529411764706 F-measure = 0.7305273343009193