This is a machine learning analysis of the 'Phishing Websites Data Set' hosted in the UCI Machine Learning Repository.
Feature descriptions for this data set are listed here: https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Phishing%20Websites%20Features.docx
This analysis includes:
In [1]:
%matplotlib inline
import os
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets.base import Bunch
import json
import time
import pickle
import csv
In [2]:
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Training%20Dataset.arff"
def fetch_data(fname='Training_Dataset.arff'):
response = requests.get(URL)
outpath = os.path.abspath(fname)
with open(outpath, 'wb') as f:
f.write(response.content)
return outpath
data = fetch_data()
print (data)
The .arff file needs to be converted to to a .csv file, so the data can be read-in to a pandas dataframe for initial analysis. I did this using the getCSVFromArff() function. Though modified slightly for my purposes here, the code for thsi function was provided in this blog: http://biggyani.blogspot.com/2014/08/converting-back-and-forth-between-weka.html.
In [3]:
def getCSVFromArff(fileNameArff, fileNameSmoted):
with open(fileNameArff, 'r') as fin:
data = fin.read().splitlines(True)
i = 0
cols = []
for line in data:
if ('@data' in line):
i+= 1
break
else:
#print line
i+= 1
if (line.startswith('@attribute')):
if('{' in line):
cols.append(line[11:line.index('{')-1])
else:
cols.append(line[11:line.index('numeric')-1])
headers = ",".join(cols)
with open(fileNameSmoted + '.csv', 'w') as fout:
fout.write(headers)
fout.write('\n')
fout.writelines(data[i:])
getCSVFromArff(data, 'Training_Dataset')
Next, I read-in the data with pandas, and used the head() function to look at dataframe and ensure it doesn't look wonky. I also wrote the this data back to disk as a .txt file with the headers and index stripped out. This will be used to later to read the data in to be passed to the machine-learning pipeline.
In [2]:
df = pd.read_csv('Training_Dataset.csv')
df.columns = ['having_IP_Address', 'URL_Length', 'Shortining_Service', 'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix', 'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length', 'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor', 'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL', 'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe', 'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank', 'Google_Index', 'Links_pointing_to_page', 'Statistical_report', 'Result']
df.to_csv(path_or_buf='Training_Dataset1.txt',delimiter=',',header=False, index=False)
df.head()
Out[2]:
The data in the pandas dataframe looks good. I also doubled checked the file saved to disk to ensure that data was stored as expected.
In [3]:
with open ('Training_Dataset1.txt') as f:
for idx, line in enumerate(f):
if idx> 10:
break
else:
print (line)
The data looks okay. As noted in the documentation, all the features are categorical, and as indicated by the output above, these categorical features are already numerically encoded. Alignment of textual descriptions to numerical encoding were not explicitly provided in the .arff file itself.
Next I looped through the data set and replaced encoded categorical values with string values from documentation (-1=phishing,1=legitimate), (-1=legitimate, 0=suspicious, 1= phishing). I labeled this df2, and kept df as the numerically encoded data.
In [4]:
df_lists = []
for i in df.columns:
x = df["%s" % i].tolist()
if 0 not in x:
for n,i in enumerate(x):
if i==-1:
x[n]='phishing'
elif i==1:
x[n]='legitimate'
elif 0 in x:
for n,i in enumerate(x):
if i==-1:
x[n]='legitimate'
elif i==1:
x[n]='suspicious'
else:
x[n]='phishing'
df_lists.append(x)
headers = ['having_IP_Address', 'URL_Length', 'Shortining_Service', 'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix', 'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length', 'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor', 'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL', 'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe', 'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank', 'Google_Index', 'Links_pointing_to_page', 'Statistical_report', 'Result']
df2 = pd.DataFrame(df_lists)
df2 = df2.transpose()
df2.columns = headers
df2.to_csv(path_or_buf='Training_Dataset2.txt',delimiter=',',header=False, index=False)
df2.head()
Out[4]:
I then re-encoded the numeric data to remove the negative values using the same script to update the encoding as follows:
Legitimate = 0, Phishing = 2
Legitimate = 0, Suspicious = 1, Phishing = 2
In hindsight this step was not really necessary.
In [5]:
df_lists = []
for i in df2.columns:
x = df2["%s" % i].tolist()
if 'suspicious' not in x:
for n,i in enumerate(x):
if i=='legitimate':
x[n]= 0
elif i=='phishing':
x[n]= 2
elif 'suspicious' in x:
for n,i in enumerate(x):
if i=='legitimate':
x[n]= 0
elif i== 'suspicious':
x[n]= 1
else:
x[n]=2
df_lists.append(x)
headers = ['having_IP_Address', 'URL_Length', 'Shortining_Service', 'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix', 'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length', 'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor', 'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL', 'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe', 'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank', 'Google_Index', 'Links_pointing_to_page', 'Statistical_report', 'Result']
df = pd.DataFrame(df_lists)
df = df.transpose()
df.columns = headers
df.to_csv(path_or_buf='Training_Dataset1.txt',delimiter=',',header=False, index=False)
df.head()
Out[5]:
Since I updated the .txt file, I doubled checked this data again.
In [6]:
with open ('Training_Dataset1.txt') as f:
for idx, line in enumerate(f):
if idx> 10:
break
else:
print (line)
Next I determined the shape of the data.
In [7]:
print ("{} instances with {} features\n".format(*df2.shape))
The number of instances in this dataframe is higher than number listed in documentation for the dataset, which is disconcerning. This number is consistent with the rows in the retreived .arff file though, so the error does not seem to be in my process.
Next I displayed a histogram of the features in the dataframe, to get an idea of the shape of each feature. Alternatively, if the data was not all categorical I could have generated a scatter plots. With this data though, scatter plots are not useful.
In [8]:
df.hist(figsize=(15,15))
Out[8]:
Next I viewed the 'results' data as a factor plot against each of the features so see if there was any unexpected relationships that were immediately apparent.
In [9]:
for i in df2[:len(df2)-1]:
g = sns.factorplot("Result", col=i, data=df2,
kind="count", size=4, aspect=1, col_wrap=7)
I originally generated a parallel coordinates chart. Since the dependent variable (results) is bianary this graph is not useful. There were probably too many features for this to reveal much anyway. I've hashed out the original code.
In [10]:
#from pandas.tools.plotting import parallel_coordinates
#plt.figure(figsize=(20,20))
#parallel_coordinates(df, 'Result')
#plt.show()
Next I generate Radviz chart, to see if any patterns were visually apparent with this approach. This was better, though probably still too many features to gain much insight.
In [11]:
from pandas.tools.plotting import radviz
plt.figure(figsize=(20,20))
radviz(df, 'Result')
plt.show()
Since there is an abundance of features in the dataset, I used several regularization methods to identify the most significant among them. I then conducted some additional visualization.
Depending on model selection, this prioritized subset of features could be used to help improve model performance. Ultimately this was not necessary given model performance with all features included, as will be illustrated further in this walk though.
The prioritized subset of features could also be advantageous from an operational perspective. If model performance can be maintained with the limited feature set, this could potentially be used to reduce data ingestion and storage requirements when conducting further analysis using the model.
First I seperated the features from what will eventually be by target value to predict.
In [12]:
features = df[['having_IP_Address','URL_Length','Shortining_Service','having_At_Symbol','double_slash_redirecting','Prefix_Suffix','having_Sub_Domain','SSLfinal_State','Domain_registeration_length','Favicon','port','HTTPS_token','Request_URL','URL_of_Anchor','Links_in_tags','SFH','Submitting_to_email','Abnormal_URL','Redirect','on_mouseover','RightClick','popUpWidnow','Iframe','age_of_domain','DNSRecord','web_traffic','Page_Rank','Google_Index','Links_pointing_to_page','Statistical_report']]
labels = df['Result']
In [13]:
list (features)
Out[13]:
The three methods I used were Lasso (L1 Regularization), Ridge Regression (L2 Regularization), and ElasticNet. First I displayed the features and their coeficients. Then used SciKit-Learn Transformer methods to use thess three methods again, diplaying just significant features.
Lasso (L1 Regularization)
In [14]:
model = Lasso()
model.fit(features, labels)
output = list(zip(features, model.coef_.tolist()))
for i in output:
print(i)
Ridge Regression (L2 Regularization)
In [15]:
model = Ridge()
model.fit(features, labels)
output = list(zip(features, model.coef_.tolist()))
for i in output:
print(i)
ElasticNet
In [16]:
model = ElasticNet(l1_ratio=0.10)
model.fit(features, labels)
output = list(zip(features, model.coef_.tolist()))
for i in output:
print(i)
SelectFromModel()
In [17]:
model = Lasso()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
output_lasso = (list(features[sfm.get_support(indices=True)]))
for i in output_lasso:
print(i)
In [18]:
model = Ridge()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
output_ridge = (list(features[sfm.get_support(indices=True)]))
for i in output_ridge:
print(i)
In [19]:
model = ElasticNet()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
output_elasticnet = (list(features[sfm.get_support(indices=True)]))
for i in output_elasticnet:
print(i)
Next I repeated the visualization with fewer features. L2 Regularization seemed to provide the best result, so I limited the visualization to just these features. The result was still difficult to interpret, but the pattern is becoming more apparent.
In [20]:
mod_features = df[['Shortining_Service', 'Prefix_Suffix', 'SSLfinal_State', 'URL_of_Anchor', 'Links_in_tags', 'SFH', 'Redirect', 'web_traffic', 'Google_Index', 'Links_pointing_to_page', 'Result']]
plt.figure(figsize=(20,20))
radviz(mod_features, 'Result')
plt.show()
Next I organized the data to be received by SciKit-Learn as a Bunch object.
First I created a JSON file with the feature and target labels. Since the data in the df dataframe, and associated .txt file, was int64 datatype, this was not JSON serializable. Since this data was generated off the same .csv as the data in df2 with the string values, I used df2 to form the meta file.
In [21]:
g = df.columns.to_series().groupby(df.dtypes).groups
print (g)
In [22]:
import json
meta = {
'target_names': list(df2.Result.unique()),
'feature_names': list(df2.columns),
'categorical_features': {
column: list(df2[column].unique())
for column in df2.columns
if df2[column].dtype == 'object'
},
}
with open('meta.json', 'w') as f:
json.dump(meta, f, indent=2)
Having created the meta.json file, as well as a readme file which I created seperately from this notebook, I read this data into a 'bunch' object, along with the numerically encoded .txt file storing the feature and target data.
In [23]:
def load_data(root=os.getcwd()):
# Construct the `Bunch` for the Phishing dataset
filenames = {
'meta': os.path.join(root, 'meta.json'),
'rdme': os.path.join(root, 'README.md'),
'data': os.path.join(root, 'Training_Dataset1.txt'),
}
# Load the meta data from the meta json
with open(filenames['meta'], 'r') as f:
meta = json.load(f)
target_names = meta['target_names']
feature_names = meta['feature_names']
#Alternative method for loading in target and featur labels.
#target_names = df.columns[-1]
#feature_names = list(df.columns[0:-1])
# Load the description from the README.
with open(filenames['rdme'], 'r') as f:
DESCR = f.read()
# Load the dataset from the text file.
dataset = np.genfromtxt(filenames['data'], delimiter=',')
# Extract the target from the data
data = dataset[:, 0:-1]
target = dataset[:, -1]
# Create the bunch object
return Bunch(
data=data,
target=target,
filenames=filenames,
target_names=target_names,
feature_names=feature_names,
DESCR=DESCR
)
# Save the dataset as a variable we can use.
dataset = load_data()
print (dataset.data.shape)
print (dataset.target.shape)
I doubled checked that the data looked as expected.
In [24]:
print (dataset.data)
print (dataset.target)
print (dataset.target_names)
print (dataset.feature_names)
Since the data I read into the bunch object was already numerically encoded, no additional transformation was necesary at this step. Though had I been using categorical data labled with string-values, for example, these would need to be re-encoded as numeric values.
Additionally, if any of the features were missing data, I would need to implement a strategy to either drop the instance or input the missing data. Fortunately the dataset did not have any missing values.
With the data loaded and prepared for fitting by the machine learning models, I used a function to assess the effectiveness of several different models. The function accepts the bunch object with the data, a selected model, and a string value to label the model as arguments.
The function implements k-fold cross-validation, randomly partisioning the data into k equal sized subsets, in this case k = 12 subsets, retaining one set as validation set and the remaining sets as training data. It creates an estimator object and uses the model type designated in the argument to fit a model to the data. It then generates precision, recall, accuracy, and F1 scores for the fitted model. Scores are displayed upon execution of the function, the model itself is pickled, i.e. the python object is stored on disk.
In [25]:
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
In [26]:
def fit_and_evaluate(dataset, model, label, **kwargs):
start = time.time() # Start the clock!
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
for train, test in KFold(dataset.data.shape[0], n_folds=12, shuffle=True):
X_train, X_test = dataset.data[train], dataset.data[test]
y_train, y_test = dataset.target[train], dataset.target[test]
estimator = model(**kwargs)
estimator.fit(X_train, y_train)
expected = y_test
predicted = estimator.predict(X_test)
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted,pos_label=2))
scores['recall'].append(metrics.recall_score(expected, predicted,pos_label=2))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted,pos_label=2))
# Report
print ("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print ("Validation scores are as follows:\n")
print (pd.DataFrame(scores).mean())
# Write official estimator to disk
estimator = model(**kwargs)
estimator.fit(dataset.data, dataset.target)
outpath = label.lower().replace(" ", "-") + ".pickle"
with open(outpath, 'wb') as f:
pickle.dump(estimator, f)
print ("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))
I selected three different classifier models to fit a model to the data. The models I selected were:
Random Forest, which is an ensemble machine learning method that generates a collection of decision trees. Naive Bayes, which uses bayes therom, but assumes no feature values are independent. Logistic Regression, a non-linear regression method used to classify a binary dependent variable.
Random Forest
In [27]:
fit_and_evaluate(dataset, RandomForestClassifier, "Phishing Random Forest Classifier")
Naive Bayes
In [28]:
fit_and_evaluate(dataset, GaussianNB, "Phishing Naive Bayes Classifier")
Logistic Regression
In [29]:
fit_and_evaluate(dataset, LogisticRegression, "Phishing Logistic Regression Classifier")
The Random Forest model performed the best, as indicated by acheving the highest scores in both accuracy and precision (and F1, the aggregation of the two), and only trailing Naive Bayes in recall.
To operationalize this model, I used a function which retreives the fitted model stored on disk, then accepts and evaluates new input data provided by a user. The fitted model accepts numerically encoded inputs, and returns numerically encoded outputs, so string value inputs provided by the user are encoded as they are received, and nuerical outputs of the model (0 and 2), are presented as the expected string values, i.e. 0 = 'legitimate' and !=0 = 'phishing'.
In [34]:
def load_model(path='phishing-random-forest-classifier.pickle'):
with open(path, 'rb') as f:
return pickle.load(f)
def predict(model, meta=meta):
user_data = {} # Store the input from the user
for column in meta['feature_names'][:-1]:
# Get the valid responses
valid = meta['categorical_features'].get(column)
# Prompt the user for an answer until good
while True:
print ("Choose one of the following: {}".format(valid))
val = " " + input("enter {} >".format(column))
if val == 'phishing':
val = 2
user_data[column] = val
elif val == 'suspicious':
val = 1
user_data[column] = val
else:
val = 0
user_data[column] = val
break
# Create prediction and label
yhat = model.predict(pd.DataFrame([user_data]))
yhat1 = yhat
if 0 in yhat1:
print ('\nThe predicted status of this URL is: Legitimate')
else:
print ('\nThe predicted status of this URL is: Phishing')
# Execute the interface
model = load_model()
predict(model)
In [ ]: