Machine Learning in Cyber Security

  • Learn to spot an attacker through monitoring TCP logs.

Dataset: KDD Cup 1999

  • Source [1]
  • TCP data dump - array of 41 variables (TCP features)
    • Basic Features (9)
    • Content Features (13)
    • Traffic Features (19)
  • Consists of 23 types of attacks:
    1. DOS: back, land, neptune, pod, surf, teardrop
    2. R2L: ftp_write, guess_passwd, lmap, multihop, phf, spy, warezclient, warezmaster
    3. U2R: buffer_overflow, loadmodule, perl, rootkit
    4. Probing: ipsweep, nmap, portsweep, satan

Basic Features of TCP

  • duration - connection time in seconds.
  • protocal_type - i.e. TCP, UDP
  • service - Netword service on destination (i.e. HTTP, Telnet)
  • source_bytes - amount of data from source to destination.
  • flag - normal or error status of connection.
  • land - Connection is from/to the same host/port: "1", else "0".
  • wrong_fragment - number of "wrong" fragments.
  • urgent - number of urgent packets.

Content Features of TCP

  • host - number of "hot" indicators.
  • num_failed_logins - number of attempts.
  • logged_in - succes: "1", else "0".
  • compromised - number of "compromised" conditions.
  • root_shell - if root shell obtained: "1", else "0".
  • su_attempted - if "su" command attempted: "1", else "0".
  • num_root - number of root accesses.
  • num_file_creations - number of creation operations.
  • num_shells - number of shell prompts.
  • num_access_files - number of operations on access control files.
  • num_outbound_cmds - number per session.
  • is_hot_login - if login on "hot" list: "1", else "0".
  • is_guest_login - if guest: "1", else "0".

Traffic Features of TCP (2 sec window)

  • count - number of connections to the same host.
  • serror_rate - % of connections with "SYN" errors.
  • rerror_rate - % of connections with "REJ" errors.
  • same_srv_rate - % of connections to the same service.
  • diff_srv_rate - % of connections to different services.
  • srv_count - number of connections to the same service.
  • srv_serror_rate - % of connections with "SYN" errors.
  • srv_rerror_rate - % of connections with "REJ" errors.
  • srv_diff_host_rate - % of connections to different hosts.

Models (from scikit-learn)

  • Generative:
    • Naive Bayes
  • Descriminative:
    • kNN
    • kMeans
    • Logistic Regression
    • Decision Tree
    • Random Forest
  • Others to consider:
    • LinearDiscriminantAnalysis
    • GaussianNB
    • NBTree, Random Tree, Multilayer Perceptron (additional models from [2])
    • SVC (Not used for this problem. "The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.")

In [1]:
from array import array
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.cluster import KMeans

%matplotlib inline

In [2]:
dataset_part = fetch_kddcup99(percent10=True)  # Over 600 MB in memeory.
# dataset_full = fetch_kddcup99(percent10=False)  # Crashed my computer with 16 GB of RAM.

In [3]:
dataset_part.data[0]  # Sample of TCP record.


Out[3]:
array([0, b'tcp', b'http', b'SF', 181, 5450, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 8, 8, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 9, 9,
       1.0, 0.0, 0.11, 0.0, 0.0, 0.0, 0.0, 0.0], dtype=object)

In [4]:
len(set(dataset_part.target))  # Number of unique classifications.


Out[4]:
23

In [ ]:

Transform Data

  • Dataset values are all of type "object" => convert to numeric types.
  • Label Encoder - replaces strings with an incrementing integer.

In [5]:
df = pd.DataFrame(dataset_part.data)

In [6]:
df.head(1)


Out[6]:
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
0 0 b'tcp' b'http' b'SF' 181 5450 0 0 0 0 ... 9 9 1 0 0.11 0 0 0 0 0

1 rows × 41 columns


In [7]:
df = df.apply(pd.to_numeric, errors='ignore')

In [8]:
# Example from http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
'''
le = preprocessing.LabelEncoder()
le.fit(list(names))
# le.classes_  # Shows all labels.
print(le.transform([b'icmpeco_iSF', b'icmpecr_iSF', b'icmpred_iSF']) )
print(le.inverse_transform([0, 0, 1, 2]))
'''
# https://datascience.stackexchange.com/questions/16728/could-not-convert-string-to-float-error-on-kddcup99-dataset
for column in df.columns:
    if df[column].dtype == object:
        le = preprocessing.LabelEncoder()
        df[column] = le.fit_transform(df[column])

In [9]:
df.head(1)  # All strings removed.


Out[9]:
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
0 0 1 22 9 181 5450 0 0 0 0 ... 9 9 1.0 0.0 0.11 0.0 0.0 0.0 0.0 0.0

1 rows × 41 columns

Preprocessing Data


In [10]:
X = df.values

le = preprocessing.LabelEncoder()
y = le.fit_transform(dataset_part.target)
y_dict = dict(zip(y,le.classes_))  # Saved for later lookup.

In [11]:
# Test options and evaluation metric
N_SPLITS = 7
SCORING = 'accuracy'

In [12]:
# Split-out validation dataset
test_size=0.33
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_size, random_state=SEED)

Train Model


In [21]:
#  Algorithms
models = [
          #('LR', LogisticRegression()),
          ('LDA', LinearDiscriminantAnalysis()),
          #('KNN', KNeighborsClassifier()),
          #('KMN', KMeans()),
          #('CART', DecisionTreeClassifier()),
          #('NB', GaussianNB()),
         ]

# evaluate each model in turn
results = []
names = []
print('{:8}{:^8}{:^8}'.format('Model','mean','std'))
print('-' * 23)
for name, model in models:
    kfold = KFold(n_splits=N_SPLITS, random_state=SEED)

    %timeit -n1 cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=SCORING) 
    results.append(cv_results)
    names.append(name)
    print('{:8}{:^8.2%}{:^8.2%}'.format(name, cv_results.mean(), cv_results.std()))


Model     mean    std   
-----------------------
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
1 loop, best of 3: 10.8 s per loop
LDA      99.49%  0.05%  
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")

In [19]:
print(*cv_results)


0.994924394628 0.995664587078 0.99452257587 0.994607169293 0.994205350534 0.995241619964 0.995135775315

In [ ]:
previous_results = '''
LR: 98.87% (0.10%)
LDA: 99.49% (0.05%)
KNN: 99.84% (0.01%) <-- slow
CART: 99.94% (0.00%)
NB: 93.96% (0.96%)
SVM:   <-- very slow
'''

In [ ]:


In [ ]:
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(y)
plt.show()

In [ ]:

Use model to make predictions:


In [ ]:
test = [0, 1, 22, 9, 181, 5450, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 9, 9, 1.0, 0.0, 0.11, 0.0, 0.0, 0.0, 0.0, 0.0]
print(neigh.predict([test]))

print(neigh.predict_proba([test]))  # TODO: research this.

In [ ]:

Sources

  • [1] - KDD Cup 99 dataset
  • [2] - M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009. link

Other Resources

Where to go from here

  • Seeking more labelled datasets and determining the potential for other non-labelled datasets.

In [ ]:


In [ ]:
print('{:10}{:10}{:10}'.format('Model','mean','std'))
print('LDA: 99.49% (0.05%)')

In [17]:
print('{:8}{:^8}{:^8}'.format('Model','mean','std'))
print('-' * 23)
print('{:8}{:^8.2%}{:^8.2%}'.format('LDA', .9949, .0005))


Model     mean    std   
-----------------------
LDA      99.49%  0.05%  

In [ ]:

TODO:

* Add `%timeit` to each model being trained.
* Compare times to a compute cluster using PySpark.
* Add more models (KMeans).
* Graph accuracy vs. k (or similar factor for all models).
* Compare all models (accuracy, train time, predict time).
* Refactor code to allow various datasets.