Machine Learning in Cyber Security

Learn to spot an attacker through monitoring TCP logs.

Dataset: KDD Cup 1999

Source [1]
TCP data dump - array of 41 variables (TCP features)
- Basic Features (9)
- Content Features (13)
- Traffic Features (19)
Consists of 23 types of attacks:
1. DOS: back, land, neptune, pod, surf, teardrop
2. R2L: ftp_write, guess_passwd, lmap, multihop, phf, spy, warezclient, warezmaster
3. U2R: buffer_overflow, loadmodule, perl, rootkit
4. Probing: ipsweep, nmap, portsweep, satan

Basic Features of TCP

duration - connection time in seconds.
protocal_type - i.e. TCP, UDP
service - Netword service on destination (i.e. HTTP, Telnet)
source_bytes - amount of data from source to destination.
flag - normal or error status of connection.
land - Connection is from/to the same host/port: "1", else "0".
wrong_fragment - number of "wrong" fragments.
urgent - number of urgent packets.

Content Features of TCP

host - number of "hot" indicators.
num_failed_logins - number of attempts.
logged_in - succes: "1", else "0".
compromised - number of "compromised" conditions.
root_shell - if root shell obtained: "1", else "0".
su_attempted - if "su" command attempted: "1", else "0".
num_root - number of root accesses.
num_file_creations - number of creation operations.
num_shells - number of shell prompts.
num_access_files - number of operations on access control files.
num_outbound_cmds - number per session.
is_hot_login - if login on "hot" list: "1", else "0".
is_guest_login - if guest: "1", else "0".

Traffic Features of TCP (2 sec window)

count - number of connections to the same host.
serror_rate - % of connections with "SYN" errors.
rerror_rate - % of connections with "REJ" errors.
same_srv_rate - % of connections to the same service.
diff_srv_rate - % of connections to different services.
srv_count - number of connections to the same service.
srv_serror_rate - % of connections with "SYN" errors.
srv_rerror_rate - % of connections with "REJ" errors.
srv_diff_host_rate - % of connections to different hosts.

Models (from scikit-learn)

Generative:
- Naive Bayes
Descriminative:
- kNN
- kMeans
- Logistic Regression
- Decision Tree
- Random Forest
Others to consider:
- LinearDiscriminantAnalysis
- GaussianNB
- NBTree, Random Tree, Multilayer Perceptron (additional models from [2])
- SVC (Not used for this problem. "The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.")



In [1]:

    
from array import array
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.cluster import KMeans

%matplotlib inline



In [2]:

    
dataset_part = fetch_kddcup99(percent10=True)  # Over 600 MB in memeory.
# dataset_full = fetch_kddcup99(percent10=False)  # Crashed my computer with 16 GB of RAM.



In [3]:

    
dataset_part.data[0]  # Sample of TCP record.









    Out[3]:





array([0, b'tcp', b'http', b'SF', 181, 5450, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 8, 8, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 9, 9,
       1.0, 0.0, 0.11, 0.0, 0.0, 0.0, 0.0, 0.0], dtype=object)



In [4]:

    
len(set(dataset_part.target))  # Number of unique classifications.









    Out[4]:





23



In [ ]:

Transform Data

Dataset values are all of type "object" => convert to numeric types.
Label Encoder - replaces strings with an incrementing integer.



In [5]:

    
df = pd.DataFrame(dataset_part.data)



In [6]:

    
df.head(1)









    Out[6]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
    
  
  
    
      0
      0
      b'tcp'
      b'http'
      b'SF'
      181
      5450
      0
      0
      0
      0
      ...
      9
      9
      1
      0
      0.11
      0
      0
      0
      0
      0
    
  

1 rows × 41 columns



In [7]:

    
df = df.apply(pd.to_numeric, errors='ignore')



In [8]:

    
# Example from http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
'''
le = preprocessing.LabelEncoder()
le.fit(list(names))
# le.classes_  # Shows all labels.
print(le.transform([b'icmpeco_iSF', b'icmpecr_iSF', b'icmpred_iSF']) )
print(le.inverse_transform([0, 0, 1, 2]))
'''
# https://datascience.stackexchange.com/questions/16728/could-not-convert-string-to-float-error-on-kddcup99-dataset
for column in df.columns:
    if df[column].dtype == object:
        le = preprocessing.LabelEncoder()
        df[column] = le.fit_transform(df[column])



In [9]:

    
df.head(1)  # All strings removed.









    Out[9]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
    
  
  
    
      0
      0
      1
      22
      9
      181
      5450
      0
      0
      0
      0
      ...
      9
      9
      1.0
      0.0
      0.11
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

1 rows × 41 columns

Preprocessing Data



In [10]:

    
X = df.values

le = preprocessing.LabelEncoder()
y = le.fit_transform(dataset_part.target)
y_dict = dict(zip(y,le.classes_))  # Saved for later lookup.



In [11]:

    
# Test options and evaluation metric
N_SPLITS = 7
SCORING = 'accuracy'



In [12]:

    
# Split-out validation dataset
test_size=0.33
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_size, random_state=SEED)

Train Model



In [21]:

    
#  Algorithms
models = [
          #('LR', LogisticRegression()),
          ('LDA', LinearDiscriminantAnalysis()),
          #('KNN', KNeighborsClassifier()),
          #('KMN', KMeans()),
          #('CART', DecisionTreeClassifier()),
          #('NB', GaussianNB()),
         ]

# evaluate each model in turn
results = []
names = []
print('{:8}{:^8}{:^8}'.format('Model','mean','std'))
print('-' * 23)
for name, model in models:
    kfold = KFold(n_splits=N_SPLITS, random_state=SEED)

    %timeit -n1 cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=SCORING) 
    results.append(cv_results)
    names.append(name)
    print('{:8}{:^8.2%}{:^8.2%}'.format(name, cv_results.mean(), cv_results.std()))









    



Model     mean    std   
-----------------------






    



C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)






    



1 loop, best of 3: 10.8 s per loop
LDA      99.49%  0.05%  






    



C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")



In [19]:

    
print(*cv_results)









    



0.994924394628 0.995664587078 0.99452257587 0.994607169293 0.994205350534 0.995241619964 0.995135775315



In [ ]:

    
previous_results = '''
LR: 98.87% (0.10%)
LDA: 99.49% (0.05%)
KNN: 99.84% (0.01%) <-- slow
CART: 99.94% (0.00%)
NB: 93.96% (0.96%)
SVM:   <-- very slow
'''



In [ ]:



In [ ]:

    
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(y)
plt.show()



In [ ]:

Use model to make predictions:



In [ ]:

    
test = [0, 1, 22, 9, 181, 5450, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 9, 9, 1.0, 0.0, 0.11, 0.0, 0.0, 0.0, 0.0, 0.0]
print(neigh.predict([test]))

print(neigh.predict_proba([test]))  # TODO: research this.



In [ ]:

Sources

[1] - KDD Cup 99 dataset
[2] - M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009. link

Other Resources

PySpark solution to the KDDCup99

link

Logs
- Public PCAP files for PCAP-based evaluation of network-based intrusion detection system (NIDS) evaluation.
- The Cyber Research Center - DataSets - ITOC CDX (2009)
Labelled datasets
- UNB ISCX (2012-) datasets contain a range of "sophisticated" intrusion attacks, botnets and DoS attacks.
- CSIC 2010 HTTP Dataset in CSV format (for Weka Analysis) dataset is from a web penetration testing testbed for anomaly detection training.
- Attack Challenge - ECML/PKDD Workshop (2007) dataset contains web penetration testing data.
- NSL-KDD Data Set (2007) intended to replace the DARPA KDDCup99 dataset for IDS.
- gureKddcup data base (2008) intended to replace the DARPA KDDCup99 dataset for IDS.
- CTU-13 dataset - pcap files (Stratosphere IPS).

Where to go from here

Seeking more labelled datasets and determining the potential for other non-labelled datasets.



In [ ]:



In [ ]:

    
print('{:10}{:10}{:10}'.format('Model','mean','std'))
print('LDA: 99.49% (0.05%)')



In [17]:

    
print('{:8}{:^8}{:^8}'.format('Model','mean','std'))
print('-' * 23)
print('{:8}{:^8.2%}{:^8.2%}'.format('LDA', .9949, .0005))









    



Model     mean    std   
-----------------------
LDA      99.49%  0.05%



In [ ]:

TODO:

* Add `%timeit` to each model being trained.
* Compare times to a compute cluster using PySpark.
* Add more models (KMeans).
* Graph accuracy vs. k (or similar factor for all models).
* Compare all models (accuracy, train time, predict time).
* Refactor code to allow various datasets.