In [3]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ConfusionMatrix
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

Worksheet - Answer - DGA Detection using Machine Learning

This worksheet is a step-by-step guide on how to detect domains that were generated using "Domain Generation Algorithm" (DGA). We will walk you through the process of transforming raw domain strings to Machine Learning features and creating a decision tree classifer which you will use to determine whether a given domain is legit or not. Once you have implemented the classifier, the worksheet will walk you through evaluating your model.

Overview 2 main steps:

  1. Feature Engineering - from raw domain strings to numeric Machine Learning features using DataFrame manipulations
  2. Machine Learning Classification - predict whether a domain is legit or not using a Decision Tree Classifier

DGA - Background

"Various families of malware use domain generation algorithms (DGAs) to generate a large number of pseudo-random domain names to connect to a command and control (C2) server. In order to block DGA C2 traffic, security organizations must first discover the algorithm by reverse engineering malware samples, then generate a list of domains for a given seed. The domains are then either preregistered, sink-holed or published in a DNS blacklist. This process is not only tedious, but can be readily circumvented by malware authors. An alternative approach to stop malware from using DGAs is to intercept DNS queries on a network and predict whether domains are DGA generated. Much of the previous work in DGA detection is based on finding groupings of like domains and using their statistical properties to determine if they are DGA generated. However, these techniques are run over large time windows and cannot be used for real-time detection and prevention. In addition, many of these techniques also use contextual information such as passive DNS and aggregations of all NXDomains throughout a network. Such requirements are not only costly to integrate, they may not be possible due to real-world constraints of many systems (such as endpoint detection). An alternative to these systems is a much harder problem: detect DGA generation on a per domain basis with no information except for the domain name. Previous work to solve this harder problem exhibits poor performance and many of these systems rely heavily on manual creation of features; a time consuming process that can easily be circumvented by malware authors..."
[Citation: Woodbridge et. al 2016: "Predicting Domain Generation Algorithms with Long Short-Term Memory Networks"]

A better alternative for real-world deployment would be to use "featureless deep learning" - We have a separate notebook where you can see how this can be implemented!

However, let's learn the basics first!!!

Worksheet for Part 2 - Feature Engineering

Breakpoint: Load Features and Labels

If you got stuck in Part 1, please simply load the feature matrix we prepared for you, so you can move on to Part 2 and train a Decision Tree Classifier.


In [5]:
df_final = pd.read_csv('../../data/dga_features_final_df.csv')
print(df_final.isDGA.value_counts())
df_final.head()


1    1000
0    1000
Name: isDGA, dtype: int64
Out[5]:
isDGA length digits entropy vowel-cons ngrams
0 1 13 0 3.546594 0.300000 968.076729
1 1 25 10 3.833270 0.250000 481.067222
2 1 12 0 2.855389 0.090909 1036.365657
3 1 26 6 3.844107 0.052632 708.328718
4 1 12 0 3.084963 0.090909 897.543434

In [6]:
# Load dictionary of common english words from part 1
from six.moves import cPickle as pickle
with open('../../data/d_common_en_words' + '.pickle', 'rb') as f:
        d = pickle.load(f)

Part 2 - Machine Learning

To learn simple classification procedures using sklearn we have split the work flow into 5 steps.

Step 1: Prepare Feature matrix and target vector containing the URL labels

  • In statistics, the feature matrix is often referred to as X
  • target is a vector containing the labels for each URL (often also called y in statistics)
  • In sklearn both the input and target can either be a pandas DataFrame/Series or numpy array/vector respectively (can't be lists!)

Tasks:

  • assign 'isDGA' column to a pandas Series named 'target'
  • drop 'isDGA' column from dga DataFrame and name the resulting pandas DataFrame 'feature_matrix'

In [7]:
target = df_final['isDGA']
feature_matrix = df_final.drop(['isDGA'], axis=1)
print('Final features', feature_matrix.columns)

print( feature_matrix.head())


Final features Index(['length', 'digits', 'entropy', 'vowel-cons', 'ngrams'], dtype='object')
   length  digits   entropy  vowel-cons       ngrams
0      13       0  3.546594    0.300000   968.076729
1      25      10  3.833270    0.250000   481.067222
2      12       0  2.855389    0.090909  1036.365657
3      26       6  3.844107    0.052632   708.328718
4      12       0  3.084963    0.090909   897.543434

Step 2: Simple Cross-Validation

Tasks:


In [8]:
# Simple Cross-Validation: Split the data set into training and test data
feature_matrix_train, feature_matrix_test, target_train, target_test = model_selection.train_test_split(feature_matrix, target, test_size=0.25, random_state=33)

In [48]:
feature_matrix_train.count()


Out[48]:
length        1500
digits        1500
entropy       1500
vowel-cons    1500
ngrams        1500
dtype: int64

In [49]:
feature_matrix_test.count()


Out[49]:
length        500
digits        500
entropy       500
vowel-cons    500
ngrams        500
dtype: int64

In [50]:
target_train.head()


Out[50]:
1179    0
1529    0
1125    0
1739    0
1303    0
Name: isDGA, dtype: int64

Step 3: Train the model and make a prediction

Finally, we have prepared and segmented the data. Let's start classifying!!

Tasks:

  • Use the sklearn tree.DecisionTreeClassfier(), create a decision tree with standard parameters, and train it using the .fit() function with X_train and target_train data.
  • Next, pull a few random rows from the data and see if your classifier got it correct.

If you are interested in trying a real unknown domain, you'll have to create a function to generate the features for that domain before you run it through the classifier (see function is_dga a few cells below).


In [9]:
# Train the decision tree based on the entropy criterion
clf = tree.DecisionTreeClassifier()  # clf means classifier
clf = clf.fit(feature_matrix_train, target_train)

# Extract a row from the test data
test_feature = feature_matrix_test[192:193]
test_target = target_test[192:193]

# Make the prediction
pred = clf.predict(test_feature)
print('Predicted class:', pred)
print('Accurate prediction?', pred[0] == test_target)


Predicted class: [0]
Accurate prediction? 1500    True
Name: isDGA, dtype: bool

In [52]:
feature_matrix_test


Out[52]:
length digits entropy vowel-cons ngrams
766 32 22 3.551109 1.000000 457.838732
182 25 7 4.163856 0.200000 656.989638
1763 12 0 3.251629 0.714286 1508.707071
1814 8 0 2.750000 0.600000 1281.912698
596 12 0 2.855389 0.200000 688.598485
1410 6 0 2.584963 0.500000 1606.894444
1994 13 0 2.661226 0.625000 1645.425214
510 25 7 4.163856 0.636364 778.120676
361 10 0 3.121928 0.428571 1139.212037
1563 13 0 3.238901 0.625000 1249.631313
1917 14 0 3.521641 0.555556 1425.548535
246 13 0 3.238901 0.083333 950.006410
371 32 22 3.593139 0.250000 289.297379
1954 3 1 1.584963 0.000000 766.888889
1314 9 0 2.725481 0.500000 1366.445767
1477 12 0 3.022055 0.714286 1236.629293
1620 8 0 3.000000 0.600000 1488.396825
1639 10 0 2.921928 0.428571 1765.397222
1574 9 0 2.419382 0.800000 1312.410053
884 9 0 2.947703 0.285714 1254.415344
1133 7 0 2.521641 0.400000 1385.012698
1986 5 0 1.370951 1.500000 1201.222222
1118 11 0 2.913977 0.571429 1388.780808
1492 8 0 3.000000 0.333333 1491.962302
317 14 0 3.521641 0.166667 648.047619
633 11 0 3.095795 0.571429 1286.953535
495 12 0 3.418296 0.333333 889.225253
772 10 0 2.446439 0.666667 1362.980556
1120 9 0 2.725481 0.333333 1331.191138
427 26 9 4.161978 0.133333 629.523932
... ... ... ... ... ...
567 15 0 3.373557 0.500000 1274.368376
62 19 0 3.471354 0.461538 1060.748997
1760 10 0 2.921928 1.000000 1829.612037
1057 6 0 2.584963 0.500000 1220.394444
665 27 0 3.902312 0.285714 964.977721
1172 1 0 -0.000000 0.000000 61.333333
1582 7 0 1.664498 1.333333 1109.173016
407 24 10 3.855389 0.272727 297.446860
22 28 0 4.066109 0.555556 1086.298535
1329 10 0 3.121928 0.666667 1285.687963
1017 4 0 1.500000 1.000000 1514.416667
1402 4 0 2.000000 0.333333 922.527778
1100 11 0 3.277613 0.571429 1529.181818
356 12 0 3.022055 0.200000 1118.463131
1350 9 0 2.947703 0.285714 1203.539021
664 26 8 4.132944 0.285714 623.070897
618 32 19 3.632049 0.625000 481.695632
792 27 10 4.088221 0.133333 520.707217
1340 7 0 2.521641 1.333333 1553.320635
1580 2 0 1.000000 1.000000 348.833333
1782 10 0 2.921928 0.666667 1497.694444
191 12 0 3.584963 0.333333 957.512626
1944 13 2 3.700440 0.571429 1265.642968
692 27 14 3.736007 0.444444 466.365622
1096 11 0 2.913977 0.833333 1648.983165
151 21 0 3.272804 0.500000 1371.717168
799 24 4 3.855389 0.250000 688.413977
249 10 0 2.446439 0.666667 1792.381481
260 13 0 3.392747 0.181818 759.509324
1669 14 0 3.182006 0.555556 1606.367521

500 rows × 5 columns


In [10]:
# For simplicity let's just copy the needed function in here again

def H_entropy (x):
    # Calculate Shannon Entropy
    prob = [ float(x.count(c)) / len(x) for c in dict.fromkeys(list(x)) ] 
    H = - sum([ p * np.log2(p) for p in prob ]) 
    return H

def vowel_consonant_ratio (x):
    # Calculate vowel to consonant ratio
    x = x.lower()
    vowels_pattern = re.compile('([aeiou])')
    consonants_pattern = re.compile('([b-df-hj-np-tv-z])')
    vowels = re.findall(vowels_pattern, x)
    consonants = re.findall(consonants_pattern, x)
    try:
        ratio = len(vowels) / len(consonants)
    except: # catch zero devision exception 
        ratio = 0  
    return ratio

# ngrams: Implementation according to Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence"
# http://s2lab.isg.rhul.ac.uk/papers/files/dimva2014.pdf

def ngrams(word, n):
    # Extract all ngrams and return a regular Python list
    # Input word: can be a simple string or a list of strings
    # Input n: Can be one integer or a list of integers 
    # if you want to extract multipe ngrams and have them all in one list
    
    l_ngrams = []
    if isinstance(word, list):
        for w in word:
            if isinstance(n, list):
                for curr_n in n:
                    ngrams = [w[i:i+curr_n] for i in range(0,len(w)-curr_n+1)]
                    l_ngrams.extend(ngrams)
            else:
                ngrams = [w[i:i+n] for i in range(0,len(w)-n+1)]
                l_ngrams.extend(ngrams)
    else:
        if isinstance(n, list):
            for curr_n in n:
                ngrams = [word[i:i+curr_n] for i in range(0,len(word)-curr_n+1)]
                l_ngrams.extend(ngrams)
        else:
            ngrams = [word[i:i+n] for i in range(0,len(word)-n+1)]
            l_ngrams.extend(ngrams)
#     print(l_ngrams)
    return l_ngrams

def ngram_feature(domain, d, n):
    # Input is your domain string or list of domain strings
    # a dictionary object d that contains the count for most common english words
    # finally you n either as int list or simple int defining the ngram length
    
    # Core magic: Looks up domain ngrams in english dictionary ngrams and sums up the 
    # respective english dictionary counts for the respective domain ngram
    # sum is normalized
    
    l_ngrams = ngrams(domain, n)
#     print(l_ngrams)
    count_sum=0
    for ngram in l_ngrams:
        if d[ngram]:
            count_sum+=d[ngram]
    try:
        feature = count_sum/(len(domain)-n+1)
    except:
        feature = 0
    return feature
    
def average_ngram_feature(l_ngram_feature):
    # input is a list of calls to ngram_feature(domain, d, n)
    # usually you would use various n values, like 1,2,3...
    return sum(l_ngram_feature)/len(l_ngram_feature)

In [11]:
def is_dga(domain, clf, d):
    # Function that takes new domain string, trained model 'clf' as input and
    # dictionary d of most common english words
    # returns prediction
    
    domain_features = np.empty([1,5])
    # order of features is ['length', 'digits', 'entropy', 'vowel-cons', 'ngrams']
    domain_features[0,0] = len(domain)
    pattern = re.compile('([0-9])')
    domain_features[0,1] = len(re.findall(pattern, domain))
    domain_features[0,2] = H_entropy(domain)
    domain_features[0,3] = vowel_consonant_ratio(domain)
    domain_features[0,4] = average_ngram_feature([ngram_feature(domain, d, 1), 
                                                  ngram_feature(domain, d, 2), 
                                                  ngram_feature(domain, d, 3)])
    pred = clf.predict(domain_features)
    return pred[0]


print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('spardeingeld'), is_dga('spardeingeld', clf, d))  
print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('google'), is_dga('google', clf, d)) 
print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('1vxznov16031kjxneqjk1rtofi6'), is_dga('1vxznov16031kjxneqjk1rtofi6', clf, d)) 
print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('lthmqglxwmrwex'), is_dga('lthmqglxwmrwex', clf, d))


Predictions of domain spardeingeld is [0 means legit and 1 dga]:  0
Predictions of domain google is [0 means legit and 1 dga]:  0
Predictions of domain 1vxznov16031kjxneqjk1rtofi6 is [0 means legit and 1 dga]:  1
Predictions of domain lthmqglxwmrwex is [0 means legit and 1 dga]:  1

Step 4: Assess model accuracy with simple cross-validation

Tasks:

  • Make predictions for all your data. Call the .predict() method on the clf with your training data X_train and store the results in a variable called target_pred.
  • Use sklearn metrics.accuracy_score to determine your models accuracy. Detailed Instruction:

    • Use your trained model to predict the labels of your test data X_test. Run .predict() method on the clf with your test data X_test and store the results in a variable called target_pred..
    • Then calculate the accuracy using target_test (which are the true labels/groundtruth) AND your models predictions on the test portion target_pred as inputs. The advantage here is to see how your model performs on new data it has not been seen during the training phase. The fair approach here is a simple cross-validation!
  • Print out the confusion matrix using metrics.confusion_matrix

  • Use Yellowbrick to visualize the classification report and confusion matrix. (http://www.scikit-yb.org/en/latest/examples/modelselect.html#common-metrics-for-evaluating-classifiers)

In [12]:
# fair approach: make prediction on test data portion
target_pred = clf.predict(feature_matrix_test)
print(metrics.accuracy_score(target_test, target_pred))
print('Confusion Matrix\n', metrics.confusion_matrix(target_test, target_pred))


0.854
Confusion Matrix
 [[219  37]
 [ 36 208]]

In [13]:
# Classification Report...neat summary
print(metrics.classification_report(target_test, target_pred, target_names=['legit', 'dga']))


             precision    recall  f1-score   support

      legit       0.86      0.86      0.86       256
        dga       0.85      0.85      0.85       244

avg / total       0.85      0.85      0.85       500


In [14]:
# short-cut
clf.score(feature_matrix_test, target_test)


Out[14]:
0.85399999999999998

In [15]:
viz = ConfusionMatrix(clf)
viz.fit(feature_matrix_train, target_train)
viz.score(feature_matrix_test, target_test)
viz.poof()



In [16]:
viz = ClassificationReport(clf)
viz.fit(feature_matrix_train, target_train)
viz.score(feature_matrix_test, target_test)
viz.poof()


Step 5: Assess model accuracy with k-fold cross-validation

Tasks:

  • Partition the dataset into k different subsets
  • Create k different models by training on k-1 subsets and testing on the remaining subsets
  • Measure the performance on each of the models and take the average measure.

Short-Cut All of these steps can be easily achieved by simply using sklearn's model_selection.KFold() and model_selection.cross_val_score() functions.


In [59]:
cvKFold = model_selection.KFold(n_splits=3, shuffle=True, random_state=33) 
cvKFold.get_n_splits(feature_matrix)


Out[59]:
3

In [60]:
scores = model_selection.cross_val_score(clf, feature_matrix, target, cv=cvKFold)
print(scores)


[ 0.86656672  0.85457271  0.86786787]

In [61]:
# Get avergage score +- Standard Error (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html)
from scipy.stats import sem
def mean_score( scores ):
    return "Mean score: {0:.3f} (+/- {1:.3f})".format( np.mean(scores), sem( scores ))
print( mean_score( scores))


Mean score: 0.863 (+/- 0.004)

(Optional) Visualizing your Tree

As an optional step, you can actually visualize your tree. The following code will generate a graph of your decision tree. You will need graphviz (http://www.graphviz.org) and pydotplus (or pydot) installed for this to work. The Griffon VM has this installed already, but if you try this on a Mac, or Linux machine you will need to install graphviz.


In [ ]:
# These libraries are used to visualize the decision tree and require that you have GraphViz
# and pydot or pydotplus installed on your computer.

from sklearn.externals.six import StringIO  
from IPython.core.display import Image
import pydotplus as pydot


dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data, feature_names=['length', 'digits', 'entropy', 'vowel-cons', 'ngrams']) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

In [63]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

In [65]:
#Create the Random Forest Classifier
random_forest_clf = RandomForestClassifier(n_estimators=10, 
                             max_depth=None, 
                             min_samples_split=2, 
                             random_state=0)

random_forest_clf = random_forest_clf.fit(feature_matrix_train, target_train)

In [64]:
#Next, create the SVM classifier
svm_classifier = svm.SVC()
svm_classifier = svm_classifier.fit(feature_matrix_train, target_train)

In [ ]: