In [3]:

    
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ConfusionMatrix
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

Worksheet - Answer - DGA Detection using Machine Learning

This worksheet is a step-by-step guide on how to detect domains that were generated using "Domain Generation Algorithm" (DGA). We will walk you through the process of transforming raw domain strings to Machine Learning features and creating a decision tree classifer which you will use to determine whether a given domain is legit or not. Once you have implemented the classifier, the worksheet will walk you through evaluating your model.

Overview 2 main steps:

Feature Engineering - from raw domain strings to numeric Machine Learning features using DataFrame manipulations
Machine Learning Classification - predict whether a domain is legit or not using a Decision Tree Classifier

DGA - Background

"Various families of malware use domain generation algorithms (DGAs) to generate a large number of pseudo-random domain names to connect to a command and control (C2) server. In order to block DGA C2 traffic, security organizations must first discover the algorithm by reverse engineering malware samples, then generate a list of domains for a given seed. The domains are then either preregistered, sink-holed or published in a DNS blacklist. This process is not only tedious, but can be readily circumvented by malware authors. An alternative approach to stop malware from using DGAs is to intercept DNS queries on a network and predict whether domains are DGA generated. Much of the previous work in DGA detection is based on finding groupings of like domains and using their statistical properties to determine if they are DGA generated. However, these techniques are run over large time windows and cannot be used for real-time detection and prevention. In addition, many of these techniques also use contextual information such as passive DNS and aggregations of all NXDomains throughout a network. Such requirements are not only costly to integrate, they may not be possible due to real-world constraints of many systems (such as endpoint detection). An alternative to these systems is a much harder problem: detect DGA generation on a per domain basis with no information except for the domain name. Previous work to solve this harder problem exhibits poor performance and many of these systems rely heavily on manual creation of features; a time consuming process that can easily be circumvented by malware authors..."
[Citation: Woodbridge et. al 2016: "Predicting Domain Generation Algorithms with Long Short-Term Memory Networks"]

A better alternative for real-world deployment would be to use "featureless deep learning" - We have a separate notebook where you can see how this can be implemented!

However, let's learn the basics first!!!

Worksheet for Part 2 - Feature Engineering

Breakpoint: Load Features and Labels

If you got stuck in Part 1, please simply load the feature matrix we prepared for you, so you can move on to Part 2 and train a Decision Tree Classifier.



In [5]:

    
df_final = pd.read_csv('../../data/dga_features_final_df.csv')
print(df_final.isDGA.value_counts())
df_final.head()









    



1    1000
0    1000
Name: isDGA, dtype: int64






    Out[5]:






  
    
      
      isDGA
      length
      digits
      entropy
      vowel-cons
      ngrams
    
  
  
    
      0
      1
      13
      0
      3.546594
      0.300000
      968.076729
    
    
      1
      1
      25
      10
      3.833270
      0.250000
      481.067222
    
    
      2
      1
      12
      0
      2.855389
      0.090909
      1036.365657
    
    
      3
      1
      26
      6
      3.844107
      0.052632
      708.328718
    
    
      4
      1
      12
      0
      3.084963
      0.090909
      897.543434



In [6]:

    
# Load dictionary of common english words from part 1
from six.moves import cPickle as pickle
with open('../../data/d_common_en_words' + '.pickle', 'rb') as f:
        d = pickle.load(f)

Part 2 - Machine Learning

To learn simple classification procedures using sklearn we have split the work flow into 5 steps.

Step 1: Prepare Feature matrix and `target` vector containing the URL labels

In statistics, the feature matrix is often referred to as X
target is a vector containing the labels for each URL (often also called y in statistics)
In sklearn both the input and target can either be a pandas DataFrame/Series or numpy array/vector respectively (can't be lists!)

Tasks:

assign 'isDGA' column to a pandas Series named 'target'
drop 'isDGA' column from dga DataFrame and name the resulting pandas DataFrame 'feature_matrix'



In [7]:

    
target = df_final['isDGA']
feature_matrix = df_final.drop(['isDGA'], axis=1)
print('Final features', feature_matrix.columns)

print( feature_matrix.head())









    



Final features Index(['length', 'digits', 'entropy', 'vowel-cons', 'ngrams'], dtype='object')
   length  digits   entropy  vowel-cons       ngrams
0      13       0  3.546594    0.300000   968.076729
1      25      10  3.833270    0.250000   481.067222
2      12       0  2.855389    0.090909  1036.365657
3      26       6  3.844107    0.052632   708.328718
4      12       0  3.084963    0.090909   897.543434

Step 2: Simple Cross-Validation

Tasks:

split your feature matrix X and target vector into train and test subsets using sklearn model_selection.train_test_split



In [8]:

    
# Simple Cross-Validation: Split the data set into training and test data
feature_matrix_train, feature_matrix_test, target_train, target_test = model_selection.train_test_split(feature_matrix, target, test_size=0.25, random_state=33)



In [48]:

    
feature_matrix_train.count()









    Out[48]:





length        1500
digits        1500
entropy       1500
vowel-cons    1500
ngrams        1500
dtype: int64



In [49]:

    
feature_matrix_test.count()









    Out[49]:





length        500
digits        500
entropy       500
vowel-cons    500
ngrams        500
dtype: int64



In [50]:

    
target_train.head()









    Out[50]:





1179    0
1529    0
1125    0
1739    0
1303    0
Name: isDGA, dtype: int64

Step 3: Train the model and make a prediction

Finally, we have prepared and segmented the data. Let's start classifying!!

Tasks:

Use the sklearn tree.DecisionTreeClassfier(), create a decision tree with standard parameters, and train it using the .fit() function with X_train and target_train data.
Next, pull a few random rows from the data and see if your classifier got it correct.

If you are interested in trying a real unknown domain, you'll have to create a function to generate the features for that domain before you run it through the classifier (see function is_dga a few cells below).



In [9]:

    
# Train the decision tree based on the entropy criterion
clf = tree.DecisionTreeClassifier()  # clf means classifier
clf = clf.fit(feature_matrix_train, target_train)

# Extract a row from the test data
test_feature = feature_matrix_test[192:193]
test_target = target_test[192:193]

# Make the prediction
pred = clf.predict(test_feature)
print('Predicted class:', pred)
print('Accurate prediction?', pred[0] == test_target)









    



Predicted class: [0]
Accurate prediction? 1500    True
Name: isDGA, dtype: bool



In [52]:

    
feature_matrix_test









    Out[52]:






  
    
      
      length
      digits
      entropy
      vowel-cons
      ngrams
    
  
  
    
      766
      32
      22
      3.551109
      1.000000
      457.838732
    
    
      182
      25
      7
      4.163856
      0.200000
      656.989638
    
    
      1763
      12
      0
      3.251629
      0.714286
      1508.707071
    
    
      1814
      8
      0
      2.750000
      0.600000
      1281.912698
    
    
      596
      12
      0
      2.855389
      0.200000
      688.598485
    
    
      1410
      6
      0
      2.584963
      0.500000
      1606.894444
    
    
      1994
      13
      0
      2.661226
      0.625000
      1645.425214
    
    
      510
      25
      7
      4.163856
      0.636364
      778.120676
    
    
      361
      10
      0
      3.121928
      0.428571
      1139.212037
    
    
      1563
      13
      0
      3.238901
      0.625000
      1249.631313
    
    
      1917
      14
      0
      3.521641
      0.555556
      1425.548535
    
    
      246
      13
      0
      3.238901
      0.083333
      950.006410
    
    
      371
      32
      22
      3.593139
      0.250000
      289.297379
    
    
      1954
      3
      1
      1.584963
      0.000000
      766.888889
    
    
      1314
      9
      0
      2.725481
      0.500000
      1366.445767
    
    
      1477
      12
      0
      3.022055
      0.714286
      1236.629293
    
    
      1620
      8
      0
      3.000000
      0.600000
      1488.396825
    
    
      1639
      10
      0
      2.921928
      0.428571
      1765.397222
    
    
      1574
      9
      0
      2.419382
      0.800000
      1312.410053
    
    
      884
      9
      0
      2.947703
      0.285714
      1254.415344
    
    
      1133
      7
      0
      2.521641
      0.400000
      1385.012698
    
    
      1986
      5
      0
      1.370951
      1.500000
      1201.222222
    
    
      1118
      11
      0
      2.913977
      0.571429
      1388.780808
    
    
      1492
      8
      0
      3.000000
      0.333333
      1491.962302
    
    
      317
      14
      0
      3.521641
      0.166667
      648.047619
    
    
      633
      11
      0
      3.095795
      0.571429
      1286.953535
    
    
      495
      12
      0
      3.418296
      0.333333
      889.225253
    
    
      772
      10
      0
      2.446439
      0.666667
      1362.980556
    
    
      1120
      9
      0
      2.725481
      0.333333
      1331.191138
    
    
      427
      26
      9
      4.161978
      0.133333
      629.523932
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      567
      15
      0
      3.373557
      0.500000
      1274.368376
    
    
      62
      19
      0
      3.471354
      0.461538
      1060.748997
    
    
      1760
      10
      0
      2.921928
      1.000000
      1829.612037
    
    
      1057
      6
      0
      2.584963
      0.500000
      1220.394444
    
    
      665
      27
      0
      3.902312
      0.285714
      964.977721
    
    
      1172
      1
      0
      -0.000000
      0.000000
      61.333333
    
    
      1582
      7
      0
      1.664498
      1.333333
      1109.173016
    
    
      407
      24
      10
      3.855389
      0.272727
      297.446860
    
    
      22
      28
      0
      4.066109
      0.555556
      1086.298535
    
    
      1329
      10
      0
      3.121928
      0.666667
      1285.687963
    
    
      1017
      4
      0
      1.500000
      1.000000
      1514.416667
    
    
      1402
      4
      0
      2.000000
      0.333333
      922.527778
    
    
      1100
      11
      0
      3.277613
      0.571429
      1529.181818
    
    
      356
      12
      0
      3.022055
      0.200000
      1118.463131
    
    
      1350
      9
      0
      2.947703
      0.285714
      1203.539021
    
    
      664
      26
      8
      4.132944
      0.285714
      623.070897
    
    
      618
      32
      19
      3.632049
      0.625000
      481.695632
    
    
      792
      27
      10
      4.088221
      0.133333
      520.707217
    
    
      1340
      7
      0
      2.521641
      1.333333
      1553.320635
    
    
      1580
      2
      0
      1.000000
      1.000000
      348.833333
    
    
      1782
      10
      0
      2.921928
      0.666667
      1497.694444
    
    
      191
      12
      0
      3.584963
      0.333333
      957.512626
    
    
      1944
      13
      2
      3.700440
      0.571429
      1265.642968
    
    
      692
      27
      14
      3.736007
      0.444444
      466.365622
    
    
      1096
      11
      0
      2.913977
      0.833333
      1648.983165
    
    
      151
      21
      0
      3.272804
      0.500000
      1371.717168
    
    
      799
      24
      4
      3.855389
      0.250000
      688.413977
    
    
      249
      10
      0
      2.446439
      0.666667
      1792.381481
    
    
      260
      13
      0
      3.392747
      0.181818
      759.509324
    
    
      1669
      14
      0
      3.182006
      0.555556
      1606.367521
    
  

500 rows × 5 columns



In [10]:

    
# For simplicity let's just copy the needed function in here again

def H_entropy (x):
    # Calculate Shannon Entropy
    prob = [ float(x.count(c)) / len(x) for c in dict.fromkeys(list(x)) ] 
    H = - sum([ p * np.log2(p) for p in prob ]) 
    return H

def vowel_consonant_ratio (x):
    # Calculate vowel to consonant ratio
    x = x.lower()
    vowels_pattern = re.compile('([aeiou])')
    consonants_pattern = re.compile('([b-df-hj-np-tv-z])')
    vowels = re.findall(vowels_pattern, x)
    consonants = re.findall(consonants_pattern, x)
    try:
        ratio = len(vowels) / len(consonants)
    except: # catch zero devision exception 
        ratio = 0  
    return ratio

# ngrams: Implementation according to Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence"
# http://s2lab.isg.rhul.ac.uk/papers/files/dimva2014.pdf

def ngrams(word, n):
    # Extract all ngrams and return a regular Python list
    # Input word: can be a simple string or a list of strings
    # Input n: Can be one integer or a list of integers 
    # if you want to extract multipe ngrams and have them all in one list
    
    l_ngrams = []
    if isinstance(word, list):
        for w in word:
            if isinstance(n, list):
                for curr_n in n:
                    ngrams = [w[i:i+curr_n] for i in range(0,len(w)-curr_n+1)]
                    l_ngrams.extend(ngrams)
            else:
                ngrams = [w[i:i+n] for i in range(0,len(w)-n+1)]
                l_ngrams.extend(ngrams)
    else:
        if isinstance(n, list):
            for curr_n in n:
                ngrams = [word[i:i+curr_n] for i in range(0,len(word)-curr_n+1)]
                l_ngrams.extend(ngrams)
        else:
            ngrams = [word[i:i+n] for i in range(0,len(word)-n+1)]
            l_ngrams.extend(ngrams)
#     print(l_ngrams)
    return l_ngrams

def ngram_feature(domain, d, n):
    # Input is your domain string or list of domain strings
    # a dictionary object d that contains the count for most common english words
    # finally you n either as int list or simple int defining the ngram length
    
    # Core magic: Looks up domain ngrams in english dictionary ngrams and sums up the 
    # respective english dictionary counts for the respective domain ngram
    # sum is normalized
    
    l_ngrams = ngrams(domain, n)
#     print(l_ngrams)
    count_sum=0
    for ngram in l_ngrams:
        if d[ngram]:
            count_sum+=d[ngram]
    try:
        feature = count_sum/(len(domain)-n+1)
    except:
        feature = 0
    return feature
    
def average_ngram_feature(l_ngram_feature):
    # input is a list of calls to ngram_feature(domain, d, n)
    # usually you would use various n values, like 1,2,3...
    return sum(l_ngram_feature)/len(l_ngram_feature)



In [11]:

    
def is_dga(domain, clf, d):
    # Function that takes new domain string, trained model 'clf' as input and
    # dictionary d of most common english words
    # returns prediction
    
    domain_features = np.empty([1,5])
    # order of features is ['length', 'digits', 'entropy', 'vowel-cons', 'ngrams']
    domain_features[0,0] = len(domain)
    pattern = re.compile('([0-9])')
    domain_features[0,1] = len(re.findall(pattern, domain))
    domain_features[0,2] = H_entropy(domain)
    domain_features[0,3] = vowel_consonant_ratio(domain)
    domain_features[0,4] = average_ngram_feature([ngram_feature(domain, d, 1), 
                                                  ngram_feature(domain, d, 2), 
                                                  ngram_feature(domain, d, 3)])
    pred = clf.predict(domain_features)
    return pred[0]


print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('spardeingeld'), is_dga('spardeingeld', clf, d))  
print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('google'), is_dga('google', clf, d)) 
print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('1vxznov16031kjxneqjk1rtofi6'), is_dga('1vxznov16031kjxneqjk1rtofi6', clf, d)) 
print('Predictions of domain %s is [0 means legit and 1 dga]: ' %('lthmqglxwmrwex'), is_dga('lthmqglxwmrwex', clf, d))









    



Predictions of domain spardeingeld is [0 means legit and 1 dga]:  0
Predictions of domain google is [0 means legit and 1 dga]:  0
Predictions of domain 1vxznov16031kjxneqjk1rtofi6 is [0 means legit and 1 dga]:  1
Predictions of domain lthmqglxwmrwex is [0 means legit and 1 dga]:  1

Step 4: Assess model accuracy with simple cross-validation

Tasks:

Make predictions for all your data. Call the .predict() method on the clf with your training data X_train and store the results in a variable called target_pred.
Use sklearn metrics.accuracy_score to determine your models accuracy. Detailed Instruction:
- Use your trained model to predict the labels of your test data X_test. Run .predict() method on the clf with your test data X_test and store the results in a variable called target_pred..
- Then calculate the accuracy using target_test (which are the true labels/groundtruth) AND your models predictions on the test portion target_pred as inputs. The advantage here is to see how your model performs on new data it has not been seen during the training phase. The fair approach here is a simple cross-validation!
Print out the confusion matrix using metrics.confusion_matrix
Use Yellowbrick to visualize the classification report and confusion matrix. (http://www.scikit-yb.org/en/latest/examples/modelselect.html#common-metrics-for-evaluating-classifiers)



In [12]:

    
# fair approach: make prediction on test data portion
target_pred = clf.predict(feature_matrix_test)
print(metrics.accuracy_score(target_test, target_pred))
print('Confusion Matrix\n', metrics.confusion_matrix(target_test, target_pred))









    



0.854
Confusion Matrix
 [[219  37]
 [ 36 208]]



In [13]:

    
# Classification Report...neat summary
print(metrics.classification_report(target_test, target_pred, target_names=['legit', 'dga']))









    



             precision    recall  f1-score   support

      legit       0.86      0.86      0.86       256
        dga       0.85      0.85      0.85       244

avg / total       0.85      0.85      0.85       500



In [14]:

    
# short-cut
clf.score(feature_matrix_test, target_test)









    Out[14]:





0.85399999999999998



In [15]:

    
viz = ConfusionMatrix(clf)
viz.fit(feature_matrix_train, target_train)
viz.score(feature_matrix_test, target_test)
viz.poof()



In [16]:

    
viz = ClassificationReport(clf)
viz.fit(feature_matrix_train, target_train)
viz.score(feature_matrix_test, target_test)
viz.poof()

Step 5: Assess model accuracy with k-fold cross-validation

Tasks:

Partition the dataset into k different subsets
Create k different models by training on k-1 subsets and testing on the remaining subsets
Measure the performance on each of the models and take the average measure.

Short-Cut All of these steps can be easily achieved by simply using sklearn's model_selection.KFold() and model_selection.cross_val_score() functions.



In [59]:

    
cvKFold = model_selection.KFold(n_splits=3, shuffle=True, random_state=33) 
cvKFold.get_n_splits(feature_matrix)









    Out[59]:





3



In [60]:

    
scores = model_selection.cross_val_score(clf, feature_matrix, target, cv=cvKFold)
print(scores)









    



[ 0.86656672  0.85457271  0.86786787]



In [61]:

    
# Get avergage score +- Standard Error (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html)
from scipy.stats import sem
def mean_score( scores ):
    return "Mean score: {0:.3f} (+/- {1:.3f})".format( np.mean(scores), sem( scores ))
print( mean_score( scores))









    



Mean score: 0.863 (+/- 0.004)

(Optional) Visualizing your Tree

As an optional step, you can actually visualize your tree. The following code will generate a graph of your decision tree. You will need graphviz (http://www.graphviz.org) and pydotplus (or pydot) installed for this to work. The Griffon VM has this installed already, but if you try this on a Mac, or Linux machine you will need to install graphviz.



In [ ]:

    
# These libraries are used to visualize the decision tree and require that you have GraphViz
# and pydot or pydotplus installed on your computer.

from sklearn.externals.six import StringIO  
from IPython.core.display import Image
import pydotplus as pydot


dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data, feature_names=['length', 'digits', 'entropy', 'vowel-cons', 'ngrams']) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())



In [63]:

    
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier



In [65]:

    
#Create the Random Forest Classifier
random_forest_clf = RandomForestClassifier(n_estimators=10, 
                             max_depth=None, 
                             min_samples_split=2, 
                             random_state=0)

random_forest_clf = random_forest_clf.fit(feature_matrix_train, target_train)



In [64]:

    
#Next, create the SVM classifier
svm_classifier = svm.SVC()
svm_classifier = svm_classifier.fit(feature_matrix_train, target_train)



In [ ]:

	isDGA	length	digits	entropy	vowel-cons	ngrams
0	1	13	0	3.546594	0.300000	968.076729
1	1	25	10	3.833270	0.250000	481.067222
2	1	12	0	2.855389	0.090909	1036.365657
3	1	26	6	3.844107	0.052632	708.328718
4	1	12	0	3.084963	0.090909	897.543434

	length	digits	entropy	vowel-cons	ngrams
766	32	22	3.551109	1.000000	457.838732
182	25	7	4.163856	0.200000	656.989638
1763	12	0	3.251629	0.714286	1508.707071
1814	8	0	2.750000	0.600000	1281.912698
596	12	0	2.855389	0.200000	688.598485
1410	6	0	2.584963	0.500000	1606.894444
1994	13	0	2.661226	0.625000	1645.425214
510	25	7	4.163856	0.636364	778.120676
361	10	0	3.121928	0.428571	1139.212037
1563	13	0	3.238901	0.625000	1249.631313
1917	14	0	3.521641	0.555556	1425.548535
246	13	0	3.238901	0.083333	950.006410
371	32	22	3.593139	0.250000	289.297379
1954	3	1	1.584963	0.000000	766.888889
1314	9	0	2.725481	0.500000	1366.445767
1477	12	0	3.022055	0.714286	1236.629293
1620	8	0	3.000000	0.600000	1488.396825
1639	10	0	2.921928	0.428571	1765.397222
1574	9	0	2.419382	0.800000	1312.410053
884	9	0	2.947703	0.285714	1254.415344
1133	7	0	2.521641	0.400000	1385.012698
1986	5	0	1.370951	1.500000	1201.222222
1118	11	0	2.913977	0.571429	1388.780808
1492	8	0	3.000000	0.333333	1491.962302
317	14	0	3.521641	0.166667	648.047619
633	11	0	3.095795	0.571429	1286.953535
495	12	0	3.418296	0.333333	889.225253
772	10	0	2.446439	0.666667	1362.980556
1120	9	0	2.725481	0.333333	1331.191138
427	26	9	4.161978	0.133333	629.523932
...	...	...	...	...	...
567	15	0	3.373557	0.500000	1274.368376
62	19	0	3.471354	0.461538	1060.748997
1760	10	0	2.921928	1.000000	1829.612037
1057	6	0	2.584963	0.500000	1220.394444
665	27	0	3.902312	0.285714	964.977721
1172	1	0	-0.000000	0.000000	61.333333
1582	7	0	1.664498	1.333333	1109.173016
407	24	10	3.855389	0.272727	297.446860
22	28	0	4.066109	0.555556	1086.298535
1329	10	0	3.121928	0.666667	1285.687963
1017	4	0	1.500000	1.000000	1514.416667
1402	4	0	2.000000	0.333333	922.527778
1100	11	0	3.277613	0.571429	1529.181818
356	12	0	3.022055	0.200000	1118.463131
1350	9	0	2.947703	0.285714	1203.539021
664	26	8	4.132944	0.285714	623.070897
618	32	19	3.632049	0.625000	481.695632
792	27	10	4.088221	0.133333	520.707217
1340	7	0	2.521641	1.333333	1553.320635
1580	2	0	1.000000	1.000000	348.833333
1782	10	0	2.921928	0.666667	1497.694444
191	12	0	3.584963	0.333333	957.512626
1944	13	2	3.700440	0.571429	1265.642968
692	27	14	3.736007	0.444444	466.365622
1096	11	0	2.913977	0.833333	1648.983165
151	21	0	3.272804	0.500000	1371.717168
799	24	4	3.855389	0.250000	688.413977
249	10	0	2.446439	0.666667	1792.381481
260	13	0	3.392747	0.181818	759.509324
1669	14	0	3.182006	0.555556	1606.367521

Worksheet - Answer - DGA Detection using Machine Learning

Worksheet for Part 2 - Feature Engineering

Breakpoint: Load Features and Labels

Part 2 - Machine Learning

Step 1: Prepare Feature matrix and target vector containing the URL labels

Step 2: Simple Cross-Validation

Step 3: Train the model and make a prediction

Step 4: Assess model accuracy with simple cross-validation

Step 5: Assess model accuracy with k-fold cross-validation

(Optional) Visualizing your Tree

Step 1: Prepare Feature matrix and `target` vector containing the URL labels