PE File Classification Exercise

In this notebook we're going to explore, understand and classify PE (Portable Executable) files as being 'benign' or 'malicious'. http://en.wikipedia.org/wiki/Portable_Executable The primary motivation is to explore the nexus of IPython, Pandas and scikit-learn with PE File classification as a vehicle for that exploration. The exercise intentionally shows what machine learning experts might call a naive approach, this is for clarity and conciseness. Recommendations for deeper materials and resources are given in the conclusion.

** DISCLAIMER:** This exercise is for illustrative purposes and only uses about 100 samples which is way too small for a generalizable model.

Python Modules Used:

Pandas: Python Data Analysis Library (http://pandas.pydata.org)

Scikit Learn (http://scikit-learn.org) Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Matplotlib: Python 2D plotting library (http://matplotlib.org) </ul>
All Code and IPython Notebooks for this talk: http://clicksecurity.github.io/data_hacking </font>

Imports and plot defaults



In [190]:

    
import os
import sklearn.feature_extraction
sklearn.__version__









    Out[190]:





'0.14.1'



In [191]:

    
import pandas as pd
pd.__version__









    Out[191]:





'0.13.1'



In [192]:

    
import numpy as np
np.__version__









    Out[192]:





'1.8.0'



In [193]:

    
# Plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 18.0
plt.rcParams['figure.figsize'] = 16.0, 5.0



In [194]:

    
def plot_cm(cm, labels):
    # Compute percentanges
    percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)  # Derp, I'm sure there's a better way   
    print 'Confusion Matrix Stats'
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())

    # Show confusion matrix
    # Thanks kermit666 from stackoverflow :)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.grid(b=False)
    cax = ax.matshow(percent, cmap='coolwarm',vmin=0,vmax=100)
    plt.title('Confusion matrix of the classifier')
    fig.colorbar(cax)
    ax.set_xticklabels([''] + labels)
    ax.set_yticklabels([''] + labels)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()



In [195]:

    
import os, warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

Read in the Raw Data

For PE files we want to quickly go from the raw binary files to a feature vector (DataFrame). For PE files there are lots of great tools, there's the pefile python module written by Ero Carrera and there's also a nice new github project called PEFrame https://github.com/guelfoweb/peframe by Gianni Amato at http://www.securityside.it. For this exercise we've provided a little wrapper class around the pefile module.



In [196]:

    
import pe_features
my_extractor = pe_features.PEFileFeatures()

# Open a PE File and see what features we get
filename = 'data/bad/0cb9aa6fb9c4aa3afad7a303e21ac0f3'
with open(filename,'rb') as f:
    features = my_extractor.execute(f.read())
features









    Out[196]:





{'check_sum': 0,
 'compile_date': 1218437803,
 'datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size': 0,
 'datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size': 0,
 'datadir_IMAGE_DIRECTORY_ENTRY_IAT_size': 468,
 'datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size': 100,
 'datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size': 1048,
 'debug_size': 0,
 'export_size': 0,
 'generated_check_sum': 53913,
 'iat_rva': 9256,
 'major_version': 0,
 'minor_version': 0,
 'number_of_bound_import_symbols': 0,
 'number_of_bound_imports': 0,
 'number_of_export_symbols': 0,
 'number_of_import_symbols': 113,
 'number_of_imports': 4,
 'number_of_rva_and_sizes': 16,
 'number_of_sections': 4,
 'pe_char': 271,
 'pe_dll': 0,
 'pe_driver': 0,
 'pe_exe': 1,
 'pe_i386': 1,
 'pe_majorlink': 6,
 'pe_minorlink': 0,
 'pe_warnings': 0,
 'sec_entropy_data': 0.4421475832668401,
 'sec_entropy_rdata': 3.2064873564662046,
 'sec_entropy_reloc': 0,
 'sec_entropy_rsrc': 1.028676764457129,
 'sec_entropy_text': 4.852962403013336,
 'sec_raw_execsize': 16384,
 'sec_rawptr_data': 12288,
 u'sec_rawptr_rdata': 8192,
 'sec_rawptr_rsrc': 16384,
 'sec_rawptr_text': 4096,
 'sec_rawsize_data': 4096,
 u'sec_rawsize_rdata': 4096,
 'sec_rawsize_rsrc': 4096,
 'sec_rawsize_text': 4096,
 'sec_va_execsize': 7044,
 'sec_vasize_data': 468,
 u'sec_vasize_rdata': 2182,
 'sec_vasize_rsrc': 1048,
 'sec_vasize_text': 3346,
 'size_code': 4096,
 'size_image': 20480,
 'size_initdata': 12288,
 'size_uninit': 0,
 'std_section_names': 1,
 'total_size_pe': 20480,
 'virtual_address': 4096,
 'virtual_size': 3346,
 'virtual_size_2': 2182}



In [197]:

    
# Load up all our files (files come from various places contagio, around the net...)
def load_files(file_list):
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features_list.append(my_extractor.execute(f.read()))
    return features_list



In [198]:

    
# Bad (malicious) files
file_list = [os.path.join('data/bad', child) for child in os.listdir('data/bad')]
bad_features = load_files(file_list)
print 'Loaded up %d malicious PE Files' % len(bad_features)









    



Loaded up 50 malicious PE Files



In [199]:

    
# Good (benign) files
file_list = [os.path.join('data/good', child) for child in os.listdir('data/good')]
good_features = load_files(file_list)
print 'Loaded up %d benign PE Files' % len(good_features)









    



Loaded up 50 benign PE Files

Data Transformation:

Going from a list of python dictionaries to a Pandas DataFrame. Pandas has all sort of different ways to create a data frame.



In [200]:

    
# Putting the features into a pandas dataframe
import pandas as pd
df_bad = pd.DataFrame.from_records(bad_features)
df_bad['label'] = 'bad'
df_good = pd.DataFrame.from_records(good_features)
df_good['label'] = 'good'
df_good.head()









    Out[200]:






  
    
      
      check_sum
      compile_date
      datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size
      datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size
      datadir_IMAGE_DIRECTORY_ENTRY_IAT_size
      datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size
      datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size
      debug_size
      export_size
      generated_check_sum
      iat_rva
      major_version
      minor_version
      number_of_bound_import_symbols
      number_of_bound_imports
      number_of_export_symbols
      number_of_import_symbols
      number_of_imports
      number_of_rva_and_sizes
      number_of_sections
      
    
  
  
    
      0
        97308
       1383744221
       3044
       0
        592
       140
        7368
       28
       0
        97308
       50424
       0
       0
       0
       0
       0
       142
       6
       16
       5
      ...
    
    
      1
       103233
       1383102953
         60
       0
       1008
        60
         872
       28
       0
       103233
       53248
       5
       1
       0
       0
       0
       124
       2
       16
       8
      ...
    
    
      2
        26573
       1386271379
        360
       0
        208
       100
        2588
       28
       0
        25971
        8804
       0
       0
       0
       0
       0
        48
       4
       16
       5
      ...
    
    
      3
            0
       1373925025
         12
       0
          8
        83
       11904
       28
       0
        54015
       35064
       0
       0
       0
       0
       0
         1
       1
       16
       4
      ...
    
    
      4
        50003
       1378865704
        360
       0
        208
       100
        2588
       28
       0
        59485
        8804
       0
       0
       0
       0
       0
        48
       4
       16
       5
      ...
    
  

5 rows × 108 columns

Lets look at the Data

We're going to use some nice functionality in the Pandas dataframe to look at our processed data:



In [201]:

    
# Now we're set and we open up a a whole new world!

# Gisting and statistics
df_bad.describe()









    Out[201]:






  
    
      
      check_sum
      compile_date
      datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size
      datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size
      datadir_IMAGE_DIRECTORY_ENTRY_IAT_size
      datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size
      datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size
      debug_size
      export_size
      generated_check_sum
      iat_rva
      major_version
      minor_version
      number_of_bound_import_symbols
      number_of_bound_imports
      number_of_export_symbols
      number_of_import_symbols
      number_of_imports
      number_of_rva_and_sizes
      number_of_sections
      
    
  
  
    
      count
           50.000000
       5.000000e+01
         50.000000
        50.000000
        50.000000
         50.000000
          50.00000
       50.000000
        50.000000
           50.000000
           50.000000
       50.000000
       50.000000
       50.000000
       50.000000
       50.000000
        50.000000
       50.000000
       50
       50.00000
      ...
    
    
      mean
        25235.660000
       1.035770e+09
        415.280000
        14.640000
       126.720000
        456.160000
        9615.64000
        3.920000
        14.640000
        86998.520000
        43982.640000
        0.960000
        0.120000
        0.140000
        0.740000
        0.240000
        44.560000
        3.740000
       16
        4.32000
      ...
    
    
      std
        45704.015095
       3.202979e+08
       1061.159532
        55.908365
       180.722252
       1060.814846
       19062.02003
        9.814275
        55.908365
        30119.209943
        44546.311213
        2.137708
        0.328261
        0.404566
        2.028471
        0.893514
        46.412595
        3.445257
        0
        1.75476
      ...
    
    
      min
            0.000000
       2.099200e+06
          0.000000
         0.000000
         0.000000
          0.000000
           0.00000
        0.000000
         0.000000
        26104.000000
            0.000000
        0.000000
        0.000000
        0.000000
        0.000000
        0.000000
         0.000000
        0.000000
       16
        1.00000
      ...
    
    
      25%
            0.000000
       9.372855e+08
          0.000000
         0.000000
         0.000000
         40.000000
           0.00000
        0.000000
         0.000000
        68094.000000
        14593.000000
        0.000000
        0.000000
        0.000000
        0.000000
        0.000000
         9.250000
        1.000000
       16
        3.00000
      ...
    
    
      50%
            0.000000
       1.172916e+09
          0.000000
         0.000000
        44.000000
        100.000000
        1152.00000
        0.000000
         0.000000
        82579.000000
        25835.000000
        0.000000
        0.000000
        0.000000
        0.000000
        0.000000
        26.000000
        2.500000
       16
        4.00000
      ...
    
    
      75%
        36417.000000
       1.219691e+09
         14.000000
         0.000000
       231.000000
        186.000000
        5938.00000
        0.000000
         0.000000
       108406.000000
        55948.000000
        0.000000
        0.000000
        0.000000
        0.000000
        0.000000
        70.500000
        5.000000
       16
        5.00000
      ...
    
    
      max
       150326.000000
       1.382647e+09
       4612.000000
       313.000000
       748.000000
       6234.000000
       84152.00000
       28.000000
       313.000000
       164776.000000
       189824.000000
       10.000000
        1.000000
        2.000000
        8.000000
        4.000000
       180.000000
       18.000000
       16
        9.00000
      ...
    
  

8 rows × 199 columns



In [202]:

    
# Visualization I
df_bad['check_sum'].hist(alpha=.5,label='bad',bins=40)
df_good['check_sum'].hist(alpha=.5,label='good',bins=40)
plt.legend()









    Out[202]:





<matplotlib.legend.Legend at 0x110fd4b10>



In [203]:

    
# Visualization I
df_bad['generated_check_sum'].hist(alpha=.5,label='bad',bins=40)
df_good['generated_check_sum'].hist(alpha=.5,label='good',bins=40)
plt.legend()









    Out[203]:





<matplotlib.legend.Legend at 0x111e38f50>



In [26]:

    
# Concatenate the info into a big pile!
df = pd.concat([df_bad, df_good], ignore_index=True)
df.replace(np.nan, 0, inplace=True)



In [27]:

    
# Boxplots show you the distribution of the data (spread).
# http://en.wikipedia.org/wiki/Box_plot

# Get some quick summary stats and plot it!
df.boxplot('number_of_import_symbols','label')
plt.xlabel('bad vs. good files')
plt.ylabel('# Import Symbols')
plt.title('Comparision of # Import Symbols')
plt.suptitle("")









    Out[27]:





<matplotlib.text.Text at 0x11089fb10>



In [28]:

    
# Get some quick summary stats and plot it!
df.boxplot('number_of_sections','label')
plt.xlabel('bad vs. good files')
plt.ylabel('Num Sections')
plt.title('Comparision of Number of Sections')
plt.suptitle("")









    Out[28]:





<matplotlib.text.Text at 0x1108a3850>



In [29]:

    
# Split the classes up so we can set colors, size, labels
cond = df['label'] == 'good'
good = df[cond]
bad  = df[~cond]
plt.scatter(good['number_of_import_symbols'], good['number_of_sections'], 
            s=140, c='#aaaaff', label='Good', alpha=.4)
plt.scatter(bad['number_of_import_symbols'], bad['number_of_sections'], 
            s=40, c='r', label='Bad', alpha=.5)
plt.legend()
plt.xlabel('Import Symbols')
plt.ylabel('Num Sections')









    Out[29]:





<matplotlib.text.Text at 0x110914f50>

Data Transformation:

Going from a Pandas DataFrame to an X Matrix and a y vector so we can utilize all of the great scikit-learn algorithms.



In [131]:

    
# In preparation for using scikit learn we're just going to use
# some handles that help take us from pandas land to scikit land

# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
X = df.as_matrix(['number_of_import_symbols', 'number_of_sections'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(df['label'].tolist())



In [132]:

    
# Random Forest is a popular ensemble machine learning classifier.
# http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
#
import sklearn.ensemble
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=50, compute_importances=True)



In [133]:

    
# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=5, n_jobs=4)
print scores









    



[ 0.8         0.75        0.7         0.65        0.78947368]



In [134]:

    
# Typically you train/test on an 80% / 20%  split meaning you train on 80%
# of the data and you test against the remaining 20%. In the case of this
# exercise we have so FEW samples (50 good/50 bad) that if were going
# to play around with predictive performance it's more meaningful
# to train on 60% of the data and test against the remaining 40%.

my_seed = 123
my_tsize = .4 # 40%
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)



In [135]:

    
# Now plot the results of the 60/40 split in a confusion matrix
from sklearn.metrics import confusion_matrix
labels = ['good', 'bad']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)









    



Confusion Matrix Stats
good/good: 72.73% (16/22)
good/bad: 27.27% (6/22)
bad/good: 33.33% (6/18)
bad/bad: 66.67% (12/18)

Features, predictive performance and 'knobs'

Here we going to explore some of the ways you can adjust the 'knobs' associated with either the feature input into your ML algorithm or the prediction probability methods that many classes in scikit-learn have.



In [140]:

    
# Okay now try putting in ALL the features (except the label, which would be cheating :)
no_label = list(df.columns.values)
no_label.remove('label')
X = df.as_matrix(no_label)

# 60/40 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)









    



Confusion Matrix Stats
good/good: 95.45% (21/22)
good/bad: 4.55% (1/22)
bad/good: 5.56% (1/18)
bad/bad: 94.44% (17/18)



In [141]:

    
# Feature Selection
# Which features best deferentiated the two classes?
# Here we're going to grab the feature_importances from the classifier itself, 
# you can also use a Chi Squared Test sklearn.feature_selection.SelectKBest(chi2)
importances = zip(no_label, clf.feature_importances_)
importances.sort(key=lambda k:k[1], reverse=True)
importances[:10]









    Out[141]:





[('compile_date', 0.087118104042058525),
 ('pe_majorlink', 0.059725488127989876),
 (u'sec_rawptr_reloc', 0.059172331241524503),
 ('debug_size', 0.036744505259105907),
 (u'sec_vasize_reloc', 0.035061138659616312),
 ('datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size', 0.033765884515823991),
 (u'sec_rawptr_rdata', 0.033184786235755895),
 ('datadir_IMAGE_DIRECTORY_ENTRY_IAT_size', 0.03261159140279881),
 ('pe_char', 0.030081949321901769),
 ('sec_rawsize_text', 0.028226654611623891)]



In [142]:

    
# Produce an X matrix with only the most important featuers
X = df.as_matrix([item[0] for item in importances[:10]])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)









    



Confusion Matrix Stats
good/good: 95.45% (21/22)
good/bad: 4.55% (1/22)
bad/good: 0.00% (0/18)
bad/bad: 100.00% (18/18)



In [143]:

    
# Compute the predition probabilities and use them to mimimize our false positives
# Note: This is simply a trade off, it means we'll miss a few of the malicious
# ones but typically false alarms are a death blow to any new 'fancy stuff' so
# we definitely want to mimimize the false alarms.
y_probs = clf.predict_proba(X_test)[:,0]
thres = .8 # This can be set to whatever you'd like
y_pred[y_probs<thres] = 'good'
y_pred[y_probs>=thres] = 'bad'
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)









    



Confusion Matrix Stats
good/good: 100.00% (22/22)
good/bad: 0.00% (0/22)
bad/good: 16.67% (3/18)
bad/bad: 83.33% (15/18)

Conclusions:

The combination of IPython, Pandas and Scikit Learn let us pull in PE files, extract features, plot them, understand them and slap them with some machine learning!

As mentioned in the disclaimer, the biggest issue with this particular exercise is the small number of samples.

There are some really great machine learning resources that cover this material on a deeper and more formal level. In particular we highly recommend the work and presentations of Olivier Grisel at INRIA Saclay. http://ogrisel.com/

	check_sum	compile_date	datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size	datadir_IMAGE_DIRECTORY_ENTRY_IAT_size	datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size	datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size	debug_size	generated_check_sum	iat_rva	major_version	minor_version	number_of_import_symbols	number_of_imports	number_of_rva_and_sizes	number_of_sections
0	97308	1383744221	3044	592	140	7368	28	97308	50424	0	0	142	6	16	5	...
1	103233	1383102953	60	1008	60	872	28	103233	53248	5	1	124	2	16	8	...
2	26573	1386271379	360	208	100	2588	28	25971	8804	0	0	48	4	16	5	...
3	0	1373925025	12	8	83	11904	28	54015	35064	0	0	1	1	16	4	...
4	50003	1378865704	360	208	100	2588	28	59485	8804	0	0	48	4	16	5	...

	check_sum	compile_date	datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size	datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size	datadir_IMAGE_DIRECTORY_ENTRY_IAT_size	datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size	datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size	debug_size	export_size	generated_check_sum	iat_rva	major_version	minor_version	number_of_bound_import_symbols	number_of_bound_imports	number_of_export_symbols	number_of_import_symbols	number_of_imports	number_of_rva_and_sizes	number_of_sections
count	50.000000	5.000000e+01	50.000000	50.000000	50.000000	50.000000	50.00000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50.000000	50	50.00000	...
mean	25235.660000	1.035770e+09	415.280000	14.640000	126.720000	456.160000	9615.64000	3.920000	14.640000	86998.520000	43982.640000	0.960000	0.120000	0.140000	0.740000	0.240000	44.560000	3.740000	16	4.32000	...
std	45704.015095	3.202979e+08	1061.159532	55.908365	180.722252	1060.814846	19062.02003	9.814275	55.908365	30119.209943	44546.311213	2.137708	0.328261	0.404566	2.028471	0.893514	46.412595	3.445257	0	1.75476	...
min	0.000000	2.099200e+06	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	26104.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	16	1.00000	...
25%	0.000000	9.372855e+08	0.000000	0.000000	0.000000	40.000000	0.00000	0.000000	0.000000	68094.000000	14593.000000	0.000000	0.000000	0.000000	0.000000	0.000000	9.250000	1.000000	16	3.00000	...
50%	0.000000	1.172916e+09	0.000000	0.000000	44.000000	100.000000	1152.00000	0.000000	0.000000	82579.000000	25835.000000	0.000000	0.000000	0.000000	0.000000	0.000000	26.000000	2.500000	16	4.00000	...
75%	36417.000000	1.219691e+09	14.000000	0.000000	231.000000	186.000000	5938.00000	0.000000	0.000000	108406.000000	55948.000000	0.000000	0.000000	0.000000	0.000000	0.000000	70.500000	5.000000	16	5.00000	...
max	150326.000000	1.382647e+09	4612.000000	313.000000	748.000000	6234.000000	84152.00000	28.000000	313.000000	164776.000000	189824.000000	10.000000	1.000000	2.000000	8.000000	4.000000	180.000000	18.000000	16	9.00000	...