1. Process ClamAV and Windows Defender Malware Scan Reports

Training labels will be generated from ClamAV, Windows Defender and VirusTotal.com reports.
- vs00251.txt (clamav)
- vs00252.txt (clamav)
- vs00263.txt (clamav)
- vs00264.txt (clamav)
- MPDetection2.log (Windows Defender)
- MPDetection3.log (Windows Defender)

In [1]:
from multiprocessing import Pool
import os
from csv import writer
import numpy as np
import pandas as pd
import math
import scipy.misc
import array
import time as tm
import io # this is required as a compatability layer between 2.x and 3.x because 2.x cannot read utf-16 text files.
import re
import matplotlib.pyplot as plt
import seaborn # make it look pretty.

In [2]:
ext_drive = '/opt/vs/'
tfiles1 = os.listdir(ext_drive + "train")
tfiles2 = os.listdir(ext_drive + "train2")

In [ ]:
# First load in the clamav reports and convert to csv files.
file_name = 'data/vs00263.txt'
vfr1 = open(file_name, 'r')
vlines1 = vfr1.readlines()

# Do the next clamav file.
file_name = 'data/vs00264.txt'
vfr2 = open(file_name, 'r')
vlines2 = vfr2.readlines()

# Do the next clamav file.
file_name = 'data/vs00apt.txt'
vfr3 = open(file_name, 'r')
vlines3 = vfr3.readlines()

# Open the output csv file.
fop = open('data/clamav-vs263-264.csv', 'w')
csv_wouter = writer(fop)
cols = ['file_name','malware_type'] # write out the column names.
csv_wouter.writerow(cols)

process_clamav_report(vlines1, csv_wouter)
process_clamav_report(vlines2, csv_wouter)
process_clamav_report(vlines3, csv_wouter)

vfr1.close()
vfr2.close()
vfr3.close()
fop.close()

In [15]:
# First load in the clamav reports and convert to csv files.
file_name = 'data/vs00251.txt'
vfr1 = open(file_name, 'r')
vlines1 = vfr1.readlines()

# Do the next clamav file.
file_name = 'data/vs00252.txt'
vfr2 = open(file_name, 'r')
vlines2 = vfr2.readlines()

# Open the output csv file.
fop = open('data/clamav001.csv', 'w')
csv_wouter = writer(fop)
cols = ['file_name','malware_type'] # write out the column names.
csv_wouter.writerow(cols)

process_clamav_report(vlines1, csv_wouter)
process_clamav_report(vlines2, csv_wouter)

vfr1.close()
vfr2.close()
fop.close()


Processed line number 0 : 46b510e161423a7e626adc3d95440f44 -> Win.Trojan.Dialer-729.
Processed line number 1000 : e616e7a2bfe3f3bea6a3e6d3b6c91c28 -> OK.
Processed line number 2000 : 7fb7c2ec5b58decc43e2331f1517bcda -> Win.Trojan.Morstar-7.
Processed line number 3000 : 1fb2b8ae511fc4a20ef6bc45971b8372 -> OK.
Processed line number 4000 : 62e67f7dd75056c4cda1a95ea5d53127 -> Win.Trojan.Agent-1385451.
Processed line number 5000 : ef559ec5ed3f7281e1e64da2eb96b08b -> Win.Adware.913802-1.
Processed line number 6000 : 20f5f176c8b6f0b28634ff13e36e4d98 -> Win.Adware.Ticnomultibar-1.
Processed line number 7000 : 223c88c24429a84fb3d3774509d05061 -> OK.
Processed line number 8000 : 5039ac8290c69b9118ae747fa5ecb2f8 -> Win.Virus.Elkern-9.
Processed line number 9000 : 6d69e1b1336adbbb0c0ee0d4b00ebd74 -> OK.
Processed line number 10000 : 315762706a9156364caa1e1e4e6922df -> Win.Trojan.Cosmu-4.
Processed line number 11000 : 1eccb6695b7f882fcce1edb70f5535b7 -> Win.Trojan.KillAV-43.
Processed line number 12000 : b6aa84c222bf536271599d39507f5e57 -> Win.Adware.Swiftbrowse-1730.
Processed line number 13000 : 7883ce9ddd8dd4eace12a36d615175a4 -> Win.Virus.Elkern-9.
Processed line number 14000 : 316e4e7e6b3737f77925cdc8369f1282 -> Win.Spyware.78845-2.
Processed line number 15000 : ebecfd8687c5ef1779ecf186f46e3588 -> Win.Worm.Soltern-1.
Processed line number 16000 : d54a2368c27996528e89e5bb43277110 -> Win.Trojan.Llac-7.
Processed line number 17000 : 37d90aa7fcc411a0cc17ea63fd432077 -> OK.
Processed line number 18000 : 087472584651147ba5d81f091405735c -> Win.Adware.Agent-1259944.
Processed line number 19000 : c2a7f1b2587dc45ef41b146f3fc32c95 -> Win.Spyware.78845-2.
Processed line number 20000 : 5662f5e6c9a97814d0383bb61cd722be -> OK.
Processed line number 21000 : 30589eb7e7421977a50977feefbe89b6 -> OK.
Processed line number 22000 : 5e541f64a4df8a69d033dba4489c65ac -> Win.Trojan.Antifw-171.
Processed line number 23000 : fb6f348cdac9a75d7e9ea05975ea20dc -> Win.Adware.Screensaver-1.
Processed line number 24000 : 4b74532a96b9e54abdf7ddbe04b1cf09 -> Win.Trojan.Cosmu-4.
Processed line number 25000 : 261032b0f533002a53b762cfe3e505e4 -> Win.Worm.Fesber-1.
Processed line number 26000 : 693304be2edb52ebb04f207f7b28630d -> Win.Worm.Soltern-1.
Processed line number 27000 : e671e5562baf9caeb3737a3da3665916 -> OK.
Processed line number 28000 : 28a9711cb05d1822c535e265132cb237 -> OK.
Processed line number 29000 : 21a21d57cc84daa2541ec3a4550683e5 -> Win.Adware.Trymedia-3.
Processed line number 30000 : 36d532def6deac3fa9f258dd7e2dceda -> Win.Adware.Downloader-96.
Processed line number 31000 : a49c79bef351028e8041ad84c74e8cf0 -> Win.Trojan.Agent-615299.
Processed line number 32000 : 48c3d4cca7f53efc11f1ce3a30423147 -> Win.Trojan.Agent-1303400.
Processed line number 33000 : 8fadd13f8097dbe4a265ededa201f27d -> Win.Spyware.78845-2.
Processed line number 34000 : 56cc6e996a58bf9072a3d69ab45cd5a1 -> Win.Trojan.11484026-1.
Processed line number 35000 : 2d79e0982c185a07a9b79c7cbc980ba6 -> Win.Worm.Soltern-1.
Processed line number 36000 : 932d99717762f019c5a2a81e76758c69 -> OK.
Processed line number 37000 : 8499b2479ab2c0a75e1462e243e10be6 -> Win.Adware.Firseria-49.
Processed line number 38000 : b65693eb37dc6fb6f216500f0ecf95a5 -> Win.Worm.Soltern-1.
Processed line number 39000 : a9ea9aef3ff4f4fb69bf7e62c75d0da1 -> Win.Adware.Screensaver-1.
Processed line number 40000 : 71fddefc939a5d140d0eaf72d2bf6988 -> Win.Worm.Mydoom-7.
Processed line number 41000 : 98ac9c66e9b527556b9ff39e5d7b146b -> OK.
Processed line number 42000 : c528e2f395e9fc32cbe435171b8dc7bc -> Win.Worm.Soltern-1.
Processed line number 43000 : f0c3c962a6df7f6971f442a1e883df22 -> Win.Trojan.KillAV-43.
Processed line number 44000 : 915d76eb7a79c81ccec0f31dbb3bd620 -> Win.Virus.Elkern-9.
Processed line number 45000 : 950aea1ff71380ceb54aa72ec67d49dc -> OK.
Processed line number 46000 : 0d660a0fe9525fcc9e611a258fde44a3 -> OK.
Processed line number 47000 : 6c870fc138522328c38604d5967c0552 -> OK.
Processed line number 48000 : 61c0551b4ffe7baca2c1d2bc8b2c8faf -> Win.Spyware.78845-2.
Processed line number 49000 : a7e695ec52270669be786f607e9b7abd -> Win.Worm.Mydoom-7.
Processed line number 50000 : 23d9eda0ea20025068ff37077897b3d0 -> Win.Trojan.11484026-1.
Processed line number 51000 : 264814b2ed13705c1c3d054edb659eaf -> Win.Trojan.Morstar-10.
Processed line number 52000 : 891c9f104c187f2faec37926e8decc31 -> Win.Adware.Domaiq-1.
Processed line number 53000 : bd2418c303d8a1e222ad909a75db541a -> OK.
Processed line number 54000 : a8822e02bd298de417abf6078c5cf960 -> OK.
Processed line number 55000 : 8f85d00e031f3e25ecad5aa6b4993dde -> OK.
Processed line number 56000 : 0934dbd32667c1ad7ffc82e23f46d2bb -> OK.
Processed line number 57000 : 2997cc40ed510b082287c03bae908713 -> Win.Trojan.Cosmu-4.
Processed line number 58000 : f2f288185bf843a8f78fd1fda8d70b5b -> Win.Trojan.KillAV-43.
Processed line number 59000 : bfe4aa85fe6ba8aac9f0358693556418 -> Win.Adware.Downloadware-24.
Processed line number 60000 : 1588e0af00cfe84569837e054f3439a2 -> Win.Worm.Soltern-1.
Processed line number 61000 : 47039238d8fa95119967d3f4a4d2d6e8 -> OK.
Processed line number 62000 : d43d907113efd16d61186f50b5b62da6 -> Win.Worm.Soltern-1.
Processed line number 63000 : 52173ed0363f5c4f9b2a375870796668 -> Win.Adware.Screensaver-1.
Processed line number 64000 : de8fee6988694a564b0d583ed7d3416d -> OK.
Processed line number 65000 : 2e32622ad621ef3d8022b0d116687595 -> OK.
Completed processing 65536 lines.
Processed line number 0 : 764bb5e4b9f846c346b5113b2418815c -> Win.Trojan.Cosmu-4.
Processed line number 1000 : 5dc3aa4bf984d4878fcc7c53591bd529 -> OK.
Processed line number 2000 : 2fb19aed6e2505290f804df5bd6c44be -> OK.
Processed line number 3000 : 34f8287b3805d94ed8870189f717b010 -> Win.Trojan.KillAV-43.
Processed line number 4000 : 004534aa078506231c64af27ea2c5da9 -> Win.Adware.Agent-1272749.
Processed line number 5000 : c7ca3432251af15a25d7f43bebb178a7 -> OK.
Processed line number 6000 : 8facc549c37fe12cd9a29bbad777983f -> OK.
Processed line number 7000 : 46138e7ffdf85110164a4f40c5fb51f2 -> Win.Adware.InstallCore-12.
Processed line number 8000 : ce999a9a9883b0be274986d568c64df9 -> Win.Trojan.KillAV-43.
Processed line number 9000 : e551c8a0a62a54208662f8cecdeb3b0c -> Win.Adware.Terkcop-47.
Processed line number 10000 : 76b6e49e04a569a7feaf087d6c004e6a -> Win.Trojan.Dialer-205.
Processed line number 11000 : f661ff9aa90d8c40d5a425d7ebeb7906 -> OK.
Processed line number 12000 : 2be824bcfe205ecd52c0c10734f79f6d -> OK.
Processed line number 13000 : ccdb459c6055b688e023d41f31f80c51 -> OK.
Processed line number 14000 : 2e04a30cdb4816978326743f5d6894dc -> Win.Worm.Brontok-88.
Processed line number 15000 : 755fc7d4f25e625fe00b40b89888f8f4 -> OK.
Processed line number 16000 : 535581c8db98ceeaa7ebf570d8dbaaed -> Win.Worm.Mydoom-7.
Processed line number 17000 : 38534e477ed5b0619e7a1693352840a4 -> Win.Worm.Agent-1297405.
Processed line number 18000 : d0f6ad4cf7209fd61a0eb7f1834f6d02 -> OK.
Processed line number 19000 : 3d725edc309728dfbb73b97911b30c8a -> OK.
Processed line number 20000 : c5f26b48464658dd3070d408ed3aa623 -> Win.Trojan.KillAV-43.
Processed line number 21000 : c17000997e46e3be110786253340ab99 -> Win.Dropper.Delf-2357.
Processed line number 22000 : d904e3a13874a864c3282f44be3749ae -> Win.Trojan.Agent-583204.
Processed line number 23000 : cea5fdd6ccbb62a34e3651d6bdffc905 -> Win.Trojan.KillAV-43.
Processed line number 24000 : e6015921ae3e8fd2d0f955d129fd434b -> OK.
Processed line number 25000 : b612541d0c8083ea427feccf79caa7fb -> Win.Adware.Downloadware-24.
Processed line number 26000 : 0f51c412ca646a38265b6e3200f878af -> Legacy.Trojan.Agent-1388596.
Processed line number 27000 : 77bb3ba6d7b01f9ff744dc7dfdaffad8 -> Win.Trojan.Morstar-10.
Processed line number 28000 : ca07ba50d79c2854482ea28cca78613a -> Legacy.Trojan.Agent-1388596.
Processed line number 29000 : bbe0e9aec0893f581533e635321291f0 -> Win.Trojan.Agent-665233.
Processed line number 30000 : cd21d149edf22d84fa4e4b14f27f0589 -> OK.
Processed line number 31000 : 240404ecbd25346352bbe6257ef77778 -> OK.
Processed line number 32000 : a7f1882ed2fb92b08e1fb5c1dc5385bd -> Win.Trojan.Loadmoney-12128.
Processed line number 33000 : d17e25b40028dbebc22f3831897b2abf -> Win.Adware.Terkcop-8.
Processed line number 34000 : a8f32264e72608335b4cd031ab38fa0e -> Win.Trojan.Morstar-10.
Processed line number 35000 : f5ef392d24dbaf3017a5066391e5dfc9 -> OK.
Processed line number 36000 : 485a0a7e98b8189f640f2b4e3fd003d9 -> Win.Trojan.Agent-1310333.
Processed line number 37000 : c59ff321773477943899d41f882f9a3e -> Win.Trojan.11484026-1.
Processed line number 38000 : d8f8c552af4eabd6cde1b047e212e64f -> Win.Trojan.Agent-120097.
Processed line number 39000 : 42e336071bc7e638be532205ffc762f8 -> Win.Adware.Strictor-731.
Processed line number 40000 : 410ca389583c2223ae4120ace8c03ac9 -> Win.Trojan.KillAV-43.
Processed line number 41000 : 1fe302823462bd945dc4b4bb6c902a26 -> OK.
Processed line number 42000 : 43852b9cbed782ef201146dc84c43fe0 -> Win.Adware.Agent-36731.
Processed line number 43000 : ff085a8cbbffd8b8a080a003624c1e74 -> Win.Trojan.11484026-1.
Processed line number 44000 : 77d5d7acbacc180eaf98d602defea5b0 -> Win.Worm.Mydoom-7.
Processed line number 45000 : ab1be2dc748974874a0003abb57235f2 -> Win.Spyware.78845-2.
Processed line number 46000 : 548c91b6c473fdd720236338ab469d7d -> Win.Trojan.Sality-73159.
Processed line number 47000 : d8ef01e7d4b19fd7b4eeb381053a4dab -> Win.Trojan.KillAV-43.
Processed line number 48000 : 7255dae5f9b2ce33b6e6e412337e4e78 -> OK.
Processed line number 49000 : 246532cffcd3984bd452af5a8ff3f3a3 -> Win.Trojan.Browsefox-2415.
Processed line number 50000 : 22e6eb51d9617d5bcfe2698bfcba58ab -> Win.Trojan.Morstar-10.
Processed line number 51000 : 3bba6bb9ea339162177cc74ee15ee4bd -> Win.Trojan.Cosmu-4.
Processed line number 52000 : 5505f340555301ac50a6ac295765d778 -> Win.Trojan.Cosmu-4.
Processed line number 53000 : 46126d4dc1c41d506d7ff64e9d0109d5 -> Win.Adware.Agent-1381471.
Processed line number 54000 : 236d64108bc882fbad6d1bef25a0cee3 -> Win.Adware.Softpulse-360.
Processed line number 55000 : f8e103042dd3bb7683fbae34bec439aa -> OK.
Processed line number 56000 : 4dc3c8b8a7904e57f0ccc7fd74bf99e8 -> Win.Trojan.Agent-1167536.
Processed line number 57000 : d9c13d1de0aa809ac5f03f92e78dd44b -> Win.Spyware.78845-2.
Processed line number 58000 : 2f7a5628a4ceef68e2ecc136dec0a7fa -> Win.Trojan.Cosmu-4.
Processed line number 59000 : af916fcdb05dbc34f7be5cc201add055 -> Win.Trojan.Installmonster-16.
Processed line number 60000 : d488b88496f3528639904174d6cf9f52 -> Win.Adware.Agent-527253.
Processed line number 61000 : 86547fff445096274e4fdf06084d72c1 -> OK.
Processed line number 62000 : 02e083b07295dc36b8c29f7a57302774 -> Win.Adware.Techsnab-18.
Processed line number 63000 : 2367d86e16dc7a48ac3b043cce6b3514 -> Legacy.Trojan.Agent-1388596.
Processed line number 64000 : 16d9a1601e4e3b57f524f5ab93e47bf2 -> Win.Adware.Screensaver-1.
Processed line number 65000 : d845e56ee4345cbaa059438533857b74 -> Win.Spyware.78845-2.
Completed processing 65536 lines.

In [14]:
def process_clamav_report(vlines, outfile):
    counter = 0
    outlines = []
    for idx, line in enumerate(vlines):
        if line.startswith('---'): # we hit the scan summary at end of file.
            break
        else:
            line = line.rstrip() # get rid of newlines they are annoying
            line = line.replace('_', ' ').replace(':', ' ') # get rid of these things they are annoying
            tokens = line.split()
            if len(tokens) > 2:
                malware_file_name = tokens[1]
                malware_type = tokens[2]
                outlines.append([malware_file_name, malware_type])
                counter += 1
                if (idx % 1000) == 0: # write out some lines
                    outfile.writerows(outlines)
                    outlines = []
                    print("Processed line number {:d} : {:s} -> {:s}.".format(idx, malware_file_name, malware_type))
            
    # Finish off.
    if (len(outlines) > 0):
        outfile.writerows(outlines)
        outlines = []
        
    print("Completed processing {:d} lines.".format(counter))

In [10]:
help(writer)


Help on built-in function writer in module _csv:

writer(...)
    csv_writer = csv.writer(fileobj [, dialect='excel']
                                [optional keyword args])
        for row in sequence:
            csv_writer.writerow(row)
    
        [or]
    
        csv_writer = csv.writer(fileobj [, dialect='excel']
                                [optional keyword args])
        csv_writer.writerows(rows)
    
    The "fileobj" argument can be any object that supports the file API.

2. Load the Training Sample Classifications from ClamAV.

Now generate integer values for the labels based on the malware type, since ClamAV does not
recognise all types of malware, find the unclassified files and send to VirusTotal.com for a
second opinion. As there is no standard method of defining malware type strings we will
have to do some munging on the virustotal results and convert to a ClamAV type malware
classification string. Also scan with Windows Defender and MalwareBytes Anti-Malware and
compare the results.

In [2]:
# now get the clamav data
clammals = pd.read_csv('data/clamav001.csv')

In [3]:
clammals.head()


Out[3]:
filename malware_type
0 46b510e161423a7e626adc3d95440f44 Win.Trojan.Dialer-729
1 1103c897ed2979339774f48ff47c0203 Win.Trojan.Jorik-10673
2 1835b8c9ed56ca729ad664e4c1725b1c Win.Worm.Mydoom-7
3 da301519b87e8b796ece22b3f4c13429 Win.Trojan.11484026-1
4 579659363281e349a93adfe5cfadf320 Win.Trojan.Sality-8178

In [4]:
clammals.shape


Out[4]:
(131073, 2)

In [5]:
# Now we can assign a numerical value to each malware classification.
moks = clammals[clammals['malware_type'] == 'OK'] # these are all classified as OK by ClamAV, so we have to send them 
                                          # to VirusTotal.com for a second opinion.
moks.to_csv('data/malok.csv', index=False)
# Now sort and write out the labels.

In [6]:
moks.head()


Out[6]:
filename malware_type
5 3d91f9da7b6ddd05f7fc3e6854ba51b9 OK
9 afeca052db9266bcdeb97d6f2a61a5e9 OK
12 ba251cd16eb5f6b16efbdd65f28eafc2 OK
17 7d03f1d4bcf044d44dec7396e750bef9 OK
18 9d03b0c2f333fb339e4e47359af759ef OK

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [7]:
moks.shape


Out[7]:
(38898, 2)

3. Process Windows Defender Malware Scan Report.

Windows Defender detected 48678 of the samples, MalwareBytes detected +38000 but crashed at
the end of the scan and the logs could not be recovered.

In [4]:
# First load in the Windows Defender reports and convert to csv files.
# NOTE: windows defender logs are UTF-16, so have to use io module to open in Python 2.x
# Two scans were conducted on vs00251 and vs00252.

#file_name = 'data/MPDetection1.log'
#vfr1 = io.open(file_name, mode='r', encoding='utf-16')
#vlines1 = vfr1.readlines()

# print("Read in {:d} lines {:s}".format(len(vlines1), vlines1[0]))

# This log file contains all the detections from vs00251 and vs00252 scans after the second scan.
file_name = 'data/MPDetection2.log'
vfr2 = io.open(file_name, mode='r', encoding='utf-16')
vlines2 = vfr2.readlines()

# Open the output csv file.
fop = open('data/defender-vs251-252.csv', 'w')
csv_wouter = writer(fop)
cols = ['file_name','malware_type'] # write out the column names.
csv_wouter.writerow(cols)

#process_defender_report(vlines1, csv_wouter)
process_defender_report(vlines2, csv_wouter)

#vfr1.close()
vfr2.close()
fop.close()


Processed line number 1000 : bfa8aab3bb5c5084ed0adc2b1874f470 -> Worm:Win32/VB.AT.
Processed line number 2000 : ec9aa46c3fccfaa3bbd01ed4eae73828 -> Worm:Win32/Yuner.A.
Processed line number 3000 : 23b7df82e89fad292f2914d60c19afef -> Worm:Win32/Picsys.C.
Processed line number 4000 : 5b6677a5a4a859c152f01e918940b0e6 -> Adware:Win32/Hotbar.
Processed line number 5000 : 92c99706bd4fbe4ae0bd547177f45ca4 -> Worm:Win32/Mydoom.O@mm.
Processed line number 6000 : 7330785b76d3a07e5907c849faa5123c -> Adware:Win32/Hotbar.
Processed line number 7000 : a7ef64cdca6b96a9938fa5f1567063bb -> Backdoor:Win32/Optixpro.T.
Processed line number 8000 : 62e8f66e9c5134440e2c074aca4c96dc -> Worm:Win32/Yuner.A.
Processed line number 9000 : aa3078efdc8db3fe1b44440cc86add2b -> BrowserModifier:Win32/Diplugem.
Processed line number 10000 : 56a6157b79f6bc5b9e030ca0ba45483c -> Worm:Win32/Soltern.L.
Processed line number 11000 : 6f6de894ec3d98c1dc0bc56470d1e8d5 -> Adware:Win32/Hotbar.
Processed line number 12000 : 1d8b5d6dcec7f29a237ea5e3b07d7ce1 -> Trojan:Win32/Dorv.B!rfn.
Processed line number 13000 : 1e0bfa393477a1889b86bd0310c369c3 -> Trojan:Win32/Skeeyah.A!rfn.
Processed line number 14000 : 4bc70b2454b0b31c9fad2bf37bfdf34c -> TrojanDownloader:Win32/Horst.I.
Processed line number 15000 : 26c90a275f71a7bb83e8db42bb5f36e4 -> PWS:Win32/OnLineGames.IZ.
Processed line number 16000 : f59e0df7abbc1e68baa42b8dccca9a7d -> Worm:Win32/Picsys.C.
Processed line number 17000 : faf31ec08304ff84b7e6d045ab8ba282 -> TrojanDropper:Win32/Sventore.B.
Processed line number 18000 : 201814f8919502ad50bf2350556d7333 -> Worm:Win32/Yuner.A.
Processed line number 19000 : 76d2c2f4d6cf9fadc231203e382e34c9 -> Worm:Win32/Yuner.A.
Processed line number 20000 : feab6dbc065f531ed661eb20558376c8 -> Exploit:HTML/IframeRef.gen.
Processed line number 21000 : 902cdaa00a49e4914a208887c3d62b69 -> Worm:Win32/VB.AT.
Processed line number 22000 : 50b2cc7f2ec4e9b3c2b87d91cea0be47 -> Worm:Win32/Yuner.A.
Processed line number 23000 : a676b3c210e948a38e3076fc584c81ba -> Worm:Win32/Soltern.L.
Processed line number 24000 : c2da66d313eb5047666f90729d8f290c -> Worm:Win32/Rebhip.Z.
Processed line number 25000 : aa97a3e824539cc7798c5f0ce440a4b9 -> Adware:Win32/Hotbar.
Processed line number 26000 : a904a0e5325f81304e4472b93375a899 -> SoftwareBundler:Win32/Ogimant.
Processed line number 27000 : 5819c4f3075849fe5d54398e887e4f3e -> Worm:Win32/Soltern.L.
Processed line number 28000 : b61f7d23a62f578b79fca285ca077d02 -> Adware:Win32/Hotbar.
Processed line number 29000 : 422debc6e552370810f5db1e94ebaa8a -> PWS:Win32/OnLineGames.IZ.
Processed line number 30000 : a4ba4ad313645f0363719af9877ce3e5 -> Worm:Win32/Yuner.A.
Processed line number 31000 : 4ef9cdec90f7352d735dfe5af1385ec0 -> Trojan:Win32/Desurou.A.
Processed line number 32000 : a39469eccbe42568ea4d934fe0848586 -> TrojanSpy:Win32/Flux.V.
Processed line number 33000 : 684263a90f2b82f7b5617bcb24537aca -> Worm:Win32/Soltern.L.
Processed line number 34000 : 49c2862a54d7b5dd6be69b544eb18289 -> SoftwareBundler:Win32/Ogimant.
Processed line number 35000 : d143eecf8464e2de850973bd3c2c1b1e -> TrojanDownloader:Win32/Delf.HO.
Processed line number 36000 : f8482b81ec2d61eae87a580e1c55fdf1 -> PWS:Win32/Gamania.gen!B.
Processed line number 37000 : e7e31eb3bfee24c1a094df7671f3968c -> PWS:Win32/OnLineGames.HM.
Processed line number 38000 : 27bdc33d837f7ecc603cd682c6a8e68b -> PWS:Win32/OnLineGames.IZ.
Processed line number 39000 : 61bb01928ae0e58efcf3e8592202f4f8 -> Worm:Win32/Soltern.L.
Processed line number 40000 : e2405a227ee52f08bf3db36feb227602 -> Trojan:Win32/Dynamer!ac.
Processed line number 41000 : 641f2a224bab961a64ef629225d25cfe -> Adware:Win32/Hotbar.
Processed line number 42000 : 71db368c608d43c185f2134ea49234f3 -> Worm:Win32/VB.AT.
Processed line number 43000 : 568dfc391b39f4bf2aa884eb77cddb9c -> Worm:Win32/Soltern.L.
Processed line number 44000 : 92764a9e0d36aeba8f5edbf8a99d3446 -> Worm:Win32/Mydoom.O@mm.
Processed line number 45000 : cca1fc7f4a88ed67d5d12051846bd4c7 -> TrojanDownloader:Win32/Delf.
Processed line number 46000 : f58f24b18b9da4ed0541c6eba170e3b9 -> Adware:Win32/Hotbar.
Processed line number 47000 : 2734f61320c240035b215119448e6438 -> Trojan:Win32/Startpage.RH.
Processed line number 48000 : 003ae9597da21136ff7eeb3b865d08d8 -> PWS:Win32/OnLineGames.IZ.
Processed line number 49000 : 13d56d43a46cea462ec858e4495e6e87 -> Virus:Win32/Parite.B.
Processed line number 50000 : 3762bd454fd404e307c3ff3f4262ff00 -> Virus:Win32/Viking.IT.
Processed line number 51000 : f18dbb50e8ba07c62674b57cbd36a608 -> TrojanDownloader:Win32/Perkesh.gen!A.
Processed line number 52000 : 2f25ff3702b6d5539c570a77e3e87be4 -> Rogue:Win32/Onescan.
Processed line number 53000 : 4caf51cdccad449fb31f913de1120ee1 -> Worm:Win32/Mydoom.O@mm.
Processed line number 54000 : 0e04c8fc0a6eb715c49bc9c941539f42 -> Trojan:JS/Redirector.QD.
Processed line number 55000 : 78669a403b458e7f1ed3a417856157b6 -> Exploit:HTML/IframeRef.gen.
Processed line number 56000 : 48415fff5a9b6bc48373a1e3bd069762 -> TrojanSpy:Win32/Skeeyah.A!rfn.
Processed line number 57000 : 319afa74786ab7d7f4611f2352a955bf -> Trojan:Win32/Piptea.E.
Processed line number 58000 : c512dc472577a0821a330cd51344ccd1 -> Trojan:JS/Iframeinject.
Processed line number 59000 : f06645f4f9f34b6e911f3f2117d538ee -> TrojanDownloader:Win32/Vxidl.
Processed line number 60000 : 3120b84d32f9a62baf7921443ab28b44 -> Worm:Win32/VB.AT.
Processed line number 61000 : 6b8f76291f3c74505dcf7c3b8f5ad8f8 -> Worm:Win32/VB.AT.
Processed line number 62000 : a5d2fa19443d759ed40b31f2490143b9 -> Worm:Win32/VB.AT.
Processed line number 63000 : e0ae62e3ac1c7fb6f7607169caf78a25 -> Worm:Win32/VB.AT.
Processed line number 64000 : 1e815c88dfd94654800e5a0300f8189a -> Trojan:JS/Redirector.PR.
Processed line number 65000 : d6ba85fe04e6c52e0dc471e6f4ea1450 -> TrojanDownloader:Win32/Banload.
Processed line number 66000 : 19c6595aa10e4233fe6e4743ae6c39a1 -> TrojanDownloader:Win32/VB.LV.
Processed line number 67000 : 221122e70be4e0e8e800bcd88646a75b -> Worm:Win32/Yuner.A.
Processed line number 68000 : 5a626d7c752f715ab501a87d9f69902d -> Worm:Win32/Yuner.A.
Processed line number 69000 : 8f9356422a802c643899aa520d2c21ff -> Worm:Win32/Yuner.A.
Processed line number 70000 : c375f657fb12d6ae6b460622e8bb0a9c -> Worm:Win32/Yuner.A.
Processed line number 71000 : fb786cbd050b4a86c6c75a6619fb1066 -> Worm:Win32/Yuner.A.
Processed line number 72000 : 5fd6c0dfac21bc2cef06e28fcaa82ae7 -> Rogue:Win32/FakeRean.
Processed line number 73000 : 13318190a0209d9e0cf1f68195f086af -> PWS:Win32/Dozmot.D.
Processed line number 74000 : 714da811ba272bce2c7decddf0884dd7 -> Adware:Win32/EoRezo.
Processed line number 75000 : d4cbe8de57d857eabd422f9837fbb6ab -> Adware:Win32/EoRezo.
Processed line number 76000 : 7b17f9ad8aa4b42dc67ea781dbda50a5 -> VirTool:Win32/Obfuscator.ZG.
Processed line number 77000 : 9a72ea2b60340711f33f92433c86d091 -> TrojanClicker:JS/Faceliker.A.
Processed line number 78000 : 375aa6d7f86fa294dde2a05c4a139525 -> PWS:Win32/OnLineGames.IZ.
Processed line number 79000 : c6b157336a1a131a119b77dd9a36109d -> PWS:Win32/OnLineGames.IZ.
Processed line number 80000 : 388face1abda03a5a730e4aa750ad63b -> SoftwareBundler:Win32/Ogimant.
Processed line number 81000 : a68d1f08fc8691f3d0bed2d4711e96af -> SoftwareBundler:Win32/Ogimant.
Processed line number 82000 : 4786460cd2268ee4056df27802f2014e -> Trojan:Win32/Rimecud.A.
Processed line number 83000 : 6682f6eb493824a531cf69da2c4260bd -> Trojan:Win32/Dynamer!ac.
Processed line number 84000 : 5b9f9608433037df89348f33e6bd8519 -> SoftwareBundler:Win32/Dowadmin.
Processed line number 85000 : e4f97546406881d80aceaed5f1810c39 -> Trojan:Win32/Bulta!rfn.
Processed line number 86000 : 089ed3a0fe333e8bfe65bc67ecce1f58 -> PWS:Win32/Lolyda.AT.
Processed line number 87000 : 36733a6300754d7a6d206c758a3e2b2e -> Adware:Win32/Hotbar.
Processed line number 88000 : 78622ad22101e3e21fb0392fc877c879 -> Adware:Win32/Hotbar.
Processed line number 89000 : bc3bf8d11450e63eb02c71968df9f306 -> Adware:Win32/Hotbar.
Processed line number 90000 : b4d224b2041a402f2fd6947b460a862e -> Trojan:Win32/Pornox.A.
Processed line number 91000 : 35c2f1be3aaa833d2d020f2e74199c16 -> BrowserModifier:Win32/Diplugem.
Processed line number 92000 : 7910b003b5001848ba5ca2354e1f7b18 -> BrowserModifier:Win32/Diplugem.
Processed line number 93000 : bfc6ddf8b9e5045e22ea64371c0b762d -> BrowserModifier:Win32/Diplugem.
Processed line number 94000 : 3de6b958c185f51d815721688d43c3c1 -> SoftwareBundler:Win32/ICLoader.
Processed line number 95000 : 863a565e83e0b01d22f9f93dcb915504 -> TrojanDropper:Win32/Sventore.B.
Processed line number 96000 : 009f230b39a622d2ff347758a5cc64ff -> SoftwareBundler:Win32/OutBrowse.
Processed line number 97000 : f4aa2e43f068ebb92f114cef6e468854 -> Worm:Win32/Msblast.A.
Completed processing 97347 lines.

In [2]:
def process_defender_report(vlines, outfile):
    counter = 0
    outlines = []
    for idx, line in enumerate(vlines):
        if line.find('DETECTION') > 0: # we hit the scan summary at end of file.
            line = line.rstrip() # get rid of newlines they are annoying
            #line = line.replace('_', ' ').replace(':', ' ') 
            tokens = line.split()
            if len(tokens) > 2:
                temp_file_name = tokens[3]
                malware_type = tokens[2]
                temp_file_name = temp_file_name.replace('_',' ').replace('->',' ')
                path_tokens = temp_file_name.split()
                malware_file_name = path_tokens[1]
                outlines.append([malware_file_name, malware_type])
                counter += 1
                if (idx % 1000) == 0: # write out some lines
                    outfile.writerows(outlines)
                    outlines = []
                    print("Processed line number {:d} : {:s} -> {:s}.".format(idx, malware_file_name, malware_type))
            
    # Finish off.
    if (len(outlines) > 0):
        outfile.writerows(outlines)
        outlines = []
        
    print("Completed processing {:d} lines.".format(counter))

In [4]:
help(pd.DataFrame.drop_duplicates)


Help on method drop_duplicates in module pandas.core.frame:

drop_duplicates(self, cols=None, take_last=False, inplace=False) unbound pandas.core.frame.DataFrame method
    Return DataFrame with duplicate rows removed, optionally only
    considering certain columns
    
    Parameters
    ----------
    cols : column label or sequence of labels, optional
        Only consider certain columns for identifying duplicates, by
        default use all of the columns
    take_last : boolean, default False
        Take the last observed row in a row. Defaults to the first row
    inplace : boolean, default False
        Whether to drop duplicates in place or to return a copy
    
    Returns
    -------
    deduplicated : DataFrame


In [ ]:

4. Load the Windows Defender Classifications and Combine with ClamAV Classifications.

- script: combine_av_reports.py

In [8]:
windefmals = pd.read_csv('data/defender001.csv')
windefmals.head()


Out[8]:
filename malware_type
0 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE
1 0004376a62e22f6ad359467eb742b8ff Worm:Win32/Picsys.C
2 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip
3 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG
4 00092d369958b67557da8661cc9093bc Adware:Win32/Hotbar

In [9]:
windefmals.shape


Out[9]:
(97347, 2)

In [10]:
clammals.head()


Out[10]:
filename malware_type
0 46b510e161423a7e626adc3d95440f44 Win.Trojan.Dialer-729
1 1103c897ed2979339774f48ff47c0203 Win.Trojan.Jorik-10673
2 1835b8c9ed56ca729ad664e4c1725b1c Win.Worm.Mydoom-7
3 da301519b87e8b796ece22b3f4c13429 Win.Trojan.11484026-1
4 579659363281e349a93adfe5cfadf320 Win.Trojan.Sality-8178

In [11]:
clammals.shape


Out[11]:
(131073, 2)

In [12]:
131073 - 97347


Out[12]:
33726

In [13]:
moks.head()


Out[13]:
filename malware_type
5 3d91f9da7b6ddd05f7fc3e6854ba51b9 OK
9 afeca052db9266bcdeb97d6f2a61a5e9 OK
12 ba251cd16eb5f6b16efbdd65f28eafc2 OK
17 7d03f1d4bcf044d44dec7396e750bef9 OK
18 9d03b0c2f333fb339e4e47359af759ef OK

In [14]:
moks.shape


Out[14]:
(38898, 2)

In [21]:
allmals = clammals.merge(windefmals, on='file_name', how='outer', indicator=True, sort=True)

In [22]:
allmals.head(20)


Out[22]:
filename malware_type_x malware_type_y _merge
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both
3 00027c21667d9119a454df8cef2dc1c7 OK Trojan:JS/Redirector.QE both
4 0003887ab64b8ae19ffa988638decac2 OK NaN left_only
5 000403e4e488356b7535cc613fbeb80b OK TrojanDownloader:Win32/Fosniw.B both
6 0004376a62e22f6ad359467eb742b8ff Win.Worm.Picsys-1 Worm:Win32/Picsys.C both
7 0004c8b2a0f4680a5694d74199b40ea2 Win.Adware.Loadmoney-12162 SoftwareBundler:Win32/ICLoader both
8 000595d8b586915c12053104cf845097 Win.Adware.Mplug-2637 BrowserModifier:Win32/Diplugem both
9 000634f03457d088c71dbffb897b1315 OK Worm:Win32/Rebhip both
10 00072ed24314e91b63b425b3dc572f50 OK VirTool:Win32/VBInject.UG both
11 00092d369958b67557da8661cc9093bc Win.Trojan.Adinstall-2 Adware:Win32/Hotbar both
12 00093d5fa5cb7ce77f6eaf39962daa12 Win.Adware.Screensaver-1 Adware:Win32/Hotbar both
13 00099926d51b44c6f8c93a48c2567891 OK SoftwareBundler:Win32/OutBrowse both
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 NaN left_only
15 000a2db4762dc06628a086c9e117f884 Win.Trojan.Magania-11227 PWS:Win32/Lolyda.AT both
16 000ac11fa7587b2316470b154254a219 OK NaN left_only
17 000ae2c63ba69fc93dfc395b40bfe03a Heuristics.W32.Parite.B Virus:Win32/Parite.B both
18 000ae90736a51c47543dcc6d8a735362 Win.Adware.Agent-1312925 BrowserModifier:Win32/Diplugem both
19 000b41258d624ef2d6e430822d0c0c8f OK SoftwareBundler:Win32/OutBrowse both

In [27]:
uniq_allmals = allmals.drop_duplicates(subset='file_name', keep='first')

In [28]:
uniq_allmals.head(20)


Out[28]:
filename malware_type_x malware_type_y _merge
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both
3 00027c21667d9119a454df8cef2dc1c7 OK Trojan:JS/Redirector.QE both
4 0003887ab64b8ae19ffa988638decac2 OK NaN left_only
5 000403e4e488356b7535cc613fbeb80b OK TrojanDownloader:Win32/Fosniw.B both
6 0004376a62e22f6ad359467eb742b8ff Win.Worm.Picsys-1 Worm:Win32/Picsys.C both
7 0004c8b2a0f4680a5694d74199b40ea2 Win.Adware.Loadmoney-12162 SoftwareBundler:Win32/ICLoader both
8 000595d8b586915c12053104cf845097 Win.Adware.Mplug-2637 BrowserModifier:Win32/Diplugem both
9 000634f03457d088c71dbffb897b1315 OK Worm:Win32/Rebhip both
10 00072ed24314e91b63b425b3dc572f50 OK VirTool:Win32/VBInject.UG both
11 00092d369958b67557da8661cc9093bc Win.Trojan.Adinstall-2 Adware:Win32/Hotbar both
12 00093d5fa5cb7ce77f6eaf39962daa12 Win.Adware.Screensaver-1 Adware:Win32/Hotbar both
13 00099926d51b44c6f8c93a48c2567891 OK SoftwareBundler:Win32/OutBrowse both
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 NaN left_only
15 000a2db4762dc06628a086c9e117f884 Win.Trojan.Magania-11227 PWS:Win32/Lolyda.AT both
16 000ac11fa7587b2316470b154254a219 OK NaN left_only
17 000ae2c63ba69fc93dfc395b40bfe03a Heuristics.W32.Parite.B Virus:Win32/Parite.B both
18 000ae90736a51c47543dcc6d8a735362 Win.Adware.Agent-1312925 BrowserModifier:Win32/Diplugem both
19 000b41258d624ef2d6e430822d0c0c8f OK SoftwareBundler:Win32/OutBrowse both

In [29]:
uniq_allmals.shape


Out[29]:
(131074, 4)

In [33]:
filled_uniq_allmals = uniq_allmals.replace(np.NaN, 'OK')

In [34]:
filled_uniq_allmals.head(20)


Out[34]:
filename malware_type_x malware_type_y _merge
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both
3 00027c21667d9119a454df8cef2dc1c7 OK Trojan:JS/Redirector.QE both
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only
5 000403e4e488356b7535cc613fbeb80b OK TrojanDownloader:Win32/Fosniw.B both
6 0004376a62e22f6ad359467eb742b8ff Win.Worm.Picsys-1 Worm:Win32/Picsys.C both
7 0004c8b2a0f4680a5694d74199b40ea2 Win.Adware.Loadmoney-12162 SoftwareBundler:Win32/ICLoader both
8 000595d8b586915c12053104cf845097 Win.Adware.Mplug-2637 BrowserModifier:Win32/Diplugem both
9 000634f03457d088c71dbffb897b1315 OK Worm:Win32/Rebhip both
10 00072ed24314e91b63b425b3dc572f50 OK VirTool:Win32/VBInject.UG both
11 00092d369958b67557da8661cc9093bc Win.Trojan.Adinstall-2 Adware:Win32/Hotbar both
12 00093d5fa5cb7ce77f6eaf39962daa12 Win.Adware.Screensaver-1 Adware:Win32/Hotbar both
13 00099926d51b44c6f8c93a48c2567891 OK SoftwareBundler:Win32/OutBrowse both
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 OK left_only
15 000a2db4762dc06628a086c9e117f884 Win.Trojan.Magania-11227 PWS:Win32/Lolyda.AT both
16 000ac11fa7587b2316470b154254a219 OK OK left_only
17 000ae2c63ba69fc93dfc395b40bfe03a Heuristics.W32.Parite.B Virus:Win32/Parite.B both
18 000ae90736a51c47543dcc6d8a735362 Win.Adware.Agent-1312925 BrowserModifier:Win32/Diplugem both
19 000b41258d624ef2d6e430822d0c0c8f OK SoftwareBundler:Win32/OutBrowse both

In [35]:
filled_uniq_allmals.shape


Out[35]:
(131074, 4)

In [36]:
# Now we have our combined AV results, write to file.
filled_uniq_allmals.to_csv('data/sorted-av-report.csv', index=False)

In [38]:
moks = filled_uniq_allmals[filled_uniq_allmals['malware_type_x'] == 'OK'] 
moks = moks[moks['malware_type_y'] == 'OK']
moks.to_csv('data/malok.csv', index=False)
# these are all classified as OK by ClamAV and Windows Defender, 
# so we have to send them to VirusTotal.com for a second opinion.

In [39]:
moks.head(20)


Out[39]:
filename malware_type_x malware_type_y _merge
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only
16 000ac11fa7587b2316470b154254a219 OK OK left_only
22 000d8dda1d4d1a88276e2b25a064fa43 OK OK left_only
38 001ea100e348e7f72a8f1b5f737dbd0a OK OK left_only
40 001ff0574000822be988193df8166bd2 OK OK left_only
45 00233e467ec20f975b09c3407877f7eb OK OK left_only
47 0025420de5eeae2b56a44366aabdfe7a OK OK left_only
49 0025cc13683331a61986b6433e768f3f OK OK left_only
56 002efa40dbab524e00c66988a51ca1c2 OK OK left_only
88 00423f1656a26c53a787304f27aa60cd OK OK left_only
127 005776d784e6a4e5034bb53ff8f3fd95 OK OK left_only
146 006422cde629ec21311dca5dad8e88c1 OK OK left_only
147 0064f090664d5c8a8b6320558d571922 OK OK left_only
155 006a9a07cf52b8434eb0e7319cb85635 OK OK left_only
157 006b4c72e79e60d10515a64ec6a4e021 OK OK left_only
177 0079676239abcb6cc1619590faf6b9ef OK OK left_only
197 008de8605b54440b784505492bb4dfd1 OK OK left_only
207 00982b38bd54a8ddd63e1cd3bedda310 OK OK left_only
217 009f185063c9e2a2e703fb3aa1ab9065 OK OK left_only
229 00a9482202d47949153783ba4db551aa OK OK left_only

In [40]:
moks.shape


Out[40]:
(16918, 4)

5. Munge the Two Malware Classifications Together and Generate Unique Scalar Values.


In [2]:
mals = pd.read_csv('data/sorted-av-report.csv')
mals.head()


Out[2]:
filename malware_type_x malware_type_y _merge
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both
3 00027c21667d9119a454df8cef2dc1c7 OK Trojan:JS/Redirector.QE both
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only

5 rows × 4 columns


In [3]:
mals.shape


Out[3]:
(131074, 4)

In [5]:
scalar_labels = [0] * mals.shape[0]
len(scalar_labels)


Out[5]:
131074

In [3]:
type_x = mals['malware_type_x']
type_y = mals['malware_type_y']
x_ok = type_x[type_x == 'OK']
y_ok = type_y[type_y == 'OK']
len(x_ok)


Out[3]:
38899

In [5]:
len(y_ok)


Out[5]:
39869

In [6]:
# Now generate unique scalar label map, we will use ClamAV as the default classification, if ClamAV is OK
# and Defender is not OK, then use the Defender classification, if both are OK then default to 0 label value for now.
scalar_labels = [0] * mals.shape[0]
label_map = {}
counter = 0
for idx, x_val in enumerate(type_x):
    if x_val == 'OK':
        if type_y.iloc[idx] != 'OK':
            mals.iloc[idx,1] = mals.iloc[idx,2] # copy the defender classification to ClamAV classification
        else:
            continue # leave the scalar label == 0
            
    # now add the classification to the label map with a new scalar value
    if mals.iloc[idx,1] not in label_map.keys():
        counter += 1
        label_map[mals.iloc[idx,1]] = counter
        
    # now get the scalar label for this malware sample
    scalar_labels[idx] = label_map[mals.iloc[idx,1]]
        
mals['label'] = scalar_labels
mals.head(20)


Out[6]:
filename malware_type_x malware_type_y _merge label
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both 1
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both 2
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both 3
3 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE Trojan:JS/Redirector.QE both 4
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only 0
5 000403e4e488356b7535cc613fbeb80b TrojanDownloader:Win32/Fosniw.B TrojanDownloader:Win32/Fosniw.B both 5
6 0004376a62e22f6ad359467eb742b8ff Win.Worm.Picsys-1 Worm:Win32/Picsys.C both 6
7 0004c8b2a0f4680a5694d74199b40ea2 Win.Adware.Loadmoney-12162 SoftwareBundler:Win32/ICLoader both 7
8 000595d8b586915c12053104cf845097 Win.Adware.Mplug-2637 BrowserModifier:Win32/Diplugem both 8
9 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip Worm:Win32/Rebhip both 9
10 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG VirTool:Win32/VBInject.UG both 10
11 00092d369958b67557da8661cc9093bc Win.Trojan.Adinstall-2 Adware:Win32/Hotbar both 11
12 00093d5fa5cb7ce77f6eaf39962daa12 Win.Adware.Screensaver-1 Adware:Win32/Hotbar both 12
13 00099926d51b44c6f8c93a48c2567891 SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 13
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 OK left_only 14
15 000a2db4762dc06628a086c9e117f884 Win.Trojan.Magania-11227 PWS:Win32/Lolyda.AT both 15
16 000ac11fa7587b2316470b154254a219 OK OK left_only 0
17 000ae2c63ba69fc93dfc395b40bfe03a Heuristics.W32.Parite.B Virus:Win32/Parite.B both 16
18 000ae90736a51c47543dcc6d8a735362 Win.Adware.Agent-1312925 BrowserModifier:Win32/Diplugem both 17
19 000b41258d624ef2d6e430822d0c0c8f SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 13

20 rows × 5 columns


In [7]:
mals.to_csv('data/sorted_train_labels.csv', index=False)

In [14]:
# Output the malware scalar classifications.
fop = open('data/malware-class-labels.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','class'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = label_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, label_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed label {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} labels.".format(counter))    
fop.close()


Completed processing 10506 labels.

In [8]:
131074 - 16918


Out[8]:
114156

In [ ]:
help(allmals.replace)

6. Munge the Two Malware Classifications Together and Generate Malware Families.

Experiment 1, use truncated ClamAV or WinDefender definitions to generate malware families and
assign a scalar training label to each family.  

- DEPRECATED: use code in section 7 below.

In [4]:
mals = pd.read_csv('data/sorted_train_labels.csv')
mals.head(20)


Out[4]:
filename malware_type_x malware_type_y _merge label
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both 1
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both 2
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both 3
3 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE Trojan:JS/Redirector.QE both 4
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only 0
5 000403e4e488356b7535cc613fbeb80b TrojanDownloader:Win32/Fosniw.B TrojanDownloader:Win32/Fosniw.B both 5
6 0004376a62e22f6ad359467eb742b8ff Win.Worm.Picsys-1 Worm:Win32/Picsys.C both 6
7 0004c8b2a0f4680a5694d74199b40ea2 Win.Adware.Loadmoney-12162 SoftwareBundler:Win32/ICLoader both 7
8 000595d8b586915c12053104cf845097 Win.Adware.Mplug-2637 BrowserModifier:Win32/Diplugem both 8
9 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip Worm:Win32/Rebhip both 9
10 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG VirTool:Win32/VBInject.UG both 10
11 00092d369958b67557da8661cc9093bc Win.Trojan.Adinstall-2 Adware:Win32/Hotbar both 11
12 00093d5fa5cb7ce77f6eaf39962daa12 Win.Adware.Screensaver-1 Adware:Win32/Hotbar both 12
13 00099926d51b44c6f8c93a48c2567891 SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 13
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 OK left_only 14
15 000a2db4762dc06628a086c9e117f884 Win.Trojan.Magania-11227 PWS:Win32/Lolyda.AT both 15
16 000ac11fa7587b2316470b154254a219 OK OK left_only 0
17 000ae2c63ba69fc93dfc395b40bfe03a Heuristics.W32.Parite.B Virus:Win32/Parite.B both 16
18 000ae90736a51c47543dcc6d8a735362 Win.Adware.Agent-1312925 BrowserModifier:Win32/Diplugem both 17
19 000b41258d624ef2d6e430822d0c0c8f SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 13

20 rows × 5 columns


In [3]:
mals.shape


Out[3]:
(131074, 4)

In [7]:
# Now generate unique scalar label map for malware families, converting Windows Defender format to ClamAV as necessary.

type_x = mals['malware_type_x']
scalar_labels = [0] * mals.shape[0]
family_labels = [' '] * mals.shape[0]
family_label_map = {}
sample_counter_map = {}
family_counter_map = {}
counter = 0
p1 = re.compile('(\w+):(\w+)/(\w+)[!.-/]+(\w+)') # Windows Defender malware definition patterns
p2 = re.compile('(\w+):(\w+)/(\w+)')
pcav = re.compile('(\w+)\.(\w+)\.(\w+)[!./-](\w+)') # ClamAV malware definition pattern
malware_family = 'unknown'

for idx, x_val in enumerate(type_x):
    # first count the sample type
    if x_val in sample_counter_map.keys():
        sample_counter_map[x_val] += 1
    else:
        sample_counter_map[x_val] = 1
        
    if x_val != 'OK':
        # now check if it is a ClamAV definition.
        pos = x_val.find('-')
        if pos > 0:
            malware_family = x_val[0:pos]
        else:
            malware_family = x_val        
        # if it is a defender classification then convert to ClamAV classification.
        m = p1.match(x_val)
        if m != None:
            malware_family = m.group(2) + '.' + m.group(1) + '.' + m.group(3)
        else:
            m = p2.match(x_val)
            if m != None:
                malware_family = m.group(2) + '.' + m.group(1) + '.' + m.group(3)           
    else:
        continue # leave the scalar label == 0, the malware sample has not been classified.
            
    # now add the classification to the label map with a new scalar value
    if malware_family not in family_label_map.keys():
        counter += 1
        family_label_map[malware_family] = counter
        
    # Count the malware family occurrences.
    if (malware_family in family_counter_map.keys()):
        family_counter_map[malware_family] += 1
    else:
        family_counter_map[malware_family] = 1
                         
    # now get the scalar label for this malware sample
    scalar_labels[idx] = family_label_map[malware_family]
    family_labels[idx] = malware_family
        
    if (idx % 1000) == 0: # report progress
        print("Processed family label {:s} -> {:d}.".format(malware_family, family_label_map[malware_family]))
        
# Finish off by adding malware family label to training label set.
mals['family_label'] = scalar_labels
mals['family_label_str'] = family_labels
mals.head(20)


Processed family label Win.Worm.Tufik -> 1.
Processed family label Win.Trojan.Antifw -> 103.
Processed family label Win.Trojan.Sality -> 60.
Processed family label Legacy.Trojan.Agent -> 19.
Processed family label JS.Trojan.Redirector -> 3.
Processed family label Win.Trojan.Small -> 447.
Processed family label Win.Worm.Soltern -> 30.
Processed family label Win.Trojan.11484026 -> 42.
Processed family label Win.Downloader.94061 -> 179.
Processed family label Win.Adware.Zango -> 612.
Processed family label Win.Trojan.Aliser -> 199.
Processed family label Win32.VirTool.CeeInject -> 197.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Adware.Trymedia -> 39.
Processed family label Win.Trojan.Dialer -> 76.
Processed family label Win.Trojan.Firseria -> 92.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win32.TrojanDownloader.Malushka -> 122.
Processed family label Win32.SoftwareBundler.OutBrowse -> 12.
Processed family label Win.Trojan.Adinstall -> 10.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Worm.Wenper -> 383.
Processed family label Win.Trojan.Loadmoney -> 20.
Processed family label Win.Adware.Screensaver -> 11.
Processed family label Win.Adware.Agent -> 16.
Processed family label JS.Rogue.FakeCall -> 1279.
Processed family label Win.Trojan.Loadmoney -> 20.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Virus.Elkern -> 35.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Adware.Trymedia -> 39.
Processed family label Win.Trojan.Magania -> 14.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win32.Worm.Gamarue -> 26.
Processed family label Win.Trojan.Antifw -> 103.
Processed family label Win.Trojan.Xtreme -> 59.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Worm.Mydoom -> 32.
Processed family label Win.Trojan.Hupigon -> 364.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.Agent -> 2.
Processed family label Win.Trojan.Morstar -> 158.
Processed family label Win.Trojan.Morstar -> 158.
Processed family label Win.Adware.Zango -> 612.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win32.SoftwareBundler.Fourthrem -> 294.
Processed family label JS.TrojanClicker.Faceliker -> 53.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.Adinstall -> 10.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win32.TrojanSpy.Delf -> 610.
Processed family label Win.Trojan.Adinstall -> 10.
Processed family label Win.Adware.Agent -> 16.
Processed family label Win.Trojan.Morstar -> 158.
Processed family label Win.Worm.Soltern -> 30.
Processed family label JS.Trojan.Redirector -> 3.
Processed family label Win.Spyware.78845 -> 24.
Processed family label Win32.Trojan.Pugeju -> 350.
Processed family label Win.Trojan.Adinstall -> 10.
Processed family label AutoIt.TrojanDownloader.Vicluder -> 2089.
Processed family label Win32.Rogue.FakeRean -> 104.
Processed family label Win.Trojan.Adinstall -> 10.
Processed family label Win.Adware.Agent -> 16.
Processed family label Win.Spyware.78845 -> 24.
Processed family label Win.Spyware.78845 -> 24.
Processed family label Win32.TrojanDownloader.Fosniw -> 4.
Processed family label Win32.Trojan.Dynamer -> 27.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Worm.Autorun -> 242.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Trojan.Banker -> 136.
Processed family label Win.Trojan.Agent -> 2.
Processed family label Win32.Trojan.Dynamer -> 27.
Processed family label Win.Trojan.Morstar -> 158.
Processed family label Win.Adware.1296193 -> 2339.
Processed family label Win.Adware.Agent -> 16.
Processed family label Win.Adware.Downloadware -> 452.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win32.Trojan.BHO -> 192.
Processed family label Win.Trojan.Loadmoney -> 20.
Processed family label Win.Trojan.Script -> 211.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win32.Rogue.Winwebsec -> 293.
Processed family label Win.Worm.Mydoom -> 32.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.KillAV -> 41.
Processed family label Win.Trojan.Agent -> 2.
Processed family label Js.Trojan.Obfus -> 119.
Processed family label Win32.Trojan.Bulta -> 28.
Processed family label Win.Trojan.Cosmu -> 82.
Processed family label Win.Virus.Elkern -> 35.
Processed family label Win.Worm.Autorun -> 242.
Processed family label Win.Adware.Browsefox -> 21.
Processed family label Win.Trojan.Adinstall -> 10.
Processed family label Win.Worm.Soltern -> 30.
Processed family label Win.Trojan.Kykymber -> 166.
Processed family label Win.Virus.Elkern -> 35.
Processed family label Win32.SoftwareBundler.OutBrowse -> 12.
Processed family label Win.Spyware.78845 -> 24.
Processed family label Win32.VirTool.Vbcrypt -> 629.
Processed family label Legacy.Trojan.Agent -> 19.
Processed family label Win.Trojan.Morstar -> 158.
Processed family label Win.Adware.Browsefox -> 21.
Processed family label Win.Spyware.78845 -> 24.
Processed family label Win.Trojan.Cosmu -> 82.
Out[7]:
filename malware_type_x malware_type_y _merge label family_label family_label_str
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both 1 1 Win.Worm.Tufik
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both 2 2 Win.Trojan.Agent
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both 3 2 Win.Trojan.Agent
3 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE Trojan:JS/Redirector.QE both 4 3 JS.Trojan.Redirector
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only 0 0
5 000403e4e488356b7535cc613fbeb80b TrojanDownloader:Win32/Fosniw.B TrojanDownloader:Win32/Fosniw.B both 5 4 Win32.TrojanDownloader.Fosniw
6 0004376a62e22f6ad359467eb742b8ff Win.Worm.Picsys-1 Worm:Win32/Picsys.C both 6 5 Win.Worm.Picsys
7 0004c8b2a0f4680a5694d74199b40ea2 Win.Adware.Loadmoney-12162 SoftwareBundler:Win32/ICLoader both 7 6 Win.Adware.Loadmoney
8 000595d8b586915c12053104cf845097 Win.Adware.Mplug-2637 BrowserModifier:Win32/Diplugem both 8 7 Win.Adware.Mplug
9 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip Worm:Win32/Rebhip both 9 8 Win32.Worm.Rebhip
10 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG VirTool:Win32/VBInject.UG both 10 9 Win32.VirTool.VBInject
11 00092d369958b67557da8661cc9093bc Win.Trojan.Adinstall-2 Adware:Win32/Hotbar both 11 10 Win.Trojan.Adinstall
12 00093d5fa5cb7ce77f6eaf39962daa12 Win.Adware.Screensaver-1 Adware:Win32/Hotbar both 12 11 Win.Adware.Screensaver
13 00099926d51b44c6f8c93a48c2567891 SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 13 12 Win32.SoftwareBundler.OutBrowse
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 OK left_only 14 13 Win.Trojan.Downloadadmin
15 000a2db4762dc06628a086c9e117f884 Win.Trojan.Magania-11227 PWS:Win32/Lolyda.AT both 15 14 Win.Trojan.Magania
16 000ac11fa7587b2316470b154254a219 OK OK left_only 0 0
17 000ae2c63ba69fc93dfc395b40bfe03a Heuristics.W32.Parite.B Virus:Win32/Parite.B both 16 15 Heuristics.W32.Parite.B
18 000ae90736a51c47543dcc6d8a735362 Win.Adware.Agent-1312925 BrowserModifier:Win32/Diplugem both 17 16 Win.Adware.Agent
19 000b41258d624ef2d6e430822d0c0c8f SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 13 12 Win32.SoftwareBundler.OutBrowse

20 rows × 7 columns


In [8]:
mals.to_csv('data/sorted-family-train-labels.csv', index=False)

In [11]:
# Output the malware family scalar classifications.
fop = open('data/malware-family-labels.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','class'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = family_label_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, family_label_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed family label {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} family labels.".format(len(sorted_keys)))    
fop.close()


Completed processing 2730 family labels.

In [12]:
# Output the malware classification counts.
fop = open('data/malware-class-counts.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','count'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = sample_counter_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, sample_counter_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed sample {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} samples.".format(len(sorted_keys)))    
fop.close()


Completed processing 10507 samples.

In [13]:
# Output the malware family counts.
fop = open('data/malware-family-counts.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','count'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = family_counter_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, family_counter_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed family {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} families.".format(len(sorted_keys)))    
fop.close()


Completed processing 2730 families.

7. Munge the Two Malware Classifications Together and Generate Malware Families.

Experiment 2, use truncated ClamAV or WinDefender definitions to generate malware families and
assign a scalar training label to each family. Use the WinDefender definitions by default or ClamAV
if WinDefender classifies as OK. Start fresh with sorted-av-report.csv and generate new malware classification
labels and family labels.

- Script: generate-train-labels.py

In [25]:
mals = pd.read_csv('data/sorted-av-report.csv')
mals.head(20)


Out[25]:
filename malware_type_x malware_type_y _merge
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both
3 00027c21667d9119a454df8cef2dc1c7 OK Trojan:JS/Redirector.QE both
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only
5 000403e4e488356b7535cc613fbeb80b OK TrojanDownloader:Win32/Fosniw.B both
6 0004376a62e22f6ad359467eb742b8ff Win.Worm.Picsys-1 Worm:Win32/Picsys.C both
7 0004c8b2a0f4680a5694d74199b40ea2 Win.Adware.Loadmoney-12162 SoftwareBundler:Win32/ICLoader both
8 000595d8b586915c12053104cf845097 Win.Adware.Mplug-2637 BrowserModifier:Win32/Diplugem both
9 000634f03457d088c71dbffb897b1315 OK Worm:Win32/Rebhip both
10 00072ed24314e91b63b425b3dc572f50 OK VirTool:Win32/VBInject.UG both
11 00092d369958b67557da8661cc9093bc Win.Trojan.Adinstall-2 Adware:Win32/Hotbar both
12 00093d5fa5cb7ce77f6eaf39962daa12 Win.Adware.Screensaver-1 Adware:Win32/Hotbar both
13 00099926d51b44c6f8c93a48c2567891 OK SoftwareBundler:Win32/OutBrowse both
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 OK left_only
15 000a2db4762dc06628a086c9e117f884 Win.Trojan.Magania-11227 PWS:Win32/Lolyda.AT both
16 000ac11fa7587b2316470b154254a219 OK OK left_only
17 000ae2c63ba69fc93dfc395b40bfe03a Heuristics.W32.Parite.B Virus:Win32/Parite.B both
18 000ae90736a51c47543dcc6d8a735362 Win.Adware.Agent-1312925 BrowserModifier:Win32/Diplugem both
19 000b41258d624ef2d6e430822d0c0c8f OK SoftwareBundler:Win32/OutBrowse both

20 rows × 4 columns


In [26]:
# Now generate unique scalar label map, we will use WinDefender as the default classification, if WinDefender is OK
# and ClamAV is not OK, then use the ClamAV classification, if both are OK then default to 0 label value for now.
type_x = np.array(mals['malware_type_x'])
type_y = np.array(mals['malware_type_y'])
scalar_labels = [0] * mals.shape[0]
scalar_label_map = {}
counter = 0
scalar_label_map['OK'] = 0

for idx, y_val in enumerate(type_y):
    if y_val != 'OK':
        mals.iloc[idx,1] = mals.iloc[idx,2] # copy the defender classification to ClamAV classification
            
    # now add the classification to the label map with a new scalar value
    if mals.iloc[idx,1] not in scalar_label_map.keys():
        counter += 1
        scalar_label_map[mals.iloc[idx,1]] = counter
        
    # now get the scalar label for this malware sample
    scalar_labels[idx] = scalar_label_map[mals.iloc[idx,1]]
        
mals['sample_label'] = scalar_labels
mals.head(20)


Out[26]:
filename malware_type_x malware_type_y _merge sample_label
0 00002e640cafb741bea9a48eaee27d6f Virus:Win32/Parite.B Virus:Win32/Parite.B both 1
1 000118d12cbf9ad6103e8b914a6e1ac3 SoftwareBundler:Win32/Techsnab SoftwareBundler:Win32/Techsnab both 2
2 0001776237ac37a69fcef93c1bac0988 TrojanDropper:Win32/Sventore.B TrojanDropper:Win32/Sventore.B both 3
3 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE Trojan:JS/Redirector.QE both 4
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only 0
5 000403e4e488356b7535cc613fbeb80b TrojanDownloader:Win32/Fosniw.B TrojanDownloader:Win32/Fosniw.B both 5
6 0004376a62e22f6ad359467eb742b8ff Worm:Win32/Picsys.C Worm:Win32/Picsys.C both 6
7 0004c8b2a0f4680a5694d74199b40ea2 SoftwareBundler:Win32/ICLoader SoftwareBundler:Win32/ICLoader both 7
8 000595d8b586915c12053104cf845097 BrowserModifier:Win32/Diplugem BrowserModifier:Win32/Diplugem both 8
9 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip Worm:Win32/Rebhip both 9
10 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG VirTool:Win32/VBInject.UG both 10
11 00092d369958b67557da8661cc9093bc Adware:Win32/Hotbar Adware:Win32/Hotbar both 11
12 00093d5fa5cb7ce77f6eaf39962daa12 Adware:Win32/Hotbar Adware:Win32/Hotbar both 11
13 00099926d51b44c6f8c93a48c2567891 SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 12
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 OK left_only 13
15 000a2db4762dc06628a086c9e117f884 PWS:Win32/Lolyda.AT PWS:Win32/Lolyda.AT both 14
16 000ac11fa7587b2316470b154254a219 OK OK left_only 0
17 000ae2c63ba69fc93dfc395b40bfe03a Virus:Win32/Parite.B Virus:Win32/Parite.B both 1
18 000ae90736a51c47543dcc6d8a735362 BrowserModifier:Win32/Diplugem BrowserModifier:Win32/Diplugem both 8
19 000b41258d624ef2d6e430822d0c0c8f SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 12

20 rows × 5 columns


In [27]:
mals.to_csv('data/sorted-av-report-labels-wd.csv', index=False)

In [28]:
# Output the malware sample scalar classifications.
fop = open('data/malware-class-labels-wd.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','class'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = scalar_label_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, scalar_label_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed label {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} labels.".format(counter))    
fop.close()


Completed processing 5986 labels.

In [ ]:


In [ ]:


In [29]:
# Now generate unique scalar label map for malware families.

type_x = np.array(mals['malware_type_x'])
#type_y = mals['malware_type_y']
family_scalar_labels = [0] * mals.shape[0]
family_labels = [' '] * mals.shape[0]
family_label_map = {}
sample_counter_map = {}
family_counter_map = {}
counter = 0
pwd1 = re.compile('(\w+):(\w+)/(\w+)[!.-/]+(\w+)') # Windows Defender malware definition patterns.
pwd2 = re.compile('(\w+):(\w+)/(\w+)')
pcav = re.compile('(\w+)\.(\w+)\.(\w+)[!./-](\w+)') # ClamAV malware definition pattern.
malware_family = 'unknown'
family_label_map['unknown'] = 0 # The default family scalar label.

for idx, x_val in enumerate(type_x):
    # first count the sample type
    if x_val in sample_counter_map.keys():
        sample_counter_map[x_val] += 1
    else:
        sample_counter_map[x_val] = 1
        
    if x_val != 'OK':
        # if it is a defender classification then convert to ClamAV definition style.
        m = pwd1.match(x_val)
        if m != None:
            malware_family = m.group(2) + '.' + m.group(1) + '.' + m.group(3) # rearrange the components to
        else:                                                                 # (platform).(class).(type)
            m = pwd2.match(x_val)
            if m != None:
                malware_family = m.group(2) + '.' + m.group(1) + '.' + m.group(3) 
            else:
                # then check if it is a ClamAV definition.
                m = pcav.match(x_val)
                if m != None:        # just truncate the end bit off.
                    malware_family = m.group(1) + '.' + m.group(2) + '.' + m.group(3)
                else:
                    malware_family = x_val  # catch the corner cases and default to original name/definition.
        
    else:
        malware_family = 'unknown' # leave the scalar label == 0, the malware sample has not been classified.
            
    # now add the classification to the label map with a new scalar value
    if malware_family not in family_label_map.keys():
        counter += 1
        family_label_map[malware_family] = counter
        
    # Count the malware family occurrences.
    if (malware_family in family_counter_map.keys()):
        family_counter_map[malware_family] += 1
    else:
        family_counter_map[malware_family] = 1
                         
    # now get the scalar label for this malware sample
    family_scalar_labels[idx] = family_label_map[malware_family]
    family_labels[idx] = malware_family
        
    if (idx % 1000) == 0: # report progress
        print("Processed family label {:s} -> {:d}.".format(malware_family, family_label_map[malware_family]))
        
# Finish off by adding malware family label to training label set.
mals['family_label'] = family_scalar_labels
mals['family_label_str'] = family_labels
mals.head(20)


Processed family label Win32.Virus.Parite -> 1.
Processed family label Win32.BrowserModifier.Diplugem -> 8.
Processed family label Win32.SoftwareBundler.Fourthrem -> 51.
Processed family label VBS.Virus.Ramnit -> 18.
Processed family label JS.Trojan.Redirector -> 4.
Processed family label Win32.Trojan.Flymux -> 479.
Processed family label Win32.Worm.Soltern -> 29.
Processed family label Win.Trojan.11484026 -> 38.
Processed family label Win32.TrojanDownloader.Renos -> 165.
Processed family label Win32.Adware.Hotbar -> 11.
Processed family label Win.Trojan.Aliser -> 183.
Processed family label Win32.VirTool.CeeInject -> 181.
Processed family label Win32.Worm.VB -> 73.
Processed family label unknown -> 0.
Processed family label Win.Adware.Trymedia -> 35.
Processed family label Win32.Dialer.CarpeDiem -> 68.
Processed family label Win.Trojan.Firseria -> 83.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.TrojanDownloader.Malushka -> 113.
Processed family label Win32.SoftwareBundler.OutBrowse -> 12.
Processed family label Win32.Adware.Hotbar -> 11.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label unknown -> 0.
Processed family label Win32.Worm.Wenper -> 365.
Processed family label Win32.SoftwareBundler.Ogimant -> 19.
Processed family label Win.Adware.Screensaver -> 21.
Processed family label unknown -> 0.
Processed family label Win32.BrowserModifier.Diplugem -> 8.
Processed family label unknown -> 0.
Processed family label JS.Rogue.FakeCall -> 1072.
Processed family label unknown -> 0.
Processed family label Win32.SoftwareBundler.Ogimant -> 19.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win32.Worm.Soltern -> 29.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win.Adware.Trymedia -> 35.
Processed family label Win32.PWS.Lolyda -> 14.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.Worm.Gamarue -> 25.
Processed family label Win32.BrowserModifier.Diplugem -> 8.
Processed family label Win32.Backdoor.Xtrat -> 56.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.Worm.Mydoom -> 31.
Processed family label Win32.TrojanDropper.Delfsnif -> 1264.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label unknown -> 0.
Processed family label Win32.Worm.VB -> 73.
Processed family label AutoIt.Worm.Autorun -> 37.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label unknown -> 0.
Processed family label unknown -> 0.
Processed family label Win32.Trojan.Aksula -> 188.
Processed family label Win.Trojan.Morstar -> 342.
Processed family label Win.Trojan.Morstar -> 342.
Processed family label Win32.Adware.Hotbar -> 11.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.SoftwareBundler.Fourthrem -> 51.
Processed family label JS.TrojanClicker.Faceliker -> 28.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label unknown -> 0.
Processed family label Win32.Adware.Hotbar -> 11.
Processed family label Win32.Worm.VB -> 73.
Processed family label unknown -> 0.
Processed family label Win32.TrojanSpy.Delf -> 435.
Processed family label Win32.Adware.Hotbar -> 11.
Processed family label Win.Adware.Agent -> 15.
Processed family label Win.Trojan.Morstar -> 342.
Processed family label Win32.Worm.Soltern -> 29.
Processed family label JS.Trojan.Redirector -> 4.
Processed family label Win32.PWS.OnLineGames -> 23.
Processed family label Win32.Trojan.Pugeju -> 327.
Processed family label Win32.Adware.Hotbar -> 11.
Processed family label AutoIt.TrojanDownloader.Vicluder -> 1615.
Processed family label unknown -> 0.
Processed family label Win32.Rogue.FakeRean -> 93.
Processed family label Win32.Adware.Hotbar -> 11.
Processed family label unknown -> 0.
Processed family label Win32.TrojanDropper.Sventore -> 3.
Processed family label Win32.PWS.OnLineGames -> 23.
Processed family label Win32.PWS.OnLineGames -> 23.
Processed family label Win32.TrojanDownloader.Fosniw -> 5.
Processed family label Win32.Trojan.Dynamer -> 26.
Processed family label unknown -> 0.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win32.VirTool.VBInject -> 10.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win.Trojan.Banker -> 176.
Processed family label Win32.TrojanDropper.Dinwod -> 394.
Processed family label unknown -> 0.
Processed family label Win32.Trojan.Dynamer -> 26.
Processed family label Win.Trojan.Morstar -> 342.
Processed family label Win.Adware.1296193 -> 1782.
Processed family label unknown -> 0.
Processed family label Win32.TrojanDropper.Sventore -> 3.
Processed family label Win.Adware.Downloadware -> 417.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win32.Trojan.BHO -> 177.
Processed family label Win32.SoftwareBundler.Ogimant -> 19.
Processed family label HTML.Trojan.Redirector -> 195.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win32.Rogue.Winwebsec -> 164.
Processed family label Win32.Worm.Mydoom -> 31.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.Worm.Yuner -> 78.
Processed family label Win32.VirTool.VBInject -> 10.
Processed family label JS.Trojan.HideLink -> 109.
Processed family label unknown -> 0.
Processed family label Win32.Trojan.Bulta -> 27.
Processed family label Win32.Worm.VB -> 73.
Processed family label Win32.Worm.Soltern -> 29.
Processed family label unknown -> 0.
Processed family label Win.Worm.Autorun -> 259.
Processed family label unknown -> 0.
Processed family label Win.Adware.Browsefox -> 20.
Processed family label Win32.Adware.ClickPotato -> 124.
Processed family label unknown -> 0.
Processed family label Win32.Worm.Soltern -> 29.
Processed family label Win32.PWS.OnLineGames -> 23.
Processed family label Win32.Worm.Soltern -> 29.
Processed family label Win32.SoftwareBundler.OutBrowse -> 12.
Processed family label Win32.PWS.OnLineGames -> 23.
Processed family label Win32.VirTool.Vbcrypt -> 553.
Processed family label VBS.Virus.Ramnit -> 18.
Processed family label Win.Trojan.Morstar -> 342.
Processed family label Win.Adware.Browsefox -> 20.
Processed family label Win32.PWS.OnLineGames -> 23.
Processed family label Win32.Worm.VB -> 73.
Out[29]:
filename malware_type_x malware_type_y _merge sample_label family_label family_label_str
0 00002e640cafb741bea9a48eaee27d6f Virus:Win32/Parite.B Virus:Win32/Parite.B both 1 1 Win32.Virus.Parite
1 000118d12cbf9ad6103e8b914a6e1ac3 SoftwareBundler:Win32/Techsnab SoftwareBundler:Win32/Techsnab both 2 2 Win32.SoftwareBundler.Techsnab
2 0001776237ac37a69fcef93c1bac0988 TrojanDropper:Win32/Sventore.B TrojanDropper:Win32/Sventore.B both 3 3 Win32.TrojanDropper.Sventore
3 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE Trojan:JS/Redirector.QE both 4 4 JS.Trojan.Redirector
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only 0 0 unknown
5 000403e4e488356b7535cc613fbeb80b TrojanDownloader:Win32/Fosniw.B TrojanDownloader:Win32/Fosniw.B both 5 5 Win32.TrojanDownloader.Fosniw
6 0004376a62e22f6ad359467eb742b8ff Worm:Win32/Picsys.C Worm:Win32/Picsys.C both 6 6 Win32.Worm.Picsys
7 0004c8b2a0f4680a5694d74199b40ea2 SoftwareBundler:Win32/ICLoader SoftwareBundler:Win32/ICLoader both 7 7 Win32.SoftwareBundler.ICLoader
8 000595d8b586915c12053104cf845097 BrowserModifier:Win32/Diplugem BrowserModifier:Win32/Diplugem both 8 8 Win32.BrowserModifier.Diplugem
9 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip Worm:Win32/Rebhip both 9 9 Win32.Worm.Rebhip
10 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG VirTool:Win32/VBInject.UG both 10 10 Win32.VirTool.VBInject
11 00092d369958b67557da8661cc9093bc Adware:Win32/Hotbar Adware:Win32/Hotbar both 11 11 Win32.Adware.Hotbar
12 00093d5fa5cb7ce77f6eaf39962daa12 Adware:Win32/Hotbar Adware:Win32/Hotbar both 11 11 Win32.Adware.Hotbar
13 00099926d51b44c6f8c93a48c2567891 SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 12 12 Win32.SoftwareBundler.OutBrowse
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 OK left_only 13 13 Win.Trojan.Downloadadmin
15 000a2db4762dc06628a086c9e117f884 PWS:Win32/Lolyda.AT PWS:Win32/Lolyda.AT both 14 14 Win32.PWS.Lolyda
16 000ac11fa7587b2316470b154254a219 OK OK left_only 0 0 unknown
17 000ae2c63ba69fc93dfc395b40bfe03a Virus:Win32/Parite.B Virus:Win32/Parite.B both 1 1 Win32.Virus.Parite
18 000ae90736a51c47543dcc6d8a735362 BrowserModifier:Win32/Diplugem BrowserModifier:Win32/Diplugem both 8 8 Win32.BrowserModifier.Diplugem
19 000b41258d624ef2d6e430822d0c0c8f SoftwareBundler:Win32/OutBrowse SoftwareBundler:Win32/OutBrowse both 12 12 Win32.SoftwareBundler.OutBrowse

20 rows × 7 columns


In [30]:
mals.to_csv('data/sorted-family-train-labels-wd.csv', index=False)

In [31]:
# Output the malware family scalar classifications.
fop = open('data/malware-family-labels-wd.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','class'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = family_label_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, family_label_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed family label {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} family labels.".format(len(sorted_keys)))    
fop.close()


Completed processing 2055 family labels.

In [32]:
# Output the malware sample classification counts.
fop = open('data/malware-class-counts-wd.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','count'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = sample_counter_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, sample_counter_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed sample {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} samples.".format(len(sorted_keys)))    
fop.close()


Completed processing 5987 samples.

In [33]:
# Output the malware family counts.
fop = open('data/malware-family-counts-wd.csv', 'w')
csv_wouter = writer(fop)
cols = ['malware_type','count'] # write out the column names.
csv_wouter.writerow(cols)
outlines = []
sorted_keys = family_counter_map.keys()
sorted_keys.sort()
for key in sorted_keys:
    outlines.append([key, family_counter_map[key]])
    if (idx % 100) == 0: # write out some lines
        csv_wouter.writerows(outlines)
        outlines = []
        print("Processed family {:s} -> {:d}.".format(key, val))
            
# Finish off.
if (len(outlines) > 0):
    csv_wouter.writerows(outlines)
    outlines = []
        
print("Completed processing {:d} families.".format(len(sorted_keys)))    
fop.close()


Completed processing 2055 families.

In [4]:
# Join the malware family sample scalar classifications and counts.
cldf = pd.read_csv('data/malware-family-labels-wd.csv')
ccdf = pd.read_csv('data/malware-family-counts-wd.csv')
cjdf = pd.merge(cldf,ccdf,on='malware_type')
cjdf.to_csv('data/malware-family-wd.csv', index=False)

# Join the malware sample scalar classifications and counts.
cldf = pd.read_csv('data/malware-class-labels-wd.csv')
ccdf = pd.read_csv('data/malware-class-counts-wd.csv')
cjdf = pd.merge(cldf,ccdf,on='malware_type')
cjdf.to_csv('data/malware-class-wd.csv', index=False)

In [ ]:
help(pd.merge)

In [2]:
mals = pd.read_csv('data/sorted-family-train-labels-wd.csv')
mals.head()

In [5]:
mals.drop(['malware_type_y', '_merge'], axis=1, inplace=True)

In [6]:
mals.head(20)


Out[6]:
filename malware_type_x sample_label family_label family_label_str
0 00002e640cafb741bea9a48eaee27d6f Virus:Win32/Parite.B 1 1 Win32.Virus.Parite
1 000118d12cbf9ad6103e8b914a6e1ac3 SoftwareBundler:Win32/Techsnab 2 2 Win32.SoftwareBundler.Techsnab
2 0001776237ac37a69fcef93c1bac0988 TrojanDropper:Win32/Sventore.B 3 3 Win32.TrojanDropper.Sventore
3 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE 4 4 JS.Trojan.Redirector
4 0003887ab64b8ae19ffa988638decac2 OK 0 0 unknown
5 000403e4e488356b7535cc613fbeb80b TrojanDownloader:Win32/Fosniw.B 5 5 Win32.TrojanDownloader.Fosniw
6 0004376a62e22f6ad359467eb742b8ff Worm:Win32/Picsys.C 6 6 Win32.Worm.Picsys
7 0004c8b2a0f4680a5694d74199b40ea2 SoftwareBundler:Win32/ICLoader 7 7 Win32.SoftwareBundler.ICLoader
8 000595d8b586915c12053104cf845097 BrowserModifier:Win32/Diplugem 8 8 Win32.BrowserModifier.Diplugem
9 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip 9 9 Win32.Worm.Rebhip
10 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG 10 10 Win32.VirTool.VBInject
11 00092d369958b67557da8661cc9093bc Adware:Win32/Hotbar 11 11 Win32.Adware.Hotbar
12 00093d5fa5cb7ce77f6eaf39962daa12 Adware:Win32/Hotbar 11 11 Win32.Adware.Hotbar
13 00099926d51b44c6f8c93a48c2567891 SoftwareBundler:Win32/OutBrowse 12 12 Win32.SoftwareBundler.OutBrowse
14 0009a64f786fa29bfa6423278cc74f02 Win.Trojan.Downloadadmin-3 13 13 Win.Trojan.Downloadadmin
15 000a2db4762dc06628a086c9e117f884 PWS:Win32/Lolyda.AT 14 14 Win32.PWS.Lolyda
16 000ac11fa7587b2316470b154254a219 OK 0 0 unknown
17 000ae2c63ba69fc93dfc395b40bfe03a Virus:Win32/Parite.B 1 1 Win32.Virus.Parite
18 000ae90736a51c47543dcc6d8a735362 BrowserModifier:Win32/Diplugem 8 8 Win32.BrowserModifier.Diplugem
19 000b41258d624ef2d6e430822d0c0c8f SoftwareBundler:Win32/OutBrowse 12 12 Win32.SoftwareBundler.OutBrowse

In [8]:
mals.to_csv('data/sorted-train-labels.csv', index=False)

In [ ]:

7.1 Validate Munging and Scalar Label Generation Between Runs.


In [15]:
# Load in the train labels for each sample set run then compare the 
# training label value for each malware family and class to ensure
# each one has a unique scalar training label and the same malware
# types and families have the same label.
# [filename,malware_type_x,malware_type_y,sample_label,family_name,family_label]


def validate_label_generation():
    mals1_df = pd.read_csv('data/sorted-train-labels-vs251-252.csv')
    mals2_df = pd.read_csv('data/sorted-train-labels-vs263-264-apt.csv')

    counter = 0
    m1_x = np.array(mals1_df['malware_type_x'])
    m1_f = np.array(mals1_df['family_name'])
    m1_sl = np.array(mals1_df['sample_label'])
    m1_fl = np.array(mals1_df['family_label'])
    m2_x = np.array(mals2_df['malware_type_x'])
    m21_f = np.array(mals2_df['family_name'])
    m2_sl = np.array(mals2_df['sample_label'])
    m2_fl = np.array(mals2_df['family_label'])
    
    for idx1, mname1 in enumerate(m1_x):
        for idx2, mname2 in enumerate(m2_x):
            if mname1 == mname2:
                if m1_sl[idx1] != m2_sl[idx2]:
                    print("Sample label incongruence: {:d} {:d}".format(m1_sl[idx1], m2_sl[idx2]))
                    counter += 1
                    
                if (m1_fl[idx1] != m2_fl[idx2]):
                    print("Family label incongruence: {:d} {:d}".format(m1_fl[idx1], m2_fl[idx2]))
                    counter += 1            
        
        if (idx1 % 1000) == 0:
            print("Processed {:d} malware names.".format(idx1))


    print("Total Incongruence Errors: {:d}".format(counter))
    
    return

In [16]:
validate_label_generation()


Total Incongruence Errors: 0

In [10]:
# Split out the training sample sets.
def split_training_sets(training_set_directory, train_label_file, output_file):
    mals1_df = pd.read_csv(train_label_file)
    
    counter = 0
    file_list = os.listdir(training_set_directory)
    #malnames = np.array(mals1_df['file_name'])
    malnames = np.array(mals1_df['file_name'])
    truncated_filenames = []
    
    for fname in file_list:
        mname = fname[fname.find('_') + 1:]
        truncated_filenames.append(mname)
        counter += 1        
        
    #t1_df = mals1_df[mals1_df['file_name'].isin(truncated_filenames)]
    t1_df = mals1_df[mals1_df['file_name'].isin(truncated_filenames)]
    
    t1_df.to_csv(output_file, index=False)
    
    
    return t1_df

In [11]:
s1_df = split_training_sets('/opt/vs/train1/', 'data/sorted-train-labels-vs251-252.csv', 'data/sorted-train-labels-vs251.csv')
s1_df.head()


Out[11]:
file_name malware_type_x sample_label family_name family_label
3 00027c21667d9119a454df8cef2dc1c7 Trojan:JS/Redirector.QE 4 JS.Trojan.Redirector 4
4 0003887ab64b8ae19ffa988638decac2 OK 0 unknown 0
6 0004376a62e22f6ad359467eb742b8ff Worm:Win32/Picsys.C 6 Win32.Worm.Picsys 6
9 000634f03457d088c71dbffb897b1315 Worm:Win32/Rebhip 9 Win32.Worm.Rebhip 9
10 00072ed24314e91b63b425b3dc572f50 VirTool:Win32/VBInject.UG 10 Win32.VirTool.VBInject 10

5 rows × 5 columns


In [12]:
s1_df.shape


Out[12]:
(65536, 5)

In [13]:
s2_df = split_training_sets('/opt/vs/train2/', 'data/sorted-train-labels-vs251-252.csv', 'data/sorted-train-labels-vs252.csv')
s2_df.head()


Out[13]:
file_name malware_type_x sample_label family_name family_label
0 00002e640cafb741bea9a48eaee27d6f Virus:Win32/Parite.B 1 Win32.Virus.Parite 1
1 000118d12cbf9ad6103e8b914a6e1ac3 SoftwareBundler:Win32/Techsnab 2 Win32.SoftwareBundler.Techsnab 2
2 0001776237ac37a69fcef93c1bac0988 TrojanDropper:Win32/Sventore.B 3 Win32.TrojanDropper.Sventore 3
5 000403e4e488356b7535cc613fbeb80b TrojanDownloader:Win32/Fosniw.B 5 Win32.TrojanDownloader.Fosniw 5
7 0004c8b2a0f4680a5694d74199b40ea2 SoftwareBundler:Win32/ICLoader 7 Win32.SoftwareBundler.ICLoader 7

5 rows × 5 columns


In [14]:
s3_df = split_training_sets('/opt/vs/train3/', 'data/sorted-train-labels-vs263-264-apt.csv', 'data/sorted-train-labels-vs263.csv')
s3_df.head()


Out[14]:
file_name malware_type_x sample_label family_name family_label
2 0002b2f621ea5786be03bf4153532dce PWS:Win32/OnLineGames.LW 59 Win32.PWS.OnLineGames 23
5 000401419eccde59975c713cfadc974c Worm:Win32/Soltern!rfn 36 Win32.Worm.Soltern 29
6 00042f23bc15b89d9c6a7bde0e316f8b Rogue:Win32/FakeRean 117 Win32.Rogue.FakeRean 93
7 0004824a60ff9fe1fb30d669a5baa627 Worm:Win32/Soltern.L 30 Win32.Worm.Soltern 29
8 0004c49071481789f1c8c80656638497 OK 0 unknown 0

5 rows × 5 columns


In [15]:
s4_df = split_training_sets('/opt/vs/train4/', 'data/sorted-train-labels-vs263-264-apt.csv', 'data/sorted-train-labels-vs264.csv')
s4_df.head()


Out[15]:
file_name malware_type_x sample_label family_name family_label
0 000070db76b6dc1ee3497a3f9319848c Trojan:JS/Redirector.QE 4 JS.Trojan.Redirector 4
1 00009cbc0a90337e4c30950a51ae3d67 Win.Adware.ForceStartPage-1 5987 Win.Adware.ForceStartPage 2055
3 0003c05a1320e64fe72438ab48da7ecf TrojanClicker:JS/Faceliker.S 29 JS.TrojanClicker.Faceliker 28
4 0003e52a9267b657d9b08b2cbc0a2593 Trojan:JS/Redirector.QE 4 JS.Trojan.Redirector 4
9 0005743596135fe65f61da7a0eba0bb6 TrojanClicker:JS/Faceliker.D 91 JS.TrojanClicker.Faceliker 28

5 rows × 5 columns


In [16]:
sa_df = split_training_sets('/opt/vs/apt/', 'data/sorted-train-labels-vs263-264-apt.csv', 'data/sorted-train-labels-apt.csv')
sa_df.head()


Out[16]:
file_name malware_type_x sample_label family_name family_label
61 001dd76872d80801692ff942308c64e6 Trojan:Win32/Sluegot.D 5992 Win32.Trojan.Sluegot 2057
75 002325a0a67fded0381b5648d7fe9b8e Trojan:Win32/Sluegot.C 5993 Win32.Trojan.Sluegot 2057
469 00dbb9e1c09dbdafb360f3163ba5a3de Backdoor:Win32/Stradatu 6005 Win32.Backdoor.Stradatu 2064
697 0149b7bd7218aab4e257d28469fddb0d Trojan:Win32/Sluegot.A 6017 Win32.Trojan.Sluegot 2057
990 01e0dc079d4e33d8edd050c4900818da Backdoor:Win32/Stradatu 6005 Win32.Backdoor.Stradatu 2064

5 rows × 5 columns


In [17]:
s1_df = split_training_sets('/opt/vs/train1/', 'data/sorted-entropy-features-vs251-252.csv', 'data/sorted-entropy-features-vs251.csv')
s1_df.head()


Out[17]:
file_name entropy file_size
3 00027c21667d9119a454df8cef2dc1c7 0.666599 18390
4 0003887ab64b8ae19ffa988638decac2 0.903260 1134320
6 0004376a62e22f6ad359467eb742b8ff 0.803515 149720
9 000634f03457d088c71dbffb897b1315 0.957584 1725502
10 00072ed24314e91b63b425b3dc572f50 0.486112 328093

5 rows × 3 columns


In [18]:
s1_df = split_training_sets('/opt/vs/train2/', 'data/sorted-entropy-features-vs251-252.csv', 'data/sorted-entropy-features-vs252.csv')
s1_df.head()


Out[18]:
file_name entropy file_size
0 00002e640cafb741bea9a48eaee27d6f 0.992174 208860
1 000118d12cbf9ad6103e8b914a6e1ac3 0.834382 201600
2 0001776237ac37a69fcef93c1bac0988 0.966021 682192
5 000403e4e488356b7535cc613fbeb80b 0.773787 199168
7 0004c8b2a0f4680a5694d74199b40ea2 0.985592 1165440

5 rows × 3 columns


In [19]:
s1_df.shape


Out[19]:
(65536, 3)

In [ ]:

8. Test Code Only


In [5]:
mals1_df = pd.read_csv('data/sorted-av-report-vs251-252.csv')
mals1_df.head()


Out[5]:
filename malware_type_x malware_type_y _merge
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B both
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab both
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B both
3 00027c21667d9119a454df8cef2dc1c7 OK Trojan:JS/Redirector.QE both
4 0003887ab64b8ae19ffa988638decac2 OK OK left_only

In [6]:
mals1_df.drop('_merge', axis=1, inplace=True)
mals1_df.head()


Out[6]:
filename malware_type_x malware_type_y
0 00002e640cafb741bea9a48eaee27d6f Win.Worm.Tufik-182 Virus:Win32/Parite.B
1 000118d12cbf9ad6103e8b914a6e1ac3 Win.Trojan.Agent-1345346 SoftwareBundler:Win32/Techsnab
2 0001776237ac37a69fcef93c1bac0988 Win.Trojan.Agent-1309696 TrojanDropper:Win32/Sventore.B
3 00027c21667d9119a454df8cef2dc1c7 OK Trojan:JS/Redirector.QE
4 0003887ab64b8ae19ffa988638decac2 OK OK

In [9]:
mals1_df.to_csv('data/sorted-av-report-vs251-252.csv', index=False)

In [4]:
malcounts = mals['malware_type_x'].value_counts()
malcounts


Out[4]:
OK                                     16918
Worm:Win32/Yuner.A                      8452
Worm:Win32/VB.AT                        7830
Adware:Win32/Hotbar                     7434
BrowserModifier:Win32/Diplugem          5689
Worm:Win32/Soltern.L                    4957
SoftwareBundler:Win32/Ogimant           4243
PWS:Win32/OnLineGames.IZ                3789
Trojan:Win32/Dynamer!ac                 2527
TrojanDropper:Win32/Sventore.B          2141
Worm:Win32/Mydoom.O@mm                  1937
Win.Trojan.Morstar-7                    1768
SoftwareBundler:Win32/OutBrowse         1726
Win.Adware.Screensaver-1                1387
PWS:Win32/OnLineGames.LW                1384
Win.Trojan.Morstar-10                   1224
Worm:Win32/Soltern!rfn                  1210
Trojan:Win32/Bulta!rfn                  1054
Worm:Win32/Picsys.C                      974
Virus:VBS/Ramnit.gen!C                   958
Win.Adware.Agent-1111578                 904
Trojan:JS/Redirector.QE                  868
Win.Trojan.11484026-1                    834
Worm:Win32/Mydoom.L@mm                   704
Trojan:Win32/Rimecud.A                   629
Win.Trojan.Morstar-12                    613
Win.Adware.913802-1                      611
Win.Adware.Agent-1126070                 565
Win.Adware.Trymedia-3                    539
Worm:Win32/Gamarue.N                     490
                                       ...  
Trojan:Win32/Renocide.A                    1
TrojanDownloader:Win32/Moure               1
TrojanSpy:Win32/Vwealer.O                  1
Trojan:Win32/Blackmon.A                    1
TrojanDownloader:Win32/Renos.gen!KJ        1
Win.Trojan.Outbrowse-1479                  1
Win.Spyware.Banker-6439                    1
TrojanDownloader:Win32/Ciucio.C            1
Win.Trojan.Downloader-1955                 1
Trojan:Win32/Tropid!rts                    1
TrojanDownloader:Win32/Agent.ABC           1
Win.Adware.Browsefox-5617                  1
Trojan:Win32/Fakefolder.C                  1
Win.Trojan.Bho-9360                        1
Win.Trojan.Small-18796                     1
TrojanSpy:Win32/Bancos.ABM                 1
Trojan:Win32/Vxidl.gen!C                   1
Dialer:Win32/Rapido                        1
Win.Trojan.Sality-63897                    1
Win.Trojan.Agent-525476                    1
Win.Trojan.12742548-1                      1
Win.Trojan.Agent-1242907                   1
Win.Adware.Agent-1258520                   1
HackTool:Win32/Bifrostack                  1
Win.Trojan.Bang5mai-2                      1
Worm:Win32/Korgo.Q                         1
Trojan:WinNT/Hookmoot.gen!A                1
TrojanDownloader:Win32/Neglemir.A          1
TrojanSpy:Win32/VB.AAI                     1
TrojanDropper:Win32/Delf.CK                1
Name: malware_type_x, dtype: int64

In [5]:
malcounts[:10].plot(kind='barh', rot=0)
plt.show()

In [8]:
# Windows Defender malware class matching patterns.

p1 = re.compile('(\w+):(\w+)/(\w+)[!.-]+(\w+)')
p2 = re.compile('(\w+):(\w+)/(\w+)')

m = p1.match('Backdoor:MSIL/Bladabindi!rfn')

# m.group(1) == 'Backdoor'
# m.group(2) == 'MSIL'
# m.group(3) == 'Bladabindi'
# m.group(4) == 'rfn'

# Convert to ClamAV style malware family
malware_family = 'unknown'
if m != None:
    malware_family = m.group(2) + '.' + m.group(1) + '.' + m.group(3)
else:
    m = p2.match('Backdoor:MSIL/Bladabindi')
    if m != None:
        malware_family = m.group(2) + '.' + m.group(1) + '.' + m.group(3)

print(malware_family)

# Convert ClamAV malware class to malware family by removing the number and hyphen from the end of the malware string.

malware_str = 'Andr.Adware.Kuguo-2'
pos = malware_str.find('-')
if pos > 0:
    malware_family = malware_str[0:pos]
else:
    malware_family = malware_str
    
print(malware_family)


MSIL.Backdoor.Bladabindi
Andr.Adware.Kuguo

In [3]:
class_labels_df = pd.read_csv('data/av-malware-class-labels.csv')
family_labels_df = pd.read_csv('data/av-malware-family-labels.csv')
vs1_df = pd.read_csv('data/sorted-av-report-vs251-252.csv')
vs2_df = pd.read_csv('data/sorted-av-report-vs263-264-apt.csv')

print("Class Labels = {:d}, Family Labels = {:d}".format(class_labels_df.shape[0], family_labels_df.shape[0]))


Class Labels = 5987, Family Labels = 2055

In [5]:
type_x = np.array(vs2_df['malware_type_x'])
type_y = np.array(vs2_df['malware_type_y'])
scalar_labels = [0] * vs2_df.shape[0]
counter = 0
scalar_label_map = {}

for idx, y_val in enumerate(type_y):
    if y_val != 'OK':
        malware_name = y_val
    else:
        malware_name = vs2_df.iloc[idx,1]

    if malware_name not in scalar_label_map.keys():
        counter += 1
        scalar_label_map[malware_name] = counter
        
    # now get the scalar label for this malware sample
    scalar_labels[idx] = scalar_label_map[malware_name]

print("Class Labels: {:d}".format(len(scalar_label_map.keys())))


Class Labels: 4596

In [7]:
sorted_train_labels_df = pd.read_csv('data/sorted-train-labels-vs251-252.csv')
type_x = np.array(sorted_train_labels_df['malware_type_x'])
counter = 0
for malware_name in scalar_label_map.keys():
    if malware_name not in type_x:
        counter += 1
        
print("New Malware Types: {:d}".format(counter))


New Malware Types: 2347

In [8]:
5987 + 2347


Out[8]:
8334

In [2]:
class_labels_df = pd.read_csv('data/av-malware-class-labels.csv')
family_labels_df = pd.read_csv('data/av-malware-family-labels.csv')
vs1_df = pd.read_csv('data/sorted-train-labels-vs251-252.csv')
vs2_df = pd.read_csv('data/sorted-train-labels-vs263-264-apt.csv')

print("Class Labels = {:d}, Family Labels = {:d}".format(class_labels_df.shape[0], family_labels_df.shape[0]))


Class Labels = 5987, Family Labels = 2055

In [3]:
newclass_labels_df = pd.read_csv('data/av-malware-class-labels-wd.csv')
newfamily_labels_df = pd.read_csv('data/av-malware-family-labels-wd.csv')

print("Class Labels = {:d}, Family Labels = {:d}".format(newclass_labels_df.shape[0], newfamily_labels_df.shape[0]))


Class Labels = 8334, Family Labels = 2737

In [10]:
newclass_labels_df = pd.read_csv('data/av-malware-class-labels-wd.csv')
newfamily_labels_df = pd.read_csv('data/av-malware-family-labels-wd.csv')

print("Class Labels = {:d}, Family Labels = {:d}".format(newclass_labels_df.shape[0], newfamily_labels_df.shape[0]))


Class Labels = 8334, Family Labels = 2737

In [4]:
class_labels_df = pd.read_csv('data/av-malware-class-labels.csv')
family_labels_df = pd.read_csv('data/av-malware-family-labels.csv')
vs1_df = pd.read_csv('data/sorted-train-labels-vs251.csv')
vs2_df = pd.read_csv('data/sorted-train-labels-vs252.csv')
vs3_df = pd.read_csv('data/sorted-train-labels-vs263.csv')
vs4_df = pd.read_csv('data/sorted-train-labels-vs264.csv')
vs5_df = pd.read_csv('data/sorted-train-labels-apt.csv')
print("Class Labels = {:d}, Family Labels = {:d}".format(class_labels_df.shape[0], family_labels_df.shape[0]))
ok_count = vs1_df["malware_type_x"].value_counts()
ok_count


Class Labels = 8334, Family Labels = 2737
Out[4]:
OK                                   8007
Worm:Win32/Soltern.L                 4745
Worm:Win32/Yuner.A                   4035
Adware:Win32/Hotbar                  3787
Worm:Win32/VB.AT                     3756
BrowserModifier:Win32/Diplugem       2098
PWS:Win32/OnLineGames.IZ             2028
SoftwareBundler:Win32/Ogimant        1980
Worm:Win32/Soltern!rfn               1206
Trojan:Win32/Dynamer!ac              1182
Worm:Win32/Mydoom.O@mm               1127
Worm:Win32/Picsys.C                   960
Win.Trojan.Morstar-7                  890
TrojanDropper:Win32/Sventore.B        821
Win.Adware.Screensaver-1              759
SoftwareBundler:Win32/OutBrowse       755
PWS:Win32/OnLineGames.LW              750
Win.Trojan.Morstar-10                 583
Trojan:Win32/Bulta!rfn                561
Win.Trojan.11484026-1                 516
Win.Adware.Agent-1111578              447
Worm:Win32/Mydoom.L@mm                417
Win.Trojan.Trymedia-7                 313
Win.Trojan.Downloadware-15            311
Win.Adware.Agent-1126070              302
Win.Trojan.Morstar-12                 295
Trojan:Win32/Rimecud.A                287
Worm:Win32/Gamarue.N                  285
Win.Adware.913802-1                   275
Virus:VBS/Ramnit.gen!C                256
                                     ... 
Win.Adware.Browsefox-2705               1
PWS:Win32/Lolyda.AC                     1
Trojan:Win32/Sipoo.A                    1
Worm:Win32/Rimecud.EL                   1
Trojan:Win32/Koobface.gen!J             1
Trojan:Win32/Koobface.gen!K             1
Trojan:Win32/DriverBypass               1
PWS:Win32/Lolyda.AO                     1
Backdoor:Win32/VB.NR                    1
Win.Trojan.BHO-1                        1
Win.Adware.Agent-1182283                1
Win.Downloader.131518-1                 1
Win.Trojan.Bundlore-72                  1
PWS:Win32/Lolyda.AU                     1
TrojanDownloader:Win32/Small.AWL        1
HackTool:Win32/Mikatz!dha               1
Trojan:Win32/DwLoad                     1
TrojanProxy:Win32/Koobface.gen!A        1
Win.Trojan.Agent-361166                 1
Win.Trojan.Buzus-8462                   1
TrojanDownloader:HTML/Adodb.gen!A       1
Backdoor:Win32/Koceg.gen!C              1
Trojan:Win32/Koobface.gen!M             1
Win.Trojan.Sality-64007                 1
Win.Trojan.Keylogger-1273               1
Backdoor:Win32/Netdevil.1_4             1
Win.Trojan.Sality-37452                 1
Win.Trojan.Sality-64132                 1
Win.Trojan.Sality-64131                 1
Win.Trojan.Sality-64678                 1
Name: malware_type_x, dtype: int64

In [6]:
vs1_df.shape[0] - 8007


Out[6]:
57529

In [7]:
ok_count = vs2_df["malware_type_x"].value_counts()
ok_count


Out[7]:
OK                                 8911
Worm:Win32/Yuner.A                 4417
Worm:Win32/VB.AT                   4074
Adware:Win32/Hotbar                3647
BrowserModifier:Win32/Diplugem     3591
SoftwareBundler:Win32/Ogimant      2263
PWS:Win32/OnLineGames.IZ           1761
Trojan:Win32/Dynamer!ac            1345
TrojanDropper:Win32/Sventore.B     1320
SoftwareBundler:Win32/OutBrowse     971
Win.Trojan.Morstar-7                878
Worm:Win32/Mydoom.O@mm              810
Virus:VBS/Ramnit.gen!C              702
Trojan:JS/Redirector.QE             667
Win.Trojan.Morstar-10               641
PWS:Win32/OnLineGames.LW            634
Win.Adware.Screensaver-1            628
Trojan:Win32/Bulta!rfn              493
Win.Adware.Agent-1111578            457
Trojan:Win32/Rimecud.A              342
Win.Adware.913802-1                 336
Win.Trojan.11484026-1               318
Win.Trojan.Morstar-12               318
Win.Adware.Trymedia-3               313
Exploit:HTML/IframeRef.gen          307
Virus:Win32/Parite.B                302
Worm:Win32/Mydoom.L@mm              287
Trojan:JS/HideLink.A                278
Win.Adware.Agent-1126070            263
PWS:Win32/Lolyda.AT                 252
                                   ... 
Win.Adware.4shared-67                 1
Win.Adware.Ibryte-6596                1
VirTool:Win32/Injector.AT             1
TrojanDownloader:Win32/Zlob.ANR       1
TrojanDownloader:Win32/VB.AAJ         1
PWS:Win32/OnLineGames.ZEU             1
Win.Trojan.Sality-66267               1
Win.Trojan.Sality-67254               1
TrojanDropper:Win32/Coopop.B          1
Win.Adware.MyWebSearch-4              1
TrojanDropper:Win32/Yangxiay.A        1
Win.Trojan.Visel-54                   1
Win.Trojan.Sality-65101               1
TrojanDownloader:Win32/Zlob.OP        1
Win.Adware.Airadinstaller-753         1
Backdoor:Win32/Agent.AXA              1
Win.Adware.Installcore-1238           1
HackTool:Win32/QQLogin.B              1
Trojan:Win32/Favadd.C                 1
Win.Adware.Agent-1112158              1
Win.Trojan.Sality-67954               1
Win.Adware.Installcore-2070           1
Worm:Win32/Soltern.O                  1
Win.Adware.Airadinstaller-213         1
Win.Trojan.Crypted-38                 1
Win.Downloader.Dadobra-2              1
Dialer:Win32/EGroup.C                 1
TrojanProxy:Win32/Pramro.B            1
Trojan:Win32/Startpage.ABB            1
TrojanDropper:Win32/Delf.CK           1
Name: malware_type_x, dtype: int64

In [8]:
vs2_df.shape[0] - 8911


Out[8]:
56625

In [9]:
ok_count = vs3_df["malware_type_x"].value_counts()
ok_count


Out[9]:
OK                                    13924
Worm:Win32/Soltern.L                   8356
Trojan:JS/Redirector.QE                3571
Adware:Win32/Hotbar                    3340
Worm:Win32/Soltern!rfn                 2474
BrowserModifier:Win32/Diplugem         1800
PWS:Win32/OnLineGames.IZ               1742
Worm:Win32/Picsys.C                    1661
Trojan:Win32/Dynamer!ac                1133
Win.Adware.Screensaver-1                828
Win.Trojan.11484026-1                   655
PWS:Win32/OnLineGames.LW                613
Worm:Win32/Mydoom.O@mm                  550
Virus:VBS/Ramnit.gen!C                  469
Win.Adware.Agent-1126070                455
Win.Trojan.Downloadware-15              455
SoftwareBundler:Win32/Ogimant           447
Win.Adware.Imali-17                     429
Win.Trojan.Trymedia-7                   408
Trojan:JS/HideLink.A                    398
TrojanClicker:JS/Faceliker.A            393
Worm:Win32/Yuner.A                      363
Trojan:JS/Iframe.AE                     352
Worm:Win32/Soltern.M                    320
Trojan:Win32/Bulta!rfn                  290
Worm:Win32/Mydoom.L@mm                  265
Trojan:Win32/Rimecud.A                  256
Exploit:HTML/IframeRef.gen              253
SoftwareBundler:Win32/OutBrowse         244
TrojanClicker:JS/Faceliker.D            242
                                      ...  
VirTool:Win32/DelfInject.gen!BB           1
TrojanDownloader:Win32/VB.SZ              1
Win.Trojan.Sality-107496                  1
Dialer:Win32/Webdialer                    1
Win.Worm.Protoride-1                      1
Win.Trojan.Agent-211529                   1
TrojanDownloader:Win32/Ratecki.A          1
Win.Malware.Agent3369319084/CRDF-1        1
Win.Trojan.Graftor-3089                   1
Win.Trojan.Qhost-151                      1
Trojan:Win32/Simda.gen!E                  1
Worm:Win32/Koobface.I                     1
Backdoor:Win32/Delf.CE                    1
Worm:Win32/Koobface.C                     1
Worm:Win32/Mydoom.F@mm                    1
Win.Trojan.Sality-102162                  1
VirTool:JS/Obfuscator.CP                  1
Win.Proxy.Ranky-11                        1
Andr.Malware.Agent-1515121                1
BrowserModifier:Win32/Sasquor             1
TrojanDropper:Win32/Decay.A               1
Worm:Win32/Autorun.PT                     1
Virus:Win32/Viking.B                      1
TrojanSpy:Win32/Bancos.VI!dll3            1
Backdoor:Win32/Hackdef.DF                 1
Backdoor:Win32/Wollf.1_6                  1
Trojan:Win32/Bojotuc!rfn                  1
VirTool:Win32/Vtub.GN                     1
Andr.Malware.Agent-1513374                1
Trojan:Win32/Regrun.B                     1
Name: malware_type_x, dtype: int64

In [10]:
vs3_df.shape[0] - 13924


Out[10]:
51612

In [11]:
ok_count = vs4_df["malware_type_x"].value_counts()
ok_count


Out[11]:
OK                                   23262
Trojan:JS/Redirector.QE              11217
Worm:Win32/Soltern.L                  2967
BrowserModifier:Win32/Diplugem        1700
Virus:VBS/Ramnit.gen!C                1654
TrojanClicker:JS/Faceliker.A          1480
TrojanClicker:JS/Faceliker.S          1270
Trojan:JS/Iframe.AE                   1168
Trojan:JS/HideLink.A                  1005
Worm:Win32/Soltern!rfn                 830
TrojanClicker:JS/Faceliker.D           763
Adware:Win32/Hotbar                    632
Trojan:JS/Redirector.QD                595
Worm:Win32/Picsys.C                    533
TrojanClicker:JS/Faceliker.C           502
Exploit:HTML/IframeRef.gen             473
Trojan:JS/Iframe.EP                    453
PWS:Win32/OnLineGames.IZ               449
Win.Adware.Imali-17                    444
SoftwareBundler:Win32/Bervisec         438
Trojan:JS/Redirector.ON                410
Trojan:JS/Iframeinject                 358
Virus:VBS/Ramnit.gen!A                 349
SoftwareBundler:Win32/Ogimant          342
Trojan:JS/Redirector.PR                328
Trojan:Win32/Dynamer!ac                310
TrojanClicker:JS/Faceliker.N           284
Trojan:HTML/Redirector.CF              284
Virus:VBS/Ramnit.B                     267
Trojan:JS/Runfile.A                    221
                                     ...  
Win.Trojan.Tufik-313                     1
Andr.Malware.Agent-1541449               1
TrojanSpy:Win32/Ursnif                   1
VirTool:WinNT/Rootkitdrv!rfn             1
PWS:Win32/Lmir.ACY                       1
Andr.Malware.Agent-1481591               1
TrojanProxy:Win32/Bunitu.G               1
Win.Malware.Agent450251929/CRDF-1        1
Dialer:Win32/Holistyc                    1
Backdoor:PHP/C99shell.R                  1
Andr.Malware.Agent-1464082               1
Exploit:HTML/IframeRef.FB                1
HackTool:Win32/Evidpatch.A               1
Andr.Malware.Agent-1475418               1
Win.Trojan.5855873-1                     1
Trojan:Win32/QQFish.A                    1
VirTool:Win32/CeeInject.GF               1
Trojan:Win32/Vundo.HX                    1
TrojanProxy:Win32/Agent.BE               1
Virus:Win32/Xorer.gen!I                  1
TrojanDownloader:Win32/Small.BB          1
Win.Trojan.Loadmoney-11748               1
Worm:Win32/Delf.AU                       1
HackTool:Win32/Gendows                   1
Andr.Malware.Agent-1472374               1
TrojanDropper:Win32/Spacekito.A          1
TrojanDownloader:Win32/Dowgav.B          1
Exploit:HTML/IframeRef.FQ                1
Andr.Malware.Agent-1468815               1
Win.Trojan.Agent-586928                  1
Name: malware_type_x, dtype: int64

In [12]:
vs4_df.shape[0] - 23262


Out[12]:
42274

In [13]:
ok_count = vs5_df["malware_type_x"].value_counts()
ok_count


Out[13]:
Trojan:Win32/Connapts               40
Backdoor:Win32/Likseput.B           25
Backdoor:Win32/Neporoot.A           21
Trojan:Win32/Sluegot.A              13
Backdoor:Win32/Tartober.A           12
Backdoor:Win32/Stradatu             12
TrojanDownloader:Win32/Govdi.A       8
Backdoor:Win32/Neunut.A              8
Backdoor:Win32/Likseput.A            7
Backdoor:Win32/Ecltys.A              7
Trojan:Win32/Dynamer!dtc             7
Backdoor:Win32/Warood.B              7
Backdoor:Win32/Noobot.A              6
TrojanDownloader:Win32/Dalbot.A      5
Backdoor:Win32/Minaps.A              5
Backdoor:Win32/Sharat.gen!A          5
TrojanDownloader:Win32/Pingbed.A     5
Backdoor:Win32/Pingbed.A             4
Backdoor:Win32/Xifos.A               4
Backdoor:Win32/Touasper.A            4
TrojanDownloader:Win32/Small.XR      4
Backdoor:Win32/Miniasroot.A          4
Backdoor:Win32/Tosct.A               4
Trojan:Win32/Sluegot.C               4
TrojanDownloader:Win32/Coswid.A      3
Trojan:Win32/Sluegot.D               3
Backdoor:Win32/Goolelo.A             3
Trojan:Win32/Sisproc!gmb             3
Backdoor:Win32/Agent.RO              2
Trojan:Win32/Orsam!rts               2
                                    ..
Backdoor:Win32/Linsomroot.A          2
Backdoor:Win32/Jepesroot.A           2
TrojanDownloader:Win32/Goosta.A      1
Trojan:Win32/Tapslix.A               1
TrojanDownloader:Win32/Dielel.A      1
Backdoor:Win32/Jabbroot.A            1
Trojan:Win32/Ruce.gen!A              1
HackTool:Win32/Hashenfill.A          1
Trojan:Win32/Tosct.A                 1
Trojan:Win32/Comame!gmb              1
Backdoor:Win32/Touasper.C            1
TrojanDownloader:Win32/Ahrocam.B     1
HackTool:Win32/SamDump               1
Trojan:Win32/Godin.A                 1
Trojan:Win32/Comroki!gmb             1
TrojanDownloader:Win32/Muntsib.A     1
TrojanDownloader:Win32/Namsoth.B     1
Trojan:Win32/Malex.gen!E             1
TrojanDownloader:Win32/Pingbed.C     1
OK                                   1
TrojanDownloader:Win32/Macup.A       1
TrojanDownloader:Win32/Tosct.B       1
TrojanDownloader:Win32/Agent.PM      1
HackTool:Win64/Mikatz!dha            1
PWS:Win32/Cimuz.B.dll                1
PWS:Win32/Maptsc.A                   1
Win.Trojan.Agent-30723               1
Trojan:Win32/Maptsc.A                1
Win.Trojan.Agent-30709               1
TrojanProxy:Win32/Small              1
Name: malware_type_x, dtype: int64

In [14]:
vs5_df.shape[0] - 1


Out[14]:
292

In [17]:
(vs1_df.shape[0] * 4) #+ 293


Out[17]:
262144

In [18]:
262144 + 293


Out[18]:
262437

In [ ]:
counter = 0
errors = 0
found = False
fip = open('/opt/vs/unpacked_file_list-vs251-252.txt','r')
unpacked_list = fip.readlines()
fip.close()
file_list = os.listdir('/opt/vs/asm/')
file_list.sort()
hdr_list = []
asm_list = []
for fname in file_list:
    if fname.endswith('.asm'):
        asm_list.append(fname)
    elif fname.endswith('.txt'):
        hdr_list.append(fname)

print("Header list size: {:d}".format(len(hdr_list)))
print("ASM list size: {:d}".format(len(asm_list)))

hdr_list.sort()
asm_list.sort()

for idx, fname in enumerate(asm_list):
    asm_name = fname[0:fname.find(".asm")]
    #hdr_name = hdr_list[idx]
    #hdr_name = hdr_name[0:hdr_name.find(".hdr")]
    #if asm_name not in asm_list:
    #if asm_name != hdr_name:
    for hname in hdr_list:
        hdr_name = hname[0:hname.find(".txt")]
        if asm_name == hdr_name:
            print("Successful Disassembly for: {:s}".format(asm_name))
            counter += 1
            found = True
            break
            
    if not found:
        errors += 1
    else:    
        found = False
    
    
        
print("Total Successful Disassemblies: {:d} Total Disassembly Errors: {:d}".format(counter, errors))

In [13]:
counter = 0
errors = 0
found = False
fip = open('/opt/vs/unpacked_file_list-vs251-252.txt','r')
unpacked_list = fip.readlines()
fip.close()
file_list = os.listdir('/opt/vs/asm/')
file_list.sort()
hdr_list = []
asm_list = []
for fname in file_list:
    if fname.endswith('.asm'):
        asm_list.append(fname)
    elif fname.endswith('.txt'):
        hdr_list.append(fname)

print("Header list size: {:d}".format(len(hdr_list)))
print("ASM list size: {:d}".format(len(asm_list)))

hdr_list.sort()
asm_list.sort()

for idx, fname in enumerate(hdr_list):
    hdr_name = fname[0:fname.find(".txt")]
    #hdr_name = hdr_list[idx]
    #hdr_name = hdr_name[0:hdr_name.find(".hdr")]
    #if asm_name not in asm_list:
    #if asm_name != hdr_name:
    for hname in asm_list:
        asm_name = hname[0:hname.find(".asm")]
        if asm_name == hdr_name:
            #print("Successful Disassembly for: {:s}".format(asm_name))
            counter += 1
            found = True
            break
            
    if not found:
        errors += 1
        print("Failed Disassembly for: {:s}".format(hdr_name))
    else:    
        found = False
    
    
        
print("Total Successful Disassemblies: {:d} Total Disassembly Errors: {:d}".format(counter, errors))


Header list size: 792
ASM list size: 774
Failed Disassembly for: VirusShare_0003887ab64b8ae19ffa988638decac2
Failed Disassembly for: VirusShare_0025cc13683331a61986b6433e768f3f
Failed Disassembly for: VirusShare_006b4c72e79e60d10515a64ec6a4e021
Failed Disassembly for: VirusShare_00d574c8f6fe8453e0c57a8a731f15b4
Failed Disassembly for: VirusShare_01561d7971d10d2192e87b75a74980a4
Failed Disassembly for: VirusShare_018c4ec104af60efebd868c6c96c4015
Failed Disassembly for: VirusShare_027aceafdea60810bd493b91fad6d83b
Failed Disassembly for: VirusShare_028a2651d8a23f8a86c6a0440b817826
Failed Disassembly for: VirusShare_02acf1da2758c291fc377d4ea18efcce
Failed Disassembly for: VirusShare_02b88fab6d6a76e3f00e99d88b42e29e
Failed Disassembly for: VirusShare_02d15c11abb5ef375e9ac3e9f05a1a52
Failed Disassembly for: VirusShare_02e6357bc2e276c4113e6de1a5b1c69c
Failed Disassembly for: VirusShare_038ae293c2dd804f41f7f7305f37ebe2
Failed Disassembly for: VirusShare_03acebfbcabb20a76e707d585aaf8c49
Failed Disassembly for: VirusShare_6a4fbcfb44717eae2145c761c1c99b6a
Failed Disassembly for: VirusShare_af719814507fdca4b96184f33b6b92ea
Failed Disassembly for: VirusShare_d4ba6430996fb4021241efc97c607504
Failed Disassembly for: VirusShare_d8b7b276710127d233abcdb7313aac36
Total Successful Disassemblies: 774 Total Disassembly Errors: 18

In [ ]:
VirusShare_0003887ab64b8ae19ffa988638decac2
VirusShare_0025cc13683331a61986b6433e768f3f
VirusShare_006b4c72e79e60d10515a64ec6a4e021
VirusShare_00d574c8f6fe8453e0c57a8a731f15b4
VirusShare_01561d7971d10d2192e87b75a74980a4
Failed Disassembly for: VirusShare_018c4ec104af60efebd868c6c96c4015
Failed Disassembly for: VirusShare_027aceafdea60810bd493b91fad6d83b
Failed Disassembly for: VirusShare_028a2651d8a23f8a86c6a0440b817826
Failed Disassembly for: VirusShare_02acf1da2758c291fc377d4ea18efcce
Failed Disassembly for: VirusShare_02b88fab6d6a76e3f00e99d88b42e29e
Failed Disassembly for: VirusShare_02d15c11abb5ef375e9ac3e9f05a1a52
Failed Disassembly for: VirusShare_02e6357bc2e276c4113e6de1a5b1c69c
Failed Disassembly for: VirusShare_038ae293c2dd804f41f7f7305f37ebe2
Failed Disassembly for: VirusShare_03acebfbcabb20a76e707d585aaf8c49
Failed Disassembly for: VirusShare_6a4fbcfb44717eae2145c761c1c99b6a
Failed Disassembly for: VirusShare_af719814507fdca4b96184f33b6b92ea
Failed Disassembly for: VirusShare_d4ba6430996fb4021241efc97c607504
Failed Disassembly for: VirusShare_d8b7b276710127d233abcdb7313aac36

In [ ]: