Mach-O Binary Data Analysis

In this notebook we're going to explore, understand, and classify Mach-O files as being 'benign' or 'malicious'. We'll start by talking how we can extract the artifacts we are interested in. Then we will explore the data, apply machine learning algorithms to the data, and the finally use those results to further explore the data.

** DISCLAIMER:** This exercise is for illustrative purposes and only uses about 500 samples which is too small for a generalizable model.

References

  • https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachORuntime/Reference/reference.html
  • https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachORuntime/Mach-O_File_Format.pdf
  • http://en.wikipedia.org/wiki/Mach-O
  • </ul> </font>

    Python Modules Used:

    • macholib: https://pythonhosted.org/macholib/
    • Pandas: Python Data Analysis Library (http://pandas.pydata.org)
    • Scikit Learn (http://scikit-learn.org) Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
    • Matplotlib: Python 2D plotting library (http://matplotlib.org) </ul> </font>
    • Data generation

      All data for this talk was generated using Artifactr, a highly scalable analysis engine. For more information, go to http://www.clicksecurity.com.

      IPython Notebook for this talk: http://clicksecurity.github.io/data_hacking


      Mach-O Basic Layout

      Basic Structure

      A Mach-O file contains three major regions:

      1. Header Structure
      2. Load Commands
      3. Segments </ol> </font>
      4. Header Structure

        This structure identifies the file as a Mach-O file. The header also contains other basic file type information, indicates the target architecture, and contains flags specifying options that affect the interpretation of the rest of the file.

        Load Commands

        Directly following the header are a series of variable-size load commands that specify the layout and linkage characteristics of the file. Among other information, the load commands can specify:

        • The initial layout of the file in virtual memory
        • The location of the symbol table (used for dynamic linking)
        • The initial execution state of the main thread of the program
        • The names of shared libraries that contain definitions for the main executable’s imported symbols
        • </ul> </font>

          Segments

          All Mach-O files contain one or more segments. Each segment contains zero or more sections. Each section of a segment contains code or data. Each segment defines a region of virtual memory that the dynamic linker maps into the address space of the process. The exact number and layout of segments and sections is specified by the load commands and the file type.

          macholib

          Current owner: Ronald Oussoren
          Description: macholib can be used to analyze and edit Mach-O headers, the executable format used by Mac OS X.

          Wasn't the easiest to get data from each load command and other data I was interested in. So we forked repo and added that ability.

          https://bitbucket.org/trogdorsey/macholib




          Imports and plot defaults

          
          
          In [1]:
          import pandas as pd
          print 'pandas version is', pd.__version__
          import numpy as np
          print 'numpy version is', np.__version__
          import sklearn
          print 'scikit-learn version is', sklearn.__version__
          import matplotlib
          print 'matplotlib version is', matplotlib.__version__
          import matplotlib.pyplot as plt
          
          
          
          
          pandas version is 0.13.1
          numpy version is 1.8.1
          scikit-learn version is 0.14.1
          matplotlib version is 1.3.1
          

          Plotting defaults and helper functions

          
          
          In [2]:
          %matplotlib inline
          plt.rcParams['font.size'] = 18.0
          plt.rcParams['figure.figsize'] = 16.0, 5.0
          
          
          
          In [3]:
          def plot_cm(cm, labels):
              # Compute percentanges
              percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)
              print 'Confusion Matrix Stats'
              for i, label_i in enumerate(labels):
                  for j, label_j in enumerate(labels):
                      print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())
          
              # Show confusion matrix
              # Thanks to kermit666 from stackoverflow
              fig = plt.figure()
              ax = fig.add_subplot(111)
              ax.grid(b=False)
              cax = ax.matshow(percent, cmap='coolwarm',vmin=0,vmax=100)
              plt.title('Confusion matrix of the classifier')
              fig.colorbar(cax)
              ax.set_xticklabels([''] + labels)
              ax.set_yticklabels([''] + labels)
              plt.xlabel('Predicted')
              plt.ylabel('True')
              plt.show()
          

          Function to extract out the features we are interested with from the json blob

          
          
          In [4]:
          def extract_features(data):
              all_features = []
              if not 'error' in data['characteristics']['macho']:
                  for i in range(data['characteristics']['macho']['number of architectures']):
                      features = {}
                      features['entropy'] = data['metadata']['entropy']
                      features['file_size'] = data['metadata']['file_size']
                      features['number of architectures'] = data['characteristics']['macho']['number of architectures']
                      features['header size'] = data['characteristics']['macho']['header'][i]['size']
                      features['cmd count'] = data['characteristics']['macho']['header'][i]['mach header']['ncmds']
                      features['cmd size'] = data['characteristics']['macho']['header'][i]['mach header']['sizeofcmds']
                      features['cputype'] = data['characteristics']['macho']['header'][i]['mach header']['cputype_string']
                      for flag in data['characteristics']['macho']['header'][i]['mach header']['flags']:
                          features[flag['name']] = 1
                      features['string count'] = data['verbose']['macho']['header'][i]['const strings count']
                      if 'const strings' in data['verbose']['macho']['header'][i]:
                          features['const strings'] = data['verbose']['macho']['header'][i]['const strings']
                      else:
                          features['const strings'] = []
                          
                      if 'symbol table strings' in data['verbose']['macho']['header'][i]:
                          features['symbol table strings'] = data['verbose']['macho']['header'][i]['symbol table strings']
                      else:
                          features['symbol table strings'] = []
                      features['segment names'] = []
                      features['section names'] = []
                      for command in data['verbose']['macho']['header'][i]['commands']:
                          if command['cmd_name'] in ['LC_SEGMENT', 'LC_SEGMENT_64']:
                              features['segment names'].append(command['segname'])
                              if command['segname'] == '__PAGEZERO':
                                  features['pz_size'] = command['cmd_size']
                                  features['pz_vmsize'] = command['vmsize']
                                  features['pz_vmaddr'] = command['vmaddr']
                                  features['pz_flags'] = command['flags']
                                  features['pz_filesize'] = command['filesize']
                                  features['pz_nsects'] = command['nsects']
                                  features['pz_fileoff'] = command['fileoff']
                                  for flag in command['initprot']:
                                      features['pz_initprot_'+flag] = 1
                                  for flag in command['maxprot']:
                                      features['pz_maxprot_'+flag] = 1
                              if command['segname'] == '__TEXT':
                                  features['text_size'] = command['cmd_size']
                                  features['text_vmsize'] = command['vmsize']
                                  features['text_vmaddr'] = command['vmaddr']
                                  features['text_flags'] = command['flags']
                                  features['text_filesize'] = command['filesize']
                                  features['text_nsects'] = command['nsects']
                                  features['text_fileoff'] = command['fileoff']
                                  for flag in command['initprot']:
                                      features['text_initprot_'+flag] = 1
                                  for flag in command['maxprot']:
                                      features['text_maxprot_'+flag] = 1
                                  features['text_entropy'] = command['entropy']
                                  for section in command['sections']:
                                      if section['sectname'] == '__text':
                                          features['text_section_reloff'] = section['reloff']
                                          features['text_section_addr'] = section['addr']
                                          features['text_section_align'] = section['align']
                                          features['text_section_nreloc'] = section['nreloc']
                                          features['text_section_offset'] = section['offset']
                                          features['text_section_size'] = section['size']
                                          features['text_section_reserved1'] = section['reserved1']
                                          features['text_section_reserved2'] = section['reserved2']
                                          features['text_section_' + section['flags']['type']] = 1
                                          if 'attributes' in section['flags']:
                                              for attr in section['flags']['attributes']:
                                                  features['text_section_flag_attribute_'+attr] = 1
                                      if section['sectname'] == '__const':
                                          features['const_section_reloff'] = section['reloff']
                                          features['const_section_addr'] = section['addr']
                                          features['const_section_align'] = section['align']
                                          features['const_section_nreloc'] = section['nreloc']
                                          features['const_section_offset'] = section['offset']
                                          features['const_section_size'] = section['size']
                                          features['const_section_reserved1'] = section['reserved1']
                                          features['const_section_reserved2'] = section['reserved2']
                                          features['const_section_' + section['flags']['type']] = 1
                                          if 'attributes' in section['flags']:
                                              for attr in section['flags']['attributes']:
                                                  features['const_section_flag_attribute_'+attr] = 1
                              if command['segname'] == '__DATA' and command['nsects'] > 0:
                                  features['data_size'] = command['cmd_size']
                                  features['data_vmsize'] = command['vmsize']
                                  features['data_vmaddr'] = command['vmaddr']
                                  features['data_flags'] = command['flags']
                                  features['data_filesize'] = command['filesize']
                                  features['data_nsects'] = command['nsects']
                                  features['data_fileoff'] = command['fileoff']
                                  for flag in command['initprot']:
                                      features['data_initprot_'+flag] = 1
                                  for flag in command['maxprot']:
                                      features['data_maxprot_'+flag] = 1
                                  features['data_entropy'] = command['entropy']
                              #if command['segname'] == '__IMPORT':
                              #    features['import_size'] = command['cmd_size']
                              #    features['import_vmsize'] = command['vmsize']
                              #    features['import_vmaddr'] = command['vmaddr']
                              #    features['import_flags'] = command['flags']
                              #    features['import_filesize'] = command['filesize']
                              #    features['import_nsects'] = command['nsects']
                              #    features['import_fileoff'] = command['fileoff']
                              #    for flag in command['initprot']:
                              #        features['import_initprot_'+flag] = 1
                              #    for flag in command['maxprot']:
                              #        features['import_maxprot_'+flag] = 1
                              #    features['import_entropy'] = command['entropy']
                              if command['segname'] == '__LINKEDIT':
                                  features['linkedit_size'] = command['cmd_size']
                                  features['linkedit_vmsize'] = command['vmsize']
                                  features['linkedit_vmaddr'] = command['vmaddr']
                                  features['linkedit_flags'] = command['flags']
                                  features['linkedit_filesize'] = command['filesize']
                                  features['linkedit_nsects'] = command['nsects']
                                  features['linkedit_fileoff'] = command['fileoff']
                                  for flag in command['initprot']:
                                      features['linkedit_initprot_'+flag] = 1
                                  for flag in command['maxprot']:
                                      features['linkedit_maxprot_'+flag] = 1
                              
                              if 'sections' in command:
                                  for section in command['sections']:
                                      features['section names'].append(section['sectname'])
                          if command['cmd_name'] == 'LC_SYMTAB':
                              features['strsize'] = command['strsize']
                              features['stroff'] = command['stroff']
                              features['symoff'] = command['symoff']
                              features['nsyms'] = command['nsyms']
                          if command['cmd_name'] in ['LC_DYLD_INFO_ONLY', 'LC_DYLD_INFO']:
                              features['lazy_bind_size'] = command['lazy_bind_size']
                              features['rebase_size'] = command['rebase_size']
                              features['weak_bind_size'] = command['weak_bind_size']
                              features['lazy_bind_off'] = command['lazy_bind_off']
                              features['export_off'] = command['export_off']
                              features['export_size'] = command['export_size']
                              features['bind_off'] = command['bind_off']
                              features['rebase_off'] = command['rebase_off']
                              features['bind_size'] = command['bind_size']
                              features['weak_bind_off'] = command['weak_bind_off']
                          if command['cmd_name'] == 'LC_DYSYMTAB':
                              features['nextdefsym'] = command['nextdefsym']
                              features['extreloff'] = command['extreloff']
                              features['nlocrel'] = command['nlocrel']
                              features['indirectsymoff'] = command['indirectsymoff']
                              features['modtaboff'] = command['modtaboff']
                              features['iundefsym'] = command['iundefsym']
                              features['ntoc'] = command['ntoc']
                              features['ilocalsym'] = command['ilocalsym']
                              features['nundefsym'] = command['nundefsym']
                              features['nextrefsyms'] = command['nextrefsyms']
                              features['locreloff'] = command['locreloff']
                              features['nmodtab'] = command['nmodtab']
                              features['nlocalsym'] = command['nlocalsym']
                              features['tocoff'] = command['tocoff']
                              features['extrefsymoff'] = command['extrefsymoff']
                              features['nindirectsyms'] = command['nindirectsyms']
                              features['iextdefsym'] = command['iextdefsym']
                              features['nextrel'] = command['nextrel']
          
                      features.update(data['verbose']['macho']['header'][i]['command type count'])
                      if 'LC_SEGMENT' in features:
                          features['number of segments'] = features['LC_SEGMENT']
                      else:
                          features['number of segments'] = features['LC_SEGMENT_64']
                      if 'entry point' in data['verbose']['macho']['header'][i]:
                          for key, value in data['verbose']['macho']['header'][i]['entry point'].iteritems():
                              features['entry point ' + key] = value
                      all_features.append(features)
          
              return all_features
          
          
          
          In [5]:
          def load_files(file_list):
              import json
              features_list = []
              for filename in file_list:
                  with open(filename,'rb') as f:
                      features = extract_features(json.loads(f.read()))
                      features_list.extend(features)
              return features_list
          

          The set of good files came from my MacBook Pro running Mavericks.

          
          
          In [6]:
          # Good files
          import glob
          good_list = glob.glob('data/good/*.results')
          good_features = load_files(good_list)
          print "Files:", len(good_list)
          print "Number of feature vectors:", len(good_features)
          
          
          
          
          Files: 266
          Number of feature vectors: 337
          

          The set of bad files came from Contagio and VirusTotal

          
          
          In [7]:
          # Bad files
          bad_list = glob.glob('data/bad/*.results')
          bad_features = load_files(bad_list)
          print "Files:", len(bad_list)
          print "Number of feature vectors:", len(bad_features)
          
          
          
          
          Files: 261
          Number of feature vectors: 302
          

          Wait, I have more feature vectors than files?

          Mach-O supports fat binaries. These are binaries that suport multiple architectures. Can have both x86 and PowerPC architectures in both 32-bit and 64-bit.

          There are 3 (or 6 when you take into consideration endianness) different magic values for MachO binaries. One (two) for 32-bit binaries, one (two) for 64-bit binaries, and one (two) for fat binaries.

          The magic value for fat binaries is cafebabe, which you might recognize as the magic value for java class files.

          
          
          In [8]:
          df_good_orig = pd.DataFrame.from_records(good_features)
          df_good_orig['label'] = 'good'
          df_good_orig.head()
          
          
          
          
          Out[8]:
          LC_CODE_SEGMENT_SPLIT_INFO LC_CODE_SIGNATURE LC_DATA_IN_CODE LC_DYLD_INFO LC_DYLD_INFO_ONLY LC_DYLIB_CODE_SIGN_DRS LC_DYSYMTAB LC_FUNCTION_STARTS LC_ID_DYLIB LC_LOAD_DYLIB LC_LOAD_DYLINKER LC_LOAD_WEAK_DYLIB LC_MAIN LC_REEXPORT_DYLIB LC_RPATH LC_SEGMENT LC_SEGMENT_64 LC_SOURCE_VERSION LC_SYMTAB LC_UNIXTHREAD
          0 NaN 1 1 NaN 1 1 1 1 NaN 1 1 NaN 1 NaN NaN NaN 4 1 1 NaN ...
          1 NaN 1 1 NaN 1 1 1 1 NaN 9 1 NaN 1 NaN NaN NaN 4 1 1 NaN ...
          2 NaN 1 1 NaN 1 1 1 1 NaN 9 1 NaN 1 NaN NaN 4 NaN 1 1 NaN ...
          3 NaN 1 1 NaN 1 1 1 1 NaN 5 1 NaN 1 NaN NaN NaN 4 1 1 NaN ...
          4 NaN 1 1 NaN 1 1 1 1 NaN 1 1 NaN 1 NaN NaN NaN 4 1 1 NaN ...

          5 rows × 148 columns

          
          
          In [9]:
          df_good = df_good_orig
          for col in df_good.columns:
              if col[0:3] in ['LC_', 'MH_']:
                  #print col
                  df_good[col].fillna(0, inplace=True)
                  
          df_good.fillna(-1, inplace=True)
          df_good_orig.head()
          
          
          
          
          Out[9]:
          LC_CODE_SEGMENT_SPLIT_INFO LC_CODE_SIGNATURE LC_DATA_IN_CODE LC_DYLD_INFO LC_DYLD_INFO_ONLY LC_DYLIB_CODE_SIGN_DRS LC_DYSYMTAB LC_FUNCTION_STARTS LC_ID_DYLIB LC_LOAD_DYLIB LC_LOAD_DYLINKER LC_LOAD_WEAK_DYLIB LC_MAIN LC_REEXPORT_DYLIB LC_RPATH LC_SEGMENT LC_SEGMENT_64 LC_SOURCE_VERSION LC_SYMTAB LC_UNIXTHREAD
          0 0 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 4 1 1 0 ...
          1 0 1 1 0 1 1 1 1 0 9 1 0 1 0 0 0 4 1 1 0 ...
          2 0 1 1 0 1 1 1 1 0 9 1 0 1 0 0 4 0 1 1 0 ...
          3 0 1 1 0 1 1 1 1 0 5 1 0 1 0 0 0 4 1 1 0 ...
          4 0 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 4 1 1 0 ...

          5 rows × 148 columns

          
          
          In [10]:
          df_bad_orig = pd.DataFrame.from_records(bad_features)
          df_bad_orig['label'] = 'bad'
          df_bad = df_bad_orig
          for col in df_bad.columns:
              if col[0:3] in ['LC_', 'MH_']:
                  #print col
                  df_bad[col].fillna(0, inplace=True)
                  
          df_bad.fillna(-1, inplace=True)
          df_bad.head()
          
          
          
          
          Out[10]:
          LC_CODE_SIGNATURE LC_DATA_IN_CODE LC_DYLD_INFO LC_DYLD_INFO_ONLY LC_DYSYMTAB LC_ENCRYPTION_INFO LC_FUNCTION_STARTS LC_ID_DYLIB LC_LAZY_LOAD_DYLIB LC_LOAD_DYLIB LC_LOAD_DYLINKER LC_SEGMENT LC_SEGMENT_64 LC_SYMTAB LC_TWOLEVEL_HINTS LC_UNIXTHREAD LC_UUID LC_VERSION_MIN_IPHONEOS LC_VERSION_MIN_MACOSX MH_BINDATLOAD
          0 0 0 1 0 1 0 0 0 0 8 1 0 4 1 0 1 1 0 0 0 ...
          1 0 0 1 0 1 0 0 1 0 8 0 0 3 1 0 0 1 0 0 0 ...
          2 1 0 0 1 1 0 0 0 0 2 1 0 4 1 0 1 1 0 0 0 ...
          3 1 0 0 1 1 0 0 0 0 2 1 4 0 1 0 1 1 0 0 0 ...
          4 1 1 0 1 1 0 1 0 0 11 0 4 0 1 0 0 1 0 1 0 ...

          5 rows × 153 columns

          Let's explore the data.

          
          
          In [11]:
          df_good['cputype'].value_counts().plot(kind='bar')
          
          
          
          
          Out[11]:
          <matplotlib.axes.AxesSubplot at 0x112d9ec90>
          
          
          In [12]:
          df_bad['cputype'].value_counts().plot(kind='bar')
          
          
          
          
          Out[12]:
          <matplotlib.axes.AxesSubplot at 0x112db1cd0>

          Let's get rid of the PowerPC cputype because that switch happened 9 years ago. I think we can all agree that malware authors won't be targeting that CPU in the future. And we probably want to examine malware for iOS devices separately.

          
          
          In [13]:
          df_all_orig = pd.concat([df_bad, df_good], ignore_index=True)
          df_all = df_all_orig
          for col in df_all.columns:
              if col[0:3] in ['LC_', 'MH_']:
                  #print col
                  df_all[col].fillna(0, inplace=True)
                  
          df_all.fillna(-1, inplace=True)
          # Break out by cpu type
          cond = df_all['cputype'] == 'x86_64' 
          df_x64 = df_all[cond]
          cond = df_all['cputype'] == 'i386'
          df_x86 = df_all[cond]
          df_all = pd.concat([df_x64, df_x86], ignore_index=True)
          #print df['symbol table strings']
          df = df_all.drop(['const strings', 'symbol table strings', 'segment names',
                            'section names', 'entry point section name', 'entry point segment name',
                            'entry point instruction type'], axis=1)
          df['cputype'].value_counts().plot(kind='bar')
          
          
          
          
          Out[13]:
          <matplotlib.axes.AxesSubplot at 0x112fe7b50>

          Let's start to blindly explore the data

          
          
          In [14]:
          cond = df['label'] == 'good'
          good = df[cond]
          bad  = df[~cond]
          bad['cmd count'].hist(alpha=.5,label='bad',bins=40)
          good['cmd count'].hist(alpha=.5,label='good',bins=40)
          plt.legend()
          
          
          
          
          Out[14]:
          <matplotlib.legend.Legend at 0x11338d1d0>
          
          
          In [15]:
          df.boxplot('cmd count', 'label')
          plt.xlabel('bad vs. good files')
          plt.ylabel('Number of Commands')
          plt.title('Comparision of Number of Commands')
          plt.suptitle('')
          
          
          
          
          Out[15]:
          <matplotlib.text.Text at 0x1135dca90>
          
          
          In [16]:
          df.boxplot(column='cmd size', by='label')
          plt.xlabel('bad vs. good files')
          plt.ylabel('Command Size')
          plt.title('Comparision of Command Sizes')
          plt.suptitle("")
          
          
          
          
          Out[16]:
          <matplotlib.text.Text at 0x11374e810>
          
          
          In [17]:
          # Split the classes up so we can set colors, size, labels
          cond = df['label'] == 'good'
          good = df[cond]
          bad  = df[~cond]
          plt.scatter(good['cmd count'], good['cmd size'], 
                      s=140, c='#aaaaff', label='Good', alpha=.4)
          plt.scatter(bad['cmd count'], bad['cmd size'], 
                      s=40, c='r', label='Bad', alpha=.5)
          plt.legend()
          plt.xlabel('Command Count')
          plt.ylabel('Command Size')
          plt.title('Command Size Vs Command Count')
          
          
          
          
          Out[17]:
          <matplotlib.text.Text at 0x1138d4690>
          
          
          In [18]:
          df.boxplot('entropy', 'label')
          plt.xlabel('bad vs. good files')
          plt.ylabel('Number of Commands')
          plt.title('Comparision of Entropy')
          plt.suptitle('')
          
          
          
          
          Out[18]:
          <matplotlib.text.Text at 0x113519b90>
          
          
          In [19]:
          df.boxplot('number of segments', 'label')
          plt.xlabel('bad vs. good files')
          plt.ylabel('Number of Segments')
          plt.title('Comparision of Number of Segments')
          plt.suptitle('')
          
          
          
          
          Out[19]:
          <matplotlib.text.Text at 0x11412ce50>
          
          
          In [20]:
          cond = df['label'] == 'good'
          good = df[cond]
          bad  = df[~cond]
          plt.scatter(good['entropy'], good['number of segments'], 
                      s=140, c='#aaaaff', label='Good', alpha=.4)
          plt.scatter(bad['entropy'], bad['number of segments'], 
                      s=40, c='r', label='Bad', alpha=.5)
          plt.legend()
          plt.xlabel('Entropy')
          plt.ylabel('Number of Segments')
          plt.title('Number of Segments Vs Entropy')
          
          
          
          
          Out[20]:
          <matplotlib.text.Text at 0x1142b6790>

          First we try classifying binaries just based on cmd count and cmd size

          
          
          In [21]:
          # List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
          X_cmd = df.as_matrix(['cmd count', 'cmd size'])
          
          # Labels (scikit learn uses 'y' for classification labels)
          y = np.array(df['label'].tolist())
          
          
          
          In [22]:
          # Random Forest is a popular ensemble machine learning classifier.
          # http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
          #
          import sklearn.ensemble
          clf_cmd = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
          
          
          
          In [23]:
          # Now we can use scikit learn's cross validation to assess predictive performance.
          scores = sklearn.cross_validation.cross_val_score(clf_cmd, X_cmd, y, cv=10)
          print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
          
          
          
          
          Accuracy: 0.89 (+/- 0.07)
          
          
          
          In [31]:
          my_seed = 1022
          my_tsize = .2
          from sklearn.cross_validation import train_test_split
          X_train, X_test, y_train, y_test = train_test_split(X_cmd, y, test_size=my_tsize, random_state=my_seed)
          clf_cmd.fit(X_train, y_train)
          clf_cmd_scores = clf_cmd.score(X_test, y_test)
          print("Accuracy: %0.2f" % clf_cmd_scores)
          y_pred = clf_cmd.predict(X_test)
          
          from sklearn.metrics import confusion_matrix
          labels = ['good', 'bad']
          cm = confusion_matrix(y_test, y_pred, labels)
          plot_cm(cm, labels)
          
          
          
          
          Accuracy: 0.91
          Confusion Matrix Stats
          good/good: 92.75% (64/69)
          good/bad: 7.25% (5/69)
          bad/good: 11.63% (5/43)
          bad/bad: 88.37% (38/43)
          

          Now try number of segments and entropy. This shouldn't work all that great.

          
          
          In [32]:
          clf_2 = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
          X_2 = df.as_matrix(['number of segments', 'entropy'])
          X_train, X_test, y_train, y_test = train_test_split(X_2, y, test_size=my_tsize, random_state=my_seed)
          clf_2.fit(X_train, y_train)
          y_pred = clf_2.predict(X_test)
          
          from sklearn.metrics import confusion_matrix
          labels = ['good', 'bad']
          cm = confusion_matrix(y_test, y_pred, labels)
          plot_cm(cm, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 73.91% (51/69)
          good/bad: 26.09% (18/69)
          bad/good: 18.60% (8/43)
          bad/bad: 81.40% (35/43)
          

          Okay now try putting in ALL the features... except the label, which would be cheating. And the cputype, which is a string.

          
          
          In [33]:
          clf_all = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
          no_label = list(df.columns.values)
          no_label.remove('label')
          no_label.remove('cputype')
          X = df.as_matrix(no_label)
          
          # 80/20 Split for predictive test
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
          clf_all.fit(X_train, y_train)
          y_pred = clf_all.predict(X_test)
          cm = confusion_matrix(y_test, y_pred, labels)
          plot_cm(cm, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 92.75% (64/69)
          good/bad: 7.25% (5/69)
          bad/good: 6.98% (3/43)
          bad/bad: 93.02% (40/43)
          
          
          
          In [34]:
          # Feature Selection
          # Which features best deferentiated the two classes?
          # Here we're going to grab the feature_importances from the classifier itself, 
          importances = zip(no_label, clf_all.feature_importances_)
          importances.sort(key=lambda k:k[1], reverse=True)
          for im in importances[0:20]:
              print im[0].ljust(30), im[1]
          
          
          
          
          LC_UNIXTHREAD                  0.11111604246
          LC_CODE_SIGNATURE              0.0500117721965
          rebase_off                     0.0407851853834
          LC_MAIN                        0.0385643285089
          text_section_align             0.0339347105743
          LC_DYLIB_CODE_SIGN_DRS         0.0335313105805
          LC_DYLD_INFO_ONLY              0.0304556902729
          linkedit_filesize              0.029905256752
          MH_PIE                         0.0226196128034
          LC_SOURCE_VERSION              0.0198478375291
          pz_size                        0.0178616171746
          rebase_size                    0.017668781554
          text_section_addr              0.0159335856278
          data_entropy                   0.0156907971405
          LC_FUNCTION_STARTS             0.0150161007743
          data_size                      0.0149447133839
          linkedit_vmsize                0.014562058373
          const_section_size             0.0135111773331
          extreloff                      0.0132383705557
          LC_VERSION_MIN_MACOSX          0.0127062868088
          

          Closer look at important features

          The LC_UNIXTHREAD command defines the entry point into the executable. In 10.8 and above, this was replaced by LC_MAIN. As stated above, I grabbed my clean binaries from my laptop, which is running Mavericks (10.9). The malware was grabbed from Contagio and VirusTotal, with no thought as to what version it was compiled on.

          
          
          In [35]:
          df_good['LC_UNIXTHREAD'].value_counts().plot(kind='bar', label='good')
          plt.legend()
          
          
          
          
          Out[35]:
          <matplotlib.legend.Legend at 0x11410c9d0>
          
          
          In [36]:
          df_bad['LC_UNIXTHREAD'].value_counts().plot(kind='bar')
          
          
          
          
          Out[36]:
          <matplotlib.axes.AxesSubplot at 0x1155239d0>

          My clean set clearly has a sample bias towards LC_MAIN. We do not want to overfit our model and since we would like our classifier to be agnostic to OS version, we will remove those labels.

          Over fitting problems: https://www.youtube.com/watch?v=DQWI1kvmwRg

          Now try all the same features as before, but without LC_UNIXTHREAD and LC_MAIN

          
          
          In [42]:
          clf_some = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
          some = list(df.columns.values)
          some.remove('label')
          some.remove('cputype')
          some.remove('LC_UNIXTHREAD')
          some.remove('LC_MAIN')
          X_some = df.as_matrix(some)
          
          # 80/20 Split for predictive test
          X_train_some, X_test_some, y_train_some, y_test_some = train_test_split(X_some, y, test_size=my_tsize, random_state=my_seed)
          clf_some.fit(X_train_some, y_train_some)
          y_pred_some = clf_some.predict(X_test_some)
          cm = confusion_matrix(y_test_some, y_pred_some, labels)
          plot_cm(cm, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 95.65% (66/69)
          good/bad: 4.35% (3/69)
          bad/good: 6.98% (3/43)
          bad/bad: 93.02% (40/43)
          

          Hey, it's the same(ish). That's kinda nice.

          
          
          In [43]:
          importances = zip(some, clf_some.feature_importances_)
          importances.sort(key=lambda k:k[1], reverse=True)
          for im in importances[0:20]:
              print im[0].ljust(30), im[1]
          
          
          
          
          rebase_off                     0.0833325758996
          text_section_align             0.0428642619275
          LC_SOURCE_VERSION              0.0414863657063
          LC_DYLIB_CODE_SIGN_DRS         0.0390404405528
          linkedit_filesize              0.0343569215656
          rebase_size                    0.0268211203307
          LC_FUNCTION_STARTS             0.0258331878885
          LC_DATA_IN_CODE                0.0257389166901
          LC_VERSION_MIN_MACOSX          0.0248798919781
          LC_CODE_SIGNATURE              0.0231596346234
          pz_size                        0.0230647740647
          LC_DYLD_INFO_ONLY              0.0227836821431
          MH_PIE                         0.0197788472089
          data_nsects                    0.0193926037336
          pz_vmaddr                      0.0180289427176
          lazy_bind_size                 0.0156826479795
          linkedit_vmsize                0.0142413695731
          cmd count                      0.0137541943859
          nlocrel                        0.0130961872487
          bind_off                       0.012773950971
          

          What about the strings?

          
          
          In [44]:
          symbol_strings = []
          for strings, label in zip(df_all['symbol table strings'], df_all['label']):
              for symbol in strings:
                  symbol_strings.append({'symbol string': symbol, 'label': label})
          
          pd_symbols = pd.DataFrame.from_records(symbol_strings)
          pd_symbols.head()
          
          
          
          
          Out[44]:
          label symbol string
          0 bad __mh_execute_header
          1 bad _AuthorizationCreate
          2 bad _AuthorizationExecuteWithPrivileges
          3 bad _BIO_ctrl
          4 bad _BIO_f_base64

          5 rows × 2 columns

          
          
          In [45]:
          import sys
          sys.path.insert(0,'..')
          import data_hacking.simple_stats as ss
          
          
          
          In [46]:
          # Spin up our g_test class
          g_test = ss.GTest()
          
          # Here we'd like to see how strongly various strings are assoicated with being clean or malware.
          df_ct, df_cd, df_symbol_stats = g_test.highest_gtest_scores(pd_symbols['symbol string'],
                                                                      pd_symbols['label'], N=0, matches=0, min_volume=5)
          
          
          
          In [47]:
          df_symbol_stats.dropna(inplace=True)
          df_symbol_stats.sort('bad_g', ascending=0).head(25)
          
          
          
          
          Out[47]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          H1\x8d5A`\x02 4836 0 1.000000 0.000000 4836 3570.141890 2935.283635 1265.858110 0.000000
          L\x89\xe7\xff\x158`\x02 4836 0 1.000000 0.000000 4836 3570.141890 2935.283635 1265.858110 0.000000
          __ZStL8__ioinit 160 0 1.000000 0.000000 160 118.118838 97.114430 41.881162 0.000000
          ___tcf_0 113 6 0.949580 0.050420 119 87.850886 56.894804 31.149114 -19.764316
          __ZN5boost6systemL14posix_categoryE 92 0 1.000000 0.000000 92 67.918332 55.840797 24.081668 0.000000
          __ZN5boost6systemL11native_ecatE 92 0 1.000000 0.000000 92 67.918332 55.840797 24.081668 0.000000
          __ZN5boost6systemL10errno_ecatE 92 0 1.000000 0.000000 92 67.918332 55.840797 24.081668 0.000000
          __GLOBAL__I_a 94 1 0.989474 0.010526 95 70.133060 55.065291 24.866940 -6.427078
          ___cxx_global_var_init 90 0 1.000000 0.000000 90 66.441847 54.626867 23.558153 0.000000
          __Z41__static_initialization_and_destruction_0ii 111 7 0.940678 0.059322 118 87.112643 53.796854 30.887357 -20.782115
          ___cxx_global_var_init2 88 0 1.000000 0.000000 88 64.965361 53.412936 23.034639 0.000000
          ___cxx_global_var_init3 88 0 1.000000 0.000000 88 64.965361 53.412936 23.034639 0.000000
          ___cxx_global_var_init4 88 0 1.000000 0.000000 88 64.965361 53.412936 23.034639 0.000000
          ___cxx_global_var_init1 88 0 1.000000 0.000000 88 64.965361 53.412936 23.034639 0.000000
          __ZN12_GLOBAL__N_12_1E 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          __ZN12_GLOBAL__N_12_5E 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          ___cxx_global_var_init8 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          __ZN12_GLOBAL__N_12_9E 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          __ZN12_GLOBAL__N_12_2E 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          ___cxx_global_var_init9 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          __ZN12_GLOBAL__N_12_8E 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          ___cxx_global_var_init11 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          __ZN12_GLOBAL__N_12_7E 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          ___cxx_global_var_init6 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000
          ___cxx_global_var_init5 86 0 1.000000 0.000000 86 63.488876 52.199006 22.511124 0.000000

          25 rows × 9 columns

          
          
          In [48]:
          df_symbol_stats.sort('good_g', ascending=0).head(25)
          
          
          
          
          Out[48]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          dyld_stub_binder 173 276 0.385301 0.614699 449 331.470990 -224.986050 117.529010 471.251050
          radr://5614542 107 236 0.311953 0.688047 343 253.217260 -184.343670 89.782740 456.159232
          ___stderrp 35 187 0.157658 0.842342 222 163.889888 -108.069271 58.110112 437.119579
          _strcmp 51 169 0.231818 0.768182 220 162.413403 -118.148571 57.586597 363.893795
          ___stack_chk_fail 92 190 0.326241 0.673759 282 208.184453 -150.261006 73.815547 359.272784
          ___stack_chk_guard 92 190 0.326241 0.673759 282 208.184453 -150.261006 73.815547 359.272784
          _fprintf 39 156 0.200000 0.800000 195 143.957334 -101.864515 51.042666 348.564572
          ___stdoutp 8 120 0.062500 0.937500 128 94.495071 -39.505698 33.504929 306.191801
          __mh_execute_header 218 225 0.492099 0.507901 443 327.041534 -176.838162 115.958466 298.290746
          _printf 46 123 0.272189 0.727811 169 124.763023 -91.795275 44.236977 251.565340
          ___stdinp 0 93 0.000000 1.000000 93 68.656575 0.000000 24.343425 249.302811
          _fwrite 55 116 0.321637 0.678363 171 126.239509 -91.393256 44.760491 220.925324
          _fputc 32 101 0.240602 0.759398 133 98.186284 -71.752361 34.813716 215.152034
          ___error 85 127 0.400943 0.599057 212 156.507461 -103.776912 55.492539 210.296380
          ___snprintf_chk 0 78 0.000000 1.000000 78 57.582934 0.000000 20.417066 209.092680
          _exit 230 189 0.548926 0.451074 419 309.323708 -136.302148 109.676292 205.712812
          _puts 27 89 0.232759 0.767241 116 85.636158 -62.330620 30.363842 191.418329
          _optind 5 74 0.063291 0.936709 79 58.321176 -24.565273 20.678824 188.693330
          _strncmp 28 88 0.241379 0.758621 116 85.636158 -62.602573 30.363842 187.278840
          _putchar 6 74 0.075000 0.925000 80 59.059419 -27.441415 20.940581 186.831670
          _optarg 1 69 0.014286 0.985714 70 51.676992 -7.890025 18.323008 182.980956
          _strtol 8 73 0.098765 0.901235 81 59.797662 -32.184400 21.202338 180.506805
          _stat$INODE64 21 80 0.207921 0.792079 101 74.562517 -53.218850 26.437483 177.159010
          _strlen 171 150 0.532710 0.467290 321 236.975920 -111.592881 84.024080 173.859559
          ___maskrune 15 73 0.170455 0.829545 88 64.965361 -43.974121 23.034639 168.405207

          25 rows × 9 columns

          
          
          In [49]:
          def tokenize_string(string):
              if '___cxx_global_var_init' in string:
                  return '___cxx_global_var_init_TOKEN'
              elif '__ZN12_GLOBAL' in string:
                  return '__ZN12_GLOBAL_TOKEN'
              else:
                  return string
          
          
          
          In [50]:
          pd_symbols['tokenized string'] = pd_symbols['symbol string'].map(lambda x: tokenize_string(x))
          
          
          
          In [51]:
          df_ct, df_cd, df_symbol_stats = g_test.highest_gtest_scores(pd_symbols['tokenized string'],
                                                               pd_symbols['label'], N=10, matches=5, min_volume=5)
          
          
          
          In [52]:
          df_symbol_stats.dropna(inplace=True)
          df_symbol_stats.sort('bad_g', ascending=0).head(10)
          
          
          
          
          Out[52]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          L\x89\xe7\xff\x158`\x02 4836 0 1.000000 0.000000 4836 4428.513498 851.367726 407.486502 0.000000
          H1\x8d5A`\x02 4836 0 1.000000 0.000000 4836 4428.513498 851.367726 407.486502 0.000000
          ___cxx_global_var_init_TOKEN 1560 0 1.000000 0.000000 1560 1428.552741 274.634750 131.447259 0.000000
          __ZN12_GLOBAL_TOKEN 870 5 0.994286 0.005714 875 801.271570 143.190314 73.728430 -26.909506
          _malloc 160 130 0.551724 0.448276 290 265.564292 -162.138608 24.435708 434.587122
          _strlen 171 150 0.532710 0.467290 321 293.952199 -185.279734 27.047801 513.908873
          radr://5614542 107 236 0.311953 0.688047 343 314.098455 -230.451818 28.901545 991.170152
          _exit 230 189 0.548926 0.451074 419 383.694615 -235.413120 35.305385 634.174947
          __mh_execute_header 218 225 0.492099 0.507901 443 405.672349 -270.778127 37.327651 808.364717
          dyld_stub_binder 173 276 0.385301 0.614699 449 411.166783 -299.534738 37.833217 1096.941787

          10 rows × 9 columns

          
          
          In [53]:
          def g_aggregate(df_stats, sequence, name):
              try:
                  g_scores = [df_stats.ix[tokenize_string(item)][name] for item in sequence]
              except KeyError:
                  return 0
              return sum(g_scores)/len(g_scores) if g_scores else 0 # Average
          
          
          
          In [54]:
          df_all['symbol table strings malicious_g'] = \
              df_all['symbol table strings'].map(lambda x: g_aggregate(df_symbol_stats, x, 'bad_g'))
          df_all['symbol table strings clean_g'] = \
              df_all['symbol table strings'].map(lambda x: g_aggregate(df_symbol_stats, x, 'good_g'))
          
          
          
          In [55]:
          segment_names = []
          for strings, label in zip(df_all['segment names'], df_all['label']):
              for name in strings:
                  segment_names.append({'segment names': name, 'label': label})
          
          pd_segment_names = pd.DataFrame.from_records(segment_names)
          pd_segment_names.head()
          
          
          
          
          Out[55]:
          label segment names
          0 bad __PAGEZERO
          1 bad __TEXT
          2 bad __DATA
          3 bad __LINKEDIT
          4 bad __TEXT

          5 rows × 2 columns

          
          
          In [56]:
          #Spin up our g_test class
          g_test = ss.GTest()
          
          df_ct, df_cd, df_segment_stats = g_test.highest_gtest_scores(pd_segment_names['segment names'],
                                                               pd_segment_names['label'], N=0, matches=0, min_volume=0)
          
          
          
          In [57]:
          df_segment_stats.dropna(inplace=True)
          df_segment_stats.sort('bad_g', ascending=0).head(10)
          
          
          
          
          Out[57]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          __OBJC 55 17 0.763889 0.236111 72 34.263158 52.058877 37.736842 -27.112400
          __IMPORT 48 28 0.631579 0.368421 76 36.166667 27.174060 39.833333 -19.739976
          __PAGEZERO 218 225 0.492099 0.507901 443 210.813596 14.615046 232.186404 -14.148050
          3 0 1.000000 0.000000 3 1.427632 4.455573 1.572368 0.000000
          __INIT_STUB 2 0 1.000000 0.000000 2 0.951754 2.970382 1.048246 0.000000
          .ida 0 4 0.000000 1.000000 4 1.903509 0.000000 2.096491 5.168234
          __LINKEDIT 253 305 0.453405 0.546595 558 265.539474 -24.477253 292.460526 25.609064
          __TEXT 253 305 0.453405 0.546595 558 265.539474 -24.477253 292.460526 25.609064
          __DATA 253 311 0.448582 0.551418 564 268.394737 -29.889069 295.605263 31.577645

          9 rows × 9 columns

          
          
          In [58]:
          s = []
          for strings, label in zip(df_all['section names'], df_all['label']):
              for name in strings:
                  s.append({'section names': name, 'label': label})
          
          pd_section_names = pd.DataFrame.from_records(s)
          print pd_section_names.shape
          pd_section_names.head()
          
          
          
          
          (8289, 2)
          
          Out[58]:
          label section names
          0 bad __text
          1 bad __symbol_stub1
          2 bad __stub_helper
          3 bad __cstring
          4 bad __unwind_info

          5 rows × 2 columns

          
          
          In [59]:
          #Spin up our g_test class
          g_test = ss.GTest()
          
          # Here we'd like to see how various exploits (description) are related to
          # the ASN (Autonomous System Number) associated with the ip/domain.
          df_ct, df_cd, df_section_stats = g_test.highest_gtest_scores(pd_section_names['section names'],
                                                               pd_section_names['label'], N=0, matches=0, min_volume=5)
          
          
          
          In [60]:
          df_section_stats.dropna(inplace=True)
          df_section_stats.sort('bad_g', ascending=0).head(10)
          
          
          
          
          Out[60]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          __dyld 192 53 0.783673 0.216327 245 125.706961 162.640082 119.293039 -85.996853
          __symbol_stub1 55 1 0.982143 0.017857 56 28.733020 71.421483 27.266980 -6.611353
          __symbol_stub 124 62 0.666667 0.333333 186 95.434672 64.936223 90.565328 -46.988195
          __common 212 156 0.576087 0.423913 368 188.816986 49.102615 179.183014 -43.228130
          __instance_vars 43 6 0.877551 0.122449 49 25.141392 46.154870 23.858608 -16.564627
          __cls_refs 50 12 0.806452 0.193548 62 31.811557 45.219334 30.188443 -22.141260
          __message_refs 50 12 0.806452 0.193548 62 31.811557 45.219334 30.188443 -22.141260
          __module_info 51 13 0.796875 0.203125 64 32.837737 44.905221 31.162263 -22.730721
          __meta_class 44 8 0.846154 0.153846 52 26.680661 44.022056 25.319339 -18.434031
          __symbols 44 8 0.846154 0.153846 52 26.680661 44.022056 25.319339 -18.434031

          10 rows × 9 columns

          
          
          In [61]:
          df_section_stats.sort('good_g', ascending=0).head(10)
          
          
          
          
          Out[61]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          __stubs 28 210 0.117647 0.882353 238 122.115334 -82.474641 115.884666 249.695077
          __got 29 171 0.145000 0.855000 200 102.617927 -73.295575 97.382073 192.553326
          __unwind_info 192 301 0.389452 0.610548 493 252.953191 -105.872288 240.046809 136.218346
          __stub_helper 192 273 0.412903 0.587097 465 238.586681 -83.419129 226.413319 102.161853
          __nl_symbol_ptr 198 274 0.419492 0.580508 472 242.178309 -79.757266 229.821691 96.351745
          __const 272 341 0.443719 0.556281 613 314.523947 -79.020434 298.476053 90.837472
          __la_symbol_ptr 204 273 0.427673 0.572327 477 244.743757 -74.293446 232.256243 88.250273
          __cstring 242 302 0.444853 0.555147 544 279.120762 -69.070092 264.879238 79.216414
          __eh_frame 126 186 0.403846 0.596154 312 160.083967 -60.332974 151.916033 75.299763
          __text 255 304 0.456172 0.543828 559 286.817107 -59.966417 272.182893 67.216504

          10 rows × 9 columns

          
          
          In [62]:
          df_all['segment names malicious_g'] = df_all['segment names'].map(lambda x: g_aggregate(df_segment_stats, x, 'bad_g'))
          df_all['segment names clean_g'] = df_all['segment names'].map(lambda x: g_aggregate(df_segment_stats, x, 'good_g'))
          df_all['section names malicious_g'] = df_all['section names'].map(lambda x: g_aggregate(df_section_stats, x, 'bad_g'))
          df_all['section names clean_g'] = df_all['section names'].map(lambda x: g_aggregate(df_section_stats, x, 'good_g'))
          
          
          
          In [63]:
          g_test = ss.GTest()
          df_entrypoint = pd.DataFrame(df_all, columns=['entry point section name', 'entry point segment name',
                                                        'entry point instruction type', 'label'])
          
          
          
          In [64]:
          df_entrypoint = df_entrypoint.replace(0, np.nan)
          df_entrypoint.dropna(inplace=True)
          df_entrypoint.head()
          
          
          
          
          Out[64]:
          entry point section name entry point segment name entry point instruction type label
          0 __text __TEXT push bad
          1 -1 -1 -1 bad
          2 __text __TEXT pop bad
          3 __text __TEXT push bad
          4 __text __TEXT add bad

          5 rows × 4 columns

          
          
          In [65]:
          df_ct, df_cd, df_ep_section_stats = g_test.highest_gtest_scores(df_entrypoint['entry point section name'],
                                                               df_entrypoint['label'], N=0, matches=0, min_volume=5)
          df_ct, df_cd, df_ep_segment_stats = g_test.highest_gtest_scores(df_entrypoint['entry point segment name'],
                                                               df_entrypoint['label'], N=0, matches=0, min_volume=5)
          df_ct, df_cd, df_ep_instruction_stats = g_test.highest_gtest_scores(df_entrypoint['entry point instruction type'],
                                                               df_entrypoint['label'], N=0, matches=0, min_volume=5)
          
          
          
          In [66]:
          df_ep_section_stats.head()
          
          
          
          
          Out[66]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          __text 214 225 0.487472 0.512528 439 199.901786 29.168132 239.098214 -27.348302
          -1 34 80 0.298246 0.701754 114 51.910714 -28.775199 62.089286 40.552511
          COULD NOT FIND 7 0 1.000000 0.000000 7 3.187500 11.013425 3.812500 0.000000

          3 rows × 9 columns

          
          
          In [67]:
          df_ep_segment_stats.head()
          
          
          
          
          Out[67]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          __TEXT 214 225 0.487472 0.512528 439 199.901786 29.168132 239.098214 -27.348302
          -1 34 80 0.298246 0.701754 114 51.910714 -28.775199 62.089286 40.552511
          COULD NOT FIND 7 0 1.000000 0.000000 7 3.187500 11.013425 3.812500 0.000000

          3 rows × 9 columns

          
          
          In [68]:
          df_ep_instruction_stats.dropna(inplace=True)
          df_ep_instruction_stats.sort('bad_g', ascending=0).head(20)
          
          
          
          
          Out[68]:
          bad good bad_cd good_cd total_cd bad_exp bad_g good_exp good_g
          push 155 153 0.503247 0.496753 308 140.250000 30.999557 167.750000 -28.163277
          COULD NOT FIND 7 0 1.000000 0.000000 7 3.187500 11.013425 3.812500 0.000000
          pop 5 1 0.833333 0.166667 6 2.732143 6.043517 3.267857 -2.368269
          das 0 8 0.000000 1.000000 8 3.642857 0.000000 4.357143 9.722000
          add 42 54 0.437500 0.562500 96 43.714286 -3.360448 52.285714 3.484173
          -1 34 80 0.298246 0.701754 114 51.910714 -28.775199 62.089286 40.552511

          6 rows × 9 columns

          
          
          In [69]:
          df_entrypoint['entry point instruction type'].value_counts().plot(kind='bar')
          
          
          
          
          Out[69]:
          <matplotlib.axes.AxesSubplot at 0x116998210>
          
          
          In [70]:
          clf_final = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
          final_labels = list(df_all.columns.values)
          final_labels.remove('label')
          final_labels.remove('cputype')
          final_labels.remove('LC_UNIXTHREAD')
          final_labels.remove('LC_MAIN')
          final_labels.remove('const strings')
          final_labels.remove('symbol table strings')
          final_labels.remove('segment names')
          final_labels.remove('section names')
          final_labels.remove('entry point section name')
          final_labels.remove('entry point segment name')
          final_labels.remove('entry point instruction type')
          X_final = df_all.as_matrix(final_labels)
          y_final = np.array(df_all['label'].tolist())
          
          
          
          In [71]:
          scores = sklearn.cross_validation.cross_val_score(clf_final, X_final, y_final, cv=10, n_jobs=5)
          print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
          
          
          
          
          Accuracy: 0.96 (+/- 0.06)
          
          
          
          In [74]:
          # 80/20 Split for predictive test
          X_train_final, X_test_final, y_train_final, y_test_final = train_test_split(X_final, y_final,
                                                                                      test_size=my_tsize,
                                                                                      random_state=my_seed)
          clf_final.fit(X_train_final, y_train_final)
          y_pred_final = clf_final.predict(X_test_final)
          cm_final = confusion_matrix(y_test_final, y_pred_final, labels)
          plot_cm(cm_final, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 94.20% (65/69)
          good/bad: 5.80% (4/69)
          bad/good: 2.33% (1/43)
          bad/bad: 97.67% (42/43)
          

          You can set threshold values to your priorities.

          
          
          In [75]:
          y_probs_final = clf_final.predict_proba(X_test_final)[:,0]
          thres = .8 # This can be set to whatever you'd like
          y_pred_final[y_probs_final<thres] = 'good'
          y_pred_final[y_probs_final>=thres] = 'bad'
          cm = confusion_matrix(y_test_final, y_pred_final, labels)
          plot_cm(cm, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 98.55% (68/69)
          good/bad: 1.45% (1/69)
          bad/good: 16.28% (7/43)
          bad/bad: 83.72% (36/43)
          

          Most important features

          Now that we added in the g-scores from the strings of section names, segment names, and symbol table, now what are the most important features?

          
          
          In [76]:
          importances = zip(final_labels, clf_final.feature_importances_)
          importances.sort(key=lambda k:k[1], reverse=True)
          for im in importances[0:20]:
              print im[0].ljust(35), im[1]
          
          
          
          
          section names malicious_g           0.0657407559464
          LC_SOURCE_VERSION                   0.0553265015965
          section names clean_g               0.0535257311096
          LC_DYLIB_CODE_SIGN_DRS              0.048614836495
          LC_DATA_IN_CODE                     0.0464634185482
          rebase_size                         0.043812886794
          LC_CODE_SIGNATURE                   0.0334035056079
          rebase_off                          0.0311568256564
          text_section_align                  0.0232383559178
          data_nsects                         0.021464002581
          linkedit_vmsize                     0.0198348290427
          lazy_bind_size                      0.0173961188418
          pz_vmaddr                           0.0165259506433
          pz_vmsize                           0.0156923760629
          nextdefsym                          0.0152103851287
          pz_fileoff                          0.0141570209658
          entry point offset                  0.0140948505669
          nlocrel                             0.0131386093896
          bind_size                           0.0121543445893
          pz_flags                            0.0120473460984
          

          Try a new classifier with only the 20 most important features

          
          
          In [77]:
          clf_most = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
          X = df_all.as_matrix([item[0] for item in importances[:20]])
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
          clf_most.fit(X_train, y_train)
          y_pred = clf_most.predict(X_test)
          cm = confusion_matrix(y_test, y_pred, labels)
          plot_cm(cm, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 92.75% (64/69)
          good/bad: 7.25% (5/69)
          bad/good: 2.33% (1/43)
          bad/bad: 97.67% (42/43)
          
          
          
          In [78]:
          y_probs = clf_most.predict_proba(X_test)[:,0]
          thres = .92 # This can be set to whatever you'd like
          y_pred[y_probs<thres] = 'good'
          y_pred[y_probs>=thres] = 'bad'
          cm = confusion_matrix(y_test, y_pred, labels)
          plot_cm(cm, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 98.55% (68/69)
          good/bad: 1.45% (1/69)
          bad/good: 20.93% (9/43)
          bad/bad: 79.07% (34/43)
          

          Using a different classifier

          Let's try a different classifier: Extra Trees Classifier (like RandomForest, but even more random).
          http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

          
          
          In [88]:
          clf_most = sklearn.ensemble.ExtraTreesClassifier(n_estimators=50)
          X = df_all.as_matrix([item[0] for item in importances[:20]])
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=my_tsize, random_state=my_seed)
          clf_most.fit(X_train, y_train)
          y_pred = clf_most.predict(X_test)
          cm = confusion_matrix(y_test, y_pred, labels)
          plot_cm(cm, labels)
          
          
          
          
          Confusion Matrix Stats
          good/good: 94.20% (65/69)
          good/bad: 5.80% (4/69)
          bad/good: 2.33% (1/43)
          bad/bad: 97.67% (42/43)
          

          Going one step further

          It's nice to give a prediction on whether the binary is good or bad, but it would be nicer to include why that prediction is given. Let's explore the rebase_size and rebase_off features from the LC_DYLD_INFO command. The dyld_info_command contains the file offsets and sizes of the new compressed form of the information dyld needs to load the image.

          Dyld rebases an image whenever dyld loads it at an address different from its preferred address.

          • rebase_off is the file offset to rebase info
          • rebase_size is the size of rebase info
          
          
          In [89]:
          df_all.boxplot(column='rebase_off', by='label')
          plt.xlabel('bad vs. good files')
          plt.ylabel('rebase_off')
          plt.title('Comparision of rebase offset')
          plt.suptitle("")
          
          
          
          
          Out[89]:
          <matplotlib.text.Text at 0x115e8f8d0>

          Not the most useful plot, let's try again, but not show the outliers.

          
          
          In [90]:
          df_all.boxplot(column='rebase_off', by='label', sym='')
          plt.xlabel('bad vs. good files')
          plt.ylabel('rebase_off')
          plt.title('Comparision of rebase offset')
          plt.suptitle("")
          
          
          
          
          Out[90]:
          <matplotlib.text.Text at 0x115ea0b50>
          
          
          In [91]:
          cond = df_all['label'] == 'good'
          good_all = df_all[cond]
          bad_all  = df_all[~cond]
          bvc = pd.DataFrame(bad_all['rebase_off'].value_counts(), columns=['count'])
          bvc['perctange'] = bvc/bad_all.shape[0]
          gvc = pd.DataFrame(good_all['rebase_off'].value_counts(), columns=['count'])
          gvc['perctange'] = gvc/good_all.shape[0]
          print 'Most common rebase_off values from malicious set'
          print bvc.head(5)
          print ''
          print 'Most common rebase_off values from clean set'
          print gvc.head()
          
          
          
          
          Most common rebase_off values from malicious set
                   count  perctange
           0         125   0.490196
          -1          81   0.317647
           20480      10   0.039216
           8192       10   0.039216
           319488      4   0.015686
          
          [5 rows x 2 columns]
          
          Most common rebase_off values from clean set
                  count  perctange
           8192      53   0.173770
           12288     49   0.160656
          -1         29   0.095082
           16384     27   0.088525
           0         21   0.068852
          
          [5 rows x 2 columns]
          
          
          
          In [92]:
          bad_all['rebase_off'].hist(alpha=1,label='bad',bins=1000)
          good_all['rebase_off'].hist(alpha=1,label='good',bins=1000)
          plt.title('Values for rebase offset')
          plt.legend()
          plt.xlim(-50,3000000)
          
          
          
          
          Out[92]:
          (-50, 3000000)

          Well, that wasn't very helpful, let's not show the long tail on the plot

          
          
          In [93]:
          bad_all['rebase_off'].hist(alpha=1,label='bad',bins=2000)
          good_all['rebase_off'].hist(alpha=1,label='good',bins=2000)
          plt.legend()
          plt.title('Values for rebase offset')
          plt.xlim(-50,80000)
          
          
          
          
          Out[93]:
          (-50, 80000)
          
          
          In [94]:
          df_all.boxplot(column='rebase_size', by='label')
          plt.xlabel('bad vs. good files')
          plt.ylabel('rebase_size')
          plt.title('Comparision of rebase size')
          plt.suptitle("")
          
          
          
          
          Out[94]:
          <matplotlib.text.Text at 0x1169b0250>
          
          
          In [95]:
          df_all.boxplot(column='rebase_size', by='label', sym='')
          plt.xlabel('bad vs. good files')
          plt.ylabel('rebase_size')
          plt.title('Comparision of rebase size')
          plt.suptitle("")
          
          
          
          
          Out[95]:
          <matplotlib.text.Text at 0x1169bbbd0>
          
          
          In [96]:
          bvc = pd.DataFrame(bad_all['rebase_size'].value_counts(), columns=['count'])
          bvc['percentange'] = bvc/bad_all.shape[0]
          gvc = pd.DataFrame(good_all['rebase_size'].value_counts(), columns=['count'])
          gvc['percentange'] = gvc/good_all.shape[0]
          print 'Most common rebase_size values from malicious set'
          print bvc.head(5)
          print ''
          print 'Most common rebase_size values from clean set'
          print gvc.head(5)
          
          
          
          
          Most common rebase_size values from malicious set
               count  percentange
           0     125     0.490196
          -1      81     0.317647
           48      6     0.023529
           16      6     0.023529
           28      5     0.019608
          
          [5 rows x 2 columns]
          
          Most common rebase_size values from clean set
               count  percentange
           8      80     0.262295
           16     44     0.144262
          -1      29     0.095082
           0      21     0.068852
           24     17     0.055738
          
          [5 rows x 2 columns]
          
          
          
          In [97]:
          plt.scatter(good_all['rebase_off'], good_all['rebase_size'], 
                      s=140, c='#aaaaff', label='Good', alpha=.4)
          plt.scatter(bad_all['rebase_off'], bad['rebase_size'], 
                      s=40, c='r', label='Bad', alpha=.5)
          plt.title('rebase_off vs rebase_size')
          plt.xlabel('rebase_off')
          plt.ylabel('rebase_size')
          
          
          
          
          Out[97]:
          <matplotlib.text.Text at 0x119130550>

          Let's zoom in around the origin

          
          
          In [98]:
          plt.scatter(good_all['rebase_off'], good_all['rebase_size'], 
                      s=140, c='#aaaaff', label='Good', alpha=.4)
          plt.scatter(bad_all['rebase_off'], bad['rebase_size'], 
                      s=40, c='r', label='Bad', alpha=.5)
          plt.xlim(-1000,150000)
          plt.ylim(-100,1100)
          plt.title('rebase_off vs rebase_size')
          plt.xlabel('rebase_off')
          plt.ylabel('rebase_size')
          
          
          
          
          Out[98]:
          <matplotlib.text.Text at 0x118632e10>
          
          
          In [ ]:
          plt.scatter(good_all['rebase_off'], good_all['rebase_size'], 
                      s=140, c='#aaaaff', label='Good', alpha=.4)
          plt.scatter(bad_all['rebase_off'], bad['rebase_size'], 
                      s=40, c='r', label='Bad', alpha=.5)
          plt.xlim(-10,10)
          plt.ylim(-10,10)
          plt.title('rebase_off vs rebase_size')
          plt.xlabel('rebase_off')
          plt.ylabel('rebase_size')
          

          How does current AV do on these files?

          Scan data is as of 2014/04/28

          
          
          In [99]:
          def load_vt_data(vt_list):
              import json
              import collections
              results = {}
              for filename in vt_list:
                  with open(filename,'rb') as f:
                      vt_data = f.read()
                      if len(vt_data) == 0:
                          break
                      vt = json.loads(vt_data)
                      for engine in vt['scans'].keys():
                          if engine not in results:
                              results[engine] = 0
                          if vt['scans'][engine]['detected']:                
                              results[engine] += 1
                          
              r = []
              for key, value in results.iteritems():
                  r.append({'Engine': key, 'Count': value})
              return r
          
          
          
          In [100]:
          vt_list = glob.glob('data/bad/vt_data/*.vtdata')
          num_files = len(vt_list)
          vt_results = load_vt_data(vt_list)
          vt_df = pd.DataFrame.from_records(vt_results, columns=['Engine', 'Count'])
          vt_df['Files Scanned'] = num_files
          vt_df['Percentage'] = vt_df['Count']/num_files
          vt_df.sort('Count', ascending=0).head(52)
          
          
          
          
          Out[100]:
          Engine Count Files Scanned Percentage
          18 Avast 261 261 1.000000
          40 GData 261 261 1.000000
          28 F-Secure 261 261 1.000000
          25 Ad-Aware 260 261 0.996169
          1 MicroWorld-eScan 260 261 0.996169
          21 BitDefender 260 261 0.996169
          20 Kaspersky 255 261 0.977011
          34 Emsisoft 254 261 0.973180
          2 nProtect 254 261 0.973180
          45 ESET-NOD32 254 261 0.973180
          31 AntiVir 245 261 0.938697
          27 Comodo 242 261 0.927203
          26 Sophos 240 261 0.919540
          29 DrWeb 240 261 0.919540
          47 Ikarus 238 261 0.911877
          19 ClamAV 231 261 0.885057
          51 Qihoo-360 230 261 0.881226
          49 AVG 228 261 0.873563
          4 CAT-QuickHeal 226 261 0.865900
          12 NANO-Antivirus 224 261 0.858238
          17 TrendMicro-HouseCall 203 261 0.777778
          14 Symantec 190 261 0.727969
          38 Microsoft 175 261 0.670498
          42 AhnLab-V3 170 261 0.651341
          0 Bkav 149 261 0.570881
          30 VIPRE 128 261 0.490421
          32 TrendMicro 102 261 0.390805
          5 McAfee 90 261 0.344828
          33 McAfee-GW-Edition 90 261 0.344828
          23 ViRobot 85 261 0.325670
          15 Norman 55 261 0.210728
          46 Rising 49 261 0.187739
          13 F-Prot 42 261 0.160920
          41 Commtouch 41 261 0.157088
          35 Jiangmin 40 261 0.153257
          9 K7AntiVirus 36 261 0.137931
          10 K7GW 35 261 0.134100
          16 TotalDefense 34 261 0.130268
          48 Fortinet 27 261 0.103448
          22 Agnitum 26 261 0.099617
          43 VBA32 25 261 0.095785
          7 Zillya 21 261 0.080460
          50 Panda 11 261 0.042146
          3 CMC 5 261 0.019157
          36 Antiy-AVL 1 261 0.003831
          24 ByteHero 0 261 0.000000
          44 Baidu-International 0 261 0.000000
          6 Malwarebytes 0 261 0.000000
          11 TheHacker 0 261 0.000000
          39 AegisLab 0 261 0.000000
          37 Kingsoft 0 261 0.000000
          8 SUPERAntiSpyware 0 261 0.000000

          52 rows × 4 columns

          
          
          In [101]:
          vt_list = glob.glob('data/good/vt_data/*.vtdata')
          num_files = len(vt_list)
          vt_results = load_vt_data(vt_list)
          vt_df = pd.DataFrame.from_records(vt_results, columns=['Engine', 'Count'])
          vt_df['Files Scanned'] = num_files
          vt_df['Percentage'] = vt_df['Count']/num_files
          vt_df.sort('Count', ascending=0).head(15)
          
          
          
          
          Out[101]:
          Engine Count Files Scanned Percentage
          51 Qihoo-360 0 266 0
          50 Panda 0 266 0
          23 SUPERAntiSpyware 0 266 0
          22 NANO-Antivirus 0 266 0
          21 BitDefender 0 266 0
          20 Kaspersky 0 266 0
          19 ClamAV 0 266 0
          18 Avast 0 266 0
          17 TrendMicro-HouseCall 0 266 0
          16 TotalDefense 0 266 0
          15 Norman 0 266 0
          14 Symantec 0 266 0
          13 F-Prot 0 266 0
          12 Agnitum 0 266 0
          11 K7AntiVirus 0 266 0

          15 rows × 4 columns