Data cleaning

After gathering all the data (see the report and Data_wrangling.ipynb) it is time to clean the data. The cleaning phase follow these steps:

  • Load all the collection txt files into a labeled pandas dataframe
  • Remove empty data points, and remove malicious data points of length 1
  • Remove any duplicates
  • Shuffle the dataset to remove any ordering bias
  • Store into a .csv file for visualization and also store the dataframe into a .pickle file to easily allow further computations on the data set

Step1

Import dependencies


In [1]:
import numpy as np
import pandas as pd
import csv
from IPython.display import display

Step2

Compute a pandas dataframe with the payloads from the different collections
dataframe columns:
$<is\_malicious> | <injection\_type> | <payload>$
example: $1 | SQL | ' OR 1=1 LIMIT 1 #$


In [2]:
def from_txt_to_dataframe(src_file,is_malicious,injection_type):
    
    #read file
    payloads_txt = open('data/{}.txt'.format(src_file),'r',encoding='UTF-8').readlines()
    
    #create dataframe
    payloads = pd.DataFrame(payloads_txt,columns=['payload'])
    payloads['is_malicious'] = [is_malicious]*len(payloads)
    payloads['injection_type'] = [injection_type]*len(payloads)

    print('First 5 lines of ' + injection_type)
    display(payloads.head())
    
    return payloads
    
#concatenate all payload dataframes together
payloads = pd.DataFrame(columns=['payload','is_malicious','injection_type'])
payloads = payloads.append(from_txt_to_dataframe('SQLCollection',1,'SQL'))
payloads = payloads.append(from_txt_to_dataframe('XSSCollection',1,'XSS'))
payloads = payloads.append(from_txt_to_dataframe('ShellCollection',1,'SHELL'))
payloads = payloads.append(from_txt_to_dataframe('non-maliciousCollection',0,'LEGAL'))
payloads = payloads.reset_index(drop=True)


First 5 lines of SQL
payload is_malicious injection_type
0 '\n 1 SQL
1 a' or 1=1-- \n 1 SQL
2 "a"" or 1=1--"\n 1 SQL
3 or a = a\n 1 SQL
4 a' or 'a' = 'a\n 1 SQL
First 5 lines of XSS
payload is_malicious injection_type
0 data:text/html;alert(1)/*,<svg%20onload=eval(... 1 XSS
1 '">*/--></title></style></textarea></script%0A... 1 XSS
2 " onclick=alert(1)//<button ‘ onclick=alert(1)... 1 XSS
3 ';alert(String.fromCharCode(88,83,83))//';aler... 1 XSS
4 ">><marquee><img src=x onerror=confirm(1)></ma... 1 XSS
First 5 lines of SHELL
payload is_malicious injection_type
0 () { 0; }; touch /tmp/blns.shellshock1.fail;\n 1 SHELL
1 () { _; } >_[$($())] { touch /tmp/blns.shellsh... 1 SHELL
2 <<< %s(un='%s') = %u\n 1 SHELL
3 '+++ATH0\n 1 SHELL
4 /dev/null; touch /tmp/blns.fail ; echo\n 1 SHELL
First 5 lines of LEGAL
payload is_malicious injection_type
0 569993989\n 0 LEGAL
1 46201\n 0 LEGAL
2 Indianapolis\n 0 LEGAL
3 20354328\n 0 LEGAL
4 A8Cyj4uzrSgkGg4szKuHeI\n 0 LEGAL

Step2

clean the data by

  • removing ending '\n'
  • removing duplicates
  • removing empty data points
  • removing malicious data points with length 1
  • shuffle the data set

In [3]:
#Remove ending \n and white spaces
payloads['payload'] = payloads['payload'].str.strip('\n')
payloads['payload'] = payloads['payload'].str.strip()

#Remove any empty data points
rows_before = len(payloads['payload'])
payloads = payloads[payloads['payload'].str.len() != 0]
print('Empty data points removed: ' + str(rows_before - len(payloads)))

#Remove any malicious data points of size 1
rows_before = len(payloads['payload'])
payloads = payloads[(payloads['is_malicious'] == 0) | ((payloads['is_malicious'] == 1) & (payloads['payload'].str.len() > 1))]
print('Malicious data points of size 1 removed: ' + str(rows_before-len(payloads)))

#Remove duplicates
rows_before = len(payloads['payload'])
payloads = payloads.drop_duplicates(subset='payload', keep='last')
print('Duplicate data points removed: ' + str(rows_before-len(payloads)))

#Reformat rows that have the format b'<payload>' into <payload>
payloads['payload'] = [payload[2:-1] if payload.startswith("b'") or payload.startswith('b"') 
                        else payload for payload in payloads['payload']]

#Shuffle dataset and reset indices again
payloads = payloads.sample(frac=1).reset_index(drop=True)
payloads.index.name = 'index'

#Remove payloads that cant be saved into .csv using pandas, e.g. they will be null/NA/NaN
payloads.to_csv('data/payloads.csv',encoding='UTF-8')
#reload dataframe from saved .csv. The dataframe will contain a few null values
payloads = pd.read_csv("data/payloads.csv",index_col='index',encoding='UTF-8') 
rows_before = len(payloads['payload'])
payloads = payloads[~payloads['payload'].isnull()]
print('null/NaN data points removed: ' + str(rows_before-len(payloads)))

#Lastly, save to .csv
payloads.to_csv('data/payloads.csv',encoding='UTF-8')


Empty data points removed: 1
Malicious data points of size 1 removed: 3
Duplicate data points removed: 26072
null/NaN data points removed: 3