The data used in this demo is available at https://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset

Citation :Alexander Vergara and Shankar Vembu and Tuba Ayhan and Margaret A. Ryan and Margie L. Homer and RamÃ³n Huerta, Chemical gas sensor drift compensation using classifier ensembles, Sensors and Actuators B: Chemical (2012) doi: 10.1016/j.snb.2012.01.074.



In [1]:

    
from os import listdir
from os.path import isfile, join

This snippet is used to create the column names for the 128 attributes (16 sensors and 8 attributes measured by each sensor), the target label and batch number for the corresponding row.

We add '\n' to the 'Batch_no' label to signify EOL. If we use 'csv' package then we don't need to add that, the 'writer' method will handle it



In [8]:

    
col_names = ['Label']
for x in map(chr,range(65,81)):
    for y in map(str,range(1,9)):
        col_names.append('Sensor_'+x+y)
col_names.append('Batch_No\n')
print 'Number of columns -',len(col_names)









    



Number of columns - 130

Open a csv file and write the column names to it



In [3]:

    
out=open('formatted_data.csv','w')
out.write(','.join(col_names))

Get the file names for all the files from the 'raw_data' directory



In [4]:

    
raw_files = [f for f in listdir('./raw_data') if isfile(join('./raw_data', f))]

Read the data in from each file, format it by -

Strip extra whitespaces and split on whitespace
Add the batch number from file name, the target label
Split the key value pairs on ':' and retrieve the values



In [5]:

    
for file_name in raw_files:
    with open('./raw_data/'+file_name,'r') as f:
        for i in f:
            j=i.strip().split(' ')
            out.write(','.join([j[0]]+[k.split(':')[1] for k in j[1:]]+[file_name.strip('batch').split('.')[0],'\n']))
out.close()

The above snippet can be rewritten in the long form as



In [6]:

    
#for file_name in raw_files:
#    with open('./raw_data/'+file_name,'r') as f:
#        for i in f:
#            j=i.strip().split(' ')
#            target_label = [j[0]]
#            attributes = [k.split(':')[1] for k in j[1:]]
#            batch_no = [file_name.strip('batch').split('.')[0],'\n']
#            out.write(','.join(target_label+attributes+batch_no))
#out.close()