ETK's Excel Extractor is a cell-based extractor for extracting data from compatible spreadsheets.
The example spreadsheet file named alabama.xml and it has a sheet named 16tbl08al, in which row 1 to row 5 and row 60 to row 62 are metadata, 6A to M59 is a table (which has row and column headers). For this example, I'm going to extract data from C7 to M33 (see the picture attached below).
Excel Extractor will scan cell-by-cell within a region that you specified and populate variables that you defined.
In this particular example, I want to extract value of all cells in region (C7, M33) and I defined a variable called value. Its value will be extracted from a cell located at $col,$row where $col and $row mean current column id and row id that the scanner is traversing at. The return is a list of object which contains user-defined variables.
In [1]:
import pprint
from etk.extractors.excel_extractor import ExcelExtractor
ee = ExcelExtractor()
variables = {
'value': '$col,$row'
}
raw_extractions = ee.extract('alabama.xls', '16tbl08al', ['C,7', 'M,33'], variables)
pprint.pprint(raw_extractions[:10]) # print first 10
Excel Extractor allows you to define multiple variables. This is useful if you want to extract the data from other cells which are associated with current cell. In this example, I also need column header (category) and county name of every cell in the region. It supports constant coordinate like ($B,$1) (which means the cell at column B row 1) or using + and - to caculate relative coordinate like ($B-1,$row+1) (which means the cell at column A and its row id is current row id + 1).
In [2]:
variables = {
'value': '$col,$row',
'county': '$B,$row',
'category': '$col,$6'
}
raw_extractions = ee.extract('alabama.xls', '16tbl08al', ['C,7', 'M,33'], variables)
pprint.pprint(raw_extractions[:10]) # print first 10
In [3]:
variables = {
'value': '$col,$row',
'county': '$B,$row',
'category': '$col,$6',
'from_row': '$row',
'from_col': '$col'
}
raw_extractions = ee.extract('alabama.xls', '16tbl08al', ['C,7', 'M,33'], variables)
pprint.pprint(raw_extractions[:10]) # print first 10
In [4]:
import os, sys
from etk.etk import ETK
from etk.etk_module import ETKModule
from etk.extractors.excel_extractor import ExcelExtractor
from etk.utilities import Utility
class ExampleETKModule(ETKModule):
"""
Abstract class for extraction module
"""
def __init__(self, etk):
ETKModule.__init__(self, etk)
self.ee = ExcelExtractor()
def document_selector(self, doc):
return 'file_path' in doc.cdr_document
def process_document(self, doc):
"""
Add your code for processing the document
"""
variables = {
'value': '$col,$row',
'county': '$B,$row',
'category': '$col,$6',
'from_row': '$row',
'from_col': '$col'
}
raw_extractions = self.ee.extract(doc.cdr_document['file_path'], '16tbl08al', ['C,7', 'M,33'], variables)
extracted_docs = []
for d in raw_extractions:
# post processing
d['category'] = d['category'].replace('\n', ' ').strip()
d['county'] = d['county'].replace('\n', ' ').strip()
d['from_row'] = int(d['from_row'])
d['from_col'] = int(d['from_col'])
# create sub document
d['doc_id'] = Utility.create_doc_id_from_json(d)
extracted_docs.append(etk.create_document(d))
return extracted_docs
# if __name__ == "__main__":
etk = ETK(modules=ExampleETKModule)
doc = etk.create_document({'file_path': 'alabama.xls'})
docs = etk.process_ems(doc)
for d in docs[1:11]: # print first 10
print(d.value)