Function reference

This is a one-page comprehensive run-down of all the functions and features in the system to be kept for reference.

By executing tutorial() in the notebook you are taking a copy of all the tutorial notebooks, including this one.

So, if a function looks useful but you don't quite understand the description, you should experiment with its input and outputs within this interactive programming environment.

However, you need to execute at least the first cell in every section for it to work as it imports the libraries.

Table of contents

Loading and saving

tabs = loadxlstabs(inputfile, sheetids="*", verbose=True)

Load xls file into a list of tables, which act as bags of cells

savepreviewhtml(tab, htmlfilename=None, verbose=True)

Previews a table -- or list of cellbags or conversion segments with the same table -- either inline, or into a separate file.

writetechnicalCSV(outputfile, conversionsegments)

Outputs a WDA format CSV file from a list of conversion segments or pandas dataframes

readtechnicalCSV(wdafile, bverbose=False, baspandas=True)

Reads in an old WDA file into a list of pandas tables, one for each segment


In [7]:
from databaker.framework import *

# put your input-output files here
inputfile = "example1.xls"
outputfile = "example1.csv"
previewfile = "preview.html"

Cell bag selection

These functions generally apply to a table as well as a cell bag, but they always output a cell bag.

A cell bag bag always has a pointer to its original table bag.table. Howwever, you can access the underlying unordered set of cells of a bag as bag.unordered_cells.

Note in the examples below, please use savepreviewhtml(cellbag) or savepreviewhtml([cellbagA, cellbagB, ...]) to see what the selections look like in the contents of the table. These have been left out of unused notebook only to save clutter.


In [2]:
from databaker.framework import *
tab = loadxlstabs("example1.xls", sheetids="stones", verbose=True)[0]
print(tab)


Loading example1.xls which has size 8192 bytes
Table names: ['stones']
{<B4 'Jan'>, <C6 'yes'>, <E7 88.0>, <D2 ''>, <E2 ''>, <A5 ''>, <E1 ''>, <B6 'Feb'>, <E3 'cost'>, <D3 'Rocks'>, <D8 'basalt'>, <C8 'yes'>, <A9 ''>, <B5 'Aug'>, <A2 'Date'>, <B9 'Dec'>, <D6 'limestone'>, <E6 2.0>, <A6 1989.0>, <C7 'no'>, <A8 ''>, <C3 'present'>, <C9 'yes'>, <E9 8.0>, <A1 ''>, <B3 'Month'>, <A4 1972.0>, <B8 'Jun'>, <B1 ''>, <E5 30.0>, <C2 ''>, <E8 96.0>, <C5 'no'>, <B7 'Mar'>, <D7 'shale'>, <B2 ''>, <D4 'chalk'>, <C4 'yes'>, <C1 ''>, <A3 'Year'>, <D5 'granite'>, <D1 ''>, <A7 ''>, <E4 10.0>, <D9 'ice'>}

cellbag.is_XXX()

cellbag.is_not_XXX()

Returns cells which are or are not a XXX thing.

Allowable functions:

bold, italic, underline, number, date, whitespace, strikeout, any_border, all_border, richtext

These functions can be chained, eg cellbag.is_not_number().is_not_whitespace().


In [3]:
cellbag = tab
print("Numbered cells only:", cellbag.is_number())
print()
print("Not numbers:", cellbag.is_not_number())
print()
print("Not numbers and not whitespace:", cellbag.is_not_number().is_not_whitespace())
print()
print("Cells that seem to be a date:", cellbag.is_date())


Numbered cells only: {<E5 30.0>, <E9 8.0>, <E7 88.0>, <A4 1972.0>, <E6 2.0>, <A6 1989.0>, <E8 96.0>, <E4 10.0>}

Not numbers: {<B4 'Jan'>, <C6 'yes'>, <D2 ''>, <E2 ''>, <B6 'Feb'>, <A5 ''>, <E1 ''>, <E3 'cost'>, <D3 'Rocks'>, <D8 'basalt'>, <C8 'yes'>, <A9 ''>, <B5 'Aug'>, <A2 'Date'>, <B9 'Dec'>, <D6 'limestone'>, <D1 ''>, <C7 'no'>, <C3 'present'>, <A1 ''>, <B3 'Month'>, <B8 'Jun'>, <B1 ''>, <C2 ''>, <C5 'no'>, <B7 'Mar'>, <B2 ''>, <D5 'granite'>, <A7 ''>, <C4 'yes'>, <C1 ''>, <A3 'Year'>, <C9 'yes'>, <A8 ''>, <D7 'shale'>, <D4 'chalk'>, <D9 'ice'>}

Not numbers and not whitespace: {<B4 'Jan'>, <C6 'yes'>, <B6 'Feb'>, <E3 'cost'>, <D3 'Rocks'>, <D8 'basalt'>, <C8 'yes'>, <B5 'Aug'>, <A2 'Date'>, <B9 'Dec'>, <D6 'limestone'>, <C7 'no'>, <C3 'present'>, <B3 'Month'>, <B8 'Jun'>, <C5 'no'>, <B7 'Mar'>, <D5 'granite'>, <C4 'yes'>, <A3 'Year'>, <C9 'yes'>, <D7 'shale'>, <D4 'chalk'>, <D9 'ice'>}

Cells that seem to be a date: {<A4 1972.0>, <A6 1989.0>}

cellbag.filter(word)

Only cells matching this word exactly

cellbag.filter(function(cell))

Only cells where function(cell) == True

cellbag.one_of([word1, word2])

Only cells matching one of the words

cellbag.regex(regexp)

Only cell matching one of the words

cellbag.excel_ref(ref)

Selects a cell by its excel Column-Row/Letter-Number format where 'A1' is the top left hand corner.

This also works for single columns or rows (eg 'C', or '3') and ranges (eg 'A2:B3').

This way of accessing is not recommended unless you know that the spreadsheet you are working with won't have extra rows or columns inserted or deleted from it.

cellbag.by_index(n)

Selects a single cell from the cell bag of index n, where n=1 is the first element. (n can also be a list of integers.)

cellbag.assert_one()

Throws an exception if there is not exactly one cell in this bag (useful for validation if your filter above was supposed to return only one cell)

cellbag.value

If len(cellbag) == 1 then cellbag.value gives the original value within that cell, otherwise it throws an exception.


In [20]:
from databaker.framework import *   # restated import so you can run from this cell
cellbag = tab = loadxlstabs("example1.xls", sheetids="stones", verbose=True)[0]

print("Get some matching cells", cellbag.one_of(["Rocks", "ice", "mud"]))
print("A3 is", cellbag.excel_ref("A3"))
print("A3:B4 is", cellbag.excel_ref("A2:B4"))
print()
print("The second cell in the whole table is", tab.by_index(2))

ngreater20 = cellbag.is_number().filter(lambda c: c.value>20)
nlessthan20 = cellbag.is_number().filter(lambda c: c.value<20)
print("Numbers greater than 20", ngreater20)
print("Numbers less than 20", nlessthan20)

# Uncomment this line to see these selections in contents
# savepreviewhtml([ngreater20, nlessthan20])


Loading example1.xls which has size 8192 bytes
Table names: ['stones']
Get some matching cells {<D9 'ice'>, <D3 'Rocks'>}
A3 is {<A3 'Year'>}
A3:B4 is {<A4 1972.0>, <B4 'Jan'>, <A3 'Year'>, <A2 'Date'>, <B2 ''>, <B3 'Month'>}

The second cell in the whole table is {<B1 ''>}
Numbers greater than 20 {<A4 1972.0>, <E5 30.0>, <E7 88.0>, <E8 96.0>, <A6 1989.0>}
Numbers less than 20 {<E6 2.0>, <E4 10.0>, <E9 8.0>}

cellbag1.union(cellbag2)

Union of two bags. Can also be expressed as cellbag1 | cellbag2

cellbag1.difference(cellbag2)

Difference of two bags. Can also be expressed as cellbag1 - cellbag2

cellbag1.difference(cellbag2)

Intersection of two bags. Can also be expressed as cellbag1 & cellbag2


In [121]:
colC = tab.excel_ref("D3:D5")
rowC = tab.excel_ref("A4:D4")
print("colC", colC)
print("rowC", rowC)
print()
print("Union is", colC.union(rowC))
print("Difference is", colC.difference(rowC))
print("Intersection is", colC.intersection(rowC))
print()
print("Union is", (colC | rowC))
print("Difference is", (colC - rowC))
print("Intersection is", (colC & rowC))


colC {<D5 'granite'>, <D3 'Rocks'>, <D4 'chalk'>}
rowC {<A4 1972.0>, <D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}

Union is {<D5 'granite'>, <A4 1972.0>, <D3 'Rocks'>, <D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}
Difference is {<D5 'granite'>, <D3 'Rocks'>}
Intersection is {<D4 'chalk'>}

Union is {<D5 'granite'>, <A4 1972.0>, <D3 'Rocks'>, <D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}
Difference is {<D5 'granite'>, <D3 'Rocks'>}
Intersection is {<D4 'chalk'>}

cellbag1.waffle(cellbag2)

Get all cells which have a cell from one bag above them, and the other bag to the side. Note that the two bags are interchangable without changing the output. You can change the direction from its default (DOWN) by specifying direction=LEFT or similar.

cellbag1.junction(cellbag2)

Enumerates the output of waffle in triplets

cellbag1.same_row(cellbag2)

Get cells in this bag which are in the same row as a cell in the second.

cellbag1.same_column(cellbag2)

Get cells in this bag which are in the same column as a cell in the second.


In [7]:
c = tab.excel_ref("D3") | tab.excel_ref("E4")
d = tab.excel_ref("A6:A7")
print("Waffle:")
savepreviewhtml([c,d, c.waffle(d)])


Waffle:
item 0item 1item 2
stones
Date
YearMonthpresentRockscost
1972.0Janyeschalk10.0
Augnogranite30.0
1989.0Febyeslimestone2.0
Marnoshale88.0
Junyesbasalt96.0
Decyesice8.0

In [123]:
print("Junction output:")
for s in c.junction(d):
    print("  ", s)


Junction output:
   ({<D3 'Rocks'>}, {<A6 1989.0>}, {<D6 'limestone'>})
   ({<D3 'Rocks'>}, {<A7 ''>}, {<D7 'shale'>})
   ({<E4 10.0>}, {<A6 1989.0>}, {<E6 2.0>})
   ({<E4 10.0>}, {<A7 ''>}, {<E7 88.0>})

In [128]:
print("Cells column A that are in same row as", c, "are", tab.excel_ref("A").same_row(c))
print("Cells column 7 that are in same column as", c, "are", tab.excel_ref("7").same_col(c))


Cells column A that are in same row as {<D3 'Rocks'>, <E4 10.0>} are {<A4 1972.0>, <A3 'Year'>}
Cells column 7 that are in same column as {<D3 'Rocks'>, <E4 10.0>} are {<D7 'shale'>, <E7 88.0>}

cellbag.shift(direction)

Move the selected cells UP, DOWN, LEFT or Right by one cell

cellbag.shift((dx, dy))

Move the selected cells dx cells to RIGHT and dy cells DOWN (can have negative values)

cellbag.fill(direction)

Take all the cells in one direction from the given cellbag

cellbag.expand(direction)

All the cells in one direction, including itself.

cellbag.extrude(dx, dy)

Step and include this many cells between 0 and dx and dy.


In [120]:
c = tab.excel_ref("B4")
print("Shift RIGHT from", c, "is", c.shift(RIGHT))
print("Shift (-1,-2) from", c, "is", c.shift((-1, -2)))
print("Fill UP from", c, "is", c.fill(UP))
print("Expand UP from", c, "is", c.expand(UP))
print()
print("How it works: UP=", UP, "  DOWN=", DOWN, "  LEFT=", LEFT, "  RIGHT=", RIGHT)
print()
print("Extrude two cells rightwards", c.extrude(2,0))


Shift RIGHT from {<B4 'Jan'>} is {<C4 'yes'>}
Shift (-1,-2) from {<B4 'Jan'>} is {<A2 'Date'>}
Fill UP from {<B4 'Jan'>} is {<B3 'Month'>, <B1 ''>, <B2 ''>}
Expand UP from {<B4 'Jan'>} is {<B3 'Month'>, <B4 'Jan'>, <B1 ''>, <B2 ''>}

How it works: UP= (0, -1)   DOWN= (0, 1)   LEFT= (-1, 0)   RIGHT= (1, 0)

Extrude two cells rightwards {<D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}

Dimensions

A dimension is simply a cellbag with a label and a lookup direction applied to it.

Each dimension represents a column in the output table and basically contains the instructions for how to look up to the corresponding value given a particular cell in the set of observations.

hdim = HDim(cellbag, label, strict=[DIRECTLY|CLOSEST], direction=[ABOVE|BELOW|LEFT|RIGHT])

The main constructor, taking a set of cells, a string name (label), look up condition and lookup directions.

The lookup conditions are:

  • CLOSEST (gets the first cell in the same column or row as the observation in a specified direction);
  • DIRECTLY (gets the closest cell in the same column or row as the observation in a specified direction).

hdim.cellvalobs(cell)

This function looks up the value of an individual cell in hdim.hbagset (defined in the constructor) according to the lookup condition and direction, and returns the pair (cell, value) The value will always be cell.value, unless it has been overridden by some member of hdim.cellvalueoverride.

hdim.AddCellValueOverride(overridecell, overridevalue)

This function is an interface to changing the return values alters the hdim.cellvalueoverride. It can be used to change the spellings of particular dimension values or to insert new heading cells in place of blank ones.

Inserting header cells is sometimes necessary when a heading is centred and you can't look it up with a single (strict=CLOSEST, direction=LEFT|RIGHT) command. (The direction=NEAREST feature proved unreliable in the real world.)

hdim.discardcellsnotlookedup(observationcells)

This function uses a set of observation cells to thin out the list of dimension cells hdim.hbagset to only those which can be looked up. Can be used to quickly trim out footnote in the bottom of a column that don't make any difference to the final output while making validation easier (see hdim.checkvalues below).

hdim.valueslist()

Use this function to print the final heading cells values (the values in hdim.hbagset after they are overridden by hdim.cellvalueoverride for use in making the validation checks.

hdim.checkvalues(valueslist)

This validates the dimension values against a hard-coded values list that has been generated earlier by hdim.valueslist() and throws an exception with an explanation if they are different.

Use this function if you need to run your code against different spreadsheets and need to check that the outputs are going to be consistent.

hdimc = HDimConst(label, value)

Create a constant dimension that will give the same value no matter what the observation is looked up.


In [24]:
from databaker.framework import *
tab = loadxlstabs("example1.xls", sheetids="stones", verbose=False)[0]

rocks = tab.filter("Rocks").fill(DOWN)
years = tab.filter("Year").fill(DOWN).is_not_whitespace()
cost = tab.filter("cost").fill(DOWN)
print(rocks)

# savepreviewhtml([rocks, years, cost])  # <-- uncomment this line to see the table


{<D4 'chalk'>, <D8 'basalt'>, <D5 'granite'>, <D7 'shale'>, <D9 'ice'>, <D6 'limestone'>}

In [31]:
hrocks = HDim(rocks, "ROCKS!", DIRECTLY, LEFT)
hrocks.AddCellValueOverride("granite", "gneiss")
hyears = HDim(years, "yyyy", CLOSEST, UP)

for ob in cost:
    print(ob, "\t", hyears.cellvalobs(ob), "\t", hrocks.cellvalobs(ob))

# savepreviewhtml([hrocks, hyears, cost])  # <-- uncomment to see as a coloured table


{<E4 10.0>} 	 (<A4 1972.0>, '1972.0') 	 (<D4 'chalk'>, 'chalk')
{<E5 30.0>} 	 (<A4 1972.0>, '1972.0') 	 (<D5 'granite'>, 'gneiss')
{<E6 2.0>} 	 (<A6 1989.0>, '1989.0') 	 (<D6 'limestone'>, 'limestone')
{<E7 88.0>} 	 (<A6 1989.0>, '1989.0') 	 (<D7 'shale'>, 'shale')
{<E8 96.0>} 	 (<A6 1989.0>, '1989.0') 	 (<D8 'basalt'>, 'basalt')
{<E9 8.0>} 	 (<A6 1989.0>, '1989.0') 	 (<D9 'ice'>, 'ice')

Conversion Segments

A ConversionSegment is a set of observations together with a list of Dimensions

ConversionSegment(observations, dimensions)

Constructor for the ConversionSegment, where observations is a bag of cells and dimensions is a list of HDim and HDimConst dimension objects. You can construct the dimensions at the same time as defining the list at the point when you call this function.

ConversionSegment(observations, dimensions, processTIMEUNIT=True, includecellxy=False)

Two default parameters in the ConversionSegment constructor. processTIMEUNIT controls whether a dimension called TIME should be used to automatically set the dimension known as TIMEUNIT. This is required by the WDA output, however its operation can be implemented in pandas.

includecellxy causes the output to include three extra columns, [__x, __y, __tablename] which can be used for debugging purposes.

conversionsegment.topandas()

Turns a ConversionSegment into a pandas.DataFrame, which is an extremely powerful, efficient and widely used data manipulation library.

This marks the place where you depart cleanly from the Databaker library and can go on to further analysis, or it's a temporary entry into a system where the data can be fixed up before outputting it to the WDA format.

savepreviewhtml(conversionsegment, htmlfilename=None, verbose=True)

This function is restated from the Loading-and-saving section to remind you that when you use it on a ConversionSegment the Observation cells are interactive -- click on one to highlight the dimension cells it is looking up to.

Also, overridden values are illustrated by strike-throughs.

writetechnicalCSV(outputfile, conversionsegments)

This function is also restated from the Loading-and-saving section for saving a WDA output file. The argument can be a single ConversionSegment, a list of ConversionSegments or a list of pandas.DataFrames (which have been cleaned up).

Special WDA dimensions

The WDA format contains the following special dimension columns that are output at the front of every row. They are identified by their dimension labels.

For convenience, the variable names have been set to their string names, ie

STATUNIT = "STATUNIT"

See the WDA documentation for their specific uses.

  • OBS - This is not a dimension; it's the observation column. Do not name a dimension as "OBS"
  • DATAMARKER - If OBS is not a number, then the non-numeric part is stripped off and put into the DATAMARKER column
  • STATUNIT
  • MEASURETYPE
  • UNITMULTIPLIER
  • UNITOFMEASURE
  • GEOG
  • TIME - Of the form "2010", "2010 Q1" or "Jan 2010"
  • TIMEUNIT - "Year", "Quarter", "Month" respectively
  • STATPOP

pdguessforceTIMEUNIT(dataframe)

Find and set the TIMEUNIT column from the TIME column in a pandas.DataFrame. This function has two lines. The first line matches the unit from the TIME value:

df["TIMEUNIT"] = df.apply(lambda row: Ldatetimeunitloose(row.TIME), axis=1)

The second line forces the TIME value to conform to the exact format required by the WDA file, given the TIMEUNIT ```python df["TIME"] = df.apply(lambda row: Ldatetimeunitforce(row.TIME, row.TIMEUNIT), axis=1)


In [39]:
from databaker.framework import *

times = [2017.0, "Q32017", "Mar  2017"]
for t in times:
    print(t, "is\t", Ldatetimeunitloose(t), "corrected to\t", Ldatetimeunitforce(t, Ldatetimeunitloose(t)))


2017.0 is	 Year corrected to	 2017
Q32017 is	 Quarter corrected to	 2017 Q3
Mar  2017 is	 Month corrected to	 Mar 2017

In [9]:
from databaker.framework import *
tab = loadxlstabs("example1.xls", sheetids="stones", verbose=False)[0]

cs = ConversionSegment(tab.filter("cost").fill(DOWN), [
        HDim(tab.filter("Year").fill(DOWN).is_not_whitespace(), "year", CLOSEST, UP),
        HDim(tab.filter("Month").fill(DOWN).is_not_whitespace(), "month", DIRECTLY, LEFT)
    ])


###################
# savepreviewhtml(cs)   # <-- uncomment this to see the interactive table

dcs = cs.topandas()
# print(dcs)   # uncomment to see the table

# concatenate the month and year into a time
dcs["TIME"] = dcs.month + " " + dcs.year
pdguessforceTIMEUNIT(dcs)   # <-- fixes the date format (removing the '.0's on the years)
# print(dcs)   # uncomment to see the table at this point

# delete the now redundant columns 
dcs.drop(['year', "month"], axis=1, inplace=True)
#print(dcs)  # uncomment to see pandas table

# Output the finished WDA file where the dates should all work!
print(writetechnicalCSV(None, dcs))


observation,data_marking,statistical_unit_eng,statistical_unit_cym,measure_type_eng,measure_type_cym,observation_type,empty,obs_type_value,unit_multiplier,unit_of_measure_eng,unit_of_measure_cym,confidentuality,empty1,geographic_area,empty2,empty3,time_dim_item_id,time_dim_item_label_eng,time_dim_item_label_cym,time_type,empty4,statistical_population_id,statistical_population_label_eng,statistical_population_label_cym,cdid,cdiddescrip,empty5,empty6,empty7,empty8,empty9,empty10,empty11,empty12
10.0,,,,,,,,,,,,,,,,,Jan 1972,Jan 1972,,Month,,,,,,,,,,,,,0,
30.0,,,,,,,,,,,,,,,,,Aug 1972,Aug 1972,,Month,,,,,,,,,,,,,0,
2.0,,,,,,,,,,,,,,,,,Feb 1989,Feb 1989,,Month,,,,,,,,,,,,,0,
88.0,,,,,,,,,,,,,,,,,Mar 1989,Mar 1989,,Month,,,,,,,,,,,,,0,
96.0,,,,,,,,,,,,,,,,,Jun 1989,Jun 1989,,Month,,,,,,,,,,,,,0,
8.0,,,,,,,,,,,,,,,,,Dec 1989,Dec 1989,,Month,,,,,,,,,,,,,0,
*********,6

Downloading excel and unzipping files

If you are doing work on a computer that can actually be done by the computer, then you are not doing real work.

Please automate the webscraping and unzipping of files.

Here are some quick methods for downloading multiple excel spreadsheets linked to from this page.


In [11]:
import urllib, re, os

# url containing the index of a set of spreadsheets
ddurl = "https://www.ons.gov.uk/businessindustryandtrade/constructionindustry/datasets/outputintheconstructionindustry/current"
req1 = urllib.request.Request(ddurl, headers={'User-Agent' : "Sensible code"}) 
dhtml = urllib.request.urlopen(req1).read().decode("utf8")
print("Downloaded a webpage with", len(dhtml), "bytes")


Downloaded a webpage with 31071 bytes

In [20]:
# make the download directory
dfiles = "downloaddir"
if not os.path.isdir(dfiles):
    print("making directory", dfiles)
    os.mkdir(dfiles)

In [30]:
# quick and dirty regular expression for pullint out the links to relevant xls spreadsheets
xllinklist = re.findall('href="(/file\?uri=/businessindustryandtrade.*?/([^/"]*\.xls))"', dhtml)
    
for xl, xln in xllinklist:
    lxln = os.path.join(dfiles, xln)
    if os.path.exists(lxln):
        continue   # <-- we avoid downloading the same file a second time, in this case
    furl = urllib.parse.urljoin(ddurl, xl)
    req = urllib.request.Request(furl, headers={'User-Agent' : "Sensible code"}) 
    xp = urllib.request.urlopen(req).read()
    print("Downloading", xln, len(xp), "bytes")
    fout = open(lxln, "wb")
    fout.write(xp)
    fout.close()

In [31]:
fnames = [ os.path.join(dfiles, f)  for f in os.listdir(dfiles)  if f[-4:] == '.xls' ]

print("Your list of xls files is:\n", "\n ".join(fnames))

What to do when you have zip files

If you find yourself downloading zipfiles and manually instructing the computer to unzip each file, you should think about making the computer do the work itself.

An example of zipfiles containing excel spreadsheets can be found on this page.

First job is to download one of these files, as we did above:


In [36]:
import urllib, re

# fetch the front page and find the link to the zip file we want
iurl = "https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/workplacepensions/datasets/annualsurveyofhoursandearningspensiontablespensiontypebyagegroupandbygrossweeklyearningsbandsp1"
req = urllib.request.Request(iurl, headers={'User-Agent' : "Sensible Code"}) 
ipage = urllib.request.urlopen(req).read()

# search the link to the zip file and "join" against the baseurl to get the full url (there's a space -> %20 bug problem)
zyears = [ urllib.parse.urljoin(iurl, z.replace(" ", "%20"))  for z in re.findall('<a href="([^"]*?\.zip)"', str(ipage)) ]
zurl = zyears[0]

print("We are about to download the file:\n", zurl)
zfilename = "downloaded.zip"
zurl = zurl.replace(" ", "%20")   # spaces in the url get escaped in the browser
req = urllib.request.Request(zurl, headers={'User-Agent' : "Sensible Code"}) 
zbytes = urllib.request.urlopen(req).read()
    
fout = open(zfilename, "wb")
fout.write(zbytes)
fout.close()
print(zfilename, "is", len(zbytes), "bytes long.")


We are about to download the file:
 https://www.ons.gov.uk/file?uri=/employmentandlabourmarket/peopleinwork/workplacepensions/datasets/annualsurveyofhoursandearningspensiontablespensiontypebyagegroupandbygrossweeklyearningsbandsp1/2015/2015provisionaltablep1.zip
downloaded.zip is 44560 bytes long.

In [48]:
import zipfile

zfilename = "downloaded.zip"

# open the zipfile
zdir = zipfile.ZipFile(zfilename)

print("The files in", zfilename, "are:\n", "\n ".join(zdir.namelist()))

zmember0 = zdir.namelist()[0]

xlsfilename = "downloaded0.xls"
fout = open(xlsfilename, "wb")
xlsbindata = zdir.read(zmember0)
fout.write(xlsbindata)   
fout.close()

print()
print("We have unzipped:\n", zmember0, "\nand saved it as", xlsfilename, "with", len(xlsbindata), "bytes")


The files in downloaded.zip are:
 PROV - Pension Provision by Earnings & Age Group Table P1.1b   Pension Type 2015 CV.xls
 PROV - Pension Provision by Earnings & Age Group Table P1.1a   Pension Type 2015.xls

We have unzipped:
 PROV - Pension Provision by Earnings & Age Group Table P1.1b   Pension Type 2015 CV.xls 
and saved it as downloaded0.xls with 83968 bytes

In [47]:
# now we can load this file into databaker and continue with our work
from databaker.framework import *
tabs = loadxlstabs(xlsfilename)


Loading downloaded0.xls which has size 83968 bytes
Table names: ['CV notes', 'All', 'Male', 'Female']

Final Automation Notes

The processes for downloading, saving and extracting the excel files from places on the web might appear complex, but are in fact quite simple. They are broken down into nothing more than opening files, listing files, reading files and saving files.

What is very complex is being organized at deciding where to copy the files, knowling what names they should have, and keeping track of what are the new files versus the ones that have already been seen or are even possibly being over-written.

There's no obvious answer to this problem, because it depends on the consistency and form of the source pages -- which may not be as consistent as you'd like them to be. The first step, however, is to take this file management issue seriously and give it the design and thought that it requires. And be prepared to look at it again in the event that your first attempt at automation turns out to be more burdensome than necessary.

Finally, using the same functions here of urllib.request.Request and so forth, it's possible for the code in a notebook to POST the processed results back into a webservice further down the pipeline so that no one needs to touch this script by hand. In this case it is important to handle the error generation and return any messages about the consistency of the input files to the place where the file was generated in order for fixes to happen as soon as possible while the file is live.


In [ ]: