In [1]:
# Load in the functions
from databaker.framework import *
# Load the spreadsheet
tabs = loadxlstabs("example1.xls")
# Select the first table
tab = tabs[0]
print("The unordered bag of cells for this table looks like:")
print(tab)
A table is also "bag of cells", which just so happens to be a set of all the cells in the table.
A "bag of cells" is like a Python set (and looks like one when you print it), but it has extra selection functions that help you navigate around the table.
We will learn these as we go along, but you can see the full list on the tutorial_reference notebook.
In [2]:
# Preview the table as a table inline
savepreviewhtml(tab)
In [3]:
bb = tab.is_bold()
print("The cells with bold font are", bb)
In [4]:
print("The", len(bb), "cells immediately below these bold font cells are", bb.shift(DOWN))
In [5]:
cc = tab.filter("Cars")
print("The single cell with the text 'Cars' is", cc)
cc.assert_one() # proves there is only one cell in this bag
Out[5]:
In [6]:
print("Everything in the column below the 'Cars' cell is", cc.fill(DOWN))
In [7]:
hcc = tab.filter("Cars").expand(DOWN)
print("If you wanted to include the 'Cars' heading, then use expand", hcc)
In [8]:
print("You can print the cells in row-column order if you don't mind unfriendly code")
shcc = sorted(hcc.unordered_cells, key=lambda Cell:(Cell.y, Cell.x))
print(shcc)
In [9]:
print("It can be easier to see the set of cells coloured within the table")
savepreviewhtml(hcc)
Note: As you work through this tutorial, do please feel free to temporarily insert new Jupyter-Cells in order to give yourself a place to experiment with any of the functions that are available. (Remember, the value of the last line in a Jupyter-Cell is always printed out -- in addition to any earlier print-statements.)
In [10]:
"All the cells that have an 'o' in them:", tab.regex(".*?o")
Out[10]:
Let's get on with some actual work. In our terminology, an "Observation" is a numerical measure (eg anything in the 3x4 array of numbers in the example table), and a "Dimension" is one of the headings.
Both are made up of a bag of cells, however a Dimension also needs to know how to "look up" from the Observation to its dimensional value.
In [11]:
# We get the array of observations by selecting its corner and expanding down and to the right
obs = tab.excel_ref('B4').expand(DOWN).expand(RIGHT)
savepreviewhtml(obs)
In [12]:
# the two main headings are in a row and a column
r1 = tab.excel_ref('B3').expand(RIGHT)
r2 = tab.excel_ref('A3').fill(DOWN)
# here we pass in a list containing two cell bags and get two colours
savepreviewhtml([r1, r2])
In [13]:
# HDim is made from a bag of cells, a name, and an instruction on how to look it up
# from an observation cell.
h1 = HDim(r1, "Vehicles", DIRECTLY, ABOVE)
# Here is an example cell
cc = tab.excel_ref('C5')
# You can preview a dimension as well as just a cell bag
savepreviewhtml([h1, cc])
In [14]:
# !!! This is the important look-up stage from a cell into a dimension
print("Cell", cc, "matches", h1.cellvalobs(cc), "in dimension", h1.label)
In [15]:
# You can start to see through to the final result of all this work when you
# print out the lookup values for every observation in the table at once.
for ob in obs:
print("Obs", ob, "maps to", h1.cellvalobs(ob))
Note the value of h1.cellvalobs(ob)
is actually a pair composed of the heading cell and its value. This is is because we can over-ride its output value without actually rewriting the original table, as we shall see.
In [16]:
# You can change an output value like this:
h1.AddCellValueOverride("Cars", "Horses")
for ob in obs:
print("Obs", ob, "maps to", h1.cellvalobs(ob))
In [17]:
# Alternatively, you can override by the reference to a single cell to a value
# (This will work even if the cell C3 is empty, which helps with filling in blank headings)
h1.AddCellValueOverride(tab.excel_ref('C3'), "Submarines")
for ob in obs:
print("Obs", ob, "maps to", h1.cellvalobs(ob))
In [18]:
# You can override the header value for an individual observation element.
b4cell = tab.excel_ref('B4')
h1.AddCellValueOverride(b4cell, "Clouds")
for ob in obs:
print("Obs", ob, "maps to", h1.cellvalobs(ob))
In [19]:
# The preview table shows how things have changed
savepreviewhtml([h1, obs])
In [20]:
wob = tab.excel_ref('A1')
print("Wrong-Obs", wob, "maps to", h1.cellvalobs(wob), " <--- ie Nothing")
In [21]:
h1.AddCellValueOverride(None, "Who knows?")
print("After giving a default value Wrong-Obs", wob, "now maps to", h1.cellvalobs(wob))
In [22]:
# The default even works if the cell bag set is empty. In which case we have a special
# constant case that maps every observation to the same value
h3 = HDimConst("Category", "Beatles")
for ob in obs:
print("Obs", ob, "maps to", h3.cellvalobs(ob))
A ConversionSegment is a collection of Dimensions with an Observation set that is going to be processed and output as a table all at once.
You can preview them in HTML (just like the cell bags and dimensions), only this time the observation cells can be clicked on interactively to show how they look up.
In [46]:
dimensions = [
HDim(tab.excel_ref('B1'), TIME, CLOSEST, ABOVE),
HDim(r1, "Vehicles", DIRECTLY, ABOVE),
HDim(r2, "Name", DIRECTLY, LEFT),
HDimConst("Category", "Beatles")
]
c1 = ConversionSegment(obs, dimensions, processTIMEUNIT=False)
savepreviewhtml(c1)
In [55]:
# If the table is too big, we can preview it in another file is openable in another browser window.
# (It's very useful if you are using two computer screens.)
savepreviewhtml(c1, "preview.html", verbose=False)
In [47]:
print("Looking up all the observations against all the dimensions and print them out")
for ob in c1.segment:
print(c1.lookupobs(ob))
In [48]:
df = c1.topandas()
df
Out[48]:
The ONS uses their own data system for publishing their time-series data known as WDA.
If you need to output to it, then this next section is for you.
The function which outputs to the WDA format is writetechnicalCSV(filename, [conversionsegments])
The format is very verbose because it repeats each dimension name and its value twice in each row, and every row begins with the following list of column entries, whether or not they exist.
observation, data_marking, statistical_unit_eng, statistical_unit_cym, measure_type_eng, measure_type_cym, observation_type, obs_type_value, unit_multiplier, unit_of_measure_eng, unit_of_measure_cym, confidentuality, geographic_area
The writetechnicalCSV()
function accepts a single conversion segment, a list of conversion segments, or equivalently a pandas dataframe.
In [35]:
print(writetechnicalCSV(None, c1))
In [40]:
# This is how to write to a file
writetechnicalCSV("exampleWDA.csv", c1)
# We can read this file back in to a list of pandas dataframes
dfs = readtechnicalCSV("exampleWDA.csv")
print(dfs[0])
Note If you were wondering what the processTIMEUNIT=False
was all about in the ConversionSegment constructor, it's a feature to help the WDA output automatically set the TIMEUNIT column according to whether it should be Year, Month, or Quarter.
You will note that the TIME column above is 2014.0
when it really should be 2014
with the TIMEUNIT set to Year
.
By setting it to True
the ConversionSegment object will identify the timeunit from the value of the TIME column and then force its format to conform.
In [51]:
# See that the `2014` no longer ends with `.0`
c1 = ConversionSegment(obs, dimensions, processTIMEUNIT=True)
c1.topandas()
Out[51]:
Note Sometimes the TIME value needs to be created by joining two or more other cells (eg one is a month, and the other is the year).
Such an operation can much more easily be done using the pandas column operations than by using the concept of subdimensions which used to exist in Databaker before we took it out.
This will be explained in a later worked example.
In [ ]: