Opening and previewing

This uses the tiny excel spreadsheet example1.xls. It is small enough to preview inline in this notebook. But for bigger spreadsheet tables you will want to open them up in a separate window.


In [1]:
# Load in the functions
from databaker.framework import *

# Load the spreadsheet
tabs = loadxlstabs("example1.xls")

# Select the first table
tab = tabs[0]

print("The unordered bag of cells for this table looks like:")
print(tab)


Loading example1.xls which has size 7168 bytes
Table names: ['beatles', 'stones']
The unordered bag of cells for this table looks like:
{<B4 2.0>, <C6 1.0>, <D1 ''>, <B5 4.0>, <C5 3.0>, <C1 ''>, <B3 'Cars'>, <D3 'Trains'>, <C2 ''>, <D4 1.0>, <D6 3.0>, <B6 4.0>, <D5 2.0>, <B2 ''>, <A1 'Date'>, <B1 2014.0>, <C7 5.0>, <A7 'George'>, <A6 'Ringo'>, <A3 ''>, <A2 ''>, <C4 2.0>, <A4 'John'>, <C3 'Planes'>, <D7 5.0>, <D2 ''>, <A5 'Paul'>, <B7 2.0>}

Selecting cell bags

A table is also "bag of cells", which just so happens to be a set of all the cells in the table.

A "bag of cells" is like a Python set (and looks like one when you print it), but it has extra selection functions that help you navigate around the table.

We will learn these as we go along, but you can see the full list on the tutorial_reference notebook.


In [2]:
# Preview the table as a table inline
savepreviewhtml(tab)


beatles
Date2014.0
CarsPlanesTrains
John2.02.01.0
Paul4.03.02.0
Ringo4.01.03.0
George2.05.05.0

In [3]:
bb = tab.is_bold()
print("The cells with bold font are", bb)


The cells with bold font are {<C3 'Planes'>, <D3 'Trains'>}

In [4]:
print("The", len(bb), "cells immediately below these bold font cells are", bb.shift(DOWN))


The 2 cells immediately below these bold font cells are {<C4 2.0>, <D4 1.0>}

In [5]:
cc = tab.filter("Cars")
print("The single cell with the text 'Cars' is", cc)

cc.assert_one()   # proves there is only one cell in this bag


The single cell with the text 'Cars' is {<B3 'Cars'>}
Out[5]:
{<B3 'Cars'>}

In [6]:
print("Everything in the column below the 'Cars' cell is", cc.fill(DOWN))


Everything in the column below the 'Cars' cell is {<B7 2.0>, <B6 4.0>, <B5 4.0>, <B4 2.0>}

In [7]:
hcc = tab.filter("Cars").expand(DOWN)
print("If you wanted to include the 'Cars' heading, then use expand", hcc)


If you wanted to include the 'Cars' heading, then use expand {<B3 'Cars'>, <B7 2.0>, <B6 4.0>, <B5 4.0>, <B4 2.0>}

In [8]:
print("You can print the cells in row-column order if you don't mind unfriendly code")
shcc = sorted(hcc.unordered_cells, key=lambda Cell:(Cell.y, Cell.x))
print(shcc)


You can print the cells in row-column order if you don't mind unfriendly code
[<B3 'Cars'>, <B4 2.0>, <B5 4.0>, <B6 4.0>, <B7 2.0>]

In [9]:
print("It can be easier to see the set of cells coloured within the table")
savepreviewhtml(hcc)


It can be easier to see the set of cells coloured within the table
item 0
beatles
Date2014.0
CarsPlanesTrains
John2.02.01.0
Paul4.03.02.0
Ringo4.01.03.0
George2.05.05.0

Note: As you work through this tutorial, do please feel free to temporarily insert new Jupyter-Cells in order to give yourself a place to experiment with any of the functions that are available. (Remember, the value of the last line in a Jupyter-Cell is always printed out -- in addition to any earlier print-statements.)


In [10]:
"All the cells that have an 'o' in them:", tab.regex(".*?o")


Out[10]:
("All the cells that have an 'o' in them:",
 {<A6 'Ringo'>, <A4 'John'>, <A7 'George'>})

Observations and dimensions

Let's get on with some actual work. In our terminology, an "Observation" is a numerical measure (eg anything in the 3x4 array of numbers in the example table), and a "Dimension" is one of the headings.

Both are made up of a bag of cells, however a Dimension also needs to know how to "look up" from the Observation to its dimensional value.


In [11]:
# We get the array of observations by selecting its corner and expanding down and to the right
obs = tab.excel_ref('B4').expand(DOWN).expand(RIGHT)
savepreviewhtml(obs)


item 0
beatles
Date2014.0
CarsPlanesTrains
John2.02.01.0
Paul4.03.02.0
Ringo4.01.03.0
George2.05.05.0

In [12]:
# the two main headings are in a row and a column
r1 = tab.excel_ref('B3').expand(RIGHT)
r2 = tab.excel_ref('A3').fill(DOWN)

# here we pass in a list containing two cell bags and get two colours
savepreviewhtml([r1, r2])


item 0item 1
beatles
Date2014.0
CarsPlanesTrains
John2.02.01.0
Paul4.03.02.0
Ringo4.01.03.0
George2.05.05.0

In [13]:
# HDim is made from a bag of cells, a name, and an instruction on how to look it up 
# from an observation cell.  
h1 = HDim(r1, "Vehicles", DIRECTLY, ABOVE)

# Here is an example cell
cc = tab.excel_ref('C5')

# You can preview a dimension as well as just a cell bag
savepreviewhtml([h1, cc])


Vehiclesitem 1
beatles
Date2014.0
CarsPlanesTrains
John2.02.01.0
Paul4.03.02.0
Ringo4.01.03.0
George2.05.05.0

In [14]:
# !!! This is the important look-up stage from a cell into a dimension
print("Cell", cc, "matches", h1.cellvalobs(cc), "in dimension", h1.label)


Cell {<C5 3.0>} matches (<C3 'Planes'>, 'Planes') in dimension Vehicles

In [15]:
# You can start to see through to the final result of all this work when you 
# print out the lookup values for every observation in the table at once.  
for ob in obs:
    print("Obs", ob, "maps to", h1.cellvalobs(ob))


Obs {<B4 2.0>} maps to (<B3 'Cars'>, 'Cars')
Obs {<C4 2.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D4 1.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B5 4.0>} maps to (<B3 'Cars'>, 'Cars')
Obs {<C5 3.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D5 2.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B6 4.0>} maps to (<B3 'Cars'>, 'Cars')
Obs {<C6 1.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D6 3.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B7 2.0>} maps to (<B3 'Cars'>, 'Cars')
Obs {<C7 5.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D7 5.0>} maps to (<D3 'Trains'>, 'Trains')

Note the value of h1.cellvalobs(ob) is actually a pair composed of the heading cell and its value. This is is because we can over-ride its output value without actually rewriting the original table, as we shall see.


In [16]:
# You can change an output value like this:
h1.AddCellValueOverride("Cars", "Horses")

for ob in obs:
    print("Obs", ob, "maps to", h1.cellvalobs(ob))


Obs {<B4 2.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C4 2.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D4 1.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B5 4.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C5 3.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D5 2.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B6 4.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C6 1.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D6 3.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B7 2.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C7 5.0>} maps to (<C3 'Planes'>, 'Planes')
Obs {<D7 5.0>} maps to (<D3 'Trains'>, 'Trains')

In [17]:
# Alternatively, you can override by the reference to a single cell to a value 
# (This will work even if the cell C3 is empty, which helps with filling in blank headings)
h1.AddCellValueOverride(tab.excel_ref('C3'), "Submarines")
for ob in obs:
    print("Obs", ob, "maps to", h1.cellvalobs(ob))


Obs {<B4 2.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C4 2.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D4 1.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B5 4.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C5 3.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D5 2.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B6 4.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C6 1.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D6 3.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B7 2.0>} maps to (<B3 'Cars'>, 'Horses')
Obs {<C7 5.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D7 5.0>} maps to (<D3 'Trains'>, 'Trains')

In [18]:
# You can override the header value for an individual observation element.  
b4cell = tab.excel_ref('B4')
h1.AddCellValueOverride(b4cell, "Clouds")
for ob in obs:
    print("Obs", ob, "maps to", h1.cellvalobs(ob))


Obs {<B4 2.0>} maps to (None, 'Clouds')
Obs {<C4 2.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D4 1.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B5 4.0>} maps to (<B4 2.0>, 'Clouds')
Obs {<C5 3.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D5 2.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B6 4.0>} maps to (<B4 2.0>, 'Clouds')
Obs {<C6 1.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D6 3.0>} maps to (<D3 'Trains'>, 'Trains')
Obs {<B7 2.0>} maps to (<B4 2.0>, 'Clouds')
Obs {<C7 5.0>} maps to (<C3 'Planes'>, 'Submarines')
Obs {<D7 5.0>} maps to (<D3 'Trains'>, 'Trains')

In [19]:
# The preview table shows how things have changed
savepreviewhtml([h1, obs])


Vehiclesitem 1
beatles
Date2014.0
CarsHorsesPlanesSubmarinesTrains
John2.0Clouds2.01.0
Paul4.03.02.0
Ringo4.01.03.0
George2.05.05.0

In [20]:
wob = tab.excel_ref('A1')
print("Wrong-Obs", wob, "maps to", h1.cellvalobs(wob), " <--- ie Nothing")


Wrong-Obs {<A1 'Date'>} maps to (None, None)  <--- ie Nothing

In [21]:
h1.AddCellValueOverride(None, "Who knows?")
print("After giving a default value Wrong-Obs", wob, "now maps to", h1.cellvalobs(wob))


After giving a default value Wrong-Obs {<A1 'Date'>} now maps to (None, 'Who knows?')

In [22]:
# The default even works if the cell bag set is empty.  In which case we have a special 
# constant case that maps every observation to the same value
h3 = HDimConst("Category", "Beatles")
for ob in obs:
    print("Obs", ob, "maps to", h3.cellvalobs(ob))


Obs {<B4 2.0>} maps to (None, 'Beatles')
Obs {<C4 2.0>} maps to (None, 'Beatles')
Obs {<D4 1.0>} maps to (None, 'Beatles')
Obs {<B5 4.0>} maps to (None, 'Beatles')
Obs {<C5 3.0>} maps to (None, 'Beatles')
Obs {<D5 2.0>} maps to (None, 'Beatles')
Obs {<B6 4.0>} maps to (None, 'Beatles')
Obs {<C6 1.0>} maps to (None, 'Beatles')
Obs {<D6 3.0>} maps to (None, 'Beatles')
Obs {<B7 2.0>} maps to (None, 'Beatles')
Obs {<C7 5.0>} maps to (None, 'Beatles')
Obs {<D7 5.0>} maps to (None, 'Beatles')

Conversion segments and output

A ConversionSegment is a collection of Dimensions with an Observation set that is going to be processed and output as a table all at once.

You can preview them in HTML (just like the cell bags and dimensions), only this time the observation cells can be clicked on interactively to show how they look up.


In [46]:
dimensions = [ 
    HDim(tab.excel_ref('B1'), TIME, CLOSEST, ABOVE), 
    HDim(r1, "Vehicles", DIRECTLY, ABOVE), 
    HDim(r2, "Name", DIRECTLY, LEFT), 
    HDimConst("Category", "Beatles")
]

c1 = ConversionSegment(obs, dimensions, processTIMEUNIT=False)
savepreviewhtml(c1)


OBSTIMEVehiclesName
beatles
Date2014.0
CarsPlanesTrains
John2.02.01.0
Paul4.03.02.0
Ringo4.01.03.0
George2.05.05.0

In [55]:
# If the table is too big, we can preview it in another file is openable in another browser window.
# (It's very useful if you are using two computer screens.)
savepreviewhtml(c1, "preview.html", verbose=False)


Written to file: preview.html

In [47]:
print("Looking up all the observations against all the dimensions and print them out")
for ob in c1.segment:
    print(c1.lookupobs(ob))


Looking up all the observations against all the dimensions and print them out
{'Vehicles': 'Cars', 'Category': 'Beatles', 'OBS': 2.0, 'TIME': '2014.0', 'Name': 'John'}
{'Vehicles': 'Planes', 'Category': 'Beatles', 'OBS': 2.0, 'TIME': '2014.0', 'Name': 'John'}
{'Vehicles': 'Trains', 'Category': 'Beatles', 'OBS': 1.0, 'TIME': '2014.0', 'Name': 'John'}
{'Vehicles': 'Cars', 'Category': 'Beatles', 'OBS': 4.0, 'TIME': '2014.0', 'Name': 'Paul'}
{'Vehicles': 'Planes', 'Category': 'Beatles', 'OBS': 3.0, 'TIME': '2014.0', 'Name': 'Paul'}
{'Vehicles': 'Trains', 'Category': 'Beatles', 'OBS': 2.0, 'TIME': '2014.0', 'Name': 'Paul'}
{'Vehicles': 'Cars', 'Category': 'Beatles', 'OBS': 4.0, 'TIME': '2014.0', 'Name': 'Ringo'}
{'Vehicles': 'Planes', 'Category': 'Beatles', 'OBS': 1.0, 'TIME': '2014.0', 'Name': 'Ringo'}
{'Vehicles': 'Trains', 'Category': 'Beatles', 'OBS': 3.0, 'TIME': '2014.0', 'Name': 'Ringo'}
{'Vehicles': 'Cars', 'Category': 'Beatles', 'OBS': 2.0, 'TIME': '2014.0', 'Name': 'George'}
{'Vehicles': 'Planes', 'Category': 'Beatles', 'OBS': 5.0, 'TIME': '2014.0', 'Name': 'George'}
{'Vehicles': 'Trains', 'Category': 'Beatles', 'OBS': 5.0, 'TIME': '2014.0', 'Name': 'George'}

In [48]:
df = c1.topandas()
df



Out[48]:
OBS TIME Vehicles Name Category
0 2.0 2014.0 Cars John Beatles
1 2.0 2014.0 Planes John Beatles
2 1.0 2014.0 Trains John Beatles
3 4.0 2014.0 Cars Paul Beatles
4 3.0 2014.0 Planes Paul Beatles
5 2.0 2014.0 Trains Paul Beatles
6 4.0 2014.0 Cars Ringo Beatles
7 1.0 2014.0 Planes Ringo Beatles
8 3.0 2014.0 Trains Ringo Beatles
9 2.0 2014.0 Cars George Beatles
10 5.0 2014.0 Planes George Beatles
11 5.0 2014.0 Trains George Beatles

WDA Technical CSV

The ONS uses their own data system for publishing their time-series data known as WDA.

If you need to output to it, then this next section is for you.

The function which outputs to the WDA format is writetechnicalCSV(filename, [conversionsegments]) The format is very verbose because it repeats each dimension name and its value twice in each row, and every row begins with the following list of column entries, whether or not they exist.

observation, data_marking, statistical_unit_eng, statistical_unit_cym, measure_type_eng, measure_type_cym, observation_type, obs_type_value, unit_multiplier, unit_of_measure_eng, unit_of_measure_cym, confidentuality, geographic_area

The writetechnicalCSV() function accepts a single conversion segment, a list of conversion segments, or equivalently a pandas dataframe.


In [35]:
print(writetechnicalCSV(None, c1))


observation,data_marking,statistical_unit_eng,statistical_unit_cym,measure_type_eng,measure_type_cym,observation_type,empty,obs_type_value,unit_multiplier,unit_of_measure_eng,unit_of_measure_cym,confidentuality,empty1,geographic_area,empty2,empty3,time_dim_item_id,time_dim_item_label_eng,time_dim_item_label_cym,time_type,empty4,statistical_population_id,statistical_population_label_eng,statistical_population_label_cym,cdid,cdiddescrip,empty5,empty6,empty7,empty8,empty9,empty10,empty11,empty12,dim_id_1,dimension_label_eng_1,dimension_label_cym_1,dim_item_id_1,dimension_item_label_eng_1,dimension_item_label_cym_1,is_total_1,is_sub_total_1,dim_id_2,dimension_label_eng_2,dimension_label_cym_2,dim_item_id_2,dimension_item_label_eng_2,dimension_item_label_cym_2,is_total_2,is_sub_total_2,dim_id_3,dimension_label_eng_3,dimension_label_cym_3,dim_item_id_3,dimension_item_label_eng_3,dimension_item_label_cym_3,is_total_3,is_sub_total_3
2.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Cars,Cars,,,,Name,Name,,John,John,,,,Category,Category,,Beatles,Beatles,,,
2.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Planes,Planes,,,,Name,Name,,John,John,,,,Category,Category,,Beatles,Beatles,,,
1.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Trains,Trains,,,,Name,Name,,John,John,,,,Category,Category,,Beatles,Beatles,,,
4.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Cars,Cars,,,,Name,Name,,Paul,Paul,,,,Category,Category,,Beatles,Beatles,,,
3.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Planes,Planes,,,,Name,Name,,Paul,Paul,,,,Category,Category,,Beatles,Beatles,,,
2.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Trains,Trains,,,,Name,Name,,Paul,Paul,,,,Category,Category,,Beatles,Beatles,,,
4.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Cars,Cars,,,,Name,Name,,Ringo,Ringo,,,,Category,Category,,Beatles,Beatles,,,
1.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Planes,Planes,,,,Name,Name,,Ringo,Ringo,,,,Category,Category,,Beatles,Beatles,,,
3.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Trains,Trains,,,,Name,Name,,Ringo,Ringo,,,,Category,Category,,Beatles,Beatles,,,
2.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Cars,Cars,,,,Name,Name,,George,George,,,,Category,Category,,Beatles,Beatles,,,
5.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Planes,Planes,,,,Name,Name,,George,George,,,,Category,Category,,Beatles,Beatles,,,
5.0,,,,,,,,,,,,,,,,,2014.0,2014.0,,,,,,,,,,,,,,,0,,Vehicles,Vehicles,,Trains,Trains,,,,Name,Name,,George,George,,,,Category,Category,,Beatles,Beatles,,,
*********,12


In [40]:
# This is how to write to a file
writetechnicalCSV("exampleWDA.csv", c1)

# We can read this file back in to a list of pandas dataframes
dfs = readtechnicalCSV("exampleWDA.csv")
print(dfs[0])


writing 1 conversion segments into /home/goatchurch/sensiblecode/src/databaker/databaker/tutorial/exampleWDA.csv
conversionwrite segment size 12 table 'beatles'; 
    OBS    TIME Vehicles    Name Category
0   2.0  2014.0     Cars    John  Beatles
1   2.0  2014.0   Planes    John  Beatles
2   1.0  2014.0   Trains    John  Beatles
3   4.0  2014.0     Cars    Paul  Beatles
4   3.0  2014.0   Planes    Paul  Beatles
5   2.0  2014.0   Trains    Paul  Beatles
6   4.0  2014.0     Cars   Ringo  Beatles
7   1.0  2014.0   Planes   Ringo  Beatles
8   3.0  2014.0   Trains   Ringo  Beatles
9   2.0  2014.0     Cars  George  Beatles
10  5.0  2014.0   Planes  George  Beatles
11  5.0  2014.0   Trains  George  Beatles

Note If you were wondering what the processTIMEUNIT=False was all about in the ConversionSegment constructor, it's a feature to help the WDA output automatically set the TIMEUNIT column according to whether it should be Year, Month, or Quarter.

You will note that the TIME column above is 2014.0 when it really should be 2014 with the TIMEUNIT set to Year.

By setting it to True the ConversionSegment object will identify the timeunit from the value of the TIME column and then force its format to conform.


In [51]:
# See that the `2014` no longer ends with `.0`
c1 = ConversionSegment(obs, dimensions, processTIMEUNIT=True)
c1.topandas()


TIMEUNIT='Year'
Out[51]:
OBS TIME TIMEUNIT Vehicles Name Category
0 2.0 2014 Year Cars John Beatles
1 2.0 2014 Year Planes John Beatles
2 1.0 2014 Year Trains John Beatles
3 4.0 2014 Year Cars Paul Beatles
4 3.0 2014 Year Planes Paul Beatles
5 2.0 2014 Year Trains Paul Beatles
6 4.0 2014 Year Cars Ringo Beatles
7 1.0 2014 Year Planes Ringo Beatles
8 3.0 2014 Year Trains Ringo Beatles
9 2.0 2014 Year Cars George Beatles
10 5.0 2014 Year Planes George Beatles
11 5.0 2014 Year Trains George Beatles

Note Sometimes the TIME value needs to be created by joining two or more other cells (eg one is a month, and the other is the year).

Such an operation can much more easily be done using the pandas column operations than by using the concept of subdimensions which used to exist in Databaker before we took it out.

This will be explained in a later worked example.


In [ ]: