The purpose of the present Getting Started section is to give a quick overview of the main objects and features of the LArray library. To get a more detailed presentation of all capabilities of LArray, read the next sections of the tutorial.
The API Reference section of the documentation give you the list of all objects, methods and functions with their individual documentation and examples.
To use the LArray library, the first thing to do is to import it:
In [ ]:
from larray import *
To know the version of the LArray library installed on your machine, type:
In [ ]:
from larray import __version__
__version__
Working with the LArray library mainly consists of manipulating Array data structures. They represent N-dimensional labelled arrays and are composed of raw data (NumPy ndarray), axes and optionally some metadata.
An Axis object represents a dimension of an array. It contains a list of labels and has a name:
In [ ]:
# define some axes to be used later
age = Axis(['0-9', '10-17', '18-66', '67+'], 'age')
gender = Axis(['female', 'male'], 'gender')
time = Axis([2015, 2016, 2017], 'time')
The labels allow to select subsets and to manipulate the data without working with the positions of array elements directly.
To create an array from scratch, you need to supply data and axes:
In [ ]:
# define some data. This is the belgian population (in thousands). Source: eurostat.
data = [[[633, 635, 634],
[663, 665, 664]],
[[484, 486, 491],
[505, 511, 516]],
[[3572, 3581, 3583],
[3600, 3618, 3616]],
[[1023, 1038, 1053],
[756, 775, 793]]]
# create an Array object
population = Array(data, axes=[age, gender, time])
population
You can optionally attach some metadata to an array:
In [ ]:
# attach some metadata to the population array
population.meta.title = 'population by age, gender and year'
population.meta.source = 'Eurostat'
# display metadata
population.meta
To get a short summary of an array, type:
In [ ]:
# Array summary: metadata + dimensions + description of axes
population.info
Arrays filled with predefined values can be generated through dedicated functions:
zeros
: creates an array filled with 0ones
: creates an array filled with 1full
: creates an array filled with a given valuesequence
: creates an array by sequentially applying modifications to the array along axis.ndtest
: creates a test array with increasing numbers as data
In [ ]:
zeros([age, gender])
In [ ]:
ones([age, gender])
In [ ]:
full([age, gender], fill_value=10.0)
In [ ]:
sequence(age)
In [ ]:
ndtest([age, gender])
The LArray library offers many I/O functions to read and write arrays in various formats
(CSV, Excel, HDF5). For example, to save an array in a CSV file, call the method to_csv
:
In [ ]:
# save our population array to a CSV file
population.to_csv('population_belgium.csv')
The content of the CSV file is then:
age,gender\time,2015,2016,2017
0-9,female,633,635,634
0-9,male,663,665,664
10-17,female,484,486,491
10-17,male,505,511,516
18-66,female,3572,3581,3583
18-66,male,3600,3618,3616
67+,female,1023,1038,1053
67+,male,756,775,793
To load a saved array, call the function read_csv
:
In [ ]:
population = read_csv('population_belgium.csv')
population
Other input/output functions are described in the Input/Output section of the API documentation.
To select an element or a subset of an array, use brackets [ ]. In Python we usually use the term indexing for this operation.
Let us start by selecting a single element:
In [ ]:
population['67+', 'female', 2017]
Labels can be given in arbitrary order:
In [ ]:
population[2017, 'female', '67+']
When selecting a larger subset the result is an array:
In [ ]:
population['female']
When selecting several labels for the same axis, they must be given as a list (enclosed by [ ]
)
In [ ]:
population['female', ['0-9', '10-17']]
You can also select slices, which are all labels between two bounds (we usually call them the start
and stop
bounds). Specifying the start
and stop
bounds of a slice is optional: when not given, start
is the first label
of the corresponding axis, stop
the last one:
In [ ]:
# in this case '10-17':'67+' is equivalent to ['10-17', '18-66', '67+']
population['female', '10-17':'67+']
In [ ]:
# :'18-66' selects all labels between the first one and '18-66'
# 2017: selects all labels between 2017 and the last one
population[:'18-66', 2017:]
For example, imagine you need to work with an 'immigration' array containing two axes sharing some common labels:
In [ ]:
country = Axis(['Belgium', 'Netherlands', 'Germany'], 'country')
citizenship = Axis(['Belgium', 'Netherlands', 'Germany'], 'citizenship')
immigration = ndtest((country, citizenship, time))
immigration
If we try to get the number of Belgians living in the Netherlands for the year 2017, we might try something like:
immigration['Netherlands', 'Belgium', 2017]
... but we receive back a volley of insults:
[some long error message ending with the line below]
[...]
ValueError: Netherlands is ambiguous (valid in country, citizenship)
In that case, we have to specify explicitly which axes the 'Netherlands' and 'Belgium' labels we want to select belong to:
In [ ]:
immigration[country['Netherlands'], citizenship['Belgium'], 2017]
In [ ]:
for year in time:
print(year)
The LArray library includes many aggregations methods: sum, mean, min, max, std, var, ...
For example, assuming we still have an array in the population
variable:
In [ ]:
population
We can sum along the 'gender' axis using:
In [ ]:
population.sum(gender)
Or sum along both 'age' and 'gender':
In [ ]:
population.sum(age, gender)
It is sometimes more convenient to aggregate along all axes except some. In that case, use the aggregation
methods ending with _by
. For example:
In [ ]:
population.sum_by(time)
A Group object represents a subset of labels or positions of an axis:
In [ ]:
children = age['0-9', '10-17']
children
It is often useful to attach them an explicit name using the >>
operator:
In [ ]:
working = age['18-66'] >> 'working'
working
In [ ]:
nonworking = age['0-9', '10-17', '67+'] >> 'nonworking'
nonworking
Still using the same population
array:
In [ ]:
population
Groups can be used in selections:
In [ ]:
population[working]
In [ ]:
population[nonworking]
or aggregations:
In [ ]:
population.sum(nonworking)
When aggregating several groups, the names we set above using >>
determines the label on the aggregated axis.
Since we did not give a name for the children group, the resulting label is generated automatically :
In [ ]:
population.sum((children, working, nonworking))
Arrays may be grouped in Session objects. A session is an ordered dict-like container of Array objects with special I/O methods. To create a session, you need to pass a list of pairs (array_name, array):
In [ ]:
population = zeros([age, gender, time])
births = zeros([age, gender, time])
deaths = zeros([age, gender, time])
# create a session containing the three arrays 'population', 'births' and 'deaths'
demography_session = Session(population=population, births=births, deaths=deaths)
# displays names of arrays contained in the session
demography_session.names
# get an array (option 1)
demography_session['population']
# get an array (option 2)
demography_session.births
# add/modify an array
demography_session['foreigners'] = zeros([age, gender, time])
One of the main interests of using sessions is to save and load many arrays at once:
In [ ]:
# dump all arrays contained in demography_session in one HDF5 file
demography_session.save('demography.h5')
# load all arrays saved in the HDF5 file 'demography.h5' and store them in the 'demography_session' variable
demography_session = Session('demography.h5')
The LArray project provides an optional package called larray-editor allowing users to explore and edit arrays through a graphical interface.
The larray-editor tool is automatically available when installing the larrayenv metapackage from conda.
To explore the content of arrays in read-only mode, call the view
function:
# shows the arrays of a given session in a graphical user interface
view(demography_session)
# the session may be directly loaded from a file
view('demography.h5')
# creates a session with all existing arrays from the current namespace
# and shows its content
view()
To open the user interface in edit mode, call the edit
function instead.
Finally, you can also visually compare two arrays or sessions using the compare
function:
arr0 = ndtest((3, 3))
arr1 = ndtest((3, 3))
arr1[['a1', 'a2']] = -arr1[['a1', 'a2']]
compare(arr0, arr1)
Installing the larray-editor
package on Windows will create a LArray
menu in the
Windows Start Menu. This menu contains:
larrayenv
.Once the graphical interface is open, all LArray objects and functions are directly accessible.
No need to start by from larray import *
.