Example of XML caching for pydov

Introduction

To speed up subsequent queries involving similar data, pydov uses a caching mechanism where raw DOV XML data is cached locally for later reuse. For regular usage of the package and data requests, the cache will be a convenient feature speeding up the time for subsequent queries. However, in case you want to alter the configuration or cache handling, this notebook illustrates some use cases on the cache handling.

Use cases:

  • Check cached files
  • Speed up subsequent queries
  • Disabling the cache
  • Changing the location of cached data
  • Changing the maximum age of cached data
  • Cleaning the cache

In [1]:
# check pydov path
import warnings; warnings.simplefilter('ignore')
import pydov

Use cases

Check cached files


In [2]:
from pydov.search.boring import BoringSearch
boring = BoringSearch()

The pydov.cache.cachedir defines the directory on the file system used to cache DOV files:


In [3]:
# check the cache dir
import os
import pydov.util.caching
cachedir = pydov.cache.cachedir
print(cachedir)
print('directories: ', os.listdir(cachedir))


c:\users\rhbav33\appdata\local\temp\pydov
('directories: ', [])

Speed up subsequent queries

To illustrate the convenience of the caching during subsequent data requests, consider the following request, while measuring the time:


In [4]:
from pydov.util.location import Within, Box

# Get all borehole data in a bounding box (llx, llxy, ulx, uly) and timeit
%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))


[000/111] ..................................................
[050/111] ..................................................
[100/111] ...........
Wall time: 38.4 s

In [5]:
# The structure of cachedir implies a separate directory for each query type, since permalinks are not unique across types
# In this example 'boring' will be queried, therefore list xmls in the cache of the 'boring' type
# list files present
print('number of files: ', len(os.listdir(os.path.join(pydov.cache.cachedir, 'boring'))))
print('files present: ', os.listdir(os.path.join(pydov.cache.cachedir, 'boring')))


('number of files: ', 111)
('files present: ', ['1879-119364.xml.gz', '1879-121292.xml.gz', '1879-121293.xml.gz', '1879-121387.xml.gz', '1879-121401.xml.gz', '1879-121412.xml.gz', '1879-121424.xml.gz', '1879-122256.xml.gz', '1894-121258.xml.gz', '1894-122153.xml.gz', '1894-122154.xml.gz', '1894-122155.xml.gz', '1895-121232.xml.gz', '1895-121241.xml.gz', '1895-121242.xml.gz', '1895-121244.xml.gz', '1895-121247.xml.gz', '1895-121248.xml.gz', '1923-121199.xml.gz', '1923-121200.xml.gz', '1932-121315.xml.gz', '1936-122224.xml.gz', '1938-121359.xml.gz', '1938-121360.xml.gz', '1953-121327.xml.gz', '1953-121361.xml.gz', '1953-121362.xml.gz', '1969-033206.xml.gz', '1969-033207.xml.gz', '1969-033208.xml.gz', '1969-033209.xml.gz', '1969-033211.xml.gz', '1969-033212.xml.gz', '1969-033213.xml.gz', '1969-033214.xml.gz', '1969-033215.xml.gz', '1969-033216.xml.gz', '1969-033217.xml.gz', '1969-033218.xml.gz', '1969-033219.xml.gz', '1969-033220.xml.gz', '1969-092685.xml.gz', '1969-092686.xml.gz', '1969-092687.xml.gz', '1969-092688.xml.gz', '1969-092689.xml.gz', '1970-018757.xml.gz', '1970-018762.xml.gz', '1970-018763.xml.gz', '1970-061362.xml.gz', '1970-061363.xml.gz', '1970-061364.xml.gz', '1970-061365.xml.gz', '1970-061366.xml.gz', '1970-061442.xml.gz', '1970-061443.xml.gz', '1970-061444.xml.gz', '1970-061445.xml.gz', '1970-061446.xml.gz', '1970-061447.xml.gz', '1970-061450.xml.gz', '1970-061454.xml.gz', '1970-104897.xml.gz', '1970-104898.xml.gz', '1970-104899.xml.gz', '1970-104900.xml.gz', '1973-018152.xml.gz', '1973-060207.xml.gz', '1973-060208.xml.gz', '1973-081811.xml.gz', '1973-104723.xml.gz', '1973-104727.xml.gz', '1973-104728.xml.gz', '1974-010351.xml.gz', '1975-010345.xml.gz', '1976-014856.xml.gz', '1976-015297.xml.gz', '1976-015298.xml.gz', '1976-015779.xml.gz', '1976-015780.xml.gz', '1976-015781.xml.gz', '1976-015782.xml.gz', '1978-012352.xml.gz', '1978-121458.xml.gz', '1984-081833.xml.gz', '1984-081834.xml.gz', '1985-084552.xml.gz', '1986-005594.xml.gz', '1986-005596.xml.gz', '1986-005597.xml.gz', '1986-005598.xml.gz', '1986-059814.xml.gz', '1986-059815.xml.gz', '1986-059816.xml.gz', '1987-119382.xml.gz', '1996-021717.xml.gz', '1996-081802.xml.gz', '2017-148854.xml.gz', '2017-152011.xml.gz', '2017-153161.xml.gz', '2018-153957.xml.gz', '2018-154057.xml.gz', '2018-155266.xml.gz', '2018-155580.xml.gz', '2018-156632.xml.gz', '2018-156633.xml.gz', '2018-156634.xml.gz', '2018-157193.xml.gz', '2018-157294.xml.gz', '2018-157386.xml.gz', '2019-160294.xml.gz'])

Rerun the previous request and timeit again:


In [6]:
%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))


[000/111] cccccccccccccccccccccccccccccccccccccccccccccccccc
[050/111] cccccccccccccccccccccccccccccccccccccccccccccccccc
[100/111] ccccccccccc
Wall time: 980 ms

The use of the cache decreased the runtime by a factor 100 in the current example. This will increase drastically if more permalinks are queried since the download takes much longer than the IO at runtime.

Disabling the cache

You can (temporarily!) disable the caching mechanism. This disables both the saving of newly downloaded data in the cache, as well as reusing existing data in the cache. It remains valid for the time being of the instantiated pydov.cache object. It does not delete existing data in the cache.


In [7]:
# list number of files
print('number of files: ', len(os.listdir(os.path.join(cachedir, 'boring'))))


('number of files: ', 111)

In [8]:
# disable caching
cache_orig = pydov.cache
pydov.cache = None
# new query
df = boring.search(location=Within(Box(151000, 205930, 153000, 206000)))
print(df.head())


[000/002] ..
                                         pkey_boring     boornummer         x  \
0  https://www.dov.vlaanderen.be/data/boring/1895...   kb15d43w-B47  151600.0   
1  https://www.dov.vlaanderen.be/data/boring/1984...  kb15d43w-B403  151041.0   

          y  mv_mtaw  start_boring_mtaw   gemeente  diepte_boring_van  \
0  205998.0    15.00              15.00  Antwerpen                0.0   
1  205933.0    21.07              21.07  Antwerpen                0.0   

   diepte_boring_tot datum_aanvang                                uitvoerder  \
0                3.3    1895-01-04                                  onbekend   
1                7.0    1984-09-26  Universiteit Gent - Geologisch Instituut   

   boorgatmeting  diepte_methode_van  diepte_methode_tot   boormethode  
0          False                 0.0                 3.3      onbekend  
1          False                 0.0                 7.0  droge boring  

In [9]:
# list number of files
print('number of files: ', len(os.listdir(os.path.join(cachedir, 'boring'))))


('number of files: ', 111)

Hence, no new files were added to the cache when disabling it.

The caching is disabled by removing the pydov.cache object from the namespace. If you want to enable caching again you must instantiate it anew.


In [10]:
pydov.cache = cache_orig

Changing the location of cached data

By default, pydov stores the cache in a temporary directory provided by the user's operating system. On Windows, the cache is usually located in: C:\Users\username\AppData\Local\Temp\pydov\ If you want the cached xml files to be saved in another location you can define your own cache for the current runtime. Mind that this does not change the location of previously saved data. No lookup in the old datafolder will be performed after changing the directory's location. Besides controlling the cache's location, this also allows using different scripts or projects.


In [11]:
import pydov.util.caching

pydov.cache = pydov.util.caching.GzipTextFileCache(
    cachedir=r'C:\temp\pydov'
    )

In [12]:
cachedir = pydov.cache.cachedir
print(cachedir)


C:\temp\pydov

In [13]:
# for the sake of the example, change dir location back 
pydov.cache = cache_orig
cachedir = pydov.cache.cachedir

Changing the maximum age of cached data

If you work with rapidly changing data or want to control when cached data is renewed, you can do so by changing the maximum age of cached data to be considered valid for the currenct runtime. You can use 'weeks', 'days' or any other common datetime format. If a cached version exists and is younger than the maximum age, it is used in favor of renewing the data from DOV services. If no cached version exists or is older than the maximum age, the data is renewed and saved in the cache. Note that data older than the maximum age is not automatically deleted from the cache.


In [14]:
import pydov.util.caching
import datetime
pydov.cache = pydov.util.caching.GzipTextFileCache(
    max_age=datetime.timedelta(seconds=1)
    )
print(pydov.cache.max_age)


0:00:01

In [15]:
from time import ctime
print(os.listdir(os.path.join(cachedir, 'boring'))[0])
ctime(os.path.getmtime(os.path.join(os.path.join(cachedir, 'boring'),
                                    os.listdir(os.path.join(cachedir, 'boring'))[0]
                                   )
                      )
     )


1879-119364.xml.gz
Out[15]:
'Wed Mar 06 14:36:24 2019'

In [16]:
# rerun previous query 
%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))


[000/111] ..................................................
[050/111] ..................................................
[100/111] ...........
Wall time: 35.7 s

In [17]:
from time import ctime
print(os.listdir(os.path.join(cachedir, 'boring'))[0])
ctime(os.path.getmtime(os.path.join(os.path.join(cachedir, 'boring'),
                                    os.listdir(os.path.join(cachedir, 'boring'))[0]
                                   )
                      )
     )


1879-119364.xml.gz
Out[17]:
'Wed Mar 06 14:38:20 2019'

Cleaning the cache

Since we use a temporary directory provided by the operating system, we rely on the operating system to clean the folder when it deems necessary.

To clean the cache, removing all records older than the maximum age


In [18]:
from time import sleep

In [19]:
print('number of files before clean: ', len(os.listdir(os.path.join(cachedir, 'boring'))))
sleep(2) # remember we've put the caching age on 1 second
pydov.cache.clean()
print('number of files after clean: ', len(os.listdir(os.path.join(cachedir, 'boring'))))


('number of files before clean: ', 111)
('number of files after clean: ', 0)

Should you want to remove the pydov cache from code yourself, you can do so as illustrated below. Note that this will erase the entire cache, not only the records older than the maximum age:


In [20]:
pydov.cache.remove()
# check existence of the cache directory:
print(os.path.exists(os.path.join(cachedir, 'boring')))


False