To speed up subsequent queries involving similar data, pydov uses a caching mechanism where raw DOV XML data is cached locally for later reuse. For regular usage of the package and data requests, the cache will be a convenient feature speeding up the time for subsequent queries. However, in case you want to alter the configuration or cache handling, this notebook illustrates some use cases on the cache handling.
In [1]:
# check pydov path
import warnings; warnings.simplefilter('ignore')
import pydov
In [2]:
from pydov.search.boring import BoringSearch
boring = BoringSearch()
The pydov.cache.cachedir
defines the directory on the file system used to cache DOV files:
In [3]:
# check the cache dir
import os
import pydov.util.caching
cachedir = pydov.cache.cachedir
print(cachedir)
print('directories: ', os.listdir(cachedir))
To illustrate the convenience of the caching during subsequent data requests, consider the following request, while measuring the time:
In [4]:
from pydov.util.location import Within, Box
# Get all borehole data in a bounding box (llx, llxy, ulx, uly) and timeit
%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))
In [5]:
# The structure of cachedir implies a separate directory for each query type, since permalinks are not unique across types
# In this example 'boring' will be queried, therefore list xmls in the cache of the 'boring' type
# list files present
print('number of files: ', len(os.listdir(os.path.join(pydov.cache.cachedir, 'boring'))))
print('files present: ', os.listdir(os.path.join(pydov.cache.cachedir, 'boring')))
Rerun the previous request and timeit again:
In [6]:
%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))
The use of the cache decreased the runtime by a factor 100 in the current example. This will increase drastically if more permalinks are queried since the download takes much longer than the IO at runtime.
You can (temporarily!) disable the caching mechanism. This disables both the saving of newly downloaded data in the cache, as well as reusing existing data in the cache. It remains valid for the time being of the instantiated pydov.cache object. It does not delete existing data in the cache.
In [7]:
# list number of files
print('number of files: ', len(os.listdir(os.path.join(cachedir, 'boring'))))
In [8]:
# disable caching
cache_orig = pydov.cache
pydov.cache = None
# new query
df = boring.search(location=Within(Box(151000, 205930, 153000, 206000)))
print(df.head())
In [9]:
# list number of files
print('number of files: ', len(os.listdir(os.path.join(cachedir, 'boring'))))
Hence, no new files were added to the cache when disabling it.
The caching is disabled by removing the pydov.cache object from the namespace. If you want to enable caching again you must instantiate it anew.
In [10]:
pydov.cache = cache_orig
By default, pydov stores the cache in a temporary directory provided by the user's operating system. On Windows, the cache is usually located in: C:\Users\username\AppData\Local\Temp\pydov\
If you want the cached xml files to be saved in another location you can define your own cache for the current runtime. Mind that this does not change the location of previously saved data. No lookup in the old datafolder will be performed after changing the directory's location.
Besides controlling the cache's location, this also allows using different scripts or projects.
In [11]:
import pydov.util.caching
pydov.cache = pydov.util.caching.GzipTextFileCache(
cachedir=r'C:\temp\pydov'
)
In [12]:
cachedir = pydov.cache.cachedir
print(cachedir)
In [13]:
# for the sake of the example, change dir location back
pydov.cache = cache_orig
cachedir = pydov.cache.cachedir
If you work with rapidly changing data or want to control when cached data is renewed, you can do so by changing the maximum age of cached data to be considered valid for the currenct runtime. You can use 'weeks', 'days' or any other common datetime format. If a cached version exists and is younger than the maximum age, it is used in favor of renewing the data from DOV services. If no cached version exists or is older than the maximum age, the data is renewed and saved in the cache. Note that data older than the maximum age is not automatically deleted from the cache.
In [14]:
import pydov.util.caching
import datetime
pydov.cache = pydov.util.caching.GzipTextFileCache(
max_age=datetime.timedelta(seconds=1)
)
print(pydov.cache.max_age)
In [15]:
from time import ctime
print(os.listdir(os.path.join(cachedir, 'boring'))[0])
ctime(os.path.getmtime(os.path.join(os.path.join(cachedir, 'boring'),
os.listdir(os.path.join(cachedir, 'boring'))[0]
)
)
)
Out[15]:
In [16]:
# rerun previous query
%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))
In [17]:
from time import ctime
print(os.listdir(os.path.join(cachedir, 'boring'))[0])
ctime(os.path.getmtime(os.path.join(os.path.join(cachedir, 'boring'),
os.listdir(os.path.join(cachedir, 'boring'))[0]
)
)
)
Out[17]:
Since we use a temporary directory provided by the operating system, we rely on the operating system to clean the folder when it deems necessary.
To clean the cache, removing all records older than the maximum age
In [18]:
from time import sleep
In [19]:
print('number of files before clean: ', len(os.listdir(os.path.join(cachedir, 'boring'))))
sleep(2) # remember we've put the caching age on 1 second
pydov.cache.clean()
print('number of files after clean: ', len(os.listdir(os.path.join(cachedir, 'boring'))))
Should you want to remove the pydov cache from code yourself, you can do so as illustrated below. Note that this will erase the entire cache, not only the records older than the maximum age:
In [20]:
pydov.cache.remove()
# check existence of the cache directory:
print(os.path.exists(os.path.join(cachedir, 'boring')))