provenance
is a Python library for function-level caching and provenance that aids in creating parsimonious pythonic pipelines™. By wrapping functions in the provenance
decorator computed results are cached across various stores (disk, S3, SFTP) and provenance (i.e. lineage) information is tracked and stored in an artifact repository. A central artifact repostiory can be used to enable production pipelines, team collaboration, and reproducible results.
What that means in practice is that you can easily keep track of how artifacts (models, features, or any object or file) are created, where they are used, and have a central place to store and share these artifacts. This basic plumbing is required (or at least desired!) in any machine learning pipeline or project. provenance
can be used standalone along with a build server to run pipelines or in conjunction with more advanced workflow systems (e.g. Airflow, Luigi).
Before you can use provenance
you need to configure it. The suggested way to do so is via the load_config
function that takes a dictionary of settings. (ProTip: For a team environment take a look at ymlconf which does merging of configs, e.g. a shared config in a repo and a local one that isn't committed.)
We'll define our configuration map in YAML below. The config specifies a blobstore, where the cached values are stored, and an artifact repository, where the metadata is stored. In this example the cached values are stored on disk and the metadata is stored in a Postgres database.
In [1]:
%load_ext yamlmagic
In [2]:
%%yaml basic_config
blobstores:
disk:
type: disk
cachedir: /tmp/provenance-intro-artifacts
read: True
write: True
delete: True
artifact_repos:
local:
type: postgres
db: postgresql://localhost/provenance-intro
store: 'disk'
read: True
write: True
delete: True
# this option will create the database if it doesn't exist
create_db: True
default_repo: local
In [3]:
import provenance as p
p.load_config(basic_config)
Out[3]:
Now lets define some decorated functions...
In [4]:
import time
@p.provenance()
def expensive_add(a, b):
time.sleep(2)
return a + b
@p.provenance()
def expensive_mult(a, b):
time.sleep(2)
return a * b
In [5]:
%%time
result = expensive_add(4, 3)
print(result)
As expected, we have a slow addition function. To see the effect of the caching we can repeat the same function invocation:
In [6]:
%%time
result = expensive_add(4, 3)
print(result)
If you have used any caching/memoization decorator or library (e.g. joblib
) then this is old hat to you. What is different is how the provenance of this result is recorded and how it can be accessed. For example, with the same result we can visualize the associated lineage:
In [7]:
import provenance.vis as vis
vis.visualize_lineage(result)
Out[7]:
In the above visualization the artifact
is outlined in red with the artifact id
on the left and the value
(result) on the right. How is this possible and what exactly is an artifact
? Well, the result
is not a raw 7
but rather is an ArtifactProxy
which wraps the 7
.
In [8]:
result
Out[8]:
The hash in the parenthesis is the id
of the artifact which is a function of the name of the function used and the inputs used to produce the value.
All properties, methods, and operations called against this object will be proxied to the underlying value, 7
in this case. You can treat the result as you normally would:
In [9]:
result + 3
Out[9]:
The one exception to this is the artifact
property which returns the Artifact
.
In [10]:
artifact = result.artifact
artifact
Out[10]:
On this artifact
is where you will find the provenance information for the value.
In [11]:
# What function what used to create artifact?
artifact.fn_module, artifact.fn_name
Out[11]:
In [12]:
# What inputs were used?
artifact.inputs
Out[12]:
This information is what powered the visualization above.
Each artifact has additional information attached to it, such as the id
, value_id
, value
, and run_info
. run_info
captures information about the environment when the artifact was created:
In [13]:
artifact.run_info
Out[13]:
You typically also want the run_info
to include the git
ref or build server job ID that was used to produce the artifact. This is easily done by using the provided provenance.set_run_info_fn
hook. Please refer to the API documenation for details.
Aside from calling decorating functions and having cached artifact
s returned you can also explicitly load an artifact
by it's id
. This becomes useful in a team setting since you can have a shared store on S3, e.g. "Check out this amazing model I just built, load artifact 349ded4f...!". The other main usecase for this is in production settings. See the Machine Learning Pipeline guide for more details on that.
In [14]:
# get the id of the artifact above
artifact_id = artifact.id
loaded_artifact = p.load_artifact(artifact_id)
loaded_artifact
Out[14]:
You can inspect the loaded_artifact
as done above or load the value into a proxy:
In [15]:
loaded_artifact.proxy()
Out[15]:
Alternatively, if you can load the proxy directly with load_proxy
. load_artifact
is still useful when you want the provenance and other metadata about an artifact
but not the actual value.
In [16]:
p.load_proxy(artifact_id)
Out[16]:
In any advanced workflow you will have a series of functions all pipelined together. For true provenance you will need to be able to track all artifacts to their source. This is why the computed results of functions are returned as ArtifactProxy
s, to enable the flow of artifacts to be tracked.
In [17]:
a1 = expensive_add(4, 3)
a2 = expensive_add(1, 1)
result = expensive_mult(a1, a2)
vis.visualize_lineage(result)
Out[17]:
Note how the final artifact can be traced all the way back to the original inputs! As mentioned above this is enabled by the passing of the ArtifactProxy
results, which means that for you to take advantage of this you must be passing the proxies throughout your entire pipeline. Best practice is to have your outer functions take basic Python data structures and then pass all resulting complex objects (e.g. scikit learn models) as ArtifactProxy
s.
Under the hood all of the artifacts are being recorded to the artifact repository we configured above. A repository is comprised of a store that records provenance and metadata and a blobstore that stores the serialized artifact (joblib
serialization is the default). While you rarely would need to access the blobstore files directly, querying the DB directly is useful and is a hallmark of the library.
In [18]:
repo = p.get_default_repo()
db = repo._db_engine
import pandas as pd
pandas.read_sql("select * from artifacts", db)
Out[18]:
A few things to note here...
The input_artifact_ids
is an array of all of the artifact ids that were used in the input (even if nested in another data structure). This allows you to efficiently query for the progeny of a particular artifact.
The inputs_json
and custom_fields
columns are stored as JSONB in Postgres allowing you to query the nested structure via SQL. For example, you could find all the addition artifacts that were created with a particular argument:
In [19]:
pandas.read_sql("""select * from artifacts
where fn_name = 'expensive_add'
and (inputs_json ->> 'a')::int IN (1, 4)
""", db)
Out[19]:
In [25]:
# the blobs are written to files with the names matching the hash of the content
!ls -l /tmp/provenance-intro-artifacts | head
A potential pitfall of using provenance
is getting stale cache results after updating your function. For example lets say we update our function definition:
In [21]:
@p.provenance()
def expensive_add(a, b):
time.sleep(2)
return a + b + a + b + a
In [22]:
%%time
expensive_add(4, 3)
Out[22]:
Noting the quick return time and incorrect result we see that we got an old (stale) cached result from our initial implementation. Rather than trying to determine if a function has been updated as joblib
does the provenance
library requires that you explicity version your functions to force new results to be computed. The rationale behind this is that it is quite hard (impossible in Python?) to tell when a funtion's definition or that of a helper function, which may be in another library, has semantically changed. For this reason the user of provenance
must increment the version number of a function in the decorator:
In [23]:
# the default version is 0 so lets set it to 1
@p.provenance(version=1)
def expensive_add(a, b):
time.sleep(2)
return a + b + a + b + a
In [24]:
%%time
expensive_add(4, 3)
Out[24]:
While this may seem onerous in practice this is not a big problem in a mature code base once people are aware of this. For rapidly changing functions you may want to consider using the use_cache=False
option temporarily as you iterate on your function. See the docs on the decorator for more information.
You now know the basics of the provenance
library and can start using it now! Be aware that the provenance
decorator takes a number of other options (such as tags
) that can be quite helpful. Please refer to the API documenation for details.