Launch this tutorial in a Jupyter Notebook on Binder:
The Python bindings provide two scalable mechanisms for tracking jobs:
Schedd.xquery
to get job
status information.Both poll- and event-based tracking have their strengths and weaknesses; the intrepid user can even combine both methodologies to have extremely reliable, low-latency job status tracking.
In this module, we outline the important design considerations behind each approach and walk through examples.
Poll-based tracking involves periodically querying the schedd(s) for jobs of interest. We have covered the technical aspects of querying the Schedd in prior tutorials. Beside the technical means of polling, important aspects to consider are how often the poll should be performed and how much data should be retrieved.
Note: When Schedd.xquery
is used, the query will cause the schedd to fork
up to SCHEDD_QUERY_WORKERS
simultaneous workers. Beyond that point, queries will
be handled in a non-blocking manner inside the main condor_schedd
process. Thus, the
memory used by many concurrent queries can be reduced by decreasing SCHEDD_QUERY_WORKERS
.
A job tracking system should not query the Schedd more than once a minute. Aim to minimize the
data returned from the query through the use of the projection; minimize the number of jobs returned
by using a query constraint. Better yet, use the AutoCluster
flag to have Schedd.xquery
return a list of job summaries instead of individual jobs.
Advantages:
condor_schedd
instances in a pool; using htcondor.poll
,
multiple Schedds can be queried simultaneously.Disadvantages:
Each job in the Schedd can specify the UserLog
attribute; the Schedd will atomically append a
machine-parseable event to the specified file for every state transition the job goes through.
By keeping track of the events in the logs, we can build an in-memory representation of the job
queue state.
Advantages:
condor_schedd
process is needed to read the event logs; the job
tracking effectively places no burden on the Schedd.Disadvantages:
condor_schedd
can be tracked; there is no mechanism to receive the event
log remotely.condor_schedd
to fail to write the event), then the job tracker may believe a job incorrectly is stuck
in the wrong state.At a technical level, event tracking is implemented with the htcondor.JobEventLog class.
>>> jel = htcondor.JobEventLog("/tmp/job_one.log")
>>> for event in jel.events(stop_after=0):
... print event
The return value of JobEventLog.events()
is an iterator over
htcondor.JobEvent
objects. The example above does not block.