Clockwork's controller outputs telemetry about client requests, and about actions sent to workers.
Telemetry is output by the controller process.
The location of the telemetry files is configured by the CLOCKWORK_LOG_DIR environment variable, which should point to a directory where the controller can write its output request logs. Be aware that for long experiments, these files can be GB large.
If CLOCKWORK_LOG_DIR is not specified or if it is left blank, Clockwork's controller will write its telemetry files to /local. If /local does not exist on your machine or is not writable, the controller will not write any telemetry files.
CLOCKWORK_LOG_DIR must already exist; Clockwork will not create the directory for you.
Upon startup, the controller will print the location it is writing its request and action logs to.
Clockwork's controller outputs telemetry for requests and for actions.
Requests are individual inferences sent by clients to Clockwork. A request specifies a model and provides an input payload; it will receive an output payload or an error.
Actions are commands sent by Clockwork's controllers to its workers. There are several kinds of action: Load, Infer, and Evict are the main ones. There is no 1-to-1 correspondence from requests to actions; an Infer actions can potentially batch the input of multiple requests.
clockwork_request_log.tsv is a tab-separated file containing measurements for every client request received by Clockwork's controller. The first row contains column headers.
clockwork_action_log.tsv is a tab-separated file containing measurements for every action sent by clockwork's controller to workers. The first row contains column headers.
The following snippet is taken from a clockwork_request_log.tsv:
t request_id result user_id model_id slo_factor latency deadline deadline_met arrival_count departure_count is_coldstart
1598832443793706643 68 0 357 30 0 11428945 100000000 1 0 2 1
1598832443793770631 330 0 115 9 0 9725697 100000000 1 0 2 1
1598832443794996383 506 0 357 30 0 3124478 100000000 1 2 2 0
1598832443795068092 511 0 115 9 0 2827707 100000000 1 2 2 0
tthe arrival timestamp of the request at the controller in nanosecondsrequest_ida unique ID for this request, generated by the controllerresultthe outcome of the request.0indicates success;4indicates a timeout; all other values indicate an error. The other values correspond to error codes fromworker_api.huser_ida user ID supplied by the client.model_ida model ID supplied by the client.slo_factorthis is a "backdoor" API used for experimentation; it is mostly zero. The client can specify a positive value, which then acts as a scale factor for this request's SLO/deadline (relative to its batchsize-1 latency).latencythe request latency measured by the controller, from request arrival at the network layer, to request departure at the network layer.deadlinethe SLO/deadline actually used by this requestdeadline_metonly true ifresult=0andlatency<=deadlinearrival_countwhen the request arrived at the controller, the number of workers with this model loadeddeparture_countwhen the request departs the controller, the number of workers with this model loadedis_coldstartonly true ifarrival_count=0
The following snippet is taken from a clockwork_action_log.tsv:
t action_id action_type status worker_id gpu_id model_id batch_size expected_exec_duration worker_exec_duration expected_exec_complete worker_exec_complete expected_gpu_clock worker_gpu_clock_before worker_gpu_clock worker_copy_output_complete worker_action_received worker_result_sent controller_result_enqueue controller_action_duration goodput requests_queued copies_loaded
1598832443789472787 2 1 0 9 1 0 1 5637000 5669663 6604169 6830646 0 0 0 0 170122 6840551 7251604 7263700 5669663 2 0
1598832443789454455 0 1 0 8 0 0 1 5637000 5672895 6605935 6826713 0 0 0 0 410550 6836940 7244669 7276903 5672895 2 0
1598832443789437665 7 1 0 3 0 1 1 5637000 5677567 6594394 6818735 0 0 0 0 325569 6829798 7183274 7297271 5677567 1 0
1598832443789463641 4 1 0 6 0 0 1 5637000 5668928 6595782 6839861 0 0 0 0 237703 6850289 7228631 7251874 5668928 2 0
1
tthe time the action was sent by the controller to a workeraction_ida unique ID for this actionaction_type1for Load Weights;2for Infer;3for Evict Weights. Other action types can be found inworker_api.h.statusthe outcome of the action.0indicates success; all other values indicate an error. Error codes correspond to those listed inworker_api.h.worker_idthe worker this action is sent togpu_idthe GPU on ths worker this action is sent tomodel_idthe model this action is forbatch_sizefor infer actions onlyexpected_exec_durationthe worker-side action duration predicted by the controllerworker_exec_durationthe actual worker-side action duration measured by the workerexpected_exec_completethe controller's predicted time when the action will complete; this incorporates theexpected_exec_durationas well as any other outstanding actions. This value is relative tot.worker_exec_completethe actual worker-side action completion time. This value is relative tot.expected_gpu_clockthe GPU clock speed anticipated by the controller.worker_gpu_clock_beforethe GPU clock speed before the action begins executing.worker_gpu_clockthe GPU clock speed after the action finishes executing.worker_copy_output_completethe worker-side time when inference has completed and output has been copied back to host. This value is relative totand is adjusted to account for clock skew.worker_action_receivedthe worker-side time when the action finished receiving over the network. This value is relative totand is adjusted to account for clock skew.worker_result_sentthe worker-side time when the result began sending over the network. This value is relative totand is adjusted to account for clock skew.controller_result_enqueuethe controller-side time when the result was received by the worker.controller_action_durationthe end-to-end action duration as measured by the controllergoodputUsed for infer actions. A goodput of 1 indicates an infer action completed in time for a request's deadline. A goodput of 0 indicates the request timed out before the action completed. Fractional values are possible due to batched inputs.requests_queuedthe number of requests queued on the controller for this model, at the time the action was initiated.copies_loadedthe number of workers with this model already loaded into GPU memory, at the time the action was initiated.