feat: allow to discover runner statuses by fgalind1 · Pull Request #1268 · actions/actions-runner-controller

fgalind1 · 2022-03-23T18:07:36Z

This is a new optional feature that allows to set the .status.phase of a Runner to a more specific/accurate phase based on the status of the runner. This leverages the recent runner feature actions/runner#1737 that allows to set hooks on job start and job end. The end goal is to make it easier for admins or however manages runners via ARC to easily see what each runner is doing e.g Created, Registering, Idle, Running, Idle, Running...

entrypoint.sh will update the runner status to Registering and Idle as it goes trough the runner registration. Runner will set the status to Running (along with the run number and repo using that runner at that moment) on job start via the runner hook and runner back to idle on job end

In order to do so - this requires the Runner pod to be able to update its own .status by itself; thus, this PR makes sure that whenever we're creating a new runner, we create a custom role, service account and role binding that has access only to update the runner/status of his own.

All these things to make this work (RBAC and the runner status) can be optionally enabled with the chart option useCustomRunnersRBAC. By default is disabled and this should have no impact in the current behavior. At some point this may be defaulted to true or just remove the old behavior but this was implemented this way to have it as an opt-in feature

Example of how it looks like:

NAME                     ENTERPRISE   ORGANIZATION   REPOSITORY                    LABELS         STATUS       MESSAGE                  AGE
test-runner-wdgpj-2s6k7                              XXXX/personal.fgalind1.test   ["fgalind1"]   Idle                                  11m
test-runner-wdgpj-9m6g8                              XXXX/personal.fgalind1.test   ["fgalind1"]   Registering                           11m
test-runner-wdgpj-gngp2                              XXXX/personal.fgalind1.test   ["fgalind1"]   Running      Run 1925397027 from XXXX 10m

mumoshu

Sorry for the delayed review!
LGTM overall, but it would be great if you could confirm my comments 🙏

controllers/runner_controller.go

runner/entrypoint.sh

runner/update_status.sh

controllers/runner_controller.go

charts/actions-runner-controller/templates/deployment.yaml

charts/actions-runner-controller/README.md

runner/Dockerfile.dindrunner

runner/Dockerfile

mumoshu · 2022-05-15T23:59:02Z

@fgalind1 I tried my best to rebase your awesome work onto our main branch. Would you mind testing it?

runner/update_status.sh

runner/Dockerfile

controllers/runner_controller.go

…runner-controller into status-runners

fgalind1 · 2022-05-17T01:27:26Z

Sorry for the delayed review!
LGTM overall, but it would be great if you could confirm my comments 🙏

@fgalind1 I tried my best to rebase your awesome work onto our main branch. Would you mind testing it?

Thanks @mumoshu as always for the review - I've address the comments from you and @Fleshgrinder

I've tested this flow with the feature on and off and it's working as expected 👍

mumoshu · 2022-05-17T01:34:51Z

@fgalind1 Thanks! We're almost there- the last thing in terms of controller code is the failing TestNewRunnerPodFromRunnerController. Seems like you need to add the new envvar to the testdata

runner/hooks/completed.sh

runner/hooks/started.d/job-start

…b specific

runner/entrypoint.sh

runner/hooks/job-completed.d/update-status

runner/hooks/job-completed.sh

runner/hooks/job-started.d/update-status

runner/hooks/job-started.sh

runner/update-status

toast-gear · 2022-06-22T15:09:35Z

I've update the workflow to just use a hardcoded dockerhub name, it's not a secret anyway @fgalind1 pull and merge master again

fgalind1 · 2022-06-22T15:20:53Z

I've update the workflow to just use a hardcoded dockerhub name, it's not a secret anyway @fgalind1 pull and merge master again

awesome - thanks @toast-gear - after merge master it's working again 👍

fgalind1 · 2022-06-29T01:19:31Z

@fgalind1 Thanks for the reminder! We'll be merging this next week. We're currently in a code-freeze period and therefore this week's going to be used for fixes (if any) to the recently released 0.24.0

@mumoshu could this change be processed? :) appreciate the help

mumoshu · 2022-06-29T09:53:07Z

@fgalind1 Hey! This has been always in my head. Sorry for the delay, I do realize this is now 3 months old and I'm very eager to get this merged!

I had reviewed this once again this morning and, unfortunately, it turned out that a part of this enhancement (conceptually) conflicts with the new container mode support #1546.

The new container mode expects the user to pre-create a service account used by the runner pod, whereas this enhancement dynamically creates a service account for each runner pod(which lacks permissions necessary for the new container mode). Please correct me if I'm saying something wrong.

I'm planning to modify this a bit so that (1)the dynamically created service account would also have any permissions required for the new container mode(so that this feature can be used along with the new container mode) and/or (2)this feature can optionally use a pre-created serviceaccount, so that the user can provided either of both sets of permissions required for the new container mode and this feature, to the pre-created serviceaccount.

I'm working hard to make it happen earlier, but probably it won't make it before 0.25.0, which I'm planning to cut this weekend if everything went well. So, this might get merged in next or the week after the next.

Hope it works. Thanks for your contribution and patience 🙏

mumoshu · 2022-06-30T00:38:01Z

I did my best fixing conflicts in 96d7925. Wishing it works

mumoshu · 2022-07-01T12:33:43Z

config/rbac/role.yaml

 metadata:
  creationTimestamp: null
-  name: manager-role
+  name: {{ include "actions-runner-controller.managerRoleName" . }}


This seems messed up. This is a resource loaded by ARC's kustomize config, which has no access to these Helm template expressions

good call - right. fixed in 9c0c9ad

mumoshu · 2022-07-01T12:41:10Z

charts/actions-runner-controller/values.yaml

 imagePullSecrets: []
 nameOverride: ""
 fullnameOverride: ""
+useRunnerStatusUpdateHookEphemeralRole: false


I started to wonder if this can be improved after #1546.
One might want to choose if they want ARC to create a service account for the runner pod, and separately enable additional features like the kubernetes container mode and/or this status update hooks.
Considering all the possibles variations of configs, perhaps something like this would be better:

runner: serviceAccount: create: true # This will let ARC create a service account for each runner pod #name: <specify the name of existing SA. Makes sense only when `create: false` statusUpdateHook: enabled: true #=> This enables te runner status update hook and adds the role and role binding to either the existing or the dynamically created service account #Uncomment if you want the kubernetes container mode. # In case runner.serviceAccount.created=true,runner.statusUpdateHook.enabled=true,containerMode=kubernetes is specified, # ARC will create a runner service account for each runner pod and adds all the permissions required for both the container mode and the status update hooks #containerMode: kubernetes

I believe these are indeed 2 different features but the scope a bit different also.
containerMode is something you want to add to your runners, while status update hook I believe is a more horizonal feature (e.g this PR adds a new column in the runner's table to show the running job, which applies for all runners. In other words I believe it makes more sense to be an on/off feature but to the controller - rather than be a switch per runner. While for the containerMode I do believe that makes sense in the context of the runner. Thoughts?

Thanks for confirming! You're absolutely correct.

In my previous comment, I made two mistakes.

First, I confused the runner service account vs workflow job pod's service account. This feature affects the former and the new container mode is related to the latter.

Second, my suggested values.yaml snippet did look like containing settings for runners, not for the controller. Configuring runners is out of the scope of the ARC's chart. It might make sense a bit if it was for default runner configs, as we already have some runner defaults related values like image.actionsRunnerRepositoryAndTag. This feature should indeed be a controller-level feature as you pointed out. We might better not conflate the new container mode related settings and this feature under a single umbrella key, at least.

That said, perhaps it would look ok if we just omitted the containerMode key from my previous example?

runner: serviceAccount: create: true # This will let ARC create a service account for each runner pod statusUpdateHook: enabled: true #=> This enables the runner status update hook. If serviceaccount.create=true, this also creates the role and bind it to the dynamically created service account

To let ARC use an existing runner pod service account, you'd say:

runner: serviceAccount: name: mysa # The name of an existing SA. Valid only when `create: false` # In this case you don't need to give ARC the permission to create SA at all, which might be preferable to some folks for security reasons. statusUpdateHook: enabled: true # This doesn't automatically create a role and rolebinding associated to the existing SA, similarly to the top-level `serviceAccount.name` which his used to specify the existing SA for the controller-manager pod.

Perhaps it might be a good idea to imply runner.serviceAccount.create=true when you set runner.statusUpdateHook.enabled=true:

runner: statusUpdateHook: enabled: true #=> This enables the runner status update hook. In order to do so, ARC automatically creates a runner pod serviceaccount and a role with necessary permissions, and binds the role to the SA.

This way, it would look like a controller-wide on/off feature, while providing enough flexibility to let the user decide whether to bring their own SA for runner pods, or not.

WDYT?

First, I confused the runner service account vs workflow job pod's service account. This feature affects the former and the new container mode is related to the latter.

Sorry, this was again a complete misunderstanding of mine! The new container mode works by giving an existing SA to every runner pod so that it can create workflow job pods from within runner pods. Never mind about this. But I think my suggestion in #1268 (comment) is still valid.
It's just that we won't need to provide the container mode related settings in helm charts as it's a runner-level feature.

I think my suggestion in this thread can be distilled into just one sentence considering the scope of this pull request. That's "Can we rename useRunnerStatusUpdateHookEphemeralRole to runner.statusUpdateHook.enabled, so that the values will be kept structured great even if we added an ability to let ARC use a pre-created SA as the default runner pod SA in the future?"

In other words, even I mentioned it in my previous examples, we don't need to add support for runner.serviceAccount.create or runner.serviceAccount.name in this PR. That might be out of the scope of this PR.

sounds good to me - thanks for the detailed analysis :) I've done the rename to runner.statusUpdateHook.enabled in ec02bb7 (#1268)

fgalind1 · 2022-07-01T20:30:46Z

@fgalind1 Hey! This has been always in my head. Sorry for the delay, I do realize this is now 3 months old and I'm very eager to get this merged!

I had reviewed this once again this morning and, unfortunately, it turned out that a part of this enhancement (conceptually) conflicts with the new container mode support #1546.

The new container mode expects the user to pre-create a service account used by the runner pod, whereas this enhancement dynamically creates a service account for each runner pod(which lacks permissions necessary for the new container mode). Please correct me if I'm saying something wrong.

I'm planning to modify this a bit so that (1)the dynamically created service account would also have any permissions required for the new container mode(so that this feature can be used along with the new container mode) and/or (2)this feature can optionally use a pre-created serviceaccount, so that the user can provided either of both sets of permissions required for the new container mode and this feature, to the pre-created serviceaccount.

I'm working hard to make it happen earlier, but probably it won't make it before 0.25.0, which I'm planning to cut this weekend if everything went well. So, this might get merged in next or the week after the next.

Hope it works. Thanks for your contribution and patience 🙏

@mumoshu I've added option #1 here d3e1f96 - let me know what you think

That way is consistent - it can let user provide a custom service account (that may be used for kubernetes or this feature) or if it is not provided we create one with the right permissions for kubernetes and/or statuses

mumoshu · 2022-07-02T04:35:37Z

controllers/runner_controller.go

 		return ctrl.Result{}, err
 	}

+	needsServiceAccount := runner.Spec.ServiceAccountName != "" && (r.UseRunnerStatusUpdateHookEphemeralRole || runner.Spec.ContainerMode == "kubernetes")


I think runner.Spec.ServiceAccountName != "" is an indication that the user wants ARC to use the specified existing service account, whereas this condition results in the opposite. Perhaps we'd better invert it so that it looks like:

Suggested change

needsServiceAccount := runner.Spec.ServiceAccountName != "" && (r.UseRunnerStatusUpdateHookEphemeralRole || runner.Spec.ContainerMode == "kubernetes")

needsServiceAccount := runner.Spec.ServiceAccountName == "" && (r.UseRunnerStatusUpdateHookEphemeralRole || runner.Spec.ContainerMode == "kubernetes")

good catch - fixed in ec02bb7 (#1268)

mumoshu · 2022-07-02T04:36:44Z

controllers/runner_controller.go

+			}...)
+		}
+
+		if runner.Spec.ContainerMode == "kubernetes" {


Awesome! Thanks for the addition ☺️

Enabling this resulted in ARC unable to create the role due to an error like:

{"level":"error","ts":"2022-07-10T01:31:04Z","logger":"actions-runner-controller.runner","msg":"Retrying as failed to create Role org-runnerdeploy-ngmsw-pwd5r resource","runner":"default/org-runnerdeploy-ngmsw-pwd5r","error":"roles.rbac.authorization.k8s.io \"org-runnerdeploy-ngmsw-pwd5r\" is forbidden: user \"system:serviceaccount:actions-runner-system:actions-runner-controller\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:actions-runner-system\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"pods/exec\"], Verbs:[\"get\" \"create\"]}\n{APIGroups:[\"\"], Resources:[\"pods/log\"], Verbs:[\"get\" \"list\" \"watch\"]}\n{APIGroups:[\"\"], Resources:[\"secrets\"], Verbs:[\"create\" \"delete\"]}","stacktrace":"github.com/actions-runner-controller/actions-runner-controller/controllers.(*RunnerReconciler).processRunnerCreation\n\t/workspace/controllers/runner_controller.go:322\ngithub.com/actions-runner-controller/actions-runner-controller/controllers.(*RunnerReconciler).Reconcile\n\t/workspace/controllers/runner_controller.go:133\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234"}

As the error indicates, this is probably due to ARC's controller-manager role lacks permissions for pods/exec and pods/log, and create/delete of secrets. Without the permissions, Kubernetes seems to think ARC is trying to do a privilege escalation (from ARC controller-manager's lesser permission set, to the runner pod's greater permission set), which is apparently forbidden by Kubernetes.

Probably we'd need another chart value flag to make this privilege escalation possible, and it should be orthogonal to statusUpdateHook 🤔

…count

mumoshu · 2022-07-07T02:43:36Z

I'm currently testing this on top the current main branch and hopefully merge this soon as long as I don't see any major issues. Thanks for your patience!

mumoshu · 2022-07-10T02:56:02Z

controllers/runner_controller.go

+					APIGroups: []string{""},
+					Resources: []string{"secrets"},
+					Verbs:     []string{"get", "list", "create", "delete"},
+				},


While testing this manually, I realized that this lacks permissions for jobs, resulting in a workflow job to fail like this:

mumoshu

After fixing two minor issues I found (https://github.com/actions-runner-controller/actions-runner-controller/pull/1268/files#r912319021 and https://github.com/actions-runner-controller/actions-runner-controller/pull/1268/files#r917331632), it worked with the "kubernetes" container mode, too 🎉 I'll submit a follow-up PR or two for those issues.

One last thing to note- the hook updates the runner status to e.g. Registering and Idle before/after the registration. It's great. But I suspect it might have a race condition with the runner controller's logic that syncs the runner pod status to the runner resource:

https://github.com/actions-runner-controller/actions-runner-controller/blob/10b88bf0702ebbf8e0e2f45a5634bc1111d9b0ac/controllers/runner_controller.go#L138-L157

If the runner controller's resync period passed and hence it reconciled the runner resource immediately after the hook update the runner status to Idle, the runner status might be "corrected" to Running(=pod status phase). So it can look a bit flappy if you watched the runner status phase transitions continuously (by e.g. running kubectl get runner -w and triggering a bunch of workflow runs). It might not be a big deal, but might be worth a fix.

I think I'm confident this works almost perfectly, and we know what the remaining issues are and how to address those. That said, I consider this is now good to be merged!

Thanks again for your patience and contribution @fgalind1 😄

fgalind1 · 2022-07-11T15:51:39Z

this works almost perfectly, and we know what the remaining issues a

Lovely! thanks also for your help @mumoshu

fgalind1 added 3 commits March 23, 2022 10:59

feat: allow to discover runner statuses

9baf665

fix manifests

c6dfb15

Bump runner version to 2.289.1 which includes the hooks support

1e6e23b

fgalind1 force-pushed the status-runners branch from 11c632f to 1e6e23b Compare March 25, 2022 21:54

Merge branch 'master' into status-runners

ff9b539

fgalind1 mentioned this pull request Apr 4, 2022

Fix deleting a runner when pod was deleted #1262

Merged

Merge remote-tracking branch 'upstream/master' into status-runners

2102a3c

fgalind1 force-pushed the status-runners branch from 2662441 to 2102a3c Compare April 6, 2022 15:47

mumoshu added the enhancement New feature or request label May 14, 2022

mumoshu requested changes May 14, 2022

View reviewed changes

mumoshu added this to the v0.26.0 milestone May 14, 2022

mumoshu reviewed May 14, 2022

View reviewed changes

runner/Dockerfile.dindrunner Outdated Show resolved Hide resolved

mumoshu reviewed May 15, 2022

View reviewed changes

runner/Dockerfile Outdated Show resolved Hide resolved

Merge branch 'master' into status-runners

b30f36f

mumoshu requested a review from toast-gear as a code owner May 15, 2022 23:57

Fleshgrinder reviewed May 16, 2022

View reviewed changes

fgalind1 added 4 commits May 16, 2022 14:37

Merge remote-tracking branch 'upstream/master' into status-runners

4727308

Add feedback from review

69b2df7

Merge branch 'status-runners' of https://github.com/fgalind1/actions-…

66544ec

…runner-controller into status-runners

Update reference to newRunnerPod

a9831c8

mumoshu reviewed May 17, 2022

View reviewed changes

runner/hooks/completed.sh Outdated Show resolved Hide resolved

mumoshu reviewed May 17, 2022

View reviewed changes

runner/hooks/completed.sh Outdated Show resolved Hide resolved

mumoshu reviewed May 17, 2022

View reviewed changes

runner/hooks/started.d/job-start Outdated Show resolved Hide resolved

fgalind1 added 2 commits May 16, 2022 18:53

Fix TestNewRunnerPodFromRunnerController and make hooks file names jo…

e5b135c

…b specific

Fix additional TestNewRunnerPod test

bf49b8c

Fleshgrinder reviewed May 17, 2022

View reviewed changes

Cover additional feedback from review

068de40

Merge branch 'master' into status-runners

29ac4e2

Merge branch 'master' into status-runners

3d6a1f3

fgalind1 added 2 commits June 27, 2022 14:28

Merge branch 'master' into status-runners

b990836

Merge branch 'master' into status-runners

6d56e44

Merge branch 'master' into status-runners

96d7925

mumoshu reviewed Jul 1, 2022

View reviewed changes

fgalind1 added 2 commits July 1, 2022 13:06

fix rbac manager role

9c0c9ad

Add permissions to service account for container mode if not provided

d3e1f96

mumoshu reviewed Jul 2, 2022

View reviewed changes

Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAc…

ec02bb7

…count

mumoshu reviewed Jul 10, 2022

View reviewed changes

mumoshu approved these changes Jul 10, 2022

View reviewed changes

mumoshu merged commit 11cb9b7 into actions:master Jul 10, 2022

This was referenced Jul 10, 2022

Ensure the runner status update hooks work with the Kubernetes container mode #1620

Merged

Bug: pod setup fails internally #1621

Closed

mumoshu mentioned this pull request Aug 28, 2022

Expose Idle/Busy state of runners #1760

Closed

This was referenced Dec 13, 2022

Add Repository information to Runner Status #2093

Merged

Add webhook payload detail to info logs #2060

Open

dhawalseth mentioned this pull request Apr 3, 2023

Runner pod statusUpdateHook not working and unlimited runners start provisioning #2468

Open

7 tasks

adkafka mentioned this pull request Oct 4, 2023

actions-metrics-server should report the current status of the workers #2934

Open

	needsServiceAccount := runner.Spec.ServiceAccountName != "" && (r.UseRunnerStatusUpdateHookEphemeralRole \|\| runner.Spec.ContainerMode == "kubernetes")
	needsServiceAccount := runner.Spec.ServiceAccountName == "" && (r.UseRunnerStatusUpdateHookEphemeralRole \|\| runner.Spec.ContainerMode == "kubernetes")

Conversation

fgalind1 commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mumoshu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mumoshu commented May 15, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fgalind1 commented May 17, 2022

Uh oh!

mumoshu commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

toast-gear commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fgalind1 commented Jun 22, 2022

Uh oh!

fgalind1 commented Jun 29, 2022

Uh oh!

mumoshu commented Jun 29, 2022

Uh oh!

mumoshu commented Jun 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fgalind1 Jul 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mumoshu Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fgalind1 Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fgalind1 commented Mar 23, 2022 •

edited

Loading

mumoshu commented May 17, 2022 •

edited

Loading

toast-gear commented Jun 22, 2022 •

edited

Loading

fgalind1 Jul 1, 2022 •

edited

Loading

mumoshu Jul 2, 2022 •

edited

Loading

fgalind1 Jul 2, 2022 •

edited

Loading

fgalind1 commented Jul 1, 2022 •

edited

Loading

mumoshu Jul 2, 2022 •

edited

Loading

mumoshu commented Jul 7, 2022 •

edited

Loading

mumoshu left a comment •

edited

Loading