Skip to content

feat: allow to discover runner statuses#1268

Merged
mumoshu merged 24 commits intoactions:masterfrom
fgalind1:status-runners
Jul 10, 2022
Merged

feat: allow to discover runner statuses#1268
mumoshu merged 24 commits intoactions:masterfrom
fgalind1:status-runners

Conversation

@fgalind1
Copy link
Copy Markdown
Contributor

@fgalind1 fgalind1 commented Mar 23, 2022

This is a new optional feature that allows to set the .status.phase of a Runner to a more specific/accurate phase based on the status of the runner. This leverages the recent runner feature actions/runner#1737 that allows to set hooks on job start and job end. The end goal is to make it easier for admins or however manages runners via ARC to easily see what each runner is doing e.g Created, Registering, Idle, Running, Idle, Running...

entrypoint.sh will update the runner status to Registering and Idle as it goes trough the runner registration. Runner will set the status to Running (along with the run number and repo using that runner at that moment) on job start via the runner hook and runner back to idle on job end

In order to do so - this requires the Runner pod to be able to update its own .status by itself; thus, this PR makes sure that whenever we're creating a new runner, we create a custom role, service account and role binding that has access only to update the runner/status of his own.

All these things to make this work (RBAC and the runner status) can be optionally enabled with the chart option useCustomRunnersRBAC. By default is disabled and this should have no impact in the current behavior. At some point this may be defaulted to true or just remove the old behavior but this was implemented this way to have it as an opt-in feature

Example of how it looks like:

NAME                     ENTERPRISE   ORGANIZATION   REPOSITORY                    LABELS         STATUS       MESSAGE                  AGE
test-runner-wdgpj-2s6k7                              XXXX/personal.fgalind1.test   ["fgalind1"]   Idle                                  11m
test-runner-wdgpj-9m6g8                              XXXX/personal.fgalind1.test   ["fgalind1"]   Registering                           11m
test-runner-wdgpj-gngp2                              XXXX/personal.fgalind1.test   ["fgalind1"]   Running      Run 1925397027 from XXXX 10m

@mumoshu mumoshu added the enhancement New feature or request label May 14, 2022
Copy link
Copy Markdown
Collaborator

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review!
LGTM overall, but it would be great if you could confirm my comments 🙏

@mumoshu mumoshu added this to the v0.26.0 milestone May 14, 2022
@mumoshu mumoshu requested a review from toast-gear as a code owner May 15, 2022 23:57
@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented May 15, 2022

@fgalind1 I tried my best to rebase your awesome work onto our main branch. Would you mind testing it?

@fgalind1
Copy link
Copy Markdown
Contributor Author

Sorry for the delayed review!
LGTM overall, but it would be great if you could confirm my comments 🙏

@fgalind1 I tried my best to rebase your awesome work onto our main branch. Would you mind testing it?

Thanks @mumoshu as always for the review - I've address the comments from you and @Fleshgrinder

I've tested this flow with the feature on and off and it's working as expected 👍

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented May 17, 2022

@fgalind1 Thanks! We're almost there- the last thing in terms of controller code is the failing TestNewRunnerPodFromRunnerController. Seems like you need to add the new envvar to the testdata

@toast-gear
Copy link
Copy Markdown
Collaborator

toast-gear commented Jun 22, 2022

I've update the workflow to just use a hardcoded dockerhub name, it's not a secret anyway @fgalind1 pull and merge master again

@fgalind1
Copy link
Copy Markdown
Contributor Author

I've update the workflow to just use a hardcoded dockerhub name, it's not a secret anyway @fgalind1 pull and merge master again

awesome - thanks @toast-gear - after merge master it's working again 👍

@fgalind1
Copy link
Copy Markdown
Contributor Author

@fgalind1 Thanks for the reminder! We'll be merging this next week. We're currently in a code-freeze period and therefore this week's going to be used for fixes (if any) to the recently released 0.24.0

@mumoshu could this change be processed? :) appreciate the help

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented Jun 29, 2022

@fgalind1 Hey! This has been always in my head. Sorry for the delay, I do realize this is now 3 months old and I'm very eager to get this merged!

I had reviewed this once again this morning and, unfortunately, it turned out that a part of this enhancement (conceptually) conflicts with the new container mode support #1546.

The new container mode expects the user to pre-create a service account used by the runner pod, whereas this enhancement dynamically creates a service account for each runner pod(which lacks permissions necessary for the new container mode). Please correct me if I'm saying something wrong.

I'm planning to modify this a bit so that (1)the dynamically created service account would also have any permissions required for the new container mode(so that this feature can be used along with the new container mode) and/or (2)this feature can optionally use a pre-created serviceaccount, so that the user can provided either of both sets of permissions required for the new container mode and this feature, to the pre-created serviceaccount.

I'm working hard to make it happen earlier, but probably it won't make it before 0.25.0, which I'm planning to cut this weekend if everything went well. So, this might get merged in next or the week after the next.

Hope it works. Thanks for your contribution and patience 🙏

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented Jun 30, 2022

I did my best fixing conflicts in 96d7925. Wishing it works

metadata:
creationTimestamp: null
name: manager-role
name: {{ include "actions-runner-controller.managerRoleName" . }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems messed up. This is a resource loaded by ARC's kustomize config, which has no access to these Helm template expressions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - right. fixed in 9c0c9ad

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
useRunnerStatusUpdateHookEphemeralRole: false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to wonder if this can be improved after #1546.
One might want to choose if they want ARC to create a service account for the runner pod, and separately enable additional features like the kubernetes container mode and/or this status update hooks.
Considering all the possibles variations of configs, perhaps something like this would be better:

runner:
  serviceAccount:
    create: true # This will let ARC create a service account for each runner pod
    #name: <specify the name of existing SA. Makes sense only when `create: false`
  statusUpdateHook:
    enabled: true #=> This enables te runner status update hook and adds the role and role binding to either the existing or the dynamically created service account
  #Uncomment if you want the kubernetes container mode.
  # In case runner.serviceAccount.created=true,runner.statusUpdateHook.enabled=true,containerMode=kubernetes is specified,
  # ARC will create a runner service account for each runner pod and adds all the permissions required for both the container mode and the status update hooks
  #containerMode: kubernetes

Copy link
Copy Markdown
Contributor Author

@fgalind1 fgalind1 Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these are indeed 2 different features but the scope a bit different also.
containerMode is something you want to add to your runners, while status update hook I believe is a more horizonal feature (e.g this PR adds a new column in the runner's table to show the running job, which applies for all runners. In other words I believe it makes more sense to be an on/off feature but to the controller - rather than be a switch per runner. While for the containerMode I do believe that makes sense in the context of the runner. Thoughts?

Copy link
Copy Markdown
Collaborator

@mumoshu mumoshu Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming! You're absolutely correct.

In my previous comment, I made two mistakes.

First, I confused the runner service account vs workflow job pod's service account. This feature affects the former and the new container mode is related to the latter.

Second, my suggested values.yaml snippet did look like containing settings for runners, not for the controller. Configuring runners is out of the scope of the ARC's chart. It might make sense a bit if it was for default runner configs, as we already have some runner defaults related values like image.actionsRunnerRepositoryAndTag. This feature should indeed be a controller-level feature as you pointed out. We might better not conflate the new container mode related settings and this feature under a single umbrella key, at least.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, perhaps it would look ok if we just omitted the containerMode key from my previous example?

runner:
  serviceAccount:
    create: true # This will let ARC create a service account for each runner pod
  statusUpdateHook:
    enabled: true #=> This enables the runner status update hook. If serviceaccount.create=true, this also creates the role and bind it to the dynamically created service account

To let ARC use an existing runner pod service account, you'd say:

runner:
  serviceAccount:
    name: mysa # The name of an existing SA. Valid only when `create: false`
    # In this case you don't need to give ARC the permission to create SA at all, which might be preferable to some folks for security reasons.
  statusUpdateHook:
    enabled: true # This doesn't automatically create a role and rolebinding associated to the existing SA, similarly to the top-level `serviceAccount.name` which his used to specify the existing SA for the controller-manager pod.

Perhaps it might be a good idea to imply runner.serviceAccount.create=true when you set runner.statusUpdateHook.enabled=true:

runner:
  statusUpdateHook:
    enabled: true #=> This enables the runner status update hook. In order to do so, ARC automatically creates a runner pod serviceaccount and a role with necessary permissions, and binds the role to the SA.

This way, it would look like a controller-wide on/off feature, while providing enough flexibility to let the user decide whether to bring their own SA for runner pods, or not.

WDYT?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I confused the runner service account vs workflow job pod's service account. This feature affects the former and the new container mode is related to the latter.

Sorry, this was again a complete misunderstanding of mine! The new container mode works by giving an existing SA to every runner pod so that it can create workflow job pods from within runner pods. Never mind about this. But I think my suggestion in #1268 (comment) is still valid.
It's just that we won't need to provide the container mode related settings in helm charts as it's a runner-level feature.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suggestion in this thread can be distilled into just one sentence considering the scope of this pull request. That's "Can we rename useRunnerStatusUpdateHookEphemeralRole to runner.statusUpdateHook.enabled, so that the values will be kept structured great even if we added an ability to let ARC use a pre-created SA as the default runner pod SA in the future?"

In other words, even I mentioned it in my previous examples, we don't need to add support for runner.serviceAccount.create or runner.serviceAccount.name in this PR. That might be out of the scope of this PR.

Copy link
Copy Markdown
Contributor Author

@fgalind1 fgalind1 Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me - thanks for the detailed analysis :) I've done the rename to runner.statusUpdateHook.enabled in ec02bb7 (#1268)

@fgalind1
Copy link
Copy Markdown
Contributor Author

fgalind1 commented Jul 1, 2022

@fgalind1 Hey! This has been always in my head. Sorry for the delay, I do realize this is now 3 months old and I'm very eager to get this merged!

I had reviewed this once again this morning and, unfortunately, it turned out that a part of this enhancement (conceptually) conflicts with the new container mode support #1546.

The new container mode expects the user to pre-create a service account used by the runner pod, whereas this enhancement dynamically creates a service account for each runner pod(which lacks permissions necessary for the new container mode). Please correct me if I'm saying something wrong.

I'm planning to modify this a bit so that (1)the dynamically created service account would also have any permissions required for the new container mode(so that this feature can be used along with the new container mode) and/or (2)this feature can optionally use a pre-created serviceaccount, so that the user can provided either of both sets of permissions required for the new container mode and this feature, to the pre-created serviceaccount.

I'm working hard to make it happen earlier, but probably it won't make it before 0.25.0, which I'm planning to cut this weekend if everything went well. So, this might get merged in next or the week after the next.

Hope it works. Thanks for your contribution and patience 🙏

@mumoshu I've added option #1 here d3e1f96 - let me know what you think

That way is consistent - it can let user provide a custom service account (that may be used for kubernetes or this feature) or if it is not provided we create one with the right permissions for kubernetes and/or statuses

return ctrl.Result{}, err
}

needsServiceAccount := runner.Spec.ServiceAccountName != "" && (r.UseRunnerStatusUpdateHookEphemeralRole || runner.Spec.ContainerMode == "kubernetes")
Copy link
Copy Markdown
Collaborator

@mumoshu mumoshu Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think runner.Spec.ServiceAccountName != "" is an indication that the user wants ARC to use the specified existing service account, whereas this condition results in the opposite. Perhaps we'd better invert it so that it looks like:

Suggested change
needsServiceAccount := runner.Spec.ServiceAccountName != "" && (r.UseRunnerStatusUpdateHookEphemeralRole || runner.Spec.ContainerMode == "kubernetes")
needsServiceAccount := runner.Spec.ServiceAccountName == "" && (r.UseRunnerStatusUpdateHookEphemeralRole || runner.Spec.ContainerMode == "kubernetes")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - fixed in ec02bb7 (#1268)

}...)
}

if runner.Spec.ContainerMode == "kubernetes" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for the addition ☺️

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling this resulted in ARC unable to create the role due to an error like:

{"level":"error","ts":"2022-07-10T01:31:04Z","logger":"actions-runner-controller.runner","msg":"Retrying as failed to create Role org-runnerdeploy-ngmsw-pwd5r resource","runner":"default/org-runnerdeploy-ngmsw-pwd5r","error":"roles.rbac.authorization.k8s.io \"org-runnerdeploy-ngmsw-pwd5r\" is forbidden: user \"system:serviceaccount:actions-runner-system:actions-runner-controller\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:actions-runner-system\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"pods/exec\"], Verbs:[\"get\" \"create\"]}\n{APIGroups:[\"\"], Resources:[\"pods/log\"], Verbs:[\"get\" \"list\" \"watch\"]}\n{APIGroups:[\"\"], Resources:[\"secrets\"], Verbs:[\"create\" \"delete\"]}","stacktrace":"github.com/actions-runner-controller/actions-runner-controller/controllers.(*RunnerReconciler).processRunnerCreation\n\t/workspace/controllers/runner_controller.go:322\ngithub.com/actions-runner-controller/actions-runner-controller/controllers.(*RunnerReconciler).Reconcile\n\t/workspace/controllers/runner_controller.go:133\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234"}

As the error indicates, this is probably due to ARC's controller-manager role lacks permissions for pods/exec and pods/log, and create/delete of secrets. Without the permissions, Kubernetes seems to think ARC is trying to do a privilege escalation (from ARC controller-manager's lesser permission set, to the runner pod's greater permission set), which is apparently forbidden by Kubernetes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we'd need another chart value flag to make this privilege escalation possible, and it should be orthogonal to statusUpdateHook 🤔

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented Jul 7, 2022

I'm currently testing this on top the current main branch and hopefully merge this soon as long as I don't see any major issues. Thanks for your patience!

APIGroups: []string{""},
Resources: []string{"secrets"},
Verbs: []string{"get", "list", "create", "delete"},
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing this manually, I realized that this lacks permissions for jobs, resulting in a workflow job to fail like this:

CleanShot 2022-07-10 at 11 53 35@2x

Copy link
Copy Markdown
Collaborator

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After fixing two minor issues I found (https://github.com/actions-runner-controller/actions-runner-controller/pull/1268/files#r912319021 and https://github.com/actions-runner-controller/actions-runner-controller/pull/1268/files#r917331632), it worked with the "kubernetes" container mode, too 🎉 I'll submit a follow-up PR or two for those issues.

One last thing to note- the hook updates the runner status to e.g. Registering and Idle before/after the registration. It's great. But I suspect it might have a race condition with the runner controller's logic that syncs the runner pod status to the runner resource:

https://github.com/actions-runner-controller/actions-runner-controller/blob/10b88bf0702ebbf8e0e2f45a5634bc1111d9b0ac/controllers/runner_controller.go#L138-L157

If the runner controller's resync period passed and hence it reconciled the runner resource immediately after the hook update the runner status to Idle, the runner status might be "corrected" to Running(=pod status phase). So it can look a bit flappy if you watched the runner status phase transitions continuously (by e.g. running kubectl get runner -w and triggering a bunch of workflow runs). It might not be a big deal, but might be worth a fix.

I think I'm confident this works almost perfectly, and we know what the remaining issues are and how to address those. That said, I consider this is now good to be merged!

Thanks again for your patience and contribution @fgalind1 😄

@fgalind1
Copy link
Copy Markdown
Contributor Author

this works almost perfectly, and we know what the remaining issues a

Lovely! thanks also for your help @mumoshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants