Skip to content

OCPBUGS-8277: Use a different internal IP for apiserver connectivity#1478

Merged
openshift-merge-robot merged 2 commits intoopenshift:mainfrom
pacevedom:OCPBUGS-8277
Mar 13, 2023
Merged

OCPBUGS-8277: Use a different internal IP for apiserver connectivity#1478
openshift-merge-robot merged 2 commits intoopenshift:mainfrom
pacevedom:OCPBUGS-8277

Conversation

@pacevedom
Copy link
Contributor

@pacevedom pacevedom commented Mar 9, 2023

Which issue(s) this PR addresses:

Closes #1460

@pacevedom pacevedom changed the title Ocpbugs 8277 OCPBUGS-8277: Use a different internal IP for apiserver connectivity Mar 9, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 9, 2023
@openshift-ci-robot
Copy link

@pacevedom: This pull request references Jira Issue OCPBUGS-8277, which is invalid:

  • expected the bug to target the "4.14.0" version, but it targets "4.13.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Which issue(s) this PR addresses:

Closes #

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from oglok and sallyom March 9, 2023 14:05
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2023
@pacevedom
Copy link
Contributor Author

/jira refresh

1 similar comment
@pacevedom
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

@pacevedom: An error was encountered updating to the POST state for bug OCPBUGS-8277 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message. Error marking step #27447694 finished: root cause: Tried to update an entity that does not exist.: request failed. Please analyze the request body for more details. Status code: 400:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 9, 2023
@openshift-ci-robot
Copy link

@pacevedom: This pull request references Jira Issue OCPBUGS-8277, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jogeo

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from jogeo March 9, 2023 14:36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand what this section is doing now. Can you give an example, maybe using the default service network settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it does is to get the next immediate subnet from the service CIDR and use that IP.
Given that we need a non-service-CIDR IP to setup for the apiserver endpoint this is the most trivial approach I could think of.
This is also why both parameters (the actual address and the lo interface getting it) are now configurable, since this is an additional IP and there might be cases where we need a different one because of collisions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand the comment in the code to explain all of that? The text that is there now might have been clear to someone who understood how we were already choosing an IP, but was not enough for me to understand what was going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we don't want to assign k8s service IP address to any host interface, because that could cause unexpected problems such as the issue described here: #1478 (comment)

however, it seems that we have to in order to fix the certificate issue: https://issues.redhat.com/browse/OCPBUGS-7442

Given the above, the IP from next service cidr is used in this PR and assigned to lo device.

@pacevedom
Copy link
Contributor Author

/cc @zshi-redhat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to make the user make this choice? Is this something we can figure out on our own?

Copy link
Contributor Author

@pacevedom pacevedom Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this is a bit complex. You have all these possibilities:

  • Custom AdvertiseAddress + SkipInterfaceConfiguration=true: This means the api server is either in a different node or is using an already configured interface.
  • Custom AdvertiseAddress + SkipInterfaceConfiguration=false: This is deliberately ignoring the default next_subnet_after_service_cidr ip, as there might be collisions with that range or simply put it is required to have a different subnet. It does configure the lo interface with the ip.
  • No AdvertiseAddress + SkipInterfaceConfiguration=true: Same as having a custom advertise address, there might be an interface already configured with the first ip from next_subnet_after_service_cidr.
  • No AdvertiseAddress + SkipInterfaceConfiguration=false: Default everything, it will configure the first valid ip from next_subnet_after_service_cidr in lo interface. This will be the common case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the SkipInterfaceConfiguration an implicit (devel-only) option and assume it is false for single node deployment? because it seems SkipInterfaceConfiguration=true is for multi-node consideration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi node does require this option, however you could also have an IP range you can use in the node and there is no need to configure it for lo interface. I wonder if this is something that could happen.
Also, it defaults to false: https://github.com/openshift/microshift/pull/1478/files#diff-a3d824da3c42420cd5cbb0a4a2c0e7b5bfddd819652788a0596d195dc6e31fa5R251

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unclear on why we need the boolean at all. If the user provides a custom address, we can say that they must pre-configure the interface to use with that address, even if they just use the loopback interface. If they do not provide an address, then we will configure the loopback interface with an address we choose.

Copy link
Contributor Author

@pacevedom pacevedom Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we ok to say that if not using default then the configuration side of the interface lies on the admin? Having in mind that custom here means "anything different than service-CIDR-next-subnet". If that is ok then I am totally fine to remove the exposed option!

Copy link
Contributor Author

@pacevedom pacevedom Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the option as its simpler this way. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we ok to say that if not using default then the configuration side of the interface lies on the admin? Having in mind that custom here means "anything different than service-CIDR-next-subnet". If that is ok then I am totally fine to remove the exposed option!

Yes, I think that's consistent with how we separate responsibilities for the OS setup in other cases.

Due to the support for IPs in certificates introduced in openshift#1298, the
apiserver IP is configured as a secondary address in the lo interface.
A VIP is configured by ovnk redirecting 10.43.0.1:443 to 10.43.0.1:6443.
6443 is the port where apiserver listens in the host.
10.43.0.1:443 is used by all pods using client go, as it is computed
from the env vars we can find in any pod.
If a host network pod or any other tool in the host tries to reach the
apiserver by using 10.43.0.1:443 the address is not translated to the
endpoint, it tries to contact 10.43.0.1:443 which is not the apiserver
but the router. This change computes a new IP endpoint in the next
available /32 subnet from the service IP to ensure ovnk does not
interfere.
@zshi-redhat
Copy link
Contributor

zshi-redhat commented Mar 10, 2023

/cc @pliurh PTAL.

Why this matters to ovnk or k8s networking?

  • 10.43.0.1:443 is the k8s apiserver service IP that needs to be accessable by k8s pods (could be hostnetwork pod or pod using CNI overlay network).
  • when 10.43.0.1 is assigned to lo device, it breaks the traffic flow for hostnetwork pod to apiserver service (e.g. executing curl 10.43.0.1:443 --insecure directly on the host won't reach the apiserver)
  • this is because the curl cmd finds that 10.43.0.1 is a local host address on lo device so the traffic is sent to lo device directly, instead of being routed to br-ex according to route added by ovnkube-node: 10.43.0.0/16 via 169.254.169.4 dev br-ex mtu 1400.
  • when traffic sent to br-ex, ovnk DNATs the virtual service IP to the backend IP (<node-ip>:6443, notice the port is 6443) which is the port apiserver is listening to.
  • when traffic sent to lo device directly with port 443, apiserver won't respond to the request
  • this broken flow doesn't result in failure in bringing up the cluster , because none of the initial pods started by microshift uses this traffic flow.
  • but if you have a hostnetwork pod (hostNetwork: true) tries to access virtual service IP 10.43.0.1:443, it fails.

@dhellmann
Copy link
Contributor

@pacevedom maybe we could take #1442 before this one? It should be easy to add a method to report when the user has provided a value for the IP and update the interface management logic based on that.

@openshift-ci-robot
Copy link

@pacevedom: This pull request references Jira Issue OCPBUGS-8277, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jogeo

Details

In response to this:

Which issue(s) this PR addresses:

Closes #1460

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that one extra IP address needs to be reserved for microshift. Shall we document it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs commit.

@pacevedom
Copy link
Contributor Author

/retest-required

@dhellmann
Copy link
Contributor

/lgtm

I will rebase the config refactoring on top of this.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, pacevedom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [dhellmann,pacevedom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2023

@pacevedom: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit c531355 into openshift:main Mar 13, 2023
@openshift-ci-robot
Copy link

@pacevedom: Jira Issue OCPBUGS-8277: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-8277 has been moved to the MODIFIED state.

Details

In response to this:

Which issue(s) this PR addresses:

Closes #1460

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacevedom
Copy link
Contributor Author

/cherry-pick release-4.13

@openshift-cherrypick-robot

@pacevedom: new pull request created: #1501

Details

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacevedom pacevedom deleted the OCPBUGS-8277 branch December 18, 2023 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Cannot validate certificate for 10.43.0.1 because it doesn't contain any IP SANs

7 participants