Skip to content

Conversation

@ustcweizhou
Copy link
Contributor

Description

This PR contains 13 commits below

bugfix #1 vpc: fix ips on wrong interfaces after rebooting vpc vrs
bugfix #1 vr: Force a restart of keepalived if conntrackd is not running or configuration has changed
bugfix #2 vpc: Fix remove first public ip will remove all ips on the nic
bugfix #3 apply ip dessociation before unplugging a nic so ip is marked as add:false in ips.json
bugfix #2 vpc vr: fix issue if static nat is disabled but still other IP used by lb/pf
bugfix #4 vpc vr: Do NOT send Nic plug in/out command to Stopped/Stopping VR
Revert "Handle private gateways more reliably"
Revert "Add private gateway IP to router initialization config"
Revert "Fix Policy Based Routing for private gateway static routes (#3604)"
bugfix #6 vpc vr: Add iptables rules for ACL of private gateway
bugfix #7 vpc vr: allow servers in private gateway to reach internet via the VPC VR if it is gateway
bugfix #8 vpc: add rule for traffic between vm and private gateway
bugfix #9 vpc vr: Add PREROUTING rule for vm with static nat to multiple private gateways

Here are description of bugs
(1) ips on wrong interfaces after rebooting vpc vrs
When add new vpc tier, create private gateway, associated IP in new range to a VPC, nics will be plugged to VPC VRs.
However, when reboot(or start) a VPC VR, the nics will be added by order: Public IP (source nat), other Public IP range, private gateway, VPC tiers. so the device_id of nics are different before stopping VR.

I have created a PR for this issue #4467

(2) remove first public ip will remove all ips on the nic
When use more public IPs in new public IP range (not same as source nat), a nic will be plugged to VPC VR. All public IPs used by the VPC will be attached to the nic.
However, when we release the first public IP of the nic, it will remove the nic, all other IPs on the nic will be gone as well.

(3) public ip is not marked as "add: false" in /etc/cloudstack/ips.json when release it.
When use a public IP in new range to VPC VR, a nic will be plugged to VPC VR.
When remove the public from VPC, the nic will be unplugged. however, the ip is still marked as "add: true" in /etc/cloudstack/ips.json
so when we add a new nic to the VPC, the (old) public ip will be added back to the nic.

(4) When a VPC VR is stopped, we cannot add/remove new nic to VPC VR.

(5) Static NAT with multiple public interfaces uses wrong outgoing IP #4234
This is a regression of the fix for #3604
we need to revert the commit "Fix Policy Based Routing for private gateway static routes (#3604)"

(6) There is no ACL rule for private gateway
This is a regression of the fix for #3402
private gateway is changed to 'public' in commit "vpc: set traffic type of private gateway IP to Public to fix ke… (#3851)"
so we need to add ACL rules.

(7) servers in private gateway cannot reach internet via the VPC VR if it is gateway
When add private gateway and use VPC as gateway (private gateway IP = gateway IP), the servers in private gateway network cannot reach internet via VPC VR.
need to add rule to accept packet if VPC VR is used as gateway.

(8) INBOUND rules for traffic between vm and private gateway servers does not work.
Even rules are added to fix bug (6), the incoming traffic between vm and private gateway network is always accepted.
for example, if the ACL of vm disallow traffic from private gateway network, vm still accepts traffic from private gateway network.
need to add rules to check the INBOUND ACL rules.

(9) vm with static nat cannot connect to private gateway network.
As the fix is reverted to fix (5), the issue described in #3604 is back.
this PR introduced another way to fix the issue.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

@ustcweizhou
Copy link
Contributor Author

@div8cn @richardlawley @ravening @DennisKonrad
could you please test this pr ?

@weizhouapache
Copy link
Member

@DaanHoogland @rhtyd could you kick-off Trillian test for this pr ?

@DaanHoogland
Copy link
Contributor

these are a lot @weizhouapache can you add the highest severity to the labels?
@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@DennisKonrad DennisKonrad self-assigned this Nov 20, 2020
@blueorangutan
Copy link

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2414

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@DaanHoogland DaanHoogland added the Severity:Critical Critical bug label Nov 20, 2020
@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@rohityadavcloud
Copy link
Member

@blueorangutan test centos7 vmware-67u3

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3223)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30312 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4484-t3223-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-3224)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 35800 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4484-t3224-vmware-67u3.zip
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Smoke tests completed. 82 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_create_template_with_checksum_sha1 Error 5.19 test_templates.py
test_03_create_template_with_checksum_sha256 Error 5.19 test_templates.py
test_04_create_template_with_checksum_md5 Error 5.18 test_templates.py

@DaanHoogland
Copy link
Contributor

hm, this exact set of failures I have seen before. I hope nothing slipped through.

@rohityadavcloud
Copy link
Member

@blueorangutan test centos7 vmware-67u3

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3230)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33956 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4484-t3230-vmware-67u3.zip
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Smoke tests completed. 82 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_create_template_with_checksum_sha1 Error 5.19 test_templates.py
test_03_create_template_with_checksum_sha256 Error 5.19 test_templates.py
test_04_create_template_with_checksum_md5 Error 5.21 test_templates.py

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see nothing strange or wrong with this code. two remarks thought:

  1. it is a lot in one go and will require verification in a lot of different kinds of environments.
  2. I recognise some changes that already went into master over the last few weeks so merging forward might give some conflicts (nothing that should stop us now.

@DaanHoogland
Copy link
Contributor

@PaulAngus as RM, do we merge this before release? cc @rhtyd

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, I see some changes are backported from master and many changes are in tests. The scope of changes are in vpc and private gw. Some manual upgrade tests can be done around that.

@rohityadavcloud
Copy link
Member

@DaanHoogland can you ping and discuss with @PaulAngus thnx

@rohityadavcloud
Copy link
Member

@weizhouapache can you share what tests were done, did you perform any upgrade tests?

@ustcweizhou ustcweizhou force-pushed the apache-4.14-fix-vpc-issues branch from ab3fe1e to 788ed28 Compare November 23, 2020 09:18
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@weizhouapache
Copy link
Member

weizhouapache commented Nov 23, 2020

I see nothing strange or wrong with this code. two remarks thought:

  1. it is a lot in one go and will require verification in a lot of different kinds of environments.
  2. I recognise some changes that already went into master over the last few weeks so merging forward might give some conflicts (nothing that should stop us now.

@DaanHoogland @rhtyd rebased with latest 4.14
yes, there are indeed a lot of verifications to be done.

I have added some test cases in this PR which verify the ips on nics and UP/DOWN state of public interfaces.
I have tested with kvm, but not on xenserver and vmware.
it takes around 1 hour to finish all 4 integration tests so I do not add them to .travis.yaml.
The integration tests cover vpc/network with/without vr, and some actions (add/remove public ip in multiple ip ranges, add/remove vpc tier,add private gateway, reboot routers, restart vpc tiers, restart vpc/network w/wo cleanup).
https://github.com/apache/cloudstack/blob/788ed28a8c73756a1bc8deb102a1d2506cc2d430/test/integration/component/test_multiple_subnets_in_isolated_network.py
https://github.com/apache/cloudstack/blob/788ed28a8c73756a1bc8deb102a1d2506cc2d430/test/integration/component/test_multiple_subnets_in_isolated_network_rvr.py
https://github.com/apache/cloudstack/blob/788ed28a8c73756a1bc8deb102a1d2506cc2d430/test/integration/component/test_multiple_subnets_in_vpc.py
https://github.com/apache/cloudstack/blob/788ed28a8c73756a1bc8deb102a1d2506cc2d430/test/integration/component/test_multiple_subnets_in_vpc_rvr.py

However, for some changes on iptable rules , they require manual test.
What I have done
(1) create vpc1 and two tiers vpc1-001, vpc1-002, and some vms vpc1-001-001, vpc1-001-002, vpc1-002-001, vpc1-002-002
(2) create a shared network and a vm in it, to simulate server in private rack.
(3) create vpc2, create a tier and vm vpc2-001-001, and enable site-to-site vpn gateway
(4) add multiple public ip ranges in zone/public physical network

setup
(5) create site-to-site vpn between vpc1 and vpc2
(6) create private gateway in vpc1, with same vlan with shared network in step (2) above.
(7) acquire multiple IPs in new public ranges created in step (4) above, and use them for different proposal (eg enable static nat on some vms in vpc, or create port forwarding rules to vms in vpc).

expected results
(1) if ACL is allow_all, then all servers (including vm in shared network, vm in vpc2, vms with/without static nat in vpc1) should be able to reach each other
(2) if ACL is deny_all, vm in vpc tiers and private gateway should not be able to reach each other.

By the way, if you merge his pr, could you use "Rebase and Merge" option so it would be better to track why a line of change is made if there are issues in the future.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2423

@DaanHoogland
Copy link
Contributor

@DennisKonrad any conclusions from your testing yet?

@DennisKonrad
Copy link
Contributor

DennisKonrad commented Nov 24, 2020

@DaanHoogland I'm looking into it right now. So far all I tested looked good. Giving my approval today if nothing else turns up

@apache apache deleted a comment from blueorangutan Nov 24, 2020
@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

i think we are good to go here. restarted the failed test env to be more than 100% sure.

@apache apache deleted a comment from blueorangutan Nov 24, 2020
@blueorangutan
Copy link

Trillian test result (tid-3244)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30318 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4484-t3244-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants