packaging: enable Parallel Collector GC for management server #4407

rohityadavcloud · 2020-10-14T22:35:04Z

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

The default GC algorithm G1 that is enabled by default with Java11
serves well on multiprocessor machines with a large amount of memory where
GC is probabilistic with low pauses, where response time is more
important than throughput and GC is kept shorter.

The CloudStack management server is largely a multi-threaded server
application that handles and orchestrates several network requests and
has the default max. heap size of only 2G that can be considered a
small/medium application from a heap size perspective. Perhaps a more
aggressive GC algorithm such as ParallelGC as used in Java8 and before
(that is previous CloudStack releases) would serve better for throughput
and cause more aggressive GC.

This PR proposes a change in the default GC algorithm to avoid OOM issues.

Reference: https://docs.oracle.com/en/java/javase/11/gctuning/available-collectors.html#GUID-13943556-F521-4287-AAAA-AE5DE68777CD

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

How this was tested

Heap activity for Java11 with G1 (default GC algorithm):

Heap activity seen after switching to ParallelGC:

... and over time with some constant load (listApis calls to simulate Primate logins every 1s, all GC/heap graphs look stable):

Heap, thread etc activities with the ParallelGC and max pause time setting (500ms):

The default GC algorithm G1 that is enabled by default with Java11 serves well on multiprocessor machines with large amount of memory where GC is probablistic with low pauses, where response time is more important than throughput and GC is kept shorter. The CloudStack management server is largely a multi-threaded server application that handles and orchestrates several network requests, and has the default max. heap size of only 2G that can be considered a small/medium application from a heap size perspective. Perhaps a more aggresive GC algorithm such as ParallelGC as used in Java8 and before (that is previous CloudStack releases) would serve better for throughput and cause more aggresive GC. Reference: https://docs.oracle.com/en/java/javase/11/gctuning/available-collectors.html#GUID-13943556-F521-4287-AAAA-AE5DE68777CD Signed-off-by: Rohit Yadav <[email protected]>

rohityadavcloud · 2020-10-14T22:35:13Z

@blueorangutan package

blueorangutan · 2020-10-14T23:15:28Z

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2182

rohityadavcloud · 2020-10-15T05:39:51Z

Smoketests against vmware kicked

sureshanaparti · 2020-10-15T08:11:29Z

@rhtyd Good to see change in the GC for management server process. Even though Parallel GC is the fastest, there are some side effects of using Parallel GC.

Parallel GC is the default GC in Java 8, which is changed to G1 from Java 9 [1], due to higher GC pauses.
All application threads are stopped during GC, can impact response time / latency. Check some use cases for Parallel GC here [2].

[1] http://openjdk.java.net/jeps/248
[2] https://www.informit.com/articles/article.aspx?p=2496621&seqNum=2

I think, it is better to control such pause times using option "-XX:MaxGCPauseMillis", when we opt for Parallel GC.

blueorangutan · 2020-10-15T09:02:25Z

Trillian test result (tid-2957)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32877 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4407-t2957-vmware-67u3.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

rohityadavcloud · 2020-10-15T09:13:27Z

@sureshanaparti okay I'll add that and do some tests, however, the max. heap size is very small (2G) the pauses would affect memory-heavy application such as databases or some sort of application that have huge heaps (10-100s of GBs). I think given CloudStack is largely a network application with small heap of 2G (default) even the pauses won't affect it (or its threads) much.

Signed-off-by: Rohit Yadav <[email protected]>

rohityadavcloud · 2020-10-15T09:35:29Z

@sureshanaparti can you check the changes now? I've added the -XX:MaxGCPauseMillis=500 Thanks

@blueorangutan package

blueorangutan · 2020-10-15T09:36:38Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-10-15T10:14:07Z

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2185

rohityadavcloud · 2020-10-15T10:14:31Z

@blueorangutan test centos7 vmware-67u3

blueorangutan · 2020-10-15T10:15:33Z

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

rohityadavcloud · 2020-10-15T10:27:11Z

Let's do one KVM too
@blueorangutan test

blueorangutan · 2020-10-15T10:28:33Z

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

DaanHoogland

I'm ok, it seems that users/operators can tweak this however they wish so I don't think we are hurting anybody.

blueorangutan · 2020-10-15T19:53:19Z

Trillian test result (tid-2968)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31997 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4407-t2968-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

blueorangutan · 2020-10-15T22:37:10Z

Trillian test result (tid-2967)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42818 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4407-t2967-vmware-67u3.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

rohityadavcloud · 2020-10-16T12:43:28Z

Heap status seen under control, after over 24hrs:

Pearl1594

LGTM

nvazquez

LGTM

harikrishna-patnala

I've been keeping an eye on a setup with these changes and MS is stable. So LGTM.

rohityadavcloud · 2020-10-19T07:40:22Z

After 94+hrs with the changes:

rohityadavcloud added this to the 4.14.1.0 milestone Oct 14, 2020

rohityadavcloud requested review from GabrielBrascher, andrijapanicsb, borisstoyanov, harikrishna-patnala, nvazquez, sureshanaparti, svenvogel, weizhouapache and wido October 15, 2020 05:41

add max pause millis to 500

8b6d985

Signed-off-by: Rohit Yadav <[email protected]>

rohityadavcloud requested a review from DaanHoogland October 15, 2020 12:01

DaanHoogland approved these changes Oct 15, 2020

View reviewed changes

rohityadavcloud marked this pull request as ready for review October 16, 2020 12:41

rohityadavcloud requested a review from Pearl1594 October 16, 2020 12:42

Pearl1594 approved these changes Oct 19, 2020

View reviewed changes

nvazquez approved these changes Oct 19, 2020

View reviewed changes

harikrishna-patnala approved these changes Oct 19, 2020

View reviewed changes

rohityadavcloud mentioned this pull request Oct 19, 2020

Memory leak in master #4337

Closed

rohityadavcloud merged commit b27b8d0 into apache:4.14 Oct 19, 2020

DaanHoogland added type:bug type:enhancement labels Oct 21, 2020

PaulAngus removed the type:bug label Nov 8, 2020

packaging: enable Parallel Collector GC for management server #4407

packaging: enable Parallel Collector GC for management server #4407

Uh oh!

Conversation

rohityadavcloud commented Oct 14, 2020 • edited by PaulAngus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Types of changes

Types of changes

How this was tested

Uh oh!

rohityadavcloud commented Oct 14, 2020

Uh oh!

blueorangutan commented Oct 14, 2020

Uh oh!

rohityadavcloud commented Oct 15, 2020

Uh oh!

sureshanaparti commented Oct 15, 2020

Uh oh!

blueorangutan commented Oct 15, 2020

Uh oh!

rohityadavcloud commented Oct 15, 2020

Uh oh!

rohityadavcloud commented Oct 15, 2020

Uh oh!

blueorangutan commented Oct 15, 2020

Uh oh!

blueorangutan commented Oct 15, 2020

Uh oh!

rohityadavcloud commented Oct 15, 2020

Uh oh!

blueorangutan commented Oct 15, 2020

Uh oh!

rohityadavcloud commented Oct 15, 2020

Uh oh!

blueorangutan commented Oct 15, 2020

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

blueorangutan commented Oct 15, 2020

Uh oh!

blueorangutan commented Oct 15, 2020

Uh oh!

rohityadavcloud commented Oct 16, 2020

Uh oh!

Pearl1594 left a comment

Choose a reason for hiding this comment

Uh oh!

nvazquez left a comment

Choose a reason for hiding this comment

Uh oh!

harikrishna-patnala left a comment

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud commented Oct 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

rohityadavcloud commented Oct 14, 2020 •

edited by PaulAngus

Loading