Skip to content

Conversation

@rohityadavcloud
Copy link
Member

@rohityadavcloud rohityadavcloud commented Oct 14, 2020

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

The default GC algorithm G1 that is enabled by default with Java11
serves well on multiprocessor machines with a large amount of memory where
GC is probabilistic with low pauses, where response time is more
important than throughput and GC is kept shorter.

The CloudStack management server is largely a multi-threaded server
application that handles and orchestrates several network requests and
has the default max. heap size of only 2G that can be considered a
small/medium application from a heap size perspective. Perhaps a more
aggressive GC algorithm such as ParallelGC as used in Java8 and before
(that is previous CloudStack releases) would serve better for throughput
and cause more aggressive GC.

This PR proposes a change in the default GC algorithm to avoid OOM issues.

Reference: https://docs.oracle.com/en/java/javase/11/gctuning/available-collectors.html#GUID-13943556-F521-4287-AAAA-AE5DE68777CD

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How this was tested

Heap activity for Java11 with G1 (default GC algorithm):
Screenshot from 2020-10-15 04-03-40

Heap activity seen after switching to ParallelGC:
Screenshot from 2020-10-15 06-19-04
... and over time with some constant load (listApis calls to simulate Primate logins every 1s, all GC/heap graphs look stable):
Screenshot from 2020-10-15 10-47-16

Heap, thread etc activities with the ParallelGC and max pause time setting (500ms):
Screenshot from 2020-10-15 16-00-56

The default GC algorithm G1 that is enabled by default with Java11
serves well on multiprocessor machines with large amount of memory where
GC is probablistic with low pauses, where response time is more
important than throughput and GC is kept shorter.

The CloudStack management server is largely a multi-threaded server
application that handles and orchestrates several network requests, and
has the default max. heap size of only 2G that can be considered a
small/medium application from a heap size perspective. Perhaps a more
aggresive GC algorithm such as ParallelGC as used in Java8 and before
(that is previous CloudStack releases) would serve better for throughput
and cause more aggresive GC.

Reference: https://docs.oracle.com/en/java/javase/11/gctuning/available-collectors.html#GUID-13943556-F521-4287-AAAA-AE5DE68777CD

Signed-off-by: Rohit Yadav <[email protected]>
@rohityadavcloud rohityadavcloud added this to the 4.14.1.0 milestone Oct 14, 2020
@rohityadavcloud
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2182

@rohityadavcloud
Copy link
Member Author

Smoketests against vmware kicked

@sureshanaparti
Copy link
Contributor

@rhtyd Good to see change in the GC for management server process. Even though Parallel GC is the fastest, there are some side effects of using Parallel GC.

  • Parallel GC is the default GC in Java 8, which is changed to G1 from Java 9 [1], due to higher GC pauses.
  • All application threads are stopped during GC, can impact response time / latency. Check some use cases for Parallel GC here [2].

[1] http://openjdk.java.net/jeps/248
[2] https://www.informit.com/articles/article.aspx?p=2496621&seqNum=2

I think, it is better to control such pause times using option "-XX:MaxGCPauseMillis", when we opt for Parallel GC.

@blueorangutan
Copy link

Trillian test result (tid-2957)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32877 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4407-t2957-vmware-67u3.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@rohityadavcloud
Copy link
Member Author

@sureshanaparti okay I'll add that and do some tests, however, the max. heap size is very small (2G) the pauses would affect memory-heavy application such as databases or some sort of application that have huge heaps (10-100s of GBs). I think given CloudStack is largely a network application with small heap of 2G (default) even the pauses won't affect it (or its threads) much.

Signed-off-by: Rohit Yadav <[email protected]>
@rohityadavcloud
Copy link
Member Author

@sureshanaparti can you check the changes now? I've added the -XX:MaxGCPauseMillis=500 Thanks

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2185

@rohityadavcloud
Copy link
Member Author

@blueorangutan test centos7 vmware-67u3

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

@rohityadavcloud
Copy link
Member Author

Let's do one KVM too
@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok, it seems that users/operators can tweak this however they wish so I don't think we are hurting anybody.

@blueorangutan
Copy link

Trillian test result (tid-2968)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31997 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4407-t2968-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-2967)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42818 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4407-t2967-vmware-67u3.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@rohityadavcloud rohityadavcloud marked this pull request as ready for review October 16, 2020 12:41
@rohityadavcloud
Copy link
Member Author

Heap status seen under control, after over 24hrs:
Screenshot from 2020-10-16 18-02-27

Copy link
Contributor

@Pearl1594 Pearl1594 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been keeping an eye on a setup with these changes and MS is stable. So LGTM.

@rohityadavcloud
Copy link
Member Author

After 94+hrs with the changes:
Screenshot from 2020-10-19 13-06-02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants