-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix NPE in volumes statistics #4388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
server/src/main/java/com/cloud/api/query/ViewResponseHelper.java
Outdated
Show resolved
Hide resolved
55f084d to
e4cb5d6
Compare
|
@sureshanaparti all your concerns addressed? |
|
@blueorangutan package |
|
@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2214 |
|
@blueorangutan test |
|
@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-3018)
|
|
@weizhouapache is this lgty? |
|
@slavkap can you fix the conflict? |
e4cb5d6 to
4fa9f63
Compare
|
@blueorangutan package |
|
@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2304 |
|
@blueorangutan test |
|
@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
4fa9f63 to
a77d518
Compare
If volume is migrated between storage pools and "volume.stats.interval" is set to more frequent interval for exmaple 30 sec. NullPointerException is thrown, because the hypervisor is looking for volume on storage pool from which the volume was migrated. The statistics did not work right - before this fix only one cluster is lists if we have more clusters from the same type (eg. KVM), and only one host will return volume stats. Also all volumes are sent to that host, but they could be on another host. And the logic does not work correctly. Also the response was not working correctly for QCOW2 images, because it's looking for volume's uuid, but in the statistics CS keeps the path (because of the migrated volumes which UUIDs are changed)
|
Considering that the milestone is 4.15 and this PR was manually tested at 4.14, maybe the best would be moving this to 4.14 or 4.15. What do you think @slavkap @DaanHoogland? |
|
I'm ok to move it to 4.14, but I'm not sure how those changes are related to VPC. Here is the failure: |
those can be due to a busy system, @slavkap. I don't think these can be related to your changes, but we can re-run if you want to be sure. |
|
I just re-kicked Travis checks. |
|
Thanks @GabrielBrascher ! @DaanHoogland I saw some of those failures from nine days ago in another PR of mine, but didn't find the zip with the results |
|
Just checked, it seems that the errors are related to Travis running JDK 11; CloudStack 4.13 uses Java 8. |
|
There was a missing import in one of the classes |
|
Nice! Travis is also running with jdk 8 now. |
|
Thank you all for the time you spent! |
sureshanaparti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.
@sureshanaparti please explain your concerns or ensure what you can? I'm am trusting @slavkap to have validated any configuration she has access too. If you know what to do more please do so or ping the people that can. If we leave it at this I'm pretty sure no follow up will happen. |
@DaanHoogland the code changes will impact the volume stats for all storage pools. So, it is better if this can be tested with volume(s) on unmanaged, managed and local storage. Otherwise, can cause regression. |
I understood that, @sureshanaparti. But by who and how should this be tested to appease your concerns. @slavkap, can you explain more about your testing. You claim you have tested with zone wide NFS storage.
tnx all |
|
@DaanHoogland @sureshanaparti I have tested it only with non managed storage on StorPool and NFS (on master and 4.14) with 2 clusters each having 2 KVM hypervisors. I'm setting now a dev environment with 4.13 and will test with managed (on NFS) and non managed (on NFS and StorPool) storage. About local storage - how should it be set managed/unmanaged? If there is somebody that could test this with OVA files will be appreciated |
|
actually what I see is that NFS with default logic is only non managed storage. I can test this only with StorPool (which is non managed) and NFS |
|
I've tested it now with non-managed NFS (zone-wide) and local (scope - host) storage on 4.13, and it works correct without any failures and when I get volume stats it shows the right information in logs local storage: volume on local storage: response when list a volume these params appear with the fix: NFS storage: volume on NFS: response when list volume: I'm not sure that I should mentioned it here about another problem during the setup of CS with 4.13. I have to do more tests, but I couldn't create guest network for zone with basic network. I saw that @GabrielBrascher have added this functionality but the list with isolation methods was empty in my case (during debug I've set one element to empty string and it worked). |
I think it has to be tested with managed storage, but our storage does not support this functionality. In this change for manged/local storage I think it will work as before, the only difference is that GetVolumeStats task will list all clusters in the zone even their hypervisors are from the same type. and GetVolumeStatsCommand will be send to all hosts, not only to the first which gets an answer like the old implementation: |
@slavkap thanks for sharing the test results, which confirms the changes are good with NFS and local storages. As local and managed storage took the same path earlier, this should work for managed as well. |
|
great. i'll merge and merge forward |
|
thank you all for the code review and your time! |
Description
If volume is migrated between storage pools and "volume.stats.interval"
is set to more frequent interval for example 30 sec.
NullPointerException is thrown, because the hypervisor is looking for
volume on storage pool from which the volume was migrated.
The statistics did not work right - before this fix only one cluster is
lists if we have more clusters from the same type (eg. KVM), and only
one host will return volume stats. Also all volumes are sent to that
host, but they could be on another host. And the logic does not work
correctly.
Also the response was not working correctly for QCOW2 images, because
it's looking for volume's UUID, but in the statistics CS keeps the path
(because of the migrated volumes which UUIDs are changed).
We have discussed the NPE with @Spaceman1984 in PR https://github.com/apache/cloudstack/pull/3949
With this change all clusters will be listed even those with the same type. Host will receive only the active volumes which are on it. And in the logs we will receive a proper information about the volumes
Types of changes
How Has This Been Tested?
CS versions - 4.14 and master
2 clusters each having 2 hosts
hypervisors KVM
2 NFS primary storages with scope Zone