fix NPE in volumes statistics #4388

slavkap · 2020-10-07T05:36:31Z

Description

If volume is migrated between storage pools and "volume.stats.interval"
is set to more frequent interval for example 30 sec.
NullPointerException is thrown, because the hypervisor is looking for
volume on storage pool from which the volume was migrated.
The statistics did not work right - before this fix only one cluster is
lists if we have more clusters from the same type (eg. KVM), and only
one host will return volume stats. Also all volumes are sent to that
host, but they could be on another host. And the logic does not work
correctly.
Also the response was not working correctly for QCOW2 images, because
it's looking for volume's UUID, but in the statistics CS keeps the path
(because of the migrated volumes which UUIDs are changed).

We have discussed the NPE with @Spaceman1984 in PR https://github.com/apache/cloudstack/pull/3949

With this change all clusters will be listed even those with the same type. Host will receive only the active volumes which are on it. And in the logs we will receive a proper information about the volumes

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
[X ] Bug fix (non-breaking change which fixes an issue)
[ X] Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

CS versions - 4.14 and master

2 clusters each having 2 hosts
hypervisors KVM
2 NFS primary storages with scope Zone

weizhouapache · 2020-10-07T09:24:00Z

similar issue as #3884 #3878

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java

server/src/main/java/com/cloud/api/query/ViewResponseHelper.java

engine/schema/src/main/java/com/cloud/dc/dao/ClusterDaoImpl.java

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java

DaanHoogland · 2020-10-20T09:56:18Z

@sureshanaparti all your concerns addressed?

DaanHoogland · 2020-10-20T09:56:36Z

@blueorangutan package

blueorangutan · 2020-10-20T09:57:55Z

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-10-20T10:44:49Z

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2214

DaanHoogland · 2020-10-20T12:42:59Z

@blueorangutan test

blueorangutan · 2020-10-20T12:43:56Z

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2020-10-21T00:04:17Z

Trillian test result (tid-3018)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38770 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4388-t3018-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 83 look OK, 2 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL	`Failure`	310.76	test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers	`Failure`	297.25	test_vpc_redundant.py
test_hostha_kvm_host_fencing	`Error`	175.40	test_hostha_kvm.py

DaanHoogland · 2020-10-21T07:52:19Z

@weizhouapache is this lgty?

yadvr · 2020-10-28T10:36:31Z

@slavkap can you fix the conflict?

GabrielBrascher · 2020-10-28T20:13:57Z

@blueorangutan package

blueorangutan · 2020-10-28T20:14:37Z

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-10-28T20:48:09Z

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2304

DaanHoogland · 2020-10-29T07:57:31Z

@blueorangutan test

blueorangutan · 2020-10-29T07:58:38Z

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java

If volume is migrated between storage pools and "volume.stats.interval" is set to more frequent interval for exmaple 30 sec. NullPointerException is thrown, because the hypervisor is looking for volume on storage pool from which the volume was migrated. The statistics did not work right - before this fix only one cluster is lists if we have more clusters from the same type (eg. KVM), and only one host will return volume stats. Also all volumes are sent to that host, but they could be on another host. And the logic does not work correctly. Also the response was not working correctly for QCOW2 images, because it's looking for volume's uuid, but in the statistics CS keeps the path (because of the migrated volumes which UUIDs are changed)

GabrielBrascher · 2020-10-29T18:38:10Z

Considering that the milestone is 4.15 and this PR was manually tested at 4.14, maybe the best would be moving this to 4.14 or 4.15.

What do you think @slavkap @DaanHoogland?

slavkap · 2020-10-29T18:50:44Z

I'm ok to move it to 4.14, but I'm not sure how those changes are related to VPC. Here is the failure:
"Ping to VM on Network Tier N from VM in Network Tier A should be successful at least for 2 out of 3 VMs"

DaanHoogland · 2020-10-29T19:00:25Z

I'm ok to move it to 4.14, but I'm not sure how those changes are related to VPC. Here is the failure:
"Ping to VM on Network Tier N from VM in Network Tier A should be successful at least for 2 out of 3 VMs"

those can be due to a busy system, @slavkap. I don't think these can be related to your changes, but we can re-run if you want to be sure.
note that the tests nine days ago showed different failures.

GabrielBrascher · 2020-10-29T19:08:34Z

I just re-kicked Travis checks.

slavkap · 2020-10-29T19:13:21Z

Thanks @GabrielBrascher ! @DaanHoogland I saw some of those failures from nine days ago in another PR of mine, but didn't find the zip with the results

GabrielBrascher · 2020-10-29T19:18:28Z

Just checked, it seems that the errors are related to Travis running JDK 11; CloudStack 4.13 uses Java 8.

slavkap · 2020-10-29T19:36:28Z

There was a missing import in one of the classes

GabrielBrascher · 2020-10-29T19:43:10Z

Nice! Travis is also running with jdk 8 now.

slavkap · 2020-10-29T19:45:33Z

Thank you all for the time you spent!

sureshanaparti

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

DaanHoogland · 2020-10-30T08:15:27Z

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

@sureshanaparti please explain your concerns or ensure what you can? I'm am trusting @slavkap to have validated any configuration she has access too. If you know what to do more please do so or ping the people that can. If we leave it at this I'm pretty sure no follow up will happen.

sureshanaparti · 2020-10-30T08:27:57Z

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

@sureshanaparti please explain your concerns or ensure what you can? I'm am trusting @slavkap to have validated any configuration she has access too. If you know what to do more please do so or ping the people that can. If we leave it at this I'm pretty sure no follow up will happen.

@DaanHoogland the code changes will impact the volume stats for all storage pools. So, it is better if this can be tested with volume(s) on unmanaged, managed and local storage. Otherwise, can cause regression.

DaanHoogland · 2020-10-30T08:53:00Z

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

@sureshanaparti please explain your concerns or ensure what you can? I'm am trusting @slavkap to have validated any configuration she has access too. If you know what to do more please do so or ping the people that can. If we leave it at this I'm pretty sure no follow up will happen.

@DaanHoogland the code changes will impact the volume stats for all storage pools. So, it is better if this can be tested with volume(s) on unmanaged, managed and local storage. Otherwise, can cause regression.

I understood that, @sureshanaparti. But by who and how should this be tested to appease your concerns.

@slavkap, can you explain more about your testing. You claim you have tested with zone wide NFS storage.

Is that all?
have you tested with more types of environement? Or have you @GabrielBrascher ?
is there any reason to think we don't need to test local - or managed storage , @slavkap ?

tnx all

slavkap · 2020-10-30T09:09:21Z

@DaanHoogland @sureshanaparti I have tested it only with non managed storage on StorPool and NFS (on master and 4.14) with 2 clusters each having 2 KVM hypervisors. I'm setting now a dev environment with 4.13 and will test with managed (on NFS) and non managed (on NFS and StorPool) storage. About local storage - how should it be set managed/unmanaged? If there is somebody that could test this with OVA files will be appreciated

slavkap · 2020-10-30T09:26:49Z

actually what I see is that NFS with default logic is only non managed storage. I can test this only with StorPool (which is non managed) and NFS

slavkap · 2020-10-30T11:37:58Z

I've tested it now with non-managed NFS (zone-wide) and local (scope - host) storage on 4.13, and it works correct without any failures and when I get volume stats it shows the right information in logs

local storage:
'3', 'cloudstack-cl3-hv2-local-0270e029', '0270e029-baa4-4006-9834-cf5a5bc92851', 'Filesystem', '0', '1', '1', '1', '4392984576', '273804165120', 'host', NULL, '/var/lib/libvirt/images', '2020-10-30 10:57:10', NULL, NULL, 'Up', 'DefaultPrimary', 'HOST', NULL, '0', NULL

volume on local storage:
'10', '2', '1', '3', NULL, '10', '0', 'ROOT-10', 'cd875374-ef02-409b-b6da-87adae43b0ee', '8589934592', NULL, 'cd875374-ef02-409b-b6da-87adae43b0ee', NULL, '1', NULL, NULL, 'ROOT', NULL, '17', '4', NULL, '0', '2020-10-30 10:59:31', NULL, '2020-10-30 10:59:39', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list a volume these params appear with the fix:
"physicalsize": 209788928,
"utilization": "0.02",
"virtualsize": 8589934592,

NFS storage:
'1', 'primary', '95d3f971-b75a-3e1d-a225-09037d2fa9f2', 'NetworkFilesystem', '2049', '1', NULL, NULL, '146331926528', '214736830464', 'host4', NULL, '/export/primary-13', '2020-10-30 10:22:01', NULL, NULL, 'Up', 'DefaultPrimary', 'ZONE', 'KVM', '0', NULL

volume on NFS:
'6', '2', '1', '1', NULL, '6', '0', 'ROOT-6', 'c86634ca-0c01-427c-b374-f6bf908f9869', '8589934592', NULL, 'c86634ca-0c01-427c-b374-f6bf908f9869', NULL, '1', NULL, NULL, 'ROOT', NULL, '1', '4', NULL, '0', '2020-10-30 10:50:05', NULL, '2020-10-30 10:50:43', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list volume:
"physicalsize": 234422272,
"utilization": "0.03",
"virtualsize": 8589934592,

I'm not sure that I should mentioned it here about another problem during the setup of CS with 4.13. I have to do more tests, but I couldn't create guest network for zone with basic network. I saw that @GabrielBrascher have added this functionality

   /**
     * Encodes VLAN/VXLAN ID into a Broadcast URI according to the isolation method from the Physical Network.
     * @return Broadcast URI, e.g. 'vlan://vlan_ID' or 'vxlan://vlxan_ID'
     */
    protected URI encodeVlanIdIntoBroadcastUri(String vlanId, PhysicalNetwork pNtwk) {
        if (pNtwk == null) {
            throw new InvalidParameterValueException(String.format("Failed to encode VLAN/VXLAN %s into a Broadcast URI. Physical Network cannot be null.", vlanId));
        }

        if(StringUtils.isNotBlank(pNtwk.getIsolationMethods().get(0))) {
            String isolationMethod = pNtwk.getIsolationMethods().get(0).toLowerCase();
            String vxlan = BroadcastDomainType.Vxlan.toString().toLowerCase();
            if(isolationMethod.equals(vxlan)) {
                return BroadcastDomainType.encodeStringIntoBroadcastUri(vlanId, BroadcastDomainType.Vxlan);
            }
        }
        return BroadcastDomainType.fromString(vlanId);
    }

but the list with isolation methods was empty in my case (during debug I've set one element to empty string and it worked).

slavkap · 2020-10-30T12:57:37Z

is there any reason to think we don't need to test local - or managed storage , @slavkap ?

I think it has to be tested with managed storage, but our storage does not support this functionality. In this change for manged/local storage I think it will work as before, the only difference is that GetVolumeStats task will list all clusters in the zone even their hypervisors are from the same type.
for (final Cluster cluster : _clusterDao.listClustersByDcId(pool.getDataCenterId()))

and GetVolumeStatsCommand will be send to all hosts, not only to the first which gets an answer like the old implementation:


        for (HostVO neighbor : neighbors) {
            // apply filters:
            // - managed storage
            // - local storage
            if (storagePool.isManaged() || storagePool.isLocal()) {

                volumeLocators = getVolumesByHost(neighbor, storagePool);

            }

            // - zone wide storage for specific hypervisortypes
            if (ScopeType.ZONE.equals(storagePool.getScope()) && storagePool.getHypervisor() != neighbor.getHypervisorType()) {
                // skip this neighbour if their hypervisor type is not the same as that of the store
                continue;
            }

            GetVolumeStatsCommand cmd = new GetVolumeStatsCommand(poolType, poolUuid, volumeLocators);

            if (timeout > 0) {
                cmd.setWait(timeout/1000);
            }

            Answer answer = _agentMgr.easySend(neighbor.getId(), cmd);

            if (answer instanceof GetVolumeStatsAnswer){
                GetVolumeStatsAnswer volstats = (GetVolumeStatsAnswer)answer;
                return volstats.getVolumeStats();
            }
        }

sureshanaparti · 2020-10-30T15:44:22Z

I've tested it now with non-managed NFS (zone-wide) and local (scope - host) storage on 4.13, and it works correct without any failures and when I get volume stats it shows the right information in logs

local storage:
'3', 'cloudstack-cl3-hv2-local-0270e029', '0270e029-baa4-4006-9834-cf5a5bc92851', 'Filesystem', '0', '1', '1', '1', '4392984576', '273804165120', 'host', NULL, '/var/lib/libvirt/images', '2020-10-30 10:57:10', NULL, NULL, 'Up', 'DefaultPrimary', 'HOST', NULL, '0', NULL

volume on local storage:
'10', '2', '1', '3', NULL, '10', '0', 'ROOT-10', 'cd875374-ef02-409b-b6da-87adae43b0ee', '8589934592', NULL, 'cd875374-ef02-409b-b6da-87adae43b0ee', NULL, '1', NULL, NULL, 'ROOT', NULL, '17', '4', NULL, '0', '2020-10-30 10:59:31', NULL, '2020-10-30 10:59:39', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list volume
DEBUG [c.c.a.ApiServlet] (qtp873447850-307:ctx-ba1bf0cc ctx-f62d604f) (logid:52b44185) ===END=== host -- GET command=listVolumes&id=cd875374-ef02-409b-b6da-87adae43b0ee&response=json&_=1604055433757

NFS storage:
'1', 'primary', '95d3f971-b75a-3e1d-a225-09037d2fa9f2', 'NetworkFilesystem', '2049', '1', NULL, NULL, '146331926528', '214736830464', 'host4', NULL, '/export/primary-13', '2020-10-30 10:22:01', NULL, NULL, 'Up', 'DefaultPrimary', 'ZONE', 'KVM', '0', NULL

volume on NFS:
'6', '2', '1', '1', NULL, '6', '0', 'ROOT-6', 'c86634ca-0c01-427c-b374-f6bf908f9869', '8589934592', NULL, 'c86634ca-0c01-427c-b374-f6bf908f9869', NULL, '1', NULL, NULL, 'ROOT', NULL, '1', '4', NULL, '0', '2020-10-30 10:50:05', NULL, '2020-10-30 10:50:43', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list volume:
DEBUG [c.c.a.ApiServlet] (qtp873447850-307:ctx-b300c496 ctx-0a2259ab) (logid:117df4a9) ===END=== host -- GET command=listVolumes&id=c86634ca-0c01-427c-b374-f6bf908f9869&response=json&_=1604055433758

I'm not sure that I should mentioned it here about another problem during the setup of CS with 4.13. I have to do more tests, but I couldn't create guest network for zone with basic network. I saw that @GabrielBrascher have added this functionality
   /**
     * Encodes VLAN/VXLAN ID into a Broadcast URI according to the isolation method from the Physical Network.
     * @return Broadcast URI, e.g. 'vlan://vlan_ID' or 'vxlan://vlxan_ID'
     */
    protected URI encodeVlanIdIntoBroadcastUri(String vlanId, PhysicalNetwork pNtwk) {
        if (pNtwk == null) {
            throw new InvalidParameterValueException(String.format("Failed to encode VLAN/VXLAN %s into a Broadcast URI. Physical Network cannot be null.", vlanId));
        }

        if(StringUtils.isNotBlank(pNtwk.getIsolationMethods().get(0))) {
            String isolationMethod = pNtwk.getIsolationMethods().get(0).toLowerCase();
            String vxlan = BroadcastDomainType.Vxlan.toString().toLowerCase();
            if(isolationMethod.equals(vxlan)) {
                return BroadcastDomainType.encodeStringIntoBroadcastUri(vlanId, BroadcastDomainType.Vxlan);
            }
        }
        return BroadcastDomainType.fromString(vlanId);
    }
but the list with isolation methods was empty in my case (during debug I've set one element to empty string and it worked).

@slavkap thanks for sharing the test results, which confirms the changes are good with NFS and local storages. As local and managed storage took the same path earlier, this should work for managed as well.

DaanHoogland · 2020-10-30T15:51:38Z

great. i'll merge and merge forward

slavkap · 2020-10-30T15:54:09Z

thank you all for the code review and your time!

DaanHoogland added this to the 4.15.0.0 milestone Oct 7, 2020

sureshanaparti reviewed Oct 8, 2020

View reviewed changes

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Show resolved Hide resolved

sureshanaparti reviewed Oct 8, 2020

View reviewed changes

server/src/main/java/com/cloud/api/query/ViewResponseHelper.java Outdated Show resolved Hide resolved

sureshanaparti reviewed Oct 8, 2020

View reviewed changes

engine/schema/src/main/java/com/cloud/dc/dao/ClusterDaoImpl.java Show resolved Hide resolved

sureshanaparti reviewed Oct 8, 2020

View reviewed changes

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Outdated Show resolved Hide resolved

slavkap force-pushed the fix_volume_stats branch from 55f084d to e4cb5d6 Compare October 13, 2020 10:43

slavkap force-pushed the fix_volume_stats branch from e4cb5d6 to 4fa9f63 Compare October 28, 2020 12:21

sureshanaparti reviewed Oct 29, 2020

View reviewed changes

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Outdated Show resolved Hide resolved

slavkap force-pushed the fix_volume_stats branch from 4fa9f63 to a77d518 Compare October 29, 2020 10:27

DaanHoogland changed the base branch from master to 4.13 October 29, 2020 11:45

DaanHoogland changed the base branch from 4.13 to master October 29, 2020 11:45

slavkap added 4 commits October 29, 2020 14:39

list all virtual machines by host id

2f445df

Return groupBy hypervisor type when listing clusters by zone id

03a7837

Replace listByZoneId in VolumeStatsTask with listClustersByDcId

a82f4e3

Add import for CollectionUtils

93214ff

sureshanaparti approved these changes Oct 30, 2020

View reviewed changes

slavkap closed this Oct 30, 2020

slavkap reopened this Oct 30, 2020

DaanHoogland modified the milestones: 4.15.0.0, 4.13.2.0 Oct 30, 2020

DaanHoogland merged commit 8afb451 into apache:4.13 Oct 30, 2020

slavkap deleted the fix_volume_stats branch October 30, 2020 16:28

Pearl1594 pushed a commit to shapeblue/cloudstack that referenced this pull request Nov 4, 2020

fix NPE in volumes statistics (apache#4388)

9cb8279

Pearl1594 pushed a commit to shapeblue/cloudstack that referenced this pull request Nov 4, 2020

fix NPE in volumes statistics (apache#4388)

db5341c

DaanHoogland added the Severity:Minor label Nov 10, 2020

fix NPE in volumes statistics #4388

fix NPE in volumes statistics #4388

Uh oh!

Conversation

slavkap commented Oct 7, 2020

Description

Types of changes

How Has This Been Tested?

Uh oh!

weizhouapache commented Oct 7, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DaanHoogland commented Oct 20, 2020

Uh oh!

DaanHoogland commented Oct 20, 2020

Uh oh!

blueorangutan commented Oct 20, 2020

Uh oh!

blueorangutan commented Oct 20, 2020

Uh oh!

DaanHoogland commented Oct 20, 2020

Uh oh!

blueorangutan commented Oct 20, 2020

Uh oh!

blueorangutan commented Oct 21, 2020

Uh oh!

DaanHoogland commented Oct 21, 2020

Uh oh!

yadvr commented Oct 28, 2020

Uh oh!

GabrielBrascher commented Oct 28, 2020

Uh oh!

blueorangutan commented Oct 28, 2020

Uh oh!

blueorangutan commented Oct 28, 2020

Uh oh!

DaanHoogland commented Oct 29, 2020

Uh oh!

blueorangutan commented Oct 29, 2020

Uh oh!

Uh oh!

GabrielBrascher commented Oct 29, 2020

Uh oh!

slavkap commented Oct 29, 2020

Uh oh!

DaanHoogland commented Oct 29, 2020

Uh oh!

GabrielBrascher commented Oct 29, 2020

Uh oh!

slavkap commented Oct 29, 2020

Uh oh!

GabrielBrascher commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slavkap commented Oct 29, 2020

Uh oh!

GabrielBrascher commented Oct 29, 2020

Uh oh!

slavkap commented Oct 29, 2020

Uh oh!

sureshanaparti left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland commented Oct 30, 2020

Uh oh!

sureshanaparti commented Oct 30, 2020

Uh oh!

DaanHoogland commented Oct 30, 2020

Uh oh!

slavkap commented Oct 30, 2020

Uh oh!

slavkap commented Oct 30, 2020

Uh oh!

slavkap commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slavkap commented Oct 30, 2020

GabrielBrascher commented Oct 29, 2020 •

edited

Loading

slavkap commented Oct 30, 2020 •

edited

Loading