Skip to content

Conversation

@slavkap
Copy link
Contributor

@slavkap slavkap commented Oct 7, 2020

Description

If volume is migrated between storage pools and "volume.stats.interval"
is set to more frequent interval for example 30 sec.
NullPointerException is thrown, because the hypervisor is looking for
volume on storage pool from which the volume was migrated.
The statistics did not work right - before this fix only one cluster is
lists if we have more clusters from the same type (eg. KVM), and only
one host will return volume stats. Also all volumes are sent to that
host, but they could be on another host. And the logic does not work
correctly.
Also the response was not working correctly for QCOW2 images, because
it's looking for volume's UUID, but in the statistics CS keeps the path
(because of the migrated volumes which UUIDs are changed).

We have discussed the NPE with @Spaceman1984 in PR https://github.com/apache/cloudstack/pull/3949

With this change all clusters will be listed even those with the same type. Host will receive only the active volumes which are on it. And in the logs we will receive a proper information about the volumes

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • [X ] Bug fix (non-breaking change which fixes an issue)
  • [ X] Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

CS versions - 4.14 and master

2 clusters each having 2 hosts
hypervisors KVM
2 NFS primary storages with scope Zone

@weizhouapache
Copy link
Member

similar issue as #3884 #3878

@DaanHoogland DaanHoogland added this to the 4.15.0.0 milestone Oct 7, 2020
@DaanHoogland
Copy link
Contributor

@sureshanaparti all your concerns addressed?

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2214

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3018)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38770 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4388-t3018-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 83 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 310.76 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 297.25 test_vpc_redundant.py
test_hostha_kvm_host_fencing Error 175.40 test_hostha_kvm.py

@DaanHoogland
Copy link
Contributor

@weizhouapache is this lgty?

@yadvr
Copy link
Member

yadvr commented Oct 28, 2020

@slavkap can you fix the conflict?

@GabrielBrascher
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2304

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland DaanHoogland changed the base branch from master to 4.13 October 29, 2020 11:45
@DaanHoogland DaanHoogland changed the base branch from 4.13 to master October 29, 2020 11:45
If volume is migrated between storage pools and "volume.stats.interval"
is set to more frequent interval for exmaple 30 sec.
NullPointerException is thrown, because the hypervisor is looking for
volume on storage pool from which the volume was migrated.
The statistics did not work right - before this fix only one cluster is
lists if we have more clusters from the same type (eg. KVM), and only
one host will return volume stats. Also all volumes are sent to that
host, but they could be on another host. And the logic does not work
correctly.
Also the response was not working correctly for QCOW2 images, because
it's looking for volume's uuid, but in the statistics CS keeps the path
(because of the migrated volumes which UUIDs are changed)
@GabrielBrascher
Copy link
Member

Considering that the milestone is 4.15 and this PR was manually tested at 4.14, maybe the best would be moving this to 4.14 or 4.15.

What do you think @slavkap @DaanHoogland?

@slavkap
Copy link
Contributor Author

slavkap commented Oct 29, 2020

I'm ok to move it to 4.14, but I'm not sure how those changes are related to VPC. Here is the failure:
"Ping to VM on Network Tier N from VM in Network Tier A should be successful at least for 2 out of 3 VMs"

@DaanHoogland
Copy link
Contributor

I'm ok to move it to 4.14, but I'm not sure how those changes are related to VPC. Here is the failure:
"Ping to VM on Network Tier N from VM in Network Tier A should be successful at least for 2 out of 3 VMs"

those can be due to a busy system, @slavkap. I don't think these can be related to your changes, but we can re-run if you want to be sure.
note that the tests nine days ago showed different failures.

@GabrielBrascher
Copy link
Member

I just re-kicked Travis checks.

@slavkap
Copy link
Contributor Author

slavkap commented Oct 29, 2020

Thanks @GabrielBrascher ! @DaanHoogland I saw some of those failures from nine days ago in another PR of mine, but didn't find the zip with the results

@GabrielBrascher
Copy link
Member

GabrielBrascher commented Oct 29, 2020

Just checked, it seems that the errors are related to Travis running JDK 11; CloudStack 4.13 uses Java 8.

@slavkap
Copy link
Contributor Author

slavkap commented Oct 29, 2020

There was a missing import in one of the classes

@GabrielBrascher
Copy link
Member

Nice! Travis is also running with jdk 8 now.

@slavkap
Copy link
Contributor Author

slavkap commented Oct 29, 2020

Thank you all for the time you spent!

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

@DaanHoogland
Copy link
Contributor

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

@sureshanaparti please explain your concerns or ensure what you can? I'm am trusting @slavkap to have validated any configuration she has access too. If you know what to do more please do so or ping the people that can. If we leave it at this I'm pretty sure no follow up will happen.

@sureshanaparti
Copy link
Contributor

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

@sureshanaparti please explain your concerns or ensure what you can? I'm am trusting @slavkap to have validated any configuration she has access too. If you know what to do more please do so or ping the people that can. If we leave it at this I'm pretty sure no follow up will happen.

@DaanHoogland the code changes will impact the volume stats for all storage pools. So, it is better if this can be tested with volume(s) on unmanaged, managed and local storage. Otherwise, can cause regression.

@DaanHoogland
Copy link
Contributor

code LGTM, haven't tested. Please ensure these changes doesn't cause any regression for managed / unmanaged / local storage.

@sureshanaparti please explain your concerns or ensure what you can? I'm am trusting @slavkap to have validated any configuration she has access too. If you know what to do more please do so or ping the people that can. If we leave it at this I'm pretty sure no follow up will happen.

@DaanHoogland the code changes will impact the volume stats for all storage pools. So, it is better if this can be tested with volume(s) on unmanaged, managed and local storage. Otherwise, can cause regression.

I understood that, @sureshanaparti. But by who and how should this be tested to appease your concerns.

@slavkap, can you explain more about your testing. You claim you have tested with zone wide NFS storage.

  • Is that all?
  • have you tested with more types of environement? Or have you @GabrielBrascher ?
  • is there any reason to think we don't need to test local - or managed storage , @slavkap ?

tnx all

@slavkap
Copy link
Contributor Author

slavkap commented Oct 30, 2020

@DaanHoogland @sureshanaparti I have tested it only with non managed storage on StorPool and NFS (on master and 4.14) with 2 clusters each having 2 KVM hypervisors. I'm setting now a dev environment with 4.13 and will test with managed (on NFS) and non managed (on NFS and StorPool) storage. About local storage - how should it be set managed/unmanaged? If there is somebody that could test this with OVA files will be appreciated

@slavkap
Copy link
Contributor Author

slavkap commented Oct 30, 2020

actually what I see is that NFS with default logic is only non managed storage. I can test this only with StorPool (which is non managed) and NFS

@slavkap
Copy link
Contributor Author

slavkap commented Oct 30, 2020

I've tested it now with non-managed NFS (zone-wide) and local (scope - host) storage on 4.13, and it works correct without any failures and when I get volume stats it shows the right information in logs

local storage:
'3', 'cloudstack-cl3-hv2-local-0270e029', '0270e029-baa4-4006-9834-cf5a5bc92851', 'Filesystem', '0', '1', '1', '1', '4392984576', '273804165120', 'host', NULL, '/var/lib/libvirt/images', '2020-10-30 10:57:10', NULL, NULL, 'Up', 'DefaultPrimary', 'HOST', NULL, '0', NULL

volume on local storage:
'10', '2', '1', '3', NULL, '10', '0', 'ROOT-10', 'cd875374-ef02-409b-b6da-87adae43b0ee', '8589934592', NULL, 'cd875374-ef02-409b-b6da-87adae43b0ee', NULL, '1', NULL, NULL, 'ROOT', NULL, '17', '4', NULL, '0', '2020-10-30 10:59:31', NULL, '2020-10-30 10:59:39', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list a volume these params appear with the fix:
"physicalsize": 209788928,
"utilization": "0.02",
"virtualsize": 8589934592,

NFS storage:
'1', 'primary', '95d3f971-b75a-3e1d-a225-09037d2fa9f2', 'NetworkFilesystem', '2049', '1', NULL, NULL, '146331926528', '214736830464', 'host4', NULL, '/export/primary-13', '2020-10-30 10:22:01', NULL, NULL, 'Up', 'DefaultPrimary', 'ZONE', 'KVM', '0', NULL

volume on NFS:
'6', '2', '1', '1', NULL, '6', '0', 'ROOT-6', 'c86634ca-0c01-427c-b374-f6bf908f9869', '8589934592', NULL, 'c86634ca-0c01-427c-b374-f6bf908f9869', NULL, '1', NULL, NULL, 'ROOT', NULL, '1', '4', NULL, '0', '2020-10-30 10:50:05', NULL, '2020-10-30 10:50:43', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list volume:
"physicalsize": 234422272,
"utilization": "0.03",
"virtualsize": 8589934592,

I'm not sure that I should mentioned it here about another problem during the setup of CS with 4.13. I have to do more tests, but I couldn't create guest network for zone with basic network. I saw that @GabrielBrascher have added this functionality

   /**
     * Encodes VLAN/VXLAN ID into a Broadcast URI according to the isolation method from the Physical Network.
     * @return Broadcast URI, e.g. 'vlan://vlan_ID' or 'vxlan://vlxan_ID'
     */
    protected URI encodeVlanIdIntoBroadcastUri(String vlanId, PhysicalNetwork pNtwk) {
        if (pNtwk == null) {
            throw new InvalidParameterValueException(String.format("Failed to encode VLAN/VXLAN %s into a Broadcast URI. Physical Network cannot be null.", vlanId));
        }

        if(StringUtils.isNotBlank(pNtwk.getIsolationMethods().get(0))) {
            String isolationMethod = pNtwk.getIsolationMethods().get(0).toLowerCase();
            String vxlan = BroadcastDomainType.Vxlan.toString().toLowerCase();
            if(isolationMethod.equals(vxlan)) {
                return BroadcastDomainType.encodeStringIntoBroadcastUri(vlanId, BroadcastDomainType.Vxlan);
            }
        }
        return BroadcastDomainType.fromString(vlanId);
    }

but the list with isolation methods was empty in my case (during debug I've set one element to empty string and it worked).

@slavkap slavkap closed this Oct 30, 2020
@slavkap slavkap reopened this Oct 30, 2020
@slavkap
Copy link
Contributor Author

slavkap commented Oct 30, 2020

is there any reason to think we don't need to test local - or managed storage , @slavkap ?

I think it has to be tested with managed storage, but our storage does not support this functionality. In this change for manged/local storage I think it will work as before, the only difference is that GetVolumeStats task will list all clusters in the zone even their hypervisors are from the same type.
for (final Cluster cluster : _clusterDao.listClustersByDcId(pool.getDataCenterId()))

and GetVolumeStatsCommand will be send to all hosts, not only to the first which gets an answer like the old implementation:


        for (HostVO neighbor : neighbors) {
            // apply filters:
            // - managed storage
            // - local storage
            if (storagePool.isManaged() || storagePool.isLocal()) {

                volumeLocators = getVolumesByHost(neighbor, storagePool);

            }

            // - zone wide storage for specific hypervisortypes
            if (ScopeType.ZONE.equals(storagePool.getScope()) && storagePool.getHypervisor() != neighbor.getHypervisorType()) {
                // skip this neighbour if their hypervisor type is not the same as that of the store
                continue;
            }

            GetVolumeStatsCommand cmd = new GetVolumeStatsCommand(poolType, poolUuid, volumeLocators);

            if (timeout > 0) {
                cmd.setWait(timeout/1000);
            }

            Answer answer = _agentMgr.easySend(neighbor.getId(), cmd);

            if (answer instanceof GetVolumeStatsAnswer){
                GetVolumeStatsAnswer volstats = (GetVolumeStatsAnswer)answer;
                return volstats.getVolumeStats();
            }
        }

@sureshanaparti
Copy link
Contributor

I've tested it now with non-managed NFS (zone-wide) and local (scope - host) storage on 4.13, and it works correct without any failures and when I get volume stats it shows the right information in logs

local storage:
'3', 'cloudstack-cl3-hv2-local-0270e029', '0270e029-baa4-4006-9834-cf5a5bc92851', 'Filesystem', '0', '1', '1', '1', '4392984576', '273804165120', 'host', NULL, '/var/lib/libvirt/images', '2020-10-30 10:57:10', NULL, NULL, 'Up', 'DefaultPrimary', 'HOST', NULL, '0', NULL

volume on local storage:
'10', '2', '1', '3', NULL, '10', '0', 'ROOT-10', 'cd875374-ef02-409b-b6da-87adae43b0ee', '8589934592', NULL, 'cd875374-ef02-409b-b6da-87adae43b0ee', NULL, '1', NULL, NULL, 'ROOT', NULL, '17', '4', NULL, '0', '2020-10-30 10:59:31', NULL, '2020-10-30 10:59:39', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list volume
DEBUG [c.c.a.ApiServlet] (qtp873447850-307:ctx-ba1bf0cc ctx-f62d604f) (logid:52b44185) ===END=== host -- GET command=listVolumes&id=cd875374-ef02-409b-b6da-87adae43b0ee&response=json&_=1604055433757

NFS storage:
'1', 'primary', '95d3f971-b75a-3e1d-a225-09037d2fa9f2', 'NetworkFilesystem', '2049', '1', NULL, NULL, '146331926528', '214736830464', 'host4', NULL, '/export/primary-13', '2020-10-30 10:22:01', NULL, NULL, 'Up', 'DefaultPrimary', 'ZONE', 'KVM', '0', NULL

volume on NFS:
'6', '2', '1', '1', NULL, '6', '0', 'ROOT-6', 'c86634ca-0c01-427c-b374-f6bf908f9869', '8589934592', NULL, 'c86634ca-0c01-427c-b374-f6bf908f9869', NULL, '1', NULL, NULL, 'ROOT', NULL, '1', '4', NULL, '0', '2020-10-30 10:50:05', NULL, '2020-10-30 10:50:43', NULL, 'Ready', NULL, '2', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

response when list volume:
DEBUG [c.c.a.ApiServlet] (qtp873447850-307:ctx-b300c496 ctx-0a2259ab) (logid:117df4a9) ===END=== host -- GET command=listVolumes&id=c86634ca-0c01-427c-b374-f6bf908f9869&response=json&_=1604055433758

I'm not sure that I should mentioned it here about another problem during the setup of CS with 4.13. I have to do more tests, but I couldn't create guest network for zone with basic network. I saw that @GabrielBrascher have added this functionality

   /**
     * Encodes VLAN/VXLAN ID into a Broadcast URI according to the isolation method from the Physical Network.
     * @return Broadcast URI, e.g. 'vlan://vlan_ID' or 'vxlan://vlxan_ID'
     */
    protected URI encodeVlanIdIntoBroadcastUri(String vlanId, PhysicalNetwork pNtwk) {
        if (pNtwk == null) {
            throw new InvalidParameterValueException(String.format("Failed to encode VLAN/VXLAN %s into a Broadcast URI. Physical Network cannot be null.", vlanId));
        }

        if(StringUtils.isNotBlank(pNtwk.getIsolationMethods().get(0))) {
            String isolationMethod = pNtwk.getIsolationMethods().get(0).toLowerCase();
            String vxlan = BroadcastDomainType.Vxlan.toString().toLowerCase();
            if(isolationMethod.equals(vxlan)) {
                return BroadcastDomainType.encodeStringIntoBroadcastUri(vlanId, BroadcastDomainType.Vxlan);
            }
        }
        return BroadcastDomainType.fromString(vlanId);
    }

but the list with isolation methods was empty in my case (during debug I've set one element to empty string and it worked).

@slavkap thanks for sharing the test results, which confirms the changes are good with NFS and local storages. As local and managed storage took the same path earlier, this should work for managed as well.

@DaanHoogland
Copy link
Contributor

great. i'll merge and merge forward

@DaanHoogland DaanHoogland modified the milestones: 4.15.0.0, 4.13.2.0 Oct 30, 2020
@DaanHoogland DaanHoogland merged commit 8afb451 into apache:4.13 Oct 30, 2020
@slavkap
Copy link
Contributor Author

slavkap commented Oct 30, 2020

thank you all for the code review and your time!

@slavkap slavkap deleted the fix_volume_stats branch October 30, 2020 16:28
Pearl1594 pushed a commit to shapeblue/cloudstack that referenced this pull request Nov 4, 2020
Pearl1594 pushed a commit to shapeblue/cloudstack that referenced this pull request Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants