Skip to content

Conversation

@slavkap
Copy link
Contributor

@slavkap slavkap commented Mar 9, 2020

Description

When volume or storage pool are not found when LibvirtGetVolumeStatsCommandWrapper.java, CloudRuntimeException is thrown. All methods invoked by this command are throwing in most of the cases CloudRuntimeException and it's not handled in LibvirtGetVolumeStatsCommandWrapper.

For example if volume is not found, CloudStack agent throws this exception and it fills up the agent.log

DB record:
'29', '2', '1', '1', '2', NULL, NULL, 'nfsDisk', 'd644f5c6-bf70-4826-8d2b-d4bc373a9198', '5368709120', '/export/primary', 'efba7e17-dddd-4bbb-8233-347c9a5faaff', '1', '1', NULL, NULL, 'DATADISK', NULL, '3', NULL, NULL, '0', '2020-01-28 08:56:00', NULL, '2020-02-19 12:21:53', NULL, 'Ready', NULL, '21', NULL, NULL, NULL, '1', 'QCOW2', NULL, NULL, NULL, 'thin'

2020-03-06 17:38:06,775 DEBUG [kvm.storage.LibvirtStoragePool] (agentRequest-Handler-1:null) (logid:4858b218) volume: d644f5c6-bf70-4826-8d2b-d4bc373a9198 not exist on storage pool
2020-03-06 17:38:06,776 WARN [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:4858b218) Caught:
com.cloud.utils.exception.CloudRuntimeException: Can't find volume:d644f5c6-bf70-4826-8d2b-d4bc373a9198
at com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:149)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.getVolumeStat(LibvirtGetVolumeStatsCommandWrapper.java:63)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.execute(LibvirtGetVolumeStatsCommandWrapper.java:52)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.execute(LibvirtGetVolumeStatsCommandWrapper.java:40)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:78)
at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1483)
at com.cloud.agent.Agent.processRequest(Agent.java:640)
at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1053)
at com.cloud.utils.nio.Task.call(Task.java:83)
at com.cloud.utils.nio.Task.call(Task.java:29)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

If we handle the exception we will receive in management logs information like this for example:

DEBUG [c.c.a.m.AgentManagerImpl] (StatsCollector-2:ctx-bc39a30f) (logid:cd207a3e) Details from executing class com.cloud.agent.api.GetVolumeStatsCommand: Can't get vm disk stats: Unable to attach volume a43deb8f-b8b5-4e73-9043-76298cacd27b. Error: Error: volume 'a43deb8f-b8b5-4e73-9043-76298cacd27b' does not exist.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • [ X] Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

tested with volumes on NFS and StorPool
on CloudStack version 4.11.2.0, 4.13.0.0 and master

@DaanHoogland
Copy link
Contributor

@slavkap can you rebase your commit on latest head of the 4.13 branch, please?

When you migrate volume between data stores CS keeps the original UUID and changes the path of the volume.
When volume is not found by the given path the agent throws CloudRuntimeException but it's not catched in LibvirtGetVolumeStatsCommandWrapper.java
@slavkap slavkap force-pushed the fix_CRE_of_migrated_volume_to_nfs branch from f72ac63 to da33574 Compare March 9, 2020 12:42
DaanHoogland
DaanHoogland previously approved these changes Mar 9, 2020
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1035

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1223)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 26719 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3949-t1223-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 76 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 172.24 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 171.98 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 233.22 test_privategw_acl.py

@Spaceman1984
Copy link
Contributor

Spaceman1984 commented Mar 12, 2020

For testing, I changed the global setting volume.stats.interval to 10000 to make stats run every 10 seconds.

I did not get the exception in the description after migrating storage. A different null pointer was thrown every time stats ran, but as soon as the migration was finished the null pointer was no longer thrown.

To reproduce the runtime exception, I just removed the file from where it was in primary storage.

This PR would hide the runtime exception which I feel is not the desired behavior here if you would agree @slavkap ?

@slavkap
Copy link
Contributor Author

slavkap commented Mar 12, 2020

@Spaceman1984 thanks that you spend time to test it! My bad that I didn't pull the latest changes and the fix is from #3884 where mgmt sends the path rather than the UUID of the volume. Unfortunately I couldn't reproduce the NPE that you hit.
I have tested it with the latest changes and in this case you wouldn't get the runtime exception. I think that the exception is not hidden because we send it back to the management for example like this:
DEBUG [c.c.a.m.AgentManagerImpl] (StatsCollector-2:ctx-bc39a30f) (logid:cd207a3e) Details from executing class com.cloud.agent.api.GetVolumeStatsCommand: Can't get vm disk stats: Unable to attach volume a43deb8f-b8b5-4e73-9043-76298cacd27b. Error: Error: volume 'a43deb8f-b8b5-4e73-9043-76298cacd27b' does not exist.
In such cases when CloudRuntimeException is thrown we won't fill up the agent's log with the stack trace

@Spaceman1984
Copy link
Contributor

This is the null pointer on the agent while migrating a volume, happens when stats are run.

java.lang.NullPointerException
	at com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:128)
	at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.getVolumeStat(LibvirtGetVolumeStatsCommandWrapper.java:64)
	at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.execute(LibvirtGetVolumeStatsCommandWrapper.java:53)
	at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.execute(LibvirtGetVolumeStatsCommandWrapper.java:41)
	at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:78)
	at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1500)
	at com.cloud.agent.Agent.processRequest(Agent.java:640)
	at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1053)
	at com.cloud.utils.nio.Task.call(Task.java:83)
	at com.cloud.utils.nio.Task.call(Task.java:29)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2020-03-13 09:03:17,681 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:33b84d86) Seq 1-3499859860420296912:  { Ans: , MgmtId: 2130706689, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NullPointerException\n\tat com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:128)\n\tat com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.getVolumeStat(LibvirtGetVolumeStatsCommandWrapper.java:64)\n\tat com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.execute(LibvirtGetVolumeStatsCommandWrapper.java:53)\n\tat com.cloud.hypervisor.kvm.resource.wrapper.LibvirtGetVolumeStatsCommandWrapper.execute(LibvirtGetVolumeStatsCommandWrapper.java:41)\n\tat com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:78)\n\tat com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1500)\n\tat com.cloud.agent.Agent.processRequest(Agent.java:640)\n\tat com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1053)\n\tat com.cloud.utils.nio.Task.call(Task.java:83)\n\tat com.cloud.utils.nio.Task.call(Task.java:29)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n","wait":0}}] }

I have also configured nfs in /etc/exports to be 2 separate entries:

/export *(rw,async,no_root_squash,no_subtree_check)
/alternate *(rw,async,no_root_squash,no_subtree_check)

I will test #3884 merged with this PR for the error on the management server, I haven't seen it yet.

@DaanHoogland DaanHoogland dismissed their stale review March 13, 2020 08:35

lgtm was purely code related, ignorant of a functional problem being there

@Spaceman1984
Copy link
Contributor

I have tested #3884 with the code from this PR, I still get the null pointer while migrating a volume and stats run.

I wasn't able to reproduce the error in the description after migrating storage.

These are the steps I followed:

  • Change volume.stats.interval global setting to 10000
  • Create an instance
  • Stop the instance
  • Migrate the volume to a different primary storage
  • Check the management server log for the error.

Results:
No runtime exception

The next step I followed was:

  • Delete the file from primary storage

Result:
I have verified the error in management server logs:

2020-03-13 16:54:51,585 DEBUG [c.c.a.m.AgentManagerImpl] (StatsCollector-6:ctx-96e6db97) (logid:24749a89) Details from executing class com.cloud.agent.api.GetVolumeStatsCommand: Can't get vm disk stats: Can't find volume:76071844-5bab-42cc-a656-9c47cf0b7797

Seems like this PR improves logging of the error when a volume is missing from the storage pool, not when a volume was migrated.

@slavkap
Copy link
Contributor Author

slavkap commented Mar 16, 2020

I will change the description, because when I have created this PR I wasn't updated my local repo with the latest changes with #3884. The CloudRuntimeException could be thrown in a lot of cases when volume stats command is executed, but I have tested it only if the volume is missing(migrated between datastores before the change from 3884). You can get it also, if for some reason can't get storage pool.

About the NPE if I find time I could check it, but this change doesn't causes it.

@Spaceman1984
Copy link
Contributor

@slavkap, I agree the NPE is not caused by this change.

@slavkap slavkap changed the title Fix CloudRuntimeException of migrated volume for Volume stats Fix: catch CloudRuntimeException in LibvirtGetVolumeStatsCommandWrapper.java Mar 19, 2020
@Spaceman1984
Copy link
Contributor

LGTM

@DaanHoogland
Copy link
Contributor

@andrijapanicsb pure bug fix, do we include it given blockers are still out?

@yadvr yadvr added this to the 4.13.2.0 milestone Jun 12, 2020
@yadvr yadvr merged commit 8b234bd into apache:4.13 Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants