Skip to content

Conversation

@FelipeM525
Copy link
Contributor

@FelipeM525 FelipeM525 commented Sep 24, 2024

Description

This PR restricts users from migrating volumes attached to VMs that are in starting state. Migrating volumes while their VMs are starting shouldn't be allowed because sometimes the VM could end up not actually starting, and if ACS starts migrating its volume, this could lead to other issues.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

Image

migrate-volume

How Has This Been Tested?

I tested this PR by deploying an instance and then trying to migrate its volume while the instance was starting. Before the changes, the volume migration process would start. After the migration, an error is thrown, as expected.

…lume that is attached to a vm in starting mode from being migrated
@codecov
Copy link

codecov bot commented Sep 24, 2024

Codecov Report

Attention: Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.

Project coverage is 15.85%. Comparing base (9df783c) to head (7903aa5).
Report is 81 commits behind head on 4.19.

Files with missing lines Patch % Lines
...n/java/com/cloud/storage/VolumeApiServiceImpl.java 0.00% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19    #9725      +/-   ##
============================================
+ Coverage     15.11%   15.85%   +0.73%     
- Complexity    11184    11269      +85     
============================================
  Files          5402     5042     -360     
  Lines        473120   444385   -28735     
  Branches      58327    52630    -5697     
============================================
- Hits          71507    70441    -1066     
+ Misses       393812   366072   -27740     
- Partials       7801     7872      +71     
Flag Coverage Δ
uitests ?
unittests 15.85% <0.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@JoaoJandre JoaoJandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM

@JoaoJandre
Copy link
Contributor

Is there any other state that we should be checking besides starting? migrating, for example?

@DaanHoogland
Copy link
Contributor

Is there any other state that we should be checking besides starting? migrating, for example?

Good point, though this is already an improvement;

        Starting(true, "VM is being started.  At this state, you should find host id filled which means it's being started on that host."),
        Running(false, "VM is running.  host id has the host that it is running on."),
        Stopping(true, "VM is being stopped.  host id has the host that it is being stopped on."),
        Stopped(false, "VM is stopped.  host id should be null."),
        Destroyed(false, "VM is marked for destroy."),
        Expunging(true, "VM is being   expunged."),
        Migrating(true, "VM is being migrated.  host id holds to from host"),
        Error(false, "VM is in error"),
        Unknown(false, "VM state is unknown."),
        Shutdown(false, "VM state is shutdown from inside"),
        Restoring(true, "VM is being restored from backup");

I think (at least) Stopping, Destroyed, Expunging, Restoring are good candidate VirtualMachine.States to include. Or in reverse, only Running, Stopped, Shutdown and maybe Error or Unknown are good vm states to migrate volumes for.

Copy link
Member

@bernardodemarco bernardodemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I manually tested the PR in a local environment.

I verified that CloudStack doesn't allow the operator to migrate a volume when the virtual machine related to it is starting.

image

root@cloudstack:~# grep 'logid:d36d1bbc' /var/log/cloudstack/management/management-server.log
2025-01-03 12:36:08,677 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-11:[ctx-8017eead, job-200]) (logid:d36d1bbc) Executing AsyncJobVO: {id:200, userId: 2, accountId: 2, instanceType: Volume, instanceId: 27, cmd: org.apache.cloudstack.api.command.admin.volume.MigrateVolumeCmdByAdmin, cmdInfo: {"response":"json","ctxUserId":"2","sessionkey":"7FhCQRxwVcXt_poXEs3mLf8akmA","volumeid":"30b7f14e-1f8f-4e9c-b364-a9887bc254f4","httpmethod":"GET","ctxStartEventId":"398","ctxDetails":"{\"interface com.cloud.storage.StoragePool\":\"10d28cdf-71a7-33ad-802e-f4ec9042e4fd\",\"interface com.cloud.storage.Volume\":\"30b7f14e-1f8f-4e9c-b364-a9887bc254f4\"}","ctxAccountId":"2","livemigrate":"true","cmdEventType":"VOLUME.MIGRATE","storageid":"10d28cdf-71a7-33ad-802e-f4ec9042e4fd"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 227341783673585, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2025-01-03 12:36:34,413 DEBUG [c.c.s.VolumeApiServiceImpl] (API-Job-Executor-11:[ctx-8017eead, job-200, ctx-ccb65616]) (logid:d36d1bbc) Unable to migrate volume: [ROOT-24] Id: [27] because the VM: [i-2-24-VM] Id: [6375962a-980e-429e-ba7a-8f164a74b597] has not started yet.
2025-01-03 12:36:37,885 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-11:[ctx-8017eead, job-200]) (logid:d36d1bbc) Unexpected exception while executing org.apache.cloudstack.api.command.admin.volume.MigrateVolumeCmdByAdmin com.cloud.utils.exception.CloudRuntimeException: Volume migration is not allowed while the VM is starting.
2025-01-03 12:36:37,888 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-11:[ctx-8017eead, job-200]) (logid:d36d1bbc) Complete async job-200, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Volume migration is not allowed while the VM is starting."}

@DaanHoogland
Copy link
Contributor

thanks @bernardodemarco

@JoaoJandre , do you agree that this PR is good to go, even when more improvements are possible?

@JoaoJandre
Copy link
Contributor

@DaanHoogland considering it's an easy change, I don't see why introduce this and change it later.
Since @FelipeM525 is no longer working on this PR: @bernardodemarco could you take care of it?

@bernardodemarco
Copy link
Member

@DaanHoogland considering it's an easy change, I don't see why introduce this and change it later. Since @FelipeM525 is no longer working on this PR: @bernardodemarco could you take care of it?

Yes, no problem. I'll update the verification to only allow the migration to be performed when the VM is Running, Stopped or Shutdown.

@bernardodemarco bernardodemarco self-assigned this Jan 3, 2025
@bernardodemarco
Copy link
Member

Yes, no problem. I'll update the verification to only allow the migration to be performed when the VM is Running, Stopped or Shutdown.

@DaanHoogland, @rohityadavcloud, @JoaoJandre I've just applied the changes.

root@cloudstack:~# grep -a 'logid:bf08211b' /var/log/cloudstack/management/management-server.log 
2025-01-06 12:13:48,480 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-7:ctx-8253b178 job-141) (logid:bf08211b) Executing AsyncJobVO: {id:141, userId: 2, accountId: 2, instanceType: Volume, instanceId: 21, cmd: org.apache.cloudstack.api.command.admin.volume.MigrateVolumeCmdByAdmin, cmdInfo: {"response":"json","ctxUserId":"2","sessionkey":"07_Zt7hC_aogIMgiM2EDW7Hs0xk","volumeid":"8dfa4a5e-2e0f-4c57-ad1e-2a19b6a3c6b9","httpmethod":"GET","ctxStartEventId":"273","ctxDetails":"{\"interface com.cloud.storage.Volume\":\"8dfa4a5e-2e0f-4c57-ad1e-2a19b6a3c6b9\",\"interface com.cloud.storage.StoragePool\":\"a0566c34-a49d-31bc-865f-3b57806131d6\"}","ctxAccountId":"2","livemigrate":"false","cmdEventType":"VOLUME.MIGRATE","storageid":"a0566c34-a49d-31bc-865f-3b57806131d6"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 227341783673585, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2025-01-06 12:13:48,488 DEBUG [c.c.s.VolumeApiServiceImpl] (API-Job-Executor-7:ctx-8253b178 job-141 ctx-fe58e892) (logid:bf08211b) Unable to migrate volume: [ROOT-19] Id: [21] because the VM: [i-2-19-VM] Id: [9228084c-7e2f-4e08-b40d-99a96c2ec23a] is in state [Stopping], which is not supported for migration.
2025-01-06 12:13:48,493 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-7:ctx-8253b178 job-141) (logid:bf08211b) Unexpected exception while executing org.apache.cloudstack.api.command.admin.volume.MigrateVolumeCmdByAdmin
2025-01-06 12:13:48,496 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-7:ctx-8253b178 job-141) (logid:bf08211b) Complete async job-141, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Volume migration is not allowed when the VM is in the Stopping state. Supported states are: [Stopped, Running, Shutdown]."}

Screenshot from 2025-01-06 09-14-02

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12002

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12056)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 45617 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9725-t12056-kvm-ol8.zip
Smoke tests completed. 132 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_secure_vm_migration Error 134.42 test_vm_life_cycle.py
test_01_secure_vm_migration Error 134.43 test_vm_life_cycle.py

@DaanHoogland DaanHoogland merged commit 21416cd into apache:4.19 Jan 8, 2025
23 of 26 checks passed
DaanHoogland added a commit that referenced this pull request Jan 8, 2025
* 4.19:
  Restrict the migration of volumes attached to VMs in Starting state (#9725)
DaanHoogland added a commit that referenced this pull request Jan 8, 2025
* 4.20:
  merge errors fixed
  Restrict the migration of volumes attached to VMs in Starting state (#9725)
  server, plugin: enhance storage stats for IOPS (#10034)
  Introducing granular command timeouts global setting (#9659)
  Improve logging to include more identifiable information (#9873)
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants