Fix tracing cow changes by dblnz · Pull Request #1260 · hyperlight-dev/hyperlight

dblnz · 2026-02-25T16:37:04Z

Description

This PR fixes #1242.
The guest tracing data received by the host assumes identity mapped addresses, the address needs to be translated to correctly decode the flatbuffers data.

This can be reviewed commit by commit.
The first commit tackles this issue, the other two improve on the spans and usability.

- With the new Copy-on-Write changes, the guest virtual addresses and guest physical addresses are no longer identity mapped, so we need a way to do this translation when a new guest trace batch arrives from the guest Signed-off-by: Doru Blânzeanu <dblnz@pm.me>

- Use registers r12, r13 and r14 to pass trace data information from the guest to the host - This is in use temporarily, until the virtio queues are implemented Signed-off-by: Doru Blânzeanu <dblnz@pm.me>

- For each call to `internal_dispatch_function` now we create a span manually because instrumenting the function would be redundant because of the call to `new_call` that resets tracing state for each new function call - The same logic is used for `generic_init` to enable it for the initialise call Signed-off-by: Doru Blânzeanu <dblnz@pm.me>

Copilot

Pull request overview

This PR fixes guest→host tracing breakage introduced by CoW by translating the guest-provided trace buffer pointer (GVA) via a page-table walk (using CR3) before decoding the FlatBuffer trace batch. It also refines guest-side spans around initialization/dispatch to improve trace usability and reduce mismatched span warnings.

Changes:

Translate trace batch pointers from GVA→GPA on the host (CR3-driven page table walk) before decoding trace FlatBuffers.
Update the guest↔host tracing ABI to pass trace batch metadata via r12/r13/r14 and adjust host handling accordingly.
Improve guest tracing spans around generic_init and internal_dispatch_function and enable Debug printing of FunctionCall.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/hyperlight_host/src/sandbox/trace/context.rs	Switch trace batch extraction to read via GVA translation and new `from_regs` API.
src/hyperlight_host/src/sandbox/snapshot.rs	Expose helpers for resolving GPAs and reading page tables from shared/scratch memory.
src/hyperlight_host/src/mem/mgr.rs	Add `read_guest_memory_by_gva` that walks page tables and reads from shared/scratch backing.
src/hyperlight_host/src/hypervisor/hyperlight_vm.rs	Pass CR3 into trace handling and adjust when trace handling is attempted.
src/hyperlight_guest_tracing/src/state.rs	Send trace batch pointer/len via r12/r13/r14 in the OUT instruction.
src/hyperlight_guest_bin/src/lib.rs	Add a `generic_init` span and flush/close ordering to avoid host span warnings.
src/hyperlight_guest_bin/src/guest_function/call.rs	Add an `internal_dispatch_function` span and trace log the `FunctionCall` in debug builds.
src/hyperlight_guest/src/exit.rs	Update OUT path to attach trace batch metadata via r12/r13/r14.
src/hyperlight_common/src/vmem.rs	Update comment noting `virt_to_phys` is now used for tracing reads.
src/hyperlight_common/src/flatbuffer_wrappers/function_call.rs	Derive `Debug` for `FunctionCall` to support tracing output.

Copilot · 2026-02-25T18:51:02Z

src/hyperlight_host/src/hypervisor/hyperlight_vm.rs

+                            match self.vm.sregs().map(|s| s.cr3) {
+                                Ok(cr3) => {
+                                    tc.handle_trace(&regs, mem_mgr, cr3).unwrap_or_else(|e| {
+                                        tracing::error!("Cannot handle trace data: {}", e);


This path logs Cannot handle trace data at error level for any tc.handle_trace failure. Since handle_trace returns an error when the trace magic isn’t present, this can become noisy on VM exits that don’t carry trace batches. Consider either gating the call on regs.r12 == OutBAction::TraceBatch as u64 or downgrading the expected "no trace batch" case to debug/trace logging.

Suggested change

tracing::error!("Cannot handle trace data: {}", e);

tracing::debug!("Cannot handle trace data: {}", e);

Copilot · 2026-02-25T18:51:03Z

src/hyperlight_host/src/mem/mgr.rs

+    /// Read guest memory at a Guest Virtual Address (GVA) by walking the
+    /// page tables to translate GVA → GPA, then reading from the correct
+    /// backing memory (shared_mem or scratch_mem).
+    ///
+    /// This is necessary because with Copy-on-Write (CoW) the guest's
+    /// virtual pages are be backed by physical pages in the scratch
+    /// region rather than being identity-mapped.
+    ///
+    /// # Arguments
+    /// * `gva` - The Guest Virtual Address to read from
+    /// * `len` - The number of bytes to read
+    /// * `root_pt` - The root page table physical address (CR3)
+    #[cfg(feature = "trace_guest")]
+    pub(crate) fn read_guest_memory_by_gva(
+        &mut self,
+        gva: u64,
+        len: usize,
+        root_pt: u64,
+    ) -> Result<Vec<u8>> {
+        use hyperlight_common::vmem::PAGE_SIZE;
+
+        use crate::sandbox::snapshot::{SharedMemoryPageTableBuffer, access_gpa};
+
+        let scratch_size = self.scratch_mem.mem_size();
+
+        self.shared_mem.with_exclusivity(|snap| {
+            self.scratch_mem.with_exclusivity(|scratch| {
+                let pt_buf = SharedMemoryPageTableBuffer::new(snap, scratch, scratch_size, root_pt);
+
+                // Walk page tables to get all mappings that cover the GVA range
+                let mappings: Vec<_> = unsafe {
+                    hyperlight_common::vmem::virt_to_phys(&pt_buf, gva, len as u64)
+                }
+                .collect();
+
+                if mappings.is_empty() {
+                    return Err(new_error!(
+                        "No page table mappings found for GVA {:#x} (len {})",
+                        gva,
+                        len,
+                    ));
+                }
+
+                let mut result = Vec::with_capacity(len);
+
+                for mapping in &mappings {
+                    // Calculate the offset within this page for the first byte we want
+                    let page_offset = if mapping.virt_base <= gva {
+                        (gva - mapping.virt_base) as usize
+                    } else {
+                        0
+                    };
+
+                    let bytes_remaining = len - result.len();
+                    let available_in_page = PAGE_SIZE - page_offset;
+                    let bytes_to_copy = bytes_remaining.min(available_in_page);
+
+                    // Translate the GPA to host memory
+                    let gpa = mapping.phys_base + page_offset as u64;
+                    let (mem, offset) = access_gpa(snap, scratch, scratch_size, gpa)
+                        .ok_or_else(|| {
+                            new_error!(
+                                "Failed to resolve GPA {:#x} to host memory (GVA {:#x})",
+                                gpa,
+                                gva
+                            )
+                        })?;
+
+                    let slice = mem
+                        .as_slice()
+                        .get(offset..offset + bytes_to_copy)
+                        .ok_or_else(|| {
+                            new_error!(
+                                "GPA {:#x} resolved to out-of-bounds host offset {} (need {} bytes)",
+                                gpa,
+                                offset,
+                                bytes_to_copy
+                            )
+                        })?;
+
+                    result.extend_from_slice(slice);
+                }
+
+                if result.len() != len {
+                    return Err(new_error!(
+                        "Could not read full GVA range: got {} of {} bytes",
+                        result.len(),
+                        len
+                    ));
+                }
+
+                Ok(result)
+            })


New page-table-walk-based trace reading logic is introduced here, but there’s no unit test coverage validating cross-page reads, unmapped gaps, and scratch-vs-snapshot backing resolution. Since this file already has paging-related tests, consider adding tests (under appropriate feature gating) that exercise read_guest_memory_by_gva for both identity-mapped and CoW/scratch-backed pages.

Copilot · 2026-02-25T18:51:03Z

src/hyperlight_host/src/sandbox/trace/context.rs

+    /// in register r9. With Copy-on-Write enabled, this GVA may not be
+    /// identity-mapped to its physical address, so we walk the guest page
+    /// tables to translate GVA → GPA before reading the data.
+    ///
+    /// # Arguments
+    /// * `regs` - The guest registers (r8 = magic, r9 = GVA pointer, r10 = length)


The rustdoc for EventsBatch::from_regs still refers to r8/r9/r10 and says the GVA pointer is in r9, but the implementation now reads magic/pointer/len from r12/r13/r14. Please update the doc comment (including the # Arguments bullet) to match the new register convention so callers don’t mis-wire the ABI again later.

Suggested change

/// in register r9. With Copy-on-Write enabled, this GVA may not be

/// identity-mapped to its physical address, so we walk the guest page

/// tables to translate GVA → GPA before reading the data.

///

/// # Arguments

/// * `regs` - The guest registers (r8 = magic, r9 = GVA pointer, r10 = length)

/// in register r13. With Copy-on-Write enabled, this GVA may not be

/// identity-mapped to its physical address, so we walk the guest page

/// tables to translate GVA → GPA before reading the data.

///

/// # Arguments

/// * `regs` - The guest registers (r12 = magic, r13 = GVA pointer, r14 = length)

Copilot · 2026-02-25T18:51:04Z

src/hyperlight_host/src/mem/mgr.rs

+    pub(crate) fn read_guest_memory_by_gva(
+        &mut self,
+        gva: u64,
+        len: usize,
+        root_pt: u64,
+    ) -> Result<Vec<u8>> {
+        use hyperlight_common::vmem::PAGE_SIZE;
+
+        use crate::sandbox::snapshot::{SharedMemoryPageTableBuffer, access_gpa};
+
+        let scratch_size = self.scratch_mem.mem_size();
+
+        self.shared_mem.with_exclusivity(|snap| {
+            self.scratch_mem.with_exclusivity(|scratch| {
+                let pt_buf = SharedMemoryPageTableBuffer::new(snap, scratch, scratch_size, root_pt);
+
+                // Walk page tables to get all mappings that cover the GVA range
+                let mappings: Vec<_> = unsafe {
+                    hyperlight_common::vmem::virt_to_phys(&pt_buf, gva, len as u64)
+                }
+                .collect();
+
+                if mappings.is_empty() {
+                    return Err(new_error!(
+                        "No page table mappings found for GVA {:#x} (len {})",
+                        gva,
+                        len,
+                    ));
+                }
+
+                let mut result = Vec::with_capacity(len);
+


read_guest_memory_by_gva allocates Vec::with_capacity(len) using a length provided by the guest (r14). A malicious/buggy guest can request an arbitrarily large len and cause host OOM. Consider enforcing a strict upper bound (e.g., a max trace batch size) before allocating, and return an error if exceeded.

Copilot · 2026-02-25T18:51:04Z

src/hyperlight_host/src/hypervisor/hyperlight_vm.rs

+                            // to correct execution of the guest
+                            match self.vm.sregs().map(|s| s.cr3) {
+                                Ok(cr3) => {
+                                    tc.handle_trace(&regs, mem_mgr, cr3).unwrap_or_else(|e| {


virt_to_phys expects root_pt to be a page-aligned PML4 base, but the value passed here comes directly from sregs.cr3 and may include low flag/PCID bits. The guest-side root_table() masks with & !0xfff; the host should do the same (or use a shared helper) before walking page tables, otherwise translation can read the wrong entries.

Suggested change

tc.handle_trace(&regs, mem_mgr, cr3).unwrap_or_else(|e| {

let root_pt = cr3 & !0xfff_u64;

tc.handle_trace(&regs, mem_mgr, root_pt).unwrap_or_else(|e| {

ludfjig

looks good to be but not too familiar with this code. BTW why were the registers changed? Also, should we have tests to make sure this doesn't break in the future again?

ludfjig · 2026-02-25T18:51:59Z

src/hyperlight_guest_bin/src/guest_function/call.rs

+        // This span captures the internal dispatch function only, without tracing internals.
+        // Close the span before flushing to ensure that the `flush` call is not included in the span
+        // NOTE: This is necessary to avoid closing the span twice. Flush closes all the open
+        // spans, when preparing to close a guest function call context.
+        // It is not mandatory, though, but avoids a warning on the host that alerts a spans
+        // that has not been opened but is being closed.
+        _entered.exit();
+
+        // Ensure that any tracing output during the call is flushed to
+        // the host, if necessary.
+        hyperlight_guest_tracing::flush();
+    }


Could this be factored out to not be duplicated here and in lib.rs?

not really, we need flush to be right before the call to hlt so that no other spans/logs are added.

ludfjig · 2026-02-25T18:52:24Z

src/hyperlight_guest_bin/src/lib.rs

    #[cfg(all(feature = "trace_guest", target_arch = "x86_64"))]
-    hyperlight_guest_tracing::flush();
+    {
+        // This span captures the internal dispatch function only, without tracing internals.


I think this comment shoud say generic_init as opposed to internal dispatch function maybe?

ludfjig · 2026-02-25T18:53:55Z

src/hyperlight_host/src/hypervisor/hyperlight_vm.rs

+                    match result.as_ref() {
+                        // On the Halt exit we don't receive any trace information
+                        Ok(VmExit::Halt()) => {}
+                        _ => {


nit: could be simplified slightly to something like

if !matches!(result.as_ref(), Ok(VmExit::Halt())) { let regs = self.vm.regs().map_err(RunVmError::GetRegs)?; if let Ok(cr3) = self.vm.sregs().map(|s| s.cr3) { tc.handle_trace(&regs, mem_mgr, cr3).unwrap_or_else(|e| { tracing::error!("Cannot handle trace data: {}", e); }); } else { tracing::error!("Failed to get sregs for trace handling"); } }

ludfjig · 2026-02-25T18:54:26Z

src/hyperlight_host/src/sandbox/trace/context.rs

+    /// tables to translate GVA → GPA before reading the data.
+    ///
+    /// # Arguments
+    /// * `regs` - The guest registers (r8 = magic, r9 = GVA pointer, r10 = length)


I think this comment is using outdated regsiters

ludfjig · 2026-02-25T18:55:29Z

src/hyperlight_host/src/mem/mgr.rs

+    /// backing memory (shared_mem or scratch_mem).
+    ///
+    /// This is necessary because with Copy-on-Write (CoW) the guest's
+    /// virtual pages are be backed by physical pages in the scratch


typo "are be"

jsturtevant · 2026-02-25T19:04:10Z

src/hyperlight_host/src/mem/mgr.rs

+                    ));
+                }
+
+                let mut result = Vec::with_capacity(len);


Is this length controlled by the guest? coming from context.rs:

let trace_data_len = regs.r14 as usize; let buf = mem_mgr.read_guest_memory_by_gva(trace_data_gva, trace_data_len, root_pt)?;

nice this was cuaght in #1260 (comment)

very good point, we need to have some max size we allow for sure

yeah, I think I'll limit it at 4K for now, but in the future we'd be able to shrink it because we would use the virtio queues and we won't care about this. We could even not batch them at all, the queue would do that for us

jsturtevant · 2026-02-25T19:16:37Z

src/hyperlight_host/src/hypervisor/hyperlight_vm.rs

+                        Ok(VmExit::Halt()) => {}
+                        _ => {
+                            // Handle the guest trace data if any
+                            let regs = self.vm.regs().map_err(RunVmError::GetRegs)?;


since we know the outb action ( in("r12") OutBAction::TraceBatch as u64), could we skip doing this for every instance with something like:

if regs.r12 == OutBAction::TraceBatch then do all the tracing stuff here

jsturtevant · 2026-02-25T19:44:28Z

src/hyperlight_host/src/mem/mgr.rs

+
+                // Walk page tables to get all mappings that cover the GVA range
+                let mappings: Vec<_> = unsafe {
+                    hyperlight_common::vmem::virt_to_phys(&pt_buf, gva, len as u64)


This lengt isn't necessarily aligned to a page I think?

This isn't in your code, but I think this might expose an issue where if this length goes past one page, it could truncate the result and not give you the right mappings. We are missing a unit test that covers that scenario in vmem.

You could do len as u64 + page_offset but we should probably fix it in vmem so that we don't accidently mis use this function again

I'll add some tests for this

dblnz requested a review from danbugs as a code owner February 25, 2026 16:37

dblnz added the kind/bugfix For PRs that fix bugs label Feb 25, 2026

dblnz requested review from devigned, jprendes, ludfjig, marosset, simongdavies and syntactically as code owners February 25, 2026 16:37

dblnz added 3 commits February 25, 2026 18:40

[trace] Update the registers for the trace reporting calling convention

726bcf5

- Use registers r12, r13 and r14 to pass trace data information from the guest to the host - This is in use temporarily, until the virtio queues are implemented Signed-off-by: Doru Blânzeanu <dblnz@pm.me>

dblnz force-pushed the fix-tracing-cow-changes branch from 78fb1fe to 6176db1 Compare February 25, 2026 16:46

danbugs mentioned this pull request Feb 25, 2026

Guest Copy on Write #1200

Open

17 tasks

ludfjig requested a review from Copilot February 25, 2026 18:46

Copilot started reviewing on behalf of ludfjig February 25, 2026 18:46 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

ludfjig reviewed Feb 25, 2026

View reviewed changes

jsturtevant reviewed Feb 25, 2026

View reviewed changes

jsturtevant requested changes Feb 25, 2026

View reviewed changes

	tracing::error!("Cannot handle trace data: {}", e);
	tracing::debug!("Cannot handle trace data: {}", e);

	tc.handle_trace(&regs, mem_mgr, cr3).unwrap_or_else(\|e\| {
	let root_pt = cr3 & !0xfff_u64;
	tc.handle_trace(&regs, mem_mgr, root_pt).unwrap_or_else(\|e\| {

Conversation

dblnz commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

ludfjig left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dblnz commented Feb 25, 2026 •

edited

Loading

ludfjig left a comment •

edited

Loading