Skip to content

LDEV-6275 Fix dump on Lucee 6.2 + native getMetadata + exit cleanup#2

Open
zspitzer wants to merge 13 commits intomainfrom
fix/dump-crashes-lucee6
Open

LDEV-6275 Fix dump on Lucee 6.2 + native getMetadata + exit cleanup#2
zspitzer wants to merge 13 commits intomainfrom
fix/dump-crashes-lucee6

Conversation

@zspitzer
Copy link
Copy Markdown
Member

Summary

Fixes three stacked bugs discovered via a user report of Dump crashing
Lucee 6.2
in the VS Code CFML debugger.

LDEV: LDEV-6275

Bug 1 — Lucee 6.2 agent: dump kills the JVM

Agent jar is built against jakarta-servlet; 6.2 is still javax.
pc.getServletConfig() in the dump worker throws NoSuchMethodError,
caught as Throwable, then System.exit(1) takes the JVM with it.

Fix: ephemeralPageContextFromOther no longer calls getServletConfig().
It uses ClassUtil.callStaticMethod on ThreadUtil.createPageContext
(the pattern already in use by extension-websocket / dk). ConfigWeb
is servlet-API-agnostic — one agent jar works on both javax (6.2) and
jakarta (7.x) runtimes.

Bug 2 — Lucee 7.1 native: Inspect Metadata fails

NativeLuceeVm was loading lucee.runtime.functions.system.GetMetaData
via cl.loadClass, but lucee.core doesn't self-import
lucee.runtime.functions.system over OSGi.

Fix: BIF calls now resolve via ClassUtil.loadBIF(pc, shortName) — the
FunctionLib fallback bypasses bundle-import issues entirely. Same change
for SerializeJSON used by both getMetadata and the native dumpAsJSON.

Bug 3 — System.exit(1) sprinkled through runtime paths

The dump crash was the most visible symptom of a wider pattern. 41 exits
across the agent code; many in runtime paths (steps, stack pop, bytecode
rewrite, JDWP event pump, BIF-evaluator detection). Any Throwable → dead
JVM.

Fix: ~30 runtime sites now log + continue with enough context to be
actionable (op name, thread/class ids, consequence). The 11 startup/init
sites (classloader sanity, premain, socket bind, instrumentation helper
method lookup) stay as System.exit — if those fail the debugger is
fundamentally broken and failing loud is correct.

Test coverage added

  • DumpTest.cfcdump / dumpAsJSON struct + array + JSON round-trip
    • server-survives-dump regression guard
  • MetadataTest.cfc — native getMetadata returns a struct; agent mode
    returns the static "not supported in JDWP mode" marker
  • DapClient.cfcdump / dumpAsJSON / getMetadata wrappers

Test plan

  • CI matrix green on all three lanes:
    • Lucee 6.2 (agent) — was broken, now green
    • Lucee 7.0 (agent) — stays green
    • Lucee 7.1 (native) — was broken on MetadataTest, now green
  • Manual: right-click a var in VARIABLES → Dump on a Lucee 6.2
    debuggee, confirm content comes back and server stays up
  • Manual: right-click a var → Inspect Metadata on a Lucee 7.1
    native debuggee, confirm JSON metadata comes back

zspitzer added 13 commits April 19, 2026 13:03
Adds DapClient.cfc wrappers for the dump/dumpAsJSON custom JSON requests
and a DumpTest.cfc suite covering struct, array, JSON round-trip, and a
server-survives-dump regression guard for the System.exit(1) in the dump
catch blocks.

Expected to fail on agent-mode Lucee 6.2 (JVM kill cascades through the
rest of the run) and pass on jakarta Lucee 7.x — intentional red phase
before the javax/jakarta servlet-API fix.
Swap TestBox toInclude/notToBe for CFML 'contains' operator on the
dump content checks. Drop failure messages — TestBox auto-dumps the
actual value when there's no message, which is more useful than a
static string, so the systemOutput diagnostic lines become redundant.
Native mode returns JSON-serialized GetMetaData result; agent/JDWP mode
returns the static 'not supported in JDWP mode' marker string. Guards
both paths plus valid-JSON shape.
The 'Error:' substring check was brittle (false positives on metadata
that legitimately contains the word) and 'content contains X' reduces
to a boolean before the matcher, so TestBox can't auto-dump the
offending content on failure. Keep: isJSON + agent-mode marker literal
+ server-survives. If GetMetaData throws for a plain struct on native,
the fallback is still valid JSON - that's a separate discussion.
Tighten the native assertion: deserialized response must be a struct.
If GetMetaData throws or doGetMetadataWithPageContext falls back to a
JSON string ("Error: ...", "getMetadata failed", "No PageContext"),
this test will fail - which is what we want to investigate.

systemOutput the raw content each run since 'isJSON' and 'toBeTypeOf'
lose the value in the matcher chain.
GetMetaData lives in lucee.runtime.functions.system, which lucee.core
does not self-import via OSGi - so cl.loadClass on the PageContext's
bundle classloader fails with:
  Class 'lucee.runtime.functions.system.GetMetaData' was not found
  because bundle lucee.core does not import 'lucee.runtime.functions.system'

loadBIF is the loader-level API built for this: it resolves BIF classes
across bundle boundaries and returns a typed BIF reference whose invoke
is a normal interface call, not reflection. Pattern matches
extension-websocket and other sibling extensions.

Applied to getMetadata (GetMetaData + SerializeJSON) and the dumpAsJSON
branch of doDumpWithPageContext (SerializeJSON). ThreadLocalPageContext,
DumpUtil, HTMLDumpWriter etc. stay on cl.loadClass - they're not BIFs,
they're runtime utility classes and currently resolve fine.
The previous commit passed the fully-qualified class name to loadBIF.
ClassUtilImpl.loadBIF tries loadClass first (same OSGi bundle issue)
and only falls through to FunctionLib lookup when that returns null.
The short name bypasses the classloader step entirely - FunctionLib is
the right resolver for BIFs and doesn't care about bundle imports.
Four System.exit(1) sites in doDump and doDumpAsJSON would take the
whole Lucee process down on any Throwable from the dump worker thread
or from thread.join. A debugger feature that kills the host on failure
is indefensible.

Keep printStackTrace so the error is still visible, let the preset
result.value fallback strings ('...something went wrong when calling
writeDump(...)' / 'Something went wrong when calling serializeJSON(...)')
get returned to the DAP client. The containment fix on its own doesn't
make dump work on Lucee 6.2 - it just stops the Lucee 6.2 crash from
being catastrophic.
Drop the direct pc.getServletConfig() call from ephemeralPageContextFromOther
- that method's return type flipped from javax to jakarta between Lucee 6
and 7, and the agent jar is compiled against jakarta, so calling it on 6.x
throws NoSuchMethodError at link time. That was the underlying cause of
the dump crash on Lucee 6.2 (the System.exit containment was only half
the story).

New path uses ClassUtil.callStaticMethod on ThreadUtil.createPageContext,
which takes a ConfigWeb (servlet-API-agnostic on the caller side; Lucee
plumbs the internal ServletConfig itself). Empty cookie array is built
jakarta-first-with-javax-fallback via reflection. Same pattern as
extension-websocket's WSUtil.createPageContext.

One agent jar works on both javax-Lucee (6.2) and jakarta-Lucee (7.x).
Only two production callers, both in ExprEvaluator's Lucee5/6 detection
fallbacks. If Renderer.tag signature detection throws something other
than NoSuchMethodException, killing the JVM was indefensible - just
fall through to Optional.empty() with a context-ful log so the caller
tries the other evaluator.

Delete Utils.java; no production caller of Utils.unreachable remains.
All 8 System.exit(1) sites in LuceeTransformer returned on catching a
Throwable from classfile rewrite. Any one bad class - a weird inner
class, an unexpected ASM opcode, anything - would take the whole Lucee
host down. One class failing to instrument should degrade debugging
for that one class, not kill the process.

Each catch now logs a context-rich line (which class, what the
consequence is) and returns the ORIGINAL classfileBuffer, so the class
loads normally. Breakpoints in that specific class won't fire, but
Lucee keeps running.

The 'Got class X before PageContextImpl' branch now logs-and-continues
too rather than fatal.
Cleanup per remove-runtime-exits.md - 20+ runtime exit sites in
DebugManager, LuceeVm, and KlassMap replaced with context-ful stderr
logs that name the operation, the inputs involved, and the consequence.

Covers:
- step / pop-frame / bad-step-type handlers (user-triggered debug actions)
- JDWP event pump, thread tracking, class-ref tracking
- step in/over/out suspend-count assertions
- KlassMap build failures

*OrFail thread-lookup helpers now throw RuntimeException with context
instead of calling exit - the caller's catch decides how to proceed.

Startup/init sites (classloader sanity, JDWP connector lookup, premain,
DAP socket bind, instrumentation helper method lookup) are kept as
System.exit - if they fire the debugger is fundamentally broken and
failing loud is correct.
CI was intermittently hanging on 'Warmup debuggee (Lucee Express)' for
up to 6 hours (GitHub's job cap). Root cause: LUCEE_ENABLE_WARMUP=true
tells Lucee to compile bundles and exit, but our agent spawned three
non-daemon threads that stayed parked forever:

- DAP server thread (DebugManager.spawnWorker) - blocked on ServerSocket.accept()
- JDWP worker (LuceeVm.JdwpWorker.spawnThreadForJdwpToSuspend) - suspended in method
- JDWP event pump (LuceeVm.initEventPump) - blocked on vm_.eventQueue().remove()

Any one of them is enough to keep the JVM alive. When warmup 'worked'
it was a race where JDWP self-connect failed fast and the DAP thread
exited via its catch before ServerSocket.accept was reached.

Native mode (ExtensionActivator) already sets dapThread.setDaemon(true).
Match that pattern in agent mode. During a real debug session, Tomcat's
own non-daemon threads keep the JVM alive - daemon status on our
background workers is invisible there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant