Skip to content

fix(browse): skip parent-process watchdog in headed and tunnel modes#869

Open
zzyyfff wants to merge 5 commits intogarrytan:mainfrom
zzyyfff:fix/watchdog-headed-crash
Open

fix(browse): skip parent-process watchdog in headed and tunnel modes#869
zzyyfff wants to merge 5 commits intogarrytan:mainfrom
zzyyfff:fix/watchdog-headed-crash

Conversation

@zzyyfff
Copy link
Copy Markdown

@zzyyfff zzyyfff commented Apr 6, 2026

Summary

The parent-process watchdog from #808 (shipped in v0.15.13.0) crashes headed Chromium after exactly 15 seconds on every platform.

  • Root cause: BROWSE_PARENT_PID is set to the CLI's process.pid, which exits within milliseconds of spawning the server. On the first 15s poll, the watchdog finds the parent dead and calls shutdown(). The idle timer correctly skips headed mode (line 730); the watchdog did not.
  • Fix: Shared shouldSuppressAutoShutdown() guard used by both idle timer and watchdog:
    1. Skip in headed mode (user controls the browser lifetime)
    2. Skip in tunnel mode (remote agents reconnect sporadically)
    3. Skip if lastActivity is within 30s grace period (prevents unnecessary cold-start restarts during rapid headless CLI calls)
  • Headless mode is unchanged in practice: the CLI auto-restarts the server on next call, so the watchdog acts as faster garbage collection for idle headless servers.

Additional fixes from cross-model review (Claude + Codex)

  • newtab regression restored: The command !== 'newtab' guard in the tab ownership check was accidentally removed. Scoped tokens with own-only tabPolicy would get 403 on newtab, breaking multi-agent tab isolation. Guard restored with regression test.
  • ESRCH-only error handling: Watchdog catch block now only treats ESRCH as "parent gone". All other error codes (EPERM, EINVAL, platform-specific) are treated as "parent alive", preventing false shutdowns on containers and unusual platforms.
  • BROWSE_PARENT_PID env leak fixed: Pair-agent connect subprocess now explicitly sets BROWSE_PARENT_PID=0 to prevent the parent shell's value from leaking into the new server and causing premature termination during the headed handoff.
  • DRY guard extraction: shouldSuppressAutoShutdown() replaces duplicate headed/tunnel checks in both the idle timer and watchdog. Future connection modes only need one update.

Test plan

  • 14 watchdog regression tests: source-level guards, behavioral logic extraction (ESRCH-only, grace period direction, headed/tunnel suppression), shutdown cleanup
  • 34 server-auth tests pass (0 failures), including restored newtab + BROWSE_PARENT_PID=0 regression tests
  • 255 cross-file tests pass (0 new failures)
  • Manual: $B connect browser stays open past 15s
  • Manual: headless commands still work normally

Review coverage

Review Source Findings Fixed
/codex challenge Codex adversarial newtab regression, ESRCH logic, BROWSE_PARENT_PID leak, grace period semantics 3/4 (grace period is by-design)
/review (Claude structured) Claude + testing/maintainability specialists newtab regression, source-level test gap, VERSION/CHANGELOG, guard duplication 5/5
/review (Claude adversarial subagent) Claude adversarial newtab regression, ESRCH flip, headed mode race, env leak, double-shutdown 3/5 (2 pre-existing, low risk)

Cross-model agreement: All three sources independently identified the newtab regression as the highest-priority fix.

Fixes #867

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

zzyyfff and others added 4 commits April 6, 2026 19:34
The watchdog from garrytan#808 polls BROWSE_PARENT_PID every 15s and calls
shutdown() when the parent is gone. But BROWSE_PARENT_PID is set to the
CLI's process.pid, which exits within milliseconds of spawning the
server. In headed mode this kills the browser window after exactly 15s
on every platform (reported as garrytan#867).

Three guards, mirroring the idle timer:
1. Skip in headed mode (user controls the browser lifetime)
2. Skip in tunnel mode (remote agents reconnect sporadically)
3. Skip if lastActivity is within WATCHDOG_GRACE_MS (30s), meaning a
   client sent a command recently through a new CLI invocation

Headless mode is unchanged in practice: the CLI auto-restarts the
server on the next call, so the watchdog just acts as faster garbage
collection for idle headless servers. The grace period prevents
unnecessary cold-start restarts during rapid command sequences.

Adds 6 source-level regression tests verifying the guards exist and
are ordered correctly relative to the kill/shutdown calls.

Fixes garrytan#867

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ommands

The `domains` variable is defined in `handlePairAgent()` but was
referenced in the connect and disconnect command handlers, causing
`ReferenceError: domains is not defined` and preventing headed mode
from launching. Looks like a merge artifact from the pair-agent body
format being copied into unrelated request bodies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rom ESRCH

Two issues found during review:

1. The watchdog setInterval handle was never stored, so shutdown()
   could not clear it. The callback could fire during teardown against
   partially-torn-down state. Now stored as watchdogInterval and
   cleared alongside the other intervals in shutdown().

2. The catch block treated all process.kill(pid, 0) errors as "parent
   is dead." But EPERM means the process exists, we just lack
   permission to signal it. Now checks err.code and skips shutdown
   for EPERM, only acting on ESRCH (process not found).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Restore newtab exclusion from tab ownership check (server.ts:954).
  Scoped tokens with own-only tabPolicy were getting 403 on newtab
  because the ownership check ran before the newtab handler.
  Confirmed by Claude structured review, Claude adversarial, and Codex.

- Flip EPERM→ESRCH logic in watchdog catch block. Only treat ESRCH
  (process doesn't exist) as "parent gone". All other error codes
  (EPERM, EINVAL, platform-specific) are treated as "parent alive".

- Extract shouldSuppressAutoShutdown() shared guard. Both idle timer
  and watchdog now call the same function, eliminating guard duplication
  and ensuring future connection modes only need one update.

- Restore BROWSE_PARENT_PID=0 for pair-agent connect subprocess.
  Prevents env variable leak from parent shell into the server,
  which could cause premature termination during headed handoff.

- Restore newtab and BROWSE_PARENT_PID=0 regression tests in
  server-auth.test.ts.

- Expand watchdog.test.ts: 14 tests covering source-level guards,
  shared function structure, behavioral logic extraction (ESRCH-only,
  grace period direction, headed/tunnel suppression), and shutdown
  cleanup. Replaces 6 pattern-only tests.

- Add CHANGELOG entry for v0.15.15.1, bump VERSION.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zzyyfff zzyyfff marked this pull request as draft April 7, 2026 03:24
Merge origin/main into fix/watchdog-headed-crash. Conflicts in
CHANGELOG.md (our improved entry supersedes main's) and
server-auth.test.ts (keep all 3 regression tests: domains,
BROWSE_PARENT_PID=0, and newtab).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zzyyfff zzyyfff marked this pull request as ready for review April 7, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Headed Chromium crashes after exactly 15s on macOS 26.5 (Tahoe beta)

1 participant