Skip to content

Conversation

@dpdiliberto
Copy link
Contributor

Problem

Long eval jobs (500+ docs) on Azure VMs freeze for ~15 minutes after reaching 100% completion. The failure occurs when summarize() calls fetch_base_experiment(), which uses app_conn() (Vercel IP) that sits idle during the eval run.

Azure NAT gateways have a 4-minute idle timeout that silently closes stale connections. When fetch_base_experiment() tries to use the stale connection, it fails with ConnectionError.

Solution

Added retry logic with connection reset to fetch_base_experiment():

  • 3 retry attempts with exponential backoff
  • Explicit timeouts (5s connect, 10s read)
  • Connection reset (conn._reset()) on retry to create fresh HTTP session
  • Returns None after max retries instead of raising

Changes

  • logger.py: Added retry logic to fetch_base_experiment()
  • test_stale_connection.py: Integration test using real HTTP server to simulate NAT timeout

Testing

cd sdk/py
PYTHONPATH=src python3 -m unittest braintrust.test_stale_connection -v

Test uses a real HTTP server that simulates NAT gateway timeout behavior (0.5s timeout simulates 4-minute Azure NAT timeout).

@dpdiliberto dpdiliberto marked this pull request as draft December 17, 2025 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants