Skip to content

fix(llmobs): respect configured writer timeout in _send_payload#18575

Open
jeong-hasang wants to merge 2 commits into
DataDog:mainfrom
jeong-hasang:fix/llmobs-writer-timeout
Open

fix(llmobs): respect configured writer timeout in _send_payload#18575
jeong-hasang wants to merge 2 commits into
DataDog:mainfrom
jeong-hasang:fix/llmobs-writer-timeout

Conversation

@jeong-hasang

@jeong-hasang jeong-hasang commented Jun 11, 2026

Copy link
Copy Markdown

Description

BaseLLMObsWriter._send_payload opens its connection with get_connection(self._intake), passing no timeout argument.
As a result, it ignores the writer's own self._timeout and falls back to get_connection's DEFAULT_TIMEOUT of 2s.

On a high-latency connection to the agent/intake, the 2s socket timeout is exceeded at the tail. All retries hit the same 2s ceiling, so the batch is dropped (failed to send N LLMObs span events).

No env var works around this. _DD_LLMOBS_WRITER_TIMEOUT only sets self._timeout, which this path ignored.
DD_TRACE_AGENT_TIMEOUT_SECONDS is not consulted by the LLMObs writer.
In production, the only effective mitigation was monkeypatching get_connection to raise its default timeout — after which the timeouts disappeared, confirming the 2s default was the cause.

The fix passes timeout=self._timeout, restoring the 5s default that #9438 intended for the writer.

     def _send_payload(self, payload: bytes, num_events: int):
-        conn = get_connection(self._intake)
+        conn = get_connection(self._intake, timeout=self._timeout)

Testing

  • Added regression test tests/llmobs/test_llmobs_span_agentless_writer.py::test_send_payload_uses_configured_timeout, which mocks get_connection and asserts _send_payload opens the connection with the writer's configured timeout (not DEFAULT_TIMEOUT).
  • Ran the llmobs suite locally; the new test passes and no related tests regress.

Risks

Low. One-line change in a background flush thread — it does not block the request path. Behavior change: the effective LLMObs socket timeout goes from 2s to 5s (the intended _DD_LLMOBS_WRITER_TIMEOUT default). No public API change.

Additional Notes

Reproduced on 4.10.3; the bug is unchanged on main. Sibling timeout work for reference: #9438 (writer default 2s→5s)

@jeong-hasang jeong-hasang marked this pull request as ready for review June 11, 2026 08:07
@jeong-hasang jeong-hasang requested review from a team as code owners June 11, 2026 08:07
@jeong-hasang jeong-hasang force-pushed the fix/llmobs-writer-timeout branch from 56fef3b to ba97162 Compare June 11, 2026 08:43
jeong-hasang and others added 2 commits June 11, 2026 17:46
BaseLLMObsWriter._send_payload called get_connection(self._intake)
without a timeout, so it ignored the writer's configured self._timeout
(_DD_LLMOBS_WRITER_TIMEOUT, default 5s) and fell back to the 2s
connection default (DEFAULT_TIMEOUT). On high-latency links the 2s
socket timeout was exceeded at the tail, all retries hit the same
ceiling, and the payload was dropped. Every other get_connection call
in the module passes timeout= explicitly; this was the only one that
did not. Pass timeout=self._timeout to match sibling code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeong-hasang jeong-hasang force-pushed the fix/llmobs-writer-timeout branch from ba97162 to 192d87d Compare June 11, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant