Set telemetry retention policy asynchronously#10416
Conversation
bnaecker
commented
May 8, 2026
- Don't wait for a new TTL to actually take effect, only for the server to ack it. These can take a very long time on large tables, 100s of seconds, and waiting isn't useful in this context.
- Update the clickhouse-admin API to return the configured retention policy for every table rather that assuming they're all the same
- Recreate the clickhouse client if a TTL request fails, to ensure we can continue to set the TTL on the rest of the tables.
- Slightly better filtering of table names in the usage output
- Fixes Setting the telemetry TTL can take a while, and that should be OK #10415
c1c4d50 to
d4fc219
Compare
- Don't wait for a new TTL to actually take effect, only for the server to ack it. These can take a very long time on large tables, 100s of seconds, and waiting isn't useful in this context. - Update the clickhouse-admin API to return the configured retention policy for every table rather that assuming they're all the same - Recreate the clickhouse client if a TTL request fails, to ensure we can continue to set the TTL on the rest of the tables. - Slightly better filtering of table names in the usage output - Fixes #10415
d4fc219 to
c6ebea4
Compare
|
Ok, I've tested this on And the commands as issued from |
| LIMIT 1;"; | ||
| WHERE (\ | ||
| database = 'oximeter' \ | ||
| AND multiSearchAny(name, ['measurements', 'fields']) \ |
There was a problem hiding this comment.
This used to be "startsWith", but is now a substring match - do we think we're unlikely to add a table with one of these words used as a non-prefix?
There was a problem hiding this comment.
I was thinking we could add an integration test here. We should always know exactly which tables will exist in clickhouse, so an integration test should catch any regressions.
There was a problem hiding this comment.
That's mostly because there's no method that I could find for checking whether a string starts with a number of prefixes. So this is searching anywhere within the string. I can change this to a condition like "starts with A or starts with B" if that's preferable.
There was a problem hiding this comment.
WDYT about adding an integration test like this one? I didn't catch this, but claude found that we panic on reading the retention policy from system.query_log if it exists. I think we need at least the fix in that branch, or something like it, to merge.
| policy: TypedBody<latest::retention::RetentionPolicyRequest>, | ||
| ) -> Result<HttpResponseUpdatedNoContent, HttpError>; | ||
|
|
||
| /// Set the retention policy for timeseries data. |
There was a problem hiding this comment.
Nit: I think it would be worth a note that this update is eventually consistent. IIUC, a user could set the retention policy, immediately read it back, and be surprised that it's not (yet) the expected value.
And a question: with SETTINGS mutations_sync=0, is it possible to queue an update that never succeeds? Speaking of which, have we tested what happens here when the disk is full, which is the issue that triggered this feature in the first place? I don't want to delay merging on this question, but we might want to figure it out sometime if we haven't already.
There was a problem hiding this comment.
Good idea about the note.
Yes, I think it's possible for a background mutation to fail. The description for the system.mutations table lists columns like last_fail_time, for example. I can't find on that page or any other whether mutations are retried. The phrasing "last fail time" makes me think they are, but that's not clear.
I'm not sure what happens if the disk is full, we haven't attempted it and it is hard to reproduce. From this page it appears that whole data parts (chunks of tables) are rewritten according to the mutation, and then the original is dropped. Depending on how the TTL is implemented, that could certainly lead to write amplification. It's a valid concern, and we should flag it for testing. I'm...fairly confident... that it will not cause more pain than we've already had because we're expecting to reduce the retention time which would lead to all or most of a data part being dropped, not rewritten with say, a different order. Or maybe I'm just hopeful.
There was a problem hiding this comment.
Actually there's this:
Unlike merges, mutations can't be rolled back once submitted and will continue to execute even after server restarts unless explicitly cancelled
From this page ominously named "Avoid mutations". I agree it's a risk, but this is the only way to actually set a TTL other than managing it ourselves, which I think is out of scope.
|
@smklein I can't respond directly to your question, but no I don't believe the default was to to exec these asynchronously. See the logs from #10415. We're clearly waiting for >530s for the actual mutation to finish before returning to the client. I'm not sure how to square that with the docs though. |
|
@smklein See https://clickhouse.com/docs/sql-reference/statements/alter#synchronicity-of-alter-queries.
|