[FLINK-39737][s3] Switch to StandardRetryStrategy with configurable backoff for throttle-aware retry#28236
[FLINK-39737][s3] Switch to StandardRetryStrategy with configurable backoff for throttle-aware retry#28236gaborgsomogyi wants to merge 1 commit into
Conversation
…ackoff for throttle-aware retry
Samrat002
left a comment
There was a problem hiding this comment.
Thank you for the patch @gaborgsomogyi
For S3-heavy distributed workloads, AdaptiveRetryStrategy can sometimes behave better under coordinated throttling. Not necessarily saying it should be used, but I am looking for the reasoning for choosing StandardRetryStrategy over adaptive would be valuable context.
| this.retryThrottleBaseDelay = | ||
| Preconditions.checkNotNull( | ||
| retryThrottleBaseDelay, "retryThrottleBaseDelay must not be null"); | ||
| this.retryMaxBackoff = |
There was a problem hiding this comment.
Invalid values are possible. Add condition check for negative durations, zero durations, maxBackoff < baseDelay, maxBackoff < throttleBaseDelay
| // --- Retry backoff --- | ||
|
|
||
| @Test | ||
| void testRetryBaseDelayDefault() throws Exception { |
There was a problem hiding this comment.
No tests for actual retry strategy construction.
| .withDescription( | ||
| "Base delay for exponential backoff on throttle retries (HTTP 429, 503). " | ||
| + "Each retry waits a random duration between 0 and " | ||
| + "min(throttle-base-delay * 2^attempt, max-backoff)."); |
There was a problem hiding this comment.
Use Uses exponential backoff with jitter capped by max-backoff. rather than a mathematical expression.
In the future, when the logic evolves may lead to misleading.
What is the purpose of the change
The
flink-s3-fs-nativeconnector was using the deprecatedRetryPolicyAPI from AWS SDK v2 in legacy retry mode. The legacy mode does not distinguish throttling errors (HTTP 429) from transient server errors, applies no token-bucket circuit breaking, and has been superseded by theRetryStrategyAPI since SDK v2.25. S3-compatible storage systems commonly signal throttling via HTTP 429, which the legacy mode silently dropped without retrying.This pull request replaces the deprecated
RetryPolicywithStandardRetryStrategyfrom the non-deprecated AWS SDK v2 retries API. The standard strategy natively handles both HTTP 429 and 503 as throttling events with a dedicated backoff path, applies full-jitter exponential backoff separately for throttle and non-throttle errors, and uses a token-bucket circuit breaker to prevent retry storms under sustained load. The key backoff parameters are exposed as Flink configuration options so operators can tune retry behavior for their storage backend.Brief change log
RetryPolicy.builder().numRetries(n)withStandardRetryStrategyviaClientOverrideConfiguration.retryStrategy()inS3ClientProviders3.retry.base-delay(default: 100ms) — backoff base delay for non-throttle retriess3.retry.throttle.base-delay(default: 1s) — backoff base delay for throttle retries (HTTP 429/503)s3.retry.max-backoff(default: 20s) — shared exponential backoff cap for both retry pathsConfigOptiondefault values directly, keeping a single source of truths3.retry.max-num-retries(default: 3) is preserved and maps tomaxAttempts = maxRetries + 1Verifying this change
This change added tests and can be verified as follows:
NativeS3FileSystemFactoryTestcovering default values and explicit configuration for each of the three newDurationconfig options, following the same pattern as the existingtestMaxRetriesDefault/testMaxRetriesExplicitlyConfiguredtestsS3ClientProviderTest(testRetryBuilderDefaultsMatchConfigOptions) verifying thatBuilderfield defaults andConfigOptiondefaults stay in syncDoes this pull request potentially affect one of the following parts:
software.amazon.awssdk:retrieswas already a transitive dependency viasdk-core)@Public(Evolving): noDocumentation
ConfigOptionfields inNativeS3FileSystemFactoryWas generative AI tooling used to co-author this PR?