What version of gRPC-Java are you using?
1.76.3 (regression from 1.70.0).
What is your environment?
Java 21, Linux (GKE), proxyless gRPC with xDS (SotW ADS) from a custom control plane.
One application has startup code that calls LoadBalancerRegistry.deregister() / register() to replace a ServiceLoader-discovered provider with a differently-configured instance.
We acknowledge that our application is likely at fault for creating gRPC channels during framework initialization, before all startup code has finished running.
What did you expect to see?
We'd expect CdsLoadBalancer2 to be resilient to registry mutations.
What did you see instead?
INTERNAL: CdsLb for xdstp://...: Unable to parse the LB config:
Status{code=INTERNAL, description=Failed to parse child policy in wrr_locality LB policy:
{childPolicy=[{els={}}]}}
Cause: None of [els] specified by Service Config are available.
The channel enters TRANSIENT_FAILURE and does not recover.
Steps to reproduce the bug
Setup: The xDS control plane sends a CDS load_balancing_policy with wrr_locality whose endpoint_picking_policy contains three policies in fallback order:
- A custom LB policy (
els) registered as a TypedStruct
LeastRequest
RoundRobin
The custom policy els has a LoadBalancerProvider registered via ServiceLoader. Application startup code deregisters it and conditionally re-registers a reconfigured instance.
We think that the failure sequence is this:
-
Phase 1 (CDS update) — XdsClusterResource parses a CDS response. LoadBalancingPolicyConverter finds the custom provider in the registry, selects it, and produces {"wrr_locality_experimental": {"childPolicy": [{"els": {}}]}}. Validation via WrrLocalityLoadBalancerProvider.parseLoadBalancingPolicyConfig succeeds. The raw JSON map is stored in CdsUpdate.lbPolicyConfig.
-
Registry mutation — Application startup code calls registry.deregister(oldProvider) then conditionally registry.register(newProvider). The custom provider is absent from the registry.
-
Phase 2 (any xDS update) — CdsLoadBalancer2.acceptResolvedAddresses re-parses the stored raw config. selectLbPolicyFromList calls getProvider("els") → returns null → error. The channel enters TRANSIENT_FAILURE and does not recover.
Critically, Phase 1 only runs on CDS updates (rare — when cluster config changes). But Phase 2 runs on every XdsConfig change, including EDS updates, which can be very frequent.
The re-parsing is guarded by the comment "Should be impossible, because XdsClusterResource validated this" (CdsLoadBalancer2.java:136), but the assumption that the registry is immutable after validation does not hold.
Why this didn't happen before #12140:
Probably because before #12140 CdsLoadBalancer2.acceptResolvedAddresses had an early return on subsequent calls.
// v1.70.0
if (this.resolvedAddresses != null) {
return Status.OK;
}
Suggested action
Ideally, it would be great to find a way to not validate again the load balancer after the choice has been made somewhere else. If it's not fixed, it would still help to document that unregistering a load balancer at runtime can break the application.
Thanks!
What version of gRPC-Java are you using?
1.76.3 (regression from 1.70.0).
What is your environment?
Java 21, Linux (GKE), proxyless gRPC with xDS (SotW ADS) from a custom control plane.
One application has startup code that calls
LoadBalancerRegistry.deregister()/register()to replace a ServiceLoader-discovered provider with a differently-configured instance.We acknowledge that our application is likely at fault for creating gRPC channels during framework initialization, before all startup code has finished running.
What did you expect to see?
We'd expect
CdsLoadBalancer2to be resilient to registry mutations.What did you see instead?
The channel enters
TRANSIENT_FAILUREand does not recover.Steps to reproduce the bug
Setup: The xDS control plane sends a CDS
load_balancing_policywithwrr_localitywhoseendpoint_picking_policycontains three policies in fallback order:els) registered as aTypedStructLeastRequestRoundRobinThe custom policy
elshas aLoadBalancerProviderregistered via ServiceLoader. Application startup code deregisters it and conditionally re-registers a reconfigured instance.We think that the failure sequence is this:
Phase 1 (CDS update) —
XdsClusterResourceparses a CDS response.LoadBalancingPolicyConverterfinds the custom provider in the registry, selects it, and produces{"wrr_locality_experimental": {"childPolicy": [{"els": {}}]}}. Validation viaWrrLocalityLoadBalancerProvider.parseLoadBalancingPolicyConfigsucceeds. The raw JSON map is stored inCdsUpdate.lbPolicyConfig.Registry mutation — Application startup code calls
registry.deregister(oldProvider)then conditionallyregistry.register(newProvider). The custom provider is absent from the registry.Phase 2 (any xDS update) —
CdsLoadBalancer2.acceptResolvedAddressesre-parses the stored raw config.selectLbPolicyFromListcallsgetProvider("els")→ returns null → error. The channel entersTRANSIENT_FAILUREand does not recover.Critically, Phase 1 only runs on CDS updates (rare — when cluster config changes). But Phase 2 runs on every
XdsConfigchange, including EDS updates, which can be very frequent.The re-parsing is guarded by the comment "Should be impossible, because XdsClusterResource validated this" (CdsLoadBalancer2.java:136), but the assumption that the registry is immutable after validation does not hold.
Why this didn't happen before #12140:
Probably because before #12140
CdsLoadBalancer2.acceptResolvedAddresseshad an early return on subsequent calls.Suggested action
Ideally, it would be great to find a way to not validate again the load balancer after the choice has been made somewhere else. If it's not fixed, it would still help to document that unregistering a load balancer at runtime can break the application.
Thanks!