feat(search): add Google Gemini embedding provider#27974
feat(search): add Google Gemini embedding provider#27974
Conversation
Adds a fourth embedding provider (google) alongside openai/bedrock/djl, using the Generative Language API with a single API key. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks covering schema change + regen, client implementation, validation tests, error path tests, request shape tests, switch wiring, and final verification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ient The string "models/" appeared in both DEFAULT_BASE_URL and the buildRequestBody method. Extract it as a named constant per project standards. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ound Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…shape Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t comment Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new google embedding provider (Google Gemini / Generative Language API) to OpenMetadata’s vector search embedding client framework, alongside the existing bedrock, openai, and djl providers.
Changes:
- Extended the ElasticSearch configuration schema with a new
naturalLanguageSearch.googleblock and updated provider description text. - Implemented
GoogleEmbeddingClient(HTTP call, request/response JSON handling, error extraction, endpoint override support). - Wired the new provider into
SearchRepository.createEmbeddingClientand added a dedicated unit test suite.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-spec/src/main/resources/json/schema/configuration/elasticSearchConfiguration.json | Adds google provider config block under naturalLanguageSearch and updates provider description. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClient.java | New embedding client implementation for Gemini (Generative Language API). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java | Adds google case to embedding client provider switch. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClientTest.java | Adds unit tests for Google embedding client behavior and request construction. |
| docs/superpowers/specs/2026-05-07-google-gemini-embedding-client-design.md | Design spec documenting the provider, config shape, and behavior. |
| docs/superpowers/plans/2026-05-07-google-gemini-embedding-client.md | Implementation plan and step-by-step checklist for the change. |
| private HttpRequest buildRequest(String body) { | ||
| String encodedKey = URLEncoder.encode(apiKey, StandardCharsets.UTF_8); | ||
| String url = endpoint + "?key=" + encodedKey; | ||
| return HttpRequest.newBuilder() | ||
| .uri(URI.create(url)) | ||
| .header("Content-Type", "application/json") | ||
| .timeout(Duration.ofSeconds(30)) | ||
| .POST(HttpRequest.BodyPublishers.ofString(body)) | ||
| .build(); |
| "default": 768 | ||
| }, | ||
| "endpoint": { | ||
| "description": "Custom endpoint URL. Leave empty for the default Generative Language API.", |
| java.util.concurrent.atomic.AtomicReference<String> captured = | ||
| new java.util.concurrent.atomic.AtomicReference<>(); | ||
| java.util.concurrent.atomic.AtomicReference<Throwable> failure = | ||
| new java.util.concurrent.atomic.AtomicReference<>(); | ||
| request | ||
| .bodyPublisher() | ||
| .ifPresent( | ||
| publisher -> { | ||
| java.util.concurrent.Flow.Subscriber<java.nio.ByteBuffer> subscriber = | ||
| new java.util.concurrent.Flow.Subscriber<>() { | ||
| private final java.io.ByteArrayOutputStream out = | ||
| new java.io.ByteArrayOutputStream(); | ||
|
|
||
| @Override | ||
| public void onSubscribe(java.util.concurrent.Flow.Subscription subscription) { | ||
| subscription.request(Long.MAX_VALUE); | ||
| } | ||
|
|
||
| @Override | ||
| public void onNext(java.nio.ByteBuffer item) { | ||
| byte[] arr = new byte[item.remaining()]; | ||
| item.get(arr); | ||
| out.write(arr, 0, arr.length); | ||
| } | ||
|
|
||
| @Override | ||
| public void onError(Throwable throwable) { | ||
| failure.set(throwable); | ||
| } | ||
|
|
||
| @Override | ||
| public void onComplete() { | ||
| captured.set(out.toString(java.nio.charset.StandardCharsets.UTF_8)); | ||
| } | ||
| }; | ||
| publisher.subscribe(subscriber); | ||
| }); | ||
| if (failure.get() != null) { | ||
| throw new RuntimeException("Body publisher failed", failure.get()); | ||
| } | ||
| String body = captured.get(); | ||
| if (body == null) { | ||
| throw new IllegalStateException("Request had no body publisher"); | ||
| } | ||
| return body; |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
🔴 Playwright Results — 2 failure(s), 15 flaky✅ 4011 passed · ❌ 2 failed · 🟡 15 flaky · ⏭️ 86 skipped
Genuine Failures (failed on all attempts)❌
|
These were workflow scaffolding (design spec + implementation plan) generated by the superpowers brainstorming/planning flow; they belong in the local development trail, not the PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- GoogleEmbeddingClient.buildRequest: handle endpoint with existing query string by switching the key separator from '?' to '&' as needed; document why the API key travels in the URL (Google Generative Language API requirement, not Bearer-header). - GoogleEmbeddingClient.extractErrorMessage: replace empty catch block with a trace-level log to comply with the 'no empty catch' standard. - elasticSearchConfiguration.json: clarify google.endpoint description so operators know it must be the full ':embedContent' URL, not a base URL. - GoogleEmbeddingClientTest.extractBody: await onComplete via CompletableFuture.get(5s) instead of relying on synchronous publisher delivery; surface onError properly. - New test: testEndpointWithExistingQueryStringUsesAmpersand verifies the '?' / '&' separator logic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| return configured.replaceAll("/+$", ""); | ||
| } | ||
| return DEFAULT_BASE_URL + config.getEmbeddingModelId() + ":embedContent"; | ||
| } | ||
|
|
| private HttpRequest buildRequest(String body) { | ||
| // Google's Generative Language API requires the API key as a `key=` query parameter; | ||
| // it does not accept Bearer/Authorization headers for AI Studio keys. | ||
| String encodedKey = URLEncoder.encode(apiKey, StandardCharsets.UTF_8); |
| "default": 768 | ||
| }, | ||
| "endpoint": { | ||
| "description": "Optional override for the full embedding endpoint URL. Must be the complete URL including the model and `:embedContent` action (e.g. `https://generativelanguage.googleapis.com/v1beta/models/text-embedding-004:embedContent`), not just a base URL. Leave empty to use the default Generative Language API endpoint, which is constructed from `embeddingModelId`. The `key` query parameter is appended automatically.", |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
- Add `google:` block under naturalLanguageSearch with env-var fallbacks (GOOGLE_API_KEY, GOOGLE_EMBEDDING_MODEL_ID, GOOGLE_EMBEDDING_DIMENSION, GOOGLE_API_ENDPOINT). - Update embeddingProvider option list comment to include "google". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| .withApiKey("test-key") | ||
| .withEmbeddingModelId("text-embedding-004") | ||
| .withEmbeddingDimension(768) | ||
| .withEndpoint("https://proxy.example.com/v1/embed/"); |
| "google": { | ||
| "description": "Google Gemini configuration for embedding generation via the Generative Language API.", | ||
| "type": "object", | ||
| "javaType": "org.openmetadata.schema.service.configuration.elasticsearch.Google", |
|
|
The previous default (text-embedding-004) is rejected on some Google projects with `404: not found for API version v1beta, or is not supported for embedContent`. Switch to gemini-embedding-001 — the current GA model, available at v1beta and broadly accessible. - GoogleEmbeddingClient.buildRequestBody: include outputDimensionality from the configured embeddingDimension. Required for gemini-embedding-001 (defaults to 3072 dims otherwise) and supported as a truncation hint by text-embedding-004. - elasticSearchConfiguration.json + openmetadata.yaml: change default embeddingModelId to gemini-embedding-001 and document the outputDimensionality semantics on the embeddingDimension field. - GoogleEmbeddingClientTest.testRequestBodyShape: assert outputDimensionality=768 in the captured body and use gemini-embedding-001 as the test fixture model. - SystemRepository.getEmbeddingConfigurationMessage: add a `google` case so /api/v1/system/status surfaces the configured model/endpoint instead of "Unknown provider 'google'". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| case "google" -> { | ||
| String googleEndpoint = | ||
| nullOrEmpty(nlpConfig.getGoogle().getEndpoint()) | ||
| ? "generativelanguage.googleapis.com" | ||
| : nlpConfig.getGoogle().getEndpoint(); | ||
| yield String.format( | ||
| "Google configuration: endpoint: %s, embeddingModelId: %s, embeddingDimension: %s", | ||
| googleEndpoint, | ||
| nlpConfig.getGoogle().getEmbeddingModelId(), | ||
| nlpConfig.getGoogle().getEmbeddingDimension()); |
There was a problem hiding this comment.
⚠️ Edge Case: NPE if google config is null in SystemRepository switch
The new "google" case at line 769 calls nlpConfig.getGoogle() three times without a null guard. If the provider string is set to "google" but the google block is absent or null in the YAML/JSON config, this will throw a NullPointerException — caught by the generic catch (Exception e) but returning a misleading "Error getting embedding configuration" instead of a clear diagnostic.
The GoogleEmbeddingClient constructor already validates this gracefully with IllegalArgumentException("Google configuration is required"), so this switch branch (which appears to be a diagnostic/info path) should mirror that pattern.
Suggested fix:
case "google" -> {
Google googleCfg = nlpConfig.getGoogle();
if (googleCfg == null) {
yield "Google provider selected but google configuration block is missing";
}
String googleEndpoint =
nullOrEmpty(googleCfg.getEndpoint())
? "generativelanguage.googleapis.com"
: googleCfg.getEndpoint();
yield String.format(
"Google configuration: endpoint: %s, embeddingModelId: %s, embeddingDimension: %s",
googleEndpoint,
googleCfg.getEmbeddingModelId(),
googleCfg.getEmbeddingDimension());
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
Code Review
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar



Summary
Adds a fourth embedding provider — Google Gemini via the Generative Language API — alongside the existing
openai,bedrock, anddjlproviders. Operators can now point natural-language / semantic search at Gemini models using a single API key from Google AI Studio (no GCP project, service account, or OAuth setup required).googleblock undernaturalLanguageSearchinelasticSearchConfiguration.json(apiKey, embeddingModelId, embeddingDimension, endpoint).GoogleEmbeddingClientmirroringOpenAIEmbeddingClient: API key in URL query string,content.parts[].textbody,embedding.valuesresponse parsing, error-message extraction from Google's standard error envelope.SearchRepository.createEmbeddingClientswitch.HttpClientstubs (no Mockito) — covering construction, validation, success path, HTTP errors, malformed responses, request-shape verification, and URL encoding.Defaults to
text-embedding-004/ 768 dim; supportsgemini-embedding-001via configuration.Test plan
mvn test -pl openmetadata-service -Dtest=GoogleEmbeddingClientTest— 23 tests passmvn test -pl openmetadata-service -Dtest='*EmbeddingClientTest'— 53 tests pass, no regressions in OpenAI/EmbeddingClient sibling testsmvn spotless:apply -pl openmetadata-service— cleanOut of scope
googleVertexAiprovider):batchEmbedContents) — uses base-class default serial loop, matching siblingsSummary by Gitar
outputDimensionalityinGoogleEmbeddingClientto align with the selected Google embedding model.embeddingModelIdfromtext-embedding-004togemini-embedding-001across YAML, JSON schemas, and test suites.googleprovider support toSystemRepositoryfor proper status reporting and configuration validation.This will update automatically on new commits.