Skip to content

feat(search): add Google Gemini embedding provider#27974

Open
pmbrull wants to merge 20 commits intomainfrom
pmbrull/gcp-embedding-client
Open

feat(search): add Google Gemini embedding provider#27974
pmbrull wants to merge 20 commits intomainfrom
pmbrull/gcp-embedding-client

Conversation

@pmbrull
Copy link
Copy Markdown
Collaborator

@pmbrull pmbrull commented May 7, 2026

Summary

Adds a fourth embedding provider — Google Gemini via the Generative Language API — alongside the existing openai, bedrock, and djl providers. Operators can now point natural-language / semantic search at Gemini models using a single API key from Google AI Studio (no GCP project, service account, or OAuth setup required).

  • New google block under naturalLanguageSearch in elasticSearchConfiguration.json (apiKey, embeddingModelId, embeddingDimension, endpoint).
  • New GoogleEmbeddingClient mirroring OpenAIEmbeddingClient: API key in URL query string, content.parts[].text body, embedding.values response parsing, error-message extraction from Google's standard error envelope.
  • One-line wiring in SearchRepository.createEmbeddingClient switch.
  • 23 unit tests using hand-rolled HttpClient stubs (no Mockito) — covering construction, validation, success path, HTTP errors, malformed responses, request-shape verification, and URL encoding.

Defaults to text-embedding-004 / 768 dim; supports gemini-embedding-001 via configuration.

Test plan

  • mvn test -pl openmetadata-service -Dtest=GoogleEmbeddingClientTest — 23 tests pass
  • mvn test -pl openmetadata-service -Dtest='*EmbeddingClientTest' — 53 tests pass, no regressions in OpenAI/EmbeddingClient sibling tests
  • mvn spotless:apply -pl openmetadata-service — clean
  • Smoke test in a local environment with a real Google AI Studio API key

Out of scope

  • Vertex AI / service-account auth (future googleVertexAi provider)
  • Gemini chat-completions for NLQ query transformation (only embeddings here)
  • True batch endpoint (:batchEmbedContents) — uses base-class default serial loop, matching siblings

Summary by Gitar

  • Core functionality:
    • Explicitly configured outputDimensionality in GoogleEmbeddingClient to align with the selected Google embedding model.
  • Configuration defaults:
    • Switched default embeddingModelId from text-embedding-004 to gemini-embedding-001 across YAML, JSON schemas, and test suites.
  • System visibility:
    • Added google provider support to SystemRepository for proper status reporting and configuration validation.

This will update automatically on new commits.

pmbrull and others added 13 commits May 7, 2026 16:24
Adds a fourth embedding provider (google) alongside openai/bedrock/djl,
using the Generative Language API with a single API key.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks covering schema change + regen, client implementation,
validation tests, error path tests, request shape tests, switch
wiring, and final verification.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ient

The string "models/" appeared in both DEFAULT_BASE_URL and the buildRequestBody
method. Extract it as a named constant per project standards.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ound

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…shape

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t comment

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 7, 2026 17:34
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new google embedding provider (Google Gemini / Generative Language API) to OpenMetadata’s vector search embedding client framework, alongside the existing bedrock, openai, and djl providers.

Changes:

  • Extended the ElasticSearch configuration schema with a new naturalLanguageSearch.google block and updated provider description text.
  • Implemented GoogleEmbeddingClient (HTTP call, request/response JSON handling, error extraction, endpoint override support).
  • Wired the new provider into SearchRepository.createEmbeddingClient and added a dedicated unit test suite.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
openmetadata-spec/src/main/resources/json/schema/configuration/elasticSearchConfiguration.json Adds google provider config block under naturalLanguageSearch and updates provider description.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClient.java New embedding client implementation for Gemini (Generative Language API).
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Adds google case to embedding client provider switch.
openmetadata-service/src/test/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClientTest.java Adds unit tests for Google embedding client behavior and request construction.
docs/superpowers/specs/2026-05-07-google-gemini-embedding-client-design.md Design spec documenting the provider, config shape, and behavior.
docs/superpowers/plans/2026-05-07-google-gemini-embedding-client.md Implementation plan and step-by-step checklist for the change.

Comment on lines +139 to +147
private HttpRequest buildRequest(String body) {
String encodedKey = URLEncoder.encode(apiKey, StandardCharsets.UTF_8);
String url = endpoint + "?key=" + encodedKey;
return HttpRequest.newBuilder()
.uri(URI.create(url))
.header("Content-Type", "application/json")
.timeout(Duration.ofSeconds(30))
.POST(HttpRequest.BodyPublishers.ofString(body))
.build();
"default": 768
},
"endpoint": {
"description": "Custom endpoint URL. Leave empty for the default Generative Language API.",
Comment on lines +498 to +542
java.util.concurrent.atomic.AtomicReference<String> captured =
new java.util.concurrent.atomic.AtomicReference<>();
java.util.concurrent.atomic.AtomicReference<Throwable> failure =
new java.util.concurrent.atomic.AtomicReference<>();
request
.bodyPublisher()
.ifPresent(
publisher -> {
java.util.concurrent.Flow.Subscriber<java.nio.ByteBuffer> subscriber =
new java.util.concurrent.Flow.Subscriber<>() {
private final java.io.ByteArrayOutputStream out =
new java.io.ByteArrayOutputStream();

@Override
public void onSubscribe(java.util.concurrent.Flow.Subscription subscription) {
subscription.request(Long.MAX_VALUE);
}

@Override
public void onNext(java.nio.ByteBuffer item) {
byte[] arr = new byte[item.remaining()];
item.get(arr);
out.write(arr, 0, arr.length);
}

@Override
public void onError(Throwable throwable) {
failure.set(throwable);
}

@Override
public void onComplete() {
captured.set(out.toString(java.nio.charset.StandardCharsets.UTF_8));
}
};
publisher.subscribe(subscriber);
});
if (failure.get() != null) {
throw new RuntimeException("Body publisher failed", failure.get());
}
String body = captured.get();
if (body == null) {
throw new IllegalStateException("Request had no body publisher");
}
return body;
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions github-actions Bot requested a review from a team as a code owner May 7, 2026 17:40
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔴 Playwright Results — 2 failure(s), 15 flaky

✅ 4011 passed · ❌ 2 failed · 🟡 15 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 748 0 7 8
🔴 Shard 3 755 2 2 7
✅ Shard 4 790 0 0 18
🟡 Shard 5 685 0 2 41
🟡 Shard 6 734 0 4 8

Genuine Failures (failed on all attempts)

Features/Tasks/TaskNavigation.spec.ts › clicking task notification while on entity task tab refreshes the task list (shard 3)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: locator('.notification-box').locator('li.ant-list-item.notification-dropdown-list-btn').first()
Expected: visible
Timeout: 15000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 15000ms�[22m
�[2m  - waiting for locator('.notification-box').locator('li.ant-list-item.notification-dropdown-list-btn').first()�[22m

Features/Tasks/TaskNavigation.spec.ts › two sessions: admin on Columns tab creates task, assignee sees refresh on notification click (shard 3)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: locator('.notification-box').locator('li.ant-list-item.notification-dropdown-list-btn').first()
Expected: visible
Timeout: 15000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 15000ms�[22m
�[2m  - waiting for locator('.notification-box').locator('li.ant-list-item.notification-dropdown-list-btn').first()�[22m

🟡 15 flaky test(s) (passed on retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataProductDomainMigration.spec.ts › Data product with no assets can change domain without confirmation (shard 2, 1 retry)
  • Features/DataQuality/IncidentManagerDateFilter.spec.ts › Select preset date range (shard 2, 1 retry)
  • Features/Glossary/GlossaryHierarchy.spec.ts › should cancel move operation (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)
  • Pages/Users.spec.ts › Create and Delete user (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

harshach
harshach previously approved these changes May 8, 2026
These were workflow scaffolding (design spec + implementation plan)
generated by the superpowers brainstorming/planning flow; they belong
in the local development trail, not the PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 8, 2026 05:50
- GoogleEmbeddingClient.buildRequest: handle endpoint with existing query
  string by switching the key separator from '?' to '&' as needed; document
  why the API key travels in the URL (Google Generative Language API
  requirement, not Bearer-header).
- GoogleEmbeddingClient.extractErrorMessage: replace empty catch block with
  a trace-level log to comply with the 'no empty catch' standard.
- elasticSearchConfiguration.json: clarify google.endpoint description so
  operators know it must be the full ':embedContent' URL, not a base URL.
- GoogleEmbeddingClientTest.extractBody: await onComplete via
  CompletableFuture.get(5s) instead of relying on synchronous publisher
  delivery; surface onError properly.
- New test: testEndpointWithExistingQueryStringUsesAmpersand verifies the
  '?' / '&' separator logic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 6 changed files in this pull request and generated 3 comments.

Comment on lines +96 to +100
return configured.replaceAll("/+$", "");
}
return DEFAULT_BASE_URL + config.getEmbeddingModelId() + ":embedContent";
}

Comment on lines +139 to +142
private HttpRequest buildRequest(String body) {
// Google's Generative Language API requires the API key as a `key=` query parameter;
// it does not accept Bearer/Authorization headers for AI Studio keys.
String encodedKey = URLEncoder.encode(apiKey, StandardCharsets.UTF_8);
"default": 768
},
"endpoint": {
"description": "Optional override for the full embedding endpoint URL. Must be the complete URL including the model and `:embedContent` action (e.g. `https://generativelanguage.googleapis.com/v1beta/models/text-embedding-004:embedContent`), not just a base URL. Leave empty to use the default Generative Language API endpoint, which is constructed from `embeddingModelId`. The `key` query parameter is appended automatically.",
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 62%
62.44% (63078/101007) 42.82% (34067/79555) 45.8% (10063/21967)

- Add `google:` block under naturalLanguageSearch with env-var fallbacks
  (GOOGLE_API_KEY, GOOGLE_EMBEDDING_MODEL_ID, GOOGLE_EMBEDDING_DIMENSION,
  GOOGLE_API_ENDPOINT).
- Update embeddingProvider option list comment to include "google".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 8, 2026 06:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 2 comments.

.withApiKey("test-key")
.withEmbeddingModelId("text-embedding-004")
.withEmbeddingDimension(768)
.withEndpoint("https://proxy.example.com/v1/embed/");
"google": {
"description": "Google Gemini configuration for embedding generation via the Generative Language API.",
"type": "object",
"javaType": "org.openmetadata.schema.service.configuration.elasticsearch.Google",
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 8, 2026

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 8, 2026

The previous default (text-embedding-004) is rejected on some Google
projects with `404: not found for API version v1beta, or is not
supported for embedContent`. Switch to gemini-embedding-001 — the
current GA model, available at v1beta and broadly accessible.

- GoogleEmbeddingClient.buildRequestBody: include outputDimensionality
  from the configured embeddingDimension. Required for gemini-embedding-001
  (defaults to 3072 dims otherwise) and supported as a truncation hint
  by text-embedding-004.
- elasticSearchConfiguration.json + openmetadata.yaml: change default
  embeddingModelId to gemini-embedding-001 and document the
  outputDimensionality semantics on the embeddingDimension field.
- GoogleEmbeddingClientTest.testRequestBodyShape: assert
  outputDimensionality=768 in the captured body and use
  gemini-embedding-001 as the test fixture model.
- SystemRepository.getEmbeddingConfigurationMessage: add a `google` case
  so /api/v1/system/status surfaces the configured model/endpoint
  instead of "Unknown provider 'google'".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines +769 to +778
case "google" -> {
String googleEndpoint =
nullOrEmpty(nlpConfig.getGoogle().getEndpoint())
? "generativelanguage.googleapis.com"
: nlpConfig.getGoogle().getEndpoint();
yield String.format(
"Google configuration: endpoint: %s, embeddingModelId: %s, embeddingDimension: %s",
googleEndpoint,
nlpConfig.getGoogle().getEmbeddingModelId(),
nlpConfig.getGoogle().getEmbeddingDimension());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Edge Case: NPE if google config is null in SystemRepository switch

The new "google" case at line 769 calls nlpConfig.getGoogle() three times without a null guard. If the provider string is set to "google" but the google block is absent or null in the YAML/JSON config, this will throw a NullPointerException — caught by the generic catch (Exception e) but returning a misleading "Error getting embedding configuration" instead of a clear diagnostic.

The GoogleEmbeddingClient constructor already validates this gracefully with IllegalArgumentException("Google configuration is required"), so this switch branch (which appears to be a diagnostic/info path) should mirror that pattern.

Suggested fix:

case "google" -> {
  Google googleCfg = nlpConfig.getGoogle();
  if (googleCfg == null) {
    yield "Google provider selected but google configuration block is missing";
  }
  String googleEndpoint =
      nullOrEmpty(googleCfg.getEndpoint())
          ? "generativelanguage.googleapis.com"
          : googleCfg.getEndpoint();
  yield String.format(
      "Google configuration: endpoint: %s, embeddingModelId: %s, embeddingDimension: %s",
      googleEndpoint,
      googleCfg.getEmbeddingModelId(),
      googleCfg.getEmbeddingDimension());
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

Copilot AI review requested due to automatic review settings May 8, 2026 07:39
@pmbrull pmbrull review requested due to automatic review settings May 8, 2026 07:39
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 8, 2026

Code Review ⚠️ Changes requested 2 resolved / 3 findings

Integrates the Google Gemini embedding provider into the SearchRepository, but introduces a potential NullPointerException if the google configuration is missing in SystemRepository. The empty catch block in the error extractor has been successfully addressed.

⚠️ Edge Case: NPE if google config is null in SystemRepository switch

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:769-778

The new "google" case at line 769 calls nlpConfig.getGoogle() three times without a null guard. If the provider string is set to "google" but the google block is absent or null in the YAML/JSON config, this will throw a NullPointerException — caught by the generic catch (Exception e) but returning a misleading "Error getting embedding configuration" instead of a clear diagnostic.

The GoogleEmbeddingClient constructor already validates this gracefully with IllegalArgumentException("Google configuration is required"), so this switch branch (which appears to be a diagnostic/info path) should mirror that pattern.

Suggested fix
case "google" -> {
  Google googleCfg = nlpConfig.getGoogle();
  if (googleCfg == null) {
    yield "Google provider selected but google configuration block is missing";
  }
  String googleEndpoint =
      nullOrEmpty(googleCfg.getEndpoint())
          ? "generativelanguage.googleapis.com"
          : googleCfg.getEndpoint();
  yield String.format(
      "Google configuration: endpoint: %s, embeddingModelId: %s, embeddingDimension: %s",
      googleEndpoint,
      googleCfg.getEmbeddingModelId(),
      googleCfg.getEmbeddingDimension());
}
✅ 2 resolved
Quality: Empty catch block in extractErrorMessage violates guidelines

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClient.java:188
The custom instructions state 'No empty catch blocks. Log exceptions with context.' The catch (Exception ignored) {} at line 188 silently swallows parse failures. While this matches the sibling OpenAIEmbeddingClient pattern and the fallback behavior (returning raw body) is correct, adding a trace-level log would aid debugging when error responses have unexpected formats.

Security: API key in URL query string may leak via server access logs

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClient.java:140-141
The Google Generative Language API requires the key in the query string (?key=), so this is by design and matches Google's documentation. However, it's worth noting that unlike bearer tokens in headers, query-string secrets can appear in proxy/CDN access logs, browser history (not applicable here), and HTTP Referer headers. Since this is a server-side call with no browser or Referer risk, and Google mandates this pattern for AI Studio keys, the practical risk is low. Consider adding a brief comment documenting why the key is in the URL (Google API requirement) to prevent future reviewers from 'fixing' it to a header.

🤖 Prompt for agents
Code Review: Integrates the Google Gemini embedding provider into the SearchRepository, but introduces a potential NullPointerException if the google configuration is missing in SystemRepository. The empty catch block in the error extractor has been successfully addressed.

1. ⚠️ Edge Case: NPE if google config is null in SystemRepository switch
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:769-778

   The new `"google"` case at line 769 calls `nlpConfig.getGoogle()` three times without a null guard. If the provider string is set to `"google"` but the `google` block is absent or null in the YAML/JSON config, this will throw a `NullPointerException` — caught by the generic `catch (Exception e)` but returning a misleading "Error getting embedding configuration" instead of a clear diagnostic.
   
   The `GoogleEmbeddingClient` constructor already validates this gracefully with `IllegalArgumentException("Google configuration is required")`, so this switch branch (which appears to be a diagnostic/info path) should mirror that pattern.

   Suggested fix:
   case "google" -> {
     Google googleCfg = nlpConfig.getGoogle();
     if (googleCfg == null) {
       yield "Google provider selected but google configuration block is missing";
     }
     String googleEndpoint =
         nullOrEmpty(googleCfg.getEndpoint())
             ? "generativelanguage.googleapis.com"
             : googleCfg.getEndpoint();
     yield String.format(
         "Google configuration: endpoint: %s, embeddingModelId: %s, embeddingDimension: %s",
         googleEndpoint,
         googleCfg.getEmbeddingModelId(),
         googleCfg.getEmbeddingDimension());
   }

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants