Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion deploy-manage/_snippets/cc-license-and-payment.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ Each cloud connected service has its own licensing and payment requirements.

* AutoOps for ECE, ECK, and self-managed clusters is available for free across all [self-managed license types](https://www.elastic.co/subscriptions). It does not consume ECUs.

* The Elastic {{infer-cap}} Service (EIS) for ECE, ECK, and self-managed clusters requires a [self-managed Enterprise license](https://www.elastic.co/subscriptions) or a self-managed free trial. Note that [EIS pricing](/explore-analyze/elastic-inference/eis-supported-models.md#pricing) is usage-based. Using EIS consumes ECUs.
* The Elastic {{infer-cap}} Service (EIS) for ECE, ECK, and self-managed clusters requires a [self-managed Enterprise license](https://www.elastic.co/subscriptions) or a self-managed free trial. Note that [EIS pricing](/explore-analyze/elastic-inference/eis.md#pricing) is usage-based. Using EIS consumes ECUs.
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,6 @@ For these models, you only need to create new {{infer}} endpoints if you want to

## Regions and billing

For information about EIS regions and request routing, refer to [Region and hosting](/explore-analyze/elastic-inference/eis-supported-models.md#eis-regions).
For information about EIS regions and request routing, refer to [Region and hosting](eis-region-and-hosting.md).

EIS is billed per million tokens and consumes ECUs. For details on pricing and usage tracking, refer to [Pricing](/explore-analyze/elastic-inference/eis-supported-models.md#pricing) and [Monitor your token usage](/explore-analyze/elastic-inference/eis-supported-models.md#monitor-your-token-usage).
EIS is billed per million tokens and consumes ECUs. For details on pricing and usage tracking, refer to [Pricing](eis.md#pricing) and [Monitor your token usage](eis.md#monitor-your-token-usage).
25 changes: 25 additions & 0 deletions explore-analyze/elastic-inference/eis-rate-limits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
navigation_title: Rate limits
applies_to:
stack: ga
serverless: ga
description: Learn about rate limits for Elastic Inference Service (EIS) models.
---

# Rate limits [eis-rate-limits]

This page lists the rate limits that apply to Elastic {{infer-cap}} Service (EIS) models.

Exceeding a limit results in HTTP 429 responses from the server until the sliding window moves on further and parts of the limit resets.

| Model | Request/minute | Tokens/minute (ingest) | Tokens/minute (search) | Notes |
|---------------------------------------------------|-----------------|-------------------------|-------------------------|--------------------------|
| Elastic Managed LLMs {applies_to}`stack: ga 9.3+` | 2000 | - | - | No rate limit on tokens |
| ELSER {applies_to}`stack: ga 9.0+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v5 Nano {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |

Check warning on line 19 in explore-analyze/elastic-inference/eis-rate-limits.md

View workflow job for this annotation

GitHub Actions / build / vale

Elastic.Spelling: 'Nano' is a possible misspelling.
| Jina Embeddings v5 Small {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v3 {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v5 (Small) {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v5 (Nano) {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |

Check warning on line 23 in explore-analyze/elastic-inference/eis-rate-limits.md

View workflow job for this annotation

GitHub Actions / build / vale

Elastic.Spelling: 'Nano' is a possible misspelling.
| Jina Reranker v2 {applies_to}`stack: ga 9.3+` | 600 | - | 6,000,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Reranker v3 {applies_to}`stack: ga 9.3+` | 600 | - | 6,000,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
25 changes: 25 additions & 0 deletions explore-analyze/elastic-inference/eis-region-and-hosting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
navigation_title: Region and hosting
applies_to:
stack: ga
serverless: ga
description: Learn which regions host Elastic Inference Service (EIS) and how inference requests are routed.
---

# Region and hosting [eis-regions]

This page lists the {{aws}} and {{gcp}} regions where Elastic {{infer-cap}} Service (EIS) is available and explains how {{infer}} requests are routed.

**{{aws}}:**

* `us-east-1` (Virginia)

**{{gcp}}:**

* `asia-southeast1` (Singapore)
* `europe-west1` (Belgium)
* `us-east4` (Virginia)

All {{infer}} requests sent through EIS are routed to the nearest region, regardless of where your {{es}} deployment or {{serverless-short}} project is hosted.

Depending on the model being used, request processing may involve Elastic {{infer}} infrastructure and, in some cases, trusted third-party model providers. For example, ELSER and Jina requests are processed entirely within Elastic {{infer}} infrastructure. Other models, such as large language models or third-party embedding models, may involve additional processing by their respective model providers, which can operate in different cloud platforms or regions.

Check notice on line 25 in explore-analyze/elastic-inference/eis-region-and-hosting.md

View workflow job for this annotation

GitHub Actions / build / vale

Elastic.WordChoice: Consider using 'can, might' instead of 'may', unless the term is in the UI.

Check notice on line 25 in explore-analyze/elastic-inference/eis-region-and-hosting.md

View workflow job for this annotation

GitHub Actions / build / vale

Elastic.WordChoice: Consider using 'can, might' instead of 'may', unless the term is in the UI.
60 changes: 2 additions & 58 deletions explore-analyze/elastic-inference/eis-supported-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ The corresponding {{kib}} connectors and {{infer}} endpoints for these models ar
The **{{infer-cap}} Regions** column shows the regions where {{infer}} requests are processed and where data is sent.
::::

For region availability and request routing, refer to [Region and hosting](eis-region-and-hosting.md). For rate limits, refer to [Rate limits](eis-rate-limits.md).

### LLM chat models

:::{csv-include} chat-models.csv
Expand All @@ -45,61 +47,3 @@ The **{{infer-cap}} Regions** column shows the regions where {{infer}} requests
* After the listed end-of-life (EOL) date, the model is no longer available for {{infer}} use and requests will fail. You need to actively transition to another model before the EOL date, there is no automated migration.
* Elastic makes every effort to use third party providers who do not use inputs to train models, and do not retain any data (zero data retention). Browse the tables on this page to double-check the status of a specific model.
::::

## Region and hosting [eis-regions]

Elastic {{infer-cap}} Service is currently available in these regions:

**AWS:**

* `us-east-1` (Virginia)

**GCP:**

* `asia-southeast1` (Singapore)
* `europe-west1` (Belgium)
* `us-east4` (Virginia)

All {{infer}} requests sent through EIS are routed to the nearest region, regardless of where your {{es}} deployment or {{serverless-short}} project is hosted.

Depending on the model being used, request processing may involve Elastic {{infer}} infrastructure and, in some cases, trusted third-party model providers. For example, ELSER and Jina requests are processed entirely within Elastic {{infer}} infrastructure. Other models, such as large language models or third-party embedding models, may involve additional processing by their respective model providers, which can operate in different cloud platforms or regions.

## Rate limits

The service enforces rate limits on an ongoing basis. Exceeding a limit results in HTTP 429 responses from the server until the sliding window moves on further and parts of the limit resets.

| Model | Request/minute | Tokens/minute (ingest) | Tokens/minute (search) | Notes |
|---------------------------------------------------|-----------------|-------------------------|-------------------------|--------------------------|
| Elastic Managed LLMs {applies_to}`stack: ga 9.3+` | 2000 | - | - | No rate limit on tokens |
| ELSER {applies_to}`stack: ga 9.0+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v5 Nano {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v5 Small {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v3 {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v5 (Small) {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Embeddings v5 (Nano) {applies_to}`stack: ga 9.3+` | 6,000 | 6,000,000 | 600,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Reranker v2 {applies_to}`stack: ga 9.3+` | 600 | - | 6,000,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |
| Jina Reranker v3 {applies_to}`stack: ga 9.3+` | 600 | - | 6,000,000 | Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first. |

## Pricing

All models on EIS incur a charge per million tokens. Certain LLM providers charge different prices depending on the prompt size. The pricing details are available on our [Pricing page](https://www.elastic.co/pricing/serverless-search).

This pricing model differs from the existing [Machine Learning Nodes](https://www.elastic.co/docs/explore-analyze/machine-learning/data-frame-analytics/ml-trained-models), which is billed through VCUs consumed.

### Token-based billing

EIS is billed per million tokens used:

* For **chat** models, input and output tokens are billed. Longer conversations with extensive context or detailed responses will consume more tokens.
* For **embeddings** models, only input tokens are billed.

Tokens are the fundamental units that language models process for both input and output. Tokenizers convert text into numerical data by segmenting it into subword units. A token can be a complete word, part of a word, or a punctuation mark, depending on the model's trained tokenizer and the frequency patterns in its training data.

For example, the sentence `It was the best of times, it was the worst of times.` contains 52 characters but would tokenize into approximately 14 tokens with a typical word-based approach, though the exact count varies by tokenizer.

### Monitor your token usage

To track your token consumption:

1. Navigate to [**Billing > Usage**](https://cloud.elastic.co/billing/usage) in the {{ecloud}} Console.
2. Look for line items where the **Billing dimension** is set to "Inference".
24 changes: 24 additions & 0 deletions explore-analyze/elastic-inference/eis.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,3 +213,27 @@ You can now use `semantic_text` with the new ELSER endpoint on EIS. To learn how
##### Get started with semantic search with ELSER on EIS

[Semantic Search with `semantic_text`](/solutions/search/semantic-search/semantic-search-semantic-text.md) has a detailed tutorial on using the `semantic_text` field and using the ELSER endpoint on EIS instead of the default endpoint. This is a great way to get started and try the new endpoint.

## Pricing [pricing]

All models on EIS incur a charge per million tokens. Certain LLM providers charge different prices depending on the prompt size. The pricing details are available on our [Pricing page](https://www.elastic.co/pricing/serverless-search).

This pricing model differs from the existing [Machine Learning Nodes](https://www.elastic.co/docs/explore-analyze/machine-learning/data-frame-analytics/ml-trained-models), which is billed through VCUs consumed.

### Token-based billing

EIS is billed per million tokens used:

* For **chat** models, input and output tokens are billed. Longer conversations with extensive context or detailed responses will consume more tokens.
* For **embeddings** models, only input tokens are billed.

Tokens are the fundamental units that language models process for both input and output. Tokenizers convert text into numerical data by segmenting it into subword units. A token can be a complete word, part of a word, or a punctuation mark, depending on the model's trained tokenizer and the frequency patterns in its training data.

For example, the sentence `It was the best of times, it was the worst of times.` contains 52 characters but would tokenize into approximately 14 tokens with a typical word-based approach, though the exact count varies by tokenizer.

### Monitor your token usage [monitor-your-token-usage]

To track your token consumption:

1. Navigate to [**Billing > Usage**](https://cloud.elastic.co/billing/usage) in the {{ecloud}} Console.
2. Look for line items where the **Billing dimension** is set to "Inference".
2 changes: 2 additions & 0 deletions explore-analyze/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ toc:
- file: elastic-inference/eis.md
children:
- file: elastic-inference/eis-supported-models.md
- file: elastic-inference/eis-region-and-hosting.md
- file: elastic-inference/eis-rate-limits.md
- file: elastic-inference/connect-self-managed-cluster-to-eis.md
- hidden: elastic-inference/ml-node-vs-eis.md
- file: elastic-inference/external.md
Expand Down
13 changes: 13 additions & 0 deletions redirects.yml
Original file line number Diff line number Diff line change
Expand Up @@ -874,6 +874,19 @@ redirects:
- to: 'explore-analyze/elastic-inference/eis-supported-models.md'
anchors:
'supported-models':
'explore-analyze/elastic-inference/eis-supported-models.md':
to: 'explore-analyze/elastic-inference/eis-supported-models.md'
many:
- to: 'explore-analyze/elastic-inference/eis-region-and-hosting.md'
anchors:
'eis-regions': 'eis-regions'
- to: 'explore-analyze/elastic-inference/eis-rate-limits.md'
anchors:
'rate-limits': 'eis-rate-limits'
- to: 'explore-analyze/elastic-inference/eis.md'
anchors:
'pricing': 'pricing'
'monitor-your-token-usage': 'monitor-your-token-usage'
# Split off links to inference UI pages
'explore-analyze/elastic-inference/inference-api.md':
to: 'explore-analyze/elastic-inference/inference-api.md'
Expand Down
Loading