-
Notifications
You must be signed in to change notification settings - Fork 860
Clarify WriteAsync contract: add explicit IngestionDocument parameter to WriteAsync #7433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: data-ingestion-preview2
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,12 +15,13 @@ namespace Microsoft.Extensions.DataIngestion; | |
| public abstract class IngestionChunkWriter<T> : IDisposable | ||
| { | ||
| /// <summary> | ||
| /// Writes chunks asynchronously. | ||
| /// Writes the chunks of a single document asynchronously. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unimportant: do we still add "asynchronously" to the docs of each and every async function (especially when these don't have a corresponding synchronous overload)? Seems a bit useless to me (but obviously let's follow latest practices and patterns). |
||
| /// </summary> | ||
| /// <param name="chunks">The chunks to write.</param> | ||
| /// <param name="document">The document from which the chunks were extracted.</param> | ||
| /// <param name="cancellationToken">The token to monitor for cancellation requests.</param> | ||
| /// <returns>A task representing the asynchronous write operation.</returns> | ||
| public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default); | ||
| public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, IngestionDocument document, CancellationToken cancellationToken = default); | ||
|
|
||
| /// <summary> | ||
| /// Disposes the writer and releases all associated resources. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -43,9 +43,10 @@ public VectorStoreWriter(VectorStoreCollection<Guid, TRecord> collection, Vector | |
| public VectorStoreCollection<Guid, TRecord> VectorStoreCollection { get; } | ||
|
|
||
| /// <inheritdoc/> | ||
| public override async Task WriteAsync(IAsyncEnumerable<IngestionChunk<TChunk>> chunks, CancellationToken cancellationToken = default) | ||
| public override async Task WriteAsync(IAsyncEnumerable<IngestionChunk<TChunk>> chunks, IngestionDocument document, CancellationToken cancellationToken = default) | ||
| { | ||
| _ = Throw.IfNull(chunks); | ||
| _ = Throw.IfNull(document); | ||
|
|
||
| IReadOnlyList<Guid>? preExistingKeys = null; | ||
| List<TRecord>? batch = null; | ||
|
|
@@ -62,13 +63,13 @@ public override async Task WriteAsync(IAsyncEnumerable<IngestionChunk<TChunk>> c | |
| // We obtain the IDs of the pre-existing chunks for given document, | ||
| // and delete them after we finish inserting the new chunks, | ||
| // to avoid a situation where we delete the chunks and then fail to insert the new ones. | ||
| preExistingKeys ??= await GetPreExistingChunksIdsAsync(chunk.Document, cancellationToken).ConfigureAwait(false); | ||
| preExistingKeys ??= await GetPreExistingChunksIdsAsync(document, cancellationToken).ConfigureAwait(false); | ||
|
|
||
| TRecord record = new() | ||
| { | ||
| Content = chunk.Content, | ||
| Context = chunk.Context, | ||
| DocumentId = chunk.Document.Identifier, | ||
| DocumentId = document.Identifier, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming we keep the document parameter, add validation that the document of all the chunks is the same as the document argument? |
||
| }; | ||
|
|
||
| if (chunk.HasMetadata) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is beyond the scope of the MEVD writer, but are you sure you want to remove the reference to the document from the chunk? Wouldn't that preclude some legitimate uses of this in processors/erichers/whatever in the ingestion pipeline, e.g. have some metadata at the document level, and then do something to chunks which are part of a document that has certain metadata? Removing the reference between the chunks and their document makes the ingestion data model poorer (and less normailzed), forcing e.g. duplicating document metadata onto each and every chunk.
Just saying, because my original comment wasn't about changing IngestionChunk - just making a clearer (and ideally enforced) contract in IngestionChunkWriter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot please address this feedback and restore
DocumenttoIngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extendedIngestionChunkWriter.WriteAsyncwith mandatoryIngestionDocument documentargument and actually make theIngestionDocument documentthe first argument ofWriteAsyncmethod.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored
Documentproperty onIngestionChunkand reverted all chunker/test changes. The only remaining diff vs base is theWriteAsyncsignature change —documentis now the first argument:WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken). See fbc1730.