Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,18 @@ public sealed class IngestionChunk<T>
/// Initializes a new instance of the <see cref="IngestionChunk{T}"/> class.
/// </summary>
/// <param name="content">The content of the chunk.</param>
/// <param name="document">The document from which this chunk was extracted.</param>
/// <param name="tokenCount">The number of tokens used to represent the chunk.</param>
/// <param name="context">Additional context for the chunk.</param>
/// <exception cref="ArgumentNullException">
/// <paramref name="content"/> or <paramref name="document"/> is <see langword="null"/>.
/// <paramref name="content"/> is <see langword="null"/>.
/// </exception>
/// <exception cref="ArgumentException">
/// <paramref name="content"/> is a string that is empty or contains only white-space characters.
/// </exception>
/// <exception cref="ArgumentOutOfRangeException">
/// <paramref name="tokenCount"/> is negative.
/// </exception>
public IngestionChunk(T content, IngestionDocument document, int tokenCount, string? context = null)
public IngestionChunk(T content, int tokenCount, string? context = null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beyond the scope of the MEVD writer, but are you sure you want to remove the reference to the document from the chunk? Wouldn't that preclude some legitimate uses of this in processors/erichers/whatever in the ingestion pipeline, e.g. have some metadata at the document level, and then do something to chunks which are part of a document that has certain metadata? Removing the reference between the chunks and their document makes the ingestion data model poorer (and less normailzed), forcing e.g. duplicating document metadata onto each and every chunk.

Just saying, because my original comment wasn't about changing IngestionChunk - just making a clearer (and ideally enforced) contract in IngestionChunkWriter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address this feedback and restore Document to IngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extended IngestionChunkWriter.WriteAsync with mandatory IngestionDocument document argument and actually make the IngestionDocument document the first argument of WriteAsync method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored Document property on IngestionChunk and reverted all chunker/test changes. The only remaining diff vs base is the WriteAsync signature change — document is now the first argument: WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken). See fbc1730.

{
if (typeof(T) == typeof(string))
{
Expand All @@ -44,7 +43,6 @@ public IngestionChunk(T content, IngestionDocument document, int tokenCount, str
Content = Throw.IfNull(content);
}

Document = Throw.IfNull(document);
Context = context;
TokenCount = Throw.IfLessThanOrEqual(tokenCount, 0);
}
Expand All @@ -54,11 +52,6 @@ public IngestionChunk(T content, IngestionDocument document, int tokenCount, str
/// </summary>
public T Content { get; }

/// <summary>
/// Gets the document from which this chunk was extracted.
/// </summary>
public IngestionDocument Document { get; }

/// <summary>
/// Gets additional context for the chunk.
/// </summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@ namespace Microsoft.Extensions.DataIngestion;
public abstract class IngestionChunkWriter<T> : IDisposable
{
/// <summary>
/// Writes chunks asynchronously.
/// Writes the chunks of a single document asynchronously.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unimportant: do we still add "asynchronously" to the docs of each and every async function (especially when these don't have a corresponding synchronous overload)? Seems a bit useless to me (but obviously let's follow latest practices and patterns).

/// </summary>
/// <param name="chunks">The chunks to write.</param>
/// <param name="document">The document from which the chunks were extracted.</param>
/// <param name="cancellationToken">The token to monitor for cancellation requests.</param>
/// <returns>A task representing the asynchronous write operation.</returns>
public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);
public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, IngestionDocument document, CancellationToken cancellationToken = default);

/// <summary>
/// Disposes the writer and releases all associated resources.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ IngestionChunk<string> FinalizeChunk()
{
IngestionChunk<string> chunk = new IngestionChunk<string>(
content: stringBuilder.ToString(),
document: document,
tokenCount: stringBuilderTokenCount,
context: string.Empty);
_ = stringBuilder.Clear();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ internal ElementsChunker(IngestionChunkerOptions options)
// 1. Create chunks that do not exceed _maxTokensPerChunk when tokenized.
// 2. Maintain context in each chunk.
// 3. If a single IngestionDocumentElement exceeds _maxTokensPerChunk, it should be split intelligently (e.g., paragraphs can be split into sentences, tables into rows).
internal IEnumerable<IngestionChunk<string>> Process(IngestionDocument document, string context, List<IngestionDocumentElement> elements)
internal IEnumerable<IngestionChunk<string>> Process(string context, List<IngestionDocumentElement> elements)
{
// Not using yield return here as we use ref structs.
List<IngestionChunk<string>> chunks = [];
Expand Down Expand Up @@ -198,7 +198,7 @@ internal IEnumerable<IngestionChunk<string>> Process(IngestionDocument document,
{
string chunkContent = _currentChunk.ToString();
int chunkTokenCount = CountTokens(chunkContent.AsSpan());
chunks.Add(new(chunkContent, document, chunkTokenCount, context));
chunks.Add(new(chunkContent, chunkTokenCount, context));
}

_currentChunk.Clear();
Expand All @@ -209,7 +209,7 @@ void Commit()
{
string chunkContent = _currentChunk.ToString();
int chunkTokenCount = CountTokens(chunkContent.AsSpan());
chunks.Add(new(chunkContent, document, chunkTokenCount, context));
chunks.Add(new(chunkContent, chunkTokenCount, context));

// We keep the context in the current chunk as it's the same for all elements.
_currentChunk.Remove(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ public override async IAsyncEnumerable<IngestionChunk<string>> ProcessAsync(Inge

if (element is IngestionDocumentHeader header)
{
foreach (var chunk in SplitIntoChunks(document, headers, elements))
foreach (var chunk in SplitIntoChunks(headers, elements))
{
yield return chunk;
}
Expand All @@ -59,19 +59,19 @@ public override async IAsyncEnumerable<IngestionChunk<string>> ProcessAsync(Inge
}

// take care of any remaining paragraphs
foreach (var chunk in SplitIntoChunks(document, headers, elements))
foreach (var chunk in SplitIntoChunks(headers, elements))
{
yield return chunk;
}
}

private IEnumerable<IngestionChunk<string>> SplitIntoChunks(IngestionDocument document, string?[] headers, List<IngestionDocumentElement> elements)
private IEnumerable<IngestionChunk<string>> SplitIntoChunks(string?[] headers, List<IngestionDocumentElement> elements)
{
if (elements.Count > 0)
{
string chunkHeader = string.Join(" ", headers.Where(h => !string.IsNullOrEmpty(h)));

foreach (var chunk in _elementsChunker.Process(document, chunkHeader, elements))
foreach (var chunk in _elementsChunker.Process(chunkHeader, elements))
{
yield return chunk;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ public override async IAsyncEnumerable<IngestionChunk<string>> ProcessAsync(Inge
{
cancellationToken.ThrowIfCancellationRequested();

Process(document, section, chunks);
Process(section, chunks);
foreach (var chunk in chunks)
{
yield return chunk;
Expand All @@ -44,7 +44,7 @@ public override async IAsyncEnumerable<IngestionChunk<string>> ProcessAsync(Inge
}
}

private void Process(IngestionDocument document, IngestionDocumentSection section, List<IngestionChunk<string>> chunks, string? parentContext = null)
private void Process(IngestionDocumentSection section, List<IngestionChunk<string>> chunks, string? parentContext = null)
{
List<IngestionDocumentElement> elements = new(section.Elements.Count);
string context = parentContext ?? string.Empty;
Expand All @@ -62,7 +62,7 @@ private void Process(IngestionDocument document, IngestionDocumentSection sectio
break;
case IngestionDocumentSection nestedSection:
Commit();
Process(document, nestedSection, chunks, context);
Process(nestedSection, chunks, context);
break;
default:
elements.Add(section.Elements[i]);
Expand All @@ -76,7 +76,7 @@ void Commit()
{
if (elements.Count > 0)
{
foreach (var chunk in _elementsChunker.Process(document, context, elements))
foreach (var chunk in _elementsChunker.Process(context, elements))
{
chunks.Add(chunk);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ public override async IAsyncEnumerable<IngestionChunk<string>> ProcessAsync(Inge
_ = Throw.IfNull(document);

List<(IngestionDocumentElement, float)> distances = await CalculateDistancesAsync(document, cancellationToken).ConfigureAwait(false);
foreach (var chunk in MakeChunks(document, distances))
foreach (var chunk in MakeChunks(distances))
{
yield return chunk;
}
Expand Down Expand Up @@ -93,7 +93,7 @@ public override async IAsyncEnumerable<IngestionChunk<string>> ProcessAsync(Inge
return elementDistances;
}

private IEnumerable<IngestionChunk<string>> MakeChunks(IngestionDocument document, List<(IngestionDocumentElement element, float distance)> elementDistances)
private IEnumerable<IngestionChunk<string>> MakeChunks(List<(IngestionDocumentElement element, float distance)> elementDistances)
{
float distanceThreshold = Percentile(elementDistances);

Expand All @@ -106,7 +106,7 @@ private IEnumerable<IngestionChunk<string>> MakeChunks(IngestionDocument documen
elementAccumulator.Add(element);
if (distance > distanceThreshold || i == elementDistances.Count - 1)
{
foreach (var chunk in _elementsChunker.Process(document, context, elementAccumulator))
foreach (var chunk in _elementsChunker.Process(context, elementAccumulator))
{
yield return chunk;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ private async Task<IngestionDocument> IngestAsync(IngestionDocument document, Ac
}

_logger?.WritingChunks(GetShortName(_writer));
await _writer.WriteAsync(chunks, cancellationToken).ConfigureAwait(false);
await _writer.WriteAsync(chunks, document, cancellationToken).ConfigureAwait(false);
_logger?.WroteChunks(document.Identifier);

return document;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ public VectorStoreWriter(VectorStoreCollection<Guid, TRecord> collection, Vector
public VectorStoreCollection<Guid, TRecord> VectorStoreCollection { get; }

/// <inheritdoc/>
public override async Task WriteAsync(IAsyncEnumerable<IngestionChunk<TChunk>> chunks, CancellationToken cancellationToken = default)
public override async Task WriteAsync(IAsyncEnumerable<IngestionChunk<TChunk>> chunks, IngestionDocument document, CancellationToken cancellationToken = default)
{
_ = Throw.IfNull(chunks);
_ = Throw.IfNull(document);

IReadOnlyList<Guid>? preExistingKeys = null;
List<TRecord>? batch = null;
Expand All @@ -62,13 +63,13 @@ public override async Task WriteAsync(IAsyncEnumerable<IngestionChunk<TChunk>> c
// We obtain the IDs of the pre-existing chunks for given document,
// and delete them after we finish inserting the new chunks,
// to avoid a situation where we delete the chunks and then fail to insert the new ones.
preExistingKeys ??= await GetPreExistingChunksIdsAsync(chunk.Document, cancellationToken).ConfigureAwait(false);
preExistingKeys ??= await GetPreExistingChunksIdsAsync(document, cancellationToken).ConfigureAwait(false);

TRecord record = new()
{
Content = chunk.Content,
Context = chunk.Context,
DocumentId = chunk.Document.Identifier,
DocumentId = document.Identifier,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we keep the document parameter, add validation that the document of all the chunks is the same as the document argument?

};

if (chunk.HasMetadata)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,6 @@ public async Task TwoSeparateTopicsWithAllKindsOfElements()
IReadOnlyList<IngestionChunk<string>> chunks = await chunker.ProcessAsync(doc).ToListAsync();

Assert.Equal(3, chunks.Count);
Assert.All(chunks, chunk => Assert.Same(doc, chunk.Document));
Assert.Equal($@"# .NET Supported Languages
The .NET platform supports multiple programming languages:
{dotNetTableMarkdown}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,30 +11,25 @@ public class IngestionChunkTests
[Fact]
public void Constructor_SetsTokenCountProperty()
{
IngestionDocument document = new("test");
IngestionChunk<string> chunk = new("test content", document, 42);
IngestionChunk<string> chunk = new("test content", 42);

Assert.Equal(42, chunk.TokenCount);
}

[Fact]
public void Constructor_ThrowsWhenTokenCountIsNegative()
{
IngestionDocument document = new("test");

ArgumentOutOfRangeException exception = Assert.Throws<ArgumentOutOfRangeException>(
() => new IngestionChunk<string>("test content", document, -1));
() => new IngestionChunk<string>("test content", -1));

Assert.Equal("tokenCount", exception.ParamName);
}

[Fact]
public void Constructor_ThrowsWhenTokenCountIsZero()
{
IngestionDocument document = new("test");

ArgumentOutOfRangeException exception = Assert.Throws<ArgumentOutOfRangeException>(
() => new IngestionChunk<string>("test content", document, 0));
() => new IngestionChunk<string>("test content", 0));

Assert.Equal("tokenCount", exception.ParamName);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,6 @@ public override IAsyncEnumerable<IngestionChunk<DataContent>> ProcessAsync(Inges
.OfType<IngestionDocumentImage>()
.Select(image => new IngestionChunk<DataContent>(
content: new(image.Content.GetValueOrDefault(), image.MediaType!),
document: document,
tokenCount: 123)) // made up number as we currently don't have the ability to easily count exact tokens
.ToAsyncEnumerable();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ namespace Microsoft.Extensions.DataIngestion.Processors.Tests;

public class ClassificationEnricherTests
{
private static readonly IngestionDocument _document = new("test");

[Fact]
public void ThrowsOnNullOptions()
{
Expand Down Expand Up @@ -122,9 +120,9 @@ private static List<IngestionChunk<string>> CreateChunks() =>
[
TestChunkFactory.CreateChunk(".NET developers need to integrate and interact with a growing variety of artificial intelligence (AI) services in their apps. " +
"The Microsoft.Extensions.AI libraries provide a unified approach for representing generative AI components, and enable seamless" +
" integration and interoperability with various AI services.", _document),
" integration and interoperability with various AI services."),
TestChunkFactory.CreateChunk("Rabbits are small mammals in the family Leporidae of the order Lagomorpha (along with the hare and the pika)." +
"They are herbivorous animals and are known for their long ears, large hind legs, and short fluffy tails.", _document),
TestChunkFactory.CreateChunk("This text does not belong to any category.", _document),
"They are herbivorous animals and are known for their long ears, large hind legs, and short fluffy tails."),
TestChunkFactory.CreateChunk("This text does not belong to any category."),
];
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ namespace Microsoft.Extensions.DataIngestion.Processors.Tests;

public class KeywordEnricherTests
{
private static readonly IngestionDocument _document = new("test");

[Fact]
public void ThrowsOnNullOptions()
{
Expand Down Expand Up @@ -120,7 +118,7 @@ public async Task FailureDoesNotStopTheProcessing()

private static List<IngestionChunk<string>> CreateChunks() =>
[
TestChunkFactory.CreateChunk("The Microsoft.Extensions.AI libraries provide a unified approach for representing generative AI components", _document),
TestChunkFactory.CreateChunk("Rabbits are great pets. They are friendly and make excellent companions.", _document)
TestChunkFactory.CreateChunk("The Microsoft.Extensions.AI libraries provide a unified approach for representing generative AI components"),
TestChunkFactory.CreateChunk("Rabbits are great pets. They are friendly and make excellent companions.")
];
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ namespace Microsoft.Extensions.DataIngestion.Processors.Tests;

public class SentimentEnricherTests
{
private static readonly IngestionDocument _document = new("test");

[Fact]
public void ThrowsOnNullOptions()
{
Expand Down Expand Up @@ -107,9 +105,9 @@ public async Task FailureDoesNotStopTheProcessing()

private static List<IngestionChunk<string>> CreateChunks() =>
[
TestChunkFactory.CreateChunk("I love programming! It's so much fun and rewarding.", _document),
TestChunkFactory.CreateChunk("I hate bugs. They are so frustrating and time-consuming.", _document),
TestChunkFactory.CreateChunk("The weather is okay, not too bad but not great either.", _document),
TestChunkFactory.CreateChunk("I hate you. I am sorry, I actually don't. I am not sure myself what my feelings are.", _document)
TestChunkFactory.CreateChunk("I love programming! It's so much fun and rewarding."),
TestChunkFactory.CreateChunk("I hate bugs. They are so frustrating and time-consuming."),
TestChunkFactory.CreateChunk("The weather is okay, not too bad but not great either."),
TestChunkFactory.CreateChunk("I hate you. I am sorry, I actually don't. I am not sure myself what my feelings are.")
];
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ namespace Microsoft.Extensions.DataIngestion.Processors.Tests;

public class SummaryEnricherTests
{
private static readonly IngestionDocument _document = new("test");

[Fact]
public void ThrowsOnNullOptions()
{
Expand Down Expand Up @@ -104,7 +102,7 @@ public async Task FailureDoesNotStopTheProcessing()

private static List<IngestionChunk<string>> CreateChunks() =>
[
TestChunkFactory.CreateChunk("I love programming! It's so much fun and rewarding.", _document),
TestChunkFactory.CreateChunk("I hate bugs. They are so frustrating and time-consuming.", _document)
TestChunkFactory.CreateChunk("I love programming! It's so much fun and rewarding."),
TestChunkFactory.CreateChunk("I hate bugs. They are so frustrating and time-consuming.")
];
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ public static class TestChunkFactory
{
private static readonly Tokenizer _tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

public static IngestionChunk<string> CreateChunk(string content, IngestionDocument document)
public static IngestionChunk<string> CreateChunk(string content)
{
int tokenCount = _tokenizer.CountTokens(content, considerNormalization: false);
return new IngestionChunk<string>(content, document, tokenCount);
return new IngestionChunk<string>(content, tokenCount);
}
}
Loading
Loading