feat: thread Delta scan file paths for FAILED_READ_FILE provenance [Delta contrib split, part 9]#13
Draft
schenksj wants to merge 2 commits into
Draft
feat: thread Delta scan file paths for FAILED_READ_FILE provenance [Delta contrib split, part 9]#13schenksj wants to merge 2 commits into
schenksj wants to merge 2 commits into
Conversation
schenksj
added a commit
that referenced
this pull request
Jun 22, 2026
…d-green + reviewed clean (fork #13); SPLIT COMPLETE A.1-A.8
01cd61f to
51e7158
Compare
4d36f7f to
b0644a6
Compare
51e7158 to
77f9032
Compare
b0644a6 to
3930a45
Compare
77f9032 to
59ff67c
Compare
3930a45 to
8e31afa
Compare
schenksj
added a commit
that referenced
this pull request
Jun 22, 2026
… check [fixes #13 3.5 cell] CometDeltaFailedReadFileSuite compared `perPartitionFilePaths` (which carries URL-form paths from the DeltaScan proto -- a literal `%` is `%25`) against `File.listFiles().getName` (the decoded on-disk name). On Delta 3.3.2 the test harness puts `%` in data-file names (`test%file%prefix-...`), so the basenames differed (`%25` vs `%`) and the subsetOf coverage check spuriously failed -- only on the Spark 3.5 / Delta 3.3.2 cell (4.1 has no `%`, so the encoding difference was invisible). URL-decode the proto basenames before comparing. Red-green: the suite *** FAILED *** on the 3.5/3.3.2 CI cell ("onDiskDataFiles.subsetOf was false ... named=test%25file..."); GREEN locally on -Pspark-3.5,scala-2.12 + Delta 3.3.2 after the fix (and unchanged on 4.1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BtErWgRQKCDRAg8Mk6qR4G
…elta contrib split, part 9] Final unit of the Delta contrib split. Completes the read-error provenance plumbing now that apache#4536 (typed SparkError::CannotReadFile -> cannotReadFilesError) has merged: a Delta native scan exposes its per-partition data-file paths through the shared `CometScanWithPlanData` trait so the unified `CometExecRDD` / `CometExecIterator` path can attribute a per-file read failure to the offending file (`FAILED_READ_FILE.NO_HINT`), exactly as `CometNativeScanExec` already does for plain Parquet. Core (operators.scala): - Add `perPartitionFilePaths` (default `Array.empty`) to the `CometScanWithPlanData` trait. - `CometNativeExec` collects per-partition file paths from every `CometScanWithPlanData` leaf in the tree (covers `CometNativeScanExec` and contrib leaves) and threads them through `NativeExecContext` into the `CometExecRDD` it builds -- so provenance also flows when the scan is fused inside a larger parent native block, not just standalone. - `CometNativeScanExec.perPartitionFilePaths` gains the `override` modifier now that the trait member is concrete (no behaviour change). Contrib (CometDeltaNativeScanExec): - Override `perPartitionFilePaths` to parse each per-partition `DeltaScan` task list into its file paths, and pass them to `CometExecRDD` in the standalone `doExecuteColumnar` path too. Note on scope: for read errors raised inside the kernel read, the contrib native already carries the path (`map_file_read_error` is always called with `&file.path`, so `CannotReadFile` is path-bearing -- which is why the corrupted-file case in `CometDeltaEdgeCaseRegressionSuite` F6 already surfaces a path-bearing error without this change). `perPartitionFilePaths` is the parity/fallback provenance the shared `CometExecRDD` path uses: `SparkErrorConverter` fills the partition's file paths when a failure reaches the JVM without a native path. This brings the Delta leaf to parity with `CometNativeScanExec`. Red-green guard (CometDeltaFailedReadFileSuite): asserts the Delta native scan exposes its data-file paths via `perPartitionFilePaths`. Proven RED before the override (`Array() was empty -- provenance not wired`) and GREEN after. Verification: red-green proven on Spark 4.1 + Delta 4.1.0; targeted regression (CometDeltaNativeSuite + CometDeltaCdcSuite + the new suite) 23/0; Scala 2.12 compile of the core change (spark-3.4) AND the contrib (spark-3.5/scala-2.12); spotless + scalastyle clean; dev/verify-contrib-delta-gate.sh all pass (default libcomet still 0 Delta symbols -- the new trait member is inert in default builds). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BtErWgRQKCDRAg8Mk6qR4G
… check [fixes #13 3.5 cell] CometDeltaFailedReadFileSuite compared `perPartitionFilePaths` (which carries URL-form paths from the DeltaScan proto -- a literal `%` is `%25`) against `File.listFiles().getName` (the decoded on-disk name). On Delta 3.3.2 the test harness puts `%` in data-file names (`test%file%prefix-...`), so the basenames differed (`%25` vs `%`) and the subsetOf coverage check spuriously failed -- only on the Spark 3.5 / Delta 3.3.2 cell (4.1 has no `%`, so the encoding difference was invisible). URL-decode the proto basenames before comparing. Red-green: the suite *** FAILED *** on the 3.5/3.3.2 CI cell ("onDiskDataFiles.subsetOf was false ... named=test%25file..."); GREEN locally on -Pspark-3.5,scala-2.12 + Delta 3.3.2 after the fix (and unchanged on 4.1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BtErWgRQKCDRAg8Mk6qR4G
59ff67c to
cf4f63a
Compare
828c135 to
22d4ed3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 9 (final) of the Delta Lake contrib PR breakup (stacked on part 8 / #12). Fork-local review draft.
Completes the read-error provenance plumbing now that apache#4536 has merged (typed
SparkError::CannotReadFile→cannotReadFilesError). A Delta native scan exposes its per-partition data-file paths through the sharedCometScanWithPlanDatatrait, so the unifiedCometExecRDD/CometExecIteratorpath can attribute a per-file read failure to the offending file (FAILED_READ_FILE.NO_HINT) — reaching parity withCometNativeScanExec.Changes
Core (
operators.scala)perPartitionFilePaths(defaultArray.empty) to theCometScanWithPlanDatatrait.CometNativeExeccollects per-partition paths from everyCometScanWithPlanDataleaf and threads them into theCometExecRDDit builds — so provenance flows even when the scan is fused inside a larger parent native block, not just standalone. Early-returnsArray.emptywhen no such scan is present.CometNativeScanExec.perPartitionFilePathsgainsoverride(the trait member is now concrete; no behaviour change).Contrib (
CometDeltaNativeScanExec)perPartitionFilePathsto parse each per-partitionDeltaScantask list into file paths; pass them toCometExecRDDstandalone too. The encryption file list reuses the same parse (perPartitionFilePaths.flatten) instead of a second pass.Scope note
For errors raised inside the kernel read, the contrib native already carries the path (
map_file_read_erroris always called with&file.path), which is whyCometDeltaEdgeCaseRegressionSuiteF6 (corrupted file) already surfaces a path-bearing error without this change.perPartitionFilePathsis the parity/fallback provenance the sharedCometExecRDDpath uses when a failure reaches the JVM without a native path. As withCometNativeScanExec, a fused multi-scan (join) partition's hint is the union of that partition's files — acceptable for theNO_HINTvariant.Red-green guard
CometDeltaFailedReadFileSuiteasserts the scan exposes its data files viaperPartitionFilePaths(and that the on-disk files are a non-empty subset, so it can't pass vacuously). Proven RED before the override (Array() was empty — provenance not wired) and GREEN after.Verification
Red-green on Spark 4.1 + Delta 4.1.0; targeted regression (native + CDF + new suite) 23/0; Scala 2.12 compile of both the core (spark-3.4) and contrib (spark-3.5/scala-2.12) changes; spotless + scalastyle clean;
dev/verify-contrib-delta-gate.shall pass (defaultlibcometstill 0 Delta symbols — the trait member is inert in default builds).🤖 This PR was prepared with the assistance of Claude (Anthropic). A human author reviewed and is responsible for the content.