Skip to content

feat: thread Delta scan file paths for FAILED_READ_FILE provenance [Delta contrib split, part 9]#13

Draft
schenksj wants to merge 2 commits into
pr/delta-A7-docsfrom
pr/delta-A8-failed-read-file
Draft

feat: thread Delta scan file paths for FAILED_READ_FILE provenance [Delta contrib split, part 9]#13
schenksj wants to merge 2 commits into
pr/delta-A7-docsfrom
pr/delta-A8-failed-read-file

Conversation

@schenksj

Copy link
Copy Markdown
Owner

Part 9 (final) of the Delta Lake contrib PR breakup (stacked on part 8 / #12). Fork-local review draft.

Completes the read-error provenance plumbing now that apache#4536 has merged (typed SparkError::CannotReadFilecannotReadFilesError). A Delta native scan exposes its per-partition data-file paths through the shared CometScanWithPlanData trait, so the unified CometExecRDD / CometExecIterator path can attribute a per-file read failure to the offending file (FAILED_READ_FILE.NO_HINT) — reaching parity with CometNativeScanExec.

Changes

Core (operators.scala)

  • Add perPartitionFilePaths (default Array.empty) to the CometScanWithPlanData trait.
  • CometNativeExec collects per-partition paths from every CometScanWithPlanData leaf and threads them into the CometExecRDD it builds — so provenance flows even when the scan is fused inside a larger parent native block, not just standalone. Early-returns Array.empty when no such scan is present.
  • CometNativeScanExec.perPartitionFilePaths gains override (the trait member is now concrete; no behaviour change).

Contrib (CometDeltaNativeScanExec)

  • Override perPartitionFilePaths to parse each per-partition DeltaScan task list into file paths; pass them to CometExecRDD standalone too. The encryption file list reuses the same parse (perPartitionFilePaths.flatten) instead of a second pass.

Scope note

For errors raised inside the kernel read, the contrib native already carries the path (map_file_read_error is always called with &file.path), which is why CometDeltaEdgeCaseRegressionSuite F6 (corrupted file) already surfaces a path-bearing error without this change. perPartitionFilePaths is the parity/fallback provenance the shared CometExecRDD path uses when a failure reaches the JVM without a native path. As with CometNativeScanExec, a fused multi-scan (join) partition's hint is the union of that partition's files — acceptable for the NO_HINT variant.

Red-green guard

CometDeltaFailedReadFileSuite asserts the scan exposes its data files via perPartitionFilePaths (and that the on-disk files are a non-empty subset, so it can't pass vacuously). Proven RED before the override (Array() was empty — provenance not wired) and GREEN after.

Verification

Red-green on Spark 4.1 + Delta 4.1.0; targeted regression (native + CDF + new suite) 23/0; Scala 2.12 compile of both the core (spark-3.4) and contrib (spark-3.5/scala-2.12) changes; spotless + scalastyle clean; dev/verify-contrib-delta-gate.sh all pass (default libcomet still 0 Delta symbols — the trait member is inert in default builds).


🤖 This PR was prepared with the assistance of Claude (Anthropic). A human author reviewed and is responsible for the content.

schenksj added a commit that referenced this pull request Jun 22, 2026
…d-green + reviewed clean (fork #13); SPLIT COMPLETE A.1-A.8
@schenksj schenksj force-pushed the pr/delta-A8-failed-read-file branch from 4d36f7f to b0644a6 Compare June 22, 2026 01:52
@schenksj schenksj force-pushed the pr/delta-A8-failed-read-file branch from b0644a6 to 3930a45 Compare June 22, 2026 01:57
@schenksj schenksj force-pushed the pr/delta-A8-failed-read-file branch from 3930a45 to 8e31afa Compare June 22, 2026 02:32
schenksj added a commit that referenced this pull request Jun 22, 2026
schenksj added a commit that referenced this pull request Jun 22, 2026
… check [fixes #13 3.5 cell]

CometDeltaFailedReadFileSuite compared `perPartitionFilePaths` (which carries URL-form
paths from the DeltaScan proto -- a literal `%` is `%25`) against `File.listFiles().getName`
(the decoded on-disk name). On Delta 3.3.2 the test harness puts `%` in data-file names
(`test%file%prefix-...`), so the basenames differed (`%25` vs `%`) and the subsetOf coverage
check spuriously failed -- only on the Spark 3.5 / Delta 3.3.2 cell (4.1 has no `%`, so the
encoding difference was invisible). URL-decode the proto basenames before comparing.

Red-green: the suite *** FAILED *** on the 3.5/3.3.2 CI cell ("onDiskDataFiles.subsetOf was
false ... named=test%25file..."); GREEN locally on -Pspark-3.5,scala-2.12 + Delta 3.3.2 after
the fix (and unchanged on 4.1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BtErWgRQKCDRAg8Mk6qR4G
schenksj and others added 2 commits June 29, 2026 09:27
…elta contrib split, part 9]

Final unit of the Delta contrib split. Completes the read-error provenance
plumbing now that apache#4536 (typed SparkError::CannotReadFile -> cannotReadFilesError)
has merged: a Delta native scan exposes its per-partition data-file paths through
the shared `CometScanWithPlanData` trait so the unified `CometExecRDD` /
`CometExecIterator` path can attribute a per-file read failure to the offending
file (`FAILED_READ_FILE.NO_HINT`), exactly as `CometNativeScanExec` already does
for plain Parquet.

Core (operators.scala):
- Add `perPartitionFilePaths` (default `Array.empty`) to the `CometScanWithPlanData`
  trait.
- `CometNativeExec` collects per-partition file paths from every
  `CometScanWithPlanData` leaf in the tree (covers `CometNativeScanExec` and contrib
  leaves) and threads them through `NativeExecContext` into the `CometExecRDD` it
  builds -- so provenance also flows when the scan is fused inside a larger parent
  native block, not just standalone.
- `CometNativeScanExec.perPartitionFilePaths` gains the `override` modifier now that
  the trait member is concrete (no behaviour change).

Contrib (CometDeltaNativeScanExec):
- Override `perPartitionFilePaths` to parse each per-partition `DeltaScan` task list
  into its file paths, and pass them to `CometExecRDD` in the standalone
  `doExecuteColumnar` path too.

Note on scope: for read errors raised inside the kernel read, the contrib native
already carries the path (`map_file_read_error` is always called with `&file.path`,
so `CannotReadFile` is path-bearing -- which is why the corrupted-file case in
`CometDeltaEdgeCaseRegressionSuite` F6 already surfaces a path-bearing error without
this change). `perPartitionFilePaths` is the parity/fallback provenance the shared
`CometExecRDD` path uses: `SparkErrorConverter` fills the partition's file paths when
a failure reaches the JVM without a native path. This brings the Delta leaf to parity
with `CometNativeScanExec`.

Red-green guard (CometDeltaFailedReadFileSuite): asserts the Delta native scan exposes
its data-file paths via `perPartitionFilePaths`. Proven RED before the override
(`Array() was empty -- provenance not wired`) and GREEN after.

Verification: red-green proven on Spark 4.1 + Delta 4.1.0; targeted regression
(CometDeltaNativeSuite + CometDeltaCdcSuite + the new suite) 23/0; Scala 2.12 compile
of the core change (spark-3.4) AND the contrib (spark-3.5/scala-2.12); spotless +
scalastyle clean; dev/verify-contrib-delta-gate.sh all pass (default libcomet still 0
Delta symbols -- the new trait member is inert in default builds).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BtErWgRQKCDRAg8Mk6qR4G
… check [fixes #13 3.5 cell]

CometDeltaFailedReadFileSuite compared `perPartitionFilePaths` (which carries URL-form
paths from the DeltaScan proto -- a literal `%` is `%25`) against `File.listFiles().getName`
(the decoded on-disk name). On Delta 3.3.2 the test harness puts `%` in data-file names
(`test%file%prefix-...`), so the basenames differed (`%25` vs `%`) and the subsetOf coverage
check spuriously failed -- only on the Spark 3.5 / Delta 3.3.2 cell (4.1 has no `%`, so the
encoding difference was invisible). URL-decode the proto basenames before comparing.

Red-green: the suite *** FAILED *** on the 3.5/3.3.2 CI cell ("onDiskDataFiles.subsetOf was
false ... named=test%25file..."); GREEN locally on -Pspark-3.5,scala-2.12 + Delta 3.3.2 after
the fix (and unchanged on 4.1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BtErWgRQKCDRAg8Mk6qR4G
@schenksj schenksj force-pushed the pr/delta-A8-failed-read-file branch from 828c135 to 22d4ed3 Compare June 29, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant