Skip to content

Include non-code files in dependency-based analysis via FILE_LEVEL fallback#11

Draft
Copilot wants to merge 14 commits into
infosaar-updatesfrom
copilot/infosaar-updates-include-non-code-files
Draft

Include non-code files in dependency-based analysis via FILE_LEVEL fallback#11
Copilot wants to merge 14 commits into
infosaar-updatesfrom
copilot/infosaar-updates-include-non-code-files

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 28, 2026

  • Diagnose the Python 2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in execute_command
  • Fix execute_command to safely build jcmd in Python 2 when cmd contains mixed unicode and bytes elements (repo path from PyYAML is unicode, filenames from git output are bytes)
Original prompt

Create a pull request in se-sic/codeface targeting branch infosaar-updates to include non-code files (e.g., Markdown) in the dependency-based analysis pipeline so that commit_dependency is not empty for repos that primarily contain non-code files.

Context:

  • Time series analysis (codeface/R/analyse_ts.r, function gen.full.ts) returns NULL when there are no commits represented in commit_dependency.
  • For projects containing only non-code files such as .md, function extraction via Doxygen/ctags yields no functions, so no dependencies are computed and nothing is inserted into commit_dependency.

Proposed fix (minimal-impact):

  1. In codeface/VCS.py within gitVCS._getFunctionLines(self, file_layout_src, file_commit), after the existing ctags fallback (if not func_lines: func_lines = self._parseSrcFileCtags(...)), add a second fallback:
    • If func_lines is still empty, synthesize a single file-level artifact for the whole file, e.g. func_lines = {0: "FILE_LEVEL"}.
    • Set file_commit.artefact_line_range = True for this synthetic artifact.
    • This ensures downstream logic can map any changed line to the synthetic artifact and thus populate commit_dependency.
  2. Ensure Markdown files are included in the file list for blame/function analysis by extending the implementation file extension filter in gitVCS.addFiles4Analysis() (same file codeface/VCS.py) to include .md (and optionally .markdown if appropriate).
  3. Add or update tests (integration if available) to cover a repository/project with only .md changes so commit_dependency is populated with FILE_LEVEL entries. If creating a new fixture is too heavy, add a focused unit test around _getFunctionLines to assert the fallback behavior for .md content.

PR Requirements:

  • Title should reflect including non-code files in dependency analysis.
  • In the PR description/introduction comment, explicitly mention: "non-code files become more important these days and therefore should be included".
  • Keep changes minimal and backward compatible for code repositories.
  • Update any documentation/comments as needed to explain the fallback.

Files likely to change:

  • codeface/VCS.py
  • tests under codeface/test/... (choose appropriate location)

Notes:

  • FILE_LEVEL is already used as a special entityId in the repo (e.g., R experiments), so reuse it.
  • The goal is not to parse Markdown into functions; it is to avoid dropping such files entirely from dependency-based analysis.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: ```
gen.full.ts <- function(conf) {
boundaries <- conf$boundaries
full.series <- vector("list", dim(boundaries)[1])

tstamps <- conf$tstamps.release

subset <- c("commitDate", "AddedLines", "DeletedLines")
ts <- get.commits.by.ranges(conf, subset, make.index.unique)

if (dim(boundaries)[1] != length(ts)) {
stop("Internal error: Release boundaries don't match ts list length")
}

for (i in 1:length(ts)) {
ts[[i]]$ChangedLines <- ts[[i]]$AddedLines + ts[[i]]$DeletedLines
full.series[[i]] <- na.omit(xts(ts[[i]]$ChangedLines,
order.by=ts[[i]]$commitDate))
full.series[[i]] <- trim.series(full.series[[i]], boundaries$date.start[i],
boundaries$date.end[i])
}

full.series <- full.series[sapply(full.series, length)!=0]
full.series <- do.call(c, full.series)

return (full.series)
}

When does this function return null? 

Assistant: `gen.full.ts()` returns `NULL` when **every per-release time series ends up empty** after trimming/filtering.

Concretely, that happens when this line removes *all* elements:

```r
full.series <- full.series[sapply(full.series, length)!=0]

If no full.series[[i]] has length(...) != 0, then full.series becomes an empty list (list()), and then:

full.series <- do.call(c, full.series)

In R, do.call(c, list()) evaluates to NULL, so the function returns NULL.

Typical reasons all series become empty

  • get.commits.by.ranges(...) returns data for each range, but after trim.series(...) the xts objects contain no points (e.g., commit timestamps fall outside boundaries$date.start[i]..date.end[i], or boundaries are wrong).
  • The xts(...) construction + na.omit(...) produces a 0-length series for every range (e.g., all ChangedLines are NA, or there are no commits at all in each range).

When it doesn’t return NULL

If at least one release range yields a non-empty trimmed xts series, do.call(c, ...) returns an xts object (possibly with only that one segment), not NULL.

User: how can it be that all series become empty? there are definitely commits in the repo. but none that are part of commit_dependency

Assistant: All series can become empty even if the repository has plenty of commits, because gen.full.ts() does not look at “all commits in the repo”. It only...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add support for non-code files in dependency analysis Include non-code files in dependency-based analysis via FILE_LEVEL fallback Apr 28, 2026
Copilot AI requested a review from bockthom April 28, 2026 23:16
Copilot AI and others added 14 commits April 29, 2026 21:17
Signed-off-by: Thomas Bock <bockthom@cs.uni-saarland.de>
Signed-off-by: Thomas Bock <bockthom@cs.uni-saarland.de>
Agent-Logs-Url: https://github.com/se-sic/codeface/sessions/8fa45b90-96b2-4b56-9e28-b373eb50240a

Co-authored-by: Thomas Bock <bockthom@cmu.edu>
…and ls-tree

Prevent Python 2 UnicodeDecodeError in execute_command when filenames contain non-ASCII bytes

Agent-Logs-Url: https://github.com/se-sic/codeface/sessions/4586695e-7500-4873-aff0-bca899936fd8

Co-authored-by: Thomas Bock <bockthom@cmu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants