Skip to content

[PROPOSAL] make2compdb.c#392

Open
g-berthiaume wants to merge 1 commit into
skeeto:masterfrom
g-berthiaume:make2compdb
Open

[PROPOSAL] make2compdb.c#392
g-berthiaume wants to merge 1 commit into
skeeto:masterfrom
g-berthiaume:make2compdb

Conversation

@g-berthiaume
Copy link
Copy Markdown

This PR introduces make2compdb.
A new CLI tool to generates Clang's JSON Compilation Database files (compiler_commands.json) from make build systems.

The API can be used in the following way:

$ make -Bwn | make2compdb.exe > compiler_commands.json
$ cat compiler_commands.json
[
  {
    "directory": "C:\\my_project",
    "file": "main.c",
    "output": "main",
    "arguments": [
      "gcc",
      "-o",
      "main",
      "main.c"
    ]
  }
]

Properties

  • Single C23 source file
  • Compiles on both Windows and Linux
  • Unit tests
  • Basic fuzzing (I'm pretty new to this)
  • CRT-free on Windows
  • Acceptably fast
Expand to see peports
$ peports make2compdb.exe
KERNEL32.dll
        0       ExitProcess
        0       GetCommandLineW
        0       GetConsoleMode
        0       GetCurrentDirectoryW
        0       GetStdHandle
        0       ReadFile
        0       VirtualAlloc
        0       WriteConsoleW
        0       WriteFile
SHELL32.dll
        0       CommandLineToArgvW
Expand to see performance benchmarks

While I'm sure, we could do better, the CLI tool seems to be acceptably fast.
It seems to be a bit faster on Linux than on Windows, but I'm not sure if I'm not just measuring piping speeds.

With a small project (linux)

$ hyperfine "./make2compdb < make_output.txt" --warmup 5
Benchmark 1: ./make2compdb < make_output.txt
  Time (mean ± σ):       5.8 ms ±   1.0 ms    [User: 1.1 ms, System: 0.9 ms]
  Range (min … max):     4.1 ms …  11.9 ms    321 runs

With a small project (windows)

$ hyperfine ".\make2compdb.exe < make_output.txt" --warmup 5
Benchmark 1: .\make2compdb.exe < make_output.txt
  Time (mean ± σ):      13.5 ms ±   1.0 ms    [User: 8.6 ms, System: 7.2 ms]
  Range (min … max):    12.3 ms …  18.1 ms    112 runs

With ffmpeg (windows)

$ hyperfine "make2compdb.exe < ffmpeg_output.txt" --warmup 5
Benchmark 1: make2compdb.exe < ffmpeg_output.txt
  Time (mean ± σ):      55.8 ms ±   4.5 ms    [User: 29.0 ms, System: 25.9 ms]
  Range (min … max):    51.4 ms …  79.0 ms    45 runs

With ffmpeg (linux)

$ hyperfine "make2compdb < ffmpeg_output.txt" --warmup 5 
Benchmark 1: make2compdb < ffmpeg_output.txt
  Time (mean ± σ):      28.2 ms ±   0.6 ms    [User: 22.8 ms, System: 3.7 ms]
  Range (min … max):    27.2 ms …  30.3 ms    102 runs

By comparison, the https://github.com/nickdiego/compiledb takes 18 seconds to analyzer ffmpeg.

$ hyperfine "compiledb -p ffmpeg_output.txt"
Benchmark 1: compiledb -p ffmpeg_output.txt
  Time (mean ± σ):     18.469 s ±  0.258 s    [User: 18.128 s, System: 0.205 s]                                        
  Range (min … max):   18.127 s … 19.003 s    10 runs

So it's a 331x speedup.

Features

Two additional features

1. No Make mode

I'm a big fan of the "unity build" compilation technique (like the one used in u-config).
Therefore, I often don't have a use for a makefiles in my projects.
That said, I still need compiler_commands.json for my IDE to work.

How great would it be to just pipe my build command into make2compdb ?
Well you can!

$ echo "gcc main.c -o main" | make2compdb.exe
[
  {
    "directory": "C:\\my_project",
    "file": "main.c",
    "output": "main",
    "arguments": [
      "gcc",
      "-o",
      "main",
      "main.c"
    ]
  }
]

2. --verbose

I believe that there's a lot of value in providing your technical users tools to help them self-diagnose issues.
In my experience, this usually leads to better bug reports and therefore help maintaining the project.

This is why I have added the --verbose CLI flag.
When it's passed to make2compdb, the stdout contains debug information.

$ make -Bwn | make2compdb.exe --verbose
make2compdb
Version: 2026-05-20
Verbose mode: true
Directory: "C:\\my_project"
CLI args: ["./make2compdb", "--verbose"]

------
Step 1: Identifying the parsing mode
    Parsing mode is SHELL

Step 2: Parsing shell command
    Invocation
        Input: "gcc -o main main.c\n"
        Tokens: ["gcc", "-o", "main", "main.c"]
        Compiler: {
            .ok = true
            .source = ["main.c"]
            .output = "main"
            .args = ["gcc", "-o", "main"]
        }
[...]

Limitations

  1. It does not support Microsoft CL.exe. I would be open to adding this feature if there's demand for it.
  2. On Linux, we only support C.UTF-8 and other English UTF8 locales.
    As a non-native English speaker, I can empathize with this being a pain point.
    That said, I didn't find a robust way to parse the Makefile output in different languages.

Note for reviewer

I've learn a lot building this project: Arena, fuzzing, CRT-less windows programming.

When starting this project, I think I underestimated the number of corner cases to handle.
To be honest, the shell parsing humbled me a little bit. :^)

One of the reasons I appreciate w64devkit, is its high standard when it comes to programming.
I think I have a lot to learn from the maintainers of this project, so I welcome any of your suggestions on how to improve make2compdb.

Closes #251

@Peter0x44
Copy link
Copy Markdown
Collaborator

Epic! I'll be test driving this myself.

@skeeto
Copy link
Copy Markdown
Owner

skeeto commented May 22, 2026

Thanks so much, @g-berthiaume! It's interesting to see these concepts through another person's lens.

Make all functions except the entry point static. The program is a single translation unit, which communicates to the compiler that functions do not need external linkage. For example, it should inline any function, regardless of size, called only from a single location as there's no downside to doing so. But external linkage counts as some unknown number of other call sites. IMHO, C gets this inverted: static ought to be the default, and external linkage ought to be opt-in (via a declared interface).

When I compile with GCC 16 in the latest w64dk I get a -Wnonnull-compare (via -Wall) warning on line 769, asserting that a [static] parameter is non-null. This is essentially GCC warning that it's going to eliminate the assertion because UB must have occurred before it could fire. That means this is not a well-placed assertion.

It's "peek" not "peak". I was genuinely confused by this because the latter would normally mean highest value (highest memory use, highest count, etc.). Also "threated" instead of "treated" and "stars" instead of "starts", and "emtpy" instead of "empty".

The program always outputs UTF-16, and there seems to be a misunderstanding how output is supposed to work. I'm surprised the resulting compile_commands.json actually works with anything (did it?):

$ printf 'cc x.c' | make2compdb | xxd | head -n1
00000000: 5b00 0a00 2000 2000 2000 2000 7b00 0a00  [... . . . .{...

If the output device is a console, use WriteConsoleW with UTF-16. Otherwise use WriteFile with UTF-8. Casting c16 * to u8 * does not make it UTF-8. The program is already UTF-8 most of the way through, and on Windows the output buffer is UTF-16. The common case is writing to a JSON file, not to a console, and so this means the common case would convert UTF-8-to-UTF-16 then UTF-16-to-UTF-8, which is silly. Instead use a UTF-8 buffer and, if a console is detected, convert UTF-8-to-UTF-16 at the last moment (caveat: tricky edge case around straddlers which even Microsoft's CRTs get wrong). It's more important that it outputs UTF-8 than that it prints UTF-16 to the console correctly, which is mostly a bonus to make visual inspection/debugging more reliable.

As a see it for yourself test of the above, these must both produce the same result:

$ printf 'cc π.c' | make2compdb | grep file | xxd
00000000: 2020 2020 2020 2020 2266 696c 6522 3a20          "file": 
00000010: 22cf 802e 6322 2c0a                      "...c",.

$ printf 'cc π.c' | make2compdb >json && grep file json | xxd
00000000: 2020 2020 2020 2020 2266 696c 6522 3a20          "file": 
00000010: 22cf 802e 6322 2c0a                      "...c",.

The parser should be more oriented around tokens than substring search. Don't substring search for "clang", match the "clang" token. You can further tokenize tokens as part of this process, first tokenizing to "x86_64-w64-mingw32-gcc", then to "x86_64" "w64" "mingw32" "gcc" to match "gcc".

Because it's substring matching, this doesn't work as intended:

$ printf 'ccache cc -c x.c' | make2compdb

Produces an arguments of ["ccache", "cc", "-c", "x.c"] because ccache matches cc, whereas, say, mycache works fine. Add a test for this when you fix it.

print_str_escaped_string produces invalid JSON for some inputs:

$ printf "cc -I'a b' x.c" | make2compdb

Produces output with escaped single quotes, "-I\'a b\'". JSON output could use some unit tests.

The PARSING_MODE_SHELL loop doesn't seem to accumulate commands, just keep the last, so:

$ printf 'cc a.c; cc b.c' | make2compdb

Only captures the second command. Add a test for this when you fix it. Unless this is intentional?

The &>> branch doesn't peek correctly, though it works out by chance anyway.

Compiler --output flag isn't handled, though there's an aspirational comment about it.

is_source_file should probably also cover at least .s and .S (assembly).

enum u32 is a strange name. Did you mean enum : u32?

Trivial to adjust, but even 16MiB arena may be a little tight for larger builds. Maybe 64M? Build commands can be quite bloated.


Nothing below is a problem, nor must change. Just pointing them out as commentary.

If you're committed to GNU-style toolchains (e.g. GCC, Clang), as MSVC is excluded, then because you're linking -lmemory, __builtin_memset (and __builtin_memcmp, etc.) is a first-class feature. In release builds, compilers likely figure out memory_set on their own, which is what makes -lmemory necessary, but it's nice to have a fast memory clear primitive even in debug builds.

Might be surprising (or might not given those const *const), but this is a semantically valid function:

void example(Str const s)
{
    s.ptr[0] = 0;
}

But s is const, right? Sure, but it doesn't propagate to pointed at objects. So the const is doing practically nothing. That const is so pathetic in C and C++ is why I don't bother with it in code I write. Though perhaps you feel it's still useful as documentation. On the other hand, it's not used consistently.

@Peter0x44
Copy link
Copy Markdown
Collaborator

On Linux, we only support C.UTF-8 and other English UTF8 locales.
As a non-native English speaker, I can empathize with this being a pain point.
That said, I didn't find a robust way to parse the Makefile output in different languages.

Would calling setlocale before starting the process not help? Then it wouldn't matter what the user's locale is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speculative application: make2compdb

3 participants