fix: llama-finetune backward pass crashes by System64fumo · Pull Request #21924 · ggml-org/llama.cpp

System64fumo · 2026-04-14T21:35:46Z

Overview

Fixes multiple crashes in llama-finetune that prevent training.

Additional information

Note: This doesn't fully fix/improve/change the behavior of the finetune tool, It only fixes the crashes.
@JohannesGaessler I propose adding a message to let users know there are deeper issues with the tool. (As stated in #18499)

The changes were tested against a simple dataset just to confirm the crashes were fixed and no further major crashes were left.
Proper finetune testing was not done as i don't have the time to let this run against a massive dataset to see actual results, But a simple test to confirm basic functionality with llama-server/llama-cli was done and worked just fine.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES: I used GLM 5.0 Turbo to analyze the crashes and propose fixes then reviewed/picked a set of changes.

Fixes multiple crashes in llama-finetune that prevent training.

JohannesGaessler · 2026-04-15T08:48:00Z

Why are you making changes to the backend scheduler?

System64fumo · 2026-04-15T14:55:45Z

Why are you making changes to the backend scheduler?

Because otherwise:

Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x0000ffff84ef40a8 in ?? () from /usr/lib/libc.so.6
#0  0x0000ffff84ef40a8 in ?? () from /usr/lib/libc.so.6
#1  0x0000ffff84ee6f34 in ?? () from /usr/lib/libc.so.6
#2  0x0000ffff84ee6f78 in ?? () from /usr/lib/libc.so.6
#3  0x0000ffff84f3e720 in wait4 () from /usr/lib/libc.so.6
#4  0x0000ffff85b05330 in ggml_print_backtrace () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#5  0x0000ffff85b054b0 in ggml_abort () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#6  0x0000ffff85b21468 in ggml_backend_sched_alloc_graph () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#7  0x0000ffff85b35c4c in ggml_opt_alloc () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#8  0x0000ffff899934b8 in llama_context::opt_epoch_iter(ggml_opt_dataset*, ggml_opt_result*, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, llama_batch&, void (*)(bool, ggml_opt_context*, ggml_opt_dataset*, ggml_opt_result*, long, long, long), bool, long, long, long) () from /mnt/nas/llama.cpp/build/bin/libllama.so.0
#9  0x0000ffff89993ac4 in llama_context::opt_epoch(ggml_opt_dataset*, ggml_opt_result*, ggml_opt_result*, long, void (*)(bool, ggml_opt_context*, ggml_opt_dataset*, ggml_opt_result*, long, long, long), void (*)(bool, ggml_opt_context*, ggml_opt_dataset*, ggml_opt_result*, long, long, long)) () from /mnt/nas/llama.cpp/build/bin/libllama.so.0
#10 0x0000aaaadaeb3684 in main ()

Unless you're suggesting this be fixed elsewhere?

Fix llama-finetune

a41c0d0

Fixes multiple crashes in llama-finetune that prevent training.

System64fumo requested a review from ggerganov as a code owner April 14, 2026 21:35

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: llama-finetune backward pass crashes#21924

fix: llama-finetune backward pass crashes#21924
System64fumo wants to merge 1 commit intoggml-org:masterfrom
System64fumo:master

System64fumo commented Apr 14, 2026

Uh oh!

JohannesGaessler commented Apr 15, 2026

Uh oh!

System64fumo commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

System64fumo commented Apr 14, 2026

Overview

Additional information

Requirements

Uh oh!

JohannesGaessler commented Apr 15, 2026

Uh oh!

System64fumo commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants