Skip to content

fix: llama-finetune backward pass crashes#21924

Open
System64fumo wants to merge 1 commit intoggml-org:masterfrom
System64fumo:master
Open

fix: llama-finetune backward pass crashes#21924
System64fumo wants to merge 1 commit intoggml-org:masterfrom
System64fumo:master

Conversation

@System64fumo
Copy link
Copy Markdown

Overview

Fixes multiple crashes in llama-finetune that prevent training.

Closes: #18499 #21037

Additional information

Note: This doesn't fully fix/improve/change the behavior of the finetune tool, It only fixes the crashes.
@JohannesGaessler I propose adding a message to let users know there are deeper issues with the tool. (As stated in #18499)

The changes were tested against a simple dataset just to confirm the crashes were fixed and no further major crashes were left.
Proper finetune testing was not done as i don't have the time to let this run against a massive dataset to see actual results, But a simple test to confirm basic functionality with llama-server/llama-cli was done and worked just fine.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES: I used GLM 5.0 Turbo to analyze the crashes and propose fixes then reviewed/picked a set of changes.

Fixes multiple crashes in llama-finetune that prevent training.
@System64fumo System64fumo requested a review from ggerganov as a code owner April 14, 2026 21:35
@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Apr 14, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor

Why are you making changes to the backend scheduler?

@System64fumo
Copy link
Copy Markdown
Author

Why are you making changes to the backend scheduler?

Because otherwise:

Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x0000ffff84ef40a8 in ?? () from /usr/lib/libc.so.6
#0  0x0000ffff84ef40a8 in ?? () from /usr/lib/libc.so.6
#1  0x0000ffff84ee6f34 in ?? () from /usr/lib/libc.so.6
#2  0x0000ffff84ee6f78 in ?? () from /usr/lib/libc.so.6
#3  0x0000ffff84f3e720 in wait4 () from /usr/lib/libc.so.6
#4  0x0000ffff85b05330 in ggml_print_backtrace () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#5  0x0000ffff85b054b0 in ggml_abort () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#6  0x0000ffff85b21468 in ggml_backend_sched_alloc_graph () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#7  0x0000ffff85b35c4c in ggml_opt_alloc () from /mnt/nas/llama.cpp/build/bin/libggml-base.so.0
#8  0x0000ffff899934b8 in llama_context::opt_epoch_iter(ggml_opt_dataset*, ggml_opt_result*, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, llama_batch&, void (*)(bool, ggml_opt_context*, ggml_opt_dataset*, ggml_opt_result*, long, long, long), bool, long, long, long) () from /mnt/nas/llama.cpp/build/bin/libllama.so.0
#9  0x0000ffff89993ac4 in llama_context::opt_epoch(ggml_opt_dataset*, ggml_opt_result*, ggml_opt_result*, long, void (*)(bool, ggml_opt_context*, ggml_opt_dataset*, ggml_opt_result*, long, long, long), void (*)(bool, ggml_opt_context*, ggml_opt_dataset*, ggml_opt_result*, long, long, long)) () from /mnt/nas/llama.cpp/build/bin/libllama.so.0
#10 0x0000aaaadaeb3684 in main ()

Unless you're suggesting this be fixed elsewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: llama-finetune won't work even with 17M parameters arch:llama

2 participants