llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-27 20:43:07 +01:00

Author	SHA1	Message	Date
Jeff Bolz	aea8ddd516	vulkan: fix coopmat2 validation failures (#11284 ) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-01-20 10:38:32 -06:00
Georgi Gerganov	9f7add1cde	examples : fix add_special conditions (#11311 )	2025-01-20 16:36:08 +02:00
Christopher Nielsen	90d987b105	mmap: add include for cerrno (#11296 ) ggml-ci Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-20 16:02:43 +02:00
Michael Podvitskiy	a4251edd6f	cmake: fix shell command quoting in build-info script (#11309 )	2025-01-20 16:02:15 +02:00
Xuan Son Nguyen	ec7f3ac9ab	llama : add support for Deepseek-R1-Qwen distill model (#11310 ) * llama : add support for Deepseek-R1-Qwen distill model * coding style	2025-01-20 14:35:07 +01:00
Georgi Gerganov	ef6dada60c	cont : fix whitespaces (#11305 )	2025-01-20 09:29:32 +02:00
Kyle Bruene	ae3c1db2f9	llama : re-add LLM_ARCH_PHIMOE (#11305 ) Phi 3.5 MoE was partially removed during a refactor. The code was originally in llama.cpp and should be in llama-model.cpp after the refactor.	2025-01-20 09:21:01 +02:00
Georgi Gerganov	92bc493917	tests : increase timeout when sanitizers are enabled (#11300 ) * tests : increase timeout when sanitizers are enabled * tests : add DEFAULT_HTTP_TIMEOUT	2025-01-19 20:22:30 +02:00
Georgi Gerganov	b9daaffe02	simple-chat : fix BOS being added to each message (#11278 )	2025-01-19 18:12:09 +02:00
Nicolò Scipione	99487b57d4	SYCL: Introducing memory host pool (#11251 ) * Implement host pool for matrix_info Creating a new memory pool on the host to store memory location for matrix_info needed to launch gemm_batch from oneMKL/oneMath. Removing complex support in gemm_batch since it is not used in llama.cpp * Remove unnecessary headers and cast * Reorder member variable to avoid warning on initialization * Formatting * Remove unused variable * Address PR review feedback - remove warning --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-01-19 21:33:34 +08:00
Eric Curtin	a1649cc13f	Adding linenoise.cpp to llama-run (#11252 ) This is a fork of linenoise that is C++17 compatible. I intend on adding it to llama-run so we can do things like traverse prompt history via the up and down arrows: https://github.com/ericcurtin/linenoise.cpp Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-18 14:42:31 +00:00
Georgi Gerganov	4dd34ff831	cmake : add sanitizer flags for llama.cpp (#11279 ) * cmake : add sanitizer flags for llama.cpp ggml-ci * tests : fix compile warnings ggml-ci * cmake : move sanitizer flags to llama_add_compile_flags ggml-ci * cmake : move llama.cpp compile flags to top level lists ggml-ci * cmake : apply only sanitizer flags at top level ggml-ci * tests : fix gguf context use in same_tensor_data * gguf-test: tensor data comparison * dummy : trigger ggml-ci * unicode : silence gcc warnings ggml-ci * ci : use sanitizer builds only in Debug mode ggml-ci * cmake : add status messages [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-01-18 16:18:15 +02:00
Xuan Son Nguyen	f30f099228	server : implement cancellable request (#11285 ) * server : implement cancellable request * fix typo * httplib 0.18.5 * fix i underflow	2025-01-18 14:12:05 +01:00
Georgi Gerganov	f26c874179	scripts : restore hf.sh (#11288 ) ggml-ci	2025-01-18 13:18:32 +02:00
LostRuins Concedo	6390a998bf	tts : add guide tokens support (#11186 ) * Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. * applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start	2025-01-18 12:20:57 +02:00
Jeff Bolz	44e18ef939	vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281 ) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-01-18 09:26:50 +01:00
codezjx	3edfa7d375	llama.android: add field formatChat to control whether to parse special tokens when send message (#11270 )	2025-01-17 14:57:56 +02:00
Radoslav Gerganov	667d72846c	rpc : early register backend devices (#11262 ) Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609	2025-01-17 10:57:09 +02:00
Georgi Gerganov	a133566d34	vocab : fix double-eos check (#11273 ) ggml-ci	2025-01-17 09:28:00 +02:00
David Renshaw	960ec65273	llama : fix deprecation message: vocabable -> vocab (#11269 )	2025-01-17 08:12:01 +01:00
musoles	7a689c415e	README : added kalavai to infrastructure list (#11216 )	2025-01-17 01:10:49 +01:00
Jeff Bolz	bd38ddea01	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (#11166 ) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-01-16 22:47:10 +01:00
Jeff Bolz	466300fe14	vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (#11206 ) Do masking on whole dwords, fetch all scales at once.	2025-01-16 22:23:49 +01:00
Jeff Bolz	206bc53422	vulkan: optimize coopmat2 q2_k dequant function (#11130 )	2025-01-16 22:16:39 +01:00
RunningLeon	4dbc8b9cb7	llama : add internlm3 support (#11233 ) * support internlm3 * fix lint	2025-01-16 20:10:38 +02:00
Johannes Gäßler	9c8dcefe17	CUDA: backwards pass for misc. ops, add tests (#11257 ) * CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers	2025-01-16 16:43:38 +01:00
Xuan Son Nguyen	681149ced2	llama : add `llama_model_load_from_splits` (#11255 ) * llama : add `llama_model_load_from_splits` * update	2025-01-16 13:54:08 +01:00
fj-y-saito	c67cc9837d	ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (#11227 ) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-16 11:11:49 +02:00
Eve	adc5dd92e8	vulkan: scale caching for k quants + misc fixes (#11081 ) * q6_k scale caching * 16 bit unpack * q4_k test (slow) * revert it * q3_k * q2_k * little stuff * try precalculating products of a and q2_k scales * Revert "try precalculating products of a and q2_k scales" This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b. * unpack should be u16, add vim swap to gitignore (about time) * better q4_k scales * q5_k * better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations * q2_k better dequant * q3_k optimizations * q3_k use hmask simd from cpu avx version * make the caches happy * q3_k separate out calculation * q2_k separate out * little stuff * use calc_superblock everywhere * q2_k optimize scale calculation * more barriers	2025-01-15 19:50:13 +00:00
Georgi Gerganov	f11cfdfd7f	ci : use -no-cnv in gguf-split tests (#11254 ) * ci : use -no-cnv in gguf-split tests ggml-ci * ci : use -no-cnv in requantize tests ggml-ci * scripts : fix [no ci]	2025-01-15 18:28:35 +02:00
Junil Kim	1d8504338e	fix: ggml: fix vulkan-shaders-gen build (#10448 ) * fix: ggml: fix vulkan-shaders-gen build The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup - Add GGML_SHADERS_GEN_TOOLCHAIN CMake option. - Auto-detect host toolchain if not set. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup Use configure_file to generate host_toolchain.cmake from template * fix: ggml: Fix compile error Fix compile error not finding vulkan-shaders-gen * fix: vulkan-shaders-gen build and path handling Fix build issues with vulkan-shaders-gen: - Add target dependency for correct build order - Use CMAKE_HOST_SYSTEM_NAME for executable suffix - Fix MSVC output directory in host toolchain - Normalize path handling for cross-compilation * fix: improve host compiler detection in vulkan shader build Improve host compiler detection for vulkan shader generation: - Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches - Consolidate compiler detection logic - Fix Windows-specific MSVC detection - Ensure correct compiler search in cross-compilation * refactor: Simplify CMake function for detecting host compiler Simplified the CMake function to improve the process of detecting the host compiler. * fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt Since `vulkan-shader-gen.cpp` only requires the `glslc` executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: `ecc93d0558`) * refactor: Rename host_toolchain.cmake.in - Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in * refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN	2025-01-15 14:17:42 +01:00
Johannes Gäßler	432df2d5f9	RoPE: fix back, CUDA support for back + noncont. (#11240 ) * RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]	2025-01-15 12:51:37 +01:00
Daniel Bevenius	0ccd7f3eb2	examples : add embd_to_audio to tts-outetts.py [no ci] (#11235 ) This commit contains a suggestion for adding the missing embd_to_audio function from tts.cpp to tts-outetts.py. This introduces a depencency numpy which I was not sure if that is acceptable or not (only PyTorch was mentioned in referened PR). Also the README has been updated with instructions to run the example with llama-server and the python script. Refs: https://github.com/ggerganov/llama.cpp/pull/10784#issuecomment-2548377734	2025-01-15 05:44:38 +01:00
Akarshan Biswas	f446c2cf6a	SYCL: Add gated linear attention kernel (#11175 ) * SYCL: Add Gated Linear attention kernel * glahpp: add a space at the end of file * gla: Put the barrier inside the main logic loop	2025-01-15 11:20:17 +08:00
Xuan Son Nguyen	b4d92a59a2	ci : add -no-cnv for tests (#11238 )	2025-01-14 16:42:23 +02:00
Georgi Gerganov	bbf3e55e35	vocab : add dummy tokens for "no_vocab" type (#11231 ) * vocab : add dummy tokens for "no_vocab" type ggml-ci * vocab : minor [no ci]	2025-01-14 11:54:58 +01:00
ebraminio	c5bf0d1bd7	server : Improve code snippets direction between RTL text (#11221 )	2025-01-14 11:39:33 +01:00
Olivier Chafik	091592d758	Refactor test-chat-template.cpp (#11224 ) * Refactor test-chat-template * Update test-chat-template.cpp	2025-01-14 10:16:41 +00:00
Georgi Gerganov	44d1e796d0	sync : ggml	2025-01-14 10:39:42 +02:00
Georgi Gerganov	a4f3f5d8e6	scripts : sync gguf (cont)	2025-01-14 09:40:52 +02:00
Georgi Gerganov	48e1ae0e61	scripts : sync gguf	2025-01-14 09:36:58 +02:00
Georgi Gerganov	d00a80e89d	scripts : sync opencl	2025-01-14 09:19:58 +02:00
ebraminio	504af20ee4	server : (UI) Improve messages bubble shape in RTL (#11220 ) I simply have overlooked message bubble's tail placement for RTL text as I use the dark mode and that isn't visible there and this fixes it.	2025-01-13 20:23:31 +01:00
Xuan Son Nguyen	84a44815f7	cli : auto activate conversation mode if chat template is available (#11214 ) * cli : auto activate conversation mode if chat template is detected * add warn on bad template * update readme (writing with the help of chatgpt) * update readme (2) * do not activate -cnv for non-instruct models	2025-01-13 20:18:12 +01:00
Andreas Kieslinger	39509fb082	cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) (#11042 ) * Refactor: Moves cuda graph executable update step to separate function. * Refactor: Moves cuda graph update check to separate function. * Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability. * Fix: Adds missing reference to maintain_cuda_graph() definition. * Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function. * Refactor: Moves node graph checks and copy ops into individual function for improved readability. * Refactor: Removes code permanently excluded from compilation to increase readability. * Style: Adds missing newline * Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one * Refactor: Makes 'cuda_graph_update_required' a local variable * remove double lines between functions --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-13 16:45:53 +01:00
Georgi Gerganov	a29f0870d4	contrib : add naming guidelines (cont) (#11177 )	2025-01-13 15:59:26 +02:00
ebraminio	437e05f714	server : (UI) Support for RTL text as models input or output (#11208 )	2025-01-13 14:46:39 +01:00
Georgi Gerganov	ca001f6656	contrib : add naming guidelines (cont) (#11177 )	2025-01-13 15:08:44 +02:00
Xuan Son Nguyen	00b4c3da62	common : support tag-based --hf-repo like on ollama (#11195 ) * common : support tag-based hf_repo like on ollama * fix build * various fixes * small fixes * fix style * fix windows build? * move common_get_hf_file to common.cpp * fix complain with noreturn	2025-01-13 13:56:23 +01:00
Georgi Gerganov	7426a26b24	contrib : add naming guidelines (#11177 ) * contrib : add naming guidelines * contrib : expand naming guidelines [no ci] * contrib : cont [no ci] * contrib : add `_t` suffix guideline [no ci] * contrib : cont [no ci] * minor [no ci] * contrib : move coding guidelines to correct section [no ci] * contrib : minor reword coding guidelines [no ci] * contrib : add TODO for preprocessor directives [no ci] * contrib : expand [no ci] * minor [no ci] * contrib : clarify `_context` suffix usage [no ci] * contrib : filename guidelines [no ci] * contrib : fix notes [no ci]	2025-01-13 14:46:36 +02:00

1 2 3 4 5 ...

4568 Commits