llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-12 21:37:19 +01:00

Author	SHA1	Message	Date
Akarshan Biswas	c6860cc734	SYCL: Refactor ggml_sycl_compute_forward (#11121 ) * SYCL: refactor ggml_sycl_compute_forward * SYCL: add back GGML_USED(dst) to ggml_sycl_cpy * SYCL: add function name to noop debug * SYCL: Some device info print refactoring and add details of XMX availability b4456	2025-01-10 08:13:03 +08:00
Tei Home	1204f97270	doc: add cuda guide for fedora (#11135 ) Since NVIDIA does not release CUDA for in-maintenance versions of Fedora, the process of setting up the CUDA toolkit on Fedora has become quite involved. This guide should help mere mortals install CUDA for development in a Fedora 39 toolbox environment, without affecting the host system.	2025-01-09 11:32:06 +00:00
Daniel Bevenius	8eceb888d7	server : add tooltips to settings and themes btn (#11154 ) * server : add tooltips to settings and themes btn This commit adds tooltips to the settings and themes buttons in the webui. The tooltip will be displayed below the actual buttons when hovered over. The motivation for this change is to clarify the purpose of the themes button. * squash! server : add tooltips to settings and themes btn This commit adds a tooltip to the '...' button when a chat has been started. The tooltip is "Chat options" which think could be a good description as the dropdown contains options to delete or download the current chat. * rm tooltip for 3 dots button --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-09 11:28:29 +01:00
Pierrick Hymbert	f8feb4b01a	model: Add support for PhiMoE arch (#11003 ) * model: support phimoe * python linter * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: add phimoe as supported model ggml-ci --------- Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> b4453	2025-01-09 11:21:41 +01:00
Georgi Gerganov	be0e950c91	media : remove old img [no ci]	2025-01-09 11:15:15 +02:00
Xuan Son Nguyen	d9feae1c06	llama-chat : add phi 4 template (#11148 ) b4451	2025-01-09 10:07:33 +01:00
hydai	8d59d91171	fix: add missing msg in static_assert (#11143 ) Signed-off-by: hydai <z54981220@gmail.com> b4450	2025-01-08 20:03:28 +00:00
Vinesh Janarthanan	8a1d9c25fa	gguf-py : move scripts directory (#11116 ) * Moved scripts dir and fixed pyproject.toml * updated readme * fixed README urls * bump pypi gguf to v0.14.0 * retrigger ci * empty commit - trigger ci gguf-v0.14.0	2025-01-08 20:54:58 +02:00
Eric Curtin	1bf839b1e8	Enhance user input handling for llama-run (#11138 ) The main motivation for this change is it was not handing ctrl-c/ctrl-d correctly. Modify `read_user_input` to handle EOF, "/bye" command, and empty input cases. Introduce `get_user_input` function to manage user input loop and handle different return cases. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-08 18:47:05 +00:00
Xuan Son Nguyen	f7cd13301c	ci : use actions from ggml-org (#11140 ) b4447	2025-01-08 16:09:20 +01:00
Xuan Son Nguyen	4d2b3d8804	lora : improve compat with `mergekit-extract-lora` (#11131 ) * (wip) support mergekit-extracted lora * support mergekit-extract-lora * use lora->get_scale * correct comment * correct norm name & condition * add some hints b4446	2025-01-08 15:59:53 +01:00
Georgi Gerganov	c07d437bbd	llama : avoid hardcoded QK_K (#11061 ) ggml-ci b4445	2025-01-08 16:19:36 +02:00
Georgi Gerganov	99a3755a3c	sync : ggml	2025-01-08 13:40:30 +02:00
Radoslav Gerganov	c792dcf488	ggml : allow loading backend with env variable (ggml/1059) ref: #1058 b4443	2025-01-08 13:40:18 +02:00
Xuan Son Nguyen	80ccf5d725	ci : pin dependency to specific version (#11137 ) * ci : pin dependency to specific version * will this fix ec?	2025-01-08 12:07:20 +01:00
Georgi Gerganov	a3c1232c3f	arg : option to exclude arguments from specific examples (#11136 ) * arg : option to exclude arguments from specific examples ggml-ci * readme : remove old args [no ci]	2025-01-08 12:55:36 +02:00
amritahs-ibm	8cef75c743	llamafile : ppc64le MMA INT8 implementation (#10912 ) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com> b4440	2025-01-08 12:54:19 +02:00
Georgi Gerganov	0d52a69e4b	ci : fix cmake option (#11125 ) b4439	2025-01-08 11:29:34 +02:00
Mathieu Baudier	02f0430141	Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (#11117 ) * Disable GL_KHR_cooperative_matrix Vulkan extension if not available. * Perform Vulkan extensions checks in a more sensible order * Remove unnecessary #ifdef directive b4438	2025-01-08 09:18:13 +01:00
ag2s20150909	bec2183f2c	fix: Vulkan shader gen binary path when Cross-compiling (#11096 ) * fix: Vulkan shader gen binary path when cross compiling b4437	2025-01-08 09:17:29 +01:00
Johannes Gäßler	53ff6b9b9f	GGUF: C++ refactor, backend support, misc fixes (#11030 ) * GGUF: C++ refactor, backend support, misc fixes remove ggml_tensor.backend update CODEOWNERS [no ci] remove gguf_get_data from API revise GGUF API data types	2025-01-07 18:01:58 +01:00
Diego Devesa	017cc5f446	ggml-backend : only offload from host buffers (fix) (#11124 ) b4435	2025-01-07 16:11:57 +01:00
Diego Devesa	a3d50bc022	ggml-backend : only offload from host buffers (#11120 ) b4434	2025-01-07 12:38:05 +01:00
Radoslav Gerganov	a4dd490069	rpc : code cleanup (#11107 ) Remove duplicated macros, use GGML_LOG_ERROR for errors b4433	2025-01-07 08:37:02 +02:00
Akarshan Biswas	c0d6f790d0	SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 (#11087 ) * SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 * Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6" This reverts commit f62dc45f318e48d375e7734b34cbddee81deed52. * Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6 b4432	2025-01-07 14:26:07 +08:00
Eric Curtin	dc7cef9f37	llama-run : fix context size (#11094 ) Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is a more reasonable 2048. Signed-off-by: Eric Curtin <ecurtin@redhat.com> b4431	2025-01-06 23:45:28 +01:00
Georgi Gerganov	ecebbd292d	llama : remove unused headers (#11109 ) ggml-ci b4430	2025-01-06 17:52:35 +02:00
Xuan Son Nguyen	96be8c3264	github : add cmd line field to bug report (#11090 ) * github : cmd line to bug report * codeowners : (@ngxson) only watch dockerfile * Apply suggestions from code review [no ci] Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * rm cmd in log output [no ci] * rm 2 [no ci] * no need backticks [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-01-06 16:34:49 +01:00
Georgi Gerganov	e6e7c75d94	server : fix extra BOS in infill endpoint (#11106 ) * server : fix extra BOS in infill endpoing ggml-ci * server : update infill tests b4428	2025-01-06 15:36:08 +02:00
Xuan Son Nguyen	09186fabbe	llama : remove check flash_attn with lora (#11104 )	2025-01-06 13:41:12 +01:00
Asghar Ghorbani	96a1dc27c3	llama : prevent system info string accumulation across calls (#11101 ) b4426	2025-01-06 13:21:46 +02:00
Daniel Bevenius	6369f867a4	llama : rename missed batch params/vars to ubatch (#10059 ) This commit renames the `batch` parameter to `ubatch` in the `llama_kv_cache_find_slot`, `llm_build_inp_embd`, and `llm_build_mamba` functions. The motivation for this is that this should have been done as part of Commit 19d900a7565b8f6b0a708836a57d26966cb9efe2 ("llama : rename batch to ubatch (#9950)") but for some reason I missed these functions in that commit and only noticed them now (sorry). b4425	2025-01-06 11:28:17 +02:00
Georgi Gerganov	47182dd03f	llama : update llama_model API names (#11063 ) * llama : deprecate llama_free_model, add llama_model_free ggml-ci * llama : change `llama_load_model_from_file` -> `llama_model_load_from_file` ggml-ci b4424	2025-01-06 10:55:18 +02:00
Georgi Gerganov	3e6e7a6bc2	tokenize : escape the prompt (#11058 ) * tokenize : escape the prompt * tokenize : update help b4423	2025-01-06 10:54:25 +02:00
Georgi Gerganov	ae2f606bb5	mmap : fix fileno macro clash (#11076 ) * mmap : fix fileno macro clash ggml-ci * cont ggml-ci b4422	2025-01-06 10:52:38 +02:00
Georgi Gerganov	727368c60f	llama : use LLAMA_TOKEN_NULL (#11062 ) ggml-ci b4421	2025-01-06 10:52:15 +02:00
Georgi Gerganov	5047dd3546	llama : use _impl suffix instead of _internal (#11060 ) ggml-ci b4420	2025-01-06 10:52:01 +02:00
Johannes Gäßler	46e3556e01	CUDA: add BF16 support (#11093 ) * CUDA: add BF16 support b4419	2025-01-06 02:33:52 +01:00
0cc4m	b56f079e28	Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (#11074 ) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check b4418	2025-01-04 21:09:59 +01:00
fairydreaming	9394bbd484	llama : Add support for DeepSeek V3 (#11049 ) * convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type * vocab : add DeepSeek V3 pre-tokenizer regexes * unicode : handle ACCENT_MARK and SYMBOL categories in regex * llama : add DeepSeek V3 chat template, handle new model parameters and tensor types --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> b4417	2025-01-04 21:06:11 +01:00
matt23654	f922a9c542	[GGML][RPC] Support for models with non-512-aligned tensors over RPC. (#11047 ) * Added init tensor calling code * Added get_alloc_size forwarding * Cleaned up and improved type/error handling. * fix: remove trailing whitespaces. * Cleanup and use GGML error logging functions. * Handle potentially dangerous edge cases. * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> b4416	2025-01-04 17:10:30 +01:00
DAN™	46be942214	llama : add support for the cohere2 model architecture (#10900 ) b4415	2025-01-04 16:33:31 +02:00
Georgi Gerganov	78c6785175	sync : ggml b4414	2025-01-04 16:09:53 +02:00
Georgi Gerganov	5e3b08d606	ggml : do not install metal source when embed library (ggml/1054)	2025-01-04 16:09:53 +02:00
Daniel Bevenius	db68c93b57	ggml : improve inputs log sched_print_assignments (ggml/1053) This commit attempts to improve the log message for the inputs of the splits in the sched_print_assignments function. The motivation for this change is that currently even if there are no inputs a colon is displayed at the end of the line, which can make it a little confusing when reading the output as it could be interpreted as the line below are inputs when they are in fact nodes. With this change the colon will only be printed if there actually are inputs.	2025-01-04 16:09:53 +02:00
Gilad S.	c31fc8b966	fix: Vulkan shader gen binary path (#11037 ) b4411	2025-01-04 09:17:31 +01:00
Molly Sophia	4b0c638b9a	common : disable KV cache shifting automatically for unsupported models (#11053 ) * Disable KV cache shifting automatically for unsupported models instead of exiting directly Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-03 14:13:18 +02:00
Georgi Gerganov	e7da954ecc	metal : avoid uint (#11019 ) b4409	2025-01-03 11:26:14 +02:00
Georgi Gerganov	f66f582927	llama : refactor `src/llama.cpp` (#10902 ) * llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]	2025-01-03 10:18:53 +02:00
Pierrick Hymbert	2f0ee84b9b	server: bench: minor fixes (#10765 ) * server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench	2025-01-02 18:06:12 +01:00

1 2 3 4 5 ...

4456 Commits