llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-31 06:03:11 +01:00

Author	SHA1	Message	Date
DAN™	d62b532c52	Use model->gguf_kv for loading the template instead of using the C API. (#10868 ) * Bump model_template to 16384 bytes to support larger chat templates. * Use `model->gguf_kv` for efficiency.	2024-12-17 23:24:22 +01:00
Billel Mokeddem	382bc7f2e8	llama : add Falcon3 support (#10864 )	2024-12-17 17:24:56 +02:00
Georgi Gerganov	08ea539df2	unicode : improve naming style (#10838 ) * unicode : improve naming style ggml-ci * cont [no ci]	2024-12-16 12:31:45 +02:00
Georgi Gerganov	644fd71b44	sampling : refactor + optimize penalties sampler (#10803 ) * sampling : refactor + optimize penalties sampler ggml-ci * common : apply ignore_eos as logit bias ggml-ci * batched : remove penalties sampler * params : allow penalty_last_n == -1 to be equal to context size ggml-ci * common : by default, move the penalties at the end of the sampling chain ggml-ci * common : ignore all EOG tokens Co-authored-by: Diego Devesa <slarengh@gmail.com> * common : move back the penalties at the front of the sampling chain ggml-ci * readme : restore hint about --ignore-eos flag [no ci] * llama : minor ggml-ci * webui : update --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-16 12:31:14 +02:00
Valentin Mamedov	a0974156f3	llama : add Deepseek MoE v1 & GigaChat models (#10827 ) * Add deepseek v1 arch & gigachat template * improve template code * add readme * delete comments * remove comment * fix format * lint llama.cpp * fix order of deepseek and deepseek2, move gigachat temlate to the end of func * fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need * remove comments * move deepseek above deepseek2 * change placement of gigachat chat template	2024-12-15 19:02:46 +02:00
HimariO	ba1cb19cdd	llama : add Qwen2VL support + multimodal RoPE (#10361 ) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-14 14:43:46 +02:00
Diego Devesa	cb13ef85a4	remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797 ) other windows build fixes	2024-12-12 19:02:49 +01:00
Djip007	19d8762ab6	ggml : refactor online repacking (#10446 ) * rename ggml-cpu-aarch64.c to .cpp * reformat extra cpu backend. - clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64 * clang-format * Clean Q4_0_N_M ref Enable restrict on C++ * add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack * added/corrected control on tensor size for Q4 repacking. * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add debug logs on repacks. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-07 14:37:50 +02:00
Riccardo Orlando	6fe6247831	llama : add Minerva 7B model support (#10673 ) * Support for Minerva 7B * Update convert_hf_to_gguf_update.py	2024-12-05 20:30:59 +02:00
JFLFY2255	8d0cfd554a	llama: Support MiniCPM-1B (with & w/o longrope) (#10559 )	2024-12-04 11:42:50 +02:00
Xuan Son Nguyen	642330ac7c	llama : add enum for built-in chat templates (#10623 ) * llama : add enum for supported chat templates * use "built-in" instead of "supported" * arg: print list of built-in templates * fix test * update server README	2024-12-02 22:10:19 +01:00
Juk Armstrong	917786f43d	Add `mistral-v1`, `mistral-v3`, `mistral-v3-tekken` and `mistral-v7` chat template types (#10572 ) * Templates: `mistral-v1`, `mistral-v2`, `mistral-v3`, `mistral-v3-tekken` * Changed system message logic and added tests for all 4 * Invalid `system_message` instead of `content` fixed * Removed tab-indented lines * Added template code and test for `mistral-v7` * Added all tests. Fixed bug with `tmpl == "llama2"` test. * Replaced tabs with spaces. * Removed `'mistral-v2'` option as no (open) models ever used it * Removed all references to 'v2' template from comments * Update llama.cpp Fixed `trim_assistant_message` bug	2024-12-01 23:09:49 +01:00
Diego Devesa	7cc2d2c889	ggml : move AMX to the CPU backend (#10570 ) * ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-29 21:54:58 +01:00
Georgi Gerganov	4c0a95b107	llama : add missing model types	2024-11-28 20:45:07 +02:00
Georgi Gerganov	ab96610b1e	cmake : enable warnings in llama (#10474 ) * cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci	2024-11-26 14:18:08 +02:00
Shane A	80acb7b430	Rename Olmo1124 to Olmo2 (#10500 )	2024-11-25 19:36:09 +01:00
Diego Devesa	10bce0450f	llama : accept a list of devices to use to offload a model (#10497 ) * llama : accept a list of devices to use to offload a model * accept `--dev none` to completely disable offloading * fix dev list with dl backends * rename env parameter to LLAMA_ARG_DEVICE for consistency	2024-11-25 19:30:06 +01:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Diego Devesa	dc39012cba	llama : fix op mul check with command-r-plus (#10476 )	2024-11-24 16:10:26 +01:00
Georgi Gerganov	1bb30bf28c	llama : handle KV shift for recurrent models (#10402 ) ggml-ci	2024-11-21 10:22:47 +02:00
Georgi Gerganov	8e752a777b	llama : add check for KV cache shifts (#10401 ) ggml-ci	2024-11-19 13:29:26 +02:00
Shane A	a88ad007de	llama : add OLMo November 2024 support (#10394 ) * Add OLMo November 2024 constants * Add OLMo November 2024 converter * Add loading of OLMo November 2024 tensors and hyper parameters * Add building of OLMo November 2024 model	2024-11-19 11:04:08 +02:00
Diego Devesa	be5caccef9	llama : only use default buffer types for the KV cache (#10358 )	2024-11-17 12:25:45 +01:00
Johannes Gäßler	4e54be0ec6	llama/ex: remove --logdir argument (#10339 )	2024-11-16 23:00:41 +01:00
FirstTimeEZ	89e4caaaf0	llama : save number of parameters and the size in llama_model (#10286 ) fixes #10285	2024-11-16 01:42:13 +01:00
Charles Xu	1607a5e5b0	backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921 ) * backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-11-15 01:28:50 +01:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
Michael Podvitskiy	fb4a0ec083	llama : propagate the results of `graph_compute` (#9525 ) * llama: propagating the results of `graph_compute` to the user interface * llama: reverting kv_cache in case of failed compute * llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned * llama: restore a kv_cache in case of failed computation * llama: correct reverting of the entire batch. also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models * llama: updated comments * llama : add comments about KV cache state after error --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-13 20:00:35 +02:00
Georgi Gerganov	f018acba22	llama : fix Qwen model type strings	2024-11-09 11:26:34 +02:00
wwoodsTM	5107e8cea3	DRY: Fixes clone functionality (#10192 )	2024-11-07 16:20:25 +01:00
Zhiyuan Li	3bcd40b3c5	Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133 ) * rwkv6: rename to wkv6 * rwkv6: support avx2 avx512 armv8 armv9 * rwkv6: update cuda file name * rwkv6: rename params * wkv on sycl * sycl: add some ops * sycl: Enhance OP support judgment * wkv6: drop armv9 and tranfer to GGML style ggml-ci * sync : ggml * update the function to use appropriate types * fix define error * Update ggml/src/ggml-cpu.c * add appropriate asserts * move element-wise functions outside * put the declaration outside the loop * rewrite to be more inline with the common pattern for distributing threads * use recommended way GGML_TENSOR_LOCALS --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Plamen Minev <pacominev@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-11-07 15:19:10 +08:00
Diego Devesa	94d8cb8be1	metal : fix from ptr buffer name (#10189 )	2024-11-06 12:10:07 +01:00
Gabe Goodhart	b8deef0ec0	llama : add <\|tool_call\|> formatting to Granite template (#10177 ) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-11-05 14:23:04 +02:00
Diego Devesa	9f40989351	ggml : move CPU backend to a separate file (#10144 )	2024-11-03 19:34:08 +01:00
Georgi Gerganov	1926d6e39d	llama : adjust default context size + print warnings (#10136 ) * llama : adjust default context size + print warnings ggml-ci * ggml-ci : add missing gpu-layers + adjust context sizes	2024-11-02 15:18:56 +02:00
Diego Devesa	e991e3127f	llama : use smart pointers for ggml resources (#10117 )	2024-11-01 23:48:26 +01:00
Diego Devesa	85679d37f3	llama : improve output buffer type selection (#10098 )	2024-11-01 00:49:53 +01:00
Diego Devesa	1e9f94994e	quantize : fix --keep-split (#10114 )	2024-11-01 00:45:34 +01:00
Diego Devesa	c02e5ab2a6	llama : fix buffer checks for mamba and rwk (#10111 ) * llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE	2024-10-31 22:54:23 +01:00
Zhenwei Jin	ab3d71f97f	loader: refactor tensor weights storage (#9935 ) * loader: refactor tensor weights storage * use sorted map, sort weights by layer --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-10-31 19:50:39 +01:00
Diego Devesa	dea5e86051	ggml : check tensor name lengths in gguf files (#10100 )	2024-10-31 11:40:59 +01:00
Diego Devesa	c5b0f4b5d9	llama : refactor model loader with backend registry (#10026 )	2024-10-30 02:01:23 +01:00
Georgi Gerganov	8d8ff71536	llama : remove Tail-Free sampling (#10071 ) ggml-ci	2024-10-29 10:42:05 +02:00
arch-btw	61715d5cc8	llama : Add IBM granite template (#10013 ) * Add granite template to llama.cpp * Add granite template to test-chat-template.cpp * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Update tests/test-chat-template.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Added proper template and expected output * Small change to \n Small change to \n * Add code space & Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Fix spacing * Apply suggestions from code review * Update src/llama.cpp --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-10-28 18:45:33 +01:00
Georgi Gerganov	8841ce3f43	llama : switch KQ multiplication to F32 precision by default (#10015 ) ggml-ci	2024-10-27 20:59:58 +02:00
wwoodsTM	ff252ea48e	llama : add DRY sampler (#9702 ) * sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>	2024-10-25 19:07:34 +03:00
Xuan Son Nguyen	c8c07d658a	llama : fix empty batch causing llama_batch_allocr to crash (#9966 ) * llama : fix empty batch cause llama_batch_allocr to crash * move batch_allocr inside decode/encode_internal * fix build * add GGML_ASSERT * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-10-22 16:59:02 +02:00
Daniel Bevenius	19d900a756	llama : rename batch to ubatch (#9950 ) This commit renames the member field batch in llm_build_context to ubatch, and also the parameter batch in llama_build_graph, and llama_set_inputs to ubatch. The motivation for this change is to make the code more readable (considering there are the structs llama_batch and llama_sbatch), and consistent with other parts of the code base where parameters/fields of type llama_ubatch are named ubatch.	2024-10-22 16:31:06 +03:00
Molly Sophia	11d47057a5	Rwkv chat template fix (#10001 ) * llama: remove useless template matching for rwkv-world Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Add comment about the hack for rwkv models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-10-22 15:22:26 +02:00
Molly Sophia	4ff7fe1fb3	llama : add chat template for RWKV-World + fix EOT (#9968 ) * Add chat template for RWKV-World Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Fix the chat template not being used Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV v6: Set EOT token to ``\n\n`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * readme: add rwkv into supported model list Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-10-22 13:33:37 +03:00
leo-pony	6b8447352d	[CANN] Adapt to dynamically loadable backends mechanism (#9970 ) * [CANN] Adapt to dynamically loadable backends mechanism * Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class * Handle the review comments of this pull request	2024-10-22 16:16:01 +08:00
Georgi Gerganov	f594bc80ba	ggml : add asserts for type conversion in fattn kernels (#9971 ) ggml-ci	2024-10-21 16:20:46 +03:00
Georgi Gerganov	55e47786e3	llama : default sampling changes + greedy update (#9897 ) * llama : deprecate softmax sampler + fix dist sampler ggml-ci * tests : replace macros with functions ggml-ci * sampling : change temperature sampler logic For t <= 0.0f, keep the max logit intact and set the rest to -inf * cont : no need for special "greedy" logic top-k == 1 is the same * tests : init prob correctly * llama : handle temp <= 0.0 in the temp_ext sampler too ggml-ci * cont : avoid extra loop in temperature sampler for sub-zero temp ggml-ci	2024-10-21 09:46:40 +03:00
Xuan Son Nguyen	cda0e4b648	llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745 ) * refactor llama_batch_get_one * adapt all examples * fix simple.cpp * fix llama_bench * fix * fix context shifting * free batch before return * use common_batch_add, reuse llama_batch in loop * null terminated seq_id list * fix save-load-state example * fix perplexity * correct token pos in llama_batch_allocr	2024-10-18 23:18:01 +02:00
Ouadie EL FAROUKI	87421a23e8	[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705 ) * implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp	2024-10-18 06:46:16 +01:00
Ma Mingfei	60ce97c9d8	add amx kernel for gemm (#8998 ) add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend	2024-10-18 13:34:36 +08:00
Daniel Bevenius	6f55bccbb8	llama : rename batch_all to batch (#8881 ) This commit addresses the TODO in the code to rename the `batch_all` parameter to `batch` in `llama_decode_internal`.	2024-10-18 01:41:51 +02:00
Georgi Gerganov	9f45fc1e99	llama : change warning to debug log	2024-10-17 23:27:42 +03:00
Georgi Gerganov	99bd4ac28c	llama : infill sampling handle very long tokens (#9924 ) * llama : infill sampling handle very long tokens ggml-ci * cont : better indices ggml-ci	2024-10-17 22:32:47 +03:00
Diego Devesa	f010b77a37	vulkan : add backend registry / device interfaces (#9721 ) * vulkan : add backend registry / device interfaces * llama : print devices used on model load	2024-10-17 02:46:58 +02:00
Daniel Bevenius	9e04102448	llama : suppress conversion from 'size_t' to 'int' (#9046 ) * llama : suppress conversion from 'size_t' to 'int' This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC: ```console src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data ``` This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX. The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows. * squash! llama : suppress conversion from 'size_t' to 'int' Move cast into for loop.	2024-10-16 20:34:28 +03:00
Molly Sophia	10433e8b45	llama : add tensor name for "result_norm" (#9907 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-10-16 13:10:21 +03:00
Georgi Gerganov	755a9b2bf0	llama : add infill sampler (#9896 ) ggml-ci	2024-10-15 16:35:33 +03:00
MaggotHATE	fbc98b748e	sampling : add XTC sampler (#9742 ) * Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes	2024-10-15 12:54:55 +02:00
Georgi Gerganov	d4c19c0f5c	server : accept extra_context for the infill endpoint (#9874 ) * server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci	2024-10-13 21:31:35 +03:00
Georgi Gerganov	11ac9800af	llama : improve infill support and special token detection (#9798 ) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (#9800) * gguf : deprecate old FIM token KVs	2024-10-12 08:21:51 +03:00
Diego Devesa	0e9f760eb1	rpc : add backend registry / device interfaces (#9812 ) * rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server	2024-10-10 20:14:55 +02:00
Diego Devesa	dca1d4b58a	ggml : fix BLAS with unsupported types (#9775 ) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it	2024-10-08 14:21:43 +02:00
Xuan Son Nguyen	458367a906	server : better security control for public deployments (#9776 ) * server : more explicit endpoint access settings * protect /props endpoint * fix tests * update server docs * fix typo * fix tests	2024-10-08 13:27:04 +02:00
Diego Devesa	6374743747	ggml : add backend registry / device interfaces to BLAS backend (#9752 ) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers	2024-10-07 21:55:08 +02:00
Georgi Gerganov	d5ac8cf2f2	ggml : add metal backend registry / device (#9713 ) * ggml : add metal backend registry / device ggml-ci * metal : fix names [no ci] * metal : global registry and device instances ggml-ci * cont : alternative initialization of global objects ggml-ci * llama : adapt to backend changes ggml-ci * fixes * metal : fix indent * metal : fix build when MTLGPUFamilyApple3 is not available ggml-ci * fix merge * metal : avoid unnecessary singleton accesses ggml-ci * metal : minor fix [no ci] * metal : g_state -> g_ggml_ctx_dev_main [no ci] * metal : avoid reference of device context in the backend context ggml-ci * metal : minor [no ci] * metal : fix maxTransferRate check * metal : remove transfer rate stuff --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-10-07 18:27:51 +03:00
Georgi Gerganov	8c475b97b8	rerank : use [SEP] token instead of [BOS] (#9737 ) * rerank : use [SEP] token instead of [BOS] ggml-ci * common : sanity check for non-NULL tokens ggml-ci * ci : adjust rank score interval ggml-ci * ci : add shebang to run.sh ggml-ci	2024-10-05 15:55:04 +03:00
bandoti	d6fe7abf04	ggml: unify backend logging mechanism (#9709 ) * Add scaffolding for ggml logging macros * Metal backend now uses GGML logging * Cuda backend now uses GGML logging * Cann backend now uses GGML logging * Add enum tag to parameters * Use C memory allocation funcs * Fix compile error * Use GGML_LOG instead of GGML_PRINT * Rename llama_state to llama_logger_state * Prevent null format string * Fix whitespace * Remove log callbacks from ggml backends * Remove cuda log statement	2024-10-03 17:39:03 +02:00
Diego Devesa	c83ad6d01e	ggml-backend : add device and backend reg interfaces (#9707 ) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-03 01:49:47 +02:00
Xuan Son Nguyen	a39ab216aa	llama : reduce compile time and binary size (#9712 ) * llama : speed up compile time * fix build * fix build (2)	2024-10-02 15:49:55 +02:00
Georgi Gerganov	cad341d889	metal : reduce command encoding overhead (#9698 ) * metal : reduce command encoding overhead ggml-ci * metal : add comments	2024-10-01 16:00:25 +03:00
Georgi Gerganov	a90484c6d9	llama : print correct model type for Llama 3.2 1B and 3B	2024-10-01 11:42:01 +03:00
Georgi Gerganov	f4d2b8846a	llama : add reranking support (#9510 ) * py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-28 17:42:03 +03:00
Zhenwei Jin	6102037bbb	vocab : refactor tokenizer to reduce init overhead (#9449 ) * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-28 15:10:58 +03:00
nopperl	9a913110cf	llama : add support for Chameleon (#8543 ) * convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <git@compilade.net> * fix lint * trigger ci --------- Co-authored-by: compilade <git@compilade.net>	2024-09-28 15:08:43 +03:00
Gabe Goodhart	3d6bf6919f	llama : add IBM Granite MoE architecture (#9438 ) * feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: alex.brooks@ibm.com Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Typo fix in docstring Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-25 10:06:52 +03:00
Georgi Gerganov	31ac5834fe	llama : keep track of all EOG tokens in the vocab (#9609 ) ggml-ci	2024-09-24 10:16:06 +03:00
Georgi Gerganov	cea1486ecf	log : add CONT level for continuing previous log entry (#9610 )	2024-09-24 10:15:35 +03:00
Georgi Gerganov	b0f27361f3	sampling : avoid expensive softmax during greedy sampling (#9605 ) * sampling : avoid expensive softmax during greedy sampling ggml-ci * speculative : fix default RNG seed + set sparams.n_probs * Update tests/test-sampling.cpp Co-authored-by: slaren <slarengh@gmail.com> * sampling : add clarifying comment [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 09:03:17 +03:00
Ivan	116efee0ee	cuda: add q8_0->f32 cpy operation (#9571 ) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 02:14:24 +02:00
Shankar	ecd5d6b65b	llama: remove redundant loop when constructing ubatch (#9574 )	2024-09-22 04:30:34 +02:00
Daniel Bevenius	6443ddd985	llama : use reserve/emplace_back in sampler_sample (#9534 ) This commit updates the llama_sampler_sample function to use reserve and emplace_back for the vector of llama_token_data structs. The motivation for this change is to avoid the creation of n_vocab default-constructed llama_token_data structs which are then immediately overwritten.	2024-09-18 14:42:36 +03:00
Michael Podvitskiy	8344ef58f8	llama : fix n_vocab init for 'no_vocab' case (#9511 ) * llama: fixed n_vocab for `no_vocab` models * llama: updated error output for `llama_decode_internal` and `llama_encode_internal` * llama: log warning if there's no vocab_size in metadata * llama: correct vocab size for logging Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-17 13:18:22 +03:00
Yuri Khrustalev	503147a9f9	unicode : add <algorithm> (#9508 )	2024-09-17 09:51:15 +03:00
Gabe Goodhart	0d2ec43833	llama : support IBM Granite architecture (#9412 ) * feat(gguf-py): Add Granite model and params to gguf-py Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add registration and param setup for Granite Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Add config parsing for Granite multiplier params Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): First pass at full port of granite deviations from llama Something is still not working right since the results are mostly terrible, but on occasion it's producing relevant results at this point, so _something_ is working. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Determine granite language 3b instruct by vocab size Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel The defaults in LlamaModel are needed for Granite as well Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Switch Granite param names to use _scale for consistency Other scalar multipliers are called _scale, so this provides a more consistent naming convention. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale The transformers names with _multiplier will now be converted to the _scale equivalent during conversion. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-09-17 09:44:58 +03:00
Michael Podvitskiy	37f3a3810e	llama : add llama_n_head() (#9512 )	2024-09-17 09:23:30 +03:00
Daniel Bevenius	acb2c32c33	llama : rename n_embed to n_embd in rwkv6_time_mix (#9504 ) This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix. The motivation for this change is consistency with the other rwkv6 functions like build_rwkv6 (and other parts of the code base).	2024-09-16 14:07:13 +03:00
Shane A	0aadac10c7	llama : support OLMoE (#9462 )	2024-09-16 09:47:37 +03:00
CarryFun	95ca85168b	llama : support MiniCPM3 (#9322 ) Co-authored-by: 范睿凯 <fanruikai@modelbest.cn>	2024-09-16 09:45:20 +03:00
Georgi Gerganov	6262d13e0b	common : reimplement logging (#9418 ) https://github.com/ggerganov/llama.cpp/pull/9418	2024-09-15 20:46:12 +03:00
Daniel Bevenius	befaf1197f	llama : make cell_id const in inp_s_mask block (#9470 ) This commit makes the cell_id variable const in the inp_s_mask block. The motivation for this change is consistency with the code in the inp_s_copy block.	2024-09-14 10:50:12 +03:00
Georgi Gerganov	0abc6a2c25	llama : llama_perf + option to disable timings during decode (#9355 ) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-13 09:53:38 +03:00
Gilad S.	bd35cb0ae3	feat: remove a sampler from a chain (#9445 ) * feat: remove a sampler from a chain * fix: return removed sampler * fix: safer casting	2024-09-13 03:54:49 +02:00
Dou Xinpeng	e6b7801bd1	cann: Add host buffer type for Ascend NPU (#9406 ) * feat: Add host buffer type for Ascend NPU(CANN backend) * fix some checking errors * Add a few comments	2024-09-12 19:46:43 +08:00
Ahmad Tameem	2b00fa7997	riscv : modify Makefile and add a RISCV_VECT to print log info (#9442 ) - Added ggml_cpu_has_riscv_v() in GGML to print system info in log - Modified Makefile to only use flag when cross compiling for RISC-V	2024-09-12 14:24:31 +03:00

1 2 3 4 5 ...

251 Commits