llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-12 13:27:21 +01:00

Author	SHA1	Message	Date
Gilad S.	c31fc8b966	fix: Vulkan shader gen binary path (#11037 ) b4411	2025-01-04 09:17:31 +01:00
Molly Sophia	4b0c638b9a	common : disable KV cache shifting automatically for unsupported models (#11053 ) * Disable KV cache shifting automatically for unsupported models instead of exiting directly Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-03 14:13:18 +02:00
Georgi Gerganov	e7da954ecc	metal : avoid uint (#11019 ) b4409	2025-01-03 11:26:14 +02:00
Georgi Gerganov	f66f582927	llama : refactor `src/llama.cpp` (#10902 ) * llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]	2025-01-03 10:18:53 +02:00
Pierrick Hymbert	2f0ee84b9b	server: bench: minor fixes (#10765 ) * server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench	2025-01-02 18:06:12 +01:00
Xuan Son Nguyen	0da5d86026	server : allow using LoRA adapters per-request (#10994 ) * slot.can_batch_with * lora per request * test: force disable cache prompt * move can_batch_with check * fix condition * add slow test with llama 8b * update docs * move lora change task to queue * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * lora_base * remove redundant check --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b4406	2025-01-02 15:05:18 +01:00
Benson Wong	a45433ba20	readme : add llama-swap to infrastructure section (#11032 ) * list llama-swap under tools in README * readme: add llama-swap to Infrastructure	2025-01-02 09:14:54 +02:00
Srihari-mcw	0827b2c1da	ggml : fixes for AVXVNNI instruction set with MSVC and Clang (#11027 ) * Fixes for clang AVX VNNI * enable AVX VNNI and alder lake build for MSVC * Apply suggestions from code review --------- Co-authored-by: slaren <slarengh@gmail.com> b4404	2024-12-31 15:23:33 +01:00
Xuan Son Nguyen	45095a61bf	server : clean up built-in template detection (#11026 ) * server : clean up built-in template detection * fix compilation * add chat template test * fix condition b4403	2024-12-31 15:22:01 +01:00
Xuan Son Nguyen	5896c65232	server : add OAI compat for /v1/completions (#10974 ) * server : add OAI compat for /v1/completions * add test * add docs * better docs b4402	2024-12-31 12:34:13 +01:00
ymcki	bc7b1f8632	convert : fix Llama-3_1-Nemotron-51B rope settings (#11008 ) * conflict resolution * move comments after bracket to its own line * DeciLMCausalModel now reads rope_theta from config.json properly	2024-12-31 13:04:48 +02:00
Peter	6e1531aca5	common, examples, ggml : fix MSYS2 GCC compiler errors and warnings when building with LLAMA_CURL=ON and GGML_OPENCL=ON (#11013 ) In common/common.cpp: * Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature) * Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2) In examples/run/run.cpp: * Add io.h header inclusion (error cannot find function _get_osfhandle) * Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members) * Add initialiser for hFile (warning it may be uninitialised) * Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int) In ggml/src/ggml-opencl/ggml-opencl.cpp: * Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned) b4400	2024-12-31 01:46:06 +01:00
Jeff Bolz	716bd6dec3	vulkan: optimize mul_mat for small values of N (#10991 ) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base. b4399	2024-12-30 18:27:11 +01:00
ag2s20150909	c250ecb315	android : fix llama_batch free (#11014 ) b4398	2024-12-30 14:35:13 +02:00
Jeff Bolz	a813badbbd	vulkan: im2col and matmul optimizations for stable diffusion (#10942 ) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup b4397	2024-12-29 10:16:34 +01:00
Jeff Bolz	fdd2188912	vulkan: Use push constant offset to handle misaligned descriptors (#10987 ) b4396	2024-12-29 09:35:11 +01:00
Isaac McFadyen	f865ea149d	server: added more docs for response_fields field (#10995 )	2024-12-28 16:09:19 +01:00
Alexey Parfenov	16cdce7b68	server : fix token duplication when streaming with stop strings (#10997 ) b4394	2024-12-28 16:08:54 +01:00
Eve	d79d8f39b4	vulkan: multi-row k quants (#10846 ) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default b4393	2024-12-26 16:54:44 +01:00
Peter	d283d02bf2	examples, ggml : fix GCC compiler warnings (#10983 ) Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first) b4392	2024-12-26 14:59:11 +01:00
Reza Kakhki	9ba399dfa7	server : add support for "encoding_format": "base64" to the /embeddings endpoints (#10967 ) add support for base64 * fix base64 test * improve test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b4391	2024-12-24 21:33:04 +01:00
Djip007	2cd43f4900	ggml : more perfo with llamafile tinyblas on x86_64 (#10714 ) * more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: https://github.com/ikawrakow/ik_llama.cpp/pull/71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test b4390	2024-12-24 18:54:49 +01:00
NeverLucky	09fe2e7613	server: allow filtering llama server response fields (#10940 ) * llama_server_response_fields * llama_server_response_fields_fix_issues * params fixes * fix * clarify docs * change to "response_fields" --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b4389	2024-12-24 17:39:49 +01:00
Georgi Gerganov	30caac3a68	llama : the WPM vocabs use the CLS token as BOS (#10930 ) * llama : the WPM vocabs use the CLS token as BOS ggml-ci * llama : add comment b4388	2024-12-24 09:44:20 +02:00
Diego Devesa	60cfa728e2	ggml : use wstring for backend search paths (#10960 ) ggml-ci b4387	2024-12-24 04:05:27 +01:00
Diego Devesa	3327bb0f8d	ggml : fix arm enabled features check (#10961 ) b4386	2024-12-24 04:05:17 +01:00
Diego Devesa	32d6ee6385	ggml : fix const usage in SSE path (#10962 ) b4385	2024-12-23 20:25:52 +01:00
Xuan Son Nguyen	14b699ecde	server : fix missing model id in /model endpoint (#10957 ) * server : fix missing model id in /model endpoint * fix ci b4384	2024-12-23 12:52:25 +01:00
Xuan Son Nguyen	485dc01214	server : add system_fingerprint to chat/completion (#10917 ) * server : add system_fingerprint to chat/completion * update README b4383	2024-12-23 12:02:44 +01:00
Radoslav Gerganov	86bf31cfe6	rpc-server : add support for the SYCL backend (#10934 ) b4382	2024-12-23 10:39:30 +02:00
Yun Dou	b92a14a841	llama : support InfiniAI Megrez 3b (#10893 ) * Support InfiniAI Megrez 3b * Fix tokenizer_clean_spaces for megrez b4381	2024-12-23 01:35:44 +01:00
ymcki	6f0c9e034b	llama : support for Llama-3_1-Nemotron-51B (#10669 ) * conflict resolution * move comments after bracket to its own line b4380	2024-12-23 01:22:33 +01:00
Eric Curtin	dab76c92cc	llama-run : include temperature option (#10899 ) This commit updates the `examples/run/README.md` file to include a new option for setting the temperature and updates the `run.cpp` file to parse this option. Signed-off-by: Eric Curtin <ecurtin@redhat.com> b4379	2024-12-23 01:21:40 +01:00
yuri@FreeBSD	7024d59e6a	ggml : fix run-time on FreeBSD in get_executable_path() (#10948 ) b4378	2024-12-23 01:20:11 +01:00
Rudi Servo	7c0e285858	devops : add docker-multi-stage builds (#10832 )	2024-12-22 23:22:58 +01:00
Billel Mokeddem	7ae33a616f	llama : add Falcon3 support (#10883 ) * Add Falcon3 model support * Add fix for adding bos to added special tokens * Add comment explaining the logic behind the if statement * Add a log message to better track the when the following line of code is triggered * Update log to only print when input and output characters are different * Fix handling pre-normalized tokens * Refactoring b4376	2024-12-23 00:09:58 +02:00
Jeff Bolz	ebdee9478c	vulkan: build fixes for 32b (#10927 ) * vulkan: build fixes for 32b Should fix #10923 * vulkan: initialize some buffer/offset variables b4375	2024-12-22 10:44:01 +01:00
Georgi Gerganov	5cd85b5e00	convert : add BertForMaskedLM (#10919 )	2024-12-21 10:10:18 +02:00
Jeff Bolz	a91a41364b	vulkan: optimize coopmat2 dequant functions (#10855 ) Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.	2024-12-21 08:04:45 +01:00
Adrien Gallouët	e34c5af43f	ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (#10874 ) * ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml-cpu: format code Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> b4372	2024-12-21 00:33:37 +01:00
Akarshan Biswas	eb5c3dc64b	SYCL: Migrate away from deprecated ggml_tensor->backend (#10840 ) * Migrate to tensor->buffer for checking backend buffer type: 1 * SYCL: common.cpp try to migrate away from tensor->backend * SYCL: fix assertions and add proper comments * SYCL: remove extra space * SYCL: Add back static to ggml_backend_buffer_is_sycl_split function * SYCL: Add pragma directive to suppress warning spam * SYCL: Integrate debug logs with GGML_LOG and other fixes * Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes" This reverts commit 2607b7de0f0d2f4f1f690226f86fa861aa39cb97. Let's keep the current SYCL specific logging mechanism for now * SYCL: Use GGML_SYCL_DEBUG after reverting * SYCL: reg_get_proc_address func, update to the current func signature * SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d b4371	2024-12-20 23:31:28 +08:00
Xuan Son Nguyen	0ca416c91a	server : (UI) fix copy to clipboard function (#10916 )	2024-12-20 14:12:06 +01:00
Diego Devesa	21ae3b9be8	ggml : add test for SVE and disable when it fails (#10906 ) b4369	2024-12-20 13:31:28 +01:00
Molly Sophia	0a11f8b7b5	convert : fix RWKV v6 model conversion (#10913 ) * Enable --no-context-shift for llama-perplexity example Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV 6: Fix error in ggml_cuda_op_bin_bcast Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> b4368	2024-12-20 11:44:58 +02:00
Georgi Gerganov	d408bb9268	clip : disable GPU support (#10896 ) ggml-ci b4367	2024-12-19 18:47:15 +02:00
Georgi Gerganov	5cab3e4aaa	llama : minor grammar refactor (#10897 ) ggml-ci b4366	2024-12-19 17:42:13 +02:00
Georgi Gerganov	36319dec5d	tts : small QoL for easy model fetch (#10903 ) b4365	2024-12-19 17:35:15 +02:00
Xuan Son Nguyen	57bb2c40cd	server : fix logprobs, make it OAI-compatible (#10783 ) * server : fix logprobs, make it openai-compatible * update docs * add std::log * return pre-sampling p * sort before apply softmax * add comment * fix test * set p for sampled token * update docs * add --multi-token-probs * update docs * add `post_sampling_probs` option * update docs [no ci] * remove --multi-token-probs * "top_probs" with "post_sampling_probs" * resolve review comments * rename struct token_prob to prob_info * correct comment placement * fix setting prob for sampled token	2024-12-19 15:40:08 +01:00
Adrien Gallouët	a3c33b1dce	ggml: fix arm build with gcc (#10895 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b4363	2024-12-19 14:20:41 +01:00
Sukriti Sharma	2fffc52b50	llama : fix Roberta embeddings (#10856 ) * fix: Use gpt2 tokenizer for roberta and add eos/bos tokens Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fixes to position embeddings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * map roberta-bpe to gpt-2 Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix linting Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> b4362	2024-12-19 15:04:51 +02:00

1 2 3 4 5 ...

4461 Commits