llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 06:39:25 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	d39e26741f	examples : flush log upon ctrl+c (#9559 )	2024-09-20 11:46:56 +03:00
Sigbjørn Skjæret	722ec1eb51	perplexity : do not escape input data by default (#9548 )	2024-09-20 09:38:10 +03:00
Georgi Gerganov	6026da52d6	server : clean-up completed tasks from waiting list (#9531 ) ggml-ci	2024-09-19 12:44:53 +03:00
Sigbjørn Skjæret	eca0fab44e	imatrix : disable prompt escape by default (#9543 )	2024-09-19 10:58:14 +03:00
slaren	64c6af3195	ggml : fix n_threads_cur initialization with one thread (#9538 ) * ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2024-09-18 10:13:08 -07:00
Georgi Gerganov	0d2f22e45c	scripts : verify py deps at the start of compare (#9520 )	2024-09-18 18:34:32 +03:00
Daniel Bevenius	6443ddd985	llama : use reserve/emplace_back in sampler_sample (#9534 ) This commit updates the llama_sampler_sample function to use reserve and emplace_back for the vector of llama_token_data structs. The motivation for this change is to avoid the creation of n_vocab default-constructed llama_token_data structs which are then immediately overwritten.	2024-09-18 14:42:36 +03:00
Vinesh Janarthanan	8a308354f6	server : match OAI structured output response (#9527 )	2024-09-18 09:50:34 +03:00
Eric Zhang	f799155ab8	server : fix OpenSSL build (remove obsolete `LOG_INFO`) (#9529 )	2024-09-18 09:28:20 +03:00
Neo Zhang Jianyu	faf67b3de4	[SYCL]set context default value to avoid memory issue, update guide (#9476 ) * set context default to avoid memory issue, update guide * Update docs/backend/SYCL.md Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-09-18 08:30:31 +08:00
Michael Podvitskiy	7be099fa81	llama-bench: correct argument parsing error message (#9524 )	2024-09-17 22:41:38 +02:00
Bert Wagner	8b836ae731	arg : add env variable for parallel (#9513 ) * add env variable for parallel * Update README.md with env: LLAMA_ARG_N_PARALLEL	2024-09-17 16:35:38 +03:00
Michael Podvitskiy	8344ef58f8	llama : fix n_vocab init for 'no_vocab' case (#9511 ) * llama: fixed n_vocab for `no_vocab` models * llama: updated error output for `llama_decode_internal` and `llama_encode_internal` * llama: log warning if there's no vocab_size in metadata * llama: correct vocab size for logging Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-17 13:18:22 +03:00
Max Krasnyansky	0226613853	threadpool : skip polling for unused threads (#9461 ) * threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.	2024-09-17 11:19:46 +03:00
Yuri Khrustalev	503147a9f9	unicode : add <algorithm> (#9508 )	2024-09-17 09:51:15 +03:00
Gabe Goodhart	0d2ec43833	llama : support IBM Granite architecture (#9412 ) * feat(gguf-py): Add Granite model and params to gguf-py Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add registration and param setup for Granite Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Add config parsing for Granite multiplier params Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): First pass at full port of granite deviations from llama Something is still not working right since the results are mostly terrible, but on occasion it's producing relevant results at this point, so _something_ is working. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Determine granite language 3b instruct by vocab size Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel The defaults in LlamaModel are needed for Granite as well Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Switch Granite param names to use _scale for consistency Other scalar multipliers are called _scale, so this provides a more consistent naming convention. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale The transformers names with _multiplier will now be converted to the _scale equivalent during conversion. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-09-17 09:44:58 +03:00
Michael Podvitskiy	37f3a3810e	llama : add llama_n_head() (#9512 )	2024-09-17 09:23:30 +03:00
slaren	23e0d70bac	ggml : move common CPU backend impl to new header (#9509 )	2024-09-16 16:22:07 +02:00
Daniel Bevenius	acb2c32c33	llama : rename n_embed to n_embd in rwkv6_time_mix (#9504 ) This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix. The motivation for this change is consistency with the other rwkv6 functions like build_rwkv6 (and other parts of the code base).	2024-09-16 14:07:13 +03:00
Michael Podvitskiy	a6a3a5c531	ggml : link MATH_LIBRARY not by its full path (#9339 )	2024-09-16 14:06:50 +03:00
compilade	d54c21df7e	convert : identify missing model files (#9397 )	2024-09-16 10:30:22 +03:00
Georgi Gerganov	19514d632e	cmake : do not hide GGML options + rename option (#9465 ) * cmake : do not hide GGML options ggml-ci * build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS for consistency ggml-ci	2024-09-16 10:27:50 +03:00
Eve	5c3d0f1824	ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422 ) * squashed readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before	2024-09-16 09:48:24 +03:00
Shane A	0aadac10c7	llama : support OLMoE (#9462 )	2024-09-16 09:47:37 +03:00
CarryFun	95ca85168b	llama : support MiniCPM3 (#9322 ) Co-authored-by: 范睿凯 <fanruikai@modelbest.cn>	2024-09-16 09:45:20 +03:00
Vinesh Janarthanan	441b72b91f	main : option to disable context shift (#9484 ) * added cli arg to disable context shift * reverted precommit * updated README.md for main * white space * allow disabling context shift in the server * Update common/arg.cpp no-context-shift only works for main example Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * added server example to --no-context-shift args * removed server changes * white space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-16 09:20:01 +03:00
Georgi Gerganov	c4965a64f7	metal : handle zero-sized allocs (#9466 )	2024-09-16 09:05:56 +03:00
Georgi Gerganov	90a2fff0e7	flake.lock: Update (#9488 )	2024-09-15 19:14:23 -07:00
Georgi Gerganov	6262d13e0b	common : reimplement logging (#9418 ) https://github.com/ggerganov/llama.cpp/pull/9418	2024-09-15 20:46:12 +03:00
slaren	e6deac31f7	gguf-split : add basic checks (#9499 ) * gguf-split : do not overwrite existing files when merging * gguf-split : error when too many arguments are passed	2024-09-15 19:02:27 +02:00
Michael Podvitskiy	6988da94a2	cmake : correct order of sycl flags (#9497 )	2024-09-15 19:55:52 +03:00
Csaba Kecskemeti	3c7989fd29	py : add "LLaMAForCausalLM" conversion support (#9485 ) Co-authored-by: Csaba Kecskemeti <csabakecskemeti@Csabas-Mac-Pro.local>	2024-09-15 10:48:25 +03:00
OSecret	d6b37c881f	readme : update tools list (#9475 ) * Added link to proprietary wrapper for Unity3d into README.md Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/). * Update README.md Fixes upon review	2024-09-15 10:36:53 +03:00
Michael Podvitskiy	7596487beb	cmake : try to fix sycl+intel build (#9487 )	2024-09-15 10:06:38 +03:00
Yuri Khrustalev	822b6322de	ggml : ggml_type_name return "NONE" for invalid values (#9458 ) When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.	2024-09-14 12:54:37 +03:00
VoidIsVoid	dcdcee3a74	server: add data: [DONE] to /chat/completions stream response (#9459 )	2024-09-14 11:36:44 +02:00
Georgi Gerganov	1f4111e540	cmake : use list(APPEND ...) instead of set() + dedup linker (#9463 ) * cmake : use list(APPEND ...) instead of set() + dedup linker ggml-ci * cmake : try fix sycl * cmake : try to fix sycl 2 * cmake : fix sycl build (#9469) * try fix sycl build * use CMAKE_CXX_FLAGS as a string variable --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * one more CMAKE_CXX_FLAGS fix (#9471) --------- Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>	2024-09-14 10:55:05 +03:00
Daniel Bevenius	befaf1197f	llama : make cell_id const in inp_s_mask block (#9470 ) This commit makes the cell_id variable const in the inp_s_mask block. The motivation for this change is consistency with the code in the inp_s_copy block.	2024-09-14 10:50:12 +03:00
Xuan Son Nguyen	feff4aa846	server : add loading html page while model is loading (#9468 ) * Adding loading page for '/' server requests * set content when model is loading * removed loading html file * updated cmakelist * updated makefile * cleaned up whitespace * cleanup for PR removed error * updated server test to handle 503 HTML * updated server test to handle 503 HTML * ca†ch 503 before parsing json * revert test * account for both api and web browser requests * precommit corrections * eol fix * revert changes to pre-commit * removed print statement * made loading message more descriptive * also support .html files --------- Co-authored-by: VJHack <flymyplane21@gmail.com> Co-authored-by: Vinesh Janarthanan <36610342+VJHack@users.noreply.github.com>	2024-09-13 14:23:11 +02:00
Georgi Gerganov	0abc6a2c25	llama : llama_perf + option to disable timings during decode (#9355 ) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-13 09:53:38 +03:00
Gilad S.	bd35cb0ae3	feat: remove a sampler from a chain (#9445 ) * feat: remove a sampler from a chain * fix: return removed sampler * fix: safer casting	2024-09-13 03:54:49 +02:00
Mathijs Henquet	78203641fe	server : Add option to return token pieces in /tokenize endpoint (#9108 ) * server : added with_pieces functionality to /tokenize endpoint * server : Add tokenize with pieces tests to server.feature * Handle case if tokenizer splits along utf8 continuation bytes * Add example of token splitting * Remove trailing ws * Fix trailing ws * Maybe fix ci * maybe this fix windows ci? --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-09-12 22:30:11 +02:00
Dou Xinpeng	e6b7801bd1	cann: Add host buffer type for Ascend NPU (#9406 ) * feat: Add host buffer type for Ascend NPU(CANN backend) * fix some checking errors * Add a few comments	2024-09-12 19:46:43 +08:00
fengerhu1	e665744317	llava : fix the script error in MobileVLM README (#9054 ) Signed-off-by: Erhu Feng <2748250768@qq.com>	2024-09-12 14:34:22 +03:00
Xuan Son Nguyen	d4c3c10fad	lora : raise error if lm_head is ignored (#9103 ) * lora : raise error if lm_head is ignored * fix style * clarify comment	2024-09-12 14:33:57 +03:00
Michael Podvitskiy	2a825116b6	cmake : fix for builds without `GGML_CDEF_PUBLIC` (#9338 ) * `GGML_TARGET_DEFINES-NOTFOUND` fix for builds without `GGML_CDEF_PUBLIC` * Update CMakeLists.txt, spaces fix	2024-09-12 14:30:01 +03:00
Huang Qi	4dc4f5f14a	ci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329 )	2024-09-12 14:28:43 +03:00
daminho	c837981bba	py : add Phi-1.5/Phi-2 tokenizer (#9361 ) * add phi2 tokenizer * add phi name to convert_hf_to_gguf_update.py * make tokenizer_pre consistent; llama.cpp work	2024-09-12 14:28:20 +03:00
Trivikram Kamat	3c26a1644d	ci : bump actions/checkout to v4 (#9377 )	2024-09-12 14:27:45 +03:00
Michael Podvitskiy	ff76e18516	cmake : fixed the order of linking libraries for llama-quantize (#9450 )	2024-09-12 14:27:14 +03:00

... 6 7 8 9 10 ...

4139 Commits