llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-15 14:50:51 +01:00

Author	SHA1	Message	Date
Xuan Son Nguyen	84a44815f7	cli : auto activate conversation mode if chat template is available (#11214 ) * cli : auto activate conversation mode if chat template is detected * add warn on bad template * update readme (writing with the help of chatgpt) * update readme (2) * do not activate -cnv for non-instruct models	2025-01-13 20:18:12 +01:00
Xuan Son Nguyen	00b4c3da62	common : support tag-based --hf-repo like on ollama (#11195 ) * common : support tag-based hf_repo like on ollama * fix build * various fixes * small fixes * fix style * fix windows build? * move common_get_hf_file to common.cpp * fix complain with noreturn	2025-01-13 13:56:23 +01:00
Xuan Son Nguyen	9a483999a6	llama : fix chat template gguf key (#11201 )	2025-01-12 13:45:14 +01:00
Georgi Gerganov	afa8a9ec9b	llama : add `llama_vocab`, functions -> methods, naming (#11110 ) * llama : functions -> methods (#11110) * llama : add struct llama_vocab to the API (#11156) ggml-ci * hparams : move vocab params to llama_vocab (#11159) ggml-ci * vocab : more pimpl (#11165) ggml-ci * vocab : minor tokenization optimizations (#11160) ggml-ci Co-authored-by: Diego Devesa <slarengh@gmail.com> * lora : update API names (#11167) ggml-ci * llama : update API names to use correct prefix (#11174) * llama : update API names to use correct prefix ggml-ci * cont ggml-ci * cont ggml-ci * minor [no ci] * vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174) ggml-ci * vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174) ggml-ci --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-01-12 11:32:42 +02:00
Georgi Gerganov	a3c1232c3f	arg : option to exclude arguments from specific examples (#11136 ) * arg : option to exclude arguments from specific examples ggml-ci * readme : remove old args [no ci]	2025-01-08 12:55:36 +02:00
Johannes Gäßler	53ff6b9b9f	GGUF: C++ refactor, backend support, misc fixes (#11030 ) * GGUF: C++ refactor, backend support, misc fixes remove ggml_tensor.backend update CODEOWNERS [no ci] remove gguf_get_data from API revise GGUF API data types	2025-01-07 18:01:58 +01:00
Georgi Gerganov	47182dd03f	llama : update llama_model API names (#11063 ) * llama : deprecate llama_free_model, add llama_model_free ggml-ci * llama : change `llama_load_model_from_file` -> `llama_model_load_from_file` ggml-ci	2025-01-06 10:55:18 +02:00
Georgi Gerganov	727368c60f	llama : use LLAMA_TOKEN_NULL (#11062 ) ggml-ci	2025-01-06 10:52:15 +02:00
Molly Sophia	4b0c638b9a	common : disable KV cache shifting automatically for unsupported models (#11053 ) * Disable KV cache shifting automatically for unsupported models instead of exiting directly Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-03 14:13:18 +02:00
Georgi Gerganov	f66f582927	llama : refactor `src/llama.cpp` (#10902 ) * llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]	2025-01-03 10:18:53 +02:00
Xuan Son Nguyen	45095a61bf	server : clean up built-in template detection (#11026 ) * server : clean up built-in template detection * fix compilation * add chat template test * fix condition	2024-12-31 15:22:01 +01:00
Peter	6e1531aca5	common, examples, ggml : fix MSYS2 GCC compiler errors and warnings when building with LLAMA_CURL=ON and GGML_OPENCL=ON (#11013 ) In common/common.cpp: * Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature) * Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2) In examples/run/run.cpp: * Add io.h header inclusion (error cannot find function _get_osfhandle) * Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members) * Add initialiser for hFile (warning it may be uninitialised) * Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int) In ggml/src/ggml-opencl/ggml-opencl.cpp: * Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned)	2024-12-31 01:46:06 +01:00
Molly Sophia	0a11f8b7b5	convert : fix RWKV v6 model conversion (#10913 ) * Enable --no-context-shift for llama-perplexity example Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV 6: Fix error in ggml_cuda_op_bin_bcast Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-20 11:44:58 +02:00
Georgi Gerganov	36319dec5d	tts : small QoL for easy model fetch (#10903 )	2024-12-19 17:35:15 +02:00
Georgi Gerganov	0bf2d10c55	tts : add OuteTTS support (#10784 ) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : be explicit about the pooling type in the tests ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * llama : add OuteTTS support (wip) * wip * extract features * first conv * group norm * resnet conv * resnet * attn * pos net * layer norm * convnext * head * hann window * fix n_embd + remove llama.cpp hacks * compute hann window * fft * spectrum processing * clean-up * tts : receive input text and generate codes * clip : fix new conv name * tts : minor fix * tts : add header + minor fixes ggml-ci * tts : add matchematical constant ggml-ci * tts : fix sampling + cut initial noise * tts : fixes * tts : update default samplers ggml-ci * tts : text pre-processing * tts : outetts-voc -> wavtokenizer-dec * tts : remove hardcoded constants ggml-ci * tts : fix tensor shapes * llama : refactor wavtokenizer tensors ggml-ci * cont ggml-ci * cont [no ci] * llama : update WavTokenizer to non-causal attn * llama : handle no-vocab detokenization * tts : add Python example for OuteTTS (wip) * tts : extend python example to generate spectrogram ggml-ci * server : fix rebase artifacts * tts : enable "return_tokens" in Python example ggml-ci * tts : minor fixes * common : support HF download for vocoder	2024-12-18 19:27:21 +02:00
Georgi Gerganov	152610eda9	server : output embeddings for all tokens when pooling = none (#10861 ) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : update readme [no ci] * server : fix spacing [no ci] Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * server : be explicit about the pooling type in the tests ggml-ci * server : update /embeddings and /v1/embeddings endpoints ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * server : update readme ggml-ci * server : fixes * tests : update server tests ggml-ci * server : update readme [no ci] * server : remove rebase artifact --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-12-18 13:01:41 +02:00
Georgi Gerganov	644fd71b44	sampling : refactor + optimize penalties sampler (#10803 ) * sampling : refactor + optimize penalties sampler ggml-ci * common : apply ignore_eos as logit bias ggml-ci * batched : remove penalties sampler * params : allow penalty_last_n == -1 to be equal to context size ggml-ci * common : by default, move the penalties at the end of the sampling chain ggml-ci * common : ignore all EOG tokens Co-authored-by: Diego Devesa <slarengh@gmail.com> * common : move back the penalties at the front of the sampling chain ggml-ci * readme : restore hint about --ignore-eos flag [no ci] * llama : minor ggml-ci * webui : update --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-16 12:31:14 +02:00
Eric Curtin	c27ac678dd	Opt class for positional argument handling (#10508 ) Added support for positional arguments `model` and `prompt`. Added functionality to download via strings like: llama-run llama3 llama-run ollama://granite-code llama-run ollama://granite-code:8b llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf llama-run https://example.com/some-file1.gguf llama-run some-file2.gguf llama-run file://some-file3.gguf Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-13 19:34:25 +01:00
Xuan Son Nguyen	adffa6ffd5	common : improve -ctv -ctk CLI arguments (#10806 ) * common : improve ctv ctk cli argument * regenerate docs * even better approach * use std::vector	2024-12-12 22:53:05 +01:00
Xuan Son Nguyen	9fdb124304	common : add missing env var for speculative (#10801 )	2024-12-12 16:57:32 +01:00
Bartowski	ae4b922614	imatrix : Add imatrix to --no-context-shift (#10766 ) This allows for setting the --no-context-shift value in llama-imatrix which is required for models like DeepSeek	2024-12-10 18:23:50 +01:00
Yüg	a86ad841f1	server : add flag to disable the web-ui (#10762 ) (#10751 ) Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>	2024-12-10 18:22:34 +01:00
Georgi Gerganov	c2a16c0bdb	server : fix free of spec context and batch (#10651 ) ggml-ci	2024-12-07 11:52:44 +02:00
Xuan Son Nguyen	f162d45a21	common : bring back --no-warmup to server (#10686 )	2024-12-06 13:29:05 +01:00
Xuan Son Nguyen	6c5bc0625f	server : (refactoring) do not rely on JSON internally (#10643 ) * server : (refactoring) reduce usage of json internally * move all response types to struct * wip [no ci] * many fixes * add virtual function * fix index * minor style fix * add std::move * refactor handle_completions_generic * add virtual functions * remove server.hpp * clarify server_sent_event RFC specs * apply review comments * fix model_alias and completion_probabilities * small clean up * remove virtual for to_json_oai_compat() * naming oai_compat --> oaicompat * fix unwanted recursive call * update docs	2024-12-06 11:14:32 +01:00
Xuan Son Nguyen	642330ac7c	llama : add enum for built-in chat templates (#10623 ) * llama : add enum for supported chat templates * use "built-in" instead of "supported" * arg: print list of built-in templates * fix test * update server README	2024-12-02 22:10:19 +01:00
haopeng	64ed2091b2	server: Add "tokens per second" information in the backend (#10548 ) * add cmake rvv support * add timings * remove space * update readme * fix * fix code * remove empty line * add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-02 14:45:54 +01:00
Diego Devesa	7cc2d2c889	ggml : move AMX to the CPU backend (#10570 ) * ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-29 21:54:58 +01:00
Johannes Gäßler	890719311b	common: fix warning message when no GPU found (#10564 )	2024-11-28 18:15:25 +01:00
Xuan Son Nguyen	9f912511bc	common : fix duplicated file name with hf_repo and hf_file (#10550 )	2024-11-27 22:30:52 +01:00
Georgi Gerganov	ab96610b1e	cmake : enable warnings in llama (#10474 ) * cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci	2024-11-26 14:18:08 +02:00
Georgi Gerganov	9fd8c2687f	server : add more information about error (#10455 )	2024-11-25 22:28:59 +02:00
Diego Devesa	10bce0450f	llama : accept a list of devices to use to offload a model (#10497 ) * llama : accept a list of devices to use to offload a model * accept `--dev none` to completely disable offloading * fix dev list with dl backends * rename env parameter to LLAMA_ARG_DEVICE for consistency	2024-11-25 19:30:06 +01:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Georgi Gerganov	d9d54e498d	speculative : refactor and add a simpler example (#10362 ) * speculative : refactor and add a simpler example ggml-ci * speculative : clean-up and add comments and TODOs [no ci] * speculative : manage context in common_speculative ggml-ci * speculative : simplify ggml-ci * speculative : simplify (cont) ggml-ci * speculative : add --draft-min CLI arg * speculative : minor fixup * make : build fixes * speculative : do not redraft previous drafts ggml-ci * speculative : fix the draft sampling ggml-ci * speculative : fix compile warning * common : refactor args ggml-ci * common : change defaults [no ci] * common : final touches ggml-ci	2024-11-25 09:58:41 +02:00
Georgi Gerganov	8e752a777b	llama : add check for KV cache shifts (#10401 ) ggml-ci	2024-11-19 13:29:26 +02:00
Johannes Gäßler	4e54be0ec6	llama/ex: remove --logdir argument (#10339 )	2024-11-16 23:00:41 +01:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
Georgi Gerganov	b141e5f6ef	server : enable KV cache defrag by default (#10233 ) ggml-ci	2024-11-11 08:38:43 +02:00
Georgi Gerganov	5c333e0140	metal : add BF16 support (#8439 ) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support	2024-11-06 19:53:51 +02:00
Diego Devesa	9f40989351	ggml : move CPU backend to a separate file (#10144 )	2024-11-03 19:34:08 +01:00
Georgi Gerganov	1926d6e39d	llama : adjust default context size + print warnings (#10136 ) * llama : adjust default context size + print warnings ggml-ci * ggml-ci : add missing gpu-layers + adjust context sizes	2024-11-02 15:18:56 +02:00
Georgi Gerganov	8d8ff71536	llama : remove Tail-Free sampling (#10071 ) ggml-ci	2024-10-29 10:42:05 +02:00
wwoodsTM	ff252ea48e	llama : add DRY sampler (#9702 ) * sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>	2024-10-25 19:07:34 +03:00
Michael Podvitskiy	d80fb71f8b	llama: string_split fix (#10022 ) * llama: Refactor string_split to use template specialization, fixes parsing strings with spaces * llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string	2024-10-25 17:57:54 +02:00
Daniel Bevenius	674804a996	arg : fix typo in embeddings argument help [no ci] (#9994 ) This commit fixes two typos in the help text for the `--embd-normalize` and `--embd-separator` arguments. It also updates common.h which contain the same typo in two comments.	2024-10-22 10:40:02 +03:00
Daniel Bevenius	94008cc760	arg : fix attention non-causal arg value hint (#9985 ) This commit updates the argument value hint for the `--attention` argument to `non-causal`. The motivation for this change is that the only values for this argument are `causal` and `non-causal`.	2024-10-21 21:12:52 +03:00
Georgi Gerganov	f594bc80ba	ggml : add asserts for type conversion in fattn kernels (#9971 ) ggml-ci	2024-10-21 16:20:46 +03:00
Georgi Gerganov	55e47786e3	llama : default sampling changes + greedy update (#9897 ) * llama : deprecate softmax sampler + fix dist sampler ggml-ci * tests : replace macros with functions ggml-ci * sampling : change temperature sampler logic For t <= 0.0f, keep the max logit intact and set the rest to -inf * cont : no need for special "greedy" logic top-k == 1 is the same * tests : init prob correctly * llama : handle temp <= 0.0 in the temp_ext sampler too ggml-ci * cont : avoid extra loop in temperature sampler for sub-zero temp ggml-ci	2024-10-21 09:46:40 +03:00
Xuan Son Nguyen	cda0e4b648	llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745 ) * refactor llama_batch_get_one * adapt all examples * fix simple.cpp * fix llama_bench * fix * fix context shifting * free batch before return * use common_batch_add, reuse llama_batch in loop * null terminated seq_id list * fix save-load-state example * fix perplexity * correct token pos in llama_batch_allocr	2024-10-18 23:18:01 +02:00

1 2 3 4 5 ...

361 Commits