llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-26 06:10:29 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	67984921a7	server : fix n_predict check (#4798 )	2024-01-07 08:45:26 +02:00
Georgi Gerganov	012cf349ae	server : send token probs for "stream == false" (#4714 )	2024-01-04 19:56:33 +02:00
Georgi Gerganov	32866c5edd	editorconfig : fix whitespace and indentation #4710	2024-01-02 13:28:15 +02:00
minarchist	5d7002d437	server : add --override-kv parameter (#4710 ) * Changes to server to allow metadata override * documentation * flake.nix: expose full scope in legacyPackages * flake.nix: rocm not yet supported on aarch64, so hide the output * flake.nix: expose checks * workflows: nix-ci: init; build flake outputs * workflows: nix-ci: add a job for eval * workflows: weekly `nix flake update` * workflows: nix-flakestry: drop tag filters ...and add a job for flakehub.com * workflows: nix-ci: add a qemu job for jetsons * flake.nix: suggest the binary caches * flake.lock: update to a commit recently cached by nixpkgs-cuda-ci --------- Co-authored-by: John <john@jLap.lan> Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>	2024-01-02 12:38:15 +02:00
Georgi Gerganov	9fbda719de	clip : refactor + bug fixes (#4696 ) * clip : refactor + bug fixes ggml-ci * server : add log message	2023-12-30 23:24:42 +02:00
Justine Tunney	db49ff8ed7	server : replace sleep with condition variables (#4673 ) The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b	2023-12-29 16:24:12 +02:00
SakuraUmi	60f55e888c	server : fix OpenAI server sampling w.r.t. penalty. (#4675 )	2023-12-29 16:22:44 +02:00
Karthik Sethuraman	b93edd22f5	server : allow to generate multimodal embeddings (#4681 )	2023-12-29 16:22:10 +02:00
Justine Tunney	65e5f6dadb	Fix OpenAI server sampling w.r.t. temp and seed (#4668 ) The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29	2023-12-28 15:20:00 -04:00
Alexey Parfenov	6123979952	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
olexiyb	0ffc92d2d2	server : disable llm logs if SERVER_VERBOSE is off (#3792 )	2023-12-17 17:02:16 +02:00
AdithyanI	8edd2b40fd	server : fix grammar being ignored (#4494 ) Fix bug in identifying the grammar.	2023-12-17 16:57:56 +02:00
Alexey Parfenov	eb16dae7e7	server : fix possible ambiguity in content type charset (#4501 )	2023-12-17 16:56:09 +02:00
mzcu	62bd52b7bf	server : allow requests larger than 8K (#4500 )	2023-12-17 16:54:37 +02:00
ShadovvBeast	88ae8952b6	server : add optional API Key Authentication example (#4441 ) * Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-15 13:49:01 +02:00
shibe2	948ff137ec	server : fix handling of characters that span multiple tokens when streaming (#4446 )	2023-12-13 21:57:15 +02:00
Vladimir Zorin	d9d4cfef64	server : fix local model name in server (#4420 )	2023-12-12 11:25:29 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Georgi Gerganov	05cd6e5036	server : recognize cache_prompt parameter in OAI API (#4347 )	2023-12-06 20:21:59 +02:00
Ed Lee	33e171d1e9	server : fix OpenAI API `stop` field to be optional (#4299 ) (cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)	2023-12-03 11:10:43 +02:00
Georgi Gerganov	d5a1cbde60	llama : support optional tensors (#4283 )	2023-12-01 20:35:47 +02:00
Ziad Ben Hadj-Alouane	1d144112c0	server : add --log-disable to disable logging to file (#4260 ) * * add --log-disable to disable logging to file in the server example * * typo fix	2023-12-01 00:25:49 +02:00
Ziad Ben Hadj-Alouane	f43f09366d	server : add single-client multi-prompt support (#4232 ) * * add multiprompt support * * cleanup * * more cleanup * * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests * * remove all references to mutex_multitasks * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * * change to set --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-12-01 00:25:04 +02:00
Georgi Gerganov	af19d35734	server : OAI API compatibility (#4198 ) * Add openai-compatible POST /v1/chat/completions API endpoint to server example * fix code style * Update server README.md * Improve server README.md * Fix server.cpp code style according to review * server : some style changes * server : indentation * server : enable special tokens during tokenization by default * server : minor code style * server : change random string generator * straightforward /v1/models endpoint --------- Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com> Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>	2023-11-25 11:29:06 +02:00
Haohui Mai	55978ce09b	Fix incorrect format strings and uninitialized variables. (#4133 ) * Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement	2023-11-23 22:56:53 +01:00
SoftwareRenderer	936c79b227	server : relay error messages (#4131 )	2023-11-19 18:54:10 +02:00
Kerfuffle	91f6499393	Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040 ) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time	2023-11-16 19:14:37 -07:00
Alexey Parfenov	d96ca7ded7	server : fix crash when prompt exceeds context size (#3996 )	2023-11-10 23:48:21 -06:00
Mihai	57ad015dc3	server : add min_p param (#3877 ) * Update server.cpp with min_p after it was introduced in https://github.com/ggerganov/llama.cpp/pull/3841 * Use spaces instead of tabs * Update index.html.hpp after running deps.sh * Fix test - fix line ending	2023-11-08 20:00:34 -06:00
cebtenzzre	b12fa0d1c1	build : link against build info instead of compiling against it (#3879 ) * cmake : fix build when .git does not exist * cmake : simplify BUILD_INFO target * cmake : add missing dependencies on BUILD_INFO * build : link against build info instead of compiling against it * zig : make build info a .cpp source instead of a header Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com> * cmake : revert change to CMP0115 --------- Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>	2023-11-02 08:50:16 +02:00
cebtenzzre	898aeca90a	llama : implement YaRN RoPE scaling (#2268 ) Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>	2023-11-01 18:04:33 -04:00
Adrian Hesketh	ca190bca8e	server : re-enable completion and embedded at the same time (#3876 )	2023-11-01 11:28:28 +02:00
Kerfuffle	6e08281e58	Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843 ) * Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality	2023-10-29 11:31:40 -06:00
Georgi Gerganov	34b2a5e1ee	server : do not release slot on image input (#3798 )	2023-10-26 22:54:17 +03:00
cebtenzzre	ad93962657	server : add parameter -tb N, --threads-batch N (#3584 ) (#3768 ) Co-authored-by: Michael Coppola <m18coppola@gmail.com> Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-24 23:10:43 +03:00
Georgi Gerganov	1717521cdb	server : do not block system prompt update (#3767 ) * server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor	2023-10-24 23:08:20 +03:00
Marcus Dunn	5be6c803fa	llama : remove token functions with `context` args in favor of `model` (#3720 ) * added `llama_model_token_` variants to all the `llama_token_` functions. * added `LLAMA_API` * formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * removed old `llama_token` functions * changed 3 more functions to take in model - `llama_token_get_text` - `llama_token_get_score` - `llama_token_get_type` * added back docs * fixed main.cpp * changed token functions to use new model variants * changed token functions to use new model variants --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-23 22:40:03 +03:00
Georgi Gerganov	438c2ca830	server : parallel decoding and multimodal (#3677 ) * implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <go778sgt@gmail.com> Co-authored-by: Damian Stewart <d@damianstewart.com> Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com> Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>	2023-10-22 22:53:08 +03:00
Georgi Gerganov	d1031cf49c	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Georgi Gerganov	a0edf73bda	server : fix uninitialized sampling context (close #3685 )	2023-10-20 13:06:10 +03:00
Georgi Gerganov	0e89203b51	speculative : add tree-based sampling example (#3624 ) * sampling : one sequence per sampling context ggml-ci * speculative : add tree-based sampling support ggml-ci * speculative : reuse the n_parallel CLI param * speculative : refactor sampling * examples : fix build after sampling refactoring ggml-ci * batched : fix n_seq_id * sampling : fix malloc ggml-ci * swift : fix build ggml-ci * swift : try to fix build ggml-ci * prompts : add assistant.txt * common : add llama_batch_add() and llama_batch_clear() helpers * speculative : minor refactor ggml-ci * minor : comments + rename ggml-ci * speculative : fix off-by-one for n_drafted * speculative : fix the n_drafted fix + p constants	2023-10-18 16:21:57 +03:00
Georgi Gerganov	57dd55e2c7	server : fix kv cache management (#3588 )	2023-10-12 09:29:04 +03:00
Michael Coppola	a8bdd65525	server : add parameter -tb N, --threads-batch N (#3584 ) Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-11 22:42:22 +03:00
Kerfuffle	70c29da118	common : fix mirostat state when using multiple sequences (#3543 ) * Fix mirostat state when using multiple sequences * Fix mirostat by completely refactoring sampling! * Try to fix zig build. * Export function to fetch/create default sampler states Code formatting cleanups and add some comments Silence a warning about id not being used when logging is disabled * Apply some renaming suggestions. Fix comments that were out of sync with the pull. * Use more consistant naming convention for sampling contexts	2023-10-11 22:35:46 +03:00
vvhg1	11ea5c7d96	infill. : fix tokenization (#3508 ) * infill tokens correction * serverinfill tokens correction * removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape * removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape * only rm when params.escape, rm space if possible which is added back or rm added space token * only rm when params.escape, rm space if possible which is added back or rm added space token * Revert "only rm when params.escape, rm space if possible which is added back or rm added space token" This reverts commit `63ba0b621f`. * fix interactive prompt escaping and fix server infill leading space handling * rm unnecessary bool check	2023-10-10 10:31:21 +03:00
Jhen-Jie Hong	97af49fa39	server : reuse llama_sample_token common util (#3494 ) * server : reuse llama_sample_token common function * common : use n_probs for temperature sampling	2023-10-06 15:44:24 +03:00
Kenvix ⭐	45eba9369f	build : use std::make_tuple() for compatibility with older GCC versions (#3488 )	2023-10-05 20:16:39 +03:00
Jhen-Jie Hong	e8b8d32e86	server : fix incorrect num_tokens_predicted (#3480 )	2023-10-05 17:02:55 +03:00
Georgi Gerganov	ac2219fef3	llama : fix session saving/loading (#3400 ) * llama : fix session saving/loading * llama : temp fix for clearing "future" tokens from the KV cache * llama : fix handling of "future" tokens when loading sessions * llama : fix comments for llama_kv_cache API	2023-10-03 21:04:01 +03:00
vvhg1	c97f01c362	infill : add new example + extend server API (#3296 ) * vvhg-code-infill (#1) * infill in separate example (#2) * reverted changes to main and added infill example * cleanup * naming improvement * make : add missing blank line * fix missing semicolon * brought infill up to current main code * cleanup --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>	2023-10-02 10:42:02 +03:00

1 2

97 Commits