llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-02-09 09:48:16 +01:00

Author	SHA1	Message	Date
compilade	f7625019c5	server : fix crash when system prompt is bigger than batch size (#5714 ) The system prompt is now decoded in batches. * server : fix off-by-one n_past when start of prompt matches whole cache The tokens right after the matching part would otherwise skip a pos value.	2024-02-25 20:43:50 +02:00
Radosław Gryta	abbabc5e51	ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711 ) * [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility vqtbl1q_u8 is not part of arm v7 neon library * [android-example] Remove abi filter after arm v7a fix * [github-workflows] Do not skip Android armeabi-v7a build	2024-02-25 20:43:00 +02:00
Pierrick Hymbert	930b178026	server: logs - unified format and --log-format option (#5700 ) * server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id * server : skip GH copilot requests from logging * server : change message format of server_log() * server : no need to repeat log in comment * server : log style consistency * server : fix compile warning * server : fix tests regex patterns on M2 Ultra * server: logs: PR feedback on log level * server: logs: allow to choose log format in json or plain text * server: tests: output server logs in text * server: logs switch init logs to server logs macro * server: logs ensure value json value does not raised error * server: logs reduce level VERBOSE to VERB to max 4 chars * server: logs lower case as other log messages * server: logs avoid static in general Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: logs PR feedback: change text log format to: LEVEL [function_name] message \| additional=data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-25 13:50:32 +01:00
Pierrick Hymbert	d52d7819b8	server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708 ) * server: monitoring - add /metrics prometheus compatible endpoint * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified * server: metrics - move to a dedicated struct	2024-02-25 13:49:43 +01:00
Georgi Gerganov	ab336a9d5e	code : normalize enum names (#5697 ) * coda : normalize enum names ggml-ci * code : cont * code : cont	2024-02-25 12:09:09 +02:00
Pierrick Hymbert	9e359a4f47	server: continue to update other slots on embedding concurrent request (#5699 ) * server: #5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs	2024-02-24 19:16:04 +01:00
Kawrakow	4c4cb30736	IQ3_S: a much better alternative to Q3_K (#5676 ) * iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-24 16:23:52 +02:00
Pierrick Hymbert	525213d2f5	server: init functional tests (#5566 ) * server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-24 12:28:55 +01:00
AlpinDale	fd43d66f46	server : add KV cache quantization options (#5684 )	2024-02-23 21:31:54 +02:00
Xuan Son Nguyen	a46f50747b	server : fallback to chatml, add AlphaMonarch chat template (#5628 ) * server: fallback to chatml * add new chat template * server: add AlphaMonarch to test chat template * server: only check model template if there is no custom tmpl * remove TODO	2024-02-22 10:33:24 +02:00
Alexey Parfenov	c5688c6250	server : clarify some params in the docs (#5640 )	2024-02-22 10:27:32 +02:00
Xuan Son Nguyen	7c8bcc11dc	Add docs for llama_chat_apply_template (#5645 ) * add docs for llama_chat_apply_template * fix typo	2024-02-22 00:31:00 +01:00
Jared Van Bortel	89febfed93	examples : do not assume BOS when shifting context (#5622 )	2024-02-21 10:33:54 -05:00
Pierrick Hymbert	1ecea255eb	server: health: fix race condition on slots data using tasks queue (#5634 ) * server: health: fix race condition on slots data using tasks queue * server: health: * include_slots only if slots_endpoint * fix compile warning task.target_id not initialized.	2024-02-21 15:47:48 +01:00
Daniel Bevenius	cc6cac08e3	llava : add --skip-unknown to 1.6 convert.py (#5632 ) This commit adds the `--skip-unknown` option to the convert.py script and removes the saving of the updated checkpoints to avoid updating possibly checked out files. The motivation for this change is that this was done for 1.5 in Commit `fc0c8d286a` ("llava : update surgery script to not remove tensors") and makes the examples more consistent. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-21 15:36:57 +02:00
Kawrakow	a14679cc30	IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590 ) * iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-21 11:39:52 +02:00
CJ Pais	6560bed3f0	server : support llava 1.6 (#5553 ) * server: init working 1.6 * move clip_image to header * remove commented code * remove c++ style from header * remove todo * expose llava_image_embed_make_with_clip_img * fix zig build	2024-02-20 21:07:22 +02:00
Daniel Bevenius	4ed8e4fbef	llava : add explicit instructions for llava-1.6 (#5611 ) This commit contains a suggestion for the README.md in the llava example. The suggestion adds explicit instructions for how to convert a llava-1.6 model and run it using llava-cli. The motivation for this is that having explicit instructions similar to the 1.5 instructions will make it easier for users to try this out. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-20 19:30:27 +02:00
Xuan Son Nguyen	9c405c9f9a	Server: use llama_chat_apply_template (#5593 ) * server: use llama_chat_apply_template * server: remove trailing space * server: fix format_chat * server: fix help message Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: fix formatted_chat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-20 15:58:27 +01:00
Pierrick Hymbert	c0a8c6db37	server : health endpoint configurable failure on no slot (#5594 )	2024-02-20 09:48:19 +02:00
nopperl	9d679f0fcc	examples : support minItems/maxItems in JSON grammar converter (#5039 ) * support minLength and maxLength in JSON schema grammar converter * Update examples/json-schema-to-grammar.py --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-19 16:14:07 +02:00
Georgi Gerganov	1387cf60f7	llava : remove extra cont (#5587 )	2024-02-19 15:23:17 +02:00
slaren	6fd413791a	llava : replace ggml_cpy with ggml_cont	2024-02-19 15:09:43 +02:00
Daniel Bevenius	7084755396	llava : avoid changing the original BakLLaVA model (#5577 ) This is a follup of Commit `fc0c8d286a` ("llava : update surgery script to not remove tensors") but this time the change is to the BakLLaVA specific part of the surgery script. I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works as expected using the instructions in README.md. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-19 10:31:59 +02:00
NawafAlansari	4480542b22	baby-llama : allocate graphs in ggml_context (#5573 ) * Fixed the baby-llama issue (see issue #4830) * minor : fix whitespaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-19 10:25:38 +02:00
Georgi Gerganov	b1de96824b	ci : fix wikitext url + compile warnings (#5569 ) ggml-ci	2024-02-18 22:39:30 +02:00
Robey Holderith	5ee99c32f5	common, server : surface min_keep as its own parameter (#5567 ) * Feature - surface min_keep as its own parameter * Updated README with min_keep param	2024-02-18 21:11:16 +02:00
Pierrick Hymbert	c145f8a132	server : slots monitoring endpoint (#5550 )	2024-02-18 19:39:57 +02:00
Pierrick Hymbert	e75c6279d1	server : enhanced health endpoint (#5548 ) * server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md	2024-02-18 18:31:28 +02:00
Pierrick Hymbert	36376abe05	server : --n-predict option document and cap to max value (#5549 ) * server: document --n-predict * server: ensure client request cannot override n_predict if set * server: fix print usage LF in new --n-predict option	2024-02-18 18:30:09 +02:00
Daniel Hiltgen	66c1968f7a	server : graceful server shutdown (#5244 ) This updates the server queue to support graceful shutdown of the server on signals.	2024-02-18 18:23:16 +02:00
Herman Semenov	5d3de51f97	ggml, common, examples, tests : fixed type arguments in printf (#5528 )	2024-02-18 18:20:12 +02:00
Daniel Bevenius	fc0c8d286a	llava : update surgery script to not remove tensors (#5536 ) This commit updates the surgery script to not remove the tensors from the model file. For this to work the `--skip-unknown` flag is added as an argument to the convert.py script in README.md. The motivation for this change is that the surgery script currently removes the projector tensors from the model file. If the model was checked out from a repository, the model file will have been updated and have to be checked out again to reset this effect. If this can be avoided I think it would be preferable. I did not perform this change for BakLLaVA models as I am not sure how that part works.	2024-02-18 18:19:23 +02:00
Kawrakow	bd2d4e393b	1.5 bit quantization (#5453 ) * iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-18 18:16:55 +02:00
Ananta Bastola	6e4e973b26	ci : add an option to fail on compile warning (#3952 ) * feat(ci): add an option to fail on compile warning * Update CMakeLists.txt * minor : fix compile warnings ggml-ci * ggml : fix unreachable code warnings ggml-ci * ci : disable fatal warnings for windows, ios and tvos * ggml : fix strncpy warning * ci : disable fatal warnings for MPI build * ci : add fatal warnings to ggml-ci ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-17 23:03:14 +02:00
Herman Semenov	4cb0727698	llava : removed excess free(NULL) operation (#5531 )	2024-02-16 14:43:23 +02:00
Alexey Parfenov	6dcc02d244	server : add "samplers" param to control the samplers order (#5494 )	2024-02-16 13:33:25 +02:00
Rőczey Barnabás	5f5808ca7b	server : fix system prompt cli (#5516 )	2024-02-16 12:00:56 +02:00
bmwl	f486f6e1e5	ggml : add numa options (#5377 ) * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-16 11:31:07 +02:00
Daniel Bevenius	60ed04cf82	llava : fix clip-model-is-vision flag in README.md (#5509 ) * llava: fix clip-model-is-vision flag in README.md This commit fixes the flag `--clip_model_is_vision` in README.md which is does not match the actual flag: ```console $ python convert-image-encoder-to-gguf.py --help ... --clip-model-is-vision The clip model is a pure vision model (ShareGPT4V vision extract for example) ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: update link to vit config in README.md Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-16 11:24:39 +02:00
Georgi Gerganov	c06e45d729	clip : fix wrong loop condition	2024-02-15 18:49:08 +02:00
Elbios	0d4177126b	llava : fix memory management bug (#5491 ) * Fix memory management in llava and server code Fixes this error: llama_new_context_with_model: graph splits (measure): 3 Available slots: -> Slot 0 - max context: 6000 {"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"} all slots are idle and system prompt is empty, clear the KV cache slot 0 - loaded image slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0 - encoding image [id: 1] munmap_chunk(): invalid pointer Aborted * Make it cleaner by checking size in batch free wrapper	2024-02-15 10:01:57 +02:00
John	7930a8a6e8	llaba : hotfix for llava-1.6 image number (#5495 ) Co-authored-by: John <cmt-nct@users.noreply.github.com>	2024-02-15 09:59:18 +02:00
John	ccbb277f46	llava : update README.md (#5489 ) * Update README.md * Update README.md * Update examples/llava/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-14 16:49:42 +02:00
John	aa23412989	llava : support v1.6 (#5267 ) * Create llava-survery-v2.py * Update convert-image-encoder-to-gguf.py * Update convert-image-encoder-to-gguf.py * Rename llava-survery-v2.py to llava-surgery-v2.py * Update convert-image-encoder-to-gguf.py will now search for projector * Update convert-image-encoder-to-gguf.py whoops * Update llava-surgery-v2.py * Clip: Bugfix for normalization (it did not loat the 3 std and mean values) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening * whitespace corrections * ws * Tensors are now properly permuted. Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference. * ws * added verbose_prompt support into cli added stopwords for llava-1.6 into cli * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed * ws * convert : skip unknown tensors (need for LLaVA) * llava : update readme * llava : fix compile warnings * llava : style * convert : add --skip-unknown CLI arg * server : remove clip structs * bugfix for non llava-1.6 It should now work with llava-1.5 as well * clip : minor code rearrange * llava : update readme a bit --------- Co-authored-by: John <cmt-nct@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-14 09:38:35 +02:00
John	6c00a06692	gguf : add python reader example (#5216 ) * Update CMakeLists.txt * Create reader.py * Update reader.py * Update reader.py another whitespace :\| * Update reader.py * lintlintlint	2024-02-13 19:56:38 +02:00
Daniel Bevenius	263978904c	finetune : rename feed-forward tensors (w1/w2/w3) (#4839 ) * finetune: rename feed-forward tensors (w1/w2/w3) This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * train-text-from-scratch: rename ff tensors This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-13 15:15:42 +02:00
Douglas Hanley	03bf161eb6	llama : support batched embeddings (#5466 ) * batched embedding: pool outputs by sequence id. updated embedding example * bring back non-causal attention * embd : minor improvements * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-13 14:06:58 +02:00
Daniel Bevenius	4a46d2b792	llava : remove prog parameter from ArgumentParser (#5457 ) * llava: remove prog parameter from ArgumentParser This commit removes the `prog` parameter from `ArgumentParser` so that it uses the default value which is the name of the script. The motivation for this change is that currently the usage output looks like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert_hf_to_gguf.py [-h] ... ``` And with this change it will look like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert-image-encoder-to-gguf.py [-h] ... ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * ci: add W503 to flake8 ignore list This commit adds W503 to the ignore list for flake8. This is done to avoid the following error: W503 line break before binary operator Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-12 10:38:44 +02:00
Georgi Gerganov	3b169441df	sync : ggml (#5452 ) * ggml-alloc : v3 (ggml/727) * ggml-alloc v3 ggml-ci * fix ci ggml-ci * whisper : check for backend buffer allocation failures * whisper : avoid leaks when initialization fails * cleanup ggml-ci * style fixes ggml-ci * sync : ggml * update llama.cpp, clip.cpp, export-lora.cpp * update finetune.cpp, train-text-from-scratch.cpp ggml-ci * ggml-backend : reduce alignment to 32 to match gguf and fix mmap --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-12 09:16:06 +02:00
Douglas Hanley	2891c8aa9a	Add support for BERT embedding models (#5423 ) * BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 11:21:38 -05:00
Alexey Parfenov	684780141a	server : allow to specify tokens as strings in logit_bias (#5003 ) * server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:38:14 +02:00
Georgi Gerganov	85910c5b30	main : ctrl+C print timing in non-interactive mode (#3873 )	2024-02-11 15:35:50 +02:00
Johannes Gäßler	e4640d8fdf	lookup: add print for drafting performance (#5450 )	2024-02-11 12:44:51 +01:00
Xuan Son Nguyen	907e08c110	server : add llama2 chat template (#5425 ) * server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-11 12:16:22 +02:00
Daniel Bevenius	e00d2a62dd	llava : add requirements.txt and update README.md (#5428 ) * llava: add requirements.txt and update README.md This commit adds a `requirements.txt` file to the `examples/llava` directory. This file contains the required Python packages to run the scripts in the `examples/llava` directory. The motivation of this to make it easier for users to run the scripts in `examples/llava`. This will avoid users from having to possibly run into missing package issues if the packages are not installed on their system. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: fix typo in llava-surgery.py output Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-09 15:00:59 +02:00
Riley Stewart	7c777fcd5d	server : fix prompt caching for repeated prompts (#5420 )	2024-02-09 12:49:49 +02:00
Daniel Bevenius	ff4ff05c5f	llava : add missing .py, and fix paths in README.md (#5414 ) This commit adds the missing .py extension to the convert-image-encoder-to-gguf script. It also fixes the paths for the `model` and `mmproj` options in the example llava-cli command. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-08 16:20:03 +02:00
Daniel Bevenius	a6e514a85f	llava: fix typo/formatting in README.md (#5405 ) This commit fixes a typo in the README.md file for the llava example which is causing the formatting to look a little off: Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-08 09:58:19 +01:00
Xiao-Yong Jin	0ef46da632	llava-cli : always tokenize special tokens (#5382 ) * llava-cli: tokenize special tokens in prompt * llava-cli: use the escape CLI argument, remove incomplete separate escaping process	2024-02-07 10:17:25 +02:00
Justin Parker	f3e2b4fa3f	server : update `/props` with "total_slots" value (#5373 ) * include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section	2024-02-07 08:15:19 +02:00
Alexey Parfenov	213d1439fa	server : remove model.json endpoint (#5371 )	2024-02-06 20:08:38 +02:00
Justin Parker	8a79c591de	server : include total "num_slots" in props endpoint (#5349 )	2024-02-06 11:20:59 +02:00
Michael Coppola	31e7903221	server : add `dynatemp_range` and `dynatemp_exponent` (#5352 ) * server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-02-06 11:20:00 +02:00
Niall Coates	4ffc7a17d4	server : various fixes for the prompt field in /completion (#5300 ) server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert	2024-02-06 10:16:23 +02:00
Alexey Parfenov	a2d60c9158	server : allow to get default generation settings for completion (#5307 )	2024-02-05 10:10:22 +02:00
Kawrakow	5ed26e1fc9	Adding some imatrix tools (#5302 ) * imatrix: adding --combine and --continue-from * imatrix: be able to start from a specific chunk --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-04 10:39:58 +02:00
Michael Klimenko	52bb63c708	refactor : switch to emplace_back to avoid extra object (#5291 )	2024-02-03 13:23:37 +02:00
kalomaze	191221178f	perplexity : fix KL divergence calculations on Windows (#5273 )	2024-02-02 16:15:30 +02:00
Neo Zhang Jianyu	af3ba5d946	[SYCL] update guide of SYCL backend (#5254 ) * update guide for make installation, memory, gguf model link, rm todo for windows build * add vs install requirement * update for gpu device check * update help of llama-bench * fix grammer issues	2024-02-02 15:53:27 +08:00
Neo Zhang Jianyu	128dcbd3c9	add --no-mmap in llama-bench (#5257 ) * add --no-mmap, show sycl backend * fix conflict * fix code format, change print for --no-mmap * ren no_mmap to mmap, show mmap when not default value in printer * update guide for mmap * mv position to reduce model reload	2024-02-01 20:48:53 +01:00
Georgi Gerganov	5cb04dbc16	llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240 ) * llama : remove LLAMA_MAX_DEVICES from llama.h ggml-ci * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * server : remove LLAMA_MAX_DEVICES ggml-ci * llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD ggml-ci * train : remove LLAMA_SUPPORTS_GPU_OFFLOAD * readme : add deprecation notice * readme : change deprecation notice to "remove" and fix url * llama : remove gpu includes from llama.h ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-31 17:30:17 +02:00
JidongZhang-THU	15606309a0	llava : add MobileVLM support (#5132 ) * New Feature: 1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda * fix tabs instead of spaces * code clean * CUDA POOL2D * ADD POOL2D test case in test-backend-ops.cpp * code clean * fix pool2d_kernel nits * fix bug in pool2d kernel * fix avg pooling, count_include_pad nits * test-backend-ops : add more pool_2d tests * cuda : fix warnings and formatting * ggml : check types in release builds too in pool_2d * test-backend-ops : remove f16 pool_2d tests * cuda : more style fixes * Add assert in ggml_cuda_op_pool2d * pool2d float padding fallback * test-backend-ops : add dst_type to im2col --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-31 15:10:15 +02:00
Neo Zhang Jianyu	b2b9f025e7	format license text, restore apache license by legal suggestion (#5233 )	2024-01-31 18:34:46 +05:30
Neo Zhang Jianyu	01684139c3	support SYCL backend windows build (#5208 ) * support SYCL backend windows build * add windows build in CI * add for win build CI * correct install oneMKL * fix install issue * fix ci * fix install cmd * fix install cmd * fix install cmd * fix install cmd * fix install cmd * fix win build * fix win build * fix win build * restore other CI part * restore as base * rm no new line * fix no new line issue, add -j * fix grammer issue * allow to trigger manually, fix format issue * fix format * add newline * fix format * fix format * fix format issuse --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-01-31 08:08:07 +05:30
Jared Van Bortel	e8dc55d006	kompute : llama-bench support and ggml_cpu_has_kompute() (#5226 )	2024-01-30 19:04:37 -05:00
Georgi Gerganov	e0085fdf7c	Revert "server : change deps.sh xxd files to string literals (#5221 )" This reverts commit `4003be0e5f`.	2024-01-30 21:19:26 +02:00
Georgi Gerganov	e6f291d158	server : fix context shift (#5195 ) * server : fix context shift + simplify self-extend * server : take system_tokens into account * server : more n_past fixes * server : rever n_past_se changes	2024-01-30 20:17:30 +02:00
JohnnyB	4003be0e5f	server : change deps.sh xxd files to string literals (#5221 ) * Changed ugly xxd to literals. HPP files are much more readable as multiline literals rather than hex arrays. * Dashes in literal variable names. Replace . and - with _ in file names -> variable names. * Comment on removing xxd. XXD-> string literals * XXD to string literals. Replaced these unreadable headers with string literal versions using new deps.sh.	2024-01-30 20:15:05 +02:00
Kawrakow	f4d7e54974	SOTA 3-bit quants (#5196 ) * iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 15:14:12 +02:00
Vladimir Malyutin	7359016c7c	quantize : fix typo (#5211 ) Fix misprint in quantize help	2024-01-30 12:57:07 +02:00
divinity76	813416991a	main : allow empty --prompt-cache file (#5176 ) * allow empty --prompt-cache file This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user. I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++ < 17 (the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17) * formatting (requested in codereview) * remove c++17, file_is_empty	2024-01-30 11:18:02 +02:00
Wu Jian Ping	6685cc41c2	server : improve README (#5209 )	2024-01-30 11:11:46 +02:00
Wu Jian Ping	c82d18e863	server : embeddings compatibility for OpenAI (#5190 )	2024-01-29 15:48:10 +02:00
0cc4m	2307523d32	ggml : add Vulkan backend (#2059 ) * Vulkan loader code * Fix matmul kernel, continue implementation * Continue implementation * Vulkan memory management * Vulkan development * Matmul call * Add aligned malloc and free for VMA * Continue implementation * First matmul success * GEMM Kernel optimization * 1D Blocktiling * 2D Blocktiling * Write coalescing * Continue vulkan implementation and optimization * First FP16 attempt, disabled for now * Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel * Enable device extensions properly, restore fp16 matmul op * Fix mulmat_f16 * Output FP32 in fp16 matmul shader * Fix f16_to_f32 kernel * dequant_q4_0 kernel * Add VMA library * Avoid requesting dedicated memory, VMA can decide that by itself * Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly * add cmake commands * Add 2d write operation, profiling code * Fix 2d write * Fix queue selection for AMD RADV * Fix trailing whitespace in vk_mem_alloc.h * Add WIP warp tile mat mul shaders * Disable glslc optimization * Disable glslc optimization for CMake * Optimize warptile matmul shader, replace blocktile with it * Add split-k optimization for small matrix multiplication Use semaphores for synchronization instead of fences or waitidle Rework async write/read for synchronization * Fix validation errors, improve compatibility with AMD GPUs * Rework command buffer handling * Variable matmul kernel using specialization constants * Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints * Reuse semaphores * Handle stage flags during command buffer submission properly * Increase matmul test runs for consistent results * Fix F32 matmul * Add vectorized loading and zeropadding for matrix multiplication * Use pinned memory for f16 preprocessing * Don't force aligned matmul * Don't free before queue done * Replace VMA library with native Vulkan buffer management * Basic offloading support with mul_f32 and dmmv for q4_0 * Run glslc commands in parallel * Unroll loops in dmmv shader * Reduce usage of waitIdle * Reuse pinned allocation for f16 conversion * Handle devices with only a single queue * Fix trailing whitespace in CMakeLists.txt * Allow parallel execution of kernels, parallelize third and fourth dimension calls * Add fallback for devices only supporting one DescriptorSet per DescriptorPool * Move to graph function similar to CUDA implementation * Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function * Add F32 dmmv shaders * Batch submissions * Add .spv to gitignore * Split off matrix vector multiplication for separate optimization * Use single command buffer for matrix vector multiplication ops * Reduce overhead of mul_f32 calls by using a single command buffer * Add submission batching to mul_f32 * Fix tests * Add missing barrier * Add further missing barrier * Add further ops * Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions * Remove unnecessary cblas link * Fix descriptor set pre-allocation assert * Add runtime shader compilation, start transferring shaders to this approach * Transfer remaining shaders to header and compile on runtime * Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16 * Add support for q4_1, q5_0, q5_1 and q8_0 * Remove unnecessary scalar layout extension * Parse graph early to pre-record command buffers * Add q6_k support * Add multi-submit for command buffers * Fix q6_k dequant shader for AMD * Fix q6_k for GPUs without fp16 support * Simplify q6_k fp16 fix * Minor fixes * Fix wg_denom of m-mulmat shaders * Add Python-based Vulkan shader generator * Replace shaderc dependency with precompiled shaders Fix python script to generate shaders * Clean up code * Fix shader generator script Windows compatibility Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> * Close file before deletion * Fix vulkan shader fp32 name * Add q2_k and q3_k support Add validation check to compare shader results to cpu results * Add q4_k support * Add q5_k support * Bake SPIR-V bytecode into the library instead of loading shaders from file * Switch to signal semaphores for flexibility Prepare broadcasting support for mul mat * Finish broadcasting mul mat support for GQA * Clean up unused functions Add repeat op * Add further ops, not yet enabled. Improve semaphore code * Reduce number of used semaphores by utilizing timelines more properly * Remove queue information * Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations * Add Vulkan to llama-bench * Remove cblas dependency * Fix matmul k-split bug * Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader * Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug * Fix issues with float16 overflows in shaders * Fix issues with older Vulkan headers on Ubuntu 22.04 * Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers * Implement further ops, rework op_f32 calls, fix bugs * Finish full offloading support, add last remaining ops, fix bugs, remove redundant code * Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders * Merge upstream changes, fix conflicts, adapt soft_max op * Fix Python and shader header format * Free model gpu buffers on exit * Use single queue per device to simplify code * Add matmul shader support for running multiple calculations in parallel * Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible * Fix missing event cast * Replace uint64_t(-1) with UINT64_MAX, rename function for clarity * Fix warning about empty C function parameters * Fix compiler warnings * Properly implement Vulkan backend buffer handling * Fix oversized host staging buffers * Simplify barrier synchronization calls * Fix gcc warnings * Implement max_size for backend buffer types to limit the size of a single allocation * Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size * refactor multi buf * Disable unsupported ops to fix tests * Check for maintenance4 support before using it * Handle devices with only a single queue * Fix single queue logic * propagate buffer usage in multi buffers * Implement rope_neox op * Cleanup header and other files * Simplify gpu_extras by removing events and putting staging memcpys into contexts * Move queue into context Add not-yet-enabled async backend ops * Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization * Add get_max_size to SYCL backend. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix trailing whitespace --------- Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 19:03:59 +02:00
Abhilash Majumder	0f648573dd	ggml : add unified SYCL backend for Intel GPUs (#2690 ) * first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 17:56:23 +02:00
Kyle Mistele	39baaf55a1	docker : add server-first container images (#5157 ) * feat: add Dockerfiles for each platform that user ./server instead of ./main * feat: update .github/workflows/docker.yml to build server-first docker containers * doc: add information about running the server with Docker to README.md * doc: add information about running with docker to the server README * doc: update n-gpu-layers to show correct GPU usage * fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA	2024-01-28 09:55:31 +02:00
John	6db2b41a76	llava : support for Yi-VL and fix for mobileVLM (#5093 ) * Support for Yi-VL, templating fix for mobileVLM * ws * Update examples/llava/clip.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llava-cli.cpp * Update clip.cpp bugfix for new conversions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-27 17:09:18 +02:00
Georgi Gerganov	753eafed0e	sync : ggml	2024-01-27 17:00:24 +02:00
Michael Klimenko	35a2ee9143	Remove unused data and add fixes (#5154 ) * Remove unused data and add fixes * Add missing file * Address review comments * Replace the scope of vq allocation	2024-01-27 15:25:55 +01:00
Maximilian Winter	ec903c0341	server : add self-extend support (#5104 ) * Ported self extension to server example * Update server.cpp * Fixed prompt caching without self extend * Update server.cpp * Added description to server readme. * Update server.cpp * Update server.cpp * Update server.cpp * Update server.cpp * Update README.md * Changed descriptions * server : formatting * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update server.cpp * Update server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-27 15:38:05 +02:00
Xuan Son Nguyen	48c857aa10	server : refactored the task processing logic (#5065 ) * server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"	2024-01-26 14:42:20 +02:00
Jared Van Bortel	d292f4f204	examples : make pydantic scripts pass mypy and support py3.8 (#5099 )	2024-01-25 14:51:24 -05:00
Valentin Konovalov	256d1bb0dd	android : use release cmake build type by default (#5123 )	2024-01-25 19:05:51 +02:00
Kawrakow	44879ee885	Additional KL-divergence statistics (#5081 ) * perplexity: add top-token probability * perplexity: add additional KL-divergence statistics * perplexity: a better organized KL-divergence statistics output --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-23 15:17:20 +02:00
Georgi Gerganov	89758723c7	minor : clean-up some warnings and style (#5094 ) * minor : clean-up some warnings and style ggml-ci * ggml : add comment	2024-01-23 14:12:57 +02:00
Michael Coppola	125d03a503	llama.vim : added api key support (#5090 ) Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-01-23 08:51:27 +02:00
Kawrakow	6f9939d119	KL-divergence (#5076 ) * kl-divergence: be able to save all logits to a file * Add ability to compute KL-divergence --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-22 16:10:14 +02:00
XiaotaoChen	3ce7e8f8e7	llava : MobileVLM support (#4954 ) * MobileVLM native implementation * delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake * move android script to example/llava directory * Fix the editor config checks --------- Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>	2024-01-22 15:09:35 +02:00
Kawrakow	15bceec2d7	imatrix : keep intermediate imatrix results (#5077 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-22 14:18:43 +02:00
Daniel Bevenius	152d9d05e0	finetune : print sample-start/include-sample-start (#5072 ) This commit adds `--sample-start` and `--include-sample-start` to the output from the main function in finetune.cpp. The motivation for this is that even though these are set explicitly by the user via the command line, if one forgets to set them then it is useful to have their values printed out. Otherwise it is possible to go through the whole training process before realizing that the values are not what one expected. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-22 13:11:01 +02:00
Kawrakow	66d575c45c	llama : add Q3_K_XS (#5060 ) * Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-22 12:43:33 +02:00
Kawrakow	7dcbe39d36	Add ability to evauate multiple choice tasks (#5047 ) * TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-21 14:42:44 +02:00
Kawrakow	726c0fa9a2	Slightly faster imatrix (#5050 ) * imatrix: speedup by avoiding unnecessary allocations and copies * imatrix: add --no-ppl option to skip PPL calculations altogether --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-21 08:01:20 +02:00
Jared Van Bortel	97c1549808	perplexity : fix MSVC build after #5020 (#5043 ) * perplexity : fix MSVC build after #5020 * try a differerent fix	2024-01-20 17:08:08 +02:00
Uzo Nweke	381ee19572	finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033 ) * Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue #4791	2024-01-19 20:20:50 +02:00
Georgi Gerganov	a5cacb22b2	imatrix : add README.md	2024-01-19 15:24:47 +02:00
Kawrakow	7051aacfac	winogrande: evaluate log-probs in parallel (#5036 ) This is a relatively minor performance tweak resulting in ~10% speedup on my system. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-19 11:39:11 +02:00
Kawrakow	993fba8180	perplexity: avoid unnecessary alloocations and logit copies (#5035 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-19 11:02:39 +02:00
Georgi Gerganov	8b20858e5e	perplexity : faster Winogrande via batching (#5024 ) * perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande	2024-01-19 10:45:06 +02:00
Xuan Son Nguyen	821f0a271e	server : defer tasks when "slot unavailable" (#5018 ) * server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>	2024-01-18 22:33:05 +02:00
Georgi Gerganov	2d5419d08a	imatrix : fix assert for src0 non-cont check	2024-01-18 21:45:51 +02:00
Georgi Gerganov	d391ae9b49	perplexity : fix winogrande N tasks option	2024-01-18 20:49:00 +02:00
Kawrakow	3e945cc1e9	HellaSwag: speed up by parallelizing log-prob evaluation (#5020 ) For Mistral-7B and fp16, time on my system goes down from 536 seconds to 423 seconds for the full evaluation dataset (10042 tasks). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-18 19:18:21 +02:00
Georgi Gerganov	ad19812cda	perplexity : faster HellaSwag via batching (#5017 ) * perplexity : faster HellaSwag ggml-ci * perplexity : clean-up ggml-ci * perplexity : no need for decode_helper ggml-ci * perplexity : add comments * perplexity : option to specify max batched tasks via `n_parallel` * perplexity : remove HellaSwag restruction for n_batch	2024-01-18 15:33:01 +02:00
Kawrakow	682986a08e	Add Winogrande evaluation (#5015 ) * winogrande: simple implementation It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics. * winogrande: somewhat better Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before. * winogrande: improving Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last. * winogrande: add dataset instructions --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-18 13:46:27 +02:00
Georgi Gerganov	ba69bbc84c	imatrix : offload to GPU support (#4957 ) * backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci	2024-01-17 18:46:30 +02:00
Daniel Bevenius	cec8a48470	finetune : add training data file to log message (#4979 ) This commit adds the name of the training data file to the log message printed when the training data is tokenized. The motivation for this change is that it can be useful to show which file is being tokenized when running the finetune example. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-16 19:54:24 +02:00
Maximilian Winter	4feb4b33ee	examples : add complete parallel function calling example (#4974 )	2024-01-16 19:41:42 +02:00
Georgi Gerganov	959ef0c0df	perplexity : fix kv cache handling for hellaswag (#4981 ) ggml-ci	2024-01-16 19:34:54 +02:00
Neuman Vong	862f5e41ab	android : introduce starter project example (#4926 ) * Introduce starter project for Android Based on examples/llama.swiftui. * Add github workflow * Set NDK version * Only build arm64-v8a in CI * Sync bench code * Rename CI prop to skip-armeabi-v7a * Remove unused tests	2024-01-16 15:47:34 +02:00
Maximilian Winter	122ed4840c	examples : fix and improv docs for the grammar generator (#4909 ) * Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue. * Fixed some issues and bugs of the grammar generator. Imporved Documentation * Update pydantic_models_to_grammar.py	2024-01-16 14:10:48 +02:00
Daniel Bevenius	d75c232e1d	finetune : use LLAMA_FILE_MAGIC_GGLA (#4961 ) This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-16 13:14:19 +02:00
stduhpf	e0324285a5	speculative : threading options (#4959 ) * speculative: expose draft threading * fix usage format * accept -td and -tbd args * speculative: revert default behavior when -td is unspecified * fix trailing whitespace	2024-01-16 13:04:32 +02:00
Kawrakow	467a882fd2	Add ability to use importance matrix for all k-quants (#4930 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 16:21:12 +02:00
Kawrakow	147b17ac94	2-bit quantizations (#4897 ) * imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 09:45:56 +02:00
Georgi Gerganov	4be5ef556d	metal : remove old API (#4919 ) ggml-ci	2024-01-13 20:45:45 +02:00
Georgi Gerganov	0ea069b87b	server : fix prompt caching with system prompt (#4914 )	2024-01-13 19:31:26 +02:00
David Friehs	df845cc982	llama : minimize size used for state save/load (#4820 ) * examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes	2024-01-13 18:29:43 +02:00
Yann Follet	722d33f34e	main : add parameter --no-display-prompt (#4541 ) * add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-13 18:09:08 +02:00
Ziad Ben Hadj-Alouane	356327feb3	server : fix deadlock that occurs in multi-prompt scenarios (#4905 ) * * fix deadlock * * dont ruint all whitespace	2024-01-13 16:20:46 +02:00
makomk	ee8243adaa	server : fix crash with multimodal models without BOS token (#4904 )	2024-01-13 16:16:11 +02:00
Maximilian Winter	52ee4540c0	examples : add pydantic models to GBNF grammar generator (#4883 ) * Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.	2024-01-12 21:46:45 +02:00
slaren	e7e4df031b	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Daniel Bevenius	930f907d3e	export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894 ) This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-12 19:54:53 +02:00
Zay	e790eef21c	llama.swiftui : update models layout (#4826 ) * Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout	2024-01-12 14:48:00 +02:00
Kawrakow	326b418b59	Importance Matrix calculation (#4861 ) * imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-12 06:59:57 +01:00
Georgi Gerganov	1d118386fe	server : fix infill when prompt is empty (#4833 )	2024-01-11 23:23:49 +02:00
Georgi Gerganov	7edefbd79c	main : better name for variable n_print (#4874 )	2024-01-11 22:46:26 +02:00
Georgi Gerganov	3ca63b4538	main : disable token count by default (#4874 )	2024-01-11 22:43:05 +02:00
Kawrakow	469e75d0a3	llama : restore intended k-quants mixes for MoE models (#4872 ) * Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-11 21:43:15 +02:00
Laura	4330bd83fe	server : implement credentialed CORS (#4514 ) * Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage	2024-01-11 20:02:48 +02:00
Michael Coppola	27379455c3	server : support for multiple api keys (#4864 ) * server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-01-11 19:51:17 +02:00
Behnam M	eab6795006	server : add `LOG_INFO` when model is successfully loaded (#4881 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading	2024-01-11 19:41:39 +02:00
pudepiedj	43f76bf1c3	main : print total token count and tokens consumed so far (#4874 ) * Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn	2024-01-11 18:14:52 +02:00
Isaac McFadyen	2f043328e3	server : fix typo in model name (#4876 )	2024-01-11 16:33:26 +02:00
Behnam M	7a9f75c38b	server : update readme to document the new `/health` endpoint (#4866 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too	2024-01-11 09:12:05 +02:00
Georgi Gerganov	5c1980d8d4	server : fix build + rename enums (#4870 )	2024-01-11 09:10:34 +02:00
Behnam M	cd108e641d	server : add a `/health` endpoint (#4860 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line	2024-01-10 21:56:05 +02:00
John	d34633d8db	clip : support more quantization types (#4846 ) Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.	2024-01-10 15:37:09 +02:00

1 2 3 4 5 ...

741 Commits