llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-21 00:59:23 +01:00

Author	SHA1	Message	Date
Iwan Kawrakow	f0cbb6ddf6	iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)	2024-02-28 08:28:10 +02:00
Iwan Kawrakow	47d52b2b24	Q2_K: fixed bug in imatrix quantization for QK_K = 64	2024-02-28 08:15:52 +02:00
Iwan Kawrakow	2540a290ed	Make CUDA compile with QK_K = 64 Tests don't pass, plus we get misaligned access	2024-02-27 21:35:11 +02:00
Iwan Kawrakow	de64e061da	QK_K = 64 tests pass on ARM_NEON and Metal Sadly, that does not mean it actually works.	2024-02-27 20:12:54 +02:00
Iwan Kawrakow	28e6146c11	iq2_xs: attempt to fix AVX dot product for QK_K = 64 Tests pass, but I get gibberish.	2024-02-27 18:41:31 +02:00
Iwan Kawrakow	13ba37f1aa	WIP: make i-quants work for QK_K = 64	2024-02-27 17:30:11 +02:00
Kawrakow	0becb22ac0	IQ4_XS: a 4.25 bpw quantization (#5747 ) * Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-27 16:34:24 +02:00
Engininja2	c24a2a6e60	cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744 )	2024-02-27 14:22:45 +01:00
Engininja2	1f30b7a9f1	ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742 )	2024-02-27 14:50:18 +02:00
Georgi Gerganov	9d533a77d0	llama : fix defrag bugs + add parameter (#5735 ) * llama : fix defrag bugs + enable by default ggml-ci * llama : add defrag_thold parameter ggml-ci * llama : cont * llama : disable log message ggml-ci * llama : fix graph size check during defrag	2024-02-27 14:35:51 +02:00
le.chang	cbbd1efa06	Makefile: use variables for cublas (#5689 ) * make: use arch variable for cublas * fix UNAME_M * check opt first --------- Co-authored-by: lindeer <le.chang118@gmail.com>	2024-02-27 03:03:06 +01:00
Xuan Son Nguyen	b11a93df41	fix server hangs on empty prompt (#5733 )	2024-02-26 23:15:48 +01:00
Kawrakow	a33e6a0d2a	Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721 ) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-26 18:28:38 +02:00
Johannes Gäßler	47bb7b48c7	CUDA: fix DEBUG_CUDA_MALLOC (#5729 )	2024-02-26 15:36:38 +01:00
Artem	c4d7f81786	readme : update ui list (#5731 ) * Add LLMFarm (ui for iOS) to list	2024-02-26 16:15:28 +02:00
AidanBeltonS	e849078c6e	[SYCL] Add support for soft_max ALiBi (#5639 ) * Add support for bias * Update pre-processor * rm commented code * fix format * fix CI --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-02-26 19:32:11 +05:30
Georgi Gerganov	67fd33132f	unicode : reuse iterator (#5726 )	2024-02-26 14:02:12 +02:00
Pierrick Hymbert	4804215cb8	server: CI fix trailing space (#5728 )	2024-02-26 12:41:34 +02:00
Pierrick Hymbert	8a533f0d90	server: CI tests reduce build matrix (#5725 )	2024-02-26 09:56:10 +01:00
Georgi Gerganov	269de86ba0	llama : fix Gemma rope type (#5691 )	2024-02-26 08:30:17 +02:00
github-actions[bot]	c393733988	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16) → 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)	2024-02-25 22:24:22 +00:00
Pierrick Hymbert	e3965cf35a	server: tests - slow inference causes timeout on the CI (#5715 ) * server: tests - longer inference timeout for CI	2024-02-25 22:48:33 +01:00
Pierrick Hymbert	8b350356b2	server: docs - refresh and tease a little bit more the http server (#5718 ) * server: docs - refresh and tease a little bit more the http server * Rephrase README.md server doc Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-25 21:46:29 +01:00
Georgi Gerganov	bf08e00643	llama : refactor k-shift implementation + KV defragmentation (#5691 ) * llama : refactor k-shift implementation ggml-ci * llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add * llama : cont k-shift refactoring + normalize type names ggml-ci * minor : fix MPI builds * llama : reuse n_rot from the build context ggml-ci * llama : revert enum name changes from this PR ggml-ci * llama : update llama_rope_type * llama : add comment about rope values * llama : fix build * passkey : apply kv cache updates explicitly ggml-ci * llama : change name to llama_kv_cache_update() * llama : add llama_kv_cache_seq_pos_max() * passkey : fix llama_kv_cache_seq_pos_max() usage * llama : some llama_kv_cell simplifications * llama : add llama_kv_cache_compress (EXPERIMENTAL) * llama : add alternative KV cache merging (EXPERIMENTAL) * llama : add llama_kv_cache_defrag * llama : comments * llama : remove llama_kv_cache_compress will add in a separate PR ggml-ci * llama : defragment via non-overlapping moves * llama : ggml_graph based defrag implementation ggml-ci * llama : switch the loop order in build_defrag * llama : add comments	2024-02-25 22:12:24 +02:00
compilade	f7625019c5	server : fix crash when system prompt is bigger than batch size (#5714 ) The system prompt is now decoded in batches. * server : fix off-by-one n_past when start of prompt matches whole cache The tokens right after the matching part would otherwise skip a pos value.	2024-02-25 20:43:50 +02:00
Radosław Gryta	abbabc5e51	ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711 ) * [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility vqtbl1q_u8 is not part of arm v7 neon library * [android-example] Remove abi filter after arm v7a fix * [github-workflows] Do not skip Android armeabi-v7a build	2024-02-25 20:43:00 +02:00
kwin1412	f1a98c5254	make : fix nvcc version is empty (#5713 ) fix nvcc version is empty	2024-02-25 18:46:49 +02:00
Ashok Gelal	7d548a1827	readme : add Msty to UI list (#5618 )	2024-02-25 17:57:34 +02:00
Pierrick Hymbert	930b178026	server: logs - unified format and --log-format option (#5700 ) * server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id * server : skip GH copilot requests from logging * server : change message format of server_log() * server : no need to repeat log in comment * server : log style consistency * server : fix compile warning * server : fix tests regex patterns on M2 Ultra * server: logs: PR feedback on log level * server: logs: allow to choose log format in json or plain text * server: tests: output server logs in text * server: logs switch init logs to server logs macro * server: logs ensure value json value does not raised error * server: logs reduce level VERBOSE to VERB to max 4 chars * server: logs lower case as other log messages * server: logs avoid static in general Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: logs PR feedback: change text log format to: LEVEL [function_name] message \| additional=data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-25 13:50:32 +01:00
Pierrick Hymbert	d52d7819b8	server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708 ) * server: monitoring - add /metrics prometheus compatible endpoint * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified * server: metrics - move to a dedicated struct	2024-02-25 13:49:43 +01:00
Radosław Gryta	1289408817	cmake : fix compilation for Android armeabi-v7a (#5702 )	2024-02-25 12:53:11 +02:00
Georgi Gerganov	ab336a9d5e	code : normalize enum names (#5697 ) * coda : normalize enum names ggml-ci * code : cont * code : cont	2024-02-25 12:09:09 +02:00
Anas Ahouzi	69917dfa55	py : fix StableLM conversion after config.json changes (#5703 ) * Fix issues during StableLM models conversion * Fix hard coded layer_norm_eps * Support layer_norm_eps for LlavaStableLM Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Add missing parenthesis Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Support rotary_factor for LlavaStableLM Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * fix typo * Add StableLMEpochForCausalLM for safety Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> * Add StableLMEpochForCausalLM for safety 2 Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>	2024-02-25 11:54:04 +02:00
Pierrick Hymbert	9e359a4f47	server: continue to update other slots on embedding concurrent request (#5699 ) * server: #5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs	2024-02-24 19:16:04 +01:00
Kawrakow	4c4cb30736	IQ3_S: a much better alternative to Q3_K (#5676 ) * iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-24 16:23:52 +02:00
Pierrick Hymbert	525213d2f5	server: init functional tests (#5566 ) * server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-24 12:28:55 +01:00
AlpinDale	fd43d66f46	server : add KV cache quantization options (#5684 )	2024-02-23 21:31:54 +02:00
Jared Van Bortel	54fbcd2ce6	convert : fix missing ftype for gemma (#5690 )	2024-02-23 20:39:14 +02:00
Jared Van Bortel	15499eb942	mpt : do not duplicate token_embd.weight on disk (#5670 )	2024-02-22 17:05:23 -05:00
Georgi Gerganov	96633eeca1	gemma : use more bits for the token_embd.weight tensor (#5650 ) * gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type	2024-02-22 23:23:46 +02:00
Georgi Gerganov	847eedbdb2	py : add Gemma conversion from HF models (#5647 ) * py : add gemma conversion from HF models * Update convert-hf-to-gguf.py Co-authored-by: Aarni Koskela <akx@iki.fi> * Update convert-hf-to-gguf.py Co-authored-by: Aarni Koskela <akx@iki.fi> * Update convert-hf-to-gguf.py Co-authored-by: Jared Van Bortel <jared@nomic.ai> --------- Co-authored-by: Aarni Koskela <akx@iki.fi> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-22 23:22:48 +02:00
Georgi Gerganov	7e4f339c40	ggml : always define ggml_fp16_t as uint16_t (#5666 ) * ggml : always define ggml_fp16_t as uint16_t ggml-ci * ggml : cont ggml-ci * ggml : cont * ggml : cont ggml-ci * ggml : cont ggml-ci * cuda : no longer ggml headers last ggml-ci * ggml : fix q6_K FP16 -> FP32 conversion ggml-ci * ggml : more FP16 -> FP32 conversion fixes ggml-ci	2024-02-22 23:21:39 +02:00
Georgi Gerganov	334f76fa38	sync : ggml	2024-02-22 23:21:05 +02:00
Georgi Gerganov	efd56b1c21	ggml : 32-bit arm compat (whisper/1891) * ggml : 32-bit arm compat * ggml : add ggml_vqtbl1q_s8 impl * ggml : cont	2024-02-22 23:20:50 +02:00
Someone	201294ae17	nix: init singularity and docker images (#5056 ) Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression. Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.	2024-02-22 11:44:10 -08:00
Georgi Gerganov	5a9e2f60ba	py : minor fixes (#5668 )	2024-02-22 20:13:25 +02:00
Xuan Son Nguyen	373ee3fbba	Add Gemma chat template (#5665 ) * add gemma chat template * gemma: only apply system_prompt on non-model message	2024-02-22 19:10:21 +01:00
Someone	4cb4d8b22d	workflows: nix: hardcode cachix ids, build unconditionally (#5663 ) GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.	2024-02-22 08:32:09 -08:00
Georgi Gerganov	3a03541ced	minor : fix trailing whitespace (#5638 )	2024-02-22 13:54:03 +02:00
Georgi Gerganov	56d03d92be	readme : update hot topics	2024-02-22 10:35:54 +02:00

1 2 3 4 5 ...

2287 Commits