llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-30 05:43:03 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	e1241d9b46	metal : switch to execution barriers + fix one of the barriers	2023-12-13 13:56:45 +02:00
Georgi Gerganov	109e7aa8ac	metal : limit kernels to not use more than the allowed threads	2023-12-13 10:57:25 +02:00
Georgi Gerganov	ab558ac2b3	metal : fix soft_max kernels ref: `1914017863`	2023-12-13 10:57:25 +02:00
Radek Pilar	82e4f64578	convert-hf : support for mixtral-instruct (#4428 ) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy	2023-12-12 21:04:10 +02:00
Georgi Gerganov	90c12e6b3c	ggml : do not use BLAS with ggml_mul_mat_id	2023-12-12 20:05:58 +02:00
Georgi Gerganov	ea4402bb0e	test-backend-ops : add one more sum_rows test	2023-12-12 17:03:38 +02:00
Georgi Gerganov	a51bc0c1c0	metal : fix binary ops for ne10 % 4 != 0	2023-12-12 15:55:42 +02:00
Georgi Gerganov	08eb99179a	metal : add cpy f16 -> f32 kernel	2023-12-12 14:15:08 +02:00
slaren	a742d9f9b7	gguf-py : bump version	2023-12-12 12:46:33 +01:00
Georgi Gerganov	6a419f4d19	convert : support safetensors format	2023-12-12 13:05:14 +02:00
slaren	f1cbfabd64	convert : fix style	2023-12-11 20:02:55 +01:00
slaren	7dc75e3923	convert : use 1e6 rope_freq_base for mixtral	2023-12-11 20:00:28 +01:00
slaren	296c945de5	cuda : fix mul_mat_id with multi gpu	2023-12-11 16:53:25 +01:00
slaren	33e50f1b53	test-backend-ops : disable MOE test with thread sanitizer	2023-12-11 12:27:48 +01:00
slaren	ffda94c87f	test-backend-ops : simplify and disable slow tests to avoid CI timeout	2023-12-11 12:15:31 +01:00
Georgi Gerganov	8cbaed1d9a	llama : fix hard-coded number of experts	2023-12-11 08:55:27 +02:00
slaren	b0029815e4	test-backend-ops : fix dequantize block offset	2023-12-11 02:43:52 +01:00
slaren	f1380d7897	test-backend-ops : add cpy from f32 -> all types test	2023-12-10 22:58:31 +01:00
slaren	54d254bbed	test-backend-ops : cleanup, add moe test for batches	2023-12-10 21:52:11 +01:00
Georgi Gerganov	54ba263410	test-backend-ops : make experts more evenly probable (test_moe)	2023-12-10 15:28:07 +02:00
Georgi Gerganov	b0b83dd9e2	metal : fix ggml_mul_mat_id for F32	2023-12-10 14:30:38 +02:00
Georgi Gerganov	65923a8ede	convert : determine n_ctx correctly	2023-12-10 14:18:14 +02:00
slaren	8614aa736d	cuda : fix get_rows when ncols is odd	2023-12-10 13:12:18 +01:00
slaren	cefebb3660	test-backend-ops : add moe test	2023-12-10 13:12:18 +01:00
Georgi Gerganov	e640cbe055	llama : add n_expert and n_expert_used to hparams + change quants	2023-12-10 13:57:54 +02:00
Georgi Gerganov	d1259b7b35	llama : do not quantize expert gating tensors	2023-12-10 13:00:13 +02:00
Georgi Gerganov	6cfb31f9ea	metal : add indirect mat-vec kernels for all quantization types	2023-12-10 11:48:14 +02:00
Georgi Gerganov	016f9bb55a	metal : fix ggml_get_rows to work with non-cont src1	2023-12-10 09:38:21 +02:00
slaren	0710b0f726	llama : offload missing ffn_moe_silu	2023-12-09 23:29:47 +01:00
slaren	62b95f93d0	cuda : support non-contiguous src1 in get_rows	2023-12-09 22:39:34 +01:00
slaren	2e4db48291	ggml : update get_rows f16 and q	2023-12-09 22:38:22 +01:00
slaren	ac3f7d8e23	ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D	2023-12-09 19:20:21 +01:00
Georgi Gerganov	8c5b66eeaa	metal : reduce the kernel launches for ggml_mul_mat_id	2023-12-09 15:30:34 +02:00
Georgi Gerganov	7e2006b0c0	metal : add/mul/div use general kernel when src1 not cont	2023-12-09 14:25:49 +02:00
slaren	06dfde3e94	llama : add basic support for offloading moe with CUDA	2023-12-09 13:21:30 +01:00
Georgi Gerganov	2cbcba829f	metal : add more general support for ggml_get_rows + tests	2023-12-09 14:18:42 +02:00
Georgi Gerganov	9064b1ca05	ggml : fix ggml_get_rows to take into account ne02 / ne11	2023-12-09 14:04:54 +02:00
slaren	ee8fb399aa	ggml : add n_as argument to ggml_mul_mat_id	2023-12-09 12:42:25 +01:00
Georgi Gerganov	7372b62271	ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)	2023-12-09 13:19:47 +02:00
Georgi Gerganov	8b185b7030	llama : fix expert weighting in the FFN	2023-12-09 13:01:42 +02:00
Georgi Gerganov	7ea36953ba	llama : first working version	2023-12-09 12:45:15 +02:00
Georgi Gerganov	af1a096bf8	llama : fix cur -> cur_expert	2023-12-09 12:07:39 +02:00
Georgi Gerganov	aedfad120a	llama : update graph to support MoE	2023-12-09 11:47:40 +02:00
Georgi Gerganov	861cd67899	ggml : sync latest ggml_mul_mat_id	2023-12-09 11:19:46 +02:00
Georgi Gerganov	a3eefe95a8	llama : model loading	2023-12-09 11:14:03 +02:00
Georgi Gerganov	d38e41ee69	convert : fix n_ff typo	2023-12-09 10:59:37 +02:00
Georgi Gerganov	dff8cbeb39	convert : support Mixtral as LLAMA arch	2023-12-09 10:51:58 +02:00
Georgi Gerganov	fe680e3d10	sync : ggml (new ops, tests, backend, etc.) (#4359 ) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit `09e35d04b1`. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 22:26:54 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Hongyu Ouyang	81bc9214a3	train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351 ) On commit b1108 (`44c117f4`) xaedes added ggml_allocr * alloc = NULL; ... (many lines in between) if (alloc) { ggml_allocr_free(alloc); } Which is correct, but it's easy to lose context after many lines in between. On commit b1287 (`0e76a899`) xaedes made a big change. From here on, alloc is freed eagerly. alloc = ggml_allocr_new(...) ... (short lines of code) ggml_allocr_free(alloc) This happens a few times, but alloc is never set to NULL, and many lines below, we still have if (alloc) { ggml_allocr_free(alloc); } which causes a double-free.	2023-12-07 12:25:22 +02:00

1 2 3 4 5 ...

1667 Commits