llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-01 00:39:00 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	b0b83dd9e2	metal : fix ggml_mul_mat_id for F32	2023-12-10 14:30:38 +02:00
Georgi Gerganov	65923a8ede	convert : determine n_ctx correctly	2023-12-10 14:18:14 +02:00
slaren	8614aa736d	cuda : fix get_rows when ncols is odd	2023-12-10 13:12:18 +01:00
slaren	cefebb3660	test-backend-ops : add moe test	2023-12-10 13:12:18 +01:00
Georgi Gerganov	e640cbe055	llama : add n_expert and n_expert_used to hparams + change quants	2023-12-10 13:57:54 +02:00
Georgi Gerganov	d1259b7b35	llama : do not quantize expert gating tensors	2023-12-10 13:00:13 +02:00
Georgi Gerganov	6cfb31f9ea	metal : add indirect mat-vec kernels for all quantization types	2023-12-10 11:48:14 +02:00
Georgi Gerganov	016f9bb55a	metal : fix ggml_get_rows to work with non-cont src1	2023-12-10 09:38:21 +02:00
slaren	0710b0f726	llama : offload missing ffn_moe_silu	2023-12-09 23:29:47 +01:00
slaren	62b95f93d0	cuda : support non-contiguous src1 in get_rows	2023-12-09 22:39:34 +01:00
slaren	2e4db48291	ggml : update get_rows f16 and q	2023-12-09 22:38:22 +01:00
slaren	ac3f7d8e23	ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D	2023-12-09 19:20:21 +01:00
Georgi Gerganov	8c5b66eeaa	metal : reduce the kernel launches for ggml_mul_mat_id	2023-12-09 15:30:34 +02:00
Georgi Gerganov	7e2006b0c0	metal : add/mul/div use general kernel when src1 not cont	2023-12-09 14:25:49 +02:00
slaren	06dfde3e94	llama : add basic support for offloading moe with CUDA	2023-12-09 13:21:30 +01:00
Georgi Gerganov	2cbcba829f	metal : add more general support for ggml_get_rows + tests	2023-12-09 14:18:42 +02:00
Georgi Gerganov	9064b1ca05	ggml : fix ggml_get_rows to take into account ne02 / ne11	2023-12-09 14:04:54 +02:00
slaren	ee8fb399aa	ggml : add n_as argument to ggml_mul_mat_id	2023-12-09 12:42:25 +01:00
Georgi Gerganov	7372b62271	ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)	2023-12-09 13:19:47 +02:00
Georgi Gerganov	8b185b7030	llama : fix expert weighting in the FFN	2023-12-09 13:01:42 +02:00
Georgi Gerganov	7ea36953ba	llama : first working version	2023-12-09 12:45:15 +02:00
Georgi Gerganov	af1a096bf8	llama : fix cur -> cur_expert	2023-12-09 12:07:39 +02:00
Georgi Gerganov	aedfad120a	llama : update graph to support MoE	2023-12-09 11:47:40 +02:00
Georgi Gerganov	861cd67899	ggml : sync latest ggml_mul_mat_id	2023-12-09 11:19:46 +02:00
Georgi Gerganov	a3eefe95a8	llama : model loading	2023-12-09 11:14:03 +02:00
Georgi Gerganov	d38e41ee69	convert : fix n_ff typo	2023-12-09 10:59:37 +02:00
Georgi Gerganov	dff8cbeb39	convert : support Mixtral as LLAMA arch	2023-12-09 10:51:58 +02:00
Georgi Gerganov	fe680e3d10	sync : ggml (new ops, tests, backend, etc.) (#4359 ) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit `09e35d04b1`. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 22:26:54 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Hongyu Ouyang	81bc9214a3	train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351 ) On commit b1108 (`44c117f4`) xaedes added ggml_allocr * alloc = NULL; ... (many lines in between) if (alloc) { ggml_allocr_free(alloc); } Which is correct, but it's easy to lose context after many lines in between. On commit b1287 (`0e76a899`) xaedes made a big change. From here on, alloc is freed eagerly. alloc = ggml_allocr_new(...) ... (short lines of code) ggml_allocr_free(alloc) This happens a few times, but alloc is never set to NULL, and many lines below, we still have if (alloc) { ggml_allocr_free(alloc); } which causes a double-free.	2023-12-07 12:25:22 +02:00
Georgi Gerganov	05cd6e5036	server : recognize cache_prompt parameter in OAI API (#4347 )	2023-12-06 20:21:59 +02:00
Georgi Gerganov	caa9249217	common : fix compile warning	2023-12-06 10:41:03 +02:00
stduhpf	da5eaef1f3	speculative : support `--color` (#4343 ) * speculative: add some colors * minor : add braces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-06 10:08:17 +02:00
Marcus Dunn	5f6e0c0dff	grammar : pre-computed pieces + reserve mem + less string copies (#4330 ) * reserve space for codepoints * improvement for the appended 0 * used precomputed token text for grammar sample * reserve canidates_decoded * reserve canidates_grammar * remove candidates_decoded * Revert "remove candidates_decoded" This reverts commit `3773328080`. * changed decode_utf8 to take src by ref	2023-12-05 22:55:12 +02:00
Kerfuffle	5aa365d88f	llama : allow overriding GGUF metadata when loading model (#4092 ) * feat: Allow overriding GGUF metadata when loading model * Fix the one time GCC is stricter than clang about something * Step1 * Refactor... basically everything! * Nuke obsolete GetArrayLen struct * simplify std::string specialization * Various cleanups Add informational output when overrides are applied Warn user when an override with the wrong type is specified * Fix broken logic for parsing bool KV overrides Fix issue where overrides didn't apply when key missing in GGUF metadata Resolve merge changes * llama : rearrange model params * Update new GET_KEY call Add note that metadata KV overrides aren't reflected in initial metadata KV info dump --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-05 19:19:18 +02:00
MaggotHATE	52c8bc3cf3	sampling : custom samplers order (#4285 ) * Samplers sequence order w parameter * Cleaned commented code * Fixed formatting * Rewrote with unordered_map * Revert and rewrite, too many problems and safeguards would be needed * Fixed code style * Code style fixes according to review * More readable samplers input string, fixed help * Style fix in sampler_queue * Formatting fixes * Fixing whitespaces	2023-12-05 12:05:51 +02:00
kchro3	e4b76bbe31	swift : revert compiler checks for swift package (#4332 )	2023-12-05 09:29:46 +02:00
Daniel Bevenius	23b5e12eb5	simple : update error message for KV cache check (#4324 ) This commit updates the error message that is printed when the KV cache is not big enough to hold all the prompt and generated tokens. Specifically it removes the reference to n_parallel and replaces it with n_len. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-12-04 18:04:21 +02:00
Miwa / Ensan	d208995c6d	swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325 )	2023-12-04 18:03:49 +02:00
Miwa / Ensan	5c9f90cba1	swift : fix prompt tokenization logic (#4321 )	2023-12-04 15:43:45 +02:00
Ikko Eltociear Ashimine	4fa44e84ad	grammar-parser : fix typo (#4318 ) preceeding -> preceding	2023-12-04 09:57:35 +02:00
Georgi Gerganov	fbbc42827b	ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (#4308 ) * ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci	2023-12-03 15:56:35 +02:00
Georgi Gerganov	adf3de4f69	ggml : fix soft max out-of-bounds access (#4307 ) ggml-ci	2023-12-03 15:56:22 +02:00
Ed Lee	33e171d1e9	server : fix OpenAI API `stop` field to be optional (#4299 ) (cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)	2023-12-03 11:10:43 +02:00
Rickard Edén	6949b50df5	py : add grammar to oai like api (#4294 )	2023-12-03 11:03:25 +02:00
Georgi Gerganov	d7b800b8bc	llama : pad KV cache size (#4280 ) * llama : pad KV cache size to 32 * metal : try to improve batched decoding	2023-12-03 10:58:16 +02:00
Georgi Gerganov	5a7d3125e7	llama : avoid using "optional" keyword (#4283 )	2023-12-01 20:39:12 +02:00
Georgi Gerganov	d5a1cbde60	llama : support optional tensors (#4283 )	2023-12-01 20:35:47 +02:00
Miwa / Ensan	b220222a64	swift : fix token_to_piece implementation (#4278 ) * Fix token_to_piece implementation in Swift * Fix errors	2023-12-01 20:19:45 +02:00
Jared Van Bortel	511f52c334	build : enable libstdc++ assertions for debug builds (#4275 )	2023-12-01 20:18:35 +02:00

1 2 3 4 5 ...

1647 Commits