1653 Commits

Author SHA1 Message Date
slaren
ffda94c87f test-backend-ops : simplify and disable slow tests to avoid CI timeout 2023-12-11 12:15:31 +01:00
Georgi Gerganov
8cbaed1d9a
llama : fix hard-coded number of experts 2023-12-11 08:55:27 +02:00
slaren
b0029815e4 test-backend-ops : fix dequantize block offset 2023-12-11 02:43:52 +01:00
slaren
f1380d7897 test-backend-ops : add cpy from f32 -> all types test 2023-12-10 22:58:31 +01:00
slaren
54d254bbed test-backend-ops : cleanup, add moe test for batches 2023-12-10 21:52:11 +01:00
Georgi Gerganov
54ba263410
test-backend-ops : make experts more evenly probable (test_moe) 2023-12-10 15:28:07 +02:00
Georgi Gerganov
b0b83dd9e2
metal : fix ggml_mul_mat_id for F32 2023-12-10 14:30:38 +02:00
Georgi Gerganov
65923a8ede
convert : determine n_ctx correctly 2023-12-10 14:18:14 +02:00
slaren
8614aa736d cuda : fix get_rows when ncols is odd 2023-12-10 13:12:18 +01:00
slaren
cefebb3660 test-backend-ops : add moe test 2023-12-10 13:12:18 +01:00
Georgi Gerganov
e640cbe055
llama : add n_expert and n_expert_used to hparams + change quants 2023-12-10 13:57:54 +02:00
Georgi Gerganov
d1259b7b35
llama : do not quantize expert gating tensors 2023-12-10 13:00:13 +02:00
Georgi Gerganov
6cfb31f9ea
metal : add indirect mat-vec kernels for all quantization types 2023-12-10 11:48:14 +02:00
Georgi Gerganov
016f9bb55a
metal : fix ggml_get_rows to work with non-cont src1 2023-12-10 09:38:21 +02:00
slaren
0710b0f726 llama : offload missing ffn_moe_silu 2023-12-09 23:29:47 +01:00
slaren
62b95f93d0 cuda : support non-contiguous src1 in get_rows 2023-12-09 22:39:34 +01:00
slaren
2e4db48291 ggml : update get_rows f16 and q 2023-12-09 22:38:22 +01:00
slaren
ac3f7d8e23 ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D 2023-12-09 19:20:21 +01:00
Georgi Gerganov
8c5b66eeaa
metal : reduce the kernel launches for ggml_mul_mat_id 2023-12-09 15:30:34 +02:00
Georgi Gerganov
7e2006b0c0
metal : add/mul/div use general kernel when src1 not cont 2023-12-09 14:25:49 +02:00
slaren
06dfde3e94 llama : add basic support for offloading moe with CUDA 2023-12-09 13:21:30 +01:00
Georgi Gerganov
2cbcba829f
metal : add more general support for ggml_get_rows + tests 2023-12-09 14:18:42 +02:00
Georgi Gerganov
9064b1ca05
ggml : fix ggml_get_rows to take into account ne02 / ne11 2023-12-09 14:04:54 +02:00
slaren
ee8fb399aa ggml : add n_as argument to ggml_mul_mat_id 2023-12-09 12:42:25 +01:00
Georgi Gerganov
7372b62271
ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) 2023-12-09 13:19:47 +02:00
Georgi Gerganov
8b185b7030
llama : fix expert weighting in the FFN 2023-12-09 13:01:42 +02:00
Georgi Gerganov
7ea36953ba
llama : first working version 2023-12-09 12:45:15 +02:00
Georgi Gerganov
af1a096bf8
llama : fix cur -> cur_expert 2023-12-09 12:07:39 +02:00
Georgi Gerganov
aedfad120a
llama : update graph to support MoE 2023-12-09 11:47:40 +02:00
Georgi Gerganov
861cd67899
ggml : sync latest ggml_mul_mat_id 2023-12-09 11:19:46 +02:00
Georgi Gerganov
a3eefe95a8
llama : model loading 2023-12-09 11:14:03 +02:00
Georgi Gerganov
d38e41ee69
convert : fix n_ff typo 2023-12-09 10:59:37 +02:00
Georgi Gerganov
dff8cbeb39
convert : support Mixtral as LLAMA arch 2023-12-09 10:51:58 +02:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) (#4359)
* sync : ggml (part 1)

* sync : ggml (part 2, CUDA)

* sync : ggml (part 3, Metal)

* ggml : build fixes

ggml-ci

* cuda : restore lost changes

* cuda : restore lost changes (StableLM rope)

* cmake : enable separable compilation for CUDA

ggml-ci

* ggml-cuda : remove device side dequantize

* Revert "cmake : enable separable compilation for CUDA"

This reverts commit 09e35d04b1c4ca67f9685690160b35bc885a89ac.

* cuda : remove assert for rope

* tests : add test-backend-ops

* ggml : fix bug in ggml_concat

* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`

* ci : try to fix macOS

* ggml-backend : remove backend self-registration

* ci : disable Metal for macOS cmake build

ggml-ci

* metal : fix "supports family" call

* metal : fix assert

* metal : print resource path

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
b1620
2023-12-07 22:26:54 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
b1619
2023-12-07 13:03:17 +02:00
Hongyu Ouyang
81bc9214a3
train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351)
On commit b1108 (44c117f4) xaedes added

    ggml_allocr * alloc = NULL;

    ... (many lines in between)

    if (alloc) {
        ggml_allocr_free(alloc);
    }

Which is correct, but it's easy to lose context after many lines in between.

On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly.

    alloc = ggml_allocr_new(...)
    ... (short lines of code)
    ggml_allocr_free(alloc)

This happens a few times, but alloc is never set to NULL, and many lines below,
we still have

    if (alloc) {
        ggml_allocr_free(alloc);
    }

which causes a double-free.
b1618
2023-12-07 12:25:22 +02:00
Georgi Gerganov
05cd6e5036
server : recognize cache_prompt parameter in OAI API (#4347) b1617 2023-12-06 20:21:59 +02:00
Georgi Gerganov
caa9249217
common : fix compile warning b1616 2023-12-06 10:41:03 +02:00
stduhpf
da5eaef1f3
speculative : support --color (#4343)
* speculative: add some colors

* minor : add braces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1615
2023-12-06 10:08:17 +02:00
Marcus Dunn
5f6e0c0dff
grammar : pre-computed pieces + reserve mem + less string copies (#4330)
* reserve space for codepoints

* improvement for the appended 0

* used precomputed token text for grammar sample

* reserve canidates_decoded

* reserve canidates_grammar

* remove candidates_decoded

* Revert "remove candidates_decoded"

This reverts commit 3773328080e6a139ee83198329a13cf4ff61d707.

* changed decode_utf8 to take src by ref
b1614
2023-12-05 22:55:12 +02:00
Kerfuffle
5aa365d88f
llama : allow overriding GGUF metadata when loading model (#4092)
* feat: Allow overriding GGUF metadata when loading model

* Fix the one time GCC is stricter than clang about something

* Step1

* Refactor... basically everything!

* Nuke obsolete GetArrayLen struct

* simplify std::string specialization

* Various cleanups

Add informational output when overrides are applied

Warn user when an override with the wrong type is specified

* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes

* llama : rearrange model params

* Update new GET_KEY call

Add note that metadata KV overrides aren't reflected in initial metadata KV info dump

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1613
2023-12-05 19:19:18 +02:00
MaggotHATE
52c8bc3cf3
sampling : custom samplers order (#4285)
* Samplers sequence order w parameter

* Cleaned commented code

* Fixed formatting

* Rewrote with unordered_map

* Revert and rewrite, too many problems and safeguards would be needed

* Fixed code style

* Code style fixes according to review

* More readable samplers input string, fixed help

* Style fix in sampler_queue

* Formatting fixes

* Fixing whitespaces
b1612
2023-12-05 12:05:51 +02:00
kchro3
e4b76bbe31
swift : revert compiler checks for swift package (#4332) b1611 2023-12-05 09:29:46 +02:00
Daniel Bevenius
23b5e12eb5
simple : update error message for KV cache check (#4324)
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
b1610
2023-12-04 18:04:21 +02:00
Miwa / Ensan
d208995c6d
swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325) b1609 2023-12-04 18:03:49 +02:00
Miwa / Ensan
5c9f90cba1
swift : fix prompt tokenization logic (#4321) b1608 2023-12-04 15:43:45 +02:00
Ikko Eltociear Ashimine
4fa44e84ad
grammar-parser : fix typo (#4318)
preceeding -> preceding
b1607
2023-12-04 09:57:35 +02:00
Georgi Gerganov
fbbc42827b
ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (#4308)
* ggml : fix soft max out-of-bounds access

ggml-ci

* ggml : reuse ggml_get_n_tasks() in ggml_graph_plan()

ggml-ci
b1606
2023-12-03 15:56:35 +02:00
Georgi Gerganov
adf3de4f69
ggml : fix soft max out-of-bounds access (#4307)
ggml-ci
b1605
2023-12-03 15:56:22 +02:00
Ed Lee
33e171d1e9
server : fix OpenAI API stop field to be optional (#4299)
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)
b1604
2023-12-03 11:10:43 +02:00