Commit Graph

1665 Commits

Author SHA1 Message Date
Georgi Gerganov
ab558ac2b3
metal : fix soft_max kernels
ref: 1914017863
2023-12-13 10:57:25 +02:00
Radek Pilar
82e4f64578
convert-hf : support for mixtral-instruct (#4428)
* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy
2023-12-12 21:04:10 +02:00
Georgi Gerganov
90c12e6b3c
ggml : do not use BLAS with ggml_mul_mat_id 2023-12-12 20:05:58 +02:00
Georgi Gerganov
ea4402bb0e
test-backend-ops : add one more sum_rows test 2023-12-12 17:03:38 +02:00
Georgi Gerganov
a51bc0c1c0
metal : fix binary ops for ne10 % 4 != 0 2023-12-12 15:55:42 +02:00
Georgi Gerganov
08eb99179a
metal : add cpy f16 -> f32 kernel 2023-12-12 14:15:08 +02:00
slaren
a742d9f9b7 gguf-py : bump version 2023-12-12 12:46:33 +01:00
Georgi Gerganov
6a419f4d19
convert : support safetensors format 2023-12-12 13:05:14 +02:00
slaren
f1cbfabd64 convert : fix style 2023-12-11 20:02:55 +01:00
slaren
7dc75e3923 convert : use 1e6 rope_freq_base for mixtral 2023-12-11 20:00:28 +01:00
slaren
296c945de5 cuda : fix mul_mat_id with multi gpu 2023-12-11 16:53:25 +01:00
slaren
33e50f1b53 test-backend-ops : disable MOE test with thread sanitizer 2023-12-11 12:27:48 +01:00
slaren
ffda94c87f test-backend-ops : simplify and disable slow tests to avoid CI timeout 2023-12-11 12:15:31 +01:00
Georgi Gerganov
8cbaed1d9a
llama : fix hard-coded number of experts 2023-12-11 08:55:27 +02:00
slaren
b0029815e4 test-backend-ops : fix dequantize block offset 2023-12-11 02:43:52 +01:00
slaren
f1380d7897 test-backend-ops : add cpy from f32 -> all types test 2023-12-10 22:58:31 +01:00
slaren
54d254bbed test-backend-ops : cleanup, add moe test for batches 2023-12-10 21:52:11 +01:00
Georgi Gerganov
54ba263410
test-backend-ops : make experts more evenly probable (test_moe) 2023-12-10 15:28:07 +02:00
Georgi Gerganov
b0b83dd9e2
metal : fix ggml_mul_mat_id for F32 2023-12-10 14:30:38 +02:00
Georgi Gerganov
65923a8ede
convert : determine n_ctx correctly 2023-12-10 14:18:14 +02:00
slaren
8614aa736d cuda : fix get_rows when ncols is odd 2023-12-10 13:12:18 +01:00
slaren
cefebb3660 test-backend-ops : add moe test 2023-12-10 13:12:18 +01:00
Georgi Gerganov
e640cbe055
llama : add n_expert and n_expert_used to hparams + change quants 2023-12-10 13:57:54 +02:00
Georgi Gerganov
d1259b7b35
llama : do not quantize expert gating tensors 2023-12-10 13:00:13 +02:00
Georgi Gerganov
6cfb31f9ea
metal : add indirect mat-vec kernels for all quantization types 2023-12-10 11:48:14 +02:00
Georgi Gerganov
016f9bb55a
metal : fix ggml_get_rows to work with non-cont src1 2023-12-10 09:38:21 +02:00
slaren
0710b0f726 llama : offload missing ffn_moe_silu 2023-12-09 23:29:47 +01:00
slaren
62b95f93d0 cuda : support non-contiguous src1 in get_rows 2023-12-09 22:39:34 +01:00
slaren
2e4db48291 ggml : update get_rows f16 and q 2023-12-09 22:38:22 +01:00
slaren
ac3f7d8e23 ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D 2023-12-09 19:20:21 +01:00
Georgi Gerganov
8c5b66eeaa
metal : reduce the kernel launches for ggml_mul_mat_id 2023-12-09 15:30:34 +02:00
Georgi Gerganov
7e2006b0c0
metal : add/mul/div use general kernel when src1 not cont 2023-12-09 14:25:49 +02:00
slaren
06dfde3e94 llama : add basic support for offloading moe with CUDA 2023-12-09 13:21:30 +01:00
Georgi Gerganov
2cbcba829f
metal : add more general support for ggml_get_rows + tests 2023-12-09 14:18:42 +02:00
Georgi Gerganov
9064b1ca05
ggml : fix ggml_get_rows to take into account ne02 / ne11 2023-12-09 14:04:54 +02:00
slaren
ee8fb399aa ggml : add n_as argument to ggml_mul_mat_id 2023-12-09 12:42:25 +01:00
Georgi Gerganov
7372b62271
ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) 2023-12-09 13:19:47 +02:00
Georgi Gerganov
8b185b7030
llama : fix expert weighting in the FFN 2023-12-09 13:01:42 +02:00
Georgi Gerganov
7ea36953ba
llama : first working version 2023-12-09 12:45:15 +02:00
Georgi Gerganov
af1a096bf8
llama : fix cur -> cur_expert 2023-12-09 12:07:39 +02:00
Georgi Gerganov
aedfad120a
llama : update graph to support MoE 2023-12-09 11:47:40 +02:00
Georgi Gerganov
861cd67899
ggml : sync latest ggml_mul_mat_id 2023-12-09 11:19:46 +02:00
Georgi Gerganov
a3eefe95a8
llama : model loading 2023-12-09 11:14:03 +02:00
Georgi Gerganov
d38e41ee69
convert : fix n_ff typo 2023-12-09 10:59:37 +02:00
Georgi Gerganov
dff8cbeb39
convert : support Mixtral as LLAMA arch 2023-12-09 10:51:58 +02:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) (#4359)
* sync : ggml (part 1)

* sync : ggml (part 2, CUDA)

* sync : ggml (part 3, Metal)

* ggml : build fixes

ggml-ci

* cuda : restore lost changes

* cuda : restore lost changes (StableLM rope)

* cmake : enable separable compilation for CUDA

ggml-ci

* ggml-cuda : remove device side dequantize

* Revert "cmake : enable separable compilation for CUDA"

This reverts commit 09e35d04b1.

* cuda : remove assert for rope

* tests : add test-backend-ops

* ggml : fix bug in ggml_concat

* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`

* ci : try to fix macOS

* ggml-backend : remove backend self-registration

* ci : disable Metal for macOS cmake build

ggml-ci

* metal : fix "supports family" call

* metal : fix assert

* metal : print resource path

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 22:26:54 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Hongyu Ouyang
81bc9214a3
train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351)
On commit b1108 (44c117f4) xaedes added

    ggml_allocr * alloc = NULL;

    ... (many lines in between)

    if (alloc) {
        ggml_allocr_free(alloc);
    }

Which is correct, but it's easy to lose context after many lines in between.

On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly.

    alloc = ggml_allocr_new(...)
    ... (short lines of code)
    ggml_allocr_free(alloc)

This happens a few times, but alloc is never set to NULL, and many lines below,
we still have

    if (alloc) {
        ggml_allocr_free(alloc);
    }

which causes a double-free.
2023-12-07 12:25:22 +02:00
Georgi Gerganov
05cd6e5036
server : recognize cache_prompt parameter in OAI API (#4347) 2023-12-06 20:21:59 +02:00
Georgi Gerganov
caa9249217
common : fix compile warning 2023-12-06 10:41:03 +02:00