Georgi Gerganov
e1241d9b46
metal : switch to execution barriers + fix one of the barriers
2023-12-13 13:56:45 +02:00
Georgi Gerganov
109e7aa8ac
metal : limit kernels to not use more than the allowed threads
2023-12-13 10:57:25 +02:00
Georgi Gerganov
ab558ac2b3
metal : fix soft_max kernels
...
ref: 1914017863
2023-12-13 10:57:25 +02:00
Radek Pilar
82e4f64578
convert-hf : support for mixtral-instruct ( #4428 )
...
* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct
* convert : use sentencepiece tokenizer for Mixtral-instruct
* convert : make flake8 happy
2023-12-12 21:04:10 +02:00
Georgi Gerganov
90c12e6b3c
ggml : do not use BLAS with ggml_mul_mat_id
2023-12-12 20:05:58 +02:00
Georgi Gerganov
ea4402bb0e
test-backend-ops : add one more sum_rows test
2023-12-12 17:03:38 +02:00
Georgi Gerganov
a51bc0c1c0
metal : fix binary ops for ne10 % 4 != 0
2023-12-12 15:55:42 +02:00
Georgi Gerganov
08eb99179a
metal : add cpy f16 -> f32 kernel
2023-12-12 14:15:08 +02:00
slaren
a742d9f9b7
gguf-py : bump version
2023-12-12 12:46:33 +01:00
Georgi Gerganov
6a419f4d19
convert : support safetensors format
2023-12-12 13:05:14 +02:00
slaren
f1cbfabd64
convert : fix style
2023-12-11 20:02:55 +01:00
slaren
7dc75e3923
convert : use 1e6 rope_freq_base for mixtral
2023-12-11 20:00:28 +01:00
slaren
296c945de5
cuda : fix mul_mat_id with multi gpu
2023-12-11 16:53:25 +01:00
slaren
33e50f1b53
test-backend-ops : disable MOE test with thread sanitizer
2023-12-11 12:27:48 +01:00
slaren
ffda94c87f
test-backend-ops : simplify and disable slow tests to avoid CI timeout
2023-12-11 12:15:31 +01:00
Georgi Gerganov
8cbaed1d9a
llama : fix hard-coded number of experts
2023-12-11 08:55:27 +02:00
slaren
b0029815e4
test-backend-ops : fix dequantize block offset
2023-12-11 02:43:52 +01:00
slaren
f1380d7897
test-backend-ops : add cpy from f32 -> all types test
2023-12-10 22:58:31 +01:00
slaren
54d254bbed
test-backend-ops : cleanup, add moe test for batches
2023-12-10 21:52:11 +01:00
Georgi Gerganov
54ba263410
test-backend-ops : make experts more evenly probable (test_moe)
2023-12-10 15:28:07 +02:00
Georgi Gerganov
b0b83dd9e2
metal : fix ggml_mul_mat_id for F32
2023-12-10 14:30:38 +02:00
Georgi Gerganov
65923a8ede
convert : determine n_ctx correctly
2023-12-10 14:18:14 +02:00
slaren
8614aa736d
cuda : fix get_rows when ncols is odd
2023-12-10 13:12:18 +01:00
slaren
cefebb3660
test-backend-ops : add moe test
2023-12-10 13:12:18 +01:00
Georgi Gerganov
e640cbe055
llama : add n_expert and n_expert_used to hparams + change quants
2023-12-10 13:57:54 +02:00
Georgi Gerganov
d1259b7b35
llama : do not quantize expert gating tensors
2023-12-10 13:00:13 +02:00
Georgi Gerganov
6cfb31f9ea
metal : add indirect mat-vec kernels for all quantization types
2023-12-10 11:48:14 +02:00
Georgi Gerganov
016f9bb55a
metal : fix ggml_get_rows to work with non-cont src1
2023-12-10 09:38:21 +02:00
slaren
0710b0f726
llama : offload missing ffn_moe_silu
2023-12-09 23:29:47 +01:00
slaren
62b95f93d0
cuda : support non-contiguous src1 in get_rows
2023-12-09 22:39:34 +01:00
slaren
2e4db48291
ggml : update get_rows f16 and q
2023-12-09 22:38:22 +01:00
slaren
ac3f7d8e23
ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D
2023-12-09 19:20:21 +01:00
Georgi Gerganov
8c5b66eeaa
metal : reduce the kernel launches for ggml_mul_mat_id
2023-12-09 15:30:34 +02:00
Georgi Gerganov
7e2006b0c0
metal : add/mul/div use general kernel when src1 not cont
2023-12-09 14:25:49 +02:00
slaren
06dfde3e94
llama : add basic support for offloading moe with CUDA
2023-12-09 13:21:30 +01:00
Georgi Gerganov
2cbcba829f
metal : add more general support for ggml_get_rows + tests
2023-12-09 14:18:42 +02:00
Georgi Gerganov
9064b1ca05
ggml : fix ggml_get_rows to take into account ne02 / ne11
2023-12-09 14:04:54 +02:00
slaren
ee8fb399aa
ggml : add n_as argument to ggml_mul_mat_id
2023-12-09 12:42:25 +01:00
Georgi Gerganov
7372b62271
ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)
2023-12-09 13:19:47 +02:00
Georgi Gerganov
8b185b7030
llama : fix expert weighting in the FFN
2023-12-09 13:01:42 +02:00
Georgi Gerganov
7ea36953ba
llama : first working version
2023-12-09 12:45:15 +02:00
Georgi Gerganov
af1a096bf8
llama : fix cur -> cur_expert
2023-12-09 12:07:39 +02:00
Georgi Gerganov
aedfad120a
llama : update graph to support MoE
2023-12-09 11:47:40 +02:00
Georgi Gerganov
861cd67899
ggml : sync latest ggml_mul_mat_id
2023-12-09 11:19:46 +02:00
Georgi Gerganov
a3eefe95a8
llama : model loading
2023-12-09 11:14:03 +02:00
Georgi Gerganov
d38e41ee69
convert : fix n_ff typo
2023-12-09 10:59:37 +02:00
Georgi Gerganov
dff8cbeb39
convert : support Mixtral as LLAMA arch
2023-12-09 10:51:58 +02:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) ( #4359 )
...
* sync : ggml (part 1)
* sync : ggml (part 2, CUDA)
* sync : ggml (part 3, Metal)
* ggml : build fixes
ggml-ci
* cuda : restore lost changes
* cuda : restore lost changes (StableLM rope)
* cmake : enable separable compilation for CUDA
ggml-ci
* ggml-cuda : remove device side dequantize
* Revert "cmake : enable separable compilation for CUDA"
This reverts commit 09e35d04b1
.
* cuda : remove assert for rope
* tests : add test-backend-ops
* ggml : fix bug in ggml_concat
* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`
* ci : try to fix macOS
* ggml-backend : remove backend self-registration
* ci : disable Metal for macOS cmake build
ggml-ci
* metal : fix "supports family" call
* metal : fix assert
* metal : print resource path
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 22:26:54 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache ( #4309 )
...
* per-layer KV
* remove unnecessary copies
* less code duplication, offload k and v separately
* llama : offload KV cache per-layer
* llama : offload K shift tensors
* llama : offload for rest of the model arches
* llama : enable offload debug temporarily
* llama : keep the KV related layers on the device
* llama : remove mirrors, perform Device -> Host when partial offload
* common : add command-line arg to disable KV cache offloading
* llama : update session save/load
* llama : support quantum K cache (#4312 )
* llama : support quantum K cache (wip)
* metal : add F32 -> Q8_0 copy kernel
* cuda : add F32 -> Q8_0 copy kernel
ggml-ci
* cuda : use mmv kernel for quantum cache ops
* llama : pass KV cache type through API
* llama : fix build
ggml-ci
* metal : add F32 -> Q4_0 copy kernel
* metal : add F32 -> Q4_1 copy kernel
* cuda : wip
* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels
* llama-bench : support type_k/type_v
* metal : use mm kernel only for quantum KV cache
* cuda : add comment
* llama : remove memory_f16 and kv_f16 flags
---------
Co-authored-by: slaren <slarengh@gmail.com>
* readme : add API change notice
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Hongyu Ouyang
81bc9214a3
train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) ( #4351 )
...
On commit b1108 (44c117f4
) xaedes added
ggml_allocr * alloc = NULL;
... (many lines in between)
if (alloc) {
ggml_allocr_free(alloc);
}
Which is correct, but it's easy to lose context after many lines in between.
On commit b1287 (0e76a899
) xaedes made a big change. From here on, alloc is freed eagerly.
alloc = ggml_allocr_new(...)
... (short lines of code)
ggml_allocr_free(alloc)
This happens a few times, but alloc is never set to NULL, and many lines below,
we still have
if (alloc) {
ggml_allocr_free(alloc);
}
which causes a double-free.
2023-12-07 12:25:22 +02:00