Georgi Gerganov
6a419f4d19
convert : support safetensors format
2023-12-12 13:05:14 +02:00
slaren
f1cbfabd64
convert : fix style
2023-12-11 20:02:55 +01:00
slaren
7dc75e3923
convert : use 1e6 rope_freq_base for mixtral
2023-12-11 20:00:28 +01:00
slaren
296c945de5
cuda : fix mul_mat_id with multi gpu
2023-12-11 16:53:25 +01:00
slaren
33e50f1b53
test-backend-ops : disable MOE test with thread sanitizer
2023-12-11 12:27:48 +01:00
slaren
ffda94c87f
test-backend-ops : simplify and disable slow tests to avoid CI timeout
2023-12-11 12:15:31 +01:00
Georgi Gerganov
8cbaed1d9a
llama : fix hard-coded number of experts
2023-12-11 08:55:27 +02:00
slaren
b0029815e4
test-backend-ops : fix dequantize block offset
2023-12-11 02:43:52 +01:00
slaren
f1380d7897
test-backend-ops : add cpy from f32 -> all types test
2023-12-10 22:58:31 +01:00
slaren
54d254bbed
test-backend-ops : cleanup, add moe test for batches
2023-12-10 21:52:11 +01:00
Georgi Gerganov
54ba263410
test-backend-ops : make experts more evenly probable (test_moe)
2023-12-10 15:28:07 +02:00
Georgi Gerganov
b0b83dd9e2
metal : fix ggml_mul_mat_id for F32
2023-12-10 14:30:38 +02:00
Georgi Gerganov
65923a8ede
convert : determine n_ctx correctly
2023-12-10 14:18:14 +02:00
slaren
8614aa736d
cuda : fix get_rows when ncols is odd
2023-12-10 13:12:18 +01:00
slaren
cefebb3660
test-backend-ops : add moe test
2023-12-10 13:12:18 +01:00
Georgi Gerganov
e640cbe055
llama : add n_expert and n_expert_used to hparams + change quants
2023-12-10 13:57:54 +02:00
Georgi Gerganov
d1259b7b35
llama : do not quantize expert gating tensors
2023-12-10 13:00:13 +02:00
Georgi Gerganov
6cfb31f9ea
metal : add indirect mat-vec kernels for all quantization types
2023-12-10 11:48:14 +02:00
Georgi Gerganov
016f9bb55a
metal : fix ggml_get_rows to work with non-cont src1
2023-12-10 09:38:21 +02:00
slaren
0710b0f726
llama : offload missing ffn_moe_silu
2023-12-09 23:29:47 +01:00
slaren
62b95f93d0
cuda : support non-contiguous src1 in get_rows
2023-12-09 22:39:34 +01:00
slaren
2e4db48291
ggml : update get_rows f16 and q
2023-12-09 22:38:22 +01:00
slaren
ac3f7d8e23
ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D
2023-12-09 19:20:21 +01:00
Georgi Gerganov
8c5b66eeaa
metal : reduce the kernel launches for ggml_mul_mat_id
2023-12-09 15:30:34 +02:00
Georgi Gerganov
7e2006b0c0
metal : add/mul/div use general kernel when src1 not cont
2023-12-09 14:25:49 +02:00
slaren
06dfde3e94
llama : add basic support for offloading moe with CUDA
2023-12-09 13:21:30 +01:00
Georgi Gerganov
2cbcba829f
metal : add more general support for ggml_get_rows + tests
2023-12-09 14:18:42 +02:00
Georgi Gerganov
9064b1ca05
ggml : fix ggml_get_rows to take into account ne02 / ne11
2023-12-09 14:04:54 +02:00
slaren
ee8fb399aa
ggml : add n_as argument to ggml_mul_mat_id
2023-12-09 12:42:25 +01:00
Georgi Gerganov
7372b62271
ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)
2023-12-09 13:19:47 +02:00
Georgi Gerganov
8b185b7030
llama : fix expert weighting in the FFN
2023-12-09 13:01:42 +02:00
Georgi Gerganov
7ea36953ba
llama : first working version
2023-12-09 12:45:15 +02:00
Georgi Gerganov
af1a096bf8
llama : fix cur -> cur_expert
2023-12-09 12:07:39 +02:00
Georgi Gerganov
aedfad120a
llama : update graph to support MoE
2023-12-09 11:47:40 +02:00
Georgi Gerganov
861cd67899
ggml : sync latest ggml_mul_mat_id
2023-12-09 11:19:46 +02:00
Georgi Gerganov
a3eefe95a8
llama : model loading
2023-12-09 11:14:03 +02:00
Georgi Gerganov
d38e41ee69
convert : fix n_ff typo
2023-12-09 10:59:37 +02:00
Georgi Gerganov
dff8cbeb39
convert : support Mixtral as LLAMA arch
2023-12-09 10:51:58 +02:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) ( #4359 )
...
* sync : ggml (part 1)
* sync : ggml (part 2, CUDA)
* sync : ggml (part 3, Metal)
* ggml : build fixes
ggml-ci
* cuda : restore lost changes
* cuda : restore lost changes (StableLM rope)
* cmake : enable separable compilation for CUDA
ggml-ci
* ggml-cuda : remove device side dequantize
* Revert "cmake : enable separable compilation for CUDA"
This reverts commit 09e35d04b1
.
* cuda : remove assert for rope
* tests : add test-backend-ops
* ggml : fix bug in ggml_concat
* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`
* ci : try to fix macOS
* ggml-backend : remove backend self-registration
* ci : disable Metal for macOS cmake build
ggml-ci
* metal : fix "supports family" call
* metal : fix assert
* metal : print resource path
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 22:26:54 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache ( #4309 )
...
* per-layer KV
* remove unnecessary copies
* less code duplication, offload k and v separately
* llama : offload KV cache per-layer
* llama : offload K shift tensors
* llama : offload for rest of the model arches
* llama : enable offload debug temporarily
* llama : keep the KV related layers on the device
* llama : remove mirrors, perform Device -> Host when partial offload
* common : add command-line arg to disable KV cache offloading
* llama : update session save/load
* llama : support quantum K cache (#4312 )
* llama : support quantum K cache (wip)
* metal : add F32 -> Q8_0 copy kernel
* cuda : add F32 -> Q8_0 copy kernel
ggml-ci
* cuda : use mmv kernel for quantum cache ops
* llama : pass KV cache type through API
* llama : fix build
ggml-ci
* metal : add F32 -> Q4_0 copy kernel
* metal : add F32 -> Q4_1 copy kernel
* cuda : wip
* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels
* llama-bench : support type_k/type_v
* metal : use mm kernel only for quantum KV cache
* cuda : add comment
* llama : remove memory_f16 and kv_f16 flags
---------
Co-authored-by: slaren <slarengh@gmail.com>
* readme : add API change notice
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Hongyu Ouyang
81bc9214a3
train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) ( #4351 )
...
On commit b1108 (44c117f4
) xaedes added
ggml_allocr * alloc = NULL;
... (many lines in between)
if (alloc) {
ggml_allocr_free(alloc);
}
Which is correct, but it's easy to lose context after many lines in between.
On commit b1287 (0e76a899
) xaedes made a big change. From here on, alloc is freed eagerly.
alloc = ggml_allocr_new(...)
... (short lines of code)
ggml_allocr_free(alloc)
This happens a few times, but alloc is never set to NULL, and many lines below,
we still have
if (alloc) {
ggml_allocr_free(alloc);
}
which causes a double-free.
2023-12-07 12:25:22 +02:00
Georgi Gerganov
05cd6e5036
server : recognize cache_prompt parameter in OAI API ( #4347 )
2023-12-06 20:21:59 +02:00
Georgi Gerganov
caa9249217
common : fix compile warning
2023-12-06 10:41:03 +02:00
stduhpf
da5eaef1f3
speculative : support --color
( #4343 )
...
* speculative: add some colors
* minor : add braces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-06 10:08:17 +02:00
Marcus Dunn
5f6e0c0dff
grammar : pre-computed pieces + reserve mem + less string copies ( #4330 )
...
* reserve space for codepoints
* improvement for the appended 0
* used precomputed token text for grammar sample
* reserve canidates_decoded
* reserve canidates_grammar
* remove candidates_decoded
* Revert "remove candidates_decoded"
This reverts commit 3773328080
.
* changed decode_utf8 to take src by ref
2023-12-05 22:55:12 +02:00
Kerfuffle
5aa365d88f
llama : allow overriding GGUF metadata when loading model ( #4092 )
...
* feat: Allow overriding GGUF metadata when loading model
* Fix the one time GCC is stricter than clang about something
* Step1
* Refactor... basically everything!
* Nuke obsolete GetArrayLen struct
* simplify std::string specialization
* Various cleanups
Add informational output when overrides are applied
Warn user when an override with the wrong type is specified
* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes
* llama : rearrange model params
* Update new GET_KEY call
Add note that metadata KV overrides aren't reflected in initial metadata KV info dump
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-05 19:19:18 +02:00
MaggotHATE
52c8bc3cf3
sampling : custom samplers order ( #4285 )
...
* Samplers sequence order w parameter
* Cleaned commented code
* Fixed formatting
* Rewrote with unordered_map
* Revert and rewrite, too many problems and safeguards would be needed
* Fixed code style
* Code style fixes according to review
* More readable samplers input string, fixed help
* Style fix in sampler_queue
* Formatting fixes
* Fixing whitespaces
2023-12-05 12:05:51 +02:00
kchro3
e4b76bbe31
swift : revert compiler checks for swift package ( #4332 )
2023-12-05 09:29:46 +02:00
Daniel Bevenius
23b5e12eb5
simple : update error message for KV cache check ( #4324 )
...
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-04 18:04:21 +02:00
Miwa / Ensan
d208995c6d
swift : fix concatenation method to avoid invalid UTF8 stringfication ( #4325 )
2023-12-04 18:03:49 +02:00