slaren
8614aa736d
cuda : fix get_rows when ncols is odd
2023-12-10 13:12:18 +01:00
slaren
cefebb3660
test-backend-ops : add moe test
2023-12-10 13:12:18 +01:00
Georgi Gerganov
e640cbe055
llama : add n_expert and n_expert_used to hparams + change quants
2023-12-10 13:57:54 +02:00
Georgi Gerganov
d1259b7b35
llama : do not quantize expert gating tensors
2023-12-10 13:00:13 +02:00
Georgi Gerganov
6cfb31f9ea
metal : add indirect mat-vec kernels for all quantization types
2023-12-10 11:48:14 +02:00
Georgi Gerganov
016f9bb55a
metal : fix ggml_get_rows to work with non-cont src1
2023-12-10 09:38:21 +02:00
slaren
0710b0f726
llama : offload missing ffn_moe_silu
2023-12-09 23:29:47 +01:00
slaren
62b95f93d0
cuda : support non-contiguous src1 in get_rows
2023-12-09 22:39:34 +01:00
slaren
2e4db48291
ggml : update get_rows f16 and q
2023-12-09 22:38:22 +01:00
slaren
ac3f7d8e23
ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D
2023-12-09 19:20:21 +01:00
Georgi Gerganov
8c5b66eeaa
metal : reduce the kernel launches for ggml_mul_mat_id
2023-12-09 15:30:34 +02:00
Georgi Gerganov
7e2006b0c0
metal : add/mul/div use general kernel when src1 not cont
2023-12-09 14:25:49 +02:00
slaren
06dfde3e94
llama : add basic support for offloading moe with CUDA
2023-12-09 13:21:30 +01:00
Georgi Gerganov
2cbcba829f
metal : add more general support for ggml_get_rows + tests
2023-12-09 14:18:42 +02:00
Georgi Gerganov
9064b1ca05
ggml : fix ggml_get_rows to take into account ne02 / ne11
2023-12-09 14:04:54 +02:00
slaren
ee8fb399aa
ggml : add n_as argument to ggml_mul_mat_id
2023-12-09 12:42:25 +01:00
Georgi Gerganov
7372b62271
ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)
2023-12-09 13:19:47 +02:00
Georgi Gerganov
8b185b7030
llama : fix expert weighting in the FFN
2023-12-09 13:01:42 +02:00
Georgi Gerganov
7ea36953ba
llama : first working version
2023-12-09 12:45:15 +02:00
Georgi Gerganov
af1a096bf8
llama : fix cur -> cur_expert
2023-12-09 12:07:39 +02:00
Georgi Gerganov
aedfad120a
llama : update graph to support MoE
2023-12-09 11:47:40 +02:00
Georgi Gerganov
861cd67899
ggml : sync latest ggml_mul_mat_id
2023-12-09 11:19:46 +02:00
Georgi Gerganov
a3eefe95a8
llama : model loading
2023-12-09 11:14:03 +02:00
Georgi Gerganov
d38e41ee69
convert : fix n_ff typo
2023-12-09 10:59:37 +02:00
Georgi Gerganov
dff8cbeb39
convert : support Mixtral as LLAMA arch
2023-12-09 10:51:58 +02:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) ( #4359 )
...
* sync : ggml (part 1)
* sync : ggml (part 2, CUDA)
* sync : ggml (part 3, Metal)
* ggml : build fixes
ggml-ci
* cuda : restore lost changes
* cuda : restore lost changes (StableLM rope)
* cmake : enable separable compilation for CUDA
ggml-ci
* ggml-cuda : remove device side dequantize
* Revert "cmake : enable separable compilation for CUDA"
This reverts commit 09e35d04b1
.
* cuda : remove assert for rope
* tests : add test-backend-ops
* ggml : fix bug in ggml_concat
* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`
* ci : try to fix macOS
* ggml-backend : remove backend self-registration
* ci : disable Metal for macOS cmake build
ggml-ci
* metal : fix "supports family" call
* metal : fix assert
* metal : print resource path
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 22:26:54 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache ( #4309 )
...
* per-layer KV
* remove unnecessary copies
* less code duplication, offload k and v separately
* llama : offload KV cache per-layer
* llama : offload K shift tensors
* llama : offload for rest of the model arches
* llama : enable offload debug temporarily
* llama : keep the KV related layers on the device
* llama : remove mirrors, perform Device -> Host when partial offload
* common : add command-line arg to disable KV cache offloading
* llama : update session save/load
* llama : support quantum K cache (#4312 )
* llama : support quantum K cache (wip)
* metal : add F32 -> Q8_0 copy kernel
* cuda : add F32 -> Q8_0 copy kernel
ggml-ci
* cuda : use mmv kernel for quantum cache ops
* llama : pass KV cache type through API
* llama : fix build
ggml-ci
* metal : add F32 -> Q4_0 copy kernel
* metal : add F32 -> Q4_1 copy kernel
* cuda : wip
* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels
* llama-bench : support type_k/type_v
* metal : use mm kernel only for quantum KV cache
* cuda : add comment
* llama : remove memory_f16 and kv_f16 flags
---------
Co-authored-by: slaren <slarengh@gmail.com>
* readme : add API change notice
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Hongyu Ouyang
81bc9214a3
train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) ( #4351 )
...
On commit b1108 (44c117f4
) xaedes added
ggml_allocr * alloc = NULL;
... (many lines in between)
if (alloc) {
ggml_allocr_free(alloc);
}
Which is correct, but it's easy to lose context after many lines in between.
On commit b1287 (0e76a899
) xaedes made a big change. From here on, alloc is freed eagerly.
alloc = ggml_allocr_new(...)
... (short lines of code)
ggml_allocr_free(alloc)
This happens a few times, but alloc is never set to NULL, and many lines below,
we still have
if (alloc) {
ggml_allocr_free(alloc);
}
which causes a double-free.
2023-12-07 12:25:22 +02:00
Georgi Gerganov
05cd6e5036
server : recognize cache_prompt parameter in OAI API ( #4347 )
2023-12-06 20:21:59 +02:00
Georgi Gerganov
caa9249217
common : fix compile warning
2023-12-06 10:41:03 +02:00
stduhpf
da5eaef1f3
speculative : support --color
( #4343 )
...
* speculative: add some colors
* minor : add braces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-06 10:08:17 +02:00
Marcus Dunn
5f6e0c0dff
grammar : pre-computed pieces + reserve mem + less string copies ( #4330 )
...
* reserve space for codepoints
* improvement for the appended 0
* used precomputed token text for grammar sample
* reserve canidates_decoded
* reserve canidates_grammar
* remove candidates_decoded
* Revert "remove candidates_decoded"
This reverts commit 3773328080
.
* changed decode_utf8 to take src by ref
2023-12-05 22:55:12 +02:00
Kerfuffle
5aa365d88f
llama : allow overriding GGUF metadata when loading model ( #4092 )
...
* feat: Allow overriding GGUF metadata when loading model
* Fix the one time GCC is stricter than clang about something
* Step1
* Refactor... basically everything!
* Nuke obsolete GetArrayLen struct
* simplify std::string specialization
* Various cleanups
Add informational output when overrides are applied
Warn user when an override with the wrong type is specified
* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes
* llama : rearrange model params
* Update new GET_KEY call
Add note that metadata KV overrides aren't reflected in initial metadata KV info dump
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-05 19:19:18 +02:00
MaggotHATE
52c8bc3cf3
sampling : custom samplers order ( #4285 )
...
* Samplers sequence order w parameter
* Cleaned commented code
* Fixed formatting
* Rewrote with unordered_map
* Revert and rewrite, too many problems and safeguards would be needed
* Fixed code style
* Code style fixes according to review
* More readable samplers input string, fixed help
* Style fix in sampler_queue
* Formatting fixes
* Fixing whitespaces
2023-12-05 12:05:51 +02:00
kchro3
e4b76bbe31
swift : revert compiler checks for swift package ( #4332 )
2023-12-05 09:29:46 +02:00
Daniel Bevenius
23b5e12eb5
simple : update error message for KV cache check ( #4324 )
...
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-04 18:04:21 +02:00
Miwa / Ensan
d208995c6d
swift : fix concatenation method to avoid invalid UTF8 stringfication ( #4325 )
2023-12-04 18:03:49 +02:00
Miwa / Ensan
5c9f90cba1
swift : fix prompt tokenization logic ( #4321 )
2023-12-04 15:43:45 +02:00
Ikko Eltociear Ashimine
4fa44e84ad
grammar-parser : fix typo ( #4318 )
...
preceeding -> preceding
2023-12-04 09:57:35 +02:00
Georgi Gerganov
fbbc42827b
ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ( #4308 )
...
* ggml : fix soft max out-of-bounds access
ggml-ci
* ggml : reuse ggml_get_n_tasks() in ggml_graph_plan()
ggml-ci
2023-12-03 15:56:35 +02:00
Georgi Gerganov
adf3de4f69
ggml : fix soft max out-of-bounds access ( #4307 )
...
ggml-ci
2023-12-03 15:56:22 +02:00
Ed Lee
33e171d1e9
server : fix OpenAI API stop
field to be optional ( #4299 )
...
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84 )
2023-12-03 11:10:43 +02:00
Rickard Edén
6949b50df5
py : add grammar to oai like api ( #4294 )
2023-12-03 11:03:25 +02:00
Georgi Gerganov
d7b800b8bc
llama : pad KV cache size ( #4280 )
...
* llama : pad KV cache size to 32
* metal : try to improve batched decoding
2023-12-03 10:58:16 +02:00
Georgi Gerganov
5a7d3125e7
llama : avoid using "optional" keyword ( #4283 )
2023-12-01 20:39:12 +02:00
Georgi Gerganov
d5a1cbde60
llama : support optional tensors ( #4283 )
2023-12-01 20:35:47 +02:00
Miwa / Ensan
b220222a64
swift : fix token_to_piece implementation ( #4278 )
...
* Fix token_to_piece implementation in Swift
* Fix errors
2023-12-01 20:19:45 +02:00
Jared Van Bortel
511f52c334
build : enable libstdc++ assertions for debug builds ( #4275 )
2023-12-01 20:18:35 +02:00
CausalLM
03562f3a86
llama : support attention bias on LLaMA architecture ( #4283 )
...
* Support attention_bias on LLaMA architecture
QKVO bias, should fix InternLM (https://github.com/ggerganov/llama.cpp/issues/3133 ) and works for LLaMAfied Qwen models (https://github.com/ggerganov/llama.cpp/pull/3743#issuecomment-1825923608 ).
* check existence of qkvo bias while loading llama models
Tested on LLaMA2, CUDA and CPU.
* Update llama.cpp
2023-12-01 20:17:06 +02:00
Shijie
37c746d687
llama : add Qwen support ( #4281 )
...
* enable qwen to llama.cpp
* llama : do not GPU split bias tensors
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-01 20:16:31 +02:00