VJHack
8fb681bf9a
updated readme
2025-01-13 16:17:39 -06:00
VJHack
bee4c7c9fa
apply parameter to only llama-cli
2025-01-13 15:12:50 -06:00
VJHack
da038d8715
completed top nsigma sampler implementation
2025-01-13 14:46:12 -06:00
VJHack
ddc3c2208a
initial sampling changes:
2025-01-09 23:04:28 -06:00
Xuan Son Nguyen
f7cd13301c
ci : use actions from ggml-org ( #11140 )
2025-01-08 16:09:20 +01:00
Xuan Son Nguyen
4d2b3d8804
lora : improve compat with mergekit-extract-lora
( #11131 )
...
* (wip) support mergekit-extracted lora
* support mergekit-extract-lora
* use lora->get_scale
* correct comment
* correct norm name & condition
* add some hints
2025-01-08 15:59:53 +01:00
Georgi Gerganov
c07d437bbd
llama : avoid hardcoded QK_K ( #11061 )
...
ggml-ci
2025-01-08 16:19:36 +02:00
Georgi Gerganov
99a3755a3c
sync : ggml
2025-01-08 13:40:30 +02:00
Radoslav Gerganov
c792dcf488
ggml : allow loading backend with env variable (ggml/1059)
...
ref: #1058
2025-01-08 13:40:18 +02:00
Xuan Son Nguyen
80ccf5d725
ci : pin dependency to specific version ( #11137 )
...
* ci : pin dependency to specific version
* will this fix ec?
2025-01-08 12:07:20 +01:00
Georgi Gerganov
a3c1232c3f
arg : option to exclude arguments from specific examples ( #11136 )
...
* arg : option to exclude arguments from specific examples
ggml-ci
* readme : remove old args [no ci]
2025-01-08 12:55:36 +02:00
amritahs-ibm
8cef75c743
llamafile : ppc64le MMA INT8 implementation ( #10912 )
...
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-01-08 12:54:19 +02:00
Georgi Gerganov
0d52a69e4b
ci : fix cmake option ( #11125 )
2025-01-08 11:29:34 +02:00
Mathieu Baudier
02f0430141
Disable GL_KHR_cooperative_matrix Vulkan extension if not available. ( #11117 )
...
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.
* Perform Vulkan extensions checks in a more sensible order
* Remove unnecessary #ifdef directive
2025-01-08 09:18:13 +01:00
ag2s20150909
bec2183f2c
fix: Vulkan shader gen binary path when Cross-compiling ( #11096 )
...
* fix: Vulkan shader gen binary path when cross compiling
2025-01-08 09:17:29 +01:00
Johannes Gäßler
53ff6b9b9f
GGUF: C++ refactor, backend support, misc fixes ( #11030 )
...
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types
2025-01-07 18:01:58 +01:00
Diego Devesa
017cc5f446
ggml-backend : only offload from host buffers (fix) ( #11124 )
2025-01-07 16:11:57 +01:00
Diego Devesa
a3d50bc022
ggml-backend : only offload from host buffers ( #11120 )
2025-01-07 12:38:05 +01:00
Radoslav Gerganov
a4dd490069
rpc : code cleanup ( #11107 )
...
Remove duplicated macros, use GGML_LOG_ERROR for errors
2025-01-07 08:37:02 +02:00
Akarshan Biswas
c0d6f790d0
SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 ( #11087 )
...
* SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6
* Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6"
This reverts commit f62dc45f31
.
* Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6
2025-01-07 14:26:07 +08:00
Eric Curtin
dc7cef9f37
llama-run : fix context size ( #11094 )
...
Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is
a more reasonable 2048.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-06 23:45:28 +01:00
Georgi Gerganov
ecebbd292d
llama : remove unused headers ( #11109 )
...
ggml-ci
2025-01-06 17:52:35 +02:00
Xuan Son Nguyen
96be8c3264
github : add cmd line field to bug report ( #11090 )
...
* github : cmd line to bug report
* codeowners : (@ngxson) only watch dockerfile
* Apply suggestions from code review [no ci]
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* rm cmd in log output [no ci]
* rm 2 [no ci]
* no need backticks [no ci]
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-01-06 16:34:49 +01:00
Georgi Gerganov
e6e7c75d94
server : fix extra BOS in infill endpoint ( #11106 )
...
* server : fix extra BOS in infill endpoing
ggml-ci
* server : update infill tests
2025-01-06 15:36:08 +02:00
Xuan Son Nguyen
09186fabbe
llama : remove check flash_attn with lora ( #11104 )
2025-01-06 13:41:12 +01:00
Asghar Ghorbani
96a1dc27c3
llama : prevent system info string accumulation across calls ( #11101 )
2025-01-06 13:21:46 +02:00
Daniel Bevenius
6369f867a4
llama : rename missed batch params/vars to ubatch ( #10059 )
...
This commit renames the `batch` parameter to `ubatch` in the
`llama_kv_cache_find_slot`, `llm_build_inp_embd`, and
`llm_build_mamba` functions.
The motivation for this is that this should have been done as part of
Commit 19d900a756
("llama : rename batch
to ubatch (#9950 )") but for some reason I missed these functions in
that commit and only noticed them now (sorry).
2025-01-06 11:28:17 +02:00
Georgi Gerganov
47182dd03f
llama : update llama_model API names ( #11063 )
...
* llama : deprecate llama_free_model, add llama_model_free
ggml-ci
* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`
ggml-ci
2025-01-06 10:55:18 +02:00
Georgi Gerganov
3e6e7a6bc2
tokenize : escape the prompt ( #11058 )
...
* tokenize : escape the prompt
* tokenize : update help
2025-01-06 10:54:25 +02:00
Georgi Gerganov
ae2f606bb5
mmap : fix fileno macro clash ( #11076 )
...
* mmap : fix fileno macro clash
ggml-ci
* cont
ggml-ci
2025-01-06 10:52:38 +02:00
Georgi Gerganov
727368c60f
llama : use LLAMA_TOKEN_NULL ( #11062 )
...
ggml-ci
2025-01-06 10:52:15 +02:00
Georgi Gerganov
5047dd3546
llama : use _impl suffix instead of _internal ( #11060 )
...
ggml-ci
2025-01-06 10:52:01 +02:00
Johannes Gäßler
46e3556e01
CUDA: add BF16 support ( #11093 )
...
* CUDA: add BF16 support
2025-01-06 02:33:52 +01:00
0cc4m
b56f079e28
Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver ( #11074 )
...
* Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver
* Add (TM) to AMD name check
2025-01-04 21:09:59 +01:00
fairydreaming
9394bbd484
llama : Add support for DeepSeek V3 ( #11049 )
...
* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type
* vocab : add DeepSeek V3 pre-tokenizer regexes
* unicode : handle ACCENT_MARK and SYMBOL categories in regex
* llama : add DeepSeek V3 chat template, handle new model parameters and tensor types
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-04 21:06:11 +01:00
matt23654
f922a9c542
[GGML][RPC] Support for models with non-512-aligned tensors over RPC. ( #11047 )
...
* Added init tensor calling code
* Added get_alloc_size forwarding
* Cleaned up and improved type/error handling.
* fix: remove trailing whitespaces.
* Cleanup and use GGML error logging functions.
* Handle potentially dangerous edge cases.
* Apply suggestions from code review
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-04 17:10:30 +01:00
DAN™
46be942214
llama : add support for the cohere2 model architecture ( #10900 )
2025-01-04 16:33:31 +02:00
Georgi Gerganov
78c6785175
sync : ggml
2025-01-04 16:09:53 +02:00
Georgi Gerganov
5e3b08d606
ggml : do not install metal source when embed library (ggml/1054)
2025-01-04 16:09:53 +02:00
Daniel Bevenius
db68c93b57
ggml : improve inputs log sched_print_assignments (ggml/1053)
...
This commit attempts to improve the log message for the inputs of the
splits in the sched_print_assignments function.
The motivation for this change is that currently even if there are no
inputs a colon is displayed at the end of the line, which can make it a
little confusing when reading the output as it could be interpreted as
the line below are inputs when they are in fact nodes. With this change
the colon will only be printed if there actually are inputs.
2025-01-04 16:09:53 +02:00
Gilad S.
c31fc8b966
fix: Vulkan shader gen binary path ( #11037 )
2025-01-04 09:17:31 +01:00
Molly Sophia
4b0c638b9a
common : disable KV cache shifting automatically for unsupported models ( #11053 )
...
* Disable KV cache shifting automatically for unsupported models
instead of exiting directly
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update common/common.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-03 14:13:18 +02:00
Georgi Gerganov
e7da954ecc
metal : avoid uint ( #11019 )
2025-01-03 11:26:14 +02:00
Georgi Gerganov
f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00
Pierrick Hymbert
2f0ee84b9b
server: bench: minor fixes ( #10765 )
...
* server/bench:
- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token
* server/bench:
- fix when prometheus not started
- wait for server to be ready before starting bench
2025-01-02 18:06:12 +01:00
Xuan Son Nguyen
0da5d86026
server : allow using LoRA adapters per-request ( #10994 )
...
* slot.can_batch_with
* lora per request
* test: force disable cache prompt
* move can_batch_with check
* fix condition
* add slow test with llama 8b
* update docs
* move lora change task to queue
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* lora_base
* remove redundant check
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-02 15:05:18 +01:00
Benson Wong
a45433ba20
readme : add llama-swap to infrastructure section ( #11032 )
...
* list llama-swap under tools in README
* readme: add llama-swap to Infrastructure
2025-01-02 09:14:54 +02:00
Srihari-mcw
0827b2c1da
ggml : fixes for AVXVNNI instruction set with MSVC and Clang ( #11027 )
...
* Fixes for clang AVX VNNI
* enable AVX VNNI and alder lake build for MSVC
* Apply suggestions from code review
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-12-31 15:23:33 +01:00
Xuan Son Nguyen
45095a61bf
server : clean up built-in template detection ( #11026 )
...
* server : clean up built-in template detection
* fix compilation
* add chat template test
* fix condition
2024-12-31 15:22:01 +01:00
Xuan Son Nguyen
5896c65232
server : add OAI compat for /v1/completions ( #10974 )
...
* server : add OAI compat for /v1/completions
* add test
* add docs
* better docs
2024-12-31 12:34:13 +01:00