arch-btw
ad76569f8e
common : Update stb_image.h to latest version ( #9161 )
...
* Update stb_image.h to latest version
Fixes https://github.com/ggerganov/llama.cpp/issues/7431
* Update .ecrc
b3632
2024-08-27 08:58:50 +03:00
slaren
7d787ed96c
ggml : do not crash when quantizing q4_x_x with an imatrix ( #9192 )
b3631
2024-08-26 19:44:43 +02:00
Georgi Gerganov
06658ad7c3
metal : separate scale and mask from QKT in FA kernel ( #9189 )
...
* metal : separate scale and mask from QKT in FA kernel
* metal : ne01 check no longer necessary
* metal : keep data in local memory
b3630
2024-08-26 18:31:02 +03:00
Georgi Gerganov
fc18425b6a
ggml : add SSM Metal kernels ( #8546 )
...
* ggml : add ggml_ssm_conv metal impl
* ggml : add ssm_scan metal impl
ggml-ci
b3629
2024-08-26 17:55:36 +03:00
Georgi Gerganov
879275ac98
tests : fix compile warnings for unreachable code ( #9185 )
...
ggml-ci
b3628
2024-08-26 16:30:25 +03:00
Georgi Gerganov
7a3df798fc
ci : add VULKAN support to ggml-ci ( #9055 )
b3627
2024-08-26 12:19:39 +03:00
Georgi Gerganov
e5edb210cd
server : update deps ( #9183 )
2024-08-26 12:16:57 +03:00
slaren
0c41e03ceb
metal : gemma2 flash attention support ( #9159 )
b3625
2024-08-26 11:08:59 +02:00
slaren
f12ceaca0c
ggml-ci : try to improve build time ( #9160 )
2024-08-26 11:03:30 +02:00
Justine Tunney
436787f170
llama : fix time complexity of string replacement ( #9163 )
...
This change fixes a bug where replacing text in a very long string could
cause llama.cpp to hang indefinitely. This is because the algorithm used
was quadratic, due to memmove() when s.replace() is called in a loop. It
seems most search results and LLM responses actually provide the O(n**2)
algorithm, which is a great tragedy. Using a builder string fixes things
b3623
2024-08-26 09:09:53 +03:00
Herman Semenov
93bc3839f9
common: fixed not working find argument --n-gpu-layers-draft ( #9175 )
b3622
2024-08-26 00:54:37 +02:00
Johannes Gäßler
f91fc5639b
CUDA: fix Gemma 2 numerical issues for FA ( #9166 )
b3621
2024-08-25 22:11:48 +02:00
Johannes Gäßler
e11bd856d5
CPU/CUDA: Gemma 2 FlashAttention support ( #8542 )
...
* CPU/CUDA: Gemma 2 FlashAttention support
* apply logit_softcap to scale in kernel
* disable logit softcapping tests on Metal
* remove metal check
b3620
2024-08-24 21:34:59 +02:00
João Dinis Ferreira
8f824ffe8e
quantize : fix typo in usage help of quantize.cpp
( #9145 )
b3619
2024-08-24 09:22:45 +03:00
Xuan Son Nguyen
3ba780e2a8
lora : fix llama conversion script with ROPE_FREQS ( #9117 )
b3618
2024-08-23 12:58:53 +02:00
piDack
a07c32ea54
llama : use F32 precision in GLM4 attention and no FA ( #9130 )
gguf-v0.10.0
b3617
2024-08-23 10:27:17 +03:00
Akarshan Biswas
11b84eb457
[SYCL] Add a space to supress a cmake warning ( #9133 )
b3616
2024-08-22 22:09:47 +08:00
luoyu-intel
1731d4238f
[SYCL] Add oneDNN primitive support ( #9091 )
...
* add onednn
* add sycl_f16
* add dnnl stream
* add engine map
* use dnnl for intel only
* use fp16fp16fp16
* update doc
b3615
2024-08-22 12:50:10 +08:00
compilade
a1631e53f6
llama : simplify Mamba with advanced batch splits ( #8526 )
...
* llama : advanced batch splits
This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.
* llama : always make recurrent state slots contiguous
* ggml : simplify mamba operators
* llama : fix integer signedness mixing
* llama : logits_all has priority over batch->logits
Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.
* llama : apply suggestions
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : fix t5 segfault
* llama : fix Mamba session save and restore
* llama : minor cosmetic changes
* llama : rename llama_reorder_outputs to llama_output_reorder
Also move it closer to llama_output_reserve.
* llama : fix pooled embeddings when using batches with equal_seqs
* minor : add struct members for clarity
ggml-ci
* llama : fix T5 segfault again
* llama : fix Mamba pooled embeddings with multiple sequences
Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.
* llama : add llama_model_is_recurrent to simplify figuring that out
This will make it easier to more cleanly support RWKV-v6 and Mamba-2.
* llama : fix simple splits when the batch contains embeddings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3614
2024-08-21 17:58:11 -04:00
Xuan Son Nguyen
fc54ef0d1c
server : support reading arguments from environment variables ( #9105 )
...
* server : support reading arguments from environment variables
* add -fa and -dt
* readme : specify non-arg env var
b3613
2024-08-21 11:04:34 +02:00
Younes Belkada
b40eb84895
llama : support for falcon-mamba
architecture ( #9074 )
...
* feat: initial support for llama.cpp
* fix: lint
* refactor: better refactor
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* fix: address comments
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
* fix: add more cleanup and harmonization
* fix: lint
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <git@compilade.net>
* fix: change name
* Apply suggestions from code review
Co-authored-by: compilade <git@compilade.net>
* add in operator
* fix: add `dt_b_c_rms` in `llm_load_print_meta`
* fix: correct printf format for bool
* fix: correct print format
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* llama : quantize more Mamba tensors
* llama : use f16 as the fallback of fallback quant types
---------
Co-authored-by: compilade <git@compilade.net>
b3612
2024-08-21 11:06:36 +03:00
fairydreaming
f63f603c87
llava : zero-initialize clip_ctx structure fields with aggregate initialization 908)
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b3611
2024-08-21 09:45:49 +02:00
Daniel Bevenius
8455340b87
llama : std::move llm_bigram_bpe from work_queue ( #9062 )
...
* llama : std::move llm_bigram_bpe from work_queue
This commit updates the retrieval of llm_bigram_bpe objects from
work_queue.top() by using std::move.
The motivation for this is to avoid the copying of the std::string
`text` member of the llm_bigram_bpe struct.
* squash! llama : std::move llm_bigram_bpe from work_queue
Introduced a MovablePriorityQueue class to allow moving elements
out of the priority queue for llm_bigram_bpe.
* squash! llama : std::move llm_bigram_bpe from work_queue
Rename MovablePriorityQueue to lama_priority_queue.
* squash! llama : std::move llm_bigram_bpe from work_queue
Rename lama_priority_queue -> llama_priority_queue.
b3610
2024-08-21 10:32:58 +03:00
Changyeon Kim
2f3c1466ff
llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. ( #8984 )
...
* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.
- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* fix-up coding style.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* Fix-up the missing initial parameter to resolve the compilation warning.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* [fix] Add missing parameters.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* [fix] Use nb1 and nb2 for dst.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* Fix check results ggml_acc call
---------
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
Co-authored-by: 0cc4m <picard12@live.de>
b3609
2024-08-20 21:00:00 +02:00
Meng, Hengyu
50addec9a5
[SYCL] fallback mmvq ( #9088 )
...
* fallback mmvq to mul_mat
* mmvq in cuda path
* Update ggml/src/ggml-sycl.cpp
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
---------
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
b3608
2024-08-20 23:50:17 +08:00
zhentaoyu
4f8d19ff17
[SYCL] Fix SYCL im2col
and convert
Overflow with Large Dims ( #9052 )
...
* sycl: fix im2col overflow and sync with cuda
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* sycl: fix convert overflow
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* sycl: fix convert and dequantize
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* sycl: fix ib in dmmv
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* sycl:refine convert
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* sycl: move downsample global_range into common
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* test: add im2col and convert test cases
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* test: make new cases only in sycl
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* test: comment new test_cases for only local testing
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
---------
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
b3607
2024-08-20 23:06:51 +08:00
fairydreaming
90db8146d5
tests : add missing comma in grammar integration tests ( #9099 )
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b3606
2024-08-20 12:09:55 +03:00
wangshuai09
cfac111e2b
cann: add doc for cann backend ( #8867 )
...
Co-authored-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2024-08-19 16:46:38 +08:00
Radoslav Gerganov
1b6ff90ff8
rpc : print error message when failed to connect endpoint ( #9042 )
b3604
2024-08-19 10:11:45 +03:00
Radoslav Gerganov
18eaf29f4c
rpc : prevent crashes on invalid input ( #9040 )
...
Add more checks which prevent RPC server from crashing if invalid input
is received from client
b3603
2024-08-19 10:10:21 +03:00
Georgi Gerganov
554b049068
flake.lock: Update ( #9068 )
2024-08-18 07:43:32 -07:00
ltoniazzi
2339a0be1c
tests : add integration test for lora adapters ( #8957 )
...
* Add printing to check weights match torch version
* minor code style changes
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-08-18 11:58:04 +02:00
Yoshi Suhara
2fb9267887
Fix incorrect use of ctx_split for bias tensors ( #9063 )
b3600
2024-08-17 15:34:21 +02:00
Xuan Son Nguyen
8b3befc0e2
server : refactor middleware and /health endpoint ( #9056 )
...
* server : refactor middleware and /health endpoint
* move "fail_on_no_slot" to /slots
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix server tests
* fix CI
* update server docs
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3599
2024-08-16 17:19:05 +02:00
tc-mb
d565bb2fd5
llava : support MiniCPM-V-2.6 ( #8967 )
...
* init
* rename
* add run android for termux in readme
* add android readme
* add instructions in readme
* change name in readme
* Update README.md
* fixed line
* add result in readme
* random pos_embed
* add positions index
* change for ollama
* change for ollama
* better pos_embed in clip
* support ollama
* updata cmakelist
* updata cmakelist
* rename wrapper
* clear code
* replace and organize code
* add link
* sync master
* fix warnings
* fix warnings
* fix bug in bicubic resize when need resize iamge smaller
* receive review comments and modify
* receive review comments and modify
* put all code into llava dir
* fix quality problem in pr code
* change n_layer
* add space in "-1"
* imitate reshape bug of python code
* fix bug in clip
* fix issues for merging
* fix llama-minicpmv-cli in cmake file
* change pr readme
* fix code review
* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir
* fix cmakefile
* add warn
* fix KEY_HAS_MINICPMV_PROJ
* remove load_image_size into clip_ctx
* remove the extern "C", MINICPMV_API
* fix uhd code for review comment
* delete minicpmv-wrapper in pr
* remove uhd_image_embed
* Modify 2 notes
* support minicpmv2.6
* modify convert script of minicpmv
* modify convert
* modify convert
* add readme
* add resampler of v2.6
* modify clip
* modify readme
* fix type-check
* fix type-check
* fix type-check
* fix type-check
* modify convert script and readme
* fix convert script and readme
* fix convert
* fix num in convert
* fix type-check
---------
Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
b3598
2024-08-16 16:34:41 +03:00
Farbod Bijary
ee2984bdaf
py : fix wrong input type for raw_dtype in ggml to gguf scripts ( #8928 )
...
Co-authored-by: farbod <farbod.bjary82@gmail.com>
2024-08-16 13:36:30 +03:00
Aisuko
c8ddce8560
Fix inference example lacks required parameters ( #9035 )
...
Signed-off-by: Aisuko <urakiny@gmail.com>
2024-08-16 11:08:59 +02:00
compilade
23fd453544
gguf-py : bump version from 0.9.1 to 0.10.0 ( #9051 )
b3595
2024-08-16 09:36:11 +03:00
Minsoo Cheong
c679e0cb5c
llama : add EXAONE model support ( #9025 )
...
* add exaone model support
* add chat template
* fix whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add ftype
* add exaone pre-tokenizer in `llama-vocab.cpp`
Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>
* fix lint
Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>
* add `EXAONE` to supported models in `README.md`
* fix space
Co-authored-by: compilade <git@compilade.net>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
2024-08-16 09:35:18 +03:00
Liu Jia
fb487bb567
common : add support for cpu_get_num_physical_cores() on Windows ( #8771 )
...
* Add support for cpu_get_num_phsical_cores() on Windows
* fix build bug on msys2-clang64 and ucrt64
* avoid adding new function
* add new macros to avoid windows+mingw64
* Add error checking to return default value
b3593
2024-08-16 09:23:12 +03:00
Yoshi Suhara
2a24c8caa6
Add Nemotron/Minitron GGUF Conversion & Inference Support ( #8922 )
...
* Add nemotron GGUF conversion & inference support
* Fix formatting issues
* Remove unnecessary write_tensors()
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* Address comments by @compilade
* Replace ggml_mul_mat()->llm_build_lora_mm()
* Remove mutable variable
* Use for bias tensors
* Cover corner case for role_scaling not in config.json
---------
Co-authored-by: compilade <git@compilade.net>
b3592
2024-08-16 04:23:33 +02:00
Nico Bosshard
e3f6fd56b1
ggml : dynamic ggml_sched_max_splits based on graph_size ( #9047 )
...
* ggml : Dynamic ggml_sched_max_splits based on graph_size
* Fixed and readded debug code for causes
b3591
2024-08-16 04:22:55 +02:00
gtygo
4b9afbbe90
retrieval : fix memory leak in retrieval query handling ( #8955 )
...
* retrieval
* Reuse querybatch to reduce frequent memory allocation
* delete unused white space
b3590
2024-08-15 10:40:12 +03:00
Riceball LEE
37501d9c79
server : fix duplicated n_predict key in the generation_settings ( #8994 )
b3589
2024-08-15 10:28:05 +03:00
Zhenwei Jin
4af8420afb
common : remove duplicate function llama_should_add_bos_token ( #8778 )
b3588
2024-08-15 10:23:23 +03:00
Esko Toivonen
6bda7ce6c3
llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish ( #8850 )
b3587
2024-08-15 10:17:12 +03:00
Georgi Gerganov
d5492f0525
ci : disable bench workflow ( #9010 )
2024-08-15 10:11:11 +03:00
Jiří Podivín
234b30676a
server : init stop and error fields of the result struct ( #9026 )
...
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
b3585
2024-08-15 09:21:57 +03:00
0cc4m
5fd89a70ea
Vulkan Optimizations and Fixes ( #8959 )
...
* Optimize Vulkan REPEAT performance
* Use Vulkan GLSL fused multiply-add instruction where possible
* Add GGML_VULKAN_PERF option to output performance data per operator
* Rework and fix Vulkan descriptor set and descriptor pool handling
* Fix float32 concat f16 shader validation error
* Add Vulkan GROUP_NORM eps parameter
* Fix validation error with transfer queue memory barrier flags
* Remove trailing whitespaces
b3584
2024-08-14 18:32:53 +02:00
compilade
98a532d474
server : fix segfault on long system prompt ( #8987 )
...
* server : fix segfault on long system prompt
* server : fix parallel generation with very small batch sizes
* server : fix typo in comment
b3583
2024-08-14 09:51:02 +03:00