1
0
mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-26 20:22:25 +01:00
Commit Graph

2147 Commits

Author SHA1 Message Date
Georgi Gerganov
413503d4b9
make : do not add linker flags when compiling static llava lib () 2023-11-07 20:25:32 +03:00
xaedes
e9c1cecb9d
ggml : fix backward rope after YaRN ()
* fix backward process of rope

rope backward process was broken after YaRN RoPE () implementation, due to missing changes in backward functions.

the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.

to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:

for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.

* fix finetune rope call to use correct default attn_factor of 1.0f

* remove unused `ggml_rope_xpos_back`

it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants.

* fix comments explaining the sinus sign in ggml_forward_rope

* add missing function arguments in declaration

* fix function argument type in declaration
2023-11-07 10:04:51 +02:00
Matthew Tejo
54b4df8886
Use params when loading models in llava-cli ()
llava-cli was loading models with default params and ignoring settings
from the cli. This switches to a generic function to load the params
from the cli options.
2023-11-07 10:43:59 +03:00
Meng Zhang
46876d2a2c
cuda : supports running on CPU for GGML_USE_CUBLAS=ON build ()
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build

* doc: add comments to ggml_cublas_loaded()

* fix defined(...)
2023-11-07 08:49:08 +02:00
Damian Stewart
381efbf480
llava : expose as a shared library for downstream projects ()
* wip llava python bindings compatibility

* add external llava API

* add base64 in-prompt image support

* wip refactor image loading

* refactor image load out of llava init

* cleanup

* further cleanup; move llava-cli into its own file and rename

* move base64.hpp into common/

* collapse clip and llava libraries

* move llava into its own subdir

* wip

* fix bug where base64 string was not removed from the prompt

* get libllava to output in the right place

* expose llava methods in libllama.dylib

* cleanup memory usage around clip_image_*

* cleanup and refactor *again*

* update headerdoc

* build with cmake, not tested (WIP)

* Editorconfig

* Editorconfig

* Build with make

* Build with make

* Fix cyclical depts on Windows

* attempt to fix build on Windows

* attempt to fix build on Windows

* Upd TODOs

* attempt to fix build on Windows+CUDA

* Revert changes in cmake

* Fix according to review comments

* Support building as a shared library

* address review comments

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-11-07 00:36:23 +03:00
slaren
2833a6f63c
ggml-cuda : fix f16 mul mat ()
* ggml-cuda : fix f16 mul mat

ggml-ci

* silence common.cpp warning (bonus)
2023-11-05 18:45:16 +01:00
Kerfuffle
d9ccce2e33
Allow common process_escapes to handle \x sequences ()
* Allow common process_escapes to handle \x sequences

* Fix edge case when second hex digit is NUL
2023-11-05 10:06:06 -07:00
Thái Hoàng Tâm
bb60fd0bf6
server : fix typo for --alias shortcut from -m to -a () 2023-11-05 18:15:27 +02:00
Jared Van Bortel
132d25b8a6
cuda : fix disabling device with --tensor-split 1,0 ()
Co-authored-by: slaren <slarengh@gmail.com>
2023-11-05 10:08:57 -05:00
Meng Zhang
3d48f42efc
llama : mark LLM_ARCH_STARCODER as full offload supported ()
as done in https://github.com/ggerganov/llama.cpp/pull/3827
2023-11-05 14:40:08 +02:00
Eve
c41ea36eaa
cmake : MSVC instruction detection (fixed up ) ()
* Add detection code for avx

* Only check hardware when option is ON

* Modify per code review sugguestions

* Build locally will detect CPU

* Fixes CMake style to use lowercase like everywhere else

* cleanup

* fix merge

* linux/gcc version for testing

* msvc combines avx2 and fma into /arch:AVX2 so check for both

* cleanup

* msvc only version

* style

* Update FindSIMD.cmake

---------

Co-authored-by: Howard Su <howard0su@gmail.com>
Co-authored-by: Jeremy Dunn <jeremydunn123@gmail.com>
2023-11-05 10:03:09 +02:00
Eve
a7fac013cf
ci : use intel sde when ci cpu doesn't support avx512 () 2023-11-05 09:46:44 +02:00
slaren
48ade94538
cuda : revert CUDA pool stuff ()
* Revert "cuda : add ROCM aliases for CUDA pool stuff ()"

This reverts commit 629f917cd6.

* Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available ()"

This reverts commit d6069051de.

ggml-ci
2023-11-05 09:12:13 +02:00
Kerfuffle
f28af0d81a
gguf-py: Support 01.AI Yi models () 2023-11-04 16:20:34 -06:00
Peter Sugihara
d9b33fe95b
metal : round up to 16 to fix MTLDebugComputeCommandEncoder assertion () 2023-11-03 21:18:18 +02:00
Xiao-Yong Jin
5ba3746171
ggml-metal: fix yarn rope () 2023-11-03 14:00:31 -04:00
slaren
abb77e7319
ggml-cuda : move row numbers to x grid dim in mmv kernels () 2023-11-03 12:13:09 +01:00
Georgi Gerganov
8f961abdc4
speculative : change default p_accept to 0.5 + CLI args ()
ggml-ci
2023-11-03 09:41:56 +02:00
Georgi Gerganov
05816027d6
common : YAYF (yet another YARN fix) ()
ggml-ci
2023-11-03 09:24:00 +02:00
cebtenzzre
3fdbe6b66b
llama : change yarn_ext_factor placeholder to -1 () 2023-11-03 08:31:58 +02:00
Kerfuffle
629f917cd6
cuda : add ROCM aliases for CUDA pool stuff () 2023-11-02 21:58:22 +02:00
Andrei
51b2fc11f7
cmake : fix relative path to git submodule index () 2023-11-02 21:40:31 +02:00
Georgi Gerganov
224e7d5b14
readme : add notice about 2023-11-02 20:44:12 +02:00
Georgi Gerganov
c7743fe1c1
cuda : fix const ptrs warning causing ROCm build issues () 2023-11-02 20:32:11 +02:00
Oleksii Maryshchenko
d6069051de
cuda : use CUDA memory pool with async memory allocation/deallocation when available ()
* Using cuda memory pools for async alloc/dealloc.

* If cuda device doesnt support memory pool than use old implementation.

* Removed redundant cublasSetStream

---------

Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>
2023-11-02 19:10:39 +02:00
Georgi Gerganov
4ff1046d75
gguf : print error for GGUFv1 files () 2023-11-02 16:22:30 +02:00
slaren
21958bb393
cmake : disable LLAMA_NATIVE by default () 2023-11-02 14:10:33 +02:00
Georgi Gerganov
2756c4fbff
gguf : remove special-case code for GGUFv1 ()
ggml-ci
2023-11-02 11:20:21 +02:00
Georgi Gerganov
1efae9b7dc
llm : prevent from 1-D tensors being GPU split () 2023-11-02 09:54:44 +02:00
cebtenzzre
b12fa0d1c1
build : link against build info instead of compiling against it ()
* cmake : fix build when .git does not exist

* cmake : simplify BUILD_INFO target

* cmake : add missing dependencies on BUILD_INFO

* build : link against build info instead of compiling against it

* zig : make build info a .cpp source instead of a header

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>

* cmake : revert change to CMP0115

---------

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
2023-11-02 08:50:16 +02:00
Georgi Gerganov
4d719a6d4e
cuda : check if this fixes Pascal card regression () 2023-11-02 08:35:10 +02:00
Georgi Gerganov
183b3fac6c
metal : fix build errors and kernel sig after () 2023-11-02 08:33:37 +02:00
cebtenzzre
2fffa0d61f
cuda : fix RoPE after () 2023-11-02 07:49:44 +02:00
cebtenzzre
0eb332a10f
llama : fix llama_context_default_params after () 2023-11-01 19:29:14 -04:00
slaren
d02e98cde0
ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel ()
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel

* fix warnings
2023-11-01 23:10:09 +01:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling ()
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Georgi Gerganov
c43c2da8af
llm : fix llm_build_kqv taking unused tensor (benign, ) 2023-11-01 23:08:30 +02:00
Georgi Gerganov
523e49b111
llm : fix falcon norm after refactoring () 2023-11-01 23:00:50 +02:00
Georgi Gerganov
e16b9fa4ba
metal : multi-simd softmax ()
ggml-ci
2023-11-01 21:25:00 +02:00
Georgi Gerganov
ff8f9a88da
common : minor () 2023-11-01 21:15:55 +02:00
Georgi Gerganov
50337961a6
llm : add llm_build_context ()
* llm : add llm_build_context

* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv

* llm : restore the non-graph llm_build_ functional API

ggml-ci

* llm : cleanup + comments
2023-11-01 20:11:02 +02:00
bandoti
0e40806c1c
common : allow caller to handle help/argument exceptions ()
* Allow caller to handle help/argument exceptions

* Prepend newline to usage output

* Add new gpt_params_parse_ex function to hide arg-parse impl

* Fix issue blocking success case

* exit instead of returning false

* Update common/common.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/common.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-01 19:42:01 +02:00
staviq
a2758d08e4
log : make generating separate log files optional ()
* impl --log-new, --log-append

* Update common/log.h

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* Update common/log.h

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* Apply suggestions from code review

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 16:18:27 +02:00
l3utterfly
e75dfdd31b
sampling : null grammar field after reset () 2023-11-01 15:40:43 +02:00
Georgi Gerganov
9a3b4f6c86
ggml : fix UNUSED macro () 2023-11-01 13:50:45 +02:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter ()
* Add '-ngl' support to finetune.cpp

* Add fprintf in ggml_cuda_op_add

When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora

* Add 'finetune.sh', which currently fails when using GPU

"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"

* tweak finetune.sh

* Suppress some warnings in ggml.c

* Add f16 implementation to ggml_compute_forward_add_f16_f32

* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs

* finetune.sh: Edit comments

* Add "add_f16_f32_f32_cuda"

* Tweak an error message

* finetune.sh: Add an optional LLAMA_MODEL_DIR variable

* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable

* train : minor

* tabs to spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Georgi Gerganov
f0e209324a
scripts : add server-llm.sh ()
* scripts : add deploy-server.sh

* scripts : rename to server-llm.sh

* scripts : working curl pipe
2023-11-01 11:29:07 +02:00
Adrian Hesketh
ca190bca8e
server : re-enable completion and embedded at the same time () 2023-11-01 11:28:28 +02:00
Georgi Gerganov
71e3718abd
llama : refactor graph build code ()
* llama : factor out ggml-alloc from graph graph build functions

ggml-ci

* metal : disable kernel load log

* llama : factor out tensor offloading outside the build call (wip)

ggml-ci

* llama : offload rest of the models

ggml-ci

* llama : update offload log messages to print node index

* llama : comments

* llama : support offloading result_norm + comments

* llama : factor graph input into a function

* llama : do tensor offload only with CUDA

* llama : fix res_norm offloading

* llama : try to optimize offloading code

* llama : fix non-CUDA build

* llama : try to fix build

* llama : move refact in correct place + optimize graph input

* llama : refactor tensor offloading as callback

* llama : add layer index to all tensor names

* llama : add functional header

* llama : comment

ggml-ci

* llama : remove obsolete map for layer counting

* llama : add llm_build helper functions ()

* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function ()

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
2023-11-01 08:04:02 +02:00
kalomaze
238657db23
samplers : Min-P sampler implementation [alternative to Top P/Top K] ()
* Introduce the new Min-P sampler by @kalomaze
   The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.

* Min-P enabled and set to 0.05 default

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31 20:44:49 +01:00