Georgi Gerganov
afb3929279
Merge branch 'master' into llama-refactor
2023-10-31 20:35:31 +02:00
Tungsten842
07178c98e1
flake.nix: fix for rocm 5.7 ( #3853 )
2023-10-31 19:24:03 +02:00
Georgi Gerganov
5baefef497
llama : add llm_build helper functions ( #3848 )
...
* llama : add llm_build_norm helper function
ggml-ci
* llama : add llm_build_ffn helper function (#3849 )
ggml-ci
* llama : add llm_build_k_shift helper
ggml-ci
* llama : fix offloading after recent changes
* llama : add llm_build_kv_store helper
ggml-ci
* llama : remove obsolete offload names
* llama : fix llm_build_k_shift to use n_head_kv instead of n_head
* llama : simplify falcon Q, K, V computation
* llama : remove obsolete comments in build graphs
* llama : add llm_build_kqv helper
ggml-ci
* llama : minor
* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
* llama : fix input allocation logic
* llama : update offload functions for KQ tensors
* llama : normalize tensor names
ggml-ci
* llama : enable warning about not offloaded tensors
* llama : remove extra ; + deduplicate gate_b logic
* llama : add llm_build_inp_embd helper
2023-10-31 19:23:12 +02:00
Georgi Gerganov
207b51900e
ggml : move FP16 <-> FP32 code to ggml-impl.h ( #3861 )
...
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
2023-10-30 19:19:15 +02:00
Kerfuffle
6e08281e58
Extend llama_kv_cache_seq_rm to allow matching any sequence ( #3843 )
...
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
cebtenzzre
2046eb4345
make : remove unnecessary dependency on build-info.h ( #3842 )
2023-10-29 18:33:47 +02:00
Georgi Gerganov
71a09da301
llama : fix kv shift bug ( #3835 )
...
ggml-ci
2023-10-29 18:32:51 +02:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring ( #3833 )
...
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00
Georgi Gerganov
210e6e5d02
llama : remove obsolete map for layer counting
2023-10-29 13:39:04 +02:00
Georgi Gerganov
79ad734417
llama : comment
...
ggml-ci
2023-10-29 13:27:53 +02:00
Georgi Gerganov
761087932b
llama : add functional header
2023-10-29 13:26:32 +02:00
Georgi Gerganov
8925cf9ef8
llama : add layer index to all tensor names
2023-10-29 13:22:15 +02:00
Georgi Gerganov
1e9c5443c2
llama : refactor tensor offloading as callback
2023-10-29 13:05:10 +02:00
Georgi Gerganov
da936188d8
llama : move refact in correct place + optimize graph input
2023-10-29 11:48:58 +02:00
Georgi Gerganov
739b85c985
llama : try to fix build
2023-10-29 11:25:32 +02:00
Georgi Gerganov
25cfbf6776
llama : fix non-CUDA build
2023-10-29 11:12:03 +02:00
Georgi Gerganov
b4ad03b3a7
llama : try to optimize offloading code
2023-10-29 10:33:11 +02:00
Georgi Gerganov
79617902ea
llama : fix res_norm offloading
2023-10-29 09:20:35 +02:00
Georgi Gerganov
e14aa46151
llama : do tensor offload only with CUDA
2023-10-29 08:03:46 +02:00
Georgi Gerganov
0dc05b8433
llama : factor graph input into a function
2023-10-29 07:52:43 +02:00
Georgi Gerganov
4e98897ede
llama : support offloading result_norm + comments
2023-10-29 07:36:07 +02:00
Georgi Gerganov
51c4f9ee9f
llama : comments
2023-10-28 22:50:08 +03:00
Georgi Gerganov
3af8771389
llama : update offload log messages to print node index
2023-10-28 22:36:44 +03:00
Georgi Gerganov
83d2c43791
llama : offload rest of the models
...
ggml-ci
2023-10-28 22:30:54 +03:00
Georgi Gerganov
38aca9e1ab
llama : factor out tensor offloading outside the build call (wip)
...
ggml-ci
2023-10-28 21:22:31 +03:00
Georgi Gerganov
5946d98fc8
metal : disable kernel load log
2023-10-28 21:22:01 +03:00
Georgi Gerganov
8b2420d249
llama : factor out ggml-alloc from graph graph build functions
...
ggml-ci
2023-10-28 19:54:28 +03:00
Erik Scholz
ff3bad83e2
flake : update flake.lock for newer transformers version + provide extra dev shell ( #3797 )
...
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
2023-10-28 16:41:07 +02:00
Aarni Koskela
82a6646e02
metal : try cwd for ggml-metal.metal if bundle lookup fails ( #3793 )
...
* Try cwd for ggml-metal if bundle lookup fails
When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.
Follows up on #1782
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-28 15:43:01 +03:00
Georgi Gerganov
ba231e8a6d
issues : change label from bug to bug-unconfirmed ( #3748 )
2023-10-28 15:35:26 +03:00
Georgi Gerganov
8a2f2fea29
convert : ignore tokens if their IDs are within [0, vocab_size) ( #3831 )
2023-10-28 06:25:15 -06:00
Kerfuffle
bd6d9e2059
llama : allow quantizing k-quants to fall back when tensor size incompatible ( #3747 )
...
* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit
2023-10-28 14:54:24 +03:00
Georgi Gerganov
ee1a0ec9cb
llama : add option for greedy sampling with probs ( #3813 )
...
* llama : add option for greedy sampling with probs
* llama : add comment about llama_sample_token_greedy() missing probs
* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
2023-10-28 14:23:11 +03:00
Henk Poley
177461104b
common : print that one line of the syntax help *also* to standard output ( #3823 )
2023-10-28 13:16:33 +03:00
Georgi Gerganov
fdee152e4e
starcoder : add GPU offloading ( #3827 )
...
* starcoder : do not GPU split 1D bias tensors
* starcoder : offload layers to GPU
ggml-ci
2023-10-28 12:06:08 +03:00
Kerfuffle
41aee4df82
speculative : ensure draft and target model vocab matches ( #3812 )
...
* speculative: Ensure draft and target model vocab matches
* Tolerate small differences when checking dft vs tgt vocab
2023-10-28 00:40:07 +03:00
cebtenzzre
6d459cbfbe
llama : correctly report GGUFv3 format ( #3818 )
2023-10-27 17:33:53 -04:00
Thibault Terrasson
c8d6a1f34a
simple : fix batch handling ( #3803 )
2023-10-27 08:37:41 -06:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance ( #3776 )
...
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Georgi Gerganov
34b2a5e1ee
server : do not release slot on image input ( #3798 )
2023-10-26 22:54:17 +03:00
Georgi Gerganov
6961c4bd0b
batched-bench : print params at start
2023-10-25 10:26:27 +03:00
Georgi Gerganov
cc44877486
log : disable pid in log filenames
2023-10-25 10:09:16 +03:00
cebtenzzre
ad93962657
server : add parameter -tb N, --threads-batch N ( #3584 ) ( #3768 )
...
Co-authored-by: Michael Coppola <m18coppola@gmail.com>
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2023-10-24 23:10:43 +03:00
Georgi Gerganov
1717521cdb
server : do not block system prompt update ( #3767 )
...
* server : do not block system prompt update
* server : update state machine logic to process system prompts
* server : minor
2023-10-24 23:08:20 +03:00
Georgi Gerganov
b2f7e04bd3
sync : ggml (conv ops + cuda MSVC fixes) ( #3765 )
...
ggml-ci
2023-10-24 21:51:20 +03:00
John Smith
abd21fc99f
cmake : add missed dependencies ( #3763 )
2023-10-24 20:48:45 +03:00
Georgi Gerganov
2b4ea35e56
cuda : add batched cuBLAS GEMM for faster attention ( #3749 )
...
* cmake : add helper for faster CUDA builds
* batched : add NGL arg
* ggml : skip nops in compute_forward
* cuda : minor indentation
* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)
* Apply suggestions from code review
These changes plus:
```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```
are needed to compile with ROCM. I haven't done performance testing, but it seems to work.
I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.
* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define
* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases
* cuda : reduce mallocs in cublasGemmBatchedEx branch
* cuda : add TODO for calling cublas from kernel + using mem pool
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-10-24 16:48:37 +03:00
Galunid
daab3d7f45
Add more tokenizer tests ( #3742 )
...
* Add more tokenizer tests
* Add starcoder
* Update test vocab files
* Restrict bpe tokenizer tests to unicode planes
* Update comment
* Comment cosmetics
* Remove bloom vocab/test
2023-10-24 09:17:17 +02:00
Georgi Gerganov
469c9addef
metal : handle ggml_scale for n%4 != 0 ( close #3754 )
...
ggml-ci
2023-10-24 09:47:22 +03:00
Georgi Gerganov
e3932593d4
Revert "make : add optional CUDA_NATIVE_ARCH ( #2482 )"
...
This reverts commit 96981f37b1
.
See:
https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866
2023-10-23 23:46:05 +03:00