Commit Graph

3091 Commits

Author SHA1 Message Date
Francis Couture-Harpin
fee3c1d740 llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.
2024-06-03 13:54:39 -04:00
Francis Couture-Harpin
17f6c1ef3b llama : fix .base() compilation error on Windows 2024-06-03 00:41:15 -04:00
Francis Couture-Harpin
8fb57ac0fb llama : use im2col and mul_mat to perform convolution for Mamba
This removes the need for ggml_ssm_conv!!!
But performance seems slighly worse on my system,
especially for prompt processing.
Maybe ggml_mul_mat isn't optimized for small row sizes?
More performance testing is necessary until GGML_OP_SSM_CONV is removed.

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.
2024-06-03 00:01:41 -04:00
Francis Couture-Harpin
eb589d5e36 llama : avoid copies for simple batch splits 2024-06-02 00:18:56 -04:00
Francis Couture-Harpin
61200ef29f llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.
2024-06-01 16:44:43 -04:00
Francis Couture-Harpin
18d1c14047 llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.
2024-06-01 15:06:59 -04:00
Francis Couture-Harpin
72eea49224 llama : fix batch split output count for embeddings 2024-06-01 12:24:19 -04:00
Francis Couture-Harpin
5d3c7b9585 Merge branch 'master' into compilade/refactor-kv-cache 2024-06-01 11:51:41 -04:00
Francis Couture-Harpin
3587a94987 llama : use equal-sequence-length sub-batches for recurrent models
* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch
2024-06-01 11:49:17 -04:00
Johannes Gäßler
750f60c03e
CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (#7681) 2024-06-01 15:47:04 +02:00
Johannes Gäßler
9b596417af
CUDA: quantized KV support for FA vec (#7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-01 08:44:14 +02:00
Georgi Gerganov
a323ec60af
server : update js (#7670) 2024-05-31 22:23:04 +03:00
Galunid
0515ad93f4
convert-hf : Handle NotImplementedError in convert-hf-to-gguf (#7660) 2024-05-31 17:42:33 +02:00
Johannes Gäßler
c8047d538f
scripts: update compare_llama_bench.py [no ci] (#7673) 2024-05-31 16:26:21 +02:00
Daniele
30e238b246
Improve HIP compatibility (#7672) 2024-05-31 16:00:29 +02:00
Georgi Gerganov
16926dff92
readme : link homebrew discussion 2024-05-31 15:04:58 +03:00
Georgi Gerganov
0c27e6f62e
ggml : fix loongson compile warnings (#7537)
* ggml : fix loongson compile warnings

ggml-ci

* Fix loongarch quantize test fail.

Fix unexpected error introduced during rebase code.

* tests : disable json test due to lack of python on the CI node

ggml-ci

---------

Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>
2024-05-31 14:17:10 +03:00
Galunid
2e32f874e6
Somehow '**' got lost (#7663) 2024-05-31 18:24:41 +10:00
Galunid
1af511fc22
Add convert.py removal to hot topics (#7662) 2024-05-31 10:09:20 +02:00
Sertaç Özercan
0541f06296
[no ci] docs: add aikit to readme (#7650)
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
2024-05-31 09:57:16 +10:00
JohnnyB
9022c33646
Fixed painfully slow single process builds. (#7326)
* Fixed painfully slow single process builds.

* Added nproc for systems that don't default to nproc
2024-05-30 22:32:38 +02:00
Georgi Gerganov
5921b8f089
llama : cache llama_token_to_piece (#7587)
* llama : cache llama_token_to_piece

ggml-ci

* llama : use vectors and avoid has_cache

ggml-ci

* llama : throw on unknown tokenizer types

ggml-ci

* llama : print a log of the total cache size
2024-05-31 02:01:41 +10:00
Martin Delille
5dcdf94676
Fix conan badge display [no ci] (#7645) 2024-05-31 01:07:39 +10:00
Manuel
2e2340de17
Add brew installation instruction to README [no ci] (#7616) 2024-05-31 00:58:15 +10:00
Martin Delille
7846540bd2
readme : add Conan badge (#7638) 2024-05-30 15:52:50 +03:00
Brian
e6157f94c8
github: add contact links to issues and convert question into research [no ci] (#7612) 2024-05-30 21:55:36 +10:00
Galunid
9c4c9cc83f
Move convert.py to examples/convert-legacy-llama.py (#7430)
* Move convert.py to examples/convert-no-torch.py

* Fix CI, scripts, readme files

* convert-no-torch -> convert-legacy-llama

* Move vocab thing to vocab.py

* Fix convert-no-torch -> convert-legacy-llama

* Fix lost convert.py in ci/run.sh

* Fix imports

* Fix gguf not imported correctly

* Fix flake8 complaints

* Fix check-requirements.sh

* Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE

* Review fixes
2024-05-30 21:40:00 +10:00
Chris Elrod
59b0d07766
faster avx512 exp implementation (#7551)
* faster avx512 exp implementation

* x->r

* improve accuracy, handle special cases

* remove `e`
2024-05-30 21:32:55 +10:00
junchao-loongson
d5c05821f3
ggml : fix loongarch build (O2 issue) (#7636) 2024-05-30 12:30:10 +03:00
Johannes Gäßler
972b555ab9
README: explain parallel build [no ci] (#7618) 2024-05-30 09:52:39 +02:00
Meng, Hengyu
3854c9d07f
[SYCL] fix intel docker (#7630)
* Update main-intel.Dockerfile

* workaround for https://github.com/intel/oneapi-containers/issues/70

* reset intel docker in CI

* add missed in server
2024-05-30 16:19:08 +10:00
Galunid
eb57fee51f
gguf-py : Add tokenizer.ggml.pre to gguf-new-metadata.py (#7627) 2024-05-30 02:10:40 +02:00
Georgi Gerganov
55d62262a9
metal : remove invalid asserts (#7617) 2024-05-29 22:21:20 +03:00
Georgi Gerganov
975ec63ff2
metal : add missing asserts (#7617) 2024-05-29 20:45:25 +03:00
Georgi Gerganov
fb76ec31a9
ggml : fix YARN + add tests + add asserts (#7617)
* tests : add rope tests

ggml-ci

* ggml : fixes (hopefully)

ggml-ci

* tests : add non-cont tests

ggml-ci

* cuda : add asserts for rope/norm + fix DS2

ggml-ci

* ggml : assert contiguousness

* tests : reduce RoPE tests

ggml-ci
2024-05-29 20:17:31 +03:00
Georgi Gerganov
cce3dcffc5
cuda : non-cont concat support (#7610)
* tests : add non-cont concat tests

* cuda : non-cont concat support

ggml-ci
2024-05-29 15:38:26 +03:00
Radoslav Gerganov
210d99173d
llama-bench : add support for the RPC backend (#7435) 2024-05-29 14:45:44 +03:00
slaren
87bdf2a199
ggml : use atomic_flag for critical section (#7598)
* ggml : use atomic_flag for critical section

* add windows shims
2024-05-29 13:36:39 +02:00
Georgi Gerganov
00281b7be3
scripts : remove mpi remnants 2024-05-29 14:31:18 +03:00
Georgi Gerganov
2ab977282b
sync : ggml 2024-05-29 14:29:52 +03:00
Georgi Gerganov
72de268bec
ggml : restore ggml_rope_xpos_inplace (ggml/0)
ggml-ci
2024-05-29 14:29:33 +03:00
Akarshan Biswas
0e8d8bfd6c
Add Arc A750 and Arch linux to readme-sycl.md as verified GPU model and Linux distro (#7605) 2024-05-29 16:53:47 +10:00
zhouwg
504f0c340f
ggml : fix typo in ggml.c (#7603) 2024-05-29 04:09:31 +02:00
Meng, Hengyu
b864b50ce5
[SYCL] Align GEMM dispatch (#7566)
* align GEMM dispatch
2024-05-29 07:00:24 +08:00
jaime-m-p
02c1ecad07
Tokenizer WPM fixes (#7500)
* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
  - Fix unicode edge case combinations.
  - Split by whitspace in the same pass.
* Discard all tokens when no matching found.
2024-05-28 21:46:34 +02:00
Georgi Gerganov
6bd12ce409
sycl : fix assert (#7563) 2024-05-28 22:22:50 +03:00
Francis Couture-Harpin
4e4c41e553 Merge branch 'master' into compilade/refactor-kv-cache 2024-05-28 15:15:18 -04:00
Francis Couture-Harpin
3a414b0be2 llama : sequence-length-aware batch splitting 2024-05-28 15:07:32 -04:00
Francis Couture-Harpin
181dadf294 llama : fix Jamba quantization sanity checks 2024-05-28 15:07:32 -04:00
Giuseppe Scrivano
5442939fcc
llama : support small Granite models (#7481)
* Add optional MLP bias for Granite models

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/llama.cpp/issues/7116
Still needs some more changes to properly support Granite.

* llama: honor add_space_prefix from the model configuration

propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* llama: add support for small granite models

it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Steffen Roecker <sroecker@redhat.com>
2024-05-28 21:49:49 +03:00