Commit Graph

2306 Commits

Author SHA1 Message Date
Iwan Kawrakow
31cecc8734 iq3_s_mult_shuffle: use lookup table on Metal
~4% faster TG and ~2% faster PP that way.
2024-03-05 10:19:44 +02:00
Iwan Kawrakow
93034df760 iq3_s_mult_shuffle: use lookup table on CUDA
~4% faster TG that way.
2024-03-05 10:06:07 +02:00
Iwan Kawrakow
6d15da1ec0 iq3_s_mult_shuffle: use new multiplier and cleanup 2024-03-05 08:36:57 +02:00
Iwan Kawrakow
b1d753be34 iq3_s_mult: remove SLOW_MULT option 2024-03-05 08:23:37 +02:00
Iwan Kawrakow
a6a263b919 iq3_s_mult_shuffle: works on ARM_NEON and Metal 2024-03-04 20:10:36 +02:00
Iwan Kawrakow
b587482287 iq3_s_mult_shuffle: mult + shuffle based codebook 2024-03-04 19:43:22 +02:00
Iwan Kawrakow
b48bf8b411 iq3_s_mult: scalar dot product 2024-03-04 08:22:03 +02:00
Iwan Kawrakow
f2c2bd6b26 iq3_s_mult: also CUDA 2024-03-03 19:12:05 +02:00
Iwan Kawrakow
e5e72562c5 iq3_s_mult: back to blocks of 32 2024-03-03 18:50:26 +02:00
Iwan Kawrakow
f4cb4eac45 iq3_s_mult: play with blocks of 16
This brings the bpw to 3.5625. We come close but
don't quite match lookup with 3.4375 bpw (blocks of 32)
2024-03-03 16:43:00 +02:00
Iwan Kawrakow
dbe98dfe70 iq3_s_mult: another alternative multiplier 2024-03-03 13:13:52 +02:00
Iwan Kawrakow
8b713a987e iq3s_mult: quantization tuning 2024-03-03 11:32:53 +02:00
Iwan Kawrakow
5b9c8785fa iq3s_mult: ARM and Metal 2024-03-03 11:30:01 +02:00
Iwan Kawrakow
b6402fa757 iq3_s_mult: ifdef'd slow / fast versions 2024-03-03 10:43:53 +02:00
Iwan Kawrakow
726aed307a iq3_s_mult: alternative multiplier / bit twidling 2024-03-03 08:51:28 +02:00
Iwan Kawrakow
fe3c20b251 iq3_s_mult: quantization tuning 2024-03-03 07:51:20 +02:00
Iwan Kawrakow
3000e0ac9e iq3_s_mult: Metal works - slower than lookup 2024-03-03 06:41:58 +02:00
Iwan Kawrakow
bf90920fb2 iq3_s_mult: ARM_NEON works - 13 t/s 2024-03-02 19:17:27 +02:00
Iwan Kawrakow
0fe9cd488f WIP 2024-03-02 17:56:16 +02:00
Iwan Kawrakow
e43e81a5d7 WIP 2024-03-01 18:48:08 +02:00
Iwan Kawrakow
160acecaba iq3_s_multiplier: CUDA and AVX2 works
CUDA is 153.8 t/s, so faster than lookup table (151 t/s) and Q3_K_S (145 t/s).
AVX2 on Ryzen-5975WX is 13.7 t/s, so faster than lookup (12.7 t/s), but
slower than Q3_K_S (15.5 t/s).
2024-03-01 13:44:06 +02:00
Iwan Kawrakow
4c21c826e1 WIP 2024-03-01 13:28:20 +02:00
Iwan Kawrakow
1cc7cb2b46 iq3_s(multiplier): use SIMD also in dequantize 2024-03-01 12:02:39 +02:00
Iwan Kawrakow
9c752ff0d3 Trying IQ3_S without a lookup table 2024-03-01 11:52:17 +02:00
Kawrakow
cb49e0f8c9
Attempt to fix android build (#5752)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27 19:16:49 +02:00
Kawrakow
0becb22ac0
IQ4_XS: a 4.25 bpw quantization (#5747)
* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27 16:34:24 +02:00
Engininja2
c24a2a6e60
cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744) 2024-02-27 14:22:45 +01:00
Engininja2
1f30b7a9f1
ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742) 2024-02-27 14:50:18 +02:00
Georgi Gerganov
9d533a77d0
llama : fix defrag bugs + add parameter (#5735)
* llama : fix defrag bugs + enable by default

ggml-ci

* llama : add defrag_thold parameter

ggml-ci

* llama : cont

* llama : disable log message

ggml-ci

* llama : fix graph size check during defrag
2024-02-27 14:35:51 +02:00
le.chang
cbbd1efa06
Makefile: use variables for cublas (#5689)
* make: use arch variable for cublas

* fix UNAME_M

* check opt first

---------

Co-authored-by: lindeer <le.chang118@gmail.com>
2024-02-27 03:03:06 +01:00
Xuan Son Nguyen
b11a93df41
fix server hangs on empty prompt (#5733) 2024-02-26 23:15:48 +01:00
Kawrakow
a33e6a0d2a
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-26 18:28:38 +02:00
Johannes Gäßler
47bb7b48c7
CUDA: fix DEBUG_CUDA_MALLOC (#5729) 2024-02-26 15:36:38 +01:00
Artem
c4d7f81786
readme : update ui list (#5731)
* Add LLMFarm (ui for iOS) to list
2024-02-26 16:15:28 +02:00
AidanBeltonS
e849078c6e
[SYCL] Add support for soft_max ALiBi (#5639)
* Add support for bias

* Update pre-processor

* rm commented code

* fix format

* fix CI

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-26 19:32:11 +05:30
Georgi Gerganov
67fd33132f
unicode : reuse iterator (#5726) 2024-02-26 14:02:12 +02:00
Pierrick Hymbert
4804215cb8
server: CI fix trailing space (#5728) 2024-02-26 12:41:34 +02:00
Pierrick Hymbert
8a533f0d90
server: CI tests reduce build matrix (#5725) 2024-02-26 09:56:10 +01:00
Georgi Gerganov
269de86ba0
llama : fix Gemma rope type (#5691) 2024-02-26 08:30:17 +02:00
github-actions[bot]
c393733988 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
  → 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
2024-02-25 22:24:22 +00:00
Pierrick Hymbert
e3965cf35a
server: tests - slow inference causes timeout on the CI (#5715)
* server: tests - longer inference timeout for CI
2024-02-25 22:48:33 +01:00
Pierrick Hymbert
8b350356b2
server: docs - refresh and tease a little bit more the http server (#5718)
* server: docs - refresh and tease a little bit more the http server

* Rephrase README.md server doc

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25 21:46:29 +01:00
Georgi Gerganov
bf08e00643
llama : refactor k-shift implementation + KV defragmentation (#5691)
* llama : refactor k-shift implementation

ggml-ci

* llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add

* llama : cont k-shift refactoring + normalize type names

ggml-ci

* minor : fix MPI builds

* llama : reuse n_rot from the build context

ggml-ci

* llama : revert enum name changes from this PR

ggml-ci

* llama : update llama_rope_type

* llama : add comment about rope values

* llama : fix build

* passkey : apply kv cache updates explicitly

ggml-ci

* llama : change name to llama_kv_cache_update()

* llama : add llama_kv_cache_seq_pos_max()

* passkey : fix llama_kv_cache_seq_pos_max() usage

* llama : some llama_kv_cell simplifications

* llama : add llama_kv_cache_compress (EXPERIMENTAL)

* llama : add alternative KV cache merging (EXPERIMENTAL)

* llama : add llama_kv_cache_defrag

* llama : comments

* llama : remove llama_kv_cache_compress

will add in a separate PR

ggml-ci

* llama : defragment via non-overlapping moves

* llama : ggml_graph based defrag implementation

ggml-ci

* llama : switch the loop order in build_defrag

* llama : add comments
2024-02-25 22:12:24 +02:00
compilade
f7625019c5
server : fix crash when system prompt is bigger than batch size (#5714)
The system prompt is now decoded in batches.

* server : fix off-by-one n_past when start of prompt matches whole cache

The tokens right after the matching part would otherwise skip a pos value.
2024-02-25 20:43:50 +02:00
Radosław Gryta
abbabc5e51
ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711)
* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility

vqtbl1q_u8 is not part of arm v7 neon library

* [android-example] Remove abi filter after arm v7a fix

* [github-workflows] Do not skip Android armeabi-v7a build
2024-02-25 20:43:00 +02:00
kwin1412
f1a98c5254
make : fix nvcc version is empty (#5713)
fix nvcc version is empty
2024-02-25 18:46:49 +02:00
Ashok Gelal
7d548a1827
readme : add Msty to UI list (#5618) 2024-02-25 17:57:34 +02:00
Pierrick Hymbert
930b178026
server: logs - unified format and --log-format option (#5700)
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id

* server : skip GH copilot requests from logging

* server : change message format of server_log()

* server : no need to repeat log in comment

* server : log style consistency

* server : fix compile warning

* server : fix tests regex patterns on M2 Ultra

* server: logs: PR feedback on log level

* server: logs: allow to choose log format in json or plain text

* server: tests: output server logs in text

* server: logs switch init logs to server logs macro

* server: logs ensure value json value does not raised error

* server: logs reduce level VERBOSE to VERB to max 4 chars

* server: logs lower case as other log messages

* server: logs avoid static in general

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25 13:50:32 +01:00
Pierrick Hymbert
d52d7819b8
server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708)
* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct
2024-02-25 13:49:43 +01:00
Radosław Gryta
1289408817
cmake : fix compilation for Android armeabi-v7a (#5702) 2024-02-25 12:53:11 +02:00