Commit Graph

4217 Commits

Author SHA1 Message Date
Jeff Bolz
54ef9cfc72
vulkan: Throttle the number of shader compiles during the build step. (#10222)
Fixes #9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes"
errors on Linux. This change applies the same throttling we use for
multithreaded pipeline creation.
2024-11-11 18:13:51 +01:00
Georgi Gerganov
b0cefea58a
metal : more precise Q*K in FA vec kernel (#10247) 2024-11-11 08:39:13 +02:00
Georgi Gerganov
b141e5f6ef
server : enable KV cache defrag by default (#10233)
ggml-ci
2024-11-11 08:38:43 +02:00
Georgi Gerganov
4b3a9212b6
flake.lock: Update (#10243)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd?narHash=sha256-l253w0XMT8nWHGXuXqyiIC/bMvh1VRszGXgdpQlfhvU%3D' (2024-10-29)
  → 'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-10 11:45:25 -08:00
MaggotHATE
505f33274d
server : (web UI) Add back sampler settings (#10239)
* Add back samplers to server

* Added tooltips with basic information

* Fixed stretching of input fields.

* use component for settings input, move help msg to tooltips

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
Jeff Bolz
160687b3ed
vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (#10226) 2024-11-10 12:37:56 +01:00
Georgi Gerganov
6423c65aa8
metal : reorder write loop in mul mat kernel + style (#10231)
* metal : reorder write loop

* metal : int -> short, style

ggml-ci
2024-11-09 11:53:13 +02:00
Georgi Gerganov
39a334a9aa
metal : fix build and some more comments (#10229) 2024-11-09 11:53:02 +02:00
Georgi Gerganov
bb38cdd8ba
metal : fix F32 accumulation in FA vec kernel (#10232) 2024-11-09 11:52:45 +02:00
Georgi Gerganov
f018acba22
llama : fix Qwen model type strings 2024-11-09 11:26:34 +02:00
Georgi Gerganov
46323fa9ef
metal : hide debug messages from normal log 2024-11-09 11:21:49 +02:00
SXX
5b359bb1e3
ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213) 2024-11-09 08:35:46 +01:00
amritahs-ibm
e89213492d
ggml : optimize llamafile cpu matrix multiplication for ppc64le (#10156)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.

This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2024-11-09 09:17:50 +02:00
haopeng
8fc393f246
scripts : fix pattern and get n_tokens in one go (#10221) 2024-11-09 09:06:54 +02:00
Georgi Gerganov
ec450d3bbf
metal : opt-in compile flag for BF16 (#10218)
* metal : opt-in compile flag for BF16

ggml-ci

* ci : use BF16

ggml-ci

* swift : switch back to v12

* metal : has_float -> use_float

ggml-ci

* metal : fix BF16 check in MSL

ggml-ci
2024-11-08 21:59:46 +02:00
Georgi Gerganov
695ad752b2
metal : improve clarity (minor) (#10171) 2024-11-08 18:37:41 +02:00
Georgi Gerganov
841f27abdb
metal : optimize FA kernels (#10171)
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
2024-11-08 13:47:22 +02:00
Jhen-Jie Hong
d05b3127bd
swift : exclude ggml-metal-embed.metal (#10211)
* llama.swift : exclude ggml-metal-embed.metal

* swift : exclude build/
2024-11-08 11:34:06 +02:00
Xuan Son Nguyen
76c6e7f105
server : minor UI fix (#10207) 2024-11-07 18:44:38 -04:00
Xuan Son Nguyen
a71d81cf8c
server : revamp chat UI with vuejs and daisyui (#10175)
* server : simple chat UI with vuejs and daisyui

* move old files to legacy folder

* embed deps into binary

* basic markdown support

* add conversation history, save to localStorage

* fix bg-base classes

* save theme preferences

* fix tests

* regenerate, edit, copy buttons

* small fixes

* docs: how to use legacy ui

* better error handling

* make CORS preflight more explicit

* add GET method for CORS

* fix tests

* clean up a bit

* better auto scroll

* small fixes

* use collapse-arrow

* fix closeAndSaveConfigDialog

* small fix

* remove console.log

* fix style for <pre> element

* lighter bubble color (less distract when reading)
2024-11-07 17:31:10 -04:00
Georgi Gerganov
eec4d71737
scripts : add amx to sync-ggml.sh [no ci] 2024-11-07 23:11:36 +02:00
Georgi Gerganov
3b08828674
sync : ggml 2024-11-07 23:08:24 +02:00
Georgi Gerganov
a2c6fd747c
scripts : sync update 2024-11-07 23:07:55 +02:00
Diego Devesa
97404c4a03
ggml : add ggml-cpu.h to the public headers (#10204) 2024-11-07 18:16:08 +01:00
Faisal Zaghloul
60e17ce23c
Remove identical wte/etw logic for jais (#10203) 2024-11-07 08:46:12 -08:00
wwoodsTM
5107e8cea3
DRY: Fixes clone functionality (#10192) 2024-11-07 16:20:25 +01:00
snadampal
2319126a70
fix q4_0_8_8 format for corrupted tokens issue (#10198)
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-62-167.us-west-2.compute.internal>
2024-11-07 09:02:08 +01:00
Zhiyuan Li
3bcd40b3c5
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133)
* rwkv6: rename to wkv6

* rwkv6: support avx2 avx512 armv8 armv9

* rwkv6: update cuda file name

* rwkv6: rename params

* wkv on sycl

* sycl: add some ops

* sycl: Enhance OP support judgment

* wkv6: drop armv9 and tranfer to GGML style

ggml-ci

* sync : ggml

* update the function to use appropriate types

* fix define error

* Update ggml/src/ggml-cpu.c

* add appropriate asserts

* move element-wise functions outside

* put the declaration outside the loop

* rewrite to be more inline with the common pattern for distributing threads

* use recommended way GGML_TENSOR_LOCALS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
2024-11-07 15:19:10 +08:00
Georgi Gerganov
5c333e0140
metal : add BF16 support (#8439)
* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support
2024-11-06 19:53:51 +02:00
Georgi Gerganov
b11f9ba9b8
server : remove hack for extra parallel slot (#10187)
ggml-ci
2024-11-06 13:29:01 +02:00
Diego Devesa
94d8cb8be1
metal : fix from ptr buffer name (#10189) 2024-11-06 12:10:07 +01:00
Georgi Gerganov
1dc04b2dee
ggml : adjust is_first_call init value (#10193)
ggml-ci
2024-11-06 11:20:10 +02:00
Georgi Gerganov
a1eaf6a960
metal : add quantized FA support (#10149)
* metal : add quantized FA (vec) support

ggml-ci

* metal : add quantized FA (non-vec) support

* metal : fix support check

ggml-ci

* metal : clean-up

* metal : clean-up (cont)

* metal : fix shared memory calc + reduce smem + comments

* metal : float-correctness

* metal : minor [no ci]
2024-11-06 10:24:23 +02:00
Gabe Goodhart
b8deef0ec0
llama : add <|tool_call|> formatting to Granite template (#10177)
Branch: GraniteToolCallTemplate

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-11-05 14:23:04 +02:00
Diego Devesa
a9e8a9a030
ggml : fix arch check in bf16_to_fp32 (#10164) 2024-11-04 23:17:01 +01:00
Eve
3407364776
Q6_K AVX improvements (#10118)
* q6_k instruction reordering attempt

* better subtract method

* should be theoretically faster

small improvement with shuffle lut, likely because all loads are already done at that stage

* optimize bit fiddling

* handle -32 offset separately. bsums exists for a reason!

* use shift

* Update ggml-quants.c

* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
2024-11-04 23:06:31 +01:00
Diego Devesa
d5a409e57f
ggml : fix gelu tables initialization (#10172) 2024-11-04 20:06:58 +01:00
Diego Devesa
401558b7ba
ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (#10167) 2024-11-04 17:34:08 +01:00
Xuan Son Nguyen
9e0ecfb697
server : clarify /slots endpoint, add is_processing (#10162)
* server : clarify /slots endpoint, add is_processing

* fix tests
2024-11-04 16:33:29 +01:00
snadampal
6a066b9978
fix build break on arm64 linux (#10166)
This fixes the build break from the recent changes
to move the CPU backend to separate files
https://github.com/ggerganov/llama.cpp/pull/10144
2024-11-04 16:08:33 +01:00
Diego Devesa
ea02c753eb
cuda : clear error after changing peer access (#10153) 2024-11-04 13:10:23 +01:00
Georgi Gerganov
05697f670b
metal : simplify f16 and f32 dequant kernels (#0) 2024-11-04 13:49:34 +02:00
Georgi Gerganov
f8e58135cf
metal : move dequantize templates to beginning of MSL source (#0) 2024-11-04 13:44:06 +02:00
leo-pony
329ed914c9
CANN: adjust backend registry refactor. (#10158)
remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.
2024-11-04 19:08:22 +08:00
Georgi Gerganov
ce027adfb3
sync : ggml 2024-11-04 10:33:37 +02:00
Yuri Khrustalev
284e5b0275
cmake : make it possible linking ggml as external lib (ggml/1003) 2024-11-04 10:33:11 +02:00
Plamen Minev
e2292aaa17
metal : fix minor string leaks (ggml/1004) 2024-11-04 10:33:10 +02:00
Diego Devesa
9f40989351
ggml : move CPU backend to a separate file (#10144) 2024-11-03 19:34:08 +01:00
Georgi Gerganov
08828a6d7d
metal : minor fixup in FA kernel (#10143)
* metal : minor fixup in FA kernel

ggml-ci

* metal : use the unrolled loop variable

* metal : remove unused var
2024-11-03 15:18:40 +02:00
Georgi Gerganov
1839f69130
flake.lock: Update (#10146) 2024-11-03 05:14:15 -08:00