Commit Graph

2145 Commits

Author SHA1 Message Date
Meng, Hengyu
e805f0fa99
[SYCL] get MAX_MEM_ALLOC from device property (#5270)
* get max alloc size from device prop

* fix macro typo
2024-02-02 15:54:14 +08:00
Neo Zhang Jianyu
af3ba5d946
[SYCL] update guide of SYCL backend (#5254)
* update guide for make installation, memory, gguf model link,  rm todo for windows build

* add vs install requirement

* update for gpu device check

* update help of llama-bench

* fix grammer issues
2024-02-02 15:53:27 +08:00
Ian Bull
e1e721094d
llama : fix memory leak in llama_batch_free (#5252)
The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.

This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.
2024-02-02 09:20:13 +02:00
Neo Zhang Jianyu
128dcbd3c9
add --no-mmap in llama-bench (#5257)
* add --no-mmap, show sycl backend

* fix conflict

* fix code format, change print for --no-mmap

* ren no_mmap to mmap, show mmap when not default value in printer

* update guide for mmap

* mv position to reduce model reload
2024-02-01 20:48:53 +01:00
0cc4m
4d0924a890
Vulkan Phi Fix for AMD Proprietary Drivers (#5260)
* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver

* Fix another Vulkan CPY buffer size bug
2024-02-01 19:25:24 +01:00
slaren
8ca511cade
cuda : fix LLAMA_CUDA_F16 (#5262) 2024-02-01 18:30:17 +01:00
Ali Nehzat
d71ac90985
make : generate .a library for static linking (#5205) 2024-02-01 17:18:53 +02:00
Guoteng
ce32060198
llama : support InternLM2 (#5184)
* support InternLM2 inference
  * add add_space_prefix KV pair
2024-02-01 11:19:51 +02:00
Eve
1cfb5372cf
Fix broken Vulkan Cmake (properly) (#5230)
* build vulkan as object

* vulkan ci
2024-01-31 20:21:55 +01:00
Georgi Gerganov
d3bac7d584
llama : reorder build_orion() at correct place (#5118) 2024-01-31 18:47:10 +02:00
Georgi Gerganov
5cb04dbc16
llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)
* llama : remove LLAMA_MAX_DEVICES from llama.h

ggml-ci

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* server : remove LLAMA_MAX_DEVICES

ggml-ci

* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

ggml-ci

* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

* readme : add deprecation notice

* readme : change deprecation notice to "remove" and fix url

* llama : remove gpu includes from llama.h

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31 17:30:17 +02:00
Georgi Gerganov
efb7bdbbd0
metal : add im2col F32 dst support (#5132) 2024-01-31 15:35:41 +02:00
JidongZhang-THU
15606309a0
llava : add MobileVLM support (#5132)
* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31 15:10:15 +02:00
Neo Zhang Jianyu
b2b9f025e7
format license text, restore apache license by legal suggestion (#5233) 2024-01-31 18:34:46 +05:30
slaren
dabcc5b471
ggml : limit n_threads to the max n_tasks (#5238) 2024-01-31 13:43:03 +01:00
0cc4m
f8e9140cb4
Vulkan Fixes (#5223)
* Fix Vulkan F16 models

* Fix Vulkan context shift crash

* Add Vulkan to common.cpp dump_non_result_info_yaml function

* Fix bug in Vulkan CPY op

* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk

Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>

---------

Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
2024-01-31 11:44:19 +01:00
Yiming Cui
d62520eb2c
Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231) 2024-01-30 22:04:21 -05:00
Neo Zhang Jianyu
01684139c3
support SYCL backend windows build (#5208)
* support SYCL backend windows build

* add windows build in CI

* add for win build CI

* correct install oneMKL

* fix install issue

* fix ci

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix win build

* fix win build

* fix win build

* restore other CI part

* restore as base

* rm no new line

* fix no new line issue, add -j

* fix grammer issue

* allow to trigger manually, fix format issue

* fix format

* add newline

* fix format

* fix format

* fix format issuse

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-01-31 08:08:07 +05:30
Jared Van Bortel
e8dc55d006
kompute : llama-bench support and ggml_cpu_has_kompute() (#5226) 2024-01-30 19:04:37 -05:00
Georgi Gerganov
e0085fdf7c
Revert "server : change deps.sh xxd files to string literals (#5221)"
This reverts commit 4003be0e5f.
2024-01-30 21:19:26 +02:00
Georgi Gerganov
e6f291d158
server : fix context shift (#5195)
* server : fix context shift + simplify self-extend

* server : take system_tokens into account

* server : more n_past fixes

* server : rever n_past_se changes
2024-01-30 20:17:30 +02:00
JohnnyB
4003be0e5f
server : change deps.sh xxd files to string literals (#5221)
* Changed ugly xxd to literals.

HPP files are much more readable as multiline literals rather than hex arrays.

* Dashes in literal variable names.

Replace . and - with _ in file names -> variable names.

* Comment on removing xxd.

XXD-> string literals

* XXD to string literals.

Replaced these unreadable headers with string literal versions using new deps.sh.
2024-01-30 20:15:05 +02:00
Kawrakow
fea4fd4ba7
ggml : fix IQ3_XXS on Metal (#5219)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 19:15:28 +02:00
Georgi Gerganov
8f8ddfcfad
sync : ggml (#0) 2024-01-30 16:21:57 +02:00
Georgi Gerganov
6fb50ebbf0
gguf : fix comparison (ggml/715)
ggml-ci
2024-01-30 16:20:25 +02:00
John Balis
625a699b54
ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting (ggml/686)
* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30 16:20:25 +02:00
Georgi Gerganov
a4b07c057a
gguf : add input validation, prevent integer overflows (ggml/709)
* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci
2024-01-30 16:20:25 +02:00
Georgi Gerganov
549a1e6cd5
ci : fix yolo URLs + fix metal capture (ggml/712) 2024-01-30 16:20:25 +02:00
Jack Mousseau
5f14ee0b0c
metal : add debug capture backend function (ggml/694)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-30 16:20:25 +02:00
Kawrakow
8e14e3ddb3
Faster AVX2 dot product for IQ2_XS (#5187)
* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Peter Reid <peter@peterreid.net>
2024-01-30 15:15:07 +02:00
Kawrakow
f4d7e54974
SOTA 3-bit quants (#5196)
* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 15:14:12 +02:00
0cc4m
2256f36b79
Vulkan Windows APU Memory Handling (#5199)
* Add basic UMA memory handling

Improve memory OOM behavior

Fix tests

* Fix UMA handling

* Also fix UMA handling for prealloc buffers

* Remove unnecessary warning message

* Remove outdated comment
2024-01-30 13:59:30 +01:00
Vladimir Malyutin
7359016c7c
quantize : fix typo (#5211)
Fix misprint in quantize help
2024-01-30 12:57:07 +02:00
divinity76
813416991a
main : allow empty --prompt-cache file (#5176)
* allow empty --prompt-cache file

This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user.

I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++  < 17
(the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17)

* formatting

(requested in codereview)

* remove c++17, file_is_empty
2024-01-30 11:18:02 +02:00
Romain Neutron
5589921ef8
readme : minor (#5204)
This is about tuning the code formatting of the README file
2024-01-30 11:16:38 +02:00
Georgi Gerganov
49f44b5c55
readme : update hot topics 2024-01-30 11:14:44 +02:00
Wu Jian Ping
6685cc41c2
server : improve README (#5209) 2024-01-30 11:11:46 +02:00
Paul Tsochantaris
ceebbb5b21
ggml alloc: Fix for null dereference on alloc failure (#5200)
* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix
2024-01-29 23:19:29 +01:00
Jared Van Bortel
6daa69ee81
kompute : fix fallback to CPU (#5201) 2024-01-29 17:11:27 -05:00
Jared Van Bortel
fbf1ddec69
Nomic Vulkan backend (#4456)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-29 15:50:50 -05:00
divinity76
2aed77eb06
fix typo "RLIMIT_MLOCK" (#5175) 2024-01-29 09:45:41 -05:00
Wu Jian Ping
c82d18e863
server : embeddings compatibility for OpenAI (#5190) 2024-01-29 15:48:10 +02:00
Georgi Gerganov
14fef85e2d
py : fix except (#5194)
ggml-ci
2024-01-29 15:35:54 +02:00
Sang-Kil Park
e76627bcce
py : improve BPE tokenizer support (#5189) 2024-01-29 11:24:19 +02:00
slaren
fbe7dfa53c
ggml : add max buffer sizes to opencl and metal backends (#5181) 2024-01-29 10:05:13 +02:00
Eve
172ac82629
cmake : fix Vulkan build (#5182) 2024-01-29 10:04:47 +02:00
Paul Tsochantaris
d2f650cb5b
metal : free metal objects (#5161)
* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix
2024-01-28 21:50:16 +02:00
Georgi Gerganov
35dec26cc2
sync : ggml 2024-01-28 19:48:05 +02:00
Georgi Gerganov
d460510c72
ggml : minor type fix (int64_t -> size_t) 2024-01-28 19:47:31 +02:00
0cc4m
2307523d32
ggml : add Vulkan backend (#2059)
* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 19:03:59 +02:00