github-actions[bot]
9392ebd49e
flake.lock: Update
...
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
→ 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
→ 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
→ 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
2024-02-04 08:45:35 -08:00
Georgi Gerganov
1846e92a90
cuda : minor
2024-02-04 11:01:01 +02:00
Kawrakow
5ed26e1fc9
Adding some imatrix tools ( #5302 )
...
* imatrix: adding --combine and --continue-from
* imatrix: be able to start from a specific chunk
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-04 10:39:58 +02:00
Welby Seely
277fad30c6
cmake : use set() for LLAMA_WIN_VER ( #5298 )
...
option() is specifically for booleans.
Fixes #5158
2024-02-03 23:18:51 -05:00
Johannes Gäßler
3c0d25c475
make: add nvcc info print ( #5310 )
2024-02-03 20:15:13 +01:00
Johannes Gäßler
3cc5ed353c
make: fix nvcc optimization flags for host code ( #5309 )
2024-02-03 20:14:59 +01:00
Martin Schwaighofer
60ecf099ed
add Vulkan support to Nix flake
2024-02-03 13:13:07 -06:00
0cc4m
e920ed393d
Vulkan Intel Fixes, Optimizations and Debugging Flags ( #5301 )
...
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt
2024-02-03 18:15:00 +01:00
Georgi Gerganov
ef68fac2a8
cuda : fix matrix names
2024-02-03 18:36:58 +02:00
Georgi Gerganov
cfd9732b2e
cuda : simplify softmax
2024-02-03 18:31:55 +02:00
Georgi Gerganov
e04ff39181
cuda : fix -INF block check
2024-02-03 16:57:46 +02:00
Georgi Gerganov
5b263dd83a
cuda : unroll Q*K^T loop
2024-02-03 16:12:20 +02:00
Georgi Gerganov
3b1c4e7673
cuda : speed-up reduce part of the kernel
2024-02-03 15:36:05 +02:00
Georgi Gerganov
a7b471569b
cuda : switch to 1 warp for bs > 16
2024-02-03 15:17:49 +02:00
Georgi Gerganov
b958151e3f
cuda : use half2 in softmax
2024-02-03 15:00:25 +02:00
Georgi Gerganov
c51f27c0db
cuda : avoid __hisinf branches
2024-02-03 14:27:36 +02:00
Georgi Gerganov
92472ea22c
cuda : unroll some of the loops
2024-02-03 14:10:01 +02:00
Georgi Gerganov
1f8a592482
cuda : make loops use the same loop values
...
Thanks Johannes again for the tip
2024-02-03 14:01:32 +02:00
Georgi Gerganov
7c34655b36
cuda : use int instead of int64_t
...
Noticeably improves performance (thanks to Johannes)
2024-02-03 13:39:46 +02:00
Michael Klimenko
52bb63c708
refactor : switch to emplace_back to avoid extra object ( #5291 )
2024-02-03 13:23:37 +02:00
Jared Van Bortel
1ec3332ade
YaRN : store rope scaling type as int32_t in memory ( #5285 )
...
* YaRN : store rope scaling type as int32_t in memory
* llama : store mapped names as const char *
2024-02-03 13:22:06 +02:00
BADR
6a66c5071a
readme : add tenere in the ui tools list ( #5284 )
2024-02-03 13:20:26 +02:00
Georgi Gerganov
b150abe83e
cuda : avoid warp_reduce for smax
2024-02-03 13:17:47 +02:00
AidanBeltonS
a305dba8ff
Fix im2col with 32fp ( #5286 )
2024-02-03 16:11:37 +08:00
kalomaze
191221178f
perplexity : fix KL divergence calculations on Windows ( #5273 )
2024-02-02 16:15:30 +02:00
Georgi Gerganov
b68a112204
cuda : fix __hisinf() result check
2024-02-02 15:12:28 +02:00
Georgi Gerganov
e437b37fd0
scripts : parse wtype in server-llm.sh ( #5167 )
...
* scripts : parse wtype in server-llm.sh
* scripts : fix check for wfile
2024-02-02 14:23:40 +02:00
Mirror Azure
2d40085c26
py : add check for '.attn.masked_bias' layers to GPT2model ( #5281 )
2024-02-02 13:39:09 +02:00
Georgi Gerganov
12eaa22628
tests : update dims
2024-02-02 11:55:38 +02:00
AidanBeltonS
b05102fe8c
Tidy ggml-sycl ( #5261 )
...
* Tidy some code in ggml-sycl
* Remove blank space
* Remove std::printf comments
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-02 16:39:48 +08:00
Xuan Son Nguyen
6b91b1e0a9
docker : add build for SYCL, Vulkan + update readme ( #5228 )
...
* add vulkan dockerfile
* intel dockerfile: compile sycl by default
* fix vulkan dockerfile
* add docs for vulkan
* docs: sycl build in docker
* docs: remove trailing spaces
* docs: sycl: add docker section
* docs: clarify install vulkan SDK outside docker
* sycl: use intel/oneapi-basekit docker image
* docs: correct TOC
* docs: correct docker image for Intel oneMKL
2024-02-02 09:56:31 +02:00
Meng, Hengyu
e805f0fa99
[SYCL] get MAX_MEM_ALLOC from device property ( #5270 )
...
* get max alloc size from device prop
* fix macro typo
2024-02-02 15:54:14 +08:00
Neo Zhang Jianyu
af3ba5d946
[SYCL] update guide of SYCL backend ( #5254 )
...
* update guide for make installation, memory, gguf model link, rm todo for windows build
* add vs install requirement
* update for gpu device check
* update help of llama-bench
* fix grammer issues
2024-02-02 15:53:27 +08:00
Ian Bull
e1e721094d
llama : fix memory leak in llama_batch_free ( #5252 )
...
The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.
This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.
2024-02-02 09:20:13 +02:00
Georgi Gerganov
db1f3c482e
cuda : avoid zeroing fragments
2024-02-01 22:08:37 +02:00
Neo Zhang Jianyu
128dcbd3c9
add --no-mmap in llama-bench ( #5257 )
...
* add --no-mmap, show sycl backend
* fix conflict
* fix code format, change print for --no-mmap
* ren no_mmap to mmap, show mmap when not default value in printer
* update guide for mmap
* mv position to reduce model reload
2024-02-01 20:48:53 +01:00
Georgi Gerganov
c6769b9422
tests : minor fix
2024-02-01 21:24:26 +02:00
Georgi Gerganov
cda5a60a41
metal : optimize softmax
2024-02-01 21:05:31 +02:00
0cc4m
4d0924a890
Vulkan Phi Fix for AMD Proprietary Drivers ( #5260 )
...
* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver
* Fix another Vulkan CPY buffer size bug
2024-02-01 19:25:24 +01:00
Georgi Gerganov
56e45a239e
metal : optimize softmax for C > 32
2024-02-01 20:16:32 +02:00
Georgi Gerganov
41d136b602
Merge branch 'master' into gg/flash-attn
2024-02-01 19:51:41 +02:00
Georgi Gerganov
5a19a9f6d0
cuda : add flash_attn kernel (wip)
2024-02-01 19:50:23 +02:00
slaren
8ca511cade
cuda : fix LLAMA_CUDA_F16 ( #5262 )
2024-02-01 18:30:17 +01:00
Ali Nehzat
d71ac90985
make : generate .a library for static linking ( #5205 )
2024-02-01 17:18:53 +02:00
Georgi Gerganov
2e46013749
cuda : fix soft_max to use correct mask size
2024-02-01 16:47:20 +02:00
Georgi Gerganov
910b15bb40
ggml : fix ggml_soft_max mask requirement
2024-02-01 16:41:02 +02:00
Guoteng
ce32060198
llama : support InternLM2 ( #5184 )
...
* support InternLM2 inference
* add add_space_prefix KV pair
2024-02-01 11:19:51 +02:00
Eve
1cfb5372cf
Fix broken Vulkan Cmake (properly) ( #5230 )
...
* build vulkan as object
* vulkan ci
2024-01-31 20:21:55 +01:00
Georgi Gerganov
8ad92dc1ec
ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
2024-01-31 20:39:29 +02:00
Georgi Gerganov
2ddc9bbef1
Merge branch 'master' into gg/flash-attn
2024-01-31 18:49:43 +02:00