llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-27 12:33:06 +01:00

Author	SHA1	Message	Date
Jeff Bolz	b3e585988f	vulkan: Optimize soft_max (#10301 ) * vulkan: Optimize soft_max Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper. Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll. Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H. * vulkan: Further soft_max optimizations Restore the workgroup size of 512 case, use it for >1024. Use unrollable loops for more iteration counts.	2024-11-19 08:25:17 +01:00
Alberto Cabrera Pérez	557924f222	sycl: Revert MUL_MAT_OP support changes (#10385 )	2024-11-19 08:50:04 +08:00
Diego Devesa	d3481e6316	cuda : only use native when supported by cmake (#10389 )	2024-11-18 18:43:40 +01:00
bandoti	531cb1c233	Skip searching root path for cross-compile builds (#10383 )	2024-11-18 16:23:58 +01:00
Jeff Bolz	f139d2ea61	vulkan: remove use of null initializer (#10372 ) Seems like this isn't working for vulkan-over-metal when the array is sized by a spec constant. Maybe a spirv-cross limitation?	2024-11-18 08:28:42 -06:00
Georgi Gerganov	2eb76b2a5e	flake.lock: Update (#10346 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05) → 'github:NixOS/nixpkgs/5e4fbfb6b3de1aa2872b76d49fafc942626e2add?narHash=sha256-OZiZ3m8SCMfh3B6bfGC/Bm4x3qc1m2SVEAlkV6iY7Yg%3D' (2024-11-15) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-11-18 06:08:20 -08:00
0cc4m	9b75f03cd2	Vulkan: Fix device info output format specifiers (#10366 ) * Vulkan: Fix device info output format specifiers * Vulkan: Use zu printf specifier for size_t instead of ld	2024-11-18 11:02:43 +01:00
Johannes Gäßler	75207b3a88	docker: use GGML_NATIVE=OFF (#10368 )	2024-11-18 00:21:53 +01:00
Johannes Gäßler	76e9e58b78	CUDA: fix MMV kernel being used for FP16 src1 (#10357 )	2024-11-17 23:20:42 +01:00
Johannes Gäßler	ce2e59ba10	CMake: fix typo in comment [no ci] (#10360 )	2024-11-17 12:59:38 +01:00
Diego Devesa	be5caccef9	llama : only use default buffer types for the KV cache (#10358 )	2024-11-17 12:25:45 +01:00
Georgi Gerganov	20a780c7b6	gitignore : ignore local run scripts [no ci]	2024-11-17 13:12:22 +02:00
Georgi Gerganov	cf32a9b93a	metal : refactor kernel args into structs (#10238 ) * metal : add kernel arg structs (wip) * metal : fattn args ggml-ci * metal : cont + avoid potential int overflow [no ci] * metal : mul mat struct (wip) * cont : mul mat vec * cont : pass by reference * cont : args is first argument * cont : use char ptr * cont : shmem style * cont : thread counters style * cont : mul mm id ggml-ci * cont : int safety + register optimizations ggml-ci * metal : GGML_OP_CONCAT ggml-ci * metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV * metal : GGML_OP_REPEAT * metal : GGML_OP_CPY * metal : GGML_OP_RMS_NORM * metal : GGML_OP_NORM * metal : add TODOs for rest of ops * ggml : add ggml-metal-impl.h ggml-ci	2024-11-17 11:23:01 +02:00
FirstTimeEZ	a43178299c	ggml : fix undefined reference to 'getcpu' (#10354 ) https://github.com/ggerganov/llama.cpp/issues/10352	2024-11-17 10:39:22 +02:00
Johannes Gäßler	c3ea58aca4	CUDA: remove DMMV, consolidate F16 mult mat vec (#10318 )	2024-11-17 09:09:55 +01:00
Johannes Gäßler	467576b6cc	CMake: default to -arch=native for CUDA build (#10320 )	2024-11-17 09:06:34 +01:00
Diego Devesa	eda7e1d4f5	ggml : fix possible buffer use after free in sched reserve (#9930 )	2024-11-17 08:31:17 +02:00
Georgi Gerganov	24203e9dd7	ggml : inttypes.h -> cinttypes (#0 ) ggml-ci	2024-11-17 08:30:29 +02:00
Georgi Gerganov	5d9e59979c	ggml : adapt AMX to tensor->grad removal (#0 ) ggml-ci	2024-11-17 08:30:29 +02:00
Georgi Gerganov	a4200cafad	make : add ggml-opt (#0 ) ggml-ci	2024-11-17 08:30:29 +02:00
Georgi Gerganov	84274a10c3	tests : remove test-grad0	2024-11-17 08:30:29 +02:00
Georgi Gerganov	68fcb4759c	ggml : fix compile warnings (#0 ) ggml-ci	2024-11-17 08:30:29 +02:00
Johannes Gäßler	8a43e940ab	ggml: new optimization interface (ggml/988)	2024-11-17 08:30:29 +02:00
Georgi Gerganov	5c9a8b22b1	scripts : update sync	2024-11-17 08:30:29 +02:00
FirstTimeEZ	0fff7fd798	docs : vulkan build instructions to use git bash mingw64 (#10303 )	2024-11-17 00:29:18 +01:00
Johannes Gäßler	4e54be0ec6	llama/ex: remove --logdir argument (#10339 )	2024-11-16 23:00:41 +01:00
Georgi Gerganov	db4cfd5dbc	llamafile : fix include path (#0 ) ggml-ci	2024-11-16 20:36:26 +02:00
Georgi Gerganov	8ee0d09ae6	make : auto-determine dependencies (#0 )	2024-11-16 20:36:26 +02:00
MaggotHATE	bcdb7a2386	server: (web UI) Add samplers sequence customization (#10255 ) * Samplers sequence: simplified and input field. * Removed unused function * Modify and use `settings-modal-short-input` * rename "name" --> "label" --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-16 14:26:54 +01:00
Georgi Gerganov	f245cc28d4	scripts : fix missing key in compare-llama-bench.py (#10332 )	2024-11-16 10:32:50 +02:00
Jeff Bolz	772703c8ff	vulkan: Optimize some mat-vec mul quant shaders (#10296 ) Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.	2024-11-16 07:26:57 +01:00
FirstTimeEZ	dd3a6ce9f8	vulkan : add cmake preset debug/release (#10306 )	2024-11-16 02:59:33 +01:00
Dan Johansson	1e58ee1318	ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324 )	2024-11-16 01:53:37 +01:00
FirstTimeEZ	89e4caaaf0	llama : save number of parameters and the size in llama_model (#10286 ) fixes #10285	2024-11-16 01:42:13 +01:00
Srihari-mcw	74d73dc85c	Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314 )	2024-11-15 22:27:00 +01:00
Johannes Gäßler	4047be74da	scripts: update compare-llama-bench.py (#10319 )	2024-11-15 21:19:03 +01:00
slaren	883d206fbd	ggml : fix some build issues	2024-11-15 21:45:32 +02:00
Georgi Gerganov	09ecbcb596	cmake : fix ppc64 check (whisper/0) ggml-ci	2024-11-15 15:44:06 +02:00
thewh1teagle	3225008973	ggml : vulkan logs (whisper/2547)	2024-11-15 15:44:06 +02:00
Georgi Gerganov	cbf5541a82	sync : ggml	2024-11-15 15:44:06 +02:00
Eve	18429220bd	AVX BF16 and single scale quant optimizations (#10212 ) * use 128 bit loads (i've tried 256->128 to death and its slower) * double accumulator * avx bf16 vec dot * +3% q4_0 inference * +7% tg +5% pp compared to master * slower f16c version, kep for reference * 256b version, also slow. i tried :) * revert f16 * faster with madd * split to functions * Q8_0 and IQ4_NL, 5-7% faster * fix potential overflow (performance reduced) * 16 bit add for q4_0 only * merge	2024-11-15 12:47:58 +01:00
R0CKSTAR	f0204a0ec7	ci: build test musa with cmake (#10298 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-11-15 12:47:25 +01:00
Romain Biessy	57f8355b29	sycl: Update Intel docker images to use DPC++ 2025.0 (#10305 )	2024-11-15 13:10:45 +02:00
Xuan Son Nguyen	9901068ac7	server : (web UI) add copy button for code block, fix api key (#10242 ) * server : (web ui) add copy btn for code blocks * fix problem with api key * use settings-modal-short-input component * always show copy btn for code snippet	2024-11-15 10:48:49 +01:00
Chenguang Li	231f9360d9	cann: dockerfile and doc adjustment (#10302 ) Co-authored-by: noemotiovon <noemotiovon@gmail.com>	2024-11-15 15:09:35 +08:00
Georgi Gerganov	4802ad350b	scripts : fix regex in sync [no ci]	2024-11-15 08:38:43 +02:00
Romain Biessy	5a54af4d4f	sycl: Use syclcompat::dp4a (#10267 ) * sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit `90cb61d692`.	2024-11-15 11:09:12 +08:00
Charles Xu	1607a5e5b0	backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921 ) * backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-11-15 01:28:50 +01:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
Johannes Gäßler	4a8ccb37ad	CUDA: no -sm row for very small matrices (#10185 )	2024-11-14 13:00:15 +01:00

1 2 3 4 5 ...

4128 Commits