llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-30 16:07:17 +01:00

Author	SHA1	Message	Date
Max Krasnyansky	c087b6f11d	threads: fix msvc build without openmp (#9615 ) We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.	2024-09-23 21:18:48 -07:00
Ivan	116efee0ee	cuda: add q8_0->f32 cpy operation (#9571 ) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 02:14:24 +02:00
Max Krasnyansky	f0c7b5edf8	threads: improve ggml_barrier scaling with large number of threads (#9598 ) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-09-23 11:42:43 -07:00
Srihari-mcw	1e7b9299c6	ggml : AVX512 gemm for Q4_0_8_8 (#9532 ) * AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-23 17:06:38 +03:00
Georgi Gerganov	bf9c1013ac	metal : use F32 prec for K*Q in vec FA (#9595 ) ggml-ci	2024-09-23 11:27:47 +03:00
Akarshan Biswas	e62e9789cd	Revert "[SYCL] fallback mmvq (#9088 )" (#9579 ) This reverts commit `50addec9a5`.	2024-09-23 11:28:06 +08:00
R0CKSTAR	c35e586ea5	musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (#9526 ) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-22 16:55:49 +02:00
Molly Sophia	912c331d3d	Fix merge error in #9454 (#9589 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-22 15:26:50 +02:00
Johannes Gäßler	a5b57b08ce	CUDA: enable Gemma FA for HIP/Pascal (#9581 )	2024-09-22 09:34:52 +02:00
Molly Sophia	2a63caaa69	RWKV v6: RWKV_WKV op CUDA implementation (#9454 ) * ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-22 04:29:12 +02:00
slaren	d09770cae7	ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (#9573 )	2024-09-21 14:24:23 +02:00
agray3	41f477879f	Update CUDA graph on scale change plus clear nodes/params (#9550 ) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes https://github.com/ggerganov/llama.cpp/issues/9451 * clear before resize	2024-09-21 02:41:07 +02:00
Georgi Gerganov	d13edb17ed	ggml : fix builds (#0 ) ggml-ci	2024-09-20 21:15:05 +03:00
Georgi Gerganov	27609c49b9	ggml : fix trailing whitespace (#0 ) ggml-ci	2024-09-20 21:15:05 +03:00
Johannes Gäßler	424c5d00a9	ggml/examples: add backend support for numerical optimization (ggml/949) * CUDA eval works * stochastic gradient descent op * Adam except decay * CUDA CROSS_ENTROPY_LOSS_BACK * CUDA mnist-fc training works * backend CLI arg * refactor gguf load * remove sched from opt_step_adam * implement l1 regularization (weight decay) * extra call to add optimizer * initialize gradients with ggml_graph_reset * gradient accumulation * increment iter per eval instead of epoch * adjust backend interfaces * fix ggml_graph_reset without backend * fix ggml graph export/import * fixup * rename * revert ggml_opt changes * more general CUDA repeat_back * update documentation, fix CNN * validation split * add clarifying comment * optimize PyTorch training * adjust buffer size, thread count * fix 0.0f validation split * Update examples/mnist/mnist-common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix gradient accumulation * tensor flag for accumulators -> tensor hash set * Update include/ggml.h Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * fix test prints * Update src/ggml-backend.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better CUDA support for noncontiguous out_prod * add comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-09-20 21:15:05 +03:00
Georgi Gerganov	a6809c6a2e	examples : add null threadpool args where needed (ggml/0) ggml-ci	2024-09-20 21:15:05 +03:00
Johannes Gäßler	5cb12f6839	CUDA: fix sum.cu compilation for CUDA < 11.7 (#9562 )	2024-09-20 18:35:35 +02:00
slaren	64c6af3195	ggml : fix n_threads_cur initialization with one thread (#9538 ) * ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2024-09-18 10:13:08 -07:00
Max Krasnyansky	0226613853	threadpool : skip polling for unused threads (#9461 ) * threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.	2024-09-17 11:19:46 +03:00
slaren	23e0d70bac	ggml : move common CPU backend impl to new header (#9509 )	2024-09-16 16:22:07 +02:00
Michael Podvitskiy	a6a3a5c531	ggml : link MATH_LIBRARY not by its full path (#9339 )	2024-09-16 14:06:50 +03:00
Georgi Gerganov	19514d632e	cmake : do not hide GGML options + rename option (#9465 ) * cmake : do not hide GGML options ggml-ci * build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS for consistency ggml-ci	2024-09-16 10:27:50 +03:00
Eve	5c3d0f1824	ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422 ) * squashed readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before	2024-09-16 09:48:24 +03:00
Georgi Gerganov	c4965a64f7	metal : handle zero-sized allocs (#9466 )	2024-09-16 09:05:56 +03:00
Georgi Gerganov	6262d13e0b	common : reimplement logging (#9418 ) https://github.com/ggerganov/llama.cpp/pull/9418	2024-09-15 20:46:12 +03:00
Michael Podvitskiy	6988da94a2	cmake : correct order of sycl flags (#9497 )	2024-09-15 19:55:52 +03:00
Michael Podvitskiy	7596487beb	cmake : try to fix sycl+intel build (#9487 )	2024-09-15 10:06:38 +03:00
Yuri Khrustalev	822b6322de	ggml : ggml_type_name return "NONE" for invalid values (#9458 ) When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.	2024-09-14 12:54:37 +03:00
Georgi Gerganov	1f4111e540	cmake : use list(APPEND ...) instead of set() + dedup linker (#9463 ) * cmake : use list(APPEND ...) instead of set() + dedup linker ggml-ci * cmake : try fix sycl * cmake : try to fix sycl 2 * cmake : fix sycl build (#9469) * try fix sycl build * use CMAKE_CXX_FLAGS as a string variable --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * one more CMAKE_CXX_FLAGS fix (#9471) --------- Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>	2024-09-14 10:55:05 +03:00
Dou Xinpeng	e6b7801bd1	cann: Add host buffer type for Ascend NPU (#9406 ) * feat: Add host buffer type for Ascend NPU(CANN backend) * fix some checking errors * Add a few comments	2024-09-12 19:46:43 +08:00
Ahmad Tameem	2b00fa7997	riscv : modify Makefile and add a RISCV_VECT to print log info (#9442 ) - Added ggml_cpu_has_riscv_v() in GGML to print system info in log - Modified Makefile to only use flag when cross compiling for RISC-V	2024-09-12 14:24:31 +03:00
Georgi Gerganov	d6a04f872d	ggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408 ) * ggml : hide ggml_object, ggml_cgraph, ggml_hash_set ggml-ci * ggml : add ggml-impl.h to backends * ggml : fix compiler warnings ggml-ci * ggml : add assert upon adding nodes	2024-09-12 14:23:49 +03:00
Xinpeng Dou	df4b7945ae	cann: Fix error when running a non-exist op (#9424 )	2024-09-12 09:02:35 +08:00
Johannes Gäßler	5af118efda	CUDA: fix --split-mode row race condition (#9413 )	2024-09-11 10:22:40 +02:00
R0CKSTAR	b34e023480	musa: remove Clang builtins mapping (#9421 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-11 03:46:55 +02:00
Alberto Cabrera Pérez	51b6038636	sycl : update support conditions (#9394 ) * sycl : update support condition to im2col Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * Added TODO to remind supporting FP32 im2col --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>	2024-09-11 08:53:42 +08:00
Georgi Gerganov	00ba2ff781	metal : fix compile warning with GGML_METAL_NDEBUG (#0 )	2024-09-10 10:17:43 +03:00
Radoslav Gerganov	293bebe077	rpc : fix segfault with nkvo (#9389 ) * rpc : fix nkvo * rpc : buf_size must not be static ref: #9337 --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-09-09 18:40:10 +03:00
Prashant Vithule	5fac4d5764	ggml : vector length agnostic SVE support (#9290 ) * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Removed WhiteSpaces * ggml : style changes + fix 512-bit nb loop check - fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts * Update ggml/src/ggml-quants.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-09 18:37:18 +03:00
Johannes Gäßler	8e6e2fbe14	CUDA: fix variable name conflict for Windows build (#9382 )	2024-09-09 14:22:53 +02:00
Markus Tavenrath	daa9623ab0	Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (#9118 ) * Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. * fix compile issues * Fix issues where the last submit wasn't executed or handled properly. * remove trailing whitespace * Repair GGML_VULKAN_CHECK_RESULTS * Increase submit counter only if actual work has been submitted and increase submit count to 100. * Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.	2024-09-08 21:43:48 +02:00
Georgi Gerganov	e079bffb66	cuda : fix FA Q src index (1 -> 0) (#9374 )	2024-09-08 22:01:02 +03:00
Neo Zhang Jianyu	2a358fb0c4	[SYCL] add check malloc result on device (#9346 ) * add check malloc result on device * update for review comments, check all malloc_device() result --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-09-08 19:05:29 +08:00
Georgi Gerganov	a876861455	metal : update support condition for im2col + fix warning (#0 )	2024-09-08 11:05:55 +03:00
Salvatore Mesoraca	406c1a32a1	vulkan: add dryrun support to sin and cos ops (ggml/947) sin and cos failed test-backend-ops because they tried to dereference a context pointer that is null on dry runs. This commit prevents that segfault. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-09-08 11:05:55 +03:00
Salvatore Mesoraca	9cb9260861	vulkan: correctly report support for OP_CONT (ggml/946) test-backend-ops fails because ggml_cont aborts when invoked passing an unsupported type. This commit makes ggml_cont tests pass Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-09-08 11:05:55 +03:00
Johannes Gäßler	202084d31d	tests: add gradient tests for all backends (ggml/932) * tests: add gradient checking to test-backend-ops * remove old comment * reorder includes * adjust SIN/COS parameters * add documentation, use supports_op if possible	2024-09-08 11:05:55 +03:00
Johannes Gäßler	dbbebcab33	ggml: fix ggml_graph_cpy undefined behavior (ggml/943)	2024-09-08 11:05:55 +03:00
Georgi Gerganov	ba1cf846ed	cann : fix doxy (ggml/0)	2024-09-08 11:05:55 +03:00
Mengqing Cao	d2d3200b38	cann : add Ascend NPU support (whisper/2336) * enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp	2024-09-08 11:05:55 +03:00

1 2 3 4

187 Commits