llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 22:59:24 +01:00

Author	SHA1	Message	Date
Diego Devesa	d5a409e57f	ggml : fix gelu tables initialization (#10172 )	2024-11-04 20:06:58 +01:00
Diego Devesa	401558b7ba	ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (#10167 )	2024-11-04 17:34:08 +01:00
snadampal	6a066b9978	fix build break on arm64 linux (#10166 ) This fixes the build break from the recent changes to move the CPU backend to separate files https://github.com/ggerganov/llama.cpp/pull/10144	2024-11-04 16:08:33 +01:00
Diego Devesa	ea02c753eb	cuda : clear error after changing peer access (#10153 )	2024-11-04 13:10:23 +01:00
Georgi Gerganov	05697f670b	metal : simplify f16 and f32 dequant kernels (#0 )	2024-11-04 13:49:34 +02:00
Georgi Gerganov	f8e58135cf	metal : move dequantize templates to beginning of MSL source (#0 )	2024-11-04 13:44:06 +02:00
leo-pony	329ed914c9	CANN: adjust backend registry refactor. (#10158 ) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.	2024-11-04 19:08:22 +08:00
Yuri Khrustalev	284e5b0275	cmake : make it possible linking ggml as external lib (ggml/1003)	2024-11-04 10:33:11 +02:00
Plamen Minev	e2292aaa17	metal : fix minor string leaks (ggml/1004)	2024-11-04 10:33:10 +02:00
Diego Devesa	9f40989351	ggml : move CPU backend to a separate file (#10144 )	2024-11-03 19:34:08 +01:00
Georgi Gerganov	08828a6d7d	metal : minor fixup in FA kernel (#10143 ) * metal : minor fixup in FA kernel ggml-ci * metal : use the unrolled loop variable * metal : remove unused var	2024-11-03 15:18:40 +02:00
Diego Devesa	a6744e43e8	llama : add simple-chat example (#10124 ) * llama : add simple-chat example --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-11-01 23:50:59 +01:00
Diego Devesa	e991e3127f	llama : use smart pointers for ggml resources (#10117 )	2024-11-01 23:48:26 +01:00
Shupei Fan	418f5eef26	vulkan : improve ggml_vk_create_buffer error handling (#9898 )	2024-11-01 19:33:14 +01:00
Georgi Gerganov	1804adb0cf	ggml : remove ggml_scratch (#10121 ) ggml-ci	2024-11-01 12:58:45 +02:00
Georgi Gerganov	f221d56220	ggml : alloc ggml_contexts on the heap (whisper/2525)	2024-11-01 10:24:50 +02:00
Zhenwei Jin	e597e50794	build: fix build error in Windows env with OneAPI setup (#10107 )	2024-11-01 11:09:59 +08:00
Diego Devesa	c02e5ab2a6	llama : fix buffer checks for mamba and rwk (#10111 ) * llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE	2024-10-31 22:54:23 +01:00
Diego Devesa	dea5e86051	ggml : check tensor name lengths in gguf files (#10100 )	2024-10-31 11:40:59 +01:00
Sergio López	1329c0a75e	kompute: add mul_mat_q4_k shader (#10097 ) This is a more or less direct translation from the Metal implementation to GLSL. Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-10-31 11:09:52 +02:00
Sergio López	61408e7fad	kompute: add backend registry / device interfaces (#10045 ) Get in line with the other backends by supporting the newer backend/device registry interfaces. Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-10-30 17:01:52 +01:00
Diego Devesa	b9e02e8184	ggml : fix memory leaks when loading invalid gguf files (#10094 ) * ggml : fix gguf string leak when reading kv pairs fails * ggml : avoid crashing with GGML_ABORT when the KV has an invalid type * ggml : avoid crashing on failed memory allocations when loading a gguf file	2024-10-30 14:51:21 +01:00
xctan	fc83a9e584	ggml : add Q4_0_8_8 RISC-V GEMV and GEMM kernels (#10029 ) * ggml : RISC-V vector gemv for q4_0_8x8 * ggml : Added WIP rvv q4_0_8x8 gemm * ggml : Added initial implementation of rvv gemm * ggml : optimize gemm to avoid register spillover * ggml : Fix GCC rvv load alignment issue * ggml : Format gemm rvv code * ggml : Fix a typo in RVV q4_0_8_8 GEMM	2024-10-30 09:00:40 +02:00
Diego Devesa	c5b0f4b5d9	llama : refactor model loader with backend registry (#10026 )	2024-10-30 02:01:23 +01:00
Changyeon Kim	8f275a7c45	ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763 ) * ggml: Add POOL2D OP for GPU ACC to the Vulkan. - The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend. - A GGML_OP_POOL_2D shader has been added. (Pooling) - The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Correct the incorrect order of the parameters. fix casting to int. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> --------- Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>	2024-10-29 09:52:56 +01:00
R0CKSTAR	524afeec9d	musa: workaround for Guilty Lockup in cleaning src0 (#10042 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-10-28 10:02:48 +01:00
bssrdf	8c60a8a462	increase cuda_cpy block size (ggml/996) Co-authored-by: bssrdf <bssrdf@gmail.com>	2024-10-26 10:33:56 +03:00
Georgi Gerganov	668750357e	metal : support permuted matrix multiplicaions (#10033 ) * metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor	2024-10-25 22:26:15 +03:00
Srihari-mcw	2f8bd2b901	llamafile : extend sgemm.cpp support for Q5_0 models (#10010 )	2024-10-25 10:27:41 +03:00
Johannes Gäßler	167a515651	CUDA: fix insufficient buffer clearing for MMQ (#10032 )	2024-10-24 14:40:23 +02:00
Johannes Gäßler	c39665f589	CUDA: fix MMQ for non-contiguous src0, add tests (#10021 ) * CUDA: fix MMQ for non-contiguous src0, add tests * revise test code	2024-10-24 11:09:36 +02:00
Johannes Gäßler	80273a306d	CUDA: fix 1D im2col, add tests (ggml/993)	2024-10-23 16:50:02 +03:00
Daniel Bevenius	c19af0acb1	ggml : remove redundant set of contexts used field (ggml/978) This commit removes the setting of the `used` field of the contexts in the global state (g_state) in `ggml_init`. The motivation for this change is that I believe that this additional initialization might not be required after the changes in Commit 45fc4fed0b9fb5b1af4a8525cbebb95e11208732 ("sync : latest changes from whisper.cpp"), which changed the initialization of the contexts field from `{ 0 }` to `{ { 0 } }`: ```console g_state = (struct ggml_state) { - /.contexts =/ { 0 }, + /.contexts =/ { { 0 } }, }; ``` My understanding is that the `{0}` initialization might not have zero-initialized all the nested fields in every array element because of compiler differences, and might have been the reason for having the explicit setting of the `used` fields to false.	2024-10-23 16:50:02 +03:00
Jun Hee Yoo	4c9388fb96	metal : add POOL2D and fix IM2COL (#9943 ) * add pool_2d Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * fix im2col and add unittest for N>=1024 Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * add tests for N % 1024 != 0 Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * remove trailing whitespaces Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply suggestions Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply more optimization - original IM2COL kernel + _ext with MIN() Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply review: change kernel name of pool_2d Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply review Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * fix more formatting and enhance readability Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> --------- Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>	2024-10-23 13:33:45 +03:00
leo-pony	6b8447352d	[CANN] Adapt to dynamically loadable backends mechanism (#9970 ) * [CANN] Adapt to dynamically loadable backends mechanism * Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class * Handle the review comments of this pull request	2024-10-22 16:16:01 +08:00
Georgi Gerganov	f594bc80ba	ggml : add asserts for type conversion in fattn kernels (#9971 ) ggml-ci	2024-10-21 16:20:46 +03:00
Radoslav Gerganov	d5ebd79c76	rpc : pack only RPC structs (#9959 )	2024-10-21 13:35:40 +03:00
Neo Zhang Jianyu	1db8c84fc6	fix mul_mat_vec_q and *_vec_q error (#9939 ) Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-10-21 14:26:09 +08:00
Radoslav Gerganov	afd9909a64	rpc : backend refactoring (#9912 ) * rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server	2024-10-18 14:33:58 +03:00
Ouadie EL FAROUKI	87421a23e8	[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705 ) * implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp	2024-10-18 06:46:16 +01:00
Ma Mingfei	60ce97c9d8	add amx kernel for gemm (#8998 ) add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend	2024-10-18 13:34:36 +08:00
Diego Devesa	f010b77a37	vulkan : add backend registry / device interfaces (#9721 ) * vulkan : add backend registry / device interfaces * llama : print devices used on model load	2024-10-17 02:46:58 +02:00
Gilad S.	2194200278	fix: allocating CPU buffer with size `0` (#9917 )	2024-10-17 01:34:22 +02:00
Gilad S.	73afe681aa	fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875 ) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment	2024-10-17 00:36:51 +02:00
Daniel Bevenius	cd60b88bf7	ggml-alloc : remove buffer_id from leaf_alloc (ggml/987) This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.	2024-10-16 11:28:01 +03:00
leo-pony	becfd387f6	[CANN] Fix cann compilation error (#9891 ) Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.	2024-10-16 08:51:46 +08:00
agray3	13dca2a54a	Vectorize load instructions in dmmv f16 CUDA kernel (#9816 ) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-14 02:49:08 +02:00
Diego Devesa	96776405a1	ggml : move more prints to the ggml log system (#9839 ) * ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print	2024-10-11 15:34:45 +02:00
Diego Devesa	0e9f760eb1	rpc : add backend registry / device interfaces (#9812 ) * rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server	2024-10-10 20:14:55 +02:00
R0CKSTAR	cf8e0a3bb9	musa: add docker image support (#9685 ) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-10-10 20:10:37 +02:00
Diego Devesa	dca1d4b58a	ggml : fix BLAS with unsupported types (#9775 ) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it	2024-10-08 14:21:43 +02:00
Diego Devesa	6374743747	ggml : add backend registry / device interfaces to BLAS backend (#9752 ) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers	2024-10-07 21:55:08 +02:00
Andrew Minh Nguyen	f1af42fa8c	Update building for Android (#9672 ) * docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android	2024-10-07 09:37:31 -07:00
Georgi Gerganov	d5ac8cf2f2	ggml : add metal backend registry / device (#9713 ) * ggml : add metal backend registry / device ggml-ci * metal : fix names [no ci] * metal : global registry and device instances ggml-ci * cont : alternative initialization of global objects ggml-ci * llama : adapt to backend changes ggml-ci * fixes * metal : fix indent * metal : fix build when MTLGPUFamilyApple3 is not available ggml-ci * fix merge * metal : avoid unnecessary singleton accesses ggml-ci * metal : minor fix [no ci] * metal : g_state -> g_ggml_ctx_dev_main [no ci] * metal : avoid reference of device context in the backend context ggml-ci * metal : minor [no ci] * metal : fix maxTransferRate check * metal : remove transfer rate stuff --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-10-07 18:27:51 +03:00
Paul Tsochantaris	96b6912103	metal : single allocation of encode_async block (#9747 ) * Single allocation of encode_async block with non-ARC capture in ggml-metal.m * Moving Block_release to the deallocation code * Release encode block when re-setting encoding buffer count if needed * Update ggml/src/ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-10-07 15:26:31 +03:00
SRHMorris	b0915d5b51	vulkan : retry allocation with fallback flags (whisper/2451) Co-authored-by: Samuel Morris <samuel.morris@artlist.io>	2024-10-06 12:52:11 +03:00
Georgi Gerganov	905f5485b2	metal : zero-init buffer contexts (whisper/0)	2024-10-05 15:53:00 +03:00
Daniel Bevenius	55951c018d	ggml : fix typo in example usage ggml_gallocr_new (ggml/984)	2024-10-04 18:50:05 +03:00
Diego Devesa	ff565769f2	ggml : fixes after sync (ggml/983) ggml : remove test-backend-buffer ggml : fix CUDA build warnings	2024-10-04 18:50:04 +03:00
Georgi Gerganov	d5ed2b929d	metal : remove abort (skip) (ggml/0)	2024-10-03 21:18:19 +03:00
Johannes Gäßler	fabdc3bda3	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
Johannes Gäßler	eee39bdc96	ggml: refactor cross entropy loss CPU impl. (ggml/976)	2024-10-03 21:17:26 +03:00
Jack Mousseau	5d5ab1e5cc	metal : fix compute pass descriptor autorelease crash (#9718 )	2024-10-03 21:01:46 +03:00
Diego Devesa	a7ad553513	ggml-backend : add device description to CPU backend (#9720 )	2024-10-03 17:39:18 +02:00
bandoti	d6fe7abf04	ggml: unify backend logging mechanism (#9709 ) * Add scaffolding for ggml logging macros * Metal backend now uses GGML logging * Cuda backend now uses GGML logging * Cann backend now uses GGML logging * Add enum tag to parameters * Use C memory allocation funcs * Fix compile error * Use GGML_LOG instead of GGML_PRINT * Rename llama_state to llama_logger_state * Prevent null format string * Fix whitespace * Remove log callbacks from ggml backends * Remove cuda log statement	2024-10-03 17:39:03 +02:00
Ouadie EL FAROUKI	5639971466	Fixed dequant precision issues in Q4_1 and Q5_1 (#9711 )	2024-10-03 07:50:44 +01:00
Diego Devesa	c83ad6d01e	ggml-backend : add device and backend reg interfaces (#9707 ) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-03 01:49:47 +02:00
Alberto Cabrera Pérez	f536f4c439	[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658 ) sycl: initial cmake support of SYCL for AMD GPUs	2024-10-02 13:57:18 +01:00
Radoslav Gerganov	00b7317e63	vulkan : do not use tensor->extra (#9407 ) * vulkan : do not use tensor->extra This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used. Ref: #8536 * Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2) --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-10-02 13:49:16 +03:00
Johannes Gäßler	e98c1c188e	test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)	2024-10-01 16:07:40 +03:00
Salvatore Mesoraca	cb00020504	vulkan : mul_mat: fix UB with small warps (ggml/952) When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size. The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication. When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0. We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8). Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-01 16:07:39 +03:00
Borislav Stanimirov	6c5322481a	ggml : fix ggml_cast (ggml/973)	2024-10-01 16:07:39 +03:00
Johannes Gäßler	7254cdf7e8	ggml: fix gradient allocation logic (ggml/966) * ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg	2024-10-01 16:07:38 +03:00
Georgi Gerganov	cad341d889	metal : reduce command encoding overhead (#9698 ) * metal : reduce command encoding overhead ggml-ci * metal : add comments	2024-10-01 16:00:25 +03:00
Georgi Gerganov	c919d5db39	ggml : define missing HWCAP flags (#9684 ) ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-09-29 21:18:23 +03:00
Johannes Gäßler	aaa4099925	CUDA: remove bad assert (ggml/972)	2024-09-29 21:15:37 +03:00
Jeff Bolz	641002fba8	vulkan : multithread pipeline creation (ggml/963)	2024-09-29 21:15:37 +03:00
Jeff Bolz	0de8b203f1	vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)	2024-09-29 21:15:37 +03:00
Salvatore Mesoraca	544f409b4b	vulkan : argsort barriers must be under uniform control flow (ggml/951) a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-09-29 21:15:37 +03:00
Georgi Gerganov	6084bfb261	ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)	2024-09-29 21:15:35 +03:00
Dan Johansson	6a0f779484	ggml : add run-time detection of neon, i8mm and sve (#9331 ) * ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section	2024-09-28 15:06:16 +03:00
Markus Tavenrath	89f9944981	Enable use to the rebar feature to upload buffers to the device. (#9251 )	2024-09-28 12:05:05 +02:00
R0CKSTAR	7691654c68	mtgpu: enable VMM (#9597 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-26 03:27:40 +02:00
Charles Xu	1e43630218	ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (#9217 ) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream	2024-09-25 16:12:20 +03:00
Dou Xinpeng	904837e0cb	cann: fix crash when llama-bench is running on multiple cann devices (#9627 )	2024-09-25 11:30:38 +08:00
Eric Zhang	70392f1f81	ggml : add AVX512DQ requirement for AVX512 builds (#9622 )	2024-09-24 11:03:21 +03:00
Georgi Gerganov	c038931615	examples : adapt to ggml.h changes (ggml/0) ggml-ci	2024-09-24 11:00:52 +03:00
Georgi Gerganov	cea1486ecf	log : add CONT level for continuing previous log entry (#9610 )	2024-09-24 10:15:35 +03:00
Max Krasnyansky	c087b6f11d	threads: fix msvc build without openmp (#9615 ) We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.	2024-09-23 21:18:48 -07:00
Ivan	116efee0ee	cuda: add q8_0->f32 cpy operation (#9571 ) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 02:14:24 +02:00
Max Krasnyansky	f0c7b5edf8	threads: improve ggml_barrier scaling with large number of threads (#9598 ) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-09-23 11:42:43 -07:00
Srihari-mcw	1e7b9299c6	ggml : AVX512 gemm for Q4_0_8_8 (#9532 ) * AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-23 17:06:38 +03:00
Georgi Gerganov	bf9c1013ac	metal : use F32 prec for K*Q in vec FA (#9595 ) ggml-ci	2024-09-23 11:27:47 +03:00
Akarshan Biswas	e62e9789cd	Revert "[SYCL] fallback mmvq (#9088 )" (#9579 ) This reverts commit `50addec9a5`.	2024-09-23 11:28:06 +08:00
R0CKSTAR	c35e586ea5	musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (#9526 ) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-22 16:55:49 +02:00
Molly Sophia	912c331d3d	Fix merge error in #9454 (#9589 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-22 15:26:50 +02:00
Johannes Gäßler	a5b57b08ce	CUDA: enable Gemma FA for HIP/Pascal (#9581 )	2024-09-22 09:34:52 +02:00
Molly Sophia	2a63caaa69	RWKV v6: RWKV_WKV op CUDA implementation (#9454 ) * ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-22 04:29:12 +02:00
slaren	d09770cae7	ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (#9573 )	2024-09-21 14:24:23 +02:00
agray3	41f477879f	Update CUDA graph on scale change plus clear nodes/params (#9550 ) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes https://github.com/ggerganov/llama.cpp/issues/9451 * clear before resize	2024-09-21 02:41:07 +02:00

1 2 3 4 5 ...

325 Commits