llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-28 15:18:26 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	b7ddc8bf12	cuda : fix out-of-bounds-access in `mul_mat_vec_q` ggml-ci	2024-01-16 23:06:18 +02:00
Georgi Gerganov	e9a5d54b7d	cuda : update supports_op for IQ2 ggml-ci	2024-01-16 22:13:17 +02:00
Justine Tunney	a0b3ac8c48	ggml : introduce GGML_CALL function annotation (#4850 ) This change makes it possible to build ggml-cuda.cu and ggml-metal.m as independent dynamic shared objects, that may be conditionally linked at runtime in a multiplatform binary. It introduces a GGML_CALL annotation that documents which functions have a cyclic call relationship, between the application code and GPU modules. This change does nothing, unless the build defines -DGGML_MULTIPLATFORM which causes back-references and function pointers to conform to MS ABI which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms	2024-01-16 13:16:33 +02:00
Georgi Gerganov	ddb008d845	cuda : fix dequantize kernel names (#4938 )	2024-01-15 13:27:00 +02:00
Kawrakow	4a3156de2f	CUDA: faster dequantize kernels for Q4_0 and Q4_1 (#4938 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-15 07:48:06 +02:00
Johannes Gäßler	3fe81781e3	CUDA: faster q8_0 -> f16 dequantization (#4895 )	2024-01-12 20:38:54 +01:00
slaren	e7e4df031b	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Johannes Gäßler	1b280c9fff	CUDA: fix softmax compile for old CUDA versions (#4862 )	2024-01-12 12:30:41 +01:00
Kawrakow	49662cbed3	ggml : SOTA 2-bit quants (add IQ2_XS) (#4856 ) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:39:39 +02:00
Erik Scholz	f34432ca1e	fix : cuda order of synchronization when setting a buffer (ggml/679) * fix : cuda order of synchronization when setting a buffer * also sync before memcpy --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-11 09:39:05 +02:00
Johannes Gäßler	8f900abfc0	CUDA: faster softmax via shared memory + fp16 math (#4742 )	2024-01-09 08:58:55 +01:00
Kawrakow	dd5ae06405	SOTA 2-bit quants (#4773 ) * iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-08 16:02:32 +01:00
Johannes Gäßler	d5a410e855	CUDA: fixed redundant value dequantization (#4809 )	2024-01-07 17:24:08 +01:00
Konstantin Zhuravlyov	63ee677efd	ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (#4787 )	2024-01-07 08:52:42 +02:00
Finn Voorhees	1bf681f90e	ggml : add error handling to graph_compute (whisper/1714)	2024-01-05 18:02:06 +02:00
Georgi Gerganov	7bed7eba35	cuda : simplify expression Co-authored-by: slaren <slarengh@gmail.com>	2024-01-03 14:38:38 +02:00
Georgi Gerganov	d55356d3ba	cuda : mark I16 and I32 ops as unsupported ggml-ci	2024-01-03 14:38:38 +02:00
Johannes Gäßler	39d8bc71ed	CUDA: fixed tensor cores not being used on RDNA3 (#4697 )	2023-12-30 13:52:01 +01:00
Johannes Gäßler	a20f3c7465	CUDA: fix tensor core logic for Pascal and HIP (#4682 )	2023-12-29 23:12:53 +01:00
hydai	91bb39cec7	cuda: fix vmm oom issue on NVIDIA AGX Orin (#4687 ) Signed-off-by: hydai <hydai@secondstate.io>	2023-12-29 17:31:19 +01:00
bssrdf	afc8c19291	ggml : fix some mul mat cases + add tests for src1 F16 (ggml/669) * fixed mul-mat error for old GPUs * style fixes * add mul mat src1 f16 test cases, fix more cases ggml-ci --------- Co-authored-by: bssrdf <bssrdf@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-29 14:54:19 +02:00
slaren	dc68f0054c	cuda : fix vmm pool with multi GPU (#4620 ) * cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device	2023-12-26 21:23:59 +01:00
FantasyGmm	77465dad48	Fix new CUDA10 compilation errors (#4635 )	2023-12-26 11:38:36 +01:00
slaren	5bf3953d7e	cuda : improve cuda pool efficiency using virtual memory (#4606 ) * cuda : improve cuda pool efficiency using virtual memory * fix mixtral * fix cmake build * check for vmm support, disable for hip ggml-ci * fix hip build * clarify granularity * move all caps to g_device_caps * refactor error checking * add cuda_pool_alloc, refactor most pool allocations ggml-ci * fix hip build * CUBLAS_TF32_TENSOR_OP_MATH is not a macro * more hip crap * llama : fix msvc warnings * ggml : fix msvc warnings * minor * minor * cuda : fallback to CPU on host buffer alloc fail * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ensure allocations are always aligned * act_size -> actual_size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-12-24 14:34:22 +01:00
slaren	708e179e85	fallback to CPU buffer if host buffer alloc fails (#4610 )	2023-12-23 16:10:51 +01:00
Johannes Gäßler	e0a4002273	CUDA: fixed row rounding for 0 tensor splits (#4594 )	2023-12-23 09:16:33 +01:00
Georgi Gerganov	ba66175132	sync : ggml (fix im2col) (#4591 ) * cuda : fix im2col_f32_f16 (ggml/#658) ggml-ci * ggml-alloc : fix ggml_tallocr_is_own --------- Co-authored-by: leejet <leejet714@gmail.com>	2023-12-22 17:53:43 +02:00
FantasyGmm	a55876955b	cuda : fix jetson compile error (#4560 ) * fix old jetson compile error * Update Makefile * update jetson detect and cuda version detect * update cuda marco define * update makefile and cuda,fix some issue * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update Makefile * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 17:11:12 +02:00
Henrik Forstén	6724ef1657	Fix CudaMemcpy direction (#4599 )	2023-12-22 14:34:05 +01:00
slaren	48b7ff193e	llama : fix platforms without mmap (#4578 ) * llama : fix platforms without mmap * win32 : limit prefetch size to the file size * fix win32 error clobber, unnecessary std::string in std::runtime_error	2023-12-22 13:12:53 +02:00
Georgi Gerganov	afefa319f1	ggml : change ggml_scale to take a float instead of tensor (#4573 ) * ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci	2023-12-21 23:20:49 +02:00
slaren	d232aca5a7	llama : initial ggml-backend integration (#4520 ) * llama : initial ggml-backend integration * add ggml-metal * cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST access all tensor data with ggml_backend_tensor_get/set * add ggml_backend_buffer_clear zero-init KV cache buffer * add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data * disable gpu backends with ngl 0 * more accurate mlock * unmap offloaded part of the model * use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap * update quantize and lora * update session copy/set to use ggml-backend ggml-ci * use posix_fadvise instead of posix_fadvise64 * ggml_backend_alloc_ctx_tensors_from_buft : remove old print * llama_mmap::align_offset : use pointers instead of references for out parameters * restore progress_callback behavior * move final progress_callback call to load_all_data * cuda : fix fprintf format string (minor) * do not offload scales * llama_mmap : avoid unmapping the same fragments again in the destructor * remove unnecessary unmap * metal : add default log function that prints to stderr, cleanup code ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-21 21:07:46 +01:00
Erik Garrison	0f630fbc92	cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449 ) * AMD ROCm: handle UMA memory VRAM expansions This resolves #2797 by allowing ROCm AMD GPU users with a UMA to dynamically expand the VRAM allocated to the GPU. Without this, AMD ROCm users with shared CPU/GPU memory usually are stuck with the BIOS-set (or fixed) framebuffer VRAM, making it impossible to load more than 1-2 layers. Note that the model is duplicated in RAM because it's loaded once for the CPU and then copied into a second set of allocations that are managed by the HIP UMA system. We can fix this later. * clarify build process for ROCm on linux with cmake * avoid using deprecated ROCm hipMallocHost * keep simplifying the change required for UMA * cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON	2023-12-21 21:45:32 +02:00
arlo-phoenix	562cf222b5	ggml-cuda: Fix HIP build by adding define for __trap (#4569 ) Regression of `1398823922` HIP doesn't have trap, only abort	2023-12-21 20:13:25 +01:00
Johannes Gäßler	9154494808	CUDA: mul_mat_id always on GPU for batches >= 32 (#4553 )	2023-12-21 18:42:59 +01:00
bobqianic	66f35a2f48	cuda : better error message for ggml_get_rows (#4561 ) * Update ggml-cuda.cu * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-21 19:06:44 +02:00
slaren	1398823922	cuda : replace asserts in wrong architecture checks with __trap (#4556 ) * cuda : replace asserts in wrong architecture checks with __trap * make bad_arch noreturn, remove returns	2023-12-21 18:02:30 +01:00
LoganDark	1d7a1912ce	Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554 )	2023-12-21 10:59:27 +01:00
Johannes Gäßler	799fc22689	CUDA: Faster Mixtral prompt processing (#4538 ) * CUDA: make MoE tensors contiguous for batch size>1 * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-20 15:41:22 +01:00
arlo-phoenix	a7aee47b98	ggml-cuda: Fix HIP build (#4528 ) regression of #4490 Adds defines for two new datatypes cublasComputeType_t, cudaDataType_t. Currently using deprecated hipblasDatatype_t since newer ones very recent.	2023-12-18 22:33:45 +01:00
Ebey Abraham	b9e74f9bca	llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490 ) * phi2 implementation * fix breaking change * phi-2 : various fixes * phi-2 : use layer norm eps * py : whitespaces * llama : fix meta KV override bug * convert : phi don't add BOS token * convert : revert "added_tokens_decoder" change * phi-2 : scale Q instead of KQ for better precision * ggml : fix NeoX rope to rotate just first n_dims * cuda : less diff in the rope_neox kernel * ggml : add ggml_mul_mat_set_prec ggml-ci * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * cuda : ggml_cuda_op_mul_mat_cublas support F32 precision * cuda : remove oboslete comment --------- Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-18 19:27:47 +02:00
slaren	6744dbe924	ggml : use ggml_row_size where possible (#4472 ) * ggml : use ggml_row_size where possible ggml-ci * ggml : move ggml_nbytes_split to ggml-cuda.cu	2023-12-14 20:05:21 +01:00
Georgi Gerganov	4d98d9a656	sync : ggml (SD ops, tests, kernels) (#4444 ) * sync : ggml (SD ops, tests, kernels) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels ggml-ci * cuda : restore correct im2col ggml-ci * metal : try to fix moe test by reducing expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-13 21:54:54 +02:00
slaren	799a1cb13b	llama : add Mixtral support (#4406 ) * convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: `1914017863` * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>	2023-12-13 14:04:25 +02:00
Georgi Gerganov	fe680e3d10	sync : ggml (new ops, tests, backend, etc.) (#4359 ) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit `09e35d04b1`. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 22:26:54 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Georgi Gerganov	ef47ec18da	ggml : add ggml_soft_max_ext (#4256 ) * metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug	2023-12-01 10:51:24 +02:00
slaren	8a052c131e	ggml-cuda : support stablelm rope (#4156 ) * ggml-cuda : support stablelm rope * remove unused freq_base kernel parameter * add n_dims parameter to llm_build_k_shift, default to n_rot via overload * llama : fix llm_build_k_shift args --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-11-24 18:04:31 +01:00
Haohui Mai	55978ce09b	Fix incorrect format strings and uninitialized variables. (#4133 ) * Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement	2023-11-23 22:56:53 +01:00
Kerfuffle	2923f17f6f	Clean up ggml-cuda.cu warnings when compiling with clang (for ROCM) (#4124 ) * ggml-cuda.cu: Clean up warnings when compiling with clang * ggml-cuda.cu: Move static items into anonymous namespace * ggml-cuda.cu: Fix use of namespace start macro * Revert "ggml-cuda.cu: Fix use of namespace start macro" This reverts commit `26c1149026`. * Revert "ggml-cuda.cu: Move static items into anonymous namespace" This reverts commit `e29757e0f7`.	2023-11-18 08:11:18 -07:00

1 2 3 4

179 Commits