llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 06:39:25 +01:00

Author	SHA1	Message	Date
Kawrakow	f4d7e54974	SOTA 3-bit quants (#5196 ) * iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 15:14:12 +02:00
0cc4m	2256f36b79	Vulkan Windows APU Memory Handling (#5199 ) * Add basic UMA memory handling Improve memory OOM behavior Fix tests * Fix UMA handling * Also fix UMA handling for prealloc buffers * Remove unnecessary warning message * Remove outdated comment	2024-01-30 13:59:30 +01:00
Vladimir Malyutin	7359016c7c	quantize : fix typo (#5211 ) Fix misprint in quantize help	2024-01-30 12:57:07 +02:00
divinity76	813416991a	main : allow empty --prompt-cache file (#5176 ) * allow empty --prompt-cache file This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user. I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++ < 17 (the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17) * formatting (requested in codereview) * remove c++17, file_is_empty	2024-01-30 11:18:02 +02:00
Romain Neutron	5589921ef8	readme : minor (#5204 ) This is about tuning the code formatting of the README file	2024-01-30 11:16:38 +02:00
Georgi Gerganov	49f44b5c55	readme : update hot topics	2024-01-30 11:14:44 +02:00
Wu Jian Ping	6685cc41c2	server : improve README (#5209 )	2024-01-30 11:11:46 +02:00
Paul Tsochantaris	ceebbb5b21	ggml alloc: Fix for null dereference on alloc failure (#5200 ) * Fix for a null pointer dereference if a metal GGML buffer fails to be allocated * Freeing the allocated buffers rather than the pointer in ggml-alloc.c * Fixed the fix of the fix	2024-01-29 23:19:29 +01:00
Jared Van Bortel	6daa69ee81	kompute : fix fallback to CPU (#5201 )	2024-01-29 17:11:27 -05:00
Jared Van Bortel	fbf1ddec69	Nomic Vulkan backend (#4456 ) Signed-off-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: niansa <anton-sa@web.de> Co-authored-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Aaron Miller <apage43@ninjawhale.com> Co-authored-by: ToKiNoBug <tokinobug@163.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-01-29 15:50:50 -05:00
divinity76	2aed77eb06	fix typo "RLIMIT_MLOCK" (#5175 )	2024-01-29 09:45:41 -05:00
Wu Jian Ping	c82d18e863	server : embeddings compatibility for OpenAI (#5190 )	2024-01-29 15:48:10 +02:00
Georgi Gerganov	14fef85e2d	py : fix except (#5194 ) ggml-ci	2024-01-29 15:35:54 +02:00
Sang-Kil Park	e76627bcce	py : improve BPE tokenizer support (#5189 )	2024-01-29 11:24:19 +02:00
slaren	fbe7dfa53c	ggml : add max buffer sizes to opencl and metal backends (#5181 )	2024-01-29 10:05:13 +02:00
Eve	172ac82629	cmake : fix Vulkan build (#5182 )	2024-01-29 10:04:47 +02:00
Paul Tsochantaris	d2f650cb5b	metal : free metal objects (#5161 ) * Releasing MTLFunction references after Metal pipeline construction * Keeping the `ggml_metal_kernel` structure * Spacing fix * Whitespace fix	2024-01-28 21:50:16 +02:00
Georgi Gerganov	35dec26cc2	sync : ggml	2024-01-28 19:48:05 +02:00
Georgi Gerganov	d460510c72	ggml : minor type fix (int64_t -> size_t)	2024-01-28 19:47:31 +02:00
0cc4m	2307523d32	ggml : add Vulkan backend (#2059 ) * Vulkan loader code * Fix matmul kernel, continue implementation * Continue implementation * Vulkan memory management * Vulkan development * Matmul call * Add aligned malloc and free for VMA * Continue implementation * First matmul success * GEMM Kernel optimization * 1D Blocktiling * 2D Blocktiling * Write coalescing * Continue vulkan implementation and optimization * First FP16 attempt, disabled for now * Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel * Enable device extensions properly, restore fp16 matmul op * Fix mulmat_f16 * Output FP32 in fp16 matmul shader * Fix f16_to_f32 kernel * dequant_q4_0 kernel * Add VMA library * Avoid requesting dedicated memory, VMA can decide that by itself * Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly * add cmake commands * Add 2d write operation, profiling code * Fix 2d write * Fix queue selection for AMD RADV * Fix trailing whitespace in vk_mem_alloc.h * Add WIP warp tile mat mul shaders * Disable glslc optimization * Disable glslc optimization for CMake * Optimize warptile matmul shader, replace blocktile with it * Add split-k optimization for small matrix multiplication Use semaphores for synchronization instead of fences or waitidle Rework async write/read for synchronization * Fix validation errors, improve compatibility with AMD GPUs * Rework command buffer handling * Variable matmul kernel using specialization constants * Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints * Reuse semaphores * Handle stage flags during command buffer submission properly * Increase matmul test runs for consistent results * Fix F32 matmul * Add vectorized loading and zeropadding for matrix multiplication * Use pinned memory for f16 preprocessing * Don't force aligned matmul * Don't free before queue done * Replace VMA library with native Vulkan buffer management * Basic offloading support with mul_f32 and dmmv for q4_0 * Run glslc commands in parallel * Unroll loops in dmmv shader * Reduce usage of waitIdle * Reuse pinned allocation for f16 conversion * Handle devices with only a single queue * Fix trailing whitespace in CMakeLists.txt * Allow parallel execution of kernels, parallelize third and fourth dimension calls * Add fallback for devices only supporting one DescriptorSet per DescriptorPool * Move to graph function similar to CUDA implementation * Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function * Add F32 dmmv shaders * Batch submissions * Add .spv to gitignore * Split off matrix vector multiplication for separate optimization * Use single command buffer for matrix vector multiplication ops * Reduce overhead of mul_f32 calls by using a single command buffer * Add submission batching to mul_f32 * Fix tests * Add missing barrier * Add further missing barrier * Add further ops * Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions * Remove unnecessary cblas link * Fix descriptor set pre-allocation assert * Add runtime shader compilation, start transferring shaders to this approach * Transfer remaining shaders to header and compile on runtime * Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16 * Add support for q4_1, q5_0, q5_1 and q8_0 * Remove unnecessary scalar layout extension * Parse graph early to pre-record command buffers * Add q6_k support * Add multi-submit for command buffers * Fix q6_k dequant shader for AMD * Fix q6_k for GPUs without fp16 support * Simplify q6_k fp16 fix * Minor fixes * Fix wg_denom of m-mulmat shaders * Add Python-based Vulkan shader generator * Replace shaderc dependency with precompiled shaders Fix python script to generate shaders * Clean up code * Fix shader generator script Windows compatibility Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> * Close file before deletion * Fix vulkan shader fp32 name * Add q2_k and q3_k support Add validation check to compare shader results to cpu results * Add q4_k support * Add q5_k support * Bake SPIR-V bytecode into the library instead of loading shaders from file * Switch to signal semaphores for flexibility Prepare broadcasting support for mul mat * Finish broadcasting mul mat support for GQA * Clean up unused functions Add repeat op * Add further ops, not yet enabled. Improve semaphore code * Reduce number of used semaphores by utilizing timelines more properly * Remove queue information * Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations * Add Vulkan to llama-bench * Remove cblas dependency * Fix matmul k-split bug * Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader * Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug * Fix issues with float16 overflows in shaders * Fix issues with older Vulkan headers on Ubuntu 22.04 * Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers * Implement further ops, rework op_f32 calls, fix bugs * Finish full offloading support, add last remaining ops, fix bugs, remove redundant code * Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders * Merge upstream changes, fix conflicts, adapt soft_max op * Fix Python and shader header format * Free model gpu buffers on exit * Use single queue per device to simplify code * Add matmul shader support for running multiple calculations in parallel * Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible * Fix missing event cast * Replace uint64_t(-1) with UINT64_MAX, rename function for clarity * Fix warning about empty C function parameters * Fix compiler warnings * Properly implement Vulkan backend buffer handling * Fix oversized host staging buffers * Simplify barrier synchronization calls * Fix gcc warnings * Implement max_size for backend buffer types to limit the size of a single allocation * Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size * refactor multi buf * Disable unsupported ops to fix tests * Check for maintenance4 support before using it * Handle devices with only a single queue * Fix single queue logic * propagate buffer usage in multi buffers * Implement rope_neox op * Cleanup header and other files * Simplify gpu_extras by removing events and putting staging memcpys into contexts * Move queue into context Add not-yet-enabled async backend ops * Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization * Add get_max_size to SYCL backend. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix trailing whitespace --------- Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 19:03:59 +02:00
Abhilash Majumder	0f648573dd	ggml : add unified SYCL backend for Intel GPUs (#2690 ) * first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 17:56:23 +02:00
Georgi Gerganov	b764b8f1d0	flake.lock: Update (#5162 )	2024-01-28 14:54:54 +00:00
Johannes Gäßler	9241c3a2ac	Apply min_p to unsorted tokens (#5115 )	2024-01-28 09:59:49 +01:00
Johannes Gäßler	b2b2bf988c	Tests for min_p, sampling queue (#5147 )	2024-01-28 09:35:14 +01:00
Marcus Dunn	af4980bfed	readme : add link to rust bindings (#5148 ) * added link to another set of rust bindings with brief note on differences. * fixed link name	2024-01-28 10:30:44 +02:00
sharpHL	f2e69d28c0	llama : add support for Orion-14B (#5118 ) * add support for Orion-14B(https://huggingface.co/OrionStarAI/Orion-14B-Chat) * flake8 support * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp * Update llama.cpp --------- Co-authored-by: lixiaopu <lixiaopu@cmcm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-01-28 10:00:30 +02:00
Kyle Mistele	39baaf55a1	docker : add server-first container images (#5157 ) * feat: add Dockerfiles for each platform that user ./server instead of ./main * feat: update .github/workflows/docker.yml to build server-first docker containers * doc: add information about running the server with Docker to README.md * doc: add information about running with docker to the server README * doc: update n-gpu-layers to show correct GPU usage * fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA	2024-01-28 09:55:31 +02:00
John	6db2b41a76	llava : support for Yi-VL and fix for mobileVLM (#5093 ) * Support for Yi-VL, templating fix for mobileVLM * ws * Update examples/llava/clip.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llava-cli.cpp * Update clip.cpp bugfix for new conversions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-27 17:09:18 +02:00
Georgi Gerganov	753eafed0e	sync : ggml	2024-01-27 17:00:24 +02:00
Judd	e976423005	ggml : check ggml_add src1 type (ggml/708) Co-authored-by: Judd <foldl@boxvest.com>	2024-01-27 16:59:00 +02:00
Michael Klimenko	35a2ee9143	Remove unused data and add fixes (#5154 ) * Remove unused data and add fixes * Add missing file * Address review comments * Replace the scope of vq allocation	2024-01-27 15:25:55 +01:00
Maximilian Winter	ec903c0341	server : add self-extend support (#5104 ) * Ported self extension to server example * Update server.cpp * Fixed prompt caching without self extend * Update server.cpp * Added description to server readme. * Update server.cpp * Update server.cpp * Update server.cpp * Update server.cpp * Update README.md * Changed descriptions * server : formatting * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update server.cpp * Update server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-27 15:38:05 +02:00
0cc4m	a1d6df129b	Add OpenCL add kernel (#5151 ) * Add OpenCL add kernel * Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results	2024-01-26 23:07:32 +01:00
Jared Van Bortel	bbe7c56c99	cmake : pass CPU architecture flags to nvcc (#5146 )	2024-01-26 15:34:06 -05:00
slaren	62fead3ea0	cuda : fix tensor size calculation for non-split buffer (#5145 )	2024-01-26 18:59:43 +01:00
slaren	15b4538ff2	ggml-alloc : add 10% margin to the buffer sizes (#5149 )	2024-01-26 19:18:26 +02:00
snadampal	7032f4f634	ggml : update softmax n_task calculation (#5126 ) updated the n_task calculation to use max number of threads possible. This has improved the prompt eval performance by around 5% for DOT kernels and by around 10% for MMLA kernels on AWS Graviton3.	2024-01-26 19:17:59 +02:00
Georgi Gerganov	5f1925a8ce	scripts : move run-with-preset.py from root to scripts folder	2024-01-26 17:09:44 +02:00
Georgi Gerganov	3b7c914de2	tests : gitignore test-c.o	2024-01-26 14:48:15 +02:00
Xuan Son Nguyen	48c857aa10	server : refactored the task processing logic (#5065 ) * server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"	2024-01-26 14:42:20 +02:00
crasm	413e7b0559	ci : add model tests + script wrapper (#4586 ) * scripts : add lib.sh and lib_test.sh * scripts : stub out new ci-run.sh script * scripts : switch to PascalCase for functions This looks a little odd at first, but I find it very useful as a convention to know if a command is part of our code vs a builtin. * scripts : add some fancy conversion from snake_case to PascalCase * Add venv to ci/run.sh * Revert scripts work * scripts : add wrapper script for local use of ci/run.sh * Simplify .gitignore for tests, clang-tidy fixes * Label all ctest tests * ci : ctest uses -L main * Attempt at writing ctest_with_model * Update test-model-load-cancel * ci : add ctest_with_model for debug and release ggml-ci * Fix gg_get_model function ggml-ci * got stuck on CMake * Add get_model.cpp to tests/CMakeLists.txt ggml-ci * Fix README.md output for ctest_with_model ggml-ci * workflows : use `-L main` for all ctest ggml-ci * Fixes * GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE * Always show warning rather than failing if model file variable is not set * scripts : update usage text for ci-run.sh	2024-01-26 14:18:00 +02:00
Paul Tsochantaris	6dd3c28c9c	metal : remove unused `n_buffers` and `buffers` (#5129 )	2024-01-26 14:16:07 +02:00
Riceball LEE	38b431de23	gguf : fix "general.alignment" type in gguf_reader.py (#5136 )	2024-01-26 11:10:28 +02:00
Georgi Gerganov	aad0b01d73	readme : update hot topics	2024-01-26 10:52:33 +02:00
Kawrakow	1182cf4d4f	Another bucket sort (#5109 ) * Initial bucket sort * Bucket sort: slightly better version * Bucket sort: another minor improvement --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-26 09:14:39 +02:00
XiaotaoChen	fe54033b69	readme : add MobileVLM 1.7B/3B to the supported models list (#5107 ) Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>	2024-01-25 22:14:32 +02:00
l3utterfly	5eaf9964fc	llama : dynamic temperature sampling (#4972 ) * implemented dynamic temperature sampling from koboldcpp * removed trailing whitespace * removed unused temp parameter in llama_sample_entropy * exposed exponent_val in dynamic temp sampler * added debug check for printf statements * use nullptr in llama_sample_softmax call during llama_sample_entropy this avoids counting the time taken stats twice Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * return earlier if there is only 1 candiate (i.e. max_entropy == 0) * reformat 't' case in llama_sample_queue Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * check for one or zero candidates case in llama_sample_entropy --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-01-25 22:06:22 +02:00
Jared Van Bortel	d292f4f204	examples : make pydantic scripts pass mypy and support py3.8 (#5099 )	2024-01-25 14:51:24 -05:00
Valentin Konovalov	256d1bb0dd	android : use release cmake build type by default (#5123 )	2024-01-25 19:05:51 +02:00
Kawrakow	faa3526a1e	Fix Q3_K_XS for MoE models (#5113 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-25 17:58:53 +02:00

1 2 3 4 5 ...

2065 Commits