llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-11 21:10:24 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	77e15bec62	metal : remove deprecated error code (#7008 ) b2773	2024-04-30 15:52:21 +03:00
Kevin Gibbons	a68a1e7ed0	metal : log more info on error (#6987 ) b2772	2024-04-30 12:34:50 +03:00
Georgi Gerganov	9c67c2773d	ggml : add Flash Attention (#5021 ) * ggml : add ggml_flash_attn_ext API * ggml : fix GQA support in ggml_flash_attn_ext * ggml : online attention (CPU) * metal : initial implementation * metal : f16 precision * metal : reduce branches * metal : specialize for head size * wip : 8 rows per simd group * wip : 4 rows per simd group * wip : template for rows per warp * metal : parallelize across KV size * metal : parallel reduce across heads * metal : efficient flash_attn_f16 implementation * metal : avoid redundant loads of the attention * metal : scale and mask in matrix form * metal : fix comment * llama : avoid ggml_cast, use F32 query * metal : add parallel reduce version (disabled) * metal : move output into local memory + optimize - the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments * metal : add tests, fix scaling, support C > 32 * metal : improve precision * ggml : fix f16 mad * metal : minor * metal : support Q > 8 * tests : add ATTN tests * metal : disable buffer allocation logs * tests : more * metal : faster inner loop for C == 32 * metal : fix array initialization * tests : ifdef * ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext * ggml : fix ggml_soft_max mask requirement * cuda : fix soft_max to use correct mask size * cuda : add flash_attn kernel (wip) * metal : optimize softmax for C > 32 * metal : optimize softmax * tests : minor fix * cuda : avoid zeroing fragments * tests : update dims * cuda : fix __hisinf() result check * cuda : avoid warp_reduce for smax * cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes) * cuda : make loops use the same loop values Thanks Johannes again for the tip * cuda : unroll some of the loops * cuda : avoid __hisinf branches * cuda : use half2 in softmax * cuda : switch to 1 warp for bs > 16 * cuda : speed-up reduce part of the kernel * cuda : unroll QK^T loop cuda : fix -INF block check * cuda : simplify softmax * cuda : fix matrix names * cuda : minor * llama : adapt to F16 KQ_pos * llama : adapt new models to F16 KQ_mask * ggml : fix F16 store (ARM NEON) * llama : fix type of KQ_mask and KQ_pos * ggml : fix CPU soft_max * tests : add hs=256 * cuda : fix build * metal : improve perf via smaller int registers * cuda : adapt soft_max to F16 mask and pos * CUDA: faster FlashAttention, kernel for bs == 1 * 16 cols for Phi-2 * no vec for hs, no hs==256 ncols==32 for Volta * adjust kernel selection logic * 4 warps, 256 stride for all D * no ncols == 64 * Multiple parallel blocks for batch size 1 * fix compile warnings * fix excessive KQ_b loads * fix cmake build * fix KV cache padding, NaN from INFINITY (#6438) * llama : flash_attn cparam + fix defrag * server: support flash_attn param * server: bench: enable flash_attn param * CUDA: refactor host code, dyn. par. blocks * fix flash_attn_vec_f16 race condition * flush softmax exp below threshold to 0 * store temp KQ in registers * Calculate KQ as FP32 if KQV has GGML_PREC_F32 * Add __hgt2_mask implementation for CUDA 11 * fix KQ FP32 precision fpr parallel_blocks > 1 * llama-bench : add -fa,--flash-attn arg * metal : add BS=1 kernel for flash attention (#6508) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel * metal : use F32 attention accumulators * batched-bench : add fattn arg * llama : simplify llama_build_kv_store ggml-ci * llama : adapt build_olmo to changes * ggml : fix arm fp16 store on windows * metal : clean-up * metal : clean-up kernel code * metal : minor * tests : remove benchmarks ggml-ci * ggml : fix avx512 const correctness ggml-ci * ggml : fix soft_max with bias on CPU ggml-ci * common : print --flash-attn in help * ggml : fix num dimensions in ggml_flash_attn_ext * llama : force disable flash attention for incompatible models * ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci * cuda : uint -> uint32_t * cuda : "constexpr dim3" -> "const dim3" ggml-ci * cuda : try to fix __hgt2_mask ggml-ci * ggml : add TODO's for F16/F32 mask/pos support in other backends * llama : replace bool need_kq_pos with use_alibi * llama : prep ALiBi support for BERT models ggml-ci * llama : fix n_batch requirements ggml-ci * cont * server : add help for --flash-attn arg * llama : disable FA for AMD * tests : remove TMP_ATTN_BENCH ggml-ci * llama : support save/load state with FA enabled ggml-ci * ci : add CUDA save-load-state tests ggml-ci * llama : llama_kv_cache_clear zeroes data + fix save-load seq ggml-ci * llama : fix copy-paste errors, add TODO * llama : disallow incompatible states * llama : update llama_state_get_size after v_trans field * metal : remove tmp log * llama : add static reminder for llama_state_get_size * metal : fix max nsg ggml-ci * ci : fix arg order ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com> b2771	2024-04-30 12:16:08 +03:00
Georgi Gerganov	952d03dbea	convert : use utf8 encoding (#7000 ) * convert : use utf8 encoding * convert : update instructions and warning message	2024-04-30 11:05:25 +03:00
Olivier Chafik	8843a98c2b	Improve usability of --model-url & related flags (#6930 ) * args: default --model to models/ + filename from --model-url or --hf-file (or else legacy models/7B/ggml-model-f16.gguf) * args: main & server now call gpt_params_handle_model_default * args: define DEFAULT_MODEL_PATH + update cli docs * curl: check url of previous download (.json metadata w/ url, etag & lastModified) * args: fix update to quantize-stats.cpp * curl: support legacy .etag / .lastModified companion files * curl: rm legacy .etag file support * curl: reuse regex across headers callback calls * curl: unique_ptr to manage lifecycle of curl & outfile * curl: nit: no need for multiline regex flag * curl: update failed test (model file collision) + gitignore *.gguf.json b2769	2024-04-30 00:52:50 +01:00
Clint Herron	b8c1476e44	Extending grammar integration tests (#6644 ) * Cleaning up integration tests to share code between tests and make it simpler to add new tests. * Add tests around quantifiers to ensure both matching and non-matching compliance. * Add slightly more complex grammar with quantifiers to test references with quantifiers. * Fixing build when C++17 is not present. * Separating test calls to give more helpful stack traces on failure. Adding verbose messages to give visibility for what is being tested. * Adding quotes around strings to explicitly show whitespace * Removing trailing whitespace. * Implementing suggestions from @ochafik -- grammars and test strings now print and flush before tests to aid in debugging segfaults and whatnot. * Cleaning up forgotten symbols. Modifying simple test to use test harness. Added comments for more verbose descriptions of what each test is accomplishing. * Unicode symbol modifications to hopefully make log easier to parse visually. b2768	2024-04-29 14:40:14 -04:00
Daniel Bevenius	5539e6fdd1	main : fix typo in comment in main.cpp (#6985 ) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> b2767	2024-04-29 13:56:59 -04:00
Olivier Chafik	b8a7a5a90f	build(cmake): simplify instructions (`cmake -B build && cmake --build build ...`) (#6964 ) * readme: cmake . -B build && cmake --build build * build: fix typo Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * build: drop implicit . from cmake config command * build: remove another superfluous . * build: update MinGW cmake commands * Update README-sycl.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * build: reinstate --config Release as not the default w/ some generators + document how to build Debug * build: revert more --config Release * build: nit / remove -H from cmake example * build: reword debug instructions around single/multi config split --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> b2766	2024-04-29 17:02:45 +01:00
Georgi Gerganov	d2c898f746	ci : tmp disable gguf-split (#6983 ) ggml-ci	2024-04-29 18:36:39 +03:00
Georgi Gerganov	544f1f10ad	ggml : fix __MSC_VER -> _MSC_VER (#6977 ) ggml-ci b2764	2024-04-29 17:55:02 +03:00
cpumaxx	ffe666572f	llava-cli : multiple images (#6969 ) Co-authored-by: root <root@nenya.lothlorien.ca> b2763	2024-04-29 17:34:24 +03:00
Georgi Gerganov	24affa7db3	readme : update hot topics	2024-04-29 17:06:19 +03:00
Georgi Gerganov	f4ab2a4147	llama : fix BPE pre-tokenization (#6920 ) * merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * lint : fix * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com> b2761	2024-04-29 16:58:41 +03:00
David Renshaw	3f167476b1	sampling : use std::random_device{}() for default random seed (#6962 ) b2760	2024-04-29 16:35:45 +03:00
Christian Zhou-Zheng	3055a41805	convert : fix conversion of some BERT embedding models (#6937 )	2024-04-29 16:34:41 +03:00
Przemysław Pawełczyk	577277ffd2	make : change GNU make default CXX from g++ to c++ (#6966 )	2024-04-29 16:08:20 +03:00
Przemysław Pawełczyk	ca7f29f568	ci : add building in MSYS2 environments (Windows) (#6967 ) b2757	2024-04-29 15:59:47 +03:00
Johannes Gäßler	c4f708a93f	llama : fix typo LAMMAFILE -> LLAMAFILE (#6974 ) b2756	2024-04-29 15:36:22 +03:00
DAN™	e00b4a8f81	Fix more int overflow during quant (PPL/CUDA). (#6563 ) * Fix more int overflow during quant. * Fix some more int overflow in softmax. * Revert back to int64_t. b2755	2024-04-29 00:38:44 +02:00
Xuan Son Nguyen	7bb36ccf91	gguf : enforce that tensor names are unique (#6905 ) * not allow adding duplicated tensor name * no duplicated tensor while reading gguf * typo * throw exception inside llama_model_loader Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> b2754	2024-04-28 17:36:18 +02:00
Neo Zhang	ce023f6f2f	add device version in device list (#6959 ) Co-authored-by: arthw <> b2753	2024-04-28 22:40:31 +08:00
github-actions[bot]	6e472f58e4	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19) → 'github:NixOS/nixpkgs/7bb2ccd8cdc44c91edba16c48d2c8f331fb3d856?narHash=sha256-Drmja/f5MRHZCskS6mvzFqxEaZMeciScCTFxWVLqWEY%3D' (2024-04-25)	2024-04-28 11:12:50 +00:00
mgroeber9110	4dba7e8114	Replace "alternative" boolean operator in conditional compilation directive (#6949 ) b2751	2024-04-27 21:02:06 +02:00
Pierrick Hymbert	b7368332e2	ci: server: tests python env on github container ubuntu latest / fix n_predict (#6935 ) * ci: server: fix python env * ci: server: fix server tests after #6638 * ci: server: fix windows is not building PR branch b2750	2024-04-27 17:50:48 +02:00
agray3	928e0b7013	Reset schedule earlier to allow overlap with ggml graph computation on device (#6933 ) * Reset schedule earlier to allow overlap with graph computation on device b2749	2024-04-26 20:08:30 +02:00
Pierrick Hymbert	0c4d489e29	quantize: add imatrix and dataset metadata in GGUF (#6658 ) * imatrix: save the dataset file used in the output file * llama: support kv overrides type string string * common: factorize KV Overrides parsing between common and server * quantize: add imatrix n entries and dataset KV metadata quantize: factorize KV Overrides parsing between common #6656 * llama: remove kv override str_value initialization as it does not compile on some toolchain * quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count` * quantize: add imatrix filename in KV * llama: add llama_model_kv_override_free * common: add llama_model_kv_override_free common: free kv override if used after model loading * llama: finally move the string KV override value to the stack * llama : minor * no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators. Co-authored-by: slaren <slarengh@gmail.com> * kv override: ensure string termination --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> b2748	2024-04-26 20:06:33 +02:00
slaren	017e6999b5	add basic tensor data validation function (#6884 ) * add basic tensor data validation function * add --check-tensors command line argument tensor validation is disabled by default and can be enabled by adding `--check-tensors` to the command line arguments. quantize always validates tensors. b2747	2024-04-26 18:39:58 +02:00
slaren	e2764cd7ca	gguf : fix mismatch between alloc and free functions (#6929 ) b2746	2024-04-26 18:07:42 +03:00
Justine Tunney	4b1c3c98b4	llamafile : use 64-bit integers in sgemm (#6928 )	2024-04-26 17:05:33 +03:00
Pierrick Hymbert	bbe3c6e761	ci: server: fix python installation (#6925 )	2024-04-26 12:27:25 +02:00
Pierrick Hymbert	7f5ff558ee	server: stop generation at `n_ctx_train` if `n_predict` is not set (#6638 ) * server: cap n_predict if not set to n_ctx_train * server: fix infinite loop * server: infinite loop, move in process_token server: infinite loop: set stop limit to true * minor: spaces * minor: spaces * server: include prompt tokens in the EOS limit	2024-04-26 12:15:30 +02:00
Pierrick Hymbert	9e4e077ec5	ci: server: fix python installation (#6922 )	2024-04-26 11:11:51 +02:00
Georgi Gerganov	83b72cb086	Merge pull request from GHSA-p5mv-gjc5-mwqv * always use calloc clamp n_kv on failure to read a kv * ggml : alternative ctx->header.n_kv update --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-04-26 10:41:53 +03:00
Pierrick Hymbert	d4a9afc100	ci: server: fix python installation (#6918 ) b2740	2024-04-26 09:27:49 +02:00
Pierrick Hymbert	7d641c26ac	ci: fix concurrency for pull_request_target (#6917 )	2024-04-26 09:26:59 +02:00
Pierrick Hymbert	5790c8dac1	bench: server add stop word for PHI-2 (#6916 )	2024-04-26 09:26:16 +02:00
vik	46e12c4692	llava : add support for moondream vision language model (#6899 ) * add support for moondream vision language model This required making the following changes to the CLIP model: 1. Support for patch embedding bias. 2. Make class embedding and pre-layernorm optional. 3. Add support for post-layernorm. * Update examples/llava/clip.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b2737	2024-04-25 22:38:31 +03:00
Georgi Gerganov	dba497e0c1	cmake : restore LLAMA_LLAMAFILE_DEFAULT b2736	2024-04-25 21:37:27 +03:00
Georgi Gerganov	fa0b4ad252	cmake : remove obsolete ANDROID check b2735	2024-04-25 18:59:51 +03:00
slaren	d6e1d44f16	llama : synchronize before get/set session data (#6911 ) b2734	2024-04-25 17:59:03 +02:00
Georgi Gerganov	853d06ffe2	ci : tmp disable slow tests	2024-04-25 17:06:27 +03:00
BarfingLemurs	3fe0596c18	readme : update model list (#6908 ) * Update README.md * missing space * llama3 !	2024-04-25 16:52:28 +03:00
slaren	0ead1f1072	llama : check that all the tensor data is in the model file (#6885 ) * llama : check that all the tensor data is in the model file * also check for unsigned overflow b2731	2024-04-25 15:23:47 +02:00
Georgi Gerganov	51543729ff	ggml : fix redefinition of vaddvq_f32 for 32-bit ARM (#6906 ) b2730	2024-04-25 15:48:25 +03:00
Daniel Bevenius	4ab99d8d47	clip : rename lerp function to avoid conflict (#6894 ) This commit renamesthe lerp (linear interpolation) function in clip.cpp to avoid a conflict with the lerp function in the <cmath> standard C++ library when using c++20. The motivation for this change is to enable projects that use c++20 to be able to compile clip.cpp without having to resort to patching it. The lerp function was added to cmath in version C++20 (202002L) and is why this is not causing any issue at the moment as C++11/C++17 is currently used by llama.cpp. I realize that llama.cpp uses either C++11 (or C++17 in the case for SYCL) but wanted to ask if this would be an acceptable change just the same. Refs: https://en.cppreference.com/w/cpp/numeric/lerp Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> b2729	2024-04-25 15:38:14 +03:00
Georgi Gerganov	54770413c4	ggml : fix MIN / MAX macros (#6904 ) ggml-ci b2728	2024-04-25 15:12:28 +03:00
Georgi Gerganov	aa750c1ede	tests : minor bash stuff (#6902 ) * tests : minor bash stuff ggml-ci * llama : fix build ggml-ci * tests : fix CUR_DIR -> ROOT_DIR ggml-ci * tests : fix fname ggml-ci b2727	2024-04-25 14:27:20 +03:00
jiez	1966eb2615	quantize : add '--keep-split' to quantize model into shards (#6688 ) * Implement '--keep-split' to quantize model into several shards * Add test script * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Split model correctly even if tensor id is out-of-order * Update llama_model_quantize_params * Fix preci failures --------- Co-authored-by: z5269887 <z5269887@unsw.edu.au> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-25 13:29:35 +03:00
Johannes Gäßler	784e11dea1	README: add graphic for matrix multiplication (#6881 )	2024-04-24 21:29:13 +02:00
Douglas Hanley	b4e4b8a935	llama : add llama_get_pooling_type function (#6862 ) * add llama_get_pooling_type function * fix argument name, move with ctx funcs b2724	2024-04-24 16:10:07 +03:00

... 2 3 4 5 6 ...

2923 Commits