llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-01 07:30:17 +01:00

Author	SHA1	Message	Date
Andrew Canis	12247f4c69	llama : add Command-R support (#6033 ) Information about the Command-R 35B model (128k context) can be found at: https://huggingface.co/CohereForAI/c4ai-command-r-v01 Based on the llama2 model with a few changes: 1) New hyper parameter to scale output logits (logit_scale) 2) Uses LayerNorm instead of RMSNorm 3) Transfomer layers have a single shared LayerNorm that feeds into both the self-attention and FFN layers in parallel. There is no post-attention LayerNorm. 4) No support for Rotary Position Embeddings (RoPE) scaling 5) No biases used Find GGUF files here: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF To convert model to GGUF format yourself: 1) Download Command-R Hugging Face safetensors: git lfs install git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01 2) Run: python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01	2024-03-15 22:41:22 +02:00
Ting Lou	4e9a7f7f7f	llava : change API to pure C style for Rust FFI bindgen (#6079 ) Co-authored-by: Lou Ting <louting.t@alibaba-inc.com>	2024-03-15 16:31:05 +02:00
slaren	3020327f6c	cuda : disable unused cudaLaunchHostFunc code (#6078 )	2024-03-15 14:24:03 +02:00
Neo Zhang Jianyu	46acb36767	fix set main gpu error (#6073 )	2024-03-15 18:53:53 +08:00
Georgi Gerganov	131b058409	make : ggml-metal.o depends on ggml.h	2024-03-15 11:38:40 +02:00
AidanBeltonS	753e36f650	[SYCL] Fix non-intel device selection (#6042 ) * Fix non-intel device selection * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2024-03-15 14:56:20 +05:30
Ondřej Čertík	7ce2c77f88	gguf : add support for I64 and F64 arrays (#6062 ) * gguf : add support for I64 and F64 arrays GGML currently does not support I64 or F64 arrays and they are not often used in machine learning, however if in the future the need arises, it would be nice to add them now, so that the types are next to the other types I8, I16, I32 in the enums, and it also reserves their type number. Furthermore, with this addition the GGUF format becomes very usable for most computational applications of NumPy (being compatible with the most common NumPy dtypes: i8, i16, i32, i64, f32, f64), providing a faster, and more versatile alternative to the `npz` format, and a simpler alternative to the `hdf5` format. The change in this PR seems small, not significantly increasing the maintenance burden. I tested this from Python using GGUFWriter/Reader and `gguf-dump`, as well as from C, everything seems to work. * Fix compiler warnings	2024-03-15 10:46:51 +02:00
Xuan Son Nguyen	aab606a11f	llama : add Orion chat template (#6066 )	2024-03-15 10:44:57 +02:00
slaren	b0bc9f4a9d	llama-bench : use random tokens to improve accuracy with mixtral (#6069 )	2024-03-15 10:22:24 +02:00
Georgi Gerganov	4755afd1cb	llama : fix integer overflow during quantization (#6063 )	2024-03-14 22:58:41 +02:00
Steve Grubb	6e0438da3c	gguf : fix resource leaks (#6061 ) There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.	2024-03-14 20:29:32 +02:00
Ondřej Čertík	727107707a	gguf-py : bump version to 0.8.0 (#6060 )	2024-03-14 19:57:31 +02:00
Michael Podvitskiy	69ff61397d	llama : support models without vocabulary (#5798 ) * additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert	2024-03-14 18:21:56 +02:00
Georgi Gerganov	044ec4b2a5	embedding : add EOS token if not present (#899 )	2024-03-14 15:14:14 +02:00
Georgi Gerganov	77178eedc8	gguf-py : fix dtype check (#6045 )	2024-03-14 13:32:14 +02:00
Jian Liao	15a333260a	readme : improve readme for Llava-1.6 example (#6044 ) Co-authored-by: Jian Liao <jianliao@adobe.com>	2024-03-14 13:18:23 +02:00
Pierrick Hymbert	43241adf22	server: disable debug release type sanitizer, simplify trigger (#6047 ) - increase time out for server - do not fail fast	2024-03-14 13:15:39 +02:00
Georgi Gerganov	a44bc969e4	llama : fix typo	2024-03-14 13:13:06 +02:00
Michael Podvitskiy	2c4fb69246	llama : optimize defrag moves + fix fragmentation calculation (#6037 ) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-14 12:56:48 +02:00
Ondřej Čertík	3ca23481dd	gguf-py : add support for I8, I16 and I32 (#6045 ) * Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader	2024-03-14 12:40:14 +02:00
Georgi Gerganov	3fe8d7a17f	ggml : designate enum vals for integer types (#6050 )	2024-03-14 12:38:37 +02:00
Georgi Gerganov	68265ebfc6	embedding : print all resulting embeddings (#899 )	2024-03-14 12:37:20 +02:00
Georgi Gerganov	381da2d9f0	metal : build metallib + fix embed path (#6015 ) * metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library	2024-03-14 11:55:23 +02:00
Georgi Gerganov	0fd6c1f015	embedding : print cosine similarity (#899 )	2024-03-14 10:12:29 +02:00
Linwei Wang	19885d205e	readme : update details about running llama in Termux on Android (#6039 )	2024-03-13 20:34:40 +02:00
Georgi Gerganov	76a936c893	readme : update API changes and hot topics	2024-03-13 20:33:56 +02:00
Clint Herron	463628372d	grammar : handle missing "root" node (#6004 )	2024-03-13 20:10:40 +02:00
slaren	f30ea47a87	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
slaren	d8fd0ccf6a	test-backend-ops : skip CPU backend by default (#6028 )	2024-03-13 15:58:30 +02:00
AidanBeltonS	b3d978600f	Update get version (#6025 )	2024-03-13 18:47:54 +05:30
Xuan Son Nguyen	99b71c068f	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
slaren	306d34be7a	ci : remove tidy-review (#6021 )	2024-03-12 17:55:19 +02:00
Georgi Gerganov	8030da7afe	ggml : reuse quantum structs across backends (#5943 ) * ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci	2024-03-12 14:27:20 +02:00
Georgi Gerganov	184215e783	ggml : fix UB in IQ2_S and IQ3_S (#6012 )	2024-03-12 13:49:55 +02:00
Georgi Gerganov	48358b2e5b	sycl : update IQ1_S kernels (WIP - not working!) (#5995 ) * sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type	2024-03-12 11:15:05 +02:00
gliptic	5cdb371731	grammar : fix unnecessarily retained pointer to rules (#6003 )	2024-03-11 21:59:03 +02:00
Kawrakow	44ca159faf	1.5 bit: we can do even better (#5999 ) * iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 17:53:15 +02:00
Georgi Gerganov	05b06210c9	llama : more consistent names of count variables (#5994 ) * llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name	2024-03-11 17:49:47 +02:00
Georgi Gerganov	83796e62bc	llama : refactor unicode stuff (#5992 ) * llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref	2024-03-11 17:47:47 +02:00
Jakub N	828defefb6	Update server docker image URLs (#5997 )	2024-03-11 14:40:42 +01:00
Xuan Son Nguyen	caa106d4e0	Server: format error to json (#5961 ) * server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme	2024-03-11 10:56:41 +01:00
Michael Podvitskiy	3202361c5b	ggml, ci : Windows ARM runner and build fixes (#5979 ) * windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`	2024-03-11 11:28:51 +02:00
Minsoo Cheong	332bdfd798	server : maintain chat completion id for streaming responses (#5988 ) * server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-11 10:09:32 +02:00
Gilad S	ecab1c75de	cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985 )	2024-03-11 10:00:08 +02:00
Georgi Gerganov	ee35600b90	llama : fix F16/F32 downcast + improve names (#5980 )	2024-03-11 09:56:47 +02:00
Kawrakow	be858f6205	Better 1.5 bit quantization (#5971 ) * Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2<x^2> as sigma2 in weight adjustment iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 07:51:49 +01:00
Abhilash Majumder	ef3ced26a3	[SYCL] Add q3_s and q1_s (#5886 ) * Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space	2024-03-11 10:27:56 +05:30
AidanBeltonS	3814a07392	[SYCL] Add support for SYCL Nvidia target (#5738 ) * Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors	2024-03-11 09:13:57 +08:00
Georgi Gerganov	bb6d00bbf9	metal : move mm_id indices to shared mem (#5982 )	2024-03-10 23:12:48 +02:00
Dean	7ab7b733bb	android : fix utf8 decoding error (#5935 ) * examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 22:03:17 +02:00

... 14 15 16 17 18 ...

3190 Commits