llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-01 15:40:21 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	ee1a0ec9cb	llama : add option for greedy sampling with probs (#3813 ) * llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs	2023-10-28 14:23:11 +03:00
Henk Poley	177461104b	common : print that one line of the syntax help also to standard output (#3823 )	2023-10-28 13:16:33 +03:00
Georgi Gerganov	fdee152e4e	starcoder : add GPU offloading (#3827 ) * starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci	2023-10-28 12:06:08 +03:00
Kerfuffle	41aee4df82	speculative : ensure draft and target model vocab matches (#3812 ) * speculative: Ensure draft and target model vocab matches * Tolerate small differences when checking dft vs tgt vocab	2023-10-28 00:40:07 +03:00
cebtenzzre	6d459cbfbe	llama : correctly report GGUFv3 format (#3818 )	2023-10-27 17:33:53 -04:00
Thibault Terrasson	c8d6a1f34a	simple : fix batch handling (#3803 )	2023-10-27 08:37:41 -06:00
Georgi Gerganov	2f9ec7e271	cuda : improve text-generation and batched decoding performance (#3776 ) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels	2023-10-27 17:01:23 +03:00
Georgi Gerganov	34b2a5e1ee	server : do not release slot on image input (#3798 )	2023-10-26 22:54:17 +03:00
Georgi Gerganov	6961c4bd0b	batched-bench : print params at start	2023-10-25 10:26:27 +03:00
Georgi Gerganov	cc44877486	log : disable pid in log filenames	2023-10-25 10:09:16 +03:00
cebtenzzre	ad93962657	server : add parameter -tb N, --threads-batch N (#3584 ) (#3768 ) Co-authored-by: Michael Coppola <m18coppola@gmail.com> Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-24 23:10:43 +03:00
Georgi Gerganov	1717521cdb	server : do not block system prompt update (#3767 ) * server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor	2023-10-24 23:08:20 +03:00
Georgi Gerganov	b2f7e04bd3	sync : ggml (conv ops + cuda MSVC fixes) (#3765 ) ggml-ci	2023-10-24 21:51:20 +03:00
John Smith	abd21fc99f	cmake : add missed dependencies (#3763 )	2023-10-24 20:48:45 +03:00
Georgi Gerganov	2b4ea35e56	cuda : add batched cuBLAS GEMM for faster attention (#3749 ) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-10-24 16:48:37 +03:00
Galunid	daab3d7f45	Add more tokenizer tests (#3742 ) * Add more tokenizer tests * Add starcoder * Update test vocab files * Restrict bpe tokenizer tests to unicode planes * Update comment * Comment cosmetics * Remove bloom vocab/test	2023-10-24 09:17:17 +02:00
Georgi Gerganov	469c9addef	metal : handle ggml_scale for n%4 != 0 (close #3754 ) ggml-ci	2023-10-24 09:47:22 +03:00
Georgi Gerganov	e3932593d4	Revert "make : add optional CUDA_NATIVE_ARCH (#2482 )" This reverts commit `96981f37b1`. See: https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866	2023-10-23 23:46:05 +03:00
M. Yusuf Sarıgöz	9d02956443	issues : separate bug and enhancement template + no default title (#3748 )	2023-10-23 22:57:16 +03:00
Galunid	69a6735087	Update special token handling in conversion scripts for gpt2 derived tokenizers (#3746 ) We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for - a couple of tokenizer tests - some more `special` and `non-special` added tokens handling (as far as I understand it) * Update special token handling * Add mpt	2023-10-23 21:46:00 +02:00
Marcus Dunn	5be6c803fa	llama : remove token functions with `context` args in favor of `model` (#3720 ) * added `llama_model_token_` variants to all the `llama_token_` functions. * added `LLAMA_API` * formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * removed old `llama_token` functions * changed 3 more functions to take in model - `llama_token_get_text` - `llama_token_get_score` - `llama_token_get_type` * added back docs * fixed main.cpp * changed token functions to use new model variants * changed token functions to use new model variants --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-23 22:40:03 +03:00
Galunid	6336701c93	Fix baichuan convert script not detecing model (#3739 ) It seems nobody objects.	2023-10-23 17:47:03 +02:00
Alex	96981f37b1	make : add optional CUDA_NATIVE_ARCH (#2482 ) Use the environment variable `CUDA_NATIVE_ARCH` if present to set NVCC arch. Otherwise, use `native`.	2023-10-22 22:56:53 +03:00
Georgi Gerganov	438c2ca830	server : parallel decoding and multimodal (#3677 ) * implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <go778sgt@gmail.com> Co-authored-by: Damian Stewart <d@damianstewart.com> Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com> Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>	2023-10-22 22:53:08 +03:00
goerch	9e70cc0322	Add test for MPT tokenization (#3728 ) * Add test for MPT tokenization * Revert code motion * Remove unnecessary restriction in test case * Clarify logic in conversion	2023-10-22 21:21:42 +02:00
Ian Scrivener	5a42a5f8e8	readme : remove unsupported node.js library (#3703 ) - https://github.com/Atome-FE/llama-node is quite out of date - doesn't support recent/current llama.cpp functionality	2023-10-22 21:16:43 +03:00
Kerfuffle	a5e7dbd614	llama : validate special token ids are in range when loading GGUF model (#3635 ) * Add validation for special token ids to llama.cpp Small optimization for llama_byte_to_token SPM mode * Fix BPE newline check, only I could break something so simple * Killll meeeeee * Account for GGUF_KEY_KEY only setting when the key exists * Minor code cleanups. * Fix convert.py error msg when added tokens are out of range * Make gguf SpecialVocab vocab size-aware Update conversion scripts accordingly * Avoid a string copy Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-22 21:14:56 +03:00
vvhg1	d3956aea53	main : escape prompt for cfg_negative_prompt and consecutive inputs in main with interactive (#3623 ) * infill tokens correction * serverinfill tokens correction * removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape * removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape * only rm when params.escape, rm space if possible which is added back or rm added space token * only rm when params.escape, rm space if possible which is added back or rm added space token * Revert "only rm when params.escape, rm space if possible which is added back or rm added space token" This reverts commit `63ba0b621f`. * fix interactive prompt escaping and fix server infill leading space handling * rm unnecessary bool check * process escapes for neg prompt and interactive consec prompts * removed unneccessary static string escape	2023-10-22 21:09:51 +03:00
Georgi Gerganov	22c69a2794	batched : add len CLI argument	2023-10-22 08:37:20 +03:00
shibe2	465219b914	CLBlast: Add outer loops over src0 for broadcasting in mulmat Reduce repeated dequantization of the same data.	2023-10-20 22:30:52 +04:00
Georgi Gerganov	d1031cf49c	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Qin Yue Chen	8cf19d60dc	gguf : support big endian platform (#3552 ) * check whether platform is 390x if yes->do not import immintrin.h * support s390x big endian * support --bigendian option for s390x 1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16 * update format based on editor-config checker result * Update convert-baichuan-hf-to-gguf.py * 1. check in ggml.c if endianess is not match 2. update GGUF version 3. change get_pack_prefix to property 4. update information log * always use "GGUF" as beginng of GGUF file * Compare "GGUF" with file header char by char 1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-20 14:19:40 +03:00
Georgi Gerganov	a0edf73bda	server : fix uninitialized sampling context (close #3685 )	2023-10-20 13:06:10 +03:00
Herman Semenov	f439e506e8	ggml : fix rope + llama minor optimizations (#3560 ) * Minor fixes and fixed memleak * Using const auto references in range-based loop C++17	2023-10-20 13:02:12 +03:00
cebtenzzre	e78f3ef24a	convert : restore compat with old Falcon models (#3680 )	2023-10-20 08:32:08 +03:00
M. Yusuf Sarıgöz	f3b25e4043	multimodal : add BakLLaVA conversion support (#3682 )	2023-10-19 19:40:41 +03:00
M. Yusuf Sarıgöz	60abea9798	llava : avoid segfault in case of non-existent mmproj file (#3674 )	2023-10-19 16:59:11 +03:00
Georgi Gerganov	004797f6ac	readme : update hot topics	2023-10-18 21:44:43 +03:00
Georgi Gerganov	4e82b2ea3f	speculative : bug fixes	2023-10-18 18:49:40 +03:00
Georgi Gerganov	0e89203b51	speculative : add tree-based sampling example (#3624 ) * sampling : one sequence per sampling context ggml-ci * speculative : add tree-based sampling support ggml-ci * speculative : reuse the n_parallel CLI param * speculative : refactor sampling * examples : fix build after sampling refactoring ggml-ci * batched : fix n_seq_id * sampling : fix malloc ggml-ci * swift : fix build ggml-ci * swift : try to fix build ggml-ci * prompts : add assistant.txt * common : add llama_batch_add() and llama_batch_clear() helpers * speculative : minor refactor ggml-ci * minor : comments + rename ggml-ci * speculative : fix off-by-one for n_drafted * speculative : fix the n_drafted fix + p constants	2023-10-18 16:21:57 +03:00
Jhen-Jie Hong	c67fe68e41	metal : implement q5_0 and q5_1 kernels (#3648 ) * metal : implement dequantize_q5_0 * metal : block_q_n_dot_y for block_q5_0 (broken) * metal : revert unnecessary change * metal : implement dequantize_q5_1 * metal : block_q_n_dot_y for q5_1 (broken) * metal : fix block_q_n_dot_y * minor : spaces / formatting --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-18 15:21:48 +03:00
shibe2	1117d06607	opencl : fix element-wise multiplication (#3656 )	2023-10-18 15:09:22 +03:00
slaren	cb33f43a2a	fix embeddings when using CUDA (#3657 )	2023-10-17 22:24:50 +02:00
Georgi Gerganov	e1675d133c	llama : avoid fprintf in favor of LLAMA_LOG (#3538 )	2023-10-17 22:34:26 +03:00
BarfingLemurs	8402566a7c	readme : update hot-topics & models, detail windows release in usage (#3615 ) * Update README.md * Update README.md * Update README.md * move "Running on Windows" section below "Prepare data and run" --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-17 21:13:21 +03:00
shibe2	40e5ce054f	CLBlast: Fix temporary buffer size for f16 conversion (wsize) Fix buffer overflow. Reduce the size to fit just one 2D slice. Assert sufficient size.	2023-10-17 21:02:30 +04:00
slaren	a5e8c1d8c7	train-text-from-scratch : fix assert failure in ggml-alloc (#3618 )	2023-10-17 20:00:58 +03:00
Georgi Gerganov	e74c705e15	editorconfig : remove trailing spaces	2023-10-17 19:52:53 +03:00
coezbek	3ad1e3f1a1	server : documentation of JSON return value of /completion endpoint (#3632 ) * Added documentation of JSON return value of /completion endpoint * Update examples/server/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-17 19:51:02 +03:00
Georgi Gerganov	1142013da4	save-load-state : fix example + add ci test (#3655 ) * save-load-state : fix example (close #3606) * ci : add test for save-load-state example ggml-ci	2023-10-17 19:12:46 +03:00

... 23 24 25 26 27 ...

2636 Commits