llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-02-10 02:03:07 +01:00

Author	SHA1	Message	Date
klosax	2ba83c8685	Fix spm whitespaces (#2806 ) * llama.cpp : fix spm whitespace escaping + clean up * main.cpp : spm - add whitespace in front of prompt * test-tokenizer-0.cpp : spm - add whitespace in front of prompt	2023-08-26 13:45:53 +02:00
lon	bae5c5f679	examples : skip unnecessary external lib in server README.md how-to (#2804 )	2023-08-26 16:07:43 +08:00
Marcus Dunn	232caf3c15	llama : fix struct decl (#2790 )	2023-08-25 19:17:15 +03:00
Kawrakow	d046dcee08	Faster perplexity computation (#2786 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-08-25 19:05:02 +03:00
Matt Pulver	c82742ac9c	llama : add llama_beam_search() (#2267 ) * Add llama_beam_search(). * Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token(). * Add space around * pointers and & references. * Add spaces around comparison and assignment operators. * Prefer west const. * Use llama_ prefix for structs in global namespace. * Delete obsolete comment from an earlier revision. * Change eos to eob in llama_beam and llama_beam_view structs.	2023-08-25 18:18:48 +03:00
Nigel Bosch	28b2c996ca	convert.py : Get rope scale from HuggingFace models (#2772 ) * Get rope scale from HF models * Save rope scale only for linear scaling * Rewrite for clarity	2023-08-25 16:41:52 +02:00
slaren	154725c543	llama-bench : add model sizes (#2771 ) * llama-bench : add model sizes * more compact markdown output * back to GiB * adjust column sizes	2023-08-25 15:16:19 +02:00
slaren	12e2e33a97	convert.py : export rope freq_base when converting CodeLlama from an HF model (#2773 )	2023-08-25 14:08:53 +02:00
Jhen-Jie Hong	29674ab4e8	server : display token probabilities in the UI (#2489 ) * server : add n_probs param in chat UI * server : keep message data array & show in probabilites component * server : add simple popover component * server : fix completion_probabilities undefined if not set n_probs * server : implement Probabilites * server : handle bytes * server : make n_probs max to 10 for easy scroll * server : adjust for dark/light mode * server : Fix regenerated prompt * server : update index.html.hpp * server : convert prob to percentage + show original value as div title * server : fix Probabilites not used if included empty str * server : skip byte pair in display probabilites * server : remove array check of completion_probabilities in messages * skip empty array or byte pair (> 1) in Probabilites * generate index.html.hpp * fix incorrect prob convert if the str is already a known token * use final response to show probabilities on stop * revert unnecessary change * correct probabilites usage * remove unused function * always send partial response for get correct probs of last to_send * fix typo * fix content of format_final_response * refactor probs render & make pColor transparent if not found * send empty string when got stop_pos in partial * avoid unnecessary empty data event & send rest of partial tokens on stop * use <br /> for new line * skip -1 tok in loop to avoid send '' on end * trim last new lines on stop * revert unnecessary change	2023-08-25 18:32:45 +08:00
Georgi Gerganov	5439a0ab57	ci : pip install gguf in editable mode (#2782 ) ggml-ci	2023-08-25 13:03:25 +03:00
M. Yusuf Sarıgöz	8194cd8772	gguf : export objects to user code (#2780 ) * gguf export more objects to user code * gguf export all objects to user code for now * gguf : bump version	2023-08-25 12:43:41 +03:00
Henri Vasserman	6bbc598a63	ROCm Port (#1087 ) * use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com> Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com> Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com> Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> Co-authored-by: jammm <2500920+jammm@users.noreply.github.com> Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>	2023-08-25 12:09:42 +03:00
Georgi Gerganov	3f460a2b72	cuda : add RoPE kernel for mode == 2 (NeoX) (#2760 ) * cuda : add RoPE kernel for mode == 2 (NeoX) * falcon : do not offload the embeddings layer	2023-08-25 11:55:59 +03:00
M. Yusuf Sarıgöz	87e3733f24	gguf : make gguf pip-installable * gitignore : add dist and rm pyproject.toml * gguf: prepare as Pip package * gguf: prepare as Pip package * gguf : fix line endings * requirements : add gguf * gguf : update readme with build notes * gguf : update readme with build notes * gguf : add notes for tests	2023-08-25 09:26:05 +03:00
Shouzheng Liu	b91ad7f461	ggml-alloc : enlarge size of parse_seq (#2776 ) Since we also store barriers in this array, we need to double its size.	2023-08-25 08:58:00 +03:00
Marcus Dunn	2e5f70a25f	Added `enum` to `llama_token_get_type` return type (#2774 )	2023-08-24 23:49:30 +02:00
slaren	d0f77b1353	convert.py : try to determine n_ctx automatically for CodeLlama (#2770 )	2023-08-24 21:10:39 +02:00
slaren	0d3094f0c7	gguf : add rope_freq_base parameter for CodeLlama (#2769 )	2023-08-24 21:04:05 +03:00
Georgi Gerganov	01f2224682	falcon : write file type	2023-08-24 19:58:30 +03:00
Shouzheng Liu	38b16dfca6	metal : bug-fix when enable ggml-alloc (#2757 ) * metal: better memory alloc w/ concurrency dispatch The ggml-alloc should only free tensors at memory barriers. * ggml-alloc: avoid return silently In certain cases, the allocate_node() function may silently return without performing any memory allocation.	2023-08-24 19:27:25 +03:00
Georgi Gerganov	8f8c28e89c	convert : auto-determine model name based on dir + scripts update	2023-08-24 19:26:47 +03:00
Kerfuffle	7694adda8d	Fix for main example getting stuck when -n -2 and --interactive (#2767 ) * Fix for main example getting stuck when -n -2 and --interactive * Add a comment so future generations may suffer less.	2023-08-24 10:11:13 -06:00
slaren	fea95c682d	fix convert.py for codellama, add llama 34B to the list of recognized models (#2768 )	2023-08-24 17:44:11 +02:00
DannyDaemonic	ef955fbd23	Tag release with build number (#2732 ) * Modified build.yml to use build number for release * Add the short hash back into the tag * Prefix the build number with b	2023-08-24 15:58:02 +02:00
Georgi Gerganov	d67777c202	metal : add Q8_0 support (#2763 ) * metal : add dequantize_q8_0 kernel * metal : add mul_mat_q8_0_f32 kernel * metal : add Q8_0 mul_mm kernel	2023-08-24 16:19:57 +03:00
Georgi Gerganov	c3e53b421a	llama : escape all U+2581 in a string (#2750 )	2023-08-24 12:26:01 +03:00
Evan Jones	6e91a1b070	llama : fix grammar sometimes generating null char (#2756 )	2023-08-24 07:07:13 +03:00
Georgi Gerganov	44d5462b5c	readme : fix link	2023-08-23 23:44:19 +03:00
Georgi Gerganov	c7868b0753	minor : fix trailing whitespace	2023-08-23 23:43:00 +03:00
Georgi Gerganov	79da24b58c	readme : update hot topics	2023-08-23 23:41:16 +03:00
Georgi Gerganov	cf658adc83	llm : add Falcon support (#2717 ) * llama : refactor GGUF constants into static maps * llama : check if model architecture is known * llama : refactor llama_model_load_internal() * gguf : add KV constant maps * llm : read arch-specific KVs * convert : add dummy scores + types * falcon : load tensor data (CPU only) * llama : fix loading progress bar * llama : add arch member to llama_model * falcon : CPU inference working * falcon : support non-40B models * falcon : minor * llama : minor updates ggml-ci * convert-falcon-hf-to-gguf.py : fix special token mapping * llama.cpp : llama default UNK token = id 0 * llama.cpp : fix bpe tokenizer * llama.cpp : fix the fix of bpe tokenizer * ggml : pass eps to ggml_norm * metal : implement RoPE (mode = 2) + avoid ggml_repeat * ggml : ggml_repeat always creates new tensor * falcon : copy-paste self-attention from LLaMA * metal : print extra compute pipeline info * falcon : minor changes (still chasing the Metal problem) * llama.cpp : fix linefeed token * metal : fix GELU kernel numerical stability by using precise::tanh * metal : temporary workaround for the concurrency optimization bug * falcon : add CUDA offloading (#2739) * llama : better model naming and size reporting * llama : prep new tokenizer support * llama : advanced BPE tokenizer based on ggllm.cpp imlpementation * llama : remove oboslete comment ggml-ci * common : remove obsolete BPE API + disable test-tokenizer-1 * llama : revert BPE special-case in llama_byte_to_token() * cuda : add TODOs for RoPE NeoX implementation * llama : default special tokens based on vocab type * perplexity : add log for start of tokenization --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-08-23 23:08:04 +03:00
Georgi Gerganov	a192860cfe	minor : fix trailing whitespace	2023-08-23 22:37:39 +03:00
Olivier Chafik	95385241a9	examples : restore the functionality to import llama2.c models (#2685 ) * Fix import of llama2.c models that don't share weights between embedding layers * llama2c: reinstate ggmlv3 conversion output + update readme w/ gguf conv * llama2.c: comment out legacy "load from ggml model" logic * llama2.c: convert special-cased "<0xXX>" single byte tokens from tokenizer.bin	2023-08-23 22:33:05 +03:00
slaren	335acd2ffd	fix convert-lora-to-ggml.py (#2738 )	2023-08-23 16:46:54 +02:00
klosax	5290c38e6e	main : insert bos if no tokens (#2727 ) * main.cpp : insert bos if no tokens * Update examples/main/main.cpp * Update examples/main/main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-08-23 16:46:03 +02:00
akawrykow	cc34dbda96	gitignore : fix for windows (#2729 )	2023-08-23 17:31:34 +03:00
Cebtenzzre	7c2227a197	chmod : make scripts executable (#2675 )	2023-08-23 17:29:09 +03:00
JohnnyB	f19dca04ea	devops : RPM Specs (#2723 ) * Create llama-cpp.srpm * Rename llama-cpp.srpm to llama-cpp.srpm.spec Correcting extension. * Tested spec success. * Update llama-cpp.srpm.spec * Create lamma-cpp-cublas.srpm.spec * Create lamma-cpp-clblast.srpm.spec * Update lamma-cpp-cublas.srpm.spec Added BuildRequires * Moved to devops dir	2023-08-23 17:28:22 +03:00
Kawrakow	8207214b6a	Fix values shown in the quantize tool help (#2735 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-08-23 12:57:12 +03:00
Kawrakow	62959e740e	Strided perplexity (#2714 ) * Implementing strided computation of perplexity * Alternative way to output PPL results --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-08-23 12:56:42 +03:00
IgnacioFDM	7f7ddd5002	Fix ggml to gguf conversion on Windows (#2733 ) This fixes `RuntimeWarning: overflow encountered in long_scalars` Credit: anon (not mine)	2023-08-23 03:31:09 -06:00
Xiao-Yong Jin	b8ad1b66b2	server : allow json array in prompt or content for direct token input (#2306 ) * server: allow json array in prompt or content We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content. This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input. With this, we can use EOS and BOS used in llama-2-chat models. * server: use tokenizePrompt(json) and default "" if empty prompt * server: fix prompt check * server: tokenize endpoint no longer adds BOS	2023-08-23 15:12:12 +08:00
Evan Jones	f5fe98d11b	docs : add grammar docs (#2701 ) * docs : add grammar docs * tweaks to grammar guide * rework GBNF example to be a commented grammar	2023-08-22 21:01:57 -04:00
Kerfuffle	777f42ba18	Improve handling of special tokens in GGML to GGUF converter (#2725 ) * Improve UNK, BOS, EOS token handling when converting without metadata. * Allow importing as a module. * Remove some obsolete code and minor cleanups. * Set default UNK token mapping from -1 to 0 in llama.cpp * Try to handle overflow due to buggy Windows Python with a better error message	2023-08-22 17:39:39 -06:00
goerch	46ef5b5fcf	llama : fix whitespace escaping in tokenizer (#2724 )	2023-08-23 00:10:42 +03:00
Johannes Gäßler	c63bb1d16a	CUDA: use mul_mat_q kernels by default (#2683 )	2023-08-22 22:47:05 +02:00
Alex Petenchea	3b6cfe7c92	convert.py : clarifying error message (#2718 )	2023-08-22 21:58:16 +03:00
Jiahao Li	800c9635b4	Fix CUDA softmax by subtracting max value before exp (#2665 )	2023-08-22 20:27:06 +02:00
Georgi Gerganov	deb7dfca4b	gguf : add ftype meta info to the model (#2710 ) * llama : add ftype meta info to the model ggml-ci * convert.py : add ftype when converting (does not work) * convert.py : fix Enum to IntEnum ggml-ci	2023-08-22 20:05:59 +03:00
Kawrakow	bac66994cf	Quantization imrovements for k_quants (#2707 ) * Improve LLaMA-2 2-, 3- and 4-bit quantization * Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of attention.wv and feed_forward.w2 This leads to a slight model sized increase as follows: Q2_K : 2.684G vs 2.670G Q3_K_S: 2.775G vs 2.745G Q3_K_M: 3.071G vs 3.057G Q4_K_S: 3.592G vs 3.563G LLaMA-2 PPL for context 512 changes as follows: Q2_K : 6.6691 vs 6.8201 Q3_K_S: 6.2129 vs 6.2584 Q3_K_M: 6.0387 vs 6.1371 Q4_K_S: 5.9138 vs 6.0041 There are improvements for LLaMA-1 as well, but they are way smaller than the above. * Minor 4-bit quantization improvement For the same model size as previus commit, we get PPL = 5.9069 vs 5.9138. * Some more fine tuning * Adding make_qkx2_quants With it, we get PPL = 5.8828 for L2-7B Q4_K_S. * Another minor improvement * Q2_K improvement Smaller model, lower perplexity. 7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201 12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178 It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk, which are Q2_K * Iterating * Revert Q5_K back to make_qkx1_quants * Better Q6_K * make_qkx2_quants is better for Q5_K after all * Fix after rebasing on master * Fix for changed tensor names --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-08-22 19:14:09 +03:00

... 40 41 42 43 44 ...

3121 Commits