llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-30 16:07:17 +01:00

Author	SHA1	Message	Date
Someone Serge	5e97ec91ae	nix: add a comment about makeScope	2024-01-22 12:19:30 +00:00
Someone Serge	7251870780	nix: refactor the cleanSource rules	2024-01-22 12:19:30 +00:00
Someone Serge	fe8b3c0d4b	workflows: nix-ci: drop the redundant "paths" filter	2024-01-22 12:19:30 +00:00
Someone Serge	f4dd059259	workflows: nix-build-aarch64: rate limit	2024-01-22 12:19:30 +00:00
Someone Serge	f7276f7500	workflows: nix-ci: rebuild on flake.lock updates	2024-01-22 12:19:30 +00:00
Kawrakow	15bceec2d7	imatrix : keep intermediate imatrix results (#5077 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-22 14:18:43 +02:00
compilade	d6bd4d46dd	llama : support StableLM 2 1.6B (#5052 ) * llama : support StableLM 2 1.6B * convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}] * convert : refactor Qwen's set_vocab to use it for StableLM 2 too * nix : add tiktoken to llama-python-extra * convert : use presence of tokenizer.json to determine StableLM tokenizer loader It's a less arbitrary heuristic than the vocab size.	2024-01-22 13:21:52 +02:00
Daniel Bevenius	152d9d05e0	finetune : print sample-start/include-sample-start (#5072 ) This commit adds `--sample-start` and `--include-sample-start` to the output from the main function in finetune.cpp. The motivation for this is that even though these are set explicitly by the user via the command line, if one forgets to set them then it is useful to have their values printed out. Otherwise it is possible to go through the whole training process before realizing that the values are not what one expected. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-22 13:11:01 +02:00
Kawrakow	66d575c45c	llama : add Q3_K_XS (#5060 ) * Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-22 12:43:33 +02:00
bobqianic	57744932c6	ci : fix Windows CI by updating Intel SDE version (#5053 )	2024-01-22 10:55:05 +02:00
Shijie	3466c6ebcf	llama : add more qwen2 models (#5071 )	2024-01-22 09:33:19 +02:00
iSma	504dc37be8	Revert LLAMA_NATIVE to OFF in flake.nix (#5066 )	2024-01-21 21:37:13 +00:00
kuronekosaiko	05490fad7f	add safetensors support to convert-lora-to-ggml.py (#5062 ) * add safetensors support to convert-lora-to-ggml.py * Update convert-lora-to-ggml.py Remove white space in line 69.	2024-01-21 17:28:14 +01:00
bobqianic	6c5629d4d2	add `#include <string>` to unicode.h (#5051 ) Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-01-21 10:17:35 -05:00
Kawrakow	7dcbe39d36	Add ability to evauate multiple choice tasks (#5047 ) * TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-21 14:42:44 +02:00
Kawrakow	726c0fa9a2	Slightly faster imatrix (#5050 ) * imatrix: speedup by avoiding unnecessary allocations and copies * imatrix: add --no-ppl option to skip PPL calculations altogether --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-21 08:01:20 +02:00
Georgi Gerganov	942c0107a7	flake.lock: Update (#5054 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9b19f5e77dd906cb52dade0b7bd280339d2a1f3d' (2024-01-13) → 'github:NixOS/nixpkgs/bbe7d8f876fbbe7c959c90ba2ae2852220573261' (2024-01-19) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-01-21 03:17:27 +00:00
Jared Van Bortel	b43ebde3b0	convert : partially revert PR #4818 (#5041 )	2024-01-20 18:14:18 -05:00
Jared Van Bortel	97c1549808	perplexity : fix MSVC build after #5020 (#5043 ) * perplexity : fix MSVC build after #5020 * try a differerent fix	2024-01-20 17:08:08 +02:00
slaren	6df465a91d	llama : run all KQV ops on the CPU with no KV offload (#5049 ) ggml-ci	2024-01-20 17:05:49 +02:00
Herman Semenov	77bc1bbd05	cmake : add support for ccache (#5002 ) * Added support ccache for speedup recompilation * cmake : option to disable ccache --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-20 10:11:31 +02:00
adel boussaken	48e2b13372	Add a dart/flutter binding to README.md (#4882 )	2024-01-20 03:05:43 -05:00
Kylin	cca894f16a	cuda : fix compile error in jetson platform (#4975 ) * cuda: fix compile error in jetson platform * cuda: update comment in ggml-cuda.cu * cuda: update ggml-cuda.cu comment	2024-01-20 09:01:46 +02:00
Uzo Nweke	381ee19572	finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033 ) * Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue #4791	2024-01-19 20:20:50 +02:00
Georgi Gerganov	a5cacb22b2	imatrix : add README.md	2024-01-19 15:24:47 +02:00
Shijie	9b75cb2b3c	llama : support upcoming Qwen2 (#5037 )	2024-01-19 13:53:13 +02:00
Georgi Gerganov	de9a147df1	py : fix flake8 lint	2024-01-19 13:52:22 +02:00
Kawrakow	7051aacfac	winogrande: evaluate log-probs in parallel (#5036 ) This is a relatively minor performance tweak resulting in ~10% speedup on my system. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-19 11:39:11 +02:00
chiranko	2b3b999cac	llama : add CodeShell support (#5016 ) * llama: add codeshell support * llama.cpp: fix codeshell with NeoX rope Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-19 11:07:27 +02:00
Kawrakow	993fba8180	perplexity: avoid unnecessary alloocations and logit copies (#5035 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-19 11:02:39 +02:00
Georgi Gerganov	8b20858e5e	perplexity : faster Winogrande via batching (#5024 ) * perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande	2024-01-19 10:45:06 +02:00
John	57e2a7a52a	llama : fix falcon arch for tied output embeddings (#4978 ) * falcon arch fix for tied output embeddings * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-19 00:12:15 +02:00
Georgi Gerganov	9b6ea4263a	cmake : add ggml public headers (#5011 )	2024-01-18 23:36:07 +02:00
Xuan Son Nguyen	821f0a271e	server : defer tasks when "slot unavailable" (#5018 ) * server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>	2024-01-18 22:33:05 +02:00
slaren	96d7f56d29	llama : fix mlock with no-mmap with Metal (#5025 )	2024-01-18 21:12:15 +01:00
Georgi Gerganov	2d5419d08a	imatrix : fix assert for src0 non-cont check	2024-01-18 21:45:51 +02:00
Georgi Gerganov	d391ae9b49	perplexity : fix winogrande N tasks option	2024-01-18 20:49:00 +02:00
Georgi Gerganov	e9240cdfa0	scripts : add get-winogrande.sh	2024-01-18 20:45:39 +02:00
David Sommers	b46757735d	convert.py : fix llama/llama2 conversion due to vocab_size=-1 (#5019 ) PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30). Without the fix, llama2 models can't be converted. The error is: `ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`	2024-01-18 19:20:59 +02:00
Kawrakow	3e945cc1e9	HellaSwag: speed up by parallelizing log-prob evaluation (#5020 ) For Mistral-7B and fp16, time on my system goes down from 536 seconds to 423 seconds for the full evaluation dataset (10042 tasks). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-18 19:18:21 +02:00
Georgi Gerganov	ad19812cda	perplexity : faster HellaSwag via batching (#5017 ) * perplexity : faster HellaSwag ggml-ci * perplexity : clean-up ggml-ci * perplexity : no need for decode_helper ggml-ci * perplexity : add comments * perplexity : option to specify max batched tasks via `n_parallel` * perplexity : remove HellaSwag restruction for n_batch	2024-01-18 15:33:01 +02:00
Kawrakow	682986a08e	Add Winogrande evaluation (#5015 ) * winogrande: simple implementation It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics. * winogrande: somewhat better Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before. * winogrande: improving Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last. * winogrande: add dataset instructions --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-18 13:46:27 +02:00
Georgi Gerganov	dcad445d0c	scritps : add helper script to get hellaswag data in txt format	2024-01-18 11:44:49 +02:00
Paul Tsochantaris	1e605f4102	metal : fix memory leak, dangling pointer and unused autorel (#5007 ) * Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute * SPM header potential fix * Reverting symlinks	2024-01-18 10:47:24 +02:00
Georgi Gerganov	6b6916b215	sync : ggml	2024-01-17 20:54:50 +02:00
Georgi Gerganov	38566680cd	ggml : add IQ2 to test-backend-ops + refactoring (#4990 ) * ggml : add IQ2 to test-backend-ops + refactoring ggml-ci * cuda : update supports_op for IQ2 ggml-ci * ci : enable LLAMA_CUBLAS=1 for CUDA nodes ggml-ci * cuda : fix out-of-bounds-access in `mul_mat_vec_q` ggml-ci * tests : avoid creating RNGs for each Q tensor ggml-ci * tests : avoid creating RNGs for each tensor ggml-ci	2024-01-17 18:54:56 +02:00
Georgi Gerganov	ba69bbc84c	imatrix : offload to GPU support (#4957 ) * backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci	2024-01-17 18:46:30 +02:00
Georgi Gerganov	44a1a4a41a	backend : add eval callback (#4935 ) * backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * simple : no need for ggml_is_contiguous + fix bool parse * llama : fix callback placement in llama_context_params * backend : avoid double-ask callback calls * simple : restore examples, imatrix will serve as a demo	2024-01-17 18:39:41 +02:00
Georgi Gerganov	c918fe8dca	metal : create autorelease pool during library build (#4970 ) * metal : create autorelease pool during library build ggml-ci * test : simplify ggml-ci	2024-01-17 18:38:39 +02:00
Georgi Gerganov	0f83e727af	py : fix whitespace	2024-01-17 18:37:36 +02:00

... 10 11 12 13 14 ...

2498 Commits