llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-15 14:50:51 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	afb3929279	Merge branch 'master' into llama-refactor	2023-10-31 20:35:31 +02:00
Tungsten842	07178c98e1	flake.nix: fix for rocm 5.7 (#3853 )	2023-10-31 19:24:03 +02:00
Georgi Gerganov	5baefef497	llama : add llm_build helper functions (#3848 ) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper	2023-10-31 19:23:12 +02:00
Georgi Gerganov	207b51900e	ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861 ) * ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file	2023-10-30 19:19:15 +02:00
Kerfuffle	6e08281e58	Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843 ) * Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality	2023-10-29 11:31:40 -06:00
cebtenzzre	2046eb4345	make : remove unnecessary dependency on build-info.h (#3842 )	2023-10-29 18:33:47 +02:00
Georgi Gerganov	71a09da301	llama : fix kv shift bug (#3835 ) ggml-ci	2023-10-29 18:32:51 +02:00
Georgi Gerganov	d69d777c02	ggml : quantization refactoring (#3833 ) * ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-10-29 18:32:28 +02:00
Georgi Gerganov	210e6e5d02	llama : remove obsolete map for layer counting	2023-10-29 13:39:04 +02:00
Georgi Gerganov	79ad734417	llama : comment ggml-ci	2023-10-29 13:27:53 +02:00
Georgi Gerganov	761087932b	llama : add functional header	2023-10-29 13:26:32 +02:00
Georgi Gerganov	8925cf9ef8	llama : add layer index to all tensor names	2023-10-29 13:22:15 +02:00
Georgi Gerganov	1e9c5443c2	llama : refactor tensor offloading as callback	2023-10-29 13:05:10 +02:00
Georgi Gerganov	da936188d8	llama : move refact in correct place + optimize graph input	2023-10-29 11:48:58 +02:00
Georgi Gerganov	739b85c985	llama : try to fix build	2023-10-29 11:25:32 +02:00
Georgi Gerganov	25cfbf6776	llama : fix non-CUDA build	2023-10-29 11:12:03 +02:00
Georgi Gerganov	b4ad03b3a7	llama : try to optimize offloading code	2023-10-29 10:33:11 +02:00
Georgi Gerganov	79617902ea	llama : fix res_norm offloading	2023-10-29 09:20:35 +02:00
Georgi Gerganov	e14aa46151	llama : do tensor offload only with CUDA	2023-10-29 08:03:46 +02:00
Georgi Gerganov	0dc05b8433	llama : factor graph input into a function	2023-10-29 07:52:43 +02:00
Georgi Gerganov	4e98897ede	llama : support offloading result_norm + comments	2023-10-29 07:36:07 +02:00
Georgi Gerganov	51c4f9ee9f	llama : comments	2023-10-28 22:50:08 +03:00
Georgi Gerganov	3af8771389	llama : update offload log messages to print node index	2023-10-28 22:36:44 +03:00
Georgi Gerganov	83d2c43791	llama : offload rest of the models ggml-ci	2023-10-28 22:30:54 +03:00
Georgi Gerganov	38aca9e1ab	llama : factor out tensor offloading outside the build call (wip) ggml-ci	2023-10-28 21:22:31 +03:00
Georgi Gerganov	5946d98fc8	metal : disable kernel load log	2023-10-28 21:22:01 +03:00
Georgi Gerganov	8b2420d249	llama : factor out ggml-alloc from graph graph build functions ggml-ci	2023-10-28 19:54:28 +03:00
Erik Scholz	ff3bad83e2	flake : update flake.lock for newer transformers version + provide extra dev shell (#3797 ) * flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)	2023-10-28 16:41:07 +02:00
Aarni Koskela	82a6646e02	metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793 ) * Try cwd for ggml-metal if bundle lookup fails When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`, `server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]` returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of passing `null` as a path. Follows up on #1782 * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-28 15:43:01 +03:00
Georgi Gerganov	ba231e8a6d	issues : change label from bug to bug-unconfirmed (#3748 )	2023-10-28 15:35:26 +03:00
Georgi Gerganov	8a2f2fea29	convert : ignore tokens if their IDs are within [0, vocab_size) (#3831 )	2023-10-28 06:25:15 -06:00
Kerfuffle	bd6d9e2059	llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747 ) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit	2023-10-28 14:54:24 +03:00
Georgi Gerganov	ee1a0ec9cb	llama : add option for greedy sampling with probs (#3813 ) * llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs	2023-10-28 14:23:11 +03:00
Henk Poley	177461104b	common : print that one line of the syntax help also to standard output (#3823 )	2023-10-28 13:16:33 +03:00
Georgi Gerganov	fdee152e4e	starcoder : add GPU offloading (#3827 ) * starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci	2023-10-28 12:06:08 +03:00
Kerfuffle	41aee4df82	speculative : ensure draft and target model vocab matches (#3812 ) * speculative: Ensure draft and target model vocab matches * Tolerate small differences when checking dft vs tgt vocab	2023-10-28 00:40:07 +03:00
cebtenzzre	6d459cbfbe	llama : correctly report GGUFv3 format (#3818 )	2023-10-27 17:33:53 -04:00
Thibault Terrasson	c8d6a1f34a	simple : fix batch handling (#3803 )	2023-10-27 08:37:41 -06:00
Georgi Gerganov	2f9ec7e271	cuda : improve text-generation and batched decoding performance (#3776 ) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels	2023-10-27 17:01:23 +03:00
Georgi Gerganov	34b2a5e1ee	server : do not release slot on image input (#3798 )	2023-10-26 22:54:17 +03:00
Georgi Gerganov	6961c4bd0b	batched-bench : print params at start	2023-10-25 10:26:27 +03:00
Georgi Gerganov	cc44877486	log : disable pid in log filenames	2023-10-25 10:09:16 +03:00
cebtenzzre	ad93962657	server : add parameter -tb N, --threads-batch N (#3584 ) (#3768 ) Co-authored-by: Michael Coppola <m18coppola@gmail.com> Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-24 23:10:43 +03:00
Georgi Gerganov	1717521cdb	server : do not block system prompt update (#3767 ) * server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor	2023-10-24 23:08:20 +03:00
Georgi Gerganov	b2f7e04bd3	sync : ggml (conv ops + cuda MSVC fixes) (#3765 ) ggml-ci	2023-10-24 21:51:20 +03:00
John Smith	abd21fc99f	cmake : add missed dependencies (#3763 )	2023-10-24 20:48:45 +03:00
Georgi Gerganov	2b4ea35e56	cuda : add batched cuBLAS GEMM for faster attention (#3749 ) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-10-24 16:48:37 +03:00
Galunid	daab3d7f45	Add more tokenizer tests (#3742 ) * Add more tokenizer tests * Add starcoder * Update test vocab files * Restrict bpe tokenizer tests to unicode planes * Update comment * Comment cosmetics * Remove bloom vocab/test	2023-10-24 09:17:17 +02:00
Georgi Gerganov	469c9addef	metal : handle ggml_scale for n%4 != 0 (close #3754 ) ggml-ci	2023-10-24 09:47:22 +03:00
Georgi Gerganov	e3932593d4	Revert "make : add optional CUDA_NATIVE_ARCH (#2482 )" This reverts commit `96981f37b1`. See: https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866	2023-10-23 23:46:05 +03:00

1 2 3 4 5 ...

1468 Commits