llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-10-30 14:40:16 +01:00

Author	SHA1	Message	Date
automaticcat	24a447e20a	ggml : add ggml_cpu_has_avx_vnni() (#4589 ) * feat: add avx_vnni based on intel documents * ggml: add avx vnni based on intel document * llama: add avx vnni information display * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * Update ggml.c Fix indentation upgate Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-30 10:07:48 +02:00
manikbhandari	ea5497df5d	gpt2 : Add gpt2 architecture integration (#4555 )	2023-12-28 15:03:57 +01:00
Paul Tsochantaris	a206137f92	Adding Emeltal reference to UI list (#4629 )	2023-12-25 18:09:53 +02:00
Shintarou Okada	753be377b6	llama : add PLaMo model (#3557 ) * add plamo mock * add tensor loading * plamo convert * update norm * able to compile * fix norm_rms_eps hparam * runnable * use inp_pos * seems ok * update kqv code * remove develop code * update README * shuffle attn_q.weight and attn_output.weight for broadcasting * remove plamo_llm_build_kqv and use llm_build_kqv * fix style * update * llama : remove obsolete KQ_scale * plamo : fix tensor names for correct GPU offload --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-24 15:35:49 +02:00
FantasyGmm	a55876955b	cuda : fix jetson compile error (#4560 ) * fix old jetson compile error * Update Makefile * update jetson detect and cuda version detect * update cuda marco define * update makefile and cuda,fix some issue * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update Makefile * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 17:11:12 +02:00
Michael Kesper	28cb35a0ec	make : add LLAMA_HIP_UMA option (#4587 ) NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA	2023-12-22 10:03:25 +02:00
Deins	2bb98279c5	readme : add zig bindings (#4581 )	2023-12-22 08:49:54 +02:00
Erik Garrison	0f630fbc92	cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449 ) * AMD ROCm: handle UMA memory VRAM expansions This resolves #2797 by allowing ROCm AMD GPU users with a UMA to dynamically expand the VRAM allocated to the GPU. Without this, AMD ROCm users with shared CPU/GPU memory usually are stuck with the BIOS-set (or fixed) framebuffer VRAM, making it impossible to load more than 1-2 layers. Note that the model is duplicated in RAM because it's loaded once for the CPU and then copied into a second set of allocations that are managed by the HIP UMA system. We can fix this later. * clarify build process for ROCm on linux with cmake * avoid using deprecated ROCm hipMallocHost * keep simplifying the change required for UMA * cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON	2023-12-21 21:45:32 +02:00
Georgi Gerganov	c083718c89	readme : update coding guidelines	2023-12-21 19:27:14 +02:00
Georgi Gerganov	b1306c4394	readme : update hot topics	2023-12-17 20:16:23 +02:00
BarfingLemurs	0353a18401	readme : update supported model list (#4457 )	2023-12-14 09:38:49 +02:00
Georgi Gerganov	113f9942fc	readme : update hot topics	2023-12-13 14:05:38 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
vodkaslime	524907aa76	readme : fix (#4135 ) * fix: readme * chore: resolve comments * chore: resolve comments	2023-11-30 23:49:21 +02:00
Dawid Wysocki	74daabae69	readme : fix typo (#4253 ) llama.cpp uses GitHub Actions, not Gitlab Actions.	2023-11-30 23:43:32 +02:00
Peter Sugihara	4fea3420ee	readme : add FreeChat (#4248 )	2023-11-29 09:16:34 +02:00
Kasumi	0dab8cd7cc	readme : add Amica to UI list (#4230 )	2023-11-27 19:39:42 +02:00
Georgi Gerganov	9656026b53	readme : update hot topics	2023-11-26 20:42:51 +02:00
Georgi Gerganov	04814e718e	readme : update hot topics	2023-11-25 12:02:13 +02:00
Aaryaman Vasishta	b35f3d0def	readme : use PATH for Windows ROCm (#4195 ) * Update README.md to use PATH for Windows ROCm * Update README.md * Update README.md	2023-11-24 09:52:39 +02:00
Georgi Gerganov	d103d935c0	readme : update hot topics	2023-11-23 13:51:22 +02:00
Aaryaman Vasishta	dfc7cd48b1	readme : update ROCm Windows instructions (#4122 ) * Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-11-20 17:02:46 +02:00
Galunid	36eed0c42c	stablelm : StableLM support (#3586 ) * Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers	2023-11-14 11:17:12 +01:00
Georgi Gerganov	c049b37d7b	readme : update hot topics	2023-11-13 14:18:08 +02:00
Richard Kiss	532dd74e38	Fix some documentation typos/grammar mistakes (#4032 ) * typos * Update examples/parallel/README.md Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-11-11 23:04:58 -07:00
Georgi Gerganov	224e7d5b14	readme : add notice about #3912	2023-11-02 20:44:12 +02:00
Ian Scrivener	5a42a5f8e8	readme : remove unsupported node.js library (#3703 ) - https://github.com/Atome-FE/llama-node is quite out of date - doesn't support recent/current llama.cpp functionality	2023-10-22 21:16:43 +03:00
Georgi Gerganov	d1031cf49c	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Georgi Gerganov	004797f6ac	readme : update hot topics	2023-10-18 21:44:43 +03:00
BarfingLemurs	8402566a7c	readme : update hot-topics & models, detail windows release in usage (#3615 ) * Update README.md * Update README.md * Update README.md * move "Running on Windows" section below "Prepare data and run" --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-17 21:13:21 +03:00
ldwang	5fe268a4d9	readme : add Aquila2 links (#3610 ) Signed-off-by: ldwang <ftgreat@gmail.com> Co-authored-by: ldwang <ftgreat@gmail.com>	2023-10-17 18:52:33 +03:00
Ian Scrivener	f3040beaab	typo : it is `--n-gpu-layers` not `--gpu-layers` (#3592 ) fixed a typo in the MacOS Metal run doco	2023-10-12 14:10:50 +03:00
Galunid	9f6ede19f3	Add MPT model to supported models in README.md (#3574 )	2023-10-10 19:02:49 -04:00
Xingchen Song(宋星辰)	c5b49360d0	readme : add bloom (#3570 )	2023-10-10 19:28:50 +03:00
BarfingLemurs	1faaae8c2b	readme : update models, cuda + ppl instructions (#3510 )	2023-10-06 22:13:36 +03:00
Georgi Gerganov	beabc8cfb0	readme : add project status link	2023-10-04 16:50:44 +03:00
slaren	40e07a60f9	llama.cpp : add documentation about rope_freq_base and scale values (#3401 ) * llama.cpp : add documentation about rope_freq_base and scale values * add notice to hot topics	2023-09-29 18:42:32 +02:00
BarfingLemurs	0a4a4a0982	readme : update hot topics + model links (#3399 )	2023-09-29 15:50:35 +03:00
Andrew Duffy	569550df20	readme : add link to grammars app (#3388 ) * Add link to grammars app per @ggernagov suggestion Adding a sentence in the Grammars section of README to point to grammar app, per https://github.com/ggerganov/llama.cpp/discussions/2494#discussioncomment-7138211 * Update README.md	2023-09-29 14:15:57 +03:00
Pierre Alexandre SCHEMBRI	4aea3b846e	readme : add Mistral AI release 0.1 (#3362 )	2023-09-28 15:13:37 +03:00
BarfingLemurs	ffe88a36a9	readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340 ) * Update README.md * Update README.md * Update README.md with k-quants bpw measurements	2023-09-27 18:30:36 +03:00
2f38b454	1726f9626f	docs: Fix typo CLBlast_DIR var. (#3330 )	2023-09-25 20:24:52 +02:00
Lee Drake	bc9d3e3971	Update README.md (#3289 ) * Update README.md * Update README.md Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-09-21 21:00:24 +02:00
Georgi Gerganov	7eb41179ed	readme : update hot topics	2023-09-20 20:48:22 +03:00
Johannes Gäßler	111163e246	CUDA: enable peer access between devices (#2470 )	2023-09-17 16:37:53 +02:00
dylan	980ab41afb	docker : add gpu image CI builds (#3103 ) Enables the GPU enabled container images to be built and pushed alongside the CPU containers. Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>	2023-09-14 19:47:00 +03:00
Ikko Eltociear Ashimine	7d99aca759	readme : fix typo (#3043 ) * readme : fix typo acceleation -> acceleration * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-08 19:04:32 +03:00
Georgi Gerganov	94f10b91ed	readme : update hot tpoics	2023-09-08 18:18:04 +03:00
Yui	6ff712a6d1	Update deprecated GGML TheBloke links to GGUF (#3079 )	2023-09-08 12:32:55 +02:00
Georgi Gerganov	e36ecdccc8	build : on Mac OS enable Metal by default (#2901 ) * build : on Mac OS enable Metal by default * make : try to fix build on Linux * make : move targets back to the top * make : fix target clean * llama : enable GPU inference by default with Metal * llama : fix vocab_only logic when GPU is enabled * common : better `n_gpu_layers` assignment * readme : update Metal instructions * make : fix merge conflict remnants * gitignore : metal	2023-09-04 22:26:24 +03:00
Ido S	340af42f09	docs : add `catai` to `README.md` (#2967 )	2023-09-03 08:50:51 +03:00
bandoti	52315a4216	readme : update clblast instructions (#2903 ) * Update Windows CLBlast instructions * Update Windows CLBlast instructions * Remove trailing whitespace	2023-09-02 15:53:18 +03:00
Konstantin Herud	49bb9cbe0f	docs : add java-llama.cpp to README.md (#2935 )	2023-09-01 16:36:14 +03:00
Gilad S	35092fb547	docs : add `node-llama-cpp` to `README.md` (#2885 )	2023-08-30 11:40:12 +03:00
slaren	c03a243abf	remove outdated references to -eps and -gqa from README (#2881 )	2023-08-29 23:17:34 +02:00
Jhen-Jie Hong	74e0caeb82	readme : add react-native binding (#2869 )	2023-08-29 12:30:10 +03:00
Georgi Gerganov	da7455d046	readme : fix headings	2023-08-27 15:52:34 +03:00
Georgi Gerganov	c48c5bb0b0	readme : update hot topics	2023-08-27 14:44:35 +03:00
Henri Vasserman	6bbc598a63	ROCm Port (#1087 ) * use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com> Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com> Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com> Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> Co-authored-by: jammm <2500920+jammm@users.noreply.github.com> Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>	2023-08-25 12:09:42 +03:00
Georgi Gerganov	44d5462b5c	readme : fix link	2023-08-23 23:44:19 +03:00
Georgi Gerganov	c7868b0753	minor : fix trailing whitespace	2023-08-23 23:43:00 +03:00
Georgi Gerganov	79da24b58c	readme : update hot topics	2023-08-23 23:41:16 +03:00
Evan Jones	f5fe98d11b	docs : add grammar docs (#2701 ) * docs : add grammar docs * tweaks to grammar guide * rework GBNF example to be a commented grammar	2023-08-22 21:01:57 -04:00
Georgi Gerganov	6381d4e110	gguf : new file format with flexible meta data (beta) (#2398 ) * gguf : first API pass * gguf : read header + meta data * gguf : read tensor info * gguf : initial model loading - not tested * gguf : add gguf_get_tensor_name() * gguf : do not support passing existing ggml_context to gguf_init * gguf : simplify gguf_get_val * gguf : gguf.c is now part of ggml.c * gguf : read / write sample models * gguf : add comments * refactor : reduce code duplication and better API (#2415) * gguf : expose the gguf_type enum through the API for now * gguf : add array support * gguf.py : some code style changes * convert.py : start a new simplified implementation by removing old stuff * convert.py : remove GGML vocab + other obsolete stuff * GGUF : write tensor (#2426) * WIP: Write tensor * GGUF : Support writing tensors in Python * refactor : rm unused import and upd todos * fix : fix errors upd writing example * rm example.gguf * gitignore .gguf undo formatting * gguf : add gguf_find_key (#2438) * gguf.cpp : find key example * ggml.h : add gguf_find_key * ggml.c : add gguf_find_key * gguf : fix writing tensors * gguf : do not hardcode tensor names to read * gguf : write sample tensors to read * gguf : add tokenization constants * quick and dirty conversion example * gguf : fix writing gguf arrays * gguf : write tensors one by one and code reuse * gguf : fix writing gguf arrays * gguf : write tensors one by one * gguf : write tensors one by one * gguf : write tokenizer data * gguf : upd gguf conversion script * Update convert-llama-h5-to-gguf.py * gguf : handle already encoded string * ggml.h : get array str and f32 * ggml.c : get arr str and f32 * gguf.py : support any type * Update convert-llama-h5-to-gguf.py * gguf : fix set is not subscriptable * gguf : update convert-llama-h5-to-gguf.py * constants.py : add layer norm eps * gguf.py : add layer norm eps and merges * ggml.h : increase GGML_MAX_NAME to 64 * ggml.c : add gguf_get_arr_n * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Makefile : add gptneox gguf example * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Update convert-llama-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * gguf : support custom alignment value * gguf : fix typo in function call * gguf : mmap tensor data example * fix : update convert-llama-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * convert-gptneox-h5-to-gguf.py : Special tokens * gptneox-main.cpp : special tokens * Update gptneox-main.cpp * constants.py : special tokens * gguf.py : accumulate kv and tensor info data + special tokens * convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens * gguf : gguf counterpart of llama-util.h * gguf-util.h : update note * convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens * convert-llama-h5-to-gguf.py : special tokens * Delete gptneox-common.cpp * Delete gptneox-common.h * convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer * gptneox-main.cpp : gpt2 bpe tokenizer * gpt2 bpe tokenizer (handles merges and unicode) * Makefile : remove gptneox-common * gguf.py : bytesarray for gpt2bpe tokenizer * cmpnct_gpt2bpe.hpp : comments * gguf.py : use custom alignment if present * gguf : minor stuff * Update gptneox-main.cpp * map tensor names * convert-gptneox-h5-to-gguf.py : map tensor names * convert-llama-h5-to-gguf.py : map tensor names * gptneox-main.cpp : map tensor names * gguf : start implementing libllama in GGUF (WIP) * gguf : start implementing libllama in GGUF (WIP) * rm binary commited by mistake * upd .gitignore * gguf : calculate n_mult * gguf : inference with 7B model working (WIP) * gguf : rm deprecated function * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : add gguf_get_kv_type * gguf : add gguf_get_kv_type * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver * gguf : rm references to old file formats * gguf : shorter name for member variable * gguf : rm redundant method * gguf : get rid of n_mult, read n_ff from file * Update gguf_tensor_map.py * Update gptneox-main.cpp * gguf : rm references to old file magics * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : quantization is working * gguf : roper closing of file * gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : no need to convert tensors twice * convert-llama-h5-to-gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : simplify nbytes * convert-llama-h5-to-gguf.py : simplify nbytes * gptneox-main.cpp : n_layer --> n_block * constants.py : n_layer --> n_block * gguf.py : n_layer --> n_block * convert-gptneox-h5-to-gguf.py : n_layer --> n_block * convert-llama-h5-to-gguf.py : n_layer --> n_block * gptneox-main.cpp : n_layer --> n_block * Update gguf_tensor_map.py * convert-gptneox-h5-to-gguf.py : load model in parts to save memory * convert-llama-h5-to-gguf.py : load model in parts to save memory * convert : write more metadata for LLaMA * convert : rm quantization version * convert-gptneox-h5-to-gguf.py : add file_type key * gptneox-main.cpp : add file_type key * fix conflicts * gguf : add todos and comments * convert-gptneox-h5-to-gguf.py : tensor name map changes * Create gguf_namemap.py : tensor name map changes * Delete gguf_tensor_map.py * gptneox-main.cpp : tensor name map changes * convert-llama-h5-to-gguf.py : fixes * gguf.py : dont add empty strings * simple : minor style changes * gguf : use UNIX line ending * Create convert-llama-7b-pth-to-gguf.py * llama : sync gguf-llama.cpp with latest llama.cpp (#2608) * llama : sync gguf-llama.cpp with latest llama.cpp * minor : indentation + assert * llama : refactor gguf_buffer and gguf_ctx_buffer * llama : minor * gitignore : add gptneox-main * llama : tokenizer fixes (#2549) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * convert : update convert-new.py with tokenizer fixes (#2614) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * llama : sync gguf-llama with llama (#2613) * llama : sync gguf-llama with llama * tests : fix build + warnings (test-tokenizer-1 still fails) * tests : fix wstring_convert * convert : fix layer names * llama : sync gguf-llama.cpp * convert : update HF converter to new tokenizer voodoo magics * llama : update tokenizer style * convert-llama-h5-to-gguf.py : add token types * constants.py : add token types * gguf.py : add token types * convert-llama-7b-pth-to-gguf.py : add token types * gguf-llama.cpp : fix n_head_kv * convert-llama-h5-to-gguf.py : add 70b gqa support * gguf.py : add tensor data layout * convert-llama-h5-to-gguf.py : add tensor data layout * convert-llama-7b-pth-to-gguf.py : add tensor data layout * gptneox-main.cpp : add tensor data layout * convert-llama-h5-to-gguf.py : clarify the reverse permute * llama : refactor model loading code (#2620) * llama : style formatting + remove helper methods * llama : fix quantization using gguf tool * llama : simplify gguf_file_saver * llama : fix method names * llama : simplify write_header() * llama : no need to pass full file loader to the file saver just gguf_ctx * llama : gguf_file_saver write I32 * llama : refactor tensor names (#2622) * gguf: update tensor names searched in quantization * gguf : define tensor names as constants * gguf : initial write API (not tested yet) * gguf : write to file API (not tested) * gguf : initial write API ready + example * gguf : fix header write * gguf : fixes + simplify example + add ggml_nbytes_pad() * gguf : minor * llama : replace gguf_file_saver with new gguf write API * gguf : streaming support when writing files * gguf : remove oboslete write methods * gguf : remove obosolete gguf_get_arr_xxx API * llama : simplify gguf_file_loader * llama : move hparams and vocab from gguf_file_loader to llama_model_loader * llama : merge gguf-util.h in llama.cpp * llama : reorder definitions in .cpp to match .h * llama : minor simplifications * llama : refactor llama_model_loader (WIP) wip : remove ggml_ctx from llama_model_loader wip : merge gguf_file_loader in llama_model_loader * llama : fix shape prints * llama : fix Windows build + fix norm_rms_eps key * llama : throw error on missing KV paris in model meta data * llama : improve printing + log meta data * llama : switch print order of meta data --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> * gguf : deduplicate (#2629) * gguf : better type names * dedup : CPU + Metal is working * ggml : fix warnings about unused results * llama.cpp : fix line feed and compiler warning * llama : fix strncpy warning + note token_to_str does not write null * llama : restore the original load/save session implementation Will migrate this to GGUF in the future * convert-llama-h5-to-gguf.py : support alt ctx param name * ggml : assert when using ggml_mul with non-F32 src1 * examples : dedup simple --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> * gguf.py : merge all files in gguf.py * convert-new.py : pick #2427 for HF 70B support * examples/gguf : no need to keep q option for quantization any more * llama.cpp : print actual model size * llama.cpp : use ggml_elements() * convert-new.py : output gguf (#2635) * convert-new.py : output gguf (WIP) * convert-new.py : add gguf key-value pairs * llama : add hparams.ctx_train + no longer print ftype * convert-new.py : minor fixes * convert-new.py : vocab-only option should work now * llama : fix tokenizer to use llama_char_to_byte * tests : add new ggml-vocab-llama.gguf * convert-new.py : tensor name mapping * convert-new.py : add map for skipping tensor serialization * convert-new.py : convert script now works * gguf.py : pick some of the refactoring from #2644 * convert-new.py : minor fixes * convert.py : update to support GGUF output * Revert "ci : disable CI temporary to not waste energy" This reverts commit `7e82d25f40`. * convert.py : n_head_kv optional and .gguf file extension * convert.py : better always have n_head_kv and default it to n_head * llama : sync with recent PRs on master * editorconfig : ignore models folder ggml-ci * ci : update ".bin" to ".gguf" extension ggml-ci * llama : fix llama_model_loader memory leak * gptneox : move as a WIP example * llama : fix lambda capture ggml-ci * ggml : fix bug in gguf_set_kv ggml-ci * common.h : .bin --> .gguf * quantize-stats.cpp : .bin --> .gguf * convert.py : fix HF tensor permuting / unpacking ggml-ci * llama.cpp : typo * llama : throw error if gguf fails to init from file ggml-ci * llama : fix tensor name grepping during quantization ggml-ci * gguf.py : write tensors in a single pass (#2644) * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : style fixes in simple conversion script * gguf : refactor gptneox conversion script * gguf : rename h5 to hf (for HuggingFace) * gguf : refactor pth to gguf conversion script * gguf : rm file_type key and method * gguf.py : fix vertical alignment * gguf.py : indentation --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * convert-gptneox-hf-to-gguf.py : fixes * gguf.py : gptneox mapping * convert-llama-hf-to-gguf.py : fixes * convert-llama-7b-pth-to-gguf.py : fixes * ggml.h : reverse GGUF_MAGIC * gguf.py : reverse GGUF_MAGIC * test-tokenizer-0.cpp : fix warning * llama.cpp : print kv general.name * llama.cpp : get special token kv and linefeed token id * llama : print number of tensors per type + print arch + style * tests : update vocab file with new magic * editorconfig : fix whitespaces * llama : re-order functions * llama : remove C++ API + reorganize common source in /common dir * llama : minor API updates * llama : avoid hardcoded special tokens * llama : fix MPI build ggml-ci * llama : introduce enum llama_vocab_type + remove hardcoded string constants * convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested * falcon-main.cpp : falcon inference example * convert-falcon-hf-to-gguf.py : remove extra kv * convert-gptneox-hf-to-gguf.py : remove extra kv * convert-llama-7b-pth-to-gguf.py : remove extra kv * convert-llama-hf-to-gguf.py : remove extra kv * gguf.py : fix for falcon 40b * falcon-main.cpp : fix for falcon 40b * convert-falcon-hf-to-gguf.py : update ref * convert-falcon-hf-to-gguf.py : add tensor data layout * cmpnct_gpt2bpe.hpp : fixes * falcon-main.cpp : fixes * gptneox-main.cpp : fixes * cmpnct_gpt2bpe.hpp : remove non-general stuff * Update examples/server/README.md Co-authored-by: slaren <slarengh@gmail.com> * cmpnct_gpt2bpe.hpp : cleanup * convert-llama-hf-to-gguf.py : special tokens * convert-llama-7b-pth-to-gguf.py : special tokens * convert-permute-debug.py : permute debug print * convert-permute-debug-master.py : permute debug for master * convert-permute-debug.py : change permute type of attn_q * convert.py : 70b model working (change attn_q permute) * Delete convert-permute-debug-master.py * Delete convert-permute-debug.py * convert-llama-hf-to-gguf.py : fix attn_q permute * gguf.py : fix rope scale kv * convert-llama-hf-to-gguf.py : rope scale and added tokens * convert-llama-7b-pth-to-gguf.py : rope scale and added tokens * llama.cpp : use rope scale kv * convert-llama-7b-pth-to-gguf.py : rope scale fix * convert-llama-hf-to-gguf.py : rope scale fix * py : fix whitespace * gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682) * First pass at converting GGMLv3 LLaMA models to GGUF * Cleanups, better output during conversion * Fix vocab space conversion logic * More vocab conversion fixes * Add description to converted GGUF files * Improve help text, expand warning * Allow specifying name and description for output GGUF * Allow overriding vocab and hyperparams from original model metadata * Use correct params override var name * Fix wrong type size for Q8_K Better handling of original style metadata * Set default value for gguf add_tensor raw_shape KW arg * llama : improve token type support (#2668) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * llama : add API for token type ggml-ci * tests : use new tokenizer type API (#2692) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * Improve commentary * Use token type API in test-tokenizer-1.cpp * py : cosmetics * readme : add notice about new file format ggml-ci --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> Co-authored-by: goerch <jhr.walter@t-online.de> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-08-21 23:07:43 +03:00
Adrian	2d8b76a110	Add link to clojure bindings to Readme. (#2659 )	2023-08-18 21:39:22 +02:00
Georgi Gerganov	7af633aec3	readme : incoming BREAKING CHANGE	2023-08-18 17:48:31 +03:00
mdrokz	eaf98c2649	readme : add link to Rust bindings (#2656 )	2023-08-18 13:17:58 +03:00
Johannes Gäßler	0992a7b8b1	README: fix LLAMA_CUDA_MMV_Y documentation (#2647 )	2023-08-17 23:57:59 +02:00
Henri Vasserman	6ddeefad9b	[Zig] Fixing Zig build and improvements (#2554 ) * Fix zig after console.o was split * Better include and flag management * Change LTO to option	2023-08-17 23:11:18 +03:00
Johannes Gäßler	25d43e0eb5	CUDA: tuned mul_mat_q kernels (#2546 )	2023-08-09 09:42:34 +02:00
ldwang	220d931864	readme : add Aquila-7B model series to supported models (#2487 ) * support bpe tokenizer in convert Signed-off-by: ldwang <ftgreat@gmail.com> * support bpe tokenizer in convert Signed-off-by: ldwang <ftgreat@gmail.com> * support bpe tokenizer in convert, fix Signed-off-by: ldwang <ftgreat@gmail.com> * Add Aquila-7B models in README.md Signed-off-by: ldwang <ftgreat@gmail.com> * Up Aquila-7B models in README.md Signed-off-by: ldwang <ftgreat@gmail.com> --------- Signed-off-by: ldwang <ftgreat@gmail.com> Co-authored-by: ldwang <ftgreat@gmail.com>	2023-08-02 11:21:11 +03:00
Yiming Cui	a312193e18	readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475 ) * add support for chinese llama-2 / alpaca-2 * remove white spaces	2023-08-02 09:18:31 +03:00
Johannes Gäßler	0728c5a8b9	CUDA: mmq CLI option, fixed mmq build issues (#2453 )	2023-07-31 15:44:35 +02:00
Johannes Gäßler	11f3ca06b8	CUDA: Quantized matrix matrix multiplication (#2160 ) * mmq implementation for non k-quants * q6_K * q2_K * q3_k * q4_K * vdr * q5_K * faster q8_1 loading * loop unrolling * add __restrict__ * q2_K sc_high * GGML_CUDA_MMQ_Y * Updated Makefile * Update Makefile * DMMV_F16 -> F16 * Updated README, CMakeLists * Fix CMakeLists.txt * Fix CMakeLists.txt * Fix multi GPU out-of-bounds	2023-07-29 23:04:44 +02:00
niansa/tuxifan	edcc7ae7d2	Obtaining LLaMA 2 instructions (#2308 ) * Obtaining LLaMA 2 instructions * Removed sharing warning for LLaMA 2 * Linked TheBloke's GGML repos * Add LLaMA 2 to list of supported models * Added LLaMA 2 usage instructions * Added links to LLaMA 2 70B models	2023-07-28 03:14:11 +02:00
Johannes Gäßler	70d26ac388	Fix __dp4a documentation (#2348 )	2023-07-23 17:49:06 +02:00
Jose Maldonado	91171b8072	make : fix CLBLAST compile support in FreeBSD (#2331 ) * Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD * More general use-case for CLBLAST support (Linux and FreeBSD)	2023-07-23 14:52:08 +03:00
wzy	78a3d13424	flake : remove intel mkl from flake.nix due to missing files (#2277 ) NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261 Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975 So remove it from flake.nix Some minor changes: - Change pkgs.python310 to pkgs.python3 to keep latest - Add pkgconfig to devShells.default - Remove installPhase because we have `cmake --install` from #2256	2023-07-21 13:26:34 +03:00
wzy	45a1b07e9b	flake : update flake.nix (#2270 ) When `isx86_32 \|\| isx86_64`, it will use mkl, else openblas According to https://discourse.nixos.org/t/rpath-of-binary-contains-a-forbidden-reference-to-build/12200/3, add -DCMAKE_SKIP_BUILD_RPATH=ON Fix #2261, Nix doesn't provide mkl-sdl.pc. When we build with -DBUILD_SHARED_LIBS=ON, -DLLAMA_BLAS_VENDOR=Intel10_lp64 replace mkl-sdl.pc by mkl-dynamic-lp64-iomp.pc	2023-07-19 10:01:55 +03:00
Jiří Podivín	27ab66e437	py : turn verify-checksum-models.py into executable (#2245 ) README.md was adjusted to reflect the change. Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-07-16 22:54:47 +03:00
Chad Brewbaker	917831c63a	readme : fix zig build instructions (#2171 )	2023-07-11 19:03:06 +03:00
Evan Miller	5656d10599	mpi : add support for distributed inference via MPI (#2099 ) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-10 18:49:56 +03:00
JackJollimore	18780e0a5e	readme : update Termux instructions (#2147 ) The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.	2023-07-09 11:20:43 +03:00
rankaiyx	2492a53fd0	readme : add more docs indexes (#2127 ) * Update README.md to add more docs indexes * Update README.md to add more docs indexes	2023-07-09 10:38:42 +03:00
dylan	84525e7962	docker : add support for CUDA in docker (#1461 ) Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-07 21:25:25 +03:00
Judd	36680f6e40	convert : update for baichuan (#2081 ) 1. guess n_layers; 2. relax warnings on context size; 3. add a note that its derivations are also supported. Co-authored-by: Judd <foldl@boxvest.com>	2023-07-06 19:23:49 +03:00
Johannes Gäßler	924dd22fd3	Quantized dot products for CUDA mul mat vec (#2067 )	2023-07-05 14:19:42 +02:00
Georgi Gerganov	b472f3fca5	readme : add link web chat PR	2023-07-04 22:25:22 +03:00
Judd	471aab6e4c	convert : add support of baichuan-7b (#2055 ) Co-authored-by: Judd <foldl@boxvest.com>	2023-07-01 20:00:25 +03:00
Roman Parykin	d38e451578	readme : add Scala 3 bindings repo (#2010 )	2023-06-26 22:47:59 +03:00
Gustavo Rocha Dias	aa777abbb7	readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007 ) * docs - Alternative way to build at Android, with CLBlast. * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux. * doc- fix typo	2023-06-26 22:34:45 +03:00
Georgi Gerganov	412c60e473	readme : add link to new k-quants for visibility	2023-06-26 19:45:09 +03:00
Georgi Gerganov	447ccbe8c3	readme : add new roadmap + manifesto	2023-06-25 16:08:12 +03:00
Georgi Gerganov	66a2555ba6	readme : add Azure CI discussion link	2023-06-25 09:07:03 +03:00
Georgi Gerganov	11da1a85cd	readme : fix whitespaces	2023-06-24 13:38:18 +03:00
Alberto	235b610d65	readme : fixed termux instructions (#1973 )	2023-06-24 13:32:13 +03:00
eiery	d7b7484f74	Add OpenLLaMA instructions to the README (#1954 ) * add openllama to readme	2023-06-23 10:38:01 +02:00
Rahul Vivek Nair	fb98254f99	Fix typo in README.md (#1961 )	2023-06-21 23:48:43 +02:00
Georgi Gerganov	049aa16b8c	readme : add link to p1	2023-06-20 19:05:54 +03:00
Xiake Sun	2322ec223a	Fix typo (#1949 )	2023-06-20 15:42:40 +03:00

1 2 3 4 5 ...

277 Commits