2023-03-10 20:47:46 +01:00
# llama.cpp
2023-04-05 17:56:20 +02:00
![llama ](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png )
2023-03-26 09:20:49 +02:00
2024-05-30 17:07:39 +02:00
[![License: MIT ](https://img.shields.io/badge/license-MIT-blue.svg )](https://opensource.org/licenses/MIT)
2024-07-19 13:34:55 +02:00
[![Server ](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg )](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
2024-05-30 17:07:39 +02:00
[![Conan Center ](https://shields.io/conan/v/llama-cpp )](https://conan.io/center/llama-cpp)
2023-03-12 21:09:26 +01:00
2023-10-04 15:50:44 +02:00
[Roadmap ](https://github.com/users/ggerganov/projects/7 ) / [Project status ](https://github.com/ggerganov/llama.cpp/discussions/3471 ) / [Manifesto ](https://github.com/ggerganov/llama.cpp/discussions/205 ) / [ggml ](https://github.com/ggerganov/ggml )
2023-06-25 15:08:12 +02:00
2024-02-05 15:55:10 +01:00
Inference of Meta's [LLaMA ](https://arxiv.org/abs/2302.13971 ) model (and others) in pure C/C++
2023-03-10 20:47:46 +01:00
2024-06-13 01:41:52 +02:00
> [!IMPORTANT]
[2024 Jun 12] Binaries have been renamed w/ a `llama-` prefix. `main` is now `llama-cli` , `server` is `llama-server` , etc (https://github.com/ggerganov/llama.cpp/pull/7809)
2024-07-05 18:08:32 +02:00
## Recent API changes
2024-03-03 11:44:03 +01:00
2024-06-26 18:26:13 +02:00
- [2024 Jun 26] The source code and CMake build scripts have been restructured https://github.com/ggerganov/llama.cpp/pull/8006
2024-04-21 17:36:45 +02:00
- [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens https://github.com/ggerganov/llama.cpp/pull/6807
2024-04-08 14:43:30 +02:00
- [2024 Apr 4] State and session file functions reorganized under `llama_state_*` https://github.com/ggerganov/llama.cpp/pull/6341
llama : greatly reduce output buffer memory usage (#6122)
* llama : greatly reduce logits memory usage
* llama : more compact state saving and reloading
* llama : fix lctx.n_outputs not being set before building graph
* perplexity : adapt to the logits API changes
* perplexity : fix Winogrande, use correct logits for second choice start
The first logits used to evaluate the second choice were not from
the end of the common prefix; instead, they were the logits from the end
of the first choice. This has been corrected.
The previous implementation sometimes had outliers in the scores of
choices for some tasks, and the logic to skip choices words
in the log-likelihood evaluation probably was an attempt to reduce those,
but it was complex and didn't quite seem to be the right thing.
This is simpler now, and the outlier scores aren't there anymore.
* perplexity : normalize spaces and punctuation in Winogrande sentences
* llama : fix embedding conditions
* llama : fix llama_get_embeddings_ith when the resulting id is 0
* llama : fix wrong n_outputs in llama_set_inputs
A mismatch happened when using a smaller n_ubatch than n_batch and then using
llama_batch_get_one(). The decision of what n_outputs should be now almost
fully depends on how lctx.n_outputs is set in llama_decode_internal.
The conditions are simpler this way.
* llama : when saving the state, recalculate n_outputs
This ensures the correct number of outputs for the entire previous batch
is stored in the session file, even when n_ubatch is smaller than n_batch.
* llama : fix not-skipping outputs of non-causal models
* llama : fix running a batch with n_outputs == 0
It previously worked because lctx.inp_out_ids was not initialized,
so it pointed to some garbage address which was somehow still valid when I
ran my tests.
* llama : keep same graph topology even when n_outputs == 0
* ggml : saner ggml_can_repeat with empty tensors
* ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1
* ggml : do not multi-thread ops returning empty tensors
* ggml : make ggml_is_empty public and work with views
* llama : use a vector for ctx->output_ids
* llama : rework reallocation logic for llama_output_reserve
Now comparing the actual size with the new total size of the output buffer
to allow more efficient enabling and disabling of the embeddings
and/or logits output in the future.
* ggml : skip empty tensors in all backends
* llama : fix llama_output_reserve nullptr deref when new_size is 0
* perplexity : make Winogrande work as it does on master
The problems with the Winogrande implementation will
need to be fixed in a separate PR to ease review.
* llama : clearer error messages for invalid logits or embeddings ids
* llama : assert all models that can have inp_out_ids
Since the graph topology is now constant, this presence check
can be done even when there are no outputs.
* llama : assert logits and embd buffers exist before writing to them
* llama : handle errors from llama_output_reserve at call sites
* perplexity : make hellaswag and multiple-choice outputs identical to master
Due to how the KV cache is updated, the logprobs for tokens in a batch
are very slightly affected by the other tokens present in the batch,
so to make hellaswag and multiple-choice return exactly the same results
as on master, the last token of each sequence needs to be evaluated
even though its output is not used at all.
This will probably be changed back in the future to make these benchmarks
a tiny bit faster.
* perplexity : fix division by zero when using less than 100 multiple-choice tasks
* llama : allow loading state saved with a different ctx size
When loading a session file, the context size is now only required to be
at least enough to load the KV cells contained in that session file,
instead of requiring to use exactly the same context size as when saving.
Doing this enables the use-case of extending or shrinking the context size
of a saved session.
This breaks existing session files because the meaning of kv_buf_size
is slightly changed (previously it was the size of the whole KV cache,
now it's only the size of the saved part of it). This allows for
finer-grained sanity checks when loading in an effort to keep kv_buf_size
useful even when the kv_size is changed.
* llama : minor
ggml-ci
* readme : update recent API changes, and warn about Vulkan
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-26 15:46:41 +01:00
- [2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122
2024-03-13 19:33:56 +01:00
- [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` https://github.com/ggerganov/llama.cpp/pull/6017
2024-03-11 16:49:47 +01:00
- [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void` , and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
2024-03-04 21:31:20 +01:00
- [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
2024-03-03 11:44:03 +01:00
- [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849
2024-07-05 18:08:32 +02:00
## Hot topics
2023-08-27 13:44:35 +02:00
2024-07-05 06:53:33 +02:00
- **`convert.py` has been deprecated and moved to `examples/convert_legacy_llama.py` , please use `convert_hf_to_gguf.py` ** https://github.com/ggerganov/llama.cpp/pull/7430
2024-05-31 10:09:20 +02:00
- Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021
2024-05-07 20:43:13 +02:00
- BPE pre-tokenization support has been added: https://github.com/ggerganov/llama.cpp/pull/6920
2024-04-29 16:06:19 +02:00
- MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387
2024-03-31 10:56:30 +02:00
- Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404
2024-03-22 10:35:53 +01:00
- Fix major bug in Metal batched inference https://github.com/ggerganov/llama.cpp/pull/6225
2024-04-04 19:16:37 +02:00
- Multi-GPU pipeline parallelism support https://github.com/ggerganov/llama.cpp/pull/6017
2024-03-10 19:58:26 +01:00
- Looking for contributions to add Deepseek support: https://github.com/ggerganov/llama.cpp/issues/5981
- Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962
2024-03-09 17:14:13 +01:00
- Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328
2023-08-18 16:48:31 +02:00
----
2023-03-14 08:43:52 +01:00
2023-03-10 20:47:46 +01:00
## Description
2024-02-05 15:55:10 +01:00
The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
variety of hardware - locally and in the cloud.
2023-03-10 20:47:46 +01:00
2024-02-05 15:55:10 +01:00
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
2023-05-05 16:43:36 +02:00
- AVX, AVX2 and AVX512 support for x86 architectures
2024-02-19 08:39:31 +01:00
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
2024-02-05 15:55:10 +01:00
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
2024-06-04 20:23:20 +02:00
- Vulkan and SYCL backend support
2024-02-05 15:55:10 +01:00
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
2023-03-10 20:47:46 +01:00
2024-02-05 15:55:10 +01:00
Since its [inception ](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022 ), the project has
improved significantly thanks to many contributions. It is the main playground for developing new features for the
[ggml ](https://github.com/ggerganov/ggml ) library.
2023-03-11 11:31:21 +01:00
2023-04-05 17:56:20 +02:00
**Supported models:**
2023-03-29 18:37:20 +02:00
2024-02-07 07:21:30 +01:00
Typically finetunes of the base models below are supported as well.
2023-03-30 21:31:54 +02:00
- [X] LLaMA 🦙
2023-07-28 03:14:11 +02:00
- [x] LLaMA 2 🦙🦙
2024-04-25 15:52:28 +02:00
- [x] LLaMA 3 🦙🦙🦙
2024-02-07 07:21:30 +01:00
- [X] [Mistral 7B ](https://huggingface.co/mistralai/Mistral-7B-v0.1 )
2024-02-05 15:55:10 +01:00
- [x] [Mixtral MoE ](https://huggingface.co/models?search=mistral-ai/Mixtral )
2024-04-13 11:33:52 +02:00
- [x] [DBRX ](https://huggingface.co/databricks/dbrx-instruct )
2024-04-21 14:35:40 +02:00
- [X] [Falcon ](https://huggingface.co/models?search=tiiuae/falcon )
2023-08-02 08:18:31 +02:00
- [X] [Chinese LLaMA / Alpaca ](https://github.com/ymcui/Chinese-LLaMA-Alpaca ) and [Chinese LLaMA-2 / Alpaca-2 ](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2 )
2023-03-30 21:31:54 +02:00
- [X] [Vigogne (French) ](https://github.com/bofenghuang/vigogne )
2024-07-01 13:40:58 +02:00
- [X] [BERT ](https://github.com/ggerganov/llama.cpp/pull/5423 )
2023-04-10 22:41:53 +02:00
- [X] [Koala ](https://bair.berkeley.edu/blog/2023/04/03/koala/ )
2023-10-17 20:13:21 +02:00
- [X] [Baichuan 1 & 2 ](https://huggingface.co/models?search=baichuan-inc/Baichuan ) + [derivations ](https://huggingface.co/hiyouga/baichuan-7b-sft )
- [X] [Aquila 1 & 2 ](https://huggingface.co/models?search=BAAI/Aquila )
2023-09-29 14:50:35 +02:00
- [X] [Starcoder models ](https://github.com/ggerganov/llama.cpp/pull/3187 )
2023-10-06 21:13:36 +02:00
- [X] [Refact ](https://huggingface.co/smallcloudai/Refact-1_6B-fim )
2023-10-11 01:02:49 +02:00
- [X] [MPT ](https://github.com/ggerganov/llama.cpp/pull/3417 )
2023-10-17 20:13:21 +02:00
- [X] [Bloom ](https://github.com/ggerganov/llama.cpp/pull/3553 )
2023-12-14 08:38:49 +01:00
- [x] [Yi models ](https://huggingface.co/models?search=01-ai/Yi )
2024-02-06 15:06:48 +01:00
- [X] [StableLM models ](https://huggingface.co/stabilityai )
2023-12-14 08:38:49 +01:00
- [x] [Deepseek models ](https://huggingface.co/models?search=deepseek-ai/deepseek )
- [x] [Qwen models ](https://huggingface.co/models?search=Qwen/Qwen )
2023-12-24 14:35:49 +01:00
- [x] [PLaMo-13B ](https://github.com/ggerganov/llama.cpp/pull/3557 )
2024-02-06 15:06:48 +01:00
- [x] [Phi models ](https://huggingface.co/models?search=microsoft/phi )
2023-12-28 15:03:57 +01:00
- [x] [GPT-2 ](https://huggingface.co/gpt2 )
2024-02-06 15:06:48 +01:00
- [x] [Orion 14B ](https://github.com/ggerganov/llama.cpp/pull/5118 )
- [x] [InternLM2 ](https://huggingface.co/models?search=internlm2 )
2024-02-05 08:41:38 +01:00
- [x] [CodeShell ](https://github.com/WisdomShell/codeshell )
2024-02-21 14:08:22 +01:00
- [x] [Gemma ](https://ai.google.dev/gemma )
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
- [x] [Mamba ](https://github.com/state-spaces/mamba )
2024-04-25 15:52:28 +02:00
- [x] [Grok-1 ](https://huggingface.co/keyfan/grok-1-hf )
2024-03-29 14:37:03 +01:00
- [x] [Xverse ](https://huggingface.co/models?search=xverse )
2024-04-25 15:52:28 +02:00
- [x] [Command-R models ](https://huggingface.co/models?search=CohereForAI/c4ai-command-r )
2024-04-03 20:05:10 +02:00
- [x] [SEA-LION ](https://huggingface.co/models?search=sea-lion )
2024-04-07 19:33:59 +02:00
- [x] [GritLM-7B ](https://huggingface.co/GritLM/GritLM-7B ) + [GritLM-8x7B ](https://huggingface.co/GritLM/GritLM-8x7B )
2024-04-19 11:35:54 +02:00
- [x] [OLMo ](https://allenai.org/olmo )
2024-08-05 07:54:10 +02:00
- [x] [Granite models ](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330 )
2024-05-23 14:12:43 +02:00
- [x] [GPT-NeoX ](https://github.com/EleutherAI/gpt-neox ) + [Pythia ](https://github.com/EleutherAI/pythia )
2024-08-05 07:54:10 +02:00
- [x] [Snowflake-Arctic MoE ](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520 )
- [x] [Smaug ](https://huggingface.co/models?search=Smaug )
- [x] [Poro 34B ](https://huggingface.co/LumiOpen/Poro-34B )
- [x] [Bitnet b1.58 models ](https://huggingface.co/1bitLLM )
- [x] [Flan T5 ](https://huggingface.co/models?search=flan-t5 )
- [x] [Open Elm models ](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca )
2024-07-08 07:57:19 +02:00
- [x] [ChatGLM3-6b ](https://huggingface.co/THUDM/chatglm3-6b ) + [ChatGLM4-9b ](https://huggingface.co/THUDM/glm-4-9b )
2024-08-05 07:54:10 +02:00
- [x] [SmolLM ](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966 )
2024-08-16 08:35:18 +02:00
- [x] [EXAONE-3.0-7.8B-Instruct ](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct )
2024-08-21 10:06:36 +02:00
- [x] [FalconMamba Models ](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a )
2023-12-14 08:38:49 +01:00
2024-07-08 16:19:24 +02:00
(instructions for supporting more models: [HOWTO-add-model.md ](./docs/development/HOWTO-add-model.md ))
2024-04-10 08:58:48 +02:00
2023-12-14 08:38:49 +01:00
**Multimodal models:**
2024-02-28 09:39:39 +01:00
- [x] [LLaVA 1.5 models ](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e ), [LLaVA 1.6 models ](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2 )
2024-02-05 15:55:10 +01:00
- [x] [BakLLaVA ](https://huggingface.co/models?search=SkunkworksAI/Bakllava )
2023-12-14 08:38:49 +01:00
- [x] [Obsidian ](https://huggingface.co/NousResearch/Obsidian-3B-V0.5 )
- [x] [ShareGPT4V ](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V )
2024-01-25 21:14:32 +01:00
- [x] [MobileVLM 1.7B/3B models ](https://huggingface.co/models?search=mobileVLM )
2024-02-06 15:06:48 +01:00
- [x] [Yi-VL ](https://huggingface.co/models?search=Yi-VL )
2024-04-25 15:52:28 +02:00
- [x] [Mini CPM ](https://huggingface.co/models?search=MiniCPM )
2024-05-10 08:41:10 +02:00
- [x] [Moondream ](https://huggingface.co/vikhyatk/moondream2 )
2024-05-23 16:43:18 +02:00
- [x] [Bunny ](https://github.com/BAAI-DCAI/Bunny )
2023-10-17 20:13:21 +02:00
2023-04-05 17:56:20 +02:00
**Bindings:**
- Python: [abetlen/llama-cpp-python ](https://github.com/abetlen/llama-cpp-python )
- Go: [go-skynet/go-llama.cpp ](https://github.com/go-skynet/go-llama.cpp )
2023-10-22 20:16:43 +02:00
- Node.js: [withcatai/node-llama-cpp ](https://github.com/withcatai/node-llama-cpp )
2024-01-07 21:24:11 +01:00
- JS/TS (llama.cpp server client): [lgrammel/modelfusion ](https://modelfusion.dev/integration/model-provider/llamacpp )
2024-02-09 11:17:00 +01:00
- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm ](https://github.com/tangledgroup/llama-cpp-wasm )
2024-03-16 16:42:08 +01:00
- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama ](https://github.com/ngxson/wllama )
2023-04-17 21:34:35 +02:00
- Ruby: [yoshoku/llama_cpp.rb ](https://github.com/yoshoku/llama_cpp.rb )
2024-04-03 19:53:37 +02:00
- Rust (more features): [edgenai/llama_cpp-rs ](https://github.com/edgenai/llama_cpp-rs )
2024-01-28 09:30:44 +01:00
- Rust (nicer API): [mdrokz/rust-llama.cpp ](https://github.com/mdrokz/rust-llama.cpp )
- Rust (more direct bindings): [utilityai/llama-cpp-rs ](https://github.com/utilityai/llama-cpp-rs )
2023-05-12 07:39:40 +02:00
- C#/.NET: [SciSharp/LLamaSharp ](https://github.com/SciSharp/LLamaSharp )
2023-06-26 21:47:59 +02:00
- Scala 3: [donderom/llm4s ](https://github.com/donderom/llm4s )
2023-08-18 21:39:22 +02:00
- Clojure: [phronmophobic/llama.clj ](https://github.com/phronmophobic/llama.clj )
2023-08-29 11:30:10 +02:00
- React Native: [mybigday/llama.rn ](https://github.com/mybigday/llama.rn )
2023-09-01 15:36:14 +02:00
- Java: [kherud/java-llama.cpp ](https://github.com/kherud/java-llama.cpp )
2023-12-22 07:49:54 +01:00
- Zig: [deins/llama.cpp.zig ](https://github.com/Deins/llama.cpp.zig )
2024-01-20 09:05:43 +01:00
- Flutter/Dart: [netdur/llama_cpp_dart ](https://github.com/netdur/llama_cpp_dart )
2024-03-27 08:08:59 +01:00
- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance ](https://github.com/distantmagic/resonance ) [(more info) ](https://github.com/ggerganov/llama.cpp/pull/6326 )
2024-07-07 15:21:37 +02:00
- Guile Scheme: [guile_llama_cpp ](https://savannah.nongnu.org/projects/guile-llama-cpp )
2023-04-05 17:56:20 +02:00
**UI:**
2024-02-05 15:55:10 +01:00
Unless otherwise noted these projects are open-source with permissive licensing:
2024-07-24 14:52:30 +02:00
- [MindWorkAI/AI-Studio ](https://github.com/MindWorkAI/AI-Studio ) (FSL-1.1-MIT)
2024-02-05 15:55:10 +01:00
- [iohub/collama ](https://github.com/iohub/coLLaMA )
- [janhq/jan ](https://github.com/janhq/jan ) (AGPL)
2023-04-05 17:56:20 +02:00
- [nat/openplayground ](https://github.com/nat/openplayground )
2024-02-07 07:16:48 +01:00
- [Faraday ](https://faraday.dev/ ) (proprietary)
2024-02-05 15:55:10 +01:00
- [LMStudio ](https://lmstudio.ai/ ) (proprietary)
2024-05-09 15:32:40 +02:00
- [Layla ](https://play.google.com/store/apps/details?id=com.laylalite ) (proprietary)
2024-08-05 14:45:01 +02:00
- [ramalama ](https://github.com/containers/ramalama ) (MIT)
2024-02-21 15:39:10 +01:00
- [LocalAI ](https://github.com/mudler/LocalAI ) (MIT)
2024-02-05 15:55:10 +01:00
- [LostRuins/koboldcpp ](https://github.com/LostRuins/koboldcpp ) (AGPL)
- [Mozilla-Ocho/llamafile ](https://github.com/Mozilla-Ocho/llamafile )
- [nomic-ai/gpt4all ](https://github.com/nomic-ai/gpt4all )
- [ollama/ollama ](https://github.com/ollama/ollama )
- [oobabooga/text-generation-webui ](https://github.com/oobabooga/text-generation-webui ) (AGPL)
2023-11-29 08:16:34 +01:00
- [psugihara/FreeChat ](https://github.com/psugihara/FreeChat )
2024-02-07 19:44:52 +01:00
- [cztomsik/ava ](https://github.com/cztomsik/ava ) (MIT)
2023-12-25 17:09:53 +01:00
- [ptsochantaris/emeltal ](https://github.com/ptsochantaris/emeltal )
2024-02-05 15:55:10 +01:00
- [pythops/tenere ](https://github.com/pythops/tenere ) (AGPL)
2024-06-16 13:51:18 +02:00
- [RAGNA Desktop ](https://ragna.app/ ) (proprietary)
2024-03-22 12:29:49 +01:00
- [RecurseChat ](https://recurse.chat/ ) (proprietary)
2024-02-05 15:55:10 +01:00
- [semperai/amica ](https://github.com/semperai/amica )
- [withcatai/catai ](https://github.com/withcatai/catai )
2024-02-20 11:00:23 +01:00
- [Mobile-Artificial-Intelligence/maid ](https://github.com/Mobile-Artificial-Intelligence/maid ) (MIT)
2024-02-25 16:57:34 +01:00
- [Msty ](https://msty.app ) (proprietary)
2024-02-26 15:15:28 +01:00
- [LLMFarm ](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file ) (MIT)
2024-03-29 08:33:46 +01:00
- [KanTV ](https://github.com/zhouwg/kantv?tab=readme-ov-file )(Apachev2.0 or later)
2024-04-04 19:22:50 +02:00
- [Dot ](https://github.com/alexpinel/Dot ) (GPL)
2024-04-05 20:39:43 +02:00
- [MindMac ](https://mindmac.app ) (proprietary)
2024-04-08 09:48:29 +02:00
- [KodiBot ](https://github.com/firatkiral/kodibot ) (GPL)
2024-04-10 08:34:00 +02:00
- [eva ](https://github.com/ylsdamxssjxxdd/eva ) (MIT)
2024-04-17 14:47:50 +02:00
- [AI Sublime Text plugin ](https://github.com/yaroslavyaroslav/OpenAI-sublime-text ) (MIT)
2024-05-31 01:57:16 +02:00
- [AIKit ](https://github.com/sozercan/aikit ) (MIT)
2024-06-18 08:57:41 +02:00
- [LARS - The LLM & Advanced Referencing Solution ](https://github.com/abgulati/LARS ) (AGPL)
2024-04-17 14:47:50 +02:00
2024-03-28 21:56:03 +01:00
*(to have a project listed here, it should clearly state that it depends on `llama.cpp` )*
2024-05-26 14:09:42 +02:00
**Tools:**
- [akx/ggify ](https://github.com/akx/ggify ) – download PyTorch models from HuggingFace Hub and convert them to GGML
2024-07-01 18:48:34 +02:00
- [crashr/gppm ](https://github.com/crashr/gppm ) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
2024-08-12 14:45:50 +02:00
- [gpustack/gguf-parser ](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser ) - review/check the GGUF file and estimate the memory usage
2024-05-26 14:09:42 +02:00
2024-07-01 19:13:22 +02:00
**Infrastructure:**
- [Paddler ](https://github.com/distantmagic/paddler ) - Stateful load balancer custom-tailored for llama.cpp
2024-08-12 14:45:50 +02:00
- [GPUStack ](https://github.com/gpustack/gpustack ) - Manage GPU clusters for running LLMs
2024-07-01 19:13:22 +02:00
2024-07-24 18:48:00 +02:00
**Games:**
- [Lucy's Labyrinth ](https://github.com/MorganRO8/Lucys_Labyrinth ) - A simple maze game where agents controlled by an AI model will try to trick you.
2024-07-05 18:08:32 +02:00
## Demo
2023-03-10 20:47:46 +01:00
2024-07-05 18:08:32 +02:00
< details >
< summary > Typical run using LLaMA v2 13B on M2 Ultra< / summary >
2023-03-10 20:47:46 +01:00
2024-02-07 07:21:30 +01:00
```
2024-06-13 01:41:52 +02:00
$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
2023-08-23 22:43:00 +02:00
I llama.cpp build info:
2023-03-10 20:47:46 +01:00
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
2023-08-23 22:41:16 +02:00
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
2023-03-10 20:47:46 +01:00
I LDFLAGS: -framework Accelerate
2023-08-23 22:41:16 +02:00
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
2023-03-10 20:47:46 +01:00
2023-03-10 23:09:19 +01:00
make: Nothing to be done for `default'.
2023-08-23 22:41:16 +02:00
main: build = 1041 (cf658ad)
main: seed = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V1 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model size = 13.02 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '< s > '
llm_load_print_meta: EOS token = 2 '< / s > '
llm_load_print_meta: UNK token = 0 '< unk > '
llm_load_print_meta: LF token = 13 '< 0x0A > '
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: mem required = 7024.01 MB (+ 400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 75.41 MB
2023-08-23 22:43:00 +02:00
system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
2023-08-23 22:41:16 +02:00
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’ s an image or text file (like PDFs). In order for someone else’ s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings: load time = 576.45 ms
llama_print_timings: sample time = 283.10 ms / 400 runs ( 0.71 ms per token, 1412.91 tokens per second)
llama_print_timings: prompt eval time = 599.83 ms / 19 tokens ( 31.57 ms per token, 31.68 tokens per second)
llama_print_timings: eval time = 24513.59 ms / 399 runs ( 61.44 ms per token, 16.28 tokens per second)
llama_print_timings: total time = 25431.49 ms
2023-03-10 20:47:46 +01:00
```
2024-07-05 18:08:32 +02:00
< / details >
< details >
< summary > Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook< / summary >
2023-03-10 23:51:46 +01:00
And here is another demo of running both LLaMA-7B and [whisper.cpp ](https://github.com/ggerganov/whisper.cpp ) on a single M1 Pro MacBook:
https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
2024-07-05 18:08:32 +02:00
< / details >
2023-03-10 20:47:46 +01:00
## Usage
2024-02-07 07:21:30 +01:00
Here are the end-to-end binary build and model conversion steps for most supported models.
2023-04-13 15:43:22 +02:00
2024-07-05 18:08:32 +02:00
### Basic usage
2023-04-26 22:03:03 +02:00
2024-07-05 18:08:32 +02:00
Firstly, you need to get the binary. There are different methods that you can follow:
- Method 1: Clone this repository and build locally, see [how to build ](./docs/build.md )
- Method 2: If you are using MacOS or Linux, you can install llama.cpp via [brew, flox or nix ](./docs/install.md )
- Method 3: Use a Docker image, see [documentation for Docker ](./docs/docker.md )
- Method 4: Download pre-built binary from [releases ](https://github.com/ggerganov/llama.cpp/releases )
2023-04-26 22:03:03 +02:00
2024-07-05 18:08:32 +02:00
You can run a basic completion using this command:
2023-03-10 20:47:46 +01:00
2024-07-05 18:08:32 +02:00
```bash
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
2024-05-30 16:58:15 +02:00
2024-07-05 18:08:32 +02:00
# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
2024-05-30 16:58:15 +02:00
```
2024-06-17 17:37:55 +02:00
2024-07-05 18:08:32 +02:00
See [this page ](./examples/main/README.md ) for a full list of parameters.
2024-06-17 17:37:55 +02:00
2024-07-05 18:08:32 +02:00
### Conversation mode
2024-06-17 17:37:55 +02:00
2024-07-05 18:08:32 +02:00
If you want a more ChatGPT-like experience, you can run in conversation mode by passing `-cnv` as a parameter:
2024-06-17 17:37:55 +02:00
2024-07-05 18:08:32 +02:00
```bash
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
2024-06-17 17:37:55 +02:00
2024-07-05 18:08:32 +02:00
# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
2024-06-17 17:37:55 +02:00
```
2023-05-20 16:58:31 +02:00
2024-07-05 18:08:32 +02:00
By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates ](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template )
2024-05-05 07:21:46 +02:00
2023-04-13 15:43:22 +02:00
```bash
2024-07-05 18:08:32 +02:00
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
2024-02-07 07:21:30 +01:00
```
2023-09-21 21:00:24 +02:00
2024-07-05 18:08:32 +02:00
You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
2023-09-21 21:00:24 +02:00
2024-02-07 07:21:30 +01:00
```bash
2024-07-05 18:08:32 +02:00
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
2023-03-10 20:47:46 +01:00
```
2024-07-05 18:08:32 +02:00
### Web server
2023-10-17 20:13:21 +02:00
2024-07-05 18:08:32 +02:00
[llama.cpp web server ](./examples/server/README.md ) is a lightweight [OpenAI API ](https://github.com/openai/openai-openapi ) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
2023-10-17 20:13:21 +02:00
2024-07-05 18:08:32 +02:00
Example usage:
2023-10-17 20:13:21 +02:00
2024-07-05 18:08:32 +02:00
```bash
./llama-server -m your_model.gguf --port 8080
2023-10-06 21:13:36 +02:00
2024-07-05 18:08:32 +02:00
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
2023-10-06 21:13:36 +02:00
```
2023-03-12 22:13:28 +01:00
### Interactive mode
2024-07-05 18:08:32 +02:00
> [!NOTE]
> If you prefer basic usage, please consider using conversation mode instead of interactive mode
2024-03-02 18:27:26 +01:00
In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"` . This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"` .
2023-03-12 22:13:28 +01:00
2023-04-19 21:52:14 +02:00
Here is an example of a few-shot interaction, invoked with the command
2023-03-21 17:10:32 +01:00
```bash
2023-04-19 21:52:14 +02:00
# default arguments using a 7B model
2023-03-25 19:36:52 +01:00
./examples/chat.sh
2023-04-19 21:52:14 +02:00
# advanced chat with a 13B model
2023-03-25 19:36:52 +01:00
./examples/chat-13B.sh
2023-03-12 22:39:01 +01:00
2023-04-19 21:52:14 +02:00
# custom arguments using a 13B model
2024-06-13 01:41:52 +02:00
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
2023-03-12 22:13:28 +01:00
```
2023-03-21 17:10:32 +01:00
2024-06-13 01:41:52 +02:00
Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README ](examples/main/README.md ) for the `llama-cli` example program.
2023-03-12 22:13:28 +01:00
2023-03-12 22:39:01 +01:00
![image ](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png )
2023-03-12 22:13:28 +01:00
2023-05-24 08:24:01 +02:00
### Persistent Interaction
2024-06-13 01:41:52 +02:00
The prompt, user inputs, and model generations can be saved and resumed across calls to `./llama-cli` by leveraging `--prompt-cache` and `--prompt-cache-all` . The `./examples/chat-persistent.sh` script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as `chat-13B.sh` . The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (`PROMPT_TEMPLATE`) and the model file.
2023-05-24 08:24:01 +02:00
```bash
# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh
# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh
```
2023-08-23 03:01:57 +02:00
### Constrained output with grammars
`llama.cpp` supports grammars to constrain model output. For example, you can force the model to output JSON only:
```bash
2024-06-13 01:41:52 +02:00
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
2023-08-23 03:01:57 +02:00
```
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide ](./grammars/README.md ).
2023-09-29 13:15:57 +02:00
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo ](http://github.com/intrinsiclabsai/gbnfgen ) and not this one.
2024-07-06 19:01:23 +02:00
## Build
2023-07-28 03:14:11 +02:00
2024-07-06 19:01:23 +02:00
Please refer to [Build llama.cpp locally ](./docs/build.md )
2023-07-28 03:14:11 +02:00
2024-07-06 19:01:23 +02:00
## Supported backends
2023-03-20 21:14:06 +01:00
2024-07-06 19:01:23 +02:00
| Backend | Target devices |
| --- | --- |
| [Metal ](./docs/build.md#metal-build ) | Apple Silicon |
| [BLAS ](./docs/build.md#blas-build ) | All |
| [BLIS ](./docs/backend/BLIS.md ) | All |
| [SYCL ](./docs/backend/SYCL.md ) | Intel and Nvidia GPU |
2024-07-28 01:41:25 +02:00
| [MUSA ](./docs/build.md#musa ) | Moore Threads GPU |
2024-07-06 19:01:23 +02:00
| [CUDA ](./docs/build.md#cuda ) | Nvidia GPU |
| [hipBLAS ](./docs/build.md#hipblas ) | AMD GPU |
| [Vulkan ](./docs/build.md#vulkan ) | GPU |
2024-08-19 10:46:38 +02:00
| [CANN ](./docs/build.md#cann ) | Ascend NPU |
2023-04-11 21:45:44 +02:00
2024-07-05 18:08:32 +02:00
## Tools
2023-09-14 18:47:00 +02:00
2024-07-05 18:08:32 +02:00
### Prepare and Quantize
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
> [!NOTE]
> You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
To obtain the official LLaMA 2 weights please see the < a href = "#obtaining-and-using-the-facebook-llama-2-model" > Obtaining and using the Facebook LLaMA 2 model</ a > section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
Note: `convert.py` has been moved to `examples/convert_legacy_llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
It does not support LLaMA 3, you can use `convert_hf_to_gguf.py` with LLaMA 3 downloaded from Hugging Face.
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
To learn more about quantizing model, [read this documentation ](./examples/quantize/README.md )
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
### Perplexity (measuring model quality)
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
For more information, see [https://huggingface.co/docs/transformers/perplexity ](https://huggingface.co/docs/transformers/perplexity ).
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
To learn more how to measure perplexity using llama.cpp, [read this documentation ](./examples/perplexity/README.md )
2023-07-07 20:25:25 +02:00
2024-07-05 18:08:32 +02:00
## Contributing
2023-03-13 08:42:26 +01:00
2023-03-13 18:21:51 +01:00
- Contributors can open PRs
2023-03-16 07:55:13 +01:00
- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
2023-03-13 08:42:26 +01:00
- Collaborators will be invited based on contributions
2023-03-16 07:55:13 +01:00
- Any help with managing issues and PRs is very appreciated!
2024-07-05 08:09:47 +02:00
- See [good first issues ](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 ) for tasks suitable for first contributions
- Read the [CONTRIBUTING.md ](CONTRIBUTING.md ) for more information
2023-03-17 19:30:04 +01:00
- Make sure to read this: [Inference at the edge ](https://github.com/ggerganov/llama.cpp/discussions/205 )
2023-03-23 09:46:58 +01:00
- A bit of backstory for those who are interested: [Changelog podcast ](https://changelog.com/podcast/532 )
2023-03-13 08:42:26 +01:00
2024-07-05 18:08:32 +02:00
## Other documentations
2023-04-05 17:56:20 +02:00
2024-06-13 01:41:52 +02:00
- [main (cli) ](./examples/main/README.md )
2023-07-09 09:38:42 +02:00
- [server ](./examples/server/README.md )
- [jeopardy ](./examples/jeopardy/README.md )
2024-07-05 18:08:32 +02:00
- [GBNF grammars ](./grammars/README.md )
**Development documentations**
- [How to build ](./docs/build.md )
- [Running on Docker ](./docs/docker.md )
- [Build on Android ](./docs/android.md )
2024-07-09 20:58:44 +02:00
- [Performance troubleshooting ](./docs/development/token_generation_performance_tips.md )
2023-07-09 09:38:42 +02:00
- [GGML tips & tricks ](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks )
2024-07-06 19:01:23 +02:00
**Seminal papers and background on the models**
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
- [Introducing LLaMA: A foundational, 65-billion-parameter large language model ](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ )
- [LLaMA: Open and Efficient Foundation Language Models ](https://arxiv.org/abs/2302.13971 )
- GPT-3
- [Language Models are Few-Shot Learners ](https://arxiv.org/abs/2005.14165 )
- GPT-3.5 / InstructGPT / ChatGPT:
- [Aligning language models to follow instructions ](https://openai.com/research/instruction-following )
- [Training language models to follow instructions with human feedback ](https://arxiv.org/abs/2203.02155 )