llama.cpp/README.md

# llama.cpp

![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
[![Conan Center](https://shields.io/conan/v/llama-cpp)](https://conan.io/center/llama-cpp)

[Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)

Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++

## Recent API changes

- [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)
- [Changelog for `llama-server` REST API](https://github.com/ggerganov/llama.cpp/issues/9291)

## Hot topics

- **Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggerganov/llama.cpp/discussions/9669**
- Hugging Face GGUF editor: [discussion](https://github.com/ggerganov/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)

----

## Description

The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
variety of hardware - locally and in the cloud.

- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has
improved significantly thanks to many contributions. It is the main playground for developing new features for the
[ggml](https://github.com/ggerganov/ggml) library.

**Supported models:**

Typically finetunes of the base models below are supported as well.

- [X] LLaMA 🦙
- [x] LLaMA 2 🦙🦙
- [x] LLaMA 3 🦙🦙🦙
- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
- [X] [BERT](https://github.com/ggerganov/llama.cpp/pull/5423)
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
- [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
- [X] [StableLM models](https://huggingface.co/stabilityai)
- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
- [x] [GPT-2](https://huggingface.co/gpt2)
- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
- [x] [Gemma](https://ai.google.dev/gemma)
- [x] [Mamba](https://github.com/state-spaces/mamba)
- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
- [x] [Xverse](https://huggingface.co/models?search=xverse)
- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
- [x] [OLMo](https://allenai.org/olmo)
- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
- [x] [Smaug](https://huggingface.co/models?search=Smaug)
- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)

(instructions for supporting more models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md))

**Multimodal models:**

- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)

**Bindings:**

- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)

**UI:**

Unless otherwise noted these projects are open-source with permissive licensing:

- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
- [iohub/collama](https://github.com/iohub/coLLaMA)
- [janhq/jan](https://github.com/janhq/jan) (AGPL)
- [nat/openplayground](https://github.com/nat/openplayground)
- [Faraday](https://faraday.dev/) (proprietary)
- [LMStudio](https://lmstudio.ai/) (proprietary)
- [Layla](https://play.google.com/store/apps/details?id=com.laylalite) (proprietary)
- [ramalama](https://github.com/containers/ramalama) (MIT)
- [LocalAI](https://github.com/mudler/LocalAI) (MIT)
- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile)
- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)
- [ollama/ollama](https://github.com/ollama/ollama)
- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
- [psugihara/FreeChat](https://github.com/psugihara/FreeChat)
- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal)
- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
- [RAGNA Desktop](https://ragna.app/) (proprietary)
- [RecurseChat](https://recurse.chat/) (proprietary)
- [semperai/amica](https://github.com/semperai/amica)
- [withcatai/catai](https://github.com/withcatai/catai)
- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
- [Msty](https://msty.app) (proprietary)
- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file)(Apachev2.0 or later)
- [Dot](https://github.com/alexpinel/Dot) (GPL)
- [MindMac](https://mindmac.app) (proprietary)
- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
- [AIKit](https://github.com/sozercan/aikit) (MIT)
- [LARS - The LLM & Advanced Referencing Solution](https://github.com/abgulati/LARS) (AGPL)
- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)

*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*

**Tools:**

- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with prebuild Mobile and Web platform wrappers and a model example)

**Infrastructure:**

- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs

**Games:**
- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.

## Demo

<details>
<summary>Typical run using LLaMA v2 13B on M2 Ultra</summary>

```
$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: build = 1041 (cf658ad)
main: seed  = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q4_0
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: mem required  = 7024.01 MB (+  400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.41 MB

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings:        load time =   576.45 ms
llama_print_timings:      sample time =   283.10 ms /   400 runs   (    0.71 ms per token,  1412.91 tokens per second)
llama_print_timings: prompt eval time =   599.83 ms /    19 tokens (   31.57 ms per token,    31.68 tokens per second)
llama_print_timings:        eval time = 24513.59 ms /   399 runs   (   61.44 ms per token,    16.28 tokens per second)
llama_print_timings:       total time = 25431.49 ms
```

</details>

<details>
<summary>Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook</summary>

And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:

https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4

</details>

## Usage

Here are the end-to-end binary build and model conversion steps for most supported models.

### Basic usage

Firstly, you need to get the binary. There are different methods that you can follow:
- Method 1: Clone this repository and build locally, see [how to build](./docs/build.md)
- Method 2: If you are using MacOS or Linux, you can install llama.cpp via [brew, flox or nix](./docs/install.md)
- Method 3: Use a Docker image, see [documentation for Docker](./docs/docker.md)
- Method 4: Download pre-built binary from [releases](https://github.com/ggerganov/llama.cpp/releases)

You can run a basic completion using this command:

```bash
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128

# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
```

See [this page](./examples/main/README.md) for a full list of parameters.

### Conversation mode

If you want a more ChatGPT-like experience, you can run in conversation mode by passing `-cnv` as a parameter:

```bash
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv

# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
```

By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)

```bash
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
```

You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:

```bash
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
```

### Web server

[llama.cpp web server](./examples/server/README.md) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.

Example usage:

```bash
./llama-server -m your_model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
```

### Interactive mode

> [!NOTE]
> If you prefer basic usage, please consider using conversation mode instead of interactive mode

In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.

Here is an example of a few-shot interaction, invoked with the command

```bash
# default arguments using a 7B model
./examples/chat.sh

# advanced chat with a 13B model
./examples/chat-13B.sh

# custom arguments using a 13B model
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
```

Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `llama-cli` example program.

![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)

### Persistent Interaction

The prompt, user inputs, and model generations can be saved and resumed across calls to `./llama-cli` by leveraging `--prompt-cache` and `--prompt-cache-all`. The `./examples/chat-persistent.sh` script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as `chat-13B.sh`. The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (`PROMPT_TEMPLATE`) and the model file.

```bash
# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh

# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
    CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh
```

### Constrained output with grammars

`llama.cpp` supports grammars to constrain model output. For example, you can force the model to output JSON only:

```bash
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
```

The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).

For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.

## Build

Please refer to [Build llama.cpp locally](./docs/build.md)

## Supported backends

| Backend | Target devices |
| --- | --- |
| [Metal](./docs/build.md#metal-build) | Apple Silicon |
| [BLAS](./docs/build.md#blas-build) | All |
| [BLIS](./docs/backend/BLIS.md) | All |
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
| [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
| [Vulkan](./docs/build.md#vulkan) | GPU |
| [CANN](./docs/build.md#cann) | Ascend NPU |

## Tools

### Prepare and Quantize

> [!NOTE]
> You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.

To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.

Note: `convert.py` has been moved to `examples/convert_legacy_llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
It does not support LLaMA 3, you can use `convert_hf_to_gguf.py` with LLaMA 3 downloaded from Hugging Face.

To learn more about quantizing model, [read this documentation](./examples/quantize/README.md)

### Perplexity (measuring model quality)

You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).

To learn more how to measure perplexity using llama.cpp, [read this documentation](./examples/perplexity/README.md)

## Contributing

- Contributors can open PRs
- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
- Collaborators will be invited based on contributions
- Any help with managing issues, PRs and projects is very appreciated!
- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)

## Other documentations

- [main (cli)](./examples/main/README.md)
- [server](./examples/server/README.md)
- [jeopardy](./examples/jeopardy/README.md)
- [GBNF grammars](./grammars/README.md)

**Development documentations**

- [How to build](./docs/build.md)
- [Running on Docker](./docs/docker.md)
- [Build on Android](./docs/android.md)
- [Performance troubleshooting](./docs/development/token_generation_performance_tips.md)
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)

**Seminal papers and background on the models**

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- GPT-3
    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- GPT-3.5 / InstructGPT / ChatGPT:
    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								# llama.cpp
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
+								![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
-												Add logo to README.md
											
										
										
											2023-03-26 09:20:49 +02:00
-												Fix conan badge display [no ci] (#7645)


											
										
										
											2024-05-30 17:07:39 +02:00
+								[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
-												readme : fix server badge
											
										
										
											2024-07-19 13:34:55 +02:00
+								[![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
-												Fix conan badge display [no ci] (#7645)


											
										
										
											2024-05-30 17:07:39 +02:00
+								[![Conan Center](https://shields.io/conan/v/llama-cpp)](https://conan.io/center/llama-cpp)
-												Update README.md
											
										
										
											2023-03-12 21:09:26 +01:00
-												readme : add project status link
											
										
										
											2023-10-04 15:50:44 +02:00
+								[Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)
-												readme : add new roadmap + manifesto
											
										
										
											2023-06-25 15:08:12 +02:00
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Recent API changes
-												readme : add API changes section
											
										
										
											2024-03-03 11:44:03 +01:00
-												readme : refactor API section + remove old hot topics
											
										
										
											2024-09-03 09:00:36 +02:00
+								- [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)
 								- [Changelog for `llama-server` REST API](https://github.com/ggerganov/llama.cpp/issues/9291)
-												readme : add API changes section
											
										
										
											2024-03-03 11:44:03 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Hot topics
-												readme : update hot topics
											
										
										
											2023-08-27 13:44:35 +02:00
-												readme : update hot topics
											
										
										
											2024-09-27 19:57:51 +02:00
+								- **Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggerganov/llama.cpp/discussions/9669**
 								- Hugging Face GGUF editor: [discussion](https://github.com/ggerganov/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)
-												readme : incoming BREAKING CHANGE
											
										
										
											2023-08-18 16:48:31 +02:00
 								----
-												Add Misc section + update hot topics + minor fixes
											
										
										
											2023-03-14 08:43:52 +01:00
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								## Description
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
 								variety of hardware - locally and in the cloud.
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- Plain C/C++ implementation without any dependencies
 								- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
-												readme: add missing info (#1324)


											
										
										
											2023-05-05 16:43:36 +02:00
+								- AVX, AVX2 and AVX512 support for x86 architectures
-												readme : update (#5572)

Added 1.5-bit on README.md
											
										
										
											2024-02-19 08:39:31 +01:00
+								- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
-												musa : update doc (#9856)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
											
										
										
											2024-10-12 07:09:53 +02:00
+								- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
-												ggml : remove OpenCL (#7735)

ggml-ci
											
										
										
											2024-06-04 20:23:20 +02:00
+								- Vulkan and SYCL backend support
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has
 								improved significantly thanks to many contributions. It is the main playground for developing new features for the
 								[ggml](https://github.com/ggerganov/ggml) library.
-												Update README.md
											
										
										
											2023-03-11 11:31:21 +01:00
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
+								**Supported models:**
-												readme : add GPT4All instructions (close #588)
											
										
										
											2023-03-29 18:37:20 +02:00
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								Typically finetunes of the base models below are supported as well.
-												readme : update supported models
											
										
										
											2023-03-30 21:31:54 +02:00
+								- [X] LLaMA 🦙
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 03:14:11 +02:00
+								- [x] LLaMA 2 🦙🦙
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] LLaMA 3 🦙🦙🦙
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
-												model: support arch `DbrxForCausalLM` (#6515)

* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
											
										
										
											2024-04-13 11:33:52 +02:00
+								- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
-												doc : add link to falcon (#6789)


											
										
										
											2024-04-21 14:35:40 +02:00
+								- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
-												readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475)

* add support for chinese llama-2 / alpaca-2

* remove white spaces
											
										
										
											2023-08-02 08:18:31 +02:00
+								- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
-												readme : update supported models
											
										
										
											2023-03-30 21:31:54 +02:00
+								- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
-												Document BERT support. (#8205)

* Update README.md

document BERT support

* Update README.md
											
										
										
											2024-07-01 13:40:58 +02:00
+								- [X] [BERT](https://github.com/ggerganov/llama.cpp/pull/5423)
-												Add BAIR's Koala to supported models (#877)


											
										
										
											2023-04-10 22:41:53 +02:00
+								- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
+								- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
 								- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
-												readme : update hot topics + model links (#3399)


											
										
										
											2023-09-29 14:50:35 +02:00
+								- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
-												readme : update models, cuda + ppl instructions (#3510)


											
										
										
											2023-10-06 21:13:36 +02:00
+								- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
-												Add MPT model to supported models in README.md (#3574)


											
										
										
											2023-10-11 01:02:49 +02:00
+								- [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
+								- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
+								- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [X] [StableLM models](https://huggingface.co/stabilityai)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
+								- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
 								- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
-												llama : add PLaMo model (#3557)

* add plamo mock

* add tensor loading

* plamo convert

* update norm

* able to compile

* fix norm_rms_eps hparam

* runnable

* use inp_pos

* seems ok

* update kqv code

* remove develop code

* update README

* shuffle attn_q.weight and attn_output.weight for broadcasting

* remove plamo_llm_build_kqv and use llm_build_kqv

* fix style

* update

* llama : remove obsolete KQ_scale

* plamo : fix tensor names for correct GPU offload

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-12-24 14:35:49 +01:00
+								- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
-												gpt2 : Add gpt2 architecture integration (#4555)


											
										
										
											2023-12-28 15:03:57 +01:00
+								- [x] [GPT-2](https://huggingface.co/gpt2)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
 								- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
-												readme : add CodeShell models to the supported models list (#5330)


											
										
										
											2024-02-05 08:41:38 +01:00
+								- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
-												llama : add `gemma` model (#5631)

There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
											
										
										
											2024-02-21 14:08:22 +01:00
+								- [x] [Gemma](https://ai.google.dev/gemma)
-												llama : support Mamba Selective State Space Models (#5328)

* mamba : begin working on support for Mamba SSM

* mamba : begin figuring out how to (ab)use the kv cache for Mamba

* mamba : recurrent inference almost works, but incoherent

* mamba : recurrent inference WORKS!!!

* convert : optionally use d_conv and d_state from config.json for Mamba

* mamba : refactor recurrent conv, resulting in 20% perf increase

It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.

I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.

* ggml : parallelize ggml_exp

This results in 8% faster token generation for Mamba-130M.

* mamba : simplify the conv step with a self-overlapping view

Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.

Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.

Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).

* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32

Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.

* mamba : fix self-overlapping view depth stride

* mamba : handle batches of more than 1 token

This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.

Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.

* ggml: add ggml_ssm_scan to help with parallel selective scan

If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.

* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation

This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.

* mamba : very basic quantization support

Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)

Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.

Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.

* convert : fix wrong name for layer norm weight of offical Mamba models

I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")

* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator

This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.

However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.

* convert : for Mamba, also consider the "MambaLMHeadModel" arch name

It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json

* mamba : fix vocab size problems with official models

The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.

Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.

* ggml : remove ggml_exp and ggml_soft_plus

They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.

* mamba : remove some useless comments

No code change.

* convert : fix flake8 linter errors

* mamba : apply suggestions from code review

* mamba : remove unecessary branch for row-wise ssm_state and C multiplication

It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.

* ggml : in ggml_ssm_scan, use more appropriate asserts

* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32

* mamba : multiple sequences, but one at a time

This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).

The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)

Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.

* mamba : support llama_kv_cache_seq_cp

This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.

Each KV cell is dedicated to the sequence ID corresponding to its own index.

* mamba : use a state mask

It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.

inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).

* llama : replace the usage of n_ctx with kv_self.size in many places

* mamba : use n_tokens directly instead of n_tok

* mamba : in comments, properly refer to KV cells instead of slots

* mamba : reduce memory usage of ggml_ssm_scan

From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.

The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.

* mamba : simultaneous sequence processing

A batch can now contain tokens from multiple sequences.

This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.

However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.

* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba

This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).

Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.

Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.

* llama : add inp_s_seq as a new input tensor

The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.

The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.

Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).

* mamba : support llama_kv_cache_seq_cp copy chains

* mamba : support shifting and dividing the kv cache pos

* mamba : make the server and parallel examples work with whole sequences

A seq_id is dedicated to the system prompt in both cases.

* llama : make llama_kv_cache_seq_rm return whether it succeeded or not

* mamba : dedicate an input tensor for state copy indices

This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.

* mamba : adapt perplexity, batched, and batched-bench examples

* perplexity : limit the max number of sequences

This adapts to what the loaded model can provide.

* llama : add llama_n_max_seq to get the upper limit for seq_ids

Used by the perplexity example.

* batched : pass n_parallel to the model's context params

This should have been there already, but it wasn't.

* batched-bench : reserve sequences to support Mamba

* batched-bench : fix tokens being put in wrong sequences

Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.

* mamba : stop abusing attention metadata

This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.

This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
 will not require breaking existing converted Mamba models again)

* gguf-py : add new KV metadata key-value pairs for Mamba

* llama : add new metadata key-value pairs for Mamba

* llama : guard against divisions by zero when n_head is 0

* mamba : rename "unlimited" KV cache property to "recurrent"

* mamba : more correctly update the "used" field of the KV cache

* ggml : in ggml_ssm_scan, use a threshold for soft_plus

This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.

* convert : for Mamba, fallback to internal NeoX tokenizer

The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.

* mamba : support state saving and restoring

* ggml : implicitly pass src tensors through dst for Mamba-related ops

* mamba : clarify some comments

* server : fix cache_tokens not getting correctly resized

Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.

For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.

* convert-hf : support new metadata keys for Mamba

For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

* mamba : rename metadata to be more similar to transformers library

This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".

* mamba : support mamba-*-hf models

These models share their token_embd.weight with their output.weight

* mamba : add missing spaces

This is purely a formatting change.

* convert-hf : omit output.weight when identical with token_embd.weight

Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.

* readme : add Mamba to supported models, and add recent API changes

* mamba : move state_seq and state_mask views outside layer loop

A few tensors were also missing `struct` in front of `ggml_tensor`.
											
										
										
											2024-03-08 23:31:00 +01:00
+								- [x] [Mamba](https://github.com/state-spaces/mamba)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
-												[Model] Add support for xverse (#6301)

* Support xverse model convert to gguf format.

* 1. Convert xverse models to gguf;
2. Add LLM_ARCH_XVERSE inference in llama.cpp;
3. Add xverse item in Supported models in README.md;

* * gguf-py: remove redundant logs
* llama: remove the init_mapping_prefetch custom parameter

* llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers.

* - Fix format issues
- Remove duplicate set kqv_out to llm_build_kv

* Update llama.cpp

---------

Co-authored-by: willhe <willhe@xverse.cn>
Co-authored-by: willhe <hexin@xverse.cn>
											
										
										
											2024-03-29 14:37:03 +01:00
+								- [x] [Xverse](https://huggingface.co/models?search=xverse)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
-												llama : add SEA-LION support (#6448)

* initial commit for sealion support

* add sealion support

* minor fix

* q/k ln and pos_embd only if required

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* minor : clear whitespaces

---------

Co-authored-by: bryan <bryansiow@aisingapore.org>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-03 20:05:10 +02:00
+								- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
-												Add GritLM as supported models. (#6513)


											
										
										
											2024-04-07 19:33:59 +02:00
+								- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
-												Implement the OLMo architecture (#6741)

* implement olmo architecture

* remove unused variable

* remove unused moe branch

* remove check for weight

* remove superfluous moe, bias and rope tensors

* clarified comment

* fix clamp_kqv setting

* remove obsolete parameter name filter
											
										
										
											2024-04-19 11:35:54 +02:00
+								- [x] [OLMo](https://allenai.org/olmo)
-												llama : support OLMoE (#9462)


											
										
										
											2024-09-16 08:47:37 +02:00
+								- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 07:54:10 +02:00
+								- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
-												readme : add GPT-NeoX + Pythia to the list of supported models (#7491)


											
										
										
											2024-05-23 14:12:43 +02:00
+								- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 07:54:10 +02:00
+								- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
 								- [x] [Smaug](https://huggingface.co/models?search=Smaug)
 								- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
 								- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
 								- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
 								- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
-												readme : add supported glm models (#8360)


											
										
										
											2024-07-08 07:57:19 +02:00
+								- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 07:54:10 +02:00
+								- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
-												llama : add EXAONE model support (#9025)

* add exaone model support

* add chat template

* fix whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add ftype

* add exaone pre-tokenizer in `llama-vocab.cpp`

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* fix lint

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* add `EXAONE` to supported models in `README.md`

* fix space

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
											
										
										
											2024-08-16 08:35:18 +02:00
+								- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
-												llama : support for `falcon-mamba` architecture (#9074)

* feat: initial support for llama.cpp

* fix: lint

* refactor: better refactor

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix: address comments

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* fix: add more cleanup and harmonization

* fix: lint

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* fix: change name

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* add in operator

* fix: add `dt_b_c_rms` in `llm_load_print_meta`

* fix: correct printf format for bool

* fix: correct print format

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* llama : quantize more Mamba tensors

* llama : use f16 as the fallback of fallback quant types

---------

Co-authored-by: compilade <git@compilade.net>
											
										
										
											2024-08-21 10:06:36 +02:00
+								- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
-												Add Jais to list of supported models (#9439)

Co-authored-by: fmz <quic_fzaghlou@quic.com>
											
										
										
											2024-09-12 02:29:53 +02:00
+								- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
-												Update README.md (#9591)

Add Bielik model.
											
										
										
											2024-10-01 19:18:46 +02:00
+								- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
-												readme : fix web link error [no ci] (#8347)


											
										
										
											2024-07-08 16:19:24 +02:00
+								(instructions for supporting more models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md))
-												docs : how to add a model (#6565)

* docs: how to add a model

* docs: model: typo and docs

* docs: model: add prevision on RoPE

* docs: model: rephrasing README.md

* docs: model: rephrasing README.md

* docs: model: README.md fix trailing spaces

* docs : some fixes

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-10 08:58:48 +02:00
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
+								**Multimodal models:**
-												readme : add link to LLaVA 1.6 models (#5758)

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
											
										
										
											2024-02-28 09:39:39 +01:00
+								- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
+								- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
 								- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
-												readme : add MobileVLM 1.7B/3B to the supported models list (#5107)

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
											
										
										
											2024-01-25 21:14:32 +01:00
+								- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
-												llava : fix moondream support (#7163)

* Revert "Revert "llava : add support for moondream vision language model (#6899)""

This reverts commit 9da243b36ac0b9d609adfaaa4c8f1cc8c592f737.

* Fix num_positions and embeddings initialization
											
										
										
											2024-05-10 08:41:10 +02:00
+								- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
-												readme : remove trailing space (#7469)

											
										
										
											2024-05-23 16:43:18 +02:00
+								- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
+								**Bindings:**
 								- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
 								- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
-												readme : remove unsupported node.js library (#3703)

- https://github.com/Atome-FE/llama-node is quite out of date
- doesn't support recent/current llama.cpp functionality
											
										
										
											2023-10-22 20:16:43 +02:00
+								- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
-												readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814)


											
										
										
											2024-01-07 21:24:11 +01:00
+								- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
-												readme : add programmable prompt engine language CLI (#9599)


											
										
										
											2024-09-23 17:58:17 +02:00
+								- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
-												readme : add JavaScript/Wasm repo (#5415)


											
										
										
											2024-02-09 11:17:00 +01:00
+								- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
-												readme : add wllama as a wasm binding (#6100)


											
										
										
											2024-03-16 16:42:08 +01:00
+								- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
-												readme : add Ruby bindings (#1029)


											
										
										
											2023-04-17 21:34:35 +02:00
+								- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
-												readme : add feature-rich rust bindings (#6465)


											
										
										
											2024-04-03 19:53:37 +02:00
+								- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
-												readme : add link to rust bindings (#5148)

* added link to another set of rust bindings with brief note on differences.

* fixed link name
											
										
										
											2024-01-28 09:30:44 +01:00
+								- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
 								- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
-												readme : add C#/.NET bindings repo (#1409)


											
										
										
											2023-05-12 07:39:40 +02:00
+								- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
-												readme : add Scala 3 bindings repo (#2010)


											
										
										
											2023-06-26 21:47:59 +02:00
+								- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
-												Add link to clojure bindings to Readme. (#2659)


											
										
										
											2023-08-18 21:39:22 +02:00
+								- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
-												readme : add react-native binding (#2869)


											
										
										
											2023-08-29 11:30:10 +02:00
+								- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
-												docs : add java-llama.cpp to README.md (#2935)


											
										
										
											2023-09-01 15:36:14 +02:00
+								- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
-												readme : add zig bindings (#4581)


											
										
										
											2023-12-22 07:49:54 +01:00
+								- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
-												Add a dart/flutter binding to README.md (#4882)


											
										
										
											2024-01-20 09:05:43 +01:00
+								- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
-												readme : add php api bindings (#6326)

* add php bindings to readme

* readme : add link to PR

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-03-27 08:08:59 +01:00
+								- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
-												readme : update bindings list (#8222)

* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
											
										
										
											2024-07-07 15:21:37 +02:00
+								- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
 								**UI:**
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								Unless otherwise noted these projects are open-source with permissive licensing:
-												readme : update UI list [no ci] (#8505)


											
										
										
											2024-07-24 14:52:30 +02:00
+								- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [iohub/collama](https://github.com/iohub/coLLaMA)
 								- [janhq/jan](https://github.com/janhq/jan) (AGPL)
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
+								- [nat/openplayground](https://github.com/nat/openplayground)
-												readme : update ui list (#5354)


											
										
										
											2024-02-07 07:16:48 +01:00
+								- [Faraday](https://faraday.dev/) (proprietary)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [LMStudio](https://lmstudio.ai/) (proprietary)
-												readme : add app (#6371)

* added Layla to supported UIs

* Update README.md
											
										
										
											2024-05-09 15:32:40 +02:00
+								- [Layla](https://play.google.com/store/apps/details?id=com.laylalite) (proprietary)
-												readme : add ramalama to the availables UI (#8811)

ramalama is a repo agnostic boring CLI tool that supports pulling from
ollama, huggingface and oci registries.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
											
										
										
											2024-08-05 14:45:01 +02:00
+								- [ramalama](https://github.com/containers/ramalama) (MIT)
-												readme : add LocalAI to the availables UI (#5629)


											
										
										
											2024-02-21 15:39:10 +01:00
+								- [LocalAI](https://github.com/mudler/LocalAI) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
 								- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile)
 								- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)
 								- [ollama/ollama](https://github.com/ollama/ollama)
 								- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
-												readme : add FreeChat (#4248)


											
										
										
											2023-11-29 08:16:34 +01:00
+								- [psugihara/FreeChat](https://github.com/psugihara/FreeChat)
-												Add Ava in the list of llama.cpp UIs (#4362)


											
										
										
											2024-02-07 19:44:52 +01:00
+								- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
-												Adding Emeltal reference to UI list (#4629)


											
										
										
											2023-12-25 17:09:53 +01:00
+								- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
-												readme : update UI list [no ci] (#7958)


											
										
										
											2024-06-16 13:51:18 +02:00
+								- [RAGNA Desktop](https://ragna.app/) (proprietary)
-												readme : add RecurseChat to the list of UIs (#6219)


											
										
										
											2024-03-22 12:29:49 +01:00
+								- [RecurseChat](https://recurse.chat/) (proprietary)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [semperai/amica](https://github.com/semperai/amica)
 								- [withcatai/catai](https://github.com/withcatai/catai)
-												readme : update UI list (#5605)

* Add maid to ui list

* Specify licence
											
										
										
											2024-02-20 11:00:23 +01:00
+								- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
-												readme : add Msty to UI list (#5618)


											
										
										
											2024-02-25 16:57:34 +01:00
+								- [Msty](https://msty.app) (proprietary)
-												readme : update ui list (#5731)

* Add LLMFarm (ui for iOS) to list
											
										
										
											2024-02-26 15:15:28 +01:00
+								- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
-												readme : add project (#6356)

* readme: add Android UI binding

* Update README.md
											
										
										
											2024-03-29 08:33:46 +01:00
+								- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file)(Apachev2.0 or later)
-												readme : add Dot to UI list (#6487)


											
										
										
											2024-04-04 19:22:50 +02:00
+								- [Dot](https://github.com/alexpinel/Dot) (GPL)
-												readme : update UI list (#6503)

* Add MindMac to UI list

* Update proprietary description

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
											
										
										
											2024-04-05 20:39:43 +02:00
+								- [MindMac](https://mindmac.app) (proprietary)
-												Adding KodiBot to UI list (#6535)

KodiBot is free and open source ai chat app released under the GNU General Public License.
											
										
										
											2024-04-08 09:48:29 +02:00
+								- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
-												readme : update UI list (#6560)


											
										
										
											2024-04-10 08:34:00 +02:00
+								- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
-												readme : add UI (#6724)

* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-17 14:47:50 +02:00
+								- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
-												[no ci] docs: add aikit to readme (#7650)

Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
											
										
										
											2024-05-31 01:57:16 +02:00
+								- [AIKit](https://github.com/sozercan/aikit) (MIT)
-												readme : update UI list (#7943)


											
										
										
											2024-06-18 08:57:41 +02:00
+								- [LARS - The LLM & Advanced Referencing Solution](https://github.com/abgulati/LARS) (AGPL)
-												readme : add LLMUnity to UI projects (#9381)

* add LLMUnity to UI projects

* add newline to examples/rpc/README.md to fix editorconfig-checker unit test
											
										
										
											2024-09-09 13:21:38 +02:00
+								- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
-												Add Llama Assistant (#9744)


											
										
										
											2024-10-04 20:29:35 +02:00
+								- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
-												readme : add UI (#6724)

* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-17 14:47:50 +02:00
-												readme : add notice for UI list
											
										
										
											2024-03-28 21:56:03 +01:00
+								*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
-												Readme: add akx/ggify to tools (#1484)


											
										
										
											2024-05-26 14:09:42 +02:00
+								**Tools:**
 								- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
-												readme : add tool (#9655)


											
										
										
											2024-09-28 14:07:14 +02:00
+								- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
-												gemma2: add sliding window mask (#8227)

* gemma2: add sliding window mask

* fix data_swa uninitialized

* better naming

* add co-author

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>

* replace list with single tensor

* update

* llama : minor styling

* convert : add sanity check for query_pre_attn_scalar

* fix small typo in README

---------

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-07-01 18:48:34 +02:00
+								- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
-												docs: introduce gpustack and gguf-parser (#8873)

* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
											
										
										
											2024-08-12 14:45:50 +02:00
+								- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
-												readme : update tools list (#9475)

* Added link to proprietary wrapper for Unity3d into README.md

Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/).

* Update README.md

Fixes upon review
											
										
										
											2024-09-15 09:36:53 +02:00
+								- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with prebuild Mobile and Web platform wrappers and a model example)
-												Readme: add akx/ggify to tools (#1484)


											
										
										
											2024-05-26 14:09:42 +02:00
-												readme: add Paddler to the list of projects (#8239)


											
										
										
											2024-07-01 19:13:22 +02:00
+								**Infrastructure:**
 								- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
-												docs: introduce gpustack and gguf-parser (#8873)

* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
											
										
										
											2024-08-12 14:45:50 +02:00
+								- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
-												readme: add Paddler to the list of projects (#8239)


											
										
										
											2024-07-01 19:13:22 +02:00
-												readme : update games list (#8673)

Added link to game I made that depends on llama
											
										
										
											2024-07-24 18:48:00 +02:00
+								**Games:**
 								- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Demo
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								<details>
 								<summary>Typical run using LLaMA v2 13B on M2 Ultra</summary>
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								```
-												`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)

* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
											
										
										
											2024-06-13 01:41:52 +02:00
+								$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
-												minor : fix trailing whitespace

											
										
										
											2023-08-23 22:43:00 +02:00
+								I llama.cpp build info:
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								I UNAME_S:  Darwin
 								I UNAME_P:  arm
 								I UNAME_M:  arm64
-												readme : update hot topics
											
										
										
											2023-08-23 22:41:16 +02:00
+								I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
 								I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								I LDFLAGS:   -framework Accelerate
-												readme : update hot topics
											
										
										
											2023-08-23 22:41:16 +02:00
+								I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
 								I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												Update README.md
											
										
										
											2023-03-10 23:09:19 +01:00
+								make: Nothing to be done for `default'.
-												readme : update hot topics
											
										
										
											2023-08-23 22:41:16 +02:00
+								main: build = 1041 (cf658ad)
 								main: seed  = 1692823051
 								llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
 								llama_model_loader: - type  f32:   81 tensors
 								llama_model_loader: - type q4_0:  281 tensors
 								llama_model_loader: - type q6_K:    1 tensors
 								llm_load_print_meta: format         = GGUF V1 (latest)
 								llm_load_print_meta: arch           = llama
 								llm_load_print_meta: vocab type     = SPM
 								llm_load_print_meta: n_vocab        = 32000
 								llm_load_print_meta: n_merges       = 0
 								llm_load_print_meta: n_ctx_train    = 4096
 								llm_load_print_meta: n_ctx          = 512
 								llm_load_print_meta: n_embd         = 5120
 								llm_load_print_meta: n_head         = 40
 								llm_load_print_meta: n_head_kv      = 40
 								llm_load_print_meta: n_layer        = 40
 								llm_load_print_meta: n_rot          = 128
 								llm_load_print_meta: n_gqa          = 1
 								llm_load_print_meta: f_norm_eps     = 1.0e-05
 								llm_load_print_meta: f_norm_rms_eps = 1.0e-05
 								llm_load_print_meta: n_ff           = 13824
 								llm_load_print_meta: freq_base      = 10000.0
 								llm_load_print_meta: freq_scale     = 1
 								llm_load_print_meta: model type     = 13B
 								llm_load_print_meta: model ftype    = mostly Q4_0
 								llm_load_print_meta: model size     = 13.02 B
 								llm_load_print_meta: general.name   = LLaMA v2
 								llm_load_print_meta: BOS token = 1 '<s>'
 								llm_load_print_meta: EOS token = 2 '</s>'
 								llm_load_print_meta: UNK token = 0 '<unk>'
 								llm_load_print_meta: LF token  = 13 '<0x0A>'
 								llm_load_tensors: ggml ctx size =    0.11 MB
 								llm_load_tensors: mem required  = 7024.01 MB (+  400.00 MB per state)
 								...................................................................................................
 								llama_new_context_with_model: kv self size  =  400.00 MB
 								llama_new_context_with_model: compute buffer total size =   75.41 MB
-												minor : fix trailing whitespace

											
										
										
											2023-08-23 22:43:00 +02:00
+								system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
-												readme : update hot topics
											
										
										
											2023-08-23 22:41:16 +02:00
+								sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
 								generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
 								 Building a website can be done in 10 simple steps:
 								Step 1: Find the right website platform.
 								Step 2: Choose your domain name and hosting plan.
 								Step 3: Design your website layout.
 								Step 4: Write your website content and add images.
 								Step 5: Install security features to protect your site from hackers or spammers
 								Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
 								Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
 								Step 8: Start marketing and promoting the website via social media channels or paid ads
 								Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
 								Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
 								How does a Website Work?
 								A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
 								The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
 								How to
 								llama_print_timings:        load time =   576.45 ms
 								llama_print_timings:      sample time =   283.10 ms /   400 runs   (    0.71 ms per token,  1412.91 tokens per second)
 								llama_print_timings: prompt eval time =   599.83 ms /    19 tokens (   31.57 ms per token,    31.68 tokens per second)
 								llama_print_timings:        eval time = 24513.59 ms /   399 runs   (   61.44 ms per token,    16.28 tokens per second)
 								llama_print_timings:       total time = 25431.49 ms
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								```
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								</details>
 								<details>
 								<summary>Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook</summary>
-												Update README.md
											
										
										
											2023-03-10 23:51:46 +01:00
+								And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:
 								https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								</details>
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								## Usage
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								Here are the end-to-end binary build and model conversion steps for most supported models.
-												zig : update build.zig (#872)

* update

* update readme

* minimize the changes.

---------

Co-authored-by: zjli2019 <zhengji.li@ingchips.com>
											
										
										
											2023-04-13 15:43:22 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								### Basic usage
-												Updating build instructions to include BLAS support (#1183)

* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
											
										
										
											2023-04-26 22:03:03 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								Firstly, you need to get the binary. There are different methods that you can follow:
 								- Method 1: Clone this repository and build locally, see [how to build](./docs/build.md)
 								- Method 2: If you are using MacOS or Linux, you can install llama.cpp via [brew, flox or nix](./docs/install.md)
 								- Method 3: Use a Docker image, see [documentation for Docker](./docs/docker.md)
 								- Method 4: Download pre-built binary from [releases](https://github.com/ggerganov/llama.cpp/releases)
-												Updating build instructions to include BLAS support (#1183)

* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
											
										
										
											2023-04-26 22:03:03 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								You can run a basic completion using this command:
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								```bash
 								llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
-												Add brew installation instruction to README [no ci] (#7616)


											
										
										
											2024-05-30 16:58:15 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								# Output:
 								# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
-												Add brew installation instruction to README [no ci] (#7616)


											
										
										
											2024-05-30 16:58:15 +02:00
+								```
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								See [this page](./examples/main/README.md) for a full list of parameters.
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								### Conversation mode
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								If you want a more ChatGPT-like experience, you can run in conversation mode by passing `-cnv` as a parameter:
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								```bash
 								llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								# Output:
 								# > hi, who are you?
 								# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
 								#
 								# > what is 1+1?
 								# Easy peasy! The answer to 1+1 is... 2!
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
+								```
-												feature : support blis and other blas implementation  (#1536)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-05-20 16:58:31 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
-												readme : add note that LLaMA 3 is not supported with convert.py (#7065)


											
										
										
											2024-05-05 07:21:46 +02:00
-												zig : update build.zig (#872)

* update

* update readme

* minimize the changes.

---------

Co-authored-by: zjli2019 <zhengji.li@ingchips.com>
											
										
										
											2023-04-13 15:43:22 +02:00
+								```bash
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								```
-												Update README.md (#3289)

* Update README.md

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
											
										
										
											2023-09-21 21:00:24 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
-												Update README.md (#3289)

* Update README.md

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
											
										
										
											2023-09-21 21:00:24 +02:00
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								```bash
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								```
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								### Web server
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								[llama.cpp web server](./examples/server/README.md) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								Example usage:
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								```bash
 								./llama-server -m your_model.gguf --port 8080
-												readme : update models, cuda + ppl instructions (#3510)


											
										
										
											2023-10-06 21:13:36 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								# Basic web UI can be accessed via browser: http://localhost:8080
 								# Chat completion endpoint: http://localhost:8080/v1/chat/completions
-												readme : update models, cuda + ppl instructions (#3510)


											
										
										
											2023-10-06 21:13:36 +02:00
+								```
-												Add interactive mode (#61)

* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
											
										
										
											2023-03-12 22:13:28 +01:00
+								### Interactive mode
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								> [!NOTE]
 								> If you prefer basic usage, please consider using conversation mode instead of interactive mode
-												convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821)


											
										
										
											2024-03-02 18:27:26 +01:00
+								In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.
-												Add interactive mode (#61)

* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
											
										
										
											2023-03-12 22:13:28 +01:00
-												Minor: Readme fixed grammar, spelling, and misc updates (#1071)


											
										
										
											2023-04-19 21:52:14 +02:00
+								Here is an example of a few-shot interaction, invoked with the command
-												Minor style changes
											
										
										
											2023-03-21 17:10:32 +01:00
 								```bash
-												Minor: Readme fixed grammar, spelling, and misc updates (#1071)


											
										
										
											2023-04-19 21:52:14 +02:00
+								# default arguments using a 7B model
-												Move chat scripts into "./examples"

											
										
										
											2023-03-25 19:36:52 +01:00
+								./examples/chat.sh
-												Minor: Readme fixed grammar, spelling, and misc updates (#1071)


											
										
										
											2023-04-19 21:52:14 +02:00
+								# advanced chat with a 13B model
-												Move chat scripts into "./examples"

											
										
										
											2023-03-25 19:36:52 +01:00
+								./examples/chat-13B.sh
-												Update README.md
											
										
										
											2023-03-12 22:39:01 +01:00
-												Minor: Readme fixed grammar, spelling, and misc updates (#1071)


											
										
										
											2023-04-19 21:52:14 +02:00
+								# custom arguments using a 13B model
-												`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)

* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
											
										
										
											2024-06-13 01:41:52 +02:00
+								./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
-												Add interactive mode (#61)

* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
											
										
										
											2023-03-12 22:13:28 +01:00
+								```
-												Minor style changes
											
										
										
											2023-03-21 17:10:32 +01:00
-												`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)

* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
											
										
										
											2024-06-13 01:41:52 +02:00
+								Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `llama-cli` example program.
-												Add interactive mode (#61)

* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
											
										
										
											2023-03-12 22:13:28 +01:00
-												Update README.md
											
										
										
											2023-03-12 22:39:01 +01:00
+								![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)
-												Add interactive mode (#61)

* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
											
										
										
											2023-03-12 22:13:28 +01:00
-												readme : add docs for chat-persistent.sh (#1568)

* readme : add docs for chat-persistent.sh

* Update README.md
											
										
										
											2023-05-24 08:24:01 +02:00
+								### Persistent Interaction
-												`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)

* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
											
										
										
											2024-06-13 01:41:52 +02:00
+								The prompt, user inputs, and model generations can be saved and resumed across calls to `./llama-cli` by leveraging `--prompt-cache` and `--prompt-cache-all`. The `./examples/chat-persistent.sh` script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as `chat-13B.sh`. The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (`PROMPT_TEMPLATE`) and the model file.
-												readme : add docs for chat-persistent.sh (#1568)

* readme : add docs for chat-persistent.sh

* Update README.md
											
										
										
											2023-05-24 08:24:01 +02:00
 								```bash
 								# Start a new chat
 								PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
 								# Resume that chat
 								PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
 								# Start a different chat with the same prompt/model
 								PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh
 								# Different prompt cache for different prompt/model
 								PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
 								    CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh
 								```
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 03:01:57 +02:00
+								### Constrained output with grammars
 								`llama.cpp` supports grammars to constrain model output. For example, you can force the model to output JSON only:
 								```bash
-												`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)

* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
											
										
										
											2024-06-13 01:41:52 +02:00
+								./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 03:01:57 +02:00
+								```
 								The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
-												readme : add link to grammars app (#3388)

* Add link to grammars app per @ggernagov suggestion

Adding a sentence in the Grammars section of README to point to grammar app, per https://github.com/ggerganov/llama.cpp/discussions/2494#discussioncomment-7138211

* Update README.md
											
										
										
											2023-09-29 13:15:57 +02:00
+								For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
+								## Build
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 03:14:11 +02:00
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
+								Please refer to [Build llama.cpp locally](./docs/build.md)
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 03:14:11 +02:00
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
+								## Supported backends
-												Add SHA256SUMS file and instructions to README how to obtain and verify the downloads

Hashes created using:

sha256sum models/*B/*.pth models/*[7136]B/ggml-model-f16.bin* models/*[7136]B/ggml-model-q4_0.bin* > SHA256SUMS

											
										
										
											2023-03-20 21:14:06 +01:00
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
+								| Backend | Target devices |
 								| --- | --- |
 								| [Metal](./docs/build.md#metal-build) | Apple Silicon |
 								| [BLAS](./docs/build.md#blas-build) | All |
 								| [BLIS](./docs/backend/BLIS.md) | All |
 								| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
-												musa : update doc (#9856)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
											
										
										
											2024-10-12 07:09:53 +02:00
+								| [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
+								| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
 								| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
 								| [Vulkan](./docs/build.md#vulkan) | GPU |
-												cann: add doc for cann backend (#8867)

Co-authored-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
											
										
										
											2024-08-19 10:46:38 +02:00
+								| [CANN](./docs/build.md#cann) | Ascend NPU |
-												Fix whitespace, add .editorconfig, add GitHub workflow (#883)


											
										
										
											2023-04-11 21:45:44 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Tools
-												docker : add gpu image CI builds (#3103)

Enables the GPU enabled container images to be built and pushed
alongside the CPU containers.

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
											
										
										
											2023-09-14 18:47:00 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								### Prepare and Quantize
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								> [!NOTE]
 								> You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								Note: `convert.py` has been moved to `examples/convert_legacy_llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
 								It does not support LLaMA 3, you can use `convert_hf_to_gguf.py` with LLaMA 3 downloaded from Hugging Face.
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								To learn more about quantizing model, [read this documentation](./examples/quantize/README.md)
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								### Perplexity (measuring model quality)
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
 								For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								To learn more how to measure perplexity using llama.cpp, [read this documentation](./examples/perplexity/README.md)
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Contributing
-												Add initial contribution guidelines
											
										
										
											2023-03-13 08:42:26 +01:00
-												Update contribution section, hot topics, limitations, etc.
											
										
										
											2023-03-13 18:21:51 +01:00
+								- Contributors can open PRs
-												Expand "Contributing" section
											
										
										
											2023-03-16 07:55:13 +01:00
+								- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
-												Add initial contribution guidelines
											
										
										
											2023-03-13 08:42:26 +01:00
+								- Collaborators will be invited based on contributions
-												contrib : add Resources section (#9675)


											
										
										
											2024-09-29 13:38:18 +02:00
+								- Any help with managing issues, PRs and projects is very appreciated!
-												contributing : update guidelines (#8316)


											
										
										
											2024-07-05 08:09:47 +02:00
+								- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
 								- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
-												Update Contributing section
											
										
										
											2023-03-17 19:30:04 +01:00
+								- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
-												Adjust repetition penalty ..
											
										
										
											2023-03-23 09:46:58 +01:00
+								- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
-												Add initial contribution guidelines
											
										
										
											2023-03-13 08:42:26 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Other documentations
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
-												`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)

* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
											
										
										
											2024-06-13 01:41:52 +02:00
+								- [main (cli)](./examples/main/README.md)
-												readme : add more docs indexes (#2127)

* Update README.md to add more docs indexes

* Update README.md to add more docs indexes
											
										
										
											2023-07-09 09:38:42 +02:00
+								- [server](./examples/server/README.md)
 								- [jeopardy](./examples/jeopardy/README.md)
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								- [GBNF grammars](./grammars/README.md)
 								**Development documentations**
 								- [How to build](./docs/build.md)
 								- [Running on Docker](./docs/docker.md)
 								- [Build on Android](./docs/android.md)
-												Update README.md to fix broken link to docs (#8399)

Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'
											
										
										
											2024-07-09 20:58:44 +02:00
+								- [Performance troubleshooting](./docs/development/token_generation_performance_tips.md)
-												readme : add more docs indexes (#2127)

* Update README.md to add more docs indexes

* Update README.md to add more docs indexes
											
										
										
											2023-07-09 09:38:42 +02:00
+								- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
 								**Seminal papers and background on the models**
 								If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
 								- LLaMA:
 								    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
 								    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
 								- GPT-3
 								    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
 								- GPT-3.5 / InstructGPT / ChatGPT:
 								    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
 								    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)