llama.cpp/README.md

# llama.cpp

![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)

[Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)

Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++

## Recent API changes

- [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)
- [Changelog for `llama-server` REST API](https://github.com/ggerganov/llama.cpp/issues/9291)

## Hot topics

- **Introducing GGUF-my-LoRA** https://github.com/ggerganov/llama.cpp/discussions/10123
- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggerganov/llama.cpp/discussions/9669
- Hugging Face GGUF editor: [discussion](https://github.com/ggerganov/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)

----

## Description

The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
range of hardware - locally and in the cloud.

- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library.

<details>
<summary>Models</summary>

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)

#### Text-only

- [X] LLaMA 🦙
- [x] LLaMA 2 🦙🦙
- [x] LLaMA 3 🦙🦙🦙
- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
- [X] [BERT](https://github.com/ggerganov/llama.cpp/pull/5423)
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
- [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
- [X] [StableLM models](https://huggingface.co/stabilityai)
- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
- [x] [GPT-2](https://huggingface.co/gpt2)
- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
- [x] [Gemma](https://ai.google.dev/gemma)
- [x] [Mamba](https://github.com/state-spaces/mamba)
- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
- [x] [Xverse](https://huggingface.co/models?search=xverse)
- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
- [x] [OLMo](https://allenai.org/olmo)
- [x] [OLMo 2](https://allenai.org/olmo)
- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
- [x] [Smaug](https://huggingface.co/models?search=Smaug)
- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)

#### Multimodal

- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)

</details>

<details>
<summary>Bindings</summary>

- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
- Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)
- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)

</details>

<details>
<summary>UIs</summary>

*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*

- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
- [Dot](https://github.com/alexpinel/Dot) (GPL)
- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
- [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)
- [janhq/jan](https://github.com/janhq/jan) (AGPL)
- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
- [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)
- [LARS](https://github.com/abgulati/LARS) (AGPL)
- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
- [LMStudio](https://lmstudio.ai/) (proprietary)
- [LocalAI](https://github.com/mudler/LocalAI) (MIT)
- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
- [MindMac](https://mindmac.app) (proprietary)
- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
- [nat/openplayground](https://github.com/nat/openplayground) (MIT)
- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)
- [ollama/ollama](https://github.com/ollama/ollama) (MIT)
- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
- [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
- [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)
- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)
- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
- [ramalama](https://github.com/containers/ramalama) (MIT)
- [semperai/amica](https://github.com/semperai/amica) (MIT)
- [withcatai/catai](https://github.com/withcatai/catai) (MIT)

</details>

<details>
<summary>Tools</summary>

- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)

</details>

<details>
<summary>Infrastructure</summary>

- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly

</details>

<details>
<summary>Games</summary>

- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.

</details>

## Supported backends

| Backend | Target devices |
| --- | --- |
| [Metal](docs/build.md#metal-build) | Apple Silicon |
| [BLAS](docs/build.md#blas-build) | All |
| [BLIS](docs/backend/BLIS.md) | All |
| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
| [MUSA](docs/build.md#musa) | Moore Threads MTT GPU |
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
| [hipBLAS](docs/build.md#hipblas) | AMD GPU |
| [Vulkan](docs/build.md#vulkan) | GPU |
| [CANN](docs/build.md#cann) | Ascend NPU |

## Building the project

The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:

- Clone this repository and build locally, see [how to build](docs/build.md)
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
- Use a Docker image, see [documentation for Docker](docs/docker.md)
- Download pre-built binaries from [releases](https://github.com/ggerganov/llama.cpp/releases)

## Obtaining and quantizing models

The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:

- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)

After downloading a model, use the CLI tools to run it locally - see below.

`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`:

- Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes
- Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggerganov/llama.cpp/discussions/10123)
- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggerganov/llama.cpp/discussions/9268)
- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)

To learn more about model quantization, [read this documentation](examples/quantize/README.md)

## [`llama-cli`](examples/main)

#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.

- <details open>
    <summary>Run simple text completion</summary>

    ```bash
    llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128

    # I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
    ```

    </details>

- <details>
    <summary>Run in conversation mode</summary>

    ```bash
    llama-cli -m model.gguf -p "You are a helpful assistant" -cnv

    # > hi, who are you?
    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
    #
    # > what is 1+1?
    # Easy peasy! The answer to 1+1 is... 2!
    ```

    </details>

- <details>
    <summary>Run with custom chat template</summary>

    ```bash
    # use the "chatml" template
    llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml

    # use a custom template
    llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
    ```

    [Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)

    </details>

- <details>
    <summary>Constrain the output with a custom grammar</summary>

    ```bash
    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

    # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
    ```

    The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).

    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

    </details>


## [`llama-server`](examples/server)

#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.

- <details open>
    <summary>Start a local HTTP server with default configuration on port 8080</summary>

    ```bash
    llama-server -m model.gguf --port 8080

    # Basic web UI can be accessed via browser: http://localhost:8080
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
    ```

    </details>

- <details>
    <summary>Support multiple-users and parallel decoding</summary>

    ```bash
    # up to 4 concurrent requests, each with 4096 max context
    llama-server -m model.gguf -c 16384 -np 4
    ```

    </details>

- <details>
    <summary>Enable speculative decoding</summary>

    ```bash
    # the draft.gguf model should be a small variant of the target model.gguf
    llama-server -m model.gguf -md draft.gguf
    ```

    </details>

- <details>
    <summary>Serve an embedding model</summary>

    ```bash
    # use the /embedding endpoint
    llama-server -m model.gguf --embedding --pooling cls -ub 8192
    ```

    </details>

- <details>
    <summary>Serve a reranking model</summary>

    ```bash
    # use the /reranking endpoint
    llama-server -m model.gguf --reranking
    ```

    </details>

- <details>
    <summary>Constrain all outputs with a grammar</summary>

    ```bash
    # custom grammar
    llama-server -m model.gguf --grammar-file grammar.gbnf

    # JSON
    llama-server -m model.gguf --grammar-file grammars/json.gbnf
    ```

    </details>


## [`llama-perplexity`](examples/perplexity)

#### A tool for measuring the perplexity [^1][^2] (and other quality metrics) of a model over a given text.

- <details open>
    <summary>Measure the perplexity over a text file</summary>

    ```bash
    llama-perplexity -m model.gguf -f file.txt

    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
    # Final estimate: PPL = 5.4007 +/- 0.67339
    ```

    </details>

- <details>
    <summary>Measure KL divergence</summary>

    ```bash
    # TODO
    ```

    </details>

[^1]: [examples/perplexity/README.md](examples/perplexity/README.md)
[^2]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)

## [`llama-bench`](example/bench)

#### Benchmark the performance of the inference for various parameters.

- <details open>
    <summary>Run default benchmark</summary>

    ```bash
    llama-bench -m model.gguf

    # Output:
    # | model               |       size |     params | backend    | threads |          test |                  t/s |
    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
    #
    # build: 3e0ba0e60 (4229)
    ```

    </details>

## [`llama-run`](examples/run)

#### A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama [^3].

- <details>
    <summary>Run a model with a specific prompt (by default it's pulled from Ollama registry)</summary>

    ```bash
    llama-run granite-code
    ```

    </details>

[^3]: [https://github.com/containers/ramalama](RamaLama)

## [`llama-simple`](examples/simple)

#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.

- <details>
    <summary>Basic text completion</summary>

    ```bash
    llama-simple -m model.gguf

    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
    ```

    </details>


## Contributing

- Contributors can open PRs
- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
- Collaborators will be invited based on contributions
- Any help with managing issues, PRs and projects is very appreciated!
- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)

## Other documentation

- [main (cli)](examples/main/README.md)
- [server](examples/server/README.md)
- [GBNF grammars](grammars/README.md)

#### Development documentation

- [How to build](docs/build.md)
- [Running on Docker](docs/docker.md)
- [Build on Android](docs/android.md)
- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)

#### Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- GPT-3
    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- GPT-3.5 / InstructGPT / ChatGPT:
    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)

#### References
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								# llama.cpp
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
+								![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
-												Add logo to README.md
											
										
										
											2023-03-26 09:20:49 +02:00
-												Fix conan badge display [no ci] (#7645)


											
										
										
											2024-05-30 17:07:39 +02:00
+								[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
-												readme : fix server badge
											
										
										
											2024-07-19 13:34:55 +02:00
+								[![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
-												Update README.md
											
										
										
											2023-03-12 21:09:26 +01:00
-												readme : add project status link
											
										
										
											2023-10-04 15:50:44 +02:00
+								[Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)
-												readme : add new roadmap + manifesto
											
										
										
											2023-06-25 15:08:12 +02:00
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Recent API changes
-												readme : add API changes section
											
										
										
											2024-03-03 11:44:03 +01:00
-												readme : refactor API section + remove old hot topics
											
										
										
											2024-09-03 09:00:36 +02:00
+								- [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)
 								- [Changelog for `llama-server` REST API](https://github.com/ggerganov/llama.cpp/issues/9291)
-												readme : add API changes section
											
										
										
											2024-03-03 11:44:03 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Hot topics
-												readme : update hot topics
											
										
										
											2023-08-27 13:44:35 +02:00
-												readme : update hot topics
											
										
										
											2024-11-01 16:31:51 +01:00
+								- **Introducing GGUF-my-LoRA** https://github.com/ggerganov/llama.cpp/discussions/10123
 								- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggerganov/llama.cpp/discussions/9669
-												readme : update hot topics
											
										
										
											2024-09-27 19:57:51 +02:00
+								- Hugging Face GGUF editor: [discussion](https://github.com/ggerganov/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)
-												readme : incoming BREAKING CHANGE
											
										
										
											2023-08-18 16:48:31 +02:00
 								----
-												Add Misc section + update hot topics + minor fixes
											
										
										
											2023-03-14 08:43:52 +01:00
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
+								## Description
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								range of hardware - locally and in the cloud.
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- Plain C/C++ implementation without any dependencies
 								- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
-												add amx kernel for gemm (#8998)

add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend
											
										
										
											2024-10-18 07:34:36 +02:00
+								- AVX, AVX2, AVX512 and AMX support for x86 architectures
-												readme : update (#5572)

Added 1.5-bit on README.md
											
										
										
											2024-02-19 08:39:31 +01:00
+								- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
-												musa : update doc (#9856)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
											
										
										
											2024-10-12 07:09:53 +02:00
+								- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
-												ggml : remove OpenCL (#7735)

ggml-ci
											
										
										
											2024-06-04 20:23:20 +02:00
+								- Vulkan and SYCL backend support
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								The `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library.
-												Update README.md
											
										
										
											2023-03-11 11:31:21 +01:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								<details>
 								<summary>Models</summary>
-												readme : add GPT4All instructions (close #588)
											
										
										
											2023-03-29 18:37:20 +02:00
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								Typically finetunes of the base models below are supported as well.
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								#### Text-only
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
-												readme : update supported models
											
										
										
											2023-03-30 21:31:54 +02:00
+								- [X] LLaMA 🦙
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 03:14:11 +02:00
+								- [x] LLaMA 2 🦙🦙
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] LLaMA 3 🦙🦙🦙
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 07:21:30 +01:00
+								- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
-												model: support arch `DbrxForCausalLM` (#6515)

* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
											
										
										
											2024-04-13 11:33:52 +02:00
+								- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
-												doc : add link to falcon (#6789)


											
										
										
											2024-04-21 14:35:40 +02:00
+								- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
-												readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475)

* add support for chinese llama-2 / alpaca-2

* remove white spaces
											
										
										
											2023-08-02 08:18:31 +02:00
+								- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
-												readme : update supported models
											
										
										
											2023-03-30 21:31:54 +02:00
+								- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
-												Document BERT support. (#8205)

* Update README.md

document BERT support

* Update README.md
											
										
										
											2024-07-01 13:40:58 +02:00
+								- [X] [BERT](https://github.com/ggerganov/llama.cpp/pull/5423)
-												Add BAIR's Koala to supported models (#877)


											
										
										
											2023-04-10 22:41:53 +02:00
+								- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
+								- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
 								- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
-												readme : update hot topics + model links (#3399)


											
										
										
											2023-09-29 14:50:35 +02:00
+								- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
-												readme : update models, cuda + ppl instructions (#3510)


											
										
										
											2023-10-06 21:13:36 +02:00
+								- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
-												Add MPT model to supported models in README.md (#3574)


											
										
										
											2023-10-11 01:02:49 +02:00
+								- [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
+								- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
+								- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [X] [StableLM models](https://huggingface.co/stabilityai)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
+								- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
 								- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
-												llama : add PLaMo model (#3557)

* add plamo mock

* add tensor loading

* plamo convert

* update norm

* able to compile

* fix norm_rms_eps hparam

* runnable

* use inp_pos

* seems ok

* update kqv code

* remove develop code

* update README

* shuffle attn_q.weight and attn_output.weight for broadcasting

* remove plamo_llm_build_kqv and use llm_build_kqv

* fix style

* update

* llama : remove obsolete KQ_scale

* plamo : fix tensor names for correct GPU offload

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-12-24 14:35:49 +01:00
+								- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
-												gpt2 : Add gpt2 architecture integration (#4555)


											
										
										
											2023-12-28 15:03:57 +01:00
+								- [x] [GPT-2](https://huggingface.co/gpt2)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
 								- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
-												readme : add CodeShell models to the supported models list (#5330)


											
										
										
											2024-02-05 08:41:38 +01:00
+								- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
-												llama : add `gemma` model (#5631)

There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
											
										
										
											2024-02-21 14:08:22 +01:00
+								- [x] [Gemma](https://ai.google.dev/gemma)
-												llama : support Mamba Selective State Space Models (#5328)

* mamba : begin working on support for Mamba SSM

* mamba : begin figuring out how to (ab)use the kv cache for Mamba

* mamba : recurrent inference almost works, but incoherent

* mamba : recurrent inference WORKS!!!

* convert : optionally use d_conv and d_state from config.json for Mamba

* mamba : refactor recurrent conv, resulting in 20% perf increase

It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.

I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.

* ggml : parallelize ggml_exp

This results in 8% faster token generation for Mamba-130M.

* mamba : simplify the conv step with a self-overlapping view

Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.

Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.

Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).

* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32

Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.

* mamba : fix self-overlapping view depth stride

* mamba : handle batches of more than 1 token

This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.

Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.

* ggml: add ggml_ssm_scan to help with parallel selective scan

If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.

* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation

This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.

* mamba : very basic quantization support

Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)

Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.

Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.

* convert : fix wrong name for layer norm weight of offical Mamba models

I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")

* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator

This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.

However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.

* convert : for Mamba, also consider the "MambaLMHeadModel" arch name

It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json

* mamba : fix vocab size problems with official models

The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.

Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.

* ggml : remove ggml_exp and ggml_soft_plus

They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.

* mamba : remove some useless comments

No code change.

* convert : fix flake8 linter errors

* mamba : apply suggestions from code review

* mamba : remove unecessary branch for row-wise ssm_state and C multiplication

It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.

* ggml : in ggml_ssm_scan, use more appropriate asserts

* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32

* mamba : multiple sequences, but one at a time

This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).

The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)

Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.

* mamba : support llama_kv_cache_seq_cp

This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.

Each KV cell is dedicated to the sequence ID corresponding to its own index.

* mamba : use a state mask

It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.

inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).

* llama : replace the usage of n_ctx with kv_self.size in many places

* mamba : use n_tokens directly instead of n_tok

* mamba : in comments, properly refer to KV cells instead of slots

* mamba : reduce memory usage of ggml_ssm_scan

From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.

The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.

* mamba : simultaneous sequence processing

A batch can now contain tokens from multiple sequences.

This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.

However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.

* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba

This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).

Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.

Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.

* llama : add inp_s_seq as a new input tensor

The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.

The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.

Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).

* mamba : support llama_kv_cache_seq_cp copy chains

* mamba : support shifting and dividing the kv cache pos

* mamba : make the server and parallel examples work with whole sequences

A seq_id is dedicated to the system prompt in both cases.

* llama : make llama_kv_cache_seq_rm return whether it succeeded or not

* mamba : dedicate an input tensor for state copy indices

This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.

* mamba : adapt perplexity, batched, and batched-bench examples

* perplexity : limit the max number of sequences

This adapts to what the loaded model can provide.

* llama : add llama_n_max_seq to get the upper limit for seq_ids

Used by the perplexity example.

* batched : pass n_parallel to the model's context params

This should have been there already, but it wasn't.

* batched-bench : reserve sequences to support Mamba

* batched-bench : fix tokens being put in wrong sequences

Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.

* mamba : stop abusing attention metadata

This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.

This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
 will not require breaking existing converted Mamba models again)

* gguf-py : add new KV metadata key-value pairs for Mamba

* llama : add new metadata key-value pairs for Mamba

* llama : guard against divisions by zero when n_head is 0

* mamba : rename "unlimited" KV cache property to "recurrent"

* mamba : more correctly update the "used" field of the KV cache

* ggml : in ggml_ssm_scan, use a threshold for soft_plus

This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.

* convert : for Mamba, fallback to internal NeoX tokenizer

The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.

* mamba : support state saving and restoring

* ggml : implicitly pass src tensors through dst for Mamba-related ops

* mamba : clarify some comments

* server : fix cache_tokens not getting correctly resized

Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.

For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.

* convert-hf : support new metadata keys for Mamba

For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

* mamba : rename metadata to be more similar to transformers library

This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".

* mamba : support mamba-*-hf models

These models share their token_embd.weight with their output.weight

* mamba : add missing spaces

This is purely a formatting change.

* convert-hf : omit output.weight when identical with token_embd.weight

Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.

* readme : add Mamba to supported models, and add recent API changes

* mamba : move state_seq and state_mask views outside layer loop

A few tensors were also missing `struct` in front of `ggml_tensor`.
											
										
										
											2024-03-08 23:31:00 +01:00
+								- [x] [Mamba](https://github.com/state-spaces/mamba)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
-												[Model] Add support for xverse (#6301)

* Support xverse model convert to gguf format.

* 1. Convert xverse models to gguf;
2. Add LLM_ARCH_XVERSE inference in llama.cpp;
3. Add xverse item in Supported models in README.md;

* * gguf-py: remove redundant logs
* llama: remove the init_mapping_prefetch custom parameter

* llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers.

* - Fix format issues
- Remove duplicate set kqv_out to llm_build_kv

* Update llama.cpp

---------

Co-authored-by: willhe <willhe@xverse.cn>
Co-authored-by: willhe <hexin@xverse.cn>
											
										
										
											2024-03-29 14:37:03 +01:00
+								- [x] [Xverse](https://huggingface.co/models?search=xverse)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
-												llama : add SEA-LION support (#6448)

* initial commit for sealion support

* add sealion support

* minor fix

* q/k ln and pos_embd only if required

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* minor : clear whitespaces

---------

Co-authored-by: bryan <bryansiow@aisingapore.org>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-03 20:05:10 +02:00
+								- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
-												Add GritLM as supported models. (#6513)


											
										
										
											2024-04-07 19:33:59 +02:00
+								- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
-												Implement the OLMo architecture (#6741)

* implement olmo architecture

* remove unused variable

* remove unused moe branch

* remove check for weight

* remove superfluous moe, bias and rope tensors

* clarified comment

* fix clamp_kqv setting

* remove obsolete parameter name filter
											
										
										
											2024-04-19 11:35:54 +02:00
+								- [x] [OLMo](https://allenai.org/olmo)
-												Add OLMo 2 model in docs (#10530)

* Add link to OLMo 2 model in docs

* Change link to landing page
											
										
										
											2024-11-26 21:55:29 +01:00
+								- [x] [OLMo 2](https://allenai.org/olmo)
-												llama : support OLMoE (#9462)


											
										
										
											2024-09-16 08:47:37 +02:00
+								- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 07:54:10 +02:00
+								- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
-												readme : add GPT-NeoX + Pythia to the list of supported models (#7491)


											
										
										
											2024-05-23 14:12:43 +02:00
+								- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 07:54:10 +02:00
+								- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
 								- [x] [Smaug](https://huggingface.co/models?search=Smaug)
 								- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
 								- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
 								- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
 								- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
-												readme : add supported glm models (#8360)


											
										
										
											2024-07-08 07:57:19 +02:00
+								- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 07:54:10 +02:00
+								- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
-												llama : add EXAONE model support (#9025)

* add exaone model support

* add chat template

* fix whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add ftype

* add exaone pre-tokenizer in `llama-vocab.cpp`

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* fix lint

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* add `EXAONE` to supported models in `README.md`

* fix space

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
											
										
										
											2024-08-16 08:35:18 +02:00
+								- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
-												llama : support for `falcon-mamba` architecture (#9074)

* feat: initial support for llama.cpp

* fix: lint

* refactor: better refactor

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix: address comments

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* fix: add more cleanup and harmonization

* fix: lint

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* fix: change name

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* add in operator

* fix: add `dt_b_c_rms` in `llm_load_print_meta`

* fix: correct printf format for bool

* fix: correct print format

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* llama : quantize more Mamba tensors

* llama : use f16 as the fallback of fallback quant types

---------

Co-authored-by: compilade <git@compilade.net>
											
										
										
											2024-08-21 10:06:36 +02:00
+								- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
-												Add Jais to list of supported models (#9439)

Co-authored-by: fmz <quic_fzaghlou@quic.com>
											
										
										
											2024-09-12 02:29:53 +02:00
+								- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
-												Update README.md (#9591)

Add Bielik model.
											
										
										
											2024-10-01 19:18:46 +02:00
+								- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
-												llama : add chat template for RWKV-World + fix EOT (#9968)

* Add chat template for RWKV-World

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Fix the chat template not being used

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV v6: Set EOT token to ``\n\n``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* readme: add rwkv into supported model list

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
											
										
										
											2024-10-22 12:33:37 +02:00
+								- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								#### Multimodal
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
-												readme : add link to LLaVA 1.6 models (#5758)

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
											
										
										
											2024-02-28 09:39:39 +01:00
+								- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 08:38:49 +01:00
+								- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
 								- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
-												readme : add MobileVLM 1.7B/3B to the supported models list (#5107)

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
											
										
										
											2024-01-25 21:14:32 +01:00
+								- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 15:06:48 +01:00
+								- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 15:52:28 +02:00
+								- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
-												llava : fix moondream support (#7163)

* Revert "Revert "llava : add support for moondream vision language model (#6899)""

This reverts commit 9da243b36ac0b9d609adfaaa4c8f1cc8c592f737.

* Fix num_positions and embeddings initialization
											
										
										
											2024-05-10 08:41:10 +02:00
+								- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
-												readme : remove trailing space (#7469)

											
										
										
											2024-05-23 16:43:18 +02:00
+								- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 20:13:21 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								</details>
 								<details>
 								<summary>Bindings</summary>
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
 								- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
 								- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
-												readme : remove unsupported node.js library (#3703)

- https://github.com/Atome-FE/llama-node is quite out of date
- doesn't support recent/current llama.cpp functionality
											
										
										
											2023-10-22 20:16:43 +02:00
+								- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
-												readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814)


											
										
										
											2024-01-07 21:24:11 +01:00
+								- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
-												readme : add programmable prompt engine language CLI (#9599)


											
										
										
											2024-09-23 17:58:17 +02:00
+								- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
-												readme : add JavaScript/Wasm repo (#5415)


											
										
										
											2024-02-09 11:17:00 +01:00
+								- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
-												readme : add wllama as a wasm binding (#6100)


											
										
										
											2024-03-16 16:42:08 +01:00
+								- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
-												readme : add Ruby bindings (#1029)


											
										
										
											2023-04-17 21:34:35 +02:00
+								- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
-												readme : add feature-rich rust bindings (#6465)


											
										
										
											2024-04-03 19:53:37 +02:00
+								- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
-												readme : add link to rust bindings (#5148)

* added link to another set of rust bindings with brief note on differences.

* fixed link name
											
										
										
											2024-01-28 09:30:44 +01:00
+								- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
 								- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
-												readme : add C#/.NET bindings repo (#1409)


											
										
										
											2023-05-12 07:39:40 +02:00
+								- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
-												readme : update bindings list (#9951)

Update the binding list by adding LM-Kit.NET (C# & VB.NET)
											
										
										
											2024-10-20 18:25:41 +02:00
+								- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
-												readme : add Scala 3 bindings repo (#2010)


											
										
										
											2023-06-26 21:47:59 +02:00
+								- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
-												Add link to clojure bindings to Readme. (#2659)


											
										
										
											2023-08-18 21:39:22 +02:00
+								- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
-												readme : add react-native binding (#2869)


											
										
										
											2023-08-29 11:30:10 +02:00
+								- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
-												docs : add java-llama.cpp to README.md (#2935)


											
										
										
											2023-09-01 15:36:14 +02:00
+								- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
-												readme : add zig bindings (#4581)


											
										
										
											2023-12-22 07:49:54 +01:00
+								- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
-												Add a dart/flutter binding to README.md (#4882)


											
										
										
											2024-01-20 09:05:43 +01:00
+								- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
-												docs : update bindings list (#10261)

Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
											
										
										
											2024-11-13 12:17:10 +01:00
+								- Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)
-												readme : add php api bindings (#6326)

* add php bindings to readme

* readme : add link to PR

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-03-27 08:08:59 +01:00
+								- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
-												readme : update bindings list (#8222)

* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
											
										
										
											2024-07-07 15:21:37 +02:00
+								- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
-												readme : update bindings list (#9889)


											
										
										
											2024-10-15 10:20:34 +02:00
+								- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
-												readme : update bindings list (#9918)

Co-authored-by: Tim Wang <tim.wang@ing.com>
											
										
										
											2024-10-17 08:57:14 +02:00
+								- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								</details>
 								<details>
 								<summary>UIs</summary>
 								*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 17:45:08 +01:00
+								- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
 								- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
 								- [Dot](https://github.com/alexpinel/Dot) (GPL)
 								- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
 								- [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [janhq/jan](https://github.com/janhq/jan) (AGPL)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 17:45:08 +01:00
+								- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
 								- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								- [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 17:45:08 +01:00
+								- [LARS](https://github.com/abgulati/LARS) (AGPL)
 								- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
 								- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
 								- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [LMStudio](https://lmstudio.ai/) (proprietary)
-												readme : add LocalAI to the availables UI (#5629)


											
										
										
											2024-02-21 15:39:10 +01:00
+								- [LocalAI](https://github.com/mudler/LocalAI) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 17:45:08 +01:00
+								- [MindMac](https://mindmac.app) (proprietary)
 								- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
 								- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
 								- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
 								- [nat/openplayground](https://github.com/nat/openplayground) (MIT)
 								- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)
 								- [ollama/ollama](https://github.com/ollama/ollama) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 17:45:08 +01:00
+								- [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
 								- [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)
 								- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 15:55:10 +01:00
+								- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 17:45:08 +01:00
+								- [ramalama](https://github.com/containers/ramalama) (MIT)
 								- [semperai/amica](https://github.com/semperai/amica) (MIT)
 								- [withcatai/catai](https://github.com/withcatai/catai) (MIT)
-												readme : add UI (#6724)

* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-17 14:47:50 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								</details>
-												readme : add notice for UI list
											
										
										
											2024-03-28 21:56:03 +01:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								<details>
 								<summary>Tools</summary>
-												Readme: add akx/ggify to tools (#1484)


											
										
										
											2024-05-26 14:09:42 +02:00
 								- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
-												readme : add tool (#9655)


											
										
										
											2024-09-28 14:07:14 +02:00
+								- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
-												gemma2: add sliding window mask (#8227)

* gemma2: add sliding window mask

* fix data_swa uninitialized

* better naming

* add co-author

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>

* replace list with single tensor

* update

* llama : minor styling

* convert : add sanity check for query_pre_attn_scalar

* fix small typo in README

---------

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-07-01 18:48:34 +02:00
+								- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
-												docs: introduce gpustack and gguf-parser (#8873)

* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
											
										
										
											2024-08-12 14:45:50 +02:00
+								- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
-												Readme: add akx/ggify to tools (#1484)


											
										
										
											2024-05-26 14:09:42 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								</details>
 								<details>
 								<summary>Infrastructure</summary>
-												readme: add Paddler to the list of projects (#8239)


											
										
										
											2024-07-01 19:13:22 +02:00
 								- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
-												docs: introduce gpustack and gguf-parser (#8873)

* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
											
										
										
											2024-08-12 14:45:50 +02:00
+								- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
-												readme : update infra list (#9942)

llama_cpp_canister allows you to run llama.cpp as a Smart Contract on the Internet Computer. The smart contract runs as WebAssembly in a so-called 'canister'.
											
										
										
											2024-10-20 18:01:34 +02:00
+								- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
-												readme: add Paddler to the list of projects (#8239)


											
										
										
											2024-07-01 19:13:22 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								</details>
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								<details>
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								<summary>Games</summary>
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								</details>
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								## Supported backends
 								| Backend | Target devices |
 								| --- | --- |
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								| [Metal](docs/build.md#metal-build) | Apple Silicon |
 								| [BLAS](docs/build.md#blas-build) | All |
 								| [BLIS](docs/backend/BLIS.md) | All |
 								| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
 								| [MUSA](docs/build.md#musa) | Moore Threads MTT GPU |
 								| [CUDA](docs/build.md#cuda) | Nvidia GPU |
 								| [hipBLAS](docs/build.md#hipblas) | AMD GPU |
 								| [Vulkan](docs/build.md#vulkan) | GPU |
 								| [CANN](docs/build.md#cann) | Ascend NPU |
 								## Building the project
-												Update README.md
											
										
										
											2023-03-10 23:51:46 +01:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
 								The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:
-												Update README.md
											
										
										
											2023-03-10 23:51:46 +01:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- Clone this repository and build locally, see [how to build](docs/build.md)
 								- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
 								- Use a Docker image, see [documentation for Docker](docs/docker.md)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								- Download pre-built binaries from [releases](https://github.com/ggerganov/llama.cpp/releases)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								## Obtaining and quantizing models
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
 								The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:
 								- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
 								- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								After downloading a model, use the CLI tools to run it locally - see below.
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.
-												zig : update build.zig (#872)

* update

* update readme

* minimize the changes.

---------

Co-authored-by: zjli2019 <zhengji.li@ingchips.com>
											
										
										
											2023-04-13 15:43:22 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`:
-												Updating build instructions to include BLAS support (#1183)

* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
											
										
										
											2023-04-26 22:03:03 +02:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
+								- Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes
 								- Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggerganov/llama.cpp/discussions/10123)
 								- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggerganov/llama.cpp/discussions/9268)
 								- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)
-												Updating build instructions to include BLAS support (#1183)

* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
											
										
										
											2023-04-26 22:03:03 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								To learn more about model quantization, [read this documentation](examples/quantize/README.md)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								## [`llama-cli`](examples/main)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 08:47:07 +01:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
-												Create README.md
											
										
										
											2023-03-10 20:47:46 +01:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- <details open>
 								    <summary>Run simple text completion</summary>
-												Add brew installation instruction to README [no ci] (#7616)


											
										
										
											2024-05-30 16:58:15 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    ```bash
 								    llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    # I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
 								    ```
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    </details>
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- <details>
 								    <summary>Run in conversation mode</summary>
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    ```bash
 								    llama-cli -m model.gguf -p "You are a helpful assistant" -cnv
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 17:37:55 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    # > hi, who are you?
 								    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
 								    #
 								    # > what is 1+1?
 								    # Easy peasy! The answer to 1+1 is... 2!
 								    ```
-												feature : support blis and other blas implementation  (#1536)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-05-20 16:58:31 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    </details>
-												readme : add note that LLaMA 3 is not supported with convert.py (#7065)


											
										
										
											2024-05-05 07:21:46 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- <details>
 								    <summary>Run with custom chat template</summary>
-												Update README.md (#3289)

* Update README.md

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
											
										
										
											2023-09-21 21:00:24 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    ```bash
 								    # use the "chatml" template
 								    llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
-												Update README.md (#3289)

* Update README.md

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
											
										
										
											2023-09-21 21:00:24 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    # use a custom template
 								    llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
 								    ```
-												readme : add docs for chat-persistent.sh (#1568)

* readme : add docs for chat-persistent.sh

* Update README.md
											
										
										
											2023-05-24 08:24:01 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    [Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 03:01:57 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    </details>
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 03:01:57 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- <details>
 								    <summary>Constrain the output with a custom grammar</summary>
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 03:01:57 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    ```bash
 								    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 03:01:57 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
 								    ```
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 03:14:11 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 03:14:11 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
-												Add SHA256SUMS file and instructions to README how to obtain and verify the downloads

Hashes created using:

sha256sum models/*B/*.pth models/*[7136]B/ggml-model-f16.bin* models/*[7136]B/ggml-model-q4_0.bin* > SHA256SUMS

											
										
										
											2023-03-20 21:14:06 +01:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								    </details>
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								## [`llama-server`](examples/server)
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- <details open>
 								    <summary>Start a local HTTP server with default configuration on port 8080</summary>
 								    ```bash
 								    llama-server -m model.gguf --port 8080
 								    # Basic web UI can be accessed via browser: http://localhost:8080
 								    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
 								    ```
 								    </details>
 								- <details>
 								    <summary>Support multiple-users and parallel decoding</summary>
 								    ```bash
 								    # up to 4 concurrent requests, each with 4096 max context
 								    llama-server -m model.gguf -c 16384 -np 4
 								    ```
 								    </details>
 								- <details>
 								    <summary>Enable speculative decoding</summary>
 								    ```bash
 								    # the draft.gguf model should be a small variant of the target model.gguf
 								    llama-server -m model.gguf -md draft.gguf
 								    ```
 								    </details>
 								- <details>
 								    <summary>Serve an embedding model</summary>
 								    ```bash
 								    # use the /embedding endpoint
 								    llama-server -m model.gguf --embedding --pooling cls -ub 8192
 								    ```
 								    </details>
 								- <details>
 								    <summary>Serve a reranking model</summary>
 								    ```bash
 								    # use the /reranking endpoint
 								    llama-server -m model.gguf --reranking
 								    ```
 								    </details>
 								- <details>
 								    <summary>Constrain all outputs with a grammar</summary>
 								    ```bash
 								    # custom grammar
 								    llama-server -m model.gguf --grammar-file grammar.gbnf
 								    # JSON
 								    llama-server -m model.gguf --grammar-file grammars/json.gbnf
 								    ```
 								    </details>
 								## [`llama-perplexity`](examples/perplexity)
 								#### A tool for measuring the perplexity [^1][^2] (and other quality metrics) of a model over a given text.
 								- <details open>
 								    <summary>Measure the perplexity over a text file</summary>
 								    ```bash
 								    llama-perplexity -m model.gguf -f file.txt
 								    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
 								    # Final estimate: PPL = 5.4007 +/- 0.67339
 								    ```
 								    </details>
 								- <details>
 								    <summary>Measure KL divergence</summary>
 								    ```bash
 								    # TODO
 								    ```
 								    </details>
 								[^1]: [examples/perplexity/README.md](examples/perplexity/README.md)
 								[^2]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
 								## [`llama-bench`](example/bench)
 								#### Benchmark the performance of the inference for various parameters.
 								- <details open>
 								    <summary>Run default benchmark</summary>
 								    ```bash
 								    llama-bench -m model.gguf
 								    # Output:
 								    # | model               |       size |     params | backend    | threads |          test |                  t/s |
 								    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
 								    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
 								    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
 								    #
 								    # build: 3e0ba0e60 (4229)
 								    ```
 								    </details>
-												Opt class for positional argument handling (#10508)

Added support for positional arguments `model` and `prompt`. Added
functionality to download via strings like:

  llama-run llama3
  llama-run ollama://granite-code
  llama-run ollama://granite-code:8b
  llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
  llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
  llama-run https://example.com/some-file1.gguf
  llama-run some-file2.gguf
  llama-run file://some-file3.gguf

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
											
										
										
											2024-12-13 19:34:25 +01:00
+								## [`llama-run`](examples/run)
 								#### A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama [^3].
 								- <details>
 								    <summary>Run a model with a specific prompt (by default it's pulled from Ollama registry)</summary>
 								    ```bash
 								    llama-run granite-code
 								    ```
 								    </details>
 								[^3]: [https://github.com/containers/ramalama](RamaLama)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
 								## [`llama-simple`](examples/simple)
 								#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
 								- <details>
 								    <summary>Basic text completion</summary>
 								    ```bash
 								    llama-simple -m model.gguf
 								    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
 								    ```
 								    </details>
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 20:25:25 +02:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
+								## Contributing
-												Add initial contribution guidelines
											
										
										
											2023-03-13 08:42:26 +01:00
-												Update contribution section, hot topics, limitations, etc.
											
										
										
											2023-03-13 18:21:51 +01:00
+								- Contributors can open PRs
-												Expand "Contributing" section
											
										
										
											2023-03-16 07:55:13 +01:00
+								- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
-												Add initial contribution guidelines
											
										
										
											2023-03-13 08:42:26 +01:00
+								- Collaborators will be invited based on contributions
-												contrib : add Resources section (#9675)


											
										
										
											2024-09-29 13:38:18 +02:00
+								- Any help with managing issues, PRs and projects is very appreciated!
-												contributing : update guidelines (#8316)


											
										
										
											2024-07-05 08:09:47 +02:00
+								- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
 								- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
-												Update Contributing section
											
										
										
											2023-03-17 19:30:04 +01:00
+								- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
-												Adjust repetition penalty ..
											
										
										
											2023-03-23 09:46:58 +01:00
+								- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
-												Add initial contribution guidelines
											
										
										
											2023-03-13 08:42:26 +01:00
-												CMake: default to -arch=native for CUDA build (#10320)


											
										
										
											2024-11-17 09:06:34 +01:00
+								## Other documentation
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 17:56:20 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- [main (cli)](examples/main/README.md)
 								- [server](examples/server/README.md)
 								- [GBNF grammars](grammars/README.md)
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								#### Development documentation
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 18:08:32 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								- [How to build](docs/build.md)
 								- [Running on Docker](docs/docker.md)
 								- [Build on Android](docs/android.md)
 								- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
-												readme : add more docs indexes (#2127)

* Update README.md to add more docs indexes

* Update README.md to add more docs indexes
											
										
										
											2023-07-09 09:38:42 +02:00
+								- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
+								#### Seminal papers and background on the models
-												update main readme (#8333)


											
										
										
											2024-07-06 19:01:23 +02:00
 								If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
 								- LLaMA:
 								    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
 								    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
 								- GPT-3
 								    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
 								- GPT-3.5 / InstructGPT / ChatGPT:
 								    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
 								    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 10:25:17 +01:00
 								#### References