mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-02-06 00:20:34 +01:00

sampling : support for llguidance grammars (#10224 )

* initial porting of previous LLG patch

* update for new APIs

* build: integrate llguidance as an external project

* use '%llguidance' as marker to enable llg lark syntax

* add some docs

* clarify docs

* code style fixes

* remove llguidance.h from .gitignore

* fix tests when llg is enabled

* pass vocab not model to llama_sampler_init_llg()

* copy test-grammar-integration.cpp to test-llguidance.cpp

* clang fmt

* fix ref-count bug

* build and run test

* gbnf -> lark syntax

* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag

* rename llguidance test file to test-grammar-llguidance.cpp

* add gh action for llg test

* align tests with LLG grammar syntax and JSON Schema spec

* llama_tokenizer() in fact requires valid utf8

* update llg

* format file

* add $LLGUIDANCE_LOG_LEVEL support

* fix whitespace

* fix warning

* include <cmath> for INFINITY

* add final newline

* fail llama_sampler_init_llg() at runtime

* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes

* simplify #includes

* improve doc string for LLAMA_LLGUIDANCE

* typo in merge

* bump llguidance to 0.6.12

2025-02-02 09:55:32 +02:00

3.6 KiB

Raw Blame History

LLGuidance Support in llama.cpp

LLGuidance is a library for constrained decoding (also called constrained sampling or structured outputs) for Large Language Models (LLMs). Initially developed as the backend for the Guidance library, it can also be used independently.

LLGuidance supports JSON Schemas and arbitrary context-free grammars (CFGs) written in a variant of Lark syntax. It is very fast and has excellent JSON Schema coverage but requires the Rust compiler, which complicates the llama.cpp build process.

Building

To enable LLGuidance support, build llama.cpp with the LLAMA_LLGUIDANCE option:

cmake -B build -DLLAMA_LLGUIDANCE=ON
make -C build -j

This requires the Rust compiler and the cargo tool to be installed.

Interface

There are no new command-line arguments or modifications to common_params. When enabled, grammars starting with %llguidance are passed to LLGuidance instead of the current llama.cpp grammars. Additionally, JSON Schema requests (e.g., using the -j argument in llama-cli) are also passed to LLGuidance.

For your existing GBNF grammars, you can use gbnf_to_lark.py script to convert them to LLGuidance Lark-like format.

Performance

Computing a "token mask" (i.e., the set of allowed tokens) for a llama3 tokenizer with 128k tokens takes, on average, 50μs of single-core CPU time for the JSON Schema Bench. The p99 time is 0.5ms, and the p100 time is 20ms. These results are due to the lexer/parser split and several optimizations.

JSON Schema

LLGuidance adheres closely to the JSON Schema specification. For example:

additionalProperties defaults to true, unlike current grammars, though you can set "additionalProperties": false if needed.
any whitespace is allowed.
The definition order in the "properties": {} object is maintained, regardless of whether properties are required (current grammars always puts required properties first).

Unsupported schemas result in an error message—no keywords are silently ignored.

Why Not Reuse GBNF Format?

GBNF lacks the concept of a lexer.

Most programming languages, including JSON, use a two-step process: a lexer (built with regular expressions) converts a byte stream into lexemes, which are then processed by a CFG parser. This approach is faster because lexers are cheaper to evaluate, and there is ~10x fewer lexemes than bytes. LLM tokens often align with lexemes, so the parser is engaged in under 0.5% of tokens, with the lexer handling the rest.

However, the user has to provide the distinction between lexemes and CFG symbols. In Lark, lexeme names are uppercase, while CFG symbols are lowercase. The gbnf_to_lark.py script can often take care of this automatically. See LLGuidance syntax docs for more details.

Error Handling

Errors are currently printed to stderr, and generation continues. Improved error handling may be added in the future.

3.6 KiB Raw Blame History