* Initial XTC commit
Adds XTC sampler, not activated by default, but recommended settings by default.
* Cleanup
* Simplified chances calculation
To be more inline with the original implementation, chance is calculated once at the beginning.
* First fixes by comments
Still need to look into sorting
* Fixed trailing backspaces
* Fixed RNG to be reproduceable
Thanks to @slaren for directions
* Fixed forgotten header
* Moved `min_keep`
Moved from conditions to a simple check at the end.
* Fixed broken randomization
Thanks to @slaren for explanation
* Swapped sorting for a custom algorithm
Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
* Algorithm rework
1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.
* Added XTC to `test-sampling`
* Simplified algorithm and more tests
* Updated info in common and args
* Merged back lost commits in common and arg
* Update dump info in common
* Fixed incorrect min_keep check
* Added XTC to README
* Renamed parameters, fixed info and defaults
* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off
* Initial server support
* Added XTC to server UIs
* Fixed labels in old server UI
* Made algorithm safer and more readable
* Removed xtc_threshold_max
* Fixed arg after update
* Quick fixes by comments
* Simplified algorithm since threshold_max is removed
* Renamed random distribution
* Fixed tests and outdated README
* Small fixes
* Vectorize load instructions in dmmv f16 CUDA kernel
Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.
* addressed comment
* Update ggml/src/ggml-cuda/dmmv.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* server : accept extra_context for the infill endpoint
ggml-ci
* server : update readme [no ci]
* server : use repo-level FIM pattern if possible
ggml-ci
* llama : improve infill support
ggml-ci
* llama : add more FIM token strings
ggml-ci
* server : update prompt on slot restore (#9800)
* gguf : deprecate old FIM token KVs
* ggml : do not use BLAS with types without to_float
* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies
* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits
it's not really internal if everybody uses it
* docs : clarify building Android on Termux
* docs : update building Android on Termux
* docs : add cross-compiling for Android
* cmake : link dl explicitly for Android
* ggml : add metal backend registry / device
ggml-ci
* metal : fix names [no ci]
* metal : global registry and device instances
ggml-ci
* cont : alternative initialization of global objects
ggml-ci
* llama : adapt to backend changes
ggml-ci
* fixes
* metal : fix indent
* metal : fix build when MTLGPUFamilyApple3 is not available
ggml-ci
* fix merge
* metal : avoid unnecessary singleton accesses
ggml-ci
* metal : minor fix [no ci]
* metal : g_state -> g_ggml_ctx_dev_main [no ci]
* metal : avoid reference of device context in the backend context
ggml-ci
* metal : minor [no ci]
* metal : fix maxTransferRate check
* metal : remove transfer rate stuff
---------
Co-authored-by: slaren <slarengh@gmail.com>
* Single allocation of encode_async block with non-ARC capture in ggml-metal.m
* Moving Block_release to the deallocation code
* Release encode block when re-setting encoding buffer count if needed
* Update ggml/src/ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* rerank : use [SEP] token instead of [BOS]
ggml-ci
* common : sanity check for non-NULL tokens
ggml-ci
* ci : adjust rank score interval
ggml-ci
* ci : add shebang to run.sh
ggml-ci
* Add scaffolding for ggml logging macros
* Metal backend now uses GGML logging
* Cuda backend now uses GGML logging
* Cann backend now uses GGML logging
* Add enum tag to parameters
* Use C memory allocation funcs
* Fix compile error
* Use GGML_LOG instead of GGML_PRINT
* Rename llama_state to llama_logger_state
* Prevent null format string
* Fix whitespace
* Remove log callbacks from ggml backends
* Remove cuda log statement