mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2024-12-28 15:18:26 +01:00
213701b51a
* Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces |
||
---|---|---|
.. | ||
.gitignore | ||
CMakeLists.txt | ||
get-model.cpp | ||
get-model.h | ||
run-json-schema-to-grammar.mjs | ||
test-autorelease.cpp | ||
test-backend-ops.cpp | ||
test-c.c | ||
test-chat-template.cpp | ||
test-double-float.cpp | ||
test-grad0.cpp | ||
test-grammar-integration.cpp | ||
test-grammar-parser.cpp | ||
test-json-schema-to-grammar.cpp | ||
test-llama-grammar.cpp | ||
test-model-load-cancel.cpp | ||
test-opt.cpp | ||
test-quantize-fns.cpp | ||
test-quantize-perf.cpp | ||
test-rope.cpp | ||
test-sampling.cpp | ||
test-tokenizer-0.cpp | ||
test-tokenizer-0.py | ||
test-tokenizer-0.sh | ||
test-tokenizer-1-bpe.cpp | ||
test-tokenizer-1-spm.cpp | ||
test-tokenizer-random.py |