* llama.cpp : fix spm whitespace escaping + clean up
* main.cpp : spm - add whitespace in front of prompt
* test-tokenizer-0.cpp : spm - add whitespace in front of prompt
* Add llama_beam_search().
* Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token().
* Add space around * pointers and & references.
* Add spaces around comparison and assignment operators.
* Prefer west const.
* Use llama_ prefix for structs in global namespace.
* Delete obsolete comment from an earlier revision.
* Change eos to eob in llama_beam and llama_beam_view structs.
* server : add n_probs param in chat UI
* server : keep message data array & show in probabilites component
* server : add simple popover component
* server : fix completion_probabilities undefined if not set n_probs
* server : implement Probabilites
* server : handle bytes
* server : make n_probs max to 10 for easy scroll
* server : adjust for dark/light mode
* server : Fix regenerated prompt
* server : update index.html.hpp
* server : convert prob to percentage + show original value as div title
* server : fix Probabilites not used if included empty str
* server : skip byte pair in display probabilites
* server : remove array check of completion_probabilities in messages
* skip empty array or byte pair (> 1) in Probabilites
* generate index.html.hpp
* fix incorrect prob convert if the str is already a known token
* use final response to show probabilities on stop
* revert unnecessary change
* correct probabilites usage
* remove unused function
* always send partial response for get correct probs of last to_send
* fix typo
* fix content of format_final_response
* refactor probs render & make pColor transparent if not found
* send empty string when got stop_pos in partial
* avoid unnecessary empty data event & send rest of partial tokens on stop
* use <br /> for new line
* skip -1 tok in loop to avoid send '' on end
* trim last new lines on stop
* revert unnecessary change
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP
---------
Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
* metal: better memory alloc w/ concurrency dispatch
The ggml-alloc should only free tensors at memory barriers.
* ggml-alloc: avoid return silently
In certain cases, the allocate_node() function may silently return
without performing any memory allocation.
* Implementing strided computation of perplexity
* Alternative way to output PPL results
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: allow json array in prompt or content
We accept an array of strings and numbers representing tokens,
in addition to the current string valued prompt or content.
This allows direct token input, so that any special tokens
can be processed and used at the frontend during the construction
of the json data, before sending to the server. And the server
does not need to know or parse special tokens from textual input.
With this, we can use EOS and BOS used in llama-2-chat models.
* server: use tokenizePrompt(json) and default "" if empty prompt
* server: fix prompt check
* server: tokenize endpoint no longer adds BOS
* Improve UNK, BOS, EOS token handling when converting without metadata.
* Allow importing as a module.
* Remove some obsolete code and minor cleanups.
* Set default UNK token mapping from -1 to 0 in llama.cpp
* Try to handle overflow due to buggy Windows Python with a better error message
* Improve LLaMA-2 2-, 3- and 4-bit quantization
* Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of
attention.wv and feed_forward.w2
This leads to a slight model sized increase as follows:
Q2_K : 2.684G vs 2.670G
Q3_K_S: 2.775G vs 2.745G
Q3_K_M: 3.071G vs 3.057G
Q4_K_S: 3.592G vs 3.563G
LLaMA-2 PPL for context 512 changes as follows:
Q2_K : 6.6691 vs 6.8201
Q3_K_S: 6.2129 vs 6.2584
Q3_K_M: 6.0387 vs 6.1371
Q4_K_S: 5.9138 vs 6.0041
There are improvements for LLaMA-1 as well, but they are
way smaller than the above.
* Minor 4-bit quantization improvement
For the same model size as previus commit, we get
PPL = 5.9069 vs 5.9138.
* Some more fine tuning
* Adding make_qkx2_quants
With it, we get PPL = 5.8828 for L2-7B Q4_K_S.
* Another minor improvement
* Q2_K improvement
Smaller model, lower perplexity.
7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201
12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178
It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk,
which are Q2_K
* Iterating
* Revert Q5_K back to make_qkx1_quants
* Better Q6_K
* make_qkx2_quants is better for Q5_K after all
* Fix after rebasing on master
* Fix for changed tensor names
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>