* Fix ffn_down quantization mix for MoE models
In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).
* Fix the fix
* Review suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* examples : save-load-state: save only required state
* llama : only reserve n_vocab * n_batch at most for logits
llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.
* llama : always reserve n_vocab * n_batch for logits
llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.
* llama : only save and restore used logits
for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.
* llama : use ostringstream and istringstream for save and load
* llama : serialize rng into minimum amount of space required
* llama : break session version due to serialization changes
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens
* remove empty line
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : detect more GPU families
* metal : refactor kernel loading
* metal : set kernel family requirements
* metal : fix kernel init + fix compile options
* metal : take into account simdgroup reduction support
* metal : print only skipped kernels
* metal : fix check for simdgroup reduction support
* metal : check for Metal 3
* metal : free allocations
* metal : normalize encoder:setComputePipelineStatus calls
ggml-ci
* metal : fix Metal3 family check
ggml-ci
* metal : check for simdgroup matrix mul. feature
ggml-ci
* llama : ggml-backend integration
* ggml-backend : add names to buffers
* fix unmap after loading
* batched-bench : add tensor_split param
* llama : check for null tensor_split
* ggml-backend : increase GGML_MAX_BACKENDS
* improve graph splitting, partial fix for --no-kv-offload
* cuda : add ggml-backend split buffer support
* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)
* ggml : fix null backend dereference (#4807)
* ggml : fix null backend dereference
* ggml : also check ggml_backend_is_cpu
* test-backend-ops : check buffer allocation failures
* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)
* ggml : fix mul_mat_id work size
* llama : rewrite session kv load/set without graphs
* minor
* llama : only initialize used backends, free backends on context free
* llama : abort ctx if cuda backend init fails
* llama : rewrite lora with ggml-backend and compute on CPU
ggml-ci
* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer
* opencl : add ggml-backend buffer type
* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)
* llama : on Metal, by default offload the full model
ggml-ci
* metal : page align the data ptr (#4854)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cuda : fix split buffer free
* address review comments
* llama-bench : add split-mode parameter
* fix whitespace
* opencl : fix double initialization
* server : add --split-mode parameter
* use async copy and compute to improve multi-gpu performance
ggml-ci
* use async memcpys to copy the graph outputs to the CPU
* fix opencl
* use a host buffer for the cpu compute buffer for faster copies to the gpu
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Updated Models Layout
- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left
* trimmed trailing white space
* Updated Models Layout
* common : streamline the formatting of help
- Separate alternative parameters by a comma
- Do not indent `--version` differently
* Update common/common.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Restore intended k-quants quantization mixes for MoE models
* Update Q2_K_S values in the quantize tool
Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* iq2_xs: basics
* iq2_xs: this should have been in the basics
* iq2_xs: CUDA and scalar CPU works
* iq2_xs: WIP Metal
* iq2_xs: Metal now works
* iq2_xs: working, but dog slow, ARM_NEON dot product
* iq2_xs: better ARM_NEON dot product
We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.
* iq2_xs: AVX2 dot product - 19.5 t/s
* iq2_xs: faster AVX2 dit product
21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.
* iq2_xs: had forgotten to delete iq2-data.h
* Add llama enum for IQ2_XS
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: added support for multiple api keys, added loading api keys from file
* minor: fix whitespace
* added file error handling to --api-key-file, changed code to better
reflect current style
* server: update README.md for --api-key-file
---------
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
* added /health endpoint to the server
* added comments on the additional /health endpoint
* Better handling of server state
When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.
* initialized server_state
* fixed a typo
* starting http server before initializing the model
* Update server.cpp
* Update server.cpp
* fixes
* fixes
* fixes
* made ServerState atomic and turned two-line spaces into one-line
* updated `server` readme to document the `/health` endpoint too
* used LOG_INFO after successful model loading