* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverted Makefile
* Fixed include
* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables
* removed trailing whitespace
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverting Makefile
* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet
* Removing MIRROR_MODE code for this PR
* Removing last bit of MIRROR_MODE code for this PR
* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static
* Fixed lingering init_llama_backend() bool calls in tests and examples
* Remote enum llama_numa_strategies
* Revert bad merge with dynatemp flags
* add missing enum ggml_numa_strategies declaration and revert sync problem with master
* add missing enum ggml_numa_strategies declaration
* fixed ggml_init_numa variable
* Update ggml.h
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges
* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples
* Fix up some boolean vs enum comparisons
* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype
* Update ggml.h
Align enum values
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
Remove whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
align paremeters
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/server.cpp
remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/common.cpp
Remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example
* Update ggml.c
simplified return for platforms without NUMA support
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* removed redundant else from cli argument processing of --numa
* whitespace
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* llama.cpp : split llama_context_params into model and context params
ggml-ci
* fix metal build
* fix freq_base/scale default to model value
* llama-bench : keep the same model between tests when possible
* move n_threads to llama_context_params, add n_threads_batch
* fix mpi build
* remove kv_size(), cuda scratch fixes
* remove low-vram option
* add n_threads_batch to system info, refactor to get_system_info()
* add documentation about --threads-batch to the READMEs
* llama-bench fix
* main : fix rope freq/scale warning
* llama.cpp : add llama_get_model
common : add llama_tokenize from model
* remove duplicated ctx/model functions
ggml-ci
* cuda : print total VRAM used
* detect NUMA systems and pin work threads to nodes (linux)
* disable mmap prefetch/readahead for NUMA systems
* avoid sending finalize op to thread pool if it does nothing
* silence robot
* fix args
* make --numa a param
* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement
* lower synchronization overhead
* statically allocate
* move numa state to g_state
* add description for --numa
* ggml : minor style changes
* ggml : minor style + try fix sanitizer build
* llama : allow to initialize backend with NUMA support
* llama : avoid ggml include in llama-util.h
* ggml : style / formatting
* ggml : fix handling of ops with n_threads > n_tasks > 1
* server : utilize numa parameter
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
Improvements to loading the session with `--prompt-cache` in the `main` example.
1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt.
2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.
* main : add option to save full output to session
* split behavior into --session and --prompt-cache
* restore original implementation with new names
* PR comments
* move the check for incompatible parameters to gpt_params_parse
* Fix whitespace
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
---------
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
- main -> examples
- utils -> examples (renamed to "common")
- quantize -> examples
- separate tools for "perplexity" and "embedding"
Hope I didn't break something !