From e7e03733b201193ab46164b59e71d4a7d1f076ee Mon Sep 17 00:00:00 2001 From: HanClinto Date: Mon, 10 Jun 2024 14:41:56 -0700 Subject: [PATCH 1/4] Updating docs for eval-callback binary to use new `llama-` prefix. --- docs/HOWTO-add-model.md | 2 +- examples/eval-callback/README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/HOWTO-add-model.md b/docs/HOWTO-add-model.md index 138124248..3eec077ea 100644 --- a/docs/HOWTO-add-model.md +++ b/docs/HOWTO-add-model.md @@ -100,7 +100,7 @@ Have a look at existing implementation like `build_llama`, `build_dbrx` or `buil When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR. -Note: to debug the inference graph: you can use [eval-callback](../examples/eval-callback). +Note: to debug the inference graph: you can use [llama-eval-callback](../examples/eval-callback). ## GGUF specification diff --git a/examples/eval-callback/README.md b/examples/eval-callback/README.md index 66a37e878..63a57ad6b 100644 --- a/examples/eval-callback/README.md +++ b/examples/eval-callback/README.md @@ -6,7 +6,7 @@ It simply prints to the console all operations and tensor data. Usage: ```shell -eval-callback \ +llama-eval-callback \ --hf-repo ggml-org/models \ --hf-file phi-2/ggml-model-q4_0.gguf \ --model phi-2-q4_0.gguf \ From 2fd66b2ce28dbb75b26d9b57d4b2aea8a3858ade Mon Sep 17 00:00:00 2001 From: HanClinto Date: Mon, 10 Jun 2024 14:53:23 -0700 Subject: [PATCH 2/4] Updating a few lingering doc references for rename of main to llama-cli --- README.md | 4 ++-- examples/main/README.md | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index ba1a862dc..6a29899c8 100644 --- a/README.md +++ b/README.md @@ -733,7 +733,7 @@ Here is an example of a few-shot interaction, invoked with the command ./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt ``` -Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `main` example program. +Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `llama-cli` example program. ![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png) @@ -958,7 +958,7 @@ docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m ### Docs -- [main](./examples/main/README.md) +- [main (cli)](./examples/main/README.md) - [server](./examples/server/README.md) - [jeopardy](./examples/jeopardy/README.md) - [BLIS](./docs/BLIS.md) diff --git a/examples/main/README.md b/examples/main/README.md index 4f688358a..61e4a42f7 100644 --- a/examples/main/README.md +++ b/examples/main/README.md @@ -64,7 +64,7 @@ llama-cli.exe -m models\7B\ggml-model.bin --ignore-eos -n -1 ## Common Options -In this section, we cover the most commonly used options for running the `main` program with the LLaMA models: +In this section, we cover the most commonly used options for running the `llama-cli` program with the LLaMA models: - `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`; inferred from `--model-url` if set). - `-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file (e.g https://huggingface.co/ggml-org/models/resolve/main/phi-2/ggml-model-q4_0.gguf). @@ -74,7 +74,7 @@ In this section, we cover the most commonly used options for running the `main` ## Input Prompts -The `main` program provides several ways to interact with the LLaMA models using input prompts: +The `llama-cli` program provides several ways to interact with the LLaMA models using input prompts: - `--prompt PROMPT`: Provide a prompt directly as a command-line option. - `--file FNAME`: Provide a file containing a prompt or multiple prompts. @@ -82,7 +82,7 @@ The `main` program provides several ways to interact with the LLaMA models using ## Interaction -The `main` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive` and `--interactive-first`. +The `llama-cli` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive` and `--interactive-first`. In interactive mode, users can participate in text generation by injecting their input during the process. Users can press `Ctrl+C` at any time to interject and type their input, followed by pressing `Return` to submit it to the LLaMA model. To submit additional lines without finalizing input, users can end the current line with a backslash (`\`) and continue typing. From 72660c357ca6795106b2cde367751bcc2a50bc5f Mon Sep 17 00:00:00 2001 From: HanClinto Date: Mon, 10 Jun 2024 15:23:32 -0700 Subject: [PATCH 3/4] Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. --- examples/perplexity/perplexity.cpp | 2 +- scripts/run-with-preset.py | 16 ++++++++-------- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/examples/perplexity/perplexity.cpp b/examples/perplexity/perplexity.cpp index 0bd78c21a..efde8dfdf 100644 --- a/examples/perplexity/perplexity.cpp +++ b/examples/perplexity/perplexity.cpp @@ -476,7 +476,7 @@ static results_perplexity perplexity(llama_context * ctx, const gpt_params & par } // Download: https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip - // Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` + // Run `./llama-perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` // Output: `perplexity: 13.5106 [114/114]` // BOS tokens will be added for each chunk before eval diff --git a/scripts/run-with-preset.py b/scripts/run-with-preset.py index 0d7219113..ee21eab37 100755 --- a/scripts/run-with-preset.py +++ b/scripts/run-with-preset.py @@ -10,7 +10,7 @@ import yaml logger = logging.getLogger("run-with-preset") -CLI_ARGS_MAIN_PERPLEXITY = [ +CLI_ARGS_LLAMA_CLI_PERPLEXITY = [ "batch-size", "cfg-negative-prompt", "cfg-scale", "chunks", "color", "ctx-size", "escape", "export", "file", "frequency-penalty", "grammar", "grammar-file", "hellaswag", "hellaswag-tasks", "ignore-eos", "in-prefix", "in-prefix-bos", "in-suffix", @@ -29,7 +29,7 @@ CLI_ARGS_LLAMA_BENCH = [ "n-prompt", "output", "repetitions", "tensor-split", "threads", "verbose" ] -CLI_ARGS_SERVER = [ +CLI_ARGS_LLAMA_SERVER = [ "alias", "batch-size", "ctx-size", "embedding", "host", "memory-f32", "lora", "lora-base", "low-vram", "main-gpu", "mlock", "model", "n-gpu-layers", "n-probs", "no-mmap", "no-mul-mat-q", "numa", "path", "port", "rope-freq-base", "timeout", "rope-freq-scale", "tensor-split", @@ -37,7 +37,7 @@ CLI_ARGS_SERVER = [ ] description = """Run llama.cpp binaries with presets from YAML file(s). -To specify which binary should be run, specify the "binary" property (main, perplexity, llama-bench, and server are supported). +To specify which binary should be run, specify the "binary" property (llama-cli, llama-perplexity, llama-bench, and llama-server are supported). To get a preset file template, run a llama.cpp binary with the "--logdir" CLI argument. Formatting considerations: @@ -77,19 +77,19 @@ for yaml_file in known_args.yaml_files: props = {prop.replace("_", "-"): val for prop, val in props.items()} -binary = props.pop("binary", "main") +binary = props.pop("binary", "llama-cli") if known_args.binary: binary = known_args.binary if os.path.exists(f"./{binary}"): binary = f"./{binary}" -if binary.lower().endswith("main") or binary.lower().endswith("perplexity"): - cli_args = CLI_ARGS_MAIN_PERPLEXITY +if binary.lower().endswith("llama-cli") or binary.lower().endswith("llama-perplexity"): + cli_args = CLI_ARGS_LLAMA_CLI_PERPLEXITY elif binary.lower().endswith("llama-bench"): cli_args = CLI_ARGS_LLAMA_BENCH -elif binary.lower().endswith("server"): - cli_args = CLI_ARGS_SERVER +elif binary.lower().endswith("llama-server"): + cli_args = CLI_ARGS_LLAMA_SERVER else: logger.error(f"Unknown binary: {binary}") sys.exit(1) From 70de0debab4464dd87dc5cb4197ea6d4eb286814 Mon Sep 17 00:00:00 2001 From: HanClinto Date: Mon, 10 Jun 2024 15:32:21 -0700 Subject: [PATCH 4/4] Updating documentation references for lookup-merge and export-lora --- examples/export-lora/README.md | 2 +- examples/lookup/lookup-merge.cpp | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/examples/export-lora/README.md b/examples/export-lora/README.md index c3a35feac..1fb17feec 100644 --- a/examples/export-lora/README.md +++ b/examples/export-lora/README.md @@ -3,7 +3,7 @@ Apply LORA adapters to base model and export the resulting model. ``` -usage: export-lora [options] +usage: llama-export-lora [options] options: -h, --help show this help message and exit diff --git a/examples/lookup/lookup-merge.cpp b/examples/lookup/lookup-merge.cpp index 07c93eb8d..81e2b0436 100644 --- a/examples/lookup/lookup-merge.cpp +++ b/examples/lookup/lookup-merge.cpp @@ -11,14 +11,14 @@ #include #include -static void print_usage() { +static void print_usage(char* argv0) { fprintf(stderr, "Merges multiple lookup cache files into a single one.\n"); - fprintf(stderr, "Usage: lookup-merge [--help] lookup_part_1.bin lookup_part_2.bin ... lookup_merged.bin\n"); + fprintf(stderr, "Usage: %s [--help] lookup_part_1.bin lookup_part_2.bin ... lookup_merged.bin\n", argv0); } int main(int argc, char ** argv){ if (argc < 3) { - print_usage(); + print_usage(argv[0]); exit(1); } @@ -27,7 +27,7 @@ int main(int argc, char ** argv){ for (int i = 0; i < argc-1; ++i) { args[i] = argv[i+1]; if (args[i] == "-h" || args[i] == "--help") { - print_usage(); + print_usage(argv[0]); exit(0); } }