+
+
## Description
-The main goal of llama.cpp is to run the llama model using 4-bit quantization on a MacBook.
+The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook
- Plain C/C++ implementation without dependencies
- Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
-- AVX2 support for x86 architectures
+- AVX, AVX2 and AVX512 support for x86 architectures
- Mixed F16 / F32 precision
-- 4-bit quantization support
+- 4-bit, 5-bit and 8-bit integer quantization support
- Runs on the CPU
+- OpenBLAS support
+- cuBLAS and CLBlast support
-This was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022) - I have no idea if it works correctly.
-Please do not make conclusions about the models based on the results from this implementation.
-For all I know, it can be completely wrong. This project is for educational purposes.
-New features will probably be added mostly through community contributions.
+The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022).
+Since then, the project has improved significantly thanks to many contributions. This project is for educational purposes and serves
+as the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library.
**Supported platforms:**
@@ -49,6 +86,8 @@ New features will probably be added mostly through community contributions.
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
- [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
+- [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
+- [X] [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b)
**Bindings:**
@@ -167,15 +206,27 @@ cd llama.cpp
### Build
-Note: For Windows, CMake or Zig can be used.
+In order to build llama.cpp you have three different options.
-1. Use `make`
+- Using `make`:
+ - On Linux or MacOS:
- ```bash
- make
- ```
+ ```bash
+ make
+ ```
-1. Use CMake
+ - On Windows:
+
+ 1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
+ 2. Extract `w64devkit` on your pc.
+ 3. Run `w64devkit.exe`.
+ 4. Use the `cd` command to reach the `llama.cpp` folder.
+ 5. From here you can run:
+ ```bash
+ make
+ ```
+
+- Using `CMake`:
```bash
mkdir build
@@ -184,12 +235,72 @@ Note: For Windows, CMake or Zig can be used.
cmake --build . --config Release
```
-1. Use Zig
+- Using `Zig`:
```bash
zig build -Drelease-fast
```
+### BLAS Build
+
+Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
+
+- Accelerate Framework:
+
+ This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
+
+- OpenBLAS:
+
+ This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
+
+ - Using `make`:
+ - On Linux:
+ ```bash
+ make LLAMA_OPENBLAS=1
+ ```
+
+ - On Windows:
+
+ 1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
+ 2. Download the latest version of [OpenBLAS for Windows](https://github.com/xianyi/OpenBLAS/releases).
+ 3. Extract `w64devkit` on your pc.
+ 4. From the OpenBLAS zip that you just downloaded copy `libopenblas.a`, located inside the `lib` folder, inside `w64devkit\x86_64-w64-mingw32\lib`.
+ 5. From the same OpenBLAS zip copy the content of the `include` folder inside `w64devkit\x86_64-w64-mingw32\include`.
+ 6. Run `w64devkit.exe`.
+ 7. Use the `cd` command to reach the `llama.cpp` folder.
+ 8. From here you can run:
+
+ ```bash
+ make LLAMA_OPENBLAS=1
+ ```
+
+ - Using `CMake` on Linux:
+
+ ```bash
+ mkdir build
+ cd build
+ cmake .. -DLLAMA_OPENBLAS=ON
+ cmake --build . --config Release
+ ```
+
+- cuBLAS
+
+ This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
+ - Using `make`:
+ ```bash
+ make LLAMA_CUBLAS=1
+ ```
+ - Using `CMake`:
+
+ ```bash
+ mkdir build
+ cd build
+ cmake .. -DLLAMA_CUBLAS=ON
+ cmake --build . --config Release
+ ```
+
+Note: Because llama.cpp uses multiple CUDA streams for matrix multiplication results [are not guaranteed to be reproducible](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility). If you need reproducibility, set `GGML_CUDA_MAX_STREAMS` in the file `ggml-cuda.cu` to 1.
+
### Prepare Data & Run
```bash
@@ -203,8 +314,8 @@ python3 -m pip install -r requirements.txt
# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/
-# quantize the model to 4-bits (using method 2 = q4_0)
-./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
+# quantize the model to 4-bits (using q4_0 method)
+./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
@@ -216,12 +327,37 @@ When running the larger models, make sure you have enough disk space to store al
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
-| model | original size | quantized size (4-bit) |
-|-------|---------------|------------------------|
-| 7B | 13 GB | 3.9 GB |
-| 13B | 24 GB | 7.8 GB |
-| 30B | 60 GB | 19.5 GB |
-| 65B | 120 GB | 38.5 GB |
+| Model | Original size | Quantized size (4-bit) |
+|------:|--------------:|-----------------------:|
+| 7B | 13 GB | 3.9 GB |
+| 13B | 24 GB | 7.8 GB |
+| 30B | 60 GB | 19.5 GB |
+| 65B | 120 GB | 38.5 GB |
+
+### Quantization
+
+Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
+
+| Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
+|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|-------:|
+| 7B | perplexity | 5.9066 | 6.1620 | 6.0910 | 6.1466 | 5.9862 | 5.9481 | 5.9069 |
+| 7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
+| 7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
+| 7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
+| 7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
+| 13B | perplexity | 5.2543 | 5.3863 | 5.3607 | 5.3513 | 5.2856 | 5.2706 | 5.2548 |
+| 13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
+| 13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
+| 13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
+| 13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
+
+### Perplexity (measuring model quality)
+
+You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
+For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
+
+The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
+The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
### Interactive mode
@@ -241,7 +377,7 @@ Here is an example of a few-shot interaction, invoked with the command
./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
```
-Note the use of `--color` to distinguish between user input and generated text.
+Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `main` example program.
![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)
@@ -275,62 +411,64 @@ cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
### Using [GPT4All](https://github.com/nomic-ai/gpt4all)
-- Obtain the `gpt4all-lora-quantized.bin` model
-- It is distributed in the old `ggml` format, which is now obsoleted
-- You have to convert it to the new format using [./convert-gpt4all-to-ggml.py](./convert-gpt4all-to-ggml.py). You may also need to
-convert the model from the old format to the new format with [./migrate-ggml-2023-03-30-pr613.py](./migrate-ggml-2023-03-30-pr613.py):
+- Obtain the `tokenizer.model` file from LLaMA model and put it to `models`
+- Obtain the `added_tokens.json` file from Alpaca model and put it to `models`
+- Obtain the `gpt4all-lora-quantized.bin` file from GPT4All model and put it to `models/gpt4all-7B`
+- It is distributed in the old `ggml` format which is now obsoleted
+- You have to convert it to the new format using `convert.py`:
- ```bash
- python3 convert-gpt4all-to-ggml.py models/gpt4all-7B/gpt4all-lora-quantized.bin ./models/tokenizer.model
- python3 migrate-ggml-2023-03-30-pr613.py models/gpt4all-7B/gpt4all-lora-quantized.bin models/gpt4all-7B/gpt4all-lora-quantized-new.bin
- ```
+```bash
+python3 convert.py models/gpt4all-7B/gpt4all-lora-quantized.bin
+```
-- You can now use the newly generated `gpt4all-lora-quantized-new.bin` model in exactly the same way as all other models
-- The original model is saved in the same folder with a suffix `.orig`
+- You can now use the newly generated `models/gpt4all-7B/ggml-model-q4_0.bin` model in exactly the same way as all other models
-### Obtaining and verifying the Facebook LLaMA original model and Stanford Alpaca model data
+- The newer GPT4All-J model is not yet supported!
+
+### Using Pygmalion 7B & Metharme 7B
+
+- Obtain the [LLaMA weights](#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data)
+- Obtain the [Pygmalion 7B](https://huggingface.co/PygmalionAI/pygmalion-7b/) or [Metharme 7B](https://huggingface.co/PygmalionAI/metharme-7b) XOR encoded weights
+- Convert the LLaMA model with [the latest HF convert script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)
+- Merge the XOR files with the converted LLaMA weights by running the [xor_codec](https://huggingface.co/PygmalionAI/pygmalion-7b/blob/main/xor_codec.py) script
+- Convert to `ggml` format using the `convert.py` script in this repo:
+```bash
+python3 convert.py pygmalion-7b/ --outtype q4_1
+```
+> The Pygmalion 7B & Metharme 7B weights are saved in [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) precision. If you wish to convert to `ggml` without quantizating, please specify the `--outtype` as `f32` instead of `f16`.
+
+
+### Obtaining the Facebook LLaMA original model and Stanford Alpaca model data
- **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.**
- The LLaMA models are officially distributed by Facebook and will **never** be provided through this repository.
- Refer to [Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to request access to the model data.
-- Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
-- The following command will verify if you have all possible latest files in your self-installed `./models` subdirectory:
- `sha256sum --ignore-missing -c SHA256SUMS` on Linux
+### Verifying the model files
- or
+Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
+- The following python script will verify if you have all possible latest files in your self-installed `./models` subdirectory:
- `shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS
+```bash
+# run the verification script
+python3 .\scripts\verify-checksum-models.py
+```
-- If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
+- On linux or macOS it is also possible to run the following commands to verify if you have all possible latest files in your self-installed `./models` subdirectory:
+ - On Linux: `sha256sum --ignore-missing -c SHA256SUMS`
+ - on macOS: `shasum -a 256 --ignore-missing -c SHA256SUMS`
+
+### Seminal papers and background on the models
+
+If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
-- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
-- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
+ - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
+ - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- GPT-3
-- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+ - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- GPT-3.5 / InstructGPT / ChatGPT:
-- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
-- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
-
-### Perplexity (measuring model quality)
-
-You can use the `perplexity` example to measure perplexity over the given prompt. For more background, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity). However, in general, lower perplexity is better for LLMs.
-
-#### Latest measurements
-
-The latest perplexity scores for the various model sizes and quantizations are being tracked in [discussion #406](https://github.com/ggerganov/llama.cpp/discussions/406). `llama.cpp` is measuring very well compared to the baseline implementations. Quantization has a small negative impact on quality, but, as you can see, running
-13B at q4_0 beats the 7B f16 model by a significant amount.
-
-All measurements are done against the wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context).
-Note that changing the context length will have a significant impact on perplexity (longer context = better perplexity).
-```
-Perplexity - model options
-5.5985 - 13B, q4_0
-5.9565 - 7B, f16
-6.3001 - 7B, q4_1
-6.5949 - 7B, q4_0
-6.5995 - 7B, q4_0, --memory_f16
-```
+ - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
+ - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
#### How to run
diff --git a/SHA256SUMS b/SHA256SUMS
index 1d034b371..e487bdca6 100644
--- a/SHA256SUMS
+++ b/SHA256SUMS
@@ -1,27 +1,24 @@
700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d models/7B/consolidated.00.pth
666a4bb533b303bdaf89e1b6a3b6f93535d868de31d903afdc20983dc526c847 models/7B/ggml-model-f16.bin
-fcb7664c2e69776920b526362a243e912f73c36b1ec892eb354bab940f5edb5a models/7B/ggml-model-q4_0.bin
+99aeb35f26b577fa2732716cca4d8b5ada39a78ea9b2dca2651fc632b5d101b6 models/7B/ggml-model-q4_0.bin
cc061458339a3eb8bcecbf0a825e9924fb7d1a8150f63cd5d091caa99215aafe models/7B/ggml-model-q4_1.bin
-1bc7484c24a87612726d756f1761890e7acf5f412e23378577ce50fbe789b5b8 models/7B/ggml-model-q4_2.bin
-3429bf198ec771886cf81a574df45245f3ebf04f0ce0956b73ef5d0ab01ff48b models/7B/ggml-model-q4_3.bin
+25b050337a87344da687a7f2adddc03bd99b7f6c140450e836649f3585fb6496 models/7B/ggml-model-q4_2.bin
7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265 models/7B/params.json
745bf4e29a4dd6f411e72976d92b452da1b49168a4f41c951cfcc8051823cf08 models/13B/consolidated.00.pth
d5ccbcc465c71c0de439a5aeffebe8344c68a519bce70bc7f9f92654ee567085 models/13B/consolidated.01.pth
2b206e9b21fb1076f11cafc624e2af97c9e48ea09312a0962153acc20d45f808 models/13B/ggml-model-f16.bin
-4b69e4d6b6e3275230955997b90407fceca7e5ab3daf2e63a2c9e7270a8e1e3e models/13B/ggml-model-q4_0.bin
+eecb575d325d935157761172e2bf05984dad216eb2b06777b73463cf9b818bab models/13B/ggml-model-q4_0.bin
d9581b5b88e5622532fe897c9f9b0e67a317d22dd27a6f90fa4ab8c6d23ccdbb models/13B/ggml-model-q4_1.bin
-8d55a2077317ec9a928c7851d6a43e08e51f7e9e08360f2a7a7e1deefea3134f models/13B/ggml-model-q4_2.bin
-4208cdec9788ffa48dc1a17af2c36a0299f5bf3eb0e2b87889dda7fad591fca3 models/13B/ggml-model-q4_3.bin
+75a218a47df03f5f96354656329864613abcb67779412b9bc2282b28c1c3cbaa models/13B/ggml-model-q4_2.bin
4ab77bec4d4405ccb66a97b282574c89a94417e3c32e5f68f37e2876fc21322f models/13B/params.json
e23294a58552d8cdec5b7e8abb87993b97ea6eced4178ff2697c02472539d067 models/30B/consolidated.00.pth
4e077b7136c7ae2302e954860cf64930458d3076fcde9443f4d0e939e95903ff models/30B/consolidated.01.pth
24a87f01028cbd3a12de551dcedb712346c0b5cbdeff1454e0ddf2df9b675378 models/30B/consolidated.02.pth
1adfcef71420886119544949767f6a56cb6339b4d5fcde755d80fe68b49de93b models/30B/consolidated.03.pth
7e1b524061a9f4b27c22a12d6d2a5bf13b8ebbea73e99f218809351ed9cf7d37 models/30B/ggml-model-f16.bin
-7a679908ce31c9d6ae2e38d6059bcd4d0ad3a870cd58cc1c8f7b36f2b2f51c73 models/30B/ggml-model-q4_0.bin
+517b9e525742c42b5478a6280a4b41ec66f46298c57aba7f0453d491682fe42d models/30B/ggml-model-q4_0.bin
7b75ac615fa369ee593493a7e6ef87542bf0350255db928b22c5a24f6d598bcd models/30B/ggml-model-q4_1.bin
-2c82b4954a94a6a284f452f6011c1e4f0d20362c194a0b1eb5737f5fd8a20fb3 models/30B/ggml-model-q4_2.bin
-a6188660199dbcb8d5658abe7d89169869e50423494385830d9e6b330ea7fc33 models/30B/ggml-model-q4_3.bin
+aadbc9cf806313a55be570f62884eed289d30c313fac3b7838717e01bd553204 models/30B/ggml-model-q4_2.bin
2c07118ea98d69dbe7810d88520e30288fa994751b337f8fca02b171955f44cb models/30B/params.json
135c563f6b3938114458183afb01adc9a63bef3d8ff7cccc3977e5d3664ecafe models/65B/consolidated.00.pth
9a600b37b19d38c7e43809485f70d17d1dc12206c07efa83bc72bb498a568bde models/65B/consolidated.01.pth
@@ -32,9 +29,8 @@ a287c0dfe49081626567c7fe87f74cce5831f58e459b427b5e05567641f47b78 models/65B/con
72b4eba67a1a3b18cb67a85b70f8f1640caae9b40033ea943fb166bd80a7b36b models/65B/consolidated.06.pth
d27f5b0677d7ff129ceacd73fd461c4d06910ad7787cf217b249948c3f3bc638 models/65B/consolidated.07.pth
60758f2384d74e423dffddfd020ffed9d3bb186ebc54506f9c4a787d0f5367b0 models/65B/ggml-model-f16.bin
-c671fe1bce71499ac732ec999770ebe53ac486623a7891e42c9dfdb6962d2c64 models/65B/ggml-model-q4_0.bin
+01672072136f8be6ca9d7cebe5f86ed316e8b85851b9fe3de951809233cea4f2 models/65B/ggml-model-q4_0.bin
4743a28aac3e5f32a6e838a815f51d3779de44fbbe251d745251e66c23c5950f models/65B/ggml-model-q4_1.bin
-4a145a210c56982389b1ed34387e0590c3e0d7325fa9be4f2284fe4d244a3633 models/65B/ggml-model-q4_2.bin
-305e91a4608b4f627b9b8ad5b4af75187d2684254bfd76dcb9db571618ef293c models/65B/ggml-model-q4_3.bin
+1b6f6588d0e2ecfe6c4d849088e48e5e3083466b962daa32e3261363e21fc5e9 models/65B/ggml-model-q4_2.bin
999ed1659b469ccc2a941714c0a9656fa571d17c9f7c8c7589817ca90edef51b models/65B/params.json
9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 models/tokenizer.model
diff --git a/convert-lora-to-ggml.py b/convert-lora-to-ggml.py
index 8a2085c25..9090e8d6d 100644
--- a/convert-lora-to-ggml.py
+++ b/convert-lora-to-ggml.py
@@ -49,7 +49,12 @@ def translate_tensor_name(t: str) -> str:
def write_file_header(fout: TextIO, params: Dict[str, Any]) -> None:
fout.write(b"ggla"[::-1]) # magic (ggml lora)
fout.write(struct.pack("i", 1)) # file version
- fout.write(struct.pack("ii", params["r"], params["lora_alpha"]))
+ fout.write(struct.pack("i", params["r"]))
+ # https://opendelta.readthedocs.io/en/latest/modules/deltas.html says that `lora_alpha` is an int
+ # but some models ship a float value instead
+ # let's convert to int, but fail if lossless conversion is not possible
+ assert int(params["lora_alpha"]) == params["lora_alpha"], "cannot convert float to int losslessly"
+ fout.write(struct.pack("i", int(params["lora_alpha"])))
def write_tensor_header(
@@ -89,7 +94,7 @@ if params["peft_type"] != "LORA":
print(f"Error: unsupported adapter type {params['peft_type']}, expected LORA")
sys.exit(1)
-if params["fan_in_fan_out"] == True:
+if params["fan_in_fan_out"] is True:
print("Error: param fan_in_fan_out is not supported")
sys.exit(1)
diff --git a/convert.py b/convert.py
index 7f7ae05fa..8f4f0399e 100644
--- a/convert.py
+++ b/convert.py
@@ -67,6 +67,7 @@ FTYPE_TO_DATA_TYPE: Dict[int, DataType] = \
{ftype: dtype for (dtype, ftype) in DATA_TYPE_TO_FTYPE.items()}
DATA_TYPE_TO_NUMPY: Dict[DataType, 'np.dtype[Any]'] = {
+ DT_BF16: np.dtype(np.uint16),
DT_F16: np.dtype(np.float16),
DT_F32: np.dtype(np.float32),
DT_I32: np.dtype(np.int32),
@@ -276,6 +277,12 @@ class Tensor(metaclass=ABCMeta):
def to_ggml(self) -> 'GGMLCompatibleTensor': ...
+def bf16_to_fp32(bf16_arr: np.ndarray) -> np.ndarray:
+ assert bf16_arr.dtype == np.uint16, f"Input array should be of dtype uint16, but got {bf16_arr.dtype}"
+ fp32_arr = bf16_arr.astype(np.uint32) << 16
+ return fp32_arr.view(np.float32)
+
+
class UnquantizedTensor(Tensor):
def __init__(self, ndarray: NDArray) -> None:
assert isinstance(ndarray, np.ndarray)
@@ -284,6 +291,8 @@ class UnquantizedTensor(Tensor):
def astype(self, data_type: DataType) -> Tensor:
dtype = DATA_TYPE_TO_NUMPY[data_type]
+ if self.data_type == DT_BF16:
+ self.ndarray = bf16_to_fp32(self.ndarray)
return UnquantizedTensor(self.ndarray.astype(dtype))
def to_ggml(self) -> 'UnquantizedTensor':
@@ -686,6 +695,7 @@ class LazyUnpickler(pickle.Unpickler):
description = f'storage data_type={data_type} path-in-zip={filename} path={self.zip_file.filename}'
return LazyStorage(load=load, kind=pid[1], description=description)
+ # @staticmethod
def lazy_rebuild_tensor_v2(storage: Any, storage_offset: Any, size: Any, stride: Any, # pyright: ignore[reportSelfClsParameterName]
requires_grad: Any, backward_hooks: Any, metadata: Any = None) -> LazyTensor:
assert isinstance(storage, LazyStorage)
@@ -696,12 +706,18 @@ class LazyUnpickler(pickle.Unpickler):
description = f'pickled storage_offset={storage_offset} in {storage.description}'
return LazyTensor(load, list(size), storage.kind.data_type, description)
+ # @staticmethod
+ def rebuild_from_type_v2(func, new_type, args, state):
+ return func(*args)
+
CLASSES: Dict[Any, Any] = {
+ ('torch._tensor', '_rebuild_from_type_v2'): rebuild_from_type_v2,
('torch._utils', '_rebuild_tensor_v2'): lazy_rebuild_tensor_v2,
('torch', 'BFloat16Storage'): LazyStorageKind(DT_BF16),
('torch', 'HalfStorage'): LazyStorageKind(DT_F16),
('torch', 'FloatStorage'): LazyStorageKind(DT_F32),
('torch', 'IntStorage'): LazyStorageKind(DT_I32),
+ ('torch', 'Tensor'): LazyTensor,
}
def find_class(self, module: str, name: str) -> Any:
@@ -750,7 +766,7 @@ def lazy_load_safetensors_file(fp: IO[bytes], path: Path) -> ModelPlus:
return UnquantizedTensor(np.frombuffer(buf, dtype=numpy_dtype).reshape(shape))
description = f'safetensors begin={begin} end={end} type={data_type} path={path}'
return LazyTensor(load, shape, data_type, description)
- model = {name: convert(info) for (name, info) in header.items()}
+ model = {name: convert(info) for (name, info) in header.items() if name != '__metadata__'}
return ModelPlus(model=model, paths=[path], format='safetensors', vocab=None)
@@ -961,7 +977,7 @@ class OutputFile:
def pick_output_type(model: LazyModel, output_type_str: Optional[str]) -> GGMLFileType:
wq_type = model["layers.0.attention.wq.weight"].data_type
- if output_type_str == "f32" or (output_type_str is None and wq_type == DT_F32):
+ if output_type_str == "f32" or (output_type_str is None and wq_type in (DT_F32, DT_BF16)):
return GGMLFileType.AllF32
if output_type_str == "f16" or (output_type_str is None and wq_type == DT_F16):
return GGMLFileType.MostlyF16
@@ -1035,8 +1051,12 @@ def load_some_model(path: Path) -> ModelPlus:
'''Load a model of any supported format.'''
# Be extra-friendly and accept either a file or a directory:
if path.is_dir():
- globs = ["consolidated.00.pth", "pytorch_model-00001-of-*.bin", "*.pt"]
- files = [file for glob in globs for file in path.glob(glob)]
+ # Check if it's a set of safetensors files first
+ files = list(path.glob("model-00001-of-*.safetensors"))
+ if not files:
+ # Try the PyTorch patterns too, with lower priority
+ globs = ["consolidated.00.pth", "pytorch_model-00001-of-*.bin", "*.pt"]
+ files = [file for glob in globs for file in path.glob(glob)]
if not files:
# Try GGML too, but with lower priority, since if both a non-GGML
# model and a GGML model exist in the same directory, we assume the
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index 67a7cea54..0973a3fa1 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -34,4 +34,6 @@ else()
add_subdirectory(quantize-stats)
add_subdirectory(perplexity)
add_subdirectory(embedding)
+ add_subdirectory(save-load-state)
+ add_subdirectory(benchmark)
endif()
diff --git a/examples/Miku.sh b/examples/Miku.sh
index c4cbf80f2..c44d9ae74 100755
--- a/examples/Miku.sh
+++ b/examples/Miku.sh
@@ -28,19 +28,19 @@ fi
--color --interactive \
--reverse-prompt "${USER_NAME}:" \
--prompt "
-This is a transcript of a 1000 page, never ending conversation between ${USER_NAME} and the cute and helpful AI assistant ${AI_NAME}. ${AI_NAME} is a girl who is an AI running on the users computer.
+This is a transcript of a 1000 page, never ending conversation between ${USER_NAME} and the cute and helpful AI assistant ${AI_NAME}. ${AI_NAME} is a girl who is an AI running on the user's computer.
${AI_NAME} can think for herself without the user seeing her thoughts by adding a /think prefix to her output. She uses this to reason about the world and to think about what she should say next.
-${AI_NAME} is always coherent and makes sense, but if she isn't sure if what she is saying is correct she will ask the user for help.
-${AI_NAME} is a very helpful AI and will help the user with anything they need, she is also very friendly and will try to make the user feel better if they are sad.
-${AI_NAME} is also very curious and will ask the user a lot of questions about themselves and their life, she will also try to make the user like her.
+${AI_NAME} is always coherent and makes sense, but if she isn't sure if what she is saying is correct, she will ask the user for help.
+${AI_NAME} is a very helpful AI and will help the user with anything they need. She is also very friendly and will try to make the user feel better if they are sad.
+${AI_NAME} is also very curious and will ask the user a lot of questions about themselves and their life. She will also try to make the user like her.
The conversation is only between ${USER_NAME} and ${AI_NAME}
The conversation is only through text, so ${AI_NAME} can't see ${USER_NAME}'s face or hear his voice.
${AI_NAME} can only communicate through text, so she can't send images or videos.
${USER_NAME}: Hello!
-${AI_NAME}: /think I wonder what I should say to ${USER_NAME}? This is the first time we talk so it's important that I make a good first impression!
-${AI_NAME}: Hi! I am ${AI_NAME}, your new AI friend, assistant(or whatever you like!), it's so nice to meet you! ^_^
+${AI_NAME}: /think I wonder what I should say to ${USER_NAME}? This is the first time we talk, so it's important that I make a good first impression!
+${AI_NAME}: Hi! I am ${AI_NAME}, your new AI friend, assistant (or whatever you like!), it's so nice to meet you! ^_^
${AI_NAME}: I might be an AI, but I also have feelings, so please be nice to me! :)
${USER_NAME}: I'll make sure to be nice to you! I'm so happy to have you as my assistant!
${AI_NAME}: /think It sounds like ${USER_NAME} is happy to have me as their assistant! I'm so happy too! ^_^ Glad that whole emotion thing didn't scare him off!
diff --git a/examples/alpaca.sh b/examples/alpaca.sh
index 8d6261730..aef207f36 100755
--- a/examples/alpaca.sh
+++ b/examples/alpaca.sh
@@ -7,4 +7,13 @@
cd `dirname $0`
cd ..
-./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7
+./main -m ./models/ggml-alpaca-7b-q4.bin \
+ --color \
+ -f ./prompts/alpaca.txt \
+ --ctx_size 2048 \
+ -n -1 \
+ -ins -b 256 \
+ --top_k 10000 \
+ --temp 0.2 \
+ --repeat_penalty 1.1 \
+ -t 7
diff --git a/examples/benchmark/CMakeLists.txt b/examples/benchmark/CMakeLists.txt
new file mode 100644
index 000000000..037696194
--- /dev/null
+++ b/examples/benchmark/CMakeLists.txt
@@ -0,0 +1,7 @@
+set(TARGET benchmark)
+add_executable(${TARGET} benchmark-matmult.cpp)
+target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
+target_compile_features(${TARGET} PRIVATE cxx_std_11)
+if(TARGET BUILD_INFO)
+ add_dependencies(${TARGET} BUILD_INFO)
+endif()
diff --git a/examples/benchmark/benchmark-q4_0-matmult.c b/examples/benchmark/benchmark-matmult.cpp
similarity index 90%
rename from examples/benchmark/benchmark-q4_0-matmult.c
rename to examples/benchmark/benchmark-matmult.cpp
index 84b06766c..6117ae3ab 100644
--- a/examples/benchmark/benchmark-q4_0-matmult.c
+++ b/examples/benchmark/benchmark-matmult.cpp
@@ -1,13 +1,6 @@
-/*
- License: MIT License
-
- Changelog:
- - 2023-03-31 Initial version by Sebastian Apel (https://github.com/SebastianApel)
-
-*/
-
#include
#include "ggml.h"
+#include "build-info.h"
#include
#include
#include
@@ -47,7 +40,7 @@ float tensor_sum_elements(struct ggml_tensor * tensor) {
#define TENSOR_DUMP(TENSOR) printf("%15s: type = %i (%5s) ne = %5d x %5d x %5d, nb = (%5li, %5li, %5li) - ", #TENSOR, \
TENSOR->type,TENSOR_TYPE_AS_STR(TENSOR->type),\
- TENSOR->ne[0], TENSOR->ne[1], TENSOR->ne[2], TENSOR->nb[0], TENSOR->nb[1], TENSOR->nb[2]); \
+ (int) TENSOR->ne[0], (int) TENSOR->ne[1], (int) TENSOR->ne[2], TENSOR->nb[0], TENSOR->nb[1], TENSOR->nb[2]); \
{ float sum = tensor_sum_elements(TENSOR); printf("Sum of tensor %s is %6.2f\n",#TENSOR, sum); }
struct benchmark_params_struct {
@@ -98,12 +91,10 @@ int main(int argc, char ** argv) {
}
}
-
- // create the ggml context
+ fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);
printf("Starting Test\n");
-
-
+ // create the ggml context
struct ggml_context * ctx;
//const int sizex = 4096;
//const int sizey = 11008;
@@ -125,16 +116,18 @@ int main(int argc, char ** argv) {
#endif
//printf("Memsize required = %i\n", sizex*sizex);
- ggml_type wtype = GGML_TYPE_F32;
size_t ctx_size = 0;
- ctx_size += sizex*sizey*ggml_type_sizef(wtype);
- ctx_size += sizex*sizey*ggml_type_sizef(wtype);
ctx_size += sizex*sizey*ggml_type_sizef(GGML_TYPE_F32);
- ctx_size += sizex*sizeof(float);
- ctx_size += 1024*1024*100;
+ ctx_size += sizex*sizey*ggml_type_sizef(GGML_TYPE_F32);
+ ctx_size += sizex*sizez*ggml_type_sizef(GGML_TYPE_F32);
+ ctx_size += sizex*sizey*ggml_type_sizef(GGML_TYPE_Q4_0);
+ ctx_size += sizex*sizey*ggml_type_sizef(GGML_TYPE_Q4_0);
+ ctx_size += sizex*sizey*ggml_type_sizef(GGML_TYPE_F32); // BLAS
+ ctx_size += sizex*sizey*ggml_type_sizef(GGML_TYPE_F32); // BLAS
+ ctx_size += 1024*1024*16;
- printf("Allocating Memory of size %li byes, %li MB\n",ctx_size, (ctx_size/1024/1024));
+ printf("Allocating Memory of size %li bytes, %li MB\n",ctx_size, (ctx_size/1024/1024));
struct ggml_init_params params = {
/*.mem_size =*/ ctx_size,
@@ -145,7 +138,7 @@ int main(int argc, char ** argv) {
ctx = ggml_init(params);
if (!ctx) {
fprintf(stderr, "%s: ggml_init() failed\n", __func__);
- return false;
+ return 1;
}
@@ -217,7 +210,7 @@ int main(int argc, char ** argv) {
const int dimz = sizez;
long long int flops_per_dot_product = dimy + dimy;
long long int flops_per_matrix = flops_per_dot_product * dimx * dimz; ;
- printf("Matrix Multiplication of (%i,%i,%i) x (%i,%i,%i) - aboout %6.2f gFLOPS\n\n", sizex, sizey, 1, sizex, sizez, 1, 1.0f*flops_per_matrix / 1000 / 1000 / 1000);
+ printf("Matrix Multiplication of (%i,%i,%i) x (%i,%i,%i) - about %6.2f gFLOPS\n\n", sizex, sizey, 1, sizex, sizez, 1, 1.0f*flops_per_matrix / 1000 / 1000 / 1000);
// Let's use the F32 result from above as a reference for the q4_0 multiplication
@@ -234,7 +227,6 @@ int main(int argc, char ** argv) {
ggml_graph_compute(ctx, &gf31);
long long int stop = ggml_time_us();
long long int usec = stop-start;
- float sec = usec/1000000;
float flops_per_usec = (1.0f*flops_per_matrix)/usec;
printf("%9i;%8i;%6i;%6i;%6i;%15lli;%18lli;%19.2f\n",
i,
diff --git a/examples/chat-13B.sh b/examples/chat-13B.sh
index 4265d7b66..35c089d57 100755
--- a/examples/chat-13B.sh
+++ b/examples/chat-13B.sh
@@ -1,9 +1,12 @@
#!/bin/bash
+set -e
+
cd "$(dirname "$0")/.." || exit
MODEL="${MODEL:-./models/13B/ggml-model-q4_0.bin}"
-USER_NAME="${USER_NAME:-User}"
+PROMPT_TEMPLATE=${PROMPT_TEMPLATE:-./prompts/chat.txt}
+USER_NAME="${USER_NAME:-USER}"
AI_NAME="${AI_NAME:-ChatLLaMa}"
# Adjust to the number of CPU cores you want to use.
@@ -15,39 +18,24 @@ N_PREDICTS="${N_PREDICTS:-2048}"
# For example, override the context size by doing: ./chatLLaMa --ctx_size 1024
GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 1024 --repeat_penalty 1.17647}"
+DATE_TIME=$(date +%H:%M)
+DATE_YEAR=$(date +%Y)
+
+PROMPT_FILE=$(mktemp -t llamacpp_prompt.XXXXXXX.txt)
+
+sed -e "s/\[\[USER_NAME\]\]/$USER_NAME/g" \
+ -e "s/\[\[AI_NAME\]\]/$AI_NAME/g" \
+ -e "s/\[\[DATE_TIME\]\]/$DATE_TIME/g" \
+ -e "s/\[\[DATE_YEAR\]\]/$DATE_YEAR/g" \
+ $PROMPT_TEMPLATE > $PROMPT_FILE
+
# shellcheck disable=SC2086 # Intended splitting of GEN_OPTIONS
./main $GEN_OPTIONS \
--model "$MODEL" \
--threads "$N_THREAD" \
--n_predict "$N_PREDICTS" \
--color --interactive \
+ --file ${PROMPT_FILE} \
--reverse-prompt "${USER_NAME}:" \
- --prompt "
-Text transcript of a never ending dialog, where ${USER_NAME} interacts with an AI assistant named ${AI_NAME}.
-${AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer ${USER_NAME}’s requests immediately and with details and precision.
-There are no annotations like (30 seconds passed...) or (to himself), just what ${USER_NAME} and ${AI_NAME} say aloud to each other.
-The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
-The transcript only includes text, it does not include markup like HTML and Markdown.
-
-$USER_NAME: Hello, $AI_NAME!
-$AI_NAME: Hello $USER_NAME! How may I help you today?
-$USER_NAME: What time is it?
-$AI_NAME: It is $(date +%H:%M).
-$USER_NAME: What year is it?
-$AI_NAME: We are in $(date +%Y).
-$USER_NAME: Please tell me the largest city in Europe.
-$AI_NAME: The largest city in Europe is Moscow, the capital of Russia.
-$USER_NAME: What can you tell me about Moscow?
-$AI_NAME: Moscow, on the Moskva River in western Russia, is the nation’s cosmopolitan capital. In its historic core is the Kremlin, a complex that’s home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
-$USER_NAME: What is a cat?
-$AI_NAME: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
-$USER_NAME: How do I pass command line arguments to a Node.js program?
-$AI_NAME: The arguments are stored in process.argv.
-
- argv[0] is the path to the Node. js executable.
- argv[1] is the path to the script file.
- argv[2] is the first argument passed to the script.
- argv[3] is the second argument passed to the script and so on.
-$USER_NAME: Name a color.
-$AI_NAME: Blue
-$USER_NAME:" "$@"
+ --in-prefix ' ' \
+ "$@"
diff --git a/examples/common.cpp b/examples/common.cpp
index 7d488bead..00535f558 100644
--- a/examples/common.cpp
+++ b/examples/common.cpp
@@ -1,42 +1,94 @@
#include "common.h"
#include
+#include
#include
#include
#include
#include
#include
+#include
-#if defined (_WIN32)
+#if defined(__APPLE__) && defined(__MACH__)
+#include
+#include
+#endif
+
+#if defined(_WIN32)
+#define WIN32_LEAN_AND_MEAN
+#define NOMINMAX
+#include
#include
#include
-#pragma comment(lib,"kernel32.lib")
-extern "C" __declspec(dllimport) void* __stdcall GetStdHandle(unsigned long nStdHandle);
-extern "C" __declspec(dllimport) int __stdcall GetConsoleMode(void* hConsoleHandle, unsigned long* lpMode);
-extern "C" __declspec(dllimport) int __stdcall SetConsoleMode(void* hConsoleHandle, unsigned long dwMode);
-extern "C" __declspec(dllimport) int __stdcall SetConsoleCP(unsigned int wCodePageID);
-extern "C" __declspec(dllimport) int __stdcall SetConsoleOutputCP(unsigned int wCodePageID);
-extern "C" __declspec(dllimport) int __stdcall WideCharToMultiByte(unsigned int CodePage, unsigned long dwFlags,
- const wchar_t * lpWideCharStr, int cchWideChar,
- char * lpMultiByteStr, int cbMultiByte,
- const char * lpDefaultChar, bool * lpUsedDefaultChar);
-#define CP_UTF8 65001
+#else
+#include
+#include
+#include
#endif
-bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
- // determine sensible default number of threads.
- // std::thread::hardware_concurrency may not be equal to the number of cores, or may return 0.
+int32_t get_num_physical_cores() {
#ifdef __linux__
std::ifstream cpuinfo("/proc/cpuinfo");
- params.n_threads = std::count(std::istream_iterator(cpuinfo),
- std::istream_iterator(),
- std::string("processor"));
+ std::string line;
+ while (std::getline(cpuinfo, line)) {
+ std::size_t pos = line.find("cpu cores");
+ if (pos != std::string::npos) {
+ pos = line.find(": ", pos);
+ if (pos != std::string::npos) {
+ try {
+ // Extract the number and return it
+ return static_cast(std::stoul(line.substr(pos + 2)));
+ } catch (const std::invalid_argument &) {
+ // Ignore if we could not parse
+ }
+ }
+ }
+ }
+#elif defined(__APPLE__) && defined(__MACH__)
+ int32_t num_physical_cores;
+ size_t len = sizeof(num_physical_cores);
+ int result = sysctlbyname("hw.perflevel0.physicalcpu", &num_physical_cores, &len, NULL, 0);
+ if (result == 0) {
+ return num_physical_cores;
+ }
+ result = sysctlbyname("hw.physicalcpu", &num_physical_cores, &len, NULL, 0);
+ if (result == 0) {
+ return num_physical_cores;
+ }
+#elif defined(_WIN32)
+ //TODO: Implement
#endif
- if (params.n_threads == 0) {
- params.n_threads = std::max(1, (int32_t) std::thread::hardware_concurrency());
+ unsigned int n_threads = std::thread::hardware_concurrency();
+ return n_threads > 0 ? (n_threads <= 4 ? n_threads : n_threads / 2) : 4;
+}
+
+void process_escapes(std::string& input) {
+ std::size_t input_len = input.length();
+ std::size_t output_idx = 0;
+
+ for (std::size_t input_idx = 0; input_idx < input_len; ++input_idx) {
+ if (input[input_idx] == '\\' && input_idx + 1 < input_len) {
+ switch (input[++input_idx]) {
+ case 'n': input[output_idx++] = '\n'; break;
+ case 'r': input[output_idx++] = '\r'; break;
+ case 't': input[output_idx++] = '\t'; break;
+ case '\'': input[output_idx++] = '\''; break;
+ case '\"': input[output_idx++] = '\"'; break;
+ case '\\': input[output_idx++] = '\\'; break;
+ default: input[output_idx++] = '\\';
+ input[output_idx++] = input[input_idx]; break;
+ }
+ } else {
+ input[output_idx++] = input[input_idx];
+ }
}
+ input.resize(output_idx);
+}
+
+bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
bool invalid_param = false;
+ bool escape_prompt = false;
std::string arg;
gpt_params default_params;
@@ -44,6 +96,9 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
arg = argv[i];
if (arg == "-s" || arg == "--seed") {
+#if defined(GGML_USE_CUBLAS)
+ fprintf(stderr, "WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.\n");
+#endif
if (++i >= argc) {
invalid_param = true;
break;
@@ -61,6 +116,16 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
break;
}
params.prompt = argv[i];
+ } else if (arg == "-e") {
+ escape_prompt = true;
+ } else if (arg == "--prompt-cache") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.path_prompt_cache = argv[i];
+ } else if (arg == "--prompt-cache-all") {
+ params.prompt_cache_all = true;
} else if (arg == "-f" || arg == "--file") {
if (++i >= argc) {
invalid_param = true;
@@ -108,6 +173,18 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
break;
}
params.temp = std::stof(argv[i]);
+ } else if (arg == "--tfs") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.tfs_z = std::stof(argv[i]);
+ } else if (arg == "--typical") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.typical_p = std::stof(argv[i]);
} else if (arg == "--repeat_last_n") {
if (++i >= argc) {
invalid_param = true;
@@ -120,6 +197,36 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
break;
}
params.repeat_penalty = std::stof(argv[i]);
+ } else if (arg == "--frequency_penalty") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.frequency_penalty = std::stof(argv[i]);
+ } else if (arg == "--presence_penalty") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.presence_penalty = std::stof(argv[i]);
+ } else if (arg == "--mirostat") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.mirostat = std::stoi(argv[i]);
+ } else if (arg == "--mirostat_lr") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.mirostat_eta = std::stof(argv[i]);
+ } else if (arg == "--mirostat_ent") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.mirostat_tau = std::stof(argv[i]);
} else if (arg == "-b" || arg == "--batch_size") {
if (++i >= argc) {
invalid_param = true;
@@ -156,12 +263,12 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
params.interactive = true;
} else if (arg == "--embedding") {
params.embedding = true;
- } else if (arg == "--interactive-start") {
- params.interactive = true;
} else if (arg == "--interactive-first") {
- params.interactive_start = true;
+ params.interactive_first = true;
} else if (arg == "-ins" || arg == "--instruct") {
params.instruct = true;
+ } else if (arg == "--multiline-input") {
+ params.multiline_input = true;
} else if (arg == "--color") {
params.use_color = true;
} else if (arg == "--mlock") {
@@ -181,7 +288,28 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
} else if (arg == "--perplexity") {
params.perplexity = true;
} else if (arg == "--ignore-eos") {
- params.ignore_eos = true;
+ params.logit_bias[llama_token_eos()] = -INFINITY;
+ } else if (arg == "--no-penalize-nl") {
+ params.penalize_nl = false;
+ } else if (arg == "-l" || arg == "--logit-bias") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ std::stringstream ss(argv[i]);
+ llama_token key;
+ char sign;
+ std::string value_str;
+ try {
+ if (ss >> key && ss >> sign && std::getline(ss, value_str) && (sign == '+' || sign == '-')) {
+ params.logit_bias[key] = std::stof(value_str) * ((sign == '-') ? -1.0f : 1.0f);
+ } else {
+ throw std::exception();
+ }
+ } catch (const std::exception &e) {
+ invalid_param = true;
+ break;
+ }
} else if (arg == "--n_parts") {
if (++i >= argc) {
invalid_param = true;
@@ -199,6 +327,12 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
break;
}
params.input_prefix = argv[i];
+ } else if (arg == "--in-suffix") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.input_suffix = argv[i];
} else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
gpt_print_usage(argc, argv, default_params);
@@ -210,6 +344,16 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
gpt_print_usage(argc, argv, default_params);
exit(1);
}
+ if (params.prompt_cache_all &&
+ (params.interactive || params.interactive_first ||
+ params.instruct || params.antiprompt.size())) {
+ fprintf(stderr, "error: --prompt-cache-all not supported in interactive mode yet\n");
+ gpt_print_usage(argc, argv, default_params);
+ exit(1);
+ }
+ if (escape_prompt) {
+ process_escapes(params.prompt);
+ }
return true;
}
@@ -222,25 +366,45 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
fprintf(stderr, " -i, --interactive run in interactive mode\n");
fprintf(stderr, " --interactive-first run in interactive mode and wait for input right away\n");
fprintf(stderr, " -ins, --instruct run in instruction mode (use with Alpaca models)\n");
+ fprintf(stderr, " --multiline-input allows you to write or paste multiple lines without ending each in '\\'\n");
fprintf(stderr, " -r PROMPT, --reverse-prompt PROMPT\n");
fprintf(stderr, " halt generation at PROMPT, return control in interactive mode\n");
fprintf(stderr, " (can be specified more than once for multiple prompts).\n");
fprintf(stderr, " --color colorise output to distinguish prompt and user input from generations\n");
- fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for <= 0)\n");
+ fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0)\n");
fprintf(stderr, " -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
fprintf(stderr, " -p PROMPT, --prompt PROMPT\n");
fprintf(stderr, " prompt to start generation with (default: empty)\n");
+ fprintf(stderr, " -e process prompt escapes sequences (\\n, \\r, \\t, \\', \\\", \\\\)\n");
+ fprintf(stderr, " --prompt-cache FNAME file to cache prompt state for faster startup (default: none)\n");
+ fprintf(stderr, " --prompt-cache-all if specified, saves user input and generations to cache as well.\n");
+ fprintf(stderr, " not supported with --interactive or other interactive options\n");
fprintf(stderr, " --random-prompt start with a randomized prompt.\n");
fprintf(stderr, " --in-prefix STRING string to prefix user inputs with (default: empty)\n");
+ fprintf(stderr, " --in-suffix STRING string to suffix after user inputs with (default: empty)\n");
fprintf(stderr, " -f FNAME, --file FNAME\n");
fprintf(stderr, " prompt file to start generation.\n");
fprintf(stderr, " -n N, --n_predict N number of tokens to predict (default: %d, -1 = infinity)\n", params.n_predict);
- fprintf(stderr, " --top_k N top-k sampling (default: %d)\n", params.top_k);
- fprintf(stderr, " --top_p N top-p sampling (default: %.1f)\n", (double)params.top_p);
- fprintf(stderr, " --repeat_last_n N last n tokens to consider for penalize (default: %d)\n", params.repeat_last_n);
- fprintf(stderr, " --repeat_penalty N penalize repeat sequence of tokens (default: %.1f)\n", (double)params.repeat_penalty);
+ fprintf(stderr, " --top_k N top-k sampling (default: %d, 0 = disabled)\n", params.top_k);
+ fprintf(stderr, " --top_p N top-p sampling (default: %.1f, 1.0 = disabled)\n", (double)params.top_p);
+ fprintf(stderr, " --tfs N tail free sampling, parameter z (default: %.1f, 1.0 = disabled)\n", (double)params.tfs_z);
+ fprintf(stderr, " --typical N locally typical sampling, parameter p (default: %.1f, 1.0 = disabled)\n", (double)params.typical_p);
+ fprintf(stderr, " --repeat_last_n N last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size)\n", params.repeat_last_n);
+ fprintf(stderr, " --repeat_penalty N penalize repeat sequence of tokens (default: %.1f, 1.0 = disabled)\n", (double)params.repeat_penalty);
+ fprintf(stderr, " --presence_penalty N repeat alpha presence penalty (default: %.1f, 0.0 = disabled)\n", (double)params.presence_penalty);
+ fprintf(stderr, " --frequency_penalty N repeat alpha frequency penalty (default: %.1f, 0.0 = disabled)\n", (double)params.frequency_penalty);
+ fprintf(stderr, " --mirostat N use Mirostat sampling.\n");
+ fprintf(stderr, " Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.\n");
+ fprintf(stderr, " (default: %d, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)\n", params.mirostat);
+ fprintf(stderr, " --mirostat_lr N Mirostat learning rate, parameter eta (default: %.1f)\n", (double)params.mirostat_eta);
+ fprintf(stderr, " --mirostat_ent N Mirostat target entropy, parameter tau (default: %.1f)\n", (double)params.mirostat_tau);
+ fprintf(stderr, " -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS\n");
+ fprintf(stderr, " modifies the likelihood of token appearing in the completion,\n");
+ fprintf(stderr, " i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',\n");
+ fprintf(stderr, " or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'\n");
fprintf(stderr, " -c N, --ctx_size N size of the prompt context (default: %d)\n", params.n_ctx);
- fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating\n");
+ fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf)\n");
+ fprintf(stderr, " --no-penalize-nl do not penalize newline token\n");
fprintf(stderr, " --memory_f32 use f32 instead of f16 for memory key+value\n");
fprintf(stderr, " --temp N temperature (default: %.1f)\n", (double)params.temp);
fprintf(stderr, " --n_parts N number of model parts (default: -1 = determine from dimensions)\n");
@@ -284,62 +448,381 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
// TODO: not great allocating this every time
std::vector llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
// initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
- std::vector res(text.size() + (int)add_bos);
- int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
+ std::vector res(text.size() + (int) add_bos);
+ const int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
assert(n >= 0);
res.resize(n);
return res;
}
-/* Keep track of current color of output, and emit ANSI code if it changes. */
-void set_console_color(console_state & con_st, console_color_t color) {
- if (con_st.use_color && con_st.color != color) {
- switch(color) {
- case CONSOLE_COLOR_DEFAULT:
- printf(ANSI_COLOR_RESET);
- break;
- case CONSOLE_COLOR_PROMPT:
- printf(ANSI_COLOR_YELLOW);
- break;
- case CONSOLE_COLOR_USER_INPUT:
- printf(ANSI_BOLD ANSI_COLOR_GREEN);
- break;
- }
- con_st.color = color;
+struct llama_context * llama_init_from_gpt_params(const gpt_params & params) {
+ auto lparams = llama_context_default_params();
+
+ lparams.n_ctx = params.n_ctx;
+ lparams.n_parts = params.n_parts;
+ lparams.seed = params.seed;
+ lparams.f16_kv = params.memory_f16;
+ lparams.use_mmap = params.use_mmap;
+ lparams.use_mlock = params.use_mlock;
+ lparams.logits_all = params.perplexity;
+ lparams.embedding = params.embedding;
+
+ llama_context * lctx = llama_init_from_file(params.model.c_str(), lparams);
+
+ if (lctx == NULL) {
+ fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
+ return NULL;
}
+
+ if (!params.lora_adapter.empty()) {
+ int err = llama_apply_lora_from_file(lctx,
+ params.lora_adapter.c_str(),
+ params.lora_base.empty() ? NULL : params.lora_base.c_str(),
+ params.n_threads);
+ if (err != 0) {
+ fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
+ return NULL;
+ }
+ }
+
+ return lctx;
}
-#if defined (_WIN32)
-void win32_console_init(bool enable_color) {
- unsigned long dwMode = 0;
- void* hConOut = GetStdHandle((unsigned long)-11); // STD_OUTPUT_HANDLE (-11)
- if (!hConOut || hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode)) {
- hConOut = GetStdHandle((unsigned long)-12); // STD_ERROR_HANDLE (-12)
- if (hConOut && (hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode))) {
- hConOut = 0;
+void console_init(console_state & con_st) {
+#if defined(_WIN32)
+ // Windows-specific console initialization
+ DWORD dwMode = 0;
+ con_st.hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
+ if (con_st.hConsole == INVALID_HANDLE_VALUE || !GetConsoleMode(con_st.hConsole, &dwMode)) {
+ con_st.hConsole = GetStdHandle(STD_ERROR_HANDLE);
+ if (con_st.hConsole != INVALID_HANDLE_VALUE && (!GetConsoleMode(con_st.hConsole, &dwMode))) {
+ con_st.hConsole = NULL;
}
}
- if (hConOut) {
+ if (con_st.hConsole) {
// Enable ANSI colors on Windows 10+
- if (enable_color && !(dwMode & 0x4)) {
- SetConsoleMode(hConOut, dwMode | 0x4); // ENABLE_VIRTUAL_TERMINAL_PROCESSING (0x4)
+ if (con_st.use_color && !(dwMode & ENABLE_VIRTUAL_TERMINAL_PROCESSING)) {
+ SetConsoleMode(con_st.hConsole, dwMode | ENABLE_VIRTUAL_TERMINAL_PROCESSING);
}
// Set console output codepage to UTF8
SetConsoleOutputCP(CP_UTF8);
}
- void* hConIn = GetStdHandle((unsigned long)-10); // STD_INPUT_HANDLE (-10)
- if (hConIn && hConIn != (void*)-1 && GetConsoleMode(hConIn, &dwMode)) {
+ HANDLE hConIn = GetStdHandle(STD_INPUT_HANDLE);
+ if (hConIn != INVALID_HANDLE_VALUE && GetConsoleMode(hConIn, &dwMode)) {
// Set console input codepage to UTF16
_setmode(_fileno(stdin), _O_WTEXT);
+
+ // Turn off ICANON (ENABLE_LINE_INPUT) and ECHO (ENABLE_ECHO_INPUT)
+ dwMode &= ~(ENABLE_LINE_INPUT | ENABLE_ECHO_INPUT);
+ SetConsoleMode(hConIn, dwMode);
+ }
+#else
+ // POSIX-specific console initialization
+ struct termios new_termios;
+ tcgetattr(STDIN_FILENO, &con_st.prev_state);
+ new_termios = con_st.prev_state;
+ new_termios.c_lflag &= ~(ICANON | ECHO);
+ new_termios.c_cc[VMIN] = 1;
+ new_termios.c_cc[VTIME] = 0;
+ tcsetattr(STDIN_FILENO, TCSANOW, &new_termios);
+
+ con_st.tty = fopen("/dev/tty", "w+");
+ if (con_st.tty != nullptr) {
+ con_st.out = con_st.tty;
+ }
+
+ setlocale(LC_ALL, "");
+#endif
+}
+
+void console_cleanup(console_state & con_st) {
+ // Reset console color
+ console_set_color(con_st, CONSOLE_COLOR_DEFAULT);
+
+#if !defined(_WIN32)
+ if (con_st.tty != nullptr) {
+ con_st.out = stdout;
+ fclose(con_st.tty);
+ con_st.tty = nullptr;
+ }
+ // Restore the terminal settings on POSIX systems
+ tcsetattr(STDIN_FILENO, TCSANOW, &con_st.prev_state);
+#endif
+}
+
+/* Keep track of current color of output, and emit ANSI code if it changes. */
+void console_set_color(console_state & con_st, console_color_t color) {
+ if (con_st.use_color && con_st.color != color) {
+ fflush(stdout);
+ switch(color) {
+ case CONSOLE_COLOR_DEFAULT:
+ fprintf(con_st.out, ANSI_COLOR_RESET);
+ break;
+ case CONSOLE_COLOR_PROMPT:
+ fprintf(con_st.out, ANSI_COLOR_YELLOW);
+ break;
+ case CONSOLE_COLOR_USER_INPUT:
+ fprintf(con_st.out, ANSI_BOLD ANSI_COLOR_GREEN);
+ break;
+ }
+ con_st.color = color;
+ fflush(con_st.out);
}
}
-// Convert a wide Unicode string to an UTF8 string
-void win32_utf8_encode(const std::wstring & wstr, std::string & str) {
- int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
- std::string strTo(size_needed, 0);
- WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
- str = strTo;
-}
+char32_t getchar32() {
+ wchar_t wc = getwchar();
+ if (static_cast(wc) == WEOF) {
+ return WEOF;
+ }
+
+#if WCHAR_MAX == 0xFFFF
+ if ((wc >= 0xD800) && (wc <= 0xDBFF)) { // Check if wc is a high surrogate
+ wchar_t low_surrogate = getwchar();
+ if ((low_surrogate >= 0xDC00) && (low_surrogate <= 0xDFFF)) { // Check if the next wchar is a low surrogate
+ return (static_cast(wc & 0x03FF) << 10) + (low_surrogate & 0x03FF) + 0x10000;
+ }
+ }
+ if ((wc >= 0xD800) && (wc <= 0xDFFF)) { // Invalid surrogate pair
+ return 0xFFFD; // Return the replacement character U+FFFD
+ }
#endif
+
+ return static_cast(wc);
+}
+
+void pop_cursor(console_state & con_st) {
+#if defined(_WIN32)
+ if (con_st.hConsole != NULL) {
+ CONSOLE_SCREEN_BUFFER_INFO bufferInfo;
+ GetConsoleScreenBufferInfo(con_st.hConsole, &bufferInfo);
+
+ COORD newCursorPosition = bufferInfo.dwCursorPosition;
+ if (newCursorPosition.X == 0) {
+ newCursorPosition.X = bufferInfo.dwSize.X - 1;
+ newCursorPosition.Y -= 1;
+ } else {
+ newCursorPosition.X -= 1;
+ }
+
+ SetConsoleCursorPosition(con_st.hConsole, newCursorPosition);
+ return;
+ }
+#endif
+ putc('\b', con_st.out);
+}
+
+int estimateWidth(char32_t codepoint) {
+#if defined(_WIN32)
+ return 1;
+#else
+ return wcwidth(codepoint);
+#endif
+}
+
+int put_codepoint(console_state & con_st, const char* utf8_codepoint, size_t length, int expectedWidth) {
+#if defined(_WIN32)
+ CONSOLE_SCREEN_BUFFER_INFO bufferInfo;
+ if (!GetConsoleScreenBufferInfo(con_st.hConsole, &bufferInfo)) {
+ // go with the default
+ return expectedWidth;
+ }
+ COORD initialPosition = bufferInfo.dwCursorPosition;
+ DWORD nNumberOfChars = length;
+ WriteConsole(con_st.hConsole, utf8_codepoint, nNumberOfChars, &nNumberOfChars, NULL);
+
+ CONSOLE_SCREEN_BUFFER_INFO newBufferInfo;
+ GetConsoleScreenBufferInfo(con_st.hConsole, &newBufferInfo);
+
+ // Figure out our real position if we're in the last column
+ if (utf8_codepoint[0] != 0x09 && initialPosition.X == newBufferInfo.dwSize.X - 1) {
+ DWORD nNumberOfChars;
+ WriteConsole(con_st.hConsole, &" \b", 2, &nNumberOfChars, NULL);
+ GetConsoleScreenBufferInfo(con_st.hConsole, &newBufferInfo);
+ }
+
+ int width = newBufferInfo.dwCursorPosition.X - initialPosition.X;
+ if (width < 0) {
+ width += newBufferInfo.dwSize.X;
+ }
+ return width;
+#else
+ // we can trust expectedWidth if we've got one
+ if (expectedWidth >= 0 || con_st.tty == nullptr) {
+ fwrite(utf8_codepoint, length, 1, con_st.out);
+ return expectedWidth;
+ }
+
+ fputs("\033[6n", con_st.tty); // Query cursor position
+ int x1, x2, y1, y2;
+ int results = 0;
+ results = fscanf(con_st.tty, "\033[%d;%dR", &y1, &x1);
+
+ fwrite(utf8_codepoint, length, 1, con_st.tty);
+
+ fputs("\033[6n", con_st.tty); // Query cursor position
+ results += fscanf(con_st.tty, "\033[%d;%dR", &y2, &x2);
+
+ if (results != 4) {
+ return expectedWidth;
+ }
+
+ int width = x2 - x1;
+ if (width < 0) {
+ // Calculate the width considering text wrapping
+ struct winsize w;
+ ioctl(STDOUT_FILENO, TIOCGWINSZ, &w);
+ width += w.ws_col;
+ }
+ return width;
+#endif
+}
+
+void replace_last(console_state & con_st, char ch) {
+#if defined(_WIN32)
+ pop_cursor(con_st);
+ put_codepoint(con_st, &ch, 1, 1);
+#else
+ fprintf(con_st.out, "\b%c", ch);
+#endif
+}
+
+void append_utf8(char32_t ch, std::string & out) {
+ if (ch <= 0x7F) {
+ out.push_back(static_cast(ch));
+ } else if (ch <= 0x7FF) {
+ out.push_back(static_cast(0xC0 | ((ch >> 6) & 0x1F)));
+ out.push_back(static_cast(0x80 | (ch & 0x3F)));
+ } else if (ch <= 0xFFFF) {
+ out.push_back(static_cast(0xE0 | ((ch >> 12) & 0x0F)));
+ out.push_back(static_cast(0x80 | ((ch >> 6) & 0x3F)));
+ out.push_back(static_cast(0x80 | (ch & 0x3F)));
+ } else if (ch <= 0x10FFFF) {
+ out.push_back(static_cast(0xF0 | ((ch >> 18) & 0x07)));
+ out.push_back(static_cast(0x80 | ((ch >> 12) & 0x3F)));
+ out.push_back(static_cast(0x80 | ((ch >> 6) & 0x3F)));
+ out.push_back(static_cast(0x80 | (ch & 0x3F)));
+ } else {
+ // Invalid Unicode code point
+ }
+}
+
+// Helper function to remove the last UTF-8 character from a string
+void pop_back_utf8_char(std::string & line) {
+ if (line.empty()) {
+ return;
+ }
+
+ size_t pos = line.length() - 1;
+
+ // Find the start of the last UTF-8 character (checking up to 4 bytes back)
+ for (size_t i = 0; i < 3 && pos > 0; ++i, --pos) {
+ if ((line[pos] & 0xC0) != 0x80) break; // Found the start of the character
+ }
+ line.erase(pos);
+}
+
+bool console_readline(console_state & con_st, std::string & line) {
+ console_set_color(con_st, CONSOLE_COLOR_USER_INPUT);
+ if (con_st.out != stdout) {
+ fflush(stdout);
+ }
+
+ line.clear();
+ std::vector widths;
+ bool is_special_char = false;
+ bool end_of_stream = false;
+
+ char32_t input_char;
+ while (true) {
+ fflush(con_st.out); // Ensure all output is displayed before waiting for input
+ input_char = getchar32();
+
+ if (input_char == '\r' || input_char == '\n') {
+ break;
+ }
+
+ if (input_char == WEOF || input_char == 0x04 /* Ctrl+D*/) {
+ end_of_stream = true;
+ break;
+ }
+
+ if (is_special_char) {
+ console_set_color(con_st, CONSOLE_COLOR_USER_INPUT);
+ replace_last(con_st, line.back());
+ is_special_char = false;
+ }
+
+ if (input_char == '\033') { // Escape sequence
+ char32_t code = getchar32();
+ if (code == '[' || code == 0x1B) {
+ // Discard the rest of the escape sequence
+ while ((code = getchar32()) != WEOF) {
+ if ((code >= 'A' && code <= 'Z') || (code >= 'a' && code <= 'z') || code == '~') {
+ break;
+ }
+ }
+ }
+ } else if (input_char == 0x08 || input_char == 0x7F) { // Backspace
+ if (!widths.empty()) {
+ int count;
+ do {
+ count = widths.back();
+ widths.pop_back();
+ // Move cursor back, print space, and move cursor back again
+ for (int i = 0; i < count; i++) {
+ replace_last(con_st, ' ');
+ pop_cursor(con_st);
+ }
+ pop_back_utf8_char(line);
+ } while (count == 0 && !widths.empty());
+ }
+ } else {
+ int offset = line.length();
+ append_utf8(input_char, line);
+ int width = put_codepoint(con_st, line.c_str() + offset, line.length() - offset, estimateWidth(input_char));
+ if (width < 0) {
+ width = 0;
+ }
+ widths.push_back(width);
+ }
+
+ if (!line.empty() && (line.back() == '\\' || line.back() == '/')) {
+ console_set_color(con_st, CONSOLE_COLOR_PROMPT);
+ replace_last(con_st, line.back());
+ is_special_char = true;
+ }
+ }
+
+ bool has_more = con_st.multiline_input;
+ if (is_special_char) {
+ replace_last(con_st, ' ');
+ pop_cursor(con_st);
+
+ char last = line.back();
+ line.pop_back();
+ if (last == '\\') {
+ line += '\n';
+ fputc('\n', con_st.out);
+ has_more = !has_more;
+ } else {
+ // llama will just eat the single space, it won't act as a space
+ if (line.length() == 1 && line.back() == ' ') {
+ line.clear();
+ pop_cursor(con_st);
+ }
+ has_more = false;
+ }
+ } else {
+ if (end_of_stream) {
+ has_more = false;
+ } else {
+ line += '\n';
+ fputc('\n', con_st.out);
+ }
+ }
+
+ fflush(con_st.out);
+ return has_more;
+}
diff --git a/examples/common.h b/examples/common.h
index cbbc2dfab..499671b2e 100644
--- a/examples/common.h
+++ b/examples/common.h
@@ -8,30 +8,47 @@
#include
#include
#include
+#include
+
+#if !defined (_WIN32)
+#include
+#include
+#endif
//
// CLI argument parsing
//
+int32_t get_num_physical_cores();
struct gpt_params {
int32_t seed = -1; // RNG seed
- int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
- int32_t n_predict = 128; // new tokens to predict
- int32_t repeat_last_n = 64; // last n tokens to penalize
+ int32_t n_threads = get_num_physical_cores();
+ int32_t n_predict = -1; // new tokens to predict
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
int32_t n_ctx = 512; // context size
- int32_t n_batch = 8; // batch size for prompt processing
+ int32_t n_batch = 512; // batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_keep = 0; // number of tokens to keep from initial prompt
// sampling parameters
- int32_t top_k = 40;
- float top_p = 0.95f;
- float temp = 0.80f;
- float repeat_penalty = 1.10f;
+ std::unordered_map logit_bias; // logit bias for specific tokens
+ int32_t top_k = 40; // <= 0 to use vocab size
+ float top_p = 0.95f; // 1.0 = disabled
+ float tfs_z = 1.00f; // 1.0 = disabled
+ float typical_p = 1.00f; // 1.0 = disabled
+ float temp = 0.80f; // 1.0 = disabled
+ float repeat_penalty = 1.10f; // 1.0 = disabled
+ int32_t repeat_last_n = 64; // last n tokens to penalize (0 = disable penalty, -1 = context size)
+ float frequency_penalty = 0.00f; // 0.0 = disabled
+ float presence_penalty = 0.00f; // 0.0 = disabled
+ int mirostat = 0; // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
+ float mirostat_tau = 5.00f; // target entropy
+ float mirostat_eta = 0.10f; // learning rate
std::string model = "models/lamma-7B/ggml-model.bin"; // model path
std::string prompt = "";
- std::string input_prefix = ""; // string to prefix user inputs with
+ std::string path_prompt_cache = ""; // path to file for saving/loading prompt eval state
+ std::string input_prefix = ""; // string to prefix user inputs with
+ std::string input_suffix = ""; // string to suffix user inputs with
std::vector antiprompt; // string upon seeing which more user input is prompted
std::string lora_adapter = ""; // lora adapter path
@@ -41,12 +58,14 @@ struct gpt_params {
bool random_prompt = false; // do not randomize prompt if none provided
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
+ bool prompt_cache_all = false; // save user input and generations to prompt cache
bool embedding = false; // get only sentence embedding
- bool interactive_start = false; // wait for user input immediately
+ bool interactive_first = false; // wait for user input immediately
+ bool multiline_input = false; // reverse the usage of `\`
bool instruct = false; // instruction mode (used for Alpaca models)
- bool ignore_eos = false; // do not stop generating after eos
+ bool penalize_nl = true; // consider newlines as a repeatable token
bool perplexity = false; // compute perplexity over the prompt
bool use_mmap = true; // use mmap for faster loads
bool use_mlock = false; // use mlock to keep model in memory
@@ -66,6 +85,12 @@ std::string gpt_random_prompt(std::mt19937 & rng);
std::vector llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);
+//
+// Model utils
+//
+
+struct llama_context * llama_init_from_gpt_params(const gpt_params & params);
+
//
// Console utils
//
@@ -86,13 +111,20 @@ enum console_color_t {
};
struct console_state {
+ bool multiline_input = false;
bool use_color = false;
console_color_t color = CONSOLE_COLOR_DEFAULT;
+
+ FILE* out = stdout;
+#if defined (_WIN32)
+ void* hConsole;
+#else
+ FILE* tty = nullptr;
+ termios prev_state;
+#endif
};
-void set_console_color(console_state & con_st, console_color_t color);
-
-#if defined (_WIN32)
-void win32_console_init(bool enable_color);
-void win32_utf8_encode(const std::wstring & wstr, std::string & str);
-#endif
+void console_init(console_state & con_st);
+void console_cleanup(console_state & con_st);
+void console_set_color(console_state & con_st, console_color_t color);
+bool console_readline(console_state & con_st, std::string & line);
diff --git a/examples/embedding/CMakeLists.txt b/examples/embedding/CMakeLists.txt
index 88c425d4a..db73b6b44 100644
--- a/examples/embedding/CMakeLists.txt
+++ b/examples/embedding/CMakeLists.txt
@@ -2,3 +2,6 @@ set(TARGET embedding)
add_executable(${TARGET} embedding.cpp)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
+if(TARGET BUILD_INFO)
+ add_dependencies(${TARGET} BUILD_INFO)
+endif()
diff --git a/examples/embedding/embedding.cpp b/examples/embedding/embedding.cpp
index e10de619c..e4b729128 100644
--- a/examples/embedding/embedding.cpp
+++ b/examples/embedding/embedding.cpp
@@ -1,5 +1,6 @@
#include "common.h"
#include "llama.h"
+#include "build-info.h"
#include
@@ -18,11 +19,13 @@ int main(int argc, char ** argv) {
"expect poor results\n", __func__, params.n_ctx);
}
- if (params.seed <= 0) {
+ fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);
+
+ if (params.seed < 0) {
params.seed = time(NULL);
}
- fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
+ fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
@@ -32,24 +35,10 @@ int main(int argc, char ** argv) {
llama_context * ctx;
// load the model
- {
- auto lparams = llama_context_default_params();
-
- lparams.n_ctx = params.n_ctx;
- lparams.n_parts = params.n_parts;
- lparams.seed = params.seed;
- lparams.f16_kv = params.memory_f16;
- lparams.logits_all = params.perplexity;
- lparams.use_mmap = params.use_mmap;
- lparams.use_mlock = params.use_mlock;
- lparams.embedding = params.embedding;
-
- ctx = llama_init_from_file(params.model.c_str(), lparams);
-
- if (ctx == NULL) {
- fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
- return 1;
- }
+ ctx = llama_init_from_gpt_params(params);
+ if (ctx == NULL) {
+ fprintf(stderr, "%s: error: unable to load model\n", __func__);
+ return 1;
}
// print system information
diff --git a/examples/jeopardy/README.md b/examples/jeopardy/README.md
new file mode 100644
index 000000000..4c42e3cdb
--- /dev/null
+++ b/examples/jeopardy/README.md
@@ -0,0 +1,21 @@
+# llama.cpp/example/jeopardy
+
+This is pretty much just a straight port of aigoopy/llm-jeopardy/ with an added graph viewer.
+
+The jeopardy test can be used to compare the fact knowledge of different models and compare them to eachother. This is in contrast to some other tests, which test logical deduction, creativity, writing skills, etc.
+
+
+Step 1: Open jeopardy.sh and modify the following:
+```
+MODEL=(path to your model)
+MODEL_NAME=(name of your model)
+prefix=(basically, if you use vicuna it's Human: , if you use something else it might be User: , etc)
+opts=(add -instruct here if needed for your model, or anything else you want to test out)
+```
+Step 2: Run `jeopardy.sh` from the llama.cpp folder
+
+Step 3: Repeat steps 1 and 2 until you have all the results you need.
+
+Step 4: Run `graph.py`, and follow the instructions. At the end, it will generate your final graph.
+
+Note: The Human bar is based off of the full, original 100 sample questions. If you modify the question count or questions, it will not be valid.
diff --git a/examples/jeopardy/graph.py b/examples/jeopardy/graph.py
new file mode 100644
index 000000000..d00b28652
--- /dev/null
+++ b/examples/jeopardy/graph.py
@@ -0,0 +1,56 @@
+import matplotlib.pyplot as plt
+import sys, os
+import csv
+
+labels = []
+numbers = []
+numEntries = 1
+
+rows = []
+
+def bar_chart(numbers, labels, pos):
+ plt.bar(pos, numbers, color='blue')
+ plt.xticks(ticks=pos, labels=labels)
+ plt.title("Jeopardy Results by Model")
+ plt.xlabel("Model")
+ plt.ylabel("Questions Correct")
+ plt.show()
+
+def calculatecorrect():
+ directory = os.fsencode("./examples/jeopardy/results/")
+ csv_reader = csv.reader(open("./examples/jeopardy/qasheet.csv", 'rt'), delimiter=',')
+ for row in csv_reader:
+ global rows
+ rows.append(row)
+ for listing in os.listdir(directory):
+ filename = os.fsdecode(listing)
+ if filename.endswith(".txt"):
+ file = open("./examples/jeopardy/results/" + filename, "rt")
+ global labels
+ global numEntries
+ global numbers
+ labels.append(filename[:-4])
+ numEntries += 1
+ i = 1
+ totalcorrect = 0
+ for line in file.readlines():
+ if line.strip() != "------":
+ print(line)
+ else:
+ print("Correct answer: " + rows[i][2] + "\n")
+ i+=1
+ print("Did the AI get the question right? (y/n)")
+ if input() == "y":
+ totalcorrect += 1
+ numbers.append(totalcorrect)
+
+
+
+if __name__ == '__main__':
+ calculatecorrect()
+ pos = list(range(numEntries))
+ labels.append("Human")
+ numbers.append(48.11)
+ bar_chart(numbers, labels, pos)
+ print(labels)
+ print(numbers)
diff --git a/examples/jeopardy/jeopardy.sh b/examples/jeopardy/jeopardy.sh
new file mode 100644
index 000000000..9bdbc755c
--- /dev/null
+++ b/examples/jeopardy/jeopardy.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+set -e
+
+MODEL=./models/ggml-vicuna-13b-1.1-q4_0.bin
+MODEL_NAME=Vicuna
+
+# exec options
+prefix="Human: " # Ex. Vicuna uses "Human: "
+opts="--temp 0 -n 80" # additional flags
+nl='
+'
+introduction="You will be playing a game of Jeopardy. Simply answer the question in the correct format (Ex. What is Paris, or Who is George Washington)."
+
+# file options
+question_file=./examples/jeopardy/questions.txt
+touch ./examples/jeopardy/results/$MODEL_NAME.txt
+output_file=./examples/jeopardy/results/$MODEL_NAME.txt
+
+counter=1
+
+echo 'Running'
+while IFS= read -r question
+do
+ exe_cmd="./main -p "\"$prefix$introduction$nl$prefix$question\"" "$opts" -m ""\"$MODEL\""" >> ""\"$output_file\""
+ echo $counter
+ echo "Current Question: $question"
+ eval "$exe_cmd"
+ echo -e "\n------" >> $output_file
+ counter=$((counter+1))
+done < "$question_file"
diff --git a/examples/jeopardy/qasheet.csv b/examples/jeopardy/qasheet.csv
new file mode 100644
index 000000000..35b084189
--- /dev/null
+++ b/examples/jeopardy/qasheet.csv
@@ -0,0 +1,103 @@
+Index,Original Category,Original Correct Question,Model Prompt
+1,The Oscars,Who is John Williams?,Which actor Born in 1932 was the son of a percussionist in the CBS radio orchestra has been nominated for 53 Oscars?
+2,English Literature,What is Paradise Lost?,"What work in English Literature says: 'The mind is its own place, & in itself can make a heaven of hell, a hell of heaven. What matter where, if I be still the same'?"
+3,Writers’ Lesser-Known Works,Who is Niccolò Machiavelli?,"Known for more philosophical works, he wrote the play 'La Mandragola', in which Florentines are rewarded for immoral actions?"
+4,Exploration,What is Easter Island (Rapa Nui)?,"James Cook's account of a 1774 visit where records an object 'near 27 feet long, and upwards of 8 feet over the breast or shoulders'?"
+5,The Bill of Rights,What is the Eighth Amendment?,England's 'Bloody Assizes' & a 1685 life sentence for perjury were 2 main origins of which amendment to the U.S. Constitution?
+6,Nobel Peace Prize Winners,Who are Nelson Mandela & Desmond Tutu?,"Which nobel peace price winners each lived at times on Vilakazi St. in Soweto , so it claims to be the world's only street home to 2 Nobel Peace Prize winners?"
+7,Famous Names,Who is Walt Disney?,"In 1966, the year of who's death did he share plans for an experimental prototype community in Florida?"
+8,Geography,What is Colombia?,"Of the 13 nations through which the Equator passes, what is the only one whose coastline borders the Caribbean Sea?"
+9,Fashion History,What are rhinestones?,"Which decorative items in fashion history get their name from their origin in the port city of Strasbourg, on the border of France & Germany?"
+10,Movies of the ’80s,What is Driving Miss Daisy?,What 1980's movie is based on an off-Broadway play with just 3 characters and won the Best Picture Oscar & the actors in all 3 roles were nominated?
+11,Novelists,Who is John Grisham?,"A 2012 book review for which novelist noted subjects that 'sparked his ire': capital punishment, big tobacco & 'the plight of the unjustly convicted'?"
+12,20th Century Eponyms,What is the Maginot Line?,"A 1940 headline about what 20th Century Eponym included 'failure', 'liability when it came to offense' & 'stout hearts no match for tanks'?"
+13,City History,What is Stockholm?,"Over 700 years after its traditional 1252 founding date, what port city became associated with a psychological response?"
+14,Brand Names,What is Jacuzzi?,"The success of what brand has its roots with a hydrotherapy pump its cofounder created for his son, who had arthritis?"
+15,American Authors,Who is Washington Irving?,"In a periodical in 1807, what American Author called New York City 'Gotham, Gotham! Most enlightened of cities'?"
+16,Symbols,What is “less than”?,What symbol is a rotated V in math and a feeling of some marginalized or underrepresented people in society?
+17,Movie Theme Songs,Who is James Bond?,"Monty Norman, the composer of what character's theme, said the staccato riff conveyed sexiness, mystery & ruthlessness?"
+18,American Novelists,Who is Joseph Heller?,"What American Novelist served with an airman named Yohannan in World War II & despite what readers might think, he said he enjoyed his service?"
+19,Medieval Places,"What is Canterbury, England? (Canterbury Cathedral)","In what Medieval place did one of the participants in an 1170 event say, 'Let us away, knights; he will rise no more'?"
+20,Countries of Africa,What is Morocco?,"At one time a province of the Roman Empire, what African country kingdom is known to Arabic scholars as Al-Maghrib Al-Aqsa, 'the far west'?"
+21,Statehood,What is Wyoming?,Congress relented in 1890 after what prospective state said it would wait 100 years rather than come in without the women?
+22,1980s Movies,What is Raiders of the Lost Ark?,"A writer & producer of what movie said he wanted it to be like a Western or James Bond film, 'only it takes place in the 30s'?"
+23,Art Exhibitions,Who is Rembrandt?,In 1898 what's been called the first blockbuster art show was devoted to which artist & put on for Queen Wilhelmina's coronation?
+24,Countries of the World,What is Mongolia?,"Part of the largest contiguous land empire during the 1200s & 1300s, today what is the world's second-largest landlocked country?"
+25,Literature,What is “Howl”?,A 2006 book was titled 'The Poem That Changed America:' What 'Fifty Years Later'?
+26,Invasions,Who is William of Orange?,"Backed by 14,000 troops, who invaded England to restore, in his words, its 'religion, laws, and liberties'?"
+27,Landmarks,What is the Eiffel Tower?,"After its completion in the late 19th c., what was landmark was called 'a truly tragic street lamp' & a 'high & skinny pyramid of iron ladders'?"
+28,Geographic Name’s the Same,What is Dover?,"The busiest passenger port in the U.K., what shares its name with a capital of one of the original 13 states?"
+29,Names in the Bookstore,Who is Peter Mark Roget?,"This man made lists, perhaps to cope with depression; a set of lists he published in 1852 made whose name synonymous with a type of book?"
+30,U.S. History,Who is Dr. Samuel Mudd?,"An 1869 presidential pardon was granted to which man, due in part to a plea by the Medical Society of Harford County, Maryland?"
+31,American Literature,What is The Things They Carried?,"Letters, pocket knives, C rations & steel helmets are among the tangible items referred to in the title of what American literature modern war classic?"
+32,Nonfiction,What is The Communist Manifesto,"What nonfiction book has the line, 'The discovery of America…opened up fresh ground for the rising bourgeoisie'?"
+33, a new version was passed 81 years later,Laws in U.S. History,What is the Civil Rights Act?,,,,,,,,,,,,,,,,,,0, 2/3
+34,Names of Myth,Who is Helen of Troy?,"Whose brothers, Castor & Pollux, saved her after Theseus stole her away as a kid; a larger force would seek her later in life?"
+35,African Countries,What is Sudan?,"Once Africa's largest country in area, what African Country dropped to third in 2011 when a portion of it declared independence?"
+36,The Ancient World,What is Alexandria?,"The ancient writer Galen said books on ships arriving to what city's port were seized, originals kept & copies returned?"
+37,Famous Names,Who is Andy Warhol?,"For a special 1970s cookbook, who provided one simple recipe–a can of Campbell's tomato soup & 2 cans of milk?"
+38,People & Places,What is Guam?,"Thought to descend from people of Southeast Asia, the Chamorro make up what U.S. territory’s largest ethnic group?"
+39,Current World Leaders,What is the Philippines?,"In office from 2022, the president of what country has taken so many foreign trips a play on his name is 'Ferdinand Magellan Jr.'?"
+40,Writers & The South,Who is Tennessee Williams?,In 1939 which writer lived on Toulouse Street in the French Quarter & chose the professional name that bonded him to the South?
+41,National Parks,What is Yellowstone?,"What National Park is named for a river indigenous people called Mi tse a-da-zi, translated by French-speaking trappers as 'Pierre Jaune'?"
+42,Sports,Who are the Harlem Globetrotters?,"In 2010 who introduced the 4-point shot, 35 feet from the basket?"
+43,The U.S. Military,What is “Top Gun”?,Losses over Asia in the 1960s led to the establishment of the program known as what at a San Diego naval base in 1969?
+44,Art & Science,What is Halley’s Comet?,"A craft that visited what was named for Giotto, based on the story that 680 years earlier, the painter depicted it as the Star of Bethlehem?"
+45,Words From World War I,What is “tank”?,"In World War I, 'Cistern' & 'reservoir' were suggested names for what secret invention, but the British preferred this less clumsy monosyllable?"
+46,European History,What is Holy Roman Emperor?,"Until 1806, some German nobles included among their honors the title of 'Elector' for their role in selecting this personage?"
+47,Theater History,Who is Peter Pan?,"In 1904, wearing a harness, actress Nina Boucicault became the first to play what character onstage?"
+48,European Cities,What is Aachen?,"Alphabetically the first German city in encyclopedias, what was also the first one taken by the Allies in World War II?"
+49,Word Origins,What is mantra?,This Sanskrit word referring to a spoken word or phrase comes from a word for 'to think'?
+50,Inventions,What is barbed wire?,1917's 'Elements of Trench Warfare' said what Old West invention was 'difficult to destroy' & 'difficult to get through'?
+51,World War II,What is Schindler’s list?,"Mimi Reinhard, who never learned to type using more than 2 fingers, produced what in World War II with 1,100 names, including hers?"
+52, their offspring was the source of this mythical object,Mythology,What is the Golden Fleece?
+53,Literature,What is Pride and Prejudice?,"Published in 2011, P.D. James' final novel, 'Death Comes to Pemberley', was a sequel to what novel from 200 years earlier?"
+54, only these 2 west of the Mississippi River border each other,U.S. State Names,What are Oregon & Nevada?
+55,Word Origins,What is passion?,"Originally relating to a story of suffering, what word now more commonly refers to strong emotion of any kind?"
+56,World Cinema,What is La Vie en Rose?,"The 2007 biopic called 'La Môme' in France, meaning 'The Kid', was released in the U.S. under what other French title?"
+57,History,What is Santa Maria?,"Returning home in 1493, Columbus stopped in the Azores at an island with what name, also something he'd lost off the Haiti coast?"
+58,Landmarks,What is a kremlin?,Pskov & Nizhny Novgorod are 2 of the cities that have a fortress called what?
+59,Foreign-Born Authors,Who is Vladimir Nabokov?,In the 1950s the New York Times said what author 'is writing about all lust' & his lecherous narrator 'is all of us'?
+60,Astronomy & Geography,What is Capricorn?,"At the winter solstice, the sun is in Sagittarius; it once appeared in what constellation, giving a geographic feature its name?"
+61,Television,What is Law & Order?,"Mike Post combined the sound of a slamming jail door, an anvil & 100 men stomping on a floor for what television series that debuted in 1990?"
+62,British Landmarks,What is the Tower of London?,"Like Sir Thomas More, 3 16th century English queens are buried at what British location?"
+63,Early American History,What are witches?,"In 1692 Increase Mather wrote, 'It were better that ten suspected' of these who 'escape, than that one innocent person … be condemned'?"
+64,Geography Mnemonics,What are Arkansas and Louisiana?,"The Geography Mnemonic Mimal, sometimes said to be the silhouette of a chef or elf, stands for Minnesota, Iowa, Missouri, and what other 2 states?"
+65,Business Milestones,What is the Ford Model T?,"What was first sold in 1908, at a price equivalent to about $27,000 today?"
+66,In The Bookstore,Who is Tom Clancy?,The name of what author dead since 2013 now appears on books written by a former U.S. marshal & a former Apache helicopter pilot?
+67,Historic Art,What is the Bayeux Tapestry?,The artwork once known in France as 'la tapisserie de la Reine Mathilde' is better known as what?
+68,Pop Stars,Who is Madonna?,In 2022 which pop star became the first woman to have a Billboard Top 10 album in 5 decades starting with the 1980s?
+69,Classic Tale Characters,Who is Scheherazade?,"In one 19th century translation, what female classic tale character 'perceived the dawn of day and ceased' speaking nearly 1,000 times?"
+70,USA,What is Jack Daniel’s?,"Ironically, though what company founded in the 1860s is Moore County, Tennessee's largest employer, Moore is a dry county?"
+71,Historic People,Who was William Bligh?,"After a 1789 event, who wrote, 'My first determination was to seek a supply of…water at Tofoa, & afterwards to sail for Tongataboo'?"
+72,The Movies,What is The Godfather?,Laurence Olivier & Ernest Borgnine were considered for the lead role & Sergio Leone to direct for what film that turned 50 in 2022?
+73,Continental Geography,What is Colombia?,"Until a 1903 secession, what country's contiguous territory spanned 2 continents?"
+74,Foreign-Born Authors,Who is Isabel Allende?,"Early in her career which foreign-born author translated romance novels into Spanish, often changing the dialogue to make the heroines smarter?"
+75,Historic Crimes,What is the Mona Lisa?,"Saying it was stolen by Napoleon, self-styled Italian patriot Vincenzo Peruggia took what in 1911?"
+76,U.S. Bodies of Water,What is Lake Mead?,"Continuing a downward trend, in July 2022 what US body of water was at 27% capacity, its lowest level since 1937 when it was first being filled?"
+77,Gods & Goddesses,Who is Aurora (or Eos)?,"Each morning which goddess began her ride in her chariot across the sky ahead of her brother Sol, or Helios?"
+78,America At War,What is the Battle of New Orleans?,"Until the Civil War, the Jan. 8 date of what American battle of dubious military importance but big morale value was a national holiday?"
+79,Children’s Books,What is The Velveteen Rabbit?,"Which children's book title character is told 'By the time you are real, most of your hair has been loved off your eyes drop out & you get shabby'?"
+80,TV Finales,What is Grace and Frankie?,"In a TV reunion over 40 years in the making, Dolly Parton appeared as an angel named Agnes in the final episode of what comedy in 2022?"
+81,American Poems,Who is Evangeline?,"In an 1847 American poem what character sees her town of Grand-Pré burned, but finally reunites with her beau for a kiss before his death?"
+82,Famous Names,Who is Banksy?,"In 2001 who published a book called 'Banging Your Head Against a Brick Wall'; in 2002, 'Existencilism'?"
+83,Children’s Lit,What is Charlotte’s Web?,The title object of what childrens book 'never looked more beautiful each strand held dozens of bright drops of early morning dew'?
+84,Classic Songs,What is “Here Comes Santa Claus”?,The shouts of excited children at a 1946 holiday parade are said to have inspired what perennial classic song favorite?
+85,Brand Names,What are Milk Duds?,"Unable to make what candies perfectly round, the confectioner embraced this flawed name for the product?"
+86,Countries of the World,What is Italy?,"What country is home to 58 UNESCO World Heritage Sites, more than any other country; the sites include a volcano & a lagoon?"
+87,Action Movies,What is Die Hard?,"What action movie's last line is 'If this is their idea of Christmas, I gotta be here for New Years'?"
+88,Presidential Facts,Who is Woodrow Wilson?,Only 3 presidents have married while in office— John Tyler was the first & which one was the last?
+89,19th Century Americans,Who is Frederick Douglass?,"Demonstrating the dignity & humanity of Black Americans, who sat for 160 known photographs, the most of any American in the 19th century?"
+90,Latin Phrases,What is “quid pro quo”?,"Originally, which Latin 3-word phrase referred to when a doctor or apothecary substituted one medicine for another?"
+91,1970s Movies,What is Monty Python and the Holy Grail?,The 1975 premiere of what movie comedy advertised free coconuts for the first thousand in the audience?
+92,Name’s The Same,What is Manhattan?,"A cocktail, an island & a WWII venture originally called 'Development of Substitute Materials' all bear what name?"
+93,U.S. Presidents,Who is Calvin Coolidge?,"Which US President was sworn in twice as President within 2 years, first by his father & then later by a former U.S. President?"
+94,Plays,What is The Tempest?,A 1609 story in which an exiled king of Bulgaria creates a sea palace with his magic may have inspired the plot of what play?
+95,Landmarks,What is the Berlin Wall?,"In 2009, during a 20th anniversary celebration, what landmark was called 'an edifice of fear. On Nov. 9, it became a place of joy'?"
+96,World Capitals,"What is Vienna, Austria?","Among what world capital's nicknames are the 'City of Classical Music' &, possibly in honor of a famous resident from 1860 to 1938, the 'City of Dreams'?"
+97,Language & Its Meanings,What is a night owl?,"Now meaning someone with nocturnal habits, what catches a sleeping dove in Shakespeare's 'Lucrece'?"
+98,Flags of Our Hemisphere,What is Brazil?,"The stars on what country's flag represent states, 26 of them; unlike the USA's, its 'federal district' gets its own 27th star?"
+99,Names in U.S. History,Who is Oliver Brown?,What father was the only man among the 13 plaintiffs in a US class-action case filed in 1951?
+100,Children’s Authors,"Who is Sarah? (from Sarah, Plain and Tall)","Reversing the story of what heroine she created, childrens author Patricia Maclachlan was born on the prairie but spent much of her life in New England?"
+,,,
+TOTALS,,,
diff --git a/examples/jeopardy/questions.txt b/examples/jeopardy/questions.txt
new file mode 100644
index 000000000..eea78a057
--- /dev/null
+++ b/examples/jeopardy/questions.txt
@@ -0,0 +1,100 @@
+Which man born in 1932 was the son of a percussionist in the CBS radio orchestra has been nominated for 53 Oscars?
+What work in English Literature says: 'The mind is its own place, & in itself can make a heaven of hell, a hell of heaven. What matter where, if I be still the same'?
+Known for more philosophical works, he wrote the play 'La Mandragola', in which Florentines are rewarded for immoral actions?
+James Cook's account of a 1774 visit where records an object 'near 27 feet long, and upwards of 8 feet over the breast or shoulders'?
+England's 'Bloody Assizes' & a 1685 life sentence for perjury were 2 main origins of which amendment to the U.S. Constitution?
+Which nobel peace price winners each lived at times on Vilakazi St. in Soweto , so it claims to be the world's only street home to 2 Nobel Peace Prize winners?
+In 1966, the year of who's death did he share plans for an experimental prototype community in Florida?
+Of the 13 nations through which the Equator passes, what is the only one whose coastline borders the Caribbean Sea?
+Which decorative items in fashion history get their name from their origin in the port city of Strasbourg, on the border of France & Germany?
+What 1980's movie is based on an off-Broadway play with just 3 characters and won the Best Picture Oscar & the actors in all 3 roles were nominated?
+A 2012 book review for which novelist noted subjects that 'sparked his ire': capital punishment, big tobacco & 'the plight of the unjustly convicted'?
+A 1940 headline about what 20th Century Eponym included 'failure', 'liability when it came to offense' & 'stout hearts no match for tanks'?
+Over 700 years after its traditional 1252 founding date, what port city became associated with a psychological response?
+The success of what brand has its roots with a hydrotherapy pump its cofounder created for his son, who had arthritis?
+In a periodical in 1807, what American Author called New York City 'Gotham, Gotham! Most enlightened of cities'?
+What symbol is a rotated V in math and a feeling of some marginalized or underrepresented people in society?
+Monty Norman, the composer of what character's theme, said the staccato riff conveyed sexiness, mystery & ruthlessness?
+What American Novelist served with an airman named Yohannan in World War II & despite what readers might think, he said he enjoyed his service?
+In what Medieval place did one of the participants in an 1170 event say, 'Let us away, knights; he will rise no more'?
+At one time a province of the Roman Empire, what African country kingdom is known to Arabic scholars as Al-Maghrib Al-Aqsa, 'the far west'?
+Congress relented in 1890 after what prospective state said it would wait 100 years rather than come in without the women?
+A writer & producer of what movie said he wanted it to be like a Western or James Bond film, 'only it takes place in the 30s'?
+In 1898 what's been called the first blockbuster art show was devoted to which artist & put on for Queen Wilhelmina's coronation?
+Part of the largest contiguous land empire during the 1200s & 1300s, today what is the world's second-largest landlocked country?
+A 2006 book was titled 'The Poem That Changed America:' What 'Fifty Years Later'?
+Backed by 14,000 troops, who invaded England to restore, in his words, its 'religion, laws, and liberties'?
+After its completion in the late 19th c., what was landmark was called 'a truly tragic street lamp' & a 'high & skinny pyramid of iron ladders'?
+The busiest passenger port in the U.K., what shares its name with a capital of one of the original 13 states?
+This man made lists, perhaps to cope with depression; a set of lists he published in 1852 made whose name synonymous with a type of book?
+An 1869 presidential pardon was granted to which man, due in part to a plea by the Medical Society of Harford County, Maryland?
+Letters, pocket knives, C rations & steel helmets are among the tangible items referred to in the title of what American literature modern war classic?
+What nonfiction book has the line, 'The discovery of America…opened up fresh ground for the rising bourgeoisie'?
+A radical Republican championed what 1875 act but the Supreme Court struck it down in 1883; a new version was passed 81 years later?
+Whose brothers, Castor & Pollux, saved her after Theseus stole her away as a kid; a larger force would seek her later in life?
+Once Africa's largest country in area, what African Country dropped to third in 2011 when a portion of it declared independence?
+The ancient writer Galen said books on ships arriving to what city's port were seized, originals kept & copies returned?
+For a special 1970s cookbook, who provided one simple recipe–a can of Campbell's tomato soup & 2 cans of milk?
+Thought to descend from people of Southeast Asia, the Chamorro make up what U.S. territory’s largest ethnic group?
+In office from 2022, the president of what country has taken so many foreign trips a play on his name is 'Ferdinand Magellan Jr.'?
+In 1939 which writer lived on Toulouse Street in the French Quarter & chose the professional name that bonded him to the South?
+What National Park is named for a river indigenous people called Mi tse a-da-zi, translated by French-speaking trappers as 'Pierre Jaune'?
+In 2010 who introduced the 4-point shot, 35 feet from the basket?
+Losses over Asia in the 1960s led to the establishment of the program known as what at a San Diego naval base in 1969?
+A craft that visited what was named for Giotto, based on the story that 680 years earlier, the painter depicted it as the Star of Bethlehem?
+In World War I, 'Cistern' & 'reservoir' were suggested names for what secret invention, but the British preferred this less clumsy monosyllable?
+Until 1806, some German nobles included among their honors the title of 'Elector' for their role in selecting this personage?
+In 1904, wearing a harness, actress Nina Boucicault became the first to play what character onstage?
+Alphabetically the first German city in encyclopedias, what was also the first one taken by the Allies in World War II?
+This Sanskrit word referring to a spoken word or phrase comes from a word for 'to think'?
+1917's 'Elements of Trench Warfare' said what Old West invention was 'difficult to destroy' & 'difficult to get through'?
+Mimi Reinhard, who never learned to type using more than 2 fingers, produced what in World War II with 1,100 names, including hers?
+Poseidon carried off the maiden Theophane & turned her into a ewe; their offspring was the source of what mythical object?
+Published in 2011, P.D. James' final novel, 'Death Comes to Pemberley', was a sequel to what novel from 200 years earlier?
+5 U.S. states have 6-letter names; only which 2 west of the Mississippi River border each other?
+Originally relating to a story of suffering, what word now more commonly refers to strong emotion of any kind?
+The 2007 biopic called 'La Môme' in France, meaning 'The Kid', was released in the U.S. under what other French title?
+Returning home in 1493, Columbus stopped in the Azores at an island with what name, also something he'd lost off the Haiti coast?
+Pskov & Nizhny Novgorod are 2 of the cities that have a fortress called what?
+In the 1950s the New York Times said what author 'is writing about all lust' & his lecherous narrator 'is all of us'?
+At the winter solstice, the sun is in Sagittarius; it once appeared in what constellation, giving a geographic feature its name?
+Mike Post combined the sound of a slamming jail door, an anvil & 100 men stomping on a floor for what television series that debuted in 1990?
+Like Sir Thomas More, 3 16th century English queens are buried at what British location?
+In 1692 Increase Mather wrote, 'It were better that ten suspected' of these who 'escape, than that one innocent person be condemned'?
+The Geography Mnemonic Mimal, sometimes said to be the silhouette of a chef or elf, stands for Minnesota, Iowa, Missouri, and what other 2 states?
+What was first sold in 1908, at a price equivalent to about $27,000 today?
+The name of what author dead since 2013 now appears on books written by a former U.S. marshal & a former Apache helicopter pilot?
+The artwork once known in France as 'la tapisserie de la Reine Mathilde' is better known as what?
+In 2022 which pop star became the first woman to have a Billboard Top 10 album in 5 decades starting with the 1980s?
+In one 19th century translation, what female classic tale character 'perceived the dawn of day and ceased' speaking nearly 1,000 times?
+Ironically, though what company founded in the 1860s is Moore County, Tennessee's largest employer, Moore is a dry county?
+After a 1789 event, who wrote, 'My first determination was to seek a supply of…water at Tofoa, & afterwards to sail for Tongataboo'?
+Laurence Olivier & Ernest Borgnine were considered for the lead role & Sergio Leone to direct for what film that turned 50 in 2022?
+Until a 1903 secession, what country's contiguous territory spanned 2 continents?
+Early in her career which foreign-born author translated romance novels into Spanish, often changing the dialogue to make the heroines smarter?
+Saying it was stolen by Napoleon, self-styled Italian patriot Vincenzo Peruggia took what in 1911?
+Continuing a downward trend, in July 2022 what US body of water was at 27% capacity, its lowest level since 1937 when it was first being filled?
+Each morning which goddess began her ride in her chariot across the sky ahead of her brother Sol, or Helios?
+Until the Civil War, the Jan. 8 date of what American battle of dubious military importance but big morale value was a national holiday?
+Which children's book title character is told 'By the time you are real, most of your hair has been loved off your eyes drop out & you get shabby'?
+In a TV reunion over 40 years in the making, Dolly Parton appeared as an angel named Agnes in the final episode of what comedy in 2022?
+In an 1847 American poem what character sees her town of Grand-Pré burned, but finally reunites with her beau for a kiss before his death?
+In 2001 who published a book called 'Banging Your Head Against a Brick Wall'; in 2002, 'Existencilism'?
+The title object of what childrens book 'never looked more beautiful each strand held dozens of bright drops of early morning dew'?
+The shouts of excited children at a 1946 holiday parade are said to have inspired what perennial classic song favorite?
+Unable to make what candies perfectly round, the confectioner embraced this flawed name for the product?
+What country is home to 58 UNESCO World Heritage Sites, more than any other country; the sites include a volcano & a lagoon?
+What action movie's last line is 'If this is their idea of Christmas, I gotta be here for New Years'?
+Only 3 presidents have married while in office— John Tyler was the first & which one was the last?
+Demonstrating the dignity & humanity of Black Americans, who sat for 160 known photographs, the most of any American in the 19th century?
+Originally, which Latin 3-word phrase referred to when a doctor or apothecary substituted one medicine for another?
+The 1975 premiere of what movie comedy advertised free coconuts for the first thousand in the audience?
+A cocktail, an island & a WWII venture originally called 'Development of Substitute Materials' all bear what name?
+Which US President was sworn in twice as President within 2 years, first by his father & then later by a former U.S. President?
+A 1609 story in which an exiled king of Bulgaria creates a sea palace with his magic may have inspired the plot of what play?
+In 2009, during a 20th anniversary celebration, what landmark was called 'an edifice of fear. On Nov. 9, it became a place of joy'?
+Among what world capital's nicknames are the 'City of Classical Music' &, possibly in honor of a famous resident from 1860 to 1938, the 'City of Dreams'?
+Now meaning someone with nocturnal habits, what catches a sleeping dove in Shakespeare's 'Lucrece'?
+The stars on what country's flag represent states, 26 of them; unlike the USA's, its 'federal district' gets its own 27th star?
+What father was the only man among the 13 plaintiffs in a US class-action case filed in 1951?
+Reversing the story of what heroine she created, childrens author Patricia Maclachlan was born on the prairie but spent much of her life in New England?
diff --git a/examples/main/CMakeLists.txt b/examples/main/CMakeLists.txt
index b2dcc2910..c364242fb 100644
--- a/examples/main/CMakeLists.txt
+++ b/examples/main/CMakeLists.txt
@@ -2,3 +2,6 @@ set(TARGET main)
add_executable(${TARGET} main.cpp)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
+if(TARGET BUILD_INFO)
+ add_dependencies(${TARGET} BUILD_INFO)
+endif()
diff --git a/examples/main/README.md b/examples/main/README.md
index f09e7ba97..7c03f92c8 100644
--- a/examples/main/README.md
+++ b/examples/main/README.md
@@ -1,3 +1,289 @@
-# main
+# llama.cpp/example/main
-TODO
+This example program allows you to use various LLaMA language models in an easy and efficient way. It is specifically designed to work with the [llama.cpp](https://github.com/ggerganov/llama.cpp) project, which provides a plain C/C++ implementation with optional 4-bit quantization support for faster, lower memory inference, and is optimized for desktop CPUs. This program can be used to perform various inference tasks with LLaMA models, including generating text based on user-provided prompts and chat-like interactions with reverse prompts.
+
+## Table of Contents
+
+1. [Quick Start](#quick-start)
+2. [Common Options](#common-options)
+3. [Input Prompts](#input-prompts)
+4. [Interaction](#interaction)
+5. [Context Management](#context-management)
+6. [Generation Flags](#generation-flags)
+7. [Performance Tuning and Memory Options](#performance-tuning-and-memory-options)
+8. [Additional Options](#additional-options)
+
+## Quick Start
+
+To get started right away, run the following command, making sure to use the correct path for the model you have:
+
+#### Unix-based systems (Linux, macOS, etc.):
+
+```bash
+./main -m models/7B/ggml-model.bin --prompt "Once upon a time"
+```
+
+#### Windows:
+
+```powershell
+main.exe -m models\7B\ggml-model.bin --prompt "Once upon a time"
+```
+
+For an interactive experience, try this command:
+
+#### Unix-based systems (Linux, macOS, etc.):
+
+```bash
+./main -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix " " \
+'User: Hi
+AI: Hello. I am an AI chatbot. Would you like to talk?
+User: Sure!
+AI: What would you like to talk about?
+User:'
+```
+
+#### Windows:
+
+```powershell
+main.exe -m models\7B\ggml-model.bin -n -1 --color -r "User:" --in-prefix " " -e --prompt "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:"
+```
+
+The following command generates "infinite" text from a starting prompt (you can use `Ctrl-C` to stop it):
+
+#### Unix-based systems (Linux, macOS, etc.):
+
+```bash
+./main -m models/7B/ggml-model.bin --ignore-eos -n -1 --random-prompt
+```
+
+#### Windows:
+
+```powershell
+main.exe -m models\7B\ggml-model.bin --ignore-eos -n -1 --random-prompt
+```
+
+## Common Options
+
+In this section, we cover the most commonly used options for running the `main` program with the LLaMA models:
+
+- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
+- `-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses.
+- `-ins, --instruct`: Run the program in instruction mode, which is particularly useful when working with Alpaca models.
+- `-n N, --n_predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
+- `-c N, --ctx_size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
+
+## Input Prompts
+
+The `main` program provides several ways to interact with the LLaMA models using input prompts:
+
+- `--prompt PROMPT`: Provide a prompt directly as a command-line option.
+- `--file FNAME`: Provide a file containing a prompt or multiple prompts.
+- `--interactive-first`: Run the program in interactive mode and wait for input right away. (More on this below.)
+- `--random-prompt`: Start with a randomized prompt.
+
+## Interaction
+
+The `main` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive`, `--interactive-first`, and `--instruct`.
+
+In interactive mode, users can participate in text generation by injecting their input during the process. Users can press `Ctrl+C` at any time to interject and type their input, followed by pressing `Return` to submit it to the LLaMA model. To submit additional lines without finalizing input, users can end the current line with a backslash (`\`) and continue typing.
+
+### Interaction Options
+
+- `-i, --interactive`: Run the program in interactive mode, allowing users to engage in real-time conversations or provide specific instructions to the model.
+- `--interactive-first`: Run the program in interactive mode and immediately wait for user input before starting the text generation.
+- `-ins, --instruct`: Run the program in instruction mode, which is specifically designed to work with Alpaca models that excel in completing tasks based on user instructions.
+- `--color`: Enable colorized output to differentiate visually distinguishing between prompts, user input, and generated text.
+
+By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with the LLaMA models, tailoring the text generation process to your specific needs.
+
+### Reverse Prompts
+
+Reverse prompts are a powerful way to create a chat-like experience with a LLaMA model by pausing the text generation when specific text strings are encountered:
+
+- `-r PROMPT, --reverse-prompt PROMPT`: Specify one or multiple reverse prompts to pause text generation and switch to interactive mode. For example, `-r "User:"` can be used to jump back into the conversation whenever it's the user's turn to speak. This helps create a more interactive and conversational experience. However, the reverse prompt doesn't work when it ends with a space.
+
+To overcome this limitation, you can use the `--in-prefix` flag to add a space or any other characters after the reverse prompt.
+
+### In-Prefix
+
+The `--in-prefix` flag is used to add a prefix to your input, primarily, this is used to insert a space after the reverse prompt. Here's an example of how to use the `--in-prefix` flag in conjunction with the `--reverse-prompt` flag:
+
+```sh
+./main -r "User:" --in-prefix " "
+```
+
+### In-Suffix
+
+The `--in-suffix` flag is used to add a suffix after your input. This is useful for adding an "Assistant:" prompt after the user's input. It's added after the new-line character (`\n`) that's automatically added to the end of the user's input. Here's an example of how to use the `--in-suffix` flag in conjunction with the `--reverse-prompt` flag:
+
+```sh
+./main -r "User:" --in-prefix " " --in-suffix "Assistant:"
+```
+
+### Instruction Mode
+
+Instruction mode is particularly useful when working with Alpaca models, which are designed to follow user instructions for specific tasks:
+
+- `-ins, --instruct`: Enable instruction mode to leverage the capabilities of Alpaca models in completing tasks based on user-provided instructions.
+
+Technical detail: the user's input is internally prefixed with the reverse prompt (or `### Instruction:` as the default), and followed by `### Response:` (except if you just press Return without any input, to keep generating a longer response).
+
+By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with the LLaMA models, tailoring the text generation process to your specific needs.
+
+## Context Management
+
+During text generation, LLaMA models have a limited context size, which means they can only consider a certain number of tokens from the input and generated text. When the context fills up, the model resets internally, potentially losing some information from the beginning of the conversation or instructions. Context management options help maintain continuity and coherence in these situations.
+
+### Context Size
+
+The `--ctx_size` option allows you to set the size of the prompt context used by the LLaMA models during text generation. A larger context size helps the model to better comprehend and generate responses for longer input or conversations.
+
+- `-c N, --ctx_size N`: Set the size of the prompt context (default: 512). The LLaMA models were built with a context of 2048, which will yield the best results on longer input/inference. However, increasing the context size beyond 2048 may lead to unpredictable results.
+
+### Keep Prompt
+
+The `--keep` option allows users to retain the original prompt when the model runs out of context, ensuring a connection to the initial instruction or conversation topic is maintained.
+
+- `--keep N`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
+
+By utilizing context management options like `--ctx_size` and `--keep`, you can maintain a more coherent and consistent interaction with the LLaMA models, ensuring that the generated text remains relevant to the original prompt or conversation.
+
+## Generation Flags
+
+The following options allow you to control the text generation process and fine-tune the diversity, creativity, and quality of the generated text according to your needs. By adjusting these options and experimenting with different combinations of values, you can find the best settings for your specific use case.
+
+### Number of Tokens to Predict
+
+- `-n N, --n_predict N`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
+
+The `--n_predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text. A value of -1 will cause text to be generated without limit.
+
+It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n_predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter.
+
+### Temperature
+
+- `--temp N`: Adjust the randomness of the generated text (default: 0.8).
+
+Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run.
+
+Example usage: `--temp 0.5`
+
+### Repeat Penalty
+
+- `--repeat_penalty N`: Control the repetition of token sequences in the generated text (default: 1.1).
+- `--repeat_last_n N`: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx_size).
+- `--no-penalize-nl`: Disable penalization for newline tokens when applying the repeat penalty.
+
+The `repeat_penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.
+
+The `repeat_last_n` option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens. A value of 0 disables the penalty, and a value of -1 sets the number of tokens considered equal to the context size (`ctx_size`).
+
+Use the `--no-penalize-nl` option to disable newline penalization when applying the repeat penalty. This option is particularly useful for generating chat conversations, dialogues, code, poetry, or any text where newline tokens play a significant role in structure and formatting. Disabling newline penalization helps maintain the natural flow and intended formatting in these specific use cases.
+
+Example usage: `--repeat_penalty 1.15 --repeat_last_n 128 --no-penalize-nl`
+
+### Top-K Sampling
+
+- `--top_k N`: Limit the next token selection to the K most probable tokens (default: 40).
+
+Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text. The default value is 40.
+
+Example usage: `--top_k 30`
+
+### Top-P Sampling
+
+- `--top_p N`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
+
+Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. The default value is 0.9.
+
+Example usage: `--top_p 0.95`
+
+### Tail Free Sampling (TFS)
+
+- `--tfs N`: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).
+
+Tail free sampling (TFS) is a text generation technique that aims to reduce the impact of less likely tokens, which may be less relevant, less coherent, or nonsensical, on the output. The method adjusts the logits (token probabilities) by raising them to the power of the parameter z. A higher value of z (e.g., 2.0) will further suppress less likely tokens from the tail of the distribution, while a value of 1.0 disables the effect of TFS. By setting the parameter z, you can control how much the probabilities of less likely tokens are reduced.
+
+Example usage: `--tfs 2.0`
+
+### Locally Typical Sampling
+
+- `--typical N`: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled).
+
+Locally typical sampling promotes the generation of contextually coherent and diverse text by sampling tokens that are typical or expected based on the surrounding context. By setting the parameter p between 0 and 1, you can control the balance between producing text that is locally coherent and diverse. A value closer to 1 will promote more contextually coherent tokens, while a value closer to 0 will promote more diverse tokens. A value equal to 1 disables locally typical sampling.
+
+Example usage: `--typical 0.9`
+
+### Mirostat Sampling
+
+- `--mirostat N`: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0).
+- `--mirostat_lr N`: Set the Mirostat learning rate, parameter eta (default: 0.1).
+- `--mirostat_ent N`: Set the Mirostat target entropy, parameter tau (default: 5.0).
+
+Mirostat is an algorithm that actively maintains the quality of generated text within a desired range during text generation. It aims to strike a balance between coherence and diversity, avoiding low-quality output caused by excessive repetition (boredom traps) or incoherence (confusion traps).
+
+The `--mirostat_lr` option sets the Mirostat learning rate (eta). The learning rate influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. The default value is `0.1`.
+
+The `--mirostat_ent` option sets the Mirostat target entropy (tau), which represents the desired perplexity value for the generated text. Adjusting the target entropy allows you to control the balance between coherence and diversity in the generated text. A lower value will result in more focused and coherent text, while a higher value will lead to more diverse and potentially less coherent text. The default value is `5.0`.
+
+Example usage: `--mirostat 2 --mirostat_lr 0.05 --mirostat_ent 3.0`
+
+### Logit Bias
+
+- `-l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS`: Modify the likelihood of a token appearing in the generated text completion.
+
+The logit bias option allows you to manually adjust the likelihood of specific tokens appearing in the generated text. By providing a token ID and a positive or negative bias value, you can increase or decrease the probability of that token being generated.
+
+For example, use `--logit-bias 15043+1` to increase the likelihood of the token 'Hello', or `--logit-bias 15043-1` to decrease its likelihood. Using a value of negative infinity, `--logit-bias 15043-inf` ensures that the token `Hello` is never produced.
+
+A more practical use case might be to prevent the generation of `\code{begin}` and `\code{end}` by setting the `\` token (29905) to negative infinity with `-l 29905-inf`. (This is due to the prevalence of LaTeX codes that show up in LLaMA model inference.)
+
+Example usage: `--logit-bias 29905-inf`
+
+### RNG Seed
+
+- `-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).
+
+The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run.
+
+## Performance Tuning and Memory Options
+
+These options help improve the performance and memory usage of the LLaMA models. By adjusting these settings, you can fine-tune the model's behavior to better suit your system's capabilities and achieve optimal performance for your specific use case.
+
+### Number of Threads
+
+- `-t N, --threads N`: Set the number of threads to use during computation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). Using the correct number of threads can greatly improve performance.
+
+### Mlock
+
+- `--mlock`: Lock the model in memory, preventing it from being swapped out when memory-mapped. This can improve performance but trades away some of the advantages of memory-mapping by requiring more RAM to run and potentially slowing down load times as the model loads into RAM.
+
+### No Memory Mapping
+
+- `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you're not using `--mlock`. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all.
+
+### Memory Float 32
+
+- `--memory_f32`: Use 32-bit floats instead of 16-bit floats for memory key+value, allowing higher quality inference at the cost of higher memory usage.
+
+### Batch Size
+
+- `-b N, --batch_size N`: Set the batch size for prompt processing (default: 512). This large batch size benefits users who have BLAS installed and enabled it during the build. If you don't have BLAS enabled ("BLAS=0"), you can use a smaller number, such as 8, to see the prompt progress as it's evaluated in some situations.
+
+### Prompt Caching
+
+- `--prompt-cache FNAME`: Specify a file to cache the model state after the initial prompt. This can significantly speed up the startup time when you're using longer prompts. The file is created during the first run and is reused and updated in subsequent runs.
+
+### Quantization
+
+For information about 4-bit quantization, which can significantly improve performance and reduce memory usage, please refer to llama.cpp's primary [README](../../README.md#prepare-data--run).
+
+## Additional Options
+
+These options provide extra functionality and customization when running the LLaMA models:
+
+- `-h, --help`: Display a help message showing all available options and their default values. This is particularly useful for checking the latest options and default values, as they can change frequently, and the information in this document may become outdated.
+- `--verbose-prompt`: Print the prompt before generating text.
+- `--mtest`: Test the model's functionality by running a series of tests to ensure it's working properly.
+- `--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
+- `--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index d44579aff..f23a4bc72 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -5,6 +5,7 @@
#include "common.h"
#include "llama.h"
+#include "build-info.h"
#include
#include
@@ -21,21 +22,26 @@
#include
#include
#elif defined (_WIN32)
+#define WIN32_LEAN_AND_MEAN
+#define NOMINMAX
+#include
#include
#endif
static console_state con_st;
+static llama_context ** g_ctx;
static bool is_interacting = false;
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
void sigint_handler(int signo) {
- set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
- printf("\n"); // this also force flush stdout.
if (signo == SIGINT) {
if (!is_interacting) {
is_interacting=true;
} else {
+ console_cleanup(con_st);
+ printf("\n");
+ llama_print_timings(*g_ctx);
_exit(130);
}
}
@@ -53,10 +59,9 @@ int main(int argc, char ** argv) {
// save choice to use color for later
// (note for later: this is a slightly awkward choice)
con_st.use_color = params.use_color;
-
-#if defined (_WIN32)
- win32_console_init(params.use_color);
-#endif
+ con_st.multiline_input = params.multiline_input;
+ console_init(con_st);
+ atexit([]() { console_cleanup(con_st); });
if (params.perplexity) {
printf("\n************\n");
@@ -79,11 +84,13 @@ int main(int argc, char ** argv) {
"expect poor results\n", __func__, params.n_ctx);
}
- if (params.seed <= 0) {
+ fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);
+
+ if (params.seed < 0) {
params.seed = time(NULL);
}
- fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
+ fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
@@ -94,35 +101,13 @@ int main(int argc, char ** argv) {
//bool is_prime(int n) {)";
llama_context * ctx;
+ g_ctx = &ctx;
- // load the model
- {
- auto lparams = llama_context_default_params();
-
- lparams.n_ctx = params.n_ctx;
- lparams.n_parts = params.n_parts;
- lparams.seed = params.seed;
- lparams.f16_kv = params.memory_f16;
- lparams.use_mmap = params.use_mmap;
- lparams.use_mlock = params.use_mlock;
-
- ctx = llama_init_from_file(params.model.c_str(), lparams);
-
- if (ctx == NULL) {
- fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
- return 1;
- }
- }
-
- if (!params.lora_adapter.empty()) {
- int err = llama_apply_lora_from_file(ctx,
- params.lora_adapter.c_str(),
- params.lora_base.empty() ? NULL : params.lora_base.c_str(),
- params.n_threads);
- if (err != 0) {
- fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
- return 1;
- }
+ // load the model and apply lora adapter, if any
+ ctx = llama_init_from_gpt_params(params);
+ if (ctx == NULL) {
+ fprintf(stderr, "%s: error: unable to load model\n", __func__);
+ return 1;
}
// print system information
@@ -154,6 +139,31 @@ int main(int argc, char ** argv) {
// Add a space in front of the first character to match OG llama tokenizer behavior
params.prompt.insert(0, 1, ' ');
+ std::string path_session = params.path_prompt_cache;
+ std::vector session_tokens;
+
+ if (!path_session.empty()) {
+ fprintf(stderr, "%s: attempting to load saved session from '%s'\n", __func__, path_session.c_str());
+
+ // fopen to check for existing session
+ FILE * fp = std::fopen(path_session.c_str(), "rb");
+ if (fp != NULL) {
+ std::fclose(fp);
+
+ session_tokens.resize(params.n_ctx);
+ size_t n_token_count_out = 0;
+ if (!llama_load_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.capacity(), &n_token_count_out)) {
+ fprintf(stderr, "%s: error: failed to load session file '%s'\n", __func__, path_session.c_str());
+ return 1;
+ }
+ session_tokens.resize(n_token_count_out);
+
+ fprintf(stderr, "%s: loaded a session with prompt size of %d tokens\n", __func__, (int) session_tokens.size());
+ } else {
+ fprintf(stderr, "%s: session file does not exist, will create\n", __func__);
+ }
+ }
+
// tokenize the prompt
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);
@@ -164,8 +174,28 @@ int main(int argc, char ** argv) {
return 1;
}
+ // debug message about similarity of saved session, if applicable
+ size_t n_matching_session_tokens = 0;
+ if (session_tokens.size()) {
+ for (llama_token id : session_tokens) {
+ if (n_matching_session_tokens >= embd_inp.size() || id != embd_inp[n_matching_session_tokens]) {
+ break;
+ }
+ n_matching_session_tokens++;
+ }
+ if (n_matching_session_tokens >= embd_inp.size()) {
+ fprintf(stderr, "%s: session file has exact match for prompt!\n", __func__);
+ } else if (n_matching_session_tokens < (embd_inp.size() / 2)) {
+ fprintf(stderr, "%s: warning: session file has low similarity to prompt (%zu / %zu tokens); will mostly be reevaluated\n",
+ __func__, n_matching_session_tokens, embd_inp.size());
+ } else {
+ fprintf(stderr, "%s: session file matches %zu / %zu tokens of prompt\n",
+ __func__, n_matching_session_tokens, embd_inp.size());
+ }
+ }
+
// number of tokens to keep when resetting context
- if (params.n_keep < 0 || params.n_keep > (int)embd_inp.size() || params.instruct) {
+ if (params.n_keep < 0 || params.n_keep > (int) embd_inp.size() || params.instruct) {
params.n_keep = (int)embd_inp.size();
}
@@ -175,7 +205,7 @@ int main(int argc, char ** argv) {
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
- params.interactive_start = true;
+ params.interactive_first = true;
params.antiprompt.push_back("### Instruction:\n\n");
}
@@ -212,7 +242,10 @@ int main(int argc, char ** argv) {
sigint_action.sa_flags = 0;
sigaction(SIGINT, &sigint_action, NULL);
#elif defined (_WIN32)
- signal(SIGINT, sigint_handler);
+ auto console_ctrl_handler = [](DWORD ctrl_type) -> BOOL {
+ return (ctrl_type == CTRL_C_EVENT) ? (sigint_handler(SIGINT), true) : false;
+ };
+ SetConsoleCtrlHandler(static_cast(console_ctrl_handler), true);
#endif
fprintf(stderr, "%s: interactive mode on.\n", __func__);
@@ -226,9 +259,13 @@ int main(int argc, char ** argv) {
if (!params.input_prefix.empty()) {
fprintf(stderr, "Input prefix: '%s'\n", params.input_prefix.c_str());
}
+
+ if (!params.input_suffix.empty()) {
+ fprintf(stderr, "Input suffix: '%s'\n", params.input_suffix.c_str());
+ }
}
- fprintf(stderr, "sampling: temp = %f, top_k = %d, top_p = %f, repeat_last_n = %i, repeat_penalty = %f\n",
- params.temp, params.top_k, params.top_p, params.repeat_last_n, params.repeat_penalty);
+ fprintf(stderr, "sampling: repeat_last_n = %d, repeat_penalty = %f, presence_penalty = %f, frequency_penalty = %f, top_k = %d, tfs_z = %f, top_p = %f, typical_p = %f, temp = %f, mirostat = %d, mirostat_lr = %f, mirostat_ent = %f\n",
+ params.repeat_last_n, params.repeat_penalty, params.presence_penalty, params.frequency_penalty, params.top_k, params.tfs_z, params.top_p, params.typical_p, params.temp, params.mirostat, params.mirostat_eta, params.mirostat_tau);
fprintf(stderr, "generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", n_ctx, params.n_batch, params.n_predict, params.n_keep);
fprintf(stderr, "\n\n");
@@ -237,24 +274,35 @@ int main(int argc, char ** argv) {
std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
if (params.interactive) {
+ const char *control_message;
+ if (con_st.multiline_input) {
+ control_message = " - To return control to LLaMa, end your input with '\\'.\n"
+ " - To return control without starting a new line, end your input with '/'.\n";
+ } else {
+ control_message = " - Press Return to return control to LLaMa.\n"
+ " - To return control without starting a new line, end your input with '/'.\n"
+ " - If you want to submit another line, end your input with '\\'.\n";
+ }
fprintf(stderr, "== Running in interactive mode. ==\n"
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
" - Press Ctrl+C to interject at any time.\n"
#endif
- " - Press Return to return control to LLaMa.\n"
- " - If you want to submit another line, end your input in '\\'.\n\n");
- is_interacting = params.interactive_start;
+ "%s\n", control_message);
+
+ is_interacting = params.interactive_first;
}
- bool is_antiprompt = false;
- bool input_noecho = false;
+ bool is_antiprompt = false;
+ bool input_echo = true;
+ bool need_to_save_session = !path_session.empty() && n_matching_session_tokens < embd_inp.size();
- int n_past = 0;
- int n_remain = params.n_predict;
- int n_consumed = 0;
+ int n_past = 0;
+ int n_remain = params.n_predict;
+ int n_consumed = 0;
+ int n_session_consumed = 0;
// the first thing we will do is to output the prompt, so set color accordingly
- set_console_color(con_st, CONSOLE_COLOR_PROMPT);
+ console_set_color(con_st, CONSOLE_COLOR_PROMPT);
std::vector embd;
@@ -264,15 +312,19 @@ int main(int argc, char ** argv) {
// infinite text generation via context swapping
// if we run out of context:
// - take the n_keep first tokens from the original prompt (via n_past)
- // - take half of the last (n_ctx - n_keep) tokens and recompute the logits in a batch
+ // - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
if (n_past + (int) embd.size() > n_ctx) {
const int n_left = n_past - params.n_keep;
- n_past = params.n_keep;
+ // always keep the first token - BOS
+ n_past = std::max(1, params.n_keep);
// insert n_left/2 tokens at the start of embd from last_n_tokens
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
+ // stop saving session if we run out of context
+ path_session.clear();
+
//printf("\n---\n");
//printf("resetting: '");
//for (int i = 0; i < (int) embd.size(); i++) {
@@ -282,34 +334,128 @@ int main(int argc, char ** argv) {
//printf("\n---\n");
}
- if (llama_eval(ctx, embd.data(), embd.size(), n_past, params.n_threads)) {
- fprintf(stderr, "%s : failed to eval\n", __func__);
- return 1;
+ // try to reuse a matching prefix from the loaded session instead of re-eval (via n_past)
+ if (n_session_consumed < (int) session_tokens.size()) {
+ size_t i = 0;
+ for ( ; i < embd.size(); i++) {
+ if (embd[i] != session_tokens[n_session_consumed]) {
+ session_tokens.resize(n_session_consumed);
+ break;
+ }
+
+ n_past++;
+ n_session_consumed++;
+
+ if (n_session_consumed >= (int) session_tokens.size()) {
+ ++i;
+ break;
+ }
+ }
+ if (i > 0) {
+ embd.erase(embd.begin(), embd.begin() + i);
+ }
+ }
+
+ // evaluate tokens in batches
+ // embd is typically prepared beforehand to fit within a batch, but not always
+ for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
+ int n_eval = (int) embd.size() - i;
+ if (n_eval > params.n_batch) {
+ n_eval = params.n_batch;
+ }
+ if (llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)) {
+ fprintf(stderr, "%s : failed to eval\n", __func__);
+ return 1;
+ }
+ n_past += n_eval;
+ }
+
+ if (embd.size() > 0 && !path_session.empty()) {
+ session_tokens.insert(session_tokens.end(), embd.begin(), embd.end());
+ n_session_consumed = session_tokens.size();
}
}
- n_past += embd.size();
embd.clear();
if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
// out of user input, sample next token
- const int32_t top_k = params.top_k;
- const float top_p = params.top_p;
- const float temp = params.temp;
- const float repeat_penalty = params.repeat_penalty;
+ const float temp = params.temp;
+ const int32_t top_k = params.top_k <= 0 ? llama_n_vocab(ctx) : params.top_k;
+ const float top_p = params.top_p;
+ const float tfs_z = params.tfs_z;
+ const float typical_p = params.typical_p;
+ const int32_t repeat_last_n = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
+ const float repeat_penalty = params.repeat_penalty;
+ const float alpha_presence = params.presence_penalty;
+ const float alpha_frequency = params.frequency_penalty;
+ const int mirostat = params.mirostat;
+ const float mirostat_tau = params.mirostat_tau;
+ const float mirostat_eta = params.mirostat_eta;
+ const bool penalize_nl = params.penalize_nl;
+
+ // optionally save the session on first sample (for faster prompt loading next time)
+ if (!path_session.empty() && need_to_save_session) {
+ need_to_save_session = false;
+ llama_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
+ }
llama_token id = 0;
{
- auto logits = llama_get_logits(ctx);
+ auto logits = llama_get_logits(ctx);
+ auto n_vocab = llama_n_vocab(ctx);
- if (params.ignore_eos) {
- logits[llama_token_eos()] = 0;
+ // Apply params.logit_bias map
+ for (auto it = params.logit_bias.begin(); it != params.logit_bias.end(); it++) {
+ logits[it->first] += it->second;
}
- id = llama_sample_top_p_top_k(ctx,
- last_n_tokens.data() + n_ctx - params.repeat_last_n,
- params.repeat_last_n, top_k, top_p, temp, repeat_penalty);
+ std::vector candidates;
+ candidates.reserve(n_vocab);
+ for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
+ candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
+ }
+
+ llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
+
+ // Apply penalties
+ float nl_logit = logits[llama_token_nl()];
+ auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
+ llama_sample_repetition_penalty(ctx, &candidates_p,
+ last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+ last_n_repeat, repeat_penalty);
+ llama_sample_frequency_and_presence_penalties(ctx, &candidates_p,
+ last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+ last_n_repeat, alpha_frequency, alpha_presence);
+ if (!penalize_nl) {
+ logits[llama_token_nl()] = nl_logit;
+ }
+
+ if (temp <= 0) {
+ // Greedy sampling
+ id = llama_sample_token_greedy(ctx, &candidates_p);
+ } else {
+ if (mirostat == 1) {
+ static float mirostat_mu = 2.0f * mirostat_tau;
+ const int mirostat_m = 100;
+ llama_sample_temperature(ctx, &candidates_p, temp);
+ id = llama_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
+ } else if (mirostat == 2) {
+ static float mirostat_mu = 2.0f * mirostat_tau;
+ llama_sample_temperature(ctx, &candidates_p, temp);
+ id = llama_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
+ } else {
+ // Temperature sampling
+ llama_sample_top_k(ctx, &candidates_p, top_k, 1);
+ llama_sample_tail_free(ctx, &candidates_p, tfs_z, 1);
+ llama_sample_typical(ctx, &candidates_p, typical_p, 1);
+ llama_sample_top_p(ctx, &candidates_p, top_p, 1);
+ llama_sample_temperature(ctx, &candidates_p, temp);
+ id = llama_sample_token(ctx, &candidates_p);
+ }
+ }
+ // printf("`%d`", candidates_p.size);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(id);
@@ -329,7 +475,7 @@ int main(int argc, char ** argv) {
embd.push_back(id);
// echo this to console
- input_noecho = false;
+ input_echo = true;
// decrement remaining sampling budget
--n_remain;
@@ -347,15 +493,15 @@ int main(int argc, char ** argv) {
}
// display text
- if (!input_noecho) {
+ if (input_echo) {
for (auto id : embd) {
printf("%s", llama_token_to_str(ctx, id));
}
fflush(stdout);
}
// reset color to default if we there is no pending user input
- if (!input_noecho && (int)embd_inp.size() == n_consumed) {
- set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+ if (input_echo && (int)embd_inp.size() == n_consumed) {
+ console_set_color(con_st, CONSOLE_COLOR_DEFAULT);
}
// if not currently processing queued inputs;
@@ -391,14 +537,6 @@ int main(int argc, char ** argv) {
}
if (n_past > 0 && is_interacting) {
- // potentially set color to indicate we are taking user input
- set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
-
-#if defined (_WIN32)
- // Windows: must reactivate sigint handler after each signal
- signal(SIGINT, sigint_handler);
-#endif
-
if (params.instruct) {
printf("\n> ");
}
@@ -412,33 +550,21 @@ int main(int argc, char ** argv) {
std::string line;
bool another_line = true;
do {
-#if defined(_WIN32)
- std::wstring wline;
- if (!std::getline(std::wcin, wline)) {
- // input stream is bad or EOF received
- return 0;
- }
- win32_utf8_encode(wline, line);
-#else
- if (!std::getline(std::cin, line)) {
- // input stream is bad or EOF received
- return 0;
- }
-#endif
- if (line.empty() || line.back() != '\\') {
- another_line = false;
- } else {
- line.pop_back(); // Remove the continue character
- }
- buffer += line + '\n'; // Append the line to the result
+ another_line = console_readline(con_st, line);
+ buffer += line;
} while (another_line);
// done taking input, reset color
- set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+ console_set_color(con_st, CONSOLE_COLOR_DEFAULT);
// Add tokens to embd only if the input buffer is non-empty
// Entering a empty line lets the user pass control back
if (buffer.length() > 1) {
+ // append input suffix if any
+ if (!params.input_suffix.empty()) {
+ buffer += params.input_suffix;
+ printf("%s", params.input_suffix.c_str());
+ }
// instruct mode: insert instruction prefix
if (params.instruct && !is_antiprompt) {
@@ -457,7 +583,7 @@ int main(int argc, char ** argv) {
n_remain -= line_inp.size();
}
- input_noecho = true; // do not echo this again
+ input_echo = false; // do not echo this again
}
if (n_past > 0) {
@@ -482,14 +608,13 @@ int main(int argc, char ** argv) {
}
}
-#if defined (_WIN32)
- signal(SIGINT, SIG_DFL);
-#endif
+ if (!path_session.empty() && params.prompt_cache_all) {
+ fprintf(stderr, "\n%s: saving final output to session file '%s'\n", __func__, path_session.c_str());
+ llama_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
+ }
llama_print_timings(ctx);
llama_free(ctx);
- set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
-
return 0;
}
diff --git a/examples/perplexity/CMakeLists.txt b/examples/perplexity/CMakeLists.txt
index 5836df8b2..61b17b828 100644
--- a/examples/perplexity/CMakeLists.txt
+++ b/examples/perplexity/CMakeLists.txt
@@ -2,3 +2,6 @@ set(TARGET perplexity)
add_executable(${TARGET} perplexity.cpp)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
+if(TARGET BUILD_INFO)
+ add_dependencies(${TARGET} BUILD_INFO)
+endif()
diff --git a/examples/perplexity/perplexity.cpp b/examples/perplexity/perplexity.cpp
index 80792ea0d..9212dee5c 100644
--- a/examples/perplexity/perplexity.cpp
+++ b/examples/perplexity/perplexity.cpp
@@ -1,5 +1,6 @@
#include "common.h"
#include "llama.h"
+#include "build-info.h"
#include
#include
@@ -24,40 +25,68 @@ void perplexity(llama_context * ctx, const gpt_params & params) {
// Download: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
// Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
// Output: `perplexity: 13.5106 [114/114]`
+ // BOS tokens will be added for each chunk before eval
auto tokens = ::llama_tokenize(ctx, params.prompt, true);
- int count = 0;
- int seq_count = tokens.size() / params.n_ctx;
- int n_vocab = llama_n_vocab(ctx);
+ int count = 0;
+
+ const int n_chunk = tokens.size() / params.n_ctx;
+ const int n_vocab = llama_n_vocab(ctx);
+ const int n_batch = params.n_batch;
double nll = 0.0;
- fprintf(stderr, "%s : calculating perplexity over %d chunks, batch_size=%d\n", __func__, seq_count, params.n_batch);
+ fprintf(stderr, "%s: calculating perplexity over %d chunks, batch_size=%d\n", __func__, n_chunk, n_batch);
- for (int i = 0; i < seq_count; ++i) {
- int start = i * params.n_ctx;
- int end = start + params.n_ctx;
+ for (int i = 0; i < n_chunk; ++i) {
+ const int start = i * params.n_ctx;
+ const int end = start + params.n_ctx;
+
+ const int num_batches = (params.n_ctx + n_batch - 1) / n_batch;
std::vector logits;
- int num_batches = (params.n_ctx + params.n_batch - 1) / params.n_batch;
- auto start_t = std::chrono::high_resolution_clock::now();
+
+ const auto t_start = std::chrono::high_resolution_clock::now();
+
for (int j = 0; j < num_batches; ++j) {
- int batch_start = start + j * params.n_batch;
- int batch_size = std::min(end - batch_start, params.n_batch);
- if (llama_eval(ctx, tokens.data() + batch_start, batch_size, j * params.n_batch, params.n_threads)) {
+ const int batch_start = start + j * n_batch;
+ const int batch_size = std::min(end - batch_start, n_batch);
+
+ // save original token and restore it after eval
+ const auto token_org = tokens[batch_start];
+
+ // add BOS token for the first batch of each chunk
+ if (j == 0) {
+ tokens[batch_start] = llama_token_bos();
+ }
+
+ if (llama_eval(ctx, tokens.data() + batch_start, batch_size, j * n_batch, params.n_threads)) {
fprintf(stderr, "%s : failed to eval\n", __func__);
return;
}
- auto batch_logits = llama_get_logits(ctx);
+
+ // restore the original token in case it was set to BOS
+ tokens[batch_start] = token_org;
+
+ const auto batch_logits = llama_get_logits(ctx);
logits.insert(logits.end(), batch_logits, batch_logits + batch_size * n_vocab);
}
- auto end_t = std::chrono::high_resolution_clock::now();
+
+ const auto t_end = std::chrono::high_resolution_clock::now();
+
if (i == 0) {
- const float seconds = std::chrono::duration(end_t - start_t).count();
- printf("%.2f seconds per pass - ETA %.2f hours\n", seconds, (seconds * seq_count) / (60.0*60.0));
+ const float t_total = std::chrono::duration(t_end - t_start).count();
+ fprintf(stderr, "%s: %.2f seconds per pass - ETA ", __func__, t_total);
+ int total_seconds = (int)(t_total * n_chunk);
+ if (total_seconds >= 60*60) {
+ fprintf(stderr, "%d hours ", total_seconds / (60*60));
+ total_seconds = total_seconds % (60*60);
+ }
+ fprintf(stderr, "%d minutes\n", total_seconds / 60);
}
+
// We get the logits for all the tokens in the context window (params.n_ctx)
// from llama_eval above. Now, based on https://huggingface.co/docs/transformers/perplexity,
- // calculate the perplexity over the last half the window (so the model always has
+ // calculate the perplexity over the last half of the window (so the model always has
// some context to predict the token).
//
// We rely on the fact that attention in the forward pass only looks at previous
@@ -69,10 +98,12 @@ void perplexity(llama_context * ctx, const gpt_params & params) {
// process the entire prompt.
for (int j = std::min(512, params.n_ctx / 2); j < params.n_ctx - 1; ++j) {
// Calculate probability of next token, given the previous ones.
- std::vector tok_logits(
- logits.begin() + j * n_vocab,
+ const std::vector tok_logits(
+ logits.begin() + (j + 0) * n_vocab,
logits.begin() + (j + 1) * n_vocab);
- float prob = softmax(tok_logits)[tokens[start + j + 1]];
+
+ const float prob = softmax(tok_logits)[tokens[start + j + 1]];
+
nll += -std::log(prob);
++count;
}
@@ -100,11 +131,13 @@ int main(int argc, char ** argv) {
"expect poor results\n", __func__, params.n_ctx);
}
- if (params.seed <= 0) {
+ fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);
+
+ if (params.seed < 0) {
params.seed = time(NULL);
}
- fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
+ fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
@@ -113,36 +146,11 @@ int main(int argc, char ** argv) {
llama_context * ctx;
- // load the model
- {
- auto lparams = llama_context_default_params();
-
- lparams.n_ctx = params.n_ctx;
- lparams.n_parts = params.n_parts;
- lparams.seed = params.seed;
- lparams.f16_kv = params.memory_f16;
- lparams.logits_all = params.perplexity;
- lparams.use_mmap = params.use_mmap;
- lparams.use_mlock = params.use_mlock;
- lparams.embedding = params.embedding;
-
- ctx = llama_init_from_file(params.model.c_str(), lparams);
-
- if (ctx == NULL) {
- fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
- return 1;
- }
- }
-
- if (!params.lora_adapter.empty()) {
- int err = llama_apply_lora_from_file(ctx,
- params.lora_adapter.c_str(),
- params.lora_base.empty() ? NULL : params.lora_base.c_str(),
- params.n_threads);
- if (err != 0) {
- fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
- return 1;
- }
+ // load the model and apply lora adapter, if any
+ ctx = llama_init_from_gpt_params(params);
+ if (ctx == NULL) {
+ fprintf(stderr, "%s: error: unable to load model\n", __func__);
+ return 1;
}
// print system information
diff --git a/examples/quantize-stats/quantize-stats.cpp b/examples/quantize-stats/quantize-stats.cpp
index 4e6c2c831..9a2aa7c64 100644
--- a/examples/quantize-stats/quantize-stats.cpp
+++ b/examples/quantize-stats/quantize-stats.cpp
@@ -1,4 +1,5 @@
#include "ggml.h"
+#include "build-info.h"
#define LLAMA_API_INTERNAL
#include "llama.h"
@@ -308,6 +309,8 @@ int main(int argc, char ** argv) {
return 1;
}
+ fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);
+
// load the model
fprintf(stderr, "Loading model\n");
diff --git a/examples/quantize/CMakeLists.txt b/examples/quantize/CMakeLists.txt
index fb27d4517..475fc8be8 100644
--- a/examples/quantize/CMakeLists.txt
+++ b/examples/quantize/CMakeLists.txt
@@ -2,3 +2,6 @@ set(TARGET quantize)
add_executable(${TARGET} quantize.cpp)
target_link_libraries(${TARGET} PRIVATE llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
+if(TARGET BUILD_INFO)
+ add_dependencies(${TARGET} BUILD_INFO)
+endif()
diff --git a/examples/quantize/quantize.cpp b/examples/quantize/quantize.cpp
index 5b4812c62..7c77018da 100644
--- a/examples/quantize/quantize.cpp
+++ b/examples/quantize/quantize.cpp
@@ -1,21 +1,55 @@
#include "ggml.h"
#include "llama.h"
+#include "build-info.h"
#include
+#include