diff --git a/docs/BLIS.md b/docs/BLIS.md index 0bcd6eeef..c933766b7 100644 --- a/docs/BLIS.md +++ b/docs/BLIS.md @@ -23,7 +23,7 @@ Install BLIS: sudo make install ``` -We recommend using openmp since it's easier to modify the cores been used. +We recommend using openmp since it's easier to modify the cores being used. ### llama.cpp compilation diff --git a/docs/HOWTO-add-model.md b/docs/HOWTO-add-model.md index a56b78344..48769cdf6 100644 --- a/docs/HOWTO-add-model.md +++ b/docs/HOWTO-add-model.md @@ -96,9 +96,9 @@ NOTE: The dimensions in `ggml` are typically in the reverse order of the `pytorc This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`. -Have a look to existing implementation like `build_llama`, `build_dbrx` or `build_bert`. +Have a look at existing implementation like `build_llama`, `build_dbrx` or `build_bert`. -When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support of missing backend operations can be added in another PR. +When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR. Note: to debug the inference graph: you can use [eval-callback](../examples/eval-callback). diff --git a/examples/llava/README.md b/examples/llava/README.md index d4810d42e..4fb0cf381 100644 --- a/examples/llava/README.md +++ b/examples/llava/README.md @@ -56,7 +56,7 @@ python ./examples/llava/convert-image-encoder-to-gguf.py -m ../clip-vit-large-pa python ./convert.py ../llava-v1.5-7b --skip-unknown ``` -Now both the LLaMA part and the image encoder is in the `llava-v1.5-7b` directory. +Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory. ## LLaVA 1.6 gguf conversion 1) First clone a LLaVA 1.6 model: diff --git a/examples/main/README.md b/examples/main/README.md index e7a38743c..97e2ae4c2 100644 --- a/examples/main/README.md +++ b/examples/main/README.md @@ -143,7 +143,7 @@ The `--ctx-size` option allows you to set the size of the prompt context used by ### Extended Context Size -Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model have a context length (max sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8. +Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model has a context length (max sequence length) of 4096 (4k) and the fine-tuned model has 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8. - `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model. @@ -286,7 +286,7 @@ These options help improve the performance and memory usage of the LLaMA models. - `--numa distribute`: Pin an equal proportion of the threads to the cores on each NUMA node. This will spread the load amongst all cores on the system, utilitizing all memory channels at the expense of potentially requiring memory to travel over the slow links between nodes. - `--numa isolate`: Pin all threads to the NUMA node that the program starts on. This limits the number of cores and amount of memory that can be used, but guarantees all memory access remains local to the NUMA node. -- `--numa numactl`: Pin threads to the CPUMAP that is passed to the program by starting it with the numactl utility. This is the most flexible mode, and allow arbitraty core usage patterns, for example a map that uses all the cores on one NUMA nodes, and just enough cores on a second node to saturate the inter-node memory bus. +- `--numa numactl`: Pin threads to the CPUMAP that is passed to the program by starting it with the numactl utility. This is the most flexible mode, and allow arbitrary core usage patterns, for example a map that uses all the cores on one NUMA nodes, and just enough cores on a second node to saturate the inter-node memory bus. These flags attempt optimizations that help on some systems with non-uniform memory access. This currently consists of one of the above strategies, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop_caches' as root. diff --git a/examples/sycl/README.md b/examples/sycl/README.md index b46f17f39..c589c2d3a 100644 --- a/examples/sycl/README.md +++ b/examples/sycl/README.md @@ -1,6 +1,6 @@ # llama.cpp/example/sycl -This example program provide the tools for llama.cpp for SYCL on Intel GPU. +This example program provides the tools for llama.cpp for SYCL on Intel GPU. ## Tool diff --git a/grammars/README.md b/grammars/README.md index c924e8d46..2b8384d9d 100644 --- a/grammars/README.md +++ b/grammars/README.md @@ -51,7 +51,7 @@ single-line ::= [^\n]+ "\n"` ## Sequences and Alternatives -The order of symbols in a sequence matter. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc. +The order of symbols in a sequence matters. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc. Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.