* Implement q5_0, q5_1 and q8_0
* Work around q5_0 OpenCL issue
* Fix q8_0 dequant kernel
* Move cl kernels into ggml-opencl.c
* Use two memcpy calls for q5_0 buffer transfer
* llama : minor - remove explicity int64_t cast
* ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS
* ggml : add asserts to guard for incorrect wsize
* Sample interface, new samplers.
New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat
Ignore EOS fix: -inf should be used.
* mirostat
* Added --logit-bias and --no-penalize-nl, removed std::span
* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
* Save and load example adjust
* Tests
* Windows build fix
* Windows test fix
* Basic Setup
* Prevent Results.txt from coming up
* Prefixes, Line separators, etc
* editorcheck
* introduction to give more consistent results
* Basic graph thing
* Grading, ready for testing!
* Y'all ready to get funky?
* fix column removal stuff
* missed a few
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
* Updated build information
First update to the build instructions to include BLAS.
* Update README.md
* Update information about BLAS
* Better BLAS explanation
Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.
* Better BLAS explanation
* BLAS for Mac
Specifying that BLAS is already supported on Macs using the Accelerate Framework.
* Clarify the effect of BLAS
* Windows Make instructions
Added the instructions to build with Make on Windows
* Fixing typo
* Fix trailing whitespace
instead of `int` (while `int` option still being supported)
This allows the following usage:
`./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0`
instead of:
`./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`
* Use full range for q4_0 quantization
By keeping the sign of the highest magnitude, we can make sure the
highest value maps to -8, which is currently unused.
This is a bit of a freebie since it is fully backwards compatible with
the current format.
* Update quantize_row_q4_0 for AVX/AVX2
* Update quantize_row_q4_0 for WASM
Untested
* Update quantize_row_q4_0 for Arm NEON
* Update quantize_row_q4_0 for PowerPC
Untested
* Use full range for q4_2 quantization
* add save_load_state example
* use <cstdio> instead of <iostream> and fprintf / printf instead of cout
* renamed save-load-state example files replacing underscores by dashes