We get slightly better PPL, and we cut quantization time in
nearly half.
The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Update server-llm.sh
Add flag --non-interactive that allows run script without asking a permission
* Update scripts/server-llm.sh
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* imatrix: adding --combine and --continue-from
* imatrix: be able to start from a specific chunk
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt
* update guide for make installation, memory, gguf model link, rm todo for windows build
* add vs install requirement
* update for gpu device check
* update help of llama-bench
* fix grammer issues
The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.
This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.
* add --no-mmap, show sycl backend
* fix conflict
* fix code format, change print for --no-mmap
* ren no_mmap to mmap, show mmap when not default value in printer
* update guide for mmap
* mv position to reduce model reload
* New Feature:
1. Sum_Rows:
fix cuda kernel overflow
fix block shape error when nrows too big
2. Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
3. DepthWiseConv:
Support by Im2Col && MulMat
4. Pool_2d:
Supoort avg pooling in cuda
5. HardSigmoid:
Imp in cuda
6. HardSwish:
Imp in cuda
* fix tabs instead of spaces
* code clean
* CUDA POOL2D
* ADD POOL2D test case in test-backend-ops.cpp
* code clean
* fix pool2d_kernel
nits
* fix bug in pool2d kernel
* fix avg pooling, count_include_pad
nits
* test-backend-ops : add more pool_2d tests
* cuda : fix warnings and formatting
* ggml : check types in release builds too in pool_2d
* test-backend-ops : remove f16 pool_2d tests
* cuda : more style fixes
* Add assert in ggml_cuda_op_pool2d
* pool2d float padding fallback
* test-backend-ops : add dst_type to im2col
---------
Co-authored-by: slaren <slarengh@gmail.com>
* Fix Vulkan F16 models
* Fix Vulkan context shift crash
* Add Vulkan to common.cpp dump_non_result_info_yaml function
* Fix bug in Vulkan CPY op
* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
---------
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
* support SYCL backend windows build
* add windows build in CI
* add for win build CI
* correct install oneMKL
* fix install issue
* fix ci
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix win build
* fix win build
* fix win build
* restore other CI part
* restore as base
* rm no new line
* fix no new line issue, add -j
* fix grammer issue
* allow to trigger manually, fix format issue
* fix format
* add newline
* fix format
* fix format
* fix format issuse
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
* server : fix context shift + simplify self-extend
* server : take system_tokens into account
* server : more n_past fixes
* server : rever n_past_se changes
* Changed ugly xxd to literals.
HPP files are much more readable as multiline literals rather than hex arrays.
* Dashes in literal variable names.
Replace . and - with _ in file names -> variable names.
* Comment on removing xxd.
XXD-> string literals
* XXD to string literals.
Replaced these unreadable headers with string literal versions using new deps.sh.
* added cuda float16->float32 upcasting to ggml_cuda_cpy
* added ability to copy 4d tensors with the cuda backend
* added tests for float16_>float32 upcast and 4d tensor cuda copys
* added 4d copy test for float32->float16 copy
* applied patch suggested by @iamlemec
* simplify cpy tests
---------
Co-authored-by: slaren <slarengh@gmail.com>