* vulkan : do not use tensor->extra
This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.
Ref: #8536
* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2)
---------
Co-authored-by: 0cc4m <picard12@live.de>
* make sure params --split and --merge are not specified at same time
* update gguf-split params parse logic
* Update examples/gguf-split/gguf-split.cpp
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.
The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.
When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.
We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
* convert : refactor rope_freqs generation
This should also fix vocab-only conversion for Phi-3.
* convert : adapt MiniCPM3 to separate rope_freqs insertion
MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.
gguf-py : add long and short RoPE factors to tensor mappings
Empty, but the key names are used to populate the mappings.
a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).
BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
A crash was observed when the number of tokens added to a batch exceeds
llama_batch size. An assertion in llama_batch_add was added to protect
against llama_batch size overflow.
* test-backend-ops : use flops for some performance tests
- parallelize tensor quantization
- use a different set of cases for performance and correctness tests
- run each test for at least one second
* ggml: Added run-time detection of neon, i8mm and sve
Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.
* ggml: Extend feature detection to include non aarch64 Arm arch
* ggml: Move definition of ggml_arm_arch_features to the global data section
* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels
* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.
* fix for build errors
* remove prints from the low-level code
* Rebase to the latest upstream
* feat(gguf-py): Add granitemoe architecture
This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(convert_hf_to_gguf): Add GraniteMoeModel
GraniteMoe has the same configuration deltas as Granite
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(granitemoe convert): Split the double-sized input layer into gate and up
After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.
Branch: GraniteMoE
Co-Authored-By: alex.brooks@ibm.com
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(granitemoe): Implement granitemoe
GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Typo fix in docstring
Co-Authored-By: ggerganov@gmail.com
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(conversion): Simplify tensor name mapping in conversion
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Remove unused tensor name mappings
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Sanity check on merged FFN tensor sizes
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow "output" layer in granite moe architecture (convert and cpp)
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(granite): Add missing 'output' tensor for Granite
This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>