* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type
* vocab : add DeepSeek V3 pre-tokenizer regexes
* unicode : handle ACCENT_MARK and SYMBOL categories in regex
* llama : add DeepSeek V3 chat template, handle new model parameters and tensor types
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Add Falcon3 model support
* Add fix for adding bos to added special tokens
* Add comment explaining the logic behind the if statement
* Add a log message to better track the when the following line of code is triggered
* Update log to only print when input and output characters are different
* Fix handling pre-normalized tokens
* Refactoring
* convert : use GPT2 vocab for Phi-4 model
* convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models
* llama : do not use sliding window attention mask for Phi-4 model
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Add deepseek v1 arch & gigachat template
* improve template code
* add readme
* delete comments
* remove comment
* fix format
* lint llama.cpp
* fix order of deepseek and deepseek2, move gigachat temlate to the end of func
* fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need
* remove comments
* move deepseek above deepseek2
* change placement of gigachat chat template
This matches the key in common bert-based embedding models and may have a
value other than 1 in it.
Branch: XLMRobertaTypeVocabSize
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Add OLMo November 2024 constants
* Add OLMo November 2024 converter
* Add loading of OLMo November 2024 tensors and hyper parameters
* Add building of OLMo November 2024 model
* convert-lora : make `--base` optional
* lint
* handle case where base_model_name_or_path is invalid
* do not include metadata from base model
* clarify unspecified --base
* add small comment [no ci]
* trigger ci
* convert : refactor rope_freqs generation
This should also fix vocab-only conversion for Phi-3.
* convert : adapt MiniCPM3 to separate rope_freqs insertion
MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.
gguf-py : add long and short RoPE factors to tensor mappings
Empty, but the key names are used to populate the mappings.
* feat(gguf-py): Add granitemoe architecture
This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(convert_hf_to_gguf): Add GraniteMoeModel
GraniteMoe has the same configuration deltas as Granite
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(granitemoe convert): Split the double-sized input layer into gate and up
After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.
Branch: GraniteMoE
Co-Authored-By: alex.brooks@ibm.com
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(granitemoe): Implement granitemoe
GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Typo fix in docstring
Co-Authored-By: ggerganov@gmail.com
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(conversion): Simplify tensor name mapping in conversion
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Remove unused tensor name mappings
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Sanity check on merged FFN tensor sizes
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow "output" layer in granite moe architecture (convert and cpp)
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(granite): Add missing 'output' tensor for Granite
This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat(gguf-py): Add Granite model and params to gguf-py
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(convert_hf_to_gguf): Add registration and param setup for Granite
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): Add config parsing for Granite multiplier params
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): First pass at full port of granite deviations from llama
Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama.cpp): Determine granite language 3b instruct by vocab size
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel
The defaults in LlamaModel are needed for Granite as well
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama.cpp): Switch Granite param names to use _scale for consistency
Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale
The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>