Commit Graph

90 Commits

Author SHA1 Message Date
Xuan Son Nguyen
8586d23c8a minicpm working without uhd 2025-01-23 12:14:06 +01:00
Xuan Son Nguyen
c0d93dd509 minicpmv works but missing uhd slices 2025-01-22 22:42:00 +01:00
Xuan Son Nguyen
ba489b4743 wip minicpmv 2025-01-22 22:26:38 +01:00
Xuan Son Nguyen
32daa38333 Merge branch 'master' into xsn/vision_2 2025-01-22 13:28:31 +01:00
Xuan Son Nguyen
431bb08059 change gguf KV from clip to vit 2025-01-21 10:51:26 +01:00
Xuan Son Nguyen
ec7f3ac9ab
llama : add support for Deepseek-R1-Qwen distill model (#11310)
* llama : add support for Deepseek-R1-Qwen distill model

* coding style
2025-01-20 14:35:07 +01:00
Xuan Son Nguyen
4a7ab89d75 wip minicpmv 2025-01-19 22:33:05 +01:00
Xuan Son Nguyen
d0068ef0ed add mobilevlm 2025-01-19 16:29:20 +01:00
Xuan Son Nguyen
6cabdda0df add back convert hf to gguf 2025-01-18 22:56:04 +01:00
RunningLeon
4dbc8b9cb7
llama : add internlm3 support (#11233)
* support internlm3

* fix lint
2025-01-16 20:10:38 +02:00
Daniel Bevenius
2739a71e4b
convert : sort print supported models [no ci] (#11179)
This commit sorts the list of supported models when printing them out.

The motivation for this change is to make it easier to find a specific
model in the list of supported models. For example:
```console
$ ./convert_hf_to_gguf.py --print-supported-models
Supported models:
- ArcticForCausalLM
- BaiChuanForCausalLM
- BaichuanForCausalLM
- BertForMaskedLM
- BertModel
- BitnetForCausalLM
- BloomForCausalLM
- BloomModel
- CamembertModel
- ChameleonForCausalLM
- ChameleonForConditionalGeneration
- ChatGLMForConditionalGeneration
- ChatGLMModel
- CodeShellForCausalLM
- Cohere2ForCausalLM
- CohereForCausalLM
- DbrxForCausalLM
- DeciLMForCausalLM
- DeepseekForCausalLM
- DeepseekV2ForCausalLM
- DeepseekV3ForCausalLM
- ExaoneForCausalLM
- FalconForCausalLM
- FalconMambaForCausalLM
- GPT2LMHeadModel
- GPTBigCodeForCausalLM
- GPTNeoXForCausalLM
- GPTRefactForCausalLM
- Gemma2ForCausalLM
- GemmaForCausalLM
- GraniteForCausalLM
- GraniteMoeForCausalLM
- GrokForCausalLM
- InternLM2ForCausalLM
- JAISLMHeadModel
- JinaBertForMaskedLM
- JinaBertModel
- LLaMAForCausalLM
- LlamaForCausalLM
- LlavaStableLMEpochForCausalLM
- MPTForCausalLM
- MT5ForConditionalGeneration
- MambaForCausalLM
- MambaLMHeadModel
- MiniCPM3ForCausalLM
- MiniCPMForCausalLM
- MistralForCausalLM
- MixtralForCausalLM
- NemotronForCausalLM
- NomicBertModel
- OLMoForCausalLM
- Olmo2ForCausalLM
- OlmoForCausalLM
- OlmoeForCausalLM
- OpenELMForCausalLM
- OrionForCausalLM
- Phi3ForCausalLM
- PhiForCausalLM
- PhiMoEForCausalLM
- PlamoForCausalLM
- QWenLMHeadModel
- Qwen2ForCausalLM
- Qwen2MoeForCausalLM
- Qwen2VLForConditionalGeneration
- RWForCausalLM
- RWKV6Qwen2ForCausalLM
- RobertaModel
- Rwkv6ForCausalLM
- StableLMEpochForCausalLM
- StableLmForCausalLM
- Starcoder2ForCausalLM
- T5EncoderModel
- T5ForConditionalGeneration
- T5WithLMHeadModel
- UMT5ForConditionalGeneration
- WavTokenizerDec
- XLMRobertaForSequenceClassification
- XLMRobertaModel
- XverseForCausalLM
```
2025-01-11 05:50:33 +01:00
Daniel Bevenius
ff3fcabc72
convert : add --print-supported-models option (#11172)
* convert : add --print-supported-models option

This commit adds a new option to the convert_hf_to_gguf.py script to
print the supported models.

The motivation for this is that it can be useful to know which models
are supported by the script without having to look at the code.

Example usage:
```console
$ ./convert_hf_to_gguf.py --print-supported-models
Supported models:
- GPTNeoXForCausalLM
- BloomForCausalLM
- BloomModel
- MPTForCausalLM
- OrionForCausalLM
- BaichuanForCausalLM
- BaiChuanForCausalLM
- XverseForCausalLM
- FalconForCausalLM
- RWForCausalLM
- GPTBigCodeForCausalLM
- GPTRefactForCausalLM
- StableLmForCausalLM
- StableLMEpochForCausalLM
- LlavaStableLMEpochForCausalLM
- LLaMAForCausalLM
- LlamaForCausalLM
- MistralForCausalLM
- MixtralForCausalLM
- DeciLMForCausalLM
- BitnetForCausalLM
- GrokForCausalLM
- DbrxForCausalLM
- MiniCPMForCausalLM
- MiniCPM3ForCausalLM
- QWenLMHeadModel
- Qwen2ForCausalLM
- Qwen2VLForConditionalGeneration
- WavTokenizerDec
- Qwen2MoeForCausalLM
- GPT2LMHeadModel
- PhiForCausalLM
- Phi3ForCausalLM
- PhiMoEForCausalLM
- PlamoForCausalLM
- CodeShellForCausalLM
- InternLM2ForCausalLM
- BertModel
- BertForMaskedLM
- CamembertModel
- RobertaModel
- NomicBertModel
- XLMRobertaModel
- XLMRobertaForSequenceClassification
- GemmaForCausalLM
- Gemma2ForCausalLM
- Starcoder2ForCausalLM
- Rwkv6ForCausalLM
- RWKV6Qwen2ForCausalLM
- MambaForCausalLM
- MambaLMHeadModel
- FalconMambaForCausalLM
- CohereForCausalLM
- Cohere2ForCausalLM
- OLMoForCausalLM
- OlmoForCausalLM
- Olmo2ForCausalLM
- OlmoeForCausalLM
- JinaBertModel
- JinaBertForMaskedLM
- OpenELMForCausalLM
- ArcticForCausalLM
- DeepseekForCausalLM
- DeepseekV3ForCausalLM
- DeepseekV2ForCausalLM
- UMT5ForConditionalGeneration
- MT5ForConditionalGeneration
- T5ForConditionalGeneration
- T5WithLMHeadModel
- T5EncoderModel
- JAISLMHeadModel
- ChatGLMModel
- ChatGLMForConditionalGeneration
- NemotronForCausalLM
- ExaoneForCausalLM
- GraniteForCausalLM
- GraniteMoeForCausalLM
- ChameleonForCausalLM
- ChameleonForConditionalGeneration
```

* squash! convert : add --print-supported-models option

Fix flake8 error.
2025-01-10 11:30:53 +01:00
Molly Sophia
ee7136c6d1
llama: add support for QRWKV6 model architecture (#11001)
llama: add support for QRWKV6 model architecture (#11001)

* WIP: Add support for RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Some graph simplification

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add support for RWKV6Qwen2 with cpu and cuda GLA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix some typos

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix wkv test & add gla test

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix cuda warning

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update README.md

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix fused lerp weights loading with RWKV6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* better sanity check skipping for QRWKV6 in llama-quant

thanks @compilade

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
Pierrick Hymbert
f8feb4b01a
model: Add support for PhiMoE arch (#11003)
* model: support phimoe

* python linter

* doc: minor

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>

* doc: minor

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>

* doc: add phimoe as supported model

ggml-ci

---------

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>
2025-01-09 11:21:41 +01:00
fairydreaming
9394bbd484
llama : Add support for DeepSeek V3 (#11049)
* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type

* vocab : add DeepSeek V3 pre-tokenizer regexes

* unicode : handle ACCENT_MARK and SYMBOL categories in regex

* llama : add DeepSeek V3 chat template, handle new model parameters and tensor types

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-04 21:06:11 +01:00
DAN™
46be942214
llama : add support for the cohere2 model architecture (#10900) 2025-01-04 16:33:31 +02:00
ymcki
bc7b1f8632
convert : fix Llama-3_1-Nemotron-51B rope settings (#11008)
* conflict resolution

* move comments after bracket to its own line

* DeciLMCausalModel now reads rope_theta from config.json properly
2024-12-31 13:04:48 +02:00
Yun Dou
b92a14a841
llama : support InfiniAI Megrez 3b (#10893)
* Support InfiniAI Megrez 3b

* Fix tokenizer_clean_spaces for megrez
2024-12-23 01:35:44 +01:00
ymcki
6f0c9e034b
llama : support for Llama-3_1-Nemotron-51B (#10669)
* conflict resolution

* move comments after bracket to its own line
2024-12-23 01:22:33 +01:00
Billel Mokeddem
7ae33a616f
llama : add Falcon3 support (#10883)
* Add Falcon3 model support

* Add fix for adding bos to added special tokens

* Add comment explaining the logic behind the if statement

* Add a log message to better track the when the following line of code is triggered

* Update log to only print when input and output characters are different

* Fix handling pre-normalized tokens

* Refactoring
2024-12-23 00:09:58 +02:00
Georgi Gerganov
5cd85b5e00
convert : add BertForMaskedLM (#10919) 2024-12-21 10:10:18 +02:00
Molly Sophia
0a11f8b7b5
convert : fix RWKV v6 model conversion (#10913)
* Enable --no-context-shift for llama-perplexity example

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV 6: Fix error in ggml_cuda_op_bin_bcast

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-12-20 11:44:58 +02:00
Sukriti Sharma
2fffc52b50
llama : fix Roberta embeddings (#10856)
* fix: Use gpt2 tokenizer for roberta and add eos/bos tokens

Branch: RobertaTokenizer

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fixes to position embeddings

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* map roberta-bpe to gpt-2

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* fix linting

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
2024-12-19 15:04:51 +02:00
fairydreaming
7585edbdeb
convert : Add support for Microsoft Phi-4 model (#10817)
* convert : use GPT2 vocab for Phi-4 model

* convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models

* llama : do not use sliding window attention mask for Phi-4 model

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-12-19 10:37:12 +01:00
Georgi Gerganov
0bf2d10c55
tts : add OuteTTS support (#10784)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : be explicit about the pooling type in the tests

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* llama : add OuteTTS support (wip)

* wip

* extract features

* first conv

* group norm

* resnet conv

* resnet

* attn

* pos net

* layer norm

* convnext

* head

* hann window

* fix n_embd + remove llama.cpp hacks

* compute hann window

* fft

* spectrum processing

* clean-up

* tts : receive input text and generate codes

* clip : fix new conv name

* tts : minor fix

* tts : add header + minor fixes

ggml-ci

* tts : add matchematical constant

ggml-ci

* tts : fix sampling + cut initial noise

* tts : fixes

* tts : update default samplers

ggml-ci

* tts : text pre-processing

* tts : outetts-voc -> wavtokenizer-dec

* tts : remove hardcoded constants

ggml-ci

* tts : fix tensor shapes

* llama : refactor wavtokenizer tensors

ggml-ci

* cont

ggml-ci

* cont [no ci]

* llama : update WavTokenizer to non-causal attn

* llama : handle no-vocab detokenization

* tts : add Python example for OuteTTS (wip)

* tts : extend python example to generate spectrogram

ggml-ci

* server : fix rebase artifacts

* tts : enable "return_tokens" in Python example

ggml-ci

* tts : minor fixes

* common : support HF download for vocoder
2024-12-18 19:27:21 +02:00
Diego Devesa
4da69d1abd
Revert "llama : add Falcon3 support (#10864)" (#10876)
This reverts commit 382bc7f2e8.
2024-12-18 01:36:46 +01:00
Billel Mokeddem
382bc7f2e8
llama : add Falcon3 support (#10864) 2024-12-17 17:24:56 +02:00
Valentin Mamedov
a0974156f3
llama : add Deepseek MoE v1 & GigaChat models (#10827)
* Add deepseek v1 arch & gigachat template

* improve template code

* add readme

* delete comments

* remove comment

* fix format

* lint llama.cpp

* fix order of deepseek and deepseek2, move gigachat temlate to the end of func

* fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need

* remove comments

* move deepseek above deepseek2

* change placement of gigachat chat template
2024-12-15 19:02:46 +02:00
HimariO
ba1cb19cdd
llama : add Qwen2VL support + multimodal RoPE (#10361)
* Barebone Qwen2VL LLM convertor

* Add Qwen2VL cli entrypoint

* [WIP] add qwen2vl arch

* Verify m-rope output

* Add vl-rope/2d-rope support for qwen2vl ViT

* update qwen2vl cli tool

* update 5D tensor op workaround

* [WIP] qwen2vl vision model

* make batch and clip utils compatible with qwen2vl

* [WIP] create inference workflow, gguf convert script but fix

* correcting vision-rope behavior, add the missing last layer back to ViT

* add arg parser to qwen2vl_surgery

* replace variable size array with vector

* cuda-gdb cmake preset

* add fp32 mrope, vision rope kernel

* add fp16 support for qwen2vl and m-rope

* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`

* fix rope op mode switching, out dated func args

* update `llama_hparams`

* update to keep up stream changes

* resolve linter, test errors

* add makefile entry, update speical image padding token

* add mrope unit test, fix few compiler warnings

* rename `mrope` related function, params

* minor updates on debug util, bug fixs

* add `m-rope` testcase to `test-backend-ops`

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix traililng whitespce

* store `llama_hparams.rope_sections` with fixed size array

* update position id tensor size check in GGML_OP_ROPE

* minor updates

* update `ggml_backend_*_supports_op` of unsupported backends

* remote old `rope_section` compare operator

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-14 14:43:46 +02:00
Robert Collins
62e84d9848
llama : add 128k yarn context for Qwen (#10698)
* add 128k yarn context for Qwen

* added property for model tensors

* removing useless line
2024-12-07 23:12:27 +02:00
Sukriti Sharma
784a14aa49
convert : add support for Roberta embeddings (#10695) 2024-12-07 09:02:14 +02:00
Riccardo Orlando
6fe6247831
llama : add Minerva 7B model support (#10673)
* Support for Minerva 7B

* Update convert_hf_to_gguf_update.py
2024-12-05 20:30:59 +02:00
JFLFY2255
8d0cfd554a
llama: Support MiniCPM-1B (with & w/o longrope) (#10559) 2024-12-04 11:42:50 +02:00
Shane A
80acb7b430
Rename Olmo1124 to Olmo2 (#10500) 2024-11-25 19:36:09 +01:00
Gabe Goodhart
9336db462c
convert : XLMRoberta Type Vocab Size (#10458)
This matches the key in common bert-based embedding models and may have a
value other than 1 in it.

Branch: XLMRobertaTypeVocabSize

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-11-24 11:02:34 +02:00
Shane A
a88ad007de
llama : add OLMo November 2024 support (#10394)
* Add OLMo November 2024 constants

* Add OLMo November 2024 converter

* Add loading of OLMo November 2024 tensors and hyper parameters

* Add building of OLMo November 2024 model
2024-11-19 11:04:08 +02:00
Faisal Zaghloul
60e17ce23c
Remove identical wte/etw logic for jais (#10203) 2024-11-07 08:46:12 -08:00
Xuan Son Nguyen
7554aa4655
convert-lora : make --base optional (#10110)
* convert-lora : make `--base` optional

* lint

* handle case where base_model_name_or_path is invalid

* do not include metadata from base model

* clarify unspecified --base

* add small comment [no ci]

* trigger ci
2024-11-02 12:53:17 +01:00
Georgi Gerganov
bc5ba007b2
server : check that the prompt fits in the slot's context (#10030)
ggml-ci
2024-10-25 10:13:46 +03:00
Molly Sophia
11d47057a5
Rwkv chat template fix (#10001)
* llama: remove useless template matching for rwkv-world

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* converter: Add comment about the hack for rwkv models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-10-22 15:22:26 +02:00
Molly Sophia
4ff7fe1fb3
llama : add chat template for RWKV-World + fix EOT (#9968)
* Add chat template for RWKV-World

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Fix the chat template not being used

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV v6: Set EOT token to ``\n\n``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* readme: add rwkv into supported model list

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-10-22 13:33:37 +03:00
compilade
1927378bcc
convert : refactor rope_freqs generation (#9396)
* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.
2024-10-01 09:31:36 +03:00
nopperl
f99d3f8367
py : add model class for Chameleon conversion (#9683) 2024-09-29 15:02:06 +03:00
Georgi Gerganov
f4d2b8846a
llama : add reranking support (#9510)
* py : add XLMRobertaForSequenceClassification [no ci]

* py : fix scalar-tensor conversion [no ci]

* py : fix position embeddings chop [no ci]

* llama : read new cls tensors [no ci]

* llama : add classigication head (wip) [no ci]

* llama : add "rank" pooling type

ggml-ci

* server : add rerank endpoint

ggml-ci

* llama : aboud ggml_repeat during classification

* rerank : cleanup + comments

* server : accept /rerank endpoint in addition to /v1/rerank [no ci]

* embedding : parse special tokens

* jina : support v1 reranker

* vocab : minor style

ggml-ci

* server : initiate tests for later

ggml-ci

* server : add docs

* llama : add comment [no ci]

* llama : fix uninitialized tensors

* ci : add rerank tests

ggml-ci

* add reranking test

* change test data

* Update examples/server/server.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add `--reranking` argument

* update server docs

* llama : fix comment [no ci]

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-28 17:42:03 +03:00
nopperl
9a913110cf
llama : add support for Chameleon (#8543)
* convert chameleon hf to gguf

* add chameleon tokenizer tests

* fix lint

* implement chameleon graph

* add swin norm param

* return qk norm weights and biases to original format

* implement swin norm

* suppress image token output

* rem tabs

* add comment to conversion

* fix ci

* check for k norm separately

* adapt to new lora implementation

* fix layer input for swin norm

* move swin_norm in gguf writer

* add comment regarding special token regex in chameleon pre-tokenizer

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix punctuation regex in chameleon pre-tokenizer (@compilade)

Co-authored-by: compilade <git@compilade.net>

* fix lint

* trigger ci

---------

Co-authored-by: compilade <git@compilade.net>
2024-09-28 15:08:43 +03:00
Gabe Goodhart
3d6bf6919f
llama : add IBM Granite MoE architecture (#9438)
* feat(gguf-py): Add granitemoe architecture

This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert_hf_to_gguf): Add GraniteMoeModel

GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granitemoe convert): Split the double-sized input layer into gate and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: alex.brooks@ibm.com

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(granitemoe): Implement granitemoe

GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Typo fix in docstring

Co-Authored-By: ggerganov@gmail.com

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(conversion): Simplify tensor name mapping in conversion

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Remove unused tensor name mappings

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Sanity check on merged FFN tensor sizes

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow "output" layer in granite moe architecture (convert and cpp)

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granite): Add missing 'output' tensor for Granite

This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-25 10:06:52 +03:00
Gabe Goodhart
0d2ec43833
llama : support IBM Granite architecture (#9412)
* feat(gguf-py): Add Granite model and params to gguf-py

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert_hf_to_gguf): Add registration and param setup for Granite

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Add config parsing for Granite multiplier params

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): First pass at full port of granite deviations from llama

Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Determine granite language 3b instruct by vocab size

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel

The defaults in LlamaModel are needed for Granite as well

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Switch Granite param names to use _scale for consistency

Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale

The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-09-17 09:44:58 +03:00
compilade
d54c21df7e
convert : identify missing model files (#9397) 2024-09-16 10:30:22 +03:00
Shane A
0aadac10c7
llama : support OLMoE (#9462) 2024-09-16 09:47:37 +03:00
CarryFun
95ca85168b
llama : support MiniCPM3 (#9322)
Co-authored-by: 范睿凯 <fanruikai@modelbest.cn>
2024-09-16 09:45:20 +03:00