llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-02-05 16:10:42 +01:00

Author	SHA1	Message	Date
Xuan Son Nguyen	8586d23c8a	minicpm working without uhd	2025-01-23 12:14:06 +01:00
Xuan Son Nguyen	c0d93dd509	minicpmv works but missing uhd slices	2025-01-22 22:42:00 +01:00
Xuan Son Nguyen	ba489b4743	wip minicpmv	2025-01-22 22:26:38 +01:00
Xuan Son Nguyen	32daa38333	Merge branch 'master' into xsn/vision_2	2025-01-22 13:28:31 +01:00
Xuan Son Nguyen	431bb08059	change gguf KV from clip to vit	2025-01-21 10:51:26 +01:00
Xuan Son Nguyen	ec7f3ac9ab	llama : add support for Deepseek-R1-Qwen distill model (#11310 ) * llama : add support for Deepseek-R1-Qwen distill model * coding style	2025-01-20 14:35:07 +01:00
Xuan Son Nguyen	4a7ab89d75	wip minicpmv	2025-01-19 22:33:05 +01:00
Xuan Son Nguyen	d0068ef0ed	add mobilevlm	2025-01-19 16:29:20 +01:00
Xuan Son Nguyen	6cabdda0df	add back convert hf to gguf	2025-01-18 22:56:04 +01:00
RunningLeon	4dbc8b9cb7	llama : add internlm3 support (#11233 ) * support internlm3 * fix lint	2025-01-16 20:10:38 +02:00
Daniel Bevenius	2739a71e4b	convert : sort print supported models [no ci] (#11179 ) This commit sorts the list of supported models when printing them out. The motivation for this change is to make it easier to find a specific model in the list of supported models. For example: ```console $ ./convert_hf_to_gguf.py --print-supported-models Supported models: - ArcticForCausalLM - BaiChuanForCausalLM - BaichuanForCausalLM - BertForMaskedLM - BertModel - BitnetForCausalLM - BloomForCausalLM - BloomModel - CamembertModel - ChameleonForCausalLM - ChameleonForConditionalGeneration - ChatGLMForConditionalGeneration - ChatGLMModel - CodeShellForCausalLM - Cohere2ForCausalLM - CohereForCausalLM - DbrxForCausalLM - DeciLMForCausalLM - DeepseekForCausalLM - DeepseekV2ForCausalLM - DeepseekV3ForCausalLM - ExaoneForCausalLM - FalconForCausalLM - FalconMambaForCausalLM - GPT2LMHeadModel - GPTBigCodeForCausalLM - GPTNeoXForCausalLM - GPTRefactForCausalLM - Gemma2ForCausalLM - GemmaForCausalLM - GraniteForCausalLM - GraniteMoeForCausalLM - GrokForCausalLM - InternLM2ForCausalLM - JAISLMHeadModel - JinaBertForMaskedLM - JinaBertModel - LLaMAForCausalLM - LlamaForCausalLM - LlavaStableLMEpochForCausalLM - MPTForCausalLM - MT5ForConditionalGeneration - MambaForCausalLM - MambaLMHeadModel - MiniCPM3ForCausalLM - MiniCPMForCausalLM - MistralForCausalLM - MixtralForCausalLM - NemotronForCausalLM - NomicBertModel - OLMoForCausalLM - Olmo2ForCausalLM - OlmoForCausalLM - OlmoeForCausalLM - OpenELMForCausalLM - OrionForCausalLM - Phi3ForCausalLM - PhiForCausalLM - PhiMoEForCausalLM - PlamoForCausalLM - QWenLMHeadModel - Qwen2ForCausalLM - Qwen2MoeForCausalLM - Qwen2VLForConditionalGeneration - RWForCausalLM - RWKV6Qwen2ForCausalLM - RobertaModel - Rwkv6ForCausalLM - StableLMEpochForCausalLM - StableLmForCausalLM - Starcoder2ForCausalLM - T5EncoderModel - T5ForConditionalGeneration - T5WithLMHeadModel - UMT5ForConditionalGeneration - WavTokenizerDec - XLMRobertaForSequenceClassification - XLMRobertaModel - XverseForCausalLM ```	2025-01-11 05:50:33 +01:00
Daniel Bevenius	ff3fcabc72	convert : add --print-supported-models option (#11172 ) * convert : add --print-supported-models option This commit adds a new option to the convert_hf_to_gguf.py script to print the supported models. The motivation for this is that it can be useful to know which models are supported by the script without having to look at the code. Example usage: ```console $ ./convert_hf_to_gguf.py --print-supported-models Supported models: - GPTNeoXForCausalLM - BloomForCausalLM - BloomModel - MPTForCausalLM - OrionForCausalLM - BaichuanForCausalLM - BaiChuanForCausalLM - XverseForCausalLM - FalconForCausalLM - RWForCausalLM - GPTBigCodeForCausalLM - GPTRefactForCausalLM - StableLmForCausalLM - StableLMEpochForCausalLM - LlavaStableLMEpochForCausalLM - LLaMAForCausalLM - LlamaForCausalLM - MistralForCausalLM - MixtralForCausalLM - DeciLMForCausalLM - BitnetForCausalLM - GrokForCausalLM - DbrxForCausalLM - MiniCPMForCausalLM - MiniCPM3ForCausalLM - QWenLMHeadModel - Qwen2ForCausalLM - Qwen2VLForConditionalGeneration - WavTokenizerDec - Qwen2MoeForCausalLM - GPT2LMHeadModel - PhiForCausalLM - Phi3ForCausalLM - PhiMoEForCausalLM - PlamoForCausalLM - CodeShellForCausalLM - InternLM2ForCausalLM - BertModel - BertForMaskedLM - CamembertModel - RobertaModel - NomicBertModel - XLMRobertaModel - XLMRobertaForSequenceClassification - GemmaForCausalLM - Gemma2ForCausalLM - Starcoder2ForCausalLM - Rwkv6ForCausalLM - RWKV6Qwen2ForCausalLM - MambaForCausalLM - MambaLMHeadModel - FalconMambaForCausalLM - CohereForCausalLM - Cohere2ForCausalLM - OLMoForCausalLM - OlmoForCausalLM - Olmo2ForCausalLM - OlmoeForCausalLM - JinaBertModel - JinaBertForMaskedLM - OpenELMForCausalLM - ArcticForCausalLM - DeepseekForCausalLM - DeepseekV3ForCausalLM - DeepseekV2ForCausalLM - UMT5ForConditionalGeneration - MT5ForConditionalGeneration - T5ForConditionalGeneration - T5WithLMHeadModel - T5EncoderModel - JAISLMHeadModel - ChatGLMModel - ChatGLMForConditionalGeneration - NemotronForCausalLM - ExaoneForCausalLM - GraniteForCausalLM - GraniteMoeForCausalLM - ChameleonForCausalLM - ChameleonForConditionalGeneration ``` * squash! convert : add --print-supported-models option Fix flake8 error.	2025-01-10 11:30:53 +01:00
Molly Sophia	ee7136c6d1	llama: add support for QRWKV6 model architecture (#11001 ) llama: add support for QRWKV6 model architecture (#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-10 09:58:08 +08:00
Pierrick Hymbert	f8feb4b01a	model: Add support for PhiMoE arch (#11003 ) * model: support phimoe * python linter * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: add phimoe as supported model ggml-ci --------- Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>	2025-01-09 11:21:41 +01:00
fairydreaming	9394bbd484	llama : Add support for DeepSeek V3 (#11049 ) * convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type * vocab : add DeepSeek V3 pre-tokenizer regexes * unicode : handle ACCENT_MARK and SYMBOL categories in regex * llama : add DeepSeek V3 chat template, handle new model parameters and tensor types --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-01-04 21:06:11 +01:00
DAN™	46be942214	llama : add support for the cohere2 model architecture (#10900 )	2025-01-04 16:33:31 +02:00
ymcki	bc7b1f8632	convert : fix Llama-3_1-Nemotron-51B rope settings (#11008 ) * conflict resolution * move comments after bracket to its own line * DeciLMCausalModel now reads rope_theta from config.json properly	2024-12-31 13:04:48 +02:00
Yun Dou	b92a14a841	llama : support InfiniAI Megrez 3b (#10893 ) * Support InfiniAI Megrez 3b * Fix tokenizer_clean_spaces for megrez	2024-12-23 01:35:44 +01:00
ymcki	6f0c9e034b	llama : support for Llama-3_1-Nemotron-51B (#10669 ) * conflict resolution * move comments after bracket to its own line	2024-12-23 01:22:33 +01:00
Billel Mokeddem	7ae33a616f	llama : add Falcon3 support (#10883 ) * Add Falcon3 model support * Add fix for adding bos to added special tokens * Add comment explaining the logic behind the if statement * Add a log message to better track the when the following line of code is triggered * Update log to only print when input and output characters are different * Fix handling pre-normalized tokens * Refactoring	2024-12-23 00:09:58 +02:00
Georgi Gerganov	5cd85b5e00	convert : add BertForMaskedLM (#10919 )	2024-12-21 10:10:18 +02:00
Molly Sophia	0a11f8b7b5	convert : fix RWKV v6 model conversion (#10913 ) * Enable --no-context-shift for llama-perplexity example Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV 6: Fix error in ggml_cuda_op_bin_bcast Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-20 11:44:58 +02:00
Sukriti Sharma	2fffc52b50	llama : fix Roberta embeddings (#10856 ) * fix: Use gpt2 tokenizer for roberta and add eos/bos tokens Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fixes to position embeddings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * map roberta-bpe to gpt-2 Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix linting Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-19 15:04:51 +02:00
fairydreaming	7585edbdeb	convert : Add support for Microsoft Phi-4 model (#10817 ) * convert : use GPT2 vocab for Phi-4 model * convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models * llama : do not use sliding window attention mask for Phi-4 model --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-12-19 10:37:12 +01:00
Georgi Gerganov	0bf2d10c55	tts : add OuteTTS support (#10784 ) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : be explicit about the pooling type in the tests ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * llama : add OuteTTS support (wip) * wip * extract features * first conv * group norm * resnet conv * resnet * attn * pos net * layer norm * convnext * head * hann window * fix n_embd + remove llama.cpp hacks * compute hann window * fft * spectrum processing * clean-up * tts : receive input text and generate codes * clip : fix new conv name * tts : minor fix * tts : add header + minor fixes ggml-ci * tts : add matchematical constant ggml-ci * tts : fix sampling + cut initial noise * tts : fixes * tts : update default samplers ggml-ci * tts : text pre-processing * tts : outetts-voc -> wavtokenizer-dec * tts : remove hardcoded constants ggml-ci * tts : fix tensor shapes * llama : refactor wavtokenizer tensors ggml-ci * cont ggml-ci * cont [no ci] * llama : update WavTokenizer to non-causal attn * llama : handle no-vocab detokenization * tts : add Python example for OuteTTS (wip) * tts : extend python example to generate spectrogram ggml-ci * server : fix rebase artifacts * tts : enable "return_tokens" in Python example ggml-ci * tts : minor fixes * common : support HF download for vocoder	2024-12-18 19:27:21 +02:00
Diego Devesa	4da69d1abd	Revert "llama : add Falcon3 support (#10864 )" (#10876 ) This reverts commit `382bc7f2e8`.	2024-12-18 01:36:46 +01:00
Billel Mokeddem	382bc7f2e8	llama : add Falcon3 support (#10864 )	2024-12-17 17:24:56 +02:00
Valentin Mamedov	a0974156f3	llama : add Deepseek MoE v1 & GigaChat models (#10827 ) * Add deepseek v1 arch & gigachat template * improve template code * add readme * delete comments * remove comment * fix format * lint llama.cpp * fix order of deepseek and deepseek2, move gigachat temlate to the end of func * fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need * remove comments * move deepseek above deepseek2 * change placement of gigachat chat template	2024-12-15 19:02:46 +02:00
HimariO	ba1cb19cdd	llama : add Qwen2VL support + multimodal RoPE (#10361 ) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-14 14:43:46 +02:00
Robert Collins	62e84d9848	llama : add 128k yarn context for Qwen (#10698 ) * add 128k yarn context for Qwen * added property for model tensors * removing useless line	2024-12-07 23:12:27 +02:00
Sukriti Sharma	784a14aa49	convert : add support for Roberta embeddings (#10695 )	2024-12-07 09:02:14 +02:00
Riccardo Orlando	6fe6247831	llama : add Minerva 7B model support (#10673 ) * Support for Minerva 7B * Update convert_hf_to_gguf_update.py	2024-12-05 20:30:59 +02:00
JFLFY2255	8d0cfd554a	llama: Support MiniCPM-1B (with & w/o longrope) (#10559 )	2024-12-04 11:42:50 +02:00
Shane A	80acb7b430	Rename Olmo1124 to Olmo2 (#10500 )	2024-11-25 19:36:09 +01:00
Gabe Goodhart	9336db462c	convert : XLMRoberta Type Vocab Size (#10458 ) This matches the key in common bert-based embedding models and may have a value other than 1 in it. Branch: XLMRobertaTypeVocabSize Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-11-24 11:02:34 +02:00
Shane A	a88ad007de	llama : add OLMo November 2024 support (#10394 ) * Add OLMo November 2024 constants * Add OLMo November 2024 converter * Add loading of OLMo November 2024 tensors and hyper parameters * Add building of OLMo November 2024 model	2024-11-19 11:04:08 +02:00
Faisal Zaghloul	60e17ce23c	Remove identical wte/etw logic for jais (#10203 )	2024-11-07 08:46:12 -08:00
Xuan Son Nguyen	7554aa4655	convert-lora : make `--base` optional (#10110 ) * convert-lora : make `--base` optional * lint * handle case where base_model_name_or_path is invalid * do not include metadata from base model * clarify unspecified --base * add small comment [no ci] * trigger ci	2024-11-02 12:53:17 +01:00
Georgi Gerganov	bc5ba007b2	server : check that the prompt fits in the slot's context (#10030 ) ggml-ci	2024-10-25 10:13:46 +03:00
Molly Sophia	11d47057a5	Rwkv chat template fix (#10001 ) * llama: remove useless template matching for rwkv-world Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Add comment about the hack for rwkv models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-10-22 15:22:26 +02:00
Molly Sophia	4ff7fe1fb3	llama : add chat template for RWKV-World + fix EOT (#9968 ) * Add chat template for RWKV-World Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Fix the chat template not being used Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV v6: Set EOT token to ``\n\n`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * readme: add rwkv into supported model list Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-10-22 13:33:37 +03:00
compilade	1927378bcc	convert : refactor rope_freqs generation (#9396 ) * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings.	2024-10-01 09:31:36 +03:00
nopperl	f99d3f8367	py : add model class for Chameleon conversion (#9683 )	2024-09-29 15:02:06 +03:00
Georgi Gerganov	f4d2b8846a	llama : add reranking support (#9510 ) * py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-28 17:42:03 +03:00
nopperl	9a913110cf	llama : add support for Chameleon (#8543 ) * convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <git@compilade.net> * fix lint * trigger ci --------- Co-authored-by: compilade <git@compilade.net>	2024-09-28 15:08:43 +03:00
Gabe Goodhart	3d6bf6919f	llama : add IBM Granite MoE architecture (#9438 ) * feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: alex.brooks@ibm.com Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Typo fix in docstring Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-25 10:06:52 +03:00
Gabe Goodhart	0d2ec43833	llama : support IBM Granite architecture (#9412 ) * feat(gguf-py): Add Granite model and params to gguf-py Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add registration and param setup for Granite Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Add config parsing for Granite multiplier params Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): First pass at full port of granite deviations from llama Something is still not working right since the results are mostly terrible, but on occasion it's producing relevant results at this point, so _something_ is working. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Determine granite language 3b instruct by vocab size Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel The defaults in LlamaModel are needed for Granite as well Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Switch Granite param names to use _scale for consistency Other scalar multipliers are called _scale, so this provides a more consistent naming convention. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale The transformers names with _multiplier will now be converted to the _scale equivalent during conversion. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-09-17 09:44:58 +03:00
compilade	d54c21df7e	convert : identify missing model files (#9397 )	2024-09-16 10:30:22 +03:00
Shane A	0aadac10c7	llama : support OLMoE (#9462 )	2024-09-16 09:47:37 +03:00
CarryFun	95ca85168b	llama : support MiniCPM3 (#9322 ) Co-authored-by: 范睿凯 <fanruikai@modelbest.cn>	2024-09-16 09:45:20 +03:00

1 2

90 Commits