llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-12 13:27:21 +01:00

History

Add support for DeepseekV2ForCausalLM (#7519 )

* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

2024-05-28 17:07:05 +02:00

__init__.py

convert-hf : support direct Q8_0 conversion (#7234 )

2024-05-13 14:10:51 -04:00

constants.py

Add support for DeepseekV2ForCausalLM (#7519 )

2024-05-28 17:07:05 +02:00

gguf_reader.py

gguf-py : fix and simplify quantized shape round-trip (#7483 )

2024-05-25 11:11:48 +10:00

gguf_writer.py

Add support for DeepseekV2ForCausalLM (#7519 )