llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-29 07:34:18 +01:00

Author	SHA1	Message	Date
Francis Couture-Harpin	6840ac0bca	Merge branch 'master' into compilade/refactor-kv-cache	2024-06-08 17:30:49 -04:00
Joan Fontanals	f5d7b268ec	llama : add jina v2 base code (#7596 ) * feat: add changes to handle jina v2 base code * fix: do not complicate things * fix: fix the usage of the code model * fix: fix comments * fix: fix linting issues * fix: remove ollama patches * style : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-06 10:22:41 +03:00
Galunid	7672adeec7	Fix encoding in python scripts (#7733 )	2024-06-06 03:07:24 +10:00
Francis Couture-Harpin	5d3c7b9585	Merge branch 'master' into compilade/refactor-kv-cache	2024-06-01 11:51:41 -04:00
Galunid	0515ad93f4	convert-hf : Handle NotImplementedError in convert-hf-to-gguf (#7660 )	2024-05-31 17:42:33 +02:00
Galunid	9c4c9cc83f	Move convert.py to examples/convert-legacy-llama.py (#7430 ) * Move convert.py to examples/convert-no-torch.py * Fix CI, scripts, readme files * convert-no-torch -> convert-legacy-llama * Move vocab thing to vocab.py * Fix convert-no-torch -> convert-legacy-llama * Fix lost convert.py in ci/run.sh * Fix imports * Fix gguf not imported correctly * Fix flake8 complaints * Fix check-requirements.sh * Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE * Review fixes	2024-05-30 21:40:00 +10:00
Francis Couture-Harpin	4e4c41e553	Merge branch 'master' into compilade/refactor-kv-cache	2024-05-28 15:15:18 -04:00
Giuseppe Scrivano	5442939fcc	llama : support small Granite models (#7481 ) * Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/llama.cpp/issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Steffen Roecker <sroecker@redhat.com>	2024-05-28 21:49:49 +03:00
fairydreaming	ee3dff6b8e	Add support for DeepseekV2ForCausalLM (#7519 ) * common : increase max number of experts to 160 * common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture * common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier * convert-hf : add model conversion support for DeepseekV2ForCausalLM * llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models * llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor) * llama : add inference support for LLM_ARCH_DEEPSEEK2 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-28 17:07:05 +02:00
Galunid	32a28217f4	Fix aya-23 conversion scripts (#7539 )	2024-05-26 16:02:34 +02:00
Bartowski	c429b33beb	llama : add Smaug 70B support (#7402 )	2024-05-26 15:28:35 +03:00
Francis Couture-Harpin	fc59407efe	convert-hf : support Mini-Jamba conversion	2024-05-25 13:56:21 -04:00
Francis Couture-Harpin	ea2e63e9d2	convert-hf : check for unprocessed Jamba experts	2024-05-25 12:54:30 -04:00
compilade	b83bab15a5	gguf-py : fix and simplify quantized shape round-trip (#7483 ) * gguf-py : fix and simplify quantized shape round-trip * gguf-py : remove unused import	2024-05-25 11:11:48 +10:00
Francis Couture-Harpin	0fd13e9473	Merge branch 'master' into compilade/refactor-kv-cache	2024-05-24 19:35:16 -04:00
Francis Couture-Harpin	cbc743e600	llama : support Jamba	2024-05-24 19:27:27 -04:00
fairydreaming	fbca2f27fc	Add support for ArcticForCausalLM (#7020 ) * common : increase max number of experts to 128 * common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn * gguf-py : add architecture-specific block mappings that override selected general block mappings * convert-hf : add model conversion support for ArcticForCausalLM * convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM) * llama : add inference support for LLM_ARCH_ARCTIC --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-24 14:31:13 +02:00
fairydreaming	9b82476ee9	Add missing inference support for GPTNeoXForCausalLM (Pythia and GPT-NeoX base models) (#7461 ) * convert-hf : add conversion of bloom-style qkv tensor to gpt-style qkv (code borrowed from BloomModel) * llama : add inference support for LLM_ARCH_GPTNEOX * llama : add model types for every Pythia variant and GPT-NeoX Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-23 11:49:53 +02:00
liuwei-git	201cc11afa	llama : add phi3 128K model support (#7225 ) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-21 23:28:32 +03:00
Georgi Gerganov	c3f8d58356	tests : test-tokenizer-0.sh print more info (#7402 )	2024-05-21 19:53:48 +03:00
jaime-m-p	d7e852c1bc	Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425 ) * Update brute force test: add_special * Update brute force test: default values for add_bos_token and add_eos_token * Enable rtrim when pre-inserting BOS Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "server : fix test regexes"	2024-05-21 14:39:48 +02:00
jaime-m-p	917dc8cfa6	Tokenizer SPM fixes for phi-3 and llama-spm (#7375 ) * Update brute force test: special tokens * Fix added tokens - Try to read 'added_tokens.json'. - Try to read 'tokenizer_config.json'. - Try to read 'tokenizer.json'. * Fix special tokens rtrim Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix test regexes	2024-05-20 20:15:57 +02:00
Georgi Gerganov	fabf30b4c4	llama : remove Persimmon (#7408 ) * llama : remove Persimmon * requirements : remove	2024-05-21 02:35:28 +10:00
Anas Ahouzi	6aade19ee7	Add StableLM2 pre-tokenizer (#7349 ) * Add StableLM pre-tokenizer * Fix space * Fix trailing whitespace	2024-05-19 22:46:46 +10:00
Georgi Gerganov	b49a13dd2f	convert : fix set_vocab_sentencepiece (#6866 ) * convert : fix set_vocab_sentencepiece * Update convert-hf-to-gguf.py	2024-05-18 08:46:20 +03:00
Aarni Koskela	d273c1402b	py : convert-hf-to-gguf-update improvements (#7340 ) * convert-hf-to-gguf-update: automate updating * convert-hf-to-gguf-update: improve download * share requests session for performance * create directories only when needed, don't skip downloads when empty directory encountered * be more graceful about errors	2024-05-17 15:11:45 +03:00
amd-lalithnc	e18bc6aaf3	convert : fix Qwen/Qwen-7b conversion (#7308 )	2024-05-17 10:01:58 +03:00
compilade	ee52225067	convert-hf : support direct Q8_0 conversion (#7234 ) * convert-hf : support q8_0 conversion * convert-hf : add missing ftype This was messing with the checksums otherwise. * convert-hf : add missing ftype to Baichuan and Xverse I didn't notice these on my first pass.	2024-05-13 14:10:51 -04:00
Joan Fontanals	9aa672490c	llama : rename jina tokenizers to v2 (#7249 ) * refactor: rename jina tokenizers to v2 * refactor: keep refactoring non-breaking	2024-05-13 11:35:14 +03:00
compilade	5a419926b0	convert-hf : support bfloat16 conversion (#7158 ) * convert-hf : support bfloat16 conversion * gguf-py : flake8 fixes * convert-hf : add missing space after comma * convert-hf : get bit-exact same output as ./quantize The quantization version was missing. * convert-hf : don't round bf16 NANs * convert-hf : save some memory with np.int16 intermediate bf16 weights * convert-hf : more closely match llama.cpp with which weights to keep in f32 * convert-hf : add --outtype auto-f16 A reason for this to exist is for model quantizers who want an initial GGUF with the most fidelity to the original model while still using a 16-bit float type instead of 32-bit floats. * convert-hf : remove a semicolon because flake8 doesn't like it It's a reflex from when programming in C/C++, I guess. * convert-hf : support outtype templating in outfile name * convert-hf : rename --outtype auto-f16 to --outtype auto	2024-05-11 11:06:26 -04:00
Joan Fontanals	b83cc3f5b3	llama : add Jina Embeddings architecture (#6826 ) * feat: first things to do * feat: create tensors for Jina architecture * fix: use other tensors * feat: embedding gets results * fix: fix usage of ALIBI * fix: clean prints * fix: do some cleanup unused vars * fix: revert changes to Makefile and CMakeLists * fix: revert some changes * fix: fix small detail * fix: fix convert formatting * fix: fix linting and editor * feat: set proper vocab settings * fix: JinaBertForMaskedLM registration * feat: support q_normalization and k_normalization in Jina arch * feat: handle gpt2 tokenizer with Jina architecture * feat: example comments in embedding * feat: rename Jina Bert to Jina Bert V2 * fix: add some changes as per review * feat: proper KQ_pos for Jina embeddings * feat: add capacity to load models ES and DE for Spanish * llama : fix pre-tokenizers * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * minor : clean-up * embedding : add warning about missing SEP --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-11 10:46:09 +03:00
Georgi Gerganov	9cb317f77e	ggml : full ALiBi support (#7192 ) * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * ggml : fix assert message * vulkan : add dev notes * ggml : require mask when using ALiBi ggml-ci * convert : fix convert for refact models	2024-05-11 10:32:41 +03:00
compilade	f98eb31c51	convert-hf : save memory with lazy evaluation (#7075 ) * convert-hf : begin refactoring write_tensor * convert : upgrade to sentencepiece v0.2.0 * convert-hf : remove unused n_dims in extra__tensors convert-hf : simplify MoE weights stacking * convert-hf : flake8 linter doesn't like semicolons * convert-hf : allow unusual model part names For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer. * convert : use a string for the SentencePiece tokenizer path * convert-hf : display tensor shape * convert-hf : convert norms to f32 by default * convert-hf : sort model part names `os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors". * convert-hf : use an ABC for Model again It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses. * convert-hf : use a plain class for Model, and forbid direct instantiation There are no abstract methods used anyway, so using ABC isn't really necessary. * convert-hf : more consistent formatting of cmdline args * convert-hf : align the message logged for converted tensors * convert-hf : fix Refact conversion * convert-hf : save memory with lazy evaluation * convert-hf : flake8 doesn't like lowercase L as a variable name * convert-hf : remove einops requirement for InternLM2 * convert-hf : faster model parts loading Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front. * convert-hf : minor changes for consistency * gguf-py : add tqdm as a dependency It's small, and used for a progress bar in GGUFWriter.write_tensors_to_file	2024-05-08 18:16:38 -04:00
Ren Xuancheng	229ffff872	llama : add BPE pre-tokenization for Qwen2 (#7114 ) * Add BPE pre-tokenization for Qwen2. * minor : fixes --------- Co-authored-by: Ren Xuancheng <17811943+jklj077@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-08 15:06:43 +03:00
DAN™	4cd621c26d	convert : add BPE pre-tokenization for DBRX (#7132 ) * Add BPE pre-tokenization for DBRX. * Add vocab GGUFs. * Remove test. * Remove GGUFs.	2024-05-08 13:43:23 +03:00
nopperl	b6aa670203	Fix OLMo HF to GGUF conversion (#6910 )	2024-05-07 21:39:43 +02:00
DAN™	889bdd7686	command-r : add BPE pre-tokenization (#7063 ) * Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-05 08:19:30 +03:00
Georgi Gerganov	92139b90af	tests : add test-tokenizer-0.sh + fix some tokenizers (#7036 ) * tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update	2024-05-04 08:32:32 +03:00
Brian	a2ac89d6ef	convert.py : add python logging instead of print() (#6511 ) * convert.py: add python logging instead of print() * convert.py: verbose flag takes priority over dump flag log suppression * convert.py: named instance logging * convert.py: use explicit logger id string * convert.py: convert extra print() to named logger * convert.py: sys.stderr.write --> logger.error * .py: Convert all python scripts to use logging module requirements.txt: remove extra line * flake8: update flake8 ignore and exclude to match ci settings * gh-actions: add flake8-no-print to flake8 lint step * pre-commit: add flake8-no-print to flake8 and also update pre-commit version * convert-hf-to-gguf.py: print() to logger conversion * .py: logging basiconfig refactor to use conditional expression .py: removed commented out logging fixup! .py: logging basiconfig refactor to use conditional expression constant.py: logger.error then exit should be a raise exception instead * .py: Convert logger error and sys.exit() into a raise exception (for atypical error) gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar * verify-checksum-model.py: This is the result of the program, it should be printed to stdout. * compare-llama-bench.py: add blank line for readability during missing repo response * reader.py: read_gguf_file() use print() over logging * convert.py: warning goes to stderr and won't hurt the dump output * gguf-dump.py: dump_metadata() should print to stdout * convert-hf-to-gguf.py: print --> logger.debug or ValueError() * verify-checksum-models.py: use print() for printing table * .py: refactor logging.basicConfig() gguf-py/gguf/.py: use __name__ as logger name Since they will be imported and not run directly. python-lint.yml: use .flake8 file instead * constants.py: logger no longer required * convert-hf-to-gguf.py: add additional logging * convert-hf-to-gguf.py: print() --> logger * .py: fix flake8 warnings revert changes to convert-hf-to-gguf.py for get_name() * convert-hf-to-gguf-update.py: use triple quoted f-string instead * .py: accidentally corrected the wrong line *.py: add compilade warning suggestions and style fixes	2024-05-03 22:36:41 +03:00
Bartowski	60325fa56f	Remove .attention from skipped tensors to match more accurately (#7051 )	2024-05-03 01:49:09 +02:00
Georgi Gerganov	952d03dbea	convert : use utf8 encoding (#7000 ) * convert : use utf8 encoding * convert : update instructions and warning message	2024-04-30 11:05:25 +03:00
Georgi Gerganov	f4ab2a4147	llama : fix BPE pre-tokenization (#6920 ) * merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * lint : fix * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>	2024-04-29 16:58:41 +03:00
Christian Zhou-Zheng	3055a41805	convert : fix conversion of some BERT embedding models (#6937 )	2024-04-29 16:34:41 +03:00
Junyang Lin	3fec68be4e	convert : add support of codeqwen due to tokenizer (#6707 ) * add support of codeqwen due to tokenizer * override load_hparams * fix typo * fix load_params * convert : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-24 10:16:21 +03:00
liuwei-git	c8297c6af5	llama : add phi3 support (#6852 ) * add explicit phi3 support * add explicit phi3 support * remove unused code * convert : add BOS token * llama : match EOT token <\|end\|> * llama : minor / style * llama : tabs -> spaces * convert : fix lint checks --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-24 10:00:37 +03:00
Pedro Cuenca	b97bc3966e	llama : support Llama 3 HF conversion (#6745 ) * Support Llama 3 conversion The tokenizer is BPE. * style * Accept suggestion Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> * llama : add llama_token_is_eog() ggml-ci * llama : auto-detect more EOT tokens when missing in KV data * convert : replacing EOS token is a hack * llama : fix codegemma EOT token + add TODOs * llama : fix model type string for 8B model --------- Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-21 14:50:41 +03:00
nopperl	9958c81b79	Implement the OLMo architecture (#6741 ) * implement olmo architecture * remove unused variable * remove unused moe branch * remove check for weight * remove superfluous moe, bias and rope tensors * clarified comment * fix clamp_kqv setting * remove obsolete parameter name filter	2024-04-19 11:35:54 +02:00
Zheng.Deng	facb8b56f8	convert : fix autoawq gemma (#6704 ) * fix autoawq quantized gemma model convert error using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error. * change code to full string match and print necessary message change code to full string match and print a short message to inform users that lm_head.weight has been skipped. --------- Co-authored-by: Zheng.Deng <32841220+CUGfred@users.noreply.github.com>	2024-04-16 23:51:07 +03:00
Ashish	dbceec87c0	llama : add StableLM2 12B (#6635 ) * StableLM2 12B support for huggingface -> GGUF * StableLM12 tensormapping and constants * StableLM-2-12b model support * fix * Added 12B support * Removed autoformatting; resolved bug where model_arch was not selecting StableLM2 * Formatting * Do QK norm stacking in model conversion step * Converge StableLM and StableLM2 code to simplify graph construction * Fix accidental removal * Removed warnings * Revert formatter * Move QK norm stack to private function so it's easier to read * refactor stablelm graph builder to support 1.6, 3b and 12b more efficiently * Proper check for None type for new_name to avoid crash; formatting; revert change to base class `write_tensors()` * Format * Formatting * format Co-authored-by: compilade <git@compilade.net> * Fix incorrect check for K norm * space after commas; Keep indentation multiple of 4 spaces * Flake8 format * Removed unnecessary conditional branches * Removed unused comment * Fixed incorrect tensor passing * Format --------- Co-authored-by: compilade <git@compilade.net>	2024-04-16 18:48:35 +03:00
Shijie	f4dea7da18	llama : add qwen2moe (#6074 ) * support qwen2moe * fix-review * metal : support unary ops for nelements % 4 != 0 * metal : require contiguousness for float4 unary kernels * metal : require contiguousness for float4 unary kernels (cont) * fix-review * names : for brevity "SHARED_EXP" -> "SHEXP" * llama : reuse build_moe_ffn() * llama : add model type name --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-16 18:40:48 +03:00

1 2 3

110 Commits