Georgi Gerganov
2917e6b528
Merge branch 'master' into gg/imatrix-gpu-4931
...
ggml-ci
2024-01-17 18:43:45 +02:00
Georgi Gerganov
44a1a4a41a
backend : add eval callback ( #4935 )
...
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* simple : no need for ggml_is_contiguous + fix bool parse
* llama : fix callback placement in llama_context_params
* backend : avoid double-ask callback calls
* simple : restore examples, imatrix will serve as a demo
2024-01-17 18:39:41 +02:00
Georgi Gerganov
c918fe8dca
metal : create autorelease pool during library build ( #4970 )
...
* metal : create autorelease pool during library build
ggml-ci
* test : simplify
ggml-ci
2024-01-17 18:38:39 +02:00
Georgi Gerganov
0f83e727af
py : fix whitespace
2024-01-17 18:37:36 +02:00
Georgi Gerganov
4f4bf35f46
py : fix missing added_tokens_dict for SPM and BPE vocabs ( #4971 )
...
* py : fix missing added_tokens_dict for SPM vocab
* py : pad with unknown tokens when data is missing
ggml-ci
* py : fix BPE vocab conversion
ggml-ci
* py : fix padded dummy tokens (I hope)
2024-01-17 15:45:03 +02:00
Georgi Gerganov
4fb52843bb
ci : rearrange output
...
ggml-ci
2024-01-17 15:27:34 +02:00
Georgi Gerganov
10b25e0388
ci : add imatrix test
...
ggml-ci
2024-01-17 15:10:38 +02:00
Georgi Gerganov
a722d05a87
imatrix : fix ggml_mul_mat_id hanlding
...
ggml-ci
2024-01-17 14:43:35 +02:00
Kawrakow
2b3a665d39
llama : use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 ( #4996 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 12:36:37 +02:00
Paul Tsochantaris
7563293665
metal : remove unnecessary nil check ( #4986 )
2024-01-17 10:07:24 +02:00
David Renshaw
f46c0c1b0e
llama : fix copy/paste error in llama_sampling_params comment ( #4994 )
2024-01-17 09:17:50 +02:00
Georgi Gerganov
5c99960901
py : remove unnecessary hasattr ( #4903 )
2024-01-16 20:59:31 +02:00
Philip Taron
bee938da74
nix: remove nixConfig from flake.nix ( #4984 )
2024-01-16 09:56:21 -08:00
Daniel Bevenius
cec8a48470
finetune : add training data file to log message ( #4979 )
...
This commit adds the name of the training data file to the log message
printed when the training data is tokenized.
The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16 19:54:24 +02:00
Kawrakow
334a835a1c
ggml : importance matrix support for legacy quants ( #4969 )
...
* imatrix: adding support for legacy quants
* imatrix: guard Q4_0/Q5_0 against ffn_down craziness
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-16 19:51:26 +02:00
Maximilian Winter
4feb4b33ee
examples : add complete parallel function calling example ( #4974 )
2024-01-16 19:41:42 +02:00
Georgi Gerganov
959ef0c0df
perplexity : fix kv cache handling for hellaswag ( #4981 )
...
ggml-ci
2024-01-16 19:34:54 +02:00
Georgi Gerganov
c37b3474e6
flake.lock: update flake-parts, flake-parts/nixpkgs-lib, and nixpkgs ( #4920 )
...
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/34fed993f1674c8d06d58b37ce1e0fe5eebcb9f5' (2023-12-01)
→ 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/e92039b55bcd58469325ded85d4f58dd5a4eaf58?dir=lib' (2023-11-29)
→ 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/cfc3698c31b1fb9cdcf10f36c9643460264d0ca8' (2023-12-27)
→ 'github:NixOS/nixpkgs/317484b1ead87b9c1b8ac5261a8d2dd748a0492d' (2024-01-08)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-01-16 09:13:54 -08:00
Paul Tsochantaris
158f8c9e21
metal : localized logic in ggml_metal_graph_compute
( #4924 )
...
* Metal: Localized logic in `ggml_metal_graph_compute`, minor performance improvement
* Whitespace
* Collecting command buffer completions on single thread
* Whitespace
* Reduce diff noise
2024-01-16 19:05:19 +02:00
Neuman Vong
862f5e41ab
android : introduce starter project example ( #4926 )
...
* Introduce starter project for Android
Based on examples/llama.swiftui.
* Add github workflow
* Set NDK version
* Only build arm64-v8a in CI
* Sync bench code
* Rename CI prop to skip-armeabi-v7a
* Remove unused tests
2024-01-16 15:47:34 +02:00
Alex Azarov
3a48d558a6
metal : replace loop of dispatch_async with dispatch_apply ( #4934 )
...
* Replace loop of dispatch_async with dispatch_apply
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-16 15:41:27 +02:00
Alex Azarov
7c8d3abd1a
metal : log recommendedMaxWorkingSetSize
on iOS 16+ ( #4936 )
...
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+
* Only log on iOS and macOS, ignoring tvOS and other platforms
* Check for Xcode version before using recommendedMaxWorkingSetSize
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-16 15:33:02 +02:00
Maximilian Winter
122ed4840c
examples : fix and improv docs for the grammar generator ( #4909 )
...
* Create pydantic-models-to-grammar.py
* Added some comments for usage
* Refactored Grammar Generator
Added example and usage instruction.
* Update pydantic_models_to_grammar.py
* Update pydantic-models-to-grammar-examples.py
* Renamed module and imported it.
* Update pydantic-models-to-grammar.py
* Renamed file and fixed grammar generator issue.
* Fixed some issues and bugs of the grammar generator. Imporved Documentation
* Update pydantic_models_to_grammar.py
2024-01-16 14:10:48 +02:00
Justine Tunney
a0b3ac8c48
ggml : introduce GGML_CALL function annotation ( #4850 )
...
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.
This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
2024-01-16 13:16:33 +02:00
Daniel Bevenius
d75c232e1d
finetune : use LLAMA_FILE_MAGIC_GGLA ( #4961 )
...
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16 13:14:19 +02:00
stduhpf
e0324285a5
speculative : threading options ( #4959 )
...
* speculative: expose draft threading
* fix usage format
* accept -td and -tbd args
* speculative: revert default behavior when -td is unspecified
* fix trailing whitespace
2024-01-16 13:04:32 +02:00
ngc92
3e5ca7931c
pass cpu-architecture arguments only to host code (C;C++) ( #4943 )
2024-01-15 19:40:48 +01:00
Georgi Gerganov
0b2fca9a9f
imatrix : offload to GPU support
2024-01-15 16:47:40 +02:00
Georgi Gerganov
e0493800ce
simple : fix
2024-01-15 16:43:46 +02:00
Georgi Gerganov
e1b1db9f09
simple : do not perform tensor data copy if not needed
2024-01-15 16:42:16 +02:00
Georgi Gerganov
83f3d7a83c
backend : clean-up the implementation
...
ggml-ci
2024-01-15 16:24:19 +02:00
Georgi Gerganov
01b6f68a00
backend : group nodes in a single compute when user don't need them
2024-01-15 16:24:19 +02:00
Georgi Gerganov
65648b341f
backend : add eval callback
...
ggml-ci
2024-01-15 16:24:19 +02:00
David Friehs
4483396751
llama : apply classifier-free guidance to logits directly ( #4951 )
2024-01-15 15:06:52 +02:00
Victor Z. Peng
d9aa4ffa6e
awq-py : fix typo in awq-py/README.md ( #4947 )
2024-01-15 14:41:46 +02:00
Georgi Gerganov
ddb008d845
cuda : fix dequantize kernel names ( #4938 )
2024-01-15 13:27:00 +02:00
Kawrakow
2faaef3979
llama : check for 256 divisibility for IQ2_XS, IQ2_XXS ( #4950 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-15 10:09:38 +02:00
Kawrakow
4a3156de2f
CUDA: faster dequantize kernels for Q4_0 and Q4_1 ( #4938 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-15 07:48:06 +02:00
David Pflug
a836c8f534
llama : fix missing quotes ( #4937 )
2024-01-14 17:46:00 +02:00
Kawrakow
467a882fd2
Add ability to use importance matrix for all k-quants ( #4930 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 16:21:12 +02:00
Georgi Gerganov
bb0c139247
llama : check LLAMA_TRACE env for extra logging ( #4929 )
...
* llama : minor fix indent
* llama : check LLAMA_TRACE env for extra logging
ggml-ci
2024-01-14 13:26:53 +02:00
Georgi Gerganov
9408cfdad6
scripts : sync-ggml-am.sh option to skip commits
2024-01-14 11:08:41 +02:00
Georgi Gerganov
03c5267490
llama : use LLAMA_LOG_ macros for logging
2024-01-14 11:03:19 +02:00
Kawrakow
a128c38de8
Fix ffn_down quantization mix for MoE models ( #4927 )
...
* Fix ffn_down quantization mix for MoE models
In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).
* Fix the fix
* Review suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 10:53:39 +02:00
Alex Azarov
5f5fe1bd60
metal : correctly set SIMD support flags on iOS ( #4923 )
...
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad
* log a little bit more info on iOS
2024-01-14 10:44:39 +02:00
Karthik Kumar Viswanathan
ac32902a87
llama : support WinXP build with MinGW 8.1.0 ( #3419 )
2024-01-14 10:41:44 +02:00
Kawrakow
147b17ac94
2-bit quantizations ( #4897 )
...
* imatrix: load
* imatrix: WIP
* imatrix: Add Q2_K quantization
* imatrix: also guard against Q2_K_S quantization without importance matrix
* imatrix: guard even more against low-bit quantization misuse
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:45:56 +02:00
Kawrakow
807179ec58
Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B ( #4906 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:44:30 +02:00
Georgi Gerganov
76484fbfd3
sync : ggml
2024-01-14 00:14:46 +02:00
Johannes Gäßler
c71d608ce7
ggml: cache sin/cos for RoPE ( #4908 )
2024-01-13 21:41:37 +01:00