* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (#4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
* metal : optimizing ggml_mul_mat_id (wip)
* metal : minor fix
* metal : opt mul_mm_id
* Add n_key_dim and n_value_dim
Some models use values that are not derived from `n_embd`.
Also remove `n_embd_head` and `n_embd_gqa` because it is not clear
which "head" is referred to (key or value).
Fix issue #4648.
* Fix `llm_build_kqv` to use `n_value_gqa`
* Rebase
* Rename variables
* Fix llm_build_kqv to be more generic wrt n_embd_head_k
* Update default values for n_embd_head_k and n_embd_head_v
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix llm_load_tensors: the asserts were not backcompat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Changes to server to allow metadata override
* documentation
* flake.nix: expose full scope in legacyPackages
* flake.nix: rocm not yet supported on aarch64, so hide the output
* flake.nix: expose checks
* workflows: nix-ci: init; build flake outputs
* workflows: nix-ci: add a job for eval
* workflows: weekly `nix flake update`
* workflows: nix-flakestry: drop tag filters
...and add a job for flakehub.com
* workflows: nix-ci: add a qemu job for jetsons
* flake.nix: suggest the binary caches
* flake.lock: update
to a commit recently cached by nixpkgs-cuda-ci
---------
Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (#4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
ggml-ci
* feat: add avx_vnni based on intel documents
* ggml: add avx vnni based on intel document
* llama: add avx vnni information display
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* Update ggml.c
Fix indentation upgate
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* python: add check-requirements.sh and GitHub workflow
This script and workflow forces package versions to remain compatible
across all convert*.py scripts, while allowing secondary convert scripts
to import dependencies not wanted in convert.py.
* Move requirements into ./requirements
* Fail on "==" being used for package requirements (but can be suppressed)
* Enforce "compatible release" syntax instead of ==
* Update workflow
* Add upper version bound for transformers and protobuf
* improve check-requirements.sh
* small syntax change
* don't remove venvs if nocleanup is passed
* See if this fixes docker workflow
* Move check-requirements.sh into ./scripts/
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* flake.lock: update to hotfix CUDA::cuda_driver
Required to support https://github.com/ggerganov/llama.cpp/pull/4606
* flake.nix: rewrite
1. Split into separate files per output.
2. Added overlays, so that this flake can be integrated into others.
The names in the overlay are `llama-cpp`, `llama-cpp-opencl`,
`llama-cpp-cuda`, and `llama-cpp-rocm` so that they fit into the
broader set of Nix packages from [nixpkgs](https://github.com/nixos/nixpkgs).
3. Use [callPackage](https://summer.nixos.org/blog/callpackage-a-tool-for-the-lazy/)
rather than `with pkgs;` so that there's dependency injection rather
than dependency lookup.
4. Add a description and meta information for each package.
The description includes a bit about what's trying to accelerate each one.
5. Use specific CUDA packages instead of cudatoolkit on the advice of SomeoneSerge.
6. Format with `serokell/nixfmt` for a consistent style.
7. Update `flake.lock` with the latest goods.
* flake.nix: use finalPackage instead of passing it manually
* nix: unclutter darwin support
* nix: pass most darwin frameworks unconditionally
...for simplicity
* *.nix: nixfmt
nix shell github:piegamesde/nixfmt/rfc101-style --command \
nixfmt flake.nix .devops/nix/*.nix
* flake.nix: add maintainers
* nix: move meta down to follow Nixpkgs style more closely
* nix: add missing meta attributes
nix: clarify the interpretation of meta.maintainers
nix: clarify the meaning of "broken" and "badPlatforms"
nix: passthru: expose the use* flags for inspection
E.g.:
```
❯ nix eval .#cuda.useCuda
true
```
* flake.nix: avoid re-evaluating nixpkgs too many times
* flake.nix: use flake-parts
* nix: migrate to pname+version
* flake.nix: overlay: expose both the namespace and the default attribute
* ci: add the (Nix) flakestry workflow
* nix: cmakeFlags: explicit OFF bools
* nix: cuda: reduce runtime closure
* nix: fewer rebuilds
* nix: respect config.cudaCapabilities
* nix: add the impure driver's location to the DT_RUNPATHs
* nix: clean sources more thoroughly
...this way outPaths change less frequently,
and so there are fewer rebuilds
* nix: explicit mpi support
* nix: explicit jetson support
* flake.nix: darwin: only expose the default
---------
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.
This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.
See Mozilla-Ocho/llamafile@1cd334f