slaren
280345968d
cuda : rename build flag to LLAMA_CUDA ( #6299 )
2024-03-26 01:16:01 +01:00
Christian Kögler
b06c16ef9f
nix: fix blas support ( #6281 )
...
Since no blas was provided to buildInputs, the executable is built without blas support.
This is a backport of NixOS/nixpkgs#298567
2024-03-25 10:52:45 -07:00
Kawrakow
1f2fd4e727
tests : include IQ2_XXS and IQ2_XS in test-quantize-fns ( #6303 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b2531
2024-03-25 19:33:15 +02:00
Georgi Gerganov
43139cc528
flake.lock: Update ( #6266 )
...
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)
→ 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-25 08:22:27 -07:00
slaren
2f34b865b6
cuda : fix LLAMA_CUDA_F16 build ( #6298 )
b2529
2024-03-25 16:43:22 +02:00
slaren
ae1f211ce2
cuda : refactor into multiple files ( #6269 )
b2528
2024-03-25 13:50:23 +01:00
Xuan Son Nguyen
ad3a0505e3
Server: clean up OAI params parsing function ( #6284 )
...
* server: clean up oai parsing function
* fix response_format
* fix empty response_format
* minor fixes
* add TODO for logprobs
* update docs
b2527
2024-03-25 09:42:17 +01:00
Neo Zhang Jianyu
95ad616cdd
[SYCL] fix SYCL backend build on windows is break by LOG() error ( #6290 )
...
* fix LOG() error for SYCL, enhance erro check by CI
* rollback to bash
* add newline at end of file
b2526
2024-03-25 15:52:41 +08:00
Minsoo Cheong
64e7b47c69
examples : add "retrieval" ( #6193 )
...
* add `retrieval` example
* add README
* minor fixes
* cast filepos on print
* remove use of variable sized array
* store similarities in separate vector
* print error on insufficient batch size
* fix error message printing
* assign n_batch value to n_ubatch
* fix param definitions
* define retrieval-only parameters in retrieval.cpp
* fix `--context-file` option to be provided multiple times for multiple files
* use vector for `query_emb`
* add usage description in README
* fix merge conflict
* fix usage printing
* remove seed setting
* fix lint
* increase file read buffer size
* retrieval : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-25 09:38:22 +02:00
Justine Tunney
7733f0c760
ggml : support AVX512VNNI ( #6280 )
...
This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some
architectures (e.g. AMD Zen 4).
2024-03-25 07:39:56 +02:00
Rick G
a32b77c4b2
Fix heap corruption from wmode out-of-bound writes on windows ( #6272 )
...
* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
* therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by https://github.com/ggerganov/llama.cpp/pull/6248
b2523
2024-03-24 22:45:56 +01:00
Georgi Gerganov
a0e584defd
imatrix : fix wname for mul_mat_id ops ( #6271 )
...
* imatrix : fix wname for mul_mat_id ops
* also filter tensor names in mul_mat_id ops
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-24 16:18:45 +02:00
Johannes Gäßler
7aed0ffe68
Fixed lookup compilation issues on Windows ( #6273 )
b2521
2024-03-24 14:21:17 +01:00
Pierrick Hymbert
ea279d5609
ci : close inactive issue, increase operations per run ( #6270 )
b2520
2024-03-24 10:57:06 +02:00
Minsoo Cheong
586e7bc561
sampling : deduplicated code for probability distribution access ( #6240 )
...
* sampling: remove duplicated code for probability distribution access
* free original_logits
* fix original_logits allocation
* fixes based on review @cebtenzzre
* change function name to `llama_sampling_prepare`
2024-03-24 10:54:07 +02:00
Meng, Hengyu
ddf6568510
[SYCL] offload op ( #6217 )
...
* remove no USM methods
* leave the schedule to ggml_backend_sched entirely
b2518
2024-03-24 12:04:25 +08:00
Neo Zhang Jianyu
d03224ac98
Support build win release for SYCL ( #6241 )
...
* support release win
* fix value
* fix value
* fix value
* fix error
* fix error
* fix format
b2517
2024-03-24 09:44:01 +08:00
Jared Van Bortel
94d1b3b411
use _wfopen instead of fopen on Windows ( #6248 )
...
also fix missing #defines before windows.h, and BPE LF token on MSVC
b2516
2024-03-23 18:48:02 -04:00
Georgi Gerganov
95562175f8
gitignore : gguf-split
2024-03-23 21:35:23 +02:00
Pierrick Hymbert
f482bb2e49
common: llama_load_model_from_url split support ( #6192 )
...
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
- fix header name case sensitive
- support downloading additional split in parallel
- hide password in url
* common: EOL EOF
* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition
* common: change max url max length
* common: minor comment
* server: support HF URL options
* llama: llama_model_loader fix log
* common: use a constant for max url length
* common: clean up curl if file cannot be loaded in gguf
* server: tests: add split tests, and HF options params
* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda
* server: tests: enable back Release test on PR
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2514
2024-03-23 18:07:00 +01:00
Pierrick Hymbert
1997577d5e
server: docs: --threads
and --threads
, --ubatch-size
, --log-disable
( #6254 )
2024-03-23 18:00:38 +01:00
Julius Arkenberg
476b0251b2
llama : add grok-1 support ( #6204 )
...
* Add support for Grok model architecture
* Revert convert-hf-to-gguf to default options
* Fixed f_norm_rms_eps bug
* Fix whitespaces
* llama : fix grok rope type
* llama : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23 18:41:53 +02:00
Pierrick Hymbert
21cad01b6e
split: add gguf-split in the make build target ( #6262 )
2024-03-23 17:18:13 +01:00
Pierrick Hymbert
1b26aebe4d
server: flush stdout after logging in both text and json layout ( #6253 )
b2510
2024-03-23 13:18:45 +01:00
Johannes Gäßler
50ccaf5eac
lookup: complement data from context with general text statistics ( #5479 )
...
* lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
b2509
2024-03-23 01:24:36 +01:00
Georgi Gerganov
56a00f0a2f
common : default --hf-file to --model ( #6234 )
b2508
2024-03-22 21:10:39 +02:00
fraxy-v
92397d87a4
convert-llama2c-to-ggml : enable conversion of GQA models ( #6237 )
...
* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608
* add test in build action
* Update build.yml
* Update build.yml
* Update build.yml
* gg patch
2024-03-22 20:49:06 +02:00
Kawrakow
1d0331c12a
quantize: options for output and token embedding tensors qtype ( #6239 )
...
* quantize: be able to specify the output tensor type
* quantize: be able to specify the token embedding tensor type
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-22 20:47:14 +02:00
Pierrick Hymbert
dba1af6129
llama_model_loader: support multiple split/shard GGUFs ( #6187 )
...
* split: support in llama_model_loader
* avoid copying the entire vector
Co-authored-by: slaren <slarengh@gmail.com>
* split: move llama_tensor_offset to llama_model_loader
* llama_model_loader: PR feedbacks:
- use only one gguf_context for metadata only
- store all ggml_context in a vector as the files and mappings
- store all weights in a vector along with the source tensor
- rename ctx_gguf to meta
- rename ctx_meta to contexts
* avoid copying the entire vector
* Simplify this by making these optional, switch some layer creation tensor optional
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Handle optional tensors
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama_model_loader: fail if backend cannot allocate buffer
* fix mmap buffer management
* llama_model_loader: map file to backend buffer if the allocation succeeds only
* llama_model_loader: only map tensors included in the context
* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast
* llama_model_loader: fail if any of backend buffer cannot be allocated
* spacing
Co-authored-by: slaren <slarengh@gmail.com>
* fix loop over pointer
Co-authored-by: slaren <slarengh@gmail.com>
* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting
* llama_model_loader: ensure mappings vector has the expected size
* llama_model_loader: use at instead of operator[] if this should never add to the map.
* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.
* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer
* llama_model_loader: fix map -> unordered map
* llama_split_prefix: use a clearer version, not pass split path len but dest max len.
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* llama : minor
ggml-ci
* llama : introduce some typedef helpers
* docs: add model shard in hot topic
* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated
Co-authored-by: slaren <slarengh@gmail.com>
* fix llama_split_prefix
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-22 19:00:01 +01:00
Minsoo Cheong
ee804f6223
ci: apply concurrency limit for github workflows ( #6243 )
2024-03-22 19:15:06 +02:00
Georgi Gerganov
80bd33bc2c
common : add HF arg helpers ( #6234 )
...
* common : add HF arg helpers
* common : remove defaults
b2503
2024-03-22 15:33:38 +02:00
Nexesenex
e80f06d2a1
llama : correction of the attn.v.weight quantization for IQ3_XS ( #6209 )
...
IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice.
That PR corrects this in the manner which was probably intended initially.
b2502
2024-03-22 15:32:02 +02:00
Olivier Chafik
f77a8ffd3b
tests : conditional python & node json schema tests ( #6207 )
...
* json: only attempt python & node schema conversion tests if their bins are present
Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978
disabled in https://github.com/ggerganov/llama.cpp/pull/6198
* json: orange warnings when tests skipped
* json: ensure py/js schema conv tested on ubuntu-focal-make
* json: print env vars in test
b2501
2024-03-22 15:09:07 +02:00
Olivier Chafik
72114edf06
json-schema-to-grammar : fix order of props + non-str const/enum ( #6232 )
...
* json: ordered json in server/schema converter to respect orig order
* json: ws nits
* json: support non-string const / enums
2024-03-22 15:07:44 +02:00
slaren
2f0e81e053
cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy ( #6208 )
...
* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy
* add LLAMA_CUDA_NO_PEER_COPY to HIP build
b2499
2024-03-22 14:05:31 +01:00
Xiaoyi Chen
29ab270e65
readme : add RecurseChat to the list of UIs ( #6219 )
2024-03-22 13:29:49 +02:00
Jan Boon
6b8bb3a31d
server : fix n_keep always showing as 0 in response ( #6211 )
b2497
2024-03-22 13:12:05 +02:00
Georgi Gerganov
68e210b354
server : enable continuous batching by default ( #6231 )
b2496
2024-03-22 13:08:28 +02:00
Georgi Gerganov
b3e94f26ba
metal : proper assert for mat-mat memory alignment ( #6225 )
...
* metal : proper assert for mat-mat memory alignment
ggml-ci
* readme : add notice about the bug fix
* metal : fix the fix
ggml-ci
b2495
2024-03-22 11:35:53 +02:00
Vaibhav Srivastav
b2075fd6a5
ci : add CURL flag for the mac builds ( #6214 )
b2494
2024-03-22 09:53:43 +02:00
Georgi Gerganov
95d576b48e
metal : pad n_ctx by 32 ( #6177 )
...
* metal : require ne00 >= 128 for mat-mat kernels
ggml-ci
* llama : pad n_ctx by 32
ggml-ci
b2493
2024-03-22 09:36:03 +02:00
Neo Zhang Jianyu
59c17f02de
add blog link ( #6222 )
2024-03-22 15:19:37 +08:00
DAN™
fa046eafbc
Fix params underscore convert to dash. ( #6203 )
...
* Fix params underscore convert to dash.
* Update common/common.cpp
---------
Co-authored-by: slaren <slarengh@gmail.com>
b2491
2024-03-22 02:32:42 +01:00
Jan Boon
be07a03217
server : update readme doc from slot_id
to id_slot
( #6213 )
2024-03-21 23:41:24 +01:00
slaren
d0a71233fb
cuda : disable host register by default ( #6206 )
b2489
2024-03-21 20:54:28 +02:00
semidark
f372c49ccd
Corrected typo to wrong file ( #6199 )
...
The stated file `./devops/main-server.Dockerfile` does not exist. I figure that `.devops/server-intel.Dockerfile` was meant.
2024-03-21 18:52:35 +01:00
Georgi Gerganov
924ce1dce7
tests : disable system() calls ( #6198 )
...
ggml-ci
b2487
2024-03-21 16:20:05 +02:00
slaren
03a8f8fafe
cuda : fix LLAMA_CUDA_F16 build ( #6197 )
2024-03-21 14:59:53 +02:00
Kawrakow
cfd3be76e3
ggml : same IQ4_NL quantization for CPU/CUDA/Metal ( #6196 )
...
* Make quantize_row_iq4_nl do the same thing is quantization on CUDA
* Make quantize_row_iq4_nl do the same thing is quantization on CUDA
This time for real. backend-ops tests pass.
* Now fix test-quantize-fns
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-21 14:59:38 +02:00
Olivier Chafik
5b7b0ac8df
json-schema-to-grammar improvements (+ added to server) ( #5978 )
...
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 11:50:43 +00:00