because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
It's currently not possible to cross-compile llama.cpp for aarch64
because CMakeLists.txt forces -mcpu=native for that target.
-mcpu=native doesn't make sense if your build host is not the
target architecture, and clang rejects it for that reason, aborting the
build. This can be easily reproduced using the current Android NDK to build
for aarch64 on an x86_64 host.
If there is not a specific CPU-tuning target for aarch64 then -mcpu
should be omitted completely. I think that makes sense, there is not
enough variance in the aarch64 instruction set to warrant a fixed -mcpu
optimization at this point. And if someone is building natively and wishes
to enable any possible optimizations for the host device, then there is
already the LLAMA_NATIVE option available.
Fixes#495.
* convert checks in llama_load_session_file to throw and handle them
* make llama_load_session_file_internal static
* address feedbacks to avoid using exceptions
* Added broken new q4k quant
* xx + ib0
* Fix q2_k fast kernel
* Use preprocessor for QK_K
* Add q6_k fast matmul kernel
* ported q3k speedup successfully
* ported q2k and q5k speedups
* remove old dot kernels and template
* fixed global const struct types
* fixing address spaces
* fixed string too long CI issue
---------
Co-authored-by: 0cc4m <picard12@live.de>
* Remove multiple shards
* Remove multiple file loaders
* Remove llama_load_tensor_shard class
* Simplify load logic
* Remove dead code guess_n_parts function
* Remove vocab_only from constructor of llama_model_loader
* Remove alignment_prevents_mmap which is not more needed.
* Remove useless check
* add interface for float input
* fixed inpL shape and type
* add examples of input floats
* add test example for embd input
* fixed sampling
* add free for context
* fixed add end condition for generating
* add examples for llava.py
* add READMD for llava.py
* add READMD for llava.py
* add example of PandaGPT
* refactor the interface and fixed the styles
* add cmake build for embd-input
* add cmake build for embd-input
* Add MiniGPT-4 example
* change the order of the args of llama_eval_internal
* fix ci error