2024-07-10 13:40:53 +02:00
# if defined(_MSC_VER)
# define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING
# endif
2023-03-25 19:26:40 +01:00
# include "common.h"
2024-05-08 21:53:08 +02:00
// Change JSON_ASSERT from assert() to GGML_ASSERT:
# define JSON_ASSERT GGML_ASSERT
2024-04-15 19:35:21 +02:00
# include "json.hpp"
# include "json-schema-to-grammar.h"
2023-08-28 17:59:39 +02:00
# include "llama.h"
2023-03-24 16:19:05 +01:00
2023-08-28 17:59:39 +02:00
# include <algorithm>
2024-06-04 20:23:39 +02:00
# include <cinttypes>
2023-08-28 17:59:39 +02:00
# include <cmath>
2024-06-04 20:23:39 +02:00
# include <codecvt>
# include <cstdarg>
2023-03-11 00:04:06 +01:00
# include <cstring>
2023-08-28 17:59:39 +02:00
# include <ctime>
2023-03-10 19:40:58 +01:00
# include <fstream>
2023-08-28 17:59:39 +02:00
# include <iostream>
2024-06-04 20:23:39 +02:00
# include <iterator>
2023-08-28 17:59:39 +02:00
# include <regex>
llama : new sampling algorithms (#1126)
* Sample interface, new samplers.
New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat
Ignore EOS fix: -inf should be used.
* mirostat
* Added --logit-bias and --no-penalize-nl, removed std::span
* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
* Save and load example adjust
* Tests
* Windows build fix
* Windows test fix
2023-04-29 07:34:41 +02:00
# include <sstream>
2023-08-28 17:59:39 +02:00
# include <string>
2023-11-23 18:07:56 +01:00
# include <unordered_map>
2023-05-15 04:25:42 +02:00
# include <unordered_set>
2023-08-28 17:59:39 +02:00
# include <vector>
2023-04-30 20:41:35 +02:00
# if defined(__APPLE__) && defined(__MACH__)
# include <sys/types.h>
# include <sys/sysctl.h>
# endif
2023-03-10 19:40:58 +01:00
2023-05-09 04:45:48 +02:00
# if defined(_WIN32)
# define WIN32_LEAN_AND_MEAN
2023-09-01 15:34:50 +02:00
# ifndef NOMINMAX
# define NOMINMAX
# endif
2023-08-28 17:59:39 +02:00
# include <locale>
2023-05-09 04:45:48 +02:00
# include <windows.h>
2023-04-08 17:49:39 +02:00
# include <fcntl.h>
# include <io.h>
2023-05-09 04:45:48 +02:00
# else
# include <sys/ioctl.h>
2023-08-28 17:59:39 +02:00
# include <sys/stat.h>
2023-05-09 04:45:48 +02:00
# include <unistd.h>
2023-03-28 16:09:55 +02:00
# endif
2024-03-17 19:12:37 +01:00
# if defined(LLAMA_USE_CURL)
# include <curl/curl.h>
2024-03-23 18:07:00 +01:00
# include <curl/easy.h>
# include <thread>
# include <future>
2024-03-17 19:12:37 +01:00
# endif
2023-03-12 21:15:00 +01:00
2023-06-16 20:23:53 +02:00
# if defined(_MSC_VER)
# pragma warning(disable: 4244 4267) // possible loss of data
# endif
2024-03-26 01:16:01 +01:00
# if (defined(GGML_USE_CUDA) || defined(GGML_USE_SYCL))
# define GGML_USE_CUDA_SYCL
ggml : add unified SYCL backend for Intel GPUs (#2690)
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 16:56:23 +01:00
# endif
2024-03-26 01:16:01 +01:00
# if (defined(GGML_USE_CUDA) || defined(GGML_USE_SYCL)) || defined(GGML_USE_VULKAN)
# define GGML_USE_CUDA_SYCL_VULKAN
2024-02-07 07:54:50 +01:00
# endif
2024-03-17 19:12:37 +01:00
# if defined(LLAMA_USE_CURL)
# ifdef __linux__
# include <linux/limits.h>
# elif defined(_WIN32)
# define PATH_MAX MAX_PATH
# else
# include <sys/syslimits.h>
# endif
2024-03-23 18:07:00 +01:00
# define LLAMA_CURL_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
2024-03-17 19:12:37 +01:00
# endif // LLAMA_USE_CURL
2024-04-15 19:35:21 +02:00
using json = nlohmann : : ordered_json ;
2024-08-21 11:04:34 +02:00
//
// Environment variable utils
//
template < typename T >
static typename std : : enable_if < std : : is_same < T , std : : string > : : value , void > : : type
get_env ( std : : string name , T & target ) {
char * value = std : : getenv ( name . c_str ( ) ) ;
target = value ? std : : string ( value ) : target ;
}
template < typename T >
static typename std : : enable_if < ! std : : is_same < T , bool > : : value & & std : : is_integral < T > : : value , void > : : type
get_env ( std : : string name , T & target ) {
char * value = std : : getenv ( name . c_str ( ) ) ;
target = value ? std : : stoi ( value ) : target ;
}
template < typename T >
static typename std : : enable_if < std : : is_floating_point < T > : : value , void > : : type
get_env ( std : : string name , T & target ) {
char * value = std : : getenv ( name . c_str ( ) ) ;
target = value ? std : : stof ( value ) : target ;
}
template < typename T >
static typename std : : enable_if < std : : is_same < T , bool > : : value , void > : : type
get_env ( std : : string name , T & target ) {
char * value = std : : getenv ( name . c_str ( ) ) ;
if ( value ) {
std : : string val ( value ) ;
target = val = = " 1 " | | val = = " true " ;
}
}
2024-05-22 19:04:20 +02:00
//
// CPU utils
//
int32_t cpu_get_num_physical_cores ( ) {
2023-03-17 18:47:35 +01:00
# ifdef __linux__
2023-05-15 04:25:42 +02:00
// enumerate the set of thread siblings, num entries is num cores
std : : unordered_set < std : : string > siblings ;
for ( uint32_t cpu = 0 ; cpu < UINT32_MAX ; + + cpu ) {
2024-05-04 15:26:53 +02:00
std : : ifstream thread_siblings ( " /sys/devices/system/cpu/cpu "
2023-05-15 04:25:42 +02:00
+ std : : to_string ( cpu ) + " /topology/thread_siblings " ) ;
if ( ! thread_siblings . is_open ( ) ) {
break ; // no more cpus
2023-04-30 20:41:35 +02:00
}
2023-05-15 04:25:42 +02:00
std : : string line ;
if ( std : : getline ( thread_siblings , line ) ) {
siblings . insert ( line ) ;
}
}
2023-09-07 19:22:29 +02:00
if ( ! siblings . empty ( ) ) {
2023-05-15 04:25:42 +02:00
return static_cast < int32_t > ( siblings . size ( ) ) ;
2023-04-30 20:41:35 +02:00
}
# elif defined(__APPLE__) && defined(__MACH__)
int32_t num_physical_cores ;
size_t len = sizeof ( num_physical_cores ) ;
int result = sysctlbyname ( " hw.perflevel0.physicalcpu " , & num_physical_cores , & len , NULL , 0 ) ;
if ( result = = 0 ) {
return num_physical_cores ;
}
result = sysctlbyname ( " hw.physicalcpu " , & num_physical_cores , & len , NULL , 0 ) ;
if ( result = = 0 ) {
return num_physical_cores ;
2023-03-17 18:47:35 +01:00
}
2024-08-16 08:23:12 +02:00
# elif defined(_WIN32) && (_WIN32_WINNT >= 0x0601) && !defined(__MINGW64__) // windows 7 and later
// TODO: windows + arm64 + mingw64
unsigned int n_threads_win = std : : thread : : hardware_concurrency ( ) ;
unsigned int default_threads = n_threads_win > 0 ? ( n_threads_win < = 4 ? n_threads_win : n_threads_win / 2 ) : 4 ;
DWORD buffer_size = 0 ;
if ( ! GetLogicalProcessorInformationEx ( RelationProcessorCore , nullptr , & buffer_size ) ) {
if ( GetLastError ( ) ! = ERROR_INSUFFICIENT_BUFFER ) {
return default_threads ;
}
}
std : : vector < char > buffer ( buffer_size ) ;
if ( ! GetLogicalProcessorInformationEx ( RelationProcessorCore , reinterpret_cast < PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX > ( buffer . data ( ) ) , & buffer_size ) ) {
return default_threads ;
}
int32_t num_physical_cores = 0 ;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info = reinterpret_cast < PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX > ( buffer . data ( ) ) ;
while ( buffer_size > 0 ) {
if ( info - > Relationship = = RelationProcessorCore ) {
num_physical_cores + = info - > Processor . GroupCount ;
}
buffer_size - = info - > Size ;
info = reinterpret_cast < PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX > ( reinterpret_cast < char * > ( info ) + info - > Size ) ;
}
return num_physical_cores > 0 ? num_physical_cores : default_threads ;
2023-04-30 20:41:35 +02:00
# endif
unsigned int n_threads = std : : thread : : hardware_concurrency ( ) ;
return n_threads > 0 ? ( n_threads < = 4 ? n_threads : n_threads / 2 ) : 4 ;
}
2023-03-17 18:47:35 +01:00
2024-04-20 12:27:12 +02:00
# if defined(__x86_64__) && defined(__linux__) && !defined(__ANDROID__)
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 20:55:30 +02:00
# include <pthread.h>
static void cpuid ( unsigned leaf , unsigned subleaf ,
unsigned * eax , unsigned * ebx , unsigned * ecx , unsigned * edx ) {
__asm__ ( " movq \t %%rbx,%%rsi \n \t "
" cpuid \n \t "
" xchgq \t %%rbx,%%rsi "
: " =a " ( * eax ) , " =S " ( * ebx ) , " =c " ( * ecx ) , " =d " ( * edx )
: " 0 " ( leaf ) , " 2 " ( subleaf ) ) ;
}
static int pin_cpu ( int cpu ) {
cpu_set_t mask ;
CPU_ZERO ( & mask ) ;
CPU_SET ( cpu , & mask ) ;
return pthread_setaffinity_np ( pthread_self ( ) , sizeof ( mask ) , & mask ) ;
}
static bool is_hybrid_cpu ( void ) {
unsigned eax , ebx , ecx , edx ;
cpuid ( 7 , 0 , & eax , & ebx , & ecx , & edx ) ;
return ! ! ( edx & ( 1u < < 15 ) ) ;
}
static bool is_running_on_efficiency_core ( void ) {
unsigned eax , ebx , ecx , edx ;
cpuid ( 0x1a , 0 , & eax , & ebx , & ecx , & edx ) ;
int intel_atom = 0x20 ;
int core_type = ( eax & 0xff000000u ) > > 24 ;
return core_type = = intel_atom ;
}
2024-05-22 19:04:20 +02:00
static int cpu_count_math_cpus ( int n_cpu ) {
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 20:55:30 +02:00
int result = 0 ;
2024-05-22 19:04:20 +02:00
for ( int cpu = 0 ; cpu < n_cpu ; + + cpu ) {
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 20:55:30 +02:00
if ( pin_cpu ( cpu ) ) {
return - 1 ;
}
if ( is_running_on_efficiency_core ( ) ) {
continue ; // efficiency cores harm lockstep threading
}
+ + cpu ; // hyperthreading isn't useful for linear algebra
+ + result ;
}
return result ;
}
# endif // __x86_64__ && __linux__
/**
* Returns number of CPUs on system that are useful for math .
*/
2024-05-22 19:04:20 +02:00
int32_t cpu_get_num_math ( ) {
2024-04-20 12:27:12 +02:00
# if defined(__x86_64__) && defined(__linux__) && !defined(__ANDROID__)
2024-05-22 19:04:20 +02:00
int n_cpu = sysconf ( _SC_NPROCESSORS_ONLN ) ;
if ( n_cpu < 1 ) {
return cpu_get_num_physical_cores ( ) ;
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 20:55:30 +02:00
}
if ( is_hybrid_cpu ( ) ) {
cpu_set_t affinity ;
if ( ! pthread_getaffinity_np ( pthread_self ( ) , sizeof ( affinity ) , & affinity ) ) {
2024-05-22 19:04:20 +02:00
int result = cpu_count_math_cpus ( n_cpu ) ;
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 20:55:30 +02:00
pthread_setaffinity_np ( pthread_self ( ) , sizeof ( affinity ) , & affinity ) ;
if ( result > 0 ) {
return result ;
}
}
}
# endif
2024-05-22 19:04:20 +02:00
return cpu_get_num_physical_cores ( ) ;
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 20:55:30 +02:00
}
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
// Helper for setting process priority
# if defined(_WIN32)
bool set_process_priority ( enum ggml_sched_priority prio ) {
if ( prio = = GGML_SCHED_PRIO_NORMAL ) {
return true ;
}
DWORD p = NORMAL_PRIORITY_CLASS ;
switch ( prio ) {
case GGML_SCHED_PRIO_NORMAL : p = NORMAL_PRIORITY_CLASS ; break ;
case GGML_SCHED_PRIO_MEDIUM : p = ABOVE_NORMAL_PRIORITY_CLASS ; break ;
case GGML_SCHED_PRIO_HIGH : p = HIGH_PRIORITY_CLASS ; break ;
case GGML_SCHED_PRIO_REALTIME : p = REALTIME_PRIORITY_CLASS ; break ;
}
if ( ! SetPriorityClass ( GetCurrentProcess ( ) , p ) ) {
fprintf ( stderr , " warn: failed to set process priority class %d : (%d) \n " , prio , ( int ) GetLastError ( ) ) ;
return false ;
}
return true ;
}
# else // MacOS and POSIX
# include <sys/types.h>
# include <sys/resource.h>
bool set_process_priority ( enum ggml_sched_priority prio ) {
if ( prio = = GGML_SCHED_PRIO_NORMAL ) {
return true ;
}
int p = 0 ;
switch ( prio ) {
case GGML_SCHED_PRIO_NORMAL : p = 0 ; break ;
case GGML_SCHED_PRIO_MEDIUM : p = - 5 ; break ;
case GGML_SCHED_PRIO_HIGH : p = - 10 ; break ;
case GGML_SCHED_PRIO_REALTIME : p = - 20 ; break ;
}
if ( ! setpriority ( PRIO_PROCESS , 0 , p ) ) {
fprintf ( stderr , " warn: failed to set process priority %d : %s (%d) \n " , prio , strerror ( errno ) , errno ) ;
return false ;
}
return true ;
}
# endif
2024-05-22 19:04:20 +02:00
//
// CLI argument parsing
//
2023-05-03 03:46:20 +02:00
2024-05-22 19:04:20 +02:00
void gpt_params_handle_model_default ( gpt_params & params ) {
if ( ! params . hf_repo . empty ( ) ) {
// short-hand to avoid specifying --hf-file -> default it to --model
if ( params . hf_file . empty ( ) ) {
if ( params . model . empty ( ) ) {
throw std : : invalid_argument ( " error: --hf-repo requires either --hf-file or --model \n " ) ;
2023-05-03 03:46:20 +02:00
}
2024-05-22 19:04:20 +02:00
params . hf_file = params . model ;
} else if ( params . model . empty ( ) ) {
2024-06-08 21:21:08 +02:00
params . model = fs_get_cache_file ( string_split ( params . hf_file , ' / ' ) . back ( ) ) ;
2024-05-22 19:04:20 +02:00
}
} else if ( ! params . model_url . empty ( ) ) {
if ( params . model . empty ( ) ) {
auto f = string_split ( params . model_url , ' # ' ) . front ( ) ;
f = string_split ( f , ' ? ' ) . front ( ) ;
2024-06-08 21:21:08 +02:00
params . model = fs_get_cache_file ( string_split ( f , ' / ' ) . back ( ) ) ;
2023-05-03 03:46:20 +02:00
}
2024-05-22 19:04:20 +02:00
} else if ( params . model . empty ( ) ) {
params . model = DEFAULT_MODEL_PATH ;
2023-05-03 03:46:20 +02:00
}
2024-05-22 19:04:20 +02:00
}
2023-05-03 03:46:20 +02:00
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
void postprocess_cpu_params ( cpu_params & cpuparams , const cpu_params * role_model ) {
int32_t n_set = 0 ;
if ( cpuparams . n_threads < 0 ) {
// Assuming everything about cpuparams is invalid
if ( role_model ! = nullptr ) {
cpuparams = * role_model ;
} else {
cpuparams . n_threads = cpu_get_num_math ( ) ;
}
}
for ( int32_t i = 0 ; i < GGML_MAX_N_THREADS ; i + + ) {
if ( cpuparams . cpumask [ i ] ) {
n_set + + ;
}
}
if ( n_set & & n_set < cpuparams . n_threads ) {
// Not enough set bits, may experience performance issues.
fprintf ( stderr , " warn: Not enough set bits in CPU mask (%d) to satisfy requested thread count: %d \n " , n_set , cpuparams . n_threads ) ;
}
}
2024-05-22 19:04:20 +02:00
bool gpt_params_parse_ex ( int argc , char * * argv , gpt_params & params ) {
bool invalid_param = false ;
std : : string arg ;
const std : : string arg_prefix = " -- " ;
llama_sampling_params & sparams = params . sparams ;
for ( int i = 1 ; i < argc ; i + + ) {
arg = argv [ i ] ;
if ( arg . compare ( 0 , arg_prefix . size ( ) , arg_prefix ) = = 0 ) {
std : : replace ( arg . begin ( ) , arg . end ( ) , ' _ ' , ' - ' ) ;
}
if ( ! gpt_params_find_arg ( argc , argv , arg , params , i , invalid_param ) ) {
throw std : : invalid_argument ( " error: unknown argument: " + arg ) ;
}
if ( invalid_param ) {
throw std : : invalid_argument ( " error: invalid parameter for argument: " + arg ) ;
}
}
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
postprocess_cpu_params ( params . cpuparams , nullptr ) ;
postprocess_cpu_params ( params . cpuparams_batch , & params . cpuparams ) ;
postprocess_cpu_params ( params . draft_cpuparams , & params . cpuparams ) ;
postprocess_cpu_params ( params . draft_cpuparams_batch , & params . cpuparams_batch ) ;
2024-06-04 20:23:39 +02:00
if ( params . prompt_cache_all & & ( params . interactive | | params . interactive_first ) ) {
2024-05-22 19:04:20 +02:00
throw std : : invalid_argument ( " error: --prompt-cache-all not supported in interactive mode yet \n " ) ;
}
gpt_params_handle_model_default ( params ) ;
2024-08-21 11:04:34 +02:00
if ( params . hf_token . empty ( ) ) {
get_env ( " HF_TOKEN " , params . hf_token ) ;
}
2024-07-06 22:32:04 +02:00
2024-05-22 19:04:20 +02:00
if ( params . escape ) {
string_process_escapes ( params . prompt ) ;
string_process_escapes ( params . input_prefix ) ;
string_process_escapes ( params . input_suffix ) ;
string_process_escapes ( sparams . cfg_negative_prompt ) ;
for ( auto & antiprompt : params . antiprompt ) {
string_process_escapes ( antiprompt ) ;
}
}
if ( ! params . kv_overrides . empty ( ) ) {
params . kv_overrides . emplace_back ( ) ;
params . kv_overrides . back ( ) . key [ 0 ] = 0 ;
}
return true ;
2023-05-03 03:46:20 +02:00
}
2024-08-21 11:04:34 +02:00
void gpt_params_parse_from_env ( gpt_params & params ) {
// we only care about server-related params for now
get_env ( " LLAMA_ARG_MODEL " , params . model ) ;
2024-08-27 11:07:01 +02:00
get_env ( " LLAMA_ARG_MODEL_URL " , params . model_url ) ;
get_env ( " LLAMA_ARG_MODEL_ALIAS " , params . model_alias ) ;
get_env ( " LLAMA_ARG_HF_REPO " , params . hf_repo ) ;
get_env ( " LLAMA_ARG_HF_FILE " , params . hf_file ) ;
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
get_env ( " LLAMA_ARG_THREADS " , params . cpuparams . n_threads ) ;
2024-08-21 11:04:34 +02:00
get_env ( " LLAMA_ARG_CTX_SIZE " , params . n_ctx ) ;
get_env ( " LLAMA_ARG_N_PARALLEL " , params . n_parallel ) ;
get_env ( " LLAMA_ARG_BATCH " , params . n_batch ) ;
get_env ( " LLAMA_ARG_UBATCH " , params . n_ubatch ) ;
get_env ( " LLAMA_ARG_N_GPU_LAYERS " , params . n_gpu_layers ) ;
get_env ( " LLAMA_ARG_THREADS_HTTP " , params . n_threads_http ) ;
get_env ( " LLAMA_ARG_CHAT_TEMPLATE " , params . chat_template ) ;
get_env ( " LLAMA_ARG_N_PREDICT " , params . n_predict ) ;
get_env ( " LLAMA_ARG_ENDPOINT_METRICS " , params . endpoint_metrics ) ;
get_env ( " LLAMA_ARG_ENDPOINT_SLOTS " , params . endpoint_slots ) ;
get_env ( " LLAMA_ARG_EMBEDDINGS " , params . embedding ) ;
get_env ( " LLAMA_ARG_FLASH_ATTN " , params . flash_attn ) ;
get_env ( " LLAMA_ARG_DEFRAG_THOLD " , params . defrag_thold ) ;
2024-08-27 11:07:01 +02:00
get_env ( " LLAMA_ARG_CONT_BATCHING " , params . cont_batching ) ;
get_env ( " LLAMA_ARG_HOST " , params . hostname ) ;
get_env ( " LLAMA_ARG_PORT " , params . port ) ;
2024-08-21 11:04:34 +02:00
}
2023-04-30 20:41:35 +02:00
bool gpt_params_parse ( int argc , char * * argv , gpt_params & params ) {
2024-06-04 20:23:39 +02:00
const auto params_org = params ; // the example can modify the default params
2023-11-01 18:42:01 +01:00
try {
2024-06-04 20:23:39 +02:00
if ( ! gpt_params_parse_ex ( argc , argv , params ) | | params . usage ) {
params = params_org ;
params . usage = true ;
return false ;
2023-11-01 18:42:01 +01:00
}
2024-06-04 20:23:39 +02:00
} catch ( const std : : invalid_argument & ex ) {
2023-11-01 20:15:55 +01:00
fprintf ( stderr , " %s \n " , ex . what ( ) ) ;
2024-06-06 15:30:58 +02:00
params = params_org ;
2024-06-04 20:23:39 +02:00
return false ;
2023-11-01 18:42:01 +01:00
}
2024-06-04 20:23:39 +02:00
return true ;
2023-11-01 18:42:01 +01:00
}
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
bool parse_cpu_range ( const std : : string & range , bool ( & boolmask ) [ GGML_MAX_N_THREADS ] ) {
size_t dash_loc = range . find ( ' - ' ) ;
if ( dash_loc = = std : : string : : npos ) {
fprintf ( stderr , " Format of CPU range is invalid! Expected [<start>]-[<end>]. \n " ) ;
return false ;
}
size_t start_i ;
size_t end_i ;
if ( dash_loc = = 0 ) {
start_i = 0 ;
} else {
start_i = std : : stoull ( range . substr ( 0 , dash_loc ) ) ;
if ( start_i > = GGML_MAX_N_THREADS ) {
fprintf ( stderr , " Start index out of bounds! \n " ) ;
return false ;
}
}
if ( dash_loc = = range . length ( ) - 1 ) {
end_i = GGML_MAX_N_THREADS - 1 ;
} else {
end_i = std : : stoull ( range . substr ( dash_loc + 1 ) ) ;
if ( end_i > = GGML_MAX_N_THREADS ) {
fprintf ( stderr , " End index out of bounds! \n " ) ;
return false ;
}
}
for ( size_t i = start_i ; i < = end_i ; i + + ) {
boolmask [ i ] = true ;
}
return true ;
}
bool parse_cpu_mask ( const std : : string & mask , bool ( & boolmask ) [ GGML_MAX_N_THREADS ] ) {
// Discard potential 0x prefix
size_t start_i = 0 ;
if ( mask . length ( ) > = 2 & & mask . substr ( 0 , 2 ) = = " 0x " ) {
start_i = 2 ;
}
size_t num_digits = mask . length ( ) - start_i ;
if ( num_digits > 128 ) num_digits = 128 ;
size_t end_i = num_digits + start_i ;
for ( size_t i = start_i , n = ( num_digits * 4 - 1 ) ; i < end_i ; i + + , n - = 4 ) {
char c = mask . at ( i ) ;
int8_t id = c ;
if ( ( c > = ' 0 ' & & c < = ' 9 ' ) ) {
id - = ' 0 ' ;
} else if ( c > = ' a ' & & c < = ' f ' ) {
id - = ' a ' - 10 ;
} else if ( c > = ' A ' & & c < = ' F ' ) {
id - = ' A ' - 10 ;
} else {
fprintf ( stderr , " Invalid hex character '%c' at position %d \n " , c , int32_t ( i ) ) ;
return false ;
}
boolmask [ n ] = boolmask [ n ] | | ( ( id & 8 ) ! = 0 ) ;
boolmask [ n - 1 ] = boolmask [ n - 1 ] | | ( ( id & 4 ) ! = 0 ) ;
boolmask [ n - 2 ] = boolmask [ n - 2 ] | | ( ( id & 2 ) ! = 0 ) ;
boolmask [ n - 3 ] = boolmask [ n - 3 ] | | ( ( id & 1 ) ! = 0 ) ;
}
return true ;
}
2024-06-24 07:30:24 +02:00
# define CHECK_ARG if (++i >= argc) { invalid_param = true; return true; }
2024-03-25 08:38:22 +01:00
bool gpt_params_find_arg ( int argc , char * * argv , const std : : string & arg , gpt_params & params , int & i , bool & invalid_param ) {
2024-06-04 20:23:39 +02:00
const char split_delim = ' , ' ;
2024-04-26 20:06:33 +02:00
llama_sampling_params & sparams = params . sparams ;
2023-04-02 04:41:12 +02:00
2024-03-18 09:27:44 +01:00
if ( arg = = " -s " | | arg = = " --seed " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
// TODO: this is temporary, in the future the sampling state will be moved fully to llama_sampling_context.
2024-03-18 09:27:44 +01:00
params . seed = std : : stoul ( argv [ i ] ) ;
2024-04-24 11:08:36 +02:00
sparams . seed = std : : stoul ( argv [ i ] ) ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " -t " | | arg = = " --threads " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
params . cpuparams . n_threads = std : : stoi ( argv [ i ] ) ;
if ( params . cpuparams . n_threads < = 0 ) {
params . cpuparams . n_threads = std : : thread : : hardware_concurrency ( ) ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
}
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
if ( arg = = " -C " | | arg = = " --cpu-mask " ) {
CHECK_ARG
std : : string mask = argv [ i ] ;
params . cpuparams . mask_valid = true ;
invalid_param = ! parse_cpu_mask ( mask , params . cpuparams . cpumask ) ;
return true ;
}
if ( arg = = " -Cr " | | arg = = " --cpu-range " ) {
CHECK_ARG
std : : string range = argv [ i ] ;
params . cpuparams . mask_valid = true ;
invalid_param = ! parse_cpu_range ( range , params . cpuparams . cpumask ) ;
return true ;
}
if ( arg = = " --prio " ) {
CHECK_ARG
params . cpuparams . priority = ( enum ggml_sched_priority ) std : : stoul ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --cpu-strict " ) {
CHECK_ARG
params . cpuparams . strict_cpu = std : : stoul ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --poll " ) {
CHECK_ARG
params . cpuparams . poll = std : : stoul ( argv [ i ] ) ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " -tb " | | arg = = " --threads-batch " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
params . cpuparams_batch . n_threads = std : : stoi ( argv [ i ] ) ;
if ( params . cpuparams_batch . n_threads < = 0 ) {
params . cpuparams_batch . n_threads = std : : thread : : hardware_concurrency ( ) ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
}
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
if ( arg = = " -Cb " | | arg = = " --cpu-mask-batch " ) {
CHECK_ARG
std : : string mask = argv [ i ] ;
params . cpuparams_batch . mask_valid = true ;
invalid_param = ! parse_cpu_mask ( mask , params . cpuparams_batch . cpumask ) ;
return true ;
}
if ( arg = = " -Crb " | | arg = = " --cpu-range_batch " ) {
CHECK_ARG
std : : string range = argv [ i ] ;
params . cpuparams_batch . mask_valid = true ;
invalid_param = ! parse_cpu_range ( range , params . cpuparams_batch . cpumask ) ;
return true ;
}
if ( arg = = " --prio-batch " ) {
CHECK_ARG
params . cpuparams_batch . priority = ( enum ggml_sched_priority ) std : : stoul ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --cpu-strict-batch " ) {
params . cpuparams_batch . strict_cpu = true ;
return true ;
}
if ( arg = = " --poll-batch " ) {
CHECK_ARG
params . cpuparams_batch . poll = std : : stoul ( argv [ i ] ) ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " -td " | | arg = = " --threads-draft " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
params . draft_cpuparams . n_threads = std : : stoi ( argv [ i ] ) ;
if ( params . draft_cpuparams . n_threads < = 0 ) {
params . draft_cpuparams . n_threads = std : : thread : : hardware_concurrency ( ) ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
}
if ( arg = = " -Cd " | | arg = = " --cpu-mask-draft " ) {
CHECK_ARG
std : : string mask = argv [ i ] ;
params . draft_cpuparams . mask_valid = true ;
invalid_param = ! parse_cpu_mask ( mask , params . draft_cpuparams . cpumask ) ;
return true ;
}
if ( arg = = " -Crd " | | arg = = " --cpu-range-draft " ) {
CHECK_ARG
std : : string range = argv [ i ] ;
params . draft_cpuparams . mask_valid = true ;
invalid_param = ! parse_cpu_range ( range , params . draft_cpuparams . cpumask ) ;
return true ;
}
if ( arg = = " --prio-draft " ) {
CHECK_ARG
params . draft_cpuparams . priority = ( enum ggml_sched_priority ) std : : stoul ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --cpu-strict-draft " ) {
params . draft_cpuparams . strict_cpu = true ;
return true ;
}
if ( arg = = " --poll-draft " ) {
CHECK_ARG
params . draft_cpuparams . poll = std : : stoul ( argv [ i ] ) ;
return true ;
2024-03-18 09:27:44 +01:00
}
if ( arg = = " -tbd " | | arg = = " --threads-batch-draft " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
params . draft_cpuparams_batch . n_threads = std : : stoi ( argv [ i ] ) ;
if ( params . draft_cpuparams_batch . n_threads < = 0 ) {
params . draft_cpuparams_batch . n_threads = std : : thread : : hardware_concurrency ( ) ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
}
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
if ( arg = = " -Crbd " | | arg = = " --cpu-range-batch-draft " ) {
CHECK_ARG
std : : string range = argv [ i ] ;
params . draft_cpuparams_batch . mask_valid = true ;
invalid_param = ! parse_cpu_range ( range , params . draft_cpuparams_batch . cpumask ) ;
return true ;
}
if ( arg = = " --prio-batch-draft " ) {
CHECK_ARG
params . draft_cpuparams_batch . priority = ( enum ggml_sched_priority ) std : : stoul ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --cpu-strict-batch-draft " ) {
params . draft_cpuparams_batch . strict_cpu = true ;
return true ;
}
if ( arg = = " --poll-batch-draft " ) {
CHECK_ARG
params . draft_cpuparams_batch . poll = std : : stoul ( argv [ i ] ) ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " -p " | | arg = = " --prompt " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . prompt = argv [ i ] ;
return true ;
}
if ( arg = = " -e " | | arg = = " --escape " ) {
params . escape = true ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " --no-escape " ) {
params . escape = false ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " --prompt-cache " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . path_prompt_cache = argv [ i ] ;
return true ;
}
if ( arg = = " --prompt-cache-all " ) {
params . prompt_cache_all = true ;
return true ;
}
if ( arg = = " --prompt-cache-ro " ) {
params . prompt_cache_ro = true ;
return true ;
}
if ( arg = = " -bf " | | arg = = " --binary-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : ifstream file ( argv [ i ] , std : : ios : : binary ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
// store the external file name in params
params . prompt_file = argv [ i ] ;
std : : ostringstream ss ;
ss < < file . rdbuf ( ) ;
params . prompt = ss . str ( ) ;
fprintf ( stderr , " Read %zu bytes from binary file %s \n " , params . prompt . size ( ) , argv [ i ] ) ;
return true ;
}
if ( arg = = " -f " | | arg = = " --file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : ifstream file ( argv [ i ] ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
// store the external file name in params
params . prompt_file = argv [ i ] ;
std : : copy ( std : : istreambuf_iterator < char > ( file ) , std : : istreambuf_iterator < char > ( ) , back_inserter ( params . prompt ) ) ;
if ( ! params . prompt . empty ( ) & & params . prompt . back ( ) = = ' \n ' ) {
params . prompt . pop_back ( ) ;
}
return true ;
}
2024-06-06 15:30:58 +02:00
if ( arg = = " --in-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-06 15:30:58 +02:00
std : : ifstream file ( argv [ i ] ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
params . in_files . push_back ( argv [ i ] ) ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -n " | | arg = = " --predict " | | arg = = " --n-predict " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_predict = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --top-k " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . top_k = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " -c " | | arg = = " --ctx-size " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_ctx = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --grp-attn-n " | | arg = = " -gan " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . grp_attn_n = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --grp-attn-w " | | arg = = " -gaw " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . grp_attn_w = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --rope-freq-base " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . rope_freq_base = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --rope-freq-scale " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . rope_freq_scale = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --rope-scaling " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " none " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_NONE ; }
else if ( value = = " linear " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_LINEAR ; }
else if ( value = = " yarn " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_YARN ; }
else { invalid_param = true ; }
return true ;
}
if ( arg = = " --rope-scale " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . rope_freq_scale = 1.0f / std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --yarn-orig-ctx " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . yarn_orig_ctx = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --yarn-ext-factor " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . yarn_ext_factor = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --yarn-attn-factor " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . yarn_attn_factor = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --yarn-beta-fast " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . yarn_beta_fast = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --yarn-beta-slow " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . yarn_beta_slow = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --pooling " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " none " ) { params . pooling_type = LLAMA_POOLING_TYPE_NONE ; }
else if ( value = = " mean " ) { params . pooling_type = LLAMA_POOLING_TYPE_MEAN ; }
else if ( value = = " cls " ) { params . pooling_type = LLAMA_POOLING_TYPE_CLS ; }
2024-06-21 07:38:22 +02:00
else if ( value = = " last " ) { params . pooling_type = LLAMA_POOLING_TYPE_LAST ; }
2024-03-18 09:27:44 +01:00
else { invalid_param = true ; }
return true ;
}
2024-07-05 09:05:56 +02:00
if ( arg = = " --attention " ) {
CHECK_ARG
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " causal " ) { params . attention_type = LLAMA_ATTENTION_TYPE_CAUSAL ; }
else if ( value = = " non-causal " ) { params . attention_type = LLAMA_ATTENTION_TYPE_NON_CAUSAL ; }
else { invalid_param = true ; }
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " --defrag-thold " | | arg = = " -dt " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . defrag_thold = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --samplers " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
const auto sampler_names = string_split ( argv [ i ] , ' ; ' ) ;
2024-05-22 19:04:20 +02:00
sparams . samplers_sequence = llama_sampling_types_from_names ( sampler_names , true ) ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --sampling-seq " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-05-22 19:04:20 +02:00
sparams . samplers_sequence = llama_sampling_types_from_chars ( argv [ i ] ) ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --top-p " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . top_p = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --min-p " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . min_p = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --temp " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . temp = std : : stof ( argv [ i ] ) ;
sparams . temp = std : : max ( sparams . temp , 0.0f ) ;
return true ;
}
if ( arg = = " --tfs " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . tfs_z = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --typical " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . typical_p = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --repeat-last-n " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . penalty_last_n = std : : stoi ( argv [ i ] ) ;
sparams . n_prev = std : : max ( sparams . n_prev , sparams . penalty_last_n ) ;
return true ;
}
if ( arg = = " --repeat-penalty " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . penalty_repeat = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --frequency-penalty " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . penalty_freq = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --presence-penalty " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . penalty_present = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --dynatemp-range " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . dynatemp_range = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --dynatemp-exp " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . dynatemp_exponent = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --mirostat " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . mirostat = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --mirostat-lr " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . mirostat_eta = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --mirostat-ent " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . mirostat_tau = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --cfg-negative-prompt " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . cfg_negative_prompt = argv [ i ] ;
return true ;
}
if ( arg = = " --cfg-negative-prompt-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : ifstream file ( argv [ i ] ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
std : : copy ( std : : istreambuf_iterator < char > ( file ) , std : : istreambuf_iterator < char > ( ) , back_inserter ( sparams . cfg_negative_prompt ) ) ;
if ( ! sparams . cfg_negative_prompt . empty ( ) & & sparams . cfg_negative_prompt . back ( ) = = ' \n ' ) {
sparams . cfg_negative_prompt . pop_back ( ) ;
}
return true ;
}
if ( arg = = " --cfg-scale " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . cfg_scale = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " -b " | | arg = = " --batch-size " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_batch = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " -ub " | | arg = = " --ubatch-size " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_ubatch = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --keep " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_keep = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --draft " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_draft = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --chunks " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_chunks = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " -np " | | arg = = " --parallel " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_parallel = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " -ns " | | arg = = " --sequences " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_sequences = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --p-split " | | arg = = " -ps " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . p_split = std : : stof ( argv [ i ] ) ;
return true ;
}
if ( arg = = " -m " | | arg = = " --model " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . model = argv [ i ] ;
return true ;
}
2024-03-22 14:33:38 +01:00
if ( arg = = " -md " | | arg = = " --model-draft " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-22 14:33:38 +01:00
params . model_draft = argv [ i ] ;
return true ;
}
if ( arg = = " -a " | | arg = = " --alias " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-22 14:33:38 +01:00
params . model_alias = argv [ i ] ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " -mu " | | arg = = " --model-url " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . model_url = argv [ i ] ;
return true ;
}
2024-07-06 22:32:04 +02:00
if ( arg = = " -hft " | | arg = = " --hf-token " ) {
if ( + + i > = argc ) {
invalid_param = true ;
return true ;
}
params . hf_token = argv [ i ] ;
return true ;
}
2024-03-22 14:33:38 +01:00
if ( arg = = " -hfr " | | arg = = " --hf-repo " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-22 14:33:38 +01:00
params . hf_repo = argv [ i ] ;
2024-03-18 09:27:44 +01:00
return true ;
}
2024-03-22 14:33:38 +01:00
if ( arg = = " -hff " | | arg = = " --hf-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-22 14:33:38 +01:00
params . hf_file = argv [ i ] ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --lora " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-08-06 17:33:39 +02:00
params . lora_adapters . push_back ( {
std : : string ( argv [ i ] ) ,
1.0 ,
} ) ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --lora-scaled " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-08-06 17:33:39 +02:00
std : : string lora_adapter = argv [ i ] ;
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-08-06 17:33:39 +02:00
params . lora_adapters . push_back ( {
lora_adapter ,
std : : stof ( argv [ i ] ) ,
} ) ;
2024-03-18 09:27:44 +01:00
return true ;
}
2024-08-06 17:33:39 +02:00
if ( arg = = " --lora-init-without-apply " ) {
params . lora_init_without_apply = true ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --control-vector " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . control_vectors . push_back ( { 1.0f , argv [ i ] , } ) ;
return true ;
}
if ( arg = = " --control-vector-scaled " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
const char * fname = argv [ i ] ;
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . control_vectors . push_back ( { std : : stof ( argv [ i ] ) , fname , } ) ;
return true ;
}
if ( arg = = " --control-vector-layer-range " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . control_vector_layer_start = std : : stoi ( argv [ i ] ) ;
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . control_vector_layer_end = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --mmproj " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . mmproj = argv [ i ] ;
return true ;
}
if ( arg = = " --image " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-04-29 16:34:24 +02:00
params . image . emplace_back ( argv [ i ] ) ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " -i " | | arg = = " --interactive " ) {
params . interactive = true ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -sp " | | arg = = " --special " ) {
2024-05-26 16:10:17 +02:00
params . special = true ;
2024-05-25 11:04:03 +02:00
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " --embedding " | | arg = = " --embeddings " ) {
2024-03-18 09:27:44 +01:00
params . embedding = true ;
return true ;
}
2024-06-24 07:30:24 +02:00
if ( arg = = " --embd-normalize " ) {
CHECK_ARG
params . embd_normalize = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --embd-output-format " ) {
CHECK_ARG
params . embd_out = argv [ i ] ;
return true ;
}
if ( arg = = " --embd-separator " ) {
CHECK_ARG
params . embd_sep = argv [ i ] ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -if " | | arg = = " --interactive-first " ) {
2024-03-18 09:27:44 +01:00
params . interactive_first = true ;
return true ;
}
2024-05-08 16:32:32 +02:00
if ( arg = = " -cnv " | | arg = = " --conversation " ) {
params . conversation = true ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " --infill " ) {
params . infill = true ;
return true ;
}
if ( arg = = " -dkvc " | | arg = = " --dump-kv-cache " ) {
params . dump_kv_cache = true ;
return true ;
}
if ( arg = = " -nkvo " | | arg = = " --no-kv-offload " ) {
params . no_kv_offload = true ;
return true ;
}
if ( arg = = " -ctk " | | arg = = " --cache-type-k " ) {
params . cache_type_k = argv [ + + i ] ;
return true ;
}
if ( arg = = " -ctv " | | arg = = " --cache-type-v " ) {
params . cache_type_v = argv [ + + i ] ;
return true ;
}
2024-07-02 22:56:46 +02:00
if ( arg = = " -mli " | | arg = = " --multiline-input " ) {
2024-03-18 09:27:44 +01:00
params . multiline_input = true ;
return true ;
}
if ( arg = = " --simple-io " ) {
params . simple_io = true ;
return true ;
}
if ( arg = = " -cb " | | arg = = " --cont-batching " ) {
params . cont_batching = true ;
return true ;
}
2024-07-15 13:54:58 +02:00
if ( arg = = " -nocb " | | arg = = " --no-cont-batching " ) {
params . cont_batching = false ;
return true ;
}
ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 11:16:08 +02:00
if ( arg = = " -fa " | | arg = = " --flash-attn " ) {
params . flash_attn = true ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -co " | | arg = = " --color " ) {
2024-03-18 09:27:44 +01:00
params . use_color = true ;
return true ;
}
if ( arg = = " --mlock " ) {
params . use_mlock = true ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -ngl " | | arg = = " --gpu-layers " | | arg = = " --n-gpu-layers " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_gpu_layers = std : : stoi ( argv [ i ] ) ;
if ( ! llama_supports_gpu_offload ( ) ) {
2024-06-04 20:23:39 +02:00
fprintf ( stderr , " warning: not compiled with GPU offload support, --gpu-layers option will be ignored \n " ) ;
2024-03-18 09:27:44 +01:00
fprintf ( stderr , " warning: see main README.md for information on enabling GPU BLAS support \n " ) ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
}
2024-08-26 00:54:37 +02:00
if ( arg = = " -ngld " | | arg = = " --gpu-layers-draft " | | arg = = " --n-gpu-layers-draft " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . n_gpu_layers_draft = std : : stoi ( argv [ i ] ) ;
if ( ! llama_supports_gpu_offload ( ) ) {
2024-06-04 20:23:39 +02:00
fprintf ( stderr , " warning: not compiled with GPU offload support, --gpu-layers-draft option will be ignored \n " ) ;
2024-03-18 09:27:44 +01:00
fprintf ( stderr , " warning: see main README.md for information on enabling GPU BLAS support \n " ) ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --main-gpu " | | arg = = " -mg " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . main_gpu = std : : stoi ( argv [ i ] ) ;
2024-06-03 10:59:14 +02:00
# ifndef GGML_USE_CUDA_SYCL_VULKAN
fprintf ( stderr , " warning: llama.cpp was compiled without CUDA/SYCL/Vulkan. Setting the main GPU has no effect. \n " ) ;
# endif // GGML_USE_CUDA_SYCL_VULKAN
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --split-mode " | | arg = = " -sm " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : string arg_next = argv [ i ] ;
if ( arg_next = = " none " ) {
params . split_mode = LLAMA_SPLIT_MODE_NONE ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
else if ( arg_next = = " layer " ) {
params . split_mode = LLAMA_SPLIT_MODE_LAYER ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
else if ( arg_next = = " row " ) {
# ifdef GGML_USE_SYCL
fprintf ( stderr , " warning: The split mode value:[row] is not supported by llama.cpp with SYCL. It's developing. \n Exit! \n " ) ;
exit ( 1 ) ;
# endif // GGML_USE_SYCL
params . split_mode = LLAMA_SPLIT_MODE_ROW ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
else {
invalid_param = true ;
return true ;
2024-03-16 16:39:15 +01:00
}
2024-06-03 10:59:14 +02:00
# ifndef GGML_USE_CUDA_SYCL_VULKAN
fprintf ( stderr , " warning: llama.cpp was compiled without CUDA/SYCL/Vulkan. Setting the split mode has no effect. \n " ) ;
# endif // GGML_USE_CUDA_SYCL_VULKAN
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --tensor-split " | | arg = = " -ts " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : string arg_next = argv [ i ] ;
ggml : add unified SYCL backend for Intel GPUs (#2690)
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 16:56:23 +01:00
2024-03-18 09:27:44 +01:00
// split string by , and /
const std : : regex regex { R " ([,/]+) " } ;
std : : sregex_token_iterator it { arg_next . begin ( ) , arg_next . end ( ) , regex , - 1 } ;
std : : vector < std : : string > split_arg { it , { } } ;
if ( split_arg . size ( ) > = llama_max_devices ( ) ) {
invalid_param = true ;
return true ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
for ( size_t i = 0 ; i < llama_max_devices ( ) ; + + i ) {
if ( i < split_arg . size ( ) ) {
params . tensor_split [ i ] = std : : stof ( split_arg [ i ] ) ;
2024-01-12 20:07:38 +01:00
}
2024-03-18 09:27:44 +01:00
else {
params . tensor_split [ i ] = 0.0f ;
2023-06-06 21:33:23 +02:00
}
2024-03-18 09:27:44 +01:00
}
2024-03-26 01:16:01 +01:00
# ifndef GGML_USE_CUDA_SYCL_VULKAN
fprintf ( stderr , " warning: llama.cpp was compiled without CUDA/SYCL/Vulkan. Setting a tensor split has no effect. \n " ) ;
# endif // GGML_USE_CUDA_SYCL_VULKAN
2024-03-18 09:27:44 +01:00
return true ;
}
2024-05-14 13:27:19 +02:00
if ( arg = = " --rpc " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-05-14 13:27:19 +02:00
params . rpc_servers = argv [ i ] ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " --no-mmap " ) {
params . use_mmap = false ;
return true ;
}
if ( arg = = " --numa " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " distribute " | | value = = " " ) { params . numa = GGML_NUMA_STRATEGY_DISTRIBUTE ; }
else if ( value = = " isolate " ) { params . numa = GGML_NUMA_STRATEGY_ISOLATE ; }
else if ( value = = " numactl " ) { params . numa = GGML_NUMA_STRATEGY_NUMACTL ; }
else { invalid_param = true ; }
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -v " | | arg = = " --verbose " ) {
2024-06-06 15:30:58 +02:00
params . verbosity = 1 ;
return true ;
}
if ( arg = = " --verbosity " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-06 15:30:58 +02:00
params . verbosity = std : : stoi ( argv [ i ] ) ;
2024-06-04 20:23:39 +02:00
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " --verbose-prompt " ) {
params . verbose_prompt = true ;
return true ;
}
if ( arg = = " --no-display-prompt " ) {
params . display_prompt = false ;
return true ;
}
if ( arg = = " -r " | | arg = = " --reverse-prompt " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . antiprompt . emplace_back ( argv [ i ] ) ;
return true ;
}
if ( arg = = " -ld " | | arg = = " --logdir " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . logdir = argv [ i ] ;
2023-08-28 17:59:39 +02:00
2024-03-18 09:27:44 +01:00
if ( params . logdir . back ( ) ! = DIRECTORY_SEPARATOR ) {
params . logdir + = DIRECTORY_SEPARATOR ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
}
2024-03-23 01:24:36 +01:00
if ( arg = = " -lcs " | | arg = = " --lookup-cache-static " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-23 01:24:36 +01:00
params . lookup_cache_static = argv [ i ] ;
return true ;
}
if ( arg = = " -lcd " | | arg = = " --lookup-cache-dynamic " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-23 01:24:36 +01:00
params . lookup_cache_dynamic = argv [ i ] ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " --save-all-logits " | | arg = = " --kl-divergence-base " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . logits_file = argv [ i ] ;
return true ;
}
if ( arg = = " --perplexity " | | arg = = " --all-logits " ) {
params . logits_all = true ;
return true ;
}
if ( arg = = " --ppl-stride " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . ppl_stride = std : : stoi ( argv [ i ] ) ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " --ppl-output-type " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . ppl_output_type = std : : stoi ( argv [ i ] ) ;
2024-04-26 18:39:58 +02:00
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -ptc " | | arg = = " --print-token-count " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . n_print = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --check-tensors " ) {
params . check_tensors = true ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --hellaswag " ) {
params . hellaswag = true ;
return true ;
}
if ( arg = = " --hellaswag-tasks " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . hellaswag_tasks = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --winogrande " ) {
params . winogrande = true ;
return true ;
}
if ( arg = = " --winogrande-tasks " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . winogrande_tasks = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --multiple-choice " ) {
params . multiple_choice = true ;
return true ;
}
if ( arg = = " --multiple-choice-tasks " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . multiple_choice_tasks = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --kl-divergence " ) {
params . kl_divergence = true ;
return true ;
}
if ( arg = = " --ignore-eos " ) {
params . ignore_eos = true ;
return true ;
}
2024-03-27 08:23:10 +01:00
if ( arg = = " --penalize-nl " ) {
sparams . penalize_nl = true ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " -l " | | arg = = " --logit-bias " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : stringstream ss ( argv [ i ] ) ;
llama_token key ;
char sign ;
std : : string value_str ;
try {
if ( ss > > key & & ss > > sign & & std : : getline ( ss , value_str ) & & ( sign = = ' + ' | | sign = = ' - ' ) ) {
sparams . logit_bias [ key ] = std : : stof ( value_str ) * ( ( sign = = ' - ' ) ? - 1.0f : 1.0f ) ;
llama : new sampling algorithms (#1126)
* Sample interface, new samplers.
New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat
Ignore EOS fix: -inf should be used.
* mirostat
* Added --logit-bias and --no-penalize-nl, removed std::span
* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
* Save and load example adjust
* Tests
* Windows build fix
* Windows test fix
2023-04-29 07:34:41 +02:00
}
2024-03-18 09:27:44 +01:00
else {
throw std : : exception ( ) ;
llama : new sampling algorithms (#1126)
* Sample interface, new samplers.
New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat
Ignore EOS fix: -inf should be used.
* mirostat
* Added --logit-bias and --no-penalize-nl, removed std::span
* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
* Save and load example adjust
* Tests
* Windows build fix
* Windows test fix
2023-04-29 07:34:41 +02:00
}
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
catch ( const std : : exception & ) {
invalid_param = true ;
return true ;
2024-03-16 16:39:15 +01:00
}
2024-03-18 09:27:44 +01:00
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -h " | | arg = = " --help " | | arg = = " --usage " ) {
params . usage = true ;
return true ;
2024-03-18 09:27:44 +01:00
}
if ( arg = = " --version " ) {
fprintf ( stderr , " version: %d (%s) \n " , LLAMA_BUILD_NUMBER , LLAMA_COMMIT ) ;
fprintf ( stderr , " built with %s for %s \n " , LLAMA_COMPILER , LLAMA_BUILD_TARGET ) ;
exit ( 0 ) ;
}
if ( arg = = " --in-prefix-bos " ) {
params . input_prefix_bos = true ;
2024-06-30 20:27:13 +02:00
params . enable_chat_template = false ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --in-prefix " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . input_prefix = argv [ i ] ;
2024-06-30 20:27:13 +02:00
params . enable_chat_template = false ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --in-suffix " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
params . input_suffix = argv [ i ] ;
2024-06-30 20:27:13 +02:00
params . enable_chat_template = false ;
2024-03-18 09:27:44 +01:00
return true ;
}
2024-06-28 12:53:43 +02:00
if ( arg = = " --spm-infill " ) {
params . spm_infill = true ;
2024-03-18 09:27:44 +01:00
return true ;
}
if ( arg = = " --grammar " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
sparams . grammar = argv [ i ] ;
return true ;
}
if ( arg = = " --grammar-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
std : : ifstream file ( argv [ i ] ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
std : : copy (
std : : istreambuf_iterator < char > ( file ) ,
std : : istreambuf_iterator < char > ( ) ,
std : : back_inserter ( sparams . grammar )
) ;
return true ;
}
2024-04-15 19:35:21 +02:00
if ( arg = = " -j " | | arg = = " --json-schema " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-04-15 19:35:21 +02:00
sparams . grammar = json_schema_to_grammar ( json : : parse ( argv [ i ] ) ) ;
return true ;
}
2024-03-18 09:27:44 +01:00
if ( arg = = " --override-kv " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-05-22 19:04:20 +02:00
if ( ! string_parse_kv_override ( argv [ i ] , params . kv_overrides ) ) {
2024-03-18 09:27:44 +01:00
fprintf ( stderr , " error: Invalid type for KV override: %s \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " --host " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . hostname = argv [ i ] ;
return true ;
}
if ( arg = = " --port " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . port = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --path " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . public_path = argv [ i ] ;
return true ;
}
if ( arg = = " --api-key " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . api_keys . push_back ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --api-key-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
std : : ifstream key_file ( argv [ i ] ) ;
if ( ! key_file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
std : : string key ;
while ( std : : getline ( key_file , key ) ) {
if ( ! key . empty ( ) ) {
params . api_keys . push_back ( key ) ;
}
}
key_file . close ( ) ;
return true ;
}
if ( arg = = " --ssl-key-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . ssl_file_key = argv [ i ] ;
return true ;
}
if ( arg = = " --ssl-cert-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . ssl_file_cert = argv [ i ] ;
return true ;
}
if ( arg = = " --timeout " | | arg = = " -to " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . timeout_read = std : : stoi ( argv [ i ] ) ;
params . timeout_write = std : : stoi ( argv [ i ] ) ;
return true ;
}
2024-06-06 18:19:59 +02:00
if ( arg = = " --threads-http " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-06 18:19:59 +02:00
params . n_threads_http = std : : stoi ( argv [ i ] ) ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -spf " | | arg = = " --system-prompt-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
std : : ifstream file ( argv [ i ] ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
std : : string system_prompt ;
std : : copy (
std : : istreambuf_iterator < char > ( file ) ,
std : : istreambuf_iterator < char > ( ) ,
std : : back_inserter ( system_prompt )
) ;
params . system_prompt = system_prompt ;
return true ;
}
if ( arg = = " --log-format " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
if ( std : : strcmp ( argv [ i ] , " json " ) = = 0 ) {
params . log_json = true ;
} else if ( std : : strcmp ( argv [ i ] , " text " ) = = 0 ) {
params . log_json = false ;
} else {
invalid_param = true ;
return true ;
}
return true ;
}
if ( arg = = " --no-slots " ) {
params . endpoint_slots = false ;
return true ;
}
if ( arg = = " --metrics " ) {
params . endpoint_metrics = true ;
return true ;
}
if ( arg = = " --slot-save-path " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . slot_save_path = argv [ i ] ;
// if doesn't end with DIRECTORY_SEPARATOR, add it
if ( ! params . slot_save_path . empty ( ) & & params . slot_save_path [ params . slot_save_path . size ( ) - 1 ] ! = DIRECTORY_SEPARATOR ) {
params . slot_save_path + = DIRECTORY_SEPARATOR ;
}
return true ;
}
if ( arg = = " --chat-template " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
if ( ! llama_chat_verify_template ( argv [ i ] ) ) {
fprintf ( stderr , " error: the supplied chat template is not supported: %s \n " , argv [ i ] ) ;
fprintf ( stderr , " note: llama.cpp does not use jinja parser, we only support commonly used templates \n " ) ;
invalid_param = true ;
return true ;
}
params . chat_template = argv [ i ] ;
return true ;
}
2024-06-08 09:50:31 +02:00
if ( arg = = " --slot-prompt-similarity " | | arg = = " -sps " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-08 09:50:31 +02:00
params . slot_prompt_similarity = std : : stof ( argv [ i ] ) ;
return true ;
}
2024-06-04 20:23:39 +02:00
if ( arg = = " -pps " ) {
params . is_pp_shared = true ;
return true ;
}
if ( arg = = " -npp " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
auto p = string_split < int > ( argv [ i ] , split_delim ) ;
params . n_pp . insert ( params . n_pp . end ( ) , p . begin ( ) , p . end ( ) ) ;
return true ;
}
if ( arg = = " -ntg " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
auto p = string_split < int > ( argv [ i ] , split_delim ) ;
params . n_tg . insert ( params . n_tg . end ( ) , p . begin ( ) , p . end ( ) ) ;
return true ;
}
if ( arg = = " -npl " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
auto p = string_split < int > ( argv [ i ] , split_delim ) ;
params . n_pl . insert ( params . n_pl . end ( ) , p . begin ( ) , p . end ( ) ) ;
return true ;
}
if ( arg = = " --context-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
std : : ifstream file ( argv [ i ] , std : : ios : : binary ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
return true ;
}
params . context_files . push_back ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --chunk-size " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . chunk_size = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --chunk-separator " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . chunk_separator = argv [ i ] ;
return true ;
}
if ( arg = = " --junk " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . n_junk = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --pos " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-04 20:23:39 +02:00
params . i_pos = std : : stoi ( argv [ i ] ) ;
return true ;
}
2024-06-06 15:30:58 +02:00
if ( arg = = " -o " | | arg = = " --output " | | arg = = " --output-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-06 15:30:58 +02:00
params . out_file = argv [ i ] ;
2024-06-15 18:53:40 +02:00
params . cvector_outfile = argv [ i ] ;
2024-07-23 23:48:37 +02:00
params . lora_outfile = argv [ i ] ;
2024-06-06 15:30:58 +02:00
return true ;
}
if ( arg = = " -ofreq " | | arg = = " --output-frequency " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-06 15:30:58 +02:00
params . n_out_freq = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --save-frequency " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-06 15:30:58 +02:00
params . n_save_freq = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --process-output " ) {
params . process_output = true ;
return true ;
}
if ( arg = = " --no-ppl " ) {
params . compute_ppl = false ;
return true ;
}
if ( arg = = " --chunk " | | arg = = " --from-chunk " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-06 15:30:58 +02:00
params . i_chunk = std : : stoi ( argv [ i ] ) ;
return true ;
}
2024-06-15 18:53:40 +02:00
// cvector params
if ( arg = = " --positive-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-15 18:53:40 +02:00
params . cvector_positive_file = argv [ i ] ;
return true ;
}
if ( arg = = " --negative-file " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-15 18:53:40 +02:00
params . cvector_negative_file = argv [ i ] ;
return true ;
}
if ( arg = = " --pca-batch " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-15 18:53:40 +02:00
params . n_pca_batch = std : : stoi ( argv [ i ] ) ;
return true ;
}
if ( arg = = " --pca-iter " ) {
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-06-15 18:53:40 +02:00
params . n_pca_iterations = std : : stoi ( argv [ i ] ) ;
return true ;
}
2024-06-25 13:59:54 +02:00
if ( arg = = " --method " ) {
CHECK_ARG
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " pca " ) { params . cvector_dimre_method = DIMRE_METHOD_PCA ; }
else if ( value = = " mean " ) { params . cvector_dimre_method = DIMRE_METHOD_MEAN ; }
else { invalid_param = true ; }
return true ;
}
2024-07-27 12:45:02 +02:00
if ( arg = = " --no-warmup " ) {
params . warmup = false ;
return true ;
}
main : log file (#2748)
* initial, base LOG macro
* add *.log to .gitignore
* added basic log file handler
* reverted log auto endline to better mimic printf
* remove atomics and add dynamic log target
* log_enable/disable, LOG_TEE, basic usage doc
* update .gitignore
* mv include to common, params, help msg
* log tostring helpers, token vectors pretty prints
* main: replaced fprintf/LOG_TEE, some trace logging
* LOG_DISABLE_LOGS compile flag, wrapped f in macros
* fix LOG_TEELN and configchecker
* stub LOG_DUMP_CMDLINE for WIN32 for now
* fix msvc
* cleanup main.cpp:273
* fix stray whitespace after master sync
* log : fix compile warnings
- do not use C++20 stuff
- use PRIu64 to print uint64_t
- avoid string copies by using const ref
- fix ", ##__VA_ARGS__" warnings
- compare strings with == and !=
* log : do not append to existing log + disable file line func by default
* log : try to fix Windows build
* main : wip logs
* main : add trace log
* review: macro f lowercase, str append to sstream
* review: simplify ifs and str comparisons
* fix MSVC, formatting, FMT/VAL placeholders
* review: if/else cleanup
* review: if/else cleanup (2)
* replace _ prefix with _impl suffix
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-30 08:29:32 +02:00
# ifndef LOG_DISABLE_LOGS
2024-03-18 09:27:44 +01:00
// Parse args for logging parameters
if ( log_param_single_parse ( argv [ i ] ) ) {
// Do nothing, log_param_single_parse automatically does it's thing
// and returns if a match was found and parsed.
return true ;
}
if ( log_param_pair_parse ( /*check_but_dont_parse*/ true , argv [ i ] ) ) {
// We have a matching known parameter requiring an argument,
// now we need to check if there is anything after this argv
// and flag invalid_param or parse it.
2024-06-24 07:30:24 +02:00
CHECK_ARG
2024-03-18 09:27:44 +01:00
if ( ! log_param_pair_parse ( /*check_but_dont_parse*/ false , argv [ i - 1 ] , argv [ i ] ) ) {
invalid_param = true ;
return true ;
}
return true ;
}
// End of Parse args for logging parameters
main : log file (#2748)
* initial, base LOG macro
* add *.log to .gitignore
* added basic log file handler
* reverted log auto endline to better mimic printf
* remove atomics and add dynamic log target
* log_enable/disable, LOG_TEE, basic usage doc
* update .gitignore
* mv include to common, params, help msg
* log tostring helpers, token vectors pretty prints
* main: replaced fprintf/LOG_TEE, some trace logging
* LOG_DISABLE_LOGS compile flag, wrapped f in macros
* fix LOG_TEELN and configchecker
* stub LOG_DUMP_CMDLINE for WIN32 for now
* fix msvc
* cleanup main.cpp:273
* fix stray whitespace after master sync
* log : fix compile warnings
- do not use C++20 stuff
- use PRIu64 to print uint64_t
- avoid string copies by using const ref
- fix ", ##__VA_ARGS__" warnings
- compare strings with == and !=
* log : do not append to existing log + disable file line func by default
* log : try to fix Windows build
* main : wip logs
* main : add trace log
* review: macro f lowercase, str append to sstream
* review: simplify ifs and str comparisons
* fix MSVC, formatting, FMT/VAL placeholders
* review: if/else cleanup
* review: if/else cleanup (2)
* replace _ prefix with _impl suffix
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-30 08:29:32 +02:00
# endif // LOG_DISABLE_LOGS
2024-03-18 09:27:44 +01:00
return false ;
}
2024-06-04 20:23:39 +02:00
# ifdef __GNUC__
# ifdef __MINGW32__
# define LLAMA_COMMON_ATTRIBUTE_FORMAT(...) __attribute__((format(gnu_printf, __VA_ARGS__)))
# else
# define LLAMA_COMMON_ATTRIBUTE_FORMAT(...) __attribute__((format(printf, __VA_ARGS__)))
# endif
# else
# define LLAMA_COMMON_ATTRIBUTE_FORMAT(...)
# endif
2024-05-22 19:04:20 +02:00
void gpt_params_print_usage ( int /*argc*/ , char * * argv , const gpt_params & params ) {
2023-10-20 20:07:23 +02:00
const llama_sampling_params & sparams = params . sparams ;
2023-10-11 21:35:46 +02:00
2024-02-11 14:43:31 +01:00
std : : string sampler_type_chars ;
std : : string sampler_type_names ;
for ( const auto sampler_type : sparams . samplers_sequence ) {
sampler_type_chars + = static_cast < char > ( sampler_type ) ;
2024-05-22 19:04:20 +02:00
sampler_type_names + = llama_sampling_type_to_str ( sampler_type ) + " ; " ;
2024-02-11 14:43:31 +01:00
}
sampler_type_names . pop_back ( ) ;
2024-06-04 20:23:39 +02:00
struct option_info {
LLAMA_COMMON_ATTRIBUTE_FORMAT ( 4 , 5 )
option_info ( const std : : string & tags , const char * args , const char * desc , . . . ) : tags ( tags ) , args ( args ) , desc ( desc ) {
va_list args_list ;
va_start ( args_list , desc ) ;
char buffer [ 1024 ] ;
vsnprintf ( buffer , sizeof ( buffer ) , desc , args_list ) ;
va_end ( args_list ) ;
this - > desc = buffer ;
}
option_info ( const std : : string & grp ) : grp ( grp ) { }
std : : string tags ;
std : : string args ;
std : : string desc ;
std : : string grp ;
} ;
std : : vector < option_info > options ;
// TODO: filter by tags
options . push_back ( { " general " } ) ;
options . push_back ( { " * " , " -h, --help, --usage " , " print usage and exit " } ) ;
options . push_back ( { " * " , " --version " , " show version and build info " } ) ;
options . push_back ( { " * " , " -v, --verbose " , " print verbose information " } ) ;
2024-06-06 15:30:58 +02:00
options . push_back ( { " * " , " --verbosity N " , " set specific verbosity level (default: %d) " , params . verbosity } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " --verbose-prompt " , " print a verbose prompt before generation (default: %s) " , params . verbose_prompt ? " true " : " false " } ) ;
options . push_back ( { " * " , " --no-display-prompt " , " don't print prompt at generation (default: %s) " , ! params . display_prompt ? " true " : " false " } ) ;
options . push_back ( { " * " , " -co, --color " , " colorise output to distinguish prompt and user input from generations (default: %s) " , params . use_color ? " true " : " false " } ) ;
options . push_back ( { " * " , " -s, --seed SEED " , " RNG seed (default: %d, use random seed for < 0) " , params . seed } ) ;
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
options . push_back ( { " * " , " -t, --threads N " , " number of threads to use during generation (default: %d) " , params . cpuparams . n_threads } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " -tb, --threads-batch N " , " number of threads to use during batch and prompt processing (default: same as --threads) " } ) ;
options . push_back ( { " speculative " , " -td, --threads-draft N " , " number of threads to use during generation (default: same as --threads) " } ) ;
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
options . push_back ( { " speculative " , " -tbd, --threads-batch-draft N " , " number of threads to use during batch and prompt processing (default: same as --threads-draft) " } ) ;
# ifndef GGML_USE_OPENMP
// these options are available only with the internal threadpool
options . push_back ( { " * " , " -C, --cpu-mask M " , " CPU affinity mask: arbitrarily long hex. Complements cpu-range (default: \" \" ) " } ) ;
options . push_back ( { " * " , " -Cr, --cpu-range lo-hi " , " range of CPUs for affinity. Complements --cpu-mask " } ) ;
options . push_back ( { " * " , " --cpu-strict <0|1> " , " use strict CPU placement (default: %u) \n " , ( unsigned ) params . cpuparams . strict_cpu } ) ;
options . push_back ( { " * " , " --priority N " , " set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: %d) \n " , params . cpuparams . priority } ) ;
options . push_back ( { " * " , " --poll <0...100> " , " use polling level to wait for work (0 - no polling, default: %u) \n " , ( unsigned ) params . cpuparams . poll } ) ;
options . push_back ( { " * " , " -Cb, --cpu-mask-batch M " , " CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch (default: same as --cpu-mask) " } ) ;
options . push_back ( { " * " , " -Crb, --cpu-range-batch lo-hi " , " ranges of CPUs for affinity. Complements --cpu-mask-batch " } ) ;
options . push_back ( { " * " , " --cpu-strict-batch <0|1> " , " use strict CPU placement (default: same as --cpu-strict) " } ) ;
options . push_back ( { " * " , " --priority-batch N " , " set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: --priority) " } ) ;
options . push_back ( { " * " , " --poll-batch <0|1> " , " use polling to wait for work (default: same as --poll " } ) ;
options . push_back ( { " speculative " , " -Cd, --cpu-mask-draft M " , " Draft model CPU affinity mask. Complements cpu-range-draft (default: same as --cpu-mask) " } ) ;
options . push_back ( { " speculative " , " -Crd, --cpu-range-draft lo-hi " , " Ranges of CPUs for affinity. Complements --cpu-mask-draft " } ) ;
options . push_back ( { " speculative " , " --cpu-strict-draft <0|1> " , " Use strict CPU placement for draft model (default: same as --cpu-strict) " } ) ;
options . push_back ( { " speculative " , " --priority-draft N " , " Set draft process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: same as --priority) " } ) ;
options . push_back ( { " speculative " , " --poll-draft <0|1> " , " Use polling to wait for draft model work (default: same as --poll]) " } ) ;
options . push_back ( { " speculative " , " -Cbd, --cpu-mask-batch-draft M " , " Draft model CPU affinity mask. Complements cpu-range-draft-batch (default: same as --cpu-mask-draft) " } ) ;
options . push_back ( { " speculative " , " -Crbd, --cpu-range-batch-draft lo-hi " ,
" Ranges of CPUs for affinity. Complements --cpu-mask-draft-batch) " } ) ;
options . push_back ( { " speculative " , " --cpu-strict-batch-draft <0|1> " ,
" Use strict CPU placement for draft model (default: --cpu-strict-draft) " } ) ;
options . push_back ( { " speculative " , " --priority-batch-draft N " , " Set draft process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: --priority-draft) " } ) ;
options . push_back ( { " speculative " , " --poll-batch-draft <0|1> " , " Use polling to wait for draft model work (default: --poll-draft) " } ) ;
# endif // GGML_USE_OPENMP
2024-06-04 20:23:39 +02:00
options . push_back ( { " speculative " , " --draft N " , " number of tokens to draft for speculative decoding (default: %d) " , params . n_draft } ) ;
options . push_back ( { " speculative " , " -ps, --p-split N " , " speculative decoding split probability (default: %.1f) " , ( double ) params . p_split } ) ;
options . push_back ( { " * " , " -lcs, --lookup-cache-static FNAME " ,
" path to static lookup cache to use for lookup decoding (not updated by generation) " } ) ;
options . push_back ( { " * " , " -lcd, --lookup-cache-dynamic FNAME " ,
" path to dynamic lookup cache to use for lookup decoding (updated by generation) " } ) ;
options . push_back ( { " * " , " -c, --ctx-size N " , " size of the prompt context (default: %d, 0 = loaded from model) " , params . n_ctx } ) ;
options . push_back ( { " * " , " -n, --predict N " , " number of tokens to predict (default: %d, -1 = infinity, -2 = until context filled) " , params . n_predict } ) ;
options . push_back ( { " * " , " -b, --batch-size N " , " logical maximum batch size (default: %d) " , params . n_batch } ) ;
options . push_back ( { " * " , " -ub, --ubatch-size N " , " physical maximum batch size (default: %d) " , params . n_ubatch } ) ;
options . push_back ( { " * " , " --keep N " , " number of tokens to keep from the initial prompt (default: %d, -1 = all) " , params . n_keep } ) ;
options . push_back ( { " * " , " --chunks N " , " max number of chunks to process (default: %d, -1 = all) " , params . n_chunks } ) ;
options . push_back ( { " * " , " -fa, --flash-attn " , " enable Flash Attention (default: %s) " , params . flash_attn ? " enabled " : " disabled " } ) ;
2024-07-04 20:55:03 +02:00
options . push_back ( { " * " , " -p, --prompt PROMPT " , " prompt to start generation with \n "
" in conversation mode, this will be used as system prompt \n "
" (default: '%s') " , params . prompt . c_str ( ) } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " -f, --file FNAME " , " a file containing the prompt (default: none) " } ) ;
2024-06-06 15:30:58 +02:00
options . push_back ( { " * " , " --in-file FNAME " , " an input file (repeat to specify multiple files) " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " -bf, --binary-file FNAME " , " binary file containing the prompt (default: none) " } ) ;
options . push_back ( { " * " , " -e, --escape " , " process escapes sequences ( \\ n, \\ r, \\ t, \\ ', \\ \" , \\ \\ ) (default: %s) " , params . escape ? " true " : " false " } ) ;
options . push_back ( { " * " , " --no-escape " , " do not process escape sequences " } ) ;
options . push_back ( { " main " , " -ptc, --print-token-count N " , " print token count every N tokens (default: %d) " , params . n_print } ) ;
options . push_back ( { " main " , " --prompt-cache FNAME " , " file to cache prompt state for faster startup (default: none) " } ) ;
options . push_back ( { " main " , " --prompt-cache-all " , " if specified, saves user input and generations to cache as well \n "
" not supported with --interactive or other interactive options " } ) ;
options . push_back ( { " main " , " --prompt-cache-ro " , " if specified, uses the prompt cache but does not update it " } ) ;
options . push_back ( { " main " , " -r, --reverse-prompt PROMPT " ,
" halt generation at PROMPT, return control in interactive mode \n "
" can be specified more than once for multiple prompts " } ) ;
options . push_back ( { " main " , " -sp, --special " , " special tokens output enabled (default: %s) " , params . special ? " true " : " false " } ) ;
2024-07-04 20:55:03 +02:00
options . push_back ( { " main " , " -cnv, --conversation " , " run in conversation mode, does not print special tokens and suffix/prefix \n "
" if suffix/prefix are not specified, default chat template will be used \n "
" (default: %s) " , params . conversation ? " true " : " false " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " main infill " , " -i, --interactive " , " run in interactive mode (default: %s) " , params . interactive ? " true " : " false " } ) ;
options . push_back ( { " main infill " , " -if, --interactive-first " , " run in interactive mode and wait for input right away (default: %s) " , params . interactive_first ? " true " : " false " } ) ;
options . push_back ( { " main infill " , " -mli, --multiline-input " , " allows you to write or paste multiple lines without ending each in ' \\ ' " } ) ;
options . push_back ( { " main infill " , " --in-prefix-bos " , " prefix BOS to user inputs, preceding the `--in-prefix` string " } ) ;
options . push_back ( { " main infill " , " --in-prefix STRING " , " string to prefix user inputs with (default: empty) " } ) ;
options . push_back ( { " main infill " , " --in-suffix STRING " , " string to suffix after user inputs with (default: empty) " } ) ;
2024-07-27 12:45:02 +02:00
options . push_back ( { " main " , " --no-warmup " , " skip warming up the model with an empty run " } ) ;
2024-06-28 12:53:43 +02:00
options . push_back ( { " server infill " ,
" --spm-infill " , " use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this. (default: %s) " , params . spm_infill ? " enabled " : " disabled " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " sampling " } ) ;
options . push_back ( { " * " , " --samplers SAMPLERS " , " samplers that will be used for generation in the order, separated by \' ; \' \n "
" (default: %s) " , sampler_type_names . c_str ( ) } ) ;
options . push_back ( { " * " , " --sampling-seq SEQUENCE " ,
" simplified sequence for samplers that will be used (default: %s) " , sampler_type_chars . c_str ( ) } ) ;
options . push_back ( { " * " , " --ignore-eos " , " ignore end of stream token and continue generating (implies --logit-bias EOS-inf) " } ) ;
options . push_back ( { " * " , " --penalize-nl " , " penalize newline tokens (default: %s) " , sparams . penalize_nl ? " true " : " false " } ) ;
options . push_back ( { " * " , " --temp N " , " temperature (default: %.1f) " , ( double ) sparams . temp } ) ;
options . push_back ( { " * " , " --top-k N " , " top-k sampling (default: %d, 0 = disabled) " , sparams . top_k } ) ;
options . push_back ( { " * " , " --top-p N " , " top-p sampling (default: %.1f, 1.0 = disabled) " , ( double ) sparams . top_p } ) ;
options . push_back ( { " * " , " --min-p N " , " min-p sampling (default: %.1f, 0.0 = disabled) " , ( double ) sparams . min_p } ) ;
options . push_back ( { " * " , " --tfs N " , " tail free sampling, parameter z (default: %.1f, 1.0 = disabled) " , ( double ) sparams . tfs_z } ) ;
options . push_back ( { " * " , " --typical N " , " locally typical sampling, parameter p (default: %.1f, 1.0 = disabled) " , ( double ) sparams . typical_p } ) ;
options . push_back ( { " * " , " --repeat-last-n N " , " last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size) " , sparams . penalty_last_n } ) ;
options . push_back ( { " * " , " --repeat-penalty N " , " penalize repeat sequence of tokens (default: %.1f, 1.0 = disabled) " , ( double ) sparams . penalty_repeat } ) ;
options . push_back ( { " * " , " --presence-penalty N " , " repeat alpha presence penalty (default: %.1f, 0.0 = disabled) " , ( double ) sparams . penalty_present } ) ;
options . push_back ( { " * " , " --frequency-penalty N " , " repeat alpha frequency penalty (default: %.1f, 0.0 = disabled) " , ( double ) sparams . penalty_freq } ) ;
options . push_back ( { " * " , " --dynatemp-range N " , " dynamic temperature range (default: %.1f, 0.0 = disabled) " , ( double ) sparams . dynatemp_range } ) ;
options . push_back ( { " * " , " --dynatemp-exp N " , " dynamic temperature exponent (default: %.1f) " , ( double ) sparams . dynatemp_exponent } ) ;
options . push_back ( { " * " , " --mirostat N " , " use Mirostat sampling. \n "
" Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used. \n "
" (default: %d, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) " , sparams . mirostat } ) ;
options . push_back ( { " * " , " --mirostat-lr N " , " Mirostat learning rate, parameter eta (default: %.1f) " , ( double ) sparams . mirostat_eta } ) ;
options . push_back ( { " * " , " --mirostat-ent N " , " Mirostat target entropy, parameter tau (default: %.1f) " , ( double ) sparams . mirostat_tau } ) ;
options . push_back ( { " * " , " -l TOKEN_ID(+/-)BIAS " , " modifies the likelihood of token appearing in the completion, \n "
" i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello', \n "
" or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' " } ) ;
options . push_back ( { " main " , " --cfg-negative-prompt PROMPT " ,
" negative prompt to use for guidance (default: '%s') " , sparams . cfg_negative_prompt . c_str ( ) } ) ;
options . push_back ( { " main " , " --cfg-negative-prompt-file FNAME " ,
" negative prompt file to use for guidance " } ) ;
options . push_back ( { " main " , " --cfg-scale N " , " strength of guidance (default: %.1f, 1.0 = disable) " , ( double ) sparams . cfg_scale } ) ;
2024-06-25 13:56:49 +02:00
options . push_back ( { " main " , " --chat-template JINJA_TEMPLATE " ,
" set custom jinja chat template (default: template taken from model's metadata) \n "
2024-07-04 20:55:03 +02:00
" if suffix/prefix are specified, template will be disabled \n "
2024-06-25 13:56:49 +02:00
" only commonly used templates are accepted: \n "
" https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " grammar " } ) ;
options . push_back ( { " * " , " --grammar GRAMMAR " , " BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '%s') " , sparams . grammar . c_str ( ) } ) ;
options . push_back ( { " * " , " --grammar-file FNAME " , " file to read grammar from " } ) ;
options . push_back ( { " * " , " -j, --json-schema SCHEMA " ,
" JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object \n "
" For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead " } ) ;
options . push_back ( { " embedding " } ) ;
2024-07-05 09:05:56 +02:00
options . push_back ( { " embedding " , " --pooling {none,mean,cls,last} " ,
2024-06-04 20:23:39 +02:00
" pooling type for embeddings, use model default if unspecified " } ) ;
2024-07-05 09:05:56 +02:00
options . push_back ( { " embedding " , " --attention {causal,non-causal} " ,
" attention type for embeddings, use model default if unspecified " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " context hacking " } ) ;
options . push_back ( { " * " , " --rope-scaling {none,linear,yarn} " ,
" RoPE frequency scaling method, defaults to linear unless specified by the model " } ) ;
options . push_back ( { " * " , " --rope-scale N " , " RoPE context scaling factor, expands context by a factor of N " } ) ;
options . push_back ( { " * " , " --rope-freq-base N " , " RoPE base frequency, used by NTK-aware scaling (default: loaded from model) " } ) ;
options . push_back ( { " * " , " --rope-freq-scale N " , " RoPE frequency scaling factor, expands context by a factor of 1/N " } ) ;
options . push_back ( { " * " , " --yarn-orig-ctx N " , " YaRN: original context size of model (default: %d = model training context size) " , params . yarn_orig_ctx } ) ;
options . push_back ( { " * " , " --yarn-ext-factor N " , " YaRN: extrapolation mix factor (default: %.1f, 0.0 = full interpolation) " , ( double ) params . yarn_ext_factor } ) ;
options . push_back ( { " * " , " --yarn-attn-factor N " , " YaRN: scale sqrt(t) or attention magnitude (default: %.1f) " , ( double ) params . yarn_attn_factor } ) ;
options . push_back ( { " * " , " --yarn-beta-slow N " , " YaRN: high correction dim or alpha (default: %.1f) " , ( double ) params . yarn_beta_slow } ) ;
options . push_back ( { " * " , " --yarn-beta-fast N " , " YaRN: low correction dim or beta (default: %.1f) " , ( double ) params . yarn_beta_fast } ) ;
options . push_back ( { " * " , " -gan, --grp-attn-n N " , " group-attention factor (default: %d) " , params . grp_attn_n } ) ;
options . push_back ( { " * " , " -gaw, --grp-attn-w N " , " group-attention width (default: %.1f) " , ( double ) params . grp_attn_w } ) ;
options . push_back ( { " * " , " -dkvc, --dump-kv-cache " , " verbose print of the KV cache " } ) ;
options . push_back ( { " * " , " -nkvo, --no-kv-offload " , " disable KV offload " } ) ;
options . push_back ( { " * " , " -ctk, --cache-type-k TYPE " , " KV cache data type for K (default: %s) " , params . cache_type_k . c_str ( ) } ) ;
options . push_back ( { " * " , " -ctv, --cache-type-v TYPE " , " KV cache data type for V (default: %s) " , params . cache_type_v . c_str ( ) } ) ;
options . push_back ( { " perplexity " } ) ;
options . push_back ( { " perplexity " , " --all-logits " , " return logits for all tokens in the batch (default: %s) " , params . logits_all ? " true " : " false " } ) ;
options . push_back ( { " perplexity " , " --hellaswag " , " compute HellaSwag score over random tasks from datafile supplied with -f " } ) ;
options . push_back ( { " perplexity " , " --hellaswag-tasks N " , " number of tasks to use when computing the HellaSwag score (default: %zu) " , params . hellaswag_tasks } ) ;
options . push_back ( { " perplexity " , " --winogrande " , " compute Winogrande score over random tasks from datafile supplied with -f " } ) ;
options . push_back ( { " perplexity " , " --winogrande-tasks N " , " number of tasks to use when computing the Winogrande score (default: %zu) " , params . winogrande_tasks } ) ;
options . push_back ( { " perplexity " , " --multiple-choice " , " compute multiple choice score over random tasks from datafile supplied with -f " } ) ;
options . push_back ( { " perplexity " , " --multiple-choice-tasks N " ,
" number of tasks to use when computing the multiple choice score (default: %zu) " , params . multiple_choice_tasks } ) ;
options . push_back ( { " perplexity " , " --kl-divergence " , " computes KL-divergence to logits provided via --kl-divergence-base " } ) ;
options . push_back ( { " perplexity " , " --ppl-stride N " , " stride for perplexity calculation (default: %d) " , params . ppl_stride } ) ;
options . push_back ( { " perplexity " , " --ppl-output-type {0,1} " ,
" output type for perplexity calculation (default: %d) " , params . ppl_output_type } ) ;
options . push_back ( { " parallel " } ) ;
options . push_back ( { " * " , " -dt, --defrag-thold N " , " KV cache defragmentation threshold (default: %.1f, < 0 - disabled) " , ( double ) params . defrag_thold } ) ;
options . push_back ( { " * " , " -np, --parallel N " , " number of parallel sequences to decode (default: %d) " , params . n_parallel } ) ;
options . push_back ( { " * " , " -ns, --sequences N " , " number of sequences to decode (default: %d) " , params . n_sequences } ) ;
options . push_back ( { " * " , " -cb, --cont-batching " , " enable continuous batching (a.k.a dynamic batching) (default: %s) " , params . cont_batching ? " enabled " : " disabled " } ) ;
2024-07-15 13:54:58 +02:00
options . push_back ( { " * " , " -nocb, --no-cont-batching " , " disable continuous batching " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " multi-modality " } ) ;
options . push_back ( { " * " , " --mmproj FILE " , " path to a multimodal projector file for LLaVA. see examples/llava/README.md " } ) ;
options . push_back ( { " * " , " --image FILE " , " path to an image file. use with multimodal models. Specify multiple times for batching " } ) ;
options . push_back ( { " backend " } ) ;
options . push_back ( { " * " , " --rpc SERVERS " , " comma separated list of RPC servers " } ) ;
2024-06-21 07:38:22 +02:00
2024-01-31 16:30:17 +01:00
if ( llama_supports_mlock ( ) ) {
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " --mlock " , " force system to keep model in RAM rather than swapping or compressing " } ) ;
2023-03-24 16:19:05 +01:00
}
2024-01-31 16:30:17 +01:00
if ( llama_supports_mmap ( ) ) {
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " --no-mmap " , " do not memory-map model (slower load but may reduce pageouts if not using mlock) " } ) ;
}
options . push_back ( { " * " , " --numa TYPE " , " attempt optimizations that help on some NUMA systems \n "
" - distribute: spread execution evenly over all nodes \n "
" - isolate: only spawn threads on CPUs on the node that execution started on \n "
" - numactl: use the CPU map provided by numactl \n "
" if run without this previously, it is recommended to drop the system page cache before using this \n "
" see https://github.com/ggerganov/llama.cpp/issues/1437 " } ) ;
2024-01-31 16:30:17 +01:00
if ( llama_supports_gpu_offload ( ) ) {
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " -ngl, --gpu-layers N " ,
" number of layers to store in VRAM " } ) ;
options . push_back ( { " * " , " -ngld, --gpu-layers-draft N " ,
" number of layers to store in VRAM for the draft model " } ) ;
options . push_back ( { " * " , " -sm, --split-mode SPLIT_MODE " ,
" how to split the model across multiple GPUs, one of: \n "
" - none: use one GPU only \n "
" - layer (default): split layers and KV across GPUs \n "
" - row: split rows across GPUs " } ) ;
options . push_back ( { " * " , " -ts, --tensor-split SPLIT " ,
" fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 " } ) ;
options . push_back ( { " * " , " -mg, --main-gpu i " , " the GPU to use for the model (with split-mode = none), \n "
" or for intermediate results and KV (with split-mode = row) (default: %d) " , params . main_gpu } ) ;
}
options . push_back ( { " model " } ) ;
options . push_back ( { " * " , " --check-tensors " , " check model tensor data for invalid values (default: %s) " , params . check_tensors ? " true " : " false " } ) ;
options . push_back ( { " * " , " --override-kv KEY=TYPE:VALUE " ,
" advanced option to override model metadata by key. may be specified multiple times. \n "
" types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false " } ) ;
2024-07-23 23:48:37 +02:00
options . push_back ( { " * " , " --lora FNAME " , " apply LoRA adapter (can be repeated to use multiple adapters) " } ) ;
options . push_back ( { " * " , " --lora-scaled FNAME S " , " apply LoRA adapter with user defined scaling S (can be repeated to use multiple adapters) " } ) ;
2024-06-25 10:44:48 +02:00
options . push_back ( { " * " , " --control-vector FNAME " , " add a control vector \n "
" note: this argument can be repeated to add multiple control vectors " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " --control-vector-scaled FNAME SCALE " ,
2024-06-25 10:44:48 +02:00
" add a control vector with user defined scaling SCALE \n "
" note: this argument can be repeated to add multiple scaled control vectors " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " * " , " --control-vector-layer-range START END " ,
" layer range to apply the control vector(s) to, start and end inclusive " } ) ;
options . push_back ( { " * " , " -m, --model FNAME " , " model path (default: models/$filename with filename from --hf-file \n "
" or --model-url if set, otherwise %s) " , DEFAULT_MODEL_PATH } ) ;
options . push_back ( { " * " , " -md, --model-draft FNAME " , " draft model for speculative decoding (default: unused) " } ) ;
options . push_back ( { " * " , " -mu, --model-url MODEL_URL " , " model download url (default: unused) " } ) ;
options . push_back ( { " * " , " -hfr, --hf-repo REPO " , " Hugging Face model repository (default: unused) " } ) ;
options . push_back ( { " * " , " -hff, --hf-file FILE " , " Hugging Face model file (default: unused) " } ) ;
2024-07-06 22:32:04 +02:00
options . push_back ( { " * " , " -hft, --hf-token TOKEN " , " Hugging Face access token (default: value from HF_TOKEN environment variable) " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " retrieval " } ) ;
options . push_back ( { " retrieval " , " --context-file FNAME " , " file to load context from (repeat to specify multiple files) " } ) ;
options . push_back ( { " retrieval " , " --chunk-size N " , " minimum length of embedded text chunks (default: %d) " , params . chunk_size } ) ;
options . push_back ( { " retrieval " , " --chunk-separator STRING " ,
" separator between chunks (default: '%s') " , params . chunk_separator . c_str ( ) } ) ;
options . push_back ( { " passkey " } ) ;
options . push_back ( { " passkey " , " --junk N " , " number of times to repeat the junk text (default: %d) " , params . n_junk } ) ;
options . push_back ( { " passkey " , " --pos N " , " position of the passkey in the junk text (default: %d) " , params . i_pos } ) ;
2024-06-06 15:30:58 +02:00
options . push_back ( { " imatrix " } ) ;
options . push_back ( { " imatrix " , " -o, --output FNAME " , " output file (default: '%s') " , params . out_file . c_str ( ) } ) ;
options . push_back ( { " imatrix " , " --output-frequency N " , " output the imatrix every N iterations (default: %d) " , params . n_out_freq } ) ;
options . push_back ( { " imatrix " , " --save-frequency N " , " save an imatrix copy every N iterations (default: %d) " , params . n_save_freq } ) ;
options . push_back ( { " imatrix " , " --process-output " , " collect data for the output tensor (default: %s) " , params . process_output ? " true " : " false " } ) ;
options . push_back ( { " imatrix " , " --no-ppl " , " do not compute perplexity (default: %s) " , params . compute_ppl ? " true " : " false " } ) ;
options . push_back ( { " imatrix " , " --chunk N " , " start processing the input from chunk N (default: %d) " , params . i_chunk } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " bench " } ) ;
options . push_back ( { " bench " , " -pps " , " is the prompt shared across parallel sequences (default: %s) " , params . is_pp_shared ? " true " : " false " } ) ;
options . push_back ( { " bench " , " -npp n0,n1,... " , " number of prompt tokens " } ) ;
options . push_back ( { " bench " , " -ntg n0,n1,... " , " number of text generation tokens " } ) ;
options . push_back ( { " bench " , " -npl n0,n1,... " , " number of parallel prompts " } ) ;
2024-06-24 07:30:24 +02:00
options . push_back ( { " embedding " } ) ;
options . push_back ( { " embedding " , " --embd-normalize " , " normalisation for embendings (default: %d) (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm) " , params . embd_normalize } ) ;
options . push_back ( { " embedding " , " --embd-output-format " , " empty = default, \" array \" = [[],[]...], \" json \" = openai style, \" json+ \" = same \" json \" + cosine similarity matrix " } ) ;
options . push_back ( { " embedding " , " --embd-separator " , " separator of embendings (default \\ n) for example \" <#sep#> \" " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " server " } ) ;
options . push_back ( { " server " , " --host HOST " , " ip address to listen (default: %s) " , params . hostname . c_str ( ) } ) ;
options . push_back ( { " server " , " --port PORT " , " port to listen (default: %d) " , params . port } ) ;
options . push_back ( { " server " , " --path PATH " , " path to serve static files from (default: %s) " , params . public_path . c_str ( ) } ) ;
2024-08-01 01:59:09 +02:00
options . push_back ( { " server " , " --embedding(s) " , " restrict to only support embedding use case; use only with dedicated embedding models (default: %s) " , params . embedding ? " enabled " : " disabled " } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " server " , " --api-key KEY " , " API key to use for authentication (default: none) " } ) ;
options . push_back ( { " server " , " --api-key-file FNAME " , " path to file containing API keys (default: none) " } ) ;
options . push_back ( { " server " , " --ssl-key-file FNAME " , " path to file a PEM-encoded SSL private key " } ) ;
options . push_back ( { " server " , " --ssl-cert-file FNAME " , " path to file a PEM-encoded SSL certificate " } ) ;
options . push_back ( { " server " , " --timeout N " , " server read/write timeout in seconds (default: %d) " , params . timeout_read } ) ;
2024-06-06 18:19:59 +02:00
options . push_back ( { " server " , " --threads-http N " , " number of threads used to process HTTP requests (default: %d) " , params . n_threads_http } ) ;
2024-06-04 20:23:39 +02:00
options . push_back ( { " server " , " --system-prompt-file FNAME " ,
" set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications " } ) ;
options . push_back ( { " server " , " --log-format {text,json} " ,
" log output format: json or text (default: json) " } ) ;
options . push_back ( { " server " , " --metrics " , " enable prometheus compatible metrics endpoint (default: %s) " , params . endpoint_metrics ? " enabled " : " disabled " } ) ;
options . push_back ( { " server " , " --no-slots " , " disables slots monitoring endpoint (default: %s) " , params . endpoint_slots ? " enabled " : " disabled " } ) ;
options . push_back ( { " server " , " --slot-save-path PATH " , " path to save slot kv cache (default: disabled) " } ) ;
options . push_back ( { " server " , " --chat-template JINJA_TEMPLATE " ,
" set custom jinja chat template (default: template taken from model's metadata) \n "
" only commonly used templates are accepted: \n "
" https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template " } ) ;
2024-06-08 09:50:31 +02:00
options . push_back ( { " server " , " -sps, --slot-prompt-similarity SIMILARITY " ,
" how much the prompt of a request must match the prompt of a slot in order to use that slot (default: %.2f, 0.0 = disabled) \n " , params . slot_prompt_similarity } ) ;
2024-08-06 17:33:39 +02:00
options . push_back ( { " server " , " --lora-init-without-apply " , " load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: %s) " , params . lora_init_without_apply ? " enabled " : " disabled " } ) ;
2024-06-04 20:23:39 +02:00
2023-11-01 18:42:01 +01:00
# ifndef LOG_DISABLE_LOGS
2024-06-04 20:23:39 +02:00
options . push_back ( { " logging " } ) ;
options . push_back ( { " * " , " --simple-io " , " use basic IO for better compatibility in subprocesses and limited consoles " } ) ;
options . push_back ( { " * " , " -ld, --logdir LOGDIR " , " path under which to save YAML logs (no logging if unset) " } ) ;
options . push_back ( { " logging " , " --log-test " , " Run simple logging test " } ) ;
options . push_back ( { " logging " , " --log-disable " , " Disable trace logs " } ) ;
options . push_back ( { " logging " , " --log-enable " , " Enable trace logs " } ) ;
options . push_back ( { " logging " , " --log-file FNAME " , " Specify a log filename (without extension) " } ) ;
options . push_back ( { " logging " , " --log-new " , " Create a separate new log file on start. "
" Each log file will have unique name: \" <name>.<ID>.log \" " } ) ;
options . push_back ( { " logging " , " --log-append " , " Don't truncate the old log file. " } ) ;
2023-11-01 18:42:01 +01:00
# endif // LOG_DISABLE_LOGS
2024-06-04 20:23:39 +02:00
2024-06-15 18:53:40 +02:00
options . push_back ( { " cvector " } ) ;
options . push_back ( { " cvector " , " -o, --output FNAME " , " output file (default: '%s') " , params . cvector_outfile . c_str ( ) } ) ;
options . push_back ( { " cvector " , " --positive-file FNAME " , " positive prompts file, one prompt per line (default: '%s') " , params . cvector_positive_file . c_str ( ) } ) ;
options . push_back ( { " cvector " , " --negative-file FNAME " , " negative prompts file, one prompt per line (default: '%s') " , params . cvector_negative_file . c_str ( ) } ) ;
2024-06-22 18:11:30 +02:00
options . push_back ( { " cvector " , " --pca-batch N " , " batch size used for PCA. Larger batch runs faster, but uses more memory (default: %d) " , params . n_pca_batch } ) ;
options . push_back ( { " cvector " , " --pca-iter N " , " number of iterations used for PCA (default: %d) " , params . n_pca_iterations } ) ;
2024-06-25 13:59:54 +02:00
options . push_back ( { " cvector " , " --method {pca,mean} " , " dimensionality reduction method to be used (default: pca) " } ) ;
2024-06-15 18:53:40 +02:00
2024-07-23 23:48:37 +02:00
options . push_back ( { " export-lora " } ) ;
options . push_back ( { " export-lora " , " -m, --model " , " model path from which to load base model (default '%s') " , params . model . c_str ( ) } ) ;
options . push_back ( { " export-lora " , " --lora FNAME " , " path to LoRA adapter (can be repeated to use multiple adapters) " } ) ;
options . push_back ( { " export-lora " , " --lora-scaled FNAME S " , " path to LoRA adapter with user defined scaling S (can be repeated to use multiple adapters) " } ) ;
options . push_back ( { " export-lora " , " -o, --output FNAME " , " output file (default: '%s') " , params . lora_outfile . c_str ( ) } ) ;
2024-06-04 20:23:39 +02:00
printf ( " usage: %s [options] \n " , argv [ 0 ] ) ;
for ( const auto & o : options ) {
if ( ! o . grp . empty ( ) ) {
printf ( " \n %s: \n \n " , o . grp . c_str ( ) ) ;
continue ;
}
printf ( " %-32s " , o . args . c_str ( ) ) ;
if ( o . args . length ( ) > 30 ) {
printf ( " \n %34s " , " " ) ;
}
const auto desc = o . desc ;
size_t start = 0 ;
size_t end = desc . find ( ' \n ' ) ;
while ( end ! = std : : string : : npos ) {
printf ( " %s \n %34s " , desc . substr ( start , end - start ) . c_str ( ) , " " ) ;
start = end + 1 ;
end = desc . find ( ' \n ' , start ) ;
}
printf ( " %s \n " , desc . substr ( start ) . c_str ( ) ) ;
}
printf ( " \n " ) ;
2023-03-10 19:40:58 +01:00
}
2024-05-22 19:04:20 +02:00
std : : string gpt_params_get_system_info ( const gpt_params & params ) {
2023-09-28 21:42:38 +02:00
std : : ostringstream os ;
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
os < < " system_info: n_threads = " < < params . cpuparams . n_threads ;
if ( params . cpuparams_batch . n_threads ! = - 1 ) {
os < < " (n_threads_batch = " < < params . cpuparams_batch . n_threads < < " ) " ;
2023-09-28 21:42:38 +02:00
}
2024-08-16 08:23:12 +02:00
# if defined(_WIN32) && (_WIN32_WINNT >= 0x0601) && !defined(__MINGW64__) // windows 7 and later
// TODO: windows + arm64 + mingw64
DWORD logicalProcessorCount = GetActiveProcessorCount ( ALL_PROCESSOR_GROUPS ) ;
os < < " / " < < logicalProcessorCount < < " | " < < llama_print_system_info ( ) ;
# else
2023-09-28 21:42:38 +02:00
os < < " / " < < std : : thread : : hardware_concurrency ( ) < < " | " < < llama_print_system_info ( ) ;
2024-08-16 08:23:12 +02:00
# endif
2023-09-28 21:42:38 +02:00
return os . str ( ) ;
}
2024-05-22 19:04:20 +02:00
//
// String utils
//
std : : vector < std : : string > string_split ( std : : string input , char separator ) {
std : : vector < std : : string > parts ;
size_t separator_pos = input . find ( separator ) ;
while ( separator_pos ! = std : : string : : npos ) {
std : : string part = input . substr ( 0 , separator_pos ) ;
parts . emplace_back ( part ) ;
input = input . substr ( separator_pos + 1 ) ;
separator_pos = input . find ( separator ) ;
}
parts . emplace_back ( input ) ;
return parts ;
}
std : : string string_strip ( const std : : string & str ) {
size_t start = 0 ;
size_t end = str . size ( ) ;
while ( start < end & & std : : isspace ( str [ start ] ) ) {
start + + ;
}
while ( end > start & & std : : isspace ( str [ end - 1 ] ) ) {
end - - ;
}
return str . substr ( start , end - start ) ;
}
std : : string string_get_sortable_timestamp ( ) {
using clock = std : : chrono : : system_clock ;
const clock : : time_point current_time = clock : : now ( ) ;
const time_t as_time_t = clock : : to_time_t ( current_time ) ;
char timestamp_no_ns [ 100 ] ;
std : : strftime ( timestamp_no_ns , 100 , " %Y_%m_%d-%H_%M_%S " , std : : localtime ( & as_time_t ) ) ;
const int64_t ns = std : : chrono : : duration_cast < std : : chrono : : nanoseconds > (
current_time . time_since_epoch ( ) % 1000000000 ) . count ( ) ;
char timestamp_ns [ 11 ] ;
snprintf ( timestamp_ns , 11 , " %09 " PRId64 , ns ) ;
return std : : string ( timestamp_no_ns ) + " . " + std : : string ( timestamp_ns ) ;
}
2024-08-09 17:23:52 +02:00
void string_replace_all ( std : : string & s , const std : : string & search , const std : : string & replace ) {
if ( search . empty ( ) ) {
2024-08-26 08:09:53 +02:00
return ;
2024-08-09 17:23:52 +02:00
}
2024-08-26 08:09:53 +02:00
std : : string builder ;
builder . reserve ( s . length ( ) ) ;
2024-08-09 17:23:52 +02:00
size_t pos = 0 ;
2024-08-26 08:09:53 +02:00
size_t last_pos = 0 ;
while ( ( pos = s . find ( search , last_pos ) ) ! = std : : string : : npos ) {
builder . append ( s , last_pos , pos - last_pos ) ;
builder . append ( replace ) ;
last_pos = pos + search . length ( ) ;
}
builder . append ( s , last_pos , std : : string : : npos ) ;
s = std : : move ( builder ) ;
2024-08-09 17:23:52 +02:00
}
2024-05-22 19:04:20 +02:00
void string_process_escapes ( std : : string & input ) {
std : : size_t input_len = input . length ( ) ;
std : : size_t output_idx = 0 ;
for ( std : : size_t input_idx = 0 ; input_idx < input_len ; + + input_idx ) {
if ( input [ input_idx ] = = ' \\ ' & & input_idx + 1 < input_len ) {
switch ( input [ + + input_idx ] ) {
case ' n ' : input [ output_idx + + ] = ' \n ' ; break ;
case ' r ' : input [ output_idx + + ] = ' \r ' ; break ;
case ' t ' : input [ output_idx + + ] = ' \t ' ; break ;
case ' \' ' : input [ output_idx + + ] = ' \' ' ; break ;
case ' \" ' : input [ output_idx + + ] = ' \" ' ; break ;
case ' \\ ' : input [ output_idx + + ] = ' \\ ' ; break ;
case ' x ' :
// Handle \x12, etc
if ( input_idx + 2 < input_len ) {
const char x [ 3 ] = { input [ input_idx + 1 ] , input [ input_idx + 2 ] , 0 } ;
char * err_p = nullptr ;
const long val = std : : strtol ( x , & err_p , 16 ) ;
if ( err_p = = x + 2 ) {
input_idx + = 2 ;
input [ output_idx + + ] = char ( val ) ;
break ;
}
}
// fall through
default : input [ output_idx + + ] = ' \\ ' ;
input [ output_idx + + ] = input [ input_idx ] ; break ;
}
} else {
input [ output_idx + + ] = input [ input_idx ] ;
}
}
input . resize ( output_idx ) ;
}
bool string_parse_kv_override ( const char * data , std : : vector < llama_model_kv_override > & overrides ) {
const char * sep = strchr ( data , ' = ' ) ;
if ( sep = = nullptr | | sep - data > = 128 ) {
fprintf ( stderr , " %s: malformed KV override '%s' \n " , __func__ , data ) ;
return false ;
}
llama_model_kv_override kvo ;
std : : strncpy ( kvo . key , data , sep - data ) ;
kvo . key [ sep - data ] = 0 ;
sep + + ;
if ( strncmp ( sep , " int: " , 4 ) = = 0 ) {
sep + = 4 ;
kvo . tag = LLAMA_KV_OVERRIDE_TYPE_INT ;
kvo . val_i64 = std : : atol ( sep ) ;
} else if ( strncmp ( sep , " float: " , 6 ) = = 0 ) {
sep + = 6 ;
kvo . tag = LLAMA_KV_OVERRIDE_TYPE_FLOAT ;
kvo . val_f64 = std : : atof ( sep ) ;
} else if ( strncmp ( sep , " bool: " , 5 ) = = 0 ) {
sep + = 5 ;
kvo . tag = LLAMA_KV_OVERRIDE_TYPE_BOOL ;
if ( std : : strcmp ( sep , " true " ) = = 0 ) {
kvo . val_bool = true ;
} else if ( std : : strcmp ( sep , " false " ) = = 0 ) {
kvo . val_bool = false ;
} else {
fprintf ( stderr , " %s: invalid boolean value for KV override '%s' \n " , __func__ , data ) ;
return false ;
}
} else if ( strncmp ( sep , " str: " , 4 ) = = 0 ) {
sep + = 4 ;
kvo . tag = LLAMA_KV_OVERRIDE_TYPE_STR ;
if ( strlen ( sep ) > 127 ) {
fprintf ( stderr , " %s: malformed KV override '%s', value cannot exceed 127 chars \n " , __func__ , data ) ;
return false ;
}
strncpy ( kvo . val_str , sep , 127 ) ;
kvo . val_str [ 127 ] = ' \0 ' ;
} else {
fprintf ( stderr , " %s: invalid type for KV override '%s' \n " , __func__ , data ) ;
return false ;
}
overrides . emplace_back ( std : : move ( kvo ) ) ;
return true ;
}
//
// Filesystem utils
//
// Validate if a filename is safe to use
// To validate a full path, split the path by the OS-specific path separator, and validate each part with this function
bool fs_validate_filename ( const std : : string & filename ) {
if ( ! filename . length ( ) ) {
// Empty filename invalid
return false ;
2024-04-08 14:43:30 +02:00
}
if ( filename . length ( ) > 255 ) {
// Limit at common largest possible filename on Linux filesystems
// to avoid unnecessary further validation
// (On systems with smaller limits it will be caught by the OS)
return false ;
}
std : : u32string filename_utf32 ;
try {
std : : wstring_convert < std : : codecvt_utf8 < char32_t > , char32_t > converter ;
filename_utf32 = converter . from_bytes ( filename ) ;
// If the reverse conversion mismatches, it means overlong UTF-8 sequences were used,
// or invalid encodings were encountered. Reject such attempts
std : : string filename_reencoded = converter . to_bytes ( filename_utf32 ) ;
if ( filename_reencoded ! = filename ) {
return false ;
}
} catch ( const std : : exception & ) {
return false ;
}
// Check for forbidden codepoints:
// - Control characters
// - Unicode equivalents of illegal characters
// - UTF-16 surrogate pairs
// - UTF-8 replacement character
// - Byte order mark (BOM)
// - Illegal characters: / \ : * ? " < > |
for ( char32_t c : filename_utf32 ) {
if ( c < = 0x1F // Control characters (C0)
| | c = = 0x7F // Control characters (DEL)
| | ( c > = 0x80 & & c < = 0x9F ) // Control characters (C1)
| | c = = 0xFF0E // Fullwidth Full Stop (period equivalent)
| | c = = 0x2215 // Division Slash (forward slash equivalent)
| | c = = 0x2216 // Set Minus (backslash equivalent)
| | ( c > = 0xD800 & & c < = 0xDFFF ) // UTF-16 surrogate pairs
| | c = = 0xFFFD // Replacement Character (UTF-8)
| | c = = 0xFEFF // Byte Order Mark (BOM)
| | c = = ' / ' | | c = = ' \\ ' | | c = = ' : ' | | c = = ' * ' // Illegal characters
| | c = = ' ? ' | | c = = ' " ' | | c = = ' < ' | | c = = ' > ' | | c = = ' | ' ) {
return false ;
}
}
// Reject any leading or trailing ' ', or any trailing '.', these are stripped on Windows and will cause a different filename
// Unicode and other whitespace is not affected, only 0x20 space
if ( filename . front ( ) = = ' ' | | filename . back ( ) = = ' ' | | filename . back ( ) = = ' . ' ) {
return false ;
}
// Reject any ".." (currently stricter than necessary, it should be fine to just check for == ".." instead)
if ( filename . find ( " .. " ) ! = std : : string : : npos ) {
return false ;
}
// Reject "."
if ( filename = = " . " ) {
return false ;
}
return true ;
}
2024-05-22 19:04:20 +02:00
// returns true if successful, false otherwise
bool fs_create_directory_with_parents ( const std : : string & path ) {
# ifdef _WIN32
std : : wstring_convert < std : : codecvt_utf8 < wchar_t > > converter ;
std : : wstring wpath = converter . from_bytes ( path ) ;
2023-12-05 11:05:51 +01:00
2024-05-22 19:04:20 +02:00
// if the path already exists, check whether it's a directory
const DWORD attributes = GetFileAttributesW ( wpath . c_str ( ) ) ;
if ( ( attributes ! = INVALID_FILE_ATTRIBUTES ) & & ( attributes & FILE_ATTRIBUTE_DIRECTORY ) ) {
return true ;
2024-02-11 14:43:31 +01:00
}
2024-05-22 19:04:20 +02:00
size_t pos_slash = 0 ;
2024-04-29 15:58:41 +02:00
2024-05-22 19:04:20 +02:00
// process path from front to back, procedurally creating directories
while ( ( pos_slash = path . find ( ' \\ ' , pos_slash ) ) ! = std : : string : : npos ) {
const std : : wstring subpath = wpath . substr ( 0 , pos_slash ) ;
const wchar_t * test = subpath . c_str ( ) ;
2024-02-16 12:33:25 +01:00
2024-05-22 19:04:20 +02:00
const bool success = CreateDirectoryW ( test , NULL ) ;
if ( ! success ) {
const DWORD error = GetLastError ( ) ;
2023-12-05 11:05:51 +01:00
2024-05-22 19:04:20 +02:00
// if the path already exists, ensure that it's a directory
if ( error = = ERROR_ALREADY_EXISTS ) {
const DWORD attributes = GetFileAttributesW ( subpath . c_str ( ) ) ;
if ( attributes = = INVALID_FILE_ATTRIBUTES | | ! ( attributes & FILE_ATTRIBUTE_DIRECTORY ) ) {
return false ;
2024-02-16 12:33:25 +01:00
}
2024-05-22 19:04:20 +02:00
} else {
return false ;
2024-02-16 12:33:25 +01:00
}
}
2024-05-22 19:04:20 +02:00
pos_slash + = 1 ;
2023-12-05 11:05:51 +01:00
}
2024-02-11 14:43:31 +01:00
2024-05-22 19:04:20 +02:00
return true ;
# else
// if the path already exists, check whether it's a directory
struct stat info ;
if ( stat ( path . c_str ( ) , & info ) = = 0 ) {
return S_ISDIR ( info . st_mode ) ;
}
size_t pos_slash = 1 ; // skip leading slashes for directory creation
// process path from front to back, procedurally creating directories
while ( ( pos_slash = path . find ( ' / ' , pos_slash ) ) ! = std : : string : : npos ) {
const std : : string subpath = path . substr ( 0 , pos_slash ) ;
struct stat info ;
2024-02-11 14:43:31 +01:00
2024-05-22 19:04:20 +02:00
// if the path already exists, ensure that it's a directory
if ( stat ( subpath . c_str ( ) , & info ) = = 0 ) {
if ( ! S_ISDIR ( info . st_mode ) ) {
return false ;
}
} else {
// create parent directories
const int ret = mkdir ( subpath . c_str ( ) , 0755 ) ;
if ( ret ! = 0 ) {
return false ;
}
2024-02-11 14:43:31 +01:00
}
2024-05-22 19:04:20 +02:00
pos_slash + = 1 ;
2024-02-11 14:43:31 +01:00
}
2024-05-22 19:04:20 +02:00
return true ;
# endif // _WIN32
2024-02-11 14:43:31 +01:00
}
2024-05-22 19:04:20 +02:00
std : : string fs_get_cache_directory ( ) {
std : : string cache_directory = " " ;
2024-05-25 05:30:59 +02:00
auto ensure_trailing_slash = [ ] ( std : : string p ) {
// Make sure to add trailing slash
if ( p . back ( ) ! = DIRECTORY_SEPARATOR ) {
p + = DIRECTORY_SEPARATOR ;
}
return p ;
} ;
2024-05-22 19:04:20 +02:00
if ( getenv ( " LLAMA_CACHE " ) ) {
cache_directory = std : : getenv ( " LLAMA_CACHE " ) ;
} else {
# ifdef __linux__
if ( std : : getenv ( " XDG_CACHE_HOME " ) ) {
cache_directory = std : : getenv ( " XDG_CACHE_HOME " ) ;
} else {
cache_directory = std : : getenv ( " HOME " ) + std : : string ( " /.cache/ " ) ;
}
# elif defined(__APPLE__)
cache_directory = std : : getenv ( " HOME " ) + std : : string ( " /Library/Caches/ " ) ;
# elif defined(_WIN32)
2024-05-25 05:30:59 +02:00
cache_directory = std : : getenv ( " LOCALAPPDATA " ) ;
2024-05-22 19:04:20 +02:00
# endif // __linux__
2024-05-25 05:30:59 +02:00
cache_directory = ensure_trailing_slash ( cache_directory ) ;
2024-05-22 19:04:20 +02:00
cache_directory + = " llama.cpp " ;
2023-12-05 11:05:51 +01:00
}
2024-05-25 05:30:59 +02:00
return ensure_trailing_slash ( cache_directory ) ;
2023-12-05 11:05:51 +01:00
}
2024-06-08 21:21:08 +02:00
std : : string fs_get_cache_file ( const std : : string & filename ) {
GGML_ASSERT ( filename . find ( DIRECTORY_SEPARATOR ) = = std : : string : : npos ) ;
std : : string cache_directory = fs_get_cache_directory ( ) ;
const bool success = fs_create_directory_with_parents ( cache_directory ) ;
if ( ! success ) {
throw std : : runtime_error ( " failed to create cache directory: " + cache_directory ) ;
}
return cache_directory + filename ;
}
2024-05-22 19:04:20 +02:00
2023-08-21 22:07:43 +02:00
//
// Model utils
//
2024-08-05 18:14:10 +02:00
struct llama_init_result llama_init_from_gpt_params ( gpt_params & params ) {
llama_init_result iparams ;
2024-05-22 19:04:20 +02:00
auto mparams = llama_model_params_from_gpt_params ( params ) ;
2023-05-02 22:39:51 +02:00
2024-05-22 19:04:20 +02:00
llama_model * model = nullptr ;
if ( ! params . hf_repo . empty ( ) & & ! params . hf_file . empty ( ) ) {
2024-07-06 22:32:04 +02:00
model = llama_load_model_from_hf ( params . hf_repo . c_str ( ) , params . hf_file . c_str ( ) , params . model . c_str ( ) , params . hf_token . c_str ( ) , mparams ) ;
2024-05-22 19:04:20 +02:00
} else if ( ! params . model_url . empty ( ) ) {
2024-07-06 22:32:04 +02:00
model = llama_load_model_from_url ( params . model_url . c_str ( ) , params . model . c_str ( ) , params . hf_token . c_str ( ) , mparams ) ;
2023-12-05 18:19:18 +01:00
} else {
2024-05-22 19:04:20 +02:00
model = llama_load_model_from_file ( params . model . c_str ( ) , mparams ) ;
2023-12-05 18:19:18 +01:00
}
2023-09-28 21:42:38 +02:00
2024-05-22 19:04:20 +02:00
if ( model = = NULL ) {
fprintf ( stderr , " %s: error: failed to load model '%s' \n " , __func__ , params . model . c_str ( ) ) ;
2024-08-05 18:14:10 +02:00
return iparams ;
2023-12-07 12:03:17 +01:00
}
2024-05-22 19:04:20 +02:00
auto cparams = llama_context_params_from_gpt_params ( params ) ;
llama_context * lctx = llama_new_context_with_model ( model , cparams ) ;
if ( lctx = = NULL ) {
fprintf ( stderr , " %s: error: failed to create context with model '%s' \n " , __func__ , params . model . c_str ( ) ) ;
llama_free_model ( model ) ;
2024-08-05 18:14:10 +02:00
return iparams ;
2024-05-22 19:04:20 +02:00
}
if ( ! params . control_vectors . empty ( ) ) {
if ( params . control_vector_layer_start < = 0 ) params . control_vector_layer_start = 1 ;
if ( params . control_vector_layer_end < = 0 ) params . control_vector_layer_end = llama_n_layer ( model ) ;
const auto cvec = llama_control_vector_load ( params . control_vectors ) ;
if ( cvec . n_embd = = - 1 ) {
llama_free ( lctx ) ;
llama_free_model ( model ) ;
2024-08-05 18:14:10 +02:00
return iparams ;
2024-05-22 19:04:20 +02:00
}
int err = llama_control_vector_apply ( lctx ,
cvec . data . data ( ) ,
cvec . data . size ( ) ,
cvec . n_embd ,
params . control_vector_layer_start ,
params . control_vector_layer_end ) ;
if ( err ) {
llama_free ( lctx ) ;
llama_free_model ( model ) ;
2024-08-05 18:14:10 +02:00
return iparams ;
2024-05-22 19:04:20 +02:00
}
}
2024-08-06 17:33:39 +02:00
// load and optionally apply lora adapters
for ( auto & la : params . lora_adapters ) {
llama_lora_adapter_container loaded_la ;
loaded_la . path = la . path ;
loaded_la . scale = la . scale ;
loaded_la . adapter = llama_lora_adapter_init ( model , la . path . c_str ( ) ) ;
if ( loaded_la . adapter = = nullptr ) {
fprintf ( stderr , " %s: error: failed to apply lora adapter '%s' \n " , __func__ , la . path . c_str ( ) ) ;
2024-05-22 19:04:20 +02:00
llama_free ( lctx ) ;
llama_free_model ( model ) ;
2024-08-05 18:14:10 +02:00
return iparams ;
2024-05-22 19:04:20 +02:00
}
2024-08-06 17:33:39 +02:00
iparams . lora_adapters . push_back ( loaded_la ) ; // copy to list of loaded adapters
}
if ( ! params . lora_init_without_apply ) {
llama_lora_adapters_apply ( lctx , iparams . lora_adapters ) ;
2024-05-22 19:04:20 +02:00
}
if ( params . ignore_eos ) {
params . sparams . logit_bias [ llama_token_eos ( model ) ] = - INFINITY ;
}
if ( params . warmup ) {
LOG ( " warming up the model with an empty run \n " ) ;
2024-07-04 15:46:11 +02:00
std : : vector < llama_token > tmp ;
llama_token bos = llama_token_bos ( model ) ;
llama_token eos = llama_token_eos ( model ) ;
// some models (e.g. T5) don't have a BOS token
if ( bos ! = - 1 ) {
tmp . push_back ( bos ) ;
}
tmp . push_back ( eos ) ;
if ( llama_model_has_encoder ( model ) ) {
llama_encode ( lctx , llama_batch_get_one ( tmp . data ( ) , tmp . size ( ) , 0 , 0 ) ) ;
llama_token decoder_start_token_id = llama_model_decoder_start_token ( model ) ;
if ( decoder_start_token_id = = - 1 ) {
decoder_start_token_id = bos ;
}
tmp . clear ( ) ;
tmp . push_back ( decoder_start_token_id ) ;
}
2024-08-10 11:43:26 +02:00
if ( llama_model_has_decoder ( model ) ) {
llama_decode ( lctx , llama_batch_get_one ( tmp . data ( ) , std : : min ( tmp . size ( ) , ( size_t ) params . n_batch ) , 0 , 0 ) ) ;
}
2024-06-10 20:44:42 +02:00
llama_past_clear ( lctx ) ;
2024-05-22 19:04:20 +02:00
llama_synchronize ( lctx ) ;
llama_reset_timings ( lctx ) ;
}
2024-08-05 18:14:10 +02:00
iparams . model = model ;
iparams . context = lctx ;
return iparams ;
2024-05-22 19:04:20 +02:00
}
2024-08-06 17:33:39 +02:00
void llama_lora_adapters_apply ( struct llama_context * ctx , std : : vector < llama_lora_adapter_container > & lora_adapters ) {
llama_lora_adapter_clear ( ctx ) ;
for ( auto & la : lora_adapters ) {
if ( la . scale ! = 0.0f ) {
llama_lora_adapter_set ( ctx , la . adapter , la . scale ) ;
}
}
2024-05-22 19:04:20 +02:00
}
struct llama_model_params llama_model_params_from_gpt_params ( const gpt_params & params ) {
auto mparams = llama_model_default_params ( ) ;
if ( params . n_gpu_layers ! = - 1 ) {
mparams . n_gpu_layers = params . n_gpu_layers ;
}
mparams . rpc_servers = params . rpc_servers . c_str ( ) ;
mparams . main_gpu = params . main_gpu ;
mparams . split_mode = params . split_mode ;
mparams . tensor_split = params . tensor_split ;
mparams . use_mmap = params . use_mmap ;
mparams . use_mlock = params . use_mlock ;
mparams . check_tensors = params . check_tensors ;
if ( params . kv_overrides . empty ( ) ) {
mparams . kv_overrides = NULL ;
} else {
GGML_ASSERT ( params . kv_overrides . back ( ) . key [ 0 ] = = 0 & & " KV overrides not terminated with empty key " ) ;
mparams . kv_overrides = params . kv_overrides . data ( ) ;
}
return mparams ;
}
static ggml_type kv_cache_type_from_str ( const std : : string & s ) {
if ( s = = " f32 " ) {
return GGML_TYPE_F32 ;
}
if ( s = = " f16 " ) {
return GGML_TYPE_F16 ;
}
if ( s = = " q8_0 " ) {
return GGML_TYPE_Q8_0 ;
}
if ( s = = " q4_0 " ) {
return GGML_TYPE_Q4_0 ;
}
if ( s = = " q4_1 " ) {
return GGML_TYPE_Q4_1 ;
}
if ( s = = " iq4_nl " ) {
return GGML_TYPE_IQ4_NL ;
}
if ( s = = " q5_0 " ) {
return GGML_TYPE_Q5_0 ;
}
if ( s = = " q5_1 " ) {
return GGML_TYPE_Q5_1 ;
}
throw std : : runtime_error ( " Invalid cache type: " + s ) ;
}
2023-12-07 12:03:17 +01:00
2023-09-28 21:42:38 +02:00
struct llama_context_params llama_context_params_from_gpt_params ( const gpt_params & params ) {
auto cparams = llama_context_default_params ( ) ;
2023-11-01 23:04:33 +01:00
cparams . n_ctx = params . n_ctx ;
2024-03-11 16:49:47 +01:00
cparams . n_seq_max = params . n_parallel ;
2024-03-13 18:54:21 +01:00
cparams . n_batch = params . n_batch ;
cparams . n_ubatch = params . n_ubatch ;
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
cparams . n_threads = params . cpuparams . n_threads ;
cparams . n_threads_batch = params . cpuparams_batch . n_threads = = - 1 ?
params . cpuparams . n_threads : params . cpuparams_batch . n_threads ;
2023-11-01 23:04:33 +01:00
cparams . seed = params . seed ;
cparams . logits_all = params . logits_all ;
2024-03-04 21:31:20 +01:00
cparams . embeddings = params . embedding ;
2023-11-01 23:04:33 +01:00
cparams . rope_scaling_type = params . rope_scaling_type ;
cparams . rope_freq_base = params . rope_freq_base ;
cparams . rope_freq_scale = params . rope_freq_scale ;
cparams . yarn_ext_factor = params . yarn_ext_factor ;
cparams . yarn_attn_factor = params . yarn_attn_factor ;
cparams . yarn_beta_fast = params . yarn_beta_fast ;
cparams . yarn_beta_slow = params . yarn_beta_slow ;
cparams . yarn_orig_ctx = params . yarn_orig_ctx ;
2024-03-03 11:40:27 +01:00
cparams . pooling_type = params . pooling_type ;
2024-07-05 09:05:56 +02:00
cparams . attention_type = params . attention_type ;
2024-02-27 13:35:51 +01:00
cparams . defrag_thold = params . defrag_thold ;
2024-04-11 14:51:07 +02:00
cparams . cb_eval = params . cb_eval ;
cparams . cb_eval_user_data = params . cb_eval_user_data ;
2023-12-07 12:03:17 +01:00
cparams . offload_kqv = ! params . no_kv_offload ;
ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 11:16:08 +02:00
cparams . flash_attn = params . flash_attn ;
2023-12-07 12:03:17 +01:00
cparams . type_k = kv_cache_type_from_str ( params . cache_type_k ) ;
cparams . type_v = kv_cache_type_from_str ( params . cache_type_v ) ;
2023-09-28 21:42:38 +02:00
return cparams ;
2023-07-11 18:18:43 +02:00
}
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
struct ggml_threadpool_params ggml_threadpool_params_from_cpu_params ( const cpu_params & params ) {
struct ggml_threadpool_params tpp ;
ggml_threadpool_params_init ( & tpp , params . n_threads ) ; // setup the defaults
if ( params . mask_valid ) {
std : : memcpy ( & tpp . cpumask , & params . cpumask , GGML_MAX_N_THREADS ) ;
}
tpp . prio = params . priority ;
tpp . poll = params . poll ;
tpp . strict_cpu = params . strict_cpu ;
return tpp ;
}
2024-03-17 19:12:37 +01:00
# ifdef LLAMA_USE_CURL
2024-04-30 01:52:50 +02:00
static bool starts_with ( const std : : string & str , const std : : string & prefix ) {
// While we wait for C++20's std::string::starts_with...
return str . rfind ( prefix , 0 ) = = 0 ;
}
2024-07-06 22:32:04 +02:00
static bool llama_download_file ( const std : : string & url , const std : : string & path , const std : : string & hf_token ) {
2024-04-30 01:52:50 +02:00
// Initialize libcurl
std : : unique_ptr < CURL , decltype ( & curl_easy_cleanup ) > curl ( curl_easy_init ( ) , & curl_easy_cleanup ) ;
if ( ! curl ) {
fprintf ( stderr , " %s: error initializing libcurl \n " , __func__ ) ;
return false ;
}
2024-03-23 18:07:00 +01:00
bool force_download = false ;
2024-03-17 19:12:37 +01:00
// Set the URL, allow to follow http redirection
2024-04-30 01:52:50 +02:00
curl_easy_setopt ( curl . get ( ) , CURLOPT_URL , url . c_str ( ) ) ;
curl_easy_setopt ( curl . get ( ) , CURLOPT_FOLLOWLOCATION , 1L ) ;
2024-03-23 18:07:00 +01:00
2024-07-06 22:32:04 +02:00
// Check if hf-token or bearer-token was specified
if ( ! hf_token . empty ( ) ) {
std : : string auth_header = " Authorization: Bearer " ;
auth_header + = hf_token . c_str ( ) ;
struct curl_slist * http_headers = NULL ;
http_headers = curl_slist_append ( http_headers , auth_header . c_str ( ) ) ;
curl_easy_setopt ( curl . get ( ) , CURLOPT_HTTPHEADER , http_headers ) ;
}
2024-03-17 19:12:37 +01:00
# if defined(_WIN32)
// CURLSSLOPT_NATIVE_CA tells libcurl to use standard certificate store of
// operating system. Currently implemented under MS-Windows.
2024-04-30 01:52:50 +02:00
curl_easy_setopt ( curl . get ( ) , CURLOPT_SSL_OPTIONS , CURLSSLOPT_NATIVE_CA ) ;
2024-03-17 19:12:37 +01:00
# endif
// Check if the file already exists locally
struct stat model_file_info ;
2024-04-30 01:52:50 +02:00
auto file_exists = ( stat ( path . c_str ( ) , & model_file_info ) = = 0 ) ;
2024-03-17 19:12:37 +01:00
2024-04-30 01:52:50 +02:00
// If the file exists, check its JSON metadata companion file.
std : : string metadata_path = path + " .json " ;
nlohmann : : json metadata ;
std : : string etag ;
std : : string last_modified ;
2024-03-17 19:12:37 +01:00
if ( file_exists ) {
2024-04-30 01:52:50 +02:00
// Try and read the JSON metadata file (note: stream autoclosed upon exiting this block).
std : : ifstream metadata_in ( metadata_path ) ;
if ( metadata_in . good ( ) ) {
try {
metadata_in > > metadata ;
fprintf ( stderr , " %s: previous metadata file found %s: %s \n " , __func__ , metadata_path . c_str ( ) , metadata . dump ( ) . c_str ( ) ) ;
2024-05-08 21:53:08 +02:00
if ( metadata . contains ( " url " ) & & metadata . at ( " url " ) . is_string ( ) ) {
auto previous_url = metadata . at ( " url " ) . get < std : : string > ( ) ;
2024-04-30 01:52:50 +02:00
if ( previous_url ! = url ) {
fprintf ( stderr , " %s: Model URL mismatch: %s != %s \n " , __func__ , url . c_str ( ) , previous_url . c_str ( ) ) ;
return false ;
}
}
2024-05-08 21:53:08 +02:00
if ( metadata . contains ( " etag " ) & & metadata . at ( " etag " ) . is_string ( ) ) {
etag = metadata . at ( " etag " ) ;
2024-04-30 01:52:50 +02:00
}
2024-05-08 21:53:08 +02:00
if ( metadata . contains ( " lastModified " ) & & metadata . at ( " lastModified " ) . is_string ( ) ) {
last_modified = metadata . at ( " lastModified " ) ;
2024-04-30 01:52:50 +02:00
}
} catch ( const nlohmann : : json : : exception & e ) {
fprintf ( stderr , " %s: error reading metadata file %s: %s \n " , __func__ , metadata_path . c_str ( ) , e . what ( ) ) ;
return false ;
2024-03-17 19:12:37 +01:00
}
}
2024-04-30 01:52:50 +02:00
} else {
fprintf ( stderr , " %s: no previous model file found %s \n " , __func__ , path . c_str ( ) ) ;
2024-03-17 19:12:37 +01:00
}
// Send a HEAD request to retrieve the etag and last-modified headers
struct llama_load_model_from_url_headers {
2024-04-30 01:52:50 +02:00
std : : string etag ;
std : : string last_modified ;
2024-03-17 19:12:37 +01:00
} ;
llama_load_model_from_url_headers headers ;
{
typedef size_t ( * CURLOPT_HEADERFUNCTION_PTR ) ( char * , size_t , size_t , void * ) ;
auto header_callback = [ ] ( char * buffer , size_t /*size*/ , size_t n_items , void * userdata ) - > size_t {
llama_load_model_from_url_headers * headers = ( llama_load_model_from_url_headers * ) userdata ;
2024-04-30 01:52:50 +02:00
static std : : regex header_regex ( " ([^:]+): (.*) \r \n " ) ;
static std : : regex etag_regex ( " ETag " , std : : regex_constants : : icase ) ;
static std : : regex last_modified_regex ( " Last-Modified " , std : : regex_constants : : icase ) ;
std : : string header ( buffer , n_items ) ;
std : : smatch match ;
if ( std : : regex_match ( header , match , header_regex ) ) {
const std : : string & key = match [ 1 ] ;
const std : : string & value = match [ 2 ] ;
if ( std : : regex_match ( key , match , etag_regex ) ) {
headers - > etag = value ;
} else if ( std : : regex_match ( key , match , last_modified_regex ) ) {
headers - > last_modified = value ;
}
2024-03-17 19:12:37 +01:00
}
return n_items ;
} ;
2024-04-30 01:52:50 +02:00
curl_easy_setopt ( curl . get ( ) , CURLOPT_NOBODY , 1L ) ; // will trigger the HEAD verb
curl_easy_setopt ( curl . get ( ) , CURLOPT_NOPROGRESS , 1L ) ; // hide head request progress
curl_easy_setopt ( curl . get ( ) , CURLOPT_HEADERFUNCTION , static_cast < CURLOPT_HEADERFUNCTION_PTR > ( header_callback ) ) ;
curl_easy_setopt ( curl . get ( ) , CURLOPT_HEADERDATA , & headers ) ;
2024-03-17 19:12:37 +01:00
2024-04-30 01:52:50 +02:00
CURLcode res = curl_easy_perform ( curl . get ( ) ) ;
2024-03-17 19:12:37 +01:00
if ( res ! = CURLE_OK ) {
fprintf ( stderr , " %s: curl_easy_perform() failed: %s \n " , __func__ , curl_easy_strerror ( res ) ) ;
2024-03-23 18:07:00 +01:00
return false ;
2024-03-17 19:12:37 +01:00
}
long http_code = 0 ;
2024-04-30 01:52:50 +02:00
curl_easy_getinfo ( curl . get ( ) , CURLINFO_RESPONSE_CODE , & http_code ) ;
2024-03-17 19:12:37 +01:00
if ( http_code ! = 200 ) {
// HEAD not supported, we don't know if the file has changed
// force trigger downloading
2024-03-23 18:07:00 +01:00
force_download = true ;
2024-03-17 19:12:37 +01:00
fprintf ( stderr , " %s: HEAD invalid http status code received: %ld \n " , __func__ , http_code ) ;
}
}
2024-04-30 01:52:50 +02:00
bool should_download = ! file_exists | | force_download ;
if ( ! should_download ) {
if ( ! etag . empty ( ) & & etag ! = headers . etag ) {
fprintf ( stderr , " %s: ETag header is different (%s != %s): triggering a new download \n " , __func__ , etag . c_str ( ) , headers . etag . c_str ( ) ) ;
should_download = true ;
} else if ( ! last_modified . empty ( ) & & last_modified ! = headers . last_modified ) {
fprintf ( stderr , " %s: Last-Modified header is different (%s != %s): triggering a new download \n " , __func__ , last_modified . c_str ( ) , headers . last_modified . c_str ( ) ) ;
should_download = true ;
}
}
2024-03-23 18:07:00 +01:00
if ( should_download ) {
2024-04-30 01:52:50 +02:00
std : : string path_temporary = path + " .downloadInProgress " ;
2024-03-17 19:12:37 +01:00
if ( file_exists ) {
2024-04-30 01:52:50 +02:00
fprintf ( stderr , " %s: deleting previous downloaded file: %s \n " , __func__ , path . c_str ( ) ) ;
if ( remove ( path . c_str ( ) ) ! = 0 ) {
fprintf ( stderr , " %s: unable to delete file: %s \n " , __func__ , path . c_str ( ) ) ;
2024-03-23 18:07:00 +01:00
return false ;
2024-03-17 19:12:37 +01:00
}
}
// Set the output file
2024-06-20 16:40:13 +02:00
struct FILE_deleter {
void operator ( ) ( FILE * f ) const {
fclose ( f ) ;
}
} ;
std : : unique_ptr < FILE , FILE_deleter > outfile ( fopen ( path_temporary . c_str ( ) , " wb " ) ) ;
2024-03-17 19:12:37 +01:00
if ( ! outfile ) {
2024-04-30 01:52:50 +02:00
fprintf ( stderr , " %s: error opening local file for writing: %s \n " , __func__ , path . c_str ( ) ) ;
2024-03-23 18:07:00 +01:00
return false ;
2024-03-17 19:12:37 +01:00
}
typedef size_t ( * CURLOPT_WRITEFUNCTION_PTR ) ( void * data , size_t size , size_t nmemb , void * fd ) ;
auto write_callback = [ ] ( void * data , size_t size , size_t nmemb , void * fd ) - > size_t {
return fwrite ( data , size , nmemb , ( FILE * ) fd ) ;
} ;
2024-04-30 01:52:50 +02:00
curl_easy_setopt ( curl . get ( ) , CURLOPT_NOBODY , 0L ) ;
curl_easy_setopt ( curl . get ( ) , CURLOPT_WRITEFUNCTION , static_cast < CURLOPT_WRITEFUNCTION_PTR > ( write_callback ) ) ;
curl_easy_setopt ( curl . get ( ) , CURLOPT_WRITEDATA , outfile . get ( ) ) ;
2024-03-17 19:12:37 +01:00
// display download progress
2024-04-30 01:52:50 +02:00
curl_easy_setopt ( curl . get ( ) , CURLOPT_NOPROGRESS , 0L ) ;
2024-03-17 19:12:37 +01:00
2024-03-23 18:07:00 +01:00
// helper function to hide password in URL
auto llama_download_hide_password_in_url = [ ] ( const std : : string & url ) - > std : : string {
std : : size_t protocol_pos = url . find ( " :// " ) ;
if ( protocol_pos = = std : : string : : npos ) {
return url ; // Malformed URL
}
std : : size_t at_pos = url . find ( ' @ ' , protocol_pos + 3 ) ;
if ( at_pos = = std : : string : : npos ) {
return url ; // No password in URL
}
return url . substr ( 0 , protocol_pos + 3 ) + " ******** " + url . substr ( at_pos ) ;
} ;
2024-03-17 19:12:37 +01:00
// start the download
2024-03-23 18:07:00 +01:00
fprintf ( stderr , " %s: downloading from %s to %s (server_etag:%s, server_last_modified:%s)... \n " , __func__ ,
2024-04-30 01:52:50 +02:00
llama_download_hide_password_in_url ( url ) . c_str ( ) , path . c_str ( ) , headers . etag . c_str ( ) , headers . last_modified . c_str ( ) ) ;
auto res = curl_easy_perform ( curl . get ( ) ) ;
2024-03-17 19:12:37 +01:00
if ( res ! = CURLE_OK ) {
fprintf ( stderr , " %s: curl_easy_perform() failed: %s \n " , __func__ , curl_easy_strerror ( res ) ) ;
2024-03-23 18:07:00 +01:00
return false ;
2024-03-17 19:12:37 +01:00
}
long http_code = 0 ;
2024-04-30 01:52:50 +02:00
curl_easy_getinfo ( curl . get ( ) , CURLINFO_RESPONSE_CODE , & http_code ) ;
2024-03-17 19:12:37 +01:00
if ( http_code < 200 | | http_code > = 400 ) {
fprintf ( stderr , " %s: invalid http status code received: %ld \n " , __func__ , http_code ) ;
2024-03-23 18:07:00 +01:00
return false ;
2024-03-17 19:12:37 +01:00
}
2024-04-30 01:52:50 +02:00
// Causes file to be closed explicitly here before we rename it.
outfile . reset ( ) ;
2024-03-17 19:12:37 +01:00
2024-04-30 01:52:50 +02:00
// Write the updated JSON metadata file.
metadata . update ( {
{ " url " , url } ,
{ " etag " , headers . etag } ,
{ " lastModified " , headers . last_modified }
} ) ;
std : : ofstream ( metadata_path ) < < metadata . dump ( 4 ) ;
fprintf ( stderr , " %s: file metadata saved: %s \n " , __func__ , metadata_path . c_str ( ) ) ;
2024-03-17 19:12:37 +01:00
2024-04-30 01:52:50 +02:00
if ( rename ( path_temporary . c_str ( ) , path . c_str ( ) ) ! = 0 ) {
fprintf ( stderr , " %s: unable to rename file: %s to %s \n " , __func__ , path_temporary . c_str ( ) , path . c_str ( ) ) ;
2024-03-23 18:07:00 +01:00
return false ;
}
}
return true ;
}
struct llama_model * llama_load_model_from_url (
const char * model_url ,
const char * path_model ,
2024-07-06 22:32:04 +02:00
const char * hf_token ,
2024-03-23 18:07:00 +01:00
const struct llama_model_params & params ) {
// Basic validation of the model_url
if ( ! model_url | | strlen ( model_url ) = = 0 ) {
fprintf ( stderr , " %s: invalid model_url \n " , __func__ ) ;
return NULL ;
}
2024-07-06 22:32:04 +02:00
if ( ! llama_download_file ( model_url , path_model , hf_token ) ) {
2024-03-23 18:07:00 +01:00
return NULL ;
}
// check for additional GGUFs split to download
int n_split = 0 ;
{
struct gguf_init_params gguf_params = {
/*.no_alloc = */ true ,
/*.ctx = */ NULL ,
} ;
auto * ctx_gguf = gguf_init_from_file ( path_model , gguf_params ) ;
if ( ! ctx_gguf ) {
fprintf ( stderr , " \n %s: failed to load input GGUF from %s \n " , __func__ , path_model ) ;
2024-03-17 19:12:37 +01:00
return NULL ;
}
2024-03-23 18:07:00 +01:00
auto key_n_split = gguf_find_key ( ctx_gguf , LLM_KV_SPLIT_COUNT ) ;
if ( key_n_split > = 0 ) {
n_split = gguf_get_val_u16 ( ctx_gguf , key_n_split ) ;
}
gguf_free ( ctx_gguf ) ;
2024-03-17 19:12:37 +01:00
}
2024-03-23 18:07:00 +01:00
if ( n_split > 1 ) {
char split_prefix [ PATH_MAX ] = { 0 } ;
char split_url_prefix [ LLAMA_CURL_MAX_URL_LENGTH ] = { 0 } ;
// Verify the first split file format
// and extract split URL and PATH prefixes
{
if ( ! llama_split_prefix ( split_prefix , sizeof ( split_prefix ) , path_model , 0 , n_split ) ) {
fprintf ( stderr , " \n %s: unexpected model file name: %s "
" n_split=%d \n " , __func__ , path_model , n_split ) ;
return NULL ;
}
if ( ! llama_split_prefix ( split_url_prefix , sizeof ( split_url_prefix ) , model_url , 0 , n_split ) ) {
fprintf ( stderr , " \n %s: unexpected model url: %s "
" n_split=%d \n " , __func__ , model_url , n_split ) ;
return NULL ;
}
}
// Prepare download in parallel
std : : vector < std : : future < bool > > futures_download ;
for ( int idx = 1 ; idx < n_split ; idx + + ) {
2024-07-06 22:32:04 +02:00
futures_download . push_back ( std : : async ( std : : launch : : async , [ & split_prefix , & split_url_prefix , & n_split , hf_token ] ( int download_idx ) - > bool {
2024-03-23 18:07:00 +01:00
char split_path [ PATH_MAX ] = { 0 } ;
llama_split_path ( split_path , sizeof ( split_path ) , split_prefix , download_idx , n_split ) ;
char split_url [ LLAMA_CURL_MAX_URL_LENGTH ] = { 0 } ;
llama_split_path ( split_url , sizeof ( split_url ) , split_url_prefix , download_idx , n_split ) ;
2024-07-06 22:32:04 +02:00
return llama_download_file ( split_url , split_path , hf_token ) ;
2024-03-23 18:07:00 +01:00
} , idx ) ) ;
}
// Wait for all downloads to complete
for ( auto & f : futures_download ) {
if ( ! f . get ( ) ) {
return NULL ;
}
}
}
2024-03-17 19:12:37 +01:00
return llama_load_model_from_file ( path_model , params ) ;
}
2024-03-22 14:33:38 +01:00
struct llama_model * llama_load_model_from_hf (
const char * repo ,
const char * model ,
const char * path_model ,
2024-07-06 22:32:04 +02:00
const char * hf_token ,
2024-03-22 14:33:38 +01:00
const struct llama_model_params & params ) {
// construct hugging face model url:
//
// --repo ggml-org/models --file tinyllama-1.1b/ggml-model-f16.gguf
// https://huggingface.co/ggml-org/models/resolve/main/tinyllama-1.1b/ggml-model-f16.gguf
//
// --repo TheBloke/Mixtral-8x7B-v0.1-GGUF --file mixtral-8x7b-v0.1.Q4_K_M.gguf
// https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf
//
std : : string model_url = " https://huggingface.co/ " ;
model_url + = repo ;
model_url + = " /resolve/main/ " ;
model_url + = model ;
2024-07-06 22:32:04 +02:00
return llama_load_model_from_url ( model_url . c_str ( ) , path_model , hf_token , params ) ;
2024-03-22 14:33:38 +01:00
}
2024-03-17 19:12:37 +01:00
# else
2024-03-22 14:33:38 +01:00
struct llama_model * llama_load_model_from_url (
const char * /*model_url*/ ,
const char * /*path_model*/ ,
2024-07-06 22:32:04 +02:00
const char * /*hf_token*/ ,
2024-03-22 14:33:38 +01:00
const struct llama_model_params & /*params*/ ) {
2024-03-17 19:12:37 +01:00
fprintf ( stderr , " %s: llama.cpp built without libcurl, downloading from an url not supported. \n " , __func__ ) ;
return nullptr ;
}
2024-03-22 14:33:38 +01:00
struct llama_model * llama_load_model_from_hf (
const char * /*repo*/ ,
const char * /*model*/ ,
const char * /*path_model*/ ,
2024-07-06 22:32:04 +02:00
const char * /*hf_token*/ ,
2024-03-22 14:33:38 +01:00
const struct llama_model_params & /*params*/ ) {
fprintf ( stderr , " %s: llama.cpp built without libcurl, downloading from Hugging Face not supported. \n " , __func__ ) ;
return nullptr ;
}
2024-03-17 19:12:37 +01:00
# endif // LLAMA_USE_CURL
2024-05-22 19:04:20 +02:00
//
// Batch utils
//
2023-08-21 22:07:43 +02:00
2024-05-22 19:04:20 +02:00
void llama_batch_clear ( struct llama_batch & batch ) {
batch . n_tokens = 0 ;
}
2023-09-03 12:42:56 +02:00
2024-05-22 19:04:20 +02:00
void llama_batch_add (
struct llama_batch & batch ,
llama_token id ,
llama_pos pos ,
const std : : vector < llama_seq_id > & seq_ids ,
bool logits ) {
batch . token [ batch . n_tokens ] = id ;
batch . pos [ batch . n_tokens ] = pos ;
batch . n_seq_id [ batch . n_tokens ] = seq_ids . size ( ) ;
for ( size_t i = 0 ; i < seq_ids . size ( ) ; + + i ) {
batch . seq_id [ batch . n_tokens ] [ i ] = seq_ids [ i ] ;
2023-09-03 12:42:56 +02:00
}
2024-05-22 19:04:20 +02:00
batch . logits [ batch . n_tokens ] = logits ;
2023-09-03 12:42:56 +02:00
2024-05-22 19:04:20 +02:00
batch . n_tokens + + ;
2023-05-02 22:39:51 +02:00
}
2023-08-21 22:07:43 +02:00
//
// Vocab utils
//
std : : vector < llama_token > llama_tokenize (
2023-09-28 21:42:38 +02:00
const struct llama_context * ctx ,
const std : : string & text ,
2024-04-09 19:44:08 +02:00
bool add_special ,
bool parse_special ) {
return llama_tokenize ( llama_get_model ( ctx ) , text , add_special , parse_special ) ;
2023-09-28 21:42:38 +02:00
}
std : : vector < llama_token > llama_tokenize (
const struct llama_model * model ,
2023-08-21 22:07:43 +02:00
const std : : string & text ,
2024-04-09 19:44:08 +02:00
bool add_special ,
bool parse_special ) {
2023-08-21 22:07:43 +02:00
// upper limit for the number of tokens
2024-04-09 19:44:08 +02:00
int n_tokens = text . length ( ) + 2 * add_special ;
2023-08-21 22:07:43 +02:00
std : : vector < llama_token > result ( n_tokens ) ;
2024-04-09 19:44:08 +02:00
n_tokens = llama_tokenize ( model , text . data ( ) , text . length ( ) , result . data ( ) , result . size ( ) , add_special , parse_special ) ;
2023-08-21 22:07:43 +02:00
if ( n_tokens < 0 ) {
result . resize ( - n_tokens ) ;
2024-04-09 19:44:08 +02:00
int check = llama_tokenize ( model , text . data ( ) , text . length ( ) , result . data ( ) , result . size ( ) , add_special , parse_special ) ;
2023-08-21 22:07:43 +02:00
GGML_ASSERT ( check = = - n_tokens ) ;
} else {
result . resize ( n_tokens ) ;
}
return result ;
}
2024-04-24 12:15:29 +02:00
std : : string llama_token_to_piece ( const struct llama_context * ctx , llama_token token , bool special ) {
2023-08-27 13:19:19 +02:00
std : : string piece ;
2024-07-05 19:01:35 +02:00
piece . resize ( piece . capacity ( ) ) ; // using string internal cache, 15 bytes + '\n'
const int n_chars = llama_token_to_piece ( llama_get_model ( ctx ) , token , & piece [ 0 ] , piece . size ( ) , 0 , special ) ;
if ( n_chars < 0 ) {
piece . resize ( - n_chars ) ;
int check = llama_token_to_piece ( llama_get_model ( ctx ) , token , & piece [ 0 ] , piece . size ( ) , 0 , special ) ;
GGML_ASSERT ( check = = - n_chars ) ;
}
else {
piece . resize ( n_chars ) ;
2023-08-28 17:59:39 +02:00
}
2024-07-05 19:01:35 +02:00
return piece ;
2024-05-22 19:04:20 +02:00
}
2023-08-28 17:59:39 +02:00
2024-07-05 19:01:35 +02:00
std : : string llama_detokenize ( llama_context * ctx , const std : : vector < llama_token > & tokens , bool special ) {
std : : string text ;
text . resize ( std : : max ( text . capacity ( ) , tokens . size ( ) ) ) ;
int32_t n_chars = llama_detokenize ( llama_get_model ( ctx ) , tokens . data ( ) , ( int32_t ) tokens . size ( ) , & text [ 0 ] , ( int32_t ) text . size ( ) , false , special ) ;
if ( n_chars < 0 ) {
text . resize ( - n_chars ) ;
n_chars = llama_detokenize ( llama_get_model ( ctx ) , tokens . data ( ) , ( int32_t ) tokens . size ( ) , & text [ 0 ] , ( int32_t ) text . size ( ) , false , special ) ;
GGML_ASSERT ( n_chars < = ( int32_t ) text . size ( ) ) ; // whitespace trimming is performed after per-token detokenization
2024-05-22 19:04:20 +02:00
}
2024-07-05 19:01:35 +02:00
text . resize ( n_chars ) ;
2024-05-22 19:04:20 +02:00
// NOTE: the original tokenizer decodes bytes after collecting the pieces.
2024-07-05 19:01:35 +02:00
return text ;
2023-08-28 17:59:39 +02:00
}
2023-11-23 18:07:56 +01:00
2024-06-25 13:56:49 +02:00
//
// Chat template utils
//
2024-06-04 20:23:39 +02:00
bool llama_chat_verify_template ( const std : : string & tmpl ) {
llama_chat_message chat [ ] = { { " user " , " test " } } ;
int res = llama_chat_apply_template ( nullptr , tmpl . c_str ( ) , chat , 1 , true , nullptr , 0 ) ;
return res > = 0 ;
}
2024-06-25 13:56:49 +02:00
std : : string llama_chat_apply_template ( const struct llama_model * model ,
const std : : string & tmpl ,
const std : : vector < llama_chat_msg > & msgs ,
bool add_ass ) {
int alloc_size = 0 ;
2024-06-27 18:14:19 +02:00
bool fallback = false ; // indicate if we must fallback to default chatml
2024-06-25 13:56:49 +02:00
std : : vector < llama_chat_message > chat ;
for ( auto & msg : msgs ) {
chat . push_back ( { msg . role . c_str ( ) , msg . content . c_str ( ) } ) ;
alloc_size + = ( msg . role . size ( ) + msg . content . size ( ) ) * 1.25 ;
}
const char * ptr_tmpl = tmpl . empty ( ) ? nullptr : tmpl . c_str ( ) ;
std : : vector < char > buf ( alloc_size ) ;
// run the first time to get the total output length
int32_t res = llama_chat_apply_template ( model , ptr_tmpl , chat . data ( ) , chat . size ( ) , add_ass , buf . data ( ) , buf . size ( ) ) ;
2024-06-27 18:14:19 +02:00
// error: chat template is not supported
if ( res < 0 ) {
if ( ptr_tmpl ! = nullptr ) {
// if the custom "tmpl" is not supported, we throw an error
// this is a bit redundant (for good), since we're not sure if user validated the custom template with llama_chat_verify_template()
throw std : : runtime_error ( " this custom template is not supported " ) ;
} else {
// If the built-in template is not supported, we default to chatml
res = llama_chat_apply_template ( nullptr , " chatml " , chat . data ( ) , chat . size ( ) , add_ass , buf . data ( ) , buf . size ( ) ) ;
fallback = true ;
}
}
2024-06-25 13:56:49 +02:00
// if it turns out that our buffer is too small, we resize it
if ( ( size_t ) res > buf . size ( ) ) {
buf . resize ( res ) ;
2024-06-27 18:14:19 +02:00
res = llama_chat_apply_template (
fallback ? nullptr : model ,
fallback ? " chatml " : ptr_tmpl ,
chat . data ( ) , chat . size ( ) , add_ass , buf . data ( ) , buf . size ( ) ) ;
2024-06-25 13:56:49 +02:00
}
std : : string formatted_chat ( buf . data ( ) , res ) ;
return formatted_chat ;
}
std : : string llama_chat_format_single ( const struct llama_model * model ,
const std : : string & tmpl ,
const std : : vector < llama_chat_msg > & past_msg ,
const llama_chat_msg & new_msg ,
bool add_ass ) {
2024-06-30 20:27:13 +02:00
std : : ostringstream ss ;
2024-07-24 13:48:46 +02:00
auto fmt_past_msg = past_msg . empty ( ) ? " " : llama_chat_apply_template ( model , tmpl , past_msg , false ) ;
2024-06-25 13:56:49 +02:00
std : : vector < llama_chat_msg > chat_new ( past_msg ) ;
2024-06-30 20:27:13 +02:00
// if the past_msg ends with a newline, we must preserve it in the formatted version
if ( add_ass & & ! fmt_past_msg . empty ( ) & & fmt_past_msg . back ( ) = = ' \n ' ) {
ss < < " \n " ;
} ;
// format chat with new_msg
2024-06-25 13:56:49 +02:00
chat_new . push_back ( new_msg ) ;
auto fmt_new_msg = llama_chat_apply_template ( model , tmpl , chat_new , add_ass ) ;
2024-06-30 20:27:13 +02:00
// get the diff part
ss < < fmt_new_msg . substr ( fmt_past_msg . size ( ) , fmt_new_msg . size ( ) - fmt_past_msg . size ( ) ) ;
return ss . str ( ) ;
2024-06-25 13:56:49 +02:00
}
std : : string llama_chat_format_example ( const struct llama_model * model ,
const std : : string & tmpl ) {
std : : vector < llama_chat_msg > msgs = {
{ " system " , " You are a helpful assistant " } ,
{ " user " , " Hello " } ,
{ " assistant " , " Hi there " } ,
{ " user " , " How are you? " } ,
} ;
return llama_chat_apply_template ( model , tmpl , msgs , true ) ;
}
2023-11-23 18:07:56 +01:00
//
// KV cache utils
//
2024-05-22 19:04:20 +02:00
void llama_kv_cache_dump_view ( const llama_kv_cache_view & view , int row_size ) {
2023-11-23 18:07:56 +01:00
static const char slot_chars [ ] = " .123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+ " ;
printf ( " === Dumping KV cache. total cells %d, max sequences per cell %d, populated cells %d, total tokens in cache %d, largest empty slot=%d @ %d " ,
2024-03-11 16:49:47 +01:00
view . n_cells , view . n_seq_max , view . used_cells , view . token_count , view . max_contiguous , view . max_contiguous_idx ) ;
2023-11-23 18:07:56 +01:00
llama_kv_cache_view_cell * c_curr = view . cells ;
llama_seq_id * cs_curr = view . cells_sequences ;
2024-03-11 16:49:47 +01:00
for ( int i = 0 ; i < view . n_cells ; i + + , c_curr + + , cs_curr + = view . n_seq_max ) {
2023-11-23 18:07:56 +01:00
if ( i % row_size = = 0 ) {
printf ( " \n %5d: " , i ) ;
}
int seq_count = 0 ;
2024-03-11 16:49:47 +01:00
for ( int j = 0 ; j < view . n_seq_max ; j + + ) {
2023-11-23 18:07:56 +01:00
if ( cs_curr [ j ] > = 0 ) { seq_count + + ; }
}
putchar ( slot_chars [ std : : min ( sizeof ( slot_chars ) - 2 , size_t ( seq_count ) ) ] ) ;
}
printf ( " \n === Done dumping \n " ) ;
}
2024-05-22 19:04:20 +02:00
void llama_kv_cache_dump_view_seqs ( const llama_kv_cache_view & view , int row_size ) {
2023-11-23 18:07:56 +01:00
static const char slot_chars [ ] = " 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz " ;
printf ( " === Dumping KV cache. total cells %d, max sequences per cell %d, populated cells %d, total tokens in cache %d, largest empty slot=%d @ %d \n " ,
2024-03-11 16:49:47 +01:00
view . n_cells , view . n_seq_max , view . used_cells , view . token_count , view . max_contiguous , view . max_contiguous_idx ) ;
2023-11-23 18:07:56 +01:00
std : : unordered_map < llama_seq_id , size_t > seqs ;
llama_kv_cache_view_cell * c_curr = view . cells ;
llama_seq_id * cs_curr = view . cells_sequences ;
2024-03-11 16:49:47 +01:00
for ( int i = 0 ; i < view . n_cells ; i + + , c_curr + + , cs_curr + = view . n_seq_max ) {
for ( int j = 0 ; j < view . n_seq_max ; j + + ) {
2023-11-23 18:07:56 +01:00
if ( cs_curr [ j ] < 0 ) { continue ; }
if ( seqs . find ( cs_curr [ j ] ) = = seqs . end ( ) ) {
if ( seqs . size ( ) + 1 > = sizeof ( slot_chars ) ) { break ; }
2024-02-18 17:21:52 +01:00
const size_t sz = seqs . size ( ) ;
seqs [ cs_curr [ j ] ] = sz ;
2023-11-23 18:07:56 +01:00
}
}
if ( seqs . size ( ) + 1 > = sizeof ( slot_chars ) ) { break ; }
}
printf ( " === Sequence legend: " ) ;
for ( const auto & it : seqs ) {
printf ( " %zu=%d, " , it . second , it . first ) ;
}
printf ( " '+'=other sequence ids " ) ;
c_curr = view . cells ;
cs_curr = view . cells_sequences ;
2024-03-11 16:49:47 +01:00
for ( int i = 0 ; i < view . n_cells ; i + + , c_curr + + , cs_curr + = view . n_seq_max ) {
2023-11-23 18:07:56 +01:00
if ( i % row_size = = 0 ) {
printf ( " \n %5d: " , i ) ;
}
2024-03-11 16:49:47 +01:00
for ( int j = 0 ; j < view . n_seq_max ; j + + ) {
2023-11-23 18:07:56 +01:00
if ( cs_curr [ j ] > = 0 ) {
const auto & it = seqs . find ( cs_curr [ j ] ) ;
putchar ( it ! = seqs . end ( ) ? int ( slot_chars [ it - > second ] ) : ' + ' ) ;
} else {
putchar ( ' . ' ) ;
}
}
putchar ( ' ' ) ;
}
printf ( " \n === Done dumping \n " ) ;
}
2024-03-09 13:27:58 +01:00
2024-05-22 19:04:20 +02:00
//
// Embedding utils
//
2024-06-24 07:30:24 +02:00
void llama_embd_normalize ( const float * inp , float * out , int n , int embd_norm ) {
2024-03-09 13:27:58 +01:00
double sum = 0.0 ;
2024-06-24 07:30:24 +02:00
switch ( embd_norm ) {
case - 1 : // no normalisation
sum = 1.0 ;
break ;
case 0 : // max absolute
for ( int i = 0 ; i < n ; i + + ) {
if ( sum < std : : abs ( inp [ i ] ) ) sum = std : : abs ( inp [ i ] ) ;
}
sum / = 32760.0 ; // make an int16 range
break ;
case 2 : // euclidean
for ( int i = 0 ; i < n ; i + + ) {
sum + = inp [ i ] * inp [ i ] ;
}
sum = std : : sqrt ( sum ) ;
break ;
default : // p-norm (euclidean is p-norm p=2)
for ( int i = 0 ; i < n ; i + + ) {
sum + = std : : pow ( std : : abs ( inp [ i ] ) , embd_norm ) ;
}
sum = std : : pow ( sum , 1.0 / embd_norm ) ;
break ;
2024-03-09 13:27:58 +01:00
}
2024-06-24 07:30:24 +02:00
const float norm = sum > 0.0 ? 1.0 / sum : 0.0f ;
2024-03-09 13:27:58 +01:00
for ( int i = 0 ; i < n ; i + + ) {
out [ i ] = inp [ i ] * norm ;
}
}
2024-03-14 09:12:29 +01:00
float llama_embd_similarity_cos ( const float * embd1 , const float * embd2 , int n ) {
double sum = 0.0 ;
double sum1 = 0.0 ;
double sum2 = 0.0 ;
for ( int i = 0 ; i < n ; i + + ) {
sum + = embd1 [ i ] * embd2 [ i ] ;
sum1 + = embd1 [ i ] * embd1 [ i ] ;
sum2 + = embd2 [ i ] * embd2 [ i ] ;
}
2024-06-24 07:30:24 +02:00
// Handle the case where one or both vectors are zero vectors
if ( sum1 = = 0.0 | | sum2 = = 0.0 ) {
if ( sum1 = = 0.0 & & sum2 = = 0.0 ) {
return 1.0f ; // two zero vectors are similar
}
return 0.0f ;
}
2024-03-14 09:12:29 +01:00
return sum / ( sqrt ( sum1 ) * sqrt ( sum2 ) ) ;
}
2024-03-15 21:43:02 +01:00
//
// Control vector utils
//
static llama_control_vector_data llama_control_vector_load_one ( const llama_control_vector_load_info & load_info ) {
llama_control_vector_data result = { - 1 , { } } ;
2024-06-27 16:48:07 +02:00
ggml_context * ctx = nullptr ;
struct gguf_init_params meta_gguf_params = {
/* .no_alloc = */ false ,
/* .ctx = */ & ctx ,
} ;
struct gguf_context * ctx_gguf = gguf_init_from_file ( load_info . fname . c_str ( ) , meta_gguf_params ) ;
if ( ! ctx_gguf ) {
fprintf ( stderr , " %s: failed to load control vector file from %s \n " , __func__ , load_info . fname . c_str ( ) ) ;
return result ;
2024-03-15 21:43:02 +01:00
}
2024-06-27 16:48:07 +02:00
int32_t n_tensors = gguf_get_n_tensors ( ctx_gguf ) ;
2024-03-15 21:43:02 +01:00
if ( n_tensors = = 0 ) {
fprintf ( stderr , " %s: no direction tensors found in %s \n " , __func__ , load_info . fname . c_str ( ) ) ;
}
2024-06-27 16:48:07 +02:00
for ( int i = 0 ; i < n_tensors ; i + + ) {
std : : string name = gguf_get_tensor_name ( ctx_gguf , i ) ;
2024-03-15 21:43:02 +01:00
2024-06-27 16:48:07 +02:00
int layer_idx = - 1 ;
2024-03-15 21:43:02 +01:00
2024-06-27 16:48:07 +02:00
// split on '.'
size_t dotpos = name . find ( ' . ' ) ;
if ( dotpos ! = std : : string : : npos & & name . substr ( 0 , dotpos ) = = " direction " ) {
try {
layer_idx = std : : stoi ( name . substr ( dotpos + 1 ) ) ;
} catch ( . . . ) {
layer_idx = - 1 ;
}
}
if ( layer_idx < 0 ) {
fprintf ( stderr , " %s: invalid/unparsable direction tensor layer index in %s \n " , __func__ , load_info . fname . c_str ( ) ) ;
result . n_embd = - 1 ;
break ;
} else if ( layer_idx = = 0 ) {
fprintf ( stderr , " %s: invalid (zero) direction tensor layer index in %s \n " , __func__ , load_info . fname . c_str ( ) ) ;
result . n_embd = - 1 ;
break ;
}
2024-03-15 21:43:02 +01:00
2024-06-27 16:48:07 +02:00
struct ggml_tensor * tensor = ggml_get_tensor ( ctx , name . c_str ( ) ) ;
if ( tensor - > type ! = GGML_TYPE_F32 ) {
fprintf ( stderr , " %s: invalid (non-F32) direction tensor type in %s \n " , __func__ , load_info . fname . c_str ( ) ) ;
result . n_embd = - 1 ;
break ;
}
if ( ggml_n_dims ( tensor ) ! = 1 ) {
fprintf ( stderr , " %s: invalid (non-1D) direction tensor shape in %s \n " , __func__ , load_info . fname . c_str ( ) ) ;
result . n_embd = - 1 ;
break ;
}
2024-03-15 21:43:02 +01:00
2024-06-27 16:48:07 +02:00
if ( result . n_embd = = - 1 ) {
result . n_embd = ggml_nelements ( tensor ) ;
} else if ( ggml_nelements ( tensor ) ! = result . n_embd ) {
fprintf ( stderr , " %s: direction tensor in %s does not match previous dimensions \n " , __func__ , load_info . fname . c_str ( ) ) ;
result . n_embd = - 1 ;
break ;
}
2024-03-15 21:43:02 +01:00
2024-06-27 16:48:07 +02:00
// extend if necessary - do not store data for layer 0 (it's not used)
result . data . resize ( std : : max ( result . data . size ( ) , static_cast < size_t > ( result . n_embd * layer_idx ) ) , 0.0f ) ;
2024-03-15 21:43:02 +01:00
2024-06-27 16:48:07 +02:00
const float * src = ( const float * ) tensor - > data ;
float * dst = result . data . data ( ) + result . n_embd * ( layer_idx - 1 ) ; // layer 1 at [0]
for ( int j = 0 ; j < result . n_embd ; j + + ) {
dst [ j ] + = src [ j ] * load_info . strength ; // allows multiple directions for same layer in same file
2024-03-15 21:43:02 +01:00
}
2024-06-27 16:48:07 +02:00
2024-03-15 21:43:02 +01:00
}
2024-06-27 16:48:07 +02:00
if ( result . n_embd = = - 1 ) {
fprintf ( stderr , " %s: skipping %s due to invalid direction tensors \n " , __func__ , load_info . fname . c_str ( ) ) ;
result . data . clear ( ) ;
}
gguf_free ( ctx_gguf ) ;
ggml_free ( ctx ) ;
2024-03-15 21:43:02 +01:00
return result ;
}
llama_control_vector_data llama_control_vector_load ( const std : : vector < llama_control_vector_load_info > & load_infos ) {
llama_control_vector_data result = { - 1 , { } } ;
for ( const auto & info : load_infos ) {
auto cur = llama_control_vector_load_one ( info ) ;
if ( cur . n_embd = = - 1 ) {
2024-06-27 16:48:07 +02:00
result . n_embd = - 1 ;
break ;
2024-03-15 21:43:02 +01:00
}
2024-06-27 16:48:07 +02:00
if ( result . n_embd ! = - 1 & & result . n_embd ! = cur . n_embd ) {
fprintf ( stderr , " %s: control vectors in %s does not match previous dimensions \n " , __func__ , info . fname . c_str ( ) ) ;
result . n_embd = - 1 ;
break ;
2024-03-15 21:43:02 +01:00
}
if ( result . n_embd = = - 1 ) {
result = std : : move ( cur ) ;
} else {
2024-06-27 16:48:07 +02:00
result . data . resize ( std : : max ( result . data . size ( ) , cur . data . size ( ) ) , 0.0f ) ; // extend if necessary
2024-03-15 21:43:02 +01:00
for ( size_t i = 0 ; i < cur . data . size ( ) ; i + + ) {
result . data [ i ] + = cur . data [ i ] ;
}
}
}
if ( result . n_embd = = - 1 ) {
2024-06-27 16:48:07 +02:00
fprintf ( stderr , " %s: no valid control vector files passed \n " , __func__ ) ;
result . data . clear ( ) ;
2024-03-15 21:43:02 +01:00
}
return result ;
}
2024-05-22 19:04:20 +02:00
//
// YAML utils
//
void yaml_dump_vector_float ( FILE * stream , const char * prop_name , const std : : vector < float > & data ) {
if ( data . empty ( ) ) {
fprintf ( stream , " %s: \n " , prop_name ) ;
return ;
}
fprintf ( stream , " %s: [ " , prop_name ) ;
for ( size_t i = 0 ; i < data . size ( ) - 1 ; + + i ) {
fprintf ( stream , " %e, " , data [ i ] ) ;
}
fprintf ( stream , " %e] \n " , data . back ( ) ) ;
}
void yaml_dump_vector_int ( FILE * stream , const char * prop_name , const std : : vector < int > & data ) {
if ( data . empty ( ) ) {
fprintf ( stream , " %s: \n " , prop_name ) ;
return ;
}
fprintf ( stream , " %s: [ " , prop_name ) ;
for ( size_t i = 0 ; i < data . size ( ) - 1 ; + + i ) {
fprintf ( stream , " %d, " , data [ i ] ) ;
}
fprintf ( stream , " %d] \n " , data . back ( ) ) ;
}
void yaml_dump_string_multiline ( FILE * stream , const char * prop_name , const char * data ) {
std : : string data_str ( data = = NULL ? " " : data ) ;
if ( data_str . empty ( ) ) {
fprintf ( stream , " %s: \n " , prop_name ) ;
return ;
}
size_t pos_start = 0 ;
size_t pos_found = 0 ;
if ( std : : isspace ( data_str [ 0 ] ) | | std : : isspace ( data_str . back ( ) ) ) {
data_str = std : : regex_replace ( data_str , std : : regex ( " \n " ) , " \\ n " ) ;
data_str = std : : regex_replace ( data_str , std : : regex ( " \" " ) , " \\ \" " ) ;
data_str = std : : regex_replace ( data_str , std : : regex ( R " ( \\ [^n " ] ) " ), R " ( \ $ & ) " );
data_str = " \" " + data_str + " \" " ;
fprintf ( stream , " %s: %s \n " , prop_name , data_str . c_str ( ) ) ;
return ;
}
if ( data_str . find ( ' \n ' ) = = std : : string : : npos ) {
fprintf ( stream , " %s: %s \n " , prop_name , data_str . c_str ( ) ) ;
return ;
}
fprintf ( stream , " %s: | \n " , prop_name ) ;
while ( ( pos_found = data_str . find ( ' \n ' , pos_start ) ) ! = std : : string : : npos ) {
fprintf ( stream , " %s \n " , data_str . substr ( pos_start , pos_found - pos_start ) . c_str ( ) ) ;
pos_start = pos_found + 1 ;
}
}
void yaml_dump_non_result_info ( FILE * stream , const gpt_params & params , const llama_context * lctx ,
const std : : string & timestamp , const std : : vector < int > & prompt_tokens , const char * model_desc ) {
const llama_sampling_params & sparams = params . sparams ;
fprintf ( stream , " build_commit: %s \n " , LLAMA_COMMIT ) ;
fprintf ( stream , " build_number: %d \n " , LLAMA_BUILD_NUMBER ) ;
fprintf ( stream , " cpu_has_arm_fma: %s \n " , ggml_cpu_has_arm_fma ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_avx: %s \n " , ggml_cpu_has_avx ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_avx_vnni: %s \n " , ggml_cpu_has_avx_vnni ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_avx2: %s \n " , ggml_cpu_has_avx2 ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_avx512: %s \n " , ggml_cpu_has_avx512 ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_avx512_vbmi: %s \n " , ggml_cpu_has_avx512_vbmi ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_avx512_vnni: %s \n " , ggml_cpu_has_avx512_vnni ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_cuda: %s \n " , ggml_cpu_has_cuda ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_vulkan: %s \n " , ggml_cpu_has_vulkan ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_kompute: %s \n " , ggml_cpu_has_kompute ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_fma: %s \n " , ggml_cpu_has_fma ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_gpublas: %s \n " , ggml_cpu_has_gpublas ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_neon: %s \n " , ggml_cpu_has_neon ( ) ? " true " : " false " ) ;
2024-05-25 10:42:31 +02:00
fprintf ( stream , " cpu_has_sve: %s \n " , ggml_cpu_has_sve ( ) ? " true " : " false " ) ;
2024-05-22 19:04:20 +02:00
fprintf ( stream , " cpu_has_f16c: %s \n " , ggml_cpu_has_f16c ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_fp16_va: %s \n " , ggml_cpu_has_fp16_va ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_wasm_simd: %s \n " , ggml_cpu_has_wasm_simd ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_blas: %s \n " , ggml_cpu_has_blas ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_sse3: %s \n " , ggml_cpu_has_sse3 ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_vsx: %s \n " , ggml_cpu_has_vsx ( ) ? " true " : " false " ) ;
fprintf ( stream , " cpu_has_matmul_int8: %s \n " , ggml_cpu_has_matmul_int8 ( ) ? " true " : " false " ) ;
# ifdef NDEBUG
fprintf ( stream , " debug: false \n " ) ;
# else
fprintf ( stream , " debug: true \n " ) ;
# endif // NDEBUG
fprintf ( stream , " model_desc: %s \n " , model_desc ) ;
fprintf ( stream , " n_vocab: %d # output size of the final layer, 32001 for some models \n " , llama_n_vocab ( llama_get_model ( lctx ) ) ) ;
# ifdef __OPTIMIZE__
fprintf ( stream , " optimize: true \n " ) ;
# else
fprintf ( stream , " optimize: false \n " ) ;
# endif // __OPTIMIZE__
fprintf ( stream , " time: %s \n " , timestamp . c_str ( ) ) ;
fprintf ( stream , " \n " ) ;
fprintf ( stream , " ############### \n " ) ;
fprintf ( stream , " # User Inputs # \n " ) ;
fprintf ( stream , " ############### \n " ) ;
fprintf ( stream , " \n " ) ;
fprintf ( stream , " alias: %s # default: unknown \n " , params . model_alias . c_str ( ) ) ;
fprintf ( stream , " batch_size: %d # default: 512 \n " , params . n_batch ) ;
yaml_dump_string_multiline ( stream , " cfg_negative_prompt " , sparams . cfg_negative_prompt . c_str ( ) ) ;
fprintf ( stream , " cfg_scale: %f # default: 1.0 \n " , sparams . cfg_scale ) ;
fprintf ( stream , " chunks: %d # default: -1 (unlimited) \n " , params . n_chunks ) ;
fprintf ( stream , " color: %s # default: false \n " , params . use_color ? " true " : " false " ) ;
fprintf ( stream , " ctx_size: %d # default: 512 \n " , params . n_ctx ) ;
fprintf ( stream , " escape: %s # default: false \n " , params . escape ? " true " : " false " ) ;
fprintf ( stream , " file: # never logged, see prompt instead. Can still be specified for input. \n " ) ;
fprintf ( stream , " frequency_penalty: %f # default: 0.0 \n " , sparams . penalty_freq ) ;
yaml_dump_string_multiline ( stream , " grammar " , sparams . grammar . c_str ( ) ) ;
fprintf ( stream , " grammar-file: # never logged, see grammar instead. Can still be specified for input. \n " ) ;
fprintf ( stream , " hellaswag: %s # default: false \n " , params . hellaswag ? " true " : " false " ) ;
fprintf ( stream , " hellaswag_tasks: %zu # default: 400 \n " , params . hellaswag_tasks ) ;
const auto logit_bias_eos = sparams . logit_bias . find ( llama_token_eos ( llama_get_model ( lctx ) ) ) ;
const bool ignore_eos = logit_bias_eos ! = sparams . logit_bias . end ( ) & & logit_bias_eos - > second = = - INFINITY ;
fprintf ( stream , " ignore_eos: %s # default: false \n " , ignore_eos ? " true " : " false " ) ;
yaml_dump_string_multiline ( stream , " in_prefix " , params . input_prefix . c_str ( ) ) ;
fprintf ( stream , " in_prefix_bos: %s # default: false \n " , params . input_prefix_bos ? " true " : " false " ) ;
yaml_dump_string_multiline ( stream , " in_suffix " , params . input_prefix . c_str ( ) ) ;
fprintf ( stream , " interactive: %s # default: false \n " , params . interactive ? " true " : " false " ) ;
fprintf ( stream , " interactive_first: %s # default: false \n " , params . interactive_first ? " true " : " false " ) ;
fprintf ( stream , " keep: %d # default: 0 \n " , params . n_keep ) ;
fprintf ( stream , " logdir: %s # default: unset (no logging) \n " , params . logdir . c_str ( ) ) ;
fprintf ( stream , " logit_bias: \n " ) ;
for ( std : : pair < llama_token , float > lb : sparams . logit_bias ) {
if ( ignore_eos & & lb . first = = logit_bias_eos - > first ) {
continue ;
}
fprintf ( stream , " %d: %f " , lb . first , lb . second ) ;
}
fprintf ( stream , " lora: \n " ) ;
2024-08-06 17:33:39 +02:00
for ( auto & la : params . lora_adapters ) {
if ( la . scale = = 1.0f ) {
fprintf ( stream , " - %s \n " , la . path . c_str ( ) ) ;
2024-05-22 19:04:20 +02:00
}
}
fprintf ( stream , " lora_scaled: \n " ) ;
2024-08-06 17:33:39 +02:00
for ( auto & la : params . lora_adapters ) {
if ( la . scale ! = 1.0f ) {
fprintf ( stream , " - %s: %f \n " , la . path . c_str ( ) , la . scale ) ;
2024-05-22 19:04:20 +02:00
}
}
2024-08-06 17:33:39 +02:00
fprintf ( stream , " lora_init_without_apply: %s # default: false \n " , params . lora_init_without_apply ? " true " : " false " ) ;
2024-05-22 19:04:20 +02:00
fprintf ( stream , " main_gpu: %d # default: 0 \n " , params . main_gpu ) ;
fprintf ( stream , " min_keep: %d # default: 0 (disabled) \n " , sparams . min_keep ) ;
fprintf ( stream , " mirostat: %d # default: 0 (disabled) \n " , sparams . mirostat ) ;
fprintf ( stream , " mirostat_ent: %f # default: 5.0 \n " , sparams . mirostat_tau ) ;
fprintf ( stream , " mirostat_lr: %f # default: 0.1 \n " , sparams . mirostat_eta ) ;
fprintf ( stream , " mlock: %s # default: false \n " , params . use_mlock ? " true " : " false " ) ;
fprintf ( stream , " model: %s # default: %s \n " , params . model . c_str ( ) , DEFAULT_MODEL_PATH ) ;
fprintf ( stream , " model_draft: %s # default: \n " , params . model_draft . c_str ( ) ) ;
fprintf ( stream , " multiline_input: %s # default: false \n " , params . multiline_input ? " true " : " false " ) ;
fprintf ( stream , " n_gpu_layers: %d # default: -1 \n " , params . n_gpu_layers ) ;
fprintf ( stream , " n_predict: %d # default: -1 (unlimited) \n " , params . n_predict ) ;
fprintf ( stream , " n_probs: %d # only used by server binary, default: 0 \n " , sparams . n_probs ) ;
fprintf ( stream , " no_mmap: %s # default: false \n " , ! params . use_mmap ? " true " : " false " ) ;
fprintf ( stream , " penalize_nl: %s # default: false \n " , sparams . penalize_nl ? " true " : " false " ) ;
fprintf ( stream , " ppl_output_type: %d # default: 0 \n " , params . ppl_output_type ) ;
fprintf ( stream , " ppl_stride: %d # default: 0 \n " , params . ppl_stride ) ;
fprintf ( stream , " presence_penalty: %f # default: 0.0 \n " , sparams . penalty_present ) ;
yaml_dump_string_multiline ( stream , " prompt " , params . prompt . c_str ( ) ) ;
fprintf ( stream , " prompt_cache: %s \n " , params . path_prompt_cache . c_str ( ) ) ;
fprintf ( stream , " prompt_cache_all: %s # default: false \n " , params . prompt_cache_all ? " true " : " false " ) ;
fprintf ( stream , " prompt_cache_ro: %s # default: false \n " , params . prompt_cache_ro ? " true " : " false " ) ;
yaml_dump_vector_int ( stream , " prompt_tokens " , prompt_tokens ) ;
fprintf ( stream , " repeat_penalty: %f # default: 1.1 \n " , sparams . penalty_repeat ) ;
fprintf ( stream , " reverse_prompt: \n " ) ;
for ( std : : string ap : params . antiprompt ) {
size_t pos = 0 ;
while ( ( pos = ap . find ( ' \n ' , pos ) ) ! = std : : string : : npos ) {
ap . replace ( pos , 1 , " \\ n " ) ;
pos + = 1 ;
}
fprintf ( stream , " - %s \n " , ap . c_str ( ) ) ;
}
fprintf ( stream , " rope_freq_base: %f # default: 10000.0 \n " , params . rope_freq_base ) ;
fprintf ( stream , " rope_freq_scale: %f # default: 1.0 \n " , params . rope_freq_scale ) ;
fprintf ( stream , " seed: %u # default: -1 (random seed) \n " , params . seed ) ;
fprintf ( stream , " simple_io: %s # default: false \n " , params . simple_io ? " true " : " false " ) ;
fprintf ( stream , " cont_batching: %s # default: false \n " , params . cont_batching ? " true " : " false " ) ;
fprintf ( stream , " flash_attn: %s # default: false \n " , params . flash_attn ? " true " : " false " ) ;
fprintf ( stream , " temp: %f # default: 0.8 \n " , sparams . temp ) ;
const std : : vector < float > tensor_split_vector ( params . tensor_split , params . tensor_split + llama_max_devices ( ) ) ;
yaml_dump_vector_float ( stream , " tensor_split " , tensor_split_vector ) ;
fprintf ( stream , " tfs: %f # default: 1.0 \n " , sparams . tfs_z ) ;
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
fprintf ( stream , " threads: %d # default: %u \n " , params . cpuparams . n_threads , std : : thread : : hardware_concurrency ( ) ) ;
2024-05-22 19:04:20 +02:00
fprintf ( stream , " top_k: %d # default: 40 \n " , sparams . top_k ) ;
fprintf ( stream , " top_p: %f # default: 0.95 \n " , sparams . top_p ) ;
fprintf ( stream , " min_p: %f # default: 0.0 \n " , sparams . min_p ) ;
fprintf ( stream , " typical_p: %f # default: 1.0 \n " , sparams . typical_p ) ;
fprintf ( stream , " verbose_prompt: %s # default: false \n " , params . verbose_prompt ? " true " : " false " ) ;
fprintf ( stream , " display_prompt: %s # default: true \n " , params . display_prompt ? " true " : " false " ) ;
}