llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-11 21:10:24 +01:00

Author	SHA1	Message	Date
Johannes Gäßler	061f5f8d21	CUDA: add __restrict__ to mul mat vec kernels (#2140 ) master-061f5f8	2023-07-08 00:25:15 +02:00
dylan	84525e7962	docker : add support for CUDA in docker (#1461 ) Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> master-84525e7	2023-07-07 21:25:25 +03:00
Georgi Gerganov	a7e20edf22	ci : switch threads to 1 (#2138 ) master-a7e20ed	2023-07-07 21:23:57 +03:00
Qingyou Meng	1d656d6360	ggml : change ggml_graph_compute() API to not require context (#1999 ) * ggml_graph_compute: deprecate using ggml_context, try resolve issue #287 * rewrite: no longer consider backward compitability; plan and make_plan * minor: rename ctx as plan; const * remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward * add static ggml_graph_compute_sugar() * minor: update comments * reusable buffers * ggml : more consistent naming + metal fixes * ggml : fix docs * tests : disable grad / opt + minor naming changes * ggml : add ggml_graph_compute_with_ctx() - backwards compatible API - deduplicates a lot of copy-paste * ci : enable test-grad0 * examples : factor out plan allocation into a helper function * llama : factor out plan stuff into a helper function * ci : fix env * llama : fix duplicate symbols + refactor example benchmark * ggml : remove obsolete assert + refactor n_tasks section * ggml : fix indentation in switch * llama : avoid unnecessary bool * ggml : remove comments from source file and match order in header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-07 19:24:01 +03:00
Georgi Gerganov	7242140283	ggml : remove sched_yield() call in ggml_graph_compute_thread() (#2134 ) master-7242140	2023-07-07 18:37:10 +03:00
Aarni Koskela	3e08ae99ce	convert.py: add mapping for safetensors bf16 (#1598 ) Fixes #1473	2023-07-07 09:12:49 -04:00
Howard Su	481f793acc	Fix opencl by wrap #if-else-endif with \n (#2086 ) master-481f793	2023-07-07 05:34:18 +02:00
Georgi Gerganov	dfd9fce6d6	ggml : fix restrict usage master-dfd9fce	2023-07-06 19:41:31 +03:00
Judd	36680f6e40	convert : update for baichuan (#2081 ) 1. guess n_layers; 2. relax warnings on context size; 3. add a note that its derivations are also supported. Co-authored-by: Judd <foldl@boxvest.com> master-36680f6	2023-07-06 19:23:49 +03:00
tslmy	a17a2683d8	alpaca.sh : update model file name (#2074 ) The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `ggmlv3.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.	2023-07-06 19:17:50 +03:00
Tobias Lütke	31cfbb1013	Expose generation timings from server & update completions.js (#2116 ) * use javascript generators as much cleaner API Also add ways to access completion as promise and EventSource * export llama_timings as struct and expose them in server * update readme, update baked includes * llama : uniform variable names + struct init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> master-31cfbb1	2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson	983b555e9d	Update Server Instructions (#2113 ) * Update server instructions for web front end * Update server README * Remove duplicate OAI instructions * Fix duplicate text --------- Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 21:03:19 +03:00
Georgi Gerganov	ec326d350c	ggml : fix bug introduced in #1237 master-ec326d3	2023-07-05 20:44:11 +03:00
Georgi Gerganov	1b6efeab82	tests : fix test-grad0 master-1b6efea	2023-07-05 20:20:25 +03:00
Stephan Walter	1b107b8550	ggml : generalize `quantize_fns` for simpler FP16 handling (#1237 ) * Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> master-1b107b8	2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson	8567c76b53	Update server instructions for web front end (#2103 ) Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 18:13:35 +03:00
Johannes Gäßler	924dd22fd3	Quantized dot products for CUDA mul mat vec (#2067 ) master-924dd22	2023-07-05 14:19:42 +02:00
Howard Su	051c70dcd5	llama: Don't double count the sampling time (#2107 ) master-051c70d	2023-07-05 18:31:23 +08:00
Johannes Gäßler	9e4475f5cf	Fixed OpenCL offloading prints (#2082 ) master-9e4475f	2023-07-05 08:58:05 +02:00
Nigel Bosch	7f0e9a775e	embd-input: Fix input embedding example unsigned int seed (#2105 ) master-7f0e9a7	2023-07-05 07:33:33 +08:00
Georgi Gerganov	b472f3fca5	readme : add link web chat PR	2023-07-04 22:25:22 +03:00
Georgi Gerganov	ed9a54e512	ggml : sync latest (new ops, macros, refactoring) (#2106 ) - add ggml_argmax() - add ggml_tanh() - add ggml_elu() - refactor ggml_conv_1d() and variants - refactor ggml_conv_2d() and variants - add helper macros to reduce code duplication in ggml.c master-ed9a54e	2023-07-04 21:54:11 +03:00
jwj7140	f257fd2550	Add an API example using server.cpp similar to OAI. (#2009 ) * add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding master-f257fd2	2023-07-04 21:06:12 +03:00
Tobias Lütke	7ee76e45af	Simple webchat for server (#1998 ) * expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> master-7ee76e4	2023-07-04 16:05:27 +02:00
Henri Vasserman	acc111caf9	Allow old Make to build server. (#2098 ) Also make server build by default. Tested with Make 3.82 master-acc111c	2023-07-04 15:38:04 +03:00
ZhouYuChen	23c7c6fc91	Update Makefile: clean simple (#2097 ) master-23c7c6f	2023-07-04 14:15:16 +02:00
Erik Scholz	698efad5fb	CI: make the brew update temporarily optional. (#2092 ) until they decide to fix the brew installation in the macos runners. see the open issues. eg https://github.com/actions/runner-images/pull/7710 master-698efad	2023-07-04 01:50:12 +02:00
Govlzkoy	14a2cc71f6	[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088 )	2023-07-04 07:50:00 +08:00
Henri Vasserman	1cf14ccef1	fix server crashes (#2076 )	2023-07-04 00:05:23 +03:00
Howard Su	cc45a7feb8	Fix crash of test-tokenizer-0 under Debug build (#2064 ) * Fix crash of test-tokenizer-0 under Debug build * Change per comment	2023-07-03 20:43:55 +02:00
Howard Su	55dbb915cc	[llama] No need to check file version when loading vocab score (#2079 )	2023-07-03 19:58:58 +08:00
WangHaoranRobin	d7d2e6a0f0	server: add option to output probabilities for completion (#1962 ) * server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together master-d7d2e6a	2023-07-03 00:38:44 +03:00
Georgi Gerganov	46088f7231	ggml : fix build with OpenBLAS (close #2066 ) master-46088f7	2023-07-02 09:46:46 +03:00
Johannes Gäßler	0bc2cdfc87	Better CUDA synchronization logic (#2057 ) master-0bc2cdf	2023-07-01 21:49:44 +02:00
Johannes Gäßler	befb3a3562	Test-based VRAM scratch size + context adjustment (#2056 )	2023-07-01 21:47:26 +02:00
Daniel Drake	b213227067	cmake : don't force -mcpu=native on aarch64 (#2063 ) It's currently not possible to cross-compile llama.cpp for aarch64 because CMakeLists.txt forces -mcpu=native for that target. -mcpu=native doesn't make sense if your build host is not the target architecture, and clang rejects it for that reason, aborting the build. This can be easily reproduced using the current Android NDK to build for aarch64 on an x86_64 host. If there is not a specific CPU-tuning target for aarch64 then -mcpu should be omitted completely. I think that makes sense, there is not enough variance in the aarch64 instruction set to warrant a fixed -mcpu optimization at this point. And if someone is building natively and wishes to enable any possible optimizations for the host device, then there is already the LLAMA_NATIVE option available. Fixes #495.	2023-07-01 21:31:44 +03:00
Aaron Miller	2f8cd979ec	metal : release buffers when freeing metal context (#2062 ) master-2f8cd97	2023-07-01 21:14:59 +03:00
Judd	471aab6e4c	convert : add support of baichuan-7b (#2055 ) Co-authored-by: Judd <foldl@boxvest.com>	2023-07-01 20:00:25 +03:00
Georgi Gerganov	463f2f4c4f	llama : fix return value of llama_load_session_file_internal (#2022 )	2023-07-01 19:05:09 +03:00
Rand Xie	cb44dbc7de	llama : catch llama_load_session_file_internal exceptions (#2022 ) * convert checks in llama_load_session_file to throw and handle them * make llama_load_session_file_internal static * address feedbacks to avoid using exceptions	2023-07-01 19:02:58 +03:00
Georgi Gerganov	79f634a19d	embd-input : fix returning ptr to temporary master-79f634a	2023-07-01 18:46:00 +03:00
Georgi Gerganov	04606a1599	train : fix compile warning	2023-07-01 18:45:44 +03:00
Qingyou Meng	b1ca8f36a9	ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995 ) Will not be scheduled unless explicitly enabled.	2023-07-01 18:42:43 +03:00
Howard Su	b8c8dda75f	Use unsigned for random seed (#2006 ) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> master-b8c8dda	2023-06-29 06:15:15 -07:00
LostRuins	96a712ca1b	Porting the improved K-Quant CUDA kernels to OpenCL (#1966 ) * Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <picard12@live.de>	2023-06-29 05:56:43 +02:00
m3ndax	d3494bb86b	llama : replacing auto &kv with const auto &kv (#2041 ) * Replacing auto &kv with const auto &kv * Create codacy.yml * Delete codacy.yml master-d3494bb	2023-06-28 21:39:08 +03:00
Salvador E. Tropea	5b351e94d0	cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028 ) - Not used master-5b351e9	2023-06-28 20:27:31 +03:00
Salvador E. Tropea	6432aabb6d	cuda : fix missing const qualifier in casts (#2027 ) master-6432aab	2023-06-28 20:26:26 +03:00
Howard Su	b922bc351b	llama : remove shards weight file support (#2000 ) * Remove multiple shards * Remove multiple file loaders * Remove llama_load_tensor_shard class * Simplify load logic * Remove dead code guess_n_parts function * Remove vocab_only from constructor of llama_model_loader * Remove alignment_prevents_mmap which is not more needed. * Remove useless check master-b922bc3	2023-06-28 20:13:02 +03:00
Johannes Gäßler	7f9753fa12	CUDA GPU acceleration for LoRAs + f16 models (#1970 ) master-7f9753f	2023-06-28 18:35:54 +02:00

... 43 44 45 46 47 ...

3006 Commits