Commit Graph

813 Commits

Author SHA1 Message Date
Georgi Gerganov
ef37dd14e7
mpi : fix output tensor after MPI compute (still not working) 2023-07-09 17:01:08 +03:00
Georgi Gerganov
c717c5185f
mpi : various fixes - communication now works but results are wrong 2023-07-09 16:40:16 +03:00
Georgi Gerganov
01abb3b3b9
mpi : move all MPI logic into ggml-mpi
Not tested yet
2023-07-09 16:04:27 +03:00
Georgi Gerganov
e339d35579
mpi : add names for layer inputs + prep ggml_mpi_graph_compute() 2023-07-09 14:42:36 +03:00
Georgi Gerganov
3232db628c
mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) 2023-07-09 14:08:53 +03:00
Evan Miller
ef61acfbf5 Add info to README 2023-07-07 09:02:23 -04:00
Evan Miller
55207ba2b8 Add GH workflow, fix test 2023-07-06 21:40:18 -04:00
Evan Miller
1f0a2cfeda Update CMakeLists.txt 2023-07-06 21:25:34 -04:00
Evan Miller
06a239343c PR comments 2023-07-06 20:18:41 -04:00
Evan Miller
32deabfdc8 Merge branch 'master' into mpi 2023-07-06 19:04:50 -04:00
Georgi Gerganov
dfd9fce6d6
ggml : fix restrict usage 2023-07-06 19:41:31 +03:00
Judd
36680f6e40
convert : update for baichuan (#2081)
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.

Co-authored-by: Judd <foldl@boxvest.com>
2023-07-06 19:23:49 +03:00
tslmy
a17a2683d8
alpaca.sh : update model file name (#2074)
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
2023-07-06 19:17:50 +03:00
Tobias Lütke
31cfbb1013
Expose generation timings from server & update completions.js (#2116)
* use javascript generators as much cleaner API

Also add ways to access completion as promise and EventSource

* export llama_timings as struct and expose them in server

* update readme, update baked includes

* llama : uniform variable names + struct init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson
983b555e9d
Update Server Instructions (#2113)
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text

---------

Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 21:03:19 +03:00
Georgi Gerganov
ec326d350c
ggml : fix bug introduced in #1237 2023-07-05 20:44:11 +03:00
Georgi Gerganov
1b6efeab82
tests : fix test-grad0 2023-07-05 20:20:25 +03:00
Stephan Walter
1b107b8550
ggml : generalize quantize_fns for simpler FP16 handling (#1237)
* Generalize quantize_fns for simpler FP16 handling

* Remove call to ggml_cuda_mul_mat_get_wsize

* ci : disable FMA for mac os actions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson
8567c76b53
Update server instructions for web front end (#2103)
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 18:13:35 +03:00
Johannes Gäßler
924dd22fd3
Quantized dot products for CUDA mul mat vec (#2067) 2023-07-05 14:19:42 +02:00
Howard Su
051c70dcd5
llama: Don't double count the sampling time (#2107) 2023-07-05 18:31:23 +08:00
Johannes Gäßler
9e4475f5cf
Fixed OpenCL offloading prints (#2082) 2023-07-05 08:58:05 +02:00
Nigel Bosch
7f0e9a775e
embd-input: Fix input embedding example unsigned int seed (#2105) 2023-07-05 07:33:33 +08:00
Georgi Gerganov
b472f3fca5
readme : add link web chat PR 2023-07-04 22:25:22 +03:00
Georgi Gerganov
ed9a54e512
ggml : sync latest (new ops, macros, refactoring) (#2106)
- add ggml_argmax()
- add ggml_tanh()
- add ggml_elu()
- refactor ggml_conv_1d() and variants
- refactor ggml_conv_2d() and variants
- add helper macros to reduce code duplication in ggml.c
2023-07-04 21:54:11 +03:00
jwj7140
f257fd2550
Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
2023-07-04 21:06:12 +03:00
Tobias Lütke
7ee76e45af
Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04 16:05:27 +02:00
Henri Vasserman
acc111caf9
Allow old Make to build server. (#2098)
Also make server build by default.

Tested with Make 3.82
2023-07-04 15:38:04 +03:00
ZhouYuChen
23c7c6fc91
Update Makefile: clean simple (#2097) 2023-07-04 14:15:16 +02:00
Evan Miller
042c5b278f wrap includes 2023-07-04 00:13:20 -04:00
Evan Miller
668ba5fe0b fixes 2023-07-04 00:09:02 -04:00
Evan Miller
d05ca74dd8 fix warnings, update README 2023-07-03 23:53:43 -04:00
Evan Miller
f85785f650 MPI support, first cut 2023-07-03 21:51:05 -04:00
Erik Scholz
698efad5fb
CI: make the brew update temporarily optional. (#2092)
until they decide to fix the brew installation in the macos runners.
see the open issues. eg https://github.com/actions/runner-images/pull/7710
2023-07-04 01:50:12 +02:00
Govlzkoy
14a2cc71f6
[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088) 2023-07-04 07:50:00 +08:00
Henri Vasserman
1cf14ccef1
fix server crashes (#2076) 2023-07-04 00:05:23 +03:00
Howard Su
cc45a7feb8
Fix crash of test-tokenizer-0 under Debug build (#2064)
* Fix crash of test-tokenizer-0 under Debug build

* Change per comment
2023-07-03 20:43:55 +02:00
Howard Su
55dbb915cc
[llama] No need to check file version when loading vocab score (#2079) 2023-07-03 19:58:58 +08:00
WangHaoranRobin
d7d2e6a0f0
server: add option to output probabilities for completion (#1962)
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
2023-07-03 00:38:44 +03:00
Georgi Gerganov
46088f7231 ggml : fix build with OpenBLAS (close #2066) 2023-07-02 09:46:46 +03:00
Johannes Gäßler
0bc2cdfc87
Better CUDA synchronization logic (#2057) 2023-07-01 21:49:44 +02:00
Johannes Gäßler
befb3a3562
Test-based VRAM scratch size + context adjustment (#2056) 2023-07-01 21:47:26 +02:00
Daniel Drake
b213227067
cmake : don't force -mcpu=native on aarch64 (#2063)
It's currently not possible to cross-compile llama.cpp for aarch64
because CMakeLists.txt forces -mcpu=native for that target.

-mcpu=native doesn't make sense if your build host is not the
target architecture, and clang rejects it for that reason, aborting the
build. This can be easily reproduced using the current Android NDK to build
for aarch64 on an x86_64 host.

If there is not a specific CPU-tuning target for aarch64 then -mcpu
should be omitted completely. I think that makes sense, there is not
enough variance in the aarch64 instruction set to warrant a fixed -mcpu
optimization at this point. And if someone is building natively and wishes
to enable any possible optimizations for the host device, then there is
already the LLAMA_NATIVE option available.

Fixes #495.
2023-07-01 21:31:44 +03:00
Aaron Miller
2f8cd979ec
metal : release buffers when freeing metal context (#2062) 2023-07-01 21:14:59 +03:00
Judd
471aab6e4c
convert : add support of baichuan-7b (#2055)
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-01 20:00:25 +03:00
Georgi Gerganov
463f2f4c4f
llama : fix return value of llama_load_session_file_internal (#2022) 2023-07-01 19:05:09 +03:00
Rand Xie
cb44dbc7de
llama : catch llama_load_session_file_internal exceptions (#2022)
* convert checks in llama_load_session_file to throw and handle them

* make llama_load_session_file_internal static

* address feedbacks to avoid using exceptions
2023-07-01 19:02:58 +03:00
Georgi Gerganov
79f634a19d
embd-input : fix returning ptr to temporary 2023-07-01 18:46:00 +03:00
Georgi Gerganov
04606a1599
train : fix compile warning 2023-07-01 18:45:44 +03:00
Qingyou Meng
b1ca8f36a9
ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995)
Will not be scheduled unless explicitly enabled.
2023-07-01 18:42:43 +03:00