2023-04-20 03:14:14 +02:00
# Define the default target now so that it is always the first target
2023-06-28 17:53:37 +02:00
BUILD_TARGETS = main quantize quantize-stats perplexity embedding vdot train-text-from-scratch simple libembdinput.so embd-input-test
2023-05-27 19:04:14 +02:00
i f d e f L L A M A _ B U I L D _ S E R V E R
BUILD_TARGETS += server
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LLAMA_SERVER_VERBOSE ?= 1
server : private CXXFLAGS += -DSERVER_VERBOSE =$( LLAMA_SERVER_VERBOSE )
2023-05-27 19:04:14 +02:00
e n d i f
default : $( BUILD_TARGETS )
2023-04-20 03:14:14 +02:00
2023-03-10 19:40:58 +01:00
i f n d e f U N A M E _ S
UNAME_S := $( shell uname -s)
e n d i f
i f n d e f U N A M E _ P
UNAME_P := $( shell uname -p)
e n d i f
i f n d e f U N A M E _ M
UNAME_M := $( shell uname -m)
e n d i f
CCV := $( shell $( CC) --version | head -n 1)
CXXV := $( shell $( CXX) --version | head -n 1)
# Mac OS + Arm can report x86_64
# ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
i f e q ( $( UNAME_S ) , D a r w i n )
ifneq ( $( UNAME_P) ,arm)
2023-03-21 16:44:11 +01:00
SYSCTL_M := $( shell sysctl -n hw.optional.arm64 2>/dev/null)
2023-03-10 19:40:58 +01:00
ifeq ( $( SYSCTL_M) ,1)
# UNAME_P := arm
# UNAME_M := arm64
warn := $( warning Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: https://github.com/ggerganov/whisper.cpp/issues/66\# issuecomment-1282546789)
endif
endif
e n d i f
#
# Compile flags
#
2023-03-21 16:29:41 +01:00
# keep standard at C11 and C++11
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml
I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.
* Adding Q3_K and Q8_K (de)-quantization
* Q3_K now working on CUDA and AVX2/scalar
CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).
* Some improvement for Q3_K on CUDA
It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.
* Some more CUDA optimizations for Q3_K
Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.
* Adding Q4_K - scalar, AVX2, CUDA
Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).
* Adding Q6_K - scalar, AVX2, CUDA
Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).
* Adding Q5_K - scalar, AVX2, CUDA
Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.
* Per convention, all QX_K quantizations use Q5_K for output.weight
* Adding quantization mixes
* Quantization mixes: didn't quite get what I wanted in the last commit
* Q4_K dot product for ARM_NEON
* Q6_K dot product for ARM_NEON
* Q5_K dot product for ARM_NEON
* Adding Q3_K dot for ARM_NEON
It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.
* A very slightly faster ARM_NEON Q3_K dot
* Adding Q2_K - just CUDA for now
Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.
* Adding scalar and AVX2 Q2_K dot
* Adding ARM_NEON Q2_K dot
About the same performance as Q4_K.
* A slightly faster ARM_NEON Q2_K dot
Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.
* Fixed bug in Q2_K CUDA dot product kernel
Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.
In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).
* Don't print zeros/NaNs when no count histogram has been collected
* A 10% faster CUDA vector dot kernel for Q3_K
Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.
* A slightly daster Q4_K AVX2 dot product
For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.
* A slightly faster ARM_NEON A4_K dot product
* Minor
* Fix quantization error test
We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.
* Fix docker build
I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.
* Added forgotten ggml.o dependence on k_quants.h to the Makefile
* Had unintentionally committed the Makefile with -Ofast enabled
* ggml : rename k_quants -> ggml-quants-k, use lowercase in code
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 21:56:18 +02:00
# -Ofast tends to produce faster code, but may not be available for some compilers.
k-quants : support for super-block size of 64 (#2001)
* k_quants: WIP super-blocks with 64 weights
* k_quants: WIP super-blocks with 64 weights
Q6_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q4_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
* k_quants: WIP super-blocks with 64 weights
Q3_K scalar and AVX2 works.
* k_quants: WIP super-blocks with 64 weights
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
* k_quants: WIP super-blocks with 64 weights
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
* k_quants: WIP super-blocks with 64 weights
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q3_K working on CUDA.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on CUDA, and with this CUDA is done.
* k_quants: WIP super-blocks with 64 weights
Q6_K working on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
* k_quants: WIP super-blocks with 64 weights
Slightly more efficient Q3_K and Q5_K
* k_quants: WIP super-blocks with 64 weights
Another small improvement for Q3_K and Q5_K on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
* k_quants: WIP super-blocks with 64 weights
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
* k_quants: WIP super-blocks with 64 weights
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
* k_quants: WIP super-blocks with 64 weights
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
* k_quants: WIP super-blocks with 64 weights
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
* k_quants: WIP super-blocks with 64 weights
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
* k_quants: call them _K, not _k, also on Metal
* k_quants: correctly define QK_K in llama.cpp
* Fixed bug in q4_K quantization added with the 64-block addition
* Simplify via lambda
* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
* k_quants: switch Q4_K to 4-bit scales when QK_K = 64
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
* k_quants: forgot to add the Metal changes in last commit
* k_quants: change Q5_K to be type 0 when QK_K = 64
Still needs AVX2 implementation
* k_quants: AVX2 implementation for new 64-weight Q5_K
* k_quants: 10% faster ARM_NEON Q5_K dot product
* k_quants: fixed issue caused by merging with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26 18:43:07 +02:00
i f d e f L L A M A _ F A S T
OPT = -Ofast
e l s e
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml
I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.
* Adding Q3_K and Q8_K (de)-quantization
* Q3_K now working on CUDA and AVX2/scalar
CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).
* Some improvement for Q3_K on CUDA
It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.
* Some more CUDA optimizations for Q3_K
Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.
* Adding Q4_K - scalar, AVX2, CUDA
Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).
* Adding Q6_K - scalar, AVX2, CUDA
Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).
* Adding Q5_K - scalar, AVX2, CUDA
Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.
* Per convention, all QX_K quantizations use Q5_K for output.weight
* Adding quantization mixes
* Quantization mixes: didn't quite get what I wanted in the last commit
* Q4_K dot product for ARM_NEON
* Q6_K dot product for ARM_NEON
* Q5_K dot product for ARM_NEON
* Adding Q3_K dot for ARM_NEON
It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.
* A very slightly faster ARM_NEON Q3_K dot
* Adding Q2_K - just CUDA for now
Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.
* Adding scalar and AVX2 Q2_K dot
* Adding ARM_NEON Q2_K dot
About the same performance as Q4_K.
* A slightly faster ARM_NEON Q2_K dot
Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.
* Fixed bug in Q2_K CUDA dot product kernel
Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.
In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).
* Don't print zeros/NaNs when no count histogram has been collected
* A 10% faster CUDA vector dot kernel for Q3_K
Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.
* A slightly daster Q4_K AVX2 dot product
For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.
* A slightly faster ARM_NEON A4_K dot product
* Minor
* Fix quantization error test
We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.
* Fix docker build
I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.
* Added forgotten ggml.o dependence on k_quants.h to the Makefile
* Had unintentionally committed the Makefile with -Ofast enabled
* ggml : rename k_quants -> ggml-quants-k, use lowercase in code
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 21:56:18 +02:00
OPT = -O3
k-quants : support for super-block size of 64 (#2001)
* k_quants: WIP super-blocks with 64 weights
* k_quants: WIP super-blocks with 64 weights
Q6_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q4_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
* k_quants: WIP super-blocks with 64 weights
Q3_K scalar and AVX2 works.
* k_quants: WIP super-blocks with 64 weights
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
* k_quants: WIP super-blocks with 64 weights
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
* k_quants: WIP super-blocks with 64 weights
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q3_K working on CUDA.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on CUDA, and with this CUDA is done.
* k_quants: WIP super-blocks with 64 weights
Q6_K working on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
* k_quants: WIP super-blocks with 64 weights
Slightly more efficient Q3_K and Q5_K
* k_quants: WIP super-blocks with 64 weights
Another small improvement for Q3_K and Q5_K on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
* k_quants: WIP super-blocks with 64 weights
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
* k_quants: WIP super-blocks with 64 weights
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
* k_quants: WIP super-blocks with 64 weights
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
* k_quants: WIP super-blocks with 64 weights
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
* k_quants: WIP super-blocks with 64 weights
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
* k_quants: call them _K, not _k, also on Metal
* k_quants: correctly define QK_K in llama.cpp
* Fixed bug in q4_K quantization added with the 64-block addition
* Simplify via lambda
* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
* k_quants: switch Q4_K to 4-bit scales when QK_K = 64
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
* k_quants: forgot to add the Metal changes in last commit
* k_quants: change Q5_K to be type 0 when QK_K = 64
Still needs AVX2 implementation
* k_quants: AVX2 implementation for new 64-weight Q5_K
* k_quants: 10% faster ARM_NEON Q5_K dot product
* k_quants: fixed issue caused by merging with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26 18:43:07 +02:00
e n d i f
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml
I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.
* Adding Q3_K and Q8_K (de)-quantization
* Q3_K now working on CUDA and AVX2/scalar
CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).
* Some improvement for Q3_K on CUDA
It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.
* Some more CUDA optimizations for Q3_K
Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.
* Adding Q4_K - scalar, AVX2, CUDA
Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).
* Adding Q6_K - scalar, AVX2, CUDA
Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).
* Adding Q5_K - scalar, AVX2, CUDA
Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.
* Per convention, all QX_K quantizations use Q5_K for output.weight
* Adding quantization mixes
* Quantization mixes: didn't quite get what I wanted in the last commit
* Q4_K dot product for ARM_NEON
* Q6_K dot product for ARM_NEON
* Q5_K dot product for ARM_NEON
* Adding Q3_K dot for ARM_NEON
It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.
* A very slightly faster ARM_NEON Q3_K dot
* Adding Q2_K - just CUDA for now
Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.
* Adding scalar and AVX2 Q2_K dot
* Adding ARM_NEON Q2_K dot
About the same performance as Q4_K.
* A slightly faster ARM_NEON Q2_K dot
Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.
* Fixed bug in Q2_K CUDA dot product kernel
Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.
In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).
* Don't print zeros/NaNs when no count histogram has been collected
* A 10% faster CUDA vector dot kernel for Q3_K
Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.
* A slightly daster Q4_K AVX2 dot product
For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.
* A slightly faster ARM_NEON A4_K dot product
* Minor
* Fix quantization error test
We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.
* Fix docker build
I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.
* Added forgotten ggml.o dependence on k_quants.h to the Makefile
* Had unintentionally committed the Makefile with -Ofast enabled
* ggml : rename k_quants -> ggml-quants-k, use lowercase in code
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 21:56:18 +02:00
CFLAGS = -I. $( OPT) -std= c11 -fPIC
CXXFLAGS = -I. -I./examples $( OPT) -std= c++11 -fPIC
2023-03-10 19:40:58 +01:00
LDFLAGS =
2023-05-28 21:01:02 +02:00
i f d e f L L A M A _ D E B U G
CFLAGS += -O0 -g
CXXFLAGS += -O0 -g
LDFLAGS += -g
e l s e
2023-04-29 17:43:28 +02:00
CFLAGS += -DNDEBUG
CXXFLAGS += -DNDEBUG
e n d i f
2023-03-28 18:48:20 +02:00
# warnings
2023-04-19 18:06:37 +02:00
CFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith
Rewrite loading code to try to satisfy everyone:
- Support all three formats (ggml, ggmf, ggjt). (However, I didn't
include the hack needed to support GPT4All files without conversion.
Those can still be used after converting them with convert.py from my
other PR.)
- Support both mmap and read (mmap is used by default, but can be
disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
files or on platforms where mmap is not supported).
- Support multi-file models like before, but automatically determine the
number of parts rather than requiring `--n_parts`.
- Improve validation and error checking.
- Stop using the per-file type field (f16) entirely in favor of just
relying on the per-tensor type/size fields. This has no immediate
benefit, but makes it easier to experiment with different formats, and
should make it easier to support the new GPTQ-for-LLaMa models in the
future (I have some work in progress on that front).
- Support VirtualLock on Windows (using the same `--mlock` option as on
Unix).
- Indicate loading progress when using mmap + mlock. (Which led me
to the interesting observation that on my Linux machine, with a
warm file cache, mlock actually takes some time, whereas mmap
without mlock starts almost instantly...)
- To help implement this, move mlock support from ggml to the
loading code.
- madvise/PrefetchVirtualMemory support (based on #740)
- Switch from ifstream to the `fopen` family of functions to avoid
unnecessary copying and, when mmap is enabled, allow reusing the same
file descriptor for both metadata reads and mmap (whereas the existing
implementation opens the file a second time to mmap).
- Quantization now produces a single-file output even with multi-file
inputs (not really a feature as much as 'it was easier this way').
Implementation notes:
I tried to factor the code into more discrete pieces than before.
Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:
- Destructors to make it easier to ensure everything gets cleaned up.
- Exceptions. I don't even usually use exceptions when writing C++, and
I can remove them if desired... but here they make the loading code
much more succinct while still properly handling a variety of errors,
ranging from API calls failing to integer overflow and allocation
failure. The exceptions are converted to error codes at the
API boundary.)
Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)
2023-04-08 21:24:37 +02:00
CXXFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar
2023-03-28 18:48:20 +02:00
2023-03-10 19:40:58 +01:00
# OS specific
# TODO: support Windows
i f e q ( $( UNAME_S ) , L i n u x )
CFLAGS += -pthread
CXXFLAGS += -pthread
e n d i f
i f e q ( $( UNAME_S ) , D a r w i n )
CFLAGS += -pthread
CXXFLAGS += -pthread
e n d i f
i f e q ( $( UNAME_S ) , F r e e B S D )
CFLAGS += -pthread
CXXFLAGS += -pthread
e n d i f
2023-03-13 17:40:54 +01:00
i f e q ( $( UNAME_S ) , N e t B S D )
CFLAGS += -pthread
CXXFLAGS += -pthread
e n d i f
2023-03-21 16:50:09 +01:00
i f e q ( $( UNAME_S ) , O p e n B S D )
CFLAGS += -pthread
CXXFLAGS += -pthread
e n d i f
2023-03-10 19:40:58 +01:00
i f e q ( $( UNAME_S ) , H a i k u )
CFLAGS += -pthread
CXXFLAGS += -pthread
e n d i f
2023-05-13 16:25:09 +02:00
i f d e f L L A M A _ G P R O F
CFLAGS += -pg
CXXFLAGS += -pg
e n d i f
i f d e f L L A M A _ P E R F
CFLAGS += -DGGML_PERF
CXXFLAGS += -DGGML_PERF
e n d i f
2023-03-10 19:40:58 +01:00
# Architecture specific
# TODO: probably these flags need to be tweaked on some architectures
# feel free to update the Makefile for your architecture and send a pull request or issue
i f e q ( $( UNAME_M ) , $( filter $ ( UNAME_M ) ,x 86_ 64 i 686) )
2023-04-02 09:17:05 +02:00
# Use all CPU extensions that are available:
2023-04-22 10:08:12 +02:00
CFLAGS += -march= native -mtune= native
2023-04-05 16:38:37 +02:00
CXXFLAGS += -march= native -mtune= native
2023-04-22 10:08:12 +02:00
# Usage AVX-only
#CFLAGS += -mfma -mf16c -mavx
#CXXFLAGS += -mfma -mf16c -mavx
2023-06-10 08:41:59 +02:00
# Usage SSSE3-only (Not is SSE3!)
#CFLAGS += -mssse3
#CXXFLAGS += -mssse3
2023-03-10 19:40:58 +01:00
e n d i f
2023-06-04 22:34:30 +02:00
2023-03-10 19:40:58 +01:00
i f n e q ( $( filter ppc 64%,$ ( UNAME_M ) ) , )
POWER9_M := $( shell grep "POWER9" /proc/cpuinfo)
ifneq ( ,$( findstring POWER9,$( POWER9_M) ) )
2023-04-22 10:08:12 +02:00
CFLAGS += -mcpu= power9
2023-03-24 16:19:26 +01:00
CXXFLAGS += -mcpu= power9
2023-03-10 19:40:58 +01:00
endif
# Require c++23's std::byteswap for big-endian support.
ifeq ( $( UNAME_M) ,ppc64)
CXXFLAGS += -std= c++23 -DGGML_BIG_ENDIAN
endif
e n d i f
2023-06-04 22:34:30 +02:00
2023-06-07 09:59:52 +02:00
i f n d e f L L A M A _ N O _ K _ Q U A N T S
CFLAGS += -DGGML_USE_K_QUANTS
2023-06-13 12:23:23 +02:00
CXXFLAGS += -DGGML_USE_K_QUANTS
2023-06-07 09:59:52 +02:00
OBJS += k_quants.o
k-quants : support for super-block size of 64 (#2001)
* k_quants: WIP super-blocks with 64 weights
* k_quants: WIP super-blocks with 64 weights
Q6_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q4_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
* k_quants: WIP super-blocks with 64 weights
Q3_K scalar and AVX2 works.
* k_quants: WIP super-blocks with 64 weights
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
* k_quants: WIP super-blocks with 64 weights
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
* k_quants: WIP super-blocks with 64 weights
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q3_K working on CUDA.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on CUDA, and with this CUDA is done.
* k_quants: WIP super-blocks with 64 weights
Q6_K working on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
* k_quants: WIP super-blocks with 64 weights
Slightly more efficient Q3_K and Q5_K
* k_quants: WIP super-blocks with 64 weights
Another small improvement for Q3_K and Q5_K on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
* k_quants: WIP super-blocks with 64 weights
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
* k_quants: WIP super-blocks with 64 weights
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
* k_quants: WIP super-blocks with 64 weights
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
* k_quants: WIP super-blocks with 64 weights
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
* k_quants: WIP super-blocks with 64 weights
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
* k_quants: call them _K, not _k, also on Metal
* k_quants: correctly define QK_K in llama.cpp
* Fixed bug in q4_K quantization added with the 64-block addition
* Simplify via lambda
* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
* k_quants: switch Q4_K to 4-bit scales when QK_K = 64
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
* k_quants: forgot to add the Metal changes in last commit
* k_quants: change Q5_K to be type 0 when QK_K = 64
Still needs AVX2 implementation
* k_quants: AVX2 implementation for new 64-weight Q5_K
* k_quants: 10% faster ARM_NEON Q5_K dot product
* k_quants: fixed issue caused by merging with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26 18:43:07 +02:00
i f d e f L L A M A _ Q K K _ 6 4
CFLAGS += -DGGML_QKK_64
CXXFLAGS += -DGGML_QKK_64
e n d i f
2023-06-07 09:59:52 +02:00
e n d i f
2023-03-11 11:26:16 +01:00
i f n d e f L L A M A _ N O _ A C C E L E R A T E
2023-03-21 16:44:11 +01:00
# Mac M1 - include Accelerate framework.
# `-framework Accelerate` works on Mac Intel as well, with negliable performance boost (as of the predict time).
2023-03-10 19:40:58 +01:00
ifeq ( $( UNAME_S) ,Darwin)
CFLAGS += -DGGML_USE_ACCELERATE
LDFLAGS += -framework Accelerate
endif
2023-06-04 22:34:30 +02:00
e n d i f # LLAMA_NO_ACCELERATE
2023-03-11 11:26:16 +01:00
i f d e f L L A M A _ O P E N B L A S
2023-05-16 10:30:15 +02:00
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -I/usr/include/openblas
2023-06-17 18:17:22 +02:00
LDFLAGS += -lopenblas
2023-06-04 22:34:30 +02:00
e n d i f # LLAMA_OPENBLAS
2023-05-20 16:58:31 +02:00
i f d e f L L A M A _ B L I S
2023-06-07 09:59:52 +02:00
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/blis -I/usr/include/blis
2023-05-20 16:58:31 +02:00
LDFLAGS += -lblis -L/usr/local/lib
2023-06-04 22:34:30 +02:00
e n d i f # LLAMA_BLIS
2023-04-19 11:22:45 +02:00
i f d e f L L A M A _ C U B L A S
ggml : add CLBlast support (#1164)
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 16:57:16 +02:00
CFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$( CUDA_PATH) /targets/x86_64-linux/include
2023-04-29 02:04:18 +02:00
CXXFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$( CUDA_PATH) /targets/x86_64-linux/include
ggml : add CLBlast support (#1164)
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 16:57:16 +02:00
LDFLAGS += -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L$( CUDA_PATH) /targets/x86_64-linux/lib
2023-04-21 21:59:17 +02:00
OBJS += ggml-cuda.o
NVCC = nvcc
2023-04-24 17:29:58 +02:00
NVCCFLAGS = --forward-unknown-to-host-compiler -arch= native
2023-05-25 23:07:29 +02:00
i f d e f L L A M A _ C U D A _ D M M V _ X
NVCCFLAGS += -DGGML_CUDA_DMMV_X= $( LLAMA_CUDA_DMMV_X)
e l s e
NVCCFLAGS += -DGGML_CUDA_DMMV_X= 32
e n d i f # LLAMA_CUDA_DMMV_X
i f d e f L L A M A _ C U D A _ D M M V _ Y
NVCCFLAGS += -DGGML_CUDA_DMMV_Y= $( LLAMA_CUDA_DMMV_Y)
e l s e
NVCCFLAGS += -DGGML_CUDA_DMMV_Y= 1
e n d i f # LLAMA_CUDA_DMMV_Y
2023-06-19 10:23:56 +02:00
i f d e f L L A M A _ C U D A _ D M M V _ F 1 6
NVCCFLAGS += -DGGML_CUDA_DMMV_F16
e n d i f # LLAMA_CUDA_DMMV_F16
2023-06-16 19:08:44 +02:00
i f d e f L L A M A _ C U D A _ K Q U A N T S _ I T E R
NVCCFLAGS += -DK_QUANTS_PER_ITERATION= $( LLAMA_CUDA_KQUANTS_ITER)
e l s e
NVCCFLAGS += -DK_QUANTS_PER_ITERATION= 2
e n d i f
2023-04-20 03:14:14 +02:00
ggml-cuda.o : ggml -cuda .cu ggml -cuda .h
2023-04-24 17:29:58 +02:00
$( NVCC) $( NVCCFLAGS) $( CXXFLAGS) -Wno-pedantic -c $< -o $@
2023-05-25 23:07:29 +02:00
e n d i f # LLAMA_CUBLAS
2023-06-04 22:34:30 +02:00
ggml : add CLBlast support (#1164)
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 16:57:16 +02:00
i f d e f L L A M A _ C L B L A S T
2023-06-04 22:34:30 +02:00
CFLAGS += -DGGML_USE_CLBLAST
CXXFLAGS += -DGGML_USE_CLBLAST
2023-05-05 14:18:21 +02:00
# Mac provides OpenCL as a framework
ifeq ( $( UNAME_S) ,Darwin)
LDFLAGS += -lclblast -framework OpenCL
else
LDFLAGS += -lclblast -lOpenCL
endif
ggml : add CLBlast support (#1164)
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 16:57:16 +02:00
OBJS += ggml-opencl.o
2023-06-04 22:34:30 +02:00
2023-05-22 23:33:24 +02:00
ggml-opencl.o : ggml -opencl .cpp ggml -opencl .h
$( CXX) $( CXXFLAGS) -c $< -o $@
2023-06-04 22:34:30 +02:00
e n d i f # LLAMA_CLBLAST
i f d e f L L A M A _ M E T A L
CFLAGS += -DGGML_USE_METAL -DGGML_METAL_NDEBUG
CXXFLAGS += -DGGML_USE_METAL
LDFLAGS += -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
OBJS += ggml-metal.o
ggml-metal.o : ggml -metal .m ggml -metal .h
$( CC) $( CFLAGS) -c $< -o $@
e n d i f # LLAMA_METAL
2023-03-10 19:40:58 +01:00
i f n e q ( $( filter aarch 64%,$ ( UNAME_M ) ) , )
2023-04-30 20:48:38 +02:00
# Apple M1, M2, etc.
# Raspberry Pi 3, 4, Zero 2 (64-bit)
2023-04-22 10:08:12 +02:00
CFLAGS += -mcpu= native
2023-03-10 19:40:58 +01:00
CXXFLAGS += -mcpu= native
e n d i f
2023-06-04 22:34:30 +02:00
2023-03-10 19:40:58 +01:00
i f n e q ( $( filter armv 6%,$ ( UNAME_M ) ) , )
2023-04-30 20:48:38 +02:00
# Raspberry Pi 1, Zero
2023-03-10 19:40:58 +01:00
CFLAGS += -mfpu= neon-fp-armv8 -mfp16-format= ieee -mno-unaligned-access
e n d i f
2023-06-04 22:34:30 +02:00
2023-03-10 19:40:58 +01:00
i f n e q ( $( filter armv 7%,$ ( UNAME_M ) ) , )
2023-04-30 20:48:38 +02:00
# Raspberry Pi 2
2023-03-10 19:40:58 +01:00
CFLAGS += -mfpu= neon-fp-armv8 -mfp16-format= ieee -mno-unaligned-access -funsafe-math-optimizations
e n d i f
2023-06-04 22:34:30 +02:00
2023-03-10 19:40:58 +01:00
i f n e q ( $( filter armv 8%,$ ( UNAME_M ) ) , )
2023-04-30 20:48:38 +02:00
# Raspberry Pi 3, 4, Zero 2 (32-bit)
2023-03-10 19:40:58 +01:00
CFLAGS += -mfp16-format= ieee -mno-unaligned-access
e n d i f
2023-06-07 09:59:52 +02:00
i f d e f L L A M A _ N O _ K _ Q U A N T S
k_quants.o : k_quants .c k_quants .h
$( CC) $( CFLAGS) -c $< -o $@
e n d i f # LLAMA_NO_K_QUANTS
2023-03-10 19:40:58 +01:00
#
# Print build information
#
$(info I llama.cpp build info : )
$(info I UNAME_S : $( UNAME_S ) )
$(info I UNAME_P : $( UNAME_P ) )
$(info I UNAME_M : $( UNAME_M ) )
$(info I CFLAGS : $( CFLAGS ) )
$(info I CXXFLAGS : $( CXXFLAGS ) )
$(info I LDFLAGS : $( LDFLAGS ) )
$(info I CC : $( CCV ) )
$(info I CXX : $( CXXV ) )
$( info )
#
# Build library
#
2023-06-07 09:59:52 +02:00
ggml.o : ggml .c ggml .h ggml -cuda .h
2023-04-14 21:39:48 +02:00
$( CC) $( CFLAGS) -c $< -o $@
2023-03-10 19:40:58 +01:00
2023-06-18 08:09:47 +02:00
llama.o : llama .cpp ggml .h ggml -cuda .h ggml -metal .h llama .h llama -util .h
2023-04-14 21:39:48 +02:00
$( CXX) $( CXXFLAGS) -c $< -o $@
2023-03-22 06:32:36 +01:00
2023-03-25 19:26:40 +01:00
common.o : examples /common .cpp examples /common .h
2023-04-14 21:39:48 +02:00
$( CXX) $( CXXFLAGS) -c $< -o $@
2023-03-10 19:40:58 +01:00
2023-06-07 09:59:52 +02:00
libllama.so : llama .o ggml .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) -shared -fPIC -o $@ $^ $( LDFLAGS)
2023-03-10 19:40:58 +01:00
clean :
2023-06-28 17:53:37 +02:00
rm -vf *.o *.so main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server vdot train-text-from-scratch embd-input-test build-info.h
2023-03-10 19:40:58 +01:00
2023-05-01 18:23:47 +02:00
#
# Examples
#
2023-06-07 09:59:52 +02:00
main : examples /main /main .cpp build -info .h ggml .o llama .o common .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-03-23 12:41:32 +01:00
@echo
@echo '==== Run ./main -h for help. ===='
@echo
2023-03-10 19:40:58 +01:00
2023-06-16 20:58:09 +02:00
simple : examples /simple /simple .cpp build -info .h ggml .o llama .o common .o $( OBJS )
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-06-07 09:59:52 +02:00
quantize : examples /quantize /quantize .cpp build -info .h ggml .o llama .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-03-25 19:26:40 +01:00
2023-06-07 09:59:52 +02:00
quantize-stats : examples /quantize -stats /quantize -stats .cpp build -info .h ggml .o llama .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-04-08 00:09:18 +02:00
2023-06-07 09:59:52 +02:00
perplexity : examples /perplexity /perplexity .cpp build -info .h ggml .o llama .o common .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-03-10 19:40:58 +01:00
2023-06-07 09:59:52 +02:00
embedding : examples /embedding /embedding .cpp build -info .h ggml .o llama .o common .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-03-28 08:11:09 +02:00
2023-06-07 09:59:52 +02:00
save-load-state : examples /save -load -state /save -load -state .cpp build -info .h ggml .o llama .o common .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-04-18 21:00:14 +02:00
2023-06-07 09:59:52 +02:00
server : examples /server /server .cpp examples /server /httplib .h examples /server /json .hpp build -info .h ggml .o llama .o common .o $( OBJS )
2023-05-27 19:04:14 +02:00
$( CXX) $( CXXFLAGS) -Iexamples/server $( filter-out %.h,$( filter-out %.hpp,$^) ) -o $@ $( LDFLAGS)
2023-06-28 17:53:37 +02:00
libembdinput.so : examples /embd -input /embd -input .h examples /embd -input /embd -input -lib .cpp build -info .h ggml .o llama .o common .o $( OBJS )
$( CXX) --shared $( CXXFLAGS) $( filter-out %.h,$( filter-out %.hpp,$^) ) -o $@ $( LDFLAGS)
embd-input-test : libembdinput .so examples /embd -input /embd -input -test .cpp build -info .h ggml .o llama .o common .o $( OBJS )
$( CXX) $( CXXFLAGS) $( filter-out %.so,$( filter-out %.h,$( filter-out %.hpp,$^) ) ) -o $@ $( LDFLAGS) -L. -lembdinput
2023-06-15 19:42:48 +02:00
train-text-from-scratch : examples /train -text -from -scratch /train -text -from -scratch .cpp build -info .h ggml .o llama .o $( OBJS )
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-05-01 18:23:47 +02:00
build-info.h : $( wildcard .git /index ) scripts /build -info .sh
2023-05-03 02:52:35 +02:00
@sh scripts/build-info.sh > $@ .tmp
2023-05-01 18:23:47 +02:00
@if ! cmp -s $@ .tmp $@ ; then \
mv $@ .tmp $@ ; \
else \
rm $@ .tmp; \
fi
2023-04-13 16:03:57 +02:00
2023-03-10 19:40:58 +01:00
#
# Tests
#
2023-06-07 09:59:52 +02:00
benchmark-matmult : examples /benchmark /benchmark -matmult .cpp build -info .h ggml .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $( filter-out %.h,$^) -o $@ $( LDFLAGS)
2023-04-30 14:32:37 +02:00
./$@
2023-04-13 16:03:57 +02:00
2023-06-07 09:59:52 +02:00
vdot : pocs /vdot /vdot .cpp ggml .o $( OBJS )
2023-05-01 18:23:47 +02:00
$( CXX) $( CXXFLAGS) $^ -o $@ $( LDFLAGS)
2023-05-21 16:03:44 +02:00
.PHONY : tests clean
2023-03-10 19:40:58 +01:00
tests :
bash ./tests/run-tests.sh