The stated file `./devops/main-server.Dockerfile` does not exist. I figure that `.devops/server-intel.Dockerfile` was meant.
18 KiB
llama.cpp for SYCL
Background
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.
Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.
To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool SYCLomatic (Commercial release Intel® DPC++ Compatibility Tool) migrate to SYCL.
The llama.cpp for SYCL is used to support Intel GPUs.
For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
News
-
2024.3
- New base line is ready: tag b2437.
- Support multiple cards: --split-mode: [none|layer]; not support [row], it's on developing.
- Support to assign main GPU by --main-gpu, replace $GGML_SYCL_DEVICE.
- Support detecting all GPUs with level-zero and same top Max compute units.
- Support OPs
- hardsigmoid
- hardswish
- pool2d
-
2024.1
- Create SYCL backend for Intel GPU.
- Support Windows build
OS
OS | Status | Verified |
---|---|---|
Linux | Support | Ubuntu 22.04, Fedora Silverblue 39 |
Windows | Support | Windows 11 |
Intel GPU
Verified
Intel GPU | Status | Verified Model |
---|---|---|
Intel Data Center Max Series | Support | Max 1550 |
Intel Data Center Flex Series | Support | Flex 170 |
Intel Arc Series | Support | Arc 770, 730M |
Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
Intel iGPU | Support | iGPU in i5-1250P, i7-1260P, i7-1165G7 |
Note: If the EUs (Execution Unit) in iGPU is less than 80, the inference speed will be too slow to use.
Memory
The memory is a limitation to run LLM on GPUs.
When run llama.cpp, there is print log to show the applied memory on GPU. You could know how much memory to be used in your case. Like llm_load_tensors: buffer size = 3577.56 MiB
.
For iGPU, please make sure the shared memory from host memory is enough. For llama-2-7b.Q4_0, recommend the host memory is 8GB+.
For dGPU, please make sure the device memory is enough. For llama-2-7b.Q4_0, recommend the device memory is 4GB+.
Nvidia GPU
Verified
Intel GPU | Status | Verified Model |
---|---|---|
Ampere Series | Support | A100 |
oneMKL for CUDA
The current oneMKL release does not contain the oneMKL cuBlas backend. As a result for Nvidia GPU's oneMKL must be built from source.
git clone https://github.com/oneapi-src/oneMKL
cd oneMKL
mkdir build
cd build
cmake -G Ninja .. -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_CUBLAS_BACKEND=ON
ninja
// Add paths as necessary
Docker
Note:
- Only docker on Linux is tested. Docker on WSL may not work.
- You may need to install Intel GPU driver on the host machine (See the Linux section to know how to do that)
Build the image
You can choose between F16 and F32 build. F16 is faster for long-prompt inference.
# For F16:
#docker build -t llama-cpp-sycl --build-arg="LLAMA_SYCL_F16=ON" -f .devops/main-intel.Dockerfile .
# Or, for F32:
docker build -t llama-cpp-sycl -f .devops/main-intel.Dockerfile .
# Note: you can also use the ".devops/server-intel.Dockerfile", which compiles the "server" example
Run
# Firstly, find all the DRI cards:
ls -la /dev/dri
# Then, pick the card that you want to use.
# For example with "/dev/dri/card1"
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-sycl -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
Linux
Setup Environment
- Install Intel GPU driver.
a. Please install Intel GPU driver by official guide: Install GPU Drivers.
Note: for iGPU, please install the client GPU driver.
b. Add user to group: video, render.
sudo usermod -aG render username
sudo usermod -aG video username
Note: re-login to enable it.
c. Check
sudo apt install clinfo
sudo clinfo -l
Output (example):
Platform #0: Intel(R) OpenCL Graphics
`-- Device #0: Intel(R) Arc(TM) A770 Graphics
Platform #0: Intel(R) OpenCL HD Graphics
`-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
- Install Intel® oneAPI Base toolkit.
a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .
Recommend to install to default folder: /opt/intel/oneapi.
Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.
b. Check
source /opt/intel/oneapi/setvars.sh
sycl-ls
There should be one or more level-zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].
Output (example):
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]
- Build locally:
Note:
- You can choose between F16 and F32 build. F16 is faster for long-prompt inference.
- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.
mkdir -p build
cd build
source /opt/intel/oneapi/setvars.sh
# For FP16:
#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON
# Or, for FP32:
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
# For Nvidia GPUs
cmake .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
# Build example/main only
#cmake --build . --config Release --target main
# Or, build all binary
cmake --build . --config Release -v
cd ..
or
./examples/sycl/build.sh
Run
- Put model file to folder models
You could download llama-2-7b.Q4_0.gguf as example.
- Enable oneAPI running environment
source /opt/intel/oneapi/setvars.sh
- List device ID
Run without parameter:
./build/bin/ls-sycl-device
# or running the "main" executable and look at the output log:
./build/bin/main
Check the ID in startup log, like:
found 6 SYCL devices:
| | | |Compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
| 2| [opencl:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136|
| 3| [opencl:gpu:1]| Intel(R) UHD Graphics 770| 3.0| 32| 512| 32| 53651849216|
| 4| [opencl:cpu:0]| 13th Gen Intel(R) Core(TM) i7-13700K| 3.0| 24| 8192| 64| 67064815616|
| 5| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 24|67108864| 64| 67064815616|
Attribute | Note |
---|---|
compute capability 1.3 | Level-zero running time, recommended |
compute capability 3.0 | OpenCL running time, slower than level-zero in most cases |
- Device selection and execution of llama.cpp
There are two device selection modes:
- Single device: Use one device assigned by user.
- Multiple devices: Automatically choose the devices with the same biggest Max compute units.
Device selection | Parameter |
---|---|
Single device | --split-mode none --main-gpu DEVICE_ID |
Multiple devices | --split-mode layer (default) |
Examples:
- Use device 0:
ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
or run by script:
./examples/sycl/run_llama2.sh 0
- Use multiple devices:
ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
or run by script:
./examples/sycl/run_llama2.sh
Note:
- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.
- Verify the device ID in output
Verify to see if the selected GPU is shown in the output, like:
detect 1 SYCL GPUs: [0] with top Max compute units:512
Or
use 1 SYCL GPUs: [0] with Max compute units:512
Windows
Setup Environment
- Install Intel GPU driver.
Please install Intel GPU driver by official guide: Install GPU Drivers.
Note: The driver is mandatory for compute function.
- Install Visual Studio.
Please install Visual Studio which impact oneAPI environment enabling in Windows.
- Install Intel® oneAPI Base toolkit.
a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .
Recommend to install to default folder: C:\Program Files (x86)\Intel\oneAPI.
Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder.
b. Enable oneAPI running environment:
- In Search, input 'oneAPI'.
Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"
- In Run:
In CMD:
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
c. Check GPU
In oneAPI command line:
sycl-ls
There should be one or more level-zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].
Output (example):
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]
- Install cmake & make
a. Download & install cmake for Windows: https://cmake.org/download/
b. Download & install mingw-w64 make for Windows provided by w64devkit
-
Download the 1.19.0 version of w64devkit.
-
Extract
w64devkit
on your pc. -
Add the bin folder path in the Windows system PATH environment, like
C:\xxx\w64devkit\bin\
.
Build locally:
In oneAPI command line window:
mkdir -p build
cd build
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
:: for FP16
:: faster for long-prompt inference
:: cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON
:: for FP32
cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release
:: build example/main only
:: make main
:: build all binary
make -j
cd ..
or
.\examples\sycl\win-build-sycl.bat
Note:
- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.
Run
- Put model file to folder models
You could download llama-2-7b.Q4_0.gguf as example.
- Enable oneAPI running environment
- In Search, input 'oneAPI'.
Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"
- In Run:
In CMD:
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
- List device ID
Run without parameter:
build\bin\ls-sycl-device.exe
or
build\bin\main.exe
Check the ID in startup log, like:
found 6 SYCL devices:
| | | |Compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
| 2| [opencl:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136|
| 3| [opencl:gpu:1]| Intel(R) UHD Graphics 770| 3.0| 32| 512| 32| 53651849216|
| 4| [opencl:cpu:0]| 13th Gen Intel(R) Core(TM) i7-13700K| 3.0| 24| 8192| 64| 67064815616|
| 5| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 24|67108864| 64| 67064815616|
Attribute | Note |
---|---|
compute capability 1.3 | Level-zero running time, recommended |
compute capability 3.0 | OpenCL running time, slower than level-zero in most cases |
- Device selection and execution of llama.cpp
There are two device selection modes:
- Single device: Use one device assigned by user.
- Multiple devices: Automatically choose the devices with the same biggest Max compute units.
Device selection | Parameter |
---|---|
Single device | --split-mode none --main-gpu DEVICE_ID |
Multiple devices | --split-mode layer (default) |
Examples:
- Use device 0:
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm none -mg 0
- Use multiple devices:
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer
or run by script:
.\examples\sycl\win-run-llama2.bat
Note:
- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.
- Verify the device ID in output
Verify to see if the selected GPU is shown in the output, like:
detect 1 SYCL GPUs: [0] with top Max compute units:512
Or
use 1 SYCL GPUs: [0] with Max compute units:512
Environment Variable
Build
Name | Value | Function |
---|---|---|
LLAMA_SYCL | ON (mandatory) | Enable build with SYCL code path. For FP32/FP16, LLAMA_SYCL=ON is mandatory. |
LLAMA_SYCL_F16 | ON (optional) | Enable FP16 build with SYCL code path. Faster for long-prompt inference. For FP32, not set it. |
CMAKE_C_COMPILER | icx | Use icx compiler for SYCL code path |
CMAKE_CXX_COMPILER | icpx (Linux), icx (Windows) | use icpx/icx for SYCL code path |
Running
Name | Value | Function |
---|---|---|
GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory. Recommended to use when --split-mode = layer |
Known Issue
-
Hang during startup
llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.
Solution: add --no-mmap or --mmap 0.
-
Split-mode: [row] is not supported
It's on developing.
Q&A
Note: please add prefix [SYCL] in issue title, so that we will check it as soon as possible.
-
Error:
error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory
.Miss to enable oneAPI running environment.
Install oneAPI base toolkit and enable it by:
source /opt/intel/oneapi/setvars.sh
. -
In Windows, no result, not error.
Miss to enable oneAPI running environment.
-
Meet compile error.
Remove folder build and try again.
-
I can not see [ext_oneapi_level_zero:gpu:0] afer install GPU driver in Linux.
Please run sudo sycl-ls.
If you see it in result, please add video/render group to your ID:
sudo usermod -aG render username sudo usermod -aG video username
Then relogin.
If you do not see it, please check the installation GPU steps again.
Todo
- Support row layer split for multiple card runs.