mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-26 03:12:23 +01:00
0f648573dd
* first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
253 lines
6.7 KiB
Markdown
253 lines
6.7 KiB
Markdown
# llama.cpp for SYCL
|
||
|
||
[Background](#background)
|
||
|
||
[OS](#os)
|
||
|
||
[Intel GPU](#intel-gpu)
|
||
|
||
[Linux](#linux)
|
||
|
||
[Environment Variable](#environment-variable)
|
||
|
||
[Known Issue](#known-issue)
|
||
|
||
[Todo](#todo)
|
||
|
||
## Background
|
||
|
||
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
|
||
|
||
oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.
|
||
|
||
Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.
|
||
|
||
To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.
|
||
|
||
The llama.cpp for SYCL is used to support Intel GPUs.
|
||
|
||
For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
|
||
|
||
## OS
|
||
|
||
|OS|Status|Verified|
|
||
|-|-|-|
|
||
|Linux|Support|Ubuntu 22.04|
|
||
|Windows|Ongoing| |
|
||
|
||
|
||
## Intel GPU
|
||
|
||
|Intel GPU| Status | Verified Model|
|
||
|-|-|-|
|
||
|Intel Data Center Max Series| Support| Max 1550|
|
||
|Intel Data Center Flex Series| Support| Flex 170|
|
||
|Intel Arc Series| Support| Arc 770|
|
||
|Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
|
||
|Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7|
|
||
|
||
|
||
## Linux
|
||
|
||
### Setup Environment
|
||
|
||
1. Install Intel GPU driver.
|
||
|
||
a. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html).
|
||
|
||
Note: for iGPU, please install the client GPU driver.
|
||
|
||
b. Add user to group: video, render.
|
||
|
||
```
|
||
sudo usermod -aG render username
|
||
sudo usermod -aG video username
|
||
```
|
||
|
||
Note: re-login to enable it.
|
||
|
||
c. Check
|
||
|
||
```
|
||
sudo apt install clinfo
|
||
sudo clinfo -l
|
||
```
|
||
|
||
Output (example):
|
||
|
||
```
|
||
Platform #0: Intel(R) OpenCL Graphics
|
||
`-- Device #0: Intel(R) Arc(TM) A770 Graphics
|
||
|
||
|
||
Platform #0: Intel(R) OpenCL HD Graphics
|
||
`-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
|
||
```
|
||
|
||
2. Install Intel® oneAPI Base toolkit.
|
||
|
||
|
||
a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
|
||
|
||
Recommend to install to default folder: **/opt/intel/oneapi**.
|
||
|
||
Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.
|
||
|
||
b. Check
|
||
|
||
```
|
||
source /opt/intel/oneapi/setvars.sh
|
||
|
||
sycl-ls
|
||
```
|
||
|
||
There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.
|
||
|
||
Output (example):
|
||
```
|
||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
|
||
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
|
||
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.30.26918.50]
|
||
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]
|
||
|
||
```
|
||
|
||
2. Build locally:
|
||
|
||
```
|
||
mkdir -p build
|
||
cd build
|
||
source /opt/intel/oneapi/setvars.sh
|
||
|
||
#for FP16
|
||
#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference
|
||
|
||
#for FP32
|
||
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
|
||
|
||
#build example/main only
|
||
#cmake --build . --config Release --target main
|
||
|
||
#build all binary
|
||
cmake --build . --config Release -v
|
||
|
||
```
|
||
|
||
or
|
||
|
||
```
|
||
./examples/sycl/build.sh
|
||
```
|
||
|
||
Note:
|
||
|
||
- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.
|
||
|
||
### Run
|
||
|
||
1. Put model file to folder **models**
|
||
|
||
2. Enable oneAPI running environment
|
||
|
||
```
|
||
source /opt/intel/oneapi/setvars.sh
|
||
```
|
||
|
||
3. List device ID
|
||
|
||
Run without parameter:
|
||
|
||
```
|
||
./build/bin/ls-sycl-device
|
||
|
||
or
|
||
|
||
./build/bin/main
|
||
```
|
||
|
||
Check the ID in startup log, like:
|
||
|
||
```
|
||
found 4 SYCL devices:
|
||
Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3,
|
||
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136
|
||
Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2,
|
||
max compute_units 24, max work group size 67108864, max sub group size 64, global mem size 67065057280
|
||
Device 2: 13th Gen Intel(R) Core(TM) i7-13700K, compute capability 3.0,
|
||
max compute_units 24, max work group size 8192, max sub group size 64, global mem size 67065057280
|
||
Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0,
|
||
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136
|
||
|
||
```
|
||
|
||
|Attribute|Note|
|
||
|-|-|
|
||
|compute capability 1.3|Level-zero running time, recommended |
|
||
|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|
|
||
|
||
4. Set device ID and execute llama.cpp
|
||
|
||
Set device ID = 0 by **GGML_SYCL_DEVICE=0**
|
||
|
||
```
|
||
GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
|
||
```
|
||
or run by script:
|
||
|
||
```
|
||
./examples/sycl/run_llama2.sh
|
||
```
|
||
|
||
Note:
|
||
|
||
- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.
|
||
|
||
|
||
5. Check the device ID in output
|
||
|
||
Like:
|
||
```
|
||
Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
|
||
```
|
||
|
||
|
||
## Environment Variable
|
||
|
||
#### Build
|
||
|
||
|Name|Value|Function|
|
||
|-|-|-|
|
||
|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path. <br>For FP32/FP16, LLAMA_SYCL=ON is mandatory.|
|
||
|LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference. <br>For FP32, not set it.|
|
||
|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|
|
||
|CMAKE_CXX_COMPILER|icpx|use icpx for SYCL code path|
|
||
|
||
#### Running
|
||
|
||
|
||
|Name|Value|Function|
|
||
|-|-|-|
|
||
|GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output|
|
||
|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|
|
||
|
||
## Known Issue
|
||
|
||
- Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.
|
||
|
||
Miss to enable oneAPI running environment.
|
||
|
||
Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.
|
||
|
||
|
||
- Hang during startup
|
||
|
||
llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.
|
||
|
||
Solution: add **--no-mmap**.
|
||
|
||
## Todo
|
||
|
||
- Support to build in Windows.
|
||
|
||
- Support multiple cards.
|