mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-26 03:12:23 +01:00
update guide
This commit is contained in:
parent
be55695eff
commit
7764ab911d
@ -80,7 +80,14 @@ The following release is verified with good quality:
|
|||||||
|
|
||||||
### Intel GPU
|
### Intel GPU
|
||||||
|
|
||||||
**Verified devices**
|
SYCL backend supports Intel GPU Family:
|
||||||
|
|
||||||
|
- Intel Data Center Max Series
|
||||||
|
- Intel Flex Series, Arc Series
|
||||||
|
- Intel Built-in Arc GPU
|
||||||
|
- Intel iGPU in Core CPU (11th Generation Core CPU and newer, refer to [oneAPI supported GPU](https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html#inpage-nav-1-1)).
|
||||||
|
|
||||||
|
#### Verified devices
|
||||||
|
|
||||||
| Intel GPU | Status | Verified Model |
|
| Intel GPU | Status | Verified Model |
|
||||||
|-------------------------------|---------|---------------------------------------|
|
|-------------------------------|---------|---------------------------------------|
|
||||||
@ -88,7 +95,7 @@ The following release is verified with good quality:
|
|||||||
| Intel Data Center Flex Series | Support | Flex 170 |
|
| Intel Data Center Flex Series | Support | Flex 170 |
|
||||||
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
|
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
|
||||||
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
|
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
|
||||||
| Intel iGPU | Support | iGPU in i5-1250P, i7-1260P, i7-1165G7 |
|
| Intel iGPU | Support | iGPU in 13700k, i5-1250P, i7-1260P, i7-1165G7 |
|
||||||
|
|
||||||
*Notes:*
|
*Notes:*
|
||||||
|
|
||||||
@ -237,6 +244,13 @@ Similarly, user targeting Nvidia GPUs should expect at least one SYCL-CUDA devic
|
|||||||
### II. Build llama.cpp
|
### II. Build llama.cpp
|
||||||
|
|
||||||
#### Intel GPU
|
#### Intel GPU
|
||||||
|
|
||||||
|
```
|
||||||
|
./examples/sycl/build.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
# Export relevant ENV variables
|
# Export relevant ENV variables
|
||||||
source /opt/intel/oneapi/setvars.sh
|
source /opt/intel/oneapi/setvars.sh
|
||||||
@ -276,23 +290,26 @@ cmake --build build --config Release -j -v
|
|||||||
|
|
||||||
### III. Run the inference
|
### III. Run the inference
|
||||||
|
|
||||||
1. Retrieve and prepare model
|
#### Retrieve and prepare model
|
||||||
|
|
||||||
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
|
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
|
||||||
|
|
||||||
2. Enable oneAPI running environment
|
##### Check device
|
||||||
|
|
||||||
|
1. Enable oneAPI running environment
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
source /opt/intel/oneapi/setvars.sh
|
source /opt/intel/oneapi/setvars.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
3. List devices information
|
2. List devices information
|
||||||
|
|
||||||
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
|
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
./build/bin/llama-ls-sycl-device
|
./build/bin/llama-ls-sycl-device
|
||||||
```
|
```
|
||||||
|
|
||||||
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
|
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
|
||||||
```
|
```
|
||||||
found 2 SYCL devices:
|
found 2 SYCL devices:
|
||||||
@ -304,12 +321,37 @@ found 2 SYCL devices:
|
|||||||
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
|
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Choose level-zero devices
|
||||||
|
|
||||||
4. Launch inference
|
|Chosen Device ID|Setting|
|
||||||
|
|-|-|
|
||||||
|
|0|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"` or no action|
|
||||||
|
|1|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
|
||||||
|
|0 & 1|`export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
|
||||||
|
|
||||||
|
#### Execute
|
||||||
|
|
||||||
|
Choose one of following methods to run.
|
||||||
|
|
||||||
|
1. Script
|
||||||
|
|
||||||
|
- Use device 0:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
./examples/sycl/run_llama2.sh 0
|
||||||
|
```
|
||||||
|
- Use multiple devices:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
./examples/sycl/run_llama2.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Command line
|
||||||
|
Launch inference
|
||||||
|
|
||||||
There are two device selection modes:
|
There are two device selection modes:
|
||||||
|
|
||||||
- Single device: Use one device target specified by the user.
|
- Single device: Use one device assigned by user. Default device id is 0.
|
||||||
- Multiple devices: Automatically choose the devices with the same backend.
|
- Multiple devices: Automatically choose the devices with the same backend.
|
||||||
|
|
||||||
In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR.
|
In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR.
|
||||||
@ -326,11 +368,6 @@ Examples:
|
|||||||
```sh
|
```sh
|
||||||
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
|
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
|
||||||
```
|
```
|
||||||
or run by script:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
./examples/sycl/run_llama2.sh 0
|
|
||||||
```
|
|
||||||
|
|
||||||
- Use multiple devices:
|
- Use multiple devices:
|
||||||
|
|
||||||
@ -338,12 +375,6 @@ or run by script:
|
|||||||
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
|
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
|
||||||
```
|
```
|
||||||
|
|
||||||
Otherwise, you can run the script:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
./examples/sycl/run_llama2.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
*Notes:*
|
*Notes:*
|
||||||
|
|
||||||
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
|
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
|
||||||
@ -390,7 +421,7 @@ c. Verify installation
|
|||||||
In the oneAPI command line, run the following to print the available SYCL devices:
|
In the oneAPI command line, run the following to print the available SYCL devices:
|
||||||
|
|
||||||
```
|
```
|
||||||
sycl-ls
|
sycl-ls.exe
|
||||||
```
|
```
|
||||||
|
|
||||||
There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:
|
There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:
|
||||||
@ -411,6 +442,18 @@ b. The new Visual Studio will install Ninja as default. (If not, please install
|
|||||||
|
|
||||||
### II. Build llama.cpp
|
### II. Build llama.cpp
|
||||||
|
|
||||||
|
You could download the release package for Windows directly, which including binary files and depended oneAPI dll files.
|
||||||
|
|
||||||
|
Choose one of following methods to build from source code.
|
||||||
|
|
||||||
|
1. Script
|
||||||
|
|
||||||
|
```sh
|
||||||
|
.\examples\sycl\win-build-sycl.bat
|
||||||
|
```
|
||||||
|
|
||||||
|
2. CMake
|
||||||
|
|
||||||
On the oneAPI command line window, step into the llama.cpp main directory and run the following:
|
On the oneAPI command line window, step into the llama.cpp main directory and run the following:
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -425,12 +468,8 @@ cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPI
|
|||||||
cmake --build build --config Release -j
|
cmake --build build --config Release -j
|
||||||
```
|
```
|
||||||
|
|
||||||
Otherwise, run the `win-build-sycl.bat` wrapper which encapsulates the former instructions:
|
|
||||||
```sh
|
|
||||||
.\examples\sycl\win-build-sycl.bat
|
|
||||||
```
|
|
||||||
|
|
||||||
Or, use CMake presets to build:
|
Or, use CMake presets to build:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
cmake --preset x64-windows-sycl-release
|
cmake --preset x64-windows-sycl-release
|
||||||
cmake --build build-x64-windows-sycl-release -j --target llama-cli
|
cmake --build build-x64-windows-sycl-release -j --target llama-cli
|
||||||
@ -442,7 +481,9 @@ cmake --preset x64-windows-sycl-debug
|
|||||||
cmake --build build-x64-windows-sycl-debug -j --target llama-cli
|
cmake --build build-x64-windows-sycl-debug -j --target llama-cli
|
||||||
```
|
```
|
||||||
|
|
||||||
Or, you can use Visual Studio to open llama.cpp folder as a CMake project. Choose the sycl CMake presets (`x64-windows-sycl-release` or `x64-windows-sycl-debug`) before you compile the project.
|
3. Visual Studio
|
||||||
|
|
||||||
|
You can use Visual Studio to open llama.cpp folder as a CMake project. Choose the sycl CMake presets (`x64-windows-sycl-release` or `x64-windows-sycl-debug`) before you compile the project.
|
||||||
|
|
||||||
*Notes:*
|
*Notes:*
|
||||||
|
|
||||||
@ -450,23 +491,25 @@ Or, you can use Visual Studio to open llama.cpp folder as a CMake project. Choos
|
|||||||
|
|
||||||
### III. Run the inference
|
### III. Run the inference
|
||||||
|
|
||||||
1. Retrieve and prepare model
|
#### Retrieve and prepare model
|
||||||
|
|
||||||
You can refer to the general [*Prepare and Quantize*](README#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
|
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
|
||||||
|
|
||||||
2. Enable oneAPI running environment
|
##### Check device
|
||||||
|
|
||||||
|
1. Enable oneAPI running environment
|
||||||
|
|
||||||
On the oneAPI command line window, run the following and step into the llama.cpp directory:
|
On the oneAPI command line window, run the following and step into the llama.cpp directory:
|
||||||
```
|
```
|
||||||
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
|
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
|
||||||
```
|
```
|
||||||
|
|
||||||
3. List devices information
|
2. List devices information
|
||||||
|
|
||||||
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
|
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
|
||||||
|
|
||||||
```
|
```
|
||||||
build\bin\ls-sycl-device.exe
|
build\bin\llama-ls-sycl-device.exe
|
||||||
```
|
```
|
||||||
|
|
||||||
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
|
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
|
||||||
@ -479,9 +522,27 @@ found 2 SYCL devices:
|
|||||||
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
|
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
|
||||||
|
|
||||||
```
|
```
|
||||||
|
#### Choose level-zero devices
|
||||||
|
|
||||||
|
|Chosen Device ID|Setting|
|
||||||
|
|-|-|
|
||||||
|
|0|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"` or no action|
|
||||||
|
|1|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
|
||||||
|
|0 & 1|`set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
|
||||||
|
|
||||||
4. Launch inference
|
#### Execute
|
||||||
|
|
||||||
|
Choose one of following methods to run.
|
||||||
|
|
||||||
|
1. Script
|
||||||
|
|
||||||
|
```
|
||||||
|
examples\sycl\win-run-llama2.bat
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Command line
|
||||||
|
|
||||||
|
Launch inference
|
||||||
|
|
||||||
There are two device selection modes:
|
There are two device selection modes:
|
||||||
|
|
||||||
@ -508,11 +569,7 @@ build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website ca
|
|||||||
```
|
```
|
||||||
build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer
|
build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer
|
||||||
```
|
```
|
||||||
Otherwise, run the following wrapper script:
|
|
||||||
|
|
||||||
```
|
|
||||||
.\examples\sycl\win-run-llama2.bat
|
|
||||||
```
|
|
||||||
|
|
||||||
Note:
|
Note:
|
||||||
|
|
||||||
@ -526,17 +583,18 @@ Or
|
|||||||
use 1 SYCL GPUs: [0] with Max compute units:512
|
use 1 SYCL GPUs: [0] with Max compute units:512
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Environment Variable
|
## Environment Variable
|
||||||
|
|
||||||
#### Build
|
#### Build
|
||||||
|
|
||||||
| Name | Value | Function |
|
| Name | Value | Function |
|
||||||
|--------------------|-----------------------------------|---------------------------------------------|
|
|--------------------|-----------------------------------|---------------------------------------------|
|
||||||
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path. |
|
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path.<br>FP32 path - recommended for better perforemance than FP16 on quantized model|
|
||||||
| GGML_SYCL_TARGET | INTEL *(default)* \| NVIDIA | Set the SYCL target device type. |
|
| GGML_SYCL_TARGET | INTEL *(default)* \| NVIDIA | Set the SYCL target device type. |
|
||||||
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. |
|
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. |
|
||||||
| CMAKE_C_COMPILER | icx | Set *icx* compiler for SYCL code path. |
|
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
|
||||||
| CMAKE_CXX_COMPILER | icpx *(Linux)*, icx *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
|
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
|
||||||
|
|
||||||
#### Runtime
|
#### Runtime
|
||||||
|
|
||||||
@ -572,9 +630,18 @@ use 1 SYCL GPUs: [0] with Max compute units:512
|
|||||||
```
|
```
|
||||||
Otherwise, please double-check the GPU driver installation steps.
|
Otherwise, please double-check the GPU driver installation steps.
|
||||||
|
|
||||||
|
- Can I report Ollama issue on Intel GPU to llama.cpp SYCL backend?
|
||||||
|
|
||||||
|
No. We can't support Ollama issue directly, because we aren't familiar with Ollama.
|
||||||
|
|
||||||
|
Sugguest reproducing on llama.cpp and report similar issue to llama.cpp. We will surpport it.
|
||||||
|
|
||||||
|
It's same for other projects including llama.cpp SYCL backend.
|
||||||
|
|
||||||
|
|
||||||
### **GitHub contribution**:
|
### **GitHub contribution**:
|
||||||
Please add the **[SYCL]** prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.
|
Please add the **[SYCL]** prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.
|
||||||
|
|
||||||
## TODO
|
## TODO
|
||||||
|
|
||||||
- Support row layer split for multiple card runs.
|
- NA
|
||||||
|
Loading…
Reference in New Issue
Block a user