mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-26 03:12:23 +01:00
server: docs - refresh and tease a little bit more the http server (#5718)
* server: docs - refresh and tease a little bit more the http server * Rephrase README.md server doc Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
parent
bf08e00643
commit
8b350356b2
@ -114,6 +114,9 @@ Typically finetunes of the base models below are supported as well.
|
|||||||
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
|
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
|
||||||
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
|
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
|
||||||
|
|
||||||
|
**HTTP server**
|
||||||
|
|
||||||
|
[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
|
||||||
|
|
||||||
**Bindings:**
|
**Bindings:**
|
||||||
|
|
||||||
|
@ -1,8 +1,20 @@
|
|||||||
# llama.cpp/example/server
|
# LLaMA.cpp HTTP Server
|
||||||
|
|
||||||
This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp.
|
Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.
|
||||||
|
|
||||||
Command line options:
|
Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
* LLM inference of F16 and quantum models on GPU and CPU
|
||||||
|
* [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
|
||||||
|
* Parallel decoding with multi-user support
|
||||||
|
* Continuous batching
|
||||||
|
* Multimodal (wip)
|
||||||
|
* Monitoring endpoints
|
||||||
|
|
||||||
|
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
|
||||||
|
|
||||||
|
**Command line options:**
|
||||||
|
|
||||||
- `--threads N`, `-t N`: Set the number of threads to use during generation.
|
- `--threads N`, `-t N`: Set the number of threads to use during generation.
|
||||||
- `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.
|
- `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.
|
||||||
|
Loading…
Reference in New Issue
Block a user