text-generation-webui/extensions/openai
2023-05-05 18:53:03 -03:00
..
README.md add openai compatible api (#1475) 2023-05-02 22:49:53 -03:00
requirements.txt add openai compatible api (#1475) 2023-05-02 22:49:53 -03:00
script.py Refactor text_generation.py, add support for custom generation functions (#1817) 2023-05-05 18:53:03 -03:00

An OpenedAI API (openai like)

This extension creates an API that works kind of like openai (ie. api.openai.com). It's incomplete so far but perhaps is functional enough for you.

Setup & installation

Optional (for flask_cloudflared, embeddings):

pip3 install -r requirements.txt

Embeddings (alpha)

Embeddings requires sentence-transformers installed, but chat and completions will function without it loaded. The embeddings endpoint is currently using the HuggingFace model: sentence-transformers/all-mpnet-base-v2 for embeddings. This produces 768 dimensional embeddings (the same as the text-davinci-002 embeddings), which is different from OpenAI's current default text-embedding-ada-002 model which produces 1536 dimensional embeddings. The model is small-ish and fast-ish. This model and embedding size may change in the future.

model name dimensions input max tokens speed size Avg. performance
text-embedding-ada-002 1536 8192 - - -
text-davinci-002 768 2046 - - -
all-mpnet-base-v2 768 384 2800 420M 63.3
all-MiniLM-L6-v2 384 256 14200 80M 58.8

In short, the all-MiniLM-L6-v2 model is 5x faster, 5x smaller ram, 2x smaller storage, and still offers good quality. Stats from (https://www.sbert.net/docs/pretrained_models.html). To change the model from the default you can set the environment variable OPENEDAI_EMBEDDING_MODEL, ex. "OPENEDAI_EMBEDDING_MODEL=all-MiniLM-L6-v2".

Warning: You cannot mix embeddings from different models even if they have the same dimensions. They are not comparable.

Client Application Setup

Almost everything you use it with will require you to set a dummy OpenAI API key environment variable.

With the official python openai client, you can set the OPENAI_API_BASE environment variable before you import the openai module, like so:

OPENAI_API_KEY=dummy
OPENAI_API_BASE=http://127.0.0.1:5001/v1

If needed, replace 127.0.0.1 with the IP/port of your server.

If using .env files to save the OPENAI_API_BASE and OPENAI_API_KEY variables, you can ensure compatibility by loading the .env file before loading the openai module, like so in python:

from dotenv import load_dotenv
load_dotenv()
import openai

With the official Node.js openai client it is slightly more more complex because the environment variables are not used by default, so small source code changes may be required to use the environment variables, like so:

const openai = OpenAI(Configuration({
  apiKey: process.env.OPENAI_API_KEY,
  basePath: process.env.OPENAI_API_BASE,
}));

For apps made with the chatgpt-api Node.js client library:

const api = new ChatGPTAPI({
  apiKey: process.env.OPENAI_API_KEY,
  apiBaseUrl: process.env.OPENAI_API_BASE,
})

Compatibility & not so compatibility

What's working:

API endpoint tested with notes
/v1/models openai.Model.list() returns the currently loaded model_name and some mock compatibility options
/v1/models/{id} openai.Model.get() returns whatever you ask for, model does nothing yet anyways
/v1/text_completion openai.Completion.create() the most tested, only supports single string input so far
/v1/chat/completions openai.ChatCompletion.create() depending on the model, this may add leading linefeeds
/v1/embeddings openai.Embedding.create() Using Sentence Transformer, dimensions are different and may never be directly comparable to openai embeddings.
/v1/moderations openai.Moderation.create() does nothing. successfully.
/v1/engines/*/... completions, embeddings, generate python-openai v0.25 and earlier Legacy engines endpoints

The model name setting is ignored in completions, but you may need to adjust the maximum token length to fit the model (ie. set to <2048 tokens instead of 4096, 8k, etc). To mitigate some of this, the max_tokens value is halved until it is less than truncation_length for the model (typically 2k).

Streaming, temperature, top_p, max_tokens, stop, should all work as expected, but not all parameters are mapped correctly.

Some hacky mappings:

OpenAI text-generation-webui note
frequency_penalty encoder_repetition_penalty this seems to operate with a different scale and defaults, I tried to scale it based on range & defaults, but the results are terrible. hardcoded to 1.18 until there is a better way
presence_penalty repetition_penalty same issues as frequency_penalty, hardcoded to 1.0
best_of top_k
stop custom_stopping_strings this is also stuffed with ['\nsystem:', '\nuser:', '\nhuman:', '\nassistant:', '\n###', ] for good measure.
n 1 hardcoded, it may be worth implementing this but I'm not sure how yet
1.0 typical_p hardcoded
1 num_beams hardcoded
max_tokens max_new_tokens max_tokens is scaled down by powers of 2 until it's smaller than truncation length.
logprobs - ignored

defaults are mostly from openai, so are different. I use the openai defaults where I can and try to scale them to the webui defaults with the same intent.

Applications

Everything needs OPENAI_API_KEY=dummy set.

Compatibility Application/Library url notes / setting
openai-python https://github.com/openai/openai-python only the endpoints from above are working. OPENAI_API_BASE=http://127.0.0.1:5001/v1
openai-node https://github.com/openai/openai-node only the endpoints from above are working. environment variables don't work by default, but can be configured (see above)
chatgpt-api https://github.com/transitive-bullshit/chatgpt-api only the endpoints from above are working. environment variables don't work by default, but can be configured (see above)
shell_gpt https://github.com/TheR1D/shell_gpt OPENAI_API_HOST=http://127.0.0.1:5001
gpt-shell https://github.com/jla/gpt-shell OPENAI_API_BASE=http://127.0.0.1:5001/v1
gpt-discord-bot https://github.com/openai/gpt-discord-bot OPENAI_API_BASE=http://127.0.0.1:5001/v1
langchain https://github.com/hwchase17/langchain OPENAI_API_BASE=http://127.0.0.1:5001/v1 even with a good 30B-4bit model the result is poor so far. It assumes zero shot python/json coding. Some model tailored prompt formatting improves results greatly.
Auto-GPT https://github.com/Significant-Gravitas/Auto-GPT OPENAI_API_BASE=http://127.0.0.1:5001/v1 Same issues as langchain. Also assumes a 4k+ context
babyagi https://github.com/yoheinakajima/babyagi OPENAI_API_BASE=http://127.0.0.1:5001/v1

Future plans

  • better error handling
  • model changing, esp. something for swapping loras or embedding models
  • consider switching to FastAPI + starlette for SSE (openai SSE seems non-standard)
  • do something about rate limiting or locking requests for completions, most systems will only be able handle a single request at a time before OOM
  • the whole api, images (stable diffusion), audio (whisper), fine-tunes (training), edits, files, etc.