Commit Graph

1350 Commits

Author SHA1 Message Date
oobabooga
a43c210617
Improved past chats menu (#6158) 2024-06-25 00:07:22 -03:00
oobabooga
96ba53d916 Handle another fix after 57119c1b30 2024-06-24 15:51:12 -07:00
oobabooga
577a8cd3ee
Add TensorRT-LLM support (#5715) 2024-06-24 02:30:03 -03:00
oobabooga
536f8d58d4 Do not expose alpha_value to llama.cpp & rope_freq_base to transformers
To avoid confusion
2024-06-23 22:09:24 -07:00
oobabooga
b48ab482f8 Remove obsolete "gptq_for_llama_info" message 2024-06-23 22:05:19 -07:00
oobabooga
5e8dc56f8a Fix after previous commit 2024-06-23 21:58:28 -07:00
Louis Del Valle
57119c1b30
Update block_requests.py to resolve unexpected type error (500 error) (#5976) 2024-06-24 01:56:51 -03:00
CharlesCNorton
5993904acf
Fix several typos in the codebase (#6151) 2024-06-22 21:40:25 -03:00
GodEmperor785
2c5a9eb597
Change limits of RoPE scaling sliders in UI (#6142) 2024-06-19 21:42:17 -03:00
Guanghua Lu
229d89ccfb
Make logs more readable, no more \u7f16\u7801 (#6127) 2024-06-15 23:00:13 -03:00
Forkoz
1576227f16
Fix GGUFs with no BOS token present, mainly qwen2 models. (#6119)
---------

Co-authored-by: oobabooga <112222186+oobabooga@users.noreply.github.com>
2024-06-14 13:51:01 -03:00
oobabooga
10601850d9 Fix after previous commit 2024-06-13 19:54:12 -07:00
oobabooga
0f3a423de1 Alternative solution to "get next logits" deadlock (#6106) 2024-06-13 19:34:16 -07:00
oobabooga
386500aa37 Avoid unnecessary calls UI -> backend, to make it faster 2024-06-12 20:52:42 -07:00
Forkoz
1d79aa67cf
Fix flash-attn UI parameter to actually store true. (#6076) 2024-06-13 00:34:54 -03:00
Belladore
3abafee696
DRY sampler improvements (#6053) 2024-06-12 23:39:11 -03:00
oobabooga
a36fa73071 Lint 2024-06-12 19:00:21 -07:00
oobabooga
2d196ed2fe Remove obsolete pre_layer parameter 2024-06-12 18:56:44 -07:00
Belladore
46174a2d33
Fix error when bos_token_id is None. (#6061) 2024-06-12 22:52:27 -03:00
Belladore
a363cdfca1
Fix missing bos token for some models (including Llama-3) (#6050) 2024-05-27 09:21:30 -03:00
oobabooga
8df68b05e9 Remove MinPLogitsWarper (it's now a transformers built-in) 2024-05-27 05:03:30 -07:00
oobabooga
4f1e96b9e3 Downloader: Add --model-dir argument, respect --model-dir in the UI 2024-05-23 20:42:46 -07:00
oobabooga
ad54d524f7 Revert "Fix stopping strings for llama-3 and phi (#6043)"
This reverts commit 5499bc9bc8.
2024-05-22 17:18:08 -07:00
oobabooga
5499bc9bc8
Fix stopping strings for llama-3 and phi (#6043) 2024-05-22 13:53:59 -03:00
oobabooga
9e189947d1 Minor fix after bd7cc4234d (thanks @belladoreai) 2024-05-21 10:37:30 -07:00
oobabooga
ae86292159 Fix getting Phi-3-small-128k-instruct logits 2024-05-21 10:35:00 -07:00
oobabooga
bd7cc4234d
Backend cleanup (#6025) 2024-05-21 13:32:02 -03:00
Philipp Emanuel Weidmann
852c943769
DRY: A modern repetition penalty that reliably prevents looping (#5677) 2024-05-19 23:53:47 -03:00
oobabooga
9f77ed1b98
--idle-timeout flag to unload the model if unused for N minutes (#6026) 2024-05-19 23:29:39 -03:00
altoiddealer
818b4e0354
Let grammar escape backslashes (#5865) 2024-05-19 20:26:09 -03:00
Tisjwlf
907702c204
Fix gguf multipart file loading (#5857) 2024-05-19 20:22:09 -03:00
A0nameless0man
5cb59707f3
fix: grammar not support utf-8 (#5900) 2024-05-19 20:10:39 -03:00
Samuel Wein
b63dc4e325
UI: Warn user if they are trying to load a model from no path (#6006) 2024-05-19 20:05:17 -03:00
chr
6b546a2c8b
llama.cpp: increase the max threads from 32 to 256 (#5889) 2024-05-19 20:02:19 -03:00
oobabooga
a38a37b3b3 llama.cpp: default n_gpu_layers to the maximum value for the model automatically 2024-05-19 10:57:42 -07:00
oobabooga
a4611232b7 Make --verbose output less spammy 2024-05-18 09:57:00 -07:00
oobabooga
e9c9483171 Improve the logging messages while loading models 2024-05-03 08:10:44 -07:00
oobabooga
e61055253c Bump llama-cpp-python to 0.2.69, add --flash-attn option 2024-05-03 04:31:22 -07:00
oobabooga
51fb766bea
Add back my llama-cpp-python wheels, bump to 0.2.65 (#5964) 2024-04-30 09:11:31 -03:00
oobabooga
dfdb6fee22 Set llm_int8_enable_fp32_cpu_offload=True for --load-in-4bit
To allow for 32-bit CPU offloading (it's very slow).
2024-04-26 09:39:27 -07:00
oobabooga
70845c76fb
Add back the max_updates_second parameter (#5937) 2024-04-26 10:14:51 -03:00
oobabooga
6761b5e7c6
Improved instruct style (with syntax highlighting & LaTeX rendering) (#5936) 2024-04-26 10:13:11 -03:00
oobabooga
4094813f8d Lint 2024-04-24 09:53:41 -07:00
oobabooga
64e2a9a0a7 Fix the Phi-3 template when used in the UI 2024-04-24 01:34:11 -07:00
oobabooga
f0538efb99 Remove obsolete --tensorcores references 2024-04-24 00:31:28 -07:00
Colin
f3c9103e04
Revert walrus operator for params['max_memory'] (#5878) 2024-04-24 01:09:14 -03:00
oobabooga
9b623b8a78
Bump llama-cpp-python to 0.2.64, use official wheels (#5921) 2024-04-23 23:17:05 -03:00
oobabooga
f27e1ba302
Add a /v1/internal/chat-prompt endpoint (#5879) 2024-04-19 00:24:46 -03:00
oobabooga
e158299fb4 Fix loading sharted GGUF models through llamacpp_HF 2024-04-11 14:50:05 -07:00
wangshuai09
fd4e46bce2
Add Ascend NPU support (basic) (#5541) 2024-04-11 18:42:20 -03:00