* SimpleChat: A placeholder system prompt, Use usage msg in code
Just have a alert msg wrt needing javascript enabled in html. And
have usage message from js file. Update the usage message a bit.
So also enable switch session wrt setup_ui call.
Add a possible system prompt as a placeholder for the system-input.
* SimpleChat:CompletionMode: Allow control of Role: prefix
* SimpleChat:Completion: Avoid Role: prefix; Newline only in between
In completion mode
* avoid inserting Role: prefix before each role's message
* avoid inserting newline at the begin and end of the prompt
message. However if there are multiple role messages, then
insert newline when going from one role's message to the
next role's message.
* SimpleChat:CompletionMode: Update readme/usage, trim textarea newline
Readme update wrt completion mode behavior.
Usage help updated wrt completion mode behavior.
When changing from input to textarea elment wrt user input, the last
newline at the end of the user input wrt textarea, was forgotten to be
filtered, this is fixed now. However if user wants to have a explicit
newline they can using shift+enter to insert a newline, that wont be
removed. The extra newline removal logic uses substring and keyup to
keep things simple and avoid some previously noted bugs wrt other
events in the key path as well as IME composition etal.
* SimpleChat:SC: Ensure proper clearing/reseting
previous logic would have cleared/reset the xchat, without doing
the same wrt iLastSys, thus leading to it pointing to a now non
existent role-content entry.
So if a user set a system prompt and used completion mode, it would
have done the half stupid clear, after the model response was got.
Inturn when user tries to send a new completion query, it would
inturn lead to handle_user_submit trying to add/update system prompt
if any, which will fail, bcas iLastSys will be still pointing to a
non existant entry.
This is fixed now, by having a proper clear helper wrt SC class.
* SimpleChat: Update usage note and readme a bit
* SimpleChat:Completion: clear any prev chat history at begining
Previously any chat history including model response to a completion
query would have got cleared, after showing the same to the user,
at the end of handle_user_submit, rather than at the begining.
This gave the flexibility that user could switch from chat mode
to completion mode and have the chat history till then sent to
the ai model, as part of the completion query. However this flow
also had the issue that, if user switches between different chat
sessions, after getting a completion response, they can no longer
see the completion query and its response that they had just got.
The new flow changes the clearing of chat history wrt completion
mode to the begining of handle_user_submit, so that user doesnt
lose the last completion mode query and response, till a new
completion mode query is sent to the model, even if they were to
switch between the chat sessions. At the same time the loss of
flexibility wrt converting previous chat history into being part
of the completion query implicitly doesnt matter, because now
the end user can enter multiline queries.
* SimpleChat:Try read json early, if available
For later
the server flow doesnt seem to be sending back data early, atleast
for the request (inc options) that is currently sent.
if able to read json data early on in future, as and when ai model
is generating data, then this helper needs to indirectly update
the chat div with the recieved data, without waiting for the
overall data to be available.
* SimpleChat: Rename the half asleep mis-spelled global var
* SimpleChat: Common chat request options from a global object
* SimpleChat: Update title, usage and readme a bit
Keep the title simple so that print file name doesnt have chars
that need to be removed.
Update readme wrt some of the new helpers and options.
Change Usage list to a list of lists, add few items and style it
to reduce the margin wrt lists.
* SimpleChat:ChatRequestOptions: max_tokens
As some times based on the query from the user, the ai model may get
into a run away kind of generation with repeatations etal, so adding
max_tokens to try and limit this run away behaviour, if possible.
* SimpleChat: Reduce max_tokens to be small but still sufficient
* SimpleChat: Consolidate global vars into gMe, Display to user
This allows the end user to see the settings used by the logic,
as well as allows users to change/update the settings if they
want to by using devel-tools/console
* SimpleChat:SlidingWindow: iRecentUserMsgCnt to limit context load
This is disabled by default. However if enabled, then in addition
to latest system message, only the last N user messages, after the
latest system message and its reponses from the ai model will be sent
to the ai-model, when querying for a new response.
This specified N also includes the latest user query.
* SimpleChat: placeholder based usage hint for user-in textarea
* SimpleChat: Try make user experience better, if possible
Reduce chat history context sent to the server/ai-model to be
just the system-prompt, prev-user-request-and-ai-response and
cur-user-request, instead of the previous full chat history.
This way if there is any response with garbage/repeatation, it
doesnt mess with things beyond the next question, in some ways.
Increase max_tokens to 1024, so that a relatively large previous
reponse doesnt eat up the space available wrt next query-response.
However dont forget that the server when started should also
be started with a model context size of 1k or more, to be on
safe side.
Add frequency and presence penalty fields set to 1.2 to the set
of fields sent to server along with the user query. So that
the model is partly set to try avoid repeating text in its
response.
* SimpleChat:Add n_predict (equiv max_tokens) for llamacpp server
The /completions endpoint of examples/server doesnt take max_tokens,
instead it takes the internal n_predict, for now add the same on
the client side, maybe later add max_tokens to /completions endpoint
handling.
* SimpleChat: Note about trying to keep things simple yet flexible
* SimpleChat: Add a skeletal html page
Contains a div placeholder for showing chat messages till now
a text-input for allowing user to enter next chat message/query
to the model.
a submit button to allow sending of the user entered message and
chat till now to the model.
* SimpleChat: A js skeleton with SimpleChat class
Allows maintaining an array of chat message.
Allows adding chat message (from any of the roles be it system,
user, assistant, ...)
Allows showing chat messages till now, in a given div element.
* SimpleChat: request_json, globals, startme
* SimpleChatJS: Roles Class, submitClick
Define Role class with static members corresponding to the roles.
Update startme to
* Get hold of the ui elements.
* Attach a click handler to submit button, which adds the user input
to xchats array and shows the chat messages till now in chat div
element.
Trap DOMContentLoaded to trigger startme
* SimpleChat:HTML: Bring in the js file
* SimpleChat: Rather value wrt input text element
* SimpleChat: Also add completions related prompt
* SimpleChat: Use common helper logic wrt json data
* SimpleChat: Move handling of submit request into its own func
* SimpleChat: Try handshake with llm over its web service endpoint
* SimpleChat:JS: Extract model response and show to user
* SimpleChat:JS: Messages/Prompt, indicate working to end user
* SimpleChat: Try keep input element in view
* SimpleChat: Diff user/assistant msgs, Make input wider
Also show a default message to user
Also add some metas
* SimpleChat: Move into its own sub directory to avoid confusion
* SimpleChat:sh: Add simple shell script to run python3 http.server
So one needs to run the llm server locally
then run this script and access it using a local browser
* SimpleChat:JS: Try trap enter key press wrt input text field
So user can either press submit button or press enter key
* SimpleChat: Allow user to select chat or completion mode
* SimpleChat: Dont submit if already submitted and waiting
Also make chat the default selection wrt mode
* SimpleChat:JS: Handle difference in response
Try read the assistance response from appropriate field in the
response got.
Also examples/server seems to return the response in a slightly
different field, so try account for that also.
* SimpleChat:JS: Force completion mode be single message by default
* SimpleChat: Add a simple readme file
* SimpleChat:HTML: Cleanup/structure UI a bit, Add input for system
* SimpleChat:Allow system prompt to be set, if provided before user
* SimpleChat: Ignore empty user input, without trimming
* SimpleChat:Alert user if they provide sysprompt late or change it
* SimpleChat: Move handling systemprompt into its own func
* SimpleChat:HTML: Add a style for system role message
* SimpleChat: Update the readme file
* SimpleChat:CSS: Move style info into its own css file
To keep it simple, clean and seperate so that things are not
unnecessarily cluttered.
* SimpleChat:CSS: Allow for chat div to be scrollable
* SimpleChat:JS: Try ensure the last entry in chat is visible
Needed because now only the chat div is scrollable and not the full
page.
In last commit the chat div size was fixed to 75% vertical height,
so the full page no longer scrolls, so the old bring user-input
element to view wont work, instead now the last element in the
chat div should be brought into view.
* SimpleChat:JS: bottom of element visible, Set focus to user input
As the generated text could be multiple lines and occupy more space
that the full scrollable div's vertical space, make the bottom of
the last element (which can be such a generated text) in the div
visible by scrolling.
Ensure that the user input box has focus
* SimpleChat: Update notes a bit. Try keep browser happy
Avoid browser quirk mode with DOCTYPE.
Help with accessibility a bit by specifying the language explicitly.
Specify the char encoding explicitly, inturn utf-8 is a safe bet,
even with intermixing of languages if reqd in future.
Add a cache-control http-equiv meta tag, which in all probability
will be ignored.
Defer js loading and execution, just for fun and future, not that
critical here as it stands now.
* SimpleChat:HTML:Group user input+btn together; Note about multichat
* SimpleChat:JS: Allow for changing system prompt anytime for future
* SimpleChat:Readme: Note about handle_systemprompt begin/anytime
* SimpleChat:HTML: Add viewport meta for better mobile friendliness
Without this the page content may look too small.
* SimpleChat:HtmlCss: Cleanup UI flow
set margin wrt vmin rather than vw or vh so portrait/landscape ok.
Use flex and flex-grow to put things on the same line as well as
distribute available space as needed. Given two main elements/line
so it remains simple.
In each line have one element with grows and one sits with a basic
comfortably fixed size.
* SimpleChat: textarea for multiline user chat, inturn shift+enter 4 enter
* SimpleChat: Make vertical layout better responsive (flex based)
Also needed to make things cleaner and properly usable whether
landscape or portrait, after changing to multiline textarea rather
than single line user input.
Avoid hardcoding the chat-till-now display area height, instead
make it a flex-growable within a flex column of ui elements within
a fixed vertical area.
* SimpleChat: Rename simplechat.html to index.html, update readme
Instead of providing a seperate shell script, update the readme wrt
how to run/use this web front end.
* SimpleChat: Screen fixed view and scrolling, Printing full
* SimpleChat:JS:CI: Avoid space at end of jsdoc param line
* SimpleChat:JS: MultiChat initial skeleton
Will help maintain multiple independent chats in future
* SimpleChat:JS: Move system prompt begin/anytime into SimpleChat
* SimpleChat:JS:Keep MultiChatUI simple for now
Worry about different chats with different servers for later.
* SimpleChat:JS: Move handle submit into MultiChat, build on same
Create an instance of MultiChatUI and inturn a instance of chat
session, which is what the UI will inturn work on.
* SimpleChat:JS: Move to dictionary of SimpleChat, instead of array
* SimpleChat: Move ui elements into MultiChatUI, Update el IDs
Move ui elements into MultiChatUI, so that current handleUserSubmit
doesnt need to take the element arguments. Also in future, when
user is allowed to switch between different chat sessions, the
UI can be updated as needed by using the elements in UI already
known to MultiChatUI instance.
Rename the element ids' so that they follow a common convention,
as well as one can identify what the element represents in a more
consistant manner.
* SimpleChat:MCUI:Show available chat sessions, try switch btw them
Previous commits brought in / consolidated existing logic into
MultiChatUI class.
Now start adding logic towards multichat support
* show buttons indicating available chat sessions
* on sessin button click, try switch to that session
* SimpleChat:MCUI: Store and use current chat session id
Also
allow to switch chat session optionally, wrt some of the related
helpers.
setup for two chat sessions by default.
* SimpleChat:MCUI: Delay enabling user-input to avoid race
Re-enable user-input, only after response to a user query has been
updated to the chat-div. This ensures that if user tries to switch
chat session, it wont be allowed till chat-request-response flow is
done.
* SimpleChat: Take care of system prompt
Helper to get the latest system prompt and inturn use same to
set the system prompt ui, when switching.
Ensure that system prompt is set if and when enter key is pressed.
* SimpleChat:GetSystemLatest, fix a oversight.
* SimpleChat:MCUI: Allow selected chat-session btn to be highlighted
Also have a general helper for setting class of children.
* SimpleChat:Cleanup corners
Show system prompt in chat space, when it is set by pressing enter,
as a feedback to user.
Alert user, if they try to switch chat session in the middle of
waiting for a response from the ai model.
* SimpleChat:MCUI: Ensure req-resp failure doesnt lock up things
* SimpleChat:MCUI: Support for new chat sessions
Also a general create button helper.
* SimpleChat:MCUI: CreateSessionBtn helper, use wrt NewChat
Also fix a oversight wrt using stale data wrt the list of chat
sessions.
* SimpleChat:MCUI: NewChat btn first before existing chat sessions
* SimpleChat:MCUI:CornerCases:Skip new chat, show only if current
Skip NewChat if user cancels or if one waiting for response from
the ai model.
Dont show a chat with newly got ai model response, if current chat
session has changed, some how. Chat session shouldnt be allowed to
change, if there is a pending response, but still as a additional
sanity check.
* SimpleChat: Update readme, title, show usage if no chat to show
* SimpleChat: Cleanup the log/dialog messages a bit
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
* Update brute force test: special tokens
* Fix added tokens
- Try to read 'added_tokens.json'.
- Try to read 'tokenizer_config.json'.
- Try to read 'tokenizer.json'.
* Fix special tokens rtrim
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
* [server] Cleanup a memory leak on exit
There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.
* make tab into spaces
* convert-hf : begin refactoring write_tensor
* convert : upgrade to sentencepiece v0.2.0
* convert-hf : remove unused n_dims in extra_*_tensors
* convert-hf : simplify MoE weights stacking
* convert-hf : flake8 linter doesn't like semicolons
* convert-hf : allow unusual model part names
For example, loading `model-00001-of-00001.safetensors` now works.
* convert-hf : fix stacking MoE expert tensors
`torch.stack` and `torch.cat` don't do the same thing.
* convert-hf : fix Mamba conversion
Tested to work even with a SentencePiece-based tokenizer.
* convert : use a string for the SentencePiece tokenizer path
* convert-hf : display tensor shape
* convert-hf : convert norms to f32 by default
* convert-hf : sort model part names
`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".
* convert-hf : use an ABC for Model again
It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)
At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.
* convert-hf : use a plain class for Model, and forbid direct instantiation
There are no abstract methods used anyway,
so using ABC isn't really necessary.
* convert-hf : more consistent formatting of cmdline args
* convert-hf : align the message logged for converted tensors
* convert-hf : fix Refact conversion
* convert-hf : save memory with lazy evaluation
* convert-hf : flake8 doesn't like lowercase L as a variable name
* convert-hf : remove einops requirement for InternLM2
* convert-hf : faster model parts loading
Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors
Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.
* convert-hf : minor changes for consistency
* gguf-py : add tqdm as a dependency
It's small, and used for a progress bar
in GGUFWriter.write_tensors_to_file
* Added themes support with two sample themes and a favicon.
* Newline
* Newline
* Newline
* Trailing whitespace
* Increased opacity for contrast
* Increase opacity.
Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY
* Opacity action trigger.
Trying to re-trigger the cancelled action.
* One more opacity adjustment
This Actions pipeline is failing for random issues.
* Delete examples/server/themes/buttons_top/completion.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/buttons_top/index.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/wild/completion.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs
This will be served from the static string built-in to server.
* Delete examples/server/themes/wild/index.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/wild/json-schema-to-grammar.mjs
This will be served from the static string built-in to server.
* Replaced underscore.
This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
'stop': ['\n'],
'temperature': 0.0,
'n_predict': 10,
'cache_prompt': True,
'n_probs': 10
}
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
* imatrix: save the dataset file used in the output file
* llama: support kv overrides type string string
* common: factorize KV Overrides parsing between common and server
* quantize: add imatrix n entries and dataset KV metadata
quantize: factorize KV Overrides parsing between common
#6656
* llama: remove kv override str_value initialization as it does not compile on some toolchain
* quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count`
* quantize: add imatrix filename in KV
* llama: add llama_model_kv_override_free
* common: add llama_model_kv_override_free
common: free kv override if used after model loading
* llama: finally move the string KV override value to the stack
* llama : minor
* no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators.
Co-authored-by: slaren <slarengh@gmail.com>
* kv override: ensure string termination
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
* server: cap n_predict if not set to n_ctx_train
* server: fix infinite loop
* server: infinite loop, move in process_token
server: infinite loop: set stop limit to true
* minor: spaces
* minor: spaces
* server: include prompt tokens in the EOS limit
* fix: revert showing control tokens by default
* feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens
* feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses"
* common : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* `build`: generate hex dumps of server assets on the fly
* build: workaround lack of -n on gnu xxd
* build: don't use xxd in cmake
* build: don't call xxd from build.zig
* build: more idiomatic hexing
* build: don't use xxd in Makefile (od hackery instead)
* build: avoid exceeding max cmd line limit in makefile hex dump
* build: hex dump assets at cmake build time (not config time)
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>