llama.cpp/examples/tts/README.md

# llama.cpp/example/tts
This example demonstrates the Text To Speech feature. It uses a
[model](https://www.outeai.com/blog/outetts-0.2-500m) from
[outeai](https://www.outeai.com/).

## Quickstart
If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the
following command and the required models will be downloaded automatically:
```console
$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav
```
For details about the models and how to convert them to the required format
see the following sections.

### Model conversion
Checkout or download the model that contains the LLM model:
```console
$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
$ popd
```
Convert the model to .gguf format:
```console
(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
    --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16
```
The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.

We can optionally quantize this to Q8_0 using the following command:
```console
$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
    models/outetts-0.2-0.5B-q8_0.gguf q8_0
```
The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.

Next we do something simlar for the audio decoder. First download or checkout
the model for the voice decoder:
```console
$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
$ popd
```
This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to
huggingface format:
```console
(venv) python examples/tts/convert_pt_to_hf.py \
    models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
...
Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json
```
Then we can convert the huggingface format to gguf:
```console
(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
    --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
...
INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf
```

### Running the example

With both of the models generated, the LLM model and the voice decoder model,
we can run the example:
```console
$ build/bin/llama-tts -m  ./models/outetts-0.2-0.5B-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75-f16.gguf \
    -p "Hello world"
...
main: audio written to file 'output.wav'
```
The output.wav file will contain the audio of the prompt. This can be heard
by playing the file with a media player. On Linux the following command will
play the audio:
```console
$ aplay output.wav
```

### Running the example with llama-server
Running this example with `llama-server` is also possible and requires two
server instances to be started. One will serve the LLM model and the other
will serve the voice decoder model.

The LLM model server can be started with the following command:
```console
$ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020
```

And the voice decoder model server can be started using:
```console
./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none
```

Then we can run [tts-outetts.py](tts-outetts.py) to generate the audio.

First create a virtual environment for python and install the required
dependencies (this in only required to be done once):
```console
$ python3 -m venv venv
$ source venv/bin/activate
(venv) pip install requests numpy
```

And then run the python script using:
```conole
(venv) python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"
spectrogram generated: n_codes: 90, n_embd: 1282
converting to audio ...
audio generated: 28800 samples
audio written to file "output.wav"
```
And to play the audio we can again use aplay or any other media player:
```console
$ aplay output.wav
```
examples : add README.md to tts example [no ci] (#11155) * examples : add README.md to tts example [no ci] * squash! examples : add README.md to tts example [no ci] Fix heading to be consistent with other examples, and add a quickstart section to README.md. * squash! examples : add README.md to tts example [no ci] Fix spelling mistake. 2025-01-10 13:16:16 +01:00			`# llama.cpp/example/tts`
			`This example demonstrates the Text To Speech feature. It uses a`
			`[model](https://www.outeai.com/blog/outetts-0.2-500m) from`
			`[outeai](https://www.outeai.com/).`

			`## Quickstart`
			If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the
			`following command and the required models will be downloaded automatically:`
			```console
			`$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav`
			```
			`For details about the models and how to convert them to the required format`
			`see the following sections.`

			`### Model conversion`
			`Checkout or download the model that contains the LLM model:`
			```console
			`$ pushd models`
			`$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M`
			`$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull`
			`$ popd`
			```
			`Convert the model to .gguf format:`
			```console
			`(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \`
			`--outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16`
			```
			The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.

			`We can optionally quantize this to Q8_0 using the following command:`
			```console
			`$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \`
			`models/outetts-0.2-0.5B-q8_0.gguf q8_0`
			```
			The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.

			`Next we do something simlar for the audio decoder. First download or checkout`
			`the model for the voice decoder:`
			```console
			`$ pushd models`
			`$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token`
			`$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull`
			`$ popd`
			```
			`This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to`
			`huggingface format:`
			```console
			`(venv) python examples/tts/convert_pt_to_hf.py \`
			`models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt`
			`...`
			`Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors`
			`Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json`
			`Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json`
			```
			`Then we can convert the huggingface format to gguf:`
			```console
			`(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \`
			`--outfile models/wavtokenizer-large-75-f16.gguf --outtype f16`
			`...`
			`INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf`
			```

			`### Running the example`

			`With both of the models generated, the LLM model and the voice decoder model,`
			`we can run the example:`
			```console
			`$ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \`
			`-mv ./models/wavtokenizer-large-75-f16.gguf \`
			`-p "Hello world"`
			`...`
			`main: audio written to file 'output.wav'`
			```
			`The output.wav file will contain the audio of the prompt. This can be heard`
			`by playing the file with a media player. On Linux the following command will`
			`play the audio:`
			```console
			`$ aplay output.wav`
			```

examples : add embd_to_audio to tts-outetts.py [no ci] (#11235) This commit contains a suggestion for adding the missing embd_to_audio function from tts.cpp to tts-outetts.py. This introduces a depencency numpy which I was not sure if that is acceptable or not (only PyTorch was mentioned in referened PR). Also the README has been updated with instructions to run the example with llama-server and the python script. Refs: https://github.com/ggerganov/llama.cpp/pull/10784#issuecomment-2548377734 2025-01-15 05:44:38 +01:00			`### Running the example with llama-server`
			Running this example with `llama-server` is also possible and requires two
			`server instances to be started. One will serve the LLM model and the other`
			`will serve the voice decoder model.`

			`The LLM model server can be started with the following command:`
			```console
			`$ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020`
			```

			`And the voice decoder model server can be started using:`
			```console
			`./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none`
			```

			`Then we can run [tts-outetts.py](tts-outetts.py) to generate the audio.`

			`First create a virtual environment for python and install the required`
			`dependencies (this in only required to be done once):`
			```console
			`$ python3 -m venv venv`
			`$ source venv/bin/activate`
			`(venv) pip install requests numpy`
			```

			`And then run the python script using:`
			```conole
			`(venv) python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"`
			`spectrogram generated: n_codes: 90, n_embd: 1282`
			`converting to audio ...`
			`audio generated: 28800 samples`
			`audio written to file "output.wav"`
			```
			`And to play the audio we can again use aplay or any other media player:`
			```console
			`$ aplay output.wav`
			```