Sure. Apologies ahead of time for the wall of text.

1. Download and setup Oobabooga first. Which is basically a Gradio interface that let's you chat with local LLM's you can download. It feels a lot like auto1111. Link: https://github.com/oobabooga/text-generation-webui

2. In the interface go to models tab and download an EXL2 model, I like these because seems to be a newer format and is much faster to run locally, especially in combination with the chat to voice. I had read that AWQ format models are graphics card format, and GGUF format is cpu format. I think EXL2 is the new thing that also runs on graphics card.

3. Note that some EXL2 models seem to have trouble downloading from main huggingface branch links, something to do with them splitting the model for different vrams. It will only download a .json/readme if that happens. Here was a text conversation I had with a user as to why they didn't download from models tab originally. "bartowski compiles all of their quants of a particular model into branches rather than separate HuggingFace uploads.
So, if you want the 8-bit quant for example, you should paste this into textgen's Download Model field: bartowski/Sensualize-Solar-10.7B-exl2:8_0" I tried that, but pasting actually didn't work still on some, if that happens make a folder in \text-generation-webui-main\models\ (folder name). and download all the files individually from the branch and put them there. Great now you can see what other kinds of models I was looking up in addition to my main idea.

4. Well now that that's out in the open, If you want something uncensored and spicy you can use this model. Pasting link on this one doesn't work either, so again choose 4,5,6,or 8bpw branch on huggingface and downloaded all the files individually and make that folder in \text-generation-webui-main\models\ (folder name). (I do not talk to myself using this model or train company data on this, nor recommend this lol) When I say uncensored in certain ways.. it really is and was fully tested for science. You can download any models though from this guy's link on huggingface to try also. Tons of models for your specific needs, like coding models, obscure stuff, etc. Model selections depends on vram and it's parameters, lower bpw is less vram usage.

5. In the models tab after the model download finishes select the model from dropdown, and in "model loader" dropdown select exllama v2, click load. You can now go back to the chat and start chatting with the assistant example.

6. Create a character. Under parameters tab up top create a character, context, image, etc. save. Go back to the chat and select the character in the gallery. You can now chat with the character.

7. Download Alltalk-tts for the voice cloning, go here: https://github.com/erew123/alltalk_tts It's well documented and has detailed instructions (read readme in: extensions\alltalk_tts or on repo. Start oogabooga again, In the cmd window there is a local address with info there also, and extension setting there I didn't really mess with.

8. The default address for oogabooga interface is http://127.0.0.1:7860 or :7861 if you have auto1111/sdforge already open. When in the chat tab the extension is located at bottom, there is a link to the documentation there too, like how to finetune (you finetune the base coqui-tts v2 model outside of oobabooga in a separate gradio he created) There is also a youtube link to starting that interface in his issues section somewhere but I can't find it now.

9. You may want to disable any conqui tts telemetry data before training, the setting he mentioned in the readme if you want to train locally and disconnected from the internet without it: type "set TRAINER_TELEMETRY=0" in the windows cmd prompt after you had run cmd_windows.bat (creates venv) from the \text-generation-webui-main folder. This is before running the gradio interface with 'python finetune.py' in the \extensions\alltalk_tts folder in the windows cmd prompt.

10. The extension will download base model automatically. I recommend using mp3's to train and it seems coqui does also.

11. Again, the training is done outside of oobabooga as per instructions. When it's done delete the voices like arnold, etc in text-generation-webui-main\extensions\alltalk_tts\voices and replace with the voices from the wav folder in new finetuning folder (\text-generation-webui-main\extensions\alltalk_tts\models\trainedmodel\wavs) This is all pretty well explained in documentation and check issues section on the repo also if you have problems. He answers all the questions.

12. After the finetune, start oobabooga again, in the bottom of chat window in alltalk-tts select the newly created finetune you made by selecting "XTTSv2 FT" button. You may have to load your model again in models tab, it doesn't autoload for me even with autoload selected and saving that page so maybe a bug.

13. Save your settings in oobabooga so you don't have to keep re-entering them. You can do this under the session tab at the top and click "save ui settings to settings.yaml"

14. Select the different wav files in the character voice dropdown from your training and pick the one that sounds most like the original voice.

15. If it's not quite exact, mess with "temperature" and "repetition penalty" sliders, you can preview audio in the little preview text box at the bottom. On the first training over 2.0.2. base, I messed with sliders and different wav files for about 4 hours. It sucked me in, until I just trained again using two 2 min long elevenlabs v2 voice samples mixed with about five 1.5-2 minute original clips of my voice, and using the Whisper v2 model (not v3). (ps. you can upload basically any voice to elevenlabs v2 if you need more samples of something, they just have a disclaimer) The second model was nearly perfect and didn't really have to mess with the sliders much anymore. I am unsure if the original training being trained with Whisper v3 didn't turn out as well or my samples needed a bunch of AI cloned samples that were easier for the AI to learn from. The dev confirmed Whisper v2 more accurate. If you still get bad results you can try to train the first model with only the original samples and whisper v3, then the second with mixed in elevenlabs v2 samples and Whisper v2. That worked well, and it seems like more of the samples sounded like my original voice when chosen in the double training one. If you're results still meh it could also be your samples, follow his link to make better dataset. You can also try combining the two or three best samples that sound like the voice, and combine into one file and use that as a voice selection (place it in voices folder as a .wav file) this definitely enhanced things for me and got rid of any script reading-like flow after messing with the sliders. I went through all 20 I had and tested and 3 best combined works.

16. The sd_api_pictures is great. This allows you to turn on the --api flag in automatic1111 and you can have the LLM generate AI pictures as you chat. I feel like it's sort of a better "wildcards" because it's a live chat. You could dreambooth train with an original AI character you made and chat with it and have it send you pictures and talk to you basically. To make alltalk_tts work with sd pictures, load all talk first in sequence then sd_pictures or it won't work. With oobaboogas it matters which order you load the extensions in.

17. If you want to talk over the mic with whisper_stt instead of typing, first use the update_wizard_windows.bat and install the whisper requirements in the list. Then enable it in sessions tab. This guy's info on adding shortcut key to start and stop the mic was useful for me (instead of clicking button in the gui each time) it worked well. Now feels more realtime with the button, but still kind of like a walkie talkie conversation. Edit: Having better luck skipping whisper and just using Dragon Naturally speaking, it types in the box, and then I say "enter" presses enter (dragon custom commands do this).

18. I saw on a youtube video the multimodal extension seems to let you upload pictures to the chatbot and it'll tell you what it is and talk to you about it using "instruct mode" chat option, kinda like Chatgpt 4 does, I haven't tried that yet but could be interesting for conversations.

19. If struggling with VRAM running this you can try a lower bpw model like 4bpw, not sure how this affect quality of chat, the amount of parameters also matters. Enable deepspeed to save vram (must install via alltalk instructions first) then on alltalk in oogabooga, when you check deepspeed box to enable it wait 20 seconds for the narrator voice confirmation, then enable low vram option. You can try choosing exllamav2_hf and the cache_8bit box and it seems to help. When the chat starts getting longer I really struggle on 8bpw one if doing pictures and whisper/alltalk together. I'm now using 4bpw and it's not too bad. Deepspeed did hurt quality of voice likeness for me though.

20. A bit different than devs findings, but I just tried to train over the coqui v2 2.0.3 model and got less "feel it's reading from a script" when voice talks. Accuracy similar or better and flow. I'd recommend trying it here: https://huggingface.co/coqui/XTTS-v2/tree/v2.0.3 backup the files in \extensions\alltalk_tts\models\xttsv2_2.0.2 download all the new files and put in there. Train over base again.

My sdxl dreambooth guide found here

UPDATE 03/16/24: It's been about a week. I've learned a few things. I now get best results during the finetuning process by editing the whisper v2 results and replacing samples whisper makes.

If you were just having it autosegment before and want to get more detailed with this, during Part 1 of the finetuning app go to \extensions\alltalk_tts\finetune\tmp-trn\wav, delete all the wav files whisper just generated, and make your own. (you'll also get two .csv files here to edit)

The new wav files you make should be all be one or two sentences each and not exceed 250 characters per sample. Then you need to edit the two .csv \finetune\ with the exact words that are being spoken for each .wav file. Use 22050hz or so for the wavs. I did about 74 samples in total in newest training.

I typed long words occasionally like "oooh" instead of simply "oh" on some of them if it sounded like I really stretched it out long, and and it seemed to work (not sure if this is right though). I got very detailed with it and typed when I said "I I don't know" if I said "I" twice. This is basically the equivelant of Whisper v6 I think ;) Then started the training. This way of doing it manually still seems better, but if you want something fast and still good whisper v2 is totally fine.

Then when training is done, on the last training page, copy the model to the training folder (ignore the page that only shows you a few samples to test it)

Copy the samples you made to voices folder. In oogabooga take notes of every sample, and which ones sound the most like you over the trained model.

Once you find the best ones. Let's say it's 5 of them that sound exactly like you but a little different. Combine them into one single sample and put that file in \extensions\alltalk_tts\voices. Select "XTTSv2 FT" button again (the new finetuned model, and use that single wav for your voice sample. It should now sound much better, more, accurate, and less instances of a robotic flow. It also helps to not use .wavs of you reading stuff from a script and instead just naturally speaking. For some reason I also got the best results on this specific finetune using temperature 0.95 and 4.5 repeat penalty slider settings.

I recently remade star trek computer LCARS with original tng computer voice, and it works perfectly. It was pretty fun test project. Talking to it using Dragon Naturally Speaking without using hands. The post also shows how to edit oobabooga colors the easiest way possible if you are interested.

text.is - Markdown Pastebin.