Purpose
Develop an integrated AI-driven virtual live-streaming host system that enables users to rapidly create and customize an AI VTuber, offering more interactive and diverse live content.
Key Features
Language Model: Integrates Google Gemini and OpenAI GPT to provide natural dialog.
Text-to-Speech (TTS): Utilizes Microsoft and OpenAI technologies for human-like voice output.
Automatic Speech Recognition (ASR): Employs OpenAI Whisper for accurate voice input from users or viewers.
VTube Studio: Configures a Live2D avatar and controls animation via API.
OBS: Streams audio and video, and displays live chat subtitles using WebSocket.
2. System Architecture and Core Flow
2.1 Core Operational Flow
Audio/Text Input
Collected via the user’s microphone or messages in the chat.
AI Processing
The chosen large language model (Gemini or GPT) generates a reply.
Voice Output (TTS)
The text reply is synthesized into a spoken voice response.
Virtual Character
VTube Studio is used to animate the avatar’s facial expressions and motions; OBS handles video layout.
Subtitle Display
Subtitles are updated in real time via the OBS WebSocket interface.
Adjust window opacity, audio output device, microphone input, etc.
Define “What’s doing now” as context for the AI (e.g., current livestream topic).
LiveChat
Connect with YouTube Live / Twitch.
Configure whether to respond to chat, 1-to-1 replies, VIP lists, blacklists, etc.
LLM
Select the large language model (Gemini or GPT) and model version.
Set tokens limits (Max Input/Output Tokens) and randomness (Temperature).
“Instruction Enhance” for extra role/prompt reinforcement.
TTS
Choose voice synthesis engine (Edge TTS or OpenAI TTS).
Adjust voice pitch, speed, and volume.
Whisper
Decide local or OpenAI API for speech recognition.
Load/unload specific Whisper models; choose language and optional prompt text.
OBS
Update subtitles via OBS WebSocket.
Configure text formatting, clearing, display duration, etc.
Set filter names and delays for showing/hiding subtitles.
VTSP
Integrate with VTube Studio hotkeys.
Optionally enable sentiment analysis to trigger expressions matching AI’s response.
If sentiment analysis is off, you can randomly or consistently trigger selected hotkeys.
5. Recommended Usage Steps
Install Python and Prerequisites
Ensure Python, GPU drivers, CUDA, and cuDNN are correctly installed.
Download and Launch AI-VTuber-System
Download either the full source code or the All-in-One release from GitHub.
Note: The All-in-One version still requires you to install GPU-related components and place the Whisper model of your choice.
Configure the Control Panel
Setting: Provide user name, mic device, audio output device, etc.
LLM: Input needed API keys (if using GPT or Gemini), set tokens and temperature.
TTS: Pick a TTS engine (e.g., Edge TTS) and desired voice.
Whisper: Load a local model or use the API.
LiveChat: Set up YouTube/Twitch channels for chat.
OBS: Enable OBS WebSocket, specify a text source name, set subtitle filters/delays.
VTSP: Confirm VTube Studio hotkeys, enable or disable sentiment analysis.
Start the Livestream Workflow
Launch VTube Studio → Enble its API Plugin → Connect with the Control Panel.
Launch OBS → Open WebSocket → Connect from the Control Panel.
Test conversation or mic input in the “Main” tab to verify correct AI replies and subtitles.
Enable YouTube or Twitch chat to confirm the AI automatically responds to viewers.
During the Livestream
The system processes inputs from the host or viewers, generating replies in real time.
OBS displays the AI’s subtitles as the VTube Studio model performs speaking animations.
You can adjust settings in the Control Panel mid-broadcast as needed.
6. Additional Notes
Conversation History
The directory Text_files/ConversationHistory logs all daily conversations in date-labeled .txt files.
Token Calculator
Use My_Tools/Token_Calculator_GUI.bat to quickly estimate text token usage.
GPU Check
Run My_Tools/check_gpu_torch.bat to verify whether Torch can detect your GPU for Whisper.
7. Conclusion
Through the AI-VTuber-System, users can conveniently build a personalized VTuber without extensive Python/AI expertise. The system seamlessly integrates LLM, TTS, ASR, VTube Studio, and OBS, producing both lifelike voice responses and synchronous Live2D animations/subtitles. This setup significantly broadens the possibilities for interactive and engaging livestreams.
For further installation and usage details, or to review the source code, please visit the GitHub repository.