This project implements a chatbot with voice recognition (as well as an audio file-to-text recognition version) and text-to-speech conversion for the chatbot's response. It is based on fine-tuned AI models designed to respond to movement commands.
The system includes:
This project uses pip
and conda
for dependency management. It also requires the Ollama model for data processing.
ollama run llama3.1
ollama pull llama3.1
pip install ollama
conda install -c conda-forge ffmpeg numpy conda install conda-forge::python-sounddevice conda install conda-forge::librosa conda install conda-forge::webrtcvad
pip install openai-whisper soundfile pip install git+https://github.com/suno-ai/bark numpy scipy sounddevice
Note: If Bark throws an error related to
weights_only
, edit thegeneration.py
file and change:checkpoint = torch.load(ckpt_path, map_location=device)
to:
checkpoint = torch.load(ckpt_path, map_location=device, weights_only=False)
Fine-tuning is the process of training a pre-existing model with specific data to optimize its performance for a particular task. In this case, we have fine-tuned Llama3.1 to respond only to movement commands.
We use Llama3.1 as a base and provide customized examples so it only responds to movement commands such as move forward, turn, or stop.
Example of the Modelfile
:
FROM llama3.1 PARAMETER temperature 0.6 PARAMETER num_ctx 4096 SYSTEM """You are a mobile robot that receives voice commands and responds accordingly.""" MESSAGE user Move forward MESSAGE assistant Understood, moving forward.
temperature
: Controls the randomness of the generated responses. A low value (e.g., 0.2) produces more coherent and predictable responses, while a high value (like 1.0) generates more creative and varied responses.num_ctx
: Specifies the maximum context size the model can process, i.e., the maximum number of tokens (words or fragments) the model can use to generate a response.The messages in the Modelfile
serve as interaction examples. The content of the messages is not taken literally as the chatbot's final response but rather acts as examples to guide response generation based on the provided context.
To create the model:
ollama create llama3.1_finetuned -f Modelfile
We chose Whisper because it offers high accuracy in voice recognition, even in noisy environments. Additionally, it supports multiple languages and dialects. Whisper uses deep learning models for audio-to-text transcription and is based on the Transformer architecture. Its ability to handle different accents and acoustic conditions makes it a robust choice for the chatbot.
Bark is an advanced speech synthesis model that generates realistic audio from text. We chose Bark over alternatives like Google TTS or Festival because:
reconocer_voz()
def reconocer_voz():
This method records real-time audio and transcribes it using Whisper. The workflow is:
chat_with_ollama_fine_tuned_mic()
def chat_with_ollama_fine_tuned_mic():
This method is the chatbotβs core. It works as follows:
reconocer_voz()
to obtain the user's text.hablar()
for text-to-speech conversion.hablar()
def hablar(texto):
This method converts text into audio using Bark and plays it back with sounddevice.
generate_audio()
.sd.play()
.This chatbot combines voice recognition, natural language processing, and text-to-speech conversion to provide a seamless experience. Thanks to Whisper, Llama3.1 Fine-Tuned, and Bark, we achieve optimized performance in interpreting movement commands with high precision.
There are no datasets linked
There are no datasets linked