Watching a sound wave or unrelated images for a podcast that is published in a platform that supports video and images can be dull.
We propose asking a trio of AI models, bundled in an app we call Talkomic, to generate images that that are closely tied to the audio content discussed in the podcast.
The team at Rendergon Ltd responsible for the development of the Tapgaze apps has worked on this prototype. We are new to this field, keen to tinker and learn ! The prototype has been tested on the Unity Editor and Windows 11 build. It also provides a good start as a proof of concept to test the ability of AI models to help audio media augment its reach.
The app’s AI Models transcribe an audio file to text and generate contexual images closely tied to the transcribed text.
Visit the Github repo where full implementation steps into Unity are further described.
We chain the results of three AI models to generate the images for an audio podcast.
We initially got the project to run with Unity’s neural network Sentis, currently on closed beta, and added some Onnxruntime to make it work. The project in this blog is exclusively based on Onnxruntime. Unity’s team is making great strides with Sentis and I look forward to future updates !
In our Unity project, we run two of the AI models locally and access a third AI model remotely via an API.
We bundle these models inside Unity3D.
In a Unity scene we loop the AI Models over each podcast audio section to generate the contextual images.
We use Onnx AI models format.
This model transcribes audio into text. Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. This is the smallest of the 5 Whisper models available with 39M parameters, a great candidate to run on a mobile device.
We convert the transformer whisper-tiny AI model into Onnx format using Microsoft Olive. The Olive github repo has a useful example with the configuration required to optimize this model using ONNXRuntime tools:
Whisper requires pre-and-post processing steps for audio inputs / model outputs. It would have been challenging to do this using python libraries in Unity. Fortunately Onnx optimization runs WhisperProcessor inside the AI Model, removing the need to code these outside! You can learn more about this feature and more in Microsoft’s Build Onnxruntime demo.
I can suggest these steps for a quick whisper configuration with Olive:
Whisper workflow configuration (watch out with copy/paste formatting char errors):
The Whisper model is designed to work on audio samples of up to 30s in duration. Hence we chunk the podcast for each section in chunks of max 30 seconds but load these as a loop in Whisper-tiny for each podcast section.
As a large language model, and unlike Unet, Chatgpt can help us generate text that describes an image that is closely tied to the discussion in the podcast.
These models are trained on 256-A 100 GPUs, ~150k hours, ~120M image-text pairs.
Generally speaking, these are models trained to denoise random Gaussian noise step by step to get to an image: the neural network is trained to predict the noise at step t on a corrupted image, and reverse the probability density function to t0 or denoised image. A drawback is that they can be slow.
Image courtesy of this suggested blog visually intuitive blog and I can also suggest this blog to learn more.
We clone the stable diffusion v1.4 model from branch onnx in Huggingface:
This is a type of diffusion model called Latent Diffusion: The model is trained to generate latent (compressed) representations of the images, which is faster than using the actual (larger) pixel space
Get the onnx models and unet weights .pb file:
These are the main components in Latent Diffusion pipeline:
The images for this podcast were generated with the following text prompt input to the stable diffusion model: “Create an image cyberpunk style. ”
We enhance the details of the 512×512 images generated by the stable diffusion AI model to crisper 2048×2048 resolutions.
My suggested steps for quick git installation (steps up to date in my forked repo):
All enhanced images available to download here.
Pre-processed Image:
ESRGAN Post-Processed Image:
Special thanks to Jason Gauci co-host at Programming Throwdown podcast whose idea shared on this podcast served as inspiration for this prototype.
I am thrilled and truly grateful to Maurizio Raffone at Tech Shift F9 Podcast for trusting me to run a proof of concept of the Talkomic app prototype with the audio file of a fantastic episode in this podcast.
We thank @sd-akashic @Haoming02 @Microsoft for helping to better understand onnxruntime implementation in Unity.
We thank ai-forever for posting the models and git repo.