This is similar to BVQA and the answer_transcription_ollama.
This assumes that ollama support video input and has compatible VLMs. E.g. qwen3-vl
One hurdle is finding a method to efficiently pass a video to ollama using FrameSense container architecture.