Description
Proposal summary
Add support of cost estimation for images, video, audio, other. Currently only text is supported.
Motivation
When working with multimodal models, the majority of the cost typically comes from including images, video, or audio in the inputs or outputs of LLM calls. Without visibility into these elements, cost tracking becomes less meaningful.