[FR]: Cost estimation for images/video/audio/other

### Proposal summary

Add support of cost estimation for images, video, audio, other. Currently only text is supported.

### Motivation

When working with multimodal models, the majority of the cost typically comes from including images, video, or audio in the inputs or outputs of LLM calls. Without visibility into these elements, cost tracking becomes less meaningful.