-
Notifications
You must be signed in to change notification settings - Fork 131
Description
Your question
Hi There everyone. Trying to work out how much using this will actually cost.
I see Open AI is $80 for 1 million output tokens.
Gemmini seems a bit better at $12. Now trying to work out how much actual voice output that is. And is this for the session (so the clock ticks even if its not outputting anything?
Also got this from Gemini and want to know if this looks correct and what else there is to consider:
You've already got the key information to calculate this! Let's re-state it and then figure out how many minutes $1 will get you.
Key Information:
1 second of audio = 32 tokens (this is a standard conversion for Gemini models)
1 minute of audio = 1,920 tokens (32 tokens/second * 60 seconds/minute)
Given your pricing:
Input audio cost: $3 per 1M input audio tokens
Output audio cost: $12 per 1M output audio tokens
How many minutes will $1 last for Input Audio?
Tokens you get for $1 (Input):
You pay $3 for 1,000,000 tokens.
So, for $1, you get (1,000,000 tokens / $3) = 333,333.33 tokens (approximately)
Minutes of input audio for those tokens:
Since 1 minute = 1,920 tokens, then:
333,333.33 tokens / 1,920 tokens/minute = 173.61 minutes (approximately)
So, $1 will get you about 173.61 minutes of input audio with Gemini 2.5 Flash Live API.
How many minutes will $1 last for Output Audio?
Tokens you get for $1 (Output):
You pay $12 for 1,000,000 tokens.
So, for $1, you get (1,000,000 tokens / $12) = 83,333.33 tokens (approximately)
Minutes of output audio for those tokens:
Since 1 minute = 1,920 tokens, then:
83,333.33 tokens / 1,920 tokens/minute = 43.40 minutes (approximately)
So, $1 will get you about 43.40 minutes of output audio with Gemini 2.5 Flash Live API.
Summary for $1:
Input Audio: Approximately 173.61 minutes
Output Audio: Approximately 43.40 minutes
This clearly shows that generating audio output is significantly more expensive per minute than processing audio input, which is a common characteristic of these types of AI services due to the computational resources required for high-quality audio synthesis.