Track cost and latency of your LLM calls
π Like what you see? Star the repo and follow @dvlshah for updates!
Decorator in β Metrics out. Monitor cost & latency of any LLM function without touching its body.
pip install tokenx-core # 1οΈβ£ install
from tokenx.metrics import measure_cost, measure_latency # 2οΈβ£ decorate
from openai import OpenAI
@measure_latency
@measure_cost(provider="openai", model="gpt-4o-mini")
def ask(prompt: str):
return OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
resp, m = ask("Hello!") # 3οΈβ£ run
print(m["cost_usd"], "USD |", m["latency_ms"], "ms")
Integrating with LLM APIs often involves hidden costs and variable performance. Manually tracking token usage and calculating costs across different models and providers is tedious and error-prone. tokenx
simplifies this by:
- Decorator-based monitoring: Add cost and latency tracking with simple decorators.
- Accurate cost calculation: Uses configurable pricing with caching discounts.
- Latency measurement: Measures API call response times.
- Multi-provider support: Supports OpenAI and Anthropic APIs.
Note: While tokenx strives for accuracy, there may be occasional discrepancies in token and cost metrics. If you encounter issues, please report them here.
flowchart LR
%% --- subgraph: user's code ---
subgraph User_Code
A["API call"]
end
%% --- main pipeline ---
A -- decorators --> B["tokenx wrapper"]
B -- cost --> C["Cost Calculator"]
B -- latency --> D["Latency Timer"]
C -- lookup --> E["model_prices.yaml"]
B -- metrics --> F["Structured JSON (stdout / exporter)"]
No vendor lockβin: pureβPython wrapper emits plain dictsβpipe them to Prometheus, Datadog, or stdout.
- Cost tracking β USD costing with cachedβtoken discounts (if applicable)
- No estimates policy β only real usage data, numbers will match what you see in the portal
- Latency measurement β measures API response times
- Decorator interface β works with sync and async functions
- Provider support β OpenAI, Anthropic, adding more providers soon!
- Minimal deps β tiktoken and pyyaml only
pip install tokenx-core # stable
pip install tokenx-core[openai] # with OpenAI provider extras
pip install tokenx-core[anthropic] # with Anthropic provider extras
pip install tokenx-core[all] # with all provider extras
Here's how to monitor your OpenAI API calls with just two lines of code:
from tokenx.metrics import measure_cost, measure_latency
from openai import OpenAI
@measure_latency
@measure_cost(provider="openai", model="gpt-4o-mini") # Always specify provider and model
def call_openai():
client = OpenAI()
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, world!"}]
)
response, metrics = call_openai()
# Access your metrics
print(f"Cost: ${metrics['cost_usd']:.6f}")
print(f"Latency: {metrics['latency_ms']:.2f}ms")
print(f"Tokens: {metrics['input_tokens']} in, {metrics['output_tokens']} out")
print(f"Cached tokens: {metrics['cached_tokens']}") # New in v0.2.0
Here's how to monitor your Anthropic API calls with just two lines of code:
from tokenx.metrics import measure_cost, measure_latency
from anthropic import Anthropic # Or AsyncAnthropic
@measure_latency
@measure_cost(provider="anthropic", model="claude-3-haiku-20240307")
def call_anthropic(prompt: str):
# For Anthropic prompt caching metrics, initialize client with .beta
# and use 'cache_control' in your messages.
# Example: client = Anthropic(api_key="YOUR_KEY").beta
client = Anthropic() # Standard client
return client.messages.create(
model="claude-3-haiku-20240307", # Ensure model matches decorator
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
response_claude, metrics_claude = call_anthropic("Why is the sky blue?")
print(f"Cost: ${metrics_claude['cost_usd']:.6f} USD")
print(f"Latency: {metrics_claude['latency_ms']:.2f} ms")
print(f"Input Tokens: {metrics_claude['input_tokens']}")
print(f"Output Tokens: {metrics_claude['output_tokens']}")
# For Anthropic, 'cached_tokens' reflects 'cache_read_input_tokens'.
# 'cache_creation_input_tokens' will also be in metrics if caching beta is active.
print(f"Cached (read) tokens: {metrics_claude.get('cached_tokens', 0)}")
print(f"Cache creation tokens: {metrics_claude.get('cache_creation_input_tokens', 'N/A (requires caching beta)')}")
Here's how to monitor OpenAI's new audio models with dual token pricing:
from tokenx.metrics import measure_cost, measure_latency
from openai import OpenAI
@measure_latency
@measure_cost(provider="openai", model="gpt-4o-mini-transcribe")
def transcribe_audio(audio_file_path: str):
client = OpenAI()
with open(audio_file_path, "rb") as audio_file:
return client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe",
file=audio_file,
response_format="verbose_json"
)
@measure_latency
@measure_cost(provider="openai", model="gpt-4o-mini-tts")
def generate_speech(text: str):
client = OpenAI()
return client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="alloy",
input=text
)
# Real usage data with separate audio/text token costs
response, metrics = transcribe_audio("path/to/audio.mp3") # Replace with actual audio file path
print(f"Transcription cost: ${metrics['cost_usd']:.6f}") # Combines audio + text tokens
print(f"Audio tokens: {metrics.get('audio_input_tokens', 0)}")
print(f"Text tokens: {metrics['input_tokens']}")
The measure_cost
decorator requires explicit provider and model specification:
@measure_cost(provider="openai", model="gpt-4o") # Explicit specification required
def my_function(): ...
@measure_cost(provider="openai", model="gpt-4o", tier="flex") # Optional tier
def my_function(): ...
The measure_latency
decorator works with both sync and async functions:
@measure_latency
def sync_function(): ...
@measure_latency
async def async_function(): ...
Decorators can be combined in any order:
@measure_latency
@measure_cost(provider="openai", model="gpt-4o")
def my_function(): ...
# Equivalent to:
@measure_cost(provider="openai", model="gpt-4o")
@measure_latency
def my_function(): ...
Both decorators work seamlessly with async
functions:
import asyncio
from tokenx.metrics import measure_cost, measure_latency
from openai import AsyncOpenAI # Use Async client
@measure_latency
@measure_cost(provider="openai", model="gpt-4o-mini")
async def call_openai_async():
client = AsyncOpenAI()
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Tell me an async joke!"}]
)
return response
async def main():
response, metrics = await call_openai_async()
print(metrics)
# asyncio.run(main()) # Example of how to run it
tokenx is designed to work with multiple LLM providers. Here's the current compatibility overview:
Provider | Status | SDK Version | Supported APIs | Cache Support |
---|---|---|---|---|
OpenAI | β | >= 1.0.0 | Chat, Responses, Embeddings, Audio (Transcription/TTS), Moderation, Images | β |
Anthropic | β | >= 0.20.0 | Messages | β |
π View Complete Coverage Matrix - Detailed breakdown of all supported APIs, models, and features
- SDK Versions: Compatible with OpenAI Python SDK v1.0.0 and newer.
- Response Formats: Supports dictionary responses from older SDK versions and Pydantic model responses from newer SDK versions, with cached token extraction from
prompt_tokens_details.cached_tokens
. - API Types:
- β Chat Completions API
- β Responses API (advanced interface)
- β Embeddings API
- β Audio Transcription API (token-based models)
- β Text-to-Speech API (token-based models)
- β Audio Preview API
- β Realtime Audio API
- β Moderation API
- β Image Generation API (gpt-image-1 with hybrid pricing)
- Pricing Features:
- β Dual token pricing (separate audio/text tokens)
- β Hybrid pricing (tokens + per-image costs)
- β No estimates policy (accuracy guaranteed)
- Models: 47+ models including o3, o4-mini, gpt-4o-transcribe, gpt-4o-mini-tts, gpt-image-1
- SDK Versions: Compatible with Anthropic Python SDK (e.g., v0.20.0+ for full prompt caching beta fields).
- Response Formats: Supports Pydantic-like response objects.
- Token Extraction: Extracts input_tokens and output_tokens from the usage object.
- Caching (Prompt Caching Beta): To utilize Anthropic's prompt caching and see related metrics, enable the beta feature in your client (
client = Anthropic().beta
) and definecache_control
checkpoints in messages. tokenx mapscache_read_input_tokens
tocached_tokens
(for cost discounts ifcached_in
is defined) and includescache_creation_input_tokens
directly in metrics. - API Types: Supports Messages API.
Prices are loaded from the model_prices.yaml
file. You can update this file when new models are released or prices change:
openai:
gpt-4o:
sync:
in: 2.50 # USD per million input tokens
cached_in: 1.25 # USD per million cached tokens (OpenAI specific)
out: 10.00 # USD per million output tokens
# Audio models with dual token pricing
gpt-4o-mini-transcribe:
sync:
in: 1.25 # USD per million input tokens (text tokens)
out: 5.00 # USD per million output tokens (text tokens)
audio_in: 3.00 # USD per million input tokens (audio tokens)
gpt-4o-mini-tts:
sync:
in: 0.60 # USD per million input tokens (text tokens)
audio_out: 12.00 # USD per million output tokens (audio tokens)
anthropic:
claude-3-haiku-20240307:
sync:
in: 0.25 # USD per million input tokens
# cached_in: 0.10 # Example: if Anthropic offered a specific rate for cache_read_input_tokens
out: 1.25 # USD per million output tokens
# If 'cached_in' is not specified for an Anthropic model,
# all input_tokens (including cache_read_input_tokens) are billed at the 'in' rate.
By default, tokenx will:
- Check for updated pricing daily (24-hour cache)
- Automatically fetch the latest prices from the official repository
- Use stale cache if network is unavailable
- Show clear error if no pricing data available
The price yaml will be automatically uploaded in the remote as the pricing information is updated by supported providers.
# First run - fetches latest prices
from tokenx.cost_calc import CostCalculator
calc = CostCalculator.for_provider("openai", "gpt-4o")
# Output: "tokenx: Fetching latest model prices..."
# Output: "tokenx: Cached latest prices locally."
# Subsequent runs - uses cache
calc2 = CostCalculator.for_provider("openai", "gpt-4o-mini")
# Output: "tokenx: Using cached model prices."
For enterprise users with custom pricing contracts:
# Set environment variable to your custom pricing file
export TOKENX_PRICES_PATH="/path/to/my-custom-pricing.yaml"
Create a YAML file with only the prices you want to override:
# my-custom-pricing.yaml
openai:
gpt-4o:
sync:
in: 1.50 # Custom rate: $1.50 per million tokens
out: 8.00 # Custom rate: $8.00 per million tokens
# You only need to specify what you're overriding
# All other models use the standard pricing
When you use the decorators, you'll get a structured metrics dictionary:
{
"provider": "openai",
"model": "gpt-4o-mini",
"tier": "sync",
"input_tokens": 12,
"output_tokens": 48,
"cached_tokens": 20, # New in v0.2.0
"cost_usd": 0.000348, # $0.000348 USD
"latency_ms": 543.21 # 543.21 milliseconds
}
For Anthropic (when prompt caching beta is active and cache is utilized):
{
"provider": "anthropic",
"model": "claude-3-haiku-20240307",
"tier": "sync",
"input_tokens": 25, # Total input tokens for the request
"output_tokens": 60,
"cached_tokens": 10, # Populated from Anthropic's 'cache_read_input_tokens'
"cache_creation_input_tokens": 15, # Anthropic's 'cache_creation_input_tokens'
"cost_usd": 0.0000XX,
"usd": 0.0000XX,
"latency_ms": 750.21
}
If Anthropic's prompt caching beta is not used or no cache interaction occurs, cached_tokens will be 0 and cache_creation_input_tokens might be 0 or absent.
TokenX makes it easy to add support for new LLM providers using the @register_provider
decorator:
from tokenx.providers import register_provider
from tokenx.providers.base import ProviderAdapter, Usage
from typing import Any, Optional
@register_provider("custom")
class CustomAdapter(ProviderAdapter):
@property
def provider_name(self) -> str:
return "custom"
def usage_from_response(self, response: Any) -> Usage:
# Extract standardized usage information from provider response
return Usage(
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cached_tokens=getattr(response.usage, 'cached_tokens', 0),
extra_fields={'provider': 'custom'}
)
def matches_function(self, func: Any, args: tuple, kwargs: dict) -> bool:
# Detect if this function uses your provider
return "custom" in kwargs.get("model", "").lower()
def detect_model(self, func: Any, args: tuple, kwargs: dict) -> Optional[str]:
return kwargs.get("model")
def calculate_cost(self, model: str, input_tokens: int,
output_tokens: int, cached_tokens: int = 0,
tier: str = "sync") -> float:
# Implement your pricing logic
return input_tokens * 0.001 + output_tokens * 0.002
# Now you can use it immediately
from tokenx.metrics import measure_cost
@measure_cost(provider="custom", model="custom-model")
def call_custom_llm():
# Your API call here
pass
git clone https://github.com/dvlshah/tokenx.git
pre-commit install
pip install -e .[dev] # or `poetry install`
pytest -q && mypy src/
See CONTRIBUTING.md for details.
See CHANGELOG.md for full history.
MIT Β© 2025 Deval Shah
If tokenX saves you time or money, please consider sponsoring or giving a β β it really helps!