-
Notifications
You must be signed in to change notification settings - Fork 450
Open
Labels
priority: p2Moderately-important priority. Fix may not be included in next release.Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Description
This may be expected behavior, but I'm not sure. Using different models for response_schema
invalidates the implicit cache. Having this would greatly help with stable multi step LLM pipelines on the same content.
The use case for us is to have a markdown representation of a document in the cache, and in different LLM calls, extract different data models from the document.
E.g., step 1: summarize, step 2: extract metadata, step 3: another custom model.
Environment details
- Programming language: Python
- OS: WSL (Ubuntu)
- Language runtime version: 3.11.9
- Package version: 1.11.0
Steps to reproduce
Code example:
import os
from typing import List
from dotenv import load_dotenv
from google import genai
from google.genai.types import GenerateContentConfig
from pydantic import BaseModel, Field
load_dotenv()
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# Create fake long content to be implicitly cached
md = "\n".join(["This is a fake test document. Improvise the answer to the question. Make it original."] * 1000)
# Response model for the first question
class ResponseModel(BaseModel):
response: str = Field(..., description="The response from the model")
messages = [
{
"role": "user",
"parts": [
{
"text": md,
},
{
"text": "What is the document about.",
},
],
},
]
response = client.models.generate_content(
model="gemini-2.5-flash-lite-preview-06-17",
contents=messages,
config=GenerateContentConfig(
response_mime_type="application/json",
response_schema=ResponseModel,
),
)
print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=77, candidates_tokens_details=None, prompt_token_count=21006, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21006)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21083, traffic_type=None)
# Different question, same long content, same model
messages = [
{
"role": "user",
"parts": [
{
"text": md,
},
{
"text": "Which people are mentioned in the document?",
},
],
},
]
response = client.models.generate_content(
model="gemini-2.5-flash-lite-preview-06-17",
contents=messages,
config=GenerateContentConfig(
response_mime_type="application/json",
response_schema=ResponseModel,
),
)
# Content is cached, as expected
print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=20343)], cached_content_token_count=20343, candidates_token_count=64, candidates_tokens_details=None, prompt_token_count=21008, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21008)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21072, traffic_type=None)
# Different model, same long content
class Bullet(BaseModel):
bullet: str = Field(..., description="A single bullet point summarizing the document")
class BulletModel(BaseModel):
bullets: List[Bullet] = Field(default_factory=list, description="The response from the model")
messages = [
{
"role": "user",
"parts": [
{
"text": md,
},
{
"text": "Give 5 bullets summarizing the document.",
},
],
},
]
response = client.models.generate_content(
model="gemini-2.5-flash-lite-preview-06-17",
contents=messages,
config=GenerateContentConfig(
response_mime_type="application/json",
response_schema=BulletModel,
),
)
# No cached tokens!
print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=162, candidates_tokens_details=None, prompt_token_count=21008, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21008)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21170, traffic_type=None)
# This now is cached again
messages = [
{
"role": "user",
"parts": [
{
"text": md,
},
{
"text": "Give 10 bullets summarizing the document.",
},
],
},
]
response = client.models.generate_content(
model="gemini-2.5-flash-lite-preview-06-17",
contents=messages,
config=GenerateContentConfig(
response_mime_type="application/json",
response_schema=BulletModel,
),
)
print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=20320)], cached_content_token_count=20320, candidates_token_count=280, candidates_tokens_details=None, prompt_token_count=21009, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21009)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21289, traffic_type=None)
Metadata
Metadata
Assignees
Labels
priority: p2Moderately-important priority. Fix may not be included in next release.Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.Error or flaw in code with unintended results or allowing sub-optimal usage patterns.