Skip to content

New response_schema invalidates implicit cache #1148

@JohanBekker

Description

@JohanBekker

This may be expected behavior, but I'm not sure. Using different models for response_schema invalidates the implicit cache. Having this would greatly help with stable multi step LLM pipelines on the same content.

The use case for us is to have a markdown representation of a document in the cache, and in different LLM calls, extract different data models from the document.

E.g., step 1: summarize, step 2: extract metadata, step 3: another custom model.

Environment details

  • Programming language: Python
  • OS: WSL (Ubuntu)
  • Language runtime version: 3.11.9
  • Package version: 1.11.0

Steps to reproduce

Code example:

import os
from typing import List

from dotenv import load_dotenv
from google import genai
from google.genai.types import GenerateContentConfig
from pydantic import BaseModel, Field

load_dotenv()

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))


# Create fake long content to be implicitly cached
md = "\n".join(["This is a fake test document. Improvise the answer to the question. Make it original."] * 1000)


# Response model for the first question
class ResponseModel(BaseModel):
    response: str = Field(..., description="The response from the model")


messages = [
    {
        "role": "user",
        "parts": [
            {
                "text": md,
            },
            {
                "text": "What is the document about.",
            },
        ],
    },
]


response = client.models.generate_content(
    model="gemini-2.5-flash-lite-preview-06-17",
    contents=messages,
    config=GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=ResponseModel,
    ),
)

print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=77, candidates_tokens_details=None, prompt_token_count=21006, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21006)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21083, traffic_type=None)

# Different question, same long content, same model
messages = [
    {
        "role": "user",
        "parts": [
            {
                "text": md,
            },
            {
                "text": "Which people are mentioned in the document?",
            },
        ],
    },
]

response = client.models.generate_content(
    model="gemini-2.5-flash-lite-preview-06-17",
    contents=messages,
    config=GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=ResponseModel,
    ),
)

# Content is cached, as expected
print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=20343)], cached_content_token_count=20343, candidates_token_count=64, candidates_tokens_details=None, prompt_token_count=21008, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21008)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21072, traffic_type=None)


# Different model, same long content
class Bullet(BaseModel):
    bullet: str = Field(..., description="A single bullet point summarizing the document")


class BulletModel(BaseModel):
    bullets: List[Bullet] = Field(default_factory=list, description="The response from the model")


messages = [
    {
        "role": "user",
        "parts": [
            {
                "text": md,
            },
            {
                "text": "Give 5 bullets summarizing the document.",
            },
        ],
    },
]

response = client.models.generate_content(
    model="gemini-2.5-flash-lite-preview-06-17",
    contents=messages,
    config=GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=BulletModel,
    ),
)

# No cached tokens!
print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=162, candidates_tokens_details=None, prompt_token_count=21008, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21008)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21170, traffic_type=None)

# This now is cached again
messages = [
    {
        "role": "user",
        "parts": [
            {
                "text": md,
            },
            {
                "text": "Give 10 bullets summarizing the document.",
            },
        ],
    },
]

response = client.models.generate_content(
    model="gemini-2.5-flash-lite-preview-06-17",
    contents=messages,
    config=GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=BulletModel,
    ),
)

print(response.usage_metadata)
# GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=20320)], cached_content_token_count=20320, candidates_token_count=280, candidates_tokens_details=None, prompt_token_count=21009, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=21009)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=21289, traffic_type=None)

Metadata

Metadata

Assignees

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions