Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support Using non Image/PDF files with Gemini models #9416

Open
johann-petrak opened this issue Mar 20, 2025 · 9 comments · May be fixed by #9590
Open

[Feature]: Support Using non Image/PDF files with Gemini models #9416

johann-petrak opened this issue Mar 20, 2025 · 9 comments · May be fixed by #9590
Labels
enhancement New feature or request

Comments

@johann-petrak
Copy link

The Feature

There is a description of how to upload images or pdfs to gemini models in the docs:
https://docs.litellm.ai/docs/providers/vertex#gemini-15-pro-and-vision

But Gemini can use many more mime types in prompts, see https://ai.google.dev/gemini-api/docs/document-processing?lang=python#technical-details

However it does not seem to be possible to send any other file types. One Issue is that for some mime types, no base64 encoding is necessary (e.g. text/md), but not appending ";base64" to the mime-type results in an exception as the code exepcts ";base64" to be present always.

Trying to load the Markdown file as bytes, base64 encoding that and using this resulted in a weird error where the model was complaining about the input token limit being exceeded (it reported more than 5M tokens) even though the Markdown file is only a few thousand words.

Motivation, pitch

Using large fiels with Gemini is one of the specific use cases for the Google model and all of the other mime types supported are extremely useful for analysis purposes

Are you a ML Ops Team?

No

Twitter / LinkedIn details

No response

@johann-petrak johann-petrak added the enhancement New feature or request label Mar 20, 2025
@NiharP31
Copy link

Hey @ishaan-jaff, I'd like to get assigned and work on this issue. I will update the Code, add test cases and Documentation for this. Let me know if this sounds good!!

@ishaan-jaff
Copy link
Contributor

yes @NiharP31 , please send the proposed litellm interface / request on this issue before building out the feature

@NiharP31
Copy link

Based on my understanding, here is the solution:

1. File Type Classification and Handling

In litellm/types/files.py, I'll add comprehensive support for all Gemini-supported MIME types by:

  • Creating a classification system to distinguish between:
    • Binary file types (images, PDFs, videos, etc.) - requiring base64 encoding
    • Text file types (text/markdown, application/json, text/csv, etc.) - no base64 encoding needed
  • Expanding the is_gemini_1_5_accepted_file_type() function to include all MIME types supported by Gemini's API
  • Adding a helper function requires_base64_encoding() to determine proper handling

2. Transformation Logic Updates

In litellm/llms/vertex_ai/gemini/transformation.py, I'll:

  • Rename _process_gemini_image() to _process_gemini_file() to reflect its broader purpose
  • Modify the file processing logic to conditionally apply base64 encoding based on MIME type
  • Fix token counting for text-based files by properly handling them without unnecessary encoding

3. Implementation Benefits

  • No UI changes required - users will use the existing interface
  • Fixes the token miscalculation issue by preventing unnecessary base64 encoding of text files
  • Enables support for all Gemini-supported MIME types as documented in Google's API

If this approach aligns with your expectations, I'd be happy to proceed with implementation.

@ishaan-jaff
Copy link
Contributor

Hi @NiharP31 can you define a clear success criteria for this issue ? Ideally 3-4 test cases you're hoping to pass

@NiharP31
Copy link

Test cases to validate the implementation & Success Criteria:

Test Case 1: Markdown File Processing

  • Input: A Markdown file (lets assume 2-3KB in size)
  • Expected Output:
    • File processed without ";base64" appended to MIME type
    • Token count accurately reflects actual content size (appx. ~500-1000 tokens, not millions)
    • Gemini model successfully processes and responds to the Markdown content

Test Case 2: JSON Data File

  • Input: A JSON file (structured data is assumed here)
  • Expected Output:
    • File properly processed with "application/json" MIME type without base64 encoding
    • Model can correctly reference and analyze the JSON structure in its response

Test Case 3: Mixed Content Input

  • Input: A request containing both text (Markdown) and binary (image) files
  • Expected Output:
    • Text files processed without base64 encoding
    • Binary files properly encoded with base64
    • Model successfully references both content types in its response

Test Case 4: Large Text File Handling

  • Input: A larger text file (lets assume ~50KB or even more in size)
  • Expected Output:
    • File processed without token count errors
    • Response acknowledges content without hitting artificial token limits caused by encoding issues

All the above test cases would verify that the system got correctly integrated with MIME type classification and conditional base64 encoding, ensuring Gemini can properly access the full range of file types it supports. More cases can be CSV files (Structured format), calling data from Google-Cloud Storage, etc.

@ishaan-jaff
Copy link
Contributor

ok, you can go ahead on implementing this @NiharP31

@johann-petrak what do you think of the test cases ?

@johann-petrak
Copy link
Author

Looks like a really good plan to me, thank you for proposing this!

One thought I have is that given the LiteLLM package provides a very consistent API for using a wide range of models,
if there is a way to make using file somewhat consistent across the models?
This is complicated by the fact that there exist basically two approaches for doing this: 1) upload a file in a separate request, get an ID, then send the prompt and reference the file ID in a special kind of prompt message and 2) directly send the file as part of the prompt message
As far as I can see the details of these approaches differ between providers (and also the kind of files supported) e.g. here is how this works with OpenAI and PDF https://platform.openai.com/docs/guides/pdf-files?api-mode=chat

Sorry if all this has been clear anyways ...

For me the most important aspect right now is support for Gemini, but I think especially with models very likely supporting much larger contexts and also with models that support multimodal prompting, sending files as part of a prompt will become much more widely used soon.

@NiharP31 NiharP31 linked a pull request Mar 27, 2025 that will close this issue
4 tasks
@NiharP31
Copy link

@johann-petrak got your point. I'm currently exploring the code-base. The current PR should provide a foundation for a more consistent file handling approach across different platforms. Once I'm done with the existing PR, I'd love to work on further integration if that would be helpful for the project.
@ishaan-jaff let me know your take on this.

@krrishdholakia
Copy link
Contributor

if there is a way to make using file somewhat consistent across the models?

Hey @johann-petrak openai just added support for a new files message content type, it maps quite similarly to vertex's FileData part type - which would make sending gs://, etc. url's much easier. I'm using it in our Gemini audio file input implementation as well

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants