-
Notifications
You must be signed in to change notification settings - Fork 280
feat: chunk embedding #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces a comprehensive embedding model framework, including new data models, adapters, provider metadata, and API endpoints for embedding models. It adds core abstractions and implementations for embedding adapters, expands the data model to support embedding configurations and their relationships, and provides both backend and frontend API support for embedding model discovery and usage. Extensive unit and integration tests are included for all new components. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API
participant ModelRegistry
participant Adapter
Client->>API: GET /api/available_embedding_models
API->>ModelRegistry: Query built-in embedding models
ModelRegistry-->>API: List of models grouped by provider
API-->>Client: Embedding models metadata
Client->>API: Request embedding generation
API->>Adapter: generate_embeddings(input_texts)
Adapter->>Adapter: Validate and preprocess input
Adapter->>Adapter: _generate_embeddings(input_texts)
Adapter-->>API: EmbeddingResult (embeddings, usage)
API-->>Client: EmbeddingResult
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
This is WIP because I need to double check some things, add a couple of |
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (6)
libs/core/kiln_ai/datamodel/embedding.py (4)
21-27
: Address the TODO comment about model_name field type.The TODO comment raises a valid architectural question about using an enum vs string for
model_name
. For consistency and type safety, consider using an enum similar to howmodel_provider
usesModelProviderName
.Would you like me to suggest an implementation using the
EmbeddingModelName
enum from the embedding model list?
33-36
: Consider using a more robust type checking approach.The string-based class name comparison is a workaround for circular imports but could be fragile. Consider using
TYPE_CHECKING
imports or a more robust approach.def parent_project(self) -> Union["Project", None]: - if self.parent is None or self.parent.__class__.__name__ != "Project": + if self.parent is None: + return None + # More robust check using module and class info + if hasattr(self.parent, '__module__') and hasattr(self.parent, '__class__'): + return self.parent if self.parent.__class__.__name__ == "Project" else None - return None - return self.parent # type: ignore + return None
43-45
: Implement the dimensions validation.The TODO comment indicates missing validation for the
dimensions
property. This is important for ensuring compatibility with model capabilities.Would you like me to implement the dimensions validation logic that checks against the model's
supports_custom_dimensions
andsupported_dimensions
fields?
60-63
: Apply the same robustness improvement as suggested for parent_project.The same string-based type checking concern applies here.
Consider applying the same robustness improvement suggested for the
parent_project
method.libs/core/kiln_ai/datamodel/test_embedding_models.py (1)
166-173
: Add test for valid provider names.The test correctly validates invalid provider names, but consider adding a test that explicitly validates all valid
ModelProviderName
enum values to ensure the validation is working correctly for valid inputs.with pytest.raises(ValueError): EmbeddingConfig( name="test-embedding", model_provider="invalid-provider", model_name="text-embedding-3", parent=mock_project, properties={}, ) + + # Test with all valid providers + from kiln_ai.datamodel.datamodel_enums import ModelProviderName + for provider in ModelProviderName: + config = EmbeddingConfig( + name=f"test-embedding-{provider.value}", + model_provider=provider, + model_name="text-embedding-3", + parent=mock_project, + properties={}, + ) + assert config.model_provider == providerlibs/core/kiln_ai/adapters/embedding/embedding_model_list.py (1)
66-68
: Address the TODO comment about ModelProviderName enum.The TODO raises a valid architectural question. Since provider names are used globally across the system, reusing the existing enum is the correct approach.
Remove the TODO comment as the current approach is correct:
KilnEmbeddingModelProvider( - # TODO: wondering if should use separate enum for the ModelProviderName, - # but since the hooking up of providers is global, maybe reusing is - # best? name=ModelProviderName.openai,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/embedding_model_list.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/extractors/test_extractor_runner.py
(0 hunks)libs/core/kiln_ai/datamodel/chunk.py
(3 hunks)libs/core/kiln_ai/datamodel/embedding.py
(1 hunks)libs/core/kiln_ai/datamodel/project.py
(3 hunks)libs/core/kiln_ai/datamodel/test_embedding_models.py
(1 hunks)libs/core/kiln_ai/datamodel/test_extraction_chunk.py
(2 hunks)
💤 Files with no reviewable changes (1)
- libs/core/kiln_ai/adapters/extractors/test_extractor_runner.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.853Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
libs/core/kiln_ai/datamodel/chunk.py (1)
Learnt from: leonardmq
PR: Kiln-AI/Kiln#313
File: libs/core/kiln_ai/adapters/extraction/gemini_extractor.py:86-88
Timestamp: 2025-05-15T17:16:16.742Z
Learning: In Kiln's extraction configuration, Pydantic automatically coerces string keys (e.g., "document") into Kind enum values when validating dictionary fields typed as dict[Kind, str], so no manual conversion is needed when accessing these values.
libs/core/kiln_ai/adapters/embedding/embedding_model_list.py (1)
Learnt from: leonardmq
PR: Kiln-AI/Kiln#313
File: libs/core/kiln_ai/adapters/extraction/gemini_extractor.py:86-88
Timestamp: 2025-05-15T17:16:16.742Z
Learning: In Kiln's extraction configuration, Pydantic automatically coerces string keys (e.g., "document") into Kind enum values when validating dictionary fields typed as dict[Kind, str], so no manual conversion is needed when accessing these values.
libs/core/kiln_ai/datamodel/embedding.py (1)
Learnt from: leonardmq
PR: Kiln-AI/Kiln#313
File: libs/core/kiln_ai/adapters/extraction/gemini_extractor.py:86-88
Timestamp: 2025-05-15T17:16:16.742Z
Learning: In Kiln's extraction configuration, Pydantic automatically coerces string keys (e.g., "document") into Kind enum values when validating dictionary fields typed as dict[Kind, str], so no manual conversion is needed when accessing these values.
🧬 Code Graph Analysis (5)
libs/core/kiln_ai/datamodel/project.py (1)
libs/core/kiln_ai/datamodel/embedding.py (1)
EmbeddingConfig
(13-45)
libs/core/kiln_ai/datamodel/test_extraction_chunk.py (6)
libs/core/kiln_ai/datamodel/embedding.py (3)
ChunkEmbeddings
(52-63)Embedding
(48-49)EmbeddingConfig
(13-45)libs/core/kiln_ai/datamodel/test_embedding_models.py (1)
mock_project
(14-23)libs/core/kiln_ai/datamodel/basemodel.py (1)
save_to_file
(341-371)libs/core/kiln_ai/datamodel/chunk.py (1)
chunk_embeddings
(129-130)libs/core/kiln_ai/datamodel/project.py (2)
embedding_configs
(46-47)chunker_configs
(43-44)libs/core/kiln_ai/adapters/chunkers/test_base_chunker.py (1)
config
(12-17)
libs/core/kiln_ai/datamodel/chunk.py (2)
libs/core/kiln_ai/datamodel/basemodel.py (2)
KilnParentModel
(569-717)KilnParentedModel
(393-564)libs/core/kiln_ai/datamodel/embedding.py (1)
ChunkEmbeddings
(52-63)
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py (3)
libs/core/kiln_ai/datamodel/task_run.py (1)
Usage
(16-36)libs/core/kiln_ai/datamodel/embedding.py (1)
EmbeddingConfig
(13-45)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
_embed
(38-64)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (2)
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py (4)
BaseEmbeddingAdapter
(25-46)EmbeddingResult
(17-22)GeneratedEmbedding
(13-14)_embed
(45-46)libs/core/kiln_ai/datamodel/embedding.py (1)
EmbeddingConfig
(13-45)
🔇 Additional comments (19)
libs/core/kiln_ai/datamodel/project.py (3)
5-5
: LGTM: Import follows established pattern.The import is correctly placed and follows the existing import organization.
17-17
: LGTM: Parent-child relationship correctly established.The addition of "embedding_configs": EmbeddingConfig to the parent_of dictionary follows the established pattern for defining parent-child relationships in the Kiln data model.
46-47
: LGTM: Method implementation follows established pattern.The embedding_configs() method correctly follows the same pattern as other child accessor methods (chunker_configs, documents, etc.) with proper typing and delegation to the superclass.
libs/core/kiln_ai/datamodel/chunk.py (3)
18-20
: LGTM: Imports correctly added for embedding support.The imports for ChunkEmbeddings and KilnParentModel are necessary for the new parent-child relationship functionality.
116-118
: LGTM: Dual inheritance pattern correctly implemented.The ChunkedDocument class now correctly inherits from both KilnParentedModel (as a child of Extraction) and KilnParentModel (as a parent of ChunkEmbeddings). This establishes the proper hierarchical relationship where chunked documents can have associated embeddings.
129-130
: LGTM: Method follows established pattern.The chunk_embeddings() method correctly follows the same pattern as other parent accessor methods with proper typing and delegation to the superclass.
libs/core/kiln_ai/datamodel/test_extraction_chunk.py (2)
9-9
: LGTM: Import correctly added for embedding test models.The import statement properly includes the necessary embedding-related models for testing.
178-207
: LGTM: Comprehensive test coverage for embedding functionality.The test extension provides excellent coverage of the new embedding functionality:
- Creates a realistic EmbeddingConfig with proper model provider and name
- Creates ChunkEmbeddings with realistic 1536-dimensional vectors (matching common embedding models like OpenAI's text-embedding-ada-002)
- Tests the parent-child relationship between ChunkedDocument and ChunkEmbeddings
- Verifies that both embedding and chunker configs are properly associated with the project
The test follows the established pattern and ensures the embedding subsystem integrates correctly with the existing data model.
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py (3)
1-10
: LGTM: Well-structured imports and logging setup.The imports are appropriate and follow established patterns. The import of
Usage
from litellm is consistent with the existing codebase usage in task_run.py.
13-23
: LGTM: Well-defined data models.The GeneratedEmbedding and EmbeddingResult models are properly structured with clear field descriptions. The typing is consistent with the embedding data model patterns used elsewhere in the codebase.
25-46
: LGTM: Excellent base adapter design.The BaseEmbeddingAdapter follows excellent design patterns:
- Abstract base class: Properly uses ABC and abstractmethod
- Configuration injection: Takes EmbeddingConfig in constructor
- Edge case handling: Gracefully handles empty input by returning empty results
- Clear separation: Public
embed
method handles common logic, abstract_embed
method for implementation-specific logic- Async support: Consistent use of async/await pattern
This provides a solid foundation for concrete embedding adapter implementations.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (5)
13-13
: LGTM: Appropriate batch size limit.The MAX_BATCH_SIZE constant of 2048 is reasonable for embedding APIs and helps prevent API errors from oversized requests.
16-21
: LGTM: Well-defined options model.The EmbeddingOptions model properly handles the optional dimensions parameter with appropriate typing and documentation.
24-37
: LGTM: Robust constructor validation.The constructor properly validates required fields (model_provider and model_name) and stores necessary configuration. The validation ensures the adapter cannot be created with invalid configuration.
48-64
: LGTM: Robust embedding implementation with excellent error handling.The _embed method implementation is well-designed:
- API Integration: Properly uses litellm.aembedding with correct parameters
- Response Validation: Validates that response data length matches input length
- Data Integrity: Sorts response data by index to maintain proper ordering
- Error Handling: Includes sanity checks for response integrity
- Return Format: Correctly constructs EmbeddingResult with embeddings and usage data
The implementation follows the litellm documentation and provides good error detection.
66-73
: LGTM: Proper options handling with validation.The build_options method correctly extracts the dimensions parameter from properties with appropriate type validation. The error handling ensures invalid dimension values are caught early.
libs/core/kiln_ai/datamodel/test_embedding_models.py (2)
29-43
: Excellent use of automatic cleanup pattern.The fixture correctly uses
tempfile.TemporaryDirectory()
as a context manager for automatic cleanup, which aligns with the preference for safer temporary file handling.
377-407
: Comprehensive workflow test demonstrates proper integration.This integration test effectively validates the complete embedding workflow, including proper parent-child relationships and ID references.
libs/core/kiln_ai/adapters/embedding/embedding_model_list.py (1)
62-63
: EmbeddingModelName reference is consistentNo other occurrences of the old
openai_text_embedding_3_small
enum name were found. TheEmbeddingModelName.openai_text_embedding_3
entry correctly maps to"openai_text_embedding_3_small"
. No changes required here.
async def _embed(self, text: List[str]) -> EmbeddingResult: | ||
# TODO: providers will throw if the text input is too long - goes over the max tokens for the model | ||
# we should validate that upstream to prevent this from bricking the whole pipeline if the user's dataset | ||
# happens to include chunks that are too long. | ||
|
||
# documented on litellm: https://docs.litellm.ai/docs/embedding/supported_embedding | ||
if len(text) > MAX_BATCH_SIZE: | ||
raise ValueError("Text is too long") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Important TODO requires attention.
The TODO comment highlights a critical issue: input validation for token limits should be implemented to prevent pipeline failures when chunks exceed model token limits.
Consider implementing input validation in a future iteration to check text length against the model's token limits before sending to the embedding API.
Would you like me to help design an input validation system that checks text length against model token limits?
🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py around
lines 38 to 46, the current code lacks validation for input text length against
the model's token limits, risking pipeline failures. Implement a validation step
before calling the embedding API that checks each text chunk's token count
against the model's maximum allowed tokens, raising a clear error or handling
oversized inputs gracefully to prevent crashes.
6159ece
to
ad96c82
Compare
gemini = "gemini" | ||
|
||
|
||
class EmbeddingModelName(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names are awkward, but to keep things aligned with the ml_model_list
, the model names should be globally unique
description="The number of dimensions in the output embedding. If the model allows picking between different options, provide the list of available dimensions.", | ||
) | ||
|
||
supports_custom_dimensions: bool = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenAI’s 3rd-generation embedding models support shortening the embedding dimensionality.
More info:
- https://platform.openai.com/docs/guides/embeddings#:~:text=embeddings%20v3%20launch%20blog%20post
- about the technique itself: https://arxiv.org/abs/2205.13147
Our KilnEmbeddingModelProvider
setup supports this, though we don’t need to expose it in the UI yet. The default dimensions on these OpenAI models are fairly high (1536
for small, 3072
for large), so it is likely we will want to shorten them later - indexing high-dimensional vectors is usually not ideal in production (memory, speed)
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
♻️ Duplicate comments (1)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
44-52
: Important TODO requires attention.The TODO comment highlights a critical issue: input validation for token limits should be implemented to prevent pipeline failures when chunks exceed model token limits.
🧹 Nitpick comments (9)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (1)
288-289
: Fix typo in TODO comment.Good practice to add this TODO comment for future refactoring, but there's a typo that should be corrected.
Apply this diff to fix the typo:
-# for example, embedding adapter for LiteLLM, and also Exractor adapter for LiteLLM +# for example, embedding adapter for LiteLLM, and also Extractor adapter for LiteLLMapp/desktop/studio_server/test_provider_api.py (1)
2308-2309
: Clean up excessive empty lines.Multiple consecutive empty lines should be reduced to maintain consistency with the rest of the file.
Apply this diff to clean up the formatting:
- - +libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (2)
134-135
: Enhance error message with list of supported providers.The error message could be more helpful by listing which providers are supported by litellm.
- raise ValueError(f"Provider {provider.name} is not supported by litellm") + raise ValueError( + f"Provider {provider.name} is not supported by litellm. " + f"Supported providers: openrouter, openai, groq, anthropic, gemini_api, " + f"fireworks_ai, amazon_bedrock, azure_openai, huggingface, vertex, together_ai" + )
91-93
: Consider addressing the TODO for code reuse.The TODO correctly identifies an opportunity to share the
litellm_model_id
method across different LiteLLM adapter implementations.Would you like me to help create a shared base class or utility module for LiteLLM adapters to avoid code duplication?
libs/core/kiln_ai/datamodel/embedding.py (2)
21-27
: Consider using enum for model_name in future iteration.The TODO correctly identifies that using an enum for
model_name
would provide better type safety. The current string approach aligns with existing patterns in the codebase for backward compatibility.Would you like me to help design a migration strategy to move from string to enum-based model names while maintaining backward compatibility?
40-46
: Address circular import issue in future refactoring.The FIXME correctly identifies that the ml model lists should be moved out of the adapters package to avoid circular imports. This is a valid architectural concern that should be addressed.
Consider creating a dedicated
models
orregistry
package at the same level asadapters
anddatamodel
to house the model lists, which are foundational data structures with no dependencies.libs/core/kiln_ai/adapters/test_ml_embedding_model_list.py (2)
24-30
: Remove excessive blank lines between assertions.The extra blank lines between assertions reduce readability.
assert ( EmbeddingModelName.openai_text_embedding_3_large == "openai_text_embedding_3_large" ) - assert ( EmbeddingModelName.gemini_text_embedding_004 == "gemini_text_embedding_004" )
269-279
: Avoid hardcoding exact error messages in tests.Testing exact error message strings makes tests brittle. If the error message format changes, these tests will fail even though the behavior (raising ValueError) is correct.
Consider testing just the exception type, or use a more flexible pattern:
- with pytest.raises( - ValueError, match="Embedding model nonexistent_model not found" - ): + with pytest.raises(ValueError): get_model_by_name("nonexistent_model")Or if you want to verify the error mentions the model name:
- with pytest.raises( - ValueError, match="Embedding model nonexistent_model not found" - ): + with pytest.raises(ValueError, match="nonexistent_model"): get_model_by_name("nonexistent_model")libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py (1)
110-238
: Add tests for empty strings and very long text inputs.The test suite is missing coverage for some edge cases that could occur in production.
Add these test cases:
async def test_embed_with_empty_string(self, mock_litellm_adapter): """Test embedding with empty string in the list.""" mock_response = AsyncMock() mock_response.data = [ {"index": 0, "embedding": [0.1, 0.2, 0.3]}, {"index": 1, "embedding": [0.0, 0.0, 0.0]}, # Empty string might return zeros {"index": 2, "embedding": [0.4, 0.5, 0.6]}, ] mock_response.usage = Usage(prompt_tokens=10, total_tokens=10) with patch("litellm.aembedding", return_value=mock_response): result = await mock_litellm_adapter._embed(["text1", "", "text3"]) assert len(result.embeddings) == 3 async def test_embed_with_very_long_text(self, mock_litellm_adapter): """Test embedding with very long individual text.""" # Create a text that might exceed token limits very_long_text = "word " * 10000 # ~10k words mock_response = AsyncMock() mock_response.data = [{"index": 0, "embedding": [0.1, 0.2, 0.3]}] mock_response.usage = Usage(prompt_tokens=8192, total_tokens=8192) with patch("litellm.aembedding", return_value=mock_response): # This should either work or raise a specific error from litellm result = await mock_litellm_adapter._embed([very_long_text]) assert len(result.embeddings) == 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
app/desktop/studio_server/provider_api.py
(4 hunks)app/desktop/studio_server/test_provider_api.py
(2 hunks)app/web_ui/src/lib/api_schema.d.ts
(5 hunks)libs/core/kiln_ai/adapters/__init__.py
(2 hunks)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/ml_embedding_model_list.py
(1 hunks)libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/test_ml_embedding_model_list.py
(1 hunks)libs/core/kiln_ai/datamodel/embedding.py
(1 hunks)libs/core/kiln_ai/datamodel/test_embedding_models.py
(1 hunks)libs/core/kiln_ai/datamodel/test_extraction_chunk.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- libs/core/kiln_ai/adapters/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
- libs/core/kiln_ai/datamodel/test_extraction_chunk.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
🧬 Code Graph Analysis (6)
app/desktop/studio_server/test_provider_api.py (2)
libs/core/kiln_ai/adapters/ml_embedding_model_list.py (4)
EmbeddingModelName
(20-30)KilnEmbeddingModel
(51-59)KilnEmbeddingModelFamily
(9-17)KilnEmbeddingModelProvider
(33-48)libs/core/kiln_ai/datamodel/datamodel_enums.py (1)
ModelProviderName
(80-99)
libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py (2)
libs/core/kiln_ai/datamodel/datamodel_enums.py (1)
ModelProviderName
(80-99)libs/core/kiln_ai/datamodel/embedding.py (1)
EmbeddingConfig
(13-68)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (3)
libs/core/kiln_ai/adapters/ml_embedding_model_list.py (2)
KilnEmbeddingModelProvider
(33-48)built_in_embedding_models_from_provider
(117-125)libs/core/kiln_ai/datamodel/datamodel_enums.py (1)
ModelProviderName
(80-99)libs/core/kiln_ai/datamodel/embedding.py (1)
EmbeddingConfig
(13-68)
libs/core/kiln_ai/adapters/test_ml_embedding_model_list.py (2)
libs/core/kiln_ai/adapters/ml_embedding_model_list.py (6)
EmbeddingModelName
(20-30)KilnEmbeddingModel
(51-59)KilnEmbeddingModelFamily
(9-17)KilnEmbeddingModelProvider
(33-48)built_in_embedding_models_from_provider
(117-125)get_model_by_name
(110-114)libs/core/kiln_ai/datamodel/datamodel_enums.py (1)
ModelProviderName
(80-99)
app/desktop/studio_server/provider_api.py (1)
libs/core/kiln_ai/adapters/ml_embedding_model_list.py (1)
EmbeddingModelName
(20-30)
libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py (2)
libs/core/kiln_ai/datamodel/datamodel_enums.py (1)
ModelProviderName
(80-99)libs/core/kiln_ai/datamodel/embedding.py (1)
EmbeddingConfig
(13-68)
🔇 Additional comments (29)
app/desktop/studio_server/test_provider_api.py (3)
10-16
: LGTM: New embedding model imports are well-structured.The imports follow the existing pattern and provide the necessary classes for testing the new embedding model endpoints.
2310-2418
: LGTM: Comprehensive test coverage for embedding models endpoint.The test provides excellent coverage of the
/api/available_embedding_models
endpoint with:
- Proper mocking of configuration and dependencies
- Multiple providers and models in test data
- Thorough assertions validating response structure and content
- Good test data that covers different embedding model configurations
The test follows the established patterns in the file and aligns well with the PR objectives.
2420-2444
: LGTM: Good validation test for embedding models metadata.The test properly validates the
/api/providers/embedding_models
endpoint by:
- Ensuring all built-in embedding models are present in the response
- Verifying model IDs and friendly names match expectations
- Checking response structure consistency
- Including specific assertions for known models
The test complements the previous test nicely by focusing on the metadata aspect rather than provider grouping.
app/web_ui/src/lib/api_schema.d.ts (7)
648-681
: LGTM! Embedding model endpoints follow established patterns.The new embedding model discovery endpoints are well-structured and consistent with the existing model API patterns. Both endpoints correctly implement read-only access for model discovery functionality.
1373-1381
: LGTM! Well-structured schema for embedding model provider grouping.The
AvailableEmbeddingModels
schema correctly mirrors the pattern ofAvailableModels
while properly referencing the embedding-specific model details schema.
1871-1883
: LGTM! Comprehensive embedding model metadata schema.The
EmbeddingModelDetails
schema captures all essential embedding model characteristics including dimensions, token limits, and custom dimension support. The required properties ensure consistent API responses.
1884-1889
: LGTM! Focused enum of supported embedding models.The
EmbeddingModelName
enum provides appropriate type safety with a focused set of well-established embedding models from major providers.
2840-2846
: LGTM! Excellent reuse of existing ProviderModel schema.The
ProviderEmbeddingModels
schema correctly reuses the existingProviderModel
structure, maintaining consistency across the API while allowing flexible embedding model discovery.
4963-5002
: LGTM! Well-defined operations for embedding model discovery.Both embedding model operations correctly implement the GET-only discovery pattern with proper response type references to the new embedding model schemas. The structure perfectly mirrors the existing model discovery operations.
648-681
: Excellent comprehensive embedding model API integration.The embedding model additions successfully extend the API schema with:
- Consistent endpoint patterns following existing model discovery conventions
- Complete type definitions for embedding-specific metadata (dimensions, token limits, custom dimension support)
- Proper schema reuse and operation definitions
- Type-safe enum of supported embedding models
The auto-generated schema correctly reflects a well-designed embedding model API that integrates seamlessly with the existing infrastructure.
Also applies to: 1373-1381, 1871-1889, 2840-2846, 4963-5002
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (5)
1-20
: LGTM! Well-organized imports and reasonable batch size limit.The imports are properly organized and the MAX_BATCH_SIZE constant of 2048 is a reasonable limit for batch embedding operations.
22-27
: LGTM! Clean Pydantic model for embedding options.The
EmbeddingOptions
class is well-structured with appropriate field typing and description.
29-43
: LGTM! Proper initialization with validation.The initialization correctly validates required fields and provides clear error messages.
53-69
: LGTM! Robust embedding generation with proper validation.The implementation correctly:
- Calls the litellm API with proper parameters
- Validates response integrity
- Maintains correct ordering by sorting on index
71-89
: LGTM! Well-implemented helper methods.Both methods have proper validation:
build_options
correctly validates dimension values_resolve_model_provider
provides clear error messages for missing modelslibs/core/kiln_ai/datamodel/test_embedding_models.py (3)
1-24
: LGTM! Well-structured test setup with proper fixture.The imports and
mock_project
fixture are properly implemented with appropriate use of temporary directories.
46-236
: LGTM! Comprehensive test coverage for EmbeddingConfig.The test class provides excellent coverage including:
- Required and optional field validation
- Name format validation
- Properties and dimensions validation
- Parent-child relationships
- Edge cases and error conditions
237-469
: LGTM! Well-structured tests for Embedding and ChunkEmbeddings models.The test classes provide thorough coverage:
TestEmbedding
: Tests vector validation with various types and sizesTestChunkEmbeddings
: Tests embeddings list, parent relationships, and edge casesTestEmbeddingIntegration
: Tests complete workflows and parent-child relationshipsAll tests are well-organized and follow testing best practices.
libs/core/kiln_ai/datamodel/embedding.py (3)
13-37
: LGTM! Well-structured data model with proper field definitions.The
EmbeddingConfig
class is well-designed with:
- Clear field descriptions
- Appropriate use of optional fields
- Proper parent relationship handling
47-68
: LGTM! Robust validation logic for embedding configuration.The validation properly:
- Checks model provider existence
- Validates custom dimensions support
- Ensures dimensions are positive integers
71-87
: LGTM! Clean and well-documented embedding data models.Both classes are properly structured:
Embedding
: Simple model for embedding vectorsChunkEmbeddings
: Properly links embeddings to configuration and maintains parent relationshipThe field descriptions clearly explain the correspondence between embeddings and chunks.
libs/core/kiln_ai/adapters/ml_embedding_model_list.py (4)
1-60
: LGTM! Well-structured embedding model definitions.The code provides:
- Clear enum definitions for model families and names
- Well-documented Pydantic models with appropriate field types
- Good use of Field descriptions for clarity
The comment about prefixing model names for uniqueness is a good design decision.
69-72
: ModelProviderName enum reuse is appropriate.The TODO question is valid, but reusing the global
ModelProviderName
enum is the right choice since provider configuration is indeed global across the system.
62-107
: LGTM! Comprehensive embedding model configurations.The built-in models list properly defines:
- OpenAI models with custom dimension support
- Gemini model with fixed dimensions
- Accurate token limits and dimension specifications
110-126
: LGTM! Clean utility functions for model lookups.Both functions provide appropriate lookup functionality:
get_model_by_name
: Raises clear error when model not foundbuilt_in_embedding_models_from_provider
: Returns None for flexibility in error handlinglibs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py (3)
1-60
: LGTM! Well-designed test adapters and fixtures.The test setup provides:
- Two concrete adapter implementations for different test scenarios
- Proper mock embedding configuration
- Clean fixture definitions for test instances
61-112
: LGTM! Comprehensive tests for embedding data models.The tests thoroughly validate:
- Various vector sizes including empty and large vectors
- EmbeddingResult with and without usage information
- Edge cases and normal operations
113-276
: LGTM! Thorough testing of BaseEmbeddingAdapter functionality.The test classes provide excellent coverage:
- Proper validation that abstract class cannot be instantiated
- Edge cases including empty inputs, special characters, and long texts
- Integration tests verifying abstract method calls
- Comprehensive input validation scenarios
The tests follow best practices with clear test names and focused assertions.
libs/core/kiln_ai/adapters/test_ml_embedding_model_list.py (1)
310-447
: Well-structured and comprehensive test coverage!The test class thoroughly covers various scenarios including successful retrieval, edge cases, and parametrized tests. Good job on the test design.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
44-52
: Important TODO requires attention.The TODO comment highlights a critical issue: input validation for token limits should be implemented to prevent pipeline failures when chunks exceed model token limits.
🧹 Nitpick comments (1)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
91-141
: LGTM: Comprehensive provider mapping with refactoring opportunity.The method properly handles all provider types with exhaustive matching and clear error handling. The TODO comment about refactoring to share with other LiteLLM adapters is a valid consideration for future code reuse.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
app/desktop/studio_server/provider_api.py
(4 hunks)app/desktop/studio_server/test_provider_api.py
(2 hunks)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/ml_embedding_model_list.py
(1 hunks)libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/test_ml_embedding_model_list.py
(1 hunks)libs/core/kiln_ai/datamodel/test_embedding_models.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- libs/core/kiln_ai/adapters/test_ml_embedding_model_list.py
🚧 Files skipped from review as they are similar to previous changes (5)
- libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
- app/desktop/studio_server/test_provider_api.py
- app/desktop/studio_server/provider_api.py
- libs/core/kiln_ai/datamodel/test_embedding_models.py
- libs/core/kiln_ai/adapters/ml_embedding_model_list.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Build Desktop Apps (macos-13)
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
- GitHub Check: Build Desktop Apps (windows-latest)
- GitHub Check: Build Desktop Apps (macos-latest)
- GitHub Check: Build, Typecheck, and Test Python (3.12)
- GitHub Check: Build, Typecheck, and Test Python (3.10)
- GitHub Check: Build, Typecheck, and Test Python (3.13)
- GitHub Check: Build, Typecheck, and Test Python (3.11)
🔇 Additional comments (11)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (4)
22-27
: LGTM: Clean Pydantic model definition.The
EmbeddingOptions
class is well-structured with proper typing and field documentation.
29-43
: LGTM: Proper initialization with validation.The constructor correctly validates required fields and follows the inheritance pattern properly.
71-79
: LGTM: Proper options building with validation.The method correctly validates dimensions as a positive integer and handles the optional nature of the property.
81-89
: LGTM: Clean model provider resolution.The method properly resolves the model provider using the built-in models list and provides clear error handling.
libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py (7)
1-55
: LGTM: Well-structured test setup and comprehensive EmbeddingOptions tests.The test fixtures are properly defined and the EmbeddingOptions tests thoroughly cover the model's behavior including default values and serialization.
57-85
: LGTM: Initialization tests properly use mocking.The tests correctly use MagicMock instead of bypassing Pydantic validation, addressing the previous review feedback. The initialization test coverage is comprehensive.
86-244
: LGTM: Comprehensive embedding functionality tests.The tests thoroughly cover the embedding functionality including success cases, error conditions, and proper parameter passing to litellm. The mocking strategy is appropriate and assertions are correct.
246-332
: LGTM: Excellent edge case coverage.The edge case tests are well-designed and cover important scenarios including None usage, empty vectors, and proper property filtering. The test verifying only dimensions are passed to litellm is particularly valuable.
334-370
: LGTM: Valuable paid integration tests.The parameterized integration tests provide important real-world validation with different providers and models. The test structure and assertions are appropriate for integration testing.
372-409
: LGTM: Thorough custom dimensions testing.The tests comprehensively cover custom dimensions support for OpenAI models with proper parameterization and real API validation. The assertions correctly verify the returned embedding dimensions.
411-431
: LGTM: Test name now matches its behavior.The test has been appropriately renamed to
test_embedding_config_validates_custom_dimensions_support
which accurately reflects that it tests config validation rather than embedding API behavior, addressing the previous review feedback.
d62aecb
to
6038287
Compare
|
||
# returns map, of provider name to list of model names | ||
@app.get("/api/available_embedding_models") | ||
async def get_available_embedding_models() -> List[AvailableEmbeddingModels]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this endpoint is copy / paste from the @app.get("/api/available_models")
endpoint - if we want to refactor (probably should), I'd rather leave that for later after merging so we don't risk the rag
branch touching too much base code that might change before rag -> main
merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy/pasted shouldn't cause conflicts (which isn't the case for editing).
General note: following existing style is good, but some of this code was my first ever python code, so always feel free to up the quality bar when touching areas.
@@ -0,0 +1,128 @@ | |||
from enum import Enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should merge main, then integrate this with remote_config.py
. Should only be a few lines, but will make it so we can remotely update it OTA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integrated with remote_config.py
- worth double-checking as I get the basic flow, but am not fully familiar with all the interactions there.
Also found some oddities in the tests (unintentional-looking env var states in the failure test case, floating assignment in one test) and in the code code (legacy / dev fallback logic). I left comments on the points I was unsure about in this commit:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing the comments - was this all resolved? Code looks fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# returns map, of provider name to list of model names | ||
@app.get("/api/available_embedding_models") | ||
async def get_available_embedding_models() -> List[AvailableEmbeddingModels]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy/pasted shouldn't cause conflicts (which isn't the case for editing).
General note: following existing style is good, but some of this code was my first ever python code, so always feel free to up the quality bar when touching areas.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
class EmbeddingConfig(KilnParentedModel): | ||
name: str = NAME_FIELD | ||
description: str | None = Field( | ||
default=None, description="The description of the embedding config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Names should be descriptive more than just the field names. "A description for your reference, not shared with embedding models."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in ac271d0
I tend to forget that we expose these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (5)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (3)
39-41
: Remove redundant instance attributes.These attributes duplicate data already available through
self.embedding_config
. Access them directly from the config instead of creating duplicate copies.- self.model_provider_name = model_provider_name - self.model_name = model_name - self.properties = embedding_config.propertiesThen access these values as
self.embedding_config.model_provider_name
,self.embedding_config.model_name
, andself.embedding_config.properties
throughout the class.
52-54
: Address the TODO for input validation.This TODO highlights a critical issue that could cause pipeline failures. Consider implementing basic validation or at least proper error handling to surface provider errors to users.
Since each provider has different token counting methods as you mentioned, at minimum wrap the embedding call in proper error handling to catch and re-raise provider errors with clear context about which chunks failed and why.
88-89
: Refactor shared code as noted in TODO.This code duplication should be addressed in this PR rather than deferred.
Consider extracting a shared utility function for provider name mapping that can be used by all LiteLLM adapters.
app/web_ui/src/lib/api_schema.d.ts (2)
648-664
: Fix the endpoint path for embedding config creation.Same issue as the chunker config endpoint - the creation endpoint incorrectly includes
{embedding_config_id}
in the path.Consider changing to:
-"/api/projects/{project_id}/embedding_configs/{embedding_config_id}/create_embedding_config": { +"/api/projects/{project_id}/embedding_configs": {
5223-5257
: Operation definition confirms the REST API path issue.The operation definition shows
chunker_config_id
in the path parameters for a creation operation, confirming the REST API design violation flagged earlier. This needs to be corrected along with the endpoint path.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
app/web_ui/src/lib/api_schema.d.ts
(12 hunks)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/ml_embedding_model_list.py
(1 hunks)libs/core/kiln_ai/datamodel/__init__.py
(2 hunks)libs/core/kiln_ai/datamodel/embedding.py
(1 hunks)libs/core/kiln_ai/datamodel/project.py
(3 hunks)libs/core/kiln_ai/datamodel/rag.py
(1 hunks)libs/core/kiln_ai/datamodel/test_rag.py
(1 hunks)libs/server/kiln_server/document_api.py
(4 hunks)libs/server/kiln_server/test_document_api.py
(20 hunks)
✅ Files skipped from review due to trivial changes (2)
- libs/core/kiln_ai/datamodel/init.py
- libs/core/kiln_ai/datamodel/rag.py
🚧 Files skipped from review as they are similar to previous changes (2)
- libs/core/kiln_ai/datamodel/embedding.py
- libs/core/kiln_ai/adapters/ml_embedding_model_list.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
libs/server/kiln_server/document_api.py (1)
Learnt from: leonardmq
PR: Kiln-AI/Kiln#405
File: libs/core/kiln_ai/utils/mime_type.py:22-38
Timestamp: 2025-07-05T13:37:51.664Z
Learning: In the Kiln AI application, the MIME type support is intentionally limited to a subset of what the underlying Gemini client supports. The `guess_extension` function in `libs/core/kiln_ai/utils/mime_type.py` should only include MIME types that are actually used at the application level, not the full comprehensive list supported by the underlying AI service.
🧬 Code Graph Analysis (1)
libs/core/kiln_ai/datamodel/test_rag.py (4)
libs/core/kiln_ai/datamodel/project.py (2)
Project
(11-52)rag_pipelines
(51-52)libs/core/kiln_ai/datamodel/rag.py (1)
RAGPipeline
(6-24)libs/server/kiln_server/test_document_api.py (1)
mock_project
(42-49)libs/core/kiln_ai/datamodel/basemodel.py (2)
save_to_file
(341-371)model_type
(250-251)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (macos-13)
- GitHub Check: Build Desktop Apps (windows-latest)
- GitHub Check: Build Desktop Apps (macos-latest)
🔇 Additional comments (9)
libs/core/kiln_ai/datamodel/project.py (1)
5-7
: LGTM! Clean extension of the Project model.The implementation correctly follows the established pattern for adding new child entity types to the Project model.
Also applies to: 18-19, 48-52
libs/core/kiln_ai/datamodel/test_rag.py (1)
1-340
: Excellent test coverage!Comprehensive test suite that covers all aspects of the RAGPipeline data model including validation, serialization, and parent-child relationships.
libs/server/kiln_server/document_api.py (1)
797-847
: Well-implemented RAG pipeline endpoints with proper validation.The implementation correctly validates all referenced configuration IDs before creating the pipeline, providing clear error messages when configs are not found.
libs/server/kiln_server/test_document_api.py (2)
42-100
: Good refactoring of test fixtures.The granular fixtures improve test modularity and reusability across different test scenarios.
647-925
: Comprehensive test coverage for new endpoints.Excellent test suite covering success cases, validation errors, and edge cases for all new chunker, embedding, and RAG pipeline endpoints.
app/web_ui/src/lib/api_schema.d.ts (4)
2139-2143
: Verify completeness of the EmbeddingModelName enum.The enum currently defines only 3 embedding models (
openai_text_embedding_3_small
,openai_text_embedding_3_large
,gemini_text_embedding_004
), but the PR objectives mention a comprehensivebuilt_in_embedding_models
list similar to the existingml_model_list
.Ensure this enum includes all supported embedding models or confirm if additional models will be added in future updates.
1628-1631
: Verify if ChunkerType enum is intentionally limited.The
ChunkerType
enum currently only includes"fixed_window"
, which seems quite restrictive for a chunking system. Most RAG implementations support multiple chunking strategies like semantic, sentence-based, or overlap-based chunking.Confirm if this is the intended scope or if additional chunker types will be added.
1587-1695
: LGTM: Consistent schema design patterns.The config schemas (ChunkerConfig, CreateChunkerConfigRequest, EmbeddingConfig, CreateEmbeddingConfigRequest) follow consistent patterns:
- Proper inheritance from base models
- Flexible properties fields for extensibility
- Clear separation between creation requests and full entities
- Consistent field naming and typing
The schema design approach is well-structured and maintainable.
2125-2137
: Well-designed embedding model metadata schema.The
EmbeddingModelDetails
schema comprehensively captures essential embedding model characteristics:
- Model identification (id, name)
- Technical specifications (dimensions, token limits)
- Capability flags (custom dimensions support)
This provides sufficient metadata for proper model selection and usage validation.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (3)
1-1
: Use consistent type annotations throughout the file.The file uses
List[str]
but should uselist[str]
for consistency with modern Python typing conventions.-from typing import List +# Remove this import if not needed elsewhere - async def _generate_embeddings(self, input_texts: List[str]) -> EmbeddingResult: + async def _generate_embeddings(self, input_texts: list[str]) -> EmbeddingResult:Also applies to: 51-51
134-138
: Remove unnecessary instance variable assignment.The
self._litellm_model_id
assignment is unnecessary since this value is computed deterministically from immutable configuration.- self._litellm_model_id = ( - str(litellm_provider_name) + "/" + str(provider.model_id) - ) - - return self._litellm_model_id + return str(litellm_provider_name) + "/" + str(provider.model_id)
52-54
: Address or remove the TODO comment about input validation.Based on past review discussions, this TODO was intended to be removed. Either implement the validation or remove the comment to avoid technical debt accumulation.
Since token counting varies by provider and there's no unified way in LiteLLM, consider removing this TODO and ensuring that provider errors are properly surfaced to users:
- # TODO: providers will throw if the text input is too long - goes over the max tokens for the model - # we should validate that upstream to prevent this from bricking the whole pipeline if the user's dataset - # happens to include chunks that are too long. + # Note: Provider APIs will reject texts exceeding token limits. + # Errors are surfaced to maintain pipeline integrity.
🧹 Nitpick comments (3)
app/web_ui/src/lib/api_schema.d.ts (1)
682-698
: Inconsistent endpoint naming pattern for RAG pipelines.The RAG pipeline endpoints use a different naming pattern compared to other similar endpoints:
- Chunker:
/create_chunker_config
+/chunker_configs
- Embedding:
/create_embedding_config
+/embedding_configs
- Extractor:
/create_extractor_config
+/extractor_configs
- RAG:
/rag_pipelines/create_rag_pipeline
+/rag_pipelines
Consider aligning with the established pattern:
-"/api/projects/{project_id}/rag_pipelines/create_rag_pipeline": { +"/api/projects/{project_id}/create_rag_pipeline": {This would make the API more consistent and predictable for developers.
Also applies to: 699-715
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py (1)
3-3
: Use consistent type annotations throughout the file.The file mixes
List[str]
(line 35) andlist[float]
(line 14). Use consistent type annotation style - prefer the modernlist[T]
syntax overList[T]
.-from typing import List +from typing import TYPE_CHECKING - async def generate_embeddings(self, input_texts: List[str]) -> EmbeddingResult: + async def generate_embeddings(self, input_texts: list[str]) -> EmbeddingResult: - async def _generate_embeddings(self, input_texts: List[str]) -> EmbeddingResult: + async def _generate_embeddings(self, input_texts: list[str]) -> EmbeddingResult:Also applies to: 35-35
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
73-74
: Improve type safety and variable naming in response processing.The current approach uses dictionary
.get()
calls which can be fragile. Consider using typed response handling for better error detection and more descriptive variable names.- for item in sorted(response.data, key=lambda x: x.get("index")): - embeddings.append(GeneratedEmbedding(vector=item.get("embedding"))) + for embedding_data in sorted(response.data, key=lambda x: x.get("index", 0)): + if "embedding" not in embedding_data: + raise ValueError(f"Missing embedding in response data: {embedding_data}") + embeddings.append(GeneratedEmbedding(vector=embedding_data["embedding"]))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
app/web_ui/src/lib/api_schema.d.ts
(12 hunks)libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
- GitHub Check: Build Desktop Apps (macos-13)
- GitHub Check: Build Desktop Apps (windows-latest)
- GitHub Check: Build Desktop Apps (macos-latest)
🔇 Additional comments (3)
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py (1)
25-46
: Well-designed abstract base class with proper separation of concerns.The base class design is solid with:
- Clear separation between public
generate_embeddings
and abstract_generate_embeddings
- Proper empty input handling in the public method
- Good use of abstract methods to enforce implementation requirements
libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py (1)
1-411
: Comprehensive test suite with excellent coverage.The test suite demonstrates thorough coverage including:
- Unit tests with proper mocking
- Edge case handling (empty inputs, batch limits, response validation)
- Integration tests with real API calls
- Proper error handling and validation
- Good use of fixtures and parameterized tests
The addressing of past review comments (removing
__dict__
manipulation, fixing test names) shows attention to code quality.libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py (1)
1-276
: Well-designed test suite for abstract base class.The test suite effectively covers:
- Abstract class behavior and instantiation restrictions
- Comprehensive edge case testing (empty inputs, whitespace, special characters)
- Proper testing of both concrete implementations with and without usage data
- Integration tests that verify method call behavior and flow
- Good use of fixtures and test organization
The test structure follows testing best practices with clear separation of concerns and descriptive test names.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (5)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (5)
29-41
: Remove duplicate instance variable assignments.The instance variables
model_provider_name
,model_name
, andproperties
are unnecessary duplicates since they're already available throughself.embedding_config
. This creates maintenance burden and potential inconsistencies.Apply this diff to remove the duplicates:
- self.model_provider_name = model_provider_name - self.model_name = model_name - self.properties = embedding_config.properties + # Access these through self.embedding_config insteadThen update the references throughout the class:
self.model_provider_name
→self.embedding_config.model_provider_name
self.model_name
→self.embedding_config.model_name
self.properties
→self.embedding_config.properties
52-54
: Remove TODO comment and implement basic validation.Based on past review feedback, the TODO should be removed and replaced with basic validation. While token counting is provider-specific, you can still validate text length and provide meaningful error messages.
Apply this diff to replace the TODO with basic validation:
- # TODO: providers will throw if the text input is too long - goes over the max tokens for the model - # we should validate that upstream to prevent this from bricking the whole pipeline if the user's dataset - # happens to include chunks that are too long. + # Validate input text lengths to prevent pipeline failures + max_chars = 100000 # Conservative estimate for most models + for i, text in enumerate(input_texts): + if len(text) > max_chars: + raise ValueError(f"Text at index {i} is too long ({len(text)} characters). Maximum supported: {max_chars}")
73-74
: Improve type safety and variable naming.The current implementation using
item.get()
and generic naming is fragile and could break with litellm updates. Use more descriptive variable names and stronger typing.Apply this diff to improve type safety:
- for item in sorted(response.data, key=lambda x: x.get("index")): - embeddings.append(Embedding(vector=item.get("embedding"))) + for embedding_data in sorted(response.data, key=lambda x: x.get("index", 0)): + if not embedding_data.get("embedding"): + raise ValueError("Missing embedding data in response") + embeddings.append(Embedding(vector=embedding_data["embedding"]))
88-89
: Refactor duplicate code as part of this PR.The provider mapping logic is duplicated across multiple LiteLLM adapters. Based on past review feedback, this should be refactored within this PR rather than deferred.
Extract the provider mapping logic to a shared utility function that can be used by both embedding and extractor adapters.
134-138
: Remove unnecessary instance variable assignment.The
self._litellm_model_id
assignment is unnecessary since this value is computed deterministically from immutable configuration.Apply this diff to remove the unnecessary assignment:
- self._litellm_model_id = ( - str(litellm_provider_name) + "/" + str(provider.model_id) - ) - - return self._litellm_model_id + return str(litellm_provider_name) + "/" + str(provider.model_id)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
🧬 Code Graph Analysis (1)
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py (4)
libs/core/kiln_ai/datamodel/task_run.py (1)
Usage
(16-36)libs/core/kiln_ai/datamodel/embedding.py (1)
EmbeddingConfig
(13-65)libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
_generate_embeddings
(51-76)libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py (4)
_generate_embeddings
(15-21)_generate_embeddings
(27-35)_generate_embeddings
(240-243)_generate_embeddings
(265-267)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
- GitHub Check: Build Desktop Apps (windows-latest)
- GitHub Check: Build Desktop Apps (macos-13)
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (macos-latest)
🔇 Additional comments (10)
libs/core/kiln_ai/adapters/embedding/base_embedding_adapter.py (4)
1-10
: LGTM! Clean imports and proper logging setup.The imports are well-organized and the logger is properly configured for the module.
13-14
: LGTM! Simple and appropriate data model.The
Embedding
model correctly represents an embedding vector as a list of floats with descriptive field documentation.
17-20
: LGTM! Well-designed result container.The
EmbeddingResult
model appropriately combines embeddings with optional usage metadata from litellm.
23-44
: LGTM! Proper abstract base class implementation.The
BaseEmbeddingAdapter
follows good design patterns:
- Proper initialization with configuration
- Handles empty input gracefully by short-circuiting
- Delegates actual embedding generation to abstract method
- Clear separation of concerns
The naming issues mentioned in past comments appear to have been resolved.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
43-49
: LGTM! Proper model validation.The adapter correctly validates that the specified model exists in the built-in models list and provides clear error messages.
libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py (5)
61-81
: LGTM! Comprehensive Embedding model tests.The tests properly cover creation, empty vectors, and large vectors with appropriate assertions.
83-111
: LGTM! Thorough EmbeddingResult model tests.The tests cover creation with and without usage, and empty embeddings with proper assertions.
113-189
: LGTM! Excellent test coverage for BaseEmbeddingAdapter.The tests comprehensively cover:
- Initialization and configuration access
- Abstract class instantiation prevention
- Various input scenarios (empty, single, multiple)
- Edge cases (None input, whitespace, duplicates)
- Usage information propagation
The test logic is sound and assertions are appropriate.
191-225
: LGTM! Good edge case coverage.The tests appropriately handle edge cases like very long text, special characters, empty strings, and configuration property access.
227-275
: LGTM! Excellent integration tests.The integration tests properly verify that the public method correctly delegates to the abstract method and handles empty input appropriately without calling the abstract method.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py
Outdated
Show resolved
Hide resolved
libs/core/kiln_ai/adapters/embedding/test_base_embedding_adapter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (3)
34-36
: Remove TODO comment per past review feedback.Based on past review feedback from scosman, this TODO should be removed rather than left indefinitely. The error handling should surface provider errors to users so they understand specific failure details.
62-64
: Remove redundant validation per past review feedback.This validation duplicates the validation already performed in
EmbeddingConfig.validate_properties()
, as noted in past reviews.- if dimensions is not None: - if not isinstance(dimensions, int) or dimensions <= 0: - raise ValueError("Dimensions must be a positive integer") -
55-56
: Improve type safety of response data access.The current fragile data access using
.get()
could be improved with stronger typing, as suggested in past reviews. Consider using proper types from litellm and validating the response structure.- for item in sorted(response.data, key=lambda x: x.get("index")): - embeddings.append(Embedding(vector=item.get("embedding"))) + for item in sorted(response.data, key=lambda x: x["index"]): + embeddings.append(Embedding(vector=item["embedding"]))
🧹 Nitpick comments (2)
libs/core/kiln_ai/datamodel/test_embedding_models.py (2)
415-418
: Remove redundant comment.The comment on line 416 is unnecessary as it's obvious from the context that this is creating mock embeddings.
embeddings = [] for chunk in mock_chunked_document.chunks: - # Create a mock embedding (in real usage, this would be generated by the embedding model) embedding = Embedding(vector=[0.1, 0.2, 0.3, 0.4, 0.5]) embeddings.append(embedding)
442-445
: Remove redundant comment.The comment "Simulate embedding generation" is unnecessary as the code is self-explanatory.
embeddings = [] for chunk in mock_chunked_document.chunks: - # Simulate embedding generation embedding = Embedding(vector=[0.1] * 1536) embeddings.append(embedding)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
(2 hunks)libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
(2 hunks)libs/core/kiln_ai/datamodel/embedding.py
(1 hunks)libs/core/kiln_ai/datamodel/test_embedding_models.py
(1 hunks)libs/core/kiln_ai/utils/litellm.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
- libs/core/kiln_ai/datamodel/embedding.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
libs/core/kiln_ai/datamodel/test_embedding_models.py (1)
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
🧬 Code Graph Analysis (1)
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py (1)
libs/core/kiln_ai/adapters/model_adapters/litellm_config.py (1)
LiteLlmConfig
(7-14)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (windows-latest)
- GitHub Check: Build Desktop Apps (macos-13)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
- GitHub Check: Build Desktop Apps (macos-latest)
🔇 Additional comments (6)
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py (2)
10-10
: LGTM! Clean import simplification.The single-line import aligns with the refactored codebase structure.
345-345
: LGTM! Patch target correctly updated for refactored utility.The patch target properly reflects the move of
raise_exhaustive_enum_error
to the newkiln_ai.utils.litellm
module, maintaining test functionality.libs/core/kiln_ai/utils/litellm.py (2)
9-17
: Well-structured dataclass for provider information.The
LitellmProviderInfo
dataclass clearly encapsulates the three key pieces of provider information needed for litellm integration.
19-86
: Excellent refactoring that eliminates code duplication.This utility function successfully centralizes the provider mapping logic that was previously duplicated across adapters, addressing the refactoring request from past reviews. The comprehensive provider mapping and custom provider handling is well-implemented.
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
83-89
: Excellent use of the new utility function.The integration with
get_litellm_provider_info
successfully addresses the code duplication issue raised in past reviews and provides clean error handling for unsupported custom providers.libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py (1)
1-442
: Comprehensive test coverage for the embedding adapter.The test suite provides excellent coverage including:
- Unit tests for all core functionality
- Edge case handling
- Mock-based testing for API interactions
- Paid integration tests with real providers
- Property filtering validation
- Error condition testing
The structure is well-organized and addresses the testing requirements for the new embedding functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
(1 hunks)libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (macos-latest)
- GitHub Check: Build Desktop Apps (windows-latest)
- GitHub Check: Build Desktop Apps (macos-13)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
🔇 Additional comments (5)
libs/core/kiln_ai/adapters/embedding/test_litellm_embedding_adapter.py (5)
1-32
: LGTM! Well-structured imports and fixtures.The imports are clean and the fixtures provide good test setup for the adapter and configuration.
34-57
: LGTM! Comprehensive testing of EmbeddingOptions.The tests properly cover default values, dimension setting, and model serialization behavior.
59-234
: LGTM! Thorough testing of core adapter functionality.The tests cover initialization, option building, embedding generation, batch size limits, error handling, and proper sorting behavior. The mock usage is appropriate and the test scenarios are comprehensive.
267-288
: Reconsider the duplicate indices test behavior.The test accepts duplicate indices in the response and processes them without error, but duplicate indices in an API response likely indicate a problem that should be flagged rather than silently handled.
Consider whether the adapter should validate for duplicate indices and raise an error, or if this is the intended behavior for handling malformed API responses.
461-550
: LGTM! Comprehensive validation function testing.The tests for
validate_map_to_embeddings
thoroughly cover valid scenarios, error conditions, and edge cases including sorting behavior and various validation failures.
89f699f
to
390e53f
Compare
…config level so throwing before adapter
390e53f
to
b5d81a9
Compare
Rebased to pull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
libs/core/kiln_ai/adapters/test_remote_config.py (1)
84-84
: Remove or clarify the unclear comment.The comment "# why?" appears to be a leftover from development and should either be removed or replaced with a clear explanation of why the original models list is restored.
- # why? + # Restore original state for subsequent tests
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
libs/core/kiln_ai/adapters/remote_config.py
(2 hunks)libs/core/kiln_ai/adapters/test_remote_config.py
(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: leonardmq
PR: Kiln-AI/Kiln#341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#388
File: libs/core/kiln_ai/datamodel/test_extraction_chunk.py:49-74
Timestamp: 2025-07-03T05:13:02.873Z
Learning: leonardmq prefers automatic cleanup of temporary files in tests using `delete=True` in `tempfile.NamedTemporaryFile()` context managers, rather than manual cleanup with `delete=False` and explicit file removal. This is because automatic cleanup is simpler, safer, and avoids leaving temporary files if tests fail before cleanup.
Learnt from: leonardmq
PR: Kiln-AI/Kiln#0
File: :0-0
Timestamp: 2025-06-27T06:45:06.591Z
Learning: leonardmq prefers unused imports to be flagged as actionable review comments rather than nitpick comments. Unused imports should be treated as concrete issues that need to be addressed, not just suggestions.
🧬 Code Graph Analysis (2)
libs/core/kiln_ai/adapters/test_remote_config.py (1)
libs/core/kiln_ai/adapters/remote_config.py (6)
KilnRemoteConfig
(23-25)serialize_config
(28-37)deserialize_config
(40-59)load_from_url
(62-78)dump_builtin_config
(81-86)load_remote_models
(89-103)
libs/core/kiln_ai/adapters/remote_config.py (3)
libs/core/kiln_ai/adapters/ml_embedding_model_list.py (1)
KilnEmbeddingModel
(55-63)libs/core/kiln_ai/adapters/ml_model_list.py (1)
KilnModel
(200-215)libs/core/kiln_ai/adapters/test_remote_config.py (2)
json
(39-40)raise_for_status
(36-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (macos-latest)
- GitHub Check: Build Desktop Apps (macos-13)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
- GitHub Check: Build Desktop Apps (windows-latest)
🔇 Additional comments (16)
libs/core/kiln_ai/adapters/test_remote_config.py (8)
7-7
: LGTM: Import addition for embedding models support.The import of
built_in_embedding_models
is correctly added to support the new embedding model functionality in the tests.
10-10
: LGTM: Import addition for new dataclass.The import of
KilnRemoteConfig
is correctly added to support the new remote configuration structure.
21-28
: LGTM: Test properly updated for dual model support.The
serialize_config
function call and assertions are correctly updated to handle both regular models and embedding models. The round-trip test ensures both model types are properly serialized and deserialized.
33-49
: LGTM: URL loading test properly extended.The test correctly includes embedding model data in the mock response and verifies both model types are properly loaded from the URL.
56-61
: LGTM: Builtin config dump test properly extended.The test correctly verifies that both regular models and embedding models are properly dumped and loaded through the builtin configuration.
69-82
: LGTM: Remote models success test properly extended.The test correctly handles both model types in the fake fetch function and verifies both global model lists are updated properly.
91-101
: LGTM: Remote models failure test properly extended.The test correctly preserves both original model lists and verifies they remain unchanged when remote loading fails.
114-130
: LGTM: Extra keys test comprehensively extended.The test properly extends the extra keys handling to include embedding models, ensuring that unknown keys are ignored for both model types and their nested providers.
libs/core/kiln_ai/adapters/remote_config.py (8)
6-6
: LGTM: Dataclass import addition.The import of
dataclass
is correctly added to support the newKilnRemoteConfig
dataclass.
12-15
: LGTM: Embedding model imports addition.The imports for
KilnEmbeddingModel
andbuilt_in_embedding_models
are correctly added to support the new embedding model functionality.
22-25
: LGTM: Well-designed dataclass for combined configuration.The
KilnRemoteConfig
dataclass provides a clean abstraction for holding both regular models and embedding models in a single configuration object.
28-37
: LGTM: Serialize function properly updated.The
serialize_config
function signature and implementation are correctly updated to handle both model types. The data structure uses clear keys ("model_list" and "embedding_model_list") for both model types.
40-59
: LGTM: Deserialize function with proper backward compatibility.The
deserialize_config
function correctly:
- Returns the new
KilnRemoteConfig
dataclass- Maintains backward compatibility with the old list-only format
- Handles both model types with appropriate validation
- Uses safe
.get()
calls with empty list defaults
62-78
: LGTM: URL loading function properly updated.The
load_from_url
function correctly:
- Returns the new
KilnRemoteConfig
dataclass- Maintains backward compatibility with the old list-only format
- Handles both model types with appropriate validation
- Uses consistent error handling
82-86
: LGTM: Builtin config dump function properly updated.The
dump_builtin_config
function correctly callsserialize_config
with both built-in model lists.
96-97
: LGTM: Remote models loading properly updated.The
load_remote_models
function correctly updates both global model lists from the loaded configuration. The slice assignment pattern ([:]
) ensures the original list references are preserved while updating their contents.
b808395
to
5139ad5
Compare
5139ad5
to
15e07da
Compare
del os.environ["KILN_SKIP_REMOTE_MODEL_LIST"] | ||
original = built_in_models.copy() | ||
sample_models = [built_in_models[0]] | ||
monkeypatch.delenv("KILN_SKIP_REMOTE_MODEL_LIST", raising=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing monkeypatch.delenv
is preferable because it undoes itself after the test execution whereas os.environ
may pollute other execution by side effect
|
||
monkeypatch.setattr("kiln_ai.adapters.remote_config.load_from_url", fake_fetch) | ||
|
||
load_remote_models("http://example.com/models.json") | ||
await asyncio.sleep(0.01) | ||
assert built_in_models == sample_models | ||
built_in_models[:] = original | ||
assert built_in_embedding_models == sample_embedding_models | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_load_remote_models_failure(monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error simulated in this test was never actually hit due to KILN_SKIP_REMOTE_MODEL_LIST == "true"
resolving to true during execution - found this out when adding the check for logger call
|
||
monkeypatch.setattr("kiln_ai.adapters.remote_config.load_from_url", fake_fetch) | ||
|
||
load_remote_models("http://example.com/models.json") | ||
await asyncio.sleep(0.01) | ||
assert built_in_models == sample_models | ||
built_in_models[:] = original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this, but am not sure what the intent was here (I get why the assignment in the actual code, but not why we are doing that at this end of this test)
raw = json.loads(Path(path).read_text()) | ||
model_data = raw.get("model_list", raw if isinstance(raw, list) else []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a fallback check trying to use the JSON as a list of models, but seems like that was a transient thing during development and now there is no scenario where raw
is a list of models (as opposed to a dict containing the list of models)
What does this PR do?
Changes:
EmbeddingConfig
: the configuration managing the embedding process (e.g. what model, etc.)ChunkEmbeddings
: the container that we will store the embeddings into (as JSON), whereChunkEmbedding.embeddings[i]
will be the embedding that corresponds toChunkedDocument.chunks[i]
BaseEmbeddingAdapter
and basic LiteLLM implementation of the adapter (LitellmEmbeddingAdapter
)built_in_embedding_models
list that describes the different available models, providers, and their features (e.g. OpenAI'stext-embedding-3
family supports custom number of dimensions) and limitations (max input tokens, etc.), similar toml_model_list
Checklists
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation
Refactor