feat: Support nullable vectors in Node.js SDK#482
feat: Support nullable vectors in Node.js SDK#482marcelo-cjl wants to merge 1 commit intomilvus-io:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: marcelo-cjl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
please specify your target milvus tag in |
Signed-off-by: marcelo-cjl <marcelo.chen@zilliz.com>
51b6ed3 to
283d75b
Compare
📝 WalkthroughWalkthroughThis pull request adds support for nullable vector fields in Milvus collections. It introduces validation to enforce that vector fields added to existing collections must be nullable, refactors data serialization to properly handle nullable vectors across multiple types, and provides comprehensive test coverage for the new functionality. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Collection
participant DataHandler
participant Serializer
participant Milvus
rect rgb(200, 220, 240)
Note over Client,Milvus: Adding Nullable Vector Field to Existing Collection
Client->>Collection: addCollectionField(vectorField)
Collection->>Collection: convertToDataType(field)
Collection->>Collection: isVectorType check
alt Vector type & not nullable
Collection->>Client: ❌ ADD_VECTOR_FIELD_MUST_BE_NULLABLE error
else Vector type & nullable ✓
Collection->>Milvus: Create field schema with nullable=true
Milvus-->>Collection: ✓ Field added
end
end
rect rgb(240, 220, 200)
Note over Client,Milvus: Inserting Data with Nullable Vectors
Client->>DataHandler: insert(rows with nullable vectors)
DataHandler->>DataHandler: For each row: buildFieldData(value)
alt Vector field value is null
DataHandler->>DataHandler: Store null at rowIndex
else Vector field value exists
DataHandler->>Serializer: Transform vector data
Serializer->>Serializer: Filter nulls, apply valid_data bitmap
Serializer->>Serializer: Build payload (buffer/array)
end
DataHandler->>Milvus: Send serialized data with nullable metadata
Milvus-->>Client: ✓ Insert complete
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
milvus/grpc/Data.ts (1)
319-427: Wrap switch case declarations in blocks.Multiple variables (
floatVecData,f16Data,binaryData,sparseData,dim,int8Data) are declared directly in switch cases without block scoping. While this works in the current code, it violates best practices and could lead to issues if cases fall through or are reordered.🔎 Proposed fix
switch (field.type) { - case DataType.FloatVector: + case DataType.FloatVector: { const floatVecData = field.nullable ? (field.data as any[]) .filter(v => v !== null && v !== undefined) .flat() : field.data; keyValue = { dim: field.dim, [dataKey]: { data: floatVecData, }, }; break; - case DataType.BFloat16Vector: - case DataType.Float16Vector: + } + case DataType.BFloat16Vector: + case DataType.Float16Vector: { const f16Data = field.nullable ? (field.data as any[]).filter(v => v !== null && v !== undefined) : field.data; keyValue = { dim: field.dim, [dataKey]: Buffer.concat(f16Data as Uint8Array[]), }; break; - case DataType.BinaryVector: + } + case DataType.BinaryVector: { const binaryData = field.nullable ? (field.data as any[]) .filter(v => v !== null && v !== undefined) .flat() : field.data; keyValue = { dim: field.dim, [dataKey]: f32ArrayToBinaryBytes(binaryData as BinaryVector), }; break; - case DataType.SparseFloatVector: + } + case DataType.SparseFloatVector: { const sparseData = field.nullable ? (field.data as any[]).filter(v => v !== null && v !== undefined) : field.data; const dim = getSparseDim(sparseData as SparseFloatVector[]); keyValue = { dim, [dataKey]: { dim, contents: sparseRowsToBytes(sparseData as SparseFloatVector[]), }, }; break; - case DataType.Int8Vector: + } + case DataType.Int8Vector: { const int8Data = field.nullable ? (field.data as any[]).filter(v => v !== null && v !== undefined) : field.data; keyValue = { dim: field.dim, [dataKey]: int8VectorRowsToBytes(int8Data as Int8Vector[]), }; break; + }Based on static analysis hints from Biome.
🧹 Nitpick comments (2)
milvus/grpc/Data.ts (1)
246-266: Consider documenting the nullable vs non-nullable storage strategy.The code uses different storage strategies:
- Nullable vectors: Indexed storage (
field.data[rowIndex] = fieldValue)- Non-nullable vectors: Concatenated storage (
field.data = field.data.concat(fieldValue))This is then reconciled in
buildColumnDatawhere nullable vectors are filtered. While this works, a comment explaining the rationale would help future maintainers understand why different strategies are used.test/grpc/NullableVector.spec.ts (1)
437-439: Consider making flush timeouts configurable or polling-based.The hardcoded 2-second timeouts could be flaky in CI environments or under load. Consider using a polling mechanism or making the timeout configurable via environment variable.
🔎 Suggested improvement
// Add at the top of the file const FLUSH_TIMEOUT = parseInt(process.env.FLUSH_TIMEOUT || '5000', 10); const FLUSH_POLL_INTERVAL = 100; // Helper function const waitForFlush = async ( client: MilvusClient, collectionName: string, maxWait: number = FLUSH_TIMEOUT ): Promise<void> => { const startTime = Date.now(); while (Date.now() - startTime < maxWait) { const state = await client.getLoadingProgress({ collection_name: collectionName }); if (state.progress === '100') { return; } await new Promise(resolve => setTimeout(resolve, FLUSH_POLL_INTERVAL)); } throw new Error(`Flush timeout after ${maxWait}ms`); }; // Usage await milvusClient.flush({ collection_names: [COLLECTION_NAME] }); await waitForFlush(milvusClient, COLLECTION_NAME);Also applies to: 536-538
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
milvus/const/error.tsmilvus/grpc/Collection.tsmilvus/grpc/Data.tsmilvus/utils/Data.tstest/grpc/NullableVector.spec.ts
🧰 Additional context used
🧬 Code graph analysis (3)
milvus/grpc/Data.ts (5)
milvus/types/DataTypes.ts (3)
BinaryVector(5-5)SparseFloatVector(15-19)Int8Vector(12-12)milvus/const/error.ts (1)
ERROR_REASONS(7-66)milvus/utils/Data.ts (1)
buildFieldData(344-432)milvus/utils/Bytes.ts (3)
f32ArrayToBinaryBytes(37-41)sparseRowsToBytes(224-230)int8VectorRowsToBytes(255-261)milvus/utils/Function.ts (1)
getSparseDim(124-133)
test/grpc/NullableVector.spec.ts (4)
test/tools/ip.ts (1)
IP(2-2)test/tools/data.ts (4)
genFloatVector(71-74)genBinaryVector(129-137)genSparseVector(150-233)genInt8Vector(235-252)test/tools/utils.ts (1)
GENERATE_NAME(7-8)milvus/const/error.ts (1)
ERROR_REASONS(7-66)
milvus/grpc/Collection.ts (3)
milvus/utils/Schema.ts (1)
convertToDataType(166-177)milvus/utils/Validate.ts (1)
isVectorType(267-276)milvus/const/error.ts (1)
ERROR_REASONS(7-66)
🪛 Biome (2.1.2)
milvus/grpc/Data.ts
[error] 321-325: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 335-337: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 344-348: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 355-357: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 358-358: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 368-370: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
🔇 Additional comments (6)
milvus/grpc/Collection.ts (1)
243-246: LGTM! Validation enforces nullable requirement for vector fields.The validation correctly checks if the field is a vector type and throws an appropriate error if
nullableis not set totrue. This prevents schema inconsistencies at field-addition time.milvus/const/error.ts (1)
64-65: LGTM! Clear and actionable error message.The error message provides clear guidance on the requirement and how to fix it.
milvus/grpc/Data.ts (2)
238-240: LGTM! Null checks prevent errors for nullable binary vectors.The additional null/undefined checks ensure the dimension validation only runs for non-null values, which is correct for nullable fields.
307-310: The removal of thekey !== 'vectors'check is intentional and correct for nullable vector support.Nullable vectors now properly compute
valid_datato track which rows contain null values. This is confirmed by the comprehensive test suite intest/grpc/NullableVector.spec.ts, which validates that nullable vectors across all 6 vector types (FloatVector, BinaryVector, Float16Vector, BFloat16Vector, SparseFloatVector, Int8Vector) can be inserted with null values and correctly queried/searched.test/grpc/NullableVector.spec.ts (2)
1-169: Excellent comprehensive test setup!The test file demonstrates thorough coverage:
- All 6 vector types (float, binary, float16, bfloat16, sparse, int8)
- Type-specific vector generators and comparison logic with appropriate tolerance
- Well-organized test configuration with VECTOR_TYPES array
The
vectorsEqualfunction correctly handles different vector formats including sparse vectors, binary vectors, and floating-point precision differences for bfloat16.
442-458: Excellent validation of nullable requirement enforcement!The test correctly verifies that adding a non-nullable vector field throws the expected error, then successfully adds a nullable field. This ensures the runtime validation works as intended.
| const applyValidDataToVectors = ( | ||
| field_data: any[], | ||
| valid_data: boolean[] | ||
| ): any[] => { | ||
| if (!valid_data || valid_data.length === 0) { | ||
| return field_data; | ||
| } | ||
|
|
||
| const result: any[] = []; | ||
| let physicalIndex = 0; | ||
|
|
||
| for (let logicalIndex = 0; logicalIndex < valid_data.length; logicalIndex++) { | ||
| if (valid_data[logicalIndex]) { | ||
| result.push(field_data[physicalIndex]); | ||
| physicalIndex++; | ||
| } else { | ||
| result.push(null); | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| }; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add bounds check to prevent index out-of-range errors.
The function assumes valid_data correctly represents the data, but doesn't validate that physicalIndex stays within field_data.length. If valid_data has more true values than field_data has elements, this could cause incorrect behavior.
🔎 Suggested fix
const applyValidDataToVectors = (
field_data: any[],
valid_data: boolean[]
): any[] => {
if (!valid_data || valid_data.length === 0) {
return field_data;
}
const result: any[] = [];
let physicalIndex = 0;
for (let logicalIndex = 0; logicalIndex < valid_data.length; logicalIndex++) {
if (valid_data[logicalIndex]) {
+ if (physicalIndex >= field_data.length) {
+ throw new Error(`Invalid valid_data: expected at least ${physicalIndex + 1} elements in field_data, but got ${field_data.length}`);
+ }
result.push(field_data[physicalIndex]);
physicalIndex++;
} else {
result.push(null);
}
}
return result;
};🤖 Prompt for AI Agents
In milvus/utils/Data.ts around lines 68 to 89, the loop can read past field_data
when valid_data contains more true entries than field_data has elements; add a
bounds check before accessing field_data[physicalIndex] — if physicalIndex <
field_data.length then push field_data[physicalIndex] and increment
physicalIndex, otherwise push null (or optionally throw a descriptive error) to
prevent index out-of-range reads and keep output length consistent with
valid_data.
related: #486
Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.