- Author: zimulala
- Tracking issue: pingcap#55693
Vector Index refers to constructing a time and space efficient data structure for vector data based on a certain mathematical quantization model, to efficiently retrieve several vectors that are as similar as possible to the target vector. Currently, we plan to support the HNSW series of vector indexing methods.
This document plans to support the same processing of vector indexes as ordinary indexes, its specific functions are as follows :
- Support adding and using vector indexes, including adding vector indexes during and after table creation.
- Support dropping and renaming vector indexes.
The process of adding a vector index is similar to that of adding an ordinary index. However, since the actual vector index data is added to TiFlash, there is no process for populating the index data to TiKV.
The following figure takes the add vector index operation as an example and briefly describes its execution process.
Syntax required to create a vector index:
- Adding a new ConstraintElem syntax creates a vector index when creating a table.
CREATE TABLE foo (
id INT PRIMARY KEY,
data VECTOR(5),
data64 VECTOR64(10),
-- "WITH OPTION" is not supported in Phase 1
VECTOR INDEX idx_name USING HNSW ((VEC_COSINE_DISTANCE(data))) [WITH OPTION "m=16, ef_construction=64"]
);
- By
IndexKeyTypeOpt
addVECTOR
index type, andIndexTypeOpt
addHNSW
option type, and add vector index after creating the table.
CREATE VECTOR INDEX idx_name USING HNSW ON foo ((VEC_COSINE_DISTANCE(data)))
-- Proposal 1, "WITH OPTION" is not supported in Phase 1
[WITH OPTION "m=16, ef_construction=64"];
-- Proposal 2, "VECTOR_INDEX_PARAM" is not supported in Phase 1
[VECTOR_INDEX_PARAM "m=16, ef_construction=64"];
ALTER TABLE foo ADD VECTOR INDEX idx_name USING HNSW ((VEC_COSINE_DISTANCE(data)))
-- "WITH OPTION" is not supported in Phase 1
[WITH OPTION "m=16, ef_construction=64"];
- The syntax is similar to a normal index, and there are several ways to create an index
CREATE VECTOR INDEX idx ON t ((VEC_COSINE_DISTANCE(a))) USING HNSW;
CREATE VECTOR INDEX IF NOT EXISTS idx ON t ((VEC_COSINE_DISTANCE(a))) TYPE HNSW;
CREATE VECTOR INDEX ident ON db.t (ident, ident ASC ) TYPE HNSW;
ALTER TABLE t ADD VECTOR ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
ALTER TABLE t ADD VECTOR INDEX ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
ALTER TABLE t ADD VECTOR INDEX IF NOT EXISTS ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
- Support to choose vector index by CBO and
USE INDEX
hint. - The vector index only obtains the top n values through similarity. The following is the syntax for using a vector index when querying.
SELECT *
FROM foo
ORDER BY VEC_COSINE_DISTANCE(data, '[3,1,2]')
LIMIT 5;
Currently, the use of vector indexing is supported on the master branch and supports the explain function.
- Protobuf adds the new field of `indexId` when TiDB interacts with TiFlash.
type ANNQueryInfo struct {
QueryType ANNQueryType `protobuf:"varint,1,opt,name=query_type,json=queryType,enum=tipb.ANNQueryType" json:"query_type"`
DistanceMetric VectorDistanceMetric `protobuf:"varint,2,opt,name=distance_metric,json=distanceMetric,enum=tipb.VectorDistanceMetric" json:"distance_metric"`
TopK uint32 `protobuf:"varint,3,opt,name=top_k,json=topK" json:"top_k"`
ColumnName string `protobuf:"bytes,4,opt,name=column_name,json=columnName" json:"column_name"`
ColumnId int64 `protobuf:"varint,5,opt,name=column_id,json=columnId" json:"column_id"`
RefVecF32 []byte `protobuf:"bytes,6,opt,name=ref_vec_f32,json=refVecF32" json:"ref_vec_f32,omitempty"`
MaxDistance float64 `protobuf:"fixed64,10,opt,name=max_distance,json=maxDistance" json:"max_distance"`
HnswEfSearch uint32 `protobuf:"varint,20,opt,name=hnsw_ef_search,json=hnswEfSearch" json:"hnsw_ef_search"`
XXX_unrecognized []byte `json:"-"`
// new fields
IndexId int64 `protobuf:"varint,5,opt,name=column_id,json=columnId" json:"column_id"`
}
The analyze statement is not supported at the moment.
- A replica of TiFlash is required
- Cluster must have a TiFlash replica
- About the TiFlash replica of the table for which the vector index needs to be created
- When a table is created with an index, the TiFlash replica of the table is set to 1 by default in the first stage (serverless is greater than or equal to minReplica in the configuration item).
- After creating a table, creating an index requires a TiFlash replica.
- Cannot be created on TiFlash nodes with static encryption enabled
- Cannot be a primary key index
- Cannot be a composite index
- Create several index constraints for the same column
- There can be different vector indexes defined using different
distance functions
- Different vector indexes defined by the same
distance function
will not exist
- There can be different vector indexes defined using different
- Same name processing as the ordinary index
- Vector index requires support for functions
function | Phase 1 | Phase 2 | Phase 3 |
---|---|---|---|
VEC_L1_DISTANCE | TBD | ||
VEC_L2_DISTANCE | v | ||
VEC_NEGATIVE_INNER_PRODUCT(名字待定) VEC_NEGATIVE_INNER_PRODUCT (name to be determined) | v | ||
VEC_COSINE_DISTANCE | v |
- Incompatibility with other features(This operation is not supported in the first phase of this design, but it can be considered for support in the future).
- Setting invisible properties is not supported
- Dropping a column with a vector index is not supported, and creating multiple indexes together is not supported. Also, modifying column types with vector indexes is not supported (lossy changes).
- Workaround, split it into multiple operations: for example, we can first drop all vector indexes on the column, and then drop the column.
Add the VectorIndexInfo
information field to describe and store information about vector indexes. In addition, add relevant information to the existing IndexInfo
to record information about vector indexes.
// VectorIndexInfo is the information on the vector index of a column.
type VectorIndexInfo struct {
// Kind is the kind of vector index. Currently, only HNSW is supported.
Kind VectorIndexKind `json:"kind"`
// Dimension is the dimension of the vector.
Dimension uint64 `json:"dimension"` // Set to 0 when initially parsed from comment. Will be assigned to flen later.
// DistanceMetric is the distance metric used by the index.
DistanceMetric DistanceMetric `json:"distance_metric"`
}
// IndexInfo provides meta data describing a DB index.
type IndexInfo struct {
ID int64 `json:"id"`
Name CIStr `json:"idx_name"` // Index name.
...
// VectorInfo is the vector index information.
VectorInfo *VectorIndexInfo `json:"is_vector"`
}
- Consider adding
ActionAddVectorIndex
DDL job type, through the existing DDL framework, to achieve the operation of adding a vector index.- The middle state is consistent with adding a normal index (None -> Delete-Only -> Write-Only -> Write-Reorg -> Public).
- When in the Write-Reorg state, do not use the runReorgJob mechanism of this state, that is, do not enable backfilling or check execution progress logic in the background.
- Sync
VectorIndexInfo
to TiFlash. - Wait for TiFlash to fill in the HNSW Index data.
- We can check whether the operation has been completed by checking whether the
ROWS_STABLE_NOT_INDEXED
value of the table corresponding to the added vector index on thesystem.dt_local_indexes
table on TiFlash is 0. - Obtain error info from indexStat to determine if there is an error during execution.
- Sync
- Compatible with Cancel, Pause, and Resume operations
- The Pause operation needs to consider pausing the operation during the reorg phase.
- Not supported in the first phase
- Implement Rollback related operations.
The IsRollbackable
check is the same as a general index.- Like general indexes, if you roll back during the Write-Reorg phase, you also need to delete the index data that has been added.
- The Pause operation needs to consider pausing the operation during the reorg phase.
- No need to add a DDL job type, similar to deleting a general index operation.
- The middle state is consistent with deleting a general index (None - > Write-Only - > Delete-Only - > Delete-Reorg - > None).
- Judgment, if it is a vector index
- Delete-Reorg state
- No need to wait for HNSW Index data deletion on TiFlash to complete .
- There is no need to add this metadata to
MySQL.gc_delete_range
- Delete-Reorg state
- Compatible with Cancel, Pause, and Resume operations
- The Pause operation needs to consider pausing the operation during the reorg phase.
- Not supported in the first phase
- Implement Rollback related operations.
The IsRollbackable
check is the same as a general index.
- The Pause operation needs to consider pausing the operation during the reorg phase.
No need to add a DDL job type, just like renaming a general index operation. We don't have to wait for TiFlash's synchronization operation, just return directly after the TiKV operation is completed.
- Compatible with
admin show ddl
to display the progress of DDL operation execution- The display way is considered the same as the general DDL display way, and the
ROWS_STABLE_INDEXED
information of the corresponding table of theinformation_schema.tiflash_indexes
table on TiFlash can be filled into theROW_COUNT
information of the DDL job.
- The display way is considered the same as the general DDL display way, and the
- Add or update relevant system table information
- information_schema tiflash_tables (already supported)
- information_schema tiflash_segments (already supported)
- information_schema tiflash_indexes (already supported)
- Add columns index_name and index_id.
- Add columns error_msg and error_count.
- information_schema.tidb_indexes
- Support
show create table
The admin check table/index
andadmin repair table
operations do not currently consider processing of vector indexes.- Add relevant metrics
- Because
ActionAddVectorIndex
is a new DDL job type, the monitoring information for adding a vector index operation needs to be handled the same as adding a general index. - Because dropping or renaming a vector index does not add a DDL job type, monitoring these two operations does not require special handling.
- Because
Adding an index operation mainly consists of two steps:
- Synchronize schema operations to add index data. Currently, there are two main ways for TiFlash to obtain schema from TiKV during operation: the first is that the Background Sync Thread is responsible for regularly calling Schema Syncer to update the Local Schema Copy module; the second is that the read and write operations in TiFlash will also call Schema Syncer for updates when needed.
- Since adding a vector index does not trigger the synchronization schema of the second method, currently we can only rely on regularly synchronizing the schema to synchronize the metadata to TiFlash.
- Notify TiFlash to perform the load schema operation through HTTP.
syncTableSchema
handles synchronization Schema operations, and if new index information is found, it needs to be synchronized to the table.
- Since adding a vector index does not trigger the synchronization schema of the second method, currently we can only rely on regularly synchronizing the schema to synchronize the metadata to TiFlash.
- Create the vector index data.
- Constraints that require creating an index. That is, when the index status is Write-Reorg.
- Timing for generating index data:
- When synchronizing to add index information, obtain all segment information under the corresponding table to generate the HNSW Index.
- Use segments as task units and put them into LocalIndexerScheduler (already supported) for parallel processing tasks.
- Other operations that may trigger the generation of index data: merging delta and stable, merging/splitting/ingesting segments.
- Due to the uncontrollable timing of such triggering operations, consider separating the index generation from the existing stable layer. The plan is to wait until the index status is public to truly trigger the index generation.
- When synchronizing to add index information, obtain all segment information under the corresponding table to generate the HNSW Index.
- Execute index data generation, with segments as the unit of index data generation.
- Create a vector index for DMFile that does not have a corresponding index. Currently, only the data of the stable layer is created with a vector index.
- The real file name for creating the index is similar to: {idx_id} .idx.
- This naming convention is different from the existing naming convention {col_id} .idx. If compatibility is a concern, it is best to add a prefix.
- The real file name for creating the index is similar to: {idx_id} .idx.
- Create a
VectorIndexHNSWBuilder
to read column data and populate the corresponding index data (already supported). - Update the relevant metadata in the segment, including the index information stored on the DMFile.
- Wait for index data generation on all relevant segments to complete.
- Create a vector index for DMFile that does not have a corresponding index. Currently, only the data of the stable layer is created with a vector index.
- This operation is similar to the add vector index operation, except that this operation processes dropping the vector index.
- Most of the process is the same as creating a vector index.
- Drop the vector index data.
- When
syncTableschema
handles synchronous Schema operations, it needs to remove the corresponding index. - The difference with the add vector index is that the constraint for triggering the operation is that the index state is Delete-Reorg state.
- In addition, when triggering the actual deletion operation, if it is found that there is no corresponding vector index (i.e. metadata without vector index), and the corresponding vector index data file still exists, it is necessary to delete the index data.
- When
Add or update relevant system table information.
- information_schema tiflash_indexes needs to add some columns:
- Add columns index_name and index_id.
- Add columns error_msg and error_count.
- BR
- TiCDC
- DM
- No need for support
- Lighting
- Dumpling
Follow-up MySQL supports MySQL 9.0 vector, whether we need to consider compatibility later.
- For the time being, we don't need to be compatible with this syntax
It’s used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered.
- Create multiple vector indexes on various tables. It tests the speed and resource usage when numerous vector indexes are added in parallel.
- Test whether index data can be deleted in sync schema (Drop vector index) when the index is deleted.
A checklist to test compatibility:
- Compatibility with other features, like partition table, drop column, etc.
- Compatibility with other components, like BR, TiCDC, Dumpling, BR, DM, Lighting, etc.
- Upgrade compatibility