Skip to content

[Bug]: [streaming] Multiple queryNodes oom when loading a collection of L2 segments #42712

@ThreadDao

Description

@ThreadDao

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20250611-a72463c6-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

config

dependencies:
  pulsar:
    inCluster:
      values:
        broker:
          configData:
            backlogQuotaDefaultLimitGB: "-1"
    common:
      enabledJSONKeyStats: true
    dataCoord:
      compaction:
        clustering:
          autoEnable: true
      enableActiveStandby: true
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    mixCoord:
      enableActiveStandby: true
    queryCoord:
      enableActiveStandby: true
    queryNode:
      enableSegmentPrune: true
    rootCoord:
      enableActiveStandby: true

client test

  1. create a collection fouram_GCT4y7ke (with clustering key)
{'auto_id': False,
 'description': '',
 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}},
            {'name': 'int32_1', 'description': '', 'type': <DataType.INT32: 4>}, {'name': 'float32_1', 'description': '', 'type': <DataType.FLOAT: 10>, 'is_clustering_key': True},
            {'name': 'varchar_1', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 100}},
            {'name': 'varchar_2', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 100, 'enable_match': True, 'enable_analyzer': True}},
            {'name': 'array_varchar_1', 'description': '', 'type': <DataType.ARRAY: 22>, 'params': {'max_length': 100, 'max_capacity': 10}, 'element_type': <DataType.VARCHAR: 21>},
            {'name': 'json_1', 'description': '', 'type': <DataType.JSON: 23>}, {'name': 'binary_vector', 'description': '', 'type': <DataType.BINARY_VECTOR: 100>, 'params': {'dim': 128}},
            {'name': 'float16_vector', 'description': '', 'type': <DataType.FLOAT16_VECTOR: 102>, 'params': {'dim': 128}}, {'name': 'bfloat16_vector', 'description': '', 'type': <DataType.BFLOAT16_VECTOR: 103>, 'params': {'dim': 128}}],
 'enable_dynamic_field': False} (base.py:329)
  1. create all kinds of index
  'scalars_index': {'float32_1': {'index_type': 'STL_SORT'},
                    'int32_1': {'index_type': 'BITMAP'},
                    'varchar_1': {'index_type': 'TRIE'},
                    'array_varchar_1': {'index_type': 'INVERTED'}},
  'vectors_index': {'binary_vector': {'metric_type': 'JACCARD',
                                      'index_type': 'BIN_IVF_FLAT',
                                      'index_param': {'nlist': 128}},
                    'float16_vector': {'metric_type': 'COSINE',
                                       'index_type': 'IVF_SQ8',
                                       'index_param': {'nlist': 128}},
                    'bfloat16_vector': {'metric_type': 'IP',
                                        'index_type': 'IVF_FLAT',
                                        'index_param': {'nlist': 128}}},
'index_params': {'index_type': 'HNSW', 'index_param': {'M': 16, 'efConstruction': 200}},
  1. insert 5m -> flush -> index again -> load
  2. concurrent requests: insert + delete + query + search + hybrid_search
    After the test, the number of segments loaded by qn and the memory usage are as follows:
    Image
  3. bulk_import 10 laion1B_nolang parquet files and create index. The collection import_1749718797_4573 schema is:
        fields = [
            FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False),
            FieldSchema(name="pk_5b", dtype=DataType.INT64, is_clustering_key=True),
            FieldSchema(name="caption", dtype=DataType.VARCHAR, max_length=8192, enable_analyzer=True,
                        enable_match=True),
            FieldSchema(name="NSFW", dtype=DataType.VARCHAR, max_length=8192),
            FieldSchema(name="similarity", dtype=DataType.DOUBLE),
            FieldSchema(name="width", dtype=DataType.INT64, is_partition_key=True),
            FieldSchema(name="height", dtype=DataType.INT64),
            FieldSchema(name="original_width", dtype=DataType.INT64),
            FieldSchema(name="original_height", dtype=DataType.INT64),
            FieldSchema(name="md5", dtype=DataType.VARCHAR, max_length=8192),
            FieldSchema(name="float32_vector", dtype=DataType.FLOAT_VECTOR, dim=VECTOR_DIM),
        ]
  1. load collection import_1749718797_4573 and 3 queryNodes oom
zong-sn-base-op-53-4126-milvus-datanode-7b9bcd8864-c5fhk          1/1     Running     0               18h     10.104.14.175   4am-node18   <none>           <none>
zong-sn-base-op-53-4126-milvus-datanode-7b9bcd8864-ddjhz          1/1     Running     0               38h     10.104.27.216   4am-node31   <none>           <none>
zong-sn-base-op-53-4126-milvus-mixcoord-594686fcdc-jpz7f          1/1     Running     0               38h     10.104.19.143   4am-node28   <none>           <none>
zong-sn-base-op-53-4126-milvus-proxy-66c4bd8f8-z27jx              1/1     Running     0               38h     10.104.34.25    4am-node37   <none>           <none>
zong-sn-base-op-53-4126-milvus-querynode-0-56d54bcd5f-7xjdd       1/1     Running     1 (15m ago)     38h     10.104.9.195    4am-node14   <none>           <none>
zong-sn-base-op-53-4126-milvus-querynode-0-56d54bcd5f-bjwr7       1/1     Running     0               38h     10.104.23.125   4am-node27   <none>           <none>
zong-sn-base-op-53-4126-milvus-querynode-0-56d54bcd5f-dh2rg       1/1     Running     1 (15m ago)     18h     10.104.25.77    4am-node30   <none>           <none>
zong-sn-base-op-53-4126-milvus-querynode-0-56d54bcd5f-gjbkd       1/1     Running     0               38h     10.104.14.52    4am-node18   <none>           <none>
zong-sn-base-op-53-4126-milvus-querynode-0-56d54bcd5f-plt6j       1/1     Running     1 (15m ago)     38h     10.104.19.144   4am-node28   <none>           <none>
zong-sn-base-op-53-4126-milvus-streamingnode-645bb4bbdd-jfls4     1/1     Running     0               38h     10.104.24.98    4am-node29   <none>           <none>
zong-sn-base-op-53-4126-milvus-streamingnode-645bb4bbdd-msr49     1/1     Running     0               38h     10.104.20.30    4am-node22   <none>           <none>

Actually, the load was successful, but insufficient memory during the load caused an OOM.

connections.connect(host="10.104.xx.xx")
utility.list_collections()
['fouram_GCT4y7ke', 'import_1749718797_4573']
c = Collection(name='import_1749718797_4573')
c.load() # qn oom
c.query('', output_fields=["count(*)"])
data: ["{'count(*)': 9683719}"]
c = Collection(name='fouram_GCT4y7ke')
c.query('', output_fields=["count(*)"])
data: ["{'count(*)': 10171780}"

metrics of load import collection L2 segments

Expected Behavior

No response

Steps To Reproduce

1. concurrent requests: https://argo-workflows.zilliz.cc/archived-workflows/qa/48e7e3d7-83a7-4a21-8065-c6c4ef19d79a?nodeId=zong-sn-labor-12-clustering-2657410768
2. bulk_import laion1B_nolang: https://argo-workflows.zilliz.cc/archived-workflows/qa/cf1871fb-2adb-4b88-963b-a63c57a7b2b2?nodeId=zong-sn-import-base-1

Milvus Log

No response

Anything else?

No response

Metadata

Metadata

Assignees

Labels

kind/bugIssues or changes related a bugtriage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions