Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 0c15ea6

Browse files
committed
Merge branch 'ann-avs-4.0'
2 parents 337d9b9 + 1519a79 commit 0c15ea6

File tree

5 files changed

+377
-222
lines changed

5 files changed

+377
-222
lines changed

.vscode/settings.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
{
22
"cSpell.words": [
3-
"opentelemetry"
3+
"opentelemetry",
4+
"pythonping"
45
]
56
}

ann_benchmarks/algorithms/aerospike/Dockerfile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
FROM ann-benchmarks
22

3-
ENV AVS_HOST='172.28.86.158'
3+
ENV AVS_HOST='172.18.174.125'
44
ENV AVS_PORT=5000
55
ENV AVS_NAMESPACE='test'
66
ENV AVS_SET='ANN-data'
77
ENV APP_POPULATE_TASKS=5000
88
ENV APP_LOGFILE=
99
ENV API_ASLOGLEVEL=
1010
ENV APP_LOGLEVEL=INFO
11+
ENV APP_CHECKRESULT=false
1112

12-
ARG AEROSPIKE_CLIENT_VERSION===2.0.0
13+
ARG AEROSPIKE_CLIENT_VERSION=4.0.0
1314

1415
RUN pip install pythonping
1516
RUN pip install aerospike-vector-search==${AEROSPIKE_CLIENT_VERSION}
Lines changed: 194 additions & 167 deletions
Original file line numberDiff line numberDiff line change
@@ -1,167 +1,194 @@
1-
# Backoff Logic when encountering Resource Exhausted
2-
3-
This only applies when "ignoreExhaustedEvent" is true in the config.yml (default is false). When this value is True, any exhausted resource event will be handled by the healer.
4-
5-
# The “back-off” logic is as follows:
6-
7-
- When the exception is received:
8-
- First record in error will perform the following actions:
9-
- Signal the “main” populating task to go into “sleep mode” so that additional records will not be upserted.
10-
- A warning message is logged.
11-
- All records in error will do the following:
12-
- Call “wait for index completion.”
13-
- Once the index is built the following occurs:
14-
- Re-upsert the error records
15-
- If successful, signal the “main” populating task to re-start populating.
16-
- A warning message is logged, stating population has re-started.
17-
18-
# Environmental Variables
19-
20-
Below are the Environmental Variables:
21-
22-
- AVS_LOGLEVEL -- The Vector Client API's Log Level. Defaults "WARNING"
23-
- Possible Values:
24-
- CRITICAL
25-
- FATAL
26-
- ERROR
27-
- WARNING
28-
- WARN
29-
- INFO
30-
- DEBUG
31-
- NOTSET
32-
33-
Note: The logging file is determined by "APP_LOGFILE"
34-
35-
- AVS_HOST -- The AVS Server's Address. Defaults to "localhost"
36-
- AVS_PORT -- The AVS Server's Port. Defaults to 5000
37-
- AVS_USELOADBALANCER -- The AVS Server's Address is a Load Balancer. Default False.
38-
- AVS_NAMESPACE -- The Vector's Namespace. Defaults to "test"
39-
- AVS_SET -- The Vector's Set name. Defaults to "ANN-data"
40-
41-
This behavior is determined by the "uniqueSetIdxName" argument defined in the config.yml file.
42-
43-
The default (True) behavior is to create a unique Set name where this is the prefix to that name.
44-
45-
The name has the following parts:
46-
47-
```
48-
{AVS_SET}_{ANN Distance Type}_{AVS Idx Type}_{Dimension}_{hnsw m}_{hnsw ef construction}_{hnsw ef}
49-
```
50-
51-
Example:
52-
53-
ANN-data_angular_COSINE_20_16_100_100
54-
55-
If "uniqueSetIdxName" is false, the Set name is as follows:
56-
57-
```
58-
{AVS_SET}__{ANN Distance Type}_{AVS Idx Type}
59-
```
60-
61-
Example:
62-
63-
ANN-data_angular_COSINE
64-
65-
- APP_LOGFILE -- The Aerospike's ANN Logging file. Default is "AerospikeANN.log".
66-
67-
The folder is always the current working directory.
68-
69-
- APP_LOGLEVEL -- The Aerospike's ANN Log Level. Defaults "INFO"
70-
- Possible Values:
71-
- CRITICAL
72-
- FATAL
73-
- ERROR
74-
- WARNING
75-
- WARN
76-
- INFO
77-
- DEBUG
78-
- NOTSET
79-
80-
Note: For performance testing this should be set to "NOTSET".
81-
82-
When running in a docker container, logging is disabled.
83-
84-
- APP_DROP_IDX -- A Boolean value that will determine if the Vector index is dropped if it already exists. The default is to use "dropIdx" argument in the config.yml file.
85-
- APP_INDEX_SLEEP -- The amount of time to sleep after the index is dropped. The default is 0.
86-
87-
Possible values are:
88-
89-
- 0 -- Don't Sleep
90-
- \< 0 -- The number of seconds to sleep
91-
- APP_POPULATE_TASKS -- The number of concurrent records upserted (put) tasks that are performed during the index population phase. When this number of records are upserted, the app will wait until all upserts are completed and then process the next set of records. The default is 5000.
92-
93-
Values:
94-
95-
- \< 0 -- All records are upserted, concurrently, and the app will only wait for the upsert completion before waiting for index completion.
96-
- 0 or 1 -- One record is upserted at a time (sync)
97-
- \> 1 -- The number of records upserted, concurrently (async), before the app waits for the upserts to complete.
98-
- APP_PINGAVS -- Checks to determine if the AVS server is reachable via ping. Default is False.
99-
- APP_CHECKRESULT -- Checks the Vector Search results for failed results or Zero Distance. Default is True
100-
101-
Note: This value is always false if running in a docker container.
102-
103-
This should be set to False when conducting performance testing!
104-
105-
The default bin name for the vectors is always "ANN_embedding".
106-
107-
# config.yml file
108-
109-
Using the config.yml file. The Aerospike ANN config.yml file can support the different ANN run group configurations. It is suggested that the Aerospike ANN application is ran using the ANN Distance Type configuration.
110-
111-
Using this configuration, we can match each ANN distance type (i.e., Angular, Euclidean, Jaccard, etc.) to the "best" Aerospike Vector index type. Below is an example of this configuration with comments regarding the behavior of each parameter:
112-
113-
```
114-
float:
115-
#This defines a run group based on the ANN angular datasets.
116-
angular:
117-
#All entries to “run_groups” keyword are required as-is (cannot change the values or structure)!
118-
- base_args: ['@metric', '@dimension']
119-
constructor: Aerospike
120-
disabled: false #can change to true to disable this run-group
121-
docker_tag: ann-benchmarks-aerospike
122-
module: ann_benchmarks.algorithms.aerospike
123-
name: aerospike
124-
run_groups:
125-
cosine: #Should match Idx Type
126-
#This grouping is reqired
127-
args: [
128-
[cosine], #Idx Type, any Aerospike Index Type, case insensitive). This is required…
129-
#A collection of HnswParams where each param is ran as a separate ran for this Idx Type. This is required and must have at least one item.
130-
[{m: 8, ef_construction: 64, ef: 8},
131-
{m: 16, ef_construction: 128, ef: 8} ],
132-
#Unique Set/Index Name (optional, default True). See the “AVS_SET” environment variable above.
133-
[True],
134-
#True to Drop Idx and Re-Populate, optional default true. See “APP_DROP_IDX” environment variable above.
135-
[True],
136-
#Determines what phases are executed. Values are:
137-
# IdxPopulateOnly – only conduct the populate index phase,
138-
# QueryOnly – only perform the vector search phase,
139-
# AllOps – All phases (optional default value)
140-
[AllOps]
141-
]
142-
#This grouping is required
143-
query_args: [
144-
# If provided (optional), overrides the HnswParams defined above for the vector search phase
145-
[null, #Uses default defined above
146-
{ef: 10} #Override “ef” above
147-
]
148-
]
149-
#This defines another run group based on the ANN Euclidean datasets.
150-
#This show using the required params.
151-
euclidean:
152-
- base_args: ['@metric', '@dimension']
153-
constructor: Aerospike
154-
disabled: false
155-
docker_tag: ann-benchmarks-aerospike
156-
module: ann_benchmarks.algorithms.aerospike
157-
name: aerospike
158-
run_groups:
159-
SQUARED_EUCLIDEAN:
160-
args: [
161-
[SQUARED_EUCLIDEAN], #Idx Type
162-
[{m: 16, ef_construction: 100, ef: 100}]
163-
]
164-
query_args: [
165-
[]
166-
]
167-
```
1+
# Backoff Logic when encountering Resource Exhausted
2+
3+
This only applies when "ignoreExhaustedEvent" is true in the config.yml (default is false). When this value is True, any exhausted resource event will be handled by the healer.
4+
5+
# The “back-off” logic is as follows:
6+
7+
- When the exception is received:
8+
- First record in error will perform the following actions:
9+
- Signal the “main” populating task to go into “sleep mode” so that additional records will not be upserted.
10+
- A warning message is logged.
11+
- All records in error will do the following:
12+
- Call “wait for index completion.”
13+
- Once the index is built the following occurs:
14+
- Re-upsert the error records
15+
- If successful, signal the “main” populating task to re-start populating.
16+
- A warning message is logged, stating population has re-started.
17+
18+
# Environmental Variables
19+
20+
Below are the Environmental Variables:
21+
22+
- AVS_LOGLEVEL -- The Vector Client API's Log Level. Defaults "WARNING"
23+
- Possible Values:
24+
- CRITICAL
25+
- FATAL
26+
- ERROR
27+
- WARNING
28+
- WARN
29+
- INFO
30+
- DEBUG
31+
- NOTSET
32+
33+
Note: The logging file is determined by "APP_LOGFILE"
34+
35+
- AVS_HOST -- The AVS Server's Address. Defaults to "localhost"
36+
- AVS_PORT -- The AVS Server's Port. Defaults to 5000
37+
- AVS_USELOADBALANCER -- The AVS Server's Address is a Load Balancer. Default False.
38+
- AVS_NAMESPACE -- The Vector's Namespace. Defaults to "test"
39+
- AVS_SET -- The Vector's Set name. Defaults to "ANN-data"
40+
41+
This behavior is determined by the "uniqueSetIdxName" argument defined in the config.yml file.
42+
43+
The default (True) behavior is to create a unique Set name where this is the prefix to that name.
44+
45+
The name has the following parts:
46+
47+
```
48+
{AVS_SET}_{ANN Distance Type}_{AVS Idx Type}_{Dimension}_{hnsw m}_{hnsw ef construction}_{hnsw ef}
49+
```
50+
51+
Example:
52+
53+
ANN-data_angular_COSINE_20_16_100_100
54+
55+
If "uniqueSetIdxName" is false, the Set name is as follows:
56+
57+
```
58+
{AVS_SET}__{ANN Distance Type}_{AVS Idx Type}
59+
```
60+
61+
Example:
62+
63+
ANN-data_angular_COSINE
64+
65+
- APP_LOGFILE -- The Aerospike's ANN Logging file. Default is "AerospikeANN.log".
66+
67+
The folder is always the current working directory.
68+
69+
- APP_LOGLEVEL -- The Aerospike's ANN Log Level. Defaults "INFO"
70+
- Possible Values:
71+
- CRITICAL
72+
- FATAL
73+
- ERROR
74+
- WARNING
75+
- WARN
76+
- INFO
77+
- DEBUG
78+
- NOTSET
79+
80+
Note: For performance testing this should be set to "NOTSET".
81+
82+
When running in a docker container, logging is disabled.
83+
84+
- APP_DROP_IDX -- A Boolean value that will determine if the Vector index is dropped if it already exists. The default is to use "dropIdx" argument in the config.yml file.
85+
- APP_INDEX_SLEEP -- The amount of time to sleep after the index is dropped. The default is 0.
86+
87+
Possible values are:
88+
89+
- 0 -- Don't Sleep
90+
- \< 0 -- The number of seconds to sleep
91+
- APP_POPULATE_TASKS -- The number of concurrent records upserted (put) tasks that are performed during the index population phase. When this number of records are upserted, the app will wait until all upserts are completed and then process the next set of records. The default is 5000.
92+
93+
Values:
94+
95+
- \< 0 -- All records are upserted, concurrently, and the app will only wait for the upsert completion before waiting for index completion.
96+
- 0 or 1 -- One record is upserted at a time (sync)
97+
- \> 1 -- The number of records upserted, concurrently (async), before the app waits for the upserts to complete.
98+
- APP_PINGAVS -- Checks to determine if the AVS server is reachable via ping. Default is False.
99+
- APP_CHECKRESULT -- Checks the Vector Search results for failed results or Zero Distance. Default is True
100+
101+
Note: This value is always false if running in a docker container.
102+
103+
This should be set to False when conducting performance testing!
104+
105+
The default bin name for the vectors is always "ANN_embedding".
106+
107+
# config.yml file
108+
109+
Using the config.yml file. The Aerospike ANN config.yml file can support the different ANN run group configurations. It is suggested that the Aerospike ANN application is ran using the ANN Distance Type configuration.
110+
111+
Using this configuration, we can match each ANN distance type (i.e., Angular, Euclidean, Jaccard, etc.) to the "best" Aerospike Vector index type. Below is an example of this configuration with comments regarding the behavior of each parameter:
112+
113+
```
114+
float:
115+
#This defines a run group based on the ANN angular datasets.
116+
angular:
117+
#All entries to “run_groups” keyword are required as-is (cannot change the values or structure)!
118+
- base_args: ['@metric', '@dimension']
119+
constructor: Aerospike
120+
disabled: false #can change to true to disable this run-group
121+
docker_tag: ann-benchmarks-aerospike
122+
module: ann_benchmarks.algorithms.aerospike
123+
name: aerospike
124+
run_groups:
125+
cosine: #Should match Idx Type
126+
#This grouping is reqired
127+
args: [
128+
[cosine], #Idx Type, any Aerospike Index Type, case insensitive). This is required…
129+
#A collection of HnswParams where each param is ran as a separate ran for this Idx Type. This is required and must have at least one item.
130+
[{m: 8, ef_construction: 64, ef: 8},
131+
{m: 16, ef_construction: 128, ef: 8} ],
132+
#Unique Set/Index Name (optional, default True). See the “AVS_SET” environment variable above.
133+
[True],
134+
#True to Drop Idx and Re-Populate, optional default true. See “APP_DROP_IDX” environment variable above.
135+
[True],
136+
#Determines what phases are executed. Values are:
137+
# IdxPopulateOnly – only conduct the populate index phase,
138+
# QueryOnly – only perform the vector search phase,
139+
# AllOps – All phases (optional default value)
140+
[AllOps]
141+
]
142+
#This grouping is required
143+
query_args: [
144+
# If provided (optional), overrides the HnswParams defined above for the vector search phase
145+
[null, #Uses default defined above
146+
{ef: 10} #Override “ef” above
147+
]
148+
]
149+
#This defines another run group based on the ANN Euclidean datasets.
150+
#This show using the required params.
151+
euclidean:
152+
- base_args: ['@metric', '@dimension']
153+
constructor: Aerospike
154+
disabled: false
155+
docker_tag: ann-benchmarks-aerospike
156+
module: ann_benchmarks.algorithms.aerospike
157+
name: aerospike
158+
run_groups:
159+
SQUARED_EUCLIDEAN:
160+
args: [
161+
[SQUARED_EUCLIDEAN], #Idx Type
162+
[{m: 16, ef_construction: 100, ef: 100}]
163+
]
164+
query_args: [
165+
[]
166+
]
167+
```
168+
169+
# HDF5 Dataset Additional Attributes
170+
171+
The following attributes are added in the resulting ANN HDF5 dataset (note that all added attributes are prefixed with "as_"):
172+
173+
- as_indockercontainer – True if this run was within a docker container. False if it was ran natively
174+
- as_idx_name – The name of the index
175+
- as_idx_type – The Aerospike Vector Index Type
176+
- as_idx_binname – The Index’s Bin name
177+
- as_idx_hnswparams – The index’s “hnsw” parameters as passed into this run via the config file. Any missing or None values will use the default values defined by Aerospike Vector client/server.
178+
- as_idx_drop – True if the index will be dropped
179+
- as_idx_ignoreexhuseevents – True to ignore any “Exhausted Resource” errors and the Aerospike Vector Healer will be used to reconcile the index. If false, internal “back-off” logic is ued.
180+
- as_idx_definition_built - Only available when the database is populated. The actual Vector Index's definitions with default values.
181+
- as_actions – The actions performed in this run (e.g., All actions, Populate Index Only, Query Only, etc.)
182+
- as_host – The Aerospike Vector server
183+
- as_isloadbalancer – If present, the as_host is a load balancer
184+
- as_namespace – The Aerospike Namespace used for the tun
185+
- as_set – The Aerospike Set name used for the run
186+
- as_train_shape - The dimensions of the training dataset which is used to populate the database.
187+
- as_query_hnswsearchparams – The Query’s “hnsw” parameters as passed into this run via the config file. Any missing or None values will use the default values defined by Aerospike Vector client/server.
188+
- as_query_checkresults – If true the query results are checked/validated. This should be false for timing runs.
189+
- as_query_no_result_cnt - The number of queries that returned empty results. Only available if the query check results are true.
190+
- as_query_no_neighbors_fnd - The number of queries that returned no neighbors. Only available if the query check results are true.
191+
- as_upserted_vectors – The number of vectors inserted
192+
- as_upserted_time_secs – The amount of time to perform all the inserts in seconds. This doesn’t include index build completion.
193+
- as_idx_completion_secs – The number of seconds to complete the index build. Does not include inset time.
194+
- as_total_polulation_time_secs – The complete time to insert and build the index.

0 commit comments

Comments
 (0)