Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
debug/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
.pytest_cache/
.hypothesis/
pytestdebug.log

# Translations
*.mo
*.pot
*.po

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having different versions across platforms is not crucial, uncomment the next line:
# Pipfile.lock

# PEP 582; __pypackages__
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython cache files
cython_debug/

# Operating System Files
.DS_Store
Thumbs.db

# IDE / Editor Files
.vscode/
.idea/
*.suo
*.ntvs*.*
*.njsproj
*.sln
*.swp

# Benchmark data cache
data/
116 changes: 116 additions & 0 deletions README.docker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Using the AVS Benchmark Tools with Docker

This document explains how to build and use the Docker container defined in the `dockerfile` to run the Aerospike Vector Search (AVS) benchmark tools (`hdf_import.py`, `hdf_query.py`, etc.). Using Docker provides a consistent and isolated environment with all necessary dependencies pre-installed.

## Prerequisites

* **Docker:** You need Docker installed on your system. Visit [docker.com](https://docs.docker.com/get-docker/) for installation instructions.
* **Docker Buildx (Recommended for Multi-Arch):** If you need to build images for architectures different from your host machine (e.g., building an `arm64` image on an `amd64` host or vice-versa), Docker Buildx is recommended. It's usually included with modern Docker Desktop installations.

## Building the Docker Image

1. **Navigate:** Open a terminal or command prompt and navigate to the root directory of the `avs-benchmark-ann` project (where the `dockerfile` is located).
2. **Build:** Run the build command.

* **For your host's architecture:**
```bash
docker build -t avs-benchmark-tool .
```
*(You can replace `avs-benchmark-tool` with your preferred image name)*

* **For multi-architecture (amd64 & arm64) using buildx (Recommended):**
```bash
# Optional: Create and switch to a new builder instance if you haven't already
# docker buildx create --use --name mybuilder

# Build for both platforms and push to a registry (replace 'your-repo' if pushing)
docker buildx build --platform linux/amd64,linux/arm64 -t your-repo/avs-benchmark-tool:latest --push .

# OR Build for a specific platform and load locally (e.g., for arm64)
# docker buildx build --platform linux/arm64 -t avs-benchmark-tool:arm64 --load .
```

## Running Benchmark Scripts

The Docker container is designed to execute the Python benchmark scripts directly. You specify the script name and its arguments after the `docker run` command and image name.

**General Structure:**

```bash
docker run [DOCKER_OPTIONS] <IMAGE_NAME> <SCRIPT_NAME.py> [SCRIPT_ARGUMENTS...]
```

**Key Docker Options:**

* `--rm`: Automatically removes the container filesystem when the container exits. Recommended for benchmark runs.
* `-it`: Use if a script requires interactive input (though these benchmark scripts generally don't).
* `-v /path/on/host:/app/data`: **Highly Recommended for Persisting Datasets.** This mounts a directory from your host machine (`/path/on/host`) into the container at `/app/data`. The benchmark scripts cache downloaded datasets (like `glove-100-angular`) in `/app/data`. Using a volume mount prevents re-downloading datasets every time you run the container.
* Example: `-v "$(pwd)/data:/app/data"` mounts a directory named `data` inside your current working directory on the host. Create this `data` directory on your host if it doesn't exist.

**Connecting to AVS:**

* **AVS on Host Machine (Docker Desktop):** Often, you can use the special DNS name `host.docker.internal` as the hostname. The port will be the one AVS is listening on (e.g., 10000).
* Example: `--host host.docker.internal --vectorport 10000`
* **AVS in another Docker Container:** If AVS is running in another container on the same Docker network, use the AVS container's name as the hostname.
* **AVS on a Remote Machine:** Use the remote machine's IP address or hostname. Ensure your Docker container can reach that address (network configuration/firewalls).
* **Using `--vectorloadbalancer`:** When connecting to a single AVS node (especially via `host.docker.internal`) or an external load balancer address from within Docker, add the `--vectorloadbalancer` flag to your command. This tells the client library not to attempt cluster node discovery based on the seed node's potentially internal address, preventing connection errors where the client might try to connect back to `127.0.0.1` inside the container. NOTE: using --vectorloadbalancer forces an extra network hop which can affect performance, it is best to properly configure an AVS test cluster that does not need this flag when optimizing for performance. You may need to configure the AVS node's [advertised listener addresses](https://aerospike.com/docs/vector/manage/config#advertised-listener) if you are running AVS in the cloud or on docker.

### Example: Basic Benchmark Run

This example runs a small benchmark using the `glove-100-angular` dataset against an AVS cluster assumed to be accessible from the container (e.g., running on the host at port 10000).

1. **Create a local data directory (if it doesn't exist):**
```bash
mkdir -p data
```

2. **Step 1: Import Data & Create Index**
This command downloads the dataset (if not already in the mounted `./data` volume), connects to AVS, drops any existing index with the default name, creates a new one, and populates it.

```bash
docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
hdf_import.py \
--dataset glove-100-angular \
--host host.docker.internal \
--vectorport 10000 \
--vectorloadbalancer \
--idxdrop
```
*(Replace `host.docker.internal` and `10000` if your AVS setup differs)*

3. **Step 2: Query the Index**
Once the import is complete, run queries against the newly populated index.

```bash
docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
hdf_query.py \
--dataset glove-100-angular \
--host host.docker.internal \
--vectorport 10000 \
--vectorloadbalancer \
--runs 1 \
--limit 10 \
--check
```
*(Adjust host/port as needed. The volume mount ensures `hdf_query.py` can also access dataset metadata if needed)*

### Other Script Examples

* **Download a BigANN Dataset (requires `axel` and `azcopy` installed in the image):**
```bash
docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
bigann_download.py --dataset <bigann_dataset_name>
```

* **Create an HDF dataset from Aerospike data:**
```bash
docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
hdf_create_dataset.py \
--idx <source_index_name> \
--hdf data/<output_hdf_filename>.hdf5 \
--hosts <aerospike_db_host>:<aerospike_db_port> \
[other_args...]
```
*(Note: `--hdf data/...` saves the output HDF file within the mounted volume, making it accessible on your host)*

Remember to consult the main `README.md` or use the `--help` argument for each script (`docker run avs-benchmark-tool <script_name.py> --help`) to see all available options.
71 changes: 71 additions & 0 deletions dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Base image: Using Python 3.10 on Debian Bullseye (slim version)
FROM python:3.10-slim-bullseye

# Set environment variables for Python and Pip
ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=off \
PIP_DISABLE_PIP_VERSION_CHECK=on

# Install common system dependencies first
RUN apt-get update && \
apt-get install -y --no-install-recommends \
wget \
gpg \
curl \
axel \
ca-certificates && \
rm -rf /var/lib/apt/lists/*

# Install azcopy based on architecture using direct download via aka.ms links
# ARG TARGETARCH is automatically available during buildx
ARG TARGETARCH
RUN \
# Set DEBIAN_FRONTEND to noninteractive to avoid potential prompts
export DEBIAN_FRONTEND=noninteractive && \
AZCOPY_URL="" && \
if [ "$TARGETARCH" = "arm64" ]; then \
echo "Setting azcopy download URL for arm64 (using aka.ms link)" && \
AZCOPY_URL="https://aka.ms/downloadazcopy-v10-linux-arm64"; \
elif [ "$TARGETARCH" = "amd64" ]; then \
echo "Setting azcopy download URL for amd64 (using aka.ms link)" && \
# Note: aka.ms/downloadazcopy-v10-linux typically points to the amd64 version
AZCOPY_URL="https://aka.ms/downloadazcopy-v10-linux"; \
else \
echo "Unsupported architecture for azcopy: $TARGETARCH" && exit 1; \
fi && \
\
echo "Downloading azcopy from $AZCOPY_URL" && \
# Use curl with -L to follow redirects from aka.ms
curl -L "$AZCOPY_URL" -o /tmp/azcopy.tar.gz && \
# Extract the azcopy binary to /usr/local/bin
tar -xzf /tmp/azcopy.tar.gz -C /usr/local/bin --strip-components=1 --wildcards '*/azcopy' && \
chmod +x /usr/local/bin/azcopy && \
rm /tmp/azcopy.tar.gz && \
echo "Successfully installed latest azcopy via direct download for $TARGETARCH"

# Verify azcopy installation
RUN azcopy --version

# Set the working directory in the container
WORKDIR /app

# Copy the requirements file first to leverage Docker's build cache
COPY requirements.txt .

# Install Python dependencies specified in requirements.txt
RUN pip install -r requirements.txt

# Copy the rest of the application code from the build context into the working directory
# This includes all .py files, the 'bigann' directory, README, etc.
COPY . .

# Create a directory for dataset caching by the scripts (e.g., ./data)
# This directory can be mounted as a volume from the host to persist downloaded datasets.
RUN mkdir -p /app/data

# Set the entrypoint to python3.
# When running the container, you will specify the script name and its arguments after the image name.
ENTRYPOINT ["python3"]

# Example: If you wanted a default command (uncomment and modify as needed):
# CMD ["hdf_import.py", "--help"]