aerospike-community · dwelch-spike · May 12, 2025
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,148 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+debug/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+.pytest_cache/
+.hypothesis/
+pytestdebug.log
+
+# Translations
+*.mo
+*.pot
+*.po
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having different versions across platforms is not crucial, uncomment the next line:
+# Pipfile.lock
+
+# PEP 582; __pypackages__
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython cache files
+cython_debug/
+
+# Operating System Files
+.DS_Store
+Thumbs.db
+
+# IDE / Editor Files
+.vscode/
+.idea/
+*.suo
+*.ntvs*.*
+*.njsproj
+*.sln
+*.swp
+
+# Benchmark data cache
+data/ 
diff --git a/README.docker.md b/README.docker.md
@@ -0,0 +1,116 @@
+# Using the AVS Benchmark Tools with Docker
+
+This document explains how to build and use the Docker container defined in the `dockerfile` to run the Aerospike Vector Search (AVS) benchmark tools (`hdf_import.py`, `hdf_query.py`, etc.). Using Docker provides a consistent and isolated environment with all necessary dependencies pre-installed.
+
+## Prerequisites
+
+*   **Docker:** You need Docker installed on your system. Visit [docker.com](https://docs.docker.com/get-docker/) for installation instructions.
+*   **Docker Buildx (Recommended for Multi-Arch):** If you need to build images for architectures different from your host machine (e.g., building an `arm64` image on an `amd64` host or vice-versa), Docker Buildx is recommended. It's usually included with modern Docker Desktop installations.
+
+## Building the Docker Image
+
+1.  **Navigate:** Open a terminal or command prompt and navigate to the root directory of the `avs-benchmark-ann` project (where the `dockerfile` is located).
+2.  **Build:** Run the build command.
+
+    *   **For your host's architecture:**
+        ```bash
+        docker build -t avs-benchmark-tool .
+        ```
+        *(You can replace `avs-benchmark-tool` with your preferred image name)*
+
+    *   **For multi-architecture (amd64 & arm64) using buildx (Recommended):**
+        ```bash
+        # Optional: Create and switch to a new builder instance if you haven't already
+        # docker buildx create --use --name mybuilder
+
+        # Build for both platforms and push to a registry (replace 'your-repo' if pushing)
+        docker buildx build --platform linux/amd64,linux/arm64 -t your-repo/avs-benchmark-tool:latest --push .
+
+        # OR Build for a specific platform and load locally (e.g., for arm64)
+        # docker buildx build --platform linux/arm64 -t avs-benchmark-tool:arm64 --load .
+        ```
+
+## Running Benchmark Scripts
+
+The Docker container is designed to execute the Python benchmark scripts directly. You specify the script name and its arguments after the `docker run` command and image name.
+
+**General Structure:**
+
+```bash
+docker run [DOCKER_OPTIONS] <IMAGE_NAME> <SCRIPT_NAME.py> [SCRIPT_ARGUMENTS...]
+```
+
+**Key Docker Options:**
+
+*   `--rm`: Automatically removes the container filesystem when the container exits. Recommended for benchmark runs.
+*   `-it`: Use if a script requires interactive input (though these benchmark scripts generally don't).
+*   `-v /path/on/host:/app/data`: **Highly Recommended for Persisting Datasets.** This mounts a directory from your host machine (`/path/on/host`) into the container at `/app/data`. The benchmark scripts cache downloaded datasets (like `glove-100-angular`) in `/app/data`. Using a volume mount prevents re-downloading datasets every time you run the container.
+    *   Example: `-v "$(pwd)/data:/app/data"` mounts a directory named `data` inside your current working directory on the host. Create this `data` directory on your host if it doesn't exist.
+
+**Connecting to AVS:**
+
+*   **AVS on Host Machine (Docker Desktop):** Often, you can use the special DNS name `host.docker.internal` as the hostname. The port will be the one AVS is listening on (e.g., 10000).
+    *   Example: `--host host.docker.internal --vectorport 10000`
+*   **AVS in another Docker Container:** If AVS is running in another container on the same Docker network, use the AVS container's name as the hostname.
+*   **AVS on a Remote Machine:** Use the remote machine's IP address or hostname. Ensure your Docker container can reach that address (network configuration/firewalls).
+*   **Using `--vectorloadbalancer`:** When connecting to a single AVS node (especially via `host.docker.internal`) or an external load balancer address from within Docker, add the `--vectorloadbalancer` flag to your command. This tells the client library not to attempt cluster node discovery based on the seed node's potentially internal address, preventing connection errors where the client might try to connect back to `127.0.0.1` inside the container. NOTE: using --vectorloadbalancer forces an extra network hop which can affect performance, it is best to properly configure an AVS test cluster that does not need this flag when optimizing for performance. You may need to configure the AVS node's [advertised listener addresses](https://aerospike.com/docs/vector/manage/config#advertised-listener) if you are running AVS in the cloud or on docker.
+
+### Example: Basic Benchmark Run
+
+This example runs a small benchmark using the `glove-100-angular` dataset against an AVS cluster assumed to be accessible from the container (e.g., running on the host at port 10000).
+
+1.  **Create a local data directory (if it doesn't exist):**
+    ```bash
+    mkdir -p data
+    ```
+
+2.  **Step 1: Import Data & Create Index**
+    This command downloads the dataset (if not already in the mounted `./data` volume), connects to AVS, drops any existing index with the default name, creates a new one, and populates it.
+
+    ```bash
+    docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
+      hdf_import.py \
+        --dataset glove-100-angular \
+        --host host.docker.internal \
+        --vectorport 10000 \
+        --vectorloadbalancer \
+        --idxdrop
+    ```
+    *(Replace `host.docker.internal` and `10000` if your AVS setup differs)*
+
+3.  **Step 2: Query the Index**
+    Once the import is complete, run queries against the newly populated index.
+
+    ```bash
+    docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
+      hdf_query.py \
+        --dataset glove-100-angular \
+        --host host.docker.internal \
+        --vectorport 10000 \
+        --vectorloadbalancer \
+        --runs 1 \
+        --limit 10 \
+        --check
+    ```
+    *(Adjust host/port as needed. The volume mount ensures `hdf_query.py` can also access dataset metadata if needed)*
+
+### Other Script Examples
+
+*   **Download a BigANN Dataset (requires `axel` and `azcopy` installed in the image):**
+    ```bash
+    docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
+      bigann_download.py --dataset <bigann_dataset_name>
+    ```
+
+*   **Create an HDF dataset from Aerospike data:**
+    ```bash
+    docker run --rm -v "$(pwd)/data:/app/data" avs-benchmark-tool \
+      hdf_create_dataset.py \
+        --idx <source_index_name> \
+        --hdf data/<output_hdf_filename>.hdf5 \
+        --hosts <aerospike_db_host>:<aerospike_db_port> \
+        [other_args...]
+    ```
+    *(Note: `--hdf data/...` saves the output HDF file within the mounted volume, making it accessible on your host)*
+
+Remember to consult the main `README.md` or use the `--help` argument for each script (`docker run avs-benchmark-tool <script_name.py> --help`) to see all available options.
diff --git a/dockerfile b/dockerfile
@@ -0,0 +1,71 @@
+# Base image: Using Python 3.10 on Debian Bullseye (slim version)
+FROM python:3.10-slim-bullseye
+
+# Set environment variables for Python and Pip
+ENV PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=off \
+    PIP_DISABLE_PIP_VERSION_CHECK=on
+
+# Install common system dependencies first
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+    wget \
+    gpg \
+    curl \
+    axel \
+    ca-certificates && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install azcopy based on architecture using direct download via aka.ms links
+# ARG TARGETARCH is automatically available during buildx
+ARG TARGETARCH
+RUN \
+    # Set DEBIAN_FRONTEND to noninteractive to avoid potential prompts
+    export DEBIAN_FRONTEND=noninteractive && \
+    AZCOPY_URL="" && \
+    if [ "$TARGETARCH" = "arm64" ]; then \
+        echo "Setting azcopy download URL for arm64 (using aka.ms link)" && \
+        AZCOPY_URL="https://aka.ms/downloadazcopy-v10-linux-arm64"; \
+    elif [ "$TARGETARCH" = "amd64" ]; then \
+        echo "Setting azcopy download URL for amd64 (using aka.ms link)" && \
+        # Note: aka.ms/downloadazcopy-v10-linux typically points to the amd64 version
+        AZCOPY_URL="https://aka.ms/downloadazcopy-v10-linux"; \
+    else \
+        echo "Unsupported architecture for azcopy: $TARGETARCH" && exit 1; \
+    fi && \
+    \
+    echo "Downloading azcopy from $AZCOPY_URL" && \
+    # Use curl with -L to follow redirects from aka.ms
+    curl -L "$AZCOPY_URL" -o /tmp/azcopy.tar.gz && \
+    # Extract the azcopy binary to /usr/local/bin
+    tar -xzf /tmp/azcopy.tar.gz -C /usr/local/bin --strip-components=1 --wildcards '*/azcopy' && \
+    chmod +x /usr/local/bin/azcopy && \
+    rm /tmp/azcopy.tar.gz && \
+    echo "Successfully installed latest azcopy via direct download for $TARGETARCH"
+
+# Verify azcopy installation
+RUN azcopy --version
+
+# Set the working directory in the container
+WORKDIR /app
+
+# Copy the requirements file first to leverage Docker's build cache
+COPY requirements.txt .
+
+# Install Python dependencies specified in requirements.txt
+RUN pip install -r requirements.txt
+
+# Copy the rest of the application code from the build context into the working directory
+# This includes all .py files, the 'bigann' directory, README, etc.
+COPY . .
+
+# Create a directory for dataset caching by the scripts (e.g., ./data)
+# This directory can be mounted as a volume from the host to persist downloaded datasets.
+RUN mkdir -p /app/data
+
+# Set the entrypoint to python3.
+# When running the container, you will specify the script name and its arguments after the image name.
+ENTRYPOINT ["python3"]
+
+# Example: If you wanted a default command (uncomment and modify as needed):
+# CMD ["hdf_import.py", "--help"]