Skip to content

Conversation

@Erotemic
Copy link
Contributor

In relation to this issue: #2019

I think having a docker image that can recreate reported benchmarks is critical for scientific reproducibility.

This dockerfile is a start for that. In its current form it simply creates an image where the basic helm package is installed in development mode.

It uses UV for efficiency. The idea is that first we install basic apt-package, then setup a controlled version of uv, and then use that to create a base python virtual env that behaves similarly to how a developer would work on a host machine. I update the bashrc and profile so the virtual environment auto-activates when running tasks in the container.

To get helm into the image, I'm assuming you have a local checkout, and it copies over the entire git folder (which makes the image easier to update / develop off of), and then checks out a specified version and does a basic install of dependencies.

Lastly, I make an entrypoint script that ensures any command you send to a docker run will be executed in the context of the .bashrc environment.

I'm using docker buildkit to get the heredoc style docker syntax, which lets me write a RUN step over multiple lines, which IMO makes it much easier to read / copy / paste into a test container for development and debugging. The last RUN is basically an echo that I just use to bundle some documentation with the docker image on how to build it and how to run the basic tests.

I've verified that with this setup I can run the example in the README and run a server so I can see results on my host machine.

--

Where I want to take this is getting the HEIM benchmarks in an easy state that can be reproduced. Currently I'm having trouble with this, but writing this base image is the first step.

@yifanmai
Copy link
Collaborator

Thanks for attempting this. Your timing is a bit awkward because I was writing some Dockerfiles last week but have not gotten around to merging them. You can take a look at this draft pull request if you are curious: #3749 - feel free to comment on it.

I have a few comments about what you're doing here:

  • This Dockerfile doesn't give you a reproducible build process. There are some sources of non-determinism in it.
  • You can create a Docker image using this Dockerfile and use that as a starting point from your experiments, which would give you a working snapshot that you can easily recover to. I think this is a reasonable approach.
  • You don't need to merge the Dockerfile into the repository, if that is your intent. It should be sufficient to keep that image locally on in your own Docker repository.
  • There have been previous requests for HELM to publish official Docker images as part of our release cycle. (Dockerize helm for deployment #2019) I am open to this idea, but it would require some work to set up and to run the release process.
  • You may want to look at requirements.txt, which pins the version of all transitive Python dependencies and produces a almost-reproducible Python environment (with some caveats). The GitHub Action definition in test.yml may be helpful here, too.
  • Unfortunately, the environment defined by requirements.txt support is missing the HEIM requirements. The HEIM lead authors did not provide a lockfile for the environment they used. This is partly because requirements were manually installed from other sources outside of PyPI.
  • To reiterate, because the HEIM authors did not use HELM's standard dependency management, even if we provided HELM Docker images with our release, the Docker image would not support HEIM out of the box.
  • If we were to provide official Dockerfiles and Docker images, the build process should be as simple as possible. Compared to the Dockerfiles in my draft PR earlier, the Dockerfile in this PR seems a lot more complicated.

@Erotemic
Copy link
Contributor Author

Erotemic commented Jul 24, 2025

I think there are some fair points, but I think you might be missing some benefits of the way I've structured things. In any case, I think getting some form of dockerization out is important, and I'd be happy to work with you to land on something. I'll defer to maintainer preferences, but I'll make some arguments for what I'm doing as well.

Small Points

This Dockerfile doesn't give you a reproducible build process. There are some sources of non-determinism in it.

The only source of non-determinism should be the state of the HELM repo. That is by design to allow for the latest version of helm to be the default build. However, setting HELM_GIT_HASH should remove all non-determinism. Is there something else I missed?

You don't need to merge the Dockerfile into the repository, if that is your intent. It should be sufficient to keep that image locally on in your own Docker repository.

I think it is important for the Dockerfile to be part of the repository. I've seen a pattern where people have a repo just for the docker build, and that can make sense in multi-component systems, but for the case where the entire application is a single repo, having a Dockerfile that guarantees the ability to bring that repo into a working executable state seems desirable to me.

On My Approach's Complexity

If we were to provide official Dockerfiles and Docker images, the build process should be as simple as possible. Compared to the Dockerfiles in my draft PR earlier, the Dockerfile in this PR seems a lot more complicated.

Yes, but the complexity is warranted. Let me explain:

  • A large part of the dockerfile is just a heredoc at the end giving users documentation on how to use the dockerfile. It has no impact on the build (other than adding one dummy layer at the end, but I think that's worth it to have documentation coupled with the code). Note, much of this documentation could be refined / simplified. Its current state is effectively my developer notes, but we can clean it up. But I do want to sell you on the idea that having a heredoc at the end of a dockerfile.

  • There are 5 main parts of the dockefile, let me explain each to justify it, and also note places where we might simplify it a bit. As a meta comment, I want to note that I think use of docker build kit to get the RUN << EOF ... EOF syntax is impoartant because it lets you write commands exactly as you would in a shell script. This makes developing / debugging much easier, and lets you combine multiple steps in a RUN without having to resort to ugly && \ line continuations. I think its much more maintainable.

  1. The initial apt install. This is is important to get basic tools. I'm not sure all of them are 100% necessary (maybe we can get rid of wget nope wget is required), but I like to start off with more than I need and trim down as needed.

  2. The installation of uv. I could make this part much smaller by just using curl -LsSf https://astral.sh/uv/install.sh | bash, but I really don't like trusting random URLs. So I'm pinning to a very specific version of UV and checking that the hash of the installer is what I expect. This isn't perfect because the uv install script itself points at urls, but those have a smaller attack surface than the main install URL, and the install script does seem to have some checksum logic in it (although I would have preferred the checksums to be baked into the script rather than pulled from a URL). The other bit is that I wanted to make it easy to update UV as future versions release, which is why I use a bash associative array to make it easy to add new entries for the hashes of new versions.

  3. The setup of the UV virtual environment. (currently noted as install uv, which is not correct). In the past, I've had issues with running experiments in docker because the typical pattern is that everything is installed as root. This causes subtle issues, and I think it is useful to have a docker image which exactly mirrors what you might see in a normal development environment, namely, the default python is manged in a virtual environment. I think I may be able to remove the cargo part of the path modification to simplify it, but the idea of this section is to setup a virtual environment like you normally would and also have that virtual environment auto-activate so the user doesn't need to think about it. Making this seemless will require one more step later.

  4. The next part simply copies the .git folder from the HELM repo on the host machine to the docker image and then uses git to checkout the appropriate state and install HELM in development mode. This will make it possible to update the image to new versions by only adding a small delta layer on top rather than rebuilding something nearly the same, but causing the users to have to redownload gigabytes of similar layer due to only a few bit changes. Note, I have not added any script to perform this update yet, but this run command is what will make that possible.

  5. Lastly, in step 3 I said we needed on more step to make auto-activation of the virtual environment seemless. The idea is to write a small entrypoint script that takes whatever command the user passes to the image, and then executes it in the context of a bash environment where the python virtualenv we setup has been activated. Frankly, I don't 100% understand why this is necessary as I thought ENTRYPOINT ["/bin/bash", "-lc"] would have taken care of it, but my tests failed without it.

Comparison to 3749's alternative dockerization

There are also some issues with your docker files:

  • The system seems to be based on the python:3.10 image, which AFAIK, does not have cuda support, so it would not be able to run GPU measures.

  • Your docker file strictly points to the main branch to obtain requirements helm based on a URL, which has reproducibility issues whenever the main branch is updated. This may also not agree with the local checkout of helm, which is used in the base dockerfile. In contrast, my version has the ability to pin the version of the helm repo and the requirements are coupled with the version of helm the image is being built for.

  • You have 4 different dockerfiles to give you an entrypoint for each main command of helm. In contrast I produce a single image, and the entrypoint lets you run any helm command. Your design is making HELM work like a docker application that requires multiple passes between host system and docker containers, whereas my design encapsulates the entire system in an image, and all steps (benchmark, summarize, server) within the same environment.

Issues with my approach

I got some external feedback on my approach, and I think there are some improvements that could be made to make it easier / clear how to run custom deployments. This involves documentation for mounting custom experiment configurations and schemas from a local machine, and not perhaps not make the root of the HELP repo the default working directory. Namely, I should add docs with an example of how to mount a custom prod_env and benchmark_outputs. (Still getting familiar with the exact artifact structure of HELM)

@Erotemic Erotemic force-pushed the dockerfile branch 3 times, most recently from baddb44 to 898eb7a Compare July 31, 2025 18:56
@Erotemic
Copy link
Contributor Author

Erotemic commented Jul 31, 2025

@yifanmai

I made an effort to cleanup the dockerfile by moving all of the docs into an associate README, and I also modified the entrypoint creation code to try and explain why it is needed more clearly.

However, my current tests are failing, but it seems like an issue with the current main rather than this branch. In fact, this demonstrates the robustness of this docker image, because even in the current borked main state, we can pin HELM to a known working version and run through the demo, e.g.

# Determine version of helm, uv, and python to use
export HELM_GIT_HASH=3f20a6cbb359d36dce028534aa0f2a3809f829dd
export UV_VERSION=0.8.4
export PYTHON_VERSION=3.10

# Build the image with version-specific tags
DOCKER_BUILDKIT=1 docker build --progress=plain \
    -t helm:${HELM_GIT_HASH}-uv${UV_VERSION}-python${PYTHON_VERSION} \
    --build-arg PYTHON_VERSION=$PYTHON_VERSION \
    --build-arg UV_VERSION=$UV_VERSION \
    --build-arg HELM_GIT_HASH=$HELM_GIT_HASH \
    -f ./dockerfiles/helm.dockerfile .

docker tag helm:${HELM_GIT_HASH}-uv${UV_VERSION}-python${PYTHON_VERSION} helm:latest

mkdir -p ./shared_directory

# Run a benchmark
docker run --rm --gpus=all \
    -v $PWD/shared_directory:/mnt/shared_directory \
    --workdir /mnt/shared_directory \
    -it helm:latest \
    helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10

# Summarize the results:
docker run --rm --gpus=all \
    -v $PWD/shared_directory:/mnt/shared_directory \
    --workdir /mnt/shared_directory \
    -it helm:latest \
    helm-summarize --suite my-suite

# Start a web server to view the results:
docker run --rm --gpus=all \
    -v $PWD/shared_directory:/mnt/shared_directory \
    --workdir /mnt/shared_directory \
    -p 8000:8000 \
    -it helm:latest \
    helm-server --suite my-suite

EDIT: I also optimized the file by using buildkit caches so it doesn't need to re-download apt or uv packages when building on the same machine. I also moved the docker ARG statements next to where they are used. It seems if they are changed they invalidate everything after it, which is annoying, so they can't all be on the top.

@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 3, 2025

@yifanmai I could also simplify the main docker image by moving all of the optimized uv stuff I did into a separate image and then have the helm docker image inherit from it. That would reduce the size of the main helm.dockerfile considerably at the cost of a 2 stage build process and an additional file.

The separated uv.dockerfile would look similar to this one that I use in my scripts to build CI images: https://gitlab.kitware.com/computer-vision/ci-docker/-/blob/main/uv.dockerfile?ref_type=heads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants