-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update accumulo-proxy Docker container to be more size efficient #22
Comments
I have documented this here as an issue as I'll raise an issue on the main accumulo project to do the same work and cross-reference. BackgroundCurrently we have 4 TAR files in the image, 3 of which are downloaded (Hadoop, Zookeeper, Accumulo) to get their libraries. The first cache layer is the download layer where we download the 3 files:
Then we go and untar the files into their correct locations in 3 separate calls:
Current implementationHere's some stats for the current implementation. Before the download After the download After the untar and relocation Summary
Suggested ApproachI'm going to update the download_bin() function to download the binaries, extract them and delete the original downloads. This should a) stop us from invalidating the cache layer and b) reduce the image size by at least the tar file sizes (384MB) which would bring the overall container down to approx 1.5GB, still big but around 22% reduction. This approach would then be mirrored across to accumulo-docker project to keep consistency. |
Isn't this what multi-stage builds are for? https://docs.docker.com/develop/develop-images/multistage-build/ |
You could use a multistage build here but I am not sure it is necessary. |
@madrob I would usually liken multi-stage builds to things like simplifying or reducing the size of a compilation based tool where you perhaps need a tonne of libraries installed to compile but once compiled the binary is standalone. This was more about being smart about how we have each cache layer, by splitting the download and untar across two commands we essentially doubled the size. I'm happy to look at a multistage build approach if you have a good idea of where to draw the line/split? I couldn't spot an easy one that made sense. The only idea I came up with is using a builder to acquire the binaries (hadoop, accumulo, zookeeper, accumulo-proxy) and then using the second image definition to grab these binaries. I welcome your thoughts though as I'm no expert in multi stage builds, I've used them at work and on home projects a few times but mostly for making consistent compilation environments e.g. with GCC or test tools in, that aren't needed for running the app. If it helps I did push a branch working on this ticket (#23), but I still need to verify something on it, I got very distracted by going down a rabbit hole of seeing if I could get accumulo-proxy to run on an alpine linux backed JDK to see if I could reduce the size more. Have a gander, see what you think? Given I'd like to take the same approach for the main accumulo-docker image (it suffers from the same problem) it'd be good to get eyes on this. |
So it's worth noting, the pull request I uploaded (#22) has brought the generated file size down from 1.86GB to 1.46GB, a reduction of 21.5% Two further ideas on dropping disk usage, let me know if you think these are worth pursuing? Remove Hadoop docsPotential saving: 500M Hadoop has a tonne of docs which are useful though I wouldn't say they are useful when you're only using Hadoop for its libraries (e.g. in this container environment). Does anyone know whether it's an acceptable use to install the hadoop release (in this case hadoop-3.2.1) and then remove the hadoop-3.2.1/share/doc folder? I still would need to conduct some testing but this would potentially save us 499.1M (noting Hadoop's total install size is 897.8M) Switch to alpinePotential saving: 400M I took a quick detour today down to openjdk:8-alpine3.9 rabbit hole. Turns out (at least on the face of it) that moving to alpine isn't actually too difficult for the accumulo-proxy, the only thing missing is bash (alpine ships with sh). Whilst I can probably rewrite accumulo-proxy to not need bash and to work in bash or sh environments, the rabbit hole continues when you start looking at things like our use of A quick but not great solution is to take the Alpine version and add bash to it. I took this for a quick tour and got the image size down to 1.06GB, so that's approx. another 400M saved) I quickly mocked an example of combining these 3 changes:
I would expect us to also be able to make the same changes to accumulo-docker. |
Nice. To me it is fine just apt installing bash vs rewriting. Thanks for looking into this @volmasoft |
@mjwall do you have any views/experience on whether it's acceptable to drop a whole portion of a Hadoop install (the docs) and whether this is restricted due to licensing etc? I am no expert in this area so I don't want to do something that could potentially cause issues. I'll wrap the alpine change in today and update the pull. |
@mjwall @keith-turner conscious the pull request is open still here: #23 Any further work required? if we're good to merge then I can take a look at the accumulo docker container to keep consistency. |
@volmasoft I think there were some outstanding comments made by @keith-turner on the PR #23 that haven't yet been addressed, regarding the preference for the jre-slim Java image. We could probably merge it once all the comments have been addressed. |
Apologies for the delay, I'll aim to get this boxed off this week. |
During the pull request here: #20 @mjwall spotted that we were being a bit inefficient with our container size by storing the tar before extracting it.
This should be cleaned up and ideally done in a single step by updating the download_bin() method.
For consistency sakes we should also ideally do this on the accumulo-docker repo https://github.com/apache/accumulo-docker/blob/master/Dockerfile
The text was updated successfully, but these errors were encountered: