Skip to content

Support for ARM, updated PDFAlto, Docker multi-architecture build #1165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Sep 14, 2024

Finally I nailed it!

This PR provide a multi-architecture build for amd64/arm64. I've switched to the eclipse-adoptium-17-jdk/jre but, for the moment, I'm using ubuntu 20.04 (focal) instead of ubuntu 22. For now Is limited to the manual build for the CRF only image.

This requires some tests, here here here the resulting image.

This PR should help to solve #1089, #928, #1014, #1119

Update: At the moment the ARM support for the Deep Learning image is not working natively because there is no linux/arm64 docker image of the base images (even of tensorflow/tensorflow:2.17.0)

@coveralls
Copy link

coveralls commented Sep 20, 2024

Coverage Status

coverage: 40.295% (+0.02%) from 40.279%
when pulling 933f01c on multi-arch-docker-image
into 1ad5901 on master.

@lfoppiano lfoppiano marked this pull request as draft November 20, 2024 16:27
@heijligers
Copy link

I'm getting

rosetta error: failed to open elf at /lib64/ld-linux-x86-64.so.2

when trying to run on macos m1

@lfoppiano
Copy link
Collaborator Author

@heijligers which image did you try?

This one was correctly built with multi arch and arm support: lfoppiano/grobid:latest-crf-multi-arch but I did not manage to get it to work in a stable way.

My next step was to try to solve this issue with spawning process by looking on a way to have a conditional decision in the Dockerfile

Any help is welcome for this issue 🙂

@AaronNGray
Copy link

I am currently working on this for the default CRF Docker (Docker.crf) .

This needs work :-

  • ARM64 builds of a number of libraries
  • generalizing for different architectures
  • releases implementing for all the various sub project repos involved

There seems a number of libraries involved :-

tini

https://github.com/krallin/tini
There is a current release of tini for ARM64 :-
https://github.com/krallin/tini/releases/download/v0.19.0/tini-arm64
I have tried this and it works.

pdfalto

https://github.com/kermitt2/pdfalto
This builds fine on x64, and armhf, but has a compile overload call issue with ICU C++.
#1230 (comment)

libwapiti

https://github.com/kermitt2/wapiti

libcrfpp

https://github.com/taku910/crfpp/

libjep

https://github.com/ninia/jep
libjep is a library for Embed Python in Java it involved JNI. This potentially is more complex depending upon the ARM64 JNI support.

I will update this comment as a tracking comment as I make progress.

@AaronNGray
Copy link

@lfoppiano It seems the way to deal with multi platform and multi architecture OS/ARCH builds is with build and buildx :-

docker build --platform linux/amd64,linux/arm64 -t */*:* .
docker buildx build \
--push \
--platform linux/arm/v7,linux/arm64/v8,linux/amd64 \ --tag <username>/<repo>:<tag> .

https://github.com/docker/buildx?tab=readme-ov-file#building-multi-platform-images

This should work across multiple docker images :-

https://github.com/AaronNGray/docker-arch-test

@lfoppiano
Copy link
Collaborator Author

@AaronNGray yes, this is what I believe I did to build the image with github actions, however I did not succeed in having a running docker image for Apple M1.

@lfoppiano lfoppiano self-assigned this Jan 28, 2025
@lfoppiano lfoppiano modified the milestones: 0.8.2, 0.9.0 Feb 16, 2025
@lfoppiano lfoppiano force-pushed the multi-arch-docker-image branch 2 times, most recently from a2d0283 to 49e9166 Compare May 22, 2025 10:07
@AaronNGray
Copy link

Good going. I posted some stuff to check regarding ABI's and compilers on Grobid-core Discord, it maybe obvious but may need checking.

@lfoppiano lfoppiano force-pushed the multi-arch-docker-image branch 2 times, most recently from 9cb10e5 to 9a60975 Compare May 26, 2025 19:34
@lfoppiano lfoppiano force-pushed the multi-arch-docker-image branch from 9a60975 to fb66529 Compare May 26, 2025 21:04
@lfoppiano lfoppiano force-pushed the multi-arch-docker-image branch from 770cc61 to 0ea1ce7 Compare May 26, 2025 22:41
@lfoppiano lfoppiano force-pushed the multi-arch-docker-image branch from 0ea1ce7 to d95b92a Compare May 26, 2025 23:37
@lfoppiano lfoppiano marked this pull request as ready for review May 27, 2025 06:25
@lfoppiano
Copy link
Collaborator Author

lfoppiano commented May 27, 2025

I tested the image lfoppiano/grobid:latest-crf-multi-arch on a M2 without rosetta enabled and the native ARM image worked well. I procesed around 1700 PDF in about 30 minutes, without any crash or problem.

docker run --platform linux/arm64 --init -p 8070:8070 lfoppiano/grobid:latest-crf-multi-arch

@lfoppiano
Copy link
Collaborator Author

There is no image for tensorflow/tensorflow for linux/arm64 at the moment, so the DL image needs to be ran either on Linux or through Rosetta (not sure and not guaranteed it will work)

@lfoppiano
Copy link
Collaborator Author

@heijligers I believe that the latest image lfoppiano/grobid:latest-crf-multi-arch should work on an ARM

@lfoppiano lfoppiano requested a review from Copilot May 27, 2025 08:20
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements multi-architecture Docker builds for amd64 and arm64, updates platform-specific scripts and configuration, and adjusts CI workflows to support the new targets.

  • Add ARM64-specific pdfalto configs and wrapper script
  • Enhance Dockerfiles with TARGETARCH and fix cleanup patterns
  • Update GitHub Actions to build and push multi-arch CRF images

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
grobid-home/pdfalto/lin_arm-64/xpdfrc Add language support mappings for ARM64
grobid-home/pdfalto/lin_arm-64/pdfalto_server New bash wrapper to enforce timeouts, memory limits, and signals
grobid-home/lib/lin_arm-64/TODO List of pending ARM64 library integrations
grobid-core/src/main/java/org/grobid/core/utilities/Utilities.java Extend OS detection to include Linux ARM
grobid-core/src/main/java/org/grobid/core/utilities/GrobidConfig.java Correct typo in class documentation
doc/Grobid-docker.md Document ARM64 usage and required --platform flag
Dockerfile.evaluation Capitalize “AS runtime”
Dockerfile.delft Pull architecture-specific Tini binary via TARGETARCH
Dockerfile.crf Fix .git/ copy, refine cleanup, and use TARGETARCH for Tini
.github/workflows/ci-build-unstable.yml Add ARM and macOS runners to matrix, bump to Java 17 Temurin
.github/workflows/ci-build-manual-crf.yml Switch to Buildx and multi-arch build-push for CRF image

@lfoppiano lfoppiano force-pushed the multi-arch-docker-image branch from 5bc1e55 to ece8953 Compare May 27, 2025 11:28
@lfoppiano lfoppiano requested a review from Copilot May 27, 2025 11:35
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces multi-architecture build support for amd64 and arm64 by updating various Dockerfiles, shell scripts, and CI workflows. Key changes include:

  • Updated shell scripts for improved argument handling, command verification, and timeout enforcement.
  • Adjustments in Dockerfiles to incorporate TARGETARCH and multi-architecture support.
  • Minor code and documentation fixes, such as enhanced OS detection and spelling corrections.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
grobid-home/pdfalto/mac_arm-64/pdfalto_server & grobid-home/pdfalto/mac-64/pdfalto_server Improved argument handling and added command executable checks.
grobid-home/pdfalto/lin_arm-64/pdfalto_server & grobid-home/pdfalto/lin-64/pdfalto_server Enhanced timeout enforcement and memory control with OS-specific considerations.
grobid-core/src/main/java/org/grobid/core/utilities/Utilities.java
grobid-core/src/main/java/org/grobid/core/utilities/GrobidConfig.java
Refined OS detection logic and corrected documentation spelling.
Dockerfile.evaluation, Dockerfile.delft, Dockerfile.crf Updated Docker build configurations for multi-architecture support via TARGETARCH and consistent build practices.
.github/workflows/ci-build-unstable.yml
.github/workflows/ci-build-manual-crf.yml
Revised CI workflows to incorporate updated JDK setups, build matrices, and multi-platform builds.
Comments suppressed due to low confidence (3)

Dockerfile.crf:48

  • The model removal pattern has been broadened from '-BidLSTM_CRF' to '-BidLSTM_'. Please verify that this change only removes the intended model files and does not accidentally delete additional valid models.
RUN rm -rf grobid-home/models/*-BidLSTM_*

Dockerfile.delft:134

  • Verify that the binary naming convention using '${TARGETARCH}' matches the available Tini releases across all supported architectures.
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TARGETARCH} /tini

grobid-core/src/main/java/org/grobid/core/utilities/GrobidConfig.java:6

  • The documentation spelling has been corrected from 'configuation' to 'configuration'.
* This class is a bean for the YAML configuration of the GROBID instance.

@lfoppiano lfoppiano changed the title Docker build for multi-architecture amd/arm Support for ARM, updated PDFAlto, Docker multi-architecture build May 28, 2025
@lfoppiano lfoppiano force-pushed the multi-arch-docker-image branch from 6777a6a to 933f01c Compare August 8, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants