Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion gradle.properties
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ hibernate_version=6.5.3.Final
opensaml_version=5.1.6
jwt_version=0.13.0
jaxb_runtime_version=4.0.6
hazelcast_version=5.5.0
hazelcast_version=5.6.0
fasterxml_version=2.20.1
netty_version=4.2.7.Final
jgit_version=7.5.0.202512021534-r
Expand Down
6 changes: 5 additions & 1 deletion src/main/java/de/tum/cit/aet/artemis/ArtemisApp.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
import org.springframework.core.env.Environment;
import org.springframework.core.env.Profiles;

import com.hazelcast.spring.HazelcastObjectExtractionConfiguration;

import de.tum.cit.aet.artemis.core.PrintStartupBeansEvent;
import de.tum.cit.aet.artemis.core.config.ArtemisCompatibleVersionsConfiguration;
import de.tum.cit.aet.artemis.core.config.ArtemisConfigHelper;
Expand All @@ -39,7 +41,9 @@
import tech.jhipster.config.DefaultProfileUtil;
import tech.jhipster.config.JHipsterConstants;

@SpringBootApplication
// Exclude HazelcastObjectExtractionConfiguration due to incompatibility with custom Hazelcast configuration
// See: https://github.com/hazelcast/hazelcast/issues/26553
@SpringBootApplication(exclude = HazelcastObjectExtractionConfiguration.class)
@EnableConfigurationProperties({ LiquibaseProperties.class, ProgrammingLanguageConfiguration.class, TheiaConfiguration.class, LicenseConfiguration.class,
ArtemisCompatibleVersionsConfiguration.class })
public class ArtemisApp {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
package de.tum.cit.aet.artemis.buildagent.dto;

import org.jspecify.annotations.Nullable;

import de.tum.cit.aet.artemis.programming.domain.build.BuildStatus;

/**
* Common interface for build job DTOs, providing access to fields shared between
* active jobs ({@link BuildJobQueueItem}) and finished jobs ({@link FinishedBuildJobDTO}).
* <p>
* This interface enables type-safe handling of build jobs in REST endpoints that
* may return either type depending on the job's state.
*/
public interface BuildJobDTO {

/**
* @return the unique identifier of the build job
*/
String id();

/**
* @return the name/description of the build job
*/
String name();

/**
* @return the ID of the participation this build job belongs to
*/
long participationId();

/**
* @return the ID of the course this build job belongs to
*/
long courseId();

/**
* @return the ID of the exercise this build job belongs to
*/
long exerciseId();

/**
* @return the current status of the build job, or null if not yet determined
*/
@Nullable
BuildStatus status();
}
Comment on lines +1 to +46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find BuildJobQueueItem and FinishedBuildJobDTO implementations

echo "=== Searching for BuildJobQueueItem ==="
rg -n "class BuildJobQueueItem|record BuildJobQueueItem" --type java -A 5

echo "=== Searching for FinishedBuildJobDTO ==="
rg -n "class FinishedBuildJobDTO|record FinishedBuildJobDTO" --type java -A 5

Repository: ls1intum/Artemis

Length of output: 2092


🏁 Script executed:

#!/bin/bash
# Get complete view of FinishedBuildJobDTO
echo "=== FinishedBuildJobDTO full view ==="
cat -n src/main/java/de/tum/cit/aet/artemis/buildagent/dto/FinishedBuildJobDTO.java | head -30

echo ""
echo "=== BuildJobQueueItem full view ==="
cat -n src/main/java/de/tum/cit/aet/artemis/buildagent/dto/BuildJobQueueItem.java | head -30

Repository: ls1intum/Artemis

Length of output: 3638


Interface implementation is complete but has a type contract inconsistency.

Both BuildJobQueueItem and FinishedBuildJobDTO correctly implement the BuildJobDTO interface with all required methods. However, FinishedBuildJobDTO declares status() as non-nullable while the interface specifies it as @Nullable. Although semantically sensible (finished jobs always have a determined status), this violates the interface contract and may cause type safety issues when the interface is used polymorphically. Consider marking status() as @Nullable in FinishedBuildJobDTO for consistency, or document why the stricter contract is intentional.

🤖 Prompt for AI Agents
In `@src/main/java/de/tum/cit/aet/artemis/buildagent/dto/BuildJobDTO.java` around
lines 1 - 46, The BuildJobDTO.status() method is annotated `@Nullable` but
FinishedBuildJobDTO implements it as non-null, violating the interface contract;
update FinishedBuildJobDTO to declare its status() return as nullable (or
explicitly annotate it with `@Nullable`) so the implementation's signature is
Liskov-compatible with BuildJobDTO.status(), ensuring polymorphic type-safety
when using BuildJobDTO references.

Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
@JsonInclude(JsonInclude.Include.NON_EMPTY)
public record BuildJobQueueItem(@NonNull String id, @NonNull String name, @NonNull BuildAgentDTO buildAgent, long participationId, long courseId, long exerciseId, int retryCount,
int priority, @Nullable BuildStatus status, @NonNull RepositoryInfo repositoryInfo, @NonNull JobTimingInfo jobTimingInfo, @NonNull BuildConfig buildConfig,
@Nullable ResultDTO submissionResult) implements Serializable, Comparable<BuildJobQueueItem> {
@Nullable ResultDTO submissionResult) implements BuildJobDTO, Serializable, Comparable<BuildJobQueueItem> {

@Serial
private static final long serialVersionUID = 1L;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
@JsonInclude(JsonInclude.Include.NON_EMPTY)
public record FinishedBuildJobDTO(String id, String name, String buildAgentAddress, long participationId, long courseId, long exerciseId, BuildStatus status,
RepositoryType repositoryType, String repositoryName, RepositoryType triggeredByPushTo, ZonedDateTime buildSubmissionDate, ZonedDateTime buildStartDate,
ZonedDateTime buildCompletionDate, String commitHash, ResultDTO submissionResult) {
ZonedDateTime buildCompletionDate, String commitHash, ResultDTO submissionResult) implements BuildJobDTO {

/**
* A DTO representing a result
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,9 @@ public class BuildAgentDockerService {
@Value("${artemis.continuous-integration.image-architecture:amd64}")
private String imageArchitecture;

@Value("${artemis.continuous-integration.build-agent.short-name}")
private String buildAgentShortName;

private static final String AMD64_ARCHITECTURE = "amd64";

private static final String ARM64_ARCHITECTURE = "arm64";
Expand Down Expand Up @@ -524,7 +527,7 @@ private long convertMegabytesToBytes(int mb) {
private boolean dockerClientNotAvailable(String additionalLogInfo) {
DockerClient dockerClient = buildAgentConfiguration.getDockerClient();
if (dockerClient == null) {
BuildAgentStatus status = distributedDataAccessService.getLocalBuildAgentStatus();
BuildAgentStatus status = distributedDataAccessService.getBuildAgentStatus(buildAgentShortName);
if ((status == BuildAgentStatus.PAUSED || status == BuildAgentStatus.SELF_PAUSED)) {
log.info("Docker client is not available because the build agent is paused. {} This is expected behavior.", additionalLogInfo);
return true;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
import java.util.List;
import java.util.Objects;

import jakarta.annotation.PreDestroy;

import org.slf4j.Logger;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.info.GitProperties;
Expand Down Expand Up @@ -54,6 +56,24 @@ public BuildAgentInformationService(BuildAgentConfiguration buildAgentConfigurat
this.distributedDataAccessService = distributedDataAccessService;
}

/**
* Removes the build agent from the distributed map when the service is being destroyed.
* This ensures proper cleanup when the build agent shuts down gracefully.
*/
@PreDestroy
public void removeLocalBuildAgentInformationOnShutdown() {
try {
if (distributedDataAccessService.isInstanceRunning()) {
log.info("Build agent '{}' is shutting down. Removing from distributed build agent information map.", buildAgentShortName);
distributedDataAccessService.getDistributedBuildAgentInformation().remove(buildAgentShortName);
log.info("Successfully removed build agent '{}' from distributed map.", buildAgentShortName);
}
}
catch (Exception e) {
log.warn("Error while removing build agent information for '{}' during shutdown: {}", buildAgentShortName, e.getMessage());
}
}

public void updateLocalBuildAgentInformation(boolean isPaused) {
updateLocalBuildAgentInformationWithRecentJob(null, isPaused, false, DEFAULT_CONSECUTIVE_FAILURES);
}
Expand All @@ -64,31 +84,46 @@ public void updateLocalBuildAgentInformation(boolean isPaused, boolean isPausedD

/**
* Updates the local build agent information with the most recent build job.
* Uses the build agent's short name as the map key for stable identification,
* since the Hazelcast member address may change after initial client connection.
*
* @param recentBuildJob the most recent build job
* @param isPaused whether the build agent is paused
* @param isPausedDueToFailures whether the build agent is paused due to consecutive failures
* @param consecutiveFailures number of consecutive build failures on the build agent
*/
public void updateLocalBuildAgentInformationWithRecentJob(BuildJobQueueItem recentBuildJob, boolean isPaused, boolean isPausedDueToFailures, int consecutiveFailures) {
String memberAddress = distributedDataAccessService.getLocalMemberAddress();
// Skip if not connected to cluster (happens when build agent starts before core nodes)
if (!distributedDataAccessService.isConnectedToCluster()) {
log.debug("Not connected to Hazelcast cluster yet. Skipping build agent information update.");
return;
}

// Use buildAgentShortName as the stable key - memberAddress can change after Hazelcast client connects
String agentKey = buildAgentShortName;
try {
distributedDataAccessService.getDistributedBuildAgentInformation().lock(memberAddress);
distributedDataAccessService.getDistributedBuildAgentInformation().lock(agentKey);
// Add/update
BuildAgentInformation info = getUpdatedLocalBuildAgentInformation(recentBuildJob, isPaused, isPausedDueToFailures, consecutiveFailures);

log.debug("Updating build agent info: key='{}', name='{}', memberAddress='{}', displayName='{}'", agentKey, info.buildAgent().name(), info.buildAgent().memberAddress(),
info.buildAgent().displayName());

try {
distributedDataAccessService.getDistributedBuildAgentInformation().put(info.buildAgent().memberAddress(), info);
// Use the agent's short name as key for stable identification
distributedDataAccessService.getDistributedBuildAgentInformation().put(agentKey, info);
log.debug("Successfully stored build agent info with key '{}'. Current map size: {}", agentKey,
distributedDataAccessService.getDistributedBuildAgentInformation().size());
}
catch (Exception e) {
log.error("Error while updating build agent information for agent {} with address {}", info.buildAgent().name(), info.buildAgent().memberAddress(), e);
}
}
catch (Exception e) {
log.error("Error while updating build agent information for agent with address {}", memberAddress, e);
log.error("Error while updating build agent information for agent {}", agentKey, e);
}
finally {
distributedDataAccessService.getDistributedBuildAgentInformation().unlock(memberAddress);
distributedDataAccessService.getDistributedBuildAgentInformation().unlock(agentKey);
}
}

Expand All @@ -100,7 +135,8 @@ private BuildAgentInformation getUpdatedLocalBuildAgentInformation(BuildJobQueue
: buildAgentConfiguration.getThreadPoolSize();
boolean hasJobs = numberOfCurrentBuildJobs > 0;
BuildAgentStatus status;
BuildAgentInformation agent = distributedDataAccessService.getDistributedBuildAgentInformation().get(memberAddress);
// Use buildAgentShortName as key since that's what we use to store the agent info
BuildAgentInformation agent = distributedDataAccessService.getDistributedBuildAgentInformation().get(buildAgentShortName);
if (isPaused) {
boolean isAlreadySelfPaused = agent != null && agent.status() == BuildAgentStatus.SELF_PAUSED;
status = (isPausedDueToFailures || isAlreadySelfPaused) ? BuildAgentStatus.SELF_PAUSED : BuildAgentStatus.PAUSED;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,10 @@
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentSkipListSet;
import java.util.concurrent.Future;
import java.util.concurrent.ScheduledFuture;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.locks.ReentrantLock;
import java.util.function.Supplier;

Expand All @@ -26,6 +28,7 @@
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Lazy;
import org.springframework.context.annotation.Profile;
import org.springframework.scheduling.TaskScheduler;
import org.springframework.stereotype.Service;

import de.tum.cit.aet.artemis.buildagent.BuildAgentConfiguration;
Expand All @@ -34,7 +37,6 @@
import de.tum.cit.aet.artemis.buildagent.dto.BuildResult;
import de.tum.cit.aet.artemis.core.exception.LocalCIException;
import de.tum.cit.aet.artemis.programming.service.localci.DistributedDataAccessService;
import de.tum.cit.aet.artemis.programming.service.localci.distributed.api.topic.DistributedTopic;

/**
* Coordinates submission, tracking, timeout handling, and cancellation of build jobs
Expand Down Expand Up @@ -77,6 +79,12 @@ public class BuildJobManagementService {

private static final Logger log = LoggerFactory.getLogger(BuildJobManagementService.class);

/**
* Interval between retries when waiting for cluster connection during startup.
* Uses the same interval as the availability check in SharedQueueProcessingService for consistency.
*/
private static final java.time.Duration CLUSTER_CONNECTION_RETRY_INTERVAL = java.time.Duration.ofSeconds(5);

private final BuildJobExecutionService buildJobExecutionService;

private final BuildAgentConfiguration buildAgentConfiguration;
Expand All @@ -87,6 +95,25 @@ public class BuildJobManagementService {

private final BuildLogsMap buildLogsMap;

private final TaskScheduler taskScheduler;

/**
* Scheduled future for retrying cluster connection and initialization.
* This is used when the build agent starts before any core node is available.
*/
private ScheduledFuture<?> connectionRetryFuture;

/**
* Flag to track whether initialization has completed successfully.
* Uses AtomicBoolean to ensure thread-safe access from the retry task.
*/
private final AtomicBoolean initialized = new AtomicBoolean(false);

/**
* UUID of the cancel build job message listener. Stored to allow removal on reconnection.
*/
private java.util.UUID cancelListenerId;

/**
* Guards job lifecycle state transitions that must be atomic across multiple data structures:
* <ul>
Expand Down Expand Up @@ -150,34 +177,108 @@ public class BuildJobManagementService {
private final Set<String> cancelledBuildJobs = new ConcurrentSkipListSet<>();

public BuildJobManagementService(DistributedDataAccessService distributedDataAccessService, BuildJobExecutionService buildJobExecutionService,
BuildAgentConfiguration buildAgentConfiguration, BuildJobContainerService buildJobContainerService, BuildLogsMap buildLogsMap) {
BuildAgentConfiguration buildAgentConfiguration, BuildJobContainerService buildJobContainerService, BuildLogsMap buildLogsMap, TaskScheduler taskScheduler) {
this.buildJobExecutionService = buildJobExecutionService;
this.buildAgentConfiguration = buildAgentConfiguration;
this.buildJobContainerService = buildJobContainerService;
this.distributedDataAccessService = distributedDataAccessService;
this.buildLogsMap = buildLogsMap;
this.taskScheduler = taskScheduler;
}

/**
* Add a listener to the canceledBuildJobsTopic that cancels the build job for the given buildJobId.
* It gets broadcast to all nodes in the cluster. Only the node that is running the build job will cancel it.
* Initialize the service by setting up the cancel listener for build jobs.
* <p>
* When running as a Hazelcast client with asyncStart=true, the client may not yet be
* connected to the cluster when this method is called. In that case, we schedule
* periodic retries until the connection is established and initialization completes.
* <p>
* Additionally, a connection state listener is registered to handle reconnection after
* a connection loss. When the client reconnects to the cluster, the listener re-initializes
* the distributed topic listener which may have been lost during the disconnection.
* <p>
* EventListener cannot be used here, as the bean is lazy
* <a href="https://docs.spring.io/spring-framework/reference/core/beans/context-introduction.html#context-functionality-events-annotation">Spring Docs</a>
*/
@PostConstruct
public void init() {
DistributedTopic<String> canceledBuildJobsTopic = distributedDataAccessService.getCanceledBuildJobsTopic();
canceledBuildJobsTopic.addMessageListener(buildJobId -> {
jobLifecycleLock.lock();
try {
if (runningFutures.containsKey(buildJobId)) {
cancelBuildJob(buildJobId);
}
}
finally {
jobLifecycleLock.unlock();
// Register a connection state listener to handle both initial connection and reconnection.
// On reconnection (isInitialConnection=false), the topic listener needs to be re-registered
// because it may have been lost when the connection was interrupted.
distributedDataAccessService.addConnectionStateListener(isInitialConnection -> {
if (!isInitialConnection) {
// This is a reconnection - reset the initialized flag so listeners are re-registered
log.info("Hazelcast client reconnected to cluster. Re-initializing BuildJobManagementService listeners.");
initialized.set(false);
}
tryInitialize();
});

// If already connected, tryInitialize was called by the listener above.
// If not connected yet, schedule periodic retries as a fallback.
if (!initialized.get() && !distributedDataAccessService.isConnectedToCluster()) {
log.info("Hazelcast client not yet connected to cluster. Scheduling periodic initialization retries every {} seconds.", CLUSTER_CONNECTION_RETRY_INTERVAL.toSeconds());

connectionRetryFuture = taskScheduler.scheduleAtFixedRate(() -> {
if (tryInitialize()) {
// Initialization succeeded - cancel the retry task
if (connectionRetryFuture != null) {
connectionRetryFuture.cancel(false);
}
}
}, CLUSTER_CONNECTION_RETRY_INTERVAL);
}
}

/**
* Attempts to initialize the cancel listener for build jobs.
* <p>
* This method checks if the Hazelcast client is connected to the cluster before
* attempting to access distributed data structures. If not connected, it returns
* false so the caller can retry later.
*
* @return true if initialization succeeded, false if not connected to cluster
*/
private boolean tryInitialize() {
if (initialized.get()) {
return true;
}

if (!distributedDataAccessService.isConnectedToCluster()) {
log.debug("Cannot initialize BuildJobManagementService: not connected to Hazelcast cluster yet");
return false;
}

try {
var canceledBuildJobsTopic = distributedDataAccessService.getCanceledBuildJobsTopic();

// Remove old listener if it exists (prevents duplicate listeners on reconnection)
if (cancelListenerId != null) {
canceledBuildJobsTopic.removeMessageListener(cancelListenerId);
cancelListenerId = null;
}

cancelListenerId = canceledBuildJobsTopic.addMessageListener(buildJobId -> {
jobLifecycleLock.lock();
try {
if (runningFutures.containsKey(buildJobId)) {
cancelBuildJob(buildJobId);
}
}
finally {
jobLifecycleLock.unlock();
}
});

initialized.set(true);
log.info("BuildJobManagementService initialized successfully - cancel listener registered");
return true;
}
catch (Exception e) {
// This can happen if the connection is lost between the check and the access
log.warn("Failed to initialize BuildJobManagementService: {}. Will retry.", e.getMessage());
return false;
}
Comment on lines +233 to +281
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent concurrent tryInitialize() executions.

The retry task and the connection listener can call tryInitialize() concurrently; both can pass initialized.get() and register duplicate topic listeners (Line 242-271). Synchronize the method or gate with a CAS.

🔧 Proposed fix
-    private boolean tryInitialize() {
+    private synchronized boolean tryInitialize() {
🤖 Prompt for AI Agents
In
`@src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildJobManagementService.java`
around lines 233 - 281, tryInitialize() can be invoked concurrently causing
duplicate listeners; change the gating to use an atomic CAS around
initialization (e.g., use initialized.compareAndSet(false, true) or introduce a
separate AtomicBoolean initializing) so only one thread proceeds to register the
topic listener. Concretely: at the start of tryInitialize() attempt to
atomically claim initialization (if compareAndSet returns false, return
initialized.get() result immediately), perform the getCanceledBuildJobsTopic()
and listener registration while holding the claim, set cancelListenerId as now,
and only on success leave initialized true; on any exception or failure clear
the claim (reset the atomic flag or set initialized back to false) and return
false. Ensure existing cancelListenerId removal logic and
jobLifecycleLock/usages (runningFutures, cancelBuildJob) remain unchanged and
are executed only by the successful initializer.

}

/**
Expand Down
Loading
Loading