Improve Robustness of Agent Foreground and Background Execution Modes #288

vikman90 · 2024-11-11T10:38:47Z

Parent Issue: #241

Description

The Wazuh agent currently has issues handling its execution modes when run with --run (foreground) or --start (background) flags. Specifically, launching the agent in foreground with ./wazuh-agent --run can sometimes print the following message:

wazuh-agent already running

This message typically indicates that an instance of the agent is already running. However, it may also appear if the agent's previous process terminated unexpectedly, which leads to unreliable behavior.

Proposed Solution

Separate --run and --start behavior:
- --run: Should only launch the agent in the foreground without checking if an instance is already running.
- --start: Should launch the agent in the background and include checks to ensure no other instance of the agent is running.
PID file handling:
When using --start, the agent should:
- Check for the existence of a PID file.
- If a PID file exists, verify if it corresponds to a currently running agent process.
- If no running process is found or the PID file does not exist, perform a fork and execute the agent in background mode (--run).
Systemd Compatibility:
- Ensure that the modified behavior aligns with Systemd’s service management for proper control over the agent's lifecycle.

The text was updated successfully, but these errors were encountered:

sdvendramini · 2024-11-12T16:54:38Z

12/11/2024

I've started reproducing the problem and researching different approaches to solve this issue.

13/11/2024

I was testing another platforms to know how they works. I'm trying to do some tests using procps library to check if the process is running.

sdvendramini · 2024-11-15T08:26:32Z

OpenSearch

During the testing of OpenSearch, it was observed that it is possible to execute another instance of the executable while the OpenSearch service is already running. This behavior appears to create an additional node, which aligns with the fact that OpenSearch is designed as a cluster-based system.

However, it was noticed that the directory containing the PID file becomes empty after launching the second OpenSearch instance. This behavior raises questions about how the process manages resources and whether this is expected in cluster configurations.

To clarify, OpenSearch uses the same executable for all node types, including:

Master Node
Data Node
Client Node (Coordinating Node)

Each OpenSearch instance runs as an independent Java process and can be configured for different roles.

Filebeat

Testing Filebeat revealed that it does not allow multiple instances to run simultaneously with the same configuration. When attempting to execute a second instance of Filebeat while the service is already running, the following error was encountered:

Exiting: /var/lib/filebeat/filebeat.lock: data path already locked by another beat. Please make sure that multiple beats are not sharing the same data path (path.data)

This error indicates that Filebeat locks the path.data directory, preventing concurrent executions unless a separate data path is specified. Further tests showed that it is possible to launch multiple instances if distinct path.data configurations are provided.

While it is technically feasible to run multiple Filebeat instances on the same server, this practice is uncommon. Typically, a single instance is configured to handle data ingestion from multiple sources, streamlining operations.

Conclusions

After analyzing these two products, I believe that wazuh-agent will not behave the same way, as it is designed to run as a single instance. Tests could be done to observe what happens with data persistence when two instances are running simultaneously. Alternatively, it could be worth considering running two instances with different data paths to avoid issues related to this.

If the idea of executing a new instance with --run while the service is running is solely for development purposes,I think the approach described in the issue's description should be sufficient. I just don't see the need to fork with --start, as systemd already handles the process in the background.
Currently, I am working on an implementation for Linux and other for macOS to improve how we verify that the process is running. Once the development is complete, the behavior of an instance with --run will be tested while the service is active and running with --start.

vikman90 · 2024-11-18T14:30:59Z

Hi @sdvendramini,

Thank you for the detailed analysis. Based on your findings and further discussions, we propose the following adjustments to streamline the behavior of the wazuh-agent and align it more closely with practical use cases:

Proposal

Remove the --run and --start options from the agent CLI.
These options add unnecessary complexity to the behavior of the agent. Instead, we aim for a simplified and predictable execution model.
Default foreground execution:
If wazuh-agent is executed without CLI options, it will start the service in the foreground. This makes behavior consistent and reduces confusion.
Prevent multiple instances:
If wazuh-agent detects that another instance is already running, it will terminate its execution. This ensures we avoid resource conflicts and maintain a single-agent instance, as is typical.

To reliably detect whether another process is running, we suggest implementing a robust mechanism using lockfiles. This approach addresses scenarios where PID files or lockfiles might remain stale, such as:

Process crashes (e.g., segmentation faults or out-of-memory errors).
Abrupt system shutdowns (e.g., power failures).
Unexpected restarts.

Proposed Lockfile Implementation

Create a lockfile when the agent starts. This file will be attached to a living process to claim exclusive ownership of the agent instance.
Validate lockfile ownership: Before starting, the agent will check whether the process associated with the lockfile is still running.
Remove stale lockfiles: If the process is no longer active, the agent will clean up the stale lockfile before proceeding.

Let me know if you agree with this approach or have further suggestions. Once finalized, we can proceed with implementing and testing these changes.

Best regards.

vikman90 added level/task Task issue type/enhancement Enhancement issue module/agent mvp Minimum Viable Product refinement labels Nov 11, 2024

vikman90 mentioned this issue Nov 11, 2024

MVP Agent refinement (I) #241

Open

wazuhci added this to Release 5.0.0 Nov 11, 2024

wazuhci moved this to Backlog in Release 5.0.0 Nov 11, 2024

TomasTurina assigned sdvendramini Nov 11, 2024

wazuhci moved this from Backlog to In progress in Release 5.0.0 Nov 12, 2024

fdalmaup mentioned this issue Nov 18, 2024

Improve daemons CLI parameters usage wazuh/wazuh#26693

Open

sdvendramini linked a pull request Nov 19, 2024 that will close this issue

Improve agent execution modes #313

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Robustness of Agent Foreground and Background Execution Modes #288

Improve Robustness of Agent Foreground and Background Execution Modes #288

vikman90 commented Nov 11, 2024 •

edited

Loading

sdvendramini commented Nov 12, 2024 •

edited

Loading

sdvendramini commented Nov 15, 2024

vikman90 commented Nov 18, 2024 •

edited

Loading

Improve Robustness of Agent Foreground and Background Execution Modes #288

Improve Robustness of Agent Foreground and Background Execution Modes #288

Comments

vikman90 commented Nov 11, 2024 • edited Loading

Description

Proposed Solution

sdvendramini commented Nov 12, 2024 • edited Loading

12/11/2024

13/11/2024

sdvendramini commented Nov 15, 2024

OpenSearch

Filebeat

Conclusions

vikman90 commented Nov 18, 2024 • edited Loading

Proposal

Proposed Lockfile Implementation

vikman90 commented Nov 11, 2024 •

edited

Loading

sdvendramini commented Nov 12, 2024 •

edited

Loading

vikman90 commented Nov 18, 2024 •

edited

Loading