logging: add OpenSearch job log indexing and UI log viewer #73

jrcastro2 · 2025-03-26T08:38:37Z

Add custom logging handler using contextvars and OpenSearch
Define JobLogEntrySchema and LogContextSchema
Support search_after pagination in log search API
Fetch logs incrementally from UI using search_after cursor
Add React log viewer with fade-in and scroll support
closes Job report #67

Example of display of different levels

jrcastro2 · 2025-03-27T08:22:29Z

invenio_jobs/logging/jobs.py

+            for h in app.logger.handlers:
+                h.setLevel(app.config["LOGGING_JOBS_LEVEL"])
+
+        # Add OpenSearch logging handler if not already added


In the setup if we pass the handler to apps and api_apps seems that is registered twice, so we check for existance to avoid issues, alternatively we can simply pass it to app only?

jrcastro2 · 2025-03-27T08:23:38Z

invenio_jobs/logging/mappings/os-v1/jobslog/log-v1.0.0.json

For the mapping I decided to keep information that is produced by the python logging, so that if we wanted to we could enrich the message for instance on errors to display the name of the function, line, module ... Alternatively we can keep it simpler.

jrcastro2 · 2025-03-27T08:45:06Z

invenio_jobs/services/services.py

+    def search(self, identity, params):
+        """Search for app logs."""
+        self.require_permission(identity, "search")
+        search_after = params.pop("search_after", None)
+        search = self._search(
+            "search",
+            identity,
+            params,
+            None,
+            permission_action="read",
+        )
+        search = search.sort("timestamp", "_id").extra(size=100)
+        if search_after:
+            search = search.extra(search_after=search_after)
+
+        final_results = None
+        # Keep fetching until no more results
+        while True:
+            results = search.execute()
+            hits = results.hits
+            if not hits:
+                if final_results is None:
+                    final_results = results
+                break
+
+            if not final_results:
+                final_results = results  # keep metadata from first page
+            else:
+                final_results.hits.extend(hits)
+                final_results.hits.hits.extend(hits.hits)
+
+            search = search.extra(search_after=hits[-1].meta.sort)
+
+        return self.result_list(
+            self,
+            identity,
+            final_results,
+            links_tpl=self.links_item_tpl,
+        )


I am not sure about this func being a search endpoint with optional params. As you can see it's meant to fetch for logs without a limit. I set batch sizes of 100 (default is 10) but it could be increased.

For cases where we do not pass any query, with millions of entries this call would timeout, to solve this I can think of:

Add a hard limit let's say 100k?

Prefered: change this to be a read endpoint, that only allows to fetch logs of a concrete run

WDYT?

For the batch size, you might want to have a look how performant is the query/cluster with different sizes.
For the limit, from what I saw in other web apps (e.g. GitLab), when the log goes over 100 Mb, it is dropped.
Does it make any kind of sense to limit it, for example, to 1 million documents max? How would the web app handle this? and how fast will the rendering be?

Agree on testing a bit further with the batch size, since logs are pretty small so even 1k batches would really speed things up and limit the number of total requests to the cluster.

I would also impose a max limit in the beginning of the call, i.e. make a search.count() call and if the results are more than some max limit (e.g. 1 million results), bail with an error.

We discussed briefly IRL with @jrcastro2 that there might be an overall alternative "streaming" approach to serving the logs that will also help with memory performance and simplify the code:

Because we serve a single JSON object inside our usual hits.hits response envelope, we need to collect all the logs for a job run to build the object. Even if we set a maximum total size/count, we're going to be serving out a relatively big response out with an array up to 100k (small) items.

This also imposes a UX issue on the client-side, since the user will have to wait for a long "Loading logs..." message until the entire response is received, and afterwards get the entire logs rendered.

If we take further advantage of the paginating methods that OpenSearch provides, we could provide a "streaming" generator response in JSON-Lines format, which would also allow the client to start showing results from the very beginning of rendering the page.

This requires the client-side code to change as well to handle the JSON-Lines response

We would use the same response when polling every 5sec with the last log's timestamp as the offset

Unfortunatley opensearch-py doesn't have the Search.iterate(...) method that the elasticsearch-py client has, which would simplify a lot the entire while True: ... loop in the code here. We could still use Search.scan(...), but this uses the Scroll API now by default, which is not ideal for sorted results.

This is not meant to be done now in the scope of this PR, since we have a working implementation that's good enough ™️, but it would be a nice future improvement both in performance and UX. I'll shelve this issue.

Increased the batches to 1k, set a hard limit of 50k for now. If we want to allow more we need to optimize on both sides. backend and frontend, otherwise it starts to become a bit too slow. Given that there is already an issue, I haven't spend much time on it, right now: #74

jrcastro2 · 2025-03-27T08:45:42Z

tests/resources/test_resources.py

+    # Search for log jobs, first set the logger level to INFO
+    # and log a message by setting the job context
+    job_context.set(dict(job_id=job_id, run_id=run_id))
+    app.logger.setLevel("INFO")
+    app.logger.info("Test log message")
+    sleep(1)  # Wait for log to be indexed
+    res = client.get(f"/logs/jobs?q={job_id}")
+    assert res.status_code == 200
+    assert res.json["hits"]["total"] == 1
+    assert res.json["hits"]["hits"][0]["message"] == "Test log message"


Is this enough or do we want to add more tests?

Copilot

Pull Request Overview

This PR adds support for job log indexing in OpenSearch and a UI log viewer, enhancing logging and administration capabilities for job execution. Key changes include:

Implementing a context-aware logging handler to index enriched job logs.
Introducing new resource, service, and configuration classes for job logs.
Updating UI components and administration views to display job log details and handle the new PARTIAL_SUCCESS run status.

Reviewed Changes

Copilot reviewed 32 out of 34 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
invenio_jobs/resources/config.py	Adds job log resource config and search args classes.
invenio_jobs/resources/init.py	Exports new job log resource and config.
invenio_jobs/proxies.py	Exposes jobs log service proxy.
invenio_jobs/models.py	Introduces PARTIAL_SUCCESS in run status enum.
invenio_jobs/logging/jobs.py	Implements a context-aware OpenSearch logging handler.
invenio_jobs/logging/celery_signals.py	Adds context capture, restoration, and cleanup for Celery tasks.
invenio_jobs/ext.py	Registers job log service and resource.
invenio_jobs/config.py	Configures logging settings for jobs.
invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/StatusFormatter.js	Adds UI support for Partial Success status.
invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsSearchResultItemLayout.js	Updates link for run details.
invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogsView.js	Implements the React log viewer setup.
invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogs.js	Implements incremental log fetching and run status monitoring.
invenio_jobs/administration/runs.py	Enhances admin view to display run logs and details.

Files not reviewed (2)

invenio_jobs/logging/mappings/os-v1/jobslog/log-v1.0.0.json: Language not supported
invenio_jobs/logging/mappings/os-v2/jobslog/log-v1.0.0.json: Language not supported

invenio_jobs/logging/jobs.py

invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogs.js

invenio_jobs/administration/runs.py

invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogs.js

kpsherva · 2025-03-27T09:28:17Z

invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogs.js

+            {run.formatted_started_at ? (
+              <>
+              <p>
+                <strong>{run.formatted_started_at}</strong>
+              </p>
+              <p className="description">{runDuration} mins</p>
+              </>
+            ) : (
+              <p className="description">Not yet started</p>
+            )}


nit: are the paragraphs needed?
with List.Item you should use List.Description, instead of "faking it" with the description name class
it will help with readability

The reason to use the paragraphs is to do the line jumps and spacing, if we use List.Description we will need to add custom classes to it to manage this, let me know if I should update it

invenio_jobs/logging/celery_signals.py

invenio_jobs/administration/runs.py

jrcastro2 · 2025-03-27T16:46:19Z

invenio_jobs/services/services.py

@@ -215,8 +220,9 @@ def read(self, identity, job_id, run_id):
            self, identity, run_record, links_tpl=self.links_item_tpl
        )

+    @with_job_context()


Add a comment to highlight the order of the decorators is important tot not skipp any logging

I have already shared my thoughts with you and @slint, mainly regarding the guard for filtering logs by the job context.
In case you decide to change approach and go for the custom logger (or any other solution), I would suggest then using a new custom logger as a context. What I also find not very intuitive is the need of using the update_context func: as a developer, it won't be clear to me when I should use it.
In general, it is clearer to define required parameters when creating an object, and not later on via some setter funcs/methods, and creating an assumption that a context should be set before using other funcs/methods.
We assume an implicit order of calls, and as a developer, It is not easy to use.

Something like this would be, IMO, easier to use, and will have more clear APIs (pseudocode):

logger = JobLogger(job_id, run_id) logger.info(...)

The object could be a singleton per job, thread-safe. I am wondering if we should even be using it via some kind of map logger = getJobLogger(run_id) (or similar), but it might be overkill. Without this, the assumption is that there is one job logger per Python process.

Changing the level of the logger (to have a run in debug mode) could be again via a constructor param, or via a .setLevel(...) func (which might require resetting the level at the end of the run).

Happy to discuss this further.

invenio_jobs/logging/celery_signals.py

invenio_jobs/logging/jobs.py

ntarocco · 2025-04-03T07:24:01Z

invenio_jobs/services/services.py

@@ -215,8 +220,9 @@ def read(self, identity, job_id, run_id):
            self, identity, run_record, links_tpl=self.links_item_tpl
        )

+    @with_job_context()


I have already shared my thoughts with you and @slint, mainly regarding the guard for filtering logs by the job context.
In case you decide to change approach and go for the custom logger (or any other solution), I would suggest then using a new custom logger as a context. What I also find not very intuitive is the need of using the update_context func: as a developer, it won't be clear to me when I should use it.
In general, it is clearer to define required parameters when creating an object, and not later on via some setter funcs/methods, and creating an assumption that a context should be set before using other funcs/methods.
We assume an implicit order of calls, and as a developer, It is not easy to use.

Something like this would be, IMO, easier to use, and will have more clear APIs (pseudocode):

logger = JobLogger(job_id, run_id) logger.info(...)

The object could be a singleton per job, thread-safe. I am wondering if we should even be using it via some kind of map logger = getJobLogger(run_id) (or similar), but it might be overkill. Without this, the assumption is that there is one job logger per Python process.

Changing the level of the logger (to have a run in debug mode) could be again via a constructor param, or via a .setLevel(...) func (which might require resetting the level at the end of the run).

Happy to discuss this further.

ntarocco · 2025-04-03T07:27:47Z

invenio_jobs/services/config.py

+class JobLog:
+    """Job Log API."""
+
+    index = IndexField("jobslog-log-v1.0.0", search_alias="job-logs")


It feels like that this could a good use case for OS datastreams. @sakshamarora1 is creating class/methods in invenio-indexer to use datastreams instead of classic indices.

Don't forget the retention period: this should be IMO a global config here, where I can set for how long we should keep logs. Default could be 30 days. Datastream has a policy to automatically delete them.
The UI should show that no logs have been found, or they have been deleted (with a nice message, we could check what GH Action shows in this case).

invenio_jobs/services/services.py

ntarocco · 2025-04-03T09:39:01Z

invenio_jobs/services/services.py

+    def search(self, identity, params):
+        """Search for app logs."""
+        self.require_permission(identity, "search")
+        search_after = params.pop("search_after", None)
+        search = self._search(
+            "search",
+            identity,
+            params,
+            None,
+            permission_action="read",
+        )
+        search = search.sort("timestamp", "_id").extra(size=100)
+        if search_after:
+            search = search.extra(search_after=search_after)
+
+        final_results = None
+        # Keep fetching until no more results
+        while True:
+            results = search.execute()
+            hits = results.hits
+            if not hits:
+                if final_results is None:
+                    final_results = results
+                break
+
+            if not final_results:
+                final_results = results  # keep metadata from first page
+            else:
+                final_results.hits.extend(hits)
+                final_results.hits.hits.extend(hits.hits)
+
+            search = search.extra(search_after=hits[-1].meta.sort)
+
+        return self.result_list(
+            self,
+            identity,
+            final_results,
+            links_tpl=self.links_item_tpl,
+        )


For the batch size, you might want to have a look how performant is the query/cluster with different sizes.
For the limit, from what I saw in other web apps (e.g. GitLab), when the log goes over 100 Mb, it is dropped.
Does it make any kind of sense to limit it, for example, to 1 million documents max? How would the web app handle this? and how fast will the rendering be?

invenio_jobs/config.py

invenio_jobs/logging/jobs.py

invenio_jobs/logging/celery_signals.py

invenio_jobs/services/services.py

- Add custom logging handler using contextvars and OpenSearch - Define JobLogEntrySchema and LogContextSchema - Support search_after pagination in log search API - Fetch logs incrementally from UI using search_after cursor - Add React log viewer with fade-in and scroll support - closes inveniosoftware#67

jrcastro2 force-pushed the add-logging branch from e26db01 to d533bf2 Compare March 26, 2025 08:39

jrcastro2 mentioned this pull request Mar 26, 2025

Add log UI #71

Closed

jrcastro2 force-pushed the add-logging branch 2 times, most recently from f5d0b18 to 55e4802 Compare March 26, 2025 21:42

jrcastro2 mentioned this pull request Mar 27, 2025

logging: adds datastream logging #70

Closed

jrcastro2 commented Mar 27, 2025

View reviewed changes

jrcastro2 force-pushed the add-logging branch from 55e4802 to 8919f96 Compare March 27, 2025 08:46

jrcastro2 requested a review from Copilot March 27, 2025 08:46

Copilot AI reviewed Mar 27, 2025

View reviewed changes

invenio_jobs/logging/jobs.py Outdated Show resolved Hide resolved

invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogs.js Outdated Show resolved Hide resolved

kpsherva reviewed Mar 27, 2025

View reviewed changes

invenio_jobs/administration/runs.py Show resolved Hide resolved

kpsherva reviewed Mar 27, 2025

View reviewed changes

invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogs.js Outdated Show resolved Hide resolved

kpsherva reviewed Mar 27, 2025

View reviewed changes

invenio_jobs/assets/semantic-ui/js/invenio_jobs/administration/RunsLogs.js Outdated Show resolved Hide resolved

kpsherva reviewed Mar 27, 2025

View reviewed changes

invenio_jobs/logging/celery_signals.py Show resolved Hide resolved

jrcastro2 commented Mar 27, 2025

View reviewed changes

invenio_jobs/administration/runs.py Outdated Show resolved Hide resolved

jrcastro2 commented Mar 27, 2025

View reviewed changes

invenio_jobs/administration/runs.py Outdated Show resolved Hide resolved

jrcastro2 commented Mar 27, 2025

View reviewed changes

invenio_jobs/administration/runs.py Outdated Show resolved Hide resolved

jrcastro2 commented Mar 27, 2025

View reviewed changes

invenio_jobs/logging/celery_signals.py Show resolved Hide resolved

jrcastro2 commented Apr 1, 2025

View reviewed changes

invenio_jobs/logging/jobs.py Outdated Show resolved Hide resolved

ntarocco reviewed Apr 3, 2025

View reviewed changes

slint reviewed Apr 14, 2025

View reviewed changes

slint mentioned this pull request Apr 16, 2025

logs: JSON-Lines streaming of logs #74

Open

jrcastro2 mentioned this pull request Apr 22, 2025

logging: add basic logging for expired embargoes inveniosoftware/invenio-rdm-records#1989

Merged

jrcastro2 force-pushed the add-logging branch 4 times, most recently from bc21678 to f29a6e6 Compare April 25, 2025 14:00

jrcastro2 force-pushed the add-logging branch from f29a6e6 to ced5404 Compare April 28, 2025 12:47

release: v3.1.0

5b356c2

jrcastro2 force-pushed the add-logging branch from ced5404 to 5b356c2 Compare April 28, 2025 12:56

jrcastro2 merged commit 5f71a8d into inveniosoftware:master Apr 28, 2025
2 checks passed

logging: add OpenSearch job log indexing and UI log viewer #73

logging: add OpenSearch job log indexing and UI log viewer #73

Uh oh!

Conversation

jrcastro2 commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrcastro2 commented Mar 26, 2025 •

edited

Loading