Radically simplify worker input/output ORT Result handling #2222

wkl3nk · 2025-03-10T09:34:11Z

wkl3nk
Mar 10, 2025
Collaborator

Context

What is the original scope of ORT Server (from https://github.com/eclipse-apoapsis/ort-server):

A scalable server implementation of the OSS Review Toolkit.

The Eclipse Apoapsis project's ORT Server is a standalone application to deploy the OSS Review Toolkit as a service in the cloud.

What does it mean:

It allows to run the stages of the ORT pipeline in the cloud (Kubernetes) in a resource-efficient way and as a service.
It offers a REST API to control it (trigger scan jobs, fetch reports)
It offers a UI to control it

So the scope of ORT server is to provide an environment where ORT pipeline stages can be executed in a Kubernetes cloud, and have results, logs and reports stored in the cloud. Nothing less, nothing more.

Problem

So, isn't it as simple as taking the ORT results of the previous stage as input, process them in the current pipeline stage, and write the ORT result as output? The next stage then again takes this ORT result as input and so on ...

ORT Result Input --> stage: process in worker --> ORT Result output --> next stage ...

The problem is, in the meantime there is lots of code that:

Reads the ORT results of the previous step out of a SQL database representation, performing additional tasks on this data, and feeding it then into the stage's worker.
After the worker has finished, takes its ORT result output, and does post processing on it, deduplication, other kinds of things that are very complicated and then again stores the result in an individual way into the database.

This means, ORT server explicitly modifies results of pipeline stages in this or that way, instead to just transparently handle them unmodified as they are returned from ORT core.

This way, no longer you can be sure that the scan results from a traditional ORT pipeline are the same as the ones when you scan a repository with ORT Server.

Effects of this can already be seen in the UI: Issues and Vulnerabilities are displayed that differ from the ones that are generated by the ORT Reporters, because they operate on different data.

Proposal

Radically get rid of any code that changes the ORT Result returned from ORT (core) in the workers. Just store the ORT Result unmodified.
So the only thing that is required is code that a.) Stores an ORT Result in the SQL Database and b.) Reads an ORT Result from the SQL Database.

Benefit

No more complex SQL handling code for post-processing ORT Results
Chance for a simplified data model for ORT Result
As ORT Results are no longer modified, the information displayed in the UI no longer deviates from the reports generated by the ORT reporters. ORT Server gets closer to ORT again.

Challenges

Big changes in the database model are required
UI: The BE code to provide entities to display may become harder, because you only have a bunch of ORT Result database representations in the database.

mnonnenmacher · 2025-03-12T18:20:47Z

mnonnenmacher
Mar 12, 2025
Maintainer

So the scope of ORT server is to provide an environment where ORT pipeline stages can be executed in a Kubernetes cloud, and have results, logs and reports stored in the cloud. Nothing less, nothing more.

The tagline does not represent the full scope, it's defined with at least a little more detail here: https://projects.eclipse.org/projects/technology.apoapsis

The ability to manage data across individual runs was one of the main motivations for the server implementation.

Radically get rid of any code that changes the ORT Result returned from ORT (core) in the workers. Just store the ORT Result unmodified.

The problem with that is that the ORT result model is NOT stable. So if we store ORT results as plain JSON we will have to apply migrations to them to be able to read older results. Being able to manage breaking changes in the ORT result model was one of the reasons for the decision to map it to a relational representation. The complexity of the current schema IMO is mainly caused by the fact that it had to be developed in a rush, but it is possible to improve that.

Also, providing data to the UI efficiently would likely require mapping the stored raw results to another model anyway.

0 replies

nnobelis · 2025-03-13T05:31:07Z

nnobelis
Mar 13, 2025
Collaborator

Radically get rid of any code that changes the ORT Result returned from ORT (core) in the workers. Just store the ORT Result unmodified.
So the only thing that is required is code that a.) Stores an ORT Result in the SQL Database and b.) Reads an ORT Result from the SQL Database.

To rephase @mnonnenmacher's answer:

If you store just the ORT result in the database, you are not able anymore to filter the ORT runs to answer such questions:

Which customer is using a vulnerable version of Log4j ?
What is the distribution of package managers across all customers ?

This kind of advanced statistics what one of the main motivation for ORT Server.

0 replies

oheger-bosch · 2025-03-13T11:17:54Z

oheger-bosch
Mar 13, 2025
Collaborator

IMHO, the data model used by ORT is not very suitable to match the requirements of a scalable server solution. Actually, the current implementation of the workers more or less tries to achieve what you describe: to represent an ORT result in the SQL database, to update it on each step, and to use it as input for the next step. The complexity you mention comes from the fact that it is really hard to represent the ORT result data in an efficient and somehow normalized way in an SQL database.

So, I think for the future we should rather investigate where we could deviate from the 1:1 representation of the ORT data model in SQL to make access to and handling of the data more easy and efficient. Maybe this could also lead to changes in ORT itself. For instance, the fact that each ORT component requires a full result in memory is a hard limit for the size of projects that can be analyzed and also prevents optimizations like doing a more fine-grained and parallel processing on single packages.

2 replies

sschuberth Mar 13, 2025
Collaborator

Maybe this could also lead to changes in ORT itself.

Absolutely. Also see oss-review-toolkit/ort#5991.

mnonnenmacher Mar 14, 2025
Maintainer

each ORT component requires a full result in memory

To be precise, this only applies to the evaluator and reporter, the advisor and scanner take only a subset of the result data as input. But that does not change the general problem.

wkl3nk · 2025-03-19T15:25:45Z

wkl3nk
Mar 19, 2025
Collaborator Author

Central database to enable data analysis across projects

The data in the database grows and grows
Database migration get more and more impossible, as more and more data also has to be migrated after the database structures are changed. At some point, this will no longer be feasible and avoid agiliity.
As the database model grows, statistical queries get more complicated. At some point, no one will touch the database structure, as there always is the risk that queries no longer work as expected
As data grows, SQL queries to fetch statistics get slower and slower.
Implementing retention policies for Scan Runs, data will be deleted from the database. But this also means that we will lose historical statistical data.

I think we need a rough architectural change, not longer trying to translate/map ORT Results (from output of the stages of the ORT pipeline) to SQL database structures, but instead store it as simple as possible. But as addition, after each run, we can read the final ORT Result and extract the statistical data we are interested in into a statistics database that is optimized for querying (sort, filter, date/time) ...

1 reply

mnonnenmacher Apr 6, 2025
Maintainer

But as addition, after each run, we can read the final ORT Result and extract the statistical data we are interested in into a statistics database that is optimized for querying (sort, filter, date/time) ...

This approach sounds simpler than it is. For example, if changes to the read model are required it might have to be regenerated for all stored ORT results which is an expensive operation, and if the ORT result format has changed in the meantime (it is not intended to be used by thirdparty tools), this can become difficult.

wkl3nk · 2025-04-09T17:24:03Z

wkl3nk
Apr 9, 2025
Collaborator Author

It took me some time to realize that one prominent goal of ORT Server is to use ORT as a library, while I was thinking it is just providing an environment to run ORT via the command line interface (CLI). So what I had in mind was a totally different approach. I therefore will close this idea now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Radically simplify worker input/output ORT Result handling #2222

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Radically simplify worker input/output ORT Result handling #2222

Uh oh!

wkl3nk Mar 10, 2025 Collaborator

Context

Problem

Proposal

Benefit

Challenges

Replies: 5 comments · 3 replies

Uh oh!

mnonnenmacher Mar 12, 2025 Maintainer

Uh oh!

Uh oh!

nnobelis Mar 13, 2025 Collaborator

Uh oh!

oheger-bosch Mar 13, 2025 Collaborator

Uh oh!

sschuberth Mar 13, 2025 Collaborator

Uh oh!

mnonnenmacher Mar 14, 2025 Maintainer

Uh oh!

wkl3nk Mar 19, 2025 Collaborator Author

Uh oh!

mnonnenmacher Apr 6, 2025 Maintainer

Uh oh!

wkl3nk Apr 9, 2025 Collaborator Author

wkl3nk
Mar 10, 2025
Collaborator

Replies: 5 comments 3 replies

mnonnenmacher
Mar 12, 2025
Maintainer

nnobelis
Mar 13, 2025
Collaborator

oheger-bosch
Mar 13, 2025
Collaborator

sschuberth Mar 13, 2025
Collaborator

mnonnenmacher Mar 14, 2025
Maintainer

wkl3nk
Mar 19, 2025
Collaborator Author

mnonnenmacher Apr 6, 2025
Maintainer

wkl3nk
Apr 9, 2025
Collaborator Author